Fine-Scale Grassland Classification Using UAV-Based Multi-Sensor Image Fusion and Deep Learning

Cai, Zhongquan; Wen, Changji; Bao, Lun; Ma, Hongyuan; Yan, Zhuoran; Li, Jiaxuan; Gao, Xiaohong; Yu, Lingxue

doi:10.3390/rs17183190

Open AccessArticle

Fine-Scale Grassland Classification Using UAV-Based Multi-Sensor Image Fusion and Deep Learning

by

Zhongquan Cai

^1,2,

Changji Wen

¹

,

Lun Bao

²

,

Hongyuan Ma

²,

Zhuoran Yan

²,

Jiaxuan Li

²,

Xiaohong Gao

² and

Lingxue Yu

^2,*

¹

College of Information and Technology, Jilin Agricultural University, Changchun 130118, China

²

State Key Laboratory of Black Soils Conservation and Utilization, Northeast Institute of Geography and Agroecology, Chinese Academy of Sciences, Changchun 130102, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(18), 3190; https://doi.org/10.3390/rs17183190

Submission received: 22 July 2025 / Revised: 8 September 2025 / Accepted: 11 September 2025 / Published: 15 September 2025

(This article belongs to the Special Issue Advanced Remote Sensing for Next-Generation Smart Agriculture: Innovations, Integration, and Applications)

Download

Browse Figures

Versions Notes

Abstract

Highlights

What are the main findings?

MSPF-Net showed top quality with sharp edges when fusing RGB/MS/TIR/LiDAR.
UNet++ + SE with Focal Dice achieved boundary-coherent grassland segmentation with more accuracy.

What is the implication of the main finding?

Enabling boundary-coherent grassland monitoring and planning at parcel level.
Providing a reproducible pipeline and modality value ranking (CHM > indices > RGB > TIR) for multimodal mapping.

Abstract

Grassland classification via remote sensing is essential for ecosystem monitoring and precision management, yet conventional satellite-based approaches are fundamentally constrained by coarse spatial resolution. To overcome this limitation, we harness high-resolution UAV multi-sensor data, integrating multi-scale image fusion with deep learning to achieve fine-scale grassland classification that satellites cannot provide. First, four categories of UAV data, including RGB, multispectral, thermal infrared, and LiDAR point cloud, were collected, and a fused image tensor consisting of 10 channels (NDVI, VCI, CHM, etc.) was constructed through orthorectification and resampling. For feature-level fusion, four deep fusion networks were designed. Among them, the MultiScale Pyramid Fusion Network, utilizing a pyramid pooling module, effectively integrated spectral and structural features, achieving optimal performance in all six image fusion evaluation metrics, including information entropy (6.84), spatial frequency (15.56), and mean gradient (12.54). Subsequently, training and validation datasets were constructed by integrating visual interpretation samples. Four backbone networks, including UNet++, DeepLabV3+, PSPNet, and FPN, were employed, and attention modules (SE, ECA, and CBAM) were introduced separately to form 12 model combinations. Results indicated that the UNet++ network combined with the SE attention module achieved the best segmentation performance on the validation set, with a mean Intersection over Union (mIoU) of 77.68%, overall accuracy (OA) of 86.98%, F1-score of 81.48%, and Kappa coefficient of 0.82. In the categories of Leymus chinensis and Puccinellia distans, producer’s accuracy (PA)/user’s accuracy (UA) reached 86.46%/82.30% and 82.40%/77.68%, respectively. Whole-image prediction validated the model’s coherent identification capability for patch boundaries. In conclusion, this study provides a systematic approach for integrating multi-source UAV remote sensing data and intelligent grassland interpretation, offering technical support for grassland ecological monitoring and resource assessment.

Keywords:

multi-source UAV data; UAV image fusion; deep convolutional networks; image semantic segmentation; attention mechanism

1. Introduction

Grasslands cover approximately 40% of the Earth’s terrestrial surface and represent critical ecosystems for maintaining carbon sinks, livestock security, and biodiversity. In recent years, grassland ecosystem degradation has become a significant research focus in ecological and environmental sciences, driven by intensified global climate change and expanding human activities. Under the context of global climate change, issues such as drought stress, altered carbon cycling in ecosystems, intensified land-use, and resulting grassland ecological function degradation, vegetation cover reduction, and biodiversity loss have increasingly drawn attention from the international ecological research community and resource management agencies [1]. Over the past decade, grassland degradation has been accelerated and ecological heterogeneity at the patch scale significantly intensified by the combined effects of climatic drying and increased grazing pressure [2]. In temperate regions, extreme hydroclimate conditions amplify the biophysical influence of advanced green up on surface energy partitioning and near surface temperature, which strengthens seasonal heterogeneity [3]. This underscores the need for high frequency, multi sensor observations to capture short phenological windows over grasslands. In this context, vegetation index products characterized by coarse spatial resolution and a revisit cycle of 16 days have become inadequate for timely capturing grassland stress dynamics. Therefore, grassland monitoring demands exhibit an urgent reliance on high-frequency, high-resolution, and multi-source information. Particularly under the requirements of precision ecological management, achieving fine-scale and real-time grassland ecological monitoring has become a key driving factor for the development of remote sensing technologies [4].

Traditional satellite-based observation platforms (e.g., Landsat and Sentinel) have long been utilized for wide-area monitoring of grassland ecosystems. For instance, utilizing optical time-series data acquired by the Sentinel-2 constellation at a revisit interval of 5 days, mowing and regeneration events in temperate grasslands have been identified with an approximate F1-score of 0.81, highlighting the significance of high spatiotemporal-density optical observations for fine-scale grassland [5]. By integrating microwave scattering information from Sentinel-1 and reflectance data from Sentinel-2, the overall accuracy for classifying heterogeneous pasture management intensity can be improved to approximately 84% [6]. Additionally, spatiotemporal fusion frameworks such as STAIR 2.0 have integrated observations from PlanetScope, Landsat-8, and Sentinel satellites to produce daily, cloud-gap-free surface reflectance products at a spatial resolution of 10 m, thus providing spatial details at the management scale for near-real-time pasture early warning [7]. The SwinV2-t model based on Sentinel-2 achieved an overall accuracy of 89% at the 10 m scale for protected area classification, representing an improvement of 11% compared to traditional CNN [8]. Meanwhile, combining optical, SAR, and meteorological data into a random forest model reduced the estimation error for the first grassland mowing date to within 5 days [9]. Multi-source studies in Germany and Austria further demonstrated that combining Sentinel-1/2 with ERA5 meteorological reanalysis improved the F1-score of mowing events by 0.07 [10]. Although multi-sensor synergy significantly increased the information dimension, cross-platform radiometric consistency and registration remain major bottlenecks [11]. Nationwide Sentinel-2 time series reduced the first mowing date error to 4.6 days, confirming the feasibility of 10 m spatial resolution and 5-day revisit period observations for grassland management monitoring [7]. A short-time series deep learning network achieved rapid mapping of grassland-shrub areas at 10 m resolution in Australia, improving mIoU by 6.3% compared to single-image analysis [12]. A deep cloud-gap filling strategy restored the continuity of grassland reflectance curves in cloudy seasons, significantly improving the accuracy of time-series analysis [13].

Although satellite platforms are suitable for large-area and large-scale grassland vegetation monitoring, their capability to detect localized disturbances (e.g., rodent infestations, fire incidents, and livestock trampling) during precision assessments is limited due to spatial resolutions of 10–30 m and revisit intervals ranging from several days to over ten days. In contrast, unmanned aerial vehicles (UAVs), characterized by centimeter-level spatial resolution, flexible high-frequency deployment, and integration capability with various sensors, offer promising opportunities for ecological monitoring of grasslands. Lu and He [14] finely delineated heterogeneous grasslands using near-infrared-green-blue imagery combined with object-oriented classification, achieving an overall accuracy of approximately 85%. Nahrstedt et al. [15] employed multispectral and texture features as parallel inputs into a random forest model, achieving 82–91% accuracy in grass-weed differentiation across various growth stages, demonstrating UAV data’s keen sensitivity to crop-weed dynamics. In addition to optical imagery, high-density UAV LiDAR point clouds have also shown remarkable potential in measuring three-dimensional grassland structure and monitoring species distribution. Šrollerů and Potůčková [16] conducted multi-temporal LiDAR observations in Arctic tundra, achieving species classification accuracy exceeding 75% through temporal changes in structural parameters and quantifying seasonal height fluctuations, providing breakthroughs for polar vegetation dynamics research. UAV can also provide reliable vegetation identification results in complex environments. Michez et al. [17] performed object-oriented interpretation of species and health status of multi-layered riparian vegetation using multi-temporal ultra-high-resolution imagery, achieving an overall accuracy of 79.5–84.1%, verifying UAV imagery’s adaptability to fragmented habitats. More importantly, UAV-LiDAR has energized grassland biomass estimation and precision pasture management. Hütt et al. [18] incorporated LiDAR-derived compressed sward height (CSH) into statistical models, achieving an R² of 0.89 for biomass estimation, enabling productivity mapping of a 200 ha pasture at a 2.5 m grid scale, thus providing robust data support for grazing quotas and spatial-temporal scheduling.

The widespread adoption of deep learning and attention mechanisms has provided new insights into cross-modal information coupling. The Attentive Bilateral Contextual Network (ABCNet), through a spatial-semantic dual-path design, increased the mIoU to 0.52 on the LoveDA dataset [19]. UNetFormer embedded global-local attention into a lightweight encoder–decoder framework, achieving an inference speed of 322 fps with a 512 × 512 px input [20]. The Progressive Adjacent-Layer Coordination Net applied differential attention reorganization to adjacent layer features, improving mIoU by 4.7% on optical-SAR-DSM trimodal data [21]. The transformer-based TMTNet achieved RGB-HSI target tracking at 30 fps, demonstrating that the cross-modal gating unit effectively alleviates spectral redundancy [22]. Furthermore, the Category Attention Guided Network enhanced category-specific channel weights, increasing the F1-score for small patch identification by approximately 5.8–6.1% on GF-3 0.8 m imagery [23]. Current remote sensing data fusion research predominantly utilizes feature-level methods (46%), followed by pixel-level (31%) and decision-level (23%) approaches. Cross-sensor radiometric consistency remains a significant bottleneck [24]. The multi-source fusion-enhanced segmentation framework proposed by Wang and Zhou [25] increased mIoU from 0.467 to 0.481 through Pix2Pix generative fusion, significantly improving the continuity of boundaries for small classes. However, fragmented patches and low-contrast boundaries in grasslands require the model to simultaneously possess multi-scale and edge-preserving capabilities. The Adaptive Feature-Fusion UNet achieved adaptive integration through dual-attention weighting, reducing the average error on multi-source datasets by 3% [26]. After incorporating a feature alignment unit, the Sparse Self-Attention Network (SAANet) improved boundary IoU by 8% in 0.3 m imagery [27]. Meanwhile, pixel-level uncertainty-weighted cross-entropy (U-CE) proposed by Landgraf et al. [28] has provided a novel approach for dynamically suppressing errors from highly uncertain pixels.

Despite the advancements in UAV-based multi-sensor technologies and attention-based deep learning methods, significant challenges remain in their practical implementation for fine-scale grassland classification. First, integrating multi-modal data sources, such as RGB, multispectral, thermal infrared imagery, and LiDAR point clouds, introduces complexities related to spatial resolution mismatches and radiometric inconsistencies, which complicate accurate data fusion. Second, conventional classification approaches often fail to fully utilize the rich, complementary information provided by multi-sensor data, thus limiting the accuracy and robustness of grassland classification. Third, traditional segmentation networks frequently encounter difficulties in accurately delineating fragmented vegetation patches and subtle boundary distinctions due to grassland heterogeneity, necessitating more sophisticated multi-scale and attention-driven network architectures.

Therefore, to effectively support fine-scale grassland monitoring and precise ecological management, it is essential to accurately handle fragmented patches, subtle boundaries, and diverse ecological conditions typically encountered in grassland ecosystems. Achieving these objectives necessitates developing advanced segmentation methods that integrate robust multi-scale data fusion techniques, more sophisticated and context-aware attention mechanisms, and effective strategies for quantifying and managing prediction uncertainties. To address these needs, this study proposes a two-stage pipeline that first performs multi sensor fusion to a ten-channel product and then performs supervised semantic segmentation, using UAV-derived visible light, multispectral imagery, and airborne LiDAR point clouds acquired from the Yaojingzi Leymus chinensis Grassland Reserve. The main contributions and innovations of this study are summarized as follows:

(1) We implement a practical multi scale fusion setup for UAV multi sensor data that harmonizes RGB, vegetation indices, thermal infrared, and canopy height from LiDAR into a unified 10-band product at 0.07 m, providing a reusable input for fine scale grassland mapping.

(2) Through systematic ablations, we identify a lightweight attention configuration—UNet++ with an SE module—that consistently improves accuracy on fragmented patches and boundaries while preserving computational efficiency.

(3) We adopt a class imbalance aware objective by combining Focal Dice losses, yielding more reliable delineation of minority species and small or bare patches than either loss alone.

2. Materials and Methods

2.1. Study Area

The study area is located in the Yaojingzi Leymus chinensis Grassland Reserve (123°26′–123°40′E, 44°34′–44°43′N) in northwestern Changling County, Jilin Province, with an area of 23,800 hectares and an elevation ranging from 140 to 160 m. It belongs to a semi-arid to semi-humid temperate monsoon climate zone, with annual precipitation of 313–581 mm and annual evaporation of 1368 mm. The area supports over 300 plant species and is predominantly characterized by Leymus chinensis grasslands, representing the largest and best-preserved high-quality Leymus chinensis grassland in Jilin Province (Figure 1). Within the reserve, Leymus chinensis forms extensive meadow steppe on low salinity loams and is the principal forage resource; Puccinellia distans occupies saline and alkali patches in shallow depressions and along salt crust margins; Phragmites australis follows wetter swales and drainage ditches. These three grasses underpin forage supply and soil water regulation locally and compose the dominant patch mosaic across the study landscape. Given the wet patches and drainage features in the reserve, wetland vegetation can exert strong local cooling through biogeophysical pathways in semi arid settings, supporting the use of thermal and height cues in our fusion design [29]. The reserve is divided into a core zone (600 hectares) and a general zone, with differentiated protection measures implemented.

2.2. Data Collection and Preprocessing

2.2.1. Ground Data Collection

Ground sampling and UAV data collection were conducted in this sampling area. Sixty herbaceous quadrats were randomly established within this sampling area, including 30 quadrats with a minimum spacing of 20 m and another 30 quadrats with a minimum spacing of 10 m. Field investigations were conducted from 27 June to 29 June 2024, coinciding with the peak growing season (June to August) of vegetation in the study area. Dominant species within each quadrat were surveyed and recorded, and photographs of quadrats were taken. During the fieldwork, the geographic coordinates (latitude and longitude) of all herbaceous quadrats were recorded using a Hi-Target V200 GNSS RTK (Hi-Target International Group Ltd., Guangzhou, China), achieving a horizontal measurement accuracy within 0.15 m.

2.2.2. UAV Data Acquisition and Processing

Four types of UAV remote sensing data were collected in this study: RGB imagery, multispectral imagery, thermal infrared imagery, and LiDAR point cloud data. All data collection tasks were performed under clear weather conditions between 13:00 and 15:00 to ensure data quality and minimize the impact of illumination variations.

LiDAR data were acquired at an altitude of 100 m using a DJI Matrice 300 RTK equipped with a Zenmuse L1 sensor (DJI, Shenzhen, China), achieving a point cloud density of 367 points/m² and a ground resolution of 2.73 cm/pixel. Flight parameters included a flight speed of 7 m/s, side overlap of 70%, forward overlap of 80%, triple-echo mode, and sampling frequency of 160 kHz. Thermal infrared (TIR) and visible (RGB) images were collected at 50 m altitude using a DJI Mavic 3T thermal imaging drone, which is equipped with a wide-angle camera featuring a 1/2-inch CMOS sensor (48 MP effective pixels) and a thermal imaging camera with 640 × 512 pixels resolution, supporting an RTK module for centimeter-level positioning accuracy. The ground resolution was 6.59 cm/pixel, with a flight speed of 6 m/s and forward/side overlap rates of 80%/70%. Multispectral data were acquired using a DJI Phantom 4 Multispectral drone, which integrates six cameras: one RGB camera and five multispectral cameras covering blue (450 ± 16 nm), green (560 ± 16 nm), red (650 ± 16 nm), red edge (730 ± 16 nm), and near-infrared (840 ± 26 nm) bands. The flight altitude was 70 m, achieving a ground resolution of 3.7 cm/pixel with forward/side overlap rates of 80%/70%.

RGB and multispectral images were processed using DJI Terra (version 4.0.10) software, which performs aerial triangulation (SfM) based on feature matching between images. High-precision orthomosaics were then generated through dense point cloud construction, digital surface model (DSM) generation, and orthophoto reprojection. During this process, spatial consistency and geometric accuracy of images were ensured, effectively eliminating image distortion caused by terrain variations. The final RGB orthomosaic achieved a spatial resolution of approximately 0.07 m, while the multispectral orthomosaic achieved approximately 0.06 m resolution. Several vegetation indices, including the normalized difference vegetation index (NDVI), green normalized difference vegetation index (GNDVI), ratio vegetation index (RVI), vegetation condition index (VCI), and perpendicular vegetation index (PVI), were calculated from multispectral data, providing essential spectral feature information for subsequent analysis of grassland growth conditions. LiDAR point cloud data processing was conducted in two stages. First, standard LAS format point cloud data were generated using DJI Terra software, and subsequently imported into LiDAR360 (version 5.2) software for detailed processing. This included ground and non-ground point classification, outlier noise removal, point cloud filtering, and interpolation operations. Based on this, a canopy height model (CHM) with a spatial resolution of 0.1 m was constructed by calculating the difference between digital surface model (DSM) and digital elevation model (DEM), accurately quantifying the vertical structural distribution of vegetation. Thermal infrared images were processed in Pix4D mapper (version 4.4.12; Pix4D SA, Lausanne, Switzerland), including geometric correction, radiometric calibration, and apparent temperature conversion. The software achieved high-precision thermal image mosaicking through feature matching between thermal infrared images and optimization of radiometric consistency in overlapping regions. Special attention was given to radiometric calibration of temperature data during processing, using atmospheric transmittance and ambient temperature parameters for temperature inversion. Ultimately, a thermal infrared orthomosaic with 0.07 m resolution was generated, providing reliable spatial temperature distribution data for vegetation stress monitoring and thermal anomaly identification.

To capture the main sources of grassland heterogeneity, we selected channels to represent spectral reflectance, vertical structure, and thermal state. All inputs were coregistered and resampled to a common grid of 0.07 m. Specifically, red, green, and blue imagery provided fine scale color, texture, and edge cues that delineate small patches and bare soil. Vegetation indices stabilized illumination and soil background effects and emphasized complementary plant traits: NDVI (biomass and vigor), GNDVI (higher sensitivity to chlorophyll and nitrogen status), RVI (more linear with leaf area index at moderate to high cover, which reduces NDVI saturation), VCI (a scene normalized NDVI that highlights relative condition and stress), and PVI (a soil line index that suppresses soil influence under sparse cover and near edges). Thermal infrared imagery conveyed local thermal context through canopy and soil temperature differences, which support class separation in shaded zones and in scenes with weak reflectance contrast. Vertical structure, including canopy height and local roughness, was represented by the canopy height model, which separates tall dense stands from low grasses and exposed surfaces. Overall, this channel design is segmentation aware and aligns with the ablation findings.

All data were uniformly geo-registered and transformed to the WGS84 coordinate system using the ArcGIS platform. Geometric refinement was performed using ground control points (GCPs) to ensure spatial consistency among multi-source datasets. Finally, all data were cropped to an area of 260 × 300 m, providing a standardized data foundation for subsequent multi-source data fusion, feature extraction, and vegetation classification analyses (Figure 2).

2.3. Multi-Sensor Image Fusion

2.3.1. Resolution Unification and Blocking Strategy

The input data included RGB, TIR, NDVI, GNDVI, PVI, RVI, VCI, and CHM images, which originated from different sensors and had varying image resolutions. To maintain spatial information integrity during the fusion process, 0.07 m was selected as the unified target resolution. This selection was based on the following considerations: firstly, it matched the original resolutions of RGB and thermal infrared imagery, preventing loss of spatial information; secondly, it retained sufficient spatial detail while effectively reducing computational load for subsequent deep learning analysis; and finally, this resolution clearly distinguished grassland patch boundaries and vegetation type transition zones. The rasterio library in Python (version 3.9.23) was used for resampling to adjust all data sources to the same resolution under the spatial coordinate system of a common reference image. Let the original input image be denoted as follows:

I_{s r c} \in R^{b \times H_{s r c} \times W_{s r c}}

(1)

where b represents the number of bands, and resampling was performed using bilinear interpolation, expressed mathematically as follows:

I_{d s t} (x, y) = \sum_{i, j} w_{i j} \cdot I_{s r c} (x_{i}, y_{j})

(2)

where

w_{i j}

denotes the interpolation weights.

After resampling, all input data were standardized into a unified tensor containing 10 bands (10 × H × W), where H and W represent the image height and width, respectively, at 0.07 m resolution. The final data integration was accomplished through a custom function called load_and_merge_all_bands, providing spatially consistent and scale-coordinated multimodal inputs for subsequent deep fusion networks.

Due to the large size of image data, a block processing strategy was adopted to alleviate GPU memory pressure and improve parallel computational efficiency, employing weighted fusion during block stitching to eliminate seam effects. The fused image was divided according to a predefined block size (e.g., 512 pixels) and overlap region (e.g., 128 pixels). The calculation formula for the block stride was defined as follows:

s t e p = {b l o c k}_{s i z e} - o v e r l a p

(3)

This ensured sufficient overlapping regions between adjacent blocks, providing conditions for subsequent edge smoothing processes.

2.3.2. Core Modules for Deep Feature Fusion

A Channel Attention module was introduced to enhance important channel information. Channel descriptors were obtained using global average pooling and global max pooling, processed through two convolutional layers with ReLU activation, and finally channel weights were generated using a sigmoid function:

w = σ ({C o n v}_{2} (R e L U ({C o n v}_{1} (p o o l (x)))))

(4)

where σ denotes the sigmoid function, Conv₁ and Conv₂ represent the first and second convolutional operations, respectively, and pool(x) denotes the pooling operation applied to the input feature map x. This module enables the network to adaptively enhance information in critical channels, highlighting important features while suppressing less significant ones.

A Spatial Attention module was incorporated by combining the average and maximum values along the channel dimension, and spatial attention maps were generated through a 7 × 7 convolution followed by sigmoid activation:

M_{s p a t i a l} = σ ({C o n v}_{7 \times 7} (c o n c a t (m e a n (x), \max (x))))

(5)

where

{C o n v}_{7 \times 7}

denotes the 7 × 7 convolutional operation, and concat (mean(x), max(x)) represents concatenating the average and maximum values of the input feature map x along the channel dimension. This module can locate image regions with significant contributions to the fusion, enhancing their features, thereby increasing the model’s attention to important spatial information.

A Residual Dense Block (RDB) was designed to extract features layer by layer through multiple convolutional layers and dense connections [30]. Finally, a 1 × 1 convolution was employed to fuse features, which were then added residually to the original input:

y = x + {C o n v}_{1 \times 1} (c o n c a t (x, {conv}_{1} (x), \dots, {conv}_{n} (x)))

(6)

where

c o n c a t (x, {conv}_{1} (x), \dots, {conv}_{n} (x))

represents the concatenation of the input feature map x with its corresponding feature maps obtained through multiple convolutional layers. This structure ensures efficient information transfer and comprehensive utilization of multi-level features, helping alleviate gradient vanishing issues and enhancing the feature extraction capability of the model.

A Pyramid Pooling Module was utilized, performing adaptive average pooling at multiple scales [31], followed by upsampling and concatenation with the original features, and finally fused through a 1 × 1 convolution:

y = {C o n v}_{1 \times 1} (c o n c a t (x, U p (P_{1} (x)), U p (P_{2} (x)), U p (P_{3} (x)), U p (P_{4} (x))))

(7)

where

U p (P_{i} (x))

denotes the upsampling operation applied to the pooled feature map

(P_{i} (x))

at scale

i

, and

y = {C o n v}_{1 \times 1} (\dots)

represents the fusion of concatenated feature maps through a 1 × 1 convolution. This module comprehensively utilizes global and local information, improving the model’s capability to capture features and enhancing its ability to better understand contextual information in images.

2.3.3. Architecture of Four Fusion Networks

Residual Channel Attention Fusion Network (RCAF-Net): a fusion network design combining residual connections with channel attention (Figure 3). RCAF-Net combines residual connections and channel attention to strengthen deep feature fusion [32]. The encoder extracts features with stacks of 3 × 3 convolutions and ReLU activations across three scales {64,128,256} with skip connections to the decoder. Residual Channel Attention Blocks perform squeeze and excitation with a channel attention reduction ratio equal to 8 so that informative channels receive larger weights while the parameter count remains moderate.

Dense Auto Encoder Fusion Network (DAEF-Net): DAEF-Net is designed based on a typical encoder–decoder architecture, introducing dense connections in the decoder to enhance information transfer efficiency [30]. Residual Dense Blocks are used with num layers equal to 3 and growth equal to 32. A lightweight channel attention gate uses a reduction ratio equal to 8.

Dual Attention Fusion Network (DAF-Net): a network design incorporating both channel and spatial attention mechanisms [33]. The spatial attention uses a 7 × 7 context mask, and the channel attention uses a reduction ratio equal to 8. The bottleneck stacks two Residual Dense Blocks to enlarge the effective receptive field, while skip connections link encoder and decoder to recover details during upsampling.

Multi-Scale Pyramid Fusion Network (MSPF-Net): a multi-scale fusion network design based on pyramid pooling [31]. MSPF-Net employs pyramid pooling to aggregate multi scale context. Pool sizes are [1,2,4] and the pooled features are concatenated with the main path. Parallel 3 × 3 and 5 × 5 branches provide complementary receptive fields, followed by a residual refinement head with channel attention where the reduction ratio equals 8. To stabilize color, a shallow RGB skip is blended only into the first three output channels before the refinement head. This design captures global and local information and yields fused images with improved edge preservation and fine structure.

2.3.4. Metrics for Fusion Quality Assessment

To comprehensively evaluate image clarity, contrast, and richness of details, several image quality assessment metrics were introduced, including information entropy, spatial frequency, average gradient, variance of Laplacian, Tenengrad, and RMS contrast. These metrics effectively reflect the visual quality of images and are suitable for quality assessment of multi-band fused images.

Information Entropy (Entropy): Information entropy is a measure of the amount of information in an image, reflecting its complexity and uncertainty. The formula is expressed as follows:

H (X) = - \sum_{i = 0}^{255} {p (i) l o g}_{2} p (i)

(8)

where

p (i)

denotes the probability of occurrence of the gray level follows

i

.

Spatial Frequency (SF): Spatial frequency reflects the richness of details in an image; a higher value indicates greater image clarity. Spatial frequency is computed based on pixel variations in both horizontal and vertical directions by calculating the standard deviations in row (RF) and column (CF) directions. The formula is expressed as follows:

S F = \sqrt{R F^{2} + C F^{2}}

(9)

Mean Gradient: Mean gradient is a commonly used metric for evaluating image clarity, describing the variation in pixel values in an image. The formula is expressed as follows:

M e a n G r a d i e n t = \frac{1}{(H - 1) (W - 1)} \sum_{x = 1}^{H - 1} \sum_{y = 1}^{W - 1} \sqrt{(f (x + 1, y) - f (x, y))^{2} + (f (x, y + 1) - f (x, y))^{2}}

(10)

Variance of Laplacian: Variance of Laplacian is commonly used to measure image sharpness; a larger value indicates more distinct edges and higher image clarity. The formula is expressed as follows:

V a r o f L a p l a c i a n = V a r (Δ f)

(11)

where

Δ f

denotes the image processed by the Laplacian operator, and

V a r

denotes variance.

Tenengrad: Tenengrad is a gradient-based image quality assessment method that calculates image gradients using the Sobel operator, assessing image clarity and edge information. The formula is expressed as follows:

T e n e n g r a d = \frac{1}{H \cdot W} \sum_{x = 1}^{H} \sum_{y = 1}^{W} \sqrt{(S o b e l_{x} (x, y))^{2} + (S o b e l_{y} (x, y))^{2}}

(12)

where

S o b e l_{x} (x, y)

and

S o b e l_{y} (x, y)

denote the image gradients in the x and y directions, respectively.

RMS Contrast: RMS contrast is the standard deviation of image luminance fluctuations, representing the overall contrast of an image. The formula is expressed as follows:

RMS Contrast = \sqrt{\frac{1}{H \cdot W} \sum_{x = 1}^{H} \sum_{y = 1}^{W} (f (x, y) - μ)^{2}}

(13)

where μ denotes the mean luminance of the image, f(x, y) is the pixel value, H and W represent image height and width.

2.3.5. Training Strategy for the Fusion Networks

All four fusion networks were trained using the same protocol. We employed a self-supervised learning approach. This approach uses a composite loss function that promotes per-modal fidelity while enhancing edge details and suppressing artifacts. The objective function is defined as follows:

L = λ_{1} L_{1} + λ_{2} L_{M S - S S I M} + λ_{3} L_{G r a d} + λ_{4} L_{T V} + λ_{5} L_{C o l o r R G B} + λ_{6} L_{I n t e n s i t y R G B}

(14)

where L is the pixel-level L₁ loss, used to preserve the fidelity to the input data. We set λ = [0.5,1.0,0.2,0.02,0.5,0.5]. L_MS-SSIM preserves image structure across scales. L_Grad represents the gradient loss, which helps enhance the image edges and fine details. L_TV stands for total variation loss, applied to reduce high-frequency noise and artifacts in the fused image. L_ColorRGB refers to the color loss in the CIELab color space, ensuring the consistency of color in the fused image. Lastly, L_IntensityRGB is the intensity loss, aimed at stabilizing the luminance in the final fused image.

We used AdamW (lr = 3 × 10⁻⁴, weight decay = 1 × 10⁻⁴) with 5 epoch warm-up followed by cosine annealing, mixed precision (AMP), batch size = 2, and gradient clipping at 1.0. Training ran for 80 epochs with early stopping (patience = 10). The fusion networks output a 10-channel fused image at 0.07 m (matching the 10-band input). Model selection was based on a normalized composite of six standard fusion metrics on held-out tiles (entropy, spatial frequency, mean gradient, Laplacian variance, Tenengrad, RMS contrast), not the training loss. Post-processing applied Laplacian-based high-frequency enhancement and adaptive gamma (γ∈[0.9,1.1]) with lightweight RGB color retention, ensuring an unbiased model evaluation.

2.4. Image Segmentation Based on Deep Learning

2.4.1. Training Data and Labeling Strategy

Visual interpretation of the optimal fused image was conducted in ArcGIS Pro. Detailed annotations for grassland types and bare soil were produced by integrating texture, spectral, and structural cues with field sample data. To improve label consistency and accuracy, a preliminary supervised classification with support vector machine (SVM) was performed. Its pixel level outputs served as pre annotations to guide and accelerate manual labeling, and the final labels were validated by annotators. The labeled dataset comprises five classes, namely Leymus chinensis, Puccinellia distans, Phragmites australis, Bare land, and Others. The distribution is imbalanced (Figure 4). To mitigate this imbalance during training, we used an objective that combines Focal loss and Dice loss together with class balanced sampling.

From the fused image, we generated fixed size segmentation image tiles of 256 × 256 pixels with stride 256, that is, no overlap, to remove spatial redundancy and to prevent leakage across data splits. In total, 3000 image and label pairs were produced (Figure 5a–f). The dataset was split once into training, validation, and test sets (70%, 15%, 15%), yielding approximately 2100/450/450 image tiles. To ensure spatial independence, we adopted a geographic split at the parcel or zone level so that tiles from the same parcel and its immediate vicinity appear in only one partition. The validation set was used only for early stopping and model selection, and all performance comparisons and ablations are reported on the held-out test set. Tile lists and a fixed random seed were saved to make the split reproducible and to guarantee that no tile appears in more than one set. This protocol eliminates the information leakage that can occur when overlapped tiling and interleaved indexing are used.

For training time robustness, we applied lightweight on the fly augmentations to the training set only: horizontal and vertical flips with a probability of 0.5, mild brightness and contrast jitter with a probability of 0.5, and tensor normalization. Random cropping was not used because inputs already have a fixed size of 256 × 256. Validation and test tiles underwent only tensor conversion and normalization using training set channel statistics to avoid distribution shift or implicit leakage.

2.4.2. Segmentation Networks and Attention Mechanism Design

To achieve high-accuracy semantic segmentation, several representative deep learning networks were selected and customized modifications were applied to their structures in this study.

DeepLabV3+ Model: This network utilizes dilated convolutions to effectively enlarge the receptive field, and employs an encoder–decoder structure for multi-scale feature extraction and fusion. The introduction of depthwise separable convolutions effectively reduces computational complexity, and the Atrous Spatial Pyramid Pooling (ASPP) module captures multi-scale contextual information [34].

UNet++ Model: Based on the original UNet architecture, a nested and dense skip connection strategy is employed, fully realizing the fusion of features across different hierarchical levels [35]. Its design advantage lies in its ability to recover fine edge details in complex backgrounds, effectively preventing the loss of semantic information during skip connections.

PSPNet Model: Utilizing a pyramid pooling module, PSPNet hierarchically captures global contextual information and effectively supplements local details [36]. After pyramid pooling, the features are progressively upsampled and fused with shallow-layer features, enhancing the robustness of the model to scale variations.

Feature Pyramid Network (FPN): Employing a top-down structural design and lateral connection strategy, FPN effectively achieves complementarity between high-level semantic information and low-level spatial details, making it particularly suitable for remote sensing image segmentation tasks characterized by significant scale variations [37].

All models were initialized using pre-trained weights to accelerate convergence, with input dimensions and the number of output classes adjusted according to the features of the fused images.

During feature extraction and decoding in base networks, effectively highlighting critical regions amid extensive redundant information has become a core issue limiting model accuracy. Therefore, three types of attention mechanisms were introduced and compared across different network architectures in this study (Figure 6).

The squeeze and excitation (SE) block compresses spatial information into a channel descriptor by global average pooling, and channel weights are then produced by a two layer fully connected network with ReLU and Sigmoid activations [32]. This process enables adaptive adjustment of channel importance and thus feature recalibration. When embedded in the network, the SE block improves recognition of subtle targets, especially in multi-source fusion, and yields better representation of critical spectral bands.

Compared with the SE block, the efficient channel attention (ECA) block uses one dimensional convolution to model interactions within local channel ranges, providing higher computational efficiency with fewer parameters [38]. The ECA block derives channel weights directly from the convolution kernel, balancing local sensitivity and global context and reducing redundancy in complex multi-channel data.

The convolutional block attention module (CBAM) combines channel attention and spatial attention. In the channel attention stage, global max pooling and global average pooling generate channel descriptors, and weights are computed by a shared multilayer perceptron (MLP). In the spatial attention stage, feature maps are pooled along the channel dimension (max and average) and then convolved to obtain a spatial attention map [33]. CBAM captures global semantics while highlighting critical spatial regions, which yields finer segmentation.

During implementation, after extracting high-dimensional features in the encoder part, the selected attention modules were integrated into corresponding feature layers. Taking UNet++ as an example (Figure 7), SE Blocks were introduced at each skip connection, allowing the fusion process to preserve original low-level features while enhancing the representation of critical regions guided by high-level semantic information. The model output was progressively upsampled and reconstructed through multiple decoder layers, and the final segmentation probability map was obtained via softmax normalization. The performances of various networks with different combinations of attention mechanisms were compared and evaluated through experiments to determine the optimal combination.

2.4.3. Loss Function Design

To address issues such as class imbalance and boundary ambiguity in remote sensing image segmentation tasks, a combined loss function, termed Focal Dice Loss, was designed:

Focal Loss introduces a modulating parameter γ to the conventional cross-entropy loss, reducing the contribution from easily classified samples and enabling the model to pay greater attention to difficult samples, thus improving recognition accuracy for minority classes [39]. Dice Loss directly measures the overlap between predicted segmentation regions and ground truth labels using the Dice coefficient, effectively optimizing segmentation boundaries and regional segmentation accuracy [40].

Focal Dice Loss is a linear combination of Focal Loss and Dice Loss with predefined weights, simultaneously addressing the class imbalance issue and enhancing region consistency, thereby improving overall segmentation performance [41]. The formula is expressed as follows:

L_{D i c e} = 1 - \frac{2 \sum_{i = 1}^{N} p_{i} y_{i} + ϵ}{\sum_{i = 1}^{N} p_{i} + \sum_{i = 1}^{N} y_{i} + ϵ}

(15)

where

p_{i}

and

y_{i}

denote the predicted value and ground truth label at pixel i, respectively, and ϵ prevents division by zero.

2.4.4. Training Strategy and Experimental Setup

The experiment was conducted on a Windows 10 operating system equipped with 32 GB RAM and an NVIDIA GeForce RTX 3070 GPU. Python 3.8 and PyTorch 2.2.0 libraries were utilized. A series of optimization strategies were adopted to enhance efficiency and effectiveness. Firstly, mixed precision training technology was employed, accelerating forward and backward propagation using half-precision computation, while significantly reducing GPU memory usage. Secondly, the Adam optimizer was selected due to its adaptive learning rate adjustment capability, facilitating rapid model convergence, with an initial learning rate set to 1 × 10⁻⁴. Simultaneously, the StepLR learning rate scheduler was applied to gradually decrease the learning rate during later training phases, ensuring model parameter stabilization. Additionally, the batch size was set to 4, and multi-threaded data loading was implemented via DataLoader to ensure efficient synchronization between data transfer and GPU computations. Finally, multiple experiments with various model combinations were designed to compare different base networks and attention modules. Each experiment was conducted under identical hyperparameter settings, including training epochs, batch size, and learning rate decay. Training loss, validation metrics, and convergence conditions were recorded for each epoch (Figure 8). The optimal combination was ultimately determined by evaluating validation metrics, such as mIoU. The trends in training and validation losses for each model indicated convergence within 100 epochs, demonstrating sufficient training.

2.4.5. Evaluation Metrics

To comprehensively evaluate the performance and relative advantages of the proposed method, this study employed accuracy metrics, including Mean Intersection over Union (mIoU), Overall Accuracy (OA), Precision, Recall, F1-score, and Kappa coefficient [42,43]. The definitions and formulas for each metric are provided as follows:

Mean Intersection over Union (mIoU): In image segmentation tasks, Intersection over Union (IoU) is a crucial metric for measuring the overlap between predicted regions and ground truth regions. For a single class, IoU is defined as the ratio between the area of intersection and the area of union of the predicted and ground truth regions. mIoU is the average IoU across all classes, expressed by the following formula:

m I o U = \frac{1}{K} \sum_{i = 1}^{K} \frac{T P_{i}}{T P_{i} + F P_{i} + F N_{i}}

(16)

where

T P_{i}

denotes the number of true positives for class i,

F P_{i}

denotes the number of false positives,

F N_{i}

denotes the number of false negatives, and K is the total number of classes.

Overall Accuracy (OA): This metric denotes the proportion of correctly classified pixels relative to the total number of pixels. The formula is expressed as follows:

O v e r a l l A c c u r a c y = \frac{\sum_{i = 1}^{K} T P_{i} + T N_{i}}{\sum_{i = 1}^{K} (T P_{i} + F P_{i} + F N_{i} + T N_{i})}

(17)

Precision: Precision denotes the proportion of pixels predicted as a specific class that truly belong to that class. The formula is expressed as follows:

{P r e c i s i o n}_{i} = \frac{T P_{i}}{T P_{i} + F P_{i}}

(18)

Recall: Recall denotes the proportion of pixels that truly belong to a specific class and are correctly predicted as such. The formula is expressed as follows:

{R e c a l l}_{i} = \frac{T P_{i}}{T P_{i} + F N_{i}}

(19)

F1-score: The F1-score is the harmonic mean of Precision and Recall, comprehensively reflecting the model’s performance in terms of precision and recall. The formula is expressed as follows:

{F 1 - s c o r e}_{i} = \frac{2 \times {P r e c i s i o n}_{i} \times {R e c a l l}_{i}}{{P r e c i s i o n}_{i} + {R e c a l l}_{i}}

(20)

Kappa coefficient: The Kappa coefficient (Cohen’s Kappa) is used to measure the agreement between classification results and random classification. It is defined as the ratio of the difference between observed accuracy and random accuracy to the maximum possible difference. The formula is expressed as follows:

K a p p a = \frac{p_{o} - p_{e}}{1 - p_{e}}

(21)

where

p_{o}

denotes the observed overall accuracy, and

p_{e}

denotes the random accuracy calculated based on class distribution.

3. Results

3.1. Image Fusion Quality Evaluation

Through detailed observation and analysis of the results from four fusion methods (Figure 9a–d), it can be found that the MSPF-Net method indeed demonstrated the optimal fusion performance. This method, based on a multi-scale feature extraction strategy using pyramid decomposition, effectively integrated spatial detail information from multi-source data while preserving the spectral characteristics of the original images. The fused result exhibited good color restoration and natural transitional features, preserving the true color tones of vegetation and enhancing the distinguishability among different grassland types, thus providing a reliable data foundation for subsequent grassland classification tasks.

Although the DAEF-Net method performed well in local feature extraction, subtle patch-processing traces were observed upon progressive magnification. Such mosaic-like boundary effects could adversely affect the accuracy of subsequent classification tasks. Both the RCAF-Net and DAF-Net methods exhibited noticeable overexposure phenomena, resulting in excessively high overall image brightness and the loss of partial spectral information. This phenomenon might be attributed to the attention mechanism excessively emphasizing the contributions of certain channels during feature weight assignment, causing the fused results to deviate from the spectral characteristics of the original images. Overexposure not only affects visual quality but also leads to spectral homogenization among different grassland types, negatively impacting subsequent classification tasks.

According to the results of six image quality evaluation metrics (Table 1), we conduct a comprehensive assessment of the four fusion methods and include an unfused ten-band reference for comparison. The unfused reference was produced by directly stacking the coregistered bands, followed by the same radiometric normalization and resampling as applied to the fused products, with no feature level fusion or enhancement. This design ensures a fair and consistent evaluation across all metrics. In terms of information entropy, the MSPF-Net method achieved the highest value of 6.84, indicating that its fused image contained the richest information, which is beneficial for subsequent image interpretation and analysis. The information entropy values of the other three methods were similar but slightly lower than that of MSPF-Net. On the metrics of spatial frequency and mean gradient, both of which reflect image clarity and texture features, the MSPF-Net method again showed optimal performance, achieving values of 15.56 and 12.54, respectively. This demonstrates that the MSPF-Net method has clear advantages in preserving and enhancing image edge features. The variance of Laplacian metric reflects image detail information; the MSPF-Net method obtained a value of 5022.92, significantly higher than the other three methods, indicating superior performance in preserving details. This result aligns with the previous qualitative analysis. Among the Tenengrad and RMS contrast metrics, which measure image clarity and contrast, the MSPF-Net method achieved the highest values of 76.42 and 31.22, respectively. This indicates that the MSPF-Net method not only preserved good image contrast but also effectively enhanced edge features.

In summary, the MSPF-Net method achieved optimal values across all six-evaluation metrics, clearly indicating its significant advantages in information preservation, texture enhancement, and detail representation, which is highly consistent with previous qualitative analysis results. Therefore, selecting the fused results obtained by the MSPF-Net method as the foundational data for subsequent grassland classification is justified.

3.2. Model Comparison and Ablation Experiments

3.2.1. Overall Performance of the Combination of Multi-Model Multi-Attention Mechanisms

In this study, four mainstream deep segmentation networks—DeepLabV3+, UNet++, PSPNet, and FPN—were selected, and three attention mechanisms, SE, ECA, and CBAM, were, respectively, introduced, constructing 12 different combination models (Table 2). All combinations were iteratively trained based on the same fused image dataset, with uniform data augmentation and training strategies (including mixed precision training and Focal Dice Loss), and evaluated comprehensively on the validation set. As shown in Table 2, after comparing metrics such as mIoU, Accuracy, Precision, Recall, F1-score, and Kappa coefficient, the UNet++ series demonstrated superior performance on multiple metrics, especially in edge delineation and small patch identification. DeepLabV3+ performed slightly lower overall, although it was close to or slightly behind UNet++ on some metrics. PSPNet and FPN exhibited average performance on this multi-source fused grassland dataset, possibly due to their limitations in extracting high-resolution features or integrating local textures compared to UNet++. Regarding performance improvement from attention modules, SE (Squeeze-and-Excitation) typically provided the most stable and significant performance gain. ECA (Efficient Channel Attention) and CBAM (Channel and Spatial Attention) also enhanced accuracy, but slightly lagged behind SE in a few ground object categories, exhibiting slightly lower stability. In conclusion, the combination of UNet++ and SE achieved the best overall evaluation metrics: mIoU exceeded 77%, OA consistently ranked at the top, and metrics such as F1-score and Kappa also showed superior performance. While combinations of UNet++ with ECA or CBAM achieved good results, they were slightly inferior compared to SE.

Based on the overall performance described above, this study further selected the top three models (UNet++_SE, UNet++_ECA, UNet++_CBAM) from the comprehensive results of evaluation metrics for a more detailed comparison and validation.

From a typical sample image randomly selected from the dataset, it can be observed that the three optimal models showed high consistency in identifying primary ground object types (e.g., Bare land, Leymus chinensis, Puccinellia distans), and generally matched well with the ground truth (Figure 10). The networks exhibited relatively stable recognition in large continuous regions, though slight misclassification might still occur at edges or small-area regions. Compared to ECA and CBAM, the prediction results of UNet++_SE showed superior smoothness and continuity at patch boundaries and small fragmented areas, indicating that the SE channel attention mechanism had a more pronounced effect in enhancing critical features. This visual difference partly confirmed that the global channel weighting in the SE mechanism could effectively emphasize principal features and suppress noise in heterogeneous grassland scenes, providing a greater advantage in depicting complex boundaries.

After selecting UNet++ with SE as the final configuration based on the validation set, we evaluated its generalization on the spatially independent test set defined in Section 2.4.1. The scores on validation and test are close, with mIoU 0.77 and 0.76, OA 0.87 and 0.86, Macro F1 0.81 and 0.80, and Kappa 0.82 and 0.81 (Figure 11); each difference equals 0.01. These results indicate limited overfitting and reliable transfer to unseen parcels under the non-overlapping tiling protocol. All performance reported hereafter refers to the held-out test set so that our conclusions reflect independent and spatially robust evaluation.

3.2.2. Loss Functions: Training Dynamics and Test Performance

Using UNet++ with SE and the split described in Section 2.4.1, we examined how the loss function influences optimization and accuracy. Three objectives were considered: Dice, Focal with γ = 2, and Focal Dice with α = 0.5. The training curves show rapid convergence within about thirty epochs; Focal Dice reaches the lowest terminal training loss at approximately 0.19, followed by Focal at about 0.20 and Dice at about 0.22 (Figure 12a). The validation mIoU presents the same ordering and stabilizes near 78% for Focal Dice (Figure 12b).

On the test set, the summary metrics further support this choice (Table 3). Dice achieves mIoU 75.90%, OA 86.50%, F1 82.60%, and Kappa 80.70%; Focal achieves 76.40%, 86.90%, 83.10%, and 81.20%; Focal Dice achieves 77.70%, 87.50%, 84.20%, and 82.50%, respectively. Relative to Dice, Focal Dice improves mIoU by 1.80%, OA by 1.00, F1 by 1.60, and Kappa by 1.80; gains over Focal are 1.30, 0.60, 1.10, and 1.30%. For completeness, we also report the test learning dynamics of the selected configuration (Figure 12c,d): the test loss decreases smoothly to about 0.19, and the test mIoU rises to about 78% by epoch thirty and then remains stable. The test curves are shown only for independent assessment and were not used for model selection.

3.2.3. Specific Category Accuracy Assessment

To better investigate the classification effectiveness of different models on specific ground object categories. We assessed accuracy per class on the test set under the spatially independent split. For three selected configurations, the producer’s accuracy (PA) and user’s accuracy (UA) for each category were calculated (Table 4). For the two grassland categories with relatively large proportions, Leymus chinensis and Puccinellia distans, both PA and UA of UNet++_SE remained at a high level (>80%), indicating relatively low omission and commission errors. Both ECA and CBAM exhibited similar or slightly superior performance to SE in certain individual categories (e.g., Bare land), but their overall distribution consistency was slightly inferior. For the “Others” category or Phragmites australis, which partially overlaps with Leymus chinensis, all three models showed some cross-class confusion in recognition, yet UNet++_SE still slightly outperformed the others in most statistical metrics.

In addition to PA and UA, the F1-score is also an important metric for evaluating segmentation quality (Figure 13). The three optimal models achieved high F1-scores for major categories such as Bare land, Puccinellia distans, and Leymus chinensis, while the F1-scores for Phragmites australis and the “Others” category were relatively lower. However, UNet++ combined with SE still performed better or at least equally well compared to the best results in this regard. These results further indicate that the SE module can enhance the recognition capability for typical grassland categories at the channel level, effectively balancing detailed features and overall semantic segmentation performance.

3.2.4. Attention Mechanism Ablation Study

To quantify the contribution of the attention mechanism, we trained a UNet++ model without attention under identical data division, training procedure, and hyperparameters. On the test set, UNet++ with SE achieves mIoU 77.68%, OA 86.98%, Precision 83.16%, Recall 79.87%, F1 score 81.48%, and Kappa 82.47%, whereas the plain UNet++ attains 73.95%, 83.76%, 79.82%, 76.48%, 78.12%, and 75.84%, respectively. Incorporating the SE module therefore yields absolute gains of 3.73% in mIoU, 3.22% in OA, 3.34% in Precision, 3.39% in Recall, 3.36% in F1 score, and 6.63% in Kappa (Table 5), confirming a clear positive effect on segmentation performance in this setting.

Under the same UNet++ decoder and the fused ten channel stack, mIoU of 73.95%, overall accuracy of 83.76%, F1 of 78.12%, and kappa of 75.84 were obtained by the model without attention. With squeeze and excitation (SE), higher scores were obtained: 77.68% in mIoU, 86.98% in overall accuracy, 81.48% in F1, and 82.47 in kappa. By comparison, efficient channel attention (ECA) yielded 72.97%, 84.32%, 77.64%, and 74.68, and convolutional block attention module (CBAM) yielded 72.45%, 83.59%, 77.23%, and 73.85 for the same metrics in the same order.

In summary, UNet++_SE significantly outperformed the UNet++ model across all evaluation metrics. Therefore, subsequent full-image predictions will be conducted using UNet++_SE as the optimal network structure.

3.2.5. Segmentation Aware Modality Occlusion Sensitivity

To assess the task level value of each sensing source without relying on image quality assessment (IQA), an inference time occlusion study was conducted with the trained UNet++ with SE kept fixed; the model had been trained on the full ten channel input. At each step, one modality group (RGB; vegetation indices [NDVI, GNDVI, RVI, VCI, PVI]; TIR; CHM) was occluded by replacing all channels in that group with their per-channel training set means, while the tiling and normalization pipeline was kept identical to the baseline. The percentage change in mIoU relative to the full input was then reported. Mean replacement preserves marginal statistics and avoids out of distribution artifacts compared with zero masking (Figure 14).

As shown in Figure 14, consistent changes were observed in the validation and test sets: CHM, −4.30 and −4.10%; indices, −3.50 and −3.20%; RGB, −2.10 and −2.00%; and TIR, −0.90 and −0.80%. This ordering indicated that structural cues from CHM were most informative for delineation of patch boundaries and separation of spectrally similar communities. Vegetation indices added discrimination beyond RGB by enhancing vigor related contrast. Thermal infrared added smaller but context dependent gains, for example, under shade or strong thermal gradients. Therefore, this segmentation aware sensitivity was used as the primary evidence when the value of sensing sources and fusion choices was discussed, while IQA was reserved for qualitative diagnosis of visual artifacts.

3.3. Panoramic Grassland Mapping Verification

After identifying UNet++_SE as the optimal combination in this study, to validate its practical application value in real operational scenarios or large-scale monitoring, the entire multi-source fused image was predicted (Figure 15). Considering the extensive coverage of UAV imagery, a sliding-window cropping method was employed to feed image blocks sequentially into the network for inference, and predictions of each block were subsequently merged. The final large-scale segmentation results generally agreed with manual interpretation, and UNet++_SE exhibited high boundary discriminability among major grassland classes, especially between vegetation types with high spectral similarity, such as saline-alkali grassland and Leymus chinensis. The model’s omission error rate remained within an acceptable range. Although minimal misclassification occurred at some transitional patches or edge pixels, overall, the predicted results demonstrated good continuity and rationality, providing explicit technical support for subsequent grassland ecological monitoring and ground feature investigation.

4. Discussion

4.1. The Technical Value of Multiscale Fusion

The Multi-Scale Pyramid Fusion Network (MSPF-Net) proposed in this study demonstrated significant advantages in integrating multi-source UAV remote sensing data. The fusion results achieved optimal values in six evaluation metrics including image entropy, spatial frequency, and mean gradient, and also realized coordinated integration of structural details and spectral features in terms of visual effects. These results indicated that in complex surface scenarios, such as those with coexisting saline-alkali grasslands, Leymus chinensis meadows, and hygrophilous reed vegetation, the pyramid structure effectively addressed fusion difficulties arising from mismatches between structural scales and texture scales. By introducing a pyramid pooling module at intermediate layers, the network hierarchically extracted and fused global semantic context and local edge textures at different scales, thus overcoming the problems of information redundancy and structural ambiguity caused by spatial frequency mismatches in traditional “same-layer concatenation” methods. This result was consistent with the conclusion of Lu et al. [44], who proposed MFPF-Net for remote sensing change detection, utilizing layer-wise feature aggregation to improve boundary details. In recent years, the wide applicability and significant improvements of multi-scale feature fusion in multi-source and multi-modal remote sensing tasks have been further validated. Moreover, a Multi-Scale Cascaded ASPP module was designed by BEMS-UNetFormer at the encoder–decoder junction, achieving MIoU scores of 86.12% and 83.10% on Potsdam and Vaihingen datasets, respectively, significantly outperforming the baseline model [45]. Additionally, SE attention was integrated with ASPP by MSANet to capture cross-scale contextual information, achieving an F1 score of 93.76% and IoU of 88.25% on the WHU aerial image dataset, and also maintaining superiority on the WHU satellite image dataset [46]. For hyperspectral-LiDAR fusion, a staged dense attention fusion strategy was employed by MAHiDF-Net, outperforming single-modal and shallow fusion methods overall [47]. It further demonstrated that progressive multi-scale fusion effectively reconciled high-level and low-level feature representations, alleviated strong structural heterogeneity of grassland patches, and underscored the necessity of multi-scale feature fusion at the feature layer to enhance representation capabilities in complex scenarios. The fusion strategy in this study effectively resolved the scale inconsistency among multi-modal data, highlighting the substantial potential of multi-scale feature fusion in improving the accuracy of remote sensing monitoring of complex grasslands.

4.2. Channel Attention Gain in High Resolution Segmentation

This study further compared various mainstream segmentation networks combined with different attention mechanisms. In the semantic segmentation stage, the UNet++ model incorporating SE channel attention achieved the highest mIoU, overall accuracy, and F1-score, significantly outperforming baseline models such as DeepLabV3+. This model combination performed prominently in delineating boundaries and fragmented patches within complex scenarios, making it particularly suitable for classification tasks using multi-scale and multi-modal remote sensing data. The two-step mechanism of SE, which involves global average pooling (squeeze) and 1 × 1 convolution excitation, adaptively adjusts channel weights according to the scene context, thereby enhancing inter-modal complementarity and suppressing noise interference. Consequently, this significantly reduces the omission rate in scenarios with high spectral overlap, such as grassland-bare soil and grassland-shrub interfaces. The effectiveness of SE-based attention mechanisms in remote sensing segmentation has been validated by multiple independent studies. Aburaed et al. [48] embedded SE into RUNet (SE-RUNet), increasing mIoU from 0.456 to 0.490 on the Dubai Aerial dataset. Wang et al. [26] proposed AFF-UNet by introducing an adaptive feature fusion and channel attention module for the Potsdam dataset, achieving an average F1-score increase of 1.09% compared with DeepLabV3+, significantly reducing confusion between buildings and roads. Under multi-modal input conditions, Jiang et al. [49] proposed DCAM, which separately fed RGB and NIR into dual-channel attention branches, increasing mIoU from 0.331 (DeepLabV3 baseline) to 0.503 on the GID-15 dataset. Additionally, Duan et al. [50] developed FEM-Attention based on “feature entropy weighting,” achieving a stable mIoU improvement of 2% in binary change detection tasks. In summary, these independent studies consistently support the findings of this research, indicating that low-cost, plug-and-play channel attention mechanisms (especially SE and its derivatives) remain the cost-effective first choice for improving accuracy and robustness in high-resolution remote sensing semantic segmentation. They are particularly effective for distinguishing boundaries between grassland-bare soil and grassland-shrub interfaces in scenarios involving high spectral overlap or complex backgrounds.

Coupling SE channel attention with UNet++ achieved the highest mIoU, demonstrating that lightweight recalibration mechanisms still offer cost-effective advantages. However, more general visual foundation models have recently begun to enter the remote sensing field. Zhang et al. [51] designed a multi-scale adapter for the Segment Anything Model (SAM), increasing the Dice metric by 0.8–6.2% across three datasets. Liu et al. [52] integrated a CNN encoder with a Mamba-based state-space decoder, addressing the contradiction between the high computational cost of Transformers and weak global dependencies of CNNs. Ma et al. [53] fine-tuned the SAM encoder using a “multi-modal adapter,” demonstrating for the first time that elevation information such as DSM could synergize with semantic knowledge from large-scale foundation models, achieving an additional improvement of 2–4% in average mIoU on ISPRS Vaihingen and Potsdam datasets. These studies suggest that future grassland mapping frameworks could integrate cross-modal priors from large-scale pre-trained models on top of lightweight attention modules, further enhancing the recognition accuracy for small patches and complex terrains.

4.3. Analysis of Method Limitations

Although the overall model performance was excellent, several sources of error still require attention. For example, Phragmites australis and Leymus chinensis exhibit similar high near-infrared reflectance characteristics during their vigorous summer growth period and have convergent morphological features in lodging or mature stages, causing dual confusion in CHM and spectral characteristics. Additionally, this study involved only one UAV flight, making it impossible to utilize temporal information regarding the early senescence of Phragmites australis. Furthermore, the model was trained and validated within a single protected area, and its mIoU dropped by up to 12% when transferred across regions, indicating that geographical generalization capability remains to be improved. Due to GPU memory limitations, inference was performed using small sliding windows of 256 × 256 pixels, potentially truncating elongated reed patches and affecting connectivity judgments. To mitigate these issues, future studies could incorporate multi-temporal imagery to leverage phenological differences for enhanced discrimination or combine time-series modeling approaches, such as spatiotemporal encoders integrating transformer structures with LSTM, to improve the model’s ability to characterize species’ growth rhythms.

The limitations of this study mainly lie in the absence of multi-temporal data and the limited cross-regional generalization capability. At the regional scale in China, vegetation greening can reshape summer temperature and rainfall by altering moisture transport, which further justifies multi temporal observations to capture seasonal contrasts [54]. Conducting only one UAV flight made it difficult for this study to capture seasonal characteristics such as the early senescence of Phragmites australis. AMFNet’s dual-temporal attention pyramid effectively integrated temporal semantics and reduced pseudo-changes in multi-temporal change detection tasks [55]. Future studies will adopt a three-window acquisition strategy covering early green up, peak biomass, and pre senescence to capture phenological trajectories and improve class separability and model robustness. Incorporating Bayesian uncertainty estimation or dynamic loss weighting approaches is expected to further reduce prediction variance in “hard-to-classify” regions within future multi-temporal, multi-modal grassland sequences, providing prior constraints for transfer learning to other regions. Future studies should incorporate multi-temporal UAV data acquisition, utilizing phenological differences in plants to enhance vegetation type classification accuracy. Additionally, future studies could introduce hyperspectral data to enhance the discriminative capability of leaf biochemical information and explore domain adaptive learning methods to improve the generalization capability of the model. Combined with active learning to reduce labeling workloads, the methodological framework proposed in this study can be further developed as a technical tool more broadly applicable to grassland ecological monitoring and management.

5. Conclusions

This study proposed a multi-source remote sensing image fusion and segmentation method integrating deep learning, providing an innovative solution for processing complex remote sensing images in grassland ecological monitoring. Our results demonstrated that the MSPF-Net method exhibited significant advantages in image fusion tasks, effectively integrating multi-source remote sensing data, preserving detailed information, and enhancing image clarity and edge features. The successful application of this method provided a reliable data foundation for grassland ecological monitoring.

In terms of image segmentation, the combination of the UNet++ network with the SE attention mechanism exhibited optimal performance, especially demonstrating significant advantages in handling details and boundaries. By adaptively weighting channel features, the SE module enhanced the model’s recognition ability for critical regions, thereby improving overall classification accuracy. This provided an effective solution for the classification of complex ground objects and the recognition of fine details. Nevertheless, certain limitations still exist in this study. In particular, the model still exhibited some misclassification and omission errors, especially in scenarios where spectral similarities existed between ground object categories. Therefore, future research could further optimize network architectures by employing more refined feature extraction and multi-level contextual information fusion approaches to enhance model performance in fine-grained classification tasks. Furthermore, with the continuous advancement of remote sensing technology and data acquisition methods, future studies could incorporate additional types of sensor data and utilize more diverse fusion techniques to further enhance model performance and broaden application scope.

In conclusion, this study provides novel ideas and methods for the deep learning-based processing of remote sensing imagery, holding significant application value particularly in the field of grassland ecological monitoring. In the future, this method is expected to be extended to additional fields, providing reliable technical support for ecological monitoring, resource management, disaster assessment, and other related applications.

Author Contributions

Conceptualization, L.Y.; methodology, Z.C. and L.Y.; software, Z.C.; validation, Z.C.; formal analysis, Z.C.; investigation, X.G., Z.Y. and J.L.; resources, L.Y.; data curation, Z.C.; writing—original draft preparation, Z.C.; writing—review and editing, Z.C., C.W., L.B. and L.Y.; visualization, Z.C.; supervision, C.W. and L.Y.; project administration, L.Y.; funding acquisition, H.M. and L.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This study was supported by the National Key Research and Development Program of China (2022YFF1300601) and the Youth Innovation Promotion Association of Chinese Academy of Sciences (Grant No. 2023240).

Data Availability Statement

The data presented in this study are available on request from the corresponding author due to ongoing follow-up analyses within the project; they will be made publicly available after project completion..

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

Bardgett, R.D.; Bullock, J.M.; Lavorel, S.; Manning, P.; Schaffner, U.; Ostle, N.; Chomel, M.; Durigan, G.; Fry, E.L.; Johnson, D.; et al. Combatting Global Grassland Degradation. Nat. Rev. Earth Environ. 2021, 2, 720–735. [Google Scholar] [CrossRef]
Zhang, M.; Sun, J.; Wang, Y.; Li, Y.; Duo, J. State-of-the-Art and Challenges in Global Grassland Degradation Studies. Geogr. Sustain. 2025, 6, 100229. [Google Scholar] [CrossRef]
Yu, L.; Liu, Y.; Shen, M.; Yu, Z.; Li, X.; Liu, H.; Lyne, V.; Jiang, M.; Wu, C. Extreme Hydroclimates Amplify the Biophysical Effects of Advanced Green-up in Temperate China. Agric. For. Meteorol. 2025, 363, 110421. [Google Scholar] [CrossRef]
Stumpf, F.; Schneider, M.K.; Keller, A.; Mayr, A.; Rentschler, T.; Meuli, R.G.; Schaepman, M.; Liebisch, F. Spatial Monitoring of Grassland Management Using Multi-Temporal Satellite Imagery. Ecol. Indic. 2020, 113, 106201. [Google Scholar] [CrossRef]
Dujakovic, A.; Watzig, C.; Schaumberger, A.; Klingler, A.; Atzberger, C.; Vuolo, F. Enhancing Grassland Cut Detection Using Sentinel-2 Time Series through Integration of Sentinel-1 SAR and Weather Data. Remote Sens. Appl. Soc. Environ. 2025, 37, 101453. [Google Scholar] [CrossRef]
Bartold, M.; Kluczek, M.; Wróblewski, K.; Dąbrowska-Zielińska, K.; Goliński, P.; Golińska, B. Mapping Management Intensity Types in Grasslands with Synergistic Use of Sentinel-1 and Sentinel-2 Satellite Images. Sci. Rep. 2024, 14, 32066. [Google Scholar] [CrossRef]
Luo, Y.; Guan, K.; Peng, J.; Wang, S.; Huang, Y. STAIR 2.0: A Generic and Automatic Algorithm to Fuse Modis, Landsat, and Sentinel-2 to Generate 10 m, Daily, and Cloud-/Gap-Free Surface Reflectance Product. Remote Sens. 2020, 12, 3209. [Google Scholar] [CrossRef]
Díaz-Ireland, G.; Gülçin, D.; Lopez-Sanchez, A.; Pla, E.; Burton, J.; Velázquez, J. Classification of Protected Grassland Habitats Using Deep Learning Architectures on Sentinel-2 Satellite Imagery Data. Int. J. Appl. Earth Obs. Geoinf. 2024, 134, 104221. [Google Scholar] [CrossRef]
Rivas, H.; Touchais, H.; Thierion, V.; Millet, J.; Curtet, L.; Fauvel, M. Nationwide Operational Mapping of Grassland First Mowing Dates Combining Machine Learning and Sentinel-2 Time Series. Remote Sens. Environ. 2024, 315, 114476. [Google Scholar] [CrossRef]
Holtgrave, A.-K.; Lobert, F.; Erasmi, S.; Röder, N.; Kleinschmit, B. Grassland Mowing Event Detection Using Combined Optical, SAR, and Weather Time Series. Remote Sens. Environ. 2023, 295, 113680. [Google Scholar] [CrossRef]
Ghamisi, P.; Rasti, B.; Yokoya, N.; Wang, Q.; Hofle, B.; Bruzzone, L.; Bovolo, F.; Chi, M.; Anders, K.; Gloaguen, R. Multisource and Multitemporal Data Fusion in Remote Sensing. arXiv 2018, arXiv:1812.08287. [Google Scholar] [CrossRef]
Abdollahi, A.; Liu, Y.; Pradhan, B.; Huete, A.; Dikshit, A.; Tran, N.N. Short-Time-Series Grassland Mapping Using Sentinel-2 Imagery and Deep Learning-Based Architecture. Egypt. J. Remote Sens. Space Sci. 2022, 25, 673–685. [Google Scholar] [CrossRef]
Tsardanidis, I.; Koukos, A.; Sitokonstantinou, V.; Drivas, T.; Kontoes, C. Cloud Gap-Filling with Deep Learning for Improved Grassland Monitoring. Comput. Electron. Agric. 2025, 230, 109732. [Google Scholar] [CrossRef]
Lu, B.; He, Y. Species Classification Using Unmanned Aerial Vehicle (UAV)-Acquired High Spatial Resolution Imagery in a Heterogeneous Grassland. ISPRS J. Photogramm. Remote Sens. 2017, 128, 73–85. [Google Scholar] [CrossRef]
Nahrstedt, K.; Reuter, T.; Trautz, D.; Waske, B.; Jarmer, T. Classifying Stand Compositions in Clover Grass Based on High-Resolution Multispectral UAV Images. Remote Sens. 2024, 16, 2684. [Google Scholar] [CrossRef]
Šrollerů, A.; Potůčková, M. Evaluating the Applicability of High-Density UAV LiDAR Data for Monitoring Tundra Grassland Vegetation. Int. J. Remote Sens. 2025, 46, 42–76. [Google Scholar] [CrossRef]
Michez, A.; Piégay, H.; Lisein, J.; Claessens, H.; Lejeune, P. Classification of Riparian Forest Species and Health Condition Using Multi-Temporal and Hyperspatial Imagery from Unmanned Aerial System. Environ. Monit. Assess. 2016, 188, 146. [Google Scholar] [CrossRef] [PubMed]
Hütt, C.; Isselstein, J.; Komainda, M.; Schöttker, O.; Sturm, A. UAV LiDAR-Based Grassland Biomass Estimation for Precision Livestock Management. J. Appl. Remote Sens. 2024, 18, 017502. [Google Scholar] [CrossRef]
Li, R.; Zheng, S.; Zhang, C.; Duan, C.; Wang, L.; Atkinson, P.M. ABCNet: Attentive Bilateral Contextual Network for Efficient Semantic Segmentation of Fine-Resolution Remotely Sensed Imagery. ISPRS J. Photogramm. Remote Sens. 2021, 181, 84–98. [Google Scholar] [CrossRef]
Wang, L.; Li, R.; Zhang, C.; Fang, S.; Duan, C.; Meng, X.; Atkinson, P.M. UNetFormer: A UNet-like Transformer for Efficient Semantic Segmentation of Remote Sensing Urban Scene Imagery. ISPRS J. Photogramm. Remote Sens. 2022, 190, 196–214. [Google Scholar] [CrossRef]
Fan, X.; Zhou, W.; Qian, X.; Yan, W. Progressive Adjacent-Layer Coordination Symmetric Cascade Network for Semantic Segmentation of Multimodal Remote Sensing Images. Expert Syst. Appl. 2024, 238, 121999. [Google Scholar] [CrossRef]
Zhao, C.; Liu, H.; Su, N.; Xu, C.; Yan, Y.; Feng, S. TMTNet: A Transformer-Based Multimodality Information Transfer Network for Hyperspectral Object Tracking. Remote Sens. 2023, 15, 1107. [Google Scholar] [CrossRef]
Wang, S.; Hu, Q.; Wang, S.; Zhao, P.; Li, J.; Ai, M. Category Attention Guided Network for Semantic Segmentation of Fine-Resolution Remote Sensing Images. Int. J. Appl. Earth Obs. Geoinf. 2024, 127, 103661. [Google Scholar] [CrossRef]
Samadzadegan, F.; Toosi, A.; Dadrass Javan, F. A Critical Review on Multi-Sensor and Multi-Platform Remote Sensing Data Fusion Approaches: Current Status and Prospects. Int. J. Remote Sens. 2025, 46, 1327–1402. [Google Scholar] [CrossRef]
Wang, S.; Zhou, Q. Multi-Source Fusion Enhanced Feature Segmentation in Remote Sensing Imagery. ISPRS Ann. Photogramm. Remote Sens. Spat. Inf. Sci. 2024, 10, 395–401. [Google Scholar] [CrossRef]
Wang, X.; Hu, Z.; Shi, S.; Hou, M.; Xu, L.; Zhang, X. A Deep Learning Method for Optimizing Semantic Segmentation Accuracy of Remote Sensing Images Based on Improved UNet. Sci. Rep. 2023, 13, 7600. [Google Scholar] [CrossRef] [PubMed]
Sun, L.; Zou, H.; Wei, J.; Cao, X.; He, S.; Li, M.; Liu, S. Semantic Segmentation of High-Resolution Remote Sensing Images Based on Sparse Self-Attention and Feature Alignment. Remote Sens. 2023, 15, 1598. [Google Scholar] [CrossRef]
Landgraf, S.; Hillemann, M.; Wursthorn, K.; Ulrich, M. U-CE: Uncertainty-Aware Cross-Entropy for Semantic Segmentation. arXiv 2023, arXiv:2307.09947. [Google Scholar]
Liu, T.; Yu, L.; Yan, Z.; Li, X.; Bu, K.; Yang, J. Enhanced Climate Mitigation Feedbacks by Wetland Vegetation in Semi-arid Compared to Humid Regions. Geophys. Res. Lett. 2025, 52, e2025GL115242. [Google Scholar] [CrossRef]
Zhang, Y.; Tian, Y.; Kong, Y.; Zhong, B.; Fu, Y. Residual Dense Network for Image Super-Resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 2472–2481. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. Cbam: Convolutional Block Attention Module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Chen, L.-C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
Zhou, Z.; Siddiquee, M.M.R.; Tajbakhsh, N.; Liang, J. Unet++: Redesigning Skip Connections to Exploit Multiscale Features in Image Segmentation. IEEE Trans. Med. Imaging 2019, 39, 1856–1867. [Google Scholar] [CrossRef]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid Scene Parsing Network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar]
Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11534–11542. [Google Scholar]
Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Milletari, F.; Navab, N.; Ahmadi, S.-A. V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation. In Proceedings of the 2016 Fourth International Conference on 3D Vision (3DV), Stanford, CA, USA, 25–28 October 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 565–571. [Google Scholar]
Jadon, S. A Survey of Loss Functions for Semantic Segmentation. In Proceedings of the 2020 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB), Via del Mar, Chile, 27–29 October 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 1–7. [Google Scholar]
Everingham, M.; Eslami, S.A.; Van Gool, L.; Williams, C.K.; Winn, J.; Zisserman, A. The Pascal Visual Object Classes Challenge: A Retrospective. Int. J. Comput. Vis. 2015, 111, 98–136. [Google Scholar]
Minaee, S.; Boykov, Y.; Porikli, F.; Plaza, A.; Kehtarnavaz, N.; Terzopoulos, D. Image Segmentation Using Deep Learning: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 3523–3542. [Google Scholar] [CrossRef] [PubMed]
Lu, D.; Cheng, S.; Wang, L.; Song, S. Multi-Scale Feature Progressive Fusion Network for Remote Sensing Image Change Detection. Sci. Rep. 2022, 12, 11968. [Google Scholar] [CrossRef]
Wang, J.; Chen, T.; Zheng, L.; Tie, J.; Zhang, Y.; Chen, P.; Luo, Z.; Song, Q. A Multi-Scale Remote Sensing Semantic Segmentation Model with Boundary Enhancement Based on UNetFormer. Sci. Rep. 2025, 15, 14737. [Google Scholar] [CrossRef]
Chang, J.; He, X.; Li, P.; Tian, T.; Cheng, X.; Qiao, M.; Zhou, T.; Zhang, B.; Chang, Z.; Fan, T. Multi-Scale Attention Network for Building Extraction from High-Resolution Remote Sensing Images. Sensors 2024, 24, 1010. [Google Scholar]
Wang, X.; Feng, Y.; Song, R.; Mu, Z.; Song, C. Multi-Attentive Hierarchical Dense Fusion Net for Fusion Classification of Hyperspectral and LiDAR Data. Inf. Fusion 2022, 82, 1–18. [Google Scholar]
Aburaed, N.; Al-Saad, M.; Alkhatib, M.Q.; Zitouni, M.S.; Almansoori, S.; Al-Ahmad, H. Semantic Segmentation of Remote Sensing Imagery Using AN Enhanced Encoder-Decoder Architecture. ISPRS Ann. Photogramm. Remote Sens. Spat. Inf. Sci. 2023, 10, 1015–1020. [Google Scholar]
Jiang, J.; Feng, X.; Huang, H. Semantic Segmentation of Remote Sensing Images Based on Dual-channel Attention Mechanism. IET Image Process. 2024, 18, 2346–2356. [Google Scholar]
Duan, S.; Zhao, J.; Huang, X.; Zhao, S. Semantic Segmentation of Remote Sensing Data Based on Channel Attention and Feature Information Entropy. Sensors 2024, 24, 1324. [Google Scholar] [CrossRef]
Zhang, E.; Liu, J.; Cao, A.; Sun, Z.; Zhang, H.; Wang, H.; Sun, L.; Song, M. RS-SAM: Integrating Multi-Scale Information for Enhanced Remote Sensing Image Segmentation. In Proceedings of the Asian Conference on Computer Vision, Hanoi, Vietnam, 8–12 December 2024; pp. 994–1010. [Google Scholar]
Liu, M.; Dan, J.; Lu, Z.; Yu, Y.; Li, Y.; Li, X. CM-UNet: Hybrid CNN-Mamba UNet for Remote Sensing Image Semantic Segmentation. arXiv 2024, arXiv:2405.10530. [Google Scholar]
Ma, X.; Zhang, X.; Pun, M.-O.; Huang, B. MANet: Fine-Tuning Segment Anything Model for Multimodal Remote Sensing Semantic Segmentation. arXiv 2024, arXiv:2410.11160. [Google Scholar] [CrossRef]
Yu, L.; Liu, Y.; Liu, T.; Yan, F. Impact of Recent Vegetation Greening on Temperature and Precipitation over China. Agric. For. Meteorol. 2020, 295, 108197. [Google Scholar] [CrossRef]
Zhan, Z.; Ren, H.; Xia, M.; Lin, H.; Wang, X.; Li, X. Amfnet: Attention-Guided Multi-Scale Fusion Network for Bi-Temporal Change Detection in Remote Sensing Images. Remote Sens. 2024, 16, 1765. [Google Scholar] [CrossRef]

Figure 1. Location of the study area. (a) Yaojingzi Leymus chinensis Grassland Reserve. (b) Plot area. (c) Plot point distribution. (d) Unmanned aerial vehicle equipment. (e) Radiometric calibration. (f) RTK positioning. (g) Plot landscape. (h) Plot layout.

Figure 2. Technical flow chart of this study.

Figure 3. Block diagrams of four fusion network structures: Residual Channel Attention Fusion Network (RCAF-Net), Dense Auto Encoder Fusion Network (DAEF-Net), Dual Attention Fusion Network (DAF-Net), and Multi-Scale Pyramid Fusion Network (MSPF-Net).

Figure 4. Class distribution of the labeled dataset.

Figure 5. Examples of training image–label pairs. Panels (a–f) show six independent 256 × 256 pixels randomly sampled from the fused dataset; the top row displays representative input image chips, and the bottom row shows their corresponding ground-truth labels. The label set used in this study includes Leymus chinensis, Puccinellia distans, Phragmites australis, Bare land, and Others.

Figure 6. The module structures for three attention models, including squeeze and excitation (SE), efficient channel attention (ECA) and convolutional block attention module (CBAM).

Figure 7. Schematic diagram of the improved UNet++ network structure.

Figure 8. Training loss curves for each model.

Figure 9. Fused images from four fusion networks in the same area: (a) RCAF-Net, (b) DAEF-Net, (c) DAF-Net, (d) MSPF-Net. Panels were displayed as false color composites from three bands of the ten-band output with the same contrast stretch.

Figure 10. Qualitative comparison on representative test tiles. Columns show the input image, the label, and predictions from UNet++_SE, UNet++_ECA, UNet++_CBAM; the legend lists the five classes.

Figure 11. Validation versus test performance of the selected model under the spatially independent split.

Figure 12. Training and evaluation curves for different loss functions. (a) training loss for Dice, Focal (γ = 2), and Focal Dice (α = 0.5); (b) validation mIoU; (c) test loss of the selected configuration UNet++ with SE and Focal Dice; (d) test mIoU of the same configuration.

Figure 13. Comparison of F1 scores for each category among the three optimal models.

Figure 14. Segmentation aware modality occlusion sensitivity.

Figure 15. Whole-scene semantic segmentation results.

Table 1. Comparison of experimental results for four fusion networks.

Fusion	RCAF-Net	DAEF-Net	DAF-Net	MSPF-Net	Unfused
Entropy	6.80	6.79	6.80	6.84	6.68
Spatial Frequency	15.52	15.52	15.54	15.56	15.33
Mean Gradient	12.51	12.51	12.53	12.54	12.28
Variance of Laplacian	4996.84	5006.41	5000.51	5022.92	4940.76
Tenengrad	76.17	76.12	76.35	76.42	75.12
RMS	30.68	30.56	30.84	31.22	30.18

Table 2. Test set performance of segmentation backbones and attention modules. Metrics include mIoU, OA, Precision, Recall, F1 score, and Kappa.

Model	mIoU (%)	OA (%)	Precision (%)	Recall (%)	F1-Score (%)	Kappa (%)
DeepLabV3+_SE	70.53	79.17	75.27	72.72	73.97	72.44
DeepLabV3+_ECA	68.28	78.84	75.27	72.51	73.86	70.02
DeepLabV3+_CBAM	68.10	78.72	75.10	72.43	73.74	69.85
UNet++_SE	77.68	86.98	83.16	79.87	81.48	82.47
UNet++_ECA	72.97	84.32	79.82	75.58	77.64	74.68
UNet++_CBAM	72.45	83.59	79.23	75.32	77.23	73.85
PSPNet_SE	57.41	67.65	54.75	57.68	56.18	54.15
PSPNet_ECA	54.76	69.26	63.05	59.36	61.15	53.39
PSPNet_CBAM	53.36	66.21	59.61	60.50	60.05	51.28
FPN_SE	62.70	76.94	70.10	68.11	69.09	63.81
FPN_ECA	62.31	76.39	69.04	68.64	68.84	63.30
FPN_CBAM	62.89	77.32	69.81	68.87	69.34	64.10

Table 3. Effect of loss functions on test set performance for UNet++ with SE.

Loss	mIoU (%)	OA (%)	F1 (%)	Kappa (%)
Dice	75.90	86.50	82.60	80.70
Focal	76.40	86.90	83.10	81.20
Focal Dice	77.70	87.50	84.20	82.50

Table 4. PA and UA statistics for the three optimal models by category.

Model	Index	Classification Accuracy (%)
Model	Index	Leymus chinensis	Puccinellia distans	Phragmites australis	Bare Land	Others
UNet++_SE	PA	86.46	82.40	79.62	93.13	75.19
UNet++_SE	UA	82.30	77.68	78.05	89.10	70.05
UNet++_ECA	PA	82.62	81.32	79.18	95.05	70.20
UNet++_ECA	UA	76.70	76.10	68.13	86.05	72.21
UNet++_CBAM	PA	81.16	83.40	78.85	92.33	61.83
UNet++_CBAM	UA	77.20	75.51	72.10	84.40	67.32

Table 5. Ablation experiment results of introducing the SE module into UNet++.

Model	mIoU (%)	OA (%)	Precision (%)	Recall (%)	F1-Score (%)	Kappa (%)
UNet++	73.95	83.76	79.82	76.48	78.12	75.84
UNet++_SE	77.68	86.98	83.16	79.87	81.48	82.47

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cai, Z.; Wen, C.; Bao, L.; Ma, H.; Yan, Z.; Li, J.; Gao, X.; Yu, L. Fine-Scale Grassland Classification Using UAV-Based Multi-Sensor Image Fusion and Deep Learning. Remote Sens. 2025, 17, 3190. https://doi.org/10.3390/rs17183190

AMA Style

Cai Z, Wen C, Bao L, Ma H, Yan Z, Li J, Gao X, Yu L. Fine-Scale Grassland Classification Using UAV-Based Multi-Sensor Image Fusion and Deep Learning. Remote Sensing. 2025; 17(18):3190. https://doi.org/10.3390/rs17183190

Chicago/Turabian Style

Cai, Zhongquan, Changji Wen, Lun Bao, Hongyuan Ma, Zhuoran Yan, Jiaxuan Li, Xiaohong Gao, and Lingxue Yu. 2025. "Fine-Scale Grassland Classification Using UAV-Based Multi-Sensor Image Fusion and Deep Learning" Remote Sensing 17, no. 18: 3190. https://doi.org/10.3390/rs17183190

APA Style

Cai, Z., Wen, C., Bao, L., Ma, H., Yan, Z., Li, J., Gao, X., & Yu, L. (2025). Fine-Scale Grassland Classification Using UAV-Based Multi-Sensor Image Fusion and Deep Learning. Remote Sensing, 17(18), 3190. https://doi.org/10.3390/rs17183190

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Fine-Scale Grassland Classification Using UAV-Based Multi-Sensor Image Fusion and Deep Learning

Abstract

Highlights

Abstract

1. Introduction

2. Materials and Methods

2.1. Study Area

2.2. Data Collection and Preprocessing

2.2.1. Ground Data Collection

2.2.2. UAV Data Acquisition and Processing

2.3. Multi-Sensor Image Fusion

2.3.1. Resolution Unification and Blocking Strategy

2.3.2. Core Modules for Deep Feature Fusion

2.3.3. Architecture of Four Fusion Networks

2.3.4. Metrics for Fusion Quality Assessment

2.3.5. Training Strategy for the Fusion Networks

2.4. Image Segmentation Based on Deep Learning

2.4.1. Training Data and Labeling Strategy

2.4.2. Segmentation Networks and Attention Mechanism Design

2.4.3. Loss Function Design

2.4.4. Training Strategy and Experimental Setup

2.4.5. Evaluation Metrics

3. Results

3.1. Image Fusion Quality Evaluation

3.2. Model Comparison and Ablation Experiments

3.2.1. Overall Performance of the Combination of Multi-Model Multi-Attention Mechanisms

3.2.2. Loss Functions: Training Dynamics and Test Performance

3.2.3. Specific Category Accuracy Assessment

3.2.4. Attention Mechanism Ablation Study

3.2.5. Segmentation Aware Modality Occlusion Sensitivity

3.3. Panoramic Grassland Mapping Verification

4. Discussion

4.1. The Technical Value of Multiscale Fusion

4.2. Channel Attention Gain in High Resolution Segmentation

4.3. Analysis of Method Limitations

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI