Optical and SAR Data Fusion Based on Transformer for Rice Identification: A Comparative Analysis from Early to Late Integration

He, Chenyang; Song, Jia; Xu, Huiyao

doi:10.3390/agriculture15070706

Open AccessArticle

Optical and SAR Data Fusion Based on Transformer for Rice Identification: A Comparative Analysis from Early to Late Integration

by

Chenyang He

^1,2,

Jia Song

^1,3,*

and

Huiyao Xu

^1,2

¹

State Key Laboratory of Resources and Environmental Information System, Institute of Geographic Sciences and Natural Resources Research, Chinese Academy of Sciences, Beijing 100101, China

²

University of Chinese Academy of Sciences, Beijing 100049, China

³

Jiangsu Center for Collaborative Innovation in Geographical Information Resource Development and Application, Nanjing 210023, China

^*

Author to whom correspondence should be addressed.

Agriculture 2025, 15(7), 706; https://doi.org/10.3390/agriculture15070706

Submission received: 14 February 2025 / Revised: 20 March 2025 / Accepted: 24 March 2025 / Published: 26 March 2025

(This article belongs to the Section Artificial Intelligence and Digital Agriculture)

Download

Browse Figures

Versions Notes

Abstract

The accurate identification of rice fields through remote sensing is critical for agricultural monitoring and global food security. While optical and Synthetic Aperture Radar (SAR) data offer complementary advantages for crop mapping—spectral richness from optical imagery and all-weather capabilities from SAR—their integration remains challenging due to heterogeneous data characteristics and environmental variability. This study systematically evaluates three Transformer-based fusion strategies for rice identification: Early Fusion Transformer (EFT), Feature Fusion Transformer (FFT), and Decision Fusion Transformer (DFT), designed to integrate optical-SAR data at the input level, feature level, and decision level, respectively. Experiments conducted in Arkansas, USA—a major rice-producing region with complex agroclimatic conditions—demonstrate that EFT achieves superior performance, with an overall accuracy (OA) of 98.33% and rice-specific Intersection over Union (IoU_rice) of 83.47%, surpassing single-modality baselines (optical: IoU_rice = 75.78%; SAR: IoU_rice = 66.81%) and alternative fusion approaches. The model exhibits exceptional robustness in cloud-obstructed regions and diverse field patterns, effectively balancing precision (90.98%) and recall (90.35%). These results highlight the superiority of early-stage fusion in preserving complementary spectral–structural information, while revealing limitations of delayed integration strategies. Our work advances multi-modal remote sensing methodologies, offering a scalable framework for operational agricultural monitoring in challenging environments.

Keywords:

fusion; transformer; rice identification; optical; synthetic aperture radar (SAR)

1. Introduction

Global food security has emerged as a critical concern in the 21st century, exacerbated by accelerating population growth and the escalating impacts of climate change due to continued population growth and intensifying climate change [1,2,3]. As a staple crop food for over half of the global population [4], rice cultivation plays a pivotal role in ensuring regional food security and socioeconomic stability [5,6,7,8]. Traditional crop monitoring techniques, predominantly dependent on manual field surveys, are increasingly inadequate due to their labor-intensive nature, temporal latency, and inability to provide spatially continuous observations [9,10]. In response, satellite remote sensing has revolutionized agricultural monitoring by enabling precise, large-scale, and dynamic crop mapping, thereby enhancing decision-making efficiency in cultivation management [11,12,13,14].

Optical remote sensing, with its rich spectral information, has been widely adopted for rice identification through various vegetation indices, such as NDVI [5,15,16,17]. However, its effectiveness is severely affected by cloud coverage, particularly during critical rice growing seasons that often coincide with cloudy and rainy weather. Synthetic Aperture Radar (SAR), with its all-weather imaging capability, offers complementary structural insights into rice canopy characteristics, including plant height, density, and moisture content [14,18,19]. Despite these advantages, SAR data present inherent limitations, including speckle noise and interpretational complexity due to distinct scattering mechanisms. Consequently, the integration of optical and SAR data has gained traction as a means to enhance classification accuracy by leveraging their complementary strengths [20,21,22,23].

According to the level of information integration, optical–SAR data fusion approaches can be broadly categorized into three strategies: input-level fusion, feature-level fusion, and decision-level fusion [22,24,25,26]. Input-level fusion preserves complete information by directly combining raw or preprocessed data from both modalities [23,27,28,29]. Feature-level fusion integrates features extracted from different modalities before classification, offering flexibility in feature representation [30,31,32,33]. Decision-level fusion processes each modality independently and combines their classification results, providing a modular approach to multi-source data integration [34,35].

The emergence of deep learning has fundamentally transformed the landscape of optical-SAR data fusion, offering unprecedented capabilities to model complex non-linear relationships between these heterogeneous data sources. Recent studies have predominantly focused on feature-level fusion [30,33,36]. For instance, refs. [32,37,38,39] employ dual-branch designs based on Convolutional Neural Networks (CNNs) to extract spatial features from optical and SAR data separately before fusion. However, their fixed receptive fields and geometric constraints limit their ability to model complex cross-modal interactions. CNN-RNN hybrid architectures incorporate temporal modeling capabilities, attempting to capture both spatial and temporal dependencies in multi-temporal remote sensing data [24,40]. Nevertheless, they face challenges in processing long-range dependencies while maintaining spatial context. Attention-based methods, though promising, often treat attention mechanisms as auxiliary components rather than core architectural elements, thereby restricting their capacity to capture global relationships [25,32,41]. Ofori-Ampofo et al. [25] adopted an attention-based pixel-level classification model to compare the crop identification performance of three image fusion strategies. The study demonstrated that input-level and feature-level data fusion methods achieved a 2% higher F1-score compared to decision-level fusion. However, the three fusion strategies explored in their study were relatively simplistic (e.g., simple stacking or averaging), which failed to effectively leverage the complementary characteristics of optical and SAR data.

The Transformer architecture, initially developed for natural language processing, has revolutionized sequential data modeling through self-attention mechanisms that dynamically capture global dependencies [42]. Adapted to computer vision, Vision Transformers have demonstrated exceptional performance in tasks requiring both local detail preservation and global contextual understanding [43,44,45]. While hybrid CNN–Transformer architectures have been explored in remote sensing [46,47,48,49,50], their reliance on CNNs for local feature extraction may constrain the Transformer’s capacity to fully exploit global relationships. Recent studies suggest that pure Transformer architectures can effectively integrate local and global features without convolutional operations [45,51,52], yet their application to optical–SAR fusion remains unexplored.

The extensive exploration of feature-level fusion in deep learning implicitly assumes its superiority over input-level and decision-level strategies. However, this assumption lacks systematic validation, especially in the context of rice identification, where environmental factors and data characteristics may significantly influence the effectiveness of different fusion approaches. Furthermore, the comparative performance of these three strategies within the Transformer framework remains unexplored, limiting insights into their adaptability across diverse agricultural landscapes.

In this context, our study presents a systematic investigation of different fusion strategies within a Transformer-based framework for rice identification. The main contributions of this work are as follows:

(1): We propose three novel Transformer-based networks—Early Fusion Transformer (EFT), Feature Fusion Transformer (FFT), and Decision Fusion Transformer (DFT)—each implementing distinct fusion strategies (input-level, feature-level, and decision-level) tailored for optical-SAR synergy;
(2): We conduct a systematic comparison of these three fusion strategies within the Transformer framework for rice identification, providing comprehensive insights into their performance and practical applicability;
(3): We demonstrate the superior adaptability and robustness of our proposed fusion networks, particularly in challenging scenarios such as cloud-covered areas.

2. Study Area and Data

2.1. Study Area

The study focused on Arkansas, USA (33°00′16″ N–36°29′58″ N, 89°38′39″ W–94°37′05″ W), a key agricultural region in the Mississippi River Basin (Figure 1). As the largest rice-producing state in the USA, Arkansas accounts for approximately 40% of the national rice production [53]. The region features a temperate climate with abundant water resources and annual precipitation exceeding 1200 mm, 60% of which occurs during the critical rice growing season (April–September) [54].

The rice cultivation cycle in Arkansas spans April to September, encompassing four key phenological stages: seeding (April), tillering (May–June), panicle initiation (July), and maturity (August–September). This temporal overlap with the rainy season introduces significant challenges for optical remote sensing. Frequent cloud cover and precipitation often obstruct optical satellite observations, resulting in missing or unreliable data during crucial growth stages when consistent monitoring is essential for accurate crop identification and management. Additionally, Arkansas cultivates diverse crops, including cotton, soybeans, and corn, creating a heterogeneous agricultural landscape ideal for evaluating optical-SAR fusion methods under complex environmental conditions.

2.2. Data Source and Preprocessing

2.2.1. Satellite Data Source

This study utilized complementary data from Sentinel-1 and Sentinel-2 satellites to leverage both optical and SAR information for rice identification. For optical data, we employed Sentinel-2 Level-1C imagery, which offers a spatial resolution of 10 m and a 5-day revisit time. All the images were orthorectified and geometrically corrected and underwent atmospheric correction. Given that rice cultivation in Arkansas primarily occurs between April and September, we collected Sentinel-2 images from April 1 to October 1, 2020, covering key phenological stages such as seedling, tillering, heading, and maturity. To ensure high-quality input data, we selected 27 images with cloud cover below 20%.

To efficiently represent rice growth dynamics while reducing computational complexity, we calculated the Normalized Difference Vegetation Index (NDVI) for each image using Equation (1):

N D V I = \frac{N I R - R}{N I R + R} = \frac{B 08 - B 04}{B 08 + B 04}

(1)

Since vegetation-covered areas typically exhibit NDVI values above 0.6, while non-vegetated regions (including clouds and cloud shadows) generally have NDVI values below 0.2, we applied a maximum value composite approach to mitigate cloud and cloud-shadow contamination [55,56]. NDVI images were aggregated into 20-day intervals, with each composite selecting the maximum NDVI value from all available images within that period. This method effectively preserved rice phenological characteristics while minimizing atmospheric interference.

For SAR data, we utilized Sentinel-1 Ground Range Detected (GRD) products acquired in Interferometric Wide (IW) imaging mode, offering 10 m resolution imagery with a 6-day revisit time. The C-band SAR data underwent comprehensive preprocessing using the Sentinel Application Platform (SNAP), including orbit correction, thermal noise removal, edge noise removal, radiometric calibration, coherent speckle filtering, terrain correction, and conversion to decibels [57]. Following preprocessing, we calculated the Normalized Difference Polarization Index (NDPI) using the processed VV and VH backscatter coefficients, according to Equation (2):

N D P I = \frac{σ_{V V}^{o} - σ_{V H}^{o}}{σ_{V V}^{o} + σ_{V H}^{o}}

(2)

where

σ_{V V}^{o}

and

σ_{V H}^{o}

represent the vertical–vertical (VV) and vertical–horizontal (VH) polarized backscatter coefficients, respectively.

To construct time-series NDPI representations, we applied a mean composite method, aggregating NDPI values over 20-day intervals from 1 April to 1 October 2020. The specific acquisition dates of Sentinel-1 and Sentinel-2 images are shown in Figure 2.

2.2.2. Ground Truth Data

The Cropland Data Layer (CDL) for Arkansas served as our reference data for model training and validation. This annual crop-specific land cover map, produced by the USDA National Agricultural Statistics Service (NASS), integrates satellite imagery with extensive ground surveys [58]. The CDL provides 30 m resolution data and demonstrates high accuracy in rice mapping, with a producer’s accuracy of 91.9% and a user’s accuracy of 97.3%, respectively.

To ensure spatial consistency with the Sentinel data, we resampled the 2020 CDL rice distribution data to a 10 m resolution and reprojected it to the UTM coordinate system. However, as CDL classifications may contain errors, we applied manual visual interpretation techniques to refine the dataset. Specifically, the resampled CDL data were cross-validated using higher-resolution Sentinel-2 imagery and Google Earth historical imagery, allowing us to correct misclassified pixels and eliminate classification noise. This refinement process produced an improved rice distribution map, which served as the final ground truth dataset for model development and evaluation.

After aligning the CDL data with the NDVI and NDPI time-series data, we generated 58 sample patches, each covering 1098 × 1098 pixels. Among these, 48 patches were used for model training, while 10 patches were reserved for validation.

3. Methodology

3.1. Overview of the Multi-Modal Transformer Framework

To effectively address the challenges in rice identification using multi-modal remote sensing data, we propose a Transformer-based framework that leverages the strengths of both optical and SAR data through different levels of fusion. Figure 3 provides an overview of our framework, illustrating how the three fusion strategies relate to the core Transformer architecture.

The proposed framework consists of two main components: a Transformer-based feature extraction module and a multi-modal fusion strategy. The backbone of the model is a Swin Transformer encoder [45], which serves as the core feature extractor, capturing both local spatial structures and global contextual relationships across different modalities. Our framework builds upon established fusion concepts from the literature [25,41] but innovatively integrates them with Transformer architectures specifically optimized for rice identification.

As shown in Figure 3, we implement three distinct fusion strategies operating at different processing stages:

(1): Early Fusion Transformer (EFT): integrates optical and SAR data at the input-level before Transformer processing through an adaptive channel attention mechanism;
(2): Feature Fusion Transformer (FFT): Employs a dual-stream architecture with parallel Transformer feature encoders. The model dynamically integrates intermediate features during the encoding process, adaptively balancing the contribution of each modality based on feature reliability and relevance;
(3): Decision Fusion Transformer (DFT): Employs a late fusion strategy through independent processing streams. Processes each modality through separate Transformer pipelines until the final classification stage, where predictions are combined through a learnable weighting parameter.

All three strategies share the same underlying Transformer-based feature extraction architecture but differ in where and how the fusion occurs. This design allows us to systematically evaluate the effectiveness of fusion at different processing stages for rice identification.

3.2. Transformer-Based Feature Extraction Architecture

The foundation of our framework is a hierarchical vision Transformer [45], which we have adapted for multi-temporal remote sensing data processing (Figure 4). Unlike conventional CNNs, which rely on fixed convolutional kernels, the hierarchical vision Transformer utilizes self-attention mechanisms to dynamically capture spatial dependencies at different scales, making it particularly suitable for capturing the complex spatial patterns of rice fields.

The feature extraction process begins with patch embedding, where the input features

X ϵ R^{H \times W \times C}

are partitioned into non-overlapping patches of size 2 × 2 and linearly projected into a high-dimensional feature space. The embedded patches then undergo hierarchical transformation through the four stages, with the feature resolution progressively reduced by a factor of 2 and the channel dimension doubled at each stage. This multi-scale feature representation is particularly beneficial for rice identification as it not only captures fine-grained field characteristics through local-level feature learning, essential for detecting field boundaries and within-field texture patterns that distinguish rice from other crops, but also captures broad contextual relationships through global-level feature aggregation, crucial for understanding the temporal dynamics of rice growth patterns and regional farming practices.

Within each Swin Transformer block, we implement two complementary attention mechanisms: Window-based Self-Attention (W-MSA) and Shifted Window-based Self-Attention (SW-MSA). The W-MSA mechanism processes features within fixed local windows, capturing fine-grained spatial relationships, while SW-MSA enables cross-window information exchange through shifted partition patterns. This dual-attention design efficiently handles multi-scale features while maintaining computational efficiency. The computation process is as follows:

{\hat{z}}^{i} = W - MSA (LN (z^{i - 1})) + z^{i - 1},

(3)

z^{i} = MLP (LN ({\hat{z}}^{i})) + {\hat{z}}^{i},

(4)

{\hat{z}}^{i + 1} = SW - MSA (LN (z^{i})) + z^{i},

(5)

z^{i + 1} = MLP (LN ({\hat{z}}^{i + 1})) + {\hat{z}}^{i + 1},

(6)

where

{\hat{z}}^{i}

represents the output features of (S)W-MSA at layer i, LN denotes layer normalization, and MLP is a multi-layer perceptron.

The decoder included a Pyramid Pooling Module (PPM) [59] and a Feature Pyramid Network (FPN) [60] to parse and integrate the self-attention based feature maps. The function of PPM was to produce scene-level features. The four scales of object-level feature maps that came from the encoder were combined with scene-level features by the FPN, which then formed the multi-level integrated features.

3.3. Multi-Modal Fusion Strategies

Based on the Transformer architecture described above, we propose three innovative Transformer-based fusion models, each implementing a different fusion strategy for optical-SAR data integration.

3.3.1. Early Fusion Transformer

The Early Fusion Transformer (EFT) integrates optical and SAR data at the input stage before feature extraction, specifically designed to address the heterogeneous nature of multi-temporal optical and SAR data in rice identification. Rice exhibits distinct spectral and structural characteristics across its growth cycle, ranging from flooded fields during early growth stages, which SAR captures effectively due to its sensitivity to water surfaces, to chlorophyll-rich canopies in later stages, best detected by optical sensors. By integrating these complementary signals at an early stage, EFT enables the model to jointly learn spectral–structural representations, improving classification robustness in diverse environmental conditions.

Unlike traditional feature stacking approaches, EFT employs a channel attention mechanism [61] that dynamically adjusts the importance of optical and SAR information based on their relevance to the classification task. As shown in Figure 5, the fusion process begins by stacking the optical and SAR images along the channel dimension. The optical input, denoted as

X_opt \in R^{{H \times W \times (Topt \times C}_{opt})}

consists of

T_{opt}

temporal observations with

C_{opt}

spectral channels, while the SAR input,

X_s a r \in R^{H \times W \times (T_{sar} \times C_{sar})}

contains

T_{sar}

temporal observations with

C_{sar}

polarimetric channels. These multi-channel inputs are concatenated along the channel dimension to form a representation

y \in R^{H \times W \times C}

, represents the combined feature space encompassing the full temporal evolution of rice fields.

To ensure that the model effectively leverages the most informative features from each modality, a three-step channel attention mechanism—squeeze, excitation, and scaling—is applied to the stacked features. First, the squeeze operation compresses the global spatial information of each channel using global average pooling, producing a descriptor that captures the overall temporal–modal context:

z = F_{s q} (y) = \frac{1}{W \times H} \sum_{j = 1}^{W} \sum_{i = 1}^{H} y (i, j)

(7)

where y represents the stacked input features with spatial dimensions

W \times H

, and i and j correspond to pixel locations. This step effectively summarizes the dominant spectral and backscatter responses associated with rice cultivation, identifying the most relevant features at each growth stage.

The excitation operation then learns interdependencies among all temporal observations and spectral–polarimetric channels through a two-layer fully connected network:

s = F_{e x} (z) = σ (W_{2} δ W_{1} z)

(8)

where

σ

represents the sigmoid activation that normalizes outputs to [0, 1],

δ

is the ReLU function, and

W_{1}

and

W_{2}

are learnable parameters.

Finally, the scaling operation applies these learned weights to the original input features, producing enhanced multi-modal representations:

X_{f u s e d} = F s c a l e (y, s) = [s_{1} \cdot y 1, s_{2} \cdot y_{2}, \dots, s_{C} \cdot y_{C}]

(9)

This adaptive weighting mechanism ensures that EFT can dynamically adjust the importance of different temporal observations based on their contribution to rice phenology. Additionally, it allows the model to suppress noise in either modality, mitigating issues such as cloud-covered optical data and speckle noise in SAR imagery.

The enhanced multi-modal features

X_{fused}

are then processed through the Transformer architecture described in Section 3.2, which further leverages the self-attention mechanism to capture spatial context essential for accurate rice field delineation. The hierarchical feature processing of the Transformer is especially effective at aggregating the multi-temporal information now optimally weighted by the channel attention mechanism.

3.3.2. Feature Fusion Transformer

The Feature Fusion Transformer (FFT) implements a dual-stream in which optical and SAR images are first processed separately before fusing their features at an intermediate stage. This strategy preserves the flexibility of independent feature extraction while enabling deep cross-modal interactions that are particularly beneficial for rice field identification.

As illustrated in Figure 6, the dual-stream architecture is designed to leverage the fundamentally different information contained in optical and SAR observations. Rather than forcing an early integration of these heterogeneous data types, FFT allows each modality to first develop its specialized feature representations. In the optical stream, multi-temporal Sentinel-2 data are processed through a hierarchical Transformer encoder based on the Swin Transformer backbone. This encoder produces optical-specific features, denoted as

\tilde{X}

, that encapsulate the spectral–temporal evolution characteristic of rice fields—for example, the marked increase in vegetation indices during early vegetative stages and more gradual changes during grain filling and maturation.

Simultaneously, the SAR stream processes multi-temporal Sentinel-1 data independently to generate SAR-specific features, denoted as

\tilde{Y}

, which capture the unique backscatter dynamics associated with rice cultivation practices. Crucially, both streams employ the Swin Transformer architecture but maintain independent parameters to ensure modality-specific optimization. This design ensures that each encoder can specialize in extracting the most relevant features for rice identification from its respective data type without compromise.

The core innovation of FFT lies in its adaptive feature fusion mechanism, which consists of several key steps. First, the refined features from the two streams are concatenated along the channel dimension to form a unified representation:

\tilde{Z} = \tilde{X} † \tilde{Y}

(10)

Here,

†

represents channel-wise concatenation, which preserves all spatial and contextual information from both modalities.

Then, a convolutional layer processes the concatenated features to learn cross-modal interactions:

f = C o n v (\tilde{Z})

(11)

This convolution is critical for rice identification, as it enables the network to combine evidence from both modalities—for instance, using SAR backscatter patterns to confirm rice paddies when the optical spectral signatures are ambiguous due to interference from similar crops.

The output of the convolution is then passed through a fully connected layer to generate adaptive fusion weights:

G = σ (\tilde{w} \times F l a t (f) + \tilde{b})

(12)

where

σ

is the sigmoid activation function that constrains weights to [0, 1], while

\tilde{w}

and

\tilde{b}

are learnable parameters. This adaptive mechanism is particularly valuable for handling regional variations in rice cultivation practices and seasonal differences that affect the relative informativeness of each data source.

The final fused features are obtained through a weighted combination:

{\tilde{Z}}_{o s} = G ⊙ \tilde{X} \oplus (1 - G) ⊙ \tilde{Y}

(13)

where

⊙

denotes element-wise multiplication and

\oplus

denotes element-wise addition.

This adaptive fusion mechanism allows the model to dynamically adjust the contribution of each modality based on their reliability and relevance to rice identification. For instance, when optical features provide clear spectral signatures of rice fields, they may be weighted more heavily. Conversely, when cloud cover compromises optical data quality, the model can rely more on SAR features that capture structural information. The fused features are finally processed through a segmentation head to generate the rice field classification map.

3.3.3. Decision Fusion Transformer

The Decision Fusion Transformer (DFT) is a novel implementation that combines traditional decision-level fusion principles with the Transformer architecture. Our DFT specifically introduces a Transformer-based approach with an adaptive weighting mechanism for integrating optical and SAR predictions.

The DFT approach, illustrated in Figure 7, maintains complete independence between optical and SAR processing streams until the final decision stage, where their complementary predictions are optimally combined through a learnable weighting mechanism.

Each modality undergoes a complete end-to-end analysis through dedicated Transformer pipelines. The optical stream processes the Sentinel-2 time series through a full Transformer encoder–decoder sequence. This stream specializes in identifying rice fields based on their distinctive spectral signatures—capturing the unique reflectance patterns in NDVI that characterize different rice phenological stages. The optical stream generates a complete classification output

p_{o}

, representing per-pixel probabilities for each class. These probabilities encapsulate the optical-based evidence for rice presence at each location, drawing on the full temporal evolution of spectral signatures throughout the growing season.

In parallel, the SAR stream separately processes the SAR time series to produce its own classification output,

p_{s}

. This stream exploits the distinctive backscatter properties inherent in SAR data—capturing the temporal evolution of VV and VH responses that are characteristic of the flooding and structural properties of rice paddies. Notably, the SAR stream is less affected by cloud cover, ensuring consistent monitoring during adverse weather conditions.

The fusion mechanism employs an adaptive weighting strategy to combine these complementary predictions:

p_{o s, i} = α \times p_{o, i} + (1 - α) \times p_{s, i}

(14)

where

p_{o, i}

and

p_{s, i}

are the softmax probability outputs from the optical and SAR streams, respectively.

α

is a learnable parameter that dynamically adjusts the relative contribution of each modality. Unlike fixed-weight fusion approaches, this adaptive mechanism allows the model to determine the optimal balance between modalities based on their observed reliability for rice identification across the diverse landscapes and conditions in the study area.

The entire DFT model, including both processing streams and the fusion parameter

α

, is trained end-to-end using cross-entropy loss:

L = - \sum_{i} \sum_{j} \sum_{k} y (i, j, k) \cdot l o g (p_{f u s e d} (i, j, k))

(15)

where

y

represents the ground truth labels. This unified training strategy allows gradients to flow through both streams simultaneously, ensuring that each modality develops complementary capabilities while the adaptive fusion mechanism converges to an optimal balance.

The DFT approach offers several distinct advantages for rice identification. By processing each modality independently, each stream can fully exploit its unique strengths—the optical stream for its detailed spectral information and the SAR stream for its robust structural insights during cloudy conditions. The learnable fusion parameter

α

further provides the flexibility to dynamically adjust the importance of each modality based on the specific data quality and environmental conditions encountered. This design not only enhances robustness but also offers improved interpretability, as the value of

α

can be analyzed to understand the relative contributions of optical and SAR data across different regions or time periods.

3.4. Experimental Setting

The proposed framework was implemented using PyTorch 2.0.1 and trained on two NVIDIA RTX 3090 GPUs with 24 GB memory each. All the experiments were conducted under consistent settings to ensure fair comparison. The network was trained using the AdamW optimizer with an initial learning rate of 0.0001, which was adjusted using a cosine annealing schedule. We employed a weight decay of 0.01 to prevent overfitting and set the momentum parameters

β_{1}

and

β_{2}

to 0.9 and 0.999, respectively.

The training process utilized a batch size of 8 and ran for 300 epochs. To enhance model generalization and robustness, we implemented a comprehensive data augmentation strategy including random horizontal and vertical flips, random rotation and random scaling. The input images were standardized to 1098 × 1098 pixels to maintain computational efficiency while preserving sufficient spatial information for rice field identification.

3.5. Evaluation Metrics

The model’s performance was evaluated using seven metrics, including Overall Accuracy (OA), mean Intersection over Union (MIoU), rice-specific Intersection over Union (IoU_rice), Precision, Recall, F1-score, and Kappa coefficient. These metrics were derived from the confusion matrix, where TP (true positive) represents correctly predicted rice pixels, TN (true negative) represents correctly predicted non-rice pixels, FN (false negative) represents actual rice pixels misclassified as non-rice, and FP (false positive) represents actual non-rice pixels misclassified as rice.

OA quantifies the global classification accuracy. The formula is as follows:

O A = \frac{T P + T N}{T P + F P + T N + F N}

(16)

IoU is one of the most critical metrics in semantic segmentation, measuring the ratio of the intersection area between predicted and actual classes to their union area [62]. MIoU averages the IoU across all classes (rice and non-rice); a higher MIoU indicates stronger discriminative ability across all classes. The class-specific IoU (IoU_rice) reflects the overlap between predicted and ground truth categories. The formulas are as follows:

M I o U = \frac{T P + T N}{T P + 2 F N + 2 F P + T N}

(17)

I o U_r i c e = \frac{T P}{T P + F N + F P}

(18)

Precision measures the proportion of correctly identified rice pixels among all predicted rice pixels. The formula is as follows:

P r e c i s i o n = \frac{T P}{T P + F P}

(19)

Recall quantifies the proportion of actual cropland pixels correctly identified by the model. The formula is as follows:

R e c a l l = \frac{T P}{T P + F N}

(20)

F1-score is the harmonic mean of precision and recall, serving as a balanced metric to evaluate classification performance under class imbalance. It comprehensively reflects model performance:

F 1 - s c o r e = \frac{2 \times R e c a l l \times P r e c i s i o n}{R e c a l l + P r e c i s i o n} = \frac{2 T P}{2 T P + F N + F P}

(21)

Kappa coefficient measures the agreement between classification results and random chance, ranging from [−1, 1]. A higher value indicates more reliable results:

K a p p a = \frac{p_{0} - p_{e}}{1 - p_{e}}

(22)

where

p_{0}

is equivalent to OA, and

p_{e}

is calculated as follows:

p_{e} = \frac{(T P + F P) \times (T P + F N) + (T N + F P) \times (T N + F N)}{{(T P + F N + F P + T N)}^{2}}

(23)

To ensure robust evaluation, all the experiments were repeated five times with different random seeds, and the average results are reported. Additionally, we conducted a qualitative analysis by visually comparing the predicted rice distribution maps with ground truth labels across various test regions, particularly focusing on areas with complex farming patterns and cloud-cover environmental conditions.

4. Results

4.1. Performance Comparison of Fusion Strategies for Rice Identification

4.1.1. Overall Accuracy and Quantitative Comparison of Different Fusion Strategies

A quantitative comparison of the fusion strategies is provided in Table 1. The Early Fusion Transformer (EFT) demonstrated superior performance across all metrics, achieving an Overall Accuracy (OA) of 98.33% and a Mean Intersection over Union (MIoU) of 96.74%. This represents a significant improvement over single-modality approaches, surpassing both optical-only (OA = 97.23%, MioU = 94.69%) and SAR-only (OA = 96.45%, MIoU = 93.21%) baselines. The substantial enhancement in IoU_rice (83.47%) particularly underscores EFT’s effectiveness in precise rice field delineation.

The Feature Fusion Transformer (FFT) exhibits competitive but slightly lower performance with an OA of 97.85% and IoU_rice of 79.71%. While FFT’s precision (90.37%) approached that of EFT, its lower recall rate (87.75% compared to EFT’s 90.35%) indicated certain limitations in comprehensive rice field detection. This suggests that feature-level fusion, while effective, may not fully preserve critical information during independent modality processing.

In contrast, the Decision Fusion Transformer (DFT) performs significantly worse, (OA = 96.15%, MIoU = 92.71%, IoU_rice = 46.02%), with a poor recall rate of 49.51%. This performance gap suggests challenges in reconciling conflicting predictions from different modalities at the decision level, particularly in areas where optical and SAR data provide divergent information.

4.1.2. Spatial Distribution Performance of Rice Field Identification

The spatial analysis of identification results (Figure 8) revealed distinct characteristics of each fusion strategy across varying landscape contexts. In high-density rice regions (Figure 8 (1), rice pixel proportion: 33.45%), all the approaches achieved satisfactory results, with EFT exhibiting minimal FN (false negative) pixels, indicating a superior rice pixels detection capability. FFT demonstrated the lowest FP (false positive) rate, suggesting that feature-level fusion is less likely to misclassify non-rice pixels as rice pixels.

In the second group of Figure 8, where the proportion of rice pixels decreases to 24.3%, the advantages of EFT become more evident, achieving the best results in rice identification. The analysis of FP distributions revealed that SAR-only approaches lead to a large number of non-rice pixels being misidentified as rice. The use of the DFT network also causes the same problem. However, EFT and FFT perform relatively well in terms of the misclassification of rice. It can be observed that the proportion of FN in the EFT network is the smallest, indicating that the problem of rice pixels being misclassified as non-rice pixels is minimal.

In the area where rice is extremely sparsely distributed (Figure 8 (3)), the proportion of rice pixels is merely 7.07%. Both EFT and FFT maintained robust performance, while DFT showed a significant degradation in detection capability. Nevertheless, when only SAR data is input, it will lead to a large number of misclassifications of rice. From the FN pixels in the results of Figure 8 (3), it can be observed that by using the EFT network, only one paddy field is misclassified and missed. In contrast, in the results of using the DFT network, most of the paddy fields are not correctly identified.

Based on a comprehensive analysis of the rice identification results in Figure 8, it can be observed that the EFT network achieves the best performance in rice identification. The FFT network demonstrates a similar effect to the EFT network in terms of preventing the misclassification of rice as non-rice, which is consistent with the accuracy evaluation results in Table 1. When the proportion of rice in the image is extremely low, the DFT network leads to a large number of rice pixels being misclassified and missed, resulting in a low identification accuracy for the rice class. Meanwhile, using only optical data yields better rice identification results than using only SAR data, with fewer misclassified and omitted pixels. Under all types of field conditions, the EFT network can maintain a relatively high and stable performance. It can accurately identify rice pixels while reducing false positives in non-rice areas. This highlights the advantages of early-stage fusion in accurately identifying crop types.

4.2. Performance of Rice Identification in Complex Scenarios

4.2.1. Field-Level Assessment

Figure 9 illustrates the performance of different approaches in complex field configurations. In Scenario (1), EFT demonstrated superior boundary delineation and classification accuracy. While optical-only approaches showed misclassification in the upper-right non-rice areas, FFT and DFT partially mitigated this issue. SAR-only results, though avoiding commission errors, exhibited significant boundary fragmentation. EFT successfully maintained field integrity while achieving precise boundary delineation.

Scenario (2) featured a regular-shaped rice paddy, where SAR-only and DFT results showed characteristic boundary degradation despite clear field geometry. While optical-only maintained boundary integrity, only EFT successfully avoided crop type misclassification in adjacent areas, demonstrating effective complementary information utilization.

In the more complex Scenario (3), the primary challenge was field omission. Both optical-only and DFT approaches failed to detect the southeastern rice paddy, while EFT and FFT showed varying degrees of detection success, with EFT achieving the highest detection rate. SAR-only results exhibited systematic boundary distortions, with both over- segmentation and under-segmentation observed in different field regions.

Based on a comprehensive analysis of the field-level identification results in the three regions above, it can be seen that when only optical data is used, there is a high likelihood of misidentifying non-rice fields as rice. When only SAR data is utilized, the identified rice fields often suffer from inaccurate boundary segmentation. The combined use of optical and SAR data can effectively address these issues. Among them, the rice identification results based on EFT can achieve the best-detailed results, yielding accurate and clear field boundaries. Meanwhile, it performs the best in terms of minimizing both the misclassification and omission of rice. The FFT network ranks second. In the results of Scenario 2, only the EFT and FFT fusion networks are capable of correctly distinguishing between rice and non-rice. Compared with the first two fusion networks, the DFT network shows relatively poor performance in terms of the omission of rice fields and boundary-related issues.

4.2.2. Cloud Impact Assessment

To evaluate the effectiveness of optical-SAR data fusion in cloud-affected regions, we analyzed rice identification results across three areas subject to varying degrees of cloud interference, as shown in Figure 10. In our study, Sentinel-2 images with cloud cover below 20% were selected, and a long-term NDVI time series was generated using a maximum value composite approach, which aggregated NDVI values over 20-day intervals in order to reduce the impact of cloud cover on the identification results. However, during the period from 24 June to 13 July 2020, only a single Sentinel-2 image (acquired on 7 July 2020) met this cloud cover criterion (see Figure 2). As a result, cloud contamination introduced distortions in NDVI values for this time period, impacting rice classification based solely on optical data.

The first column of Figure 10 presents the composite NDVI images for the 24 June–13 July period, where the presence of clouds is clearly visible. The second column displays the corresponding NDPI images, which remain unaffected by cloud cover but exhibits significant speckle noise. The right panels show the rice classification results obtained using NDVI only, NDPI only, and the EFT-based fusion model. When relying solely on optical data, cloud interference significantly degraded rice identification across all three regions, with severe omission errors, particularly in heavily cloud-covered areas (Figure 10 (1)). In contrast, classification based exclusively on SAR data (NDPI) was unaffected by cloud cover but suffered from incomplete field detection and imprecise boundary delineation due to speckle noise. The EFT demonstrated exceptional robustness against cloud interference, achieving complete field segmentation with precise boundary delineation, significantly reducing misclassification errors compared to optical-only approaches while improving boundary accuracy over SAR-only results.

These results demonstrate that EFT exhibits exceptional robustness against cloud interference by effectively leveraging the complementary strengths of optical and SAR data. Notably, EFT maintained consistent performance across varying cloud conditions and was able to accurately identify most rice fields even in fully cloud-covered areas (Figure 10 (1)), while preserving fine spatial details.

4.3. Ablation Experiment

To evaluate the effectiveness of the three proposed fusion models and the contribution of their key components, we conducted an ablation study. Specifically, for the EFT, we assessed the importance of the channel attention mechanism by comparing the full model against a version where optical and SAR data were simply stacked along the channel dimension without adaptive weighting. For the FFT, we examined the role of the adaptive feature fusion module by replacing it with straightforward feature concatenation. For the DFT, we investigated the impact of the learnable weighting parameter (α) by fixing it at 0.5, assigning equal importance to both modalities without dynamic adjustment.

The ablation study results presented in Table 2 clearly demonstrate the significance of adaptive fusion mechanisms in enhancing model performance. Removing the channel attention mechanism in EFT led to a 4.82% decrease in IoU_rice, confirming its critical role in dynamically adjusting feature contributions from different modalities. Similarly, replacing adaptive feature fusion with simple concatenation in FFT resulted in a 3.27% drop in IoU_rice, highlighting the importance of learning modality interactions instead of performing static feature merging. In the case of DFT, fixing the weighting parameter α only led to a 0.17% decrease in IoU_rice, suggesting that its limitations stem more from the constraints of late fusion rather than the weighting mechanism itself.

These findings validate the effectiveness of incorporating adaptive feature integration at different fusion stages, demonstrating that early-level and feature-level fusion approaches benefit significantly from dynamic weighting mechanisms, while decision-level fusion remains inherently constrained in capturing multi-modal interactions.

4.4. Comparison with Other Backbone Architectures

To further evaluate the performance of the proposed fusion framework, we compared EFT against widely used CNN-based architectures (UNet [63], DeepLabV3+ [64]) and another Transformer-based model (SegFormer [65]) under an identical early fusion setting. Each architecture was tested with and without channel attention mechanisms to assess the impact of feature weighting.

As shown in Table 3, EFT achieved the highest performance across all evaluation metrics, particularly in IoU_rice, indicating strong capability in distinguishing rice fields from other land cover or crop types. The second-best performer was SegFormer, another Transformer-based model, while the CNN-based architectures UNet and DeepLabV3+ showed a slightly lower performance than EFT and SegFormer, indicating that Transformer-based fusion strategies outperform CNN-based approaches in multi-modal rice classification.

The impact of channel attention varies significantly across architectures. For CNN-based models, like UNet and DeepLabV3+, adding channel attention provides substantial benefits, increasing IoU_rice by 4.13% and 2.71%, respectively. In contrast, for SegFormer, the effect was relatively minor, with IoU_rice improving by only 0.24%. This suggests that CNNs benefit more from explicit attention mechanisms when handling multi-temporal and multi-modal data, as their standard convolution operations have limited capacity to adaptively weight features across temporal, spectral, and polarimetric dimensions.

These findings provide a comparative assessment of different backbone architectures for multi-modal rice classification, demonstrating the effectiveness of Transformer-based early fusion strategies in leveraging both optical and SAR data.

5. Discussion

5.1. In-Depth Analysis of Different Fusion Strategies and Their Impact on Rice Identification

The superior performance of the Early Fusion Transformer (EFT) highlights the significance of early-stage integration in multi-modal remote sensing data fusion. By integrating optical and SAR data at the input level through channel attention mechanism, EFT effectively preserves the original spectral–temporal patterns from both modalities, enabling the model to learn joint representations that fully exploit their complementary characteristics. This early integration strategy demonstrates particular advantages in capturing the synergistic relationship between optical and SAR data, where spectral information from optical sensors effectively complements the structural information derived from SAR data.

Previous studies have demonstrated that simply combining [23] or stacking optical and SAR data [29] can yield improved results compared to using a single data source. Building on this foundation, EFT introduces a dynamic channel attention mechanism that automatically adjusts the contribution of each input channel based on its relevance to rice identification. This process further enhances the weighting of key spectral and temporal features. This approach proves especially valuable when dealing with the temporal dynamics of rice growth, where the relative importance of optical and SAR features varies throughout the growing season. For instance, SAR data provide crucial information during the early flooding and transplanting stages, while optical indices become more informative during later vegetative and reproductive phases. The channel attention mechanism effectively captures these changing relationships, leading to more robust rice identification across the entire growing cycle.

The performance gap between the EFT and Feature Fusion Transformer (FFT) reveals that the independent initial processing of different modalities may result in information loss, despite the implementation of sophisticated fusion mechanisms at later stages. While FFT exhibits competitive performance under ideal conditions (e.g., cloud-free scenarios), its slightly lower recall rate suggests the potential loss of subtle yet significant features during independent feature extraction processes. This observation aligns with recent findings in multi-modal learning research [25], which emphasizes that early interactions between modalities facilitate more effective feature learning and representation. This may also be attributed to the fact that effectively integrating and extracting meaningful information from different modal features often requires more complex network architectures. This challenge has become a key research focus, with numerous studies exploring ways to improve feature-level fusion in deep learning models [24,30,33,66], to enhance their understanding of multi-modal features and ultimately improve fusion performance. Our study offers a different perspective—feature-level fusion networks may not always be the optimal choice for crop identification tasks.

The poor performance of the Decision Fusion Transformer (DFT) underscores the limitations of late fusion strategies in handling complex spatio-temporal patterns. The significant decrease in recall (49.51%) indicates that decision-level fusion struggles to reconcile potentially contradictory predictions from different modalities, particularly in challenging scenarios. These limitations become especially pronounced in areas with complex field patterns or adverse weather conditions, where the complementary nature of the two modalities is crucial for accurate identification. This is consistent with the results of previous studies [30,41,67]. Whether in fusion methods based on shallow machine learning or deep learning architectures, the performance of decision-level fusion is not satisfactory and is not as good as feature-level fusion.

Several factors contribute to DFT’s poor performance. First, in the DFT architecture, errors from individual modality streams are propagated to the final decision fusion stage. This issue has also been highlighted in previous studies [67], and although we introduced a learnable parameter

α

to mitigate errors through adaptive weighting, the weighted combination of predictions using

α

may not effectively compensate for significant misclassifications in the optical or SAR stream, especially when these errors occur in different spatial regions. Second, the learnable parameter

α

in DFT represents a global weighting factor between modalities. Our analysis reveals that this single parameter struggles to capture the complex, region-specific relationships between optical and SAR data, a limitation that has been similarly observed in other decision-level fusion approaches [41]. Third, by reducing rich feature representations to class probabilities before fusion, DFT eliminates direct interaction between individual modality features [24], creating an information bottleneck that discards potentially valuable complementary features. This loss of information is particularly problematic for rice identification, where subtle cross-modal patterns are crucial for distinguishing rice from similar crops or natural vegetation.

5.2. Model Adaptability and Robustness Under Complex Environmental Conditions

The EFT demonstrates exceptional performance in areas with significant cloud coverage, highlighting the effectiveness of early fusion and the channel attention mechanism in dynamically adjusting contribution weights for each modality. When optical data quality is compromised by cloud cover, EFT automatically increases the influence of SAR data while emphasizing unaffected optical observations from other temporal points. The ability to maintain consistent performance under variable atmospheric conditions addresses one of the most significant challenges in operational agricultural monitoring, particularly in regions with persistent cloud coverage such as Chongqing and Sichuan province in China, where continuous cloud obstruction has historically limited the effectiveness of purely optical approaches.

The model’s robust performance across diverse fields further demonstrates its practical utility. In regions with large, continuous rice fields, all fusion approaches achieved satisfactory results. However, in areas characterized by small, fragmented fields—a common feature in many Chinese agricultural regions—EFT maintained high accuracy while FFT and particularly DFT showed notable performance degradation. This consistent performance across varying field patterns can be attributed to two key architectural advantages: First, the Transformer backbone effectively leverages its self-attention mechanism to model complex spatial dependencies at multiple scales without the receptive field limitations inherent in CNN-based architectures. Second, the early fusion approach preserves important contextual information that might otherwise be lost during independent modality processing, allowing more accurate delineation of field boundaries and identification of isolated rice patches.

5.3. Advantages and Limitations

The primary advantages of our proposed framework lie in its robust performance and adaptive capabilities. The early fusion strategy, combined with the Transformer architecture, effectively captures complex spatio-temporal patterns while maintaining high accuracy across diverse fields. The model’s strong performance under cloud coverage and its ability to handle varying field patterns demonstrate its practical utility for operational agricultural monitoring.

However, several limitations should be noted. Firstly, the current implementation relies on pre-calculated indices (NDVI and NDPI). This approach, while computationally efficient, inherently involves a loss of some of the original spectral and polarimetric information. Given the powerful feature extraction capabilities of Transformer models, it is worth investigating whether using raw spectral/polarimetric bands as input could lead to better performance. Additionally, rice often exhibits strong water-related features, and indices such as Land Surface Water Index (LSWI), which are more sensitive to water content, should be considered [14]. Furthermore, the VH polarization band, which is more sensitive to rice growth, could also be strengthened to enhance the model’s discriminative power. Secondly, the model’s effectiveness for other crop types and in different geographical regions requires further validation. While the current study focuses on rice, extending the model to other crops and diverse environmental conditions would provide a more comprehensive assessment of its generalizability and robustness. Future research should aim to evaluate and improve the model’s performance across a broader range of agricultural applications.

6. Conclusions

This study provides a comprehensive analysis of multi-modal fusion strategies for rice field identification, integrating optical and SAR remote sensing data. The comparative analysis of different fusion strategies—input-level (EFT), feature-level (FFT), and decision-level (DFT)—reveals that early fusion minimizes information loss by enabling direct interaction between modalities from the initial processing stages. In contrast, decision-level fusion suffers from information bottlenecks, leading to significantly lower accuracy. The proposed EFT achieves superior performance, with an overall accuracy of 98.33% and a rice-specific IoU of 83.47%, significantly outperforming both single-modality baselines and alternative fusion strategies. The model also exhibits strong robustness under complex environmental conditions, maintaining high accuracy even in cloud-affected regions and across diverse agricultural landscapes. The results demonstrate that early-stage fusion, implemented through a Transformer-based framework, is the most effective approach for synthesizing complementary information from heterogeneous data sources.

The ablation study further highlights the critical role of adaptive fusion mechanisms in improving model performance. The channel attention mechanism in EFT is shown to be essential for dynamically adjusting the contribution of each modality, leading to improved feature extraction and classification accuracy. Removing this mechanism results in a 4.82% drop in IoU_rice, underscoring its importance in leveraging the complementary nature of optical and SAR data. The study also compares EFT against widely used CNN-based (UNet, DeepLabV3+) and Transformer-based (SegFormer) architectures under an identical early fusion setting. The results indicate that Transformer-based architectures outperform CNN-based models in capturing long-range dependencies and modeling spatio-temporal relationships. However, the effectiveness of channel attention varies across architectures, benefiting CNN-based models more significantly than Transformer-based ones.

Our study establishes a Transformer-based multi-modal fusion framework, providing a systematic evaluation of fusion strategies for rice identification. The results demonstrate that early-level fusion outperforms feature-level and decision-level fusion, with the EFT model exhibiting high stability in cloud-affected regions. This makes it a reliable solution for agricultural monitoring, particularly in areas where cloud cover limits the availability of optical data. These findings deepen our understanding of multi-modal feature integration and offer valuable insights for optimizing data fusion strategies in large-scale agricultural monitoring. Future research should focus on extending this approach to other crop types and diverse agricultural regions, further enhancing its generalizability and practical applicability.

Author Contributions

C.H.: methodology, validation, data curation, formal analysis, visualization, writing—original draft, and writing—review and editing. J.S.: conceptualization, methodology, software, resources, funding acquisition, writing—review and editing, and supervision. H.X.: methodology, validation, data curation, and supervision. All authors have read and agreed to the published version of the manuscript.

Funding

The study was funded by the National Key Research and Development Program of China (2022YFF0711602), Strategic Priority Research Program of the Chinese Academy of Sciences (XDB0740201), and the National Data Sharing Infrastructure of Earth System Science (http://www.geodata.cn/), accessed on 18 December 2024.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

Data generated and analyzed during this study are available from the corresponding author by request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Godfray, H.C.J.; Beddington, J.R.; Crute, I.R.; Haddad, L.; Lawrence, D.; Muir, J.F.; Pretty, J.; Robinson, S.; Thomas, S.M.; Toulmin, C. Food Security: The Challenge of Feeding 9 Billion People. Science 2010, 327, 812–818. [Google Scholar]
Martin, W.; Vos, R. The SDGs and Food System Challenges: Global Trends and Scenarios Toward 2030; International Food Policy Research Institute: Washington, DC, USA, 2024. [Google Scholar]
Molotoks, A.; Smith, P.; Dawson, T.P. Impacts of Land Use, Population, and Climate Change on Global Food Security. Food Energy Secur. 2021, 10, e261. [Google Scholar] [CrossRef]
FAO. World Food and Agriculture Statistical Yearbook 2022; FAO: Rome, Italy, 2022. [Google Scholar]
Han, J.; Zhang, Z.; Luo, Y.; Cao, J.; Zhang, L.; Zhuang, H.; Cheng, F.; Zhang, J.; Tao, F. Annual Paddy Rice Planting Area and Cropping Intensity Datasets and Their Dynamics in the Asian Monsoon Region from 2000 to 2020. Agric. Syst. 2022, 200, 103437. [Google Scholar] [CrossRef]
Bandumula, N. Rice Production in Asia: Key to Global Food Security. Proc. Natl. Acad. Sci. India Sect. B Biol. Sci. 2018, 88, 1323–1328. [Google Scholar] [CrossRef]
Ham, J.; Chen, Y.; Crawford, M.M.; Ghosh, J. Investigation of the Random Forest Framework for Classification of Hyperspectral Data. IEEE Trans. Geosci. Remote Sens. 2005, 43, 492–501. [Google Scholar]
Zhang, G.; Xiao, X.; Dong, J.; Xin, F.; Zhang, Y.; Qin, Y.; Doughty, R.B.; Moore, B. Fingerprint of Rice Paddies in Spatial–Temporal Dynamics of Atmospheric Methane Concentration in Monsoon Asia. Nat. Commun. 2020, 11, 554. [Google Scholar] [CrossRef] [PubMed]
Wu, B.; Gommes, R.; Zhang, M.; Zeng, H.; Yan, N.; Zou, W.; Zheng, Y.; Zhang, N.; Chang, S.; Xing, Q.; et al. Global Crop Monitoring: A Satellite-Based Hierarchical Approach. Remote Sens. 2015, 7, 3907–3933. [Google Scholar] [CrossRef]
Nakalembe, C.; Becker-Reshef, I.; Bonifacio, R.; Hu, G.; Humber, M.L.; Justice, C.J.; Keniston, J.; Mwangi, K.; Rembold, F.; Shukla, S.; et al. A Review of Satellite-Based Global Agricultural Monitoring Systems Available for Africa. Glob. Food Secur. 2021, 29, 100543. [Google Scholar] [CrossRef]
Kang, J.; Yang, X.; Wang, Z.; Huang, C.; Wang, J. Collaborative Extraction of Paddy Planting Areas with Multi-Source Information Based on Google Earth Engine: A Case Study of Cambodia. Remote Sens. 2022, 14, 1823. [Google Scholar] [CrossRef]
Sun, C.; Zhang, H.; Ge, J.; Wang, C.; Li, L.; Xu, L. Rice Mapping in a Subtropical Hilly Region Based on Sentinel-1 Time Series Feature Analysis and the Dual Branch BiLSTM Model. Remote Sens. 2022, 14, 3213. [Google Scholar] [CrossRef]
Wu, B.; Zhang, M.; Zeng, H.; Tian, F.; Potgieter, A.B.; Qin, X.; Yan, N.; Chang, S.; Zhao, Y.; Dong, Q.; et al. Challenges and Opportunities in Remote Sensing-Based Crop Monitoring: A Review. Natl. Sci. Rev. 2023, 10, nwac290. [Google Scholar] [CrossRef] [PubMed]
Zhao, Z.; Dong, J.; Zhang, G.; Yang, J.; Liu, R.; Wu, B.; Xiao, X. Improved Phenology-Based Rice Mapping Algorithm by Integrating Optical and Radar Data. Remote Sens. Environ. 2024, 315, 114460. [Google Scholar] [CrossRef]
Carrasco, L.; Fujita, G.; Kito, K.; Miyashita, T. Historical Mapping of Rice Fields in Japan Using Phenology and Temporally Aggregated Landsat Images in Google Earth Engine. ISPRS J. Photogramm. Remote Sens. 2022, 191, 277–289. [Google Scholar] [CrossRef]
Dong, J.; Xiao, X. Evolution of Regional to Global Paddy Rice Mapping Methods: A Review. ISPRS J. Photogramm. Remote Sens. 2016, 119, 214–227. [Google Scholar] [CrossRef]
Xia, L.; Zhao, F.; Chen, J.; Yu, L.; Lu, M.; Yu, Q.; Liang, S.; Fan, L.; Sun, X.; Wu, S.; et al. A Full Resolution Deep Learning Network for Paddy Rice Mapping Using Landsat Data. ISPRS J. Photogramm. Remote Sens. 2022, 194, 91–107. [Google Scholar] [CrossRef]
Lee, K.; Lee, S. Assessment of Post-Flooding Conditions of Rice Fields with Multi-Temporal Satellite SAR Data. Int. J. Remote Sens. 2003, 24, 3457–3465. [Google Scholar] [CrossRef]
Zhan, P.; Zhu, W.; Li, N. An Automated Rice Mapping Method Based on Flooding Signals in Synthetic Aperture Radar Time Series. Remote Sens. Environ. 2021, 252, 112112. [Google Scholar] [CrossRef]
Betbeder, J.; Laslier, M.; Corpetti, T.; Pottier, E.; Corgne, S.; Hubert-Moy, L. Multi-temporal optical and radar data fusion for crop monitoring: Application to an intensive agricultural area in BRITTANY (France). In Proceedings of the 2014 IEEE Geoscience and Remote Sensing Symposium, Quebec City, QC, Canada, 13–18 July 2014; IEEE: Quebec City, QC, Canada, 2014; pp. 1493–1496. [Google Scholar]
Hughes, L.H.; Marcos, D.; Lobry, S.; Tuia, D.; Schmitt, M. A Deep Learning Framework for Matching of SAR and Optical Imagery. ISPRS J. Photogramm. Remote Sens. 2020, 169, 166–179. [Google Scholar] [CrossRef]
Orynbaikyzy, A.; Gessner, U.; Conrad, C. Crop Type Classification Using a Combination of Optical and Radar Remote Sensing Data: A Review. Int. J. Remote Sens. 2019, 40, 6553–6595. [Google Scholar] [CrossRef]
Tuvdendorj, B.; Zeng, H.; Wu, B.; Elnashar, A.; Zhang, M.; Tian, F.; Nabil, M.; Nanzad, L.; Bulkhbai, A.; Natsagdorj, N. Performance and the Optimal Integration of Sentinel-1/2 Time-Series Features for Crop Classification in Northern Mongolia. Remote Sens. 2022, 14, 1830. [Google Scholar] [CrossRef]
Liu, C.; Sun, Y.; Xu, Y.; Sun, Z.; Zhang, X.; Lei, L.; Kuang, G. A Review of Optical and SAR Image Deep Feature Fusion in Semantic Segmentation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 12910–12930. [Google Scholar] [CrossRef]
Ofori-Ampofo, S.; Pelletier, C.; Lang, S. Crop Type Mapping from Optical and Radar Time Series Using Attention-Based Deep Learning. Remote Sens. 2021, 13, 4668. [Google Scholar] [CrossRef]
Tang, Q.; Liang, J.; Zhu, F. A Comparative Review on Multi-Modal Sensors Fusion Based on Deep Learning. Signal Process. 2023, 213, 109165. [Google Scholar] [CrossRef]
Kulkarni, S.C.; Rege, P.P. Pixel Level Fusion Techniques for SAR and Optical Images: A Review. Inf. Fusion 2020, 59, 13–29. [Google Scholar] [CrossRef]
Liu, Y.; Chen, X.; Wang, Z.; Wang, Z.J.; Ward, R.K.; Wang, X. Deep Learning for Pixel-Level Image Fusion: Recent Advances and Future Prospects. Inf. Fusion 2018, 42, 158–173. [Google Scholar] [CrossRef]
Steinhausen, M.J.; Wagner, P.D.; Narasimhan, B.; Waske, B. Combining Sentinel-1 and Sentinel-2 Data for Improved Land Use and Land Cover Mapping of Monsoon Regions. Int. J. Appl. Earth Obs. Geoinf. 2018, 73, 595–604. [Google Scholar] [CrossRef]
Hamidi, M.; Homayouni, S.; Safari, A.; Hasani, H. Deep Learning Based Crop-Type Mapping Using SAR and Optical Data Fusion. Int. J. Appl. Earth Obs. Geoinf. 2024, 129, 103860. [Google Scholar] [CrossRef]
Li, K.; Zhao, W.; Peng, R.; Ye, T. Multi-Branch Self-Learning Vision Transformer (MSViT) for Crop Type Mapping with Optical-SAR Time-Series. Comput. Electron. Agric. 2022, 203, 107497. [Google Scholar] [CrossRef]
Li, X.; Lei, L.; Sun, Y.; Li, M.; Kuang, G. Collaborative Attention-Based Heterogeneous Gated Fusion Network for Land Cover Classification. IEEE Trans. Geosci. Remote Sens. 2021, 59, 3829–3845. [Google Scholar] [CrossRef]
Ren, B.; Ma, S.; Hou, B.; Hong, D.; Chanussot, J.; Wang, J.; Jiao, L. A Dual-Stream High Resolution Network: Deep Fusion of GF-2 and GF-3 Data for Land Cover Classification. Int. J. Appl. Earth Obs. Geoinf. 2022, 112, 102896. [Google Scholar] [CrossRef]
Orynbaikyzy, A.; Gessner, U.; Mack, B.; Conrad, C. Crop Type Classification Using Fusion of Sentinel-1 and Sentinel-2 Data: Assessing the Impact of Feature Selection, Optical Data Availability, and Parcel Sizes on the Accuracies. Remote Sens. 2020, 12, 2779. [Google Scholar] [CrossRef]
Waske, B.; Van Der Linden, S. Classifying Multilevel Imagery From SAR and Optical Sensors by Decision Fusion. IEEE Trans. Geosci. Remote Sens. 2008, 46, 1457–1466. [Google Scholar] [CrossRef]
Li, J.; Hong, D.; Gao, L.; Yao, J.; Zheng, K.; Zhang, B.; Chanussot, J. Deep Learning in Multimodal Remote Sensing Data Fusion: A Comprehensive Review. Int. J. Appl. Earth Obs. Geoinf. 2022, 112, 102926. [Google Scholar] [CrossRef]
Adrian, J.; Sagan, V.; Maimaitijiang, M. Sentinel SAR-Optical Fusion for Crop Type Mapping Using Deep Learning and Google Earth Engine. ISPRS J. Photogramm. Remote Sens. 2021, 175, 215–235. [Google Scholar] [CrossRef]
Teimouri, M.; Mokhtarzade, M.; Baghdadi, N.; Heipke, C. Fusion of Time-Series Optical and SAR Images Using 3D Convolutional Neural Networks for Crop Classification. Geocarto Int. 2022, 37, 15143–15160. [Google Scholar]
Xu, X.; Li, W.; Ran, Q.; Du, Q.; Gao, L.; Zhang, B. Multisource Remote Sensing Data Classification Based on Convolutional Neural Network. IEEE Trans. Geosci. Remote Sens. 2018, 56, 937–949. [Google Scholar] [CrossRef]
Ienco, D.; Interdonato, R.; Gaetano, R.; Minh, D.H.T. Combining Sentinel-1 and Sentinel-2 Satellite Image Time Series for Land Cover Mapping via a Multi-Source Deep Learning Architecture. ISPRS J. Photogramm. Remote Sens. 2019, 158, 11–22. [Google Scholar]
Garnot, V.S.F.; Landrieu, L.; Chehata, N. Multi-Modal Temporal Attention Models for Crop Mapping from Satellite Time Series. ISPRS J. Photogramm. Remote Sens. 2022, 187, 294–305. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. arXiv 2017, arXiv:1706.03762. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2021, arXiv:2010.11929. [Google Scholar]
Han, K.; Wang, Y.; Chen, H.; Chen, X.; Guo, J.; Liu, Z.; Tang, Y.; Xiao, A.; Xu, C.; Xu, Y.; et al. A Survey on Vision Transformer. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 87–110. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted Windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; IEEE: Montreal, QC, Canada, 2021; pp. 9992–10002. [Google Scholar]
Li, Z.; Chen, G.; Zhang, T. A CNN-Transformer Hybrid Approach for Crop Classification Using Multitemporal Multisensor Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 13, 847–858. [Google Scholar] [CrossRef]
Niu, B.; Feng, Q.; Chen, B.; Ou, C.; Liu, Y.; Yang, J. HSI-TransUNet: A Transformer Based Semantic Segmentation Model for Crop Mapping from UAV Hyperspectral Imagery. Comput. Electron. Agric. 2022, 201, 107297. [Google Scholar] [CrossRef]
Wang, H.; Chen, X.; Zhang, T.; Xu, Z.; Li, J. CCTNet: Coupled CNN and Transformer Network for Crop Segmentation of Remote Sensing Images. Remote Sens. 2022, 14, 1956. [Google Scholar] [CrossRef]
Wang, Y.; Feng, L.; Sun, W.; Wang, L.; Yang, G.; Chen, B. A Lightweight CNN-Transformer Network for Pixel-Based Crop Mapping Using Time-Series Sentinel-2 Imagery. Comput. Electron. Agric. 2024, 226, 109370. [Google Scholar] [CrossRef]
Yan, J.; Liu, J.; Liang, D.; Wang, Y.; Li, J.; Wang, L. Semantic Segmentation of Land Cover in Urban Areas by Fusing Multisource Satellite Image Time Series. IEEE Trans. Geosci. Remote Sens. 2023, 61, 4410315. [Google Scholar] [CrossRef]
Cui, L.; Jing, X.; Wang, Y.; Huan, Y.; Xu, Y.; Zhang, Q. Improved Swin Transformer-Based Semantic Segmentation of Postearthquake Dense Buildings in Urban Areas Using Remote Sensing Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 369–385. [Google Scholar] [CrossRef]
Yan, X.; Song, J.; Liu, Y.; Lu, S.; Xu, Y.; Ma, C.; Zhu, Y. A Transformer-Based Method to Reduce Cloud Shadow Interference in Automatic Lake Water Surface Extraction from Sentinel-2 Imagery. J. Hydrol. 2023, 620, 129561. [Google Scholar] [CrossRef]
Kim, M.-K.; Tejeda, H.; Yu, T.E. US Milled Rice Markets and Integration across Regions and Types. Int. Food Agribus. Manag. Rev. 2017, 20, 623–636. [Google Scholar]
Hardke, J. Trends in Arkansas Rice Production, 2018; University of Arkansas Agricultural Experiment Station: Fayetteville, AR, USA, 2019; pp. 15–24. [Google Scholar]
Javed, M.A.; Rashid Ahmad, S.; Awan, W.K.; Munir, B.A. Estimation of Crop Water Deficit in Lower Bari Doab, Pakistan Using Reflection-Based Crop Coefficient. Int. J. Geo-Inf. 2020, 9, 173. [Google Scholar] [CrossRef]
Maxwell, S.K.; Sylvester, K.M. Identification of “Ever-Cropped” Land (1984–2010) Using Landsat Annual Maximum NDVI Image Composites: Southwestern Kansas Case Study. Remote Sens. Environ. 2012, 121, 186–195. [Google Scholar] [CrossRef]
Filipponi, F. Sentinel-1 GRD preprocessing workflow. Proceedings 2019, 18, 11. [Google Scholar] [CrossRef]
Cai, Y. A High-Performance and in-Season Classification System of Field-Level Crop Types Using Time-Series Landsat Data and a Machine Learning Approach. Remote Sens. Environ. 2018, 210, 35–47. [Google Scholar]
Xiao, T.; Liu, Y.; Zhou, B.; Jiang, Y.; Sun, J. Unified perceptual parsing for scene understanding. In Computer Vision—ECCV 2018; Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2018; Volume 11209, pp. 432–448. ISBN 978-3-030-01227-4. [Google Scholar]
Lin, T.-Y.; Dollar, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; IEEE: Honolulu, HI, USA, 2017; pp. 936–944. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
Garcia-Garcia, A.; Orts-Escolano, S.; Oprea, S.; Villena-Martinez, V.; Garcia-Rodriguez, J. A Review on Deep Learning Techniques Applied to Semantic Segmentation. arXiv 2017, arXiv:1704.06857. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Proceedings, Part III 18. Springer: Munich, Germany, 2015; pp. 234–241. [Google Scholar]
Chen, L.-C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Computer Vision—ECCV 2018; Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2018; Volume 11211, pp. 833–851. ISBN 978-3-030-01233-5. [Google Scholar]
Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers. In Advances in Neural Information Processing Systems; Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P.S., Vaughan, J.W., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2021; Volume 34, pp. 12077–12090. [Google Scholar]
Yuan, Y.; Lin, L.; Zhou, Z.-G.; Jiang, H.; Liu, Q. Bridging Optical and SAR Satellite Image Time Series via Contrastive Feature Extraction for Crop Classification. ISPRS J. Photogramm. Remote Sens. 2023, 195, 222–232. [Google Scholar] [CrossRef]
Li, X.; Zhang, G.; Cui, H.; Hou, S.; Chen, Y.; Li, Z.; Li, H.; Wang, H. Progressive Fusion Learning: A Multimodal Joint Segmentation Framework for Building Extraction from Optical and SAR Images. ISPRS J. Photogramm. Remote Sens. 2023, 195, 178–191. [Google Scholar] [CrossRef]

Figure 1. Overview of the study area in Arkansas, USA. (a) Location map showing the study area in Arkansas with an inset map of the United States. The base map shows the topographic and land cover characteristics of Arkansas. (b) Main crop type distribution map of the study area, including rice, cotton, corn, and soybeans. (c) Cropland Data Layer (CDL) image and Sentinel-2 image of the detailed study site in white box in (b).

Figure 2. The specific acquisition dates of Sentinel-1 (in purple) and Sentinel-2 (in orange) images. Sentinel-1 acquired one scene every six days from 1 April to 30 September 2020, and Sentinel-2 acquired a total of 20 scenes due to cloud cover.

Figure 3. Overview of three fusion network based on Transformer. (a) No fusion: using optical and SAR data separately to obtain rice identification results. (b) Early Fusion Transformer (EFT): optical and SAR data are fused before entering the feature extraction network. (c) Feature Fusion Transformer (FFT): the feature maps obtained from optical and SAR data are fused. (d) Decision Fusion Transformer (DFT): the class probability maps obtained from optical and SAR data are fused.

Figure 4. The Transformer-based Feature Extraction Architecture. The structure of the Swin Transformer Block is shown in the blue box on the left. The feature maps output after four stages are

\frac{H}{2} \times \frac{W}{2} \times C

,

\frac{H}{4} \times \frac{W}{4} \times 2 C

,

\frac{H}{8} \times \frac{W}{8} \times 4 C

,

\frac{H}{16} \times \frac{W}{16} \times 8 C

, where H, W, and C represent the length, width, and number of channels of the input data, respectively.

Figure 4. The Transformer-based Feature Extraction Architecture. The structure of the Swin Transformer Block is shown in the blue box on the left. The feature maps output after four stages are

\frac{H}{2} \times \frac{W}{2} \times C

,

\frac{H}{4} \times \frac{W}{4} \times 2 C

,

\frac{H}{8} \times \frac{W}{8} \times 4 C

,

\frac{H}{16} \times \frac{W}{16} \times 8 C

, where H, W, and C represent the length, width, and number of channels of the input data, respectively.

Figure 5. The structure of Early Fusion Transformer (EFT) network. The

F_{s q}

,

F_{e x}

, and

F_{s c a l e}

represent the squeeze, excitation, and scaling, respectively, and their specific operations are shown in Equations (7)–(9).

Figure 5. The structure of Early Fusion Transformer (EFT) network. The

F_{s q}

,

F_{e x}

, and

F_{s c a l e}

represent the squeeze, excitation, and scaling, respectively, and their specific operations are shown in Equations (7)–(9).

Figure 6. The structure of the Feature Fusion Transformer (FFT) network.

\tilde{X}

and

\tilde{Y}

represent the feature maps obtained by the optical and SAR data through the feature extraction network,

\tilde{Z}

represents the result of superposition of the two along the channel dimension, and

{\tilde{Z}}_{o s}

represents the final fused feature map. The specific operations are shown in Equations (10)–(13).

Figure 6. The structure of the Feature Fusion Transformer (FFT) network.

\tilde{X}

and

\tilde{Y}

represent the feature maps obtained by the optical and SAR data through the feature extraction network,

\tilde{Z}

represents the result of superposition of the two along the channel dimension, and

{\tilde{Z}}_{o s}

represents the final fused feature map. The specific operations are shown in Equations (10)–(13).

Figure 7. The structure of Decision Fusion Transformer (DFT) network.

p_{o}

and

p_{s}

represent the class probability maps independently output by optical and SAR data, respectively.

α

is a learnable parameter used to adjust the weights of the two.

p_{o s}

represents the final fused class probability map. The specific operations are shown in Equation (14).

Figure 7. The structure of Decision Fusion Transformer (DFT) network.

p_{o}

and

p_{s}

represent the class probability maps independently output by optical and SAR data, respectively.

α

is a learnable parameter used to adjust the weights of the two.

p_{o s}

represents the final fused class probability map. The specific operations are shown in Equation (14).

Figure 8. Comparative analysis of fusion strategies across varying rice density scenarios. TP: true positive (correctly identified rice pixels), FN: false negative (missed rice pixels), FP: false positive (incorrectly classified rice pixels), and TN: true negative (correctly identified non-rice pixels). Scenario (1) represents a high-density rice area, Scenario (2) represents a medium-density rice area, and Scenario (3) represents a low-density rice area.

Figure 9. Field-level rice identification details based on different images fusion networks. Scenarios (1)–(3) show the field-level rice r identification results for three different types of rice-growing areas.

Figure 10. Comparison of rice identification results in cloud-covered area. The NDVI and NDPI on the left represent the 20-day composite NDVI and NDPI results from 24 June 2020 to 13 July 2020, respectively. The optical, SAR, and EFT on the right represent the rice identification results using only optical data (NDVI) as input, only SAR data (NDPI) as input, and using the EFT fusion network to fuse optical and SAR data, respectively. Scenarios (1)–(3) represent areas with different degrees of cloud coverage.

Table 1. Quantitative comparison of different fusion strategies.

Model	Accuracy Results
Model	OA	MIoU	IoU_Rice	Recall	Precision	F1-Score	Kappa
optical	0.9723	0.9469	0.7578	0.8926	0.8381	0.8537	0.8350
SAR	0.9645	0.9321	0.6681	0.8403	0.7515	0.7766	0.7530
EFT	0.9833	0.9674	0.8347	0.9035	0.9098	0.9013	0.8898
FFT	0.9785	0.9585	0.7971	0.8775	0.9037	0.8807	0.8662
DFT	0.9615	0.9271	0.4602	0.4951	0.6927	0.5164	0.4990

Table 2. Ablation Study Results for Fusion Strategies.

Model	Accuracy Results
Model	OA	MIoU	IoU_Rice	F1-Score
EFT	0.9833	0.9674	0.8347	0.9013
EFT-noAtten (stack)	0.9612	0.9386	0.7865	0.8806
FFT	0.9785	0.9585	0.7971	0.8807
FFT-concatenation	0.9689	0.9295	0.7644	0.8592
DFT	0.9615	0.9271	0.4602	0.5164
DFT-FixedWeight (0.5)	0.9535	0.9178	0.4585	0.4983

Table 3. Comparison of Early Fusion with Different Backbone Architectures.

Model		Accuracy Results
	Channel Attention	OA	MIoU	IoU_Rice	F1-Score
EFT	✓	0.9833	0.9674	0.8347	0.9013
EFT	✕	0.9612	0.9386	0.7865	0.8806
UNet	✓	0.9712	0.9442	0.7738	0.8723
UNet	✕	0.9635	0.9307	0.7325	0.8458
Deeplab V3+	✓	0.9738	0.9491	0.7823	0.8781
Deeplab V3+	✕	0.9689	0.9404	0.7552	0.8605
Segformer	✓	0.9762	0.9551	0.7956	0.8864
Segformer	✕	0.9754	0.9538	0.7932	0.8845

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

He, C.; Song, J.; Xu, H. Optical and SAR Data Fusion Based on Transformer for Rice Identification: A Comparative Analysis from Early to Late Integration. Agriculture 2025, 15, 706. https://doi.org/10.3390/agriculture15070706

AMA Style

He C, Song J, Xu H. Optical and SAR Data Fusion Based on Transformer for Rice Identification: A Comparative Analysis from Early to Late Integration. Agriculture. 2025; 15(7):706. https://doi.org/10.3390/agriculture15070706

Chicago/Turabian Style

He, Chenyang, Jia Song, and Huiyao Xu. 2025. "Optical and SAR Data Fusion Based on Transformer for Rice Identification: A Comparative Analysis from Early to Late Integration" Agriculture 15, no. 7: 706. https://doi.org/10.3390/agriculture15070706

APA Style

He, C., Song, J., & Xu, H. (2025). Optical and SAR Data Fusion Based on Transformer for Rice Identification: A Comparative Analysis from Early to Late Integration. Agriculture, 15(7), 706. https://doi.org/10.3390/agriculture15070706

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Optical and SAR Data Fusion Based on Transformer for Rice Identification: A Comparative Analysis from Early to Late Integration

Abstract

1. Introduction

2. Study Area and Data

2.1. Study Area

2.2. Data Source and Preprocessing

2.2.1. Satellite Data Source

2.2.2. Ground Truth Data

3. Methodology

3.1. Overview of the Multi-Modal Transformer Framework

3.2. Transformer-Based Feature Extraction Architecture

3.3. Multi-Modal Fusion Strategies

3.3.1. Early Fusion Transformer

3.3.2. Feature Fusion Transformer

3.3.3. Decision Fusion Transformer

3.4. Experimental Setting

3.5. Evaluation Metrics

4. Results

4.1. Performance Comparison of Fusion Strategies for Rice Identification

4.1.1. Overall Accuracy and Quantitative Comparison of Different Fusion Strategies

4.1.2. Spatial Distribution Performance of Rice Field Identification

4.2. Performance of Rice Identification in Complex Scenarios

4.2.1. Field-Level Assessment

4.2.2. Cloud Impact Assessment

4.3. Ablation Experiment

4.4. Comparison with Other Backbone Architectures

5. Discussion

5.1. In-Depth Analysis of Different Fusion Strategies and Their Impact on Rice Identification

5.2. Model Adaptability and Robustness Under Complex Environmental Conditions

5.3. Advantages and Limitations

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI