MMDFRNet: Dynamic Cross-Modal Decoupling and Alignment for Robust Rice Mapping

Fu, Tingyan; Ge, Jia; Tian, Shufang

doi:10.3390/rs18091413

Open AccessArticle

MMDFRNet: Dynamic Cross-Modal Decoupling and Alignment for Robust Rice Mapping

by

Tingyan Fu

¹

,

Jia Ge

^2,* and

Shufang Tian

¹

School of Earth Sciences and Resources, China University of Geosciences (Beijing), Beijing 100083, China

²

Oil and Gas Resources Investigation Center of China Geological Survey, Beijing 100083, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(9), 1413; https://doi.org/10.3390/rs18091413

Submission received: 9 March 2026 / Revised: 30 April 2026 / Accepted: 30 April 2026 / Published: 2 May 2026

(This article belongs to the Special Issue Advanced Remote Sensing Techniques in Agriculture and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

Developed MMDFRNet, a mechanism-driven dual-stream framework featuring dynamic cross-modal decoupling and alignment to resolve statistical disparities between Sentinel-1 SAR and Sentinel-2 optical imagery.
Achieved state-of-the-art performance (IoU of 0.8612), outperforming recent advanced paradigms (e.g., UNetFormer, STMA) in both mapping accuracy and inference speed.
Resolved the multi-modal “degradation paradox” by effectively suppressing SAR speckle noise and transforming it into a performance booster through pixel-wise adaptive alignment.

What are the implications of the main findings?

Methodological Implication: Validates that pixel-wise adaptive recalibration effectively bridges the physical gap between SAR structural data and optical spectral data, providing a robust paradigm for handling asynchronous data quality.
Practical Implication: Delivers a highly robust and computationally efficient tool for regional food security monitoring, demonstrating exceptional generalization across both fragmented smallholder plots and large-scale agricultural landscapes.

Abstract

Accurate rice mapping is critical for grain yield estimation and food security, yet traditional methods often struggle with asynchronous data quality and the inherent statistical gap between SAR and optical signals. To bridge this gap, we propose MMDFRNet, a novel multi-modal deep learning framework that synergistically integrates Sentinel-1 SAR and Sentinel-2 optical imagery. Unlike conventional static fusion approaches, MMDFRNet features a dual-stream modality-specific encoder architecture designed to decouple structural backscattering signals from spectral reflectance. Central to this framework is the multi-modal feature fusion (MMF) module, which employs an adaptive attention mechanism to dynamically align and recalibrate features based on their reliability, effectively mitigating noise from compromised modalities. Additionally, a multi-scale feature fusion (MSF) module is incorporated to coordinate hierarchical semantic information, enhancing boundary delineation in fragmented landscapes. Extensive experiments conducted across multiple study areas in China demonstrate the superiority of MMDFRNet. The model achieves a Precision of 0.9234, an IoU of 0.8612, and an F1-score of 0.9252. Notably, it consistently outperforms state-of-the-art benchmarks (e.g., UNetFormer, STMA, and CCRNet) by margins of up to 11.72% (Precision) and 7.39% (IoU) compared to classic baselines. Furthermore, rigorous ablation studies and degradation analyses confirm the model’s robustness, verifying its ability to transform the degradation paradox into a performance booster through pixel-wise adaptive alignment. Consequently, MMDFRNet offers a promising solution for precise rice area statistics and long-term monitoring in complex agricultural landscapes.

Keywords:

agricultural monitoring; crop mapping; deep learning; multi-modal data fusion; Synthetic Aperture Radar (SAR)

1. Introduction

According to the Food and Agriculture Organization (FAO), crop planting areas constitute approximately one-third of the global land surface [1,2]. Among these crops, rice is a primary staple food. Consequently, accurate identification and dynamic monitoring of rice paddies are imperative for grain yield estimation, agricultural resource management, and food security strategies, particularly in the face of climate change and frequent extreme weather events [3]. While remote sensing has become a core technology for large-scale crop monitoring due to its timeliness and broad coverage [1,2,4], traditional single-modality approaches are constrained by limited mapping accuracy and poor cross-regional generalization [5,6,7].

Historically, optical imagery (e.g., Sentinel-2 [8,9,10,11], Landsat [12,13,14]) served as the primary data source for rice mapping. Vegetation indices, such as the Normalized Difference Vegetation Index (NDVI) [2,10,15] and Enhanced Vegetation Index (EVI) [16,17], effectively capture crop phenological dynamics. However, optical sensors are inherently constrained by illumination conditions and cloud contamination. Rice cultivation typically coincides with rainy seasons, leading to significant data gaps and reduced mapping accuracy in cloud-prone regions [16]. In contrast, Synthetic Aperture Radar (SAR) offers all-weather, day-and-night imaging capabilities. SAR is particularly sensitive to the dielectric properties and geometric structures of targets, making it highly effective for detecting the unique flooded conditions of rice paddies [18,19,20,21]. The Sentinel-1 satellite series, with its high spatial resolution and short revisit period, has been widely utilized to extract backscattering coefficients (σ₀) for rice mapping [15,22,23,24,25,26]. Nevertheless, SAR data lacks the rich spectral fidelity of optical imagery, often resulting in confusion between rice and other targets with similar structural properties.

The difficulty of rice mapping is further compounded by its complex phenological changes [27,28]. During the transplanting and vegetative stages, rice paddies are a mixture of water and sparse plants, presenting low backscattering in SAR and unique spectral signatures in optical data. In the reproductive and ripening stages, the canopy closes, and the structural complexity increases. A robust mapping model must adapt to these drastic temporal variations. However, traditional single-source methods struggle to capture these multi-dimensional characteristics simultaneously. Therefore, leveraging the synergy between the spectral details of optical data and the structural penetration of SAR is essential to overcome the limitations of single-source methods [3,22,29].

Early research predominantly employed “input-level” fusion strategies to integrate these heterogeneous data sources. In these approaches, optical and SAR images were directly concatenated along the channel dimension and fed into classic segmentation models such as U-Net [30], SegNet [31], or PSPNet [32]. For instance, Fu et al. [15] stacked heterogeneous data prior to inputting it into an R-Unet model. However, such methods fail to account for the distinct modal properties of the data. Direct concatenation lacks adaptive mechanisms to extract modality-specific features, limiting the model’s ability to bridge the semantic gap between coherent radar signals and optical reflectance [28].

Subsequently, feature-level fusion architectures emerged to enable deeper interaction. Wu et al. [33], for example, proposed the CCR-Net, which fuses hyperspectral, SAR, and LiDAR data through a CNN-based architecture. More recently, advanced deep learning paradigms have further pushed the boundaries of remote sensing segmentation. For instance, Wang et al. [34] introduced UNetFormer, a hybrid architecture combining a lightweight ResNet18 encoder with a Transformer-based decoder to efficiently model global context for land cover classification. In the realm of time-series SAR analysis, Han et al. [22] proposed the STMA (spatio-temporal multi-level attention) method, which incorporates learnable positional encoding to capture multi-scale features for large-scale crop mapping. Similarly, Lin et al. [35] developed a Phenological-Knowledge-Independent (PKI) method, breaking the reliance on region-specific phenology or manual labels. Regarding multi-modal fusion, Garnot et al. [36] evaluated various strategies and highlighted an innovative “Mid-Fusion” scheme using independent geospatial encoders and shared temporal encoders. Furthermore, Yang et al. [37] introduced the POSTAR framework, utilizing the complementarity of optical and SAR data to automate high-quality sample generation. Other studies have also begun to explore basic attention mechanisms to weight feature channels [33,38,39,40,41]. Moreover, the latest developments in 2024 and 2025 increasingly emphasize dynamic feature alignment and dual-stream architectures [42]. Recent frameworks employ Gaussian dynamic fusion networks [43] and dual-stream local–global attention mechanisms [44,45] to better align heterogeneous modalities, moving beyond basic attention structures.

However, despite these significant advancements, current frameworks remain constrained when dealing with asynchronous data quality, a common issue caused by inherent sensor differences or environmental factors (e.g., severe cloud cover). Primarily, while time-series models demonstrate superior performance, they typically assume relatively consistent data availability across the temporal dimension. In the Yangtze River Basin, the frequent “Meiyu” (Plum Rain) season often results in long-term optical data gaps, rendering dense time-series models less effective or reliant on heavy interpolation that introduces synthetic noise. Furthermore, most existing fusion strategies (including CCRNet) rely on static fusion mechanisms, such as element-wise summation or concatenation. These approaches inherently assume equal reliability across modalities, lacking the true adaptive capability required to suppress local interference dynamically [33,46,47]. Even recent attention-based methods often treat multi-modal features as synchronized signals, neglecting the scenario where one modality is severely compromised (e.g., by thick clouds). From a theoretical perspective, effective fusion of heterogeneous data requires two distinct but coupled processes: (1) Modality decoupling, which separates intrinsic semantic information from modality-specific noise or interference; and (2) Cross-modal alignment, which adaptively integrates these decoupled features based on their instantaneous reliability. Most existing studies lack an explicit mechanism for such dynamic decoupling and alignment, leading to performance degradation in complex wetland environments [48,49,50,51,52]. This impedes the network’s ability to distinguish rice under both optimal clear-sky conditions and cloud-prone regions.

To address these challenges, we propose MMDFRNet, a mechanism-driven multi-modal deep learning framework centered on dynamic cross-modal decoupling and alignment. Unlike generic fusion architectures, MMDFRNet is explicitly designed to handle the asynchronous quality of heterogeneous data. By integrating a dual-stream encoder with adaptive recalibration mechanisms, the framework effectively synthesizes the complementary strengths of SAR structural backscattering and optical spectral reflectance. The main contributions of this study are summarized as follows:

Mechanism-driven modality decoupling: We design independent dual-stream encoders to decouple the distinct characteristics of heterogeneous data. This structure effectively separates the penetrative texture features of SAR from the spectral details of optical images, preventing the feature interference and noise propagation common in traditional shared-encoder approaches.
Dynamic cross-modal alignment via multi-modal feature fusion (MMF) module: An attention-based MMF module is proposed to facilitate pixel-wise dynamic alignment. Unlike static fusion, this module acts as a dynamic gate that suppresses noise from compromised modalities (e.g., cloud-covered optical pixels) while enhancing reliable features, ensuring robust cross-modal complementarity.
Multi-scale feature fusion (MSF) module: To address the scale variation inherent in diverse agricultural landscapes, we incorporate an MSF module. This component hierarchically fuses low-level spatial features with high-level semantic representations, significantly enhancing boundary delineation accuracy in both fragmented and large-scale rice fields.

2. Materials and Methods

2.1. Study Area

The rice–crayfish co-culture (RCC) system is a distinctive agricultural model in China that integrates rice cultivation with crayfish breeding. Leveraging the symbiotic relationship between the two, this model promotes ecological sustainability and economic efficiency [27]. RCC fields are widely distributed across the middle and lower reaches of the Yangtze River, particularly in Hubei, Hunan, Anhui, and Jiangxi provinces.

Qianjiang City (Hubei Province) was selected as the core region for model development and training (Figure 1a,b). Situated in the hinterland of the Jianghan Plain, this region is characterized by flat terrain, an extensive water network, and a subtropical monsoon climate with abundant rainfall and sunshine. These geographical conditions are highly favorable for RCC, making Qianjiang a representative region for investigating this agricultural pattern.

To evaluate the generalization capability of the proposed model across different landscapes, three additional regions were selected: Jianli City (Hubei), Huoqiu County (Anhui), and Yongxiu County (Jiangxi). Jianli City shares similar landscape characteristics with Qianjiang, as it is located at the southern end of the Jianghan Plain. Huoqiu and Yongxiu Counties, in contrast, exhibit distinct spatial patterns. Specifically, the paddy fields in these regions tend to be more regular and structured compared to the fragmented landscapes found in Hubei.

The geographical distribution of these study sites is illustrated in Figure 1d–h. This diverse selection of study areas allows for a comprehensive assessment of the model’s robustness under varying field geometries and regional conditions.

2.2. Data Acquisition and Preprocessing

The primary remote sensing data were acquired and processed via the Google Earth Engine (GEE) platform. We employed a hybrid temporal strategy to effectively balance the need for phenological characterization against the constraints of cloud contamination typical in the study area.

For Synthetic Aperture Radar (SAR) data, we utilized Sentinel-1 Ground Range Detected (GRD) products in Interferometric Wide (IW) mode (VH + VV). To track the continuous structural evolution of rice, characterized by the transition from the specular reflection of water bodies to the volume scattering of the canopy, a full monthly time-series covering the entire rice growth period in 2023 was acquired. The specific temporal coverage for each study area is detailed in Table 1. Standard preprocessing on GEE included thermal noise removal, radiometric calibration to backscattering coefficients (σ₀), terrain correction, and monthly temporal compositing to mitigate speckle noise.

Regarding optical data, the persistent cloud cover during the Meiyu season in the Yangtze River Basin makes acquiring dense time-series challenging. Consequently, we adopted a bi-temporal strategy centered on the greening stage. Instead of a continuous series, we selected Level-2A images from two critical phenological windows: the pre-greening stage (transplanting phase characterized by water dominance) and the post-greening stage (heading phase characterized by vigorous vegetation growth). This combination efficiently captures the unique “Water-to-Vegetation” shift essential for identifying rice. The specific acquisition dates for each region are listed in Table 1. The final optical input comprises 10 channels, consisting of Red, Green, Blue, NIR, and NDVI bands for both phases. The NDVI is calculated as follows:

NDVI = \frac{NIR - Red}{NIR + Red}

(1)

where Red and NIR correspond to Band 4 and Band 8 of Sentinel-2, respectively.

2.3. Ground Truth Generation and Quality Control

Pixel-level ground truth labels were generated via manual visual interpretation following a rigorous quality control protocol. To ensure high geometric precision, sub-meter historical imagery from Google Earth was utilized as the primary reference to delineate the fine boundaries of fields, roads, and ponds, while Sentinel-2 False Color Composites served as auxiliary references to verify crop phenology. Furthermore, to minimize subjective bias and label noise, a peer-review mechanism was implemented. Initial annotations were labeled by researchers and subsequently audited by senior experts; any samples containing ambiguous features or inter-annotator disagreement were strictly excluded to maintain dataset integrity.

Based on this protocol, we constructed a robust dataset by selecting representative sample regions across the study areas. Specifically, in Qianjiang, the dataset comprises 9 regions (2048 × 2048 pixels) for training and validation, alongside 11 regions (1024 × 1024 pixels) dedicated to testing. For Jianli, 5 regions were selected for training and validation, with an additional 7 regions for testing. To evaluate model transferability, a combined total of 5 training/validation regions and 8 testing regions were acquired from Yongxiu and Huoqiu. Prior to model input, these annotated images were cropped into patches of 256 × 256 pixels using a sliding window strategy with a stride of 128 pixels. The final distribution of patches for each study area is summarized in Table 2, and 20% of the training samples were randomly set aside for validation.

2.4. Models and Principles

2.4.1. MMDFRNet

As illustrated in Figure 2, the proposed MMDFRNet comprises four integral components: a dual-stream feature encoder, multi-modal feature fusion (MMF), multi-scale feature fusion (MSF), and atrous spatial pyramid pooling (ASPP). Specifically, the architecture initiates with two parallel branches: the Optical Feature Encoder (OFE), designed to progressively extract spectral phenological representations, and the SAR Feature Encoder (SFE), dedicated to capturing structural backscattering features. Subsequently, the MMF module synthesizes these heterogeneous feature maps to effectively integrate optical spectral information with SAR texture details. Furthermore, the ASPP module employs dilated convolutions with varying rates to capture multi-scale contextual information by expanding the receptive field. Concurrently, the MSF module hierarchically fuses high-level semantic features with low-level spatial details. Finally, the fused representations are processed by the decoder (DeConv) to generate the precise rice distribution map.

2.4.2. Multi-Modal Fusion Module

The multi-modal feature (MMF) module serves as the core engine for dynamic cross-modal alignment, bridging the heterogeneous features decoupled by the dual-stream encoders. As illustrated in Figure 3, this module is designed to facilitate adaptive feature recalibration and robust cross-modal interaction by explicitly addressing the asynchronous quality of optical and SAR data. Specifically, the MMF module operates on the principle that effective fusion requires first recognizing the distinct reliability of each modality (alignment) and then integrating them without mutual interference. It comprises three core components: the feature recalibration unit, which dynamically assesses pixel-wise reliability to align feature contributions; the cross-modal enhancement unit, which leverages complementary information from the dominant modality to enhance the compromised one; and the residual fusion unit, which ensures efficient gradient flow and preserves the intrinsic structural details decoupled in the earlier stages.

Firstly, the input feature tensor x ∈ R^B×C×H×W is split into separate optical (

F_{opt}

) and SAR (

F_{sar}

) branches along the channel dimension, where B, C, H and W denote the batch size, channel count, height, and width, respectively. To achieve adaptive feature selection, the attention weights are computed separately for each modality using independent global context-aware gating mechanisms (i.e., channel attention) This mechanism explicitly models inter-channel dependencies to perform feature recalibration, mathematically expressed as:

A_{opt} = σ (W_{2} \cdot δ (W_{1} \cdot GAP (F_{opt})))

(2)

A_{sar} = σ (W_{4} \cdot δ (W_{3} \cdot GAP (F_{sar})))

(3)

where A_opt and A_sar denote the generated attention weight vectors for optical and SAR features, respectively; GAP represents the global average pooling operation;

δ

denotes the ReLU activation function, and

σ

is the Sigmoid activation function. The parameters W₁, W₃

\in R^{\frac{C}{r} \times C}

and W₂, W₄

\in R^{C \times \frac{C}{r}}

correspond to 1 × 1 convolution kernels, which form a bottleneck structure to capture channel correlations. This gating mechanism generates weights in the range of (0,1) to suppress redundant features while enhancing modality-specific discriminative information.

Secondly, the input features are independently recalibrated using their respective channel attention weights via element-wise multiplication to suppress redundant modality-specific noise:

F_{opt}^{'} = F_{o p t} ⊗ A_{s a r}

(4)

F_{sar}^{'} = F_{s a r} ⊗ A_{opt}

(5)

Subsequently, the recalibrated feature maps are subjected to a cross-modal attention mechanism. Instead of simple concatenation, the “guidance signal” is implemented through a Query–Key–Value (QKV) matrix multiplication architecture. For instance, to guide the optical features using SAR representations, the Query matrix (

Q_{opt}

) is linearly projected from

F_{opt}^{'}

, while the Key (

K_{sar}

) and Value (

V_{sar}

) matrices are projected from

F_{sar}^{'}

. The optical feature guided by SAR, denoted as

F_{opt}^{″}

, is computed as follows:

F_{opt}^{″} = γ \cdot (V_{s a r} \cdot {Softmax (Q_{opt} \cdot K_{sar}^{T})}^{T}) + F_{opt}^{'},

(6)

where

γ

is a learnable scaling parameter initialized to zero. A symmetric QKV operation is performed simultaneously to obtain the SAR feature guided by optical data (

F_{sar}^{″}

).

Finally, the cross-attended features (

F_{opt}^{″}

and

F_{sar}^{″}

) are concatenated along the channel dimension and processed through a residual block. This sophisticated two-stage design (element-wise self-gating followed by QKV cross-attention) effectively synergizes the structural texture of SAR and the spectral fidelity of optical imagery, making the MMF module highly robust in complex agricultural landscapes.

2.4.3. ASPP

As illustrated in Figure 2, the ASPP module comprises five parallel branches designed to capture contextual information at multiple scales. Specifically, the fifth branch integrates a global average pooling (GAP) layer to extract image-level features, thereby incorporating global context into the local representation [1]. This operation is mathematically defined as:

X_{p o o l} = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} x_{i, j}

(7)

where

x

denotes the input feature map, and H and W represent its height and width, respectively.

To handle scale variations, branches 2 to 4 utilize 3 × 3 convolutional layers with varying dilation rates. By adjusting these dilation rates, the network effectively expands its receptive field to encode broader contextual information without losing spatial resolution. Finally, the outputs from all five branches are concatenated along the channel dimension, followed by a dropout operation to enhance model robustness and mitigate overfitting.

2.4.4. Multi-Scale Feature Fusion Module

The MSF module is designed to bridge the semantic gap between varying feature levels by integrating outputs from the MMF module and the decoder (DeConv). As illustrated in Figure 1, this module comprises three parallel branches. Specifically, the third branch utilizes a dilated convolution with a dilation rate of 2. This design effectively expands the receptive field to capture broader contextual information without sacrificing spatial resolution [15]. By synthesizing feature information across different scales, the MSF module generates a highly discriminative representation, thereby enhancing the network’s ability to delineate complex boundaries in rice mapping.

2.4.5. OFE and SFE

In the feature extraction encoder stage, two feature encoders are designed for optical and SAR data, respectively.

As illustrated in Figure 4, the SFE adopts a multi-scale dilated convolution strategy. Initially, a 3 × 3 convolution layer is employed to halve the channel dimension, focusing on capturing local structural features. Subsequently, a dilated convolution with a rate of 2 is applied to expand the receptive field. These multi-scale features are then integrated via concatenation and a fusion convolution layer. Following this, a residual block is introduced, which performs nonlinear transformations through dual 3 × 3 convolution paths combined with skip connections. This residual design facilitates identity mapping, effectively mitigating the gradient vanishing problem and promoting feature reuse [7].

The OFE utilizes a three-branch parallel architecture, as shown in Figure 5. The primary branch employs Depthwise Separable Convolution to efficiently encode the spatial details of optical imagery. Concurrently, parallel branches utilizing 1 × 1 and 5 × 5 convolutions are deployed to capture broader contextual information. The outputs from these three branches are first concatenated along the channel dimension. Finally, the aggregated features are coupled through a fusion convolution layer and further refined by a residual module to enhance feature representation.

2.5. Experimental Design for Ablation Studies

To thoroughly evaluate the proposed framework, we designed a two-tiered ablation strategy focusing on both the internal network components and the input modalities.

2.5.1. Module Effectiveness Analysis

To verify the contribution of specific components (MMF, MSF, and ASPP), we constructed four variants based on the MMDFRNet architecture:

MMFNet (w/o MSF): As shown in Figure 6a, this variant is constructed by removing the MSF module from MMDFRNet. The objective is to verify the necessity of multi-scale feature integration. By retaining only the MMF module, we can assess whether the network suffers from insufficient local detail and generalization capability when hierarchical semantic fusion is absent.
MSFNet (w/o MMF): To validate the efficacy of the proposed adaptive fusion mechanism, we constructed the MSFNet (Figure 6b) by removing the MMF module. In this setup, features from the optical and SAR encoders are directly concatenated and fed into the MSF module. This comparison helps clarify whether the dynamic recalibration and cross-modal enhancement (provided by MMF) offer a significant advantage over simple static fusion.
MMDFNet (w/o ASPP): To verify the contribution of the atrous spatial pyramid pooling module, we constructed this variant by replacing the dilated convolutions in the ASPP block with standard 1 × 1 convolutions, as illustrated in Figure 6c. This configuration removes the multi-scale receptive field expansion capability, allowing us to quantify the module’s specific role in capturing long-range contextual dependencies and handling scale variations in rice fields.
RNet (Baseline): As a baseline, both the MMF, MSF, and ASPP modules are removed (Figure 6d). The dual-stream encoder outputs are directly concatenated and passed to the decoder. This model serves to quantify the overall performance gain attributed to the proposed fusion architecture.

2.5.2. Modality Necessity Analysis

Beyond architectural components, we also investigated the specific contribution of SAR and optical data to rice mapping. We designed single-modality variants of MMDFRNet where either the SAR or optical branch is disabled, as illustrated in Figure 7. Specifically, the MMDFRNet (SAR-only) retains only the SFE branch, while the MMDFRNet (optical-only) relies exclusively on the OFE branch. These experiments aim to demonstrate the necessity of multi-modal fusion, particularly in overcoming the limitations of single-source data under complex weather conditions.

2.6. Comparative Methods and Degradation Verification

2.6.1. Classic and SOTA Comparision Models

To comprehensively evaluate the performance of MMDFRNet, we compared it against six benchmarks, ranging from classic architectures to recent state-of-the-art (SOTA) methods. Specifically, for single-stream architectures (i.e., U-Net, PSPNet, UNetFormer, and STMA), we adopted an input-level fusion strategy by stacking the optical and SAR images along the channel dimension:

Classic semantic segmentation models: We selected U-Net [30] and PSPNet [32]. U-Net is a pioneering encoder–decoder architecture that recovers spatial details through symmetric skip connections, though it primarily relies on simple channel-wise concatenation for feature integration. PSPNet utilizes a pyramid pooling module to aggregate global context information at multiple scales, effectively capturing scene-level semantics through fixed-grid spatial pooling.
Domain-specific fusion models: We included R-Unet [15], a dedicated rice mapping model, and CCRNet [33], a representative multi-modal network designed to fuse heterogeneous data. R-Unet is a specialized rice mapping model that incorporates deep residual blocks into a U-Net structure to enhance feature extraction from bi-temporal remote sensing data. CCRNet is a representative multi-modal network that employs a cross-level regional response mechanism to align and fuse features from heterogeneous sensors through cross-attention.
SOTA models: To benchmark against advanced deep learning paradigms, we selected UNetFormer [34] and STMA [22]. UNetFormer introduces a hybrid Transformer-based architecture (combining a ResNet encoder with a Transformer decoder) for efficient global modeling. STMA is designed to capture multi-scale spatio-temporal features through learnable positional encodings.

2.6.2. Degradation Verification in Classic Models

To investigate the performance degradation, we designed a specific verification experiment using the U-Net as the representative baseline. We trained the U-Net under three input configurations: Optical-only, SAR-only, and dual-modal (concatenation). This experiment aims to verify whether direct fusion introduces noise that hampers performance in classic architectures, thereby highlighting the necessity of the proposed adaptive fusion mechanism.

2.7. Hyperparameter Settings

The model was implemented using the PyTorch 1.13.0 framework and optimized via the AdamW optimizer. The initial learning rate (lr) was set to 0.001, with a weight decay of 0.01. To stabilize convergence, we employed a cosine annealing learning rate scheduler [53], which dynamically adjusts the lr from a maximum (

η_{m a x}

) to a minimum (

η_{m i n}

) following a cosine curve. The scheduling formula is defined as:

η_{t} = η_{m i n} + \frac{1}{2} (η_{m a x} - η_{m i n}) (1 + c o s (\frac{T_{c u r}}{T_{m a x}} π))

(8)

where η_t denotes the learning rate at the current epoch, and

T_{cur}

and

T_{\max}

represent the current and maximum number of epochs, respectively. Experimental settings were fixed as follows: batch size = 16, total epochs = 60, and random seed = 42 to ensure reproducibility.

For the loss function, we adopted the Deep Supervision Loss [54] to facilitate gradient flow and hierarchically optimize multi-scale features. The total loss

L

is a weighted sum of the main and auxiliary losses:

L = w_{0} \cdot L_{m a i n} + w_{1} \cdot L_{a u x 1} + w_{2} \cdot L_{a u x 2}

(9)

2.8. Model Evaluation

To quantitatively assess the rice mapping performance, we utilized four standard metrics: Precision, Recall, F1-score, Intersection over Union (IoU), and Matthews Correlation Coefficient (MCC). The mathematical definitions are as follows:

P r e c i s i o n = \frac{T P}{T P + F P}, R e c a l l = \frac{T P}{T P + F N}

(10)

F 1 - s c o r e = 2 \times \frac{R e c a l l \times P r e c i s i o n}{R e c a l l + P r e c i s i o n}

(11)

I o U = \frac{T P}{T P + F P + F N}

(12)

M C C = \frac{T P \times T N - F P \times F N}{\sqrt ((T P + F P) (T P + F N) (T N + F P) (T N + F N))}

(13)

where TP is true positive, FP is false positive, TN is true negative, and FN is false negative.

3. Results

3.1. Training Dynamics and Convergence Analysis

To evaluate the stability and convergence of the proposed MMDFRNet, we recorded the loss and performance metrics during the training process. As illustrated in Figure 8a, both the training and validation loss curves exhibited a rapid synchronized decline in the first 20 epochs, followed by a gradual stabilization, eventually reaching convergence around the 45th epoch. This consistent downward trend across both sets indicates an efficient optimization process.

Simultaneously, the evolution of mapping accuracy, represented by IoU and F1-score, is shown in Figure 8b. Both metrics increased steadily and reached a stable plateau in the later stages of training. The minimal gap between the training and validation metrics further confirms that MMDFRNet possesses robust generalization capabilities and is free from significant overfitting, providing a reliable foundation for the subsequent comparative analysis.

3.2. Comparative Analysis with SOTA Models

To rigorously evaluate the performance of MMDFRNet, we benchmarked it against seven established models, ranging from classic CNN architectures to recent state-of-the-art (SOTA) methods.

The quantitative performance within the primary study area is summarized in Table 3. Among the comparison methods, the Transformer-based UNetFormer and the attention-based CCRNet demonstrated competitive performance, achieving IoU scores of 0.8207 and 0.8342, respectively. However, the proposed MMDFRNet consistently achieved superior performance compared to all benchmarks across every metric evaluated. Specifically, it achieved a Precision of 0.9234, an IoU of 0.8612, and an F1-score of 0.9252. Compared to UNetFormer, our method improved the IoU by 4.05%, indicating that our mechanism-driven fusion strategy effectively captures heterogeneous features better than generic global modeling. Similarly, MMDFRNet surpassed the attention-based fusion models CCRNet and STMA by margins of 2.70% and 5.21% in IoU, respectively. The visual comparisons in Figure 9 further corroborate this superiority. While STMA and UNetFormer occasionally produce fragmented boundaries in mixed wetland areas, MMDFRNet excels at delineating the fine edges of rice–crayfish fields, effectively suppressing the false positives observed in other models.

In addition to accuracy, computational efficiency is critical for large-scale agricultural monitoring. Table 4 details the inference time for each model processing a single image. Classic single-stream models like R-Unet (0.099 s) and PSPNet (0.1488 s) exhibit the fastest speeds due to their lightweight architectures. Notably, while our MMDFRNet utilizes a dual-stream encoder, which inherently increases computational load (6.0274 s), it is remarkably more efficient than the SOTA models. Specifically, MMDFRNet is faster than both UNetFormer (6.4432 s) and STMA (6.7643 s). This result suggests that although Transformer-based architectures and complex spatio-temporal attention mechanisms offer high theoretical capacity, they often incur high computational costs. In contrast, MMDFRNet achieves an optimal trade-off, delivering the highest accuracy with a reasonable inference overhead suitable for regional mapping tasks.

3.3. Internal Module Effectiveness Analysis

To quantify the individual contributions of the MMF, MSF, and ASPP modules, we compared the performance of MMDFRNet against its four ablation variants. The quantitative comparisons are visually illustrated in the bar charts in Figure 10.

The results reveal that the full MMDFRNet consistently achieves the highest scores across all metrics, verifying the necessity of the complete framework. Specifically, removing the MSF module (MMFNet) led to a noticeable decline in hierarchical semantic capture, while removing the MMF module (MSFNet) caused a more significant drop in Precision to 0.8548, confirming the critical role of adaptive cross-modal recalibration. Notably, the contribution of the ASPP module proves to be indispensable. As observed in the MMDFNet variant, the removal of dilated convolutions caused the IoU (0.7703) and MCC (0.7998) to drop even below the RNet baseline. This counter-intuitive result suggests that without the enlarged receptive field provided by ASPP, the network struggles to effectively integrate the complex features extracted by the encoders, leading to fragmented segmentation performance that is inferior to even a simple U-Net-like structure.

Figure 11 presents the visual comparisons in a complex study area characterized by heterogeneous land covers, further corroborating these quantitative findings. Visual inspection reveals distinct error patterns among the variants. The MMDFNet variant, in particular, exhibits significant commission errors, especially in Area 1, where large patches of non-rice vegetation were misclassified as rice. This visual artifact aligns with its low quantitative performance, implying that without the global context from ASPP, the model relies excessively on local spectral signatures and fails to distinguish rice from spectrally similar weeds. Similarly, the RNet and MSFNet variants show varying degrees of omission errors and boundary blurring. In contrast, the full MMDFRNet successfully mitigates both omission and commission errors, producing crisp boundaries that closely match the ground truth. These visual results confirm that the synergistic integration of multi-modal fusion, multi-scale processing, and receptive field expansion is essential for robust rice mapping.

3.4. Modality Necessity and Degradation Verification

To address the concern regarding the necessity of multi-modal data and the effectiveness of our fusion strategy, we conducted two sets of rigorous verification experiments.

3.4.1. Necessity of Multi-Modal Fusion in MMDFRNet

Firstly, we evaluated whether integrating SAR data actually improves performance within our proposed framework. As detailed in Table 5, we compared the full MMDFRNet against its single-modality variants (optical-only and SAR-only). The results confirm that multi-modal fusion is indispensable. In the primary region of Qianjiang, the optical-only variant achieved an IoU of 0.7803. By integrating SAR data via the MMF module, the full model improved the IoU by 8.09%, reaching 0.8612. A similar trend was observed in the transfer region of Yongxiu, where the fusion model outperformed the optical-only variant by nearly 10% in IoU (0.8465 vs. 0.7469). Furthermore, the SAR-only performance was significantly lower (IoU 0.6618 in Qianjiang), confirming that while SAR provides critical structural information, it cannot function as a standalone source for precise mapping. These results quantitatively prove that MMDFRNet effectively leverages the complementary strengths of both modalities to achieve superior performance.

3.4.2. Verification of Degradation in Classic Models

Subsequently, we investigated whether this improvement holds true for classic models or if simple fusion leads to performance degradation.

As detailed in Table 6 and visualized in Figure 12, the U-Net (optical-only) variant achieved an IoU of 0.8044, which is surprisingly higher than the U-Net (Optical + SAR Concat) baseline (0.7873) and the SAR-only variant (0.6444). This finding confirms a degradation phenomenon where directly concatenating SAR data into simple networks introduces speckle noise that hampers performance rather than aiding it. In stark contrast, MMDFRNet successfully reverses this trend. As shown in the comparison, our full model (dual-modal) outperforms its own optical-only variant by a substantial margin of 8.09% in IoU. This contrast rigorously validates that the proposed framework does not merely stack data, but effectively disentangles valid structural features from noise, turning a potential interference source into a critical performance booster.

3.5. Model Adaptability Assessment Results

To evaluate the generalization capability of MMDFRNet across diverse agricultural landscapes, we extended the assessment to three additional regions, including Jianli, Huoqiu, and Yongxiu.

Table 7 summarizes the quantitative performance across all four study areas. Generally, the model maintained high robustness across these spatially distinct regions. Specifically, in Qianjiang, the model achieved the highest overall accuracy. This was followed closely by Yongxiu, where MMDFRNet yielded an IoU of 0.8465 and an MCC of 0.8612. In Huoqiu, despite a slight fluctuation in MCC (0.7707), the Precision remained robust at 0.8848. Similarly, the results in Jianli were consistent with the primary study area, yielding a Precision of 0.8665 and an MCC of 0.8501. These consistent metrics across varying datasets confirm that MMDFRNet possesses strong adaptability and is not merely overfitting to the source domain.

The visualization results (Figure 13) further corroborate this quantitative success. It is worth noting that the landscape characteristics vary significantly among these regions: rice plots in Huoqiu and Yongxiu are generally larger and more regular compared to the fragmented fields in Qianjiang and Jianli. Against this backdrop, MMDFRNet demonstrated superior spatial consistency. While classic baselines often suffered from “salt-and-pepper” noise (internal misclassification) within the large rice plots of Huoqiu or boundary blurring in Jianli, MMDFRNet produced smooth and complete extraction results. This indicates that the proposed multi-scale and cross-modal fusion mechanisms effectively mitigate the challenges posed by both fragmented boundaries and large-scale homogeneity, ensuring optimal transferability in complex agricultural scenarios.

4. Discussion

4.1. The Analysis of Performance Degradation in Concatenation-Based Fusion

A critical question regarding multi-modal remote sensing is whether the simple addition of data sources guarantees performance improvements. Our degradation verification experiments on classic models reveal a counter-intuitive phenomenon where direct fusion leads to performance regression. As detailed in Table 5, the U-Net trained with concatenated optical and SAR data achieved an IoU of 0.7873, which is surprisingly lower than the U-Net (optical-only) variant (IoU 0.8044). The underlying cause of this performance degradation lies in the optimization dynamics of heterogeneous data. When optical and SAR inputs are directly concatenated, early convolutional layers are forced to process them simultaneously through shared kernels. Because optical reflectance and SAR backscatter possess fundamentally different statistical distributions, the inherent speckle noise in SAR introduces high-frequency artifacts that induce severe modal interference. During backpropagation, the conflicting optimization objectives between clean optical signals and noisy radar features disrupt the gradient flow. The early layers, therefore, struggle to converge on a unified representation, severely compromising the extraction of fundamental low-level spatial details and confusing the overall feature-learning process. MMDFRNet fundamentally resolves this conflict. By employing the MMF module for adaptive feature recalibration, our model successfully suppresses SAR noise while extracting valid structural details to compensate for optical deficiencies. This is quantitatively proven by the fact that the full MMDFRNet outperforms its own optical-only variant by a substantial margin of 8.09% in IoU. This contrast rigorously validates that the proposed framework does not merely stack data, but effectively disentangles and synergizes the complementary properties of heterogeneous modalities.

4.2. The Comparative Assessment Against Advanced Deep Learning Paradigms

The evolution of deep learning has introduced advanced paradigms such as Transformers and spatio-temporal networks. In our comparative analysis (Table 8), we benchmarked MMDFRNet against recent SOTA methods, explicitly considering the approaches suggested by recent literature. For instance, UNetFormer, proposed by Wang et al. [34], utilizes a hybrid Transformer architecture to capture global context. However, our results show that MMDFRNet outperforms it by 4.05% in IoU. This suggests that while Transformers are powerful, their data-hungry nature may not be as effective as our mechanism-driven MMF module at handling the specific “asynchronous quality” of heterogeneous SAR–Optical data in wetland scenarios. Similarly, compared to the time-series-dependent STMA method by Han et al. [22] and the phenology-independent approach by Lin et al. [35], MMDFRNet demonstrates superior boundary delineation (Figure 9). This indicates that our strategy is more robust in cloud-prone regions than models relying on dense time-series, which often suffer from data gaps in the Meiyu season. Furthermore, regarding multi-modal fusion strategies discussed by Garnot et al. [36] and Yang et al. [37], our method proves that pixel-wise adaptive recalibration is more efficient than static or mid-level fusion schemes. Notably, regarding computational efficiency (Table 4), MMDFRNet (6.02 s) is faster than both UNetFormer (6.44 s) and STMA (6.76 s), offering a better trade-off between precision and speed for large-scale agricultural applications.

4.3. Evidence of MMF Necessity from Regional Performance Variations

Effective utilization of multi-modal data is a pivotal challenge in rice mapping [55]. To validate the critical role of the MMF module, particularly its robust performance across regions with varying data quality and terrain complexities, we conducted extensive ablation analyses based on the metrics detailed in Table 8.

The results reveal a strong correlation between the module’s contribution and regional characteristics. In cloud-prone regions like Qianjiang and Yongxiu, removing the MMF module (the MSFNet variant) caused a sharp decline in performance. Specifically, the Precision gap between the full model and MSFNet reached 6.86% in Qianjiang and 7.73% in Yongxiu. This significant drop quantitatively confirms that in fragmented terrains where optical data is often compromised by cloud cover or mixed pixels [56], the adaptive recalibration provided by the MMF module is essential to effectively align optical and SAR features [50,57].

Furthermore, a distinct phenomenon was observed in the homogeneous landscapes of Huoqiu. Here, the baseline RNet achieved a deceptively high Precision of 0.9199 but suffered from a disproportionately low IoU of 0.6959. This discrepancy suggests that simple architectures fail to capture structural integrity, leading to high omission errors. In contrast, the full MMDFRNet achieved a balanced IoU of 0.8456. This evidence underscores that the MMF module does not merely fuse data, but ensures structural completeness even in seemingly simple landscapes [24,49].

In the context of regional performance variations, the integrated optimality of MMDFRNet is further evidenced by the balance between Precision and Recall. As shown in Table 3 and Table 9, while some baseline models (e.g., PSPNet) occasionally achieve high Precision, they often suffer from significantly lower IoU or F1-scores, suggesting a “conservative” prediction strategy that misses boundary details. In contrast, MMDFRNet maintains a stable lead in IoU and MCC across all four regions. This superiority stems from the MMF module’s ability to act as a dynamic decoupling gate, resolving the statistical disparity between heterogeneous modalities. By effectively suppressing SAR speckle noise and adaptively recalibrating unreliable optical spectral features, MMDFRNet ensures structural completeness without sacrificing classification accuracy, thus providing the most robust solution for large-scale mapping.

4.4. The Impact of Data Distribution Shifts on Generalization Mechanisms

Deep learning models rely heavily on the consistency of data distribution. Therefore, analyzing how the model adapts to different spatial patterns is crucial for assessing its generalization capability [7,15]. Our experiments in Table 9 reveal how MMDFRNet adapts to distinct spatial patterns compared to baselines. In this study, the test regions exhibit distinct characteristics: Qianjiang and Jianli are dominated by fragmented small-scale fields, whereas Huoqiu and Yongxiu feature large-scale cultivation patterns [58,59].

In fragmented regions (Qianjiang and Jianli), the complex boundaries pose a severe challenge to classic models. For instance, in Jianli, MMDFRNet outperformed other baselines significantly, improving Precision, IoU, F1-score, and MCC by margins of up to 15.10%, 13.81%, 9.61%, and 13.02%, respectively. This indicates that the proposed multi-scale and attention mechanisms are particularly effective in handling high-frequency spatial details found in fragmented landscapes [27].

However, in regular regions (Huoqiu and Yongxiu), the data distribution is relatively simple. However, the performance variation between classic models and MMDFRNet can be attributed to the “conservative strategy” of classic architectures. In large-scale fields, classic models tend to predict only the most confident pixels in the center of fields to ensure high precision but often miss boundary areas [60]. In contrast, MMDFRNet balances Precision and Recall more effectively, maintaining high structural consistency in large fields while capturing fine details.

Crucially, further analysis reveals that the ASPP module is the cornerstone of this adaptability in large-scale regions. As shown in the ablation results (Table 8), removing the ASPP module (MMDFNet) caused the Precision in Yongxiu to plummet to 0.6548, substantially lower than the 0.8834 achieved by the full model. This sharp decline confirms that without the expanded receptive field provided by ASPP, the network fails to establish long-range dependencies, resulting in fragmented predictions within large rice plots. Consequently, the synergistic combination of MMF and ASPP enables the proposed model to maintain robust transferability across both fragmented and large-scale agricultural landscapes.

4.5. Limitations and Prospects

Although the proposed method has achieved promising results, there are certain limitations. First, regarding agricultural generalization, the current study focused primarily on the rice–crayfish co-culture model in China. Validating the framework across other systems (e.g., single/double-cropping rice) and major international rice-producing regions remains a future objective. Furthermore, generalizing MMDFRNet to dryland crops (e.g., wheat or maize) would require recalibrating the bi-temporal strategy, as it is currently optimized for the unique “water-to-vegetation” transition of flooded paddies.

Second, from a technical perspective, the dual-stream architecture is highly sensitive to precise pixel-level co-registration between Sentinel-1 and Sentinel-2 imagery; spatial misalignment can severely degrade the cross-modal attention mechanism. Regarding data availability, while SAR compensates for optical gaps, potential failure cases may still arise under extreme, prolonged cloud cover that completely obscures the critical phenological windows. Finally, the parallel encoding and complex feature recalibration modules impose increased computational and memory requirements, posing challenges when scaling up to national-level mapping tasks. Future research will explore lightweight network adaptations and missing-data imputation techniques to further address these constraints and enhance the model’s global robustness.

5. Conclusions

In this study, we proposed a Multi-Modal Data Fusion Rice Mapping Network (MMDFRNet) to address the challenges of feature heterogeneity and asynchronous data quality. By integrating Sentinel-1 SAR and Sentinel-2 optical imagery through a dynamic decoupling and alignment architecture, the framework offers a robust solution for all-weather monitoring. Based on the experimental results, the main conclusions are summarized as follows:

Superior mapping accuracy and methodological robustness: MMDFRNet establishes a new benchmark for rice mapping, achieving a Precision of 0.9234 and an IoU of 0.8612. Crucially, our framework effectively reverses the “degradation paradox” observed in classic concatenation-based models. By employing adaptive recalibration, MMDFRNet leverages SAR data to boost IoU by 8.09% compared to its optical-only variant. Furthermore, it significantly outperforms state-of-the-art paradigms, surpassing the Transformer-based UNetFormer and the time-series model STMA in both segmentation accuracy and boundary delineation.
Exceptional generalization and operational efficiency: The rigorous evaluation across four spatially distinct study areas confirms the model’s superior generalization capability. The synergistic combination of MMF and MSF modules enables the model to capture long-range dependencies and multi-scale features, ensuring high structural consistency in both fragmented smallholder plots and large-scale agricultural landscapes. Furthermore, the model maintains a high inference speed (6.0274 s per image), demonstrating that our mechanism-driven dual-stream design is not only theoretically rigorous but also highly efficient for practical, national-level crop mapping tasks in complex environments.

Author Contributions

Conceptualization, T.F.; methodology, T.F.; software, T.F.; validation, T.F.; formal analysis, T.F.; investigation, T.F.; resources, T.F.; data curation, T.F.; writing—original draft preparation, T.F.; writing—review and editing, T.F., J.G., and S.T.; visualization, T.F.; supervision, T.F. and J.G.; project administration, S.T. and J.G.; funding acquisition, S.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research and APC were funded by the National Mine Development and Ecological Space Monitoring and Evaluation in Key Areas, China University of Geosciences (Beijing), China (project no. DD20230100).

Data Availability Statement

The raw satellite data are publicly available from the Google Earth Engine platform (https://earthengine.google.com/ (accessed on 15 October 2024)). The manually annotated ground truth datasets generated during this study are available on request from the corresponding author. The core implementation of MMDFRNet is available in the GitHub repository (https://github.com/Daisybobo/MMDFRNet (accessible from 31 May 2026)).

Acknowledgments

We would like to thank the Google Earth Engine for providing the cloud processing platform and the ESA for providing the open-source Sentinel-1 and Sentinel-2 remote sensing image data.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Wu, B.; Zhang, M.; Zeng, H.; Tian, F.; Potgieter, A.B.; Qin, X.; Yan, N.; Chang, S.; Zhao, Y.; Dong, Q.; et al. Challenges and Opportunities in Remote Sensing-Based Crop Monitoring: A Review. Natl. Sci. Rev. 2023, 10, nwac290. [Google Scholar] [CrossRef]
Bhandari, B.; Mayer, T. Comparing Deep Learning Models for Mapping Rice Cultivation Area in Bhutan Using High-Resolution Satellite Imagery. ISPRS Open J. Photogramm. Remote Sens. 2025, 15, 100084. [Google Scholar] [CrossRef]
Weiss, M.; Jacob, F.; Duveiller, G. Remote Sensing for Agricultural Applications: A Meta-Review. Remote Sens. Environ. 2020, 236, 111402. [Google Scholar] [CrossRef]
Hashemi, M.G.Z.; Jalilvand, E.; Alemohammad, H.; Tan, P.-N.; Das, N.N. Review of Synthetic Aperture Radar with Deep Learning in Agricultural Applications. ISPRS J. Photogramm. Remote Sens. 2024, 218, 20–49. [Google Scholar] [CrossRef]
Chen, W.; Ouyang, S.; Tong, W.; Li, X.; Zheng, X.; Wang, L. GCSANet: A Global Context Spatial Attention Deep Learning Network for Remote Sensing Scene Classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 1150–1162. [Google Scholar] [CrossRef]
Chen, Q.; Kuang, G.; Li, J.; Sui, L.; Li, D. Unsupervised Land Cover/Land Use Classification Using PolSAR Imagery Based on Scattering Similarity. IEEE Trans. Geosci. Remote Sens. 2013, 51, 1817–1825. [Google Scholar] [CrossRef]
Zhan, L.; Ye, P.; Fan, J.; Chen, T. U²ConvFormer: Marrying and Evolving Nested U-Net and Scale-Aware Transformer for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5517114. [Google Scholar] [CrossRef]
Yuan, Y.; Lin, L.; Liu, Q.; Hang, R.; Zhou, Z.-G. SITS-Former: A Pre-Trained Spatio-Spectral-Temporal Representation Model for Sentinel-2 Time Series Classification. Int. J. Appl. Earth Obs. Geoinf. 2022, 106, 102651. [Google Scholar] [CrossRef]
Wang, Y.; Feng, L.; Zhang, Z.; Tian, F. An Unsupervised Domain Adaptation Deep Learning Method for Spatial and Temporal Transferable Crop Type Mapping Using Sentinel-2 Imagery. ISPRS J. Photogramm. Remote Sens. 2023, 199, 102–117. [Google Scholar] [CrossRef]
Xu, Y.; Zhou, J.; Zhang, Z. A New Bayesian Semi-Supervised Active Learning Framework for Large-Scale Crop Mapping Using Sentinel-2 Imagery. ISPRS J. Photogramm. Remote Sens. 2024, 209, 17–34. [Google Scholar] [CrossRef]
Fan, L.; Xia, L.; Yang, J.; Sun, X.; Wu, S.; Qiu, B.; Chen, J.; Wu, W.; Yang, P. A Temporal-Spatial Deep Learning Network for Winter Wheat Mapping Using Time-Series Sentinel-2 Imagery. ISPRS J. Photogramm. Remote Sens. 2024, 214, 48–64. [Google Scholar] [CrossRef]
Xia, L.; Zhao, F.; Chen, J.; Yu, L.; Lu, M.; Yu, Q.; Liang, S.; Fan, L.; Sun, X.; Wu, S.; et al. A Full Resolution Deep Learning Network for Paddy Rice Mapping Using Landsat Data. ISPRS J. Photogramm. Remote Sens. 2022, 194, 91–107. [Google Scholar] [CrossRef]
Du, M.; Huang, J.; Wei, P.; Yang, L.; Chai, D.; Peng, D.; Sha, J.; Sun, W.; Huang, R. Dynamic Mapping of Paddy Rice Using Multi-Temporal Landsat Data Based on a Deep Semantic Segmentation Model. Agronomy 2022, 12, 1583. [Google Scholar] [CrossRef]
Zhao, F.; Xia, L.; Kylling, A.; Li, R.Q.; Shang, H.; Xu, M. Detection Flying Aircraft from Landsat 8 OLI Data. ISPRS J. Photogramm. Remote Sens. 2018, 141, 176–184. [Google Scholar] [CrossRef]
Fu, T.; Tian, S.; Ge, J. R-Unet: A Deep Learning Model for Rice Extraction in Rio Grande Do Sul, Brazil. Remote Sens. 2023, 15, 4021. [Google Scholar] [CrossRef]
Zhong, L.; Hu, L.; Zhou, H. Deep Learning Based Multi-Temporal Crop Classification. Remote Sens. Environ. 2019, 221, 430–443. [Google Scholar] [CrossRef]
Jiang, T.; Liu, X.; Wu, L. Method for Mapping Rice Fields in Complex Landscape Areas Based on Pre-Trained Convolutional Neural Network from HJ-1 A/B Data. ISPRS Int. J. Geo-Inf. 2018, 7, 418. [Google Scholar] [CrossRef]
Yang, L.; Huang, R.; Zhang, J.; Huang, J.; Wang, L.; Dong, J.; Shao, J. Inter-Continental Transfer of Pre-Trained Deep Learning Rice Mapping Model and Its Generalization Ability. Remote Sens. 2023, 15, 2443. [Google Scholar] [CrossRef]
Fu, T.; Tian, S.; Zhan, Q. Phenological Analysis and Yield Estimation of Rice Based on Multi-Spectral and SAR Data in Maha Sarakham, Thailand. J. Spat. Sci. 2024, 69, 149–165. [Google Scholar] [CrossRef]
Yang, H.; Pan, B.; Li, N.; Wang, W.; Zhang, J.; Zhang, X. A Systematic Method for Spatio-Temporal Phenology Estimation of Paddy Rice Using Time Series Sentinel-1 Images. Remote Sens. Environ. 2021, 259, 112394. [Google Scholar] [CrossRef]
Lasko, K.; Vadrevu, K.P.; Tran, V.T.; Justice, C. Mapping Double and Single Crop Paddy Rice with Sentinel-1A at Varying Spatial Scales and Polarizations in Hanoi, Vietnam. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2018, 11, 498–512. [Google Scholar] [CrossRef] [PubMed]
Han, Z.; Zhang, C.; Gao, L.; Zeng, Z.; Zhang, B.; Atkinson, P.M. Spatio-Temporal Multi-Level Attention Crop Mapping Method Using Time-Series SAR Imagery. ISPRS J. Photogramm. Remote Sens. 2023, 206, 293–310. [Google Scholar] [CrossRef]
Adrian, J.; Sagan, V.; Maimaitijiang, M. Sentinel SAR-Optical Fusion for Crop Type Mapping Using Deep Learning and Google Earth Engine. ISPRS J. Photogramm. Remote Sens. 2021, 175, 215–235. [Google Scholar] [CrossRef]
Wei, P.; Huang, R.; Lin, T.; Huang, J. Rice Mapping in Training Sample Shortage Regions Using a Deep Semantic Segmentation Model Trained on Pseudo-Labels. Remote Sens. 2022, 14, 328. [Google Scholar] [CrossRef]
Wei, P.; Chai, D.; Lin, T.; Tang, C.; Du, M.; Huang, J. Large-Scale Rice Mapping under Different Years Based on Time-Series Sentinel-1 Images Using Deep Semantic Segmentation Model. ISPRS J. Photogramm. Remote Sens. 2021, 174, 198–214. [Google Scholar] [CrossRef]
Wei, P.; Chai, D.; Huang, R.; Peng, D.; Lin, T.; Sha, J.; Sun, W.; Huang, J. Rice Mapping Based on Sentinel-1 Images Using the Coupling of Prior Knowledge and Deep Semantic Segmentation Network: A Case Study in Northeast China from 2019 to 2021. Int. J. Appl. Earth Obs. Geoinf. 2022, 112, 102948. [Google Scholar] [CrossRef]
Cai, Z.; Hu, Q.; Zhang, X.; Yang, J.; Wei, H.; Wang, J.; Zeng, Y.; Yin, G.; Li, W.; You, L.; et al. Improving Agricultural Field Parcel Delineation with a Dual Branch Spatiotemporal Fusion Network by Integrating Multimodal Satellite Data. ISPRS J. Photogramm. Remote Sens. 2023, 205, 34–49. [Google Scholar] [CrossRef]
Cai, Z.; Wei, H.; Hu, Q.; Zhou, W.; Zhang, X.; Jin, W.; Wang, L.; Yu, S.; Wang, Z.; Xu, B.; et al. Learning Spectral-Spatial Representations from VHR Images for Fine-Scale Crop Type Mapping: A Case Study of Rice-Crayfish Field Extraction in South China. ISPRS J. Photogramm. Remote Sens. 2023, 199, 28–39. [Google Scholar] [CrossRef]
Li, Y.; Zhou, Y.; Zhang, Y.; Zhong, L.; Wang, J.; Chen, J. DKDFN: Domain Knowledge-Guided Deep Collaborative Fusion Network for Multimodal Unitemporal Remote Sensing Land Cover Classification. ISPRS J. Photogramm. Remote Sens. 2022, 186, 170–189. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional networks for biomedical image segmentation. arXiv 2015, arXiv:1505.04597. [Google Scholar] [CrossRef]
Badrinarayanan, V.; Kendall, A.; Cipolla, R. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. arXiv 2017, arXiv:1612.01105. [Google Scholar] [CrossRef]
Wu, X.; Hong, D.; Chanussot, J. Convolutional Neural Networks for Multimodal Remote Sensing Data Classification. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5517010. [Google Scholar] [CrossRef]
Wang, L.; Li, R.; Zhang, C.; Fang, S.; Duan, C.; Meng, X.; Atkinson, P.M. UNetFormer: A UNet-like Transformer for Efficient Semantic Segmentation of Remote Sensing Urban Scene Imagery. ISPRS J. Photogramm. Remote Sens. 2022, 190, 196–214. [Google Scholar] [CrossRef]
Lin, S.; Qi, Z.; Li, X.; Zhang, H.; Lv, Q.; Huang, D. A Phenological-Knowledge-Independent Method for Automatic Paddy Rice Mapping with Time Series of Polarimetric SAR Images. ISPRS J. Photogramm. Remote Sens. 2024, 218, 628–644. [Google Scholar] [CrossRef]
Sainte Fare Garnot, V.; Landrieu, L.; Chehata, N. Multi-Modal Temporal Attention Models for Crop Mapping from Satellite Time Series. ISPRS J. Photogramm. Remote Sens. 2022, 187, 294–305. [Google Scholar] [CrossRef]
Yang, J.; Hu, Q.; Li, W.; Song, Q.; Cai, Z.; Zhang, X.; Wei, H.; Wu, W. An Automated Sample Generation Method by Integrating Phenology Domain Optical-SAR Features in Rice Cropping Pattern Mapping. Remote Sens. Environ. 2024, 314, 114387. [Google Scholar] [CrossRef]
Deng, J.; Hong, D.; Li, C.; Yokoya, N. Joint super-resolution and segmentation for 1-m impervious surface area mapping in China’s Yangtze River economic belt. arXiv 2025, arXiv:2505.05367. [Google Scholar] [CrossRef]
Li, X.; Li, C.; Ghamisi, P.; Hong, D. FlexiMo: A Flexible Remote Sensing Foundation Model. arXiv 2025, arXiv:2503.23844. [Google Scholar] [CrossRef]
Li, H.; Qiu, K.; Chen, L.; Mei, X.; Hong, L.; Tao, C. SCAttNet: Semantic Segmentation Network with Spatial and Channel Attention Mechanism for High-Resolution Remote Sensing Images. IEEE Geosci. Remote Sens. Lett. 2021, 18, 905–909. [Google Scholar] [CrossRef]
Ghamisi, P.; Rasti, B.; Yokoya, N.; Wang, Q.; Hofle, B.; Bruzzone, L.; Bovolo, F.; Chi, M.; Anders, K.; Gloaguen, R.; et al. Multisource and Multitemporal Data Fusion in Remote Sensing: A Comprehensive Review of the State of the Art. IEEE Geosci. Remote Sens. Mag. 2019, 7, 6–39. [Google Scholar] [CrossRef]
Liu, C.; Sun, Y.; Xu, Y.; Sun, Z.; Zhang, X.; Lei, L.; Kuang, G. A Review of Optical and SAR Image Deep Feature Fusion in Semantic Segmentation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 12910–12930. [Google Scholar] [CrossRef]
Wang, H.; Liu, X.; Qiao, Z.; Wang, G.; Chen, H. Multimodal Remote Sensing Data Classification Based on Gaussian Mixture Variational Dynamic Fusion Network. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5621214. [Google Scholar] [CrossRef]
Huang, Y.; Wang, Z.; Tang, T.; Ohtsuki, T.; Gui, G. Dual-Stream Multimodal Fusion with Local–Global Attention for Remote-Sensing Object Detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2026, 19, 1691–1702. [Google Scholar] [CrossRef]
Cheng, B.; Xu, B.; Deng, Q.; Shen, T. MIFNet: Multi-Modal Interactive Fusion Network For Remote Sensing Semantic Segmentation. In Proceedings of the IGARSS 2025—2025 IEEE International Geoscience and Remote Sensing Symposium; IEEE: New York, NY, USA, 2025; pp. 6980–6984. [Google Scholar]
Schmitt, M.; Zhu, X.X. Data Fusion and Remote Sensing: An Ever-Growing Relationship. IEEE Geosci. Remote Sens. Mag. 2016, 4, 6–23. [Google Scholar] [CrossRef]
Schmitt, M.; Tupin, F.; Zhu, X.X. Fusion of SAR and Optical Remote Sensing Data—Challenges and Recent Trends. In Proceedings of the 2017 IEEE International Geoscience and Remote Sensing Symposium (IGARSS); IEEE: New York, NY, USA, 2017; pp. 5458–5461. [Google Scholar]
Rußwurm, M.; Courty, N.; Emonet, R.; Lefèvre, S.; Tuia, D.; Tavenard, R. End-to-End Learned Early Classification of Time Series for in-Season Crop Type Mapping. ISPRS J. Photogramm. Remote Sens. 2023, 196, 445–456. [Google Scholar] [CrossRef]
Seong, S.; Chang, A.; Mo, J.; Na, S.; Ahn, H.; Oh, J.; Choi, J. Crop Classification in South Korea for Multitemporal PlanetScope Imagery Using SFC-DenseNet-AM. Int. J. Appl. Earth Obs. Geoinf. 2024, 126, 103619. [Google Scholar] [CrossRef]
Hong, D.; Gao, L.; Yokoya, N.; Yao, J.; Chanussot, J.; Du, Q.; Zhang, B. More Diverse Means Better: Multimodal Deep Learning Meets Remote-Sensing Imagery Classification. IEEE Trans. Geosci. Remote Sens. 2021, 59, 4340–4354. [Google Scholar] [CrossRef]
Hang, R.; Li, Z.; Ghamisi, P.; Hong, D.; Xia, G.; Liu, Q. Classification of Hyperspectral and LiDAR Data Using Coupled CNNs. IEEE Trans. Geosci. Remote Sens. 2020, 58, 4939–4950. [Google Scholar] [CrossRef]
Zhang, P.; Ke, Y.; Zhang, Z.; Wang, M.; Li, P.; Zhang, S. Urban Land Use and Land Cover Classification Using Novel Deep Learning Models Based on High Spatial Resolution Satellite Imagery. Sensors 2018, 18, 3717. [Google Scholar] [CrossRef]
Loshchilov, I.; Hutter, F. SGDR: Stochastic gradient descent with warm restarts. arXiv 2017, arXiv:1608.03983. [Google Scholar] [CrossRef]
Zhang, L.; Chen, X.; Zhang, J.; Dong, R.; Ma, K. Contrastive deep supervision. arXiv 2022, arXiv:2207.05306. [Google Scholar] [CrossRef]
Zhang, C.; Yue, P.; Tapete, D.; Jiang, L.; Shangguan, B.; Huang, L.; Liu, G. A Deeply Supervised Image Fusion Network for Change Detection in High Resolution Bi-Temporal Remote Sensing Images. ISPRS J. Photogramm. Remote Sens. 2020, 166, 183–200. [Google Scholar] [CrossRef]
Jiang, N.; Li, P.; Feng, Z. Remote Sensing of Swidden Agriculture in the Tropics: A Review. Int. J. Appl. Earth Obs. Geoinf. 2022, 112, 102876. [Google Scholar] [CrossRef]
Veloso, A.; Mermoz, S.; Bouvet, A.; Le Toan, T.; Planells, M.; Dejoux, J.-F.; Ceschia, E. Understanding the Temporal Behavior of Crops Using Sentinel-1 and Sentinel-2-like Data for Agricultural Applications. Remote Sens. Environ. 2017, 199, 415–426. [Google Scholar] [CrossRef]
Liu, L.; Xiao, X.; Qin, Y.; Wang, J.; Xu, X.; Hu, Y.; Qiao, Z. Mapping Cropping Intensity in China Using Time Series Landsat and Sentinel-2 Images and Google Earth Engine. Remote Sens. Environ. 2020, 239, 111624. [Google Scholar] [CrossRef]
Zhang, G.; Xiao, X.; Dong, J.; Kou, W.; Jin, C.; Qin, Y.; Zhou, Y.; Wang, J.; Menarguez, M.A.; Biradar, C. Mapping Paddy Rice Planting Areas through Time Series Analysis of MODIS Land Surface Temperature and Vegetation Index Data. ISPRS J. Photogramm. Remote Sens. 2015, 106, 157–171. [Google Scholar] [CrossRef]
Zhu, Y.; Pan, Y.; Zhang, D.; Wu, H.; Zhao, C. A Deep Learning Method for Cultivated Land Parcels (CLPs) Delineation from High-Resolution Remote Sensing Images with High-Generalization Capability. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4410525. [Google Scholar] [CrossRef]

Figure 1. Geographical location distribution map of the study area, where (a) shows the geographic distribution of the four study areas. The three different colors in the subplot represent the three provinces where the study areas are located: Hubei, Anhui, and Jiangxi. The (b,d,f,h) are the geographical location distributions of Qianjiang City, Jianli City in Hubei Province, Yongxiu County in Jiangxi Province, and Huoqiu County in Anhui Province, respectively. The red box is the model training and verification area, and the yellow box is the test area; (c,e,g,i) are the ground truth label data corresponding to different regions.

Figure 2. The architecture of MMDFRNet.

Figure 3. The architecture of MMF.

Figure 4. The architecture of SFE.

Figure 5. The architecture of OFE.

Figure 6. Architectures of the four ablation variants. (a) MMFNet (w/o MSF): Constructed by removing the multi-scale feature fusion module. (b) MSFNet (w/o MMF): Constructed by removing the multi-modal feature fusion module. (c) MMDFNet (w/o ASPP): Formed by replacing the dilated convolutions in ASPP with standard convolutions. (d) RNet (Baseline): The reference model using simple channel concatenation without MMF and MSF modules.

Figure 7. The architectures of the single-modality variants constructed for modality necessity analysis. (a) MMDFRNet (SAR-only): Utilizes only Sentinel-1 data fed into the SAR Feature Encoder (SFE). (b) MMDFRNet (optical-only): Utilizes only Sentinel-2 data fed into the Optical Feature Encoder (OFE).

Figure 8. (a) Training and validation loss values of MMDFRNet. (b) The validation metrics of MMDFRNet.

Figure 9. The results of classic and SOTA models. (a–d) represent results from four sub-regions within the Qianjiang study area.

Figure 10. The evaluation indicators of the ablation model.

Figure 11. The results of the ablation experiment. (a–c) represent results from three sub-regions within the Qianjiang study area.

Figure 12. The visual comparison of U-Net performance under different input configurations. (a–c) represent results from three sub-regions within the Qianjiang study area.

Figure 13. The results of model adaptability. (a–c) show representative results from the Jianli, Huoqiu, and Yongxiu study areas, respectively.

Table 1. The dataset of study area.

Dataset	Sentinel-1	Sentinel-2
Qianjiang	April to September 2023	April and August 2023
Jianli		March and August 2023
Huoqiu		May and August 2023
Yongxiu	May to October 2023	May and October 2023

Table 2. The number of patches for each study area.

Number of Patches	Training and Validation	Test
Qianjiang	2025	539
Jianli	1125	343
Huoqiu	450	196
Yongxiu	675	196

Table 3. The quantitative comparison with classic and SOTA models in Qianjiang.

Model	Precision	IoU	F1-Score	MCC
U-Net	0.8062	0.7873	0.8800	0.8200
PSPNet	0.7535	0.7121	0.8295	0.7447
R-Unet	0.7610	0.7503	0.8554	0.7873
CCRNet	0.8775	0.8342	0.9094	0.8314
UNetFormer	0.8632	0.8207	0.9015	0.8413
STMA	0.8491	0.8091	0.8945	0.8299
MMDFRNet	0.9234	0.8612	0.9252	0.8879

Table 4. The inference time for each image.

Model	U-Net	PSPNet	R-Unet	CCRNet	UNetFormer	STMA	MMDFRNet
Time/s	0.1649	0.1488	0.099	1.3929	6.4432	6.7643	6.0274

Table 5. The quantitative comparison of performance degradation in the U-Net using different input modalities.

MMDFRNet		Precision	IoU	F1-Score	MCC
Qianjiang	Optical-only	0.8091	0.7803	0.8766	0.8014
	SAR-only	0.7107	0.6618	0.7965	0.6658
	Optical-SAR	0.9234	0.8612	0.9252	0.8879
Yongxiu	Optical-only	0.7687	0.7469	0.8551	0.7712
	SAR-only	0.8537	0.7438	0.8531	0.7698
	Optical-SAR	0.8834	0.8465	0.9160	0.8612

Table 6. The quantitative verification of performance degradation in the U-Net using different input modalities.

U-Net	Precision	IoU	F1-Score	MCC
Optical-only	0.8373	0.8044	0.8916	0.8255
SAR-only	0.6888	0.6444	0.7838	0.6441
Optical-SAR	0.8062	0.7873	0.8800	0.8200

Table 7. The accuracy of the MMDFRNet model in different regions.

Region	Precision	IoU	F1-Score	MCC
Qianjiang	0.9234	0.8612	0.9252	0.8879
Jianli	0.8665	0.8129	0.8963	0.8501
Huoqiu	0.8848	0.8456	0.9159	0.7707
Yongxiu	0.8834	0.8465	0.9160	0.8612

Table 8. The model transferability and ablation experiment results.

Region	Model	Precision	IoU	F1-Score	MCC
Qianjiang	MMDFRNet	0.9234	0.8612	0.9252	0.8879
	MMFNet	0.8914	0.8311	0.9075	0.8589
	MSFNet	0.8548	0.8152	0.8974	0.8449
	MMDFNet	0.8102	0.7703	0.8676	0.7998
	RNet	0.8061	0.7848	0.8734	0.8010
Jianli	MMDFRNet	0.8665	0.8129	0.8963	0.8501
	MMFNet	0.8247	0.7476	0.8538	0.7905
	MSFNet	0.8404	0.7812	0.8767	0.8190
	MMDFNet	0.7742	0.7330	0.8442	0.7810
	RNet	0.8451	0.7543	0.859	0.7917
Huoqiu	MMDFRNet	0.8848	0.8456	0.9159	0.7707
	MMFNet	0.9173	0.8405	0.9121	0.7775
	MSFNet	0.8578	0.8115	0.8950	0.7092
	MMDFNet	0.8896	0.8259	0.9033	0.7519
	RNet	0.9199	0.6959	0.8045	0.6450
Yongxiu	MMDFRNet	0.8834	0.8465	0.9160	0.8612
	MMFNet	0.8694	0.8285	0.9048	0.8443
	MSFNet	0.8061	0.7848	0.8734	0.8010
	MMDFNet	0.6548	0.6496	0.7644	0.6883
	RNet	0.8138	0.7484	0.8486	0.7945

Table 9. The model adaptability and classic experiment results.

Region	Model	Precision	IoU	F1-Score	MCC
Jianli	U-Net	0.8372	0.7668	0.8665	0.8079
	PSPNet	0.8234	0.7068	0.8249	0.7546
	R-Unet	0.8297	0.6921	0.8085	0.7447
	CCRNet	0.8400	0.7448	0.8525	0.7876
	UNetFormer	0.8667	0.7446	0.8536	0.7902
	STMA	0.8284	0.7947	0.8856	0.8340
	MMDFRNet	0.8665	0.8129	0.8963	0.8501
Huoqiu	U-Net	0.8570	0.8051	0.8847	0.7477
	PSPNet	0.9088	0.8099	0.8938	0.7366
	R-Unet	0.8792	0.7979	0.8781	0.7516
	CCRNet	0.8966	0.8327	0.9081	0.7584
	UNetFormer	0.8595	0.8160	0.8987	0.7799
	STMA	0.8924	0.8152	0.8982	0.7847
	MMDFRNet	0.8848	0.8456	0.9159	0.7707
Yongxiu	U-Net	0.9370	0.8032	0.8863	0.8337
	PSPNet	0.9265	0.789	0.8781	0.8189
	R-Unet	0.9315	0.8042	0.8871	0.834
	CCRNet	0.8605	0.8244	0.8982	0.8325
	UNetFormer	0.8417	0.7660	0.8675	0.8260
	STMA	0.8698	0.8311	0.9078	0.8538
	MMDFRNet	0.8834	0.8465	0.9160	0.8612

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Fu, T.; Ge, J.; Tian, S. MMDFRNet: Dynamic Cross-Modal Decoupling and Alignment for Robust Rice Mapping. Remote Sens. 2026, 18, 1413. https://doi.org/10.3390/rs18091413

AMA Style

Fu T, Ge J, Tian S. MMDFRNet: Dynamic Cross-Modal Decoupling and Alignment for Robust Rice Mapping. Remote Sensing. 2026; 18(9):1413. https://doi.org/10.3390/rs18091413

Chicago/Turabian Style

Fu, Tingyan, Jia Ge, and Shufang Tian. 2026. "MMDFRNet: Dynamic Cross-Modal Decoupling and Alignment for Robust Rice Mapping" Remote Sensing 18, no. 9: 1413. https://doi.org/10.3390/rs18091413

APA Style

Fu, T., Ge, J., & Tian, S. (2026). MMDFRNet: Dynamic Cross-Modal Decoupling and Alignment for Robust Rice Mapping. Remote Sensing, 18(9), 1413. https://doi.org/10.3390/rs18091413

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MMDFRNet: Dynamic Cross-Modal Decoupling and Alignment for Robust Rice Mapping

Highlights

Abstract

1. Introduction

2. Materials and Methods

2.1. Study Area

2.2. Data Acquisition and Preprocessing

2.3. Ground Truth Generation and Quality Control

2.4. Models and Principles

2.4.1. MMDFRNet

2.4.2. Multi-Modal Fusion Module

2.4.3. ASPP

2.4.4. Multi-Scale Feature Fusion Module

2.4.5. OFE and SFE

2.5. Experimental Design for Ablation Studies

2.5.1. Module Effectiveness Analysis

2.5.2. Modality Necessity Analysis

2.6. Comparative Methods and Degradation Verification

2.6.1. Classic and SOTA Comparision Models

2.6.2. Degradation Verification in Classic Models

2.7. Hyperparameter Settings

2.8. Model Evaluation

3. Results

3.1. Training Dynamics and Convergence Analysis

3.2. Comparative Analysis with SOTA Models

3.3. Internal Module Effectiveness Analysis

3.4. Modality Necessity and Degradation Verification

3.4.1. Necessity of Multi-Modal Fusion in MMDFRNet

3.4.2. Verification of Degradation in Classic Models

3.5. Model Adaptability Assessment Results

4. Discussion

4.1. The Analysis of Performance Degradation in Concatenation-Based Fusion

4.2. The Comparative Assessment Against Advanced Deep Learning Paradigms

4.3. Evidence of MMF Necessity from Regional Performance Variations

4.4. The Impact of Data Distribution Shifts on Generalization Mechanisms

4.5. Limitations and Prospects

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI