Transformer-Based Dual-Branch Spatial–Temporal–Spectral Feature Fusion Network for Paddy Rice Mapping

Zhang, Xinxin; Wei, Hongwei; Shao, Yuzhou; Luan, Haijun; Wang, Da-Han

doi:10.3390/rs17121999

Open AccessArticle

Transformer-Based Dual-Branch Spatial–Temporal–Spectral Feature Fusion Network for Paddy Rice Mapping

by

Xinxin Zhang

^1,2,*,

Hongwei Wei

^1,2,

Yuzhou Shao

^1,2,

Haijun Luan

^1,2

and

Da-Han Wang

^1,2

¹

College of Computer and Information Engineering, Xiamen University of Technology, Xiamen 361024, China

²

Fujian Key Laboratory of Pattern Recognition and Image Understanding, Xiamen University of Technology, Xiamen 361024, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(12), 1999; https://doi.org/10.3390/rs17121999

Submission received: 22 April 2025 / Revised: 3 June 2025 / Accepted: 7 June 2025 / Published: 10 June 2025

(This article belongs to the Special Issue Improving Remote Sensing Crop Mapping and Yield Estimation by New Techniques)

Download

Browse Figures

Versions Notes

Abstract

Deep neural network fusion approaches utilizing multimodal remote sensing are essential for crop mapping. However, challenges such as insufficient spatiotemporal feature extraction and ineffective fusion strategies still exist, leading to a decrease in mapping accuracy and robustness when these approaches are applied across spatial‒temporal regions. In this study, we propose a novel rice mapping approach based on dual-branch transformer fusion networks, named RDTFNet. Specifically, we implemented a dual-branch encoder that is based on two improved transformer architectures. One is a multiscale transformer block used to extract spatial–spectral features from a single-phase optical image, and the other is a Restormer block used to extract spatial–temporal features from time-series synthetic aperture radar (SAR) images. Both extracted features were then combined into a feature fusion module (FFM) to generate fully fused spatial–temporal–spectral (STS) features, which were finally fed into the decoder of the U-Net structure for rice mapping. The model’s performance was evaluated through experiments with the Sentinel-1 and Sentinel-2 datasets from the United States. Compared with conventional models, the RDTFNet model achieved the best performance, and the overall accuracy (OA), intersection over union (IoU), precision, recall and F1-score were 96.95%, 88.12%, 95.14%, 92.27% and 93.68%, respectively. The comparative results show that the OA, IoU, accuracy, recall and F1-score improved by 1.61%, 5.37%, 5.16%, 1.12% and 2.53%, respectively, over those of the baseline model, demonstrating its superior performance for rice mapping. Furthermore, in subsequent cross-regional and cross-temporal tests, RDTFNet outperformed other classical models, achieving improvements of 7.11% and 12.10% in F1-score, and 11.55% and 18.18% in IoU, respectively. These results further confirm the robustness of the proposed model. Therefore, the proposed RDTFNet model can effectively fuse STS features from multimodal images and exhibit strong generalization capabilities, providing valuable information for governments in agricultural management.

Keywords:

rice mapping; feature fusion; synthetic aperture radar; optical; transformer

1. Introduction

Although global food production has increased over the past several decades, a series of new challenges have emerged, meaning that the number of people experiencing hunger has continued to rise gradually [1,2,3,4]. Accurate knowledge of the types and distributions of crops available in a region is essential to ensure food security. Remote sensing (RS) technology plays a crucial role in the extraction and monitoring of crop areas [5,6,7,8]. By interpreting RS images—particularly optical satellite and synthetic aperture radar (SAR) images—we can obtain timely and accurate information on crop planting areas, providing essential data support for agricultural production. Furthermore, farmland monitoring based on remote sensing has become fundamental for understanding food security, analyzing changes in crop planting areas, and formulating agricultural policies. Therefore, the efficient utilization of the rich spectral features and spatial–temporal contextual information contained in multimodal RS satellites is essential for agricultural research, such as rice mapping.

Despite significant advancements in deep learning (DL) for remote sensing (RS) image interpretation, current rice mapping approaches predominantly rely on convolutional neural networks (CNNs), such as Convolutional Networks for Biomedical Image Segmentation (U-Net) [9], a Deep Convolutional Encoder–Decoder Architecture for Image Segmentation (SegNet) [10], and Fully Convolutional Networks (FCNs) [11], which emphasize spatial feature extraction (e.g., edges, shapes, and positional context) through multiscale feature extraction modules and attention mechanisms. For instance, Fan et al. [12] enhanced U-Net with an atrous spatial pyramid pooling (ASPP) module and spatial attention (SAM), achieving 83.21% accuracy in rice classification. However, these models usually suffer from a critical limitation, i.e., they focus solely on spatial–spectral information while largely ignoring temporal phenological dynamics inherent to rice growth cycles. This oversight undermines model robustness in two key ways: (1) Spectral ambiguity: single-temporal models struggle to distinguish crops with similar spectral signatures at specific growth stages. (2) Poor spatiotemporal generalization: models trained in one region often fail when applied to other areas or timeframes due to variability in farmland fragmentation, climate, and cultivation practices. Thus, the primary challenge lies in developing a versatile DL framework capable of integrating spatial–temporal–spectral (STS) features from multimodal RS data. Such a framework must reconcile spatial detail with temporal phenological patterns to improve cross-regional adaptability and classification accuracy.

In order to effectively extract crop phenological characteristics, early approaches relied on recurrent neural networks (RNNs) and their variants, such as conventional Long Short-Term Memory (LSTM) [13] and Convolutional Layer 1D (Conv1D) [14]. Moreover, recent breakthroughs in transformer-based architecture have revolutionized temporal sequence modelling, due to their superior ability to capture long-range dependencies via self-attention mechanisms. Originally developed for natural language processing (NLP) [15], transformers have demonstrated remarkable success in identifying subtle yet critical phenological variations in remote sensing (RS) time-series data. For instance, Garnot et al. [16] introduced a temporal self-attention network that employs a lightweight multi-head attention mechanism to encode image sequences, achieving state-of-the-art performance in multi-crop semantic segmentation tasks.

Beyond temporal modelling, transformers have also reshaped computer vision (CV) by enabling global context modeling in high-dimensional RS imagery. Vision Transformer (ViT), for example, leverages multi-head self-attention to establish long-range spatial dependencies—a capability inherently limited in CNNs due to their localized receptive fields. While transformers excel at capturing global spatiotemporal patterns, their tendency to overlook fine-grained local features poses a significant challenge, particularly in high-resolution optical RS images. Notably, prior studies (e.g., Fan et al., 2022) empirically confirmed that local spatial details remain indispensable for accurate semantic segmentation [12]. Thus, a critical research gap persists, i.e., how to synergize the strengths of CNNs (local feature extraction) and transformers (global sequence modelling) to enhance spatial–temporal–spectral (STS) fusion.

Recent advances in architectural design have yielded several hybrid CNN-transformer models (e.g., ResT [17], BoTNet [18], CoAtNet [19], TransUNet [20]) that typically employ serial architectures where transformer modules replace spatial convolutions in CNN backbones. While these approaches demonstrate promising results in general computer vision tasks, three critical limitations emerge when applied to multimodal RS for rice mapping: (1) Modality-Specific Feature Extraction: The serial architecture’s sequential processing fails to adequately preserve modality-specific characteristics in multimodal RS data (optical/SAR). Parallel architectures with dedicated branches for each modality have demonstrated superior performance in maintaining these distinctive features [21,22,23]. (2) Scale-Sensitive Feature Representation: Complete replacement of spatial convolutions with global self-attention risks losing fine-grained spatial details essential for precise rice identification. A balanced spatial multiscale mechanism is required to simultaneously capture both global context and local discriminative features. (3) Cross-Modal Feature Fusion: Current architectures lack specialized modules for effectively fusing optical and SAR features, which is crucial for robust STS representation. A dedicated cross-modal fusion mechanism could significantly enhance the model’s ability to integrate complementary information from different sensor modalities. These limitations highlight the need for novel architectural designs that (i) implement parallel processing streams to preserve modality-specific features, (ii) incorporate multiscale mechanisms to balance global and local feature extraction, and (iii) include specialized fusion modules for optimal STS feature integration in agricultural remote sensing applications.

In summary, despite significant advances in multimodal remote sensing for rice mapping, existing hybrid transformer-CNN models, such as TransUNet, CoAtNet, and TFBS, still face substantial limitations in effectively fusing spatiotemporal–spectral (STS) features from heterogeneous data sources. These models typically adopt serial architectures, where convolutional and transformer layers are stacked sequentially, limiting their ability to retain and exploit modality-specific characteristics in optical and SAR inputs. In contrast, we propose RDTFNet (Rice Dual-branch Transformer Fusion Network), a novel parallel dual-branch architecture that preserves modality-specific features while enabling cross-modal interaction for enhanced rice mapping. RDTFNet advances multimodal learning through three key innovations. First, it standardizes heterogeneous optical and SAR time series into a unified image representation, facilitating seamless STS feature sharing and avoiding complex modality-dependent preprocessing. Second, the architecture employs specialized attention mechanisms for each modality: the temporal SAR branch utilizes Restormer blocks [24] with cross-channel self-attention to efficiently extract spatiotemporal features, while the spatial optical branch incorporates multiscale transformer blocks [25] to capture both global context and local spectral–spatial details through hierarchical attention. This is the first application of Restormer to SAR time-series modeling in remote sensing and crop phenology. The module’s cross-channel attention effectively captures long-range dependencies critical for detecting rice growth stages. Although Yang et al. [26] and Fu et al. [27] previously explored temporal modeling through channel interactions, the Restormer introduces a more efficient and expressive architecture, marking a significant advancement. Third, RDTFNet introduces an adaptive feature fusion module (FFM) that dynamically combines optical spatial–spectral and SAR temporal–spatial features at multiple network depths, preserving modality-specific traits while enhancing complementary integration. These innovations together enable RDTFNet to achieve superior feature discriminability for rice mapping with high computational efficiency, providing a robust solution for operational agricultural monitoring across diverse landscapes. This integrated framework bridges key gaps in STS feature fusion and attention design, showing strong potential for advancing precision agriculture.

To validate the proposed model, we conducted experiments in the south-central United States and northern California, and the results were compared with those of several conventional models. Section 2 presents the experimental data and data processing methods and introduces the proposed RDTFNet model. Section 3 details the experimental results. Section 4 provides a discussion of these results, and Section 5 concludes this study.

2. Materials and Methods

2.1. Study Area

The study areas are located in two regions of the United States: the south-central region (89.52°W–92.21°W, 34.17°N–37.03°N) and northern California (121.11°W–122.56°W, 38.52°N–40.17°N), referred to as Study Area A and Study Area B, respectively. The two areas cover approximately 78,623 km² and 31,084 km², as shown in Figure 1. Study Area A encompasses regions of Arkansas (AR), southern Missouri (MO), western Tennessee (TN), and northwestern Mississippi (MS). It is divided into training, validation, and testing subsets. The testing subset (Test Area A), highlighted in orange, is located in eastern Arkansas and is used to evaluate the baseline performance of the model. Study Area B is located in the Sacramento Valley (SV), and its data are entirely independent of the training and hyperparameter tuning processes, serving to assess the model’s spatiotemporal generalization capabilities. Study Area A is characterized by a humid subtropical climate with evenly distributed precipitation. It mainly cultivates long-grain indica rice and accounts for approximately 76% of total U.S. rice production [26]. The agricultural plots are spatially scattered and intercropped with various crop types, providing diverse features beneficial for model learning. Study Area B has a Mediterranean climate, where rice cultivation relies on artificial irrigation. The fields are more spatially concentrated and exhibit lower crop diversity, with medium-grain japonica rice as the dominant type, accounting for about 20% of the national rice production. The distinct differences in topography, climate, and cropping patterns between the two regions provide an ideal comparative setting for evaluating the model’s cross-regional generalization performance.

In the south-central United States, rice has a growth cycle from germination to maturity that ranges from 105 to 145 days, depending on the variety and environmental conditions [26]. In Study Area A, rice is typically planted between April and May, with a planting period of 3 to 5 weeks, and harvested between September and October [28,29]. Additionally, rice fields in Study Area B were usually irrigated or aerially seeded in May and harvested from September to October [30].

Therefore, considering differences in land use types, cropping structures, and growth cycles, Study Area A is well suited for training and validating rice extraction models, while Study Area B serves as a testbed for evaluating model adaptability under diverse environmental conditions. This combination of regions enables comprehensive evaluation of both the baseline performance and the spatiotemporal generalization capabilities of the algorithm.

2.2. Remote Sensing Data Preparation

2.2.1. Sentinel-1 Time-Series SAR Images

Sentinel-1 SAR data are among the most widely used sources for rice mapping, especially in cloud-prone subtropical regions, as it can effectively mitigate the impact of missing optical data. Sentinel-1 was the first satellite launched under the European Space Agency’s (ESA) Copernicus Program [31]. It consists of two polar-orbiting satellites, Sentinel-1A and Sentinel-1B, which share the same orbit [32]. In this study, dual-polarization time-series data in interferometric wide-swath (IW) mode, including vertical transmit, vertical receive (VV) and vertical transmit, and horizontal receive (VH) polarizations, were obtained. The time series, covering the primary rice-growing season, extends from April to November 2019. The collected dual-polarization SAR images were preprocessed via the Google Earth Engine (GEE) platform, including calibration and terrain correction. To reduce speckle noise and emphasize phenological information, a 24-day average composite time series of the dual-polarized backscatter coefficients for the rice-growing season was generated, as shown in Figure 2. This strategy is suitable for SAR data, where speckle is inherent. The rice growth cycle includes about nine phenological stages, each lasting 20–30 days. A longer compositing interval may lose important phenological transitions, while a shorter one may cause redundancy. Thus, a 24-day window offers a balance between noise suppression and phenological detail retention. This design also aligns with Yang et al. [26], ensuring consistency and comparability. Two sets of nine-band time-series composites were obtained for VV and VH polarizations, totaling eighteen bands. Each band represents a 24-day average composite, as detailed in Table 1. The time-series images were then resampled to a resolution of 30 m to ensure consistency with the CDL data.

2.2.2. Sentinel-2 Multispectral Image

Optical data were selected from the Sentinel-2 satellite, which is a series of Earth observation satellites launched by the European Space Agency (ESA) in 2015. Sentinel-2 is equipped with a multispectral instrument (MSI), one of its primary optical sensors designed to capture multispectral RS data from the Earth’s surface. MSI data from Sentinel-2, with less than 5% cloud cover, were collected from April 2019 to November 2019 via the Google Earth Engine (GEE) platform. Cloud masking was performed on all the images via the GEE [33]. According to Formulas (1)–(3), the optical indices of rice, including the normalized difference vegetation index (NDVI) [34], enhanced vegetation index (EVI) [35], and land surface water index (LSWI) [36], were calculated through band operations. Finally, the median values of the three vegetation indices during the rice-growing period were computed, resulting in three indices.

NDVI = \frac{NIR - RED}{NIR + RED}

(1)

EVI = 2.5 \times \frac{NIR - RED}{NIR + 6 \times RED - 7.5 \times BLUE + 1}

(2)

LSWI = \frac{NIR - SWIR}{NIR + SWIR}

(3)

Here, blue, red, NIR, and SWIR correspond to Sentinel-2 bands 2, 4, 8, and 11, respectively.

2.2.3. The Ground Truth Data

The Cropland Data Layer (CDL) is an annual 30 m resolution land cover map focused on crops that covers the contiguous United States and serves as reference data for model training and testing [37,38]. It was created by the USDA’s National Agricultural Statistics Service (NASS) and is based on comprehensive agricultural ground truth data and medium-resolution satellite images such as Landsat-8, Sentinel-2, and Deimos images.

CDL products from 2019 and 2021 were used in this study. CDL metadata provide state-specific accuracy for rice. The kappa values for rice in the study areas were 91.55% in 2019 and 88.96% in 2021. The user accuracies for these years were 97.42% and 97.27%, whereas the producer accuracies were 91.90% and 91.04%, respectively. The high precision of the CDL data confirms their reliability for training and testing deep learning models [39,40,41].

It is acknowledged that the CDL is not derived from direct field surveys and, therefore, carries inherent uncertainties. Although the accuracy for rice in the study areas is relatively high (kappa values of 91.55% in 2019 and 88.96% in 2021), it does not reach 100%. Nevertheless, the primary goal of this study is not to produce an absolute classification map but to evaluate the rice extraction capability of different models under a consistent and widely accepted benchmark. The CDL has been extensively used in previous studies as a reference for model performance comparison, making it a suitable choice for this research. While the 30 m spatial resolution may lead to classification errors at field boundaries or in small, irregular plots, our analysis shows that small fields (0.06–0.13 km²) account for only 12.4% of Study Area A and 6.9% of Study Area B. Therefore, the potential impact of resolution limitations is limited, and the CDL remains a reliable reference for model evaluation in this context.

2.2.4. Training, Validation and Test Samples

In this study, tenfold cross-validation was employed, and the average result was used as the final estimate. To comprehensively evaluate the model’s performance, the study area was divided into two regions across two different years to assess both its predictive accuracy and spatiotemporal generalization capability. Study Area A in 2019 was selected as the training, validation, and testing region for rice extraction. The training and validation sets included 1333 samples of 256 × 256 pixels, each corresponding to a spatial extent of 7.68 × 7.68 km² at 30 m resolution. All samples were generated based on the 30 m pixel size of the Cropland Data Layer (CDL), which also served as the reference label. To ensure spatial consistency across data sources, both Sentinel-1 SAR data (originally at 10 m resolution) and Sentinel-2 optical imagery (originally at 10–20 m resolution) were resampled to 30 m. This harmonization enabled accurate pixel-level alignment with the CDL and facilitated multimodal feature fusion. Test Area A consisted of 108 samples of the same size, used to evaluate the model’s baseline performance. Test Area B, corresponding to Study Area B in 2019 and 2021, included a total of 468 samples. These were used to assess the model’s generalization ability across different geographic locations and time periods. Table 2 shows the dataset’s information.

2.3. Analysis of Diversity Across Regions

After image clipping, the coverage areas of Study Areas A and B were approximately 78,623 km² and 31,084 km², respectively. This study divided Study Area A into three subregions, as shown in Figure 3, with image sizes of 6328 × 3328 pixels, 6656 × 4096 pixels, and 6912 × 5632 pixels. The image size for Study Area B is 3328 × 4608 pixels. We precalculated the crop sample counts and areas for the two regions. As shown in Table 3, Area A primarily cultivates crops such as soybeans (51.93%), cotton (16.43%), and corn (13.40%), with rice accounting for 17.37% of the total crop area. The spatial distribution of rice in Study Area A is widespread (Figure 3), but the field sizes are small, the distribution is discontinuous, and the plot pattern is relatively fragmented. Although no explicit class imbalance mitigation strategy was applied, Study Area A in this study does exhibit a clear class imbalance, with rice accounting for only 17.37% of the total crop area. Such imbalances could potentially impact classification accuracy. However, as demonstrated in the experimental results presented in Section 3, all comparison models, including ours, achieved satisfactory evaluation metrics in this region, suggesting that the rice samples available were sufficient for effective model training and performance. Nevertheless, we recognize that in more extreme cases of imbalance, classification accuracy could be adversely affected. Therefore, future work will consider selecting study areas based on actual crop proportions or applying class imbalance mitigation strategies such as class weighting, focal loss, or resampling to enhance model robustness and generalizability. Area B mainly cultivates winter wheat (8.27%), sunflower (6.86%), and corn (2.65%), with rice comprising 69.54% of the total crop area. Owing to differences in climate and agricultural practices, the rice-growing areas in Study Area B are generally more concentrated (Figure 3), with larger and more densely clustered plots. The two areas present significant differences in terms of their planting ratios, crop structures, and spatial distributions. This distinction is crucial for objectively assessing the generalizability of the proposed model.

2.4. Model and Principles

2.4.1. RDTFNet

Currently, most rice extraction studies that use SAR and optical images simply concatenate data along the channel dimension before they are fed into a single deep learning model for feature extraction. Although this approach preliminarily fuses the spatial features of SAR and optical images, the temporal information in SAR images is at risk of interference by optical data, making it difficult to discern the significance of each feature type. This interference may lead to the loss of certain phenological information about rice, and the fixed input features are not ideal for predictive tasks when data are missing. Therefore, this study designed a dual-branch transformer encoder to separately extract the temporal and spatial features of SAR and optical images, preventing the loss of modality-specific rice features and enhancing the model’s ability to learn and generalize between rice and non-rice areas.

The overall network structure of RDTFNet is illustrated in Figure 4. As a hybrid of U-Net and the transformer, RDTFNet retains the excellent hierarchical structure of U-Net, specifically using skip connections to effectively integrate features between the encoder and decoder. To fully leverage the diverse and complementary characteristics of multisource RS data, we replace the conventional encoder with a dual-branch structure that is designed to extract both shallow and deep spatiotemporal–spectral (STS) features of rice from optical and time-series SAR images separately.

In the optical branch, we adopt a multiscale transformer block [25], which is further enhanced with a multiscale feature module and an attention mechanism, as the spatial encoder to capture rich multiscale global spatial context and detailed spectral information from optical images. The optical branch consists of four hierarchical stages, each comprising a multiscale transformer block and a downsampling module. Each stage progressively scales down the spatial resolution by a factor of 2, resulting in four feature maps at different spatial resolutions: H × W, H/2 × W/2, H/4 × W/4, and H/8 × W/8. In parallel, the time-series SAR branch employs a Restormer block [24] as the temporal encoder, which also follows a four-stage architecture. Each stage includes a Restormer block and a downsampling module, mirroring the structure of the optical branch to ensure alignment. Consequently, feature maps are extracted at the same multiscale resolutions across both branches, facilitating subsequent fusion. Additionally, we design a feature fusion module (FFM) to integrate the spatial and spectral features from optical images with the temporal and spatial features from time-series SAR images in the dual-branch encoder.

2.4.2. SAR Branch Encoder

The SAR-temporal branch encoder has four stages, each comprising a Restormer temporal information extraction module that includes multi-Dconv head transposed attention (MDTA), a gated-Dconv feed-forward network (GDFN), and a downsampling module, as shown in Figure 5. There are two reasons for choosing the Restormer block for the SAR branch. First, Restormer replaces conventional multi-head self-attention, which has linear complexity, with MDTA. Applying self-attention across the channel dimension rather than the spatial dimension effectively improves image processing efficiency. Second, in contrast to standard transformer modules that rely on positional encoding to model sequence order, the Restormer module eliminates the need for explicit positional encoding along the channel dimension. This is because the order of channels in the SAR time series already encodes phenological sequences, implicitly representing temporal locations. MDTA module in Restormer applies self-attention along the channel dimension rather than the spatial dimension, computing cross-channel covariance to generate attention maps that implicitly encode global context [24]. This cross-channel interaction enables effective extraction of temporal features along the channel axis in SAR time-series imagery.

Specifically, the feature map first enters MDTA, where a 1 × 1 pointwise convolution aggregates pixelwise cross-channel context to extract temporal features, whereas a 3 × 3 depth-wise convolution encodes channel-wise spatial context to capture spatial features, generating a query (Q), key (K), and value (V). The projections of the query and key are then reshaped, and feature covariance is calculated to produce a global attention map. The global attention map generated by MDTA then enters the GDFN, a module that introduces gating mechanisms instead of conventional feed-forward networks to regulate information flow across channel levels. Like MDTA, the GDFN includes depth-wise convolution, which is beneficial for learning local image information. Therefore, for a given input feature X∈R^T×H×W, the process definitions of MDTA and the GDFN are shown in (4)–(5) and (7)–(8):

\hat{X} = W_{p} * Attention (\hat{Q}, \hat{K}, \hat{V}) + X

(4)

(\hat{Q}, \hat{K}, \hat{V}) = ({L N (X) * W}_{p}^{Q} W_{d}^{Q}, {L N (X) * W}_{p}^{K} W_{d}^{K}, {L N (X) * W}_{p}^{V} W_{d}^{V})

(5)

Attention (\hat{Q}, \hat{K}, \hat{V}) = \hat{V} * Softmax (\frac{\hat{Q} * {\hat{K}}^{T}}{\sqrt{d_{k}}})

(6)

\hat{Y} = W_{p} * Gate (Y) + Y

(7)

Gate (Y) = GELU (W_{d 1} W_{p 1} (LN (Y))) * W_{d 2} W_{p 2} (LN (Y))

(8)

where X and

\hat{X}

are the inputs and outputs of MDTA, respectively; Y and

\hat{Y}

are the inputs and outputs of the GDFN, respectively; W_p is 1 × 1 pointwise convolution; and W_d is 3 × 3 depth-wise convolution.

Based on the architecture and working mechanism of MDTA, Equations (4) and (5) describe how MDTA processes the input feature X∈R^T×H×W; generates the query (Q), key (K), and value (V) matrices; and subsequently computes the self-attention. Unlike conventional methods, MDTA applies attention along the channel dimension rather than the spatial dimension. When processing SAR time series, where each channel corresponds to a different temporal snapshot, this design enables the model to capture temporal dependencies through inter-channel interactions. Time-series SAR imagery typically contains multiple channels, with each channel representing observations from distinct time points. Traditional spatial attention may fail to account for temporal variations, whereas channel-wise attention can directly model temporal relationships, such as phenological changes in rice growth. In addition, both linear projection and depth-wise separable convolution in MDTA are equally essential components. The 1 × 1 pointwise convolution is employed for dimensionality reduction and channel information integration, while the depth-wise separable convolution captures spatial features, thereby reducing computational complexity while preserving spatial context. This design ensures that critical spatial details, such as the shapes and boundaries of farmland, are preserved during temporal feature extraction.

2.4.3. Optical Branch Encoder

The optical–spatial branch encoder also consists of four stages, each comprising a multiscale transformer block with multiscale multi-head self-attention (M2SA), an efficient feed-forward network (E-FFN), and a downsampling module, as illustrated in Figure 6. To capture more powerful multiscale global information, we combine multiscale and multi-head self-attention mechanisms and apply them to optical remote sensing images to extract the multiscale global semantic context of rice spatial features.

Specifically, for the encoder output feature X∈R^C×H×W, C denotes the channel dimension, and H and W denote the image dimensions. First, feature X is reduced by a factor of 4 in the channel dimension via three parallel 1 × 1 convolutions. Then, four 3 × 3 depth-wise convolutions with different dilation rates capture multiscale information, with weights shared across branches. As the downsampling scale decreases, the dilation rates decrease layer by layer, with dilation rates Di set as (7,9,11), (5,7,9), (3,5,7), and (1,3,5) for each stage’s M2SA module across the four downsampling stages. Next, the original channel dimension is restored through three parallel 1 × 1 pointwise convolutions. Finally, the multiscale features from the four branches are summed to produce a multiscale global attention map. This attention map is then fed into the E-FFN. To capture local information from 2D images, the E-FFN module replaces the fully connected (FC) layers in the FFN with 1 × 1 convolutions and incorporates two parallel depth-wise separable convolutions (3 × 3 and 5 × 5), thereby retaining the CNN’s local extraction capabilities. The process definitions of M2SA are shown in (9)–(13):

\hat{X} = AdaptivePool (W_{p} * (X_{1} + X_{2} + X_{3}) * X)

(9)

X_{I} = W_{p} {* W}_{d} * W_{p} * X

(10)

Attention (\hat{Q}, \hat{K}, \hat{V}) = \hat{V} * Softmax (\frac{\hat{Q} * {\hat{K}}^{T}}{\sqrt{d_{k}}})

(11)

Attention (\hat{Q}, \hat{K}, \hat{V}) = \hat{V} * Softmax (\frac{\hat{Q} * {\hat{K}}^{T}}{\sqrt{d_{k}}})

(12)

X_out = Attention (\hat{Q}, \hat{K}, \hat{V}) + CA (X)

(13)

where X and X_out are inputs and outputs of M2SA, respectively;

\hat{X}

is the multiscale mechanism; W_p is 1 × 1 pointwise convolution; W_d is 3 × 3 depth-wise convolution; and CA is channel attention.

2.4.4. Feature Fusion Module (FFM)

Optical imagery typically contains richer spatial features than SAR imagery does, but optical imagery is susceptible to clouds and extreme rainfall. In contrast, SAR can penetrate clouds and resist most weather disturbances, providing more robust temporal resolution. However, SAR imagery is easily affected by Doppler effects and speckle noise. Therefore, data fusion techniques should be used to combine the strengths of both modalities to improve the accuracy of rice extraction. Inspired by feature-level fusion, we design the feature fusion module (FFM), as illustrated in Figure 7.

The transformer-based SAR-temporal branch encoder primarily captures cross-channel interactions to extract temporal features from the channel dimensions of time-series SAR images. However, it cannot encode channel dependencies. When rice fields exhibit similar spatial characteristics but differ across channels (corresponding to temporal steps), this limitation may lead to feature ambiguity. To address this issue, we improved the channel attention mechanism to enable the feature fusion module (FFM) to highlight critical phenological features of the rice growth cycle within the temporal global features extracted by the SAR-temporal branch encoder. It emphasizes important and representative channel information. Additionally, the FFM incorporates receptive-field attention convolution (RFVConv) [42], which effectively handles details and complex patterns within the multiscale global spatial context information extracted by the optical spatial branch encoder. This enhances the model’s sensitivity to spatial positions and shapes, accommodating the irregular plot structures typical of agricultural remote sensing. Finally, the FFM fuses the improved temporal and spatial features, enhancing the accuracy of rice field identification and mitigating segmentation ambiguity caused by differences in multisource data characteristics. Specifically, for the dual-branch encoder’s output feature maps S_n and O_n at different levels, n represents the encoder’s layers, and S and O denote the outputs from the SAR-temporal branch encoder and optical–spatial branch encoder, respectively. On the one hand, three pooling strategies—average pooling, max pooling, and soft pooling—are applied to S_n to compute the statistical features across channels. These are passed into a Multilayer Perceptron (MLP), where addition and multiplication operations refine the statistical features to produce P_(A+M)*S. Subsequently, an MLP and sigmoid operation are used to derive the channel dependence P. On the other hand, O_n is fed into RFVConv to generate feature map O_R. Feature map O_R is multiplied by the channel dependence P to obtain an initial fusion result. Finally, to ensure the completeness of temporal, spatial, and spectral information, the initial fusion result is added to both S_n and O_R to produce the final fusion result F. The process definitions of the FFM are shown in (14)–(16):

P_{A + M} = MLP (AvgPool (W_{p} S_{n})) + MLP (MaxPool (W_{p} S_{n})) P_{(A + M) * S} = P_{A + M} * MLP (SoftPool (W_{p} S_{n}))

(14)

P = Sigmoid (MLP (P_{(A + M) * S}))

(15)

F = S_{n} + O_{n} + (P * O_{R})

(16)

2.4.5. Loss Function

The cross-entropy loss function is widely used for multiclassification tasks because of its excellent performance [27]. It evaluates each pixel’s predicted class against its target class, and its formula is as follows:

Loss = - (y * \log (P) + (1 - y) * \log (1 - P))

(17)

In this formula, Loss is the loss value, y is the true label, and P represents the model’s predicted probability.

2.4.6. Performance Evaluation and Experimental Setup

The primary evaluation metrics used in this study to assess the rice extraction model include overall accuracy (OA), precision, IoU, recall, specificity, and F1-score, with their formulas provided in (18)–(23):

OA = \frac{TP + TN}{TP + TN + FP + FN}

(18)

Precision = \frac{TP}{TP + FP}

(19)

IoU = \frac{TP}{TP + FP + FN}

(20)

Recall = \frac{TP}{TP + FN}

(21)

Specificity = \frac{TN}{TN + FP}

(22)

F 1 -score = 2 * \frac{Recall * Precision}{Recall + Precision}

(23)

TP denotes correctly classified rice pixels, TN refers to correctly classified non-rice pixels, FP represents non-rice pixels misclassified as rice, and FN indicates rice pixels misclassified as non-rice.

The overall accuracy (OA) evaluates the model’s classification performance across the dataset [27,33,43]. Precision can be used to quantify the classification accuracy for rice [27,33,43]. The intersection over union (IoU) calculates the ratio of correctly predicted rice pixels to the total rice pixel area [43]. Recall assesses the model’s ability to detect rice pixels [27,33]. Specificity is used to indicate how many of all non-rice samples are predicted to be non-rice. The F1-score, as the harmonic mean of precision and recall, is a balanced evaluation metric [27,33,43,44].

RDTFNet was compared with six classical models, including R-Unet, CMTFNet, TFBS, U-Net, SegNet and DeepLabV3, to evaluate its effectiveness in rice extraction. The 2019 dataset from Study Area A was used for training and validation, while the 2019 test set (Test Area A) served as the test dataset. In Study Area A, the 2019 dataset was used to train and validate seven models. The training loss was recorded, along with the overall accuracy (OA), intersection over union (IoU), precision, recall, and F1-score on the validation set.

The experiments were implemented using Python version 3.8.13 and PyTorch version 1.13.0. Training was conducted on an Intel(R) Core(TM) i7-8700K CPU with 12 GB RAM and an NVIDIA RTX 3090 GPU. The initial learning rate was set to 0.00005, using the Adam optimizer, with 50 training epochs and a batch size of 4.

3. Results

3.1. Cross-Regional Temporal Diversity Analysis

Owing to variations in climatic conditions across different regions and years, the growing season of crops exhibits slight differences. As shown in Figure 8, the time-series curves of NDVI, VV, and VH for rice differ significantly between Study Area A and Study Area B. Taking the 2019 NDVI time-series curve as an example, the rice-growing season in Study Area B is slightly delayed compared to that in Study Area A, as indicated by the later onset, peak, and decline in vegetation indices. This indicates that rice in Study Area B is sown later than in Study Area A, further confirming regional differences in climate, cultivation systems, or irrigation management practices.

In addition, the temporal variations in VV and VH polarization signals reflect the distinct characteristics of rice growth between Study Area A and Study Area B. VV polarization is primarily influenced by the crop canopy structure, whereas VH polarization is more sensitive to crop water content and scattering properties. The VV backscatter rise in Study Area B appears delayed, which aligns with the NDVI trend and further supports the conclusion that rice sowing in this region occurs later. Regarding VH polarization, the trough of the VH curve in Study Area B appears later, indicating a delayed sowing period compared to Study Area A. The earlier decline in VH in Study Area A suggests that rice entered the maturation stage and harvesting began sooner, whereas the later decline in Study Area B further reinforces the observation of a delayed growing season in this region.

Overall, the differences in the NDVI, VV, and VH time-series curves collectively reveal variations in rice planting schedules and growth cycles between Study Area A and Study Area B. These patterns show that rice in Study Area B is planted and matures later. The two regions differ in climatic conditions, sowing schedules, and crop development periods. Such differences are critical for objectively evaluating the temporal generalization capability of the proposed model.

Based on the rice remote sensing characteristics of Study Areas A and B, a cross-temporal and cross-spatial diversity analysis was conducted. The analysis results indicate that when models trained on data from specific years and locations are applied to different years or locations, the accuracy of crop classification may be influenced by the model’s spatial–temporal generalization ability.

3.2. Comparative Results of Basic Performance of Rice Extraction

Figure 9 presents the training loss curves of the seven models in Study Area A. As training epochs increased, the training losses steadily decreased. Figure 10 shows the OA, precision, recall, and F1-score of the seven models on the validation samples in Study Area A. After 50 epochs, all evaluation metrics became stable and exceeded 0.85. Among them, the F1-scores of RDTFNet, R-Unet, CMTFNet, TFBS, and U-Net stabilized around 0.92. In contrast, SegNet and DeepLab performed relatively poorly, with F1-scores stabilizing around 0.85. Compared with SegNet and DeepLab, RDTFNet and R-Unet consistently outperformed the other classical models for rice mapping on the validation samples. This also highlights the effectiveness of the U-Net architecture for rice segmentation tasks.

The untrained 2019 test set from Study Area A was selected to evaluate the performance of the seven classical models by calculating relevant metrics. Table 4 presents the performance of the seven models on the test set. The results show that RDTFNet achieved the highest overall accuracy of 96.95%, an IoU of 88.12%, a precision of 95.14%, a recall of 92.27%, and an F1-score of 93.68%, outperforming all six other classical models for rice extraction. Among the other six models, R-Unet performed relatively well, with metrics including overall accuracy, IoU, precision, recall, and F1-score approaching those of RDTFNet, though a noticeable gap remained. In contrast, DeepLabV3 and SegNet performed poorly, with IoUs of only 83.57% and 83.50%, respectively, substantially lower than those of RDTFNet and R-Unet. This indicates a higher incidence of misclassification and omission errors in rice-growing areas, suggesting that these two models struggle to accurately capture the spatial distribution characteristics of rice fields.

Figure 11 shows the rice prediction results of seven classical models in the 2019 Test Area A. RDTFNet produces more accurate predictions that align closely with the actual distribution of rice fields. DeepLabV3 (Figure 11c) segmentation is overly smooth and lacks precise field boundary extraction (highlighted in green box), mainly due to its large upsampling ratio, which blurs feature representation. While SegNet, U-Net, and CMTFNet (Figure 11d–f) all adopt encoder–decoder architectures, CMTFNet and U-Net have fewer misclassifications (highlighted in blue box). U-Net’s skip connections preserve spatial context across layers, while CMTFNet replaces decoder convolutions with multiscale transformer modules to capture global multiscale semantic spatial context. These strategies help retain rice field details and improve segmentation precision. TFBS, R-Unet, and RDTFNet (Figure 11g–i) all model temporal feature but with varying effectiveness. TFBS and R-Unet process time-series data more coarsely, limiting boundary preservation and small patch recognition (highlighted in yellow box). In contrast, RDTFNet accurately distinguishes between rice fields and non-rice fields, with improved completeness in identifying rice fields. This is because the temporal and spatial encoding branches effectively capture the temporal features and multiscale global context of rice, allowing for efficient fusion of these learned temporal and spatial features.

3.3. Comparative Results of Cross-Regional Generalization Capabilities

The results of RDTFNet were compared with those of six classical models, including DeepLabV3, SegNet, U-Net, CMTFNet, TFBS and R-Unet. To further evaluate the generalization ability of the seven different models for rice mapping across spatial regions, the 2019 dataset from Study Area A was used for training and validation of all seven models, whereas the 2019 dataset from Study Area B served as the test dataset.

Table 5 presents the performance of the seven models in Test Area B. The results show that RDTFNet outperforms the six classical models, achieving the highest overall accuracy (98.33%), IoU (86.15%), precision (93.25%), recall (91.91%), and F1-score (92.55%). Compared with the non-cross-region test results in Area A, all models exhibited a decline in performance across most evaluation metrics except for overall accuracy and specificity. RDTFNet experienced the smallest performance drop. Among the other six models, DeepLabV3 suffered from excessive upsampling, which led to the loss of spatial detail and a relatively low precision (82.49%), resulting in frequent misclassification of non-rice areas. SegNet and U-Net improved performance using encoder–decoder architectures. U-Net achieved better IoU (79.16%) and F1-score (88.33%) than SegNet due to its skip connections, which preserved more spatial information. CMTFNet further improved performance by integrating multiscale transformer modules into the U-Net architecture, increasing precision to 89.88% and IoU to 79.82%. However, its recall slightly declined to 87.70%, suggesting room for improvement in the model’s recall ability. TFBS and R-Unet enhanced the spatiotemporal feature representation through temporal modeling, yielding better F1-scores compared to spatial-only models. However, both suffered from limited temporal feature extraction, leading to some loss of temporal information. Notably, as shown in Table 5, the OA of all models increased under cross-regional generalization. This is due to differences in planting structure and field size between Test Areas A and B. Area A is smaller and more complex, containing a diverse array of negative samples, whereas Area B is larger, with a simpler, more uniform planting structure. The observed increase in specificity supports this, indicating that models were better at identifying negative samples in Area B.

Figure 12 shows the rice prediction results of seven models on the 2019 dataset for Study Area B. A comparison of the rice extraction results reveals that RDTFNet produces more accurate extractions, with predictions closer to those of the actual ground conditions of rice fields. Overall, DeepLabV3 (Figure 12c) exhibits numerous misclassifications (highlighted in green boxes), particularly between rice and non-rice areas, and between rice fields and water bodies. This is mainly due to its excessive upsampling, which causes blurred feature representation and loss of spatial detail. SegNet, U-Net, and CMTFNet (Figure 12d–f) adopt encoder–decoder architectures. U-Net preserves spatial contextual information through skip connections, while CMTFNet enhances global spatial semantics using multiscale transformer modules, significantly reducing classification errors (highlighted in blue boxes). However, these methods lack temporal modeling, which leads to confusion between rice fields and water bodies. TFBS and R-Unet (Figure 12g,h) attempt to capture temporal dynamics using LSTM and channel attention, respectively. Nevertheless, TFBS’s sequential structure disrupts the continuity of temporal features, and R-Unet’s coarse modeling leads to inadequate temporal extraction, resulting in misclassifications (highlighted in yellow boxes). In contrast, RDTFNet (Figure 12i) effectively differentiates between rice fields and water bodies and improves the overall completeness of rice field identification. This improvement stems from its temporal encoding branch, which captures deep temporal information from SAR time series, enabling recognition of rice fields with weak temporal signatures and distinction of water bodies with irrigation-like features. The spatial encoding branch extracts multiscale global spatial context from optical imagery, ensuring accurate detection of rice field location, boundaries, and shapes.

3.4. Comparison Results of Temporal Generalization Capabilities

To further evaluate the temporal generalization capabilities of different models, this study adopted a cross-temporal and cross-regional testing strategy. Specifically, data from Study Area A in 2019 were used for training and validation, while data from Study Area B in 2021 served as the test set. The two study areas differ in climate conditions, soil types, cropping patterns, and agricultural management practices, and the data were collected in different years. This experimental design enables a comprehensive assessment of each model’s adaptability to spatiotemporal changes, especially their robustness in rice extraction tasks. It also offers a more objective evaluation of RDTFNet’s temporal generalization performance.

Table 6 presents the performance of RDTFNet and six classical models on the 2021 test set from Study Area B. Compared with the 2019 test results, all models exhibited varying degrees of performance degradation on the 2021 dataset, indicating that spatiotemporal variation poses significant challenges to rice extraction. DeepLabV3 showed a 10.08% decrease in IoU and a 7.02% drop in F1-score, highlighting its limited ability to detect rice field boundaries and small patches and its susceptibility to temporal variation. SegNet, U-Net, and CMTFNet experienced IoU reductions of 5.57%, 4.63%, and 7.68%, respectively. Their relatively low generalization capability is attributed to their reliance solely on spatial features, making them less adaptable to interannual variations in rice growth conditions. TFBS and R-Unet showed IoU decreases of 5.45% and 4.50%, respectively. Although both consider temporal information, their feature extraction approaches limit their ability to capture the complex rice growth cycle, resulting in considerable misclassification. In contrast, RDTFNet exhibited only a 3.44% reduction in IoU, with comparatively minor declines in precision (3.44%) and recall (3.16%), maintaining the most stable performance among all models. These results suggest that RDTFNet is more effective in modeling the spatiotemporal characteristics of rice, mitigating feature drift caused by temporal shifts and, thus, demonstrating superior robustness and generalization in cross-temporal rice extraction tasks.

Figure 13 shows the rice extraction results of RDTFNet and six classical deep learning models on the 2021 dataset from Study Area B. A comparative analysis reveals that RDTFNet significantly outperforms the other models, producing extraction results that closely align with actual rice cultivation patterns. Analyzing extraction results at the same locations across 2019 and 2021 demonstrates that RDTFNet maintains high accuracy in cross-year predictions, indicating superior temporal generalization compared to other models. Specifically, RDTFNet exhibits stronger adaptability to interannual variations in rice cultivation. As shown in the figure, despite differences in rice planting areas between 2019 and 2021, RDTFNet consistently achieves high extraction accuracy across both years (highlighted in yellow boxes). TFBS and R-Unet, which incorporate temporal feature modules, show the second-best temporal generalization performance. However, limitations in their network architectures and modeling approaches result in incomplete furrow detection and confusion between rice fields and water bodies during cross-year predictions (highlighted in blue boxes). Other models, such as SegNet and DeepLabV3, display a higher number of false positives and missed detections in their predictions (highlighted in green boxes). The superior performance of RDTFNet can be attributed to its dual-branch spatiotemporal encoding architecture and feature fusion module, which enable better adaptation to image features across varying geographic regions and time points. Notably, in regions with complex terrain and diverse crop distributions, RDTFNet effectively distinguishes rice fields from water bodies, whereas other models are more prone to interference from non-rice crops, leading to increased extraction errors.

4. Discussion

4.1. Ablation Study

Using the untrained 2019 dataset from Study Area B as the test set, an ablation experiment was conducted on the five baseline models.

(1): Effect of the SAR-temporal branch encoder: As shown in Table 7, the use of the Restormer CNN (R-CNN) branch structure developed in this study for U-Net segmentation of the test set improved the overall accuracy by 0.19%, the IoU by 0.82%, and the recall by 1.42%, with a decrease in precision of 0.56% but an increase in the F1-score of 0.47%. The Restormer module emphasizes the local context, ensuring that context-based global relationships between pixels are implicitly modeled during covariance-based attention map computation [24]. The cross-channel self-attention calculation meets the temporal feature extraction requirements. This result validates that, compared with traditional convolutional branches, the Restormer CNN branch captures richer spatial and temporal features in SAR imagery, improving the model’s capacity to recognize rice.
(2): Effect of the optical–spatial branch encoder: As shown in Table 7, replacing the convolutional branch with the multiscale transformer-CNN (MTCNN) module as the encoding branch in the U-Net segmentation of the test set improved the overall accuracy by 0.15%, the IoU by 0.61%, the recall by 0.82%, and the F1-score by 0.35% compared with DE-UNet. This confirms the effectiveness of multiscale global attention in rice extraction and demonstrates that adding a multiscale mechanism compensates for insufficient granularity in global context information [25].
(3): Effect of the FFM: Table 7 shows the results of applying the FFM to fuse optical and SAR features from the dual-branch encoder and process the test set images. The overall accuracy improved by 0.22%, the IoU improved by 0.79%, the precision by 0.76%, the recall improved by 0.15%, and the F1-score improved by 0.45%. The FFM integrates the temporal information from SAR imagery with spatial information from optical imagery at the feature level, learning complementary information from multimodal data [45]. By including the FFM, the network became more sensitive to the phenological characteristics of rice, resulting in fewer misclassifications between rice-growing areas and rivers or lakes, as the unique irrigation period of rice makes its optical features similar to those of water bodies. The results indicate that the FFM effectively captures the phenological characteristics of rice, enhances category differentiation and improves segmentation between rice and non-rice areas.

Figure 14 shows the rice prediction results of five baseline models on the 2019 dataset for Test Area A. Comparing the extraction results reveals that RDTFNet performs well overall, closely matching ground truth rice distribution. DE-UNet and DE-UNet (spatial) tend to misclassify large areas of non-rice crops as rice (green boxes), indicating their emphasis on spatial features while overlooking temporal information. This limits their ability to distinguish irrigated rice from other crops in June imagery. In contrast, DE-UNet (temporal) effectively differentiates rice due to its added temporal branch encoder that better captures rice growth patterns (blue boxes). Integrating spatial and temporal branches in RDTFNet (WithoutFFM) and RDTFNet further enhances classification. Although RDTFNet (WithoutFFM) still misidentifies some rice fields (yellow boxes), misclassifications are noticeably reduced. RDTFNet’s classification errors are mostly concentrated on roads between rice paddies, likely due to the limited spatial resolution of Sentinel data, which results in mixed land-road pixels at field edges and hinders accurate edge extraction. This is a common limitation of current low-to-medium-resolution extraction methods. Additionally, scattered small pixels are misclassified as farmland, which could be improved using postprocessing techniques like filling or filtering.

These ablation results collectively demonstrate the effectiveness of each architectural component in enhancing rice extraction from multimodal remote sensing data. Specifically, the Restormer-based SAR-temporal encoder improves temporal pattern recognition, the MTCNN-based optical–spatial encoder enhances spatial contextual understanding, and the FFM module significantly boosts the synergy between modalities. The comparative visualization further underscores the superiority of the full RDTFNet architecture in capturing the phenological characteristics and spatial patterns of rice, especially under challenging conditions such as mixed pixels and phenological similarity to water bodies. These findings highlight the importance of integrating both spatial and temporal attention mechanisms when modeling complex crop types like rice, offering valuable insights for the design of future multimodal segmentation networks.

4.2. The Impact of Single/Dual-Branch Model Structures on Rice Extraction

This study proposes RDTFNet, which is based on U-Net and incorporates Restormer and MSTransformer as encoders to form a dual-branch structure for rice extraction in the study area. To validate the performance of RDTFNet’s dual-branch structure in rice extraction, one branch was removed from the original RDTFNet architecture, resulting in two baseline models: Restormer-UNet and MSTransformer-UNet. To ensure consistency in data modality, both the single- and dual-branch models use multimodal inputs. The difference lies in the input strategy: the single-branch models concatenate temporal SAR and spatial optical data along the channel dimension before feeding them into the model. In contrast, the dual-branch model inputs temporal SAR and spatial optical data separately through two independent encoder branches.

As shown in Table 8, RDTFNet outperforms the other baseline models in terms of OA, IoU, precision, and F1-score across the three baseline models. RDTFNet’s OA, IoU, and F1-score improved by 0.50%, 1.88%, and 1.07%, respectively, compared with those of Restormer-UNet. Compared with those of MSTransformer-UNet, RDTFNet’s OA, IoU, and F1-score increased by 0.59%, 2.16%, and 1.23%, respectively. Compared with U-Net, the modified dual-encoder UNet (DE-UNet) improved the overall accuracy by 0.18%, the IoU by 0.59%, the precision by 0.71%, and the F1-score by 0.34%. Replacing DE-UNet’s conventional CNN encoder structure with the transformer dual-branch encoder designed in this study further enhanced segmentation on the test set, with an increase in overall accuracy of 0.45%, an increase in IoU of 1.73%, an increase in recall of 1.54%, and an increase in F1-score of 0.99%. These results validate the effectiveness of a parallel dual-branch structure in broadening the model width by extracting different multilevel features in parallel [21,22,23]. Specifically, using two transformers to capture temporal–spatial features from SAR and optical images effectively enhances model performance to a certain extent.

Additionally, the OA, IoU, precision, and F1-score of Restormer-UNet, which includes a Restormer module, are 0.09%, 0.28%, 0.51%, and 0.16% higher, respectively, than those of MSTransformer, which includes an MSTransformer module. The Restormer module can extract pixelwise cross-channel context and channel-wise spatial context, enabling it to capture spatiotemporal features. In contrast, the MSTransformer module focuses more on the multiscale global dependencies among spatial pixels, which limits its ability to capture temporal information, leading to a decrease in performance. With respect to the single-branch structure, Restormer-UNet outperforms MSTransformer-UNet in terms of rice extraction model performance; however, neither single-branch Restormer-UNet nor MSTransformer-UNet can effectively learn the SAR and optical features of rice. A combination of the two modules is necessary to effectively learn the features of rice.

4.3. The Impact of Unimodal/Multimodal Data on Rice Extraction

This study further investigates the impact of unimodal and multimodal data on rice mapping and shows that the distinct characteristics of temporal SAR data and single-date optical imagery underscore the necessity of multimodal fusion. However, the effectiveness of such fusion depends on the coordinated integration of temporal and spatial features across modalities. Under unimodal conditions, temporal SAR data alone can capture rice phenological features (e.g., abrupt changes in backscatter following irrigation and variations in polarization indices during maturation). However, the inherent low spatial resolution and speckle noise in SAR imagery limit its ability to delineate field boundaries and mixed cropping areas with precision. Although median-composited single-date optical images can effectively suppress cloud contamination and preserve spatial details, their lack of temporal information makes it difficult to distinguish rice from other spectrally similar crops. Multimodal data, by integrating the temporal sensitivity of SAR with the high spatial resolution of single-date optical imagery, enable a more complementary feature representation. SAR time series can identify the key phenological stages of rice (e.g., transplanting and heading), while optical images provide accurate spatial features at corresponding timestamps. However, simple channel-wise concatenation (as used in single-branch models) may lead to insufficient inter-modal feature coupling, causing spatiotemporal information to be diluted in shallow network layers. In contrast, the dual-branch architecture independently extracts features from each modality and facilitates cross-modal interactions at both shallow and deep levels, enabling the retention of modality-specific information while enhancing inter-modal correlations.

As shown in Table 9, the experimental results further confirm that multimodal inputs offer significant advantages over unimodal inputs in the task of rice mapping. Unimodal SAR and optical data each exhibit distinct strengths. In terms of precision and recall, SAR data demonstrate higher precision (88.20% vs. 80.05%), indicating a lower false-positive rate and a stronger ability to accurately distinguish between rice and non-rice areas. In contrast, optical data achieve higher recall (92.01% vs. 84.39%), suggesting a more comprehensive detection of rice regions with fewer omissions. These findings highlight the inherent limitations of using single-modality data for rice mapping. SAR data are more effective in ensuring classification precision, while optical data tend to provide more comprehensive coverage of rice areas. Therefore, the two modalities offer complementary strengths in the rice extraction task.

Compared with unimodal input, the multimodal approach outperforms on all evaluation metrics. Specifically, under the multimodal setting, the overall accuracy (OA) of U-Net and RDTFNet increases to 97.37% and 98.33%, respectively, while their intersection over union (IoU) improves to 79.16% and 86.15%. These results demonstrate that fusing multimodal data enhances the accuracy of rice extraction. RDTFNet, which adopts a dual-branch architecture, achieves the best performance in multimodal fusion. It attains an OA of 98.33%, an IoU of 86.15%, and an F1-score of 92.55%, clearly outperforming other multimodal models. These findings suggest that the effectiveness of multimodal data in rice extraction depends not only on the complementarity of the data but also on the network’s ability to decouple modality-specific features and optimize fusion strategies. The dual-branch design separates spatial and temporal features and incorporates a hierarchical fusion mechanism, enabling more efficient utilization of multimodal data. This helps overcome the perceptual limitations of unimodal approaches and further enhances both the accuracy and robustness of rice extraction.

Figure 15 presents the rice extraction results of RDTFNet using unimodal (optical), unimodal (SAR), and multimodal (optical and SAR) data in the 2019 dataset of Test Area B. Comparative analysis reveals that using spatial optical data alone tends to extract rice-growing areas more comprehensively. However, due to the lack of phenological feature learning, the model misclassifies a large number of non-rice crops (e.g., corn, cotton, almonds) as rice, as highlighted by the green boxes. Although optical images contain rich spatial information (e.g., edges, textures, and colors), the model’s accuracy degrades significantly in scenarios involving spectral similarity between different objects (e.g., coexisting crops) or spectral confusion between distinct objects (e.g., irrigated rice and water bodies). When using temporal SAR data alone, the model effectively distinguishes rice from non-rice crops and water bodies by learning phenological features, thereby reducing the false-positive rate. However, due to limited spatial features, the model exhibits omission errors in rice pixel detection, as indicated by the blue boxes. In contrast, with multimodal data (optical and SAR), the model simultaneously learns spatial features from optical data and temporal phenological patterns from SAR data. Optical imagery contributes rich edge, texture, and spectral information, enabling more comprehensive identification of rice-growing areas. SAR data enhance the model’s ability to learn phenological characteristics through temporal information, effectively distinguishing rice from other crops and water bodies, as highlighted by the yellow boxes. This synergy significantly improves the accuracy of rice extraction.

5. Conclusions

On the basis of Sentinel-2 multispectral remote sensing images and Sentinel-1 time-series SAR remote sensing images, optical vegetation index and time-series polarization index datasets were created for rice growth periods in the U.S. Midwest and Western U.S. for 2019 and 2021. Building on the CNN and transformer architectures, we propose the U-shaped dual-branch fusion network RDTFNet for rice extraction, which includes a dual-branch encoder comprising the Restormer and MSTransformer modules, along with an FFM module. The transformer dual-branch encoder is designed to independently focus on the different features of optical and SAR data. The FFM is used to further extract spatiotemporal features and effectively fuse spatiotemporal information. To validate the impact of datasets from different years on the model’s spatiotemporal generalization ability, RDTFNet was compared with four conventional models under different regional and annual conditions. Additionally, to investigate the performance of the dual-branch structure and the two transformer modules in rice extraction tasks, RDTFNet was compared with four baseline models. The following conclusions were drawn:

(1): In terms of baseline performance and generalization capability, RDTFNet achieves high classification accuracy in rice extraction, outperforming several existing deep learning models. On in-region test data, it reached an overall accuracy (OA), intersection over union (IoU), and F1-score of 96.95%, 88.12%, and 93.68%, respectively, improving by 1.61%, 5.37%, and 2.53% over other models. In cross-region tests, RDTFNet showed the smallest performance drop, achieving a 92.55% F1-score with only a 1.13% decrease, demonstrating strong spatial generalization. In cross-temporal evaluation, it also had the smallest metric reductions, maintaining an F1-score of 90.53%, just 2.03% lower than in cross-region testing, highlighting its robust temporal generalization.
(2): Compared with single-branch architectures, the dual-branch design offers higher accuracy and better overall performance. On the 2019 dataset of Study Area A, models using only a single branch (spatial or temporal) yielded lower F1 (92.61%) and IoU (86.24%) scores, indicating their limitations in fully exploiting multimodal information. In contrast, RDTFNet achieved 88.12% IoU, 95.14% overall accuracy, and 93.68% F1-score. Compared with single-branch models such as U-Net, Restormer-Unet, and MSTransformer-Unet, RDTFNet achieved the highest scores across all evaluation metrics, demonstrating superior performance in rice mapping.
(3): To evaluate the impact of unimodal and multimodal data, experiments were conducted using either optical or SAR data alone, and the results were compared with those obtained from the multimodal RDTFNet. The results demonstrate that the multimodal RDTFNet achieved the best performance, with IoU scores exceeding those of the unimodal optical and SAR inputs by 11.08% and 10.33%, respectively, highlighting the complementary advantages of optical and SAR data.

Although the model in this study employs a dual-branch structure, its ability to support classification predictions in the absence of certain modality data remains to be examined. Evaluations of cross-regional scenarios need to consider various environmental factors, including planting type, terrain, cloud cover, and plot distribution. The two selected regions in this study are insufficient to account for all the aforementioned influencing factors, necessitating further testing and improvements. In future research, we will enhance the model’s adaptability to missing modality data while ensuring its performance. Moreover, we aim to improve the model’s robustness under the influence of various environmental factors. We strive to develop a universal model for rice extraction research.

In conclusion, the RDTFNet model proposed in this study can extract rice planting areas in a timely and accurate manner across different spatiotemporal contexts, demonstrating robust spatiotemporal generalizability. The research results may provide important information for government agricultural management decision making.

Author Contributions

The manuscript was primarily written and revised by X.Z. and H.W., who also designed and conducted the comparative experiments; D.-H.W. provided supervision throughout the study and reviewed and edited the manuscript; Y.S. and H.L. contributed to the manuscript by providing comments and suggestions for improvement and also assisted with revisions. All authors have scanned and approved the final draft for publication. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Science and Technology Planning Project of Fujian Province (No. 2023J011429) and the Natural Science Foundation of Hunan Province, China (No. 2025JJ80020).

Data Availability Statement

All datasets are available at https://browser.dataspace.copernicus.eu (accessed on 20 April 2025) and https://nassgeodata.gmu.edu/CropScape (accessed on 20 April 2025). The RDTFNet will be available at https://github.com/GODBYW/RDTFNet (accessed on 30 May 2025).

Acknowledgments

We would like to express our appreciation to the reviewers, whose insightful and constructive comments were highly valuable in improving the quality of our work on previous versions of this document.

Conflicts of Interest

The authors have no conflicts of interest to declare.

References

Boddiger, D. Boosting biofuel crops could threaten food security. Lancet 2007, 370, 923–924. [Google Scholar] [CrossRef] [PubMed]
Bongaarts, J. The State of Food Security and Nutrition in the World 2020. Transforming food systems for affordable healthy diets. Popul. Dev. Rev. 2021, 10–11. [Google Scholar]
Funk, C.C.; Brown, M.E. Declining global per capita agricultural production and warming oceans threaten food security. Food Secur. 2009, 1, 271–289. [Google Scholar] [CrossRef]
Godfray, H.C.J.; Beddington, J.R.; Crute, I.R.; Haddad, L.; Lawrence, D.; Muir, J.F.; Pretty, J.; Robinson, S.; Thomas, S.M.; Toulmin, C. Food security: The challenge of feeding 9 billion people. Science 2010, 327, 812–818. [Google Scholar] [CrossRef]
Bioucas-Dias, J.M.; Plaza, A.; Camps-Valls, G.; Scheunders, P.; Nasrabadi, N.; Chanussot, J. Hyperspectral remote sensing data analysis and future challenges. IEEE Geosci. Remote Sens. Mag. 2013, 1, 6–36. [Google Scholar] [CrossRef]
Bo, L.; Xiaoyang, X.; Xingxing, W.; Wenting, T. Ship detection and classification from optical remote sensing images: A survey. Chin. J. Aeronaut. 2021, 34, 145–163. [Google Scholar]
Hu, Q.; Yin, H.; Friedl, M.A.; You, L.; Li, Z.; Tang, H.; Wu, W. Integrating coarse-resolution images and agricultural statistics to generate sub-pixel crop type maps and reconciled area estimates. Remote Sens. Environ. 2021, 258, 112365. [Google Scholar] [CrossRef]
Lorenz, S.; Ghamisi, P.; Kirsch, M.; Jackisch, R.; Rasti, B.; Gloaguen, R. Feature extraction for hyperspectral mineral domain mapping: A test of conventional and innovative methods. Remote Sens. Environ. 2021, 252, 112129. [Google Scholar] [CrossRef]
Wei, P.; Huang, R.; Lin, T.; Huang, J. Rice mapping in training sample shortage regions using a deep semantic segmentation model trained on pseudo-labels. Remote Sens. 2022, 14, 328. [Google Scholar] [CrossRef]
Zhu, S.; Li, S.; Yang, Z. Research on the Distribution Map of Weeds in Rice Field Based on SegNet. In 3D Imaging—Multidimensional Signal Processing and Deep Learning: Multidimensional Signals, Images, Video Processing and Applications; Springer: Berlin/Heidelberg, Germany, 2022; Volume 2, pp. 91–99. [Google Scholar]
Wang, M.; Wang, J.; Cui, Y.; Liu, J.; Chen, L. Agricultural field boundary delineation with satellite image segmentation for high-resolution crop mapping: A case study of rice paddy. Agronomy 2022, 12, 2342. [Google Scholar] [CrossRef]
Fan, X.; Yan, C.; Fan, J.; Wang, N. Improved U-net remote sensing classification algorithm fusing attention and multiscale features. Remote Sens. 2022, 14, 3591. [Google Scholar] [CrossRef]
Crisóstomo de Castro Filho, H.; Abílio de Carvalho Júnior, O.; Ferreira de Carvalho, O.L.; Pozzobon de Bem, P.; dos Santos de Moura, R.; Olino de Albuquerque, A.; Rosa Silva, C.; Guimaraes Ferreira, P.H.; Fontes Guimarães, R.; Trancoso Gomes, R.A. Rice crop detection using LSTM, Bi-LSTM, and machine learning models from Sentinel-1 time series. Remote Sens. 2020, 12, 2655. [Google Scholar] [CrossRef]
Wang, X.; Zhang, J.; Xun, L.; Wang, J.; Wu, Z.; Henchiri, M.; Zhang, S.; Zhang, S.; Bai, Y.; Yang, S. Evaluating the effectiveness of machine learning and deep learning models combined time-series satellite data for multiple crop types classification over a large-scale region. Remote Sens. 2022, 14, 2341. [Google Scholar] [CrossRef]
Radford, A. Improving Language Understanding by Generative Pre-Training. 2018. Available online: https://www.mikecaptain.com/resources/pdf/GPT-1.pdf (accessed on 20 April 2024).
Garnot, V.S.F.; Landrieu, L. Panoptic segmentation of satellite image time series with convolutional temporal attention networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 10–17 October 2021; pp. 4872–4881. [Google Scholar]
Zhang, Q.; Yang, Y.-B. ResT: An efficient transformer for visual recognition. Adv. Neural Inf. Process. Systems 2021, 34, 15475–15485. [Google Scholar]
Srinivas, A.; Lin, T.-Y.; Parmar, N.; Shlens, J.; Abbeel, P.; Vaswani, A. Bottleneck transformers for visual recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 16519–16529. [Google Scholar]
Dai, Z.; Liu, H.; Le, Q.V.; Tan, M. CoAtNet: Marrying convolution and attention for all data sizes. Adv. Neural Inf. Process. Syst. 2021, 34, 3965–3977. [Google Scholar]
Niu, B.; Feng, Q.; Chen, B.; Ou, C.; Liu, Y.; Yang, J. HSI-TransUNet: A transformer based semantic segmentation model for crop mapping from UAV hyperspectral imagery. Comput. Electron. Agric. 2022, 201, 107297. [Google Scholar] [CrossRef]
He, X.; Zhou, Y.; Zhao, J.; Zhang, D.; Yao, R.; Xue, Y. Swin transformer embedding UNet for remote sensing image semantic segmentation. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–15. [Google Scholar] [CrossRef]
Li, H.; Chen, S.-B.; Huang, L.-L.; Ding, C.; Tang, J.; Luo, B. DEGANet: Road Extraction Using Dual-branch Encoder with Gated Attention Mechanism. IEEE Geosci. Remote Sens. Lett. 2024, 21, 8003705. [Google Scholar] [CrossRef]
Wei, H.; Xu, X.; Ou, N.; Zhang, X.; Dai, Y. DEANet: Dual encoder with attention network for semantic segmentation of remote sensing imagery. Remote Sens. 2021, 13, 3900. [Google Scholar] [CrossRef]
Zamir, S.W.; Arora, A.; Khan, S.; Hayat, M.; Khan, F.S.; Yang, M.-H. Restormer: Efficient transformer for high-resolution image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–20 June 2022; pp. 5728–5739. [Google Scholar]
Wu, H.; Huang, P.; Zhang, M.; Tang, W.; Yu, X. CMTFNet: CNN and multiscale transformer fusion network for remote sensing image semantic segmentation. IEEE Trans. Geosci. Remote Sens. 2023, 61, 2004612. [Google Scholar] [CrossRef]
Yang, L.; Huang, R.; Huang, J.; Lin, T.; Wang, L.; Mijiti, R.; Wei, P.; Tang, C.; Shao, J.; Li, Q. Semantic segmentation based on temporal features: Learning of temporal–spatial information from time-series SAR images for paddy rice mapping. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–16. [Google Scholar] [CrossRef]
Fu, T.; Tian, S.; Ge, J. R-Unet: A Deep Learning Model for Rice Extraction in Rio Grande do Sul, Brazil. Remote Sens. 2023, 15, 4021. [Google Scholar] [CrossRef]
Shipp, M. Rice Crop Timeline for the Southern States of Arkansas, Louisiana, and Mississippi; NSF Center for Integrated Pest Management: Raleigh, NC, USA, 2005. [Google Scholar]
Wilson, C., Jr.; Branson, J. Trends in Arkansas rice production. BR Wells Rice Res. Stud. 2004, 550, 13–22. [Google Scholar]
Hill, J.; Williams, J.; Mutters, R.; Greer, C. The California rice cropping system: Agronomic and natural resource issues for long-term sustainability. Paddy Water Environ. 2006, 4, 13–19. [Google Scholar] [CrossRef]
Torres, R.; Snoeij, P.; Geudtner, D.; Bibby, D.; Davidson, M.; Attema, E.; Potin, P.; Rommen, B.; Floury, N.; Brown, M. GMES Sentinel-1 mission. Remote Sens. Environ. 2012, 120, 9–24. [Google Scholar] [CrossRef]
Yang, C.; Zhang, D.; Zhao, C.; Han, B.; Sun, R.; Du, J.; Chen, L. Ground deformation revealed by Sentinel-1 MSBAS-InSAR time-series over Karamay Oilfield, China. Remote Sens. 2019, 11, 2027. [Google Scholar] [CrossRef]
Onojeghuo, A.O.; Miao, Y.; Blackburn, G.A. Deep ResU-Net Convolutional Neural Networks Segmentation for Smallholder Paddy Rice Mapping Using Sentinel 1 SAR and Sentinel 2 Optical Imagery. Remote Sens. 2023, 15, 1517. [Google Scholar] [CrossRef]
Gumma, M.K.; Nelson, A.; Thenkabail, P.S.; Singh, A.N. Mapping rice areas of South Asia using MODIS multitemporal data. J. Appl. Remote Sens. 2011, 5, 53547–53573. [Google Scholar] [CrossRef]
Huete, A.; Didan, K.; Miura, T.; Rodriguez, E.P.; Gao, X.; Ferreira, L.G. Overview of the radiometric and biophysical performance of the MODIS vegetation indices. Remote Sens. Environ. 2002, 83, 195–213. [Google Scholar] [CrossRef]
Liu, W.; Dong, J.; Xiang, K.; Wang, S.; Han, W.; Yuan, W. A sub-pixel method for estimating planting fraction of paddy rice in Northeast China. Remote Sens. Environ. 2018, 205, 305–314. [Google Scholar] [CrossRef]
Boryan, C.; Yang, Z.; Mueller, R.; Craig, M. Monitoring US agriculture: The US department of agriculture, national agricultural statistics service, cropland data layer program. Geocarto Int. 2011, 26, 341–358. [Google Scholar] [CrossRef]
Shao, Y.; Lunetta, R.S.; Wheeler, B.; Iiames, J.S.; Campbell, J.B. An evaluation of time-series smoothing algorithms for land-cover classifications using MODIS-NDVI multi-temporal data. Remote Sens. Environ. 2016, 174, 258–265. [Google Scholar] [CrossRef]
Zhong, L.; Hu, L.; Zhou, H.; Tao, X. Deep learning based winter wheat mapping using statistical data as ground references in Kansas and northern Texas, US. Remote Sens. Environ. 2019, 233, 111411. [Google Scholar] [CrossRef]
Sun, Z.; Di, L.; Fang, H. Using long short-term memory recurrent neural network in land cover classification on Landsat and Cropland data layer time series. Int. J. Remote Sens. 2019, 40, 593–614. [Google Scholar] [CrossRef]
Wang, S.; Chen, W.; Xie, S.M.; Azzari, G.; Lobell, D.B. Weakly supervised deep learning for segmentation of remote sensing imagery. Remote Sens. 2020, 12, 207. [Google Scholar] [CrossRef]
Zhang, X.; Liu, C.; Yang, D.; Song, T.; Ye, Y.; Li, K.; Song, Y. RFAConv: Innovating spatial attention and standard convolutional operation. arXiv 2023, arXiv:2304.03198. [Google Scholar]
de Bem, P.P.; de Carvalho Júnior, O.A.; de Carvalho, O.L.F.; Gomes, R.A.T.; Guimarāes, R.F.; Pimentel, C.M.M. Irrigated rice crop identification in Southern Brazil using convolutional neural networks and Sentinel-1 time series. Remote Sens. Appl. Soc. Environ. 2021, 24, 100627. [Google Scholar]
Xia, L.; Zhao, F.; Chen, J.; Yu, L.; Lu, M.; Yu, Q.; Liang, S.; Fan, L.; Sun, X.; Wu, S. A full resolution deep learning network for paddy rice mapping using Landsat data. ISPRS J. Photogramm. Remote Sens. 2022, 194, 91–107. [Google Scholar] [CrossRef]
Yang, X.; Li, S.; Chen, Z.; Chanussot, J.; Jia, X.; Zhang, B.; Li, B.; Chen, P. An attention-fused network for semantic segmentation of very-high-resolution remote sensing imagery. ISPRS J. Photogramm. Remote Sens. 2021, 177, 238–262. [Google Scholar] [CrossRef]

Figure 1. Location of the study area. Arkansas (AR), southern Missouri (MO), western Tennessee (TN), and the Sacramento Valley in northern California (SV). The red frames indicate the selected boundaries of the study areas.

Figure 2. Nine 24-day averaged composite images of Sentinel-1 VH and VV polarization data over Study Area A.

Figure 3. 2019 Cropland Data Layer (CDL) products in the same colors as the CDL website shows for Study Area A and Study Area B. https://nassgeodata.gmu.edu/CropScape/ (accessed on 20 April 2025).

Figure 4. Structure of RDTFNet.

Figure 5. Structure of MDTA and GDFN.

Figure 6. Structure of M2SA and E-FFN.

Figure 7. Structure of FFM. The color-ramp represents the relative importance of each feature channel after attention weighting.

Figure 8. Time-series curves of the rice NDVI and backscattering coefficients at VV and VH for different regions and years; (a) 2019 NDVI time-series curves for rice in Study Area A and Study Area B; (b) 2021 NDVI time-series curves for rice in Study Area A and Study Area B; (c) 2019 time-series curve of backscatter coefficients at VV polarization for rice in Study Area A and Study Area B; (d) 2021 time-series curve of backscatter coefficients at VV polarization for rice in Study Area A and Study Area B; (e) the 2019 time-series curve of backscatter coefficients at VH polarization for rice in Study Area A and Study Area B; (f) 2021 time-series curve of backscatter coefficients at VH polarization for rice in Study Area A and Study Area B.

Figure 9. (a) Training loss values of RDTFNet and the six conventional models. (b) Validation loss values of RDTFNet and the six conventional models.

Figure 10. OA, precision, recall, and F1-score of RDTFNet and the six conventional models for the validation samples. (a) Overall accuracy, (b) precision, (c) recall, and (d) F1-score.

Figure 11. Comparison of rice extraction results from different models in the 2019 Test Area A: (a) optical index image (RGB corresponds to the NDVI, EVI, and LSWI); (b) ground truth map; and (c–i) rice prediction results from DeepLabV3, SegNet, U-Net, CMTFNet, TFBS, R-Unet and RDTFNet (ours).

Figure 12. Comparison of rice extraction results from different models on the 2019 cross-regional test samples: (a) optical index image (RGB corresponds to the NDVI, EVI, and LSWI); (b) ground truth map; and (c–i) rice prediction results from DeepLabV3, SegNet, U-Net, CMTFNet, TFBS, R-Unet and RDTFNet (Ours).

Figure 13. Comparison of the rice extraction results among the models on the 2021 test samples: (a) optical index image (RGB corresponds to the NDVI, EVI, and LSWI); (b) ground truth map; and (c–i) rice prediction results from DeepLabV3, SegNet, U-Net, CMTFNet, TFBS, R-Unet and RDTFNet (ours).

Figure 14. Comparison of rice extraction results from different ablation models on the 2019 test area A: (a) VH backscatter coefficient image (June 2021 VH), (b) optical indices (with RGB representing the NDVI, EVI, and LSWI), (c) ground truth image, and (d–h) rice prediction results for DU-UNet, DE-Unet (spatial), DE-Unet (temporal), RDTFNet (WithoutFFM), and RDTFNet (ours).

Figure 15. Comparison of rice extraction results using unimodal (optical), unimodal (SAR), and multimodal (optical and SAR) data: (a) VH backscatter coefficient image (June 2021 VH), (b) optical indices (with RGB representing the NDVI, EVI, and LSWI), (c,d) ground truth image, and (e–g) rice prediction results of RDTFNet using unimodal (optical), unimodal (SAR), and multimodal (optical and SAR) inputs, respectively.

Table 1. Start and end dates of nine 24-day average composite Sentinel-1 SAR images.

Index	Start Date	End Date
1	April 1	April 24
2	April 25	May 18
3	May 19	June 11
4	June 12	July 5
5	July 6	July 29
6	July 30	August 22
7	August 23	September 15
8	September 16	October 9
9	October 10	November 3

Table 2. Datasets of rice mapping model.

Datasets	Input Bands	Channels	Description
SAR	VV+VH	18	VV: vertical-vertical polarization VH: vertical-horizontal polarization Indices: NDVI+EVI+LSWI NDVI: normalized difference Vegetation index EVI: enhanced vegetation index LSWI: land surface water index
Optical	Indices	3
SAR and Optical	VV+VH and Indices	18 and 3

Table 3. Statistical information in Study Areas A and B.

Study Area	Class	Number of Samples	Area/km²
Study Area A	Corn	4,146,263	3734.5 (13.40%)
	Cotton	5,083,928	4579.1 (16.43%)
	Rice	5,376,608	4842.7 (17.37%)
	Sorghum	32,314	29.1 (0.10%)
	Soybeans	16,070,391	14,474.6 (51.93%)
	Peanuts	166,313	149.8 (0.54%)
	Winter Wheat	39,842	35.9 (0.13%)
	Alfalfa	4200	3.8 (0.01%)
	Sweet Potatoes	25,682	23.1 (0.08%)
Study Area B	Corn	88,449	79.7 (2.65%)
	Cotton	15,640	14.1 (0.47%)
	Rice	2,324,439	2093.6 (69.54%)
	Sorghum	5897	5.3 (0.18%)
	Sunflowers	229,388	206.6 (6.86%)
	Barley	17,126	15.4 (0.51%)
	Winter Wheat	276,386	248.9 (8.27%)
	Alfalfa	344,142	310 (10.30%)
	Dry Beans	41,128	37 (1.23%)

Table 4. Test results of RDTFNet and the six classical models on the 2019 Test Area A dataset.

Model	OA	IoU	Precision	Recall	F1-Score	Specificity
DeepLabV3	95.34%	82.74%	89.98%	91.15%	91.15%	96.70%
SegNet	95.67%	83.54%	92.26%	89.84%	91.03%	97.88%
U-Net	96.32%	85.80%	94.04%	90.73%	92.35%	97.87%
CMTFNet	96.32%	85.84%	93.72%	91.08%	92.38%	98.02%
TFBS	96.46%	86.24%	94.81%	90.51%	92.61%	98.39%
R-Unet	96.90%	87.86%	94.46%	91.70%	93.05%	98.58%
RDTFNet	96.95%	88.12%	95.14%	92.27%	93.68%	98.71%

Table 5. Test results of RDTFNet and the six classical models on the 2019 Study Area B dataset (refer to Table 4 for the benchmarking of evaluation indicators).

Model	OA	IoU	Precision	Recall	F1-Score	Specificity
DeepLabV3	96.60%	74.61%	82.49%	88.65%	85.44%	97.61%
DeepLabV3	(+1.25%)	(−8.14%)	(−7.49%)	(−2.50%)	(−5.71%)	(+0.90%)
SegNet	96.95%	76.29%	86.32%	86.90%	86.41%	98.23%
SegNet	(+1.29%)	(−7.26%)	(−5.95%)	(−2.94%)	(−4.62%)	(+0.35%)
U-Net	97.37%	79.16%	88.16%	88.60%	88.33%	98.48%
U-Net	(+1.05%)	(−6.64%)	(−5.88%)	(−2.13%)	(−4.03%)	(+0.62%)
CMTFNet	97.50%	79.82%	89.88%	87.70%	88.74%	98.75%
CMTFNet	(+1.18%)	(−6.02%)	(−3.83%)	(−3.39%)	(−3.64%)	(+0.73%)
TFBS	98.20%	83.20%	91.80%	89.24%	90.49%	98.95%
TFBS	(+1.72%)	(−3.04%)	(−3.01%)	(−1.27%)	(−2.12%)	(+0.56%)
R-Unet	97.49%	80.13%	88.46%	89.50%	88.93%	99.09%
R-Unet	(+0.60%)	(−7.72%)	(−6.00%)	(−2.19%)	(−4.12%)	(+0.50%)
RDTFNet	98.33%	86.15%	93.25%	91.91%	92.55%	99.15%
RDTFNet	(+1.38%)	(−1.96%)	(−1.89%)	(−0.36%)	(−1.13%)	(+0.68%)

Table 6. Test results of RDTFNet and the four conventional models on the 2021 dataset for Study Area B (refer to Table 5 for the benchmarking of evaluation indicators).

Model	OA	IoU	Precision	Recall	F1-Score
DeepLabV3	96.12%	64.53%	78.91%	78.05%	78.42%
DeepLabV3	(−0.48%)	(−10.08%)	(−3.58%)	(−10.60%)	(−7.02)
SegNet	96.80%	70.72%	83.64%	82.18%	82.58%
SegNet	(−0.16%)	(−5.57%)	(−2.68%)	(−4.72%)	(−3.84%)
U–Net	97.15%	74.52%	87.10%	82.54%	85.22%
U–Net	(−0.22%)	(−4.63%)	(−1.07%)	(−6.06%)	(−3.11%)
CMTFNet	97.23%	72.13%	88.84%	79.32%	83.73%
CMTFNet	(−0.28%)	(−7.68%)	(−1.04%)	(−8.38%)	(−5.01%)
TFBS	97.66%	77.75%	84.84%	87.30%	86.05%
TFBS	(−0.53%)	(−5.45%)	(−6.95%)	(−1.94%)	(−4.44%)
R–Unet	97.38%	75.63%	87.62%	82.90%	85.10%
R–Unet	(−0.11%)	(−4.50%)	(−0.84%)	(−4.72%)	(−3.84)
RDTFNet	98.32%	82.71%	92.44%	88.75%	90.53%
RDTFNet	(−0.01%)	(−3.44%)	(−0.81%)	(−3.16%)	(−2.03%)

Table 7. Test results of RDTFNet and the four ablation models on the 2019 Test Area A dataset.

Model	OA	IoU	Precision	Recall	F1-Score
Base (U-Net)	96.32%	85.80%	94.04%	90.73%	92.35%
Base + dual branch (DE-UNet)	96.50%	86.38%	94.76%	90.73%	92.69%
Base + dual branch + spatial encoder (Spatial-DE-UNet)	96.65%	86.99%	94.60%	91.55%	93.04%
Base + dual branch + temporal encoder (Temporal-DE-UNet)	96.69%	87.20%	94.20%	92.15%	93.16%
Base + dual branch + spatial–temporal encoders (WithoutFFM)	96.73%	87.33%	94.38%	92.12%	93.23%
Base + dual branch + spatial–temporal encoders + FFM (RDTFNet)	96.95%	88.12%	95.14%	92.27%	93.68%

Table 8. Test results of the single/dual-branching model on the 2019 Study Area B dataset.

Model	OA	IoU	Precision	Recall	F1-Score
U-Net	96.32%	85.80%	94.04%	90.73%	92.35%
DE-Unet	96.50%	86.38%	94.76%	90.73%	92.69%
MSTransformer-UNet	96.36%	85.96%	93.96%	90.99%	92.45%
Restormer-UNet	96.45%	86.24%	94.47%	90.83%	92.61%
RDTFNet	96.95%	88.12%	95.14%	92.27%	93.68%

Table 9. Test results of the unimodal/multimodal data on the 2019 Study Area B dataset.

Model	Modal	OA	IoU	Precision	Recall	F1-Score
U-Net	Unimodal (OPT)	95.28%	67.69%	75.21%	87.30%	80.70%
	Unimodal (SAR)	95.90%	67.34%	86.60%	75.25%	80.31%
	Multimodal	97.37%	79.16%	88.16%	88.60%	88.33%
RDTFNet	Unimodal (OPT)	95.51%	75.07%	80.05%	92.42%	85.75%
	Unimodal (SAR)	96.57%	75.82%	88.20%	84.39%	86.23%
	Multimodal	98.33%	86.15%	93.25%	91.91%	92.55%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, X.; Wei, H.; Shao, Y.; Luan, H.; Wang, D.-H. Transformer-Based Dual-Branch Spatial–Temporal–Spectral Feature Fusion Network for Paddy Rice Mapping. Remote Sens. 2025, 17, 1999. https://doi.org/10.3390/rs17121999

AMA Style

Zhang X, Wei H, Shao Y, Luan H, Wang D-H. Transformer-Based Dual-Branch Spatial–Temporal–Spectral Feature Fusion Network for Paddy Rice Mapping. Remote Sensing. 2025; 17(12):1999. https://doi.org/10.3390/rs17121999

Chicago/Turabian Style

Zhang, Xinxin, Hongwei Wei, Yuzhou Shao, Haijun Luan, and Da-Han Wang. 2025. "Transformer-Based Dual-Branch Spatial–Temporal–Spectral Feature Fusion Network for Paddy Rice Mapping" Remote Sensing 17, no. 12: 1999. https://doi.org/10.3390/rs17121999

APA Style

Zhang, X., Wei, H., Shao, Y., Luan, H., & Wang, D.-H. (2025). Transformer-Based Dual-Branch Spatial–Temporal–Spectral Feature Fusion Network for Paddy Rice Mapping. Remote Sensing, 17(12), 1999. https://doi.org/10.3390/rs17121999

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Transformer-Based Dual-Branch Spatial–Temporal–Spectral Feature Fusion Network for Paddy Rice Mapping

Abstract

1. Introduction

2. Materials and Methods

2.1. Study Area

2.2. Remote Sensing Data Preparation

2.2.1. Sentinel-1 Time-Series SAR Images

2.2.2. Sentinel-2 Multispectral Image

2.2.3. The Ground Truth Data

2.2.4. Training, Validation and Test Samples

2.3. Analysis of Diversity Across Regions

2.4. Model and Principles

2.4.1. RDTFNet

2.4.2. SAR Branch Encoder

2.4.3. Optical Branch Encoder

2.4.4. Feature Fusion Module (FFM)

2.4.5. Loss Function

2.4.6. Performance Evaluation and Experimental Setup

3. Results

3.1. Cross-Regional Temporal Diversity Analysis

3.2. Comparative Results of Basic Performance of Rice Extraction

3.3. Comparative Results of Cross-Regional Generalization Capabilities

3.4. Comparison Results of Temporal Generalization Capabilities

4. Discussion

4.1. Ablation Study

4.2. The Impact of Single/Dual-Branch Model Structures on Rice Extraction

4.3. The Impact of Unimodal/Multimodal Data on Rice Extraction

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI