TABS-Net: A Temporal Spectral Attentive Block with Space–Time Fusion Network for Robust Cross-Year Crop Mapping

Zhou, Xin; Huang, Yuancheng; Shen, Qian; Yao, Yue; Wen, Qingke; Xi, Fengjiang; Ma, Chendong

doi:10.3390/rs18020365

Open AccessArticle

TABS-Net: A Temporal Spectral Attentive Block with Space–Time Fusion Network for Robust Cross-Year Crop Mapping

by

Xin Zhou

^1,2,

Yuancheng Huang

^1,†,

Qian Shen

^2,3,*,†,

Yue Yao

^2,3

,

Qingke Wen

^2,3,

Fengjiang Xi

⁴ and

Chendong Ma

²

¹

College of Geomatics, Xi’an University of Science and Technology, Xi’an 710054, China

²

Key Laboratory of Digital Earth Science, Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100094, China

³

International Research Center of Big Data for Sustainable Development Goals, Beijing 100094, China

⁴

Inner Mongolia Remote Sensing Center Co., Ltd., Hohhot 010000, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Remote Sens. 2026, 18(2), 365; https://doi.org/10.3390/rs18020365

Submission received: 23 November 2025 / Revised: 15 January 2026 / Accepted: 19 January 2026 / Published: 21 January 2026

(This article belongs to the Special Issue Advances in Multispectral Image Processing for Land Use and Land Cover Mapping)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

A 3D CNN framework (TABS-Net) integrating CBAM3D and DOY positional encoding was developed for accurate crop mapping.
The model achieves superior accuracy and Inter-Annual Robustness (IAR) values close to 1 by simulating phenological shifts via Temporal Jitter.

What is the implication of the main findings?

Explicitly aligning seasonal timing and simulating shifts effectively overcomes the “date–spectrum–class” misalignment caused by climate variability.
TABS-Net provides a scalable and transferable solution for consistent large-scale agricultural monitoring across different years.

Abstract

Accurate and stable mapping of crop types is fundamental to agricultural monitoring and food security. However, inter-annual phenological shifts driven by variations in air temperature, precipitation, and sowing dates introduce systematic changes in the spectral distributions associated with the same day of year (DOY). As a result, the “date–spectrum–class” mapping learned during training can become misaligned when applied to a new year, leading to increased misclassification and unstable performance. To tackle this problem, we develop TABS-Net (Temporal–Spectral Attentive Block with Space–Time Fusion Network). The core contributions of this study are summarized as follows: (1) we propose an end-to-end 3D CNN framework to jointly model spatial, temporal, and spectral information; (2) we design and embed CBAM3D modules into the backbone to emphasize informative bands and key time windows; and (3) we introduce DOY positional encoding and temporal jitter during training to explicitly align seasonal timing and simulate phenological shifts, thereby enhancing cross-year robustness. We conduct a comprehensive evaluation on a Cropland Data Layer (CDL) subset. Within a single year, TABS-Net delivers higher and more balanced overall accuracy, Macro-F1, and mIoU than strong baselines, including 2D stacking, 1D temporal convolution/LSTM, and transformer models. In cross-year experiments, we quantify temporal stability using inter-annual robustness (IAR); with both DOY encoding and temporal jitter enabled, the model attains IAR values close to one for major crop classes, effectively compensating for phenological misalignment and inter-annual variability. Ablation studies show that DOY encoding and temporal jitter are the primary contributors to improved inter-annual consistency, while CBAM3D reduces crop–crop and crop–background confusion by focusing on discriminative spectral regions such as the red-edge and near-infrared bands and on key growth stages. Overall, TABS-Net combines higher accuracy with stronger robustness across multiple years, offering a scalable and transferable solution for large-area, multi-year remote sensing crop mapping.

Keywords:

crop classification; sentinel-2 time series; 3D CNN; inter annual robustness; phenological drift; spatio temporal attention

1. Introduction

Corn and soybean rank among the world’s most important staple and cash crops, playing an irreplaceable role in safeguarding global food security, sustaining feed supply, and supporting biofuel production [1]. Being able to identify these two crops accurately and in a timely manner, and to monitor their cultivated area, is of strategic importance for governments [2]: it underpins evidence-based agricultural policy making, grain yield estimation, risk assessment under climate change, and the optimized allocation of agricultural resources. However, traditional ground surveys are time-consuming, labor-intensive, and difficult to scale to large regions, making them poorly suited to the modern demand for frequently updated crop information. In this context, remote sensing technology with its wide spatial coverage, high acquisition frequency, and relatively low cost has become an indispensable tool for the precise extraction and continuous monitoring of major staple crops [3].

Over several decades of development, remote sensing-based crop extraction has seen a steady evolution of methods and continuous gains in accuracy [4]. Early work relied mainly on empirical, spectrum-based approaches such as thresholding and vegetation index analysis. Although these techniques enabled basic crop identification, their generalization was often confined to specific regions and time periods. As remote sensing archives expanded and computing power increased, supervised classification methods built on hand-crafted features and expert knowledge (such as the maximum likelihood method and support vector machines) [5,6] emerged as the mainstream and further improved classification performance. Nonetheless, these traditional approaches typically depend on complex feature engineering and are ill-suited for capturing the rich spatio-temporal dynamics present in remote sensing time series data. In recent years, the rapid advance of artificial intelligence, particularly deep learning, has brought transformative progress to remote sensing image interpretation. Deep learning models, with their strong representation learning capabilities, can automatically derive high-level, abstract spatio-temporal features from massive remote sensing datasets [7], and have demonstrated unprecedented potential in complex crop type discrimination and fine-resolution mapping. As a result, they have become the dominant research direction and core methodology for remote sensing-based crop extraction today [8].

Early applications of deep learning to crop extraction largely focused on single date imagery, where two-dimensional convolutional neural networks (2D CNNs) were used to exploit their strong spatial feature extraction capabilities for classification [9,10,11]. Yet, remote sensing imagery is inherently temporal, and accurate crop discrimination critically depends on distinctive phenological characteristics (phenology) expressed over the full growing season [12]. As a result, research quickly shifted toward models that can process multi-temporal sequences. Representative approaches include using recurrent neural networks (RNNs) and long short-term memory networks (LSTM) to explicitly model pixel wise time series, and adopting transformer architectures to capture long-range temporal dependencies [13,14]. In parallel, one-dimensional temporal convolutional networks (TempCNNs) have gained wide adoption thanks to their efficiency in extracting temporal features [13,15]. While each of these methods has its strengths, three-dimensional convolutional neural networks (3D CNNs) have emerged as one of the mainstream choices because they can directly operate on spatio-temporal data cubes [16,17,18]. Three-dimensional CNN is particularly adept at jointly capturing coupled spatio-temporal patterns, and with its mature architecture, efficient training, and strong compatibility with grid-structured remote sensing imagery, it is often selected as a powerful baseline model [19]. Nevertheless, standard 3D CNN baselines also exhibit clear limitations: they tend to treat all time steps, spectral bands, and spatial locations uniformly, without explicitly highlighting the most informative components. In phenology-driven crop classification, however, not all growth stages or spectral cues contribute equally, which naturally motivates the incorporation of attention mechanisms to refine feature representations and emphasize the most informative temporal and spectral signals [20].

To strengthen how deep learning models perceive salient patterns in remote sensing data, researchers have widely adopted the attention mechanism, whose central idea is to let the network dynamically concentrate on the most informative parts of the input. Typical forms include channel attention and spatial attention. Channel attention learns the relative importance of each channel and highlights spectral bands that contribute more strongly to the task, while spatial attention produces a spatial weighting map that guides the model to focus on key geographic regions in the image [21]. The Convolutional Block Attention Module (CBAM) [22] is a particularly efficient and widely used attention mechanism that applies channel attention and spatial attention in sequence, and has been shown to substantially boost performance across a range of visual tasks [23]. In crop extraction, the two-dimensional variant CBAM2D has already been successfully used for both single date imagery and cases where multi-temporal images are stacked into pseudo multi-channel inputs. By enhancing the model’s capacity to capture spatial and spectral features, CBAM2D effectively improves the accuracy of crop classification [24].

However, when dealing with remote sensing data that contain rich temporal information, restricting attention to only the 2D spatial and channel dimensions is clearly insufficient. Crop phenology evolves as a dynamic time series process: a model must not only understand “where” and “what spectrum” is informative, but, more critically, “when” specific phenological cues are most effective for separating crop types. For instance, during key growth stages, a crop’s spectral response may differ markedly from that of neighboring crops, while at other stages, their signatures can be almost indistinguishable. Conventional 2D CBAM cannot directly apply attention along the temporal axis, which means the model may treat all time steps equally and fail to focus on those phenological periods that provide the strongest discriminative power. Therefore, to enable the model to more intelligently capture the core discriminative patterns emerging from the spatio-temporal evolution of crops and to explicitly attend to critical phenological phases, we urgently need an attention mechanism that operates jointly over channel, spatial, and temporal dimensions.

Despite the substantial progress of deep learning models in remote sensing-based crop extraction, limited cross-year generalization remains a widespread and pressing challenge [25]. Existing methods face fundamental limitations when confronting inter-annual phenological drift. Standard 1D-CNNs and RNNs (e.g., LSTM) typically rely on learning fixed temporal patterns or rigid sequential dependencies from the source domain. They implicitly assume that the spectral features of a specific crop appear at the same calendar time across years. Consequently, when climate-driven drift causes a phenological lag (e.g., a 20-day delay in peak growth), the learned temporal receptive fields fail to capture the shifted key features, leading to misclassification. Similarly, emerging transformer-based models often utilize absolute positional encodings to mark time steps. This design enforces strict correspondence between spectral values and absolute calendar dates, rendering the model sensitive to temporal shifts and limiting its shift-invariance. In essence, these mainstream approaches lack an explicit mechanism to dynamically align or calibrate “date–spectrum” mismatches, making them brittle in cross-year scenarios where the assumption of independent and identically distributed (i.i.d.) data is violated by climate variability.

To mitigate this issue, Unsupervised Domain Adaptation (UDA) has been widely studied. The central idea is to exploit unlabeled target domain data to reduce the discrepancy in feature distributions between the source and target domains. Representative UDA approaches, such as the adversarial training-based Domain Adversarial Neural Network (DANN), introduce a domain discriminator to encourage domain-invariant representations [25]. In practice, however, UDA often involves a complex and unstable training pipeline, requiring delicate hyperparameter tuning and handling convergence issues inherent in adversarial training. Even then, the performance gains may fall short of expectations [26,27,28].

To overcome these limitations without the complexity of adversarial training, we propose a Temporal–Spectral Attentive Block with Space–Time Fusion Network (TABS-Net). Unlike rigid CNN/RNN templates or static transformers, TABS-Net explicitly addresses phenological drift through a dual-alignment strategy. We introduce DOY-based positional encoding to anchor features to relative phenological stages rather than absolute indices, and employ a temporal jitter strategy to simulate potential shifts, forcing the 3D Convolutional Block Attention Module (CBAM3D) to learn shift-invariant representations. This design allows the model to “follow” the crop growth curve even when it drifts, providing a robust and efficient alternative for cross-year mapping.

More concretely, this study is organized around the following core questions:

Relative to mainstream temporal baselines (e.g., 3D CNN and LSTM), to what extent can our phenology-oriented TABS-Net improve the accuracy of fine-grained crop classification?
How do the three components—CBAM3D, day-of-year (DOY) encoding, and temporal jitter—individually contribute to classification accuracy and cross-year generalization, and which component plays the leading role in boosting generalization?

To answer these questions, we make the following contributions:

We propose TABS-Net, a phenology-oriented spatio-temporal network that combines 3D feature learning with attention-driven temporal–spectral refinement, aiming to improve robustness under inter-annual phenological drift.

We introduce a lightweight generalization strategy by integrating CBAM3D, DOY encoding, and temporal jitter, and we systematically quantify their individual and combined effects through controlled ablation studies.

We conduct extensive cross-year experiments and comparisons against representative temporal baselines, demonstrating that our approach achieves a favorable balance among accuracy, generalization, and training simplicity.

The remainder of this paper is organized as follows. Section 2 describes the study area, datasets, and preprocessing. Section 3 presents the proposed method and implementation details. Section 4 reports experimental results and analyses. Section 5 discusses limitations and future work, and Section 6 concludes the paper.

2. Materials

2.1. Study Site

Our study focuses on Iowa in the U.S. Midwest (Iowa) (Figure 1), a state situated along the eastern margin of the Great Plains, spanning roughly 40.4–43.5°N and 89.5–96.6°W. The landscape is dominated by gently rolling loess–till plains at elevations of about 150–500 m [29], underlain by highly fertile black soils that support a classic intensive farming landscape with large, contiguous, and regularly shaped fields [30]. The region has a temperate, continental, humid climate with four distinct seasons, characterized by warm, wet summers and cold, dry winters. Annual precipitation is on the order of 700–1000 mm, most of which falls during the May–September growing season [31], and the frost free period typically lasts 150–180 days, which is well-aligned with the phenological calendar of the major local crops.

The cropping system is dominated by a stable rotation of corn and soybean [32,33]: Corn is usually sown from late April to May, enters rapid vegetative growth in May–June, reaches its peak tasseling, pollination, and grain filling stages from late July to August, and is harvested in September–October. Soybean is generally sown from early to mid-May, and grows vegetatively through June–July; it flowers and fills pods from mid/late July into August, and likewise matures and is harvested in September–October. To clearly illustrate these temporal dynamics, we summarize the typical crop calendars for both corn and soybean in Figure 2, using four representative growth stages and corresponding day-of-year (DOY) ranges. This schematic aids in identifying phenologically critical periods for subsequent Sentinel-2 temporal sampling and feature selection. Together, these geographical and climatic settings and the associated crop phenology make multi-temporal Sentinel-2 observations from May to October particularly suitable for capturing the full growing cycle from green up through peak biomass to maturity. Vegetation-sensitive bands, especially in the red-edge and near-infrared regions, respond strongly to changes in canopy structure, nitrogen status, and water content, providing an ideal testbed and data foundation for subsequent crop classification and cross-year consistency analyses based on full season spectral–temporal signatures.

2.2. Sentinel-2 Data and Preprocessing

We primarily use surface reflectance products from the European Space Agency’s Sentinel-2A constellation multi-spectral imager (MSI) (ESA, Paris, France) as our data source [34]. Priority is given to L2A-level images that have undergone atmospheric correction, after which all scenes are brought into a common UTM projection and subjected to cross-year geometric co-registration and radiometric consistency constraints. To meet the strong dependence of crop classification on vegetation-sensitive bands and temporal continuity, we build a 10-band data stack (B02–B08, B8A, B11, B12) spanning the visible to near-infrared, red-edge, and shortwave-infrared regions. Detailed band information is summarized in Table 1. To construct a spatially consistent data cube, we resampled all utilized bands to a unified 10 m resolution using bicubic interpolation. We acknowledge that upsampling the native 20 m bands (Red-Edge and SWIR) introduces data redundancy without retrieving true sub-pixel details. However, this strategy is preferred over downsampling the native 10 m bands (Visible and NIR), as preserving the high-frequency edge information inherent in the 10 m channels is critical for accurate field boundary delineation in fragmented agricultural landscapes. The proposed 3D-CNN architecture is designed to handle this channel-wise correlation, effectively integrating the fine spatial textures of the 10 m bands with the rich spectral diagnostics of the upsampled bands.

Our experiments cover four consecutive years from 2021 to 2024. For each year, we first select scenes within the May–October window based on a low cloud criterion, and then, on that basis, we sample them as evenly as possible in time to construct a 12 epoch time series; the day of year (DOY) of the selected scenes are shown in Figure 3. All images undergo cloud and cloud shadow masking using the L2A scene classification layer combined with shadow geometry constraints. To address temporal and spatial gaps remaining after cloud removal and to maintain a consistent, smooth time series, we apply cubic spline interpolation, constructing piecewise cubic polynomials between adjacent observations and enforcing continuity of the function value and its first and second derivatives. In this way, small gaps are gently filled without introducing high-frequency artifacts, yielding stable and continuous multi-temporal input sequences.

2.3. CDL Sample Construction and Dataset Preparation

We use the Cropland Data Layer (CDL) released by the U.S. Department of Agriculture as the authoritative reference for surface crop-type labels [35]. To align these labels with the Sentinel-2 time series imagery and enable cross-year generalization experiments, we extract several representative agricultural counties within the Corn Belt from CDL to form study subsets for 2021–2024. These data are reprojected to the study area’s unified UTM coordinate system and a 10 m raster grid, and the integer labels are upsampled using nearest neighbor resampling to avoid class mixing. For the classification scheme, we adopt a “fine for major crops, broad for background” strategy, aggregating the original fine-grained CDL classes into three categories, namely Corn, Soybean, and Other, with “Other” encompassing both non-target crops and non-crop background.

2.4. Analysis of Phenological Drift Mechanism

To systematically characterize the inter-annual phenological variability of corn and soybean from 2021 to 2024, we employed a stratified random sampling approach based on the Cropland Data Layer (CDL). Considering the frequent crop rotation in the study area, sampling was performed independently for each year, with 100 pure pixels randomly selected within the identified corn and soybean areas annually to ensure thematic purity. We constructed NDVI time series from Sentinel-2 imagery and applied the Savitzky–Golay (S-G) [36] filtering algorithm to reconstruct authentic growth trajectories free from atmospheric contamination. Specifically, after removing outliers (NDVI < −1 or >1), we resampled the data to a daily scale via linear interpolation and applied S-G smoothing (window size = 31, polynomial order = 2). This parameter combination was empirically selected to effectively suppress high-frequency noise while preserving key peak characteristics. Finally, the peak DOY (the day of year corresponding to maximum NDVI) was extracted to quantify phenological timing. In Figure 4, the solid lines represent the mean of the smoothed samples, while the shaded areas denote the 95% confidence interval (CI), quantifying estimation uncertainty and reflecting the spatial variability of crop growth within the region.

From a mechanistic perspective, these phenological dynamics are fundamentally driven by climatic interactions rather than rigid calendar dates, exhibiting both systematic regularities and regional heterogeneities. Generally, crop development accelerates with accumulated heat (Growing Degree Days, GDD). As demonstrated by Yang et al. [37], rising temperatures typically drive a systematic advancement in phenology, leading to widespread earlier planting and harvesting dates across major US production regions. However, this regularity is often modulated by regional moisture conditions. For instance, Yang et al. [38] highlighted that in transitional climatic zones, deviations such as increased summer precipitation can counteract the warming effect, significantly delaying harvesting dates and extending the growing season. Consequently, the spectral signature associated with a specific DOY in one year may correspond to a completely different physiological stage in another due to this ‘date–spectrum’ misalignment caused by the interplay of temperature fluctuations and rainfall patterns.

As quantified in the temporal profiles (Figure 4a,c), significant inter-annual shifts are evident. Taking corn as a prominent example, the growth cycle in 2022 was notably delayed compared to 2021. The median peak DOY shifted from approximately 195 in 2021 to 220 in 2022 (see boxplots in Figure 4b). This substantial lag of roughly 25 days implies that a model trained on 2021 data would likely misclassify the vegetative corn of 2022 if relying solely on absolute dates. Similarly, soybean exhibits a marked drift, with the median peak DOY shifting from ~222 in 2021 to ~230 in 2023 (Figure 4d). Beyond these temporal shifts, the boxplots also reveal significant spatial heterogeneity within single years. For instance, the 2023 corn boxplot (Figure 4b) displays a wide inter-quartile range (IQR) spanning from DOY ~185 to ~215. This dispersion suggests that even within the same year, phenology varies drastically due to local microclimates, soil moisture differences, and diverse management practices (e.g., varied sowing dates). This heterogeneity confirms that a simple global time-shift is insufficient for alignment, justifying the need for the pixel-wise dynamic adjustment mechanism proposed in TABS-Net.

3. Methodology

3.1. The Structure of TABS-NET

3.1.1. Overall Architecture

The core of TABS-Net builds on a 3D U-Net topology (Figure 5) composed of a symmetric encoder (Encoder3D) and decoder (Decoder3D) pathway. The encoder takes a five-dimensional spatio-temporal tensor (B×C×T×H×W) as input. Here, B represents the batch size, C denotes the number of spectral channels (e.g., 10 bands for Sentinel-2), T indicates the length of the temporal sequence (e.g., 12 time phases), and H×W refers to the spatial height and width of the input patch (e.g., 512 × 512). The encoder extracts multi-scale features, and progressively downsamples them using MaxPool3d layers. The decoder then gradually upsamples the features with 3D-transposed convolutions to recover spatial resolution. Importantly, this upsampling is intentionally restricted to the spatial dimensions (H, W), while the temporal dimension is kept fixed throughout the decoding stage. To support accurate boundary delineation, the network leverages the classic U-Net skip connection mechanism [39]. A distinctive design choice is that all shallow feature maps forwarded from the encoder to the decoder are first average pooled along the temporal axis (T) prior to fusion. This temporal compression step supplies the decoder with spatial detail that is smoothed and robust over time. In the final stage, a 1×1×1 convolution projects the features to the desired number of classes (N_class) and collapses the temporal dimension, producing a 2D semantic segmentation output of size B×N_class×H×W. Note that the DOY positional encoding is applied directly to the input tensor as an additional channel prior to entering the Encoder3D, ensuring explicit temporal alignment.

3.1.2. Spatio-Temporal Feature Extraction

The encoder backbone of TABS-Net consists of a stack of 3D convolutional blocks. Each block is built from a basic unit that includes a 3D convolution layer with a 3×3×3 kernel, batch normalization (BatchNorm3d), and an ReLU activation function. With 3×3×3 kernels sliding jointly over the spatial and temporal axes, the network can effectively capture local temporal dynamics and spatial texture patterns of crops across successive observations.

To expand the contextual field of view without substantially increasing the parameter count, the deeper encoder blocks (e3, e4) adopt 3D-dilated convolutions with a dilation rate of two [40]. This design allows the model to aggregate information over a larger receptive field, strengthening its ability to represent long-range temporal dependencies across growth stages and to characterize large-scale spatial structures. Furthermore, DropBlock3D regularization is applied at the end of each block to mitigate overfitting in the spatio-temporal domain.

3.1.3. Three-Dimensional Convolutional Block Attention Module

To improve TABS-Net’s ability to discriminate and filter information in multi-temporal, multi-spectral remote sensing data, we insert a 3D Convolutional Block Attention Module (CBAM3D) (Figure 6) into each 3D convolution block of the 3D CNN backbone, directly after the ReLU activation. CBAM3D is adapted from the original CBAM proposed for 2D CNNs [22] by extending its pooling and convolution operations to 3D, so that attention can be learned over the spatio-temporal feature volume (T, H, W). Instead of treating all dimensions equally, CBAM3D adopts a serial ‘Channel-First, Spatio-Temporal-Second’ integration strategy to achieve progressive feature refinement. This design effectively mimics a cognitive process of determining ‘what to look for’ followed by ‘where and when to focus’.

Specifically, CBAM3D consists of two sequential submodules: spectral channel attention and spatio-temporal attention. In Figure 6, F denotes the input feature to CBAM3D (i.e., the input to the channel attention submodule). The spectral channel attention module first estimates the importance of each spectral channel (C) by applying 3D global average pooling and max pooling over the (T, H, W) dimensions and feeding the pooled descriptors into a shared multilayer perceptron (MLP). Functionally, this step acts as a spectral filter, adaptively amplifying vegetation-sensitive bands (e.g., red-edge and NIR) while suppressing redundant or noisy channels. This step produces a channel-refined intermediate feature, denoted as F′, which is then fed into the spatio-temporal attention submodule.

The spatio-temporal attention module then performs average and max pooling along the channel (C) dimension, yielding two aggregated spatio-temporal feature maps of size B×1×T×H×W, which are concatenated. A subsequent 3D convolution followed by a Sigmoid activation produces a 3D spatio-temporal attention map that is applied to the feature maps. This map guides the network to focus on critical phenological time windows (T), such as the peak greenness stage and key spatial locations (H, W), effectively highlighting crop field interiors while suppressing boundary artifacts. After applying spatio-temporal attention to F′, CBAM3D outputs the final refined feature (denoted as F″) for subsequent convolutional processing. Via this serial channel–spatio-temporal attention scheme, CBAM3D adaptively reweights the features, enhancing the expression of discriminative spectral bands and the most informative spatio-temporal regions.

3.1.4. Cross-Year Generalization Strategies

Network architecture alone is not enough to overcome inter-annual phenological drift. At its core, inter-annual phenological drift manifests as a “domain shift” in the data distribution: variations in air temperature, precipitation, and sowing dates cause the same crop to exhibit different spectral signatures and growth stages on the same DOY across different years. During training, deep learning models (no matter how sophisticated their architectures) tend to “overfit” the specific phenological rhythm of the source domain (training year). When these models are later applied to a target domain (test year) where phenology has shifted, the learned “date–spectrum–class” mapping breaks down, resulting in a steep decline in performance. Consequently, merely adding depth or complexity to the network following a purely “model-centric” strategy cannot fundamentally resolve this data mismatch issue. In response, we adopt a “data-centric” perspective and introduce two key strategies that directly tackle and alleviate temporal distribution shifts, thereby improving the model’s robustness across years.

To mitigate inter-annual phenological mismatches that arise when cloud cover leads to irregular multi-temporal observations, we introduce DOY positional encoding [36], providing the network with an explicit seasonal time anchor. Concretely, the DOY integer associated with each acquisition (in the range 1–365) is normalized to [0, 1] and treated as an additional input band. This normalized DOY layer is broadcast to match the spatio-temporal dimensions of the imagery (T×1×H×W) and then concatenated with the original spectral bands (T×C×H×W) along the channel dimension. As a result, the model ingests an extended tensor of size T×(C+1)×H×W, allowing TABS-Net to exploit absolute seasonal timing information to better align phenological trajectories across years and, in turn, substantially improve its cross-year generalization performance.

Although DOY positional encoding provides an absolute seasonal anchor, it cannot fully eliminate the impact of real phenological shifts driven by changes in sowing dates and inter-annual climate variability, nor the observation mismatches introduced by random factors such as cloud cover. These effects still undermine the model’s ability to generalize across years [41]. To address this, we incorporate a temporal jitter augmentation strategy during training, explicitly simulating advances and delays in phenology as well as irregular sampling, so as to improve the model’s robustness to temporal distribution shifts. The strategy has two complementary components: (1) global timeline shifting, where a small random integer offset (for example, ±5 to 10 days) is applied uniformly to the entire DOY sequence of a sample to emulate a crop growth cycle that is globally advanced or delayed; and (2) local temporal perturbation, where a few observation dates within the sequence are randomly selected and their DOY values are perturbed independently by a smaller random offset (for example, ±2 to 5 days) to mimic slight misalignments of individual acquisitions caused by cloud contamination or variations in satellite overpass timing. In practice, the two modes of temporal jitter augmentation are used jointly, with the jitter range carefully constrained to preserve a realistic phenological trajectory.

3.2. Comparison Methods

To objectively assess the effectiveness and cross-year robustness of TABS-Net, we benchmark it against a set of comparison models that represent mainstream methodological routes. First, a 2D CNN spatio-temporal stacking scheme concatenates multi-temporal images along the channel dimension into a high-dimensional input, learning local spatial textures without explicitly modeling temporal dependencies [42]. Second, LSTM/RNN-based models treat pixel-wise multi-temporal spectra as sequences, explicitly capturing temporal dynamics but largely neglecting spatial context [43]. Third, 1D temporal convolution models (such as TempCNN) use one-dimensional convolutions to efficiently extract temporal features, but are similarly constrained in their spatial modeling capacity [14]. Fourth, transformer-based models rely on self-attention to model long-range dependencies and allocate temporal weights, representing a purely attention-driven paradigm [44]. Fifth, a 3D CNN baseline shares the same 3D convolutional backbone as TABS-Net but removes DOY encoding and CBAM3D, allowing us to isolate and quantify the net contribution of the proposed modules.

Together, these baselines embody different assumptions and inductive biases, “space-dominant”, “time-dominant”, “pure attention”, and “spatio-temporal fusion (without explicit temporal/attention enhancement)”, and form a systematic spectrum spanning 2D, 1D, transformer, and 3D architectures. Comparing against this spectrum enables a comprehensive evaluation of TABS-Net’s strengths in spatio-temporal joint representation, alignment of key phenological phases, and cross-year generalization. A detailed comparison of the characteristics, strengths, and weaknesses of these methodological routes is summarized in Table 2.

3.3. Experimental Setup and Evaluation Metrics

To provide a comprehensive and rigorous assessment of the TABS-Net model proposed in this work both in terms of classification accuracy and cross-year generalization on multi-temporal remote sensing crop-mapping tasks, we design a strict experimental protocol and adopt well-established evaluation metrics to ensure that the reported results are objective, reliable, and reproducible.

3.3.1. Experimental Protocols

Because one of the central challenges in remote sensing-based agricultural monitoring is maintaining model stability under year-to-year changes in phenology and environmental conditions, our experimental design places particular emphasis on cross-year generalization. Specifically, within the CDL data subset spanning four agricultural years (2021–2024), we employ a Leave One Year Out Cross Validation (LOYO CV) scheme: in each round, three years are used for training and the remaining year for testing, and this process is repeated four times. This enables a systematic evaluation of temporal robustness and cross-year transferability. The experimental framework, built on the two datasets, not only covers multi-region, multi-crop, and multi-year settings, but also provides a solid foundation for validating the applicability and generalization capacity of TABS-Net in realistic, complex agricultural remote sensing scenarios.

3.3.2. Evaluation Metrics

To thoroughly evaluate performance and inter-annual stability in multi-temporal crop classification, we report a set of multi-dimensional metrics that capture overall agreement, class-level discrimination, spatial overlap quality, and year-to-year stability. These are derived from the four basic elements of the confusion matrix true positive (TP), false positive (FP), false negative (FN), and true negative (TN), and are complemented by Macro-F1(%), mIoU(%), IAR_IoU, IAR_F1, UA_mean, and PA_mean. Here, N denotes the number of classes and the total number of samples.

To ensure a fair and comprehensive evaluation, we adopt standard metrics widely used in remote sensing image classification and introduce auxiliary indices specifically designed to quantify cross-year generalization. The metrics and their computation are defined as follows:

Overall accuracy (OA) is the most commonly used indicator in remote sensing classification and is defined as the proportion of correctly classified pixels relative to all pixels:

OA and Kappa

O A = \frac{\sum_{i = 1}^{N} T P_{i}}{\sum_{i = 1}^{N} (T P_{i} + F P_{i} + F N_{i})},

(1)

K a p p a = \frac{p_{o} - p_{e}}{1 - p_{e}},

(2)

p_{o} = \frac{\sum_{i = 1}^{N} T P_{i}}{T o t a l},

(3)

p_{e} = \frac{1}{{T o t a l}^{2}} \sum_{i = 1}^{N} (T P_{i} + F N_{i}) (T P_{i} + F P_{i}),

(4)

OA reflects the overall proportion of correctly classified samples [45], while Kappa adjusts for agreement that could occur by chance and thus provides a more reliable measure of overall classification performance [46].

UA_mean and PA_mean

For class I, the following equations are used:

{U A}_{i} = {P r e c i s i o n}_{i} = \frac{T P_{i}}{T P_{i} + F P_{i}},

(5)

{P A}_{i} = {R e c a l l}_{i} = \frac{T P_{i}}{T P_{i} + F N_{i}},

(6)

Report its macro-averaged value (the arithmetic mean computed over all classes):

{U A}_{m e a n} = \frac{1}{N} \sum_{i = 1}^{N} U A_{i},

(7)

{P A}_{m e a n} = \frac{1}{N} \sum_{i = 1}^{N} P A_{i},

(8)

UA is more sensitive to FP, whereas PA is more sensitive to FN; taken together, they describe the error structure in terms of misclassified versus omitted samples [47].

F1 and Macro-F1(%)

{F 1}_{i} = \frac{2 U A_{i} \cdot P A_{i}}{U A_{i} + P A_{i}},

(9)

M a c r o - F 1 = \frac{1}{N} \sum_{i = 1}^{N} {F 1}_{i},

(10)

Macro-F1 provides a trade-off between the impacts of false positives and false negatives [48].

IoU and mIoU

{I o U}_{i} = \frac{T P_{i}}{T P_{i} + F P_{i} + F N_{i}},

(11)

{m I o U}_{i} = \frac{1}{N} \sum_{i = 1}^{N} {I o U}_{i},

(12)

IoU jointly penalizes false positives and false negatives, making it particularly sensitive to spatial consistency [49].

Inter-Annual Robustness: IAR_F1 and IAR_IoU

To quantitatively characterize how much a model’s classification accuracy varies from year to year, we introduce the following IAR [50] metric: for a given evaluation measure

M ϵ {F 1, I o U}

, IAR is defined as the following:

{I A R}_{M} = 1 - \frac{σ (M_{y})}{μ (M_{y})},

(13)

Here,

M_{y}

denotes the value of the metric in year y, while

μ

and

σ

represent its mean and standard deviation across years, respectively. An IAR value closer to one indicates smaller inter-annual variation and thus stronger robustness. In this study, we report on IAR_F1 and IAR_IoU to examine how annual consistency changes when DOY, temporal jitter, and attention mechanisms are switched on or off.

3.3.3. Training and Evaluation Procedures

During training, we optimized the network using the Adam optimizer with an initial learning rate of 0.0001, scheduled by a cosine decay policy. The model is trained for 100 epochs, while early stopping on the validation set is employed to mitigate overfitting. The checkpoint that achieves the best validation performance is finally selected and used for independent evaluation on the test set.

3.4. Implementation Details

All models were implemented using the PyTorch framework (version 2.7.0+cu128) on a workstation equipped with a single NVIDIA RTX 5080 GPU (NVIDIA, Santa Clara, CA, USA), an Intel Core i7-12700K CPU(Intel, Santa Clara, CA, USA), and 32 GB of RAM. The network parameters were initialized using the Kaiming initialization method. The specific hyperparameters and architecture configurations were determined through rigorous empirical tuning and grid search on the validation set.

Architecture Configuration: We employed a four-level 3D U-Net backbone to effectively capture multi-scale spatio-temporal contexts. The number of feature channels in the encoder follows a progressive sequence of [32, 64, 128] to balance feature representation capability with memory efficiency; deeper networks were found to overfit on the limited training samples. In the CBAM3D modules inserted at each block, the channel reduction ratio was set to r = 16, and the spatial attention kernel size was set to k = 7.

Regularization and Augmentation: To mitigate overfitting, we integrated DropBlock3D with a block size of five and a drop probability of 0.1 in the deeper layers. Furthermore, a temporal jitter augmentation was applied with a probability of 0.5 during training, which randomly shifts the temporal sequence to simulate inter-annual phenological variations and enhance model robustness.

Optimization and Loss: The model was trained using the Adam optimizer with a weight decay of 1 × 10⁻⁴. We employed a cosine annealing learning rate scheduler to dynamically adjust the learning rate, stabilizing the convergence in later epochs. The objective function was a mixed loss combining dice loss, focal loss, and boundary loss with equal weights (1:1:1). This hybrid loss design specifically addresses the issues of class imbalance and boundary fuzziness in crop mapping. The batch size was set to four, and training ran for 100 epochs with an early stopping mechanism (patience = 10).

4. Results

4.1. Overall Performance

Figure 7 presents the boxplots of accuracy metrics for TABS-Net and the baseline models, while Figure 8 illustrates their corresponding classification maps. Detailed numerical comparisons of these metrics are provided in Table 3.

In terms of quantitative assessment, TABS-Net demonstrates superior capability on the current year test set, achieving an overall accuracy (OA) of 94.10%, Kappa of 0.907, Macro-F1 of 93.22%, and mIoU of 87.47%. When compared to the top-performing baseline, the 3D-CNN baseline (OA = 91.61%, mIoU = 83.46%), our method yields substantial gains of approximately 2.49 percentage points in OA and 4.01 percentage points in mIoU. These statistics indicate that TABS-Net not only excels in global accuracy metrics but also achieves significantly better performance on structural indicators such as mIoU, reflecting its robust capability in capturing class balance and ensuring boundary quality.

Dissecting the performance disparities among the evaluated models reveals critical insights into the underlying mechanisms of crop mapping. The 1D temporal models, such as TempCNN-1D and Transformer-1D, rank at the bottom of the leaderboard (OA ≈ 83.7% and 80.2%, respectively). Their underperformance underscores that methods relying solely on spectral–temporal features—while neglecting spatial context—are insufficient for fine-grained crop mapping in complex agricultural landscapes. More notably, the sequential 2D-LSTM suffers a marked performance drop (OA = 89.03%) under our setting. This degradation is largely attributable to the inherent sensitivity of RNNs to the absolute timing of phenological events. Since RNNs tend to overfit the fixed temporal sequences of the training data, slight phenological shifts within the same year (due to regional microclimate variations) cause the learned temporal patterns to become misaligned, inevitably leading to classification errors. Meanwhile, although the 3D-CNN baseline and 2D-Stack CNN form a competitive second tier (OA ≈ 91.6%), their lack of explicit mechanisms to handle temporal misalignment limits their potential for further improvement.

In distinct contrast to these baselines, TABS-Net effectively overcomes these limitations by explicitly incorporating DOY position encoding and temporal jitter augmentation. These mechanisms effectively decouple the model’s dependency on rigid calendar dates, allowing it to dynamically align spectral features based on relative phenological stages rather than absolute time steps. When coupled with the CBAM3D module, which adaptively filters background noise and emphasizes key spectral channels, the proposed architecture successfully integrates spectral, spatial, and temporal information into a unified and robust representation.

From a qualitative perspective, as visualized in Figure 8, TABS-Net proves more reliable in delineating field boundaries compared to the baselines. It is worth noting that the ground truth CDL labels often contain speckled or “salt-and-pepper” noise due to their pixel-based production pipeline. TABS-Net, however, produces spatially smoother and homogeneous responses within fields, along with sharper interfaces between field and non-field areas. This characteristic effectively suppresses the artifacts found in the raw labels, demonstrating that our method is not merely fitting the noisy ground truth but is instead learning the underlying contiguous field structures, thereby offering higher practical usability.

4.2. Ablation Study and Analysis

To isolate the independent contribution of each component to cross-year generalization and class-level discrimination, we used the full model ALL (3D CNN backbone + DOY temporal positional encoding + temporal jitter + CBAM3D) as the reference and constructed four single-factor variants: DOY-0 (removing DOY so that the model relies solely on the raw temporal order), Jitter-0 (disabling temporal jitter while keeping all other augmentations unchanged), CBAM-0 (stripping out all attention modules while maintaining a comparable convolutional backbone and parameter count), and CBAM-2D (retaining only 2D CBAM and discarding 3D spatio-temporal attention). Apart from these on/off switches, all training and data settings are strictly aligned: a unified Sentinel-2 input configuration (default 10 band full spectrum), a fixed sequence length (T = 12, spanning May–October and aligned by DOY), an identical preprocessing pipeline (atmospheric correction, cloud masking, and nearest neighbor/spline filling for missing frames), and the same loss and optimization setup (identical optimizer, learning rate schedule, batch size, and number of epochs), along with identical data splits and random seeds.

The evaluation follows a cross-year protocol in which models are trained on 2024 and tested independently during 2021–2023. We primarily report inter-annual robustness metrics (IAR_F1, IAR_IoU; values closer to one indicate smaller inter-annual variability), supplemented by per class UA_mean/PA_mean to disentangle false positive (FP) and false negative (FN) error patterns. All metrics are summarized as mean ± std across multiple random seeds, and performance degradation of the four variants is measured relative to ALL, thereby assessing the actual contribution of DOY, temporal jitter, full 3D attention, and 2D, with only attention to both cross-year robustness and within-year accuracy.

Using the full model ALL (3D CNN + CBAM3D + DOY + temporal jitter) as the reference, the four ablation variants exhibited clearly differentiated behavior on corn/soybean in terms of IAR_IoU, IAR_F1, and UA/PA (see Table 4). DOY encoding emerges as the single most important factor for cross-year robustness: when DOY is removed (DOY-0), the IAR_IoU/IAR_F1 for corn drops from 0.997/0.998 to 0.956/0.976, and those for soybean from 0.996/0.994 to 0.906/0.947. For corn, UA_mean decreases from 0.952 to 0.900 while PA_mean remains nearly unchanged between 0.975 and 0.969, indicating that false positives dominate the error. For soybean, PA_mean falls from 0.953 to 0.865, whereas UA_mean only slightly decreases from 0.973 to 0.951, pointing to a marked increase in false negatives. Temporal jitter provides the next most influential boost to robustness: in the Jitter-0 setting, corn and soybean reach IAR_IoU/IAR_F1 of 0.986/0.992 and 0.971/0.984, with corn’s UA_mean reduced to 0.916 and soybean’s PA_mean to 0.911. This pattern suggests that injecting temporal perturbations during training materially strengthens the model’s tolerance to mild temporal shifts. The contribution of attention further manifests in jointly enhancing discriminability and inter-annual consistency, with full 3D spatio-temporal attention outperforming its purely 2D variant. Removing all attention (CBAM-0) reduces the IAR_IoU/IAR_F1 for corn and soybean to 0.989/0.994 and 0.977/0.987, and lowers UA_mean/PA_mean to 0.919/0.960 and 0.960/0.924. Retaining only 2D attention (CBAM-2D) leads to a further decline, with IAR_IoU/IAR_F1 of 0.979/0.989 for corn and 0.960/0.978 for soybean, and UA_mean/PA_mean of 0.914/0.973 and 0.961/0.914. These results indicate that lacking temporal attention weakens the model’s ability to focus on key phenological windows and the main body of fields. Taken together, explicit temporal alignment via DOY is the decisive driver of cross-year robustness; temporal jitter acts as a strong auxiliary reinforcement, and 3D channel spatio-temporal attention, by jointly suppressing background false positives and amplifying signals at critical growth stages, offering clear advantages over 2D attention in both accuracy and stability.

From a qualitative perspective (Figure 9), when DOY encoding is absent (DOY-0), the baseline configurations display pronounced false and missed classifications, along with substantial salt-and-pepper noise in the classification maps, revealing limited discrimination against complex background land cover. After introducing temporal jitter data augmentation (Jitter-0), misclassifications are noticeably reduced, but evident omissions remain in some regions, and the full spatial extent of the crop fields is still not fully recovered. Importantly, when moving from the no attention model (CBAM-0) to the 2D attention enhanced model (CBAM-2D), we observe marked improvements in boundary delineation and fine-scale detail. This consistent trend indicates that incorporating attention mechanisms greatly strengthens the model’s ability to recognize and extract crop features, enabling more accurate delineation of crop boundaries and effectively reducing misclassification.

4.3. Spectro-Temporal Signature Analysis

Ablation experiments identify DOY temporal positional encoding as the dominant contributor to cross-year robustness. To uncover the underlying physical and agronomic mechanisms, we analyzed the spectral reflectance time series for 100 fields (corn and soybean) over the full 2021 growing season (May–October, 12 DOY time steps). For each band, we compute “class mean ± 95% confidence interval”, suppressing individual-level noise and irregularities so that phenology-driven spectral differences can be examined at the population level.

As illustrated in Figure 10, during the early growth phase (DOY 156–171) and at late senescence (DOY 276), corn and soybean show highly overlapping spectral responses in the visible and red-edge bands, with confidence intervals that are difficult to separate, implying that single date observations alone are insufficient for reliable discrimination. Once the crops enter the vigorous growth stage, however, their spectral trajectories diverge sharply. Around DOY 186, corn exhibits slightly higher near-infrared reflectance, but between DOY 226 and 231, a critical “spectral inversion” occurs: soybean overtakes corn in the red-edge and near-infrared bands and reaches its peak. Within the optimal discriminative window (DOY 231–251), soybean maintains a clearly and consistently stronger response than corn in the shortwave infrared (SWIR, B11–B12), and the confidence intervals of the two classes become fully separated.

To move beyond qualitative visual inspection and rigorously quantify the magnitude of these divergences, we employed two statistical metrics: the Jeffries–Matusita (J-M) distance for multivariate separability and Welch’s t-test for band-wise significance. The J-M distance (

J_{C S}

) is widely regarded as a robust separability measure in remote sensing because, unlike simple Euclidean distance, it accounts for the covariance structure of the distributions [51]. It is defined as follows:

J_{C S} = 2 (1 - e^{- B_{C S}}),

(14)

where

B_{C S}

is the Bhattacharyya distance, calculated as the following:

B_{C S} = \frac{1}{8} (μ_{C} - μ_{S})^{T} Σ^{- 1} (μ_{C} - μ_{S}) + \frac{1}{2} \ln (\frac{|Σ|}{\sqrt{|Σ_{C}| |Σ_{S}|}}),

(15)

Here,

μ_{C}

and

μ_{S}

represent the mean spectral vectors of corn and soybean, and

Σ_{C}

and

Σ_{S}

are their covariance matrices (

Σ = \frac{Σ_{C} + Σ_{S}}{2}

). The J-M distance ranges from zero to two, with values exceeding 1.9 typically indicating excellent separability. Furthermore, to assess the statistical significance of differences in individual bands, we utilized Welch’s t-test [52]. This method is preferred over the standard Student’s t-test because it does not assume equal variances (heteroscedasticity), which is appropriate given the distinct spectral variability of the two crops. The t-statistic is computed as follows:

t = \frac{{\bar{X}}_{C} - {\bar{X}}_{S}}{\sqrt{\frac{s_{C}^{2}}{N_{C}} + \frac{s_{S}^{2}}{N_{S}}}},

(16)

where

\bar{X}

,

s^{2}

, and N denote the sample mean, variance, and sample size, respectively.

As presented in Figure 11, the application of these metrics yields critical insights. The temporal profile of the J-M distance (Figure 11) confirms that the separability is dynamic. While the J-M distance remains below 1.5 during the early vegetative stage, indicating moderate confusion, it rapidly ascends to the ‘Excellent Separability’ range (>1.9) from DOY 166, peaking at nearly 2.0 during the reproductive stages (DOY 200–250). Complementing this, Figure 12 visualizes the band-wise separability, where darker colors represent larger absolute reflectance differences (

∆ | M e a n |

) and textual markers denote statistical significance levels (***: p < 0.001; **: p < 0.01; *: p < 0.05; ns: not significant, p ≥ 0.05). This heatmap reveals that while visible bands show statistically significant differences (p < 0.001) throughout the season, their absolute reflectance differences are minimal. In contrast, the SWIR bands (B11, B12) and red-edge bands (B6, B7) exhibit the most substantial absolute differences (dark blue regions) during the critical window of DOY 226–251.

This evolution pattern clarifies why each component of TABS-Net is necessary. DOY encoding supplies an absolute temporal anchor that allows the network to precisely align key phenological events such as the “Spectral Inversion” and the SWIR difference peak across years, mitigating misalignment caused by inter-annual shifts. Temporal jitter augmentation, by randomly perturbing these critical temporal nodes, trains the model to capture phenological patterns that are less tied to fixed calendar dates. CBAM3D, in turn, learns to focus on and amplify the discriminative signals of the red-edge and SWIR bands within these crucial windows. Acting together, these three elements provide the physical basis for the strong cross-year stability exhibited by TABS-Net.

4.4. Extrapolation Experiment and Visual Assessment for 2025

To further assess cross-year generalization, we deploy the model trained on 2024 directly on Sentinel-2 imagery acquired in 2025 for crop classification. As the CDL reference for 2025 is not yet available, we validate the results visually using a near-infrared red green (typical) false color composite. In this rendering, corn typically appears as a darker deep red, whereas soybean shows up as a bright, vivid red; the spectral contrast driven by differences in canopy structure and water content provides a useful basis for visual interpretation. Figure 13 presents the 2025 mapping results over the study area alongside the corresponding false color composites. Even without ground truth labels for the target year, the model preserves coherent spatial patterns and clear separation between classes, indicating strong cross-year recognition performance and good practical applicability.

This section may be divided by subheadings. It should provide a concise and precise description of the experimental results, their interpretation, as well as the experimental conclusions that can be drawn.

5. Discussion

5.1. Computational Efficiency and Spatial Coherence Analysis

Beyond accuracy, we further evaluated the computational efficiency of all models, as operational deployment requires a critical balance between classification performance and resource consumption. Table 5 summarizes the model complexity (parameters and FLOPs) and inference speed (latency and FPS).

As expected, the lightweight 2D-Stack CNN achieves the highest throughput (436.1 FPS) due to its shallow architecture and simple channel concatenation. In contrast, the 1D temporal models exhibit significant computational bottlenecks. The Transformer-1D is the slowest (2.9 FPS), largely due to the quadratic complexity (O(T²)) of its self-attention mechanism, which becomes computationally prohibitive for pixel-wise inference. Similarly, TempCNN-1D (9.1 FPS) suffers from high latency caused by its deep stack of dilated 1D convolutions.

Regarding our proposed method, TABS-Net (12.2 FPS) naturally incurs a higher computational load compared to the standard 3D-CNN Baseline (63.0 FPS). This reduction in speed is the direct trade-off for integrating the CBAM3D modules, which requires additional dense computations for spatio-temporal attention map generation. However, it is crucial to note that TABS-Net remains over four times faster than the Transformer-1D, benefiting from the efficient parallelization of 3D convolutions on GPUs. In practical terms, TABS-Net achieves a processing speed of approximately 19,256 km²/min, which is more than sufficient for large-scale regional mapping. Given that TABS-Net yields a substantial accuracy gain (+2.5% OA and +4.0% mIoU) over the faster but less capable baselines, we consider this computational cost to be a highly acceptable investment for high-precision agricultural monitoring.

This investment in computational resources yields substantial returns in terms of spatial coherence and visual quality. As visualized in Figure 8, the classification maps provide critical insights into architectural behaviors beyond numerical metrics. The 1D baseline models (Transformer-1D and TempCNN), despite their ability to capture temporal dynamics, exhibit noticeable “salt-and-pepper” noise and fragmented field boundaries. This is mechanically attributed to their pixel-wise inference paradigm, where each pixel is classified independently based solely on its spectral–temporal curve, ignoring the semantic consistency of its neighborhood. Consequently, even slight spectral variability within a single field—caused by soil heterogeneity or moisture differences—leads to misclassified noise points.

In distinct contrast, TABS-Net produces spatially coherent maps with smooth field interiors and sharp transitions. This visual superiority stems from two key structural biases inherent in our design. First, the 3D convolutional backbone captures local spatial context (3×3×3), enforcing a physical continuity constraint that effectively smooths out isolated outliers. Second, and more importantly, the spatial attention component of the CBAM3D module explicitly models the logic of “where to focus”. By generating a spatial weight map that highlights salient field regions and suppresses background fluctuations, CBAM3D acts as a semantic regularizer. It encourages the network to treat the crop field as a unified object rather than a collection of disjoint pixels, thereby correcting the fragmented predictions typical of pure time-series models.

5.2. Effect of Band Selection

To assess how changes in input channels affect classification performance and cross-year stability, we fixed the network architecture and training protocol and varied only the band combinations, using single-year accuracy (Macro-F1, mIoU) and cross-year robustness (IAR_F1, IAR_IoU) as the primary evaluation metrics. We designed four Sentinel-2 band configurations: ① RGB (B2/B3/B4, 10 m); ② RGB+NIR (RGB plus a near-infrared band at 10 m); ③ RGB+NIR+RedEdge (further adding multiple red-edge bands, with native 10/20 m bands uniformly upsampled to 10 m); and ④ Full-Spec (a full spectrum setup combining RGB, NIR, red-edge, and SWIR bands). As summarized in Table 6, richer spectral information consistently boosts both current year and cross-year performance. With RGB alone, Macro-F1 and mIoU are 0.756 and 0.693, and IAR_F1 and IAR_IoU are 0.736 and 0.721, indicating that visible bands by themselves cannot adequately capture phenology and canopy structure. Adding NIR (RGB+NIR) increases Macro-F1 and mIoU to 0.837 and 0.796, and raises IAR_F1 and IAR_IoU to 0.759 and 0.743, reflecting the direct benefits of bands sensitive to chlorophyll content and canopy architecture. Incorporating multiple red-edge bands (RGB+NIR+RedEdge) further improves Macro-F1 and mIoU to 0.886 and 0.833, with IAR_F1 and IAR_IoU climbing to 0.874 and 0.847, suggesting that red-edge information helps align phenological stages across years and suppress inter-annual distribution shifts. Under the Full-Spec configuration (including SWIR), Macro-F1 and mIoU reached 0.932 and 0.875, while IAR_F1 and IAR_IoU attained 0.996 and 0.994, indicating near “constant-level” consistency from year to year. This is consistent with SWIR’s strong ability to distinguish soil background and moisture conditions, which markedly reduces false and missed detections along crop–background boundaries. In summary, the trend can be described as follows: RGB is restrictive; RGB+NIR yields clear gains; adding multiple red-edge bands substantially improves robustness; and the full spectrum setup (with SWIR) nearly saturates both accuracy and cross-year consistency, showing that red-edge and SWIR bands not only enhance within year discriminative power but, more importantly, significantly shrink inter-annual variance and reinforce the cross-year generalization capability of TABS-Net.

5.3. Effect of Number of Observations

To investigate how the temporal sequence length T affects model behavior, we constructed subsequences with T ∈ {3,6,9,12} within the same growing season. Specifically, T = 3 uses three scenes from August of the target year; T = 6 uses six scenes spanning July–September; T = 9 uses nine scenes from June–September; and T = 12 uses twelve scenes from May–October. Acquisition dates are chosen to be as evenly spaced as possible, and when the target window does not provide enough valid scenes, missing slots are filled along the time axis using nearest neighbor interpolation. As summarized in Table 7, extending the temporal coverage from T = 3 to T = 12 yields a monotonic improvement in both single-year performance and cross-year robustness, which approaches saturation when the full season is covered: Macro-F1/mIoU increased from 0.793/0.746 to 0.932/0.875, while IAR_F1/IAR_IoU rose from 0.849/0.832 to 0.996/0.994. Using only a few snapshots around peak biomass (T = 3) cannot adequately capture the full phenological trajectory. Once the sequence spans the jointing and peak growth stages (T = 6), Macro-F1 and mIoU reach 0.852 and 0.817, and IAR_F1 and IAR_IoU reach 0.914 and 0.906, indicating a pronounced reduction in sensitivity to inter-annual variability. Further extending coverage to include green up and rapid vegetative growth (T = 9) lifts Macro-F1 to 0.879 and IAR_F1 to 0.958, further strengthening cross-year alignment. With full season coverage (T = 12), the sequence forms a complete phenological loop: mIoU reaches 0.875, and both IAR_F1 and IAR_IoU are close to one, reflecting near “constant level” consistency from year to year.

In summary, the largest performance gain occurs when increasing from T = 3 to T = 6, the additional benefit from T = 6 to T = 9 is more modest, and a further notable improvement appears from T = 9 to T = 12, highlighting that fully capturing key transition windows (such as emergence and senescence) is crucial for cross-year consistency. Considering the trade-off between accuracy and data/compute cost, T ≈ 9−12 emerges as a favorable range, with T = 12 providing the strongest robustness under irregular sampling and inter-annual misalignment.

5.4. Cross-Regional Transferability Experiment: A Case Study in Horqin

To evaluate the model’s ability to generalize across regions, we apply TABS-Net trained in Iowa (Iowa) directly to Horqin Left Rear Banner in Northeast China, where the cropping system is likewise dominated by corn and soybean. Figure 14 presents the study area location, the crop maps generated by TABS-Net, and side-by-side comparisons for three representative subregions (true color/false color composites and the corresponding classification maps). The results show that, even under markedly different climatic and landscape conditions, the model retains a stable discriminative capacity: in Horqin, corn and soybean form a clearly structured spatial pattern, with corn concentrated mainly in the west and soybean relatively clustered in the east. The maps are spatially coherent, with field boundaries delineated consistently, indicating that the proposed 3D spatio-temporal modeling together with DOY-based alignment generalizes well and is practically deployable for cross domain applications.

5.5. Practical Implications and Economic Value

The innovations presented in TABS-Net offer substantial economic benefits for operational agricultural monitoring. First, the proposed method significantly reduces the costs associated with ground truth data collection. Traditional crop mapping approaches often require annual field surveys to collect labeled samples for re-training models due to phenological shifts. This process is labor-intensive, time-consuming, and expensive. By achieving robust cross-year generalization solely on historical source domain data, TABS-Net eliminates the dependency on concurrent ground truth data, thereby drastically cutting operational costs for long-term monitoring. Second, the improved classification accuracy directly supports precision agriculture and downstream economic activities. Accurate identification of crop types is the prerequisite for yield estimation, crop insurance assessment, and agricultural subsidy distribution. The ability of TABS-Net to produce reliable crop maps in new years—even under fluctuating phenological conditions—ensures timely information delivery for government policymakers and agricultural supply chain stakeholders, facilitating more efficient resource allocation and risk management.

5.6. Limitations and Further Study

While TABS-Net demonstrates robust cross-year generalization, we acknowledge two primary limitations inherent in the current design. The methodology currently relies exclusively on optical Sentinel-2 imagery, which restricts its applicability in cloud-prone regions—a persistent challenge in optical remote sensing [53]. Furthermore, although our Domain Generalization (DG) strategy successfully mitigates the need for target labels, its performance remains fundamentally dependent on the diversity of the source domain. As highlighted in the recent literature [54], extreme climate anomalies in the target year that fall significantly outside the source distribution support may still induce a generalization gap, potentially requiring minimal adaptation rather than a purely zero-shot approach.

Placing this work within the broader context of agricultural remote sensing, the landscape of crop mapping has evolved rapidly, primarily driven by the shift towards transformer-based architectures and unsupervised domain adaptation (UDA). Recent studies [55,56] indicate that while self-attention mechanisms excel at modeling long-range temporal dependencies in satellite time series, they typically require massive annotated datasets to overcome the lack of inductive bias [57]. Similarly, UDA methods utilizing adversarial alignment or optimal transport [58,59] have successfully aligned feature distributions across years. However, these methods are inherently transductive, necessitating access to target domain data during training. In contrast, TABS-Net operates under a Domain Generalization paradigm, learning time–invariant representations solely from the source domain. This distinction renders our approach more operationally viable for real-time monitoring tasks where future target data is unavailable beforehand, offering a pragmatic trade-off between accuracy and deployment efficiency.

Informed by these developments, we identify three critical directions to further advance this research. To overcome the limitations of optical data, integrating SAR data (Sentinel-1) is imperative; recent evidence suggests that the early fusion of SAR and optical time series significantly improves robustness against cloud cover and phenological ambiguity [60,61]. Additionally, to reduce label dependency, self-supervised learning (SSL) has emerged as a promising frontier. Pre-training on vast unlabeled satellite archives has been shown to yield generalized spectral–temporal representations, potentially enhancing zero-shot transferability [62,63]. Finally, moving beyond regional scales, we aim to apply meta-learning strategies to adapt to diverse continental cropping systems with minimal support samples, aligning with recent trends in few-shot land cover classification [64].

6. Conclusions

To enhance both the accuracy and cross-year generalization of remote sensing-based mapping of major staple crops such as corn and soybean, we introduce a new Temporal–Spatial-Band Attention Network (TABS-Net). Guided by a systematic evaluation of spectral and temporal configurations, we adopt an input strategy that spans the full growing season (May–October, 12 time steps) and uses a full-spectrum setting with red-edge and shortwave infrared bands (10 bands), thereby maximizing the discriminative power of the spatio-temporal signal. Building on this data configuration, TABS-Net employs a 3D convolutional neural network backbone augmented with DOY positional encoding, temporal jitter augmentation, and an innovative CBAM3D. Focusing on the core challenge of inter-annual phenological drift, we formulate an “implicit phenological alignment” paradigm: DOY encoding provides absolute temporal anchors for aligning key phenological stages across years, Temporal jitter improves adaptability by simulating realistic temporal shifts, and CBAM3D dynamically concentrates on the most informative time windows and spectral bands.

Our experiments demonstrate that TABS-Net achieves state-of-the-art performance for fine-scale extraction of corn and soybean, with substantial gains over conventional temporal deep learning baselines. Ablation studies show that DOY encoding is the primary driver of cross-year robustness, while temporal jitter and CBAM3D each contribute crucial improvements in handling non-rigid phenological variation and enhancing feature discriminability.

Given the current limitations in computational cost and data dependence, we see several promising directions for future work. First, to mitigate the high resource demands of 3D networks, future research could explore lightweight 3D convolutions, hybrid 2D/3D architectures, or knowledge distillation, aiming to preserve spatio-temporal modeling capacity while reducing model complexity and enabling deployment on resource-constrained edge platforms. Second, to address the vulnerability of optical time series to cloud- and rain-induced gaps, future studies should investigate multi-source fusion of Sentinel-2 with synthetic aperture radar (SAR) and hyper-spectral imagery, exploiting radar’s all-weather penetration to compensate for missing optical observations and improve robustness under challenging weather conditions. Finally, in light of background shifts encountered in cross-regional transfer, we encourage combining the generalization strategy proposed here with advanced unsupervised UDA techniques, and testing TABS-Net at larger scales and across more diverse global agricultural landscapes to further probe and extend its generalization limits.

Author Contributions

Conceptualization, Y.H. and Q.S.; methodology, X.Z., Y.H. and Y.Y.; software, X.Z.; validation, Q.W. and F.X.; formal analysis, X.Z.; investigation, X.Z.; resources, Q.S. and F.X.; data curation, X.Z., Q.W. and C.M.; writing—original draft, X.Z.; writing—review and editing, Y.H., Q.S. and Y.Y.; visualization, X.Z.; supervision, Q.S.; project administration, Q.S.; funding acquisition, Q.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Central Guidance on Local Science and Technology Development Fund project “Intelligent Remote Sensing System for Agricultural Disaster Loss Estimation at the County Level in Inner Mongolia” (Grant No. 2024ZY0138).

Data Availability Statement

To facilitate reproducibility and future research, the source code, pre-trained models, and inference scripts developed in this study are available at https://github.com/RS-ZX/TABS-NET.git (accessed on 18 January 2026).

Acknowledgments

The authors would like to thank Qian Shen from the Key Laboratory of Digital. Earth: Science, Aerospace Information Research Institute, Chinese Academy of Sciences for the help with writing. We are also grateful to all the anonymous reviewers for their constructive comments on this study.

Conflicts of Interest

Author F.X. is employed by Inner Mongolia Remote Sensing Center Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Shiferaw, B.; Prasanna, B.M.; Hellin, J.; Bänziger, M. Crops That Feed the World 6. Past Successes and Future Challenges to the Role Played by Maize in Global Food Security. Food Sec. 2011, 3, 307–327. [Google Scholar] [CrossRef]
Ray, D.K.; Mueller, N.D.; West, P.C.; Foley, J.A. Yield Trends Are Insufficient to Double Global Crop Production by 2050. PLoS ONE 2013, 8, e66428. [Google Scholar] [CrossRef]
Karthikeyan, L.; Chawla, I.; Mishra, A.K. A Review of Remote Sensing Applications in Agriculture for Food Security: Crop Growth and Yield, Irrigation, and Crop Losses. J. Hydrol. 2020, 586, 124905. [Google Scholar] [CrossRef]
Wu, B.; Zhang, M.; Zeng, H.; Tian, F.; Potgieter, A.B.; Qin, X.; Yan, N.; Chang, S.; Zhao, Y.; Dong, Q.; et al. Challenges and Opportunities in Remote Sensing-Based Crop Monitoring: A Review. Natl. Sci. Rev. 2023, 10, nwac290. [Google Scholar] [CrossRef] [PubMed]
Belgiu, M.; Drăguţ, L. Random Forest in Remote Sensing: A Review of Applications and Future Directions. ISPRS J. Photogramm. Remote Sens. 2016, 114, 24–31. [Google Scholar] [CrossRef]
Mountrakis, G.; Im, J.; Ogole, C. Support Vector Machines in Remote Sensing: A Review. ISPRS J. Photogramm. Remote Sens. 2011, 66, 247–259. [Google Scholar] [CrossRef]
Li, Z.; He, W.; Cheng, M.; Hu, J.; Yang, G.; Zhang, H. SinoLC-1: The First 1 m Resolution National-Scale Land-Cover Map of China Created with a Deep Learning Framework and Open-Access Data. Earth Syst. Sci. Data 2023, 15, 4749–4780. [Google Scholar] [CrossRef]
Karmakar, P.; Teng, S.W.; Murshed, M.; Pang, S.; Li, Y.; Lin, H. Crop Monitoring by Multimodal Remote Sensing: A Review. Remote Sens. Appl. Soc. Environ. 2024, 33, 101093. [Google Scholar] [CrossRef]
Kussul, N.; Lavreniuk, M.; Skakun, S.; Shelestov, A. Deep Learning Classification of Land Cover and Crop Types Using Remote Sensing Data. IEEE Geosci. Remote Sens. Lett. 2017, 14, 778–782. [Google Scholar] [CrossRef]
Adrian, J.; Sagan, V.; Maimaitijiang, M. Sentinel SAR-Optical Fusion for Crop Type Mapping Using Deep Learning and Google Earth Engine. ISPRS J. Photogramm. Remote Sens. 2021, 175, 215–235. [Google Scholar] [CrossRef]
Lu, T.; Gao, M.; Wang, L. Crop Classification in High-Resolution Remote Sensing Images Based on Multi-Scale Feature Fusion Semantic Segmentation Model. Front. Plant Sci. 2023, 14, 1196634. [Google Scholar] [CrossRef]
Gao, F.; Anderson, M.; Daughtry, C.; Karnieli, A.; Hively, D.; Kustas, W. A Within-Season Approach for Detecting Early Growth Stages in Corn and Soybean Using High Temporal and Spatial Resolution Imagery. Remote Sens. Environ. 2020, 242, 111752. [Google Scholar] [CrossRef]
Pelletier, C.; Webb, G.; Petitjean, F. Temporal Convolutional Neural Network for the Classification of Satellite Image Time Series. Remote Sens. 2019, 11, 523. [Google Scholar] [CrossRef]
Durrani, A.U.R.; Minallah, N.; Aziz, N.; Frnda, J.; Khan, W.; Nedoma, J. Effect of Hyper-Parameters on the Performance of ConvLSTM Based Deep Neural Network in Crop Classification. PLoS ONE 2023, 18, e0275653. [Google Scholar] [CrossRef]
Rußwurm, M.; Körner, M. Multi-Temporal Land Cover Classification with Sequential Recurrent Encoders. ISPRS Int. J. Geo-Inf. 2018, 7, 129. [Google Scholar] [CrossRef]
Zhang, R.; Wu, X.; Li, J.; Zhao, P.; Zhang, Q.; Wuri, L.; Zhang, D.; Zhang, Z.; Yang, L. A Bibliometric Review of Deep Learning in Crop Monitoring: Trends, Challenges, and Future Perspectives. Front. Artif. Intell. 2025, 8, 1636898. [Google Scholar] [CrossRef] [PubMed]
Ji, S.; Zhang, C.; Xu, A.; Shi, Y.; Duan, Y. 3D Convolutional Neural Networks for Crop Classification with Multi-Temporal Remote Sensing Images. Remote Sens. 2018, 10, 75. [Google Scholar] [CrossRef]
Gallo, I.; La Grassa, R.; Landro, N.; Boschetti, M. Sentinel 2 Time Series Analysis with 3D Feature Pyramid Network and Time Domain Class Activation Intervals for Crop Mapping. ISPRS Int. J. Geo-Inf. 2021, 10, 483. [Google Scholar] [CrossRef]
Blickensdörfer, L. Mapping of Crop Types and Crop Sequences with Combined Time Series of Sentinel-1, Sentinel-2 and Landsat 8 Data for Germany. Remote Sens. Environ. 2022, 269, 112831. [Google Scholar] [CrossRef]
Shen, Y.; Zhang, X.; Yang, Z. Mapping Corn and Soybean Phenometrics at Field Scales over the United States Corn Belt by Fusing Time Series of Landsat 8 and Sentinel-2 Data with VIIRS Data. ISPRS J. Photogramm. Remote Sens. 2022, 186, 55–69. [Google Scholar] [CrossRef]
Hu, X.; Wang, X.; Zhong, Y.; Zhang, L. S3ANet: Spectral-Spatial-Scale Attention Network for End-to-End Precise Crop Classification Based on UAV-Borne H2 Imagery. ISPRS J. Photogramm. Remote Sens. 2022, 183, 147–163. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Computer Vision—ECCV 2018; Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2018; Volume 11211, pp. 3–19. ISBN 978-3-030-01233-5. [Google Scholar]
Wang, Y.; Zhang, Z.; Feng, L.; Ma, Y.; Du, Q. A New Attention-Based CNN Approach for Crop Mapping Using Time Series Sentinel-2 Images. Comput. Electron. Agric. 2021, 184, 106090. [Google Scholar] [CrossRef]
Garnot, V.S.F.; Landrieu, L.; Chehata, N. Multi-Modal Temporal Attention Models for Crop Mapping from Satellite Time Series. ISPRS J. Photogramm. Remote. Sens. 2022, 187, 294–305. [Google Scholar] [CrossRef]
Antonijević, O.; Jelić, S.; Bajat, B.; Kilibarda, M. Transfer Learning Approach Based on Satellite Image Time Series for the Crop Classification Problem. J. Big Data 2023, 10, 54. [Google Scholar] [CrossRef]
Mohammadi, S.; Belgiu, M.; Stein, A. Few-Shot Learning for Crop Mapping from Satellite Image Time Series. Remote Sens. 2024, 16, 1026. [Google Scholar] [CrossRef]
Elshamli, A.; Taylor, G.W.; Areibi, S. Multisource Domain Adaptation for Remote Sensing Using Deep Neural Networks. IEEE Trans. Geosci. Remote Sens. 2020, 58, 3328–3340. [Google Scholar] [CrossRef]
You, N.; Dong, J.; Li, J.; Huang, J.; Jin, Z. Rapid Early-Season Maize Mapping without Crop Labels. Remote Sens. Environ. 2023, 290, 113496. [Google Scholar] [CrossRef]
McDanel, J.J.; Meghani, N.A.; Miller, B.A.; Moore, P.L. Harmonized Landform Regions in the Glaciated Central Lowlands, USA. J. Maps 2022, 18, 448–460. [Google Scholar] [CrossRef]
Lesiv, M.; Laso Bayas, J.C.; See, L.; Duerauer, M.; Dahlia, D.; Durando, N.; Hazarika, R.; Kumar Sahariah, P.; Vakolyuk, M.; Blyshchyk, V.; et al. Estimating the Global Distribution of Field Size Using Crowdsourcing. Glob. Change Biol. 2019, 25, 174–186. [Google Scholar] [CrossRef]
Daly, C.; Halbleib, M.; Smith, J.I.; Gibson, W.P.; Doggett, M.K.; Taylor, G.H.; Curtis, J.; Pasteris, P.P. Physiographically Sensitive Mapping of Climatological Temperature and Precipitation across the Conterminous United States. Int. J. Climatol. 2008, 28, 2031–2064. [Google Scholar] [CrossRef]
Seifert, C.A.; Roberts, M.J.; Lobell, D.B. Continuous Corn and Soybean Yield Penalties across Hundreds of Thousands of Fields. Agron. J. 2017, 109, 541–548. [Google Scholar] [CrossRef]
Martens, D.A.; Jaynes, D.B.; Colvin, T.S.; Kaspar, T.C.; Karlen, D.L. Soil Organic Nitrogen Enrichment Following Soybean in an Iowa Corn–Soybean Rotation. Soil Sci. Soc. Am. J. 2006, 70, 382–392. [Google Scholar] [CrossRef]
Ashourloo, D.; Nematollahi, H.; Huete, A.; Aghighi, H.; Azadbakht, M.; Shahrabi, H.S.; Goodarzdashti, S. A New Phenology-Based Method for Mapping Wheat and Barley Using Time-Series of Sentinel-2 Images. Remote Sens. Environ. 2022, 280, 113206. [Google Scholar] [CrossRef]
Han, W.; Yang, Z.; Di, L.; Mueller, R. CropScape: A Web Service Based Application for Exploring and Disseminating US Conterminous Geospatial Cropland Data Products for Decision Support. Comput. Electron. Agric. 2012, 84, 111–123. [Google Scholar] [CrossRef]
Duan, K.; Vrieling, A.; Schlund, M.; Bhaskar Nidumolu, U.; Ratcliff, C.; Collings, S.; Nelson, A. Detection and Attribution of Cereal Yield Losses Using Sentinel-2 and Weather Data: A Case Study in South Australia. ISPRS J. Photogramm. Remote Sens. 2024, 213, 33–52. [Google Scholar] [CrossRef]
Yang, Y.; Tao, B.; Ruane, A.C.; Shen, C.; Matteson, D.S.; Cousin, R.; Ren, W. Widespread Advances in Corn and Soybean Phenology in Response to Future Climate Change Across the United States. JGR Biogeosci. 2025, 130, e2024JG008266. [Google Scholar] [CrossRef]
Yang, Y.; Tao, B.; Liang, L.; Huang, Y.; Matocha, C.; Lee, C.D.; Sama, M.; Masri, B.E.; Ren, W. Detecting Recent Crop Phenology Dynamics in Corn and Soybean Cropping Systems of Kentucky. Remote Sens. 2021, 13, 1615. [Google Scholar] [CrossRef]
Chen, W.; Liu, G. A Novel Method for Identifying Crops in Parcels Constrained by Environmental Factors Through the Integration of a Gaofen-2 High-Resolution Remote Sensing Image and Sentinel-2 Time Series. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 450–463. [Google Scholar] [CrossRef]
Ople, J.J.M.; Yeh, P.-Y.; Sun, S.-W.; Tsai, I.-T.; Hua, K.-L. Multi-Scale Neural Network with Dilated Convolutions for Image Deblurring. IEEE Access 2020, 8, 53942–53952. [Google Scholar] [CrossRef]
Zhang, S.; Yang, J.; Leng, P.; Ma, Y.; Wang, H.; Song, Q. Crop Type Mapping with Temporal Sample Migration. Int. J. Remote Sens. 2024, 45, 7014–7032. [Google Scholar] [CrossRef]
Reuß, F.; Greimeister-Pfeil, I.; Vreugdenhil, M.; Wagner, W. Comparison of Long Short-Term Memory Networks and Random Forest for Sentinel-1 Time Series Based Large Scale Crop Classification. Remote Sens. 2021, 13, 5000. [Google Scholar] [CrossRef]
Xu, L.; Zhang, H.; Wang, C.; Zhang, B.; Liu, M. Crop Classification Based on Temporal Information Using Sentinel-1 SAR Time-Series Data. Remote Sens. 2018, 11, 53. [Google Scholar] [CrossRef]
Simón Sánchez, A.-M.; González-Piqueras, J.; De La Ossa, L.; Calera, A. Convolutional Neural Networks for Agricultural Land Use Classification from Sentinel-2 Image Time Series. Remote Sens. 2022, 14, 5373. [Google Scholar] [CrossRef]
Bargiel, D. A New Method for Crop Classification Combining Time Series of Radar Images and Crop Phenology Information. Remote Sens. Environ. 2017, 198, 369–383. [Google Scholar] [CrossRef]
Pontius, R.G.; Millones, M. Death to Kappa: Birth of Quantity Disagreement and Allocation Disagreement for Accuracy Assessment. Int. J. Remote Sens. 2011, 32, 4407–4429. [Google Scholar] [CrossRef]
Stehman, S.V. Selecting and Interpreting Measures of Thematic Classification Accuracy. Remote Sens. Environ. 1997, 62, 77–89. [Google Scholar] [CrossRef]
Farhadpour, S.; Warner, T.A.; Maxwell, A.E. Selecting and Interpreting Multiclass Loss and Accuracy Assessment Metrics for Classifications with Class Imbalance: Guidance and Best Practices. Remote Sens. 2024, 16, 533. [Google Scholar] [CrossRef]
He, X.; Zhang, S.; Xue, B.; Zhao, T.; Wu, T. Cross-Modal Change Detection Flood Extraction Based on Convolutional Neural Network. Int. J. Appl. Earth Obs. Geoinf. 2023, 117, 103197. [Google Scholar] [CrossRef]
Frantz, D.; Rufin, P.; Janz, A.; Ernst, S.; Pflugmacher, D.; Schug, F.; Hostert, P. Understanding the Robustness of Spectral-Temporal Metrics across the Global Landsat Archive from 1984 to 2019—A Quantitative Evaluation. Remote Sens. Environ. 2023, 298, 113823. [Google Scholar] [CrossRef]
Swain, P.H.; King, R.C. Two Effective Feature Selection Criteria for Multispectral Remote Sensing. In LARS Technical Reports; Purdue University: West Lafayette, IN, USA, 1973. [Google Scholar]
Welch, B.L. The Generalization of ‘Student’s’ Problem When Several Different Population Variances Are Involved. Biometrika 1947, 34, 28. [Google Scholar] [CrossRef]
Wright, N.; Duncan, J.M.A.; Callow, J.N.; Thompson, S.E.; George, R.J. CloudS2Mask: A Novel Deep Learning Approach for Improved Cloud and Cloud Shadow Masking in Sentinel-2 Imagery. Remote Sens. Environ. 2024, 306, 114122. [Google Scholar] [CrossRef]
Hong, J.; Zhang, Y.-D.; Chen, W. Source-Free Unsupervised Domain Adaptation for Cross-Modality Abdominal Multi-Organ Segmentation. Knowl.-Based Syst. 2022, 250, 109155. [Google Scholar] [CrossRef]
Noman, M.; Fiaz, M.; Cholakkal, H.; Narayan, S.; Muhammad Anwer, R.; Khan, S.; Shahbaz Khan, F. Remote Sensing Change Detection With Transformers Trained From Scratch. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4704214. [Google Scholar] [CrossRef]
Wang, D.; Zhang, Q.; Xu, Y.; Zhang, J.; Du, B.; Tao, D.; Zhang, L. Advancing Plain Vision Transformer Toward Remote Sensing Foundation Model. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5607315. [Google Scholar] [CrossRef]
Wang, D.; Zhang, J.; Du, B.; Xia, G.-S.; Tao, D. An Empirical Study of Remote Sensing Pretraining. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5608020. [Google Scholar] [CrossRef]
Capliez, E.; Ienco, D.; Gaetano, R.; Baghdadi, N.; Salah, A.H.; Le Goff, M.; Chouteau, F. Multisensor Temporal Unsupervised Domain Adaptation for Land Cover Mapping With Spatial Pseudo-Labeling and Adversarial Learning. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5405716. [Google Scholar] [CrossRef]
Shen, Z.; Ni, H.; Guan, H.; Niu, X. Optimal Transport-Based Domain Adaptation for Semantic Segmentation of Remote Sensing Images. Int. J. Remote Sens. 2024, 45, 420–450. [Google Scholar] [CrossRef]
Eisfelder, C.; Boemke, B.; Gessner, U.; Sogno, P.; Alemu, G.; Hailu, R.; Mesmer, C.; Huth, J. Cropland and Crop Type Classification with Sentinel-1 and Sentinel-2 Time Series Using Google Earth Engine for Agricultural Monitoring in Ethiopia. Remote Sens. 2024, 16, 866. [Google Scholar] [CrossRef]
Najem, S.; Baghdadi, N.; Bazzi, H.; Lalande, N.; Bouchet, L. Detection and Mapping of Cover Crops Using Sentinel-1 SAR Remote Sensing Data. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 1446–1461. [Google Scholar] [CrossRef]
Xu, Y.; Ma, Y.; Zhang, Z. Self-Supervised Pre-Training for Large-Scale Crop Mapping Using Sentinel-2 Time Series. ISPRS J. Photogramm. Remote Sens. 2024, 207, 312–325. [Google Scholar] [CrossRef]
Tao, C.; Qi, J.; Guo, M.; Zhu, Q.; Li, H. Self-Supervised Remote Sensing Feature Learning: Learning Paradigms, Challenges, and Future Works. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5610426. [Google Scholar] [CrossRef]
Ma, S.; Tong, L.; Zhou, J.; Yu, J.; Xiao, C. Self-Supervised Spectral–Spatial Graph Prototypical Network for Few-Shot Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5518915. [Google Scholar] [CrossRef]

Figure 1. Geographic location of the study area. The map displays the experimental region overlaid on true-color satellite imagery. The yellow outline marks the administrative boundary of the state of Iowa, United States, serving as the macro-scale geographic reference. The red rectangle delineates the specific region of interest (ROI) selected for this study.

Figure 2. Simplified crop calendar illustrating typical growth stages of corn and soybean in the study area (DOY-based).

Figure 3. Data temporal distribution: acquisition DOY in the study region, 2021–2024.

Figure 4. Analysis of inter-annual phenological drift and regional variability for corn and soybean from 2021 to 2024. (a) Smoothed temporal NDVI profiles for corn; (b) boxplots of the day of year (DOY) corresponding to the maximum NDVI for corn, illustrating peak growth timing; (c) smoothed temporal NDVI profiles for soybean; (d) boxplots of the peak DOY for soybean. The substantial shifts in peak timing (e.g., corn 2021 vs. 2022) and the variance within years (e.g., corn 2023) highlight the challenges of spectral–temporal misalignment.

Figure 5. Overview of the TABS-Net architecture.

Figure 6. The detailed structure of the 3D Convolutional Block Attention Module (CBAM3D).

Figure 7. Comparison of box lines of performance indicators for each model (OA, Kappa, Macro-F1, mIou).

Figure 8. Qualitative comparison of crop classification results across all benchmarked models in three representative regions. (a–c) False color images of Region 1, Region 2, and Region 3; (d–f) ground truth labels from the CDL dataset; (g–i) classification maps predicted by TABS-Net (ours); (j–l) classification maps predicted by the 3D-CNN baseline; (m–o) classification maps predicted by the 2D CNN; (p–r) classification maps predicted by the 2dlstm model; (s–u) classification maps predicted by the TempCNN-1D model; (v–x) classification maps predicted by the Transformer-1D model. Orange indicates corn, and green indicates soybean.

Figure 9. Visual ablation study comparing the impact of different components in three representative regions. (a–c) Ground truth labels from CDL samples; (d–f) classification results without DOY encoding (DOY-0); (g–i) classification results without temporal jitter augmentation (Jitter-0); (j–l) classification results without attention modules (CBAM-0); (m–o) classification results using 2D attention (CBAM2D); (p–r) classification results of the proposed complete model (ALL). Orange indicates corn, and green indicates soybean.

Figure 10. (a–l) Spectro-temporal signatures of corn and soybean (Mean ± 95%CI) across Sentinel-2 bands key DOYs from 156 to 276 in 2021.

Figure 11. Temporal profile of the Jeffries–Matusita (J-M) distance, where values exceeding 1.9 indicate excellent multivariate separability between corn and soybean.

Figure 12. Heatmap of band-wise statistical significance derived from Welch’s t-test. Darker blue regions indicate larger absolute differences in reflectance.

Figure 13. (a–c) Crop classification in the three regions of the study zone in 2025; (d–f) false color images of three regions.

Figure 14. (a) The location of the Left Rear Banner of Horqin; (b–d) crop classification in three regions with TABS-NET; (e–g) true color images of three regions; (h–j) false color images of three regions.

Table 1. Sentinel-2 MSI bands used in this study and their characteristics.

Bands Name	Central Wavelength (nm)	Resolution (m)	Description
B2	496.6	10	Blue
B3	560.0	10	Green
B4	664.5	10	Red
B5	703.9	20	Red Edge 1
B6	740.2	20	Red Edge 2
B7	782.5	20	Red Edge 3
B8	835.1	10	NIR
B8A	864.8	20	Narrow NIR
B11	1613.7	20	SWIR-1
B12	2202.4	20	SWIR-2

Table 2. Comparison of the characteristics, inductive biases, strengths, and weaknesses of the baseline methods and the proposed TABS-Net.

Method	Method Category	Key Characteristics and Inductive Bias	Strengths	Weaknesses
2D CNN	Space-Dominant	Implicit Temporal: Time steps stacked as channels.	Strong spatial texture learning; high inference speed.	Ignores temporal order; sensitive to phenological shifts.
2D LSTM	Time-Dominant (Recurrent)	Sequential: Pixel-wise recurrent processing.	Captures evolution dynamics; long-term memory.	Lacks spatial context; slow training; salt-and-pepper noise.
TempCNN-1D	Time-Dominant (Convolutional)	Local Temporal: Pixel-wise 1D convolution.	Efficient temporal extraction; interpretable filters.	Interpretable filters. Limited spatial modeling; fixed temporal receptive field.
Transformer-1D	Pure Attention	Global Attention: Self-attention-based weighting.	Models long-range dependencies; global context.	Data-hungry; computationally heavy; ignores local priors.

Table 3. Quantitative comparison of classification performance metrics (OA, Kappa, Macro-F1, and mIoU) between TABS-Net and baseline models on the test dataset.

Model	OA	Kappa	MacroF1	mIoU
TABS-Net	0.941	0.907	0.932	0.875
3D-CNN baseline	0.916	0.866	0.910	0.835
2D CNN	0.913	0.862	0.903	0.825
2dlstm	0.890	0.825	0.867	0.773
TempCNN-1D	0.837	0.736	0.832	0.715
Transformer-1D	0.802	0.677	0.793	0.665

Table 4. Ablation: inter-annual robustness and UA/PA(corn/soybean).

Methods	IAR_IoU		IAR_F1		UA_Mean		PA_Mean
Methods	Corn	Soybean	Corn	Soybean	Corn	Soybean	Corn	Soybean
ALL	0.997	0.996	0.998	0.994	0.952	0.973	0.975	0.953
DOY-0	0.956	0.906	0.976	0.947	0.900	0.951	0.969	0.865
Jitter-0	0.986	0.971	0.992	0.984	0.916	0.964	0.973	0.911
CBAM-0	0.989	0.977	0.994	0.987	0.919	0.960	0.960	0.924
CBAM-2D	0.979	0.960	0.989	0.978	0.914	0.961	0.973	0.914

Table 5. Comparison of computational complexity and inference efficiency (parameters, FLOPs, latency, and FPS) among the evaluated models.

Model	Params (M)	FLOPs (G)	Latency (ms)	FPS	Speed (km²/min)
TABS-Net	0.42	123.70	326.7	12.2	19,256.7
3D-CNN baseline	0.41	118.13	63.5	63.0	99,119.7
2D CNN	0.22	14.43	9.2	436.1	685,959.0
2dlstm	0.20	42.71	53.5	74.8	117,582.1
TempCNN-1D	0.34	1059.92	439.9	9.1	14,300.5
Transformer-1D	0.08	240.16	1356.8	2.9	4637.0

Table 6. Comparative results for band combinations: Macro-F1, mIoU, and IAR.

Band Set	Macro-F1	Miou	IAR_F1	IAR_IoU
RGB (3 bands)	0.756	0.693	0.736	0.721
RGB+NIR (5 bands)	0.837	0.796	0.759	0.743
RGB+NIR+RE (8 bands)	0.886	0.833	0.874	0.847
Full-Spec (10 bands)	0.932	0.875	0.996	0.994

Table 7. Effect of temporal sequence length (T) on accuracy and inter-annual robustness (Macro-F1, mIoU, IAR_F1, IAR_IoU).

T	Macro-F1	mIoU	IAR_F1	IAR_IoU
3	0.793	0.746	0.849	0.832
6	0.852	0.817	0.914	0.906
9	0.879	0.821	0.958	0.949
12	0.932	0.875	0.996	0.994

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhou, X.; Huang, Y.; Shen, Q.; Yao, Y.; Wen, Q.; Xi, F.; Ma, C. TABS-Net: A Temporal Spectral Attentive Block with Space–Time Fusion Network for Robust Cross-Year Crop Mapping. Remote Sens. 2026, 18, 365. https://doi.org/10.3390/rs18020365

AMA Style

Zhou X, Huang Y, Shen Q, Yao Y, Wen Q, Xi F, Ma C. TABS-Net: A Temporal Spectral Attentive Block with Space–Time Fusion Network for Robust Cross-Year Crop Mapping. Remote Sensing. 2026; 18(2):365. https://doi.org/10.3390/rs18020365

Chicago/Turabian Style

Zhou, Xin, Yuancheng Huang, Qian Shen, Yue Yao, Qingke Wen, Fengjiang Xi, and Chendong Ma. 2026. "TABS-Net: A Temporal Spectral Attentive Block with Space–Time Fusion Network for Robust Cross-Year Crop Mapping" Remote Sensing 18, no. 2: 365. https://doi.org/10.3390/rs18020365

APA Style

Zhou, X., Huang, Y., Shen, Q., Yao, Y., Wen, Q., Xi, F., & Ma, C. (2026). TABS-Net: A Temporal Spectral Attentive Block with Space–Time Fusion Network for Robust Cross-Year Crop Mapping. Remote Sensing, 18(2), 365. https://doi.org/10.3390/rs18020365

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

TABS-Net: A Temporal Spectral Attentive Block with Space–Time Fusion Network for Robust Cross-Year Crop Mapping

Highlights

Abstract

1. Introduction

2. Materials

2.1. Study Site

2.2. Sentinel-2 Data and Preprocessing

2.3. CDL Sample Construction and Dataset Preparation

2.4. Analysis of Phenological Drift Mechanism

3. Methodology

3.1. The Structure of TABS-NET

3.1.1. Overall Architecture

3.1.2. Spatio-Temporal Feature Extraction

3.1.3. Three-Dimensional Convolutional Block Attention Module

3.1.4. Cross-Year Generalization Strategies

3.2. Comparison Methods

3.3. Experimental Setup and Evaluation Metrics

3.3.1. Experimental Protocols

3.3.2. Evaluation Metrics

3.3.3. Training and Evaluation Procedures

3.4. Implementation Details

4. Results

4.1. Overall Performance

4.2. Ablation Study and Analysis

4.3. Spectro-Temporal Signature Analysis

4.4. Extrapolation Experiment and Visual Assessment for 2025

5. Discussion

5.1. Computational Efficiency and Spatial Coherence Analysis

5.2. Effect of Band Selection

5.3. Effect of Number of Observations

5.4. Cross-Regional Transferability Experiment: A Case Study in Horqin

5.5. Practical Implications and Economic Value

5.6. Limitations and Further Study

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI