HyperVTCN: A Deep Learning Method with Temporal and Feature Modeling Capabilities for Crop Classification with Multisource Satellite Imagery

Huang, Xiaoqi; Fang, Minzi; Kong, Weilang; Liu, Jialin; Wu, Yuxin; Liu, Zhenjie; Qiao, Zhi; Liu, Luo

doi:10.3390/rs17173022

Open AccessArticle

HyperVTCN: A Deep Learning Method with Temporal and Feature Modeling Capabilities for Crop Classification with Multisource Satellite Imagery

by

Xiaoqi Huang

¹

,

Minzi Fang

¹,

Weilang Kong

²

,

Jialin Liu

¹,

Yuxin Wu

¹,

Zhenjie Liu

³

,

Zhi Qiao

⁴

and

Luo Liu

^1,*

¹

Guangdong Province Key Laboratory for Agricultural Resources Utilization, College of Natural Resources and Environment, South China Agricultural University, Guangzhou 510642, China

²

College of Mathematics and Informatics, South China Agricultural University, Guangzhou 510642, China

³

School of Public Administration, South China Agricultural University, Guangzhou 510640, China

⁴

School of Environmental Science and Engineering, Tianjin University, Tianjin 300350, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(17), 3022; https://doi.org/10.3390/rs17173022

Submission received: 7 July 2025 / Revised: 27 August 2025 / Accepted: 28 August 2025 / Published: 31 August 2025

(This article belongs to the Special Issue Mapping Essential Elements of Agricultural Land Using Remote Sensing (2nd Edition))

Download

Browse Figures

Versions Notes

Abstract

Crop distribution represents crucial information in agriculture, playing a key role in ensuring food security and promoting sustainable agricultural development. However, existing methods for crop distribution primarily focus on modeling temporal dependencies while overlooking the interactions and dependencies among different remote sensing features, thus failing to fully exploit the rich information contained in multisource satellite imagery. To address this issue, we propose a deep learning-based method named HyperVTCN, which comprises two key components: the ModernTCN block and the TiVDA attention mechanism. HyperVTCN effectively captures temporal dependencies and uncovers intrinsic correlations among features, thereby enabling more comprehensive data utilization. Compared to other state-of-the-art models, it shows improved performance, with overall accuracy (OA) improving by approximately 2–3%, Kappa improving by 3–4.5%, and Macro-F1 improving by about 2–3%. Additionally, ablation experiments suggest that both the attention mechanism(Time-Feature Dual Attention, TiVDA) and the targeted loss optimization strategy contribute to performance improvements. Finally, experiments were conducted to investigate HyperVTCN’s cross-feature and cross-temporal modeling. The results indicate that this joint modeling strategy is effective. This approach has shown potential in enhancing model performance and offers a viable solution for crop classification tasks.

Keywords:

crop classification; time series; convolutional neural network; attention mechanism; feature interaction

1. Introduction

Driven by the dual pressures of global climate change and a growing population, food security has become one of the most severe global challenges of the 21st century [1,2,3]. Faced with the real conflict between land resource constraints and the increasing demand for food [4], crop classification research demonstrates critical strategic value [5]. Accurate crop classification can effectively identify the spatial distribution, planting area, and yield potential of crops [6]. It not only provides reliable data support for the formulation of food security policies [7], but also guides the optimization of planting structures and the allocation of agricultural resources, promoting the sustainable development of agricultural production [8].

With the rapid advancement of remote sensing technology, the quality and spatiotemporal resolution of satellite imagery have been significantly improved [9]. This enhancement enables Satellite Image Time Series (SITS) to capture dynamic changes throughout the crop growth cycle more accurately. Meanwhile, the availability of multisource satellite imagery, such as optical sensor data and Synthetic Aperture Radar (SAR) data, has increased [10]. These diverse data sources offer more comprehensive information to support crop classification [11,12]. As a result, the integrated use of time series data from multisource satellite imagery has gradually become an important direction for improving crop classification accuracy. It can leverage complementary advantages between features, compensating for the limitations of single-feature data in complex conditions [13]. However, the wealth of information contained in data has not been fully utilized in current crop classification research yet.

Currently, crop classification methods can generally be divided into three categories. The first category refers to traditional methods that rely on feature engineering and thresholding. These methods select a small number of features based on vegetation phenological characteristics and expert knowledge, aiming to explore the optimal threshold for distinguishing between different crop types [14,15,16]. However, the manual selection of features, which is susceptible to subjective factors, often fails to comprehensively capture the complex patterns of crop growth and thus struggles to handle crop types that cannot be distinguished by simple rules [17,18]. Moreover, crop spectra exhibit significant intra-species spectral variability, making manually defined rules difficult to generalize across different regions. As a result, traditional methods often perform poorly in large-scale crop classification [19,20].

The second category refers to machine learning-based methods, such as Random Forest (RF) [21,22,23] and Support Vector Machine (SVM) [24,25]. Compared to traditional methods, machine learning-based methods do not require human intervention, they learn features through a data-driven approach automatically. These machine learning-based methods can handle complex data and uncover potential hidden patterns and relationships that are not currently discernible by human experts [26]. Additionally, machine learning-based methods are useful for crop classification over large-scale areas, exhibiting stable performance in crop classification across different regions [27,28]. However, machine learning-based methods still have certain limitations when dealing with high-dimensional and nonlinear problems. When handling high-dimensional data, they face the challenge of the curse of dimensionality [29,30]. As the feature dimensionality increases, the model’s performance may decline, leading to poor performance of machine learning-based methods in capturing high-dimensional features during the crop growth cycle. Meanwhile, changes in crop growth cycles and seasonal variations are often influenced by a variety of factors (such as climate change, soil types, etc.), and are not simple linear changes [31]. Although machine learning-based methods can handle nonlinear relationships using techniques such as kernel functions in SVM, their ability to model nonlinearity under these complex factors remains limited.

In the face of these challenges, deep learning-based methods have demonstrated outstanding modeling capabilities [32]. Deep neural networks use hierarchical feature representations and nonlinear activation mechanisms (such as ReLU, Sigmoid, etc.) to automatically learn and model the highly nonlinear relationships in the data [33]. Many deep learning models, such as Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Transformer, have been proven to be effective methods for crop classification [34,35]. Currently, research on deep learning-based methods mainly focuses on several key directions. One approach is to expand the temporal receptive field, enabling the model to capture dependencies over a longer time span. Traditional 1D-CNNs are limited by the receptive field of the convolution kernels, often focusing on local temporal features but neglecting global temporal dependencies. To overcome this limitation, researchers have stacked multiple convolutional layers to gradually expand the receptive field, capturing longer temporal dependencies [17,36,37]. In addition to optimizing CNNs, it was also discovered that RNNs are capable of modeling sequential data and capturing longer temporal dependencies through their recurrent connections [38,39]. Further studies compared the performance of unidirectional and bidirectional RNNs in crop classification. They demonstrated that bidirectional RNNs, by propagating information in both forward and backward directions, can fully utilize the entire crop growth cycle’s temporal sequence information [6,40]. The second approach models temporal dependencies across multiple scales, enabling the detection of seasonal changes and other temporal patterns. This method generates a more diverse and comprehensive representation of temporal information. Many hybrid architectures [41,42], such as the dual-path model combining a CNN and an RNN, have been proposed. These architectures leverage the local feature extraction capabilities of different modules and the global temporal dependency modeling capabilities to improve model performance. Additionally, the emergence of Transformer models has shifted research towards attention-based techniques [43,44]. This has led to the exploration of temporal attention mechanisms, further enhancing the modeling of temporal context. These techniques allow the model to autonomously focus on critical phenological stages [45], such as the seedling and grain filling stages, thereby improving the model’s classification performance and efficiency.

Various strategies have been investigated to effectively capture the temporal evolution, thereby improving crop classification accuracy [46]. In recent years, significant progress has been made in expanding the temporal receptive field, modeling temporal dependencies at different scales, and exploring temporal attention mechanisms. However, different crop types may exhibit highly similar temporal patterns in certain remote sensing features, making it difficult to distinguish them based on a single or limited set of spectral bands. Multi-dimensional features are needed to complement each other during classification [47]. Therefore, information from the feature dimension is equally critical, and the correlations among features warrant greater attention. Nevertheless, current studies have not adequately accounted for the interdependencies among features during model design. The feature dependencies in this paper refer to the implicit interrelationships between different remote sensing features. For example, the correlations between features such as Red, Green, Blue, NIR, and SWIR are often used to construct remote sensing indices. The purpose of remote sensing indices is to utilize the implicit information between features, thereby enhancing the representation of land surface information. Similarly, we believe that there are also unexplored latent correlations between other remote sensing features. Although multisource satellite imagery has been extensively utilized [33], studies on existing methods primarily focus on the optimization of temporal modeling. A crucial challenge remains in effectively modeling the dependencies between different features. Therefore, it is essential to develop new crop classification methods to better leverage the hidden relationships among features, thereby enhancing classification accuracy and model generalization capability.

To address the aforementioned limitations, we propose a novel deep learning model, HyperVTCN. It is designed to jointly capture temporal dependencies and feature interactions in multisource satellite imagery for crop classification. We further evaluate its performance through comprehensive experiments.

The main contributions of this study are as follows:

HyperVTCN identifies different crops over large areas with good accuracy by exploring the relationships between temporal features, which reflect the growth period of crops.
HyperVTCN, with feature modeling components, captures coordinated variation patterns across multiple features. These patterns reflect key characteristics of crop phenology, complement crop growth information, and thereby enhance the differentiation of various crops.
A weighted loss function combining Focal Loss and QR-Ortho Loss is proposed to address class imbalance and feature redundancy, enhancing classification accuracy.

The remainder of this paper is organized as follows. Section 2 presents the study area and data. Section 3 describes the data processing workflow, the specific crop classification algorithms, and the validation methods in detail. Section 4 compares the performance of the proposed method with other crop classification approaches and presents the results of the ablation experiments in HyperVTCN. Section 5 discusses several key aspects, including the effectiveness of feature modeling, improvements in temporal modeling, the reliability of crop mapping with HyperVTCN, and the associated uncertainties and future research directions. Finally, Section 6 summarizes the conclusions of this study.

2. Study Area and Data

2.1. Study Area

The study area is Northeast China, which includes Heilongjiang, Jilin, and Liaoning provinces, as shown in Figure 1. It falls within the cold temperate and temperate climate zones, characterized by warm and humid conditions during the crop growing season. The region is predominantly flat and fertile, particularly in the Songhua River Basin of Heilongjiang and Jilin provinces, providing an optimal environment for large-scale agricultural production. The primary agricultural activities involve the cultivation of staple crops such as rice, soybeans, and corn. Due to temperature constraints, a single-season cropping system is predominant in Northeast China. Figure 2 provides a detailed illustration of the phenological periods of the three major crops. Rice is predominantly grown as a monoculture in paddy fields, while soybeans and corn are commonly found both in intercropping systems and in year-to-year rotations in upland areas.

2.2. Remote Sensing Imagery Data

This study utilized Sentinel-1 SAR (C-band Synthetic Aperture Radar), Sentinel-2, Landsat-7/8 TOA data, and MODIS LST products from 2017 to 2019 as the primary data sources. For the Sentinel-2 and Landsat series data, this paper selected five basic bands: red, green, blue, NIR, and SWIR. Considering the limitations of optical imagery under cloudy, hazy, or rainy conditions, Sentinel-1 SAR data were integrated to mitigate the deficiencies of optical imagery [48]. The Sentinel-1 data, comprising two polarization bands (VV and VH), provides high spatial and temporal resolutions, rendering it well suited for crop classification and growth monitoring in extensive agricultural regions. In addition, LST data serves as a thermodynamic feature supplement during the crop growth process, further enhancing the representation of the crop’s physiological state. The data above were preprocessed on the Google Earth Engine (GEE) platform, which included radiometric calibration, geometric correction, and the removal of cloud, haze, and snow cover to enhance data accuracy and consistency. Additionally, bicubic interpolation was applied to resample the spatial resolution of all data to 10 m, ensuring spatial alignment across different data sources.

2.3. Ground Truth Data

This study conducted field surveys in Northeast China from 2017 to 2019 to collect ground truth samples [49]. The main sample types and their distribution are shown in Figure 3, including rice, corn, soybeans, and others. The “others” category comprises wheat, grassland, wetlands, forests, water bodies, built-up areas, and other land cover types. After the field surveys, we conducted visual interpretation of all ground samples using high-resolution images from Google Earth and eliminated abnormal or erroneous samples.

3. Method

Figure 4 illustrates the methodological workflow of this study, which includes three main stages. First, data processing is performed. Then, the processed data are used for model training. Finally, the effectiveness of the proposed HyperVTCN model is validated through quantitative evaluation.

3.1. Data Processing

Based on the pre-processed remote sensing imagery, the following indices were calculated using five spectral bands from the optical imagery data: Normalized Difference Vegetation Index (NDVI), Enhanced Vegetation Index (EVI), Ratio Vegetation Index (RVI), Green Cover Vegetation Index (GCVI), and Land Surface Water Index (LSWI). Their calculation formulas are as follows:

N D V I = \frac{N I R - R e d}{N I R + R e d}

(1)

E V I = G \times \frac{N I R - R e d}{N I R + C_{1} \times R e d - C_{2} \times B l u e + L}

(2)

R V I = \frac{N I R}{R e d}

(3)

G C V I = \frac{N I R - G r e e n}{N I R + G r e e n}

(4)

L S W I = \frac{N I R - S W I R}{N I R + S W I R}

(5)

Additionally, the ratio index of VV and VH (VV/VH) was calculated using radar imagery data. Each of these indices offers distinct advantages for the crop classification task.

Specifically, NDVI is among the most extensively utilized indices in agricultural remote sensing for evaluating vegetation density and health [50]. However, it exhibits a saturable defect when it plateaus in high vegetation cover areas, making further increases challenging. EVI has better correction capabilities for atmospheric effects and soil background and exhibits higher sensitivity at higher biomass levels, which effectively compensates for the limitations of NDVI [51]. RVI is more sensitive to changes in areas with low vegetation cover and provides more accurate vegetation information during sparse or early growth stages. GCVI, on the other hand, focuses on characterizing the chlorophyll content of vegetation [52]. Each of these four vegetation indices has its focus and can comprehensively represent the nutrient changes and physical characteristics during the crop growth and development process. Additionally, LSWI is highly sensitive to water body characteristics [53] and can effectively capture the distinctive features of flooded rice fields, making it crucial for distinguishing water-intensive crops like rice from rainfed crops such as corn and soybeans. It can also further assist in crop type classification by providing information on vegetation and soil moisture. VV/VH is based on the analysis of radar backscatter characteristics [54] and can capture features such as surface roughness, vegetation structure, and moisture content, assisting in the differentiation of different crop types.

These indices can indicate changes in crop growth status, health, and other environmental factors, complementing the original reflectance bands (Blue, Green, Red, NIR, SWIR1, VV, VH). In addition, Land Surface Temperature (LST), as a derived geophysical product, provides important thermal information that helps distinguish crop types by reflecting their specific temperature patterns and growth conditions. Consequently, a total of 14 remote sensing features were ultimately selected for this study, including six remote sensing indices, seven original reflectance bands, and LST.

To mitigate the impact of clouds and the temporal unevenness of observations, this study used the maximum GCVI value, the minimum VH value, and the minimum LST value within each ten days to generate composite index images, from which the corresponding values were extracted as observation data. The missing values were subsequently filled by fitting a linear function [55], and the data were then normalized, resulting in a complete multi-feature time-series remote sensing dataset (as shown in Figure 5). Finally, the time-series remote sensing data were matched with ground truth data to form the sample dataset for crop classification.

3.2. Algorithms for Crop Classification

To address the challenges in crop classification, we propose a novel deep learning model named HyperVTCN. The model consists of two key components: (1) Modern TCN block, which captures temporal dependencies and feature dependencies. (2) Time-Feature Dual Attention (TiVDA) module, which enables the model to automatically focus on the most relevant time steps and features in the input sequence.

In this section, to avoid confusion with the high-dimensional features extracted by the model, multiple remote sensing features are represented as multiple variables input into the model.

3.2.1. Patch Embedding

The crop growth cycle consists of several stages (e.g., germination, heading, and maturity), and each stage exhibits significantly different feature representations in the data [56]. To capture these stage-specific characteristics in more detail, the input data

X_{in} \in R^{M \times L}

(a tensor containing M variables, each with a time length of L) is first divided into smaller time windows (patches). The size of the time window is P, and the stride is S, which determine the temporal receptive field and the downsampling rate, respectively. Subsequently, an embedding layer is introduced, which maps each time window to a high-dimensional feature space using a one-dimensional convolution operation, further extracting its dynamic patterns and features in the temporal dimension.

This operation transforms the data dimensions into

X_{emb} \in R^{M \times D \times N}

, where D is the embedding dimension and

N = (L - P) / S + 1

is the number of time steps after downsampling. This transformation introduces a channel structure, enabling each input variable to be represented by D embedded feature channels.

3.2.2. ModernTCN Block

ModernTCN, proposed by Luo et al. [57], is a modern convolutional structure specifically designed for time series analysis. Among them, the submodule DWConv focuses on processing the time dimension, learning the time dependencies independently, while ConvFFN focuses on processing the variable dimension. Based on their roles within the module, they can be categorized into the time information extraction module and the variable information extraction module. This section will introduce these two modules separately.

(1): Time information extraction module DWConv

The DWConv module independently models the temporal dependencies of each variable’s patch through depth-wise convolution.

As illustrated in Figure 6, the input tensor has a shape of M × D × N, where M is the number of variables, D represents the number of feature channels for each variable, and N is the number of time steps. The DWConv module performs depth-wise 1D convolutions with kernel size K along the temporal dimension (N) for each channel independently, preserving the feature-wise structure while capturing temporal dependencies.

\begin{matrix} X_{D W C o n v} = DW Conv (X_{e m d}) \end{matrix}

(6)

(2): Variable information extraction module ConvFFN

The ConvFFN module is inspired by the feed-forward network (FFN) in Transformer. However, it implements feature interaction among multiple variables through grouped convolution and dimensional transformations. Additionally, to address the computational overhead issue caused by high input dimensions in traditional FFN fully connected layers, ConvFFN is designed as an inverse residual structure.

In the ModernTCN block, the ConvFFN module is designed as two complementary submodules: ConvFFN 1 focuses on optimizing the feature channels of each variable independently, and ConvFFN 2, which enhances the interaction between variables. As shown in Figure 7, in ConvFFN 1 (Figure 7a), the input has dimensions M × D × N, and grouped convolutions are applied according to variables (groups = M). In other words, each variable in the data independently processes its channels, enhancing the feature representation within different channels of a single variable. In ConvFFN 2 (Figure 7b), the input is rearranged to D × M × N, and convolutions are grouped by channels (groups = D). This enables cross-variable information mixing, promoting the fusion of information between different variables and capturing the correlations between multiple variables.

3.2.3. Temporal-Variable Dual Attention (TiVDA)

After the ModernTCN block, an attention mechanism, TiVDA (as shown in Figure 8), is proposed, specifically designed for SITS with multiple features. The input tensor

X \in R^{M \times D \times N}

of TiVDA dynamically models feature importance along the variable-channel (M × D) and time-channel (N × D) dimensions through a combination of grouped convolution and dual-path attention.

(1): Variable-wise Attention

Firstly, the input tensor X is rearranged into a new tensor

X_{v}

to focus on the interactions along the variable dimension.

X_{v} = rearrange (X^{'}, MDN \to (M D) N^{'})

(7)

Secondly, deep convolution is applied along the time dimension N with a convolution operation (groups = M × D). After the convolution operation, adaptive average pooling (AvgPool) and max pooling (MaxPool) are applied separately. The resulting pooled features are then processed through a small multilayer perceptron (MLP) to obtain attention weights for the variable dimension:

W_{v} = σ (F C_{2} (ReLU (F C_{1} (avg_pool + \max_pool))))

(8)

(2): Temporal-wise Attention

Similarly, TiVDA also applies attention weights to the time dimension. The input tensor X is reshaped into

X_{n d}

to focus on the interactions between time dimensions.

X_{t} = rearrange (X^{'}, M D N \to (N D) M^{'})

(9)

Then, the attention calculation process is similar to that used for the variable dimension:

W_{t} = σ (F C_{2} (ReLU (F C_{1} (avg_pool + \max_pool))))

(10)

Finally, by element-wise multiplication, the attention weights from both dimensions are combined to obtain the final attention weights:

W = W_{v} \times W_{t}

(11)

3.2.4. Output Module

After TiVDA, three fully connected layers are used as the classifier of the model. It receives the high-level features extracted by the Modern TCN block and the attention scores output by TiVDA, then multiplies them and maps to the final classification result.

3.2.5. Loss Function

Moreover, to make the model better adapt to the dataset, a composite loss was designed. During the training process, Focal Loss and QR-Ortho Loss are combined with weighted coefficients. The total loss function is expressed as:

L = (1 - α) \times L_{focal} + α \times L_{Q R - O r t h o}

(12)

(1): Focal loss

Focal Loss is a loss function designed to address the issue of class imbalance in datasets [58]. Based on the standard Cross-Entropy Loss, it dynamically adjusts the sample weights to reduce the focus on easily classified samples while increasing the attention on hard-to-classify samples.

L_{focal} (p_{t}) = - β_{t} (1 - p_{t})^{γ} \log (p_{t})

(13)

where

p_{t}

is the model’s predicted probability for the correct class.

β_{t}

is the balancing factor, used to adjust for class imbalance and prevent the model from favoring the majority class.

γ

is the focusing factor, which controls the model’s attention on harder-to-classify samples.

(2): QR-Ortho Loss

QR-Ortho [59] is a regularization method based on QR decomposition [60], aimed at ensuring the orthogonality of features. When handling high-dimensional data, feature redundancy and correlation can affect the model’s ability to distinguish the data. To address this issue, one common approach is to apply an orthogonality constraint. The orthogonality constraint requires that the features be independent of each other, meaning there should be no redundant information between different features. Mathematically, this is expressed as the inner product of feature vectors being close to zero. Suppose Q is an n × m matrix, where each column is a feature vector. To minimize feature redundancy, we want to:

Q^{T} Q \approx I

(14)

where

Q^{T} Q

is the inner product of the transpose of the feature matrix Q and the matrix itself, and I is the identity matrix.

L_{Q R - O r t h o} = \frac{1}{2} ({‖Q^{T} Q - I‖}_{F}^{2})

(15)

where

‖ \cdot ‖_{F}

denotes the Frobenius norm, which is the square root of the sum of squares of all elements in the matrix. (15) measures the difference between the matrix

Q^{T} Q

and the identity matrix

I

. By minimizing the QR-Ortho Loss, the column vectors of the feature matrix are forced to be orthogonal, thereby minimizing feature redundancy and improving model performance.

3.3. Validation

To validate the effectiveness of our proposed algorithm, the sample data were first divided into training, validation, and testing sets at a ratio of 8:1:1. Comparative experiments were then conducted by evaluating HyperVTCN against four deep learning models (1D-CNN, LSTM, Transformer, and DCM) and a classical machine learning model (RF).

Specifically, RF is an ensemble learning method based on decision trees, known for its robustness in classification tasks. In this study, the number of trees was set to 100. The 1D-CNN model is configured with three convolutional layers containing 64, 128, and 256 filters (with a kernel size of 3 and padding set to 1), leveraging local receptive fields to capture short-term temporal patterns [17]. LSTM is a recurrent neural network known for its ability to retain long-term dependencies [39]. In this study, the LSTM model consists of 2 stacked layers, each with 512 hidden units. A Transformer uses a self-attention mechanism to model long-range dependencies in sequences [61]. We employ a Transformer with 2 encoder layers, model dimensions of 512, 4 attention heads, and a feedforward hidden size of 1024. Positional encoding uses the scale parameter τ = 10,000. DCM, an attention-based BiLSTM model proposed by Xu et al., has consistently achieved strong results in crop classification tasks [38]; it had a hidden size of 512 and 2 layers in our experiments. All models were implemented in the PyTorch (version 1.12.1) framework. Training was conducted for 100 epochs using the Adam optimizer with a learning rate of 0.0001. The learning rate scheduler employed was StepLR, which decreased the learning rate by a factor of 0.1 every 50 epochs. Additionally, unless otherwise specified, all experiments used the Cross-Entropy Loss function, which is commonly applied in multi-class classification tasks.

In the experiments, four metrics were selected to quantitatively evaluate the results: OA (Overall Accuracy), Kappa coefficient, Macro-F1 score, and F1 score. These metrics could comprehensively measure the model’s performance in crop classification tasks. OA was used to assess overall classification accuracy, while the Kappa coefficient quantified the degree of consistency between the model’s classification results and random classification, effectively eliminating the impact of chance on accuracy. The F1 score and Macro-F1 score reflected the classification performance of each category. The definitions of these metrics are as follows:

OA = \frac{\sum_{i = 1}^{N} T P_{i}}{\sum_{i = 1}^{N} (T P_{i} + F P_{i} + F N_{i} + T N_{i})}

(16)

where

T P_{i}

is the true positive for class i,

F P_{i}

is the false positive for class i,

F N_{i}

is the false negative for class i,

T N_{i}

is the true negative for class i, and

N

denotes the total number of classes in the dataset.

K a p p a = \frac{P_{o} - P_{e}}{1 - P_{e}}

(17)

P_{o} = \frac{\sum_{i = 1}^{N} T P_{i}}{\sum_{i = 1}^{N} (T P_{i} + F P_{i} + F N_{i} + T N_{i})}

(18)

P_{e} = \sum_{i = 1}^{N} (\frac{(T P_{i} + F P_{i})}{T} \cdot \frac{(T P_{i} + F N_{i})}{T})

(19)

where

P_{o}

is the observed consistency, which refers to the correct proportion of actual classifications, and

P_{e}

is the expected consistency, which refers to the correct proportion under random conditions.

F 1 = \frac{2 \cdot Precision \cdot Recall}{Precision + Recall}

(20)

Precision = \frac{T P}{T P + F P}

(21)

Recall = \frac{T P}{T P + F N}

(22)

where Precision measures how accurately the model predicts a specific class, while Recall assesses the model’s ability to identify samples belonging to that class correctly.

4. Result

4.1. Results of Comparative Experiments

As presented in Table 1, the OA, Kappa, and Macro-F1 scores of HyperVTCN were the highest. Compared with other models, the OA improved by approximately 2–3%, the Kappa coefficient increased by 3–4.5%, and the Macro-F1 score improved by 2–3%. These results suggest that HyperVTCN achieves competitive performance in crop classification tasks. In particular, the advantage of HyperVTCN was more prominently reflected in its accurate identification of difficult-to-classify crop categories. In the case of rice, the performance of the HyperVTCN model was not significant and was slightly inferior compared to Transformer and DCM. Its per-class F1 score was approximately 0.30% lower than that of the Transformer and DCM. However, HyperVTCN performed exceptionally well in distinguishing among corn, soybean, and other categories. HyperVTCN’s classification accuracy for corn, soybean, and other categories increased by 1.80–4.24%, 1.41–3.93%, and 2.82–3.72%, respectively.

From the F1 scores of the four categories across different models, it is evident that the classifications of corn, soybean, and other categories are much more challenging than that of rice. They demonstrated more complex variations in the time-series data, exhibiting greater environmental differences and instability. Despite slightly lower performance on rice, HyperVTCN achieved better recognition in the more challenging categories, leading to significant improvements in overall metrics (OA, Kappa, and Macro-F1).

In terms of model size and computational complexity, HyperVTCN maintains a relatively moderate level. As shown in Table 2, the number of parameters in HyperVTCN is comparable to, or even lower than, those of LSTM and Transformer. Its FLOPs are relatively higher, but still lower than those of DCM, indicating that it achieves a relatively good balance between accuracy and efficiency.

4.2. Results of Ablation Experiments in HyperVTCN

To verify the effectiveness of each component in HyperVTCN, we performed ablation studies. HyperVTCN is the method used in this study, incorporating TiVDA and a weighted loss function combining Focal Loss (FL) and QR-Ortho Loss (QRL). And we define HyperVTCN_noTiVDA as HyperVTCN without the attention mechanism TiVDA. HyperVTCN_CEL replaces the loss function in HyperVTCN with Cross-Entropy Loss (CEL). HyperVTCN_FL removes QRL from HyperVTCN. HyperVTCN_CEL + QRL replaces the loss function in HyperVTCN with a composite weighted loss combining CEL and QRL.

As depicted in Table 3, HyperVTCN achieved the best results in both OA and Kappa metrics. The ablation experiments results showed that adding the TiVDA attention mechanism module to the baseline improved the performance of HyperVTCN. Under the same experimental conditions, HyperVTCN outperformed HyperVTCN_noTiVDA, achieving a 0.76% improvement in OA and an 1.16% increase in Kappa.

Compared to HyperVTCN_CEL, HyperVTCN_FL achieved slightly higher OA and Kappa, with increases of 0.25% and 0.32%, respectively. This suggests that using FL may have potential advantages in addressing class imbalance. When comparing HyperVTCN_CEL to HyperVTCN_CEL + QRL, the Overall Accuracy (OA) increased by 0.5%, and the Kappa increased by 0.67%. Similarly, compared to HyperVTCN_FL, HyperVTCN achieved a 0.5% improvement in OA and a 0.69% improvement in Kappa. These results suggest that the introduction of QRL contributed to further performance gains. QRL reduces the orthogonality between the features extracted by the model, allowing each feature to capture independent information from the data, thereby enhancing the model’s ability to express features of the data. HyperVTCN achieved an improvement in OA and Kappa compared to HyperVTCN_CEL, HyperVTCN_FL, and HyperVTCN_CEL + QRL. It achieved the best performance among all versions. This indicated that the weighted combination of Focal Loss and QR-Ortho Loss optimized the model’s training process, enabling the model to better address class imbalance and feature redundancy, thereby improving the performance in multi-class classification tasks.

5. Discussion

5.1. Evaluating the Effectiveness of Feature Modeling

To better understand the feature modeling mechanism of HyperVTCN, we removed all modules dedicated to cross-feature modeling to create the HyperV_noV model. Specifically, the ConvFFN module within the ModernTCN block and the Variable-wise Attention module within TiVDA were omitted. HyperV_noV retains the temporal modeling capability of HyperVTCN but lacks the cross-feature modeling capability. In this experiment, we systematically evaluated the performance of HyperVTCN and HyperV_noV on multiple datasets with five types of feature combinations, covering different data modalities and numbers of features, as summarized in Table 4.

The experimental results are shown in Figure 9. By comparing the performance on the R and S datasets (Figure 9a,b), it is evident that both HyperVTCN and HyperV_noV exhibit a marked improvement in OA and Kappa on the S dataset compared to the R dataset. This observation is consistent with previous studies, which suggest that effective feature engineering can provide models with richer learnable information, thereby significantly enhancing crop classification accuracy [62,63]. However, when additional selected features are incorporated on top of the S dataset (forming the S-R, S-R-T, and S-R-T-I datasets), the improvement in OA and Kappa for HyperV_noV is limited and in some cases even shows a declining trend. This indicates that the accuracy gains from adding features and feature engineering have an upper limit, with information gain gradually reaching saturation. These additional features are not effectively leveraged by the model and may instead introduce redundancy and noise, negatively affecting model performance [64]. In contrast, HyperVTCN continues to show an upward trend. Therefore, building upon feature engineering, further optimization of the model architecture to better capture inter-feature dependencies is crucial. This approach can unlock the potential of multisource satellite imagery and help overcome performance bottlenecks. As a result, it enables further improvements in classification accuracy.

Further comparison of HyperVTCN and HyperV_noV on the S-R, S-R-T, and S-R-T-I datasets shows the following. From S-R to S-R-T, adding only an LST feature results in a 1.38% increase in Kappa for HyperVTCN. In contrast, HyperV_noV, which lacks feature modeling capability, experiences a 0.56% decrease. On the S-R-T-I dataset, HyperVTCN’s Kappa increases by only 0.5% compared to S-R-T. These results indicate the reason behind HyperVTCN’s improved performance in crop classification. The reason lies in its feature modeling components, which effectively capture the dependencies among optical, radar, and LST multisource remote sensing data. LST, as a direct indicator of surface thermal conditions, characterizes the temperature variations experienced by crops throughout their growth stages. Owing to fundamental genetic differences, different crops exhibit distinct physiological and phenological responses to temperature variations. Therefore, LST can be regarded as a feature that reflects the driving mechanisms underlying crop growth. In contrast, optical and radar data primarily capture the external states of crops under the combined influence of multiple environmental factors such as temperature, moisture, and nutrients. These data represent features that reflect the outcomes of growth processes and do not directly reveal the intrinsic differences among crops. LST and optical/radar data can thus complement each other. By exploring the synergistic relationships between these information sources, the feature modeling components of HyperVTCN better capture the intrinsic physiological differences of crops, thereby enabling more precise differentiation among various crop types. Given that the model has already captured the key dependencies in the S-R-T multisource data, incorporating remote sensing index in S-R-T-I provides only limited additional performance gains.

Overall, HyperVTCN models the dependencies among features and achieves better classification performance than HyperV_noV on most datasets. Results in Figure 9c–f further indicate that HyperVTCN’s cross-feature modeling capability is not incidental or limited to specific crops, but can be broadly applied across various crop types. However, when the synergy between data sources is limited or the feature dimensionality is low, its advantages may be constrained. For example, in the R dataset, which contains only radar features (VV and VH), the low feature dimensionality may cause HyperVTCN’s feature modeling components to introduce redundant transformations or increase the risk of overfitting, resulting in slightly lower classification performance compared to the simpler HyperV_noV. On the S-R dataset, HyperVTCN performs comparably to HyperV_noV. This is because the optical data already contain abundant information, as most days in northeastern China are sunny with minimal cloud or fog interference. Much of the information provided by radar is already captured by optical data, so the S-R dataset does not offer additional inter-source dependencies for the model. Consequently, the limited synergy between optical and radar data prevents HyperVTCN’s feature modeling capability from being fully leveraged.

5.2. The Improved Temporal Modeling Performance

HyperVTCN’s performance is attributed to its ability to model both feature interactions and temporal dependencies effectively. To more accurately evaluate its time relationship modeling capability, we conducted crop classification experiments using only single-feature data to eliminate the interference of feature dependencies and focus solely on the modeling effects of the time dimension.

Due to the limited discriminative power of single-feature data [64], we selected features with higher class discrimination potential. Specifically, the chosen feature should exhibit certain separability in the time series of different crop categories to ensure the validity of the experiment. Based on this criterion, we finally selected the three indices: NDVI, GCVI, and LSWI for the experiment. NDVI and GCVI have been widely interpreted as vegetation indices sensitive to crop growth and correlated with the leaf area index (LAI) [65]. LSWI is highly sensitive to leaf and soil moisture, enabling the identification of rice and the classification of corn and soybeans [66]. These three indices possess good discriminative ability, effectively supporting the experiments.

From the experimental results in Table 5, HyperVTCN exhibited outstanding performance across all three distinct features. The Kappa value increased by an average of 3.2% compared to the second-best model, with exceptional performance in other indicators. Transformer and DCM also show relatively good performance, while CNN and LSTM exhibit clear disadvantages. Interestingly, HyperVTCN, Transformer, and DCM all employ a temporal attention mechanism, which could be a key factor in their excellent performance in time series data modeling.

Additionally, the experimental result shows that CNN and LSTM perform poorly on NDVI and GCVI, with an F1 score of 0 for some crop categories. This suggests that the information provided by NDVI or GCVI alone remains relatively limited, preventing the model from effectively learning useful feature representations and thereby tending to assign samples randomly to the more prevalent classes. However, even under such information scarcity, HyperVTCN still demonstrates stable performance. Overall, although the classification accuracy of HyperVTCN on the single-feature dataset, with an average Kappa of 0.8075, is not as high as its performance on the S-R-I dataset, it still demonstrates strong temporal modeling capability compared to other models.

5.3. Reliability of Crop Mapping Using HyperVTCN

The 2019 crop map was generated using the HyperVTCN model and compared with the statistical yearbook data at the prefectural level (Figure 10). The consistency between the predicted and statistical areas was evaluated using the coefficient of determination (R²) and the root mean square error (RMSE). As shown in Figure 10b–d, the predicted areas for all three crops showed a strong correlation with the statistical data, with R² values of 0.90, 0.96, and 0.98 for rice, maize, and soybean. For rice and maize, the slope of the fitted line was slightly greater than 1, indicating a small overestimation. This discrepancy may be attributed to the ability of remote sensing to detect small-scale crop fields not included in the statistical yearbook, as well as potential misclassification during the remote sensing process. However, the overall deviations were within an acceptable range.

Beyond the prefectural-level comparison with statistical data, we further evaluated the temporal robustness and local-scale reliability of the model by examining representative regions in Northeast China. We randomly selected a representative 30 km × 30 km region within each of Heilongjiang Province (HLJ), Jilin Province (JL), and Liaoning Province (LN) and mapped the crop distributions from 2017 to 2020 using the HyperVTCN method (Figure 11). Overall, the spatial patterns of rice distribution remained relatively stable, whereas soybean and corn exhibited clear interannual rotation and intercropping characteristics. Moreover, the HyperVTCN demonstrated reliable transferability in predicting crop distributions in 2020, with results largely consistent with those from 2017 to 2019. The area of rice cultivation remained highly consistent with previous years (Figure 12). The areas of soybean and corn exhibited some interannual variations, primarily due to the soybean–corn rotation system in Northeast China. These findings demonstrate the model’s reliability in capturing temporal dynamics and spatial transfer patterns.

To further evaluate the advantages of our result, we compared the mapping results of our proposed method with two existing crop distribution products for Northeast China and two state-of-the-art deep learning approaches (Figure 13). Overall, all methods exhibit similar patterns in the general spatial distribution of crops. However, certain differences can be observed in terms of detail depiction and class discrimination.

Both You-CDL [49] and Su-CDL [67] employ the Random Forest (RF) classification method. Due to its 30 m spatial resolution, Su-CDL exhibits relatively blurred image details, with indistinct crop boundaries and limited recognition of small parcels, resulting in comparatively coarse classification outcomes. You-CDL performs better than Su-CDL in preserving spatial details; however, it still shows some confusion in the soybean–corn intercropping areas of regions a and b, and misclassifies several paddy fields as the “others” category in region c.

Among deep learning methods, both LSTM and DCM are capable of modeling the temporal dependencies of multi-temporal crop features. However, their delineation of crop strip boundaries in the soybean–corn intercropping areas of regions a and b appears fragmented, with poor internal spatial continuity. In contrast, HyperVTCN reduces misclassification and delineates the distribution boundaries of rice, soybean, and corn more precisely, achieving higher classification purity and consistency.

5.4. Uncertainties and Future Directions

The above results provide strong evidence for the reliability of the HyperVTCN method. HyperVTCN not only demonstrates a strong capacity to uncover inherent relationships among remote sensing features, but also excels at capturing temporal patterns in the data. However, some uncertainties still remain in this study.

First, due to weather conditions such as cloud cover and the limitations of satellite revisit cycles, obtaining continuous time series data from medium- and high-resolution remote sensing imagery is often challenging. Although this study mitigated data discontinuities by fusing multisource optical data from Landsat 7, Landsat 8, and Sentinel-2 to ensure more complete temporal coverage, missing observations and the inherent heterogeneity among sensors may still introduce uncertainty into model training and prediction. These factors may affect the reliability of feature extraction and ultimately influence classification performance.

Second, the applicability and generalization ability of HyperVTCN require broader empirical validation. The current experiments were conducted only in the three northeastern provinces of China, a region with a predominantly single-season cropping system. The model’s effectiveness in areas with more complex or multi-season cropping systems remains unknown. Future research could further assess its performance in such contexts and explore its adaptability to diverse agricultural practices.

In future studies, research could further explore the synergistic optimization of temporal and feature modeling capabilities. For models that primarily focus on temporal sequence modeling, such as Transformer-based architectures and Recurrent Neural Networks (RNNs), integrating modules dedicated to feature modeling may enhance performance and represents a promising direction for future research. Moreover, investigating effective ways to optimize the interaction between these two dimensions may enable models to better capture intrinsic data patterns, thereby enhancing overall performance.

Another important avenue for future research concerns the interpretability of the learned temporal and feature dependencies. In the current study, these internal mechanisms remain opaque. Enhancing model interpretability could provide deeper insights into the decision-making process and potentially guide further architectural enhancements.

From an application perspective, it is also worth investigating the adaptability and timeliness of this method for within-season classification tasks. In particular, achieving accurate identification during the early stages of crop growth is of great significance for enabling precision agricultural management. Enhancing the method’s ability to capture crop growth dynamics under limited observations, while enabling earlier and faster classification, is a promising avenue for future research.

6. Conclusions

In this study, we propose a crop classification method named HyperVTCN. It aims to effectively address the challenges of crop classification using time series data from multisource satellite imagery. To validate the effectiveness of the method, we constructed a new dataset. This dataset encompasses the extensive agricultural region of Northeast China and includes data from multiple sources, including Sentinel-1 SAR, Sentinel-2, Landsat-7/8 TOA, and MODIS LST. Experimental results show that HyperVTCN achieves higher accuracy compared to existing methods. Additionally, ablation experiments were conducted to evaluate the effectiveness of the attention mechanism and the weighted loss function strategy, further validating the importance of each component.

We further validated the effectiveness of the model’s temporal and feature modeling capabilities, as well as the reliability of HyperVTCN in crop mapping. HyperVTCN has shown effective performance in crop classification and offers a new perspective for modeling time series data from multisource satellite imagery. The application prospects of HyperVTCN are extensive, particularly in the fields of smart agriculture and precision agriculture.

Author Contributions

Conceptualization, X.H. and L.L.; methodology, X.H., L.L. and M.F.; software, X.H. and W.K.; validation, X.H. and J.L.; formal analysis, L.L. and Z.Q.; investigation, X.H., L.L. and W.K.; resources, X.H. and M.F.; data curation, X.H., L.L. and Y.W.; writing—original draft preparation, X.H. and Y.W.; writing—review and editing, X.H., M.F., Z.L. and L.L.; visualization, X.H. and J.L.; supervision, X.H., L.L. and Z.L.; project administration, L.L. and Z.Q.; funding acquisition, L.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data in this study can be accessed from the corresponding author upon request due to the privacy requirements of the research project.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Bovolo, F.; Bruzzone, L.; Solano-Correa, Y.T. 2.08—Multitemporal Analysis of Remotely Sensed Image Data. In Comprehensive Remote Sensing; Liang, S., Ed.; Elsevier: Oxford, UK, 2018; pp. 156–185. ISBN 978-0-12-803221-3. [Google Scholar]
Liu, L.; Chen, X.; Xu, X.; Wang, Y.; Li, S.; Fu, Y. Changes in Production Potential in China in Response to Climate Change from 1960 to 2010. Adv. Meteorol. 2014, 2014, 640320. [Google Scholar] [CrossRef]
Sachs, J.; Remans, R.; Smukler, S.; Winowiecki, L.; Andelman, S.J.; Cassman, K.G.; Castle, D.; DeFries, R.; Denning, G.; Fanzo, J.; et al. Monitoring the World’s Agriculture. Nature 2010, 466, 558–560. [Google Scholar] [CrossRef] [PubMed]
Qie, L.; Pu, L.; Tang, P.; Liu, R.; Huang, S.; Xu, F.; Zhong, T. Gains and Losses of Farmland Associated with Farmland Protection Policy and Urbanization in China: An Integrated Perspective Based on Goal Orientation. Land Use Policy 2023, 129, 106643. [Google Scholar] [CrossRef]
Atzberger, C. Advances in Remote Sensing of Agriculture: Context Description, Existing Operational Monitoring Systems and Major Information Needs. Remote Sens. 2013, 5, 949–981. [Google Scholar] [CrossRef]
Feng, F.; Gao, M.; Liu, R.; Yao, S.; Yang, G. A Deep Learning Framework for Crop Mapping with Reconstructed Sentinel-2 Time Series Images. Comput. Electron. Agric. 2023, 213, 108227. [Google Scholar] [CrossRef]
Wang, S.; Azzari, G.; Lobell, D.B. Crop Type Mapping without Field-Level Labels: Random Forest Transfer and Unsupervised Clustering Techniques. Remote Sens. Environ. 2019, 222, 303–317. [Google Scholar] [CrossRef]
Yang, Z.; Diao, C.; Gao, F. Towards Scalable Within-Season Crop Mapping with Phenology Normalization and Deep Learning. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 1390–1402. [Google Scholar] [CrossRef]
Ienco, D.; Interdonato, R.; Gaetano, R.; Ho Tong Minh, D. Combining Sentinel-1 and Sentinel-2 Satellite Image Time Series for Land Cover Mapping via a Multi-Source Deep Learning Architecture. ISPRS J. Photogramm. Remote Sens. 2019, 158, 11–22. [Google Scholar] [CrossRef]
Ghamisi, P.; Rasti, B.; Yokoya, N.; Wang, Q.; Hofle, B.; Bruzzone, L.; Bovolo, F.; Chi, M.; Anders, K.; Gloaguen, R.; et al. Multisource and Multitemporal Data Fusion in Remote Sensing: A Comprehensive Review of the State of the Art. IEEE Geosci. Remote Sens. Mag. 2019, 7, 6–39. [Google Scholar] [CrossRef]
Li, J.; Hong, D.; Gao, L.; Yao, J.; Zheng, K.; Zhang, B.; Chanussot, J. Deep Learning in Multimodal Remote Sensing Data Fusion: A Comprehensive Review. Int. J. Appl. Earth Obs. Geoinf. 2022, 112, 102926. [Google Scholar] [CrossRef]
Pang, M.; Chen, Q.; Shang, J.; Long, W.; Liu, X. Sentinel-1/2 Image Fusion Coupled With CSR, GAN, and Temporal Phenology Feature Construction for Cropland Mapping. IEEE Trans. Geosci. Remote Sens. 2025, 63, 1–19. [Google Scholar] [CrossRef]
Hu, Y.; Hu, Q.; Li, J. CMINet: A Unified Cross-Modal Integration Framework for Crop Classification from Satellite Image Time Series. IEEE Trans. Geosci. Remote Sens. 2025, 63, 1–13. [Google Scholar] [CrossRef]
Xu, S.; Zhu, X.; Chen, J.; Zhu, X.; Duan, M.; Qiu, B.; Wan, L.; Tan, X.; Xu, Y.N.; Cao, R. A Robust Index to Extract Paddy Fields in Cloudy Regions from SAR Time Series. Remote Sens. Environ. 2023, 285, 113374. [Google Scholar] [CrossRef]
Satalino, G.; Mattia, F.; Le Toan, T.; Rinaldi, M. Wheat Crop Mapping by Using ASAR AP Data. IEEE Trans. Geosci. Remote Sens. 2009, 47, 527–530. [Google Scholar] [CrossRef]
Moran, M.S.; Vidal, A.; Troufleau, D.; Inoue, Y.; Mitchell, T.A. Ku- and C-Band SAR for Discriminating Agricultural Crop and Soil Conditions. IEEE Trans. Geosci. Remote Sens. 1998, 36, 265–272. [Google Scholar] [CrossRef]
Zhong, L.; Hu, L.; Zhou, H. Deep Learning Based Multi-Temporal Crop Classification. Remote Sens. Environ. 2019, 221, 430–443. [Google Scholar] [CrossRef]
Lei, L.; Wang, X.; Zhang, L.; Hu, X.; Zhong, Y. CROPUP: Historical Products Are All You Need? An End-to-End Cross-Year Crop Map Updating Framework without the Need for in Situ Samples. Remote Sens. Environ. 2024, 315, 114430. [Google Scholar] [CrossRef]
Wardlow, B.D.; Egbert, S.L. Large-Area Crop Mapping Using Time-Series MODIS 250 m NDVI Data: An Assessment for the U.S. Central Great Plains. Remote Sens. Environ. 2008, 112, 1096–1116. [Google Scholar] [CrossRef]
Zhang, X.; Cai, Z.; Hu, Q.; Yang, J.; Wei, H.; You, L.; Xu, B. Improving Crop Type Mapping by Integrating LSTM with Temporal Random Masking and Pixel-Set Spatial Information. ISPRS J. Photogramm. Remote Sens. 2024, 218, 87–101. [Google Scholar] [CrossRef]
Hao, P.; Zhan, Y.; Wang, L.; Niu, Z.; Shakir, M. Feature Selection of Time Series MODIS Data for Early Crop Classification Using Random Forest: A Case Study in Kansas, USA. Remote Sens. 2015, 7, 5347–5369. [Google Scholar] [CrossRef]
Wang, L.; Wang, T.; Ma, H.; Lu, P.; Sun, W.; Fan, L.; Wang, H.; Wu, Y.; Wang, Y. Time-Series SAR Monitoring of Rice in Multiple Cropping Modes Combining Statistical and Phenological Characteristics. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–11. [Google Scholar] [CrossRef]
Zhao, Q.; Xie, Q.; Peng, X.; Lai, K.; Wang, J.; Fu, H.; Zhu, J.; Song, Y. Understanding the Temporal Dynamics of Coherence and Backscattering Using Sentinel-1 Imagery for Crop-Type Mapping. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 6875–6893. [Google Scholar] [CrossRef]
Ni, R.; Tian, J.; Li, X.; Yin, D.; Li, J.; Gong, H.; Zhang, J.; Zhu, L.; Wu, D. An Enhanced Pixel-Based Phenological Feature for Accurate Paddy Rice Mapping with Sentinel-2 Imagery in Google Earth Engine. ISPRS J. Photogramm. Remote Sens. 2021, 178, 282–296. [Google Scholar] [CrossRef]
Asgarian, A.; Soffianian, A.; Pourmanafi, S. Crop Type Mapping in a Highly Fragmented and Heterogeneous Agricultural Landscape: A Case of Central Iran Using Multi-Temporal Landsat 8 Imagery. Comput. Electron. Agric. 2016, 127, 531–540. [Google Scholar] [CrossRef]
Wang, X.; Zhang, J.; Xun, L.; Wang, J.; Wu, Z.; Henchiri, M.; Zhang, S.; Zhang, S.; Bai, Y.; Yang, S.; et al. Evaluating the Effectiveness of Machine Learning and Deep Learning Models Combined Time-Series Satellite Data for Multiple Crop Types Classification over a Large-Scale Region. Remote Sens. 2022, 14, 2341. [Google Scholar] [CrossRef]
Yang, N.; Liu, D.; Feng, Q.; Xiong, Q.; Zhang, L.; Ren, T.; Zhao, Y.; Zhu, D.; Huang, J. Large-Scale Crop Mapping Based on Machine Learning and Parallel Computation with Grids. Remote Sens. 2019, 11, 1500. [Google Scholar] [CrossRef]
He, S.; Peng, P.; Chen, Y.; Wang, X. Multi-Crop Classification Using Feature Selection-Coupled Machine Learning Classifiers Based on Spectral, Textural and Environmental Features. Remote Sens. 2022, 14, 3153. [Google Scholar] [CrossRef]
Löw, F.; Michel, U.; Dech, S.; Conrad, C. Impact of Feature Selection on the Accuracy and Spatial Uncertainty of Per-Field Crop Classification Using Support Vector Machines. ISPRS J. Photogramm. Remote Sens. 2013, 85, 102–119. [Google Scholar] [CrossRef]
Hamidi, M.; Homayouni, S.; Safari, A.; Hasani, H. Deep Learning Based Crop-Type Mapping Using SAR and Optical Data Fusion. Int. J. Appl. Earth Obs. Geoinf. 2024, 129, 103860. [Google Scholar] [CrossRef]
Zhai, Y.; Wang, N.; Zhang, L.; Hao, L.; Hao, C. Automatic Crop Classification in Northeastern China by Improved Nonlinear Dimensionality Reduction for Satellite Image Time Series. Remote Sens. 2020, 12, 2726. [Google Scholar] [CrossRef]
Farmonov, N.; Esmaeili, M.; Abbasi-Moghadam, D.; Sharifi, A.; Amankulova, K.; Mucsi, L. HypsLiDNet: 3-D–2-D CNN Model and Spatial–Spectral Morphological Attention for Crop Classification with DESIS and LiDAR Data. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 11969–11996. [Google Scholar] [CrossRef]
Sun, J.; Yao, X.; Yan, S.; Xiong, Q.; Li, G.; Huang, J. Large-Scale Crop Mapping Based on Multisource Remote Sensing Intelligent Interpretation: A Spatiotemporal Data Cubes Approach. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 13077–13088. [Google Scholar] [CrossRef]
Lei, L.; Wang, X.; Hu, X.; Zhang, L.; Zhong, Y. PhenoCropNet: A Phenology-Aware-Based SAR Crop Mapping Network for Cloudy and Rainy Areas. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–13. [Google Scholar] [CrossRef]
Pham, V.-D.; Tetteh, G.; Thiel, F.; Erasmi, S.; Schwieder, M.; Frantz, D.; van der Linden, S. Temporally Transferable Crop Mapping with Temporal Encoding and Deep Learning Augmentations. Int. J. Appl. Earth Obs. Geoinf. 2024, 129, 103867. [Google Scholar] [CrossRef]
Mao, M.; Zhao, H.; Tang, G.; Ren, J. In-Season Crop Type Detection by Combing Sentinel-1A and Sentinel-2 Imagery Based on the CNN Model. Agronomy 2023, 13, 1723. [Google Scholar] [CrossRef]
Pelletier, C.; Webb, G.I.; Petitjean, F. Temporal Convolutional Neural Network for the Classification of Satellite Image Time Series. Remote Sens. 2019, 11, 523. [Google Scholar] [CrossRef]
Xu, J.; Zhu, Y.; Zhong, R.; Lin, Z.; Xu, J.; Jiang, H.; Huang, J.; Li, H.; Lin, T. DeepCropMapping: A Multi-Temporal Deep Learning Approach with Improved Spatial Generalizability for Dynamic Corn and Soybean Mapping. Remote Sens. Environ. 2020, 247, 111946. [Google Scholar] [CrossRef]
Zhou, Y.; Luo, J.; Feng, L.; Yang, Y.; Chen, Y.; Wu, W. Long-Short-Term-Memory-Based Crop Classification Using High-Resolution Optical Images and Multi-Temporal SAR Data. GISci. Remote Sens. 2019, 56, 1170–1191. [Google Scholar] [CrossRef]
Crisóstomo de Castro Filho, H.; Abílio de Carvalho Júnior, O.; Ferreira de Carvalho, O.L.; Pozzobon de Bem, P.; dos Santos de Moura, R.; Olino de Albuquerque, A.; Rosa Silva, C.; Guimarães Ferreira, P.H.; Fontes Guimarães, R.; Trancoso Gomes, R.A. Rice Crop Detection Using LSTM, Bi-LSTM, and Machine Learning Models from Sentinel-1 Time Series. Remote Sens. 2020, 12, 2655. [Google Scholar] [CrossRef]
Zhang, F.; Yin, J.; Wu, N.; Hu, X.; Sun, S.; Wang, Y. A Dual-Path Model Merging CNN and RNN with Attention Mechanism for Crop Classification. Eur. J. Agron. 2024, 159, 127273. [Google Scholar] [CrossRef]
Tang, P.; Chanussot, J.; Guo, S.; Zhang, W.; Qie, L.; Zhang, P.; Fang, H.; Du, P. Deep Learning with Multi-Scale Temporal Hybrid Structure for Robust Crop Mapping. ISPRS J. Photogramm. Remote Sens. 2024, 209, 117–132. [Google Scholar] [CrossRef]
Rußwurm, M.; Körner, M. Self-Attention for Raw Optical Satellite Time Series Classification. ISPRS J. Photogramm. Remote Sens. 2020, 169, 421–435. [Google Scholar] [CrossRef]
Zheng, Y.; Dong, W.; Yang, Z.; Lu, Y.; Zhang, X.; Dong, Y.; Sun, F. A New Attention-Based Deep Metric Model for Crop Type Mapping in Complex Agricultural Landscapes Using Multisource Remote Sensing Data. Int. J. Appl. Earth Obs. Geoinf. 2024, 134, 104204. [Google Scholar] [CrossRef]
Tang, P.; Du, P.; Xia, J.; Zhang, P.; Zhang, W. Channel Attention-Based Temporal Convolutional Network for Satellite Image Time Series Classification. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
Turkoglu, M.O.; D’Aronco, S.; Perich, G.; Liebisch, F.; Streit, C.; Schindler, K.; Wegner, J.D. Crop Mapping from Image Time Series: Deep Learning with Multi-Scale Label Hierarchies. Remote Sens. Environ. 2021, 264, 112603. [Google Scholar] [CrossRef]
Zhang, J.; Zhao, L.; Yang, H. A Dual-Branch U-Net for Staple Crop Classification in Complex Scenes. Remote Sens. 2025, 17, 726. [Google Scholar] [CrossRef]
Van Tricht, K.; Gobin, A.; Gilliams, S.; Piccard, I. Synergistic Use of Radar Sentinel-1 and Optical Sentinel-2 Imagery for Crop Mapping: A Case Study for Belgium. Remote Sens. 2018, 10, 1642. [Google Scholar] [CrossRef]
You, N.; Dong, J.; Huang, J.; Du, G.; Zhang, G.; He, Y.; Yang, T.; Di, Y.; Xiao, X. The 10-m Crop Type Maps in Northeast China during 2017–2019. Sci Data 2021, 8, 41. [Google Scholar] [CrossRef]
Liao, J.; Hu, Y.; Zhang, H.; Liu, L.; Liu, Z.; Tan, Z.; Wang, G. A Rice Mapping Method Based on Time-Series Landsat Data for the Extraction of Growth Period Characteristics. Sustainability 2018, 10, 2570. [Google Scholar] [CrossRef]
Huete, A.; Didan, K.; Miura, T.; Rodriguez, E.P.; Gao, X.; Ferreira, L.G. Overview of the Radiometric and Biophysical Performance of the MODIS Vegetation Indices. Remote Sens. Environ. 2002, 83, 195–213. [Google Scholar] [CrossRef]
Xiao, G.; Huang, J.; Song, J.; Li, X.; Du, K.; Huang, H.; Su, W.; Miao, S. A Novel Soybean Mapping Index within the Global Optimal Time Window. ISPRS J. Photogramm. Remote Sens. 2024, 217, 120–133. [Google Scholar] [CrossRef]
Xiao, X.; Boles, S.; Liu, J.; Zhuang, D.; Frolking, S.; Li, C.; Salas, W.; Moore, B. Mapping Paddy Rice Agriculture in Southern China Using Multi-Temporal MODIS Images. Remote Sens. Environ. 2005, 95, 480–492. [Google Scholar] [CrossRef]
Song, X.-P.; Huang, W.; Hansen, M.C.; Potapov, P. An Evaluation of Landsat, Sentinel-2, Sentinel-1 and MODIS Data for Crop Type Mapping. Sci. Remote Sens. 2021, 3, 100018. [Google Scholar] [CrossRef]
Inglada, J.; Arias, M.; Tardy, B.; Hagolle, O.; Valero, S.; Morin, D.; Dedieu, G.; Sepulcre, G.; Bontemps, S.; Defourny, P.; et al. Assessment of an Operational System for Crop Type Map Production Using High Temporal and Spatial Resolution Satellite Optical Imagery. Remote Sens. 2015, 7, 12356–12379. [Google Scholar] [CrossRef]
Wei, M.; Wang, H.; Zhang, Y.; Li, Q.; Du, X.; Shi, G.; Ren, Y. Investigating the Potential of Sentinel-2 MSI in Early Crop Identification in Northeast China. Remote Sens. 2022, 14, 1928. [Google Scholar] [CrossRef]
Luo, D.; Wang, X. ModernTCN: A Modern Pure Convolution Structure for General Time Series Analysis. In Proceedings of the The Twelfth International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024. [Google Scholar]
Li, X.; Lv, C.; Wang, W.; Li, G.; Yang, L.; Yang, J. Generalized Focal Loss: Towards Efficient Representation Learning for Dense Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 3139–3153. [Google Scholar] [CrossRef] [PubMed]
Kim, J.; Hahn, S.-J.; Hwang, Y.; Lee, J.; Lee, S. CAFO: Feature-Centric Explanation on Time Series Classification. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Barcelona, Spain, 25–29 August 2024; Association for Computing Machinery: New York, NY, USA, 2024; pp. 1372–1382. [Google Scholar]
Goodall, C.R. 13 Computation Using the QR Decomposition. In Handbook of Statistics; Computational Statistics; Elsevier: Amsterdam, The Netherlands, 1993; Volume 9, pp. 467–508. [Google Scholar]
Zhang, H.K.; Luo, D.; Li, Z. Classifying Raw Irregular Time Series (CRIT) for Large Area Land Cover Mapping by Adapting Transformer Model. Sci. Remote Sens. 2024, 9, 100123. [Google Scholar] [CrossRef]
Vidican, R.; Mălinaș, A.; Ranta, O.; Moldovan, C.; Marian, O.; Ghețe, A.; Ghișe, C.R.; Popovici, F.; Cătunescu, G.M. Using Remote Sensing Vegetation Indices for the Discrimination and Monitoring of Agricultural Crops: A Critical Review. Agronomy 2023, 13, 3040. [Google Scholar] [CrossRef]
Jiang, Y.; Lu, Z.; Li, S.; Lei, Y.; Chu, Q.; Yin, X.; Chen, F. Large-Scale and High-Resolution Crop Mapping in China Using Sentinel-2 Satellite Imagery. Agriculture 2020, 10, 433. [Google Scholar] [CrossRef]
Yin, L.; You, N.; Zhang, G.; Huang, J.; Dong, J. Optimizing Feature Selection of Individual Crop Types for Improved Crop Mapping. Remote Sens. 2020, 12, 162. [Google Scholar] [CrossRef]
Cai, Y.; Guan, K.; Peng, J.; Wang, S.; Seifert, C.; Wardlow, B.; Li, Z. A High-Performance and in-Season Classification System of Field-Level Crop Types Using Time-Series Landsat Data and a Machine Learning Approach. Remote Sens. Environ. 2018, 210, 35–47. [Google Scholar] [CrossRef]
Xiao, X.; Boles, S.; Frolking, S.; Salas, W.; Moore Iii, B.; Li, C.; He, L.; Zhao, R. Observation of Flooding and Rice Transplanting of Paddy Rice Fields at the Site to Landscape Scales in China Using VEGETATION Sensor Data. Int. J. Remote Sens. 2002, 23, 3009–3022. Available online: https://www.tandfonline.com/doi/abs/10.1080/01431160110107734 (accessed on 9 August 2025). [CrossRef]
Xuan, F.; Dong, Y.; Li, J.; Li, X.; Su, W.; Huang, X.; Huang, J.; Xie, Z.; Li, Z.; Liu, H.; et al. Mapping Crop Type in Northeast China during 2013–2021 Using Automatic Sampling and Tile-Based Image Classification. Int. J. Appl. Earth Obs. Geoinf. 2023, 117, 103178. [Google Scholar] [CrossRef]

Figure 1. Spatial distribution and category counts of field samples in the study area.

Figure 2. Schematic diagram of the phenological stages of rice, corn, and soybean in Northeast China.

Figure 3. Local enlarged views of representative sample points based on false color composite and Google Earth Image. Panels (a–f) correspond to rice, soybean, corn, water bodies, built-up areas, and forest, respectively.

Figure 4. Flowchart of the proposed crop classification method.

Figure 5. Reconstructed time series curves of reflectance and other remote sensing features. Only Red, Green, Blue, NIR, SWIR, and NDVI are shown here.

Figure 6. The schematic illustration of DWconv. Each bar in the figure represents the time series of a single feature channel for a specific variable. Different colors indicate different variables, while bars of the same color represent different channels of the same variable. Each feature channel is independently processed by its corresponding depth-wise convolution kernel K through one-dimensional convolution along the time dimension N.

Figure 7. The schematic illustration of ConvFFN. (a) The schematic illustration of ConvFFN1. (b) The schematic illustration of ConvFFN2. Similarly, each bar in the figure represents the time series of a single feature channel for a specific variable. Different colors indicate different variables, while bars of the same color represent different channels of the same variable. The labels a1, b1, c1,…, n1 indicate different input features, where the letters represent distinct variables in the input sequence and the numbers denote different channels of each variable.

Figure 8. The schematic illustration of Temporal-Variable Dual Attention (TiVDA). The input feature map is processed along two separate pathways: the left pathway implements Variable-wise Attention; the right pathway implements Temporal-wise Attention. The outputs are combined to produce the refined representation.

Figure 9. Evaluation results of HyperVTCN and HyperV_noV on datasets with varying feature groups. (a) Overall Accuracy (OA) results. (b) Kappa coefficient results. (c) Rice_F1 results. (d) Corn_F1 results. (e) Soybaen_F1 results. (f) Others_F1 results. Here, R represents Radar Features, S represents Spectral Features, S-R represents Spectral-Radar Features, S-R-T represents Spectral-Radar-Temperature Features and S-R-T-I represents Spectral-Radar-Temperature-Index Features.

Figure 10. Crop distribution mapping and validation against official statistics in 2019. (a) Predicted crop distribution map, (b) Validation for rice planting area, (c) Validation for corn planting area, (d) Validation for soybean planting area.

Figure 11. Spatiotemporal crop mapping results for representative regions generated by the proposed model (2017–2020).

Figure 12. Annual planting areas of rice, corn, and soybean from 2017 to 2020 in selected 30 km × 30 km regions within (a) Heilongjiang (HLJ), (b) Jilin (JL), and (c) Liaoning (LN), derived from classification results.

Figure 13. Comparison of different datasets and models in 2019.

Table 1. Performance comparison of the HyperVTCN with state-of-the-art models. The best score for each metric was shown in bold.

Method	Evaluation Metrics
Method	OA	Kappa	Macro-F1	Rice_F1	Corn_F1	Soybean_F1	Others_F1
RF	0.8802	0.8305	0.8903	0.9701	0.8371	0.8752	0.8788
CNN	0.8836	0.8346	0.8939	0.9704	0.8365	0.8901	0.8783
LSTM	0.8911	0.8458	0.8998	0.9637	0.8609	0.8873	0.8873
Transformer	0.8894	0.8437	0.8992	0.9735	0.8427	0.8940	0.8864
DCM	0.8844	0.8363	0.8947	0.9733	0.8544	0.8688	0.8824
HyperVTCN	0.9129	0.8760	0.9182	0.9703	0.8789	0.9081	0.9155

Table 2. Comparison of model parameters and computational complexity.

Method	FLOPs (G)	Param Count (M)
CNN	0.0729	0.5898
LSTM	3.7780	3.1846
Transformer	2.5038	2.1171
DCM	10.0401	8.4675
HyperVTCN	0.9741	7.6252

Table 3. Results of the ablation experiments. Each row corresponds to a different configuration where specific model components are modified. Abbreviations: TiVDA: Temporal-Variable Dual Attention; FL: Focal Loss; QRL: QR-Ortho Loss; CEL: Cross-Entropy Loss.

Method	Attention Mechanism	Loss Function			OA	Kappa
Method	TiVDA	CEL	FL	QRL	OA	Kappa
HyperVTCN_noTiVDA		√			0.8978	0.8543
HyperVTCN_CEL	√	√			0.9054	0.8659
HyperVTCN_FL	√		√		0.9079	0.8691
HyperVTCN_CEL + QRL	√	√		√	0.9104	0.8726
HyperVTCN	√		√	√	0.9129	0.8760

Table 4. Overview of feature sets used in experiments.

Feature Group	Number of Features	Feature Types Included
Spectral-Radar-Temperature-Index Features	14	Red, Green, Blue, NIR, SWIR, VV, VH, NDVI, RVI, EVI, GCVI, LSWI, LST, VV/VH
Spectral-Radar-Temperature Features	8	Red, Green, Blue, NIR, SWIR, VV, VH, LST
Spectral-Radar Features	7	Red, Green, Blue, NIR, SWIR, VV, VH
Spectral Features	5	Red, Green, Blue, NIR, SWIR
Radar Features	2	VV, VH

Table 5. Under univariate conditions, the performance of HyperVTCN is compared with the state-of-the-art deep learning classification methods. The optimal values for each metric are displayed in bold, the second-best values are underlined.

Dataset	Method	Evaluation Metrics
Dataset	Method	OA	Kappa	Macro-F1	Rice_F1	Corn_F1	Soybean_F1	Others_F1
NDVI	CNN	0.6231	0.4566	0.4688	0	0.4916	0.5189	0.8649
	LSTM	0.7747	0.6812	0.7538	0.7740	0.6175	0.7538	0.8699
	Transformer	0.8333	0.7649	0.8262	0.8311	0.7544	0.8484	0.8709
	DCM	0.7923	0.7056	0.7736	0.7724	0.6775	0.7703	0.8741
	HyperVTCN	0.8476	0.7848	0.8419	0.8652	0.7737	0.8401	0.8887
GCVI	CNN	0.6432	0.486	0.4982	0	0.4881	0.6530	0.8517
	LSTM	0.4506	0.2310	0.2799	0	0.4378	0	0.6816
	Transformer	0.8375	0.7707	0.8379	0.8990	0.7460	0.8336	0.8728
	DCM	0.7965	0.7122	0.7936	0.8571	0.6854	0.7772	0.8548
	HyperVTCN	0.871	0.8168	0.8729	0.9272	0.8023	0.8752	0.8867
LSWI	CNN	0.804	0.7254	0.8111	0.8721	0.7792	0.7845	0.8084
	LSTM	0.8157	0.738	0.8216	0.9078	0.7619	0.7795	0.8370
	Transformer	0.8509	0.7911	0.8588	0.9320	0.8076	0.8405	0.8550
	DCM	0.8342	0.7656	0.8396	0.9246	0.7782	0.7993	0.8563
	HyperVTCN	0.8735	0.8210	0.8779	0.9446	0.8204	0.8582	0.8884

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Huang, X.; Fang, M.; Kong, W.; Liu, J.; Wu, Y.; Liu, Z.; Qiao, Z.; Liu, L. HyperVTCN: A Deep Learning Method with Temporal and Feature Modeling Capabilities for Crop Classification with Multisource Satellite Imagery. Remote Sens. 2025, 17, 3022. https://doi.org/10.3390/rs17173022

AMA Style

Huang X, Fang M, Kong W, Liu J, Wu Y, Liu Z, Qiao Z, Liu L. HyperVTCN: A Deep Learning Method with Temporal and Feature Modeling Capabilities for Crop Classification with Multisource Satellite Imagery. Remote Sensing. 2025; 17(17):3022. https://doi.org/10.3390/rs17173022

Chicago/Turabian Style

Huang, Xiaoqi, Minzi Fang, Weilang Kong, Jialin Liu, Yuxin Wu, Zhenjie Liu, Zhi Qiao, and Luo Liu. 2025. "HyperVTCN: A Deep Learning Method with Temporal and Feature Modeling Capabilities for Crop Classification with Multisource Satellite Imagery" Remote Sensing 17, no. 17: 3022. https://doi.org/10.3390/rs17173022

APA Style

Huang, X., Fang, M., Kong, W., Liu, J., Wu, Y., Liu, Z., Qiao, Z., & Liu, L. (2025). HyperVTCN: A Deep Learning Method with Temporal and Feature Modeling Capabilities for Crop Classification with Multisource Satellite Imagery. Remote Sensing, 17(17), 3022. https://doi.org/10.3390/rs17173022

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

HyperVTCN: A Deep Learning Method with Temporal and Feature Modeling Capabilities for Crop Classification with Multisource Satellite Imagery

Abstract

1. Introduction

2. Study Area and Data

2.1. Study Area

2.2. Remote Sensing Imagery Data

2.3. Ground Truth Data

3. Method

3.1. Data Processing

3.2. Algorithms for Crop Classification

3.2.1. Patch Embedding

3.2.2. ModernTCN Block

3.2.3. Temporal-Variable Dual Attention (TiVDA)

3.2.4. Output Module

3.2.5. Loss Function

3.3. Validation

4. Result

4.1. Results of Comparative Experiments

4.2. Results of Ablation Experiments in HyperVTCN

5. Discussion

5.1. Evaluating the Effectiveness of Feature Modeling

5.2. The Improved Temporal Modeling Performance

5.3. Reliability of Crop Mapping Using HyperVTCN

5.4. Uncertainties and Future Directions

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI