Predicting Crop Growth Patterns with Spatial–Temporal Deep Feature Exploration for Early Mapping

: The timely and accurate mapping of crops over large areas is essential for alleviating food crises and formulating agricultural policies. However, most existing classical crop mapping methods usually require the whole-year historical time-series data that cannot respond quickly to the current planting information, let alone for future prediction. To address this issue, we propose a novel spatial–temporal feature and deep integration strategy for crop growth pattern prediction and early mapping (STPM). Speciﬁcally, the STPM ﬁrst learns crop spatial–temporal evolving patterns from historical data to generate future remote sensing images based on the current observations. Then, a robust crop type recognition model is applied by combining the current early data with the predicted images for early crop mapping. Compared to existing spatial–temporal prediction models, our proposed model integrates local, global, and temporal multi-modal features comprehensively. Not only does it achieve the capability to predict longer sequence lengths (exceeding 100 days), but it also demonstrates a signiﬁcant improvement in prediction accuracy for each time step. In addition, this paper analyses the impact of feature dimensionality and initial data length on prediction and early crop mapping accuracy, demonstrating the necessity of multi-modal feature fusion for spatial–temporal prediction of high-resolution remote sensing data and the beneﬁts of longer initial time-series (i.e., longer crop planting time) for crop identiﬁcation. In general, our method has the potential to carry out early crop mapping on a large scale and provide information to formulate changes in agricultural conditions promptly.


Introduction
The rapid increase in the global population has led to further demand for food production [1,2].However, the global food crisis is becoming increasingly serious due to the external conditions of extreme weather caused by global warming and frequent regional conflicts [3][4][5][6].Therefore, it is urgent to implement sustainable management and organic integration of global arable land to adapt to the needs of future social development [7].Timely and accurate mapping of crops on a large scale can provide the basic information for improving crop management, disaster assessment, production, and food price prediction [8][9][10].With the continuous advancement of remote sensing technology, it has been successfully used for rapid extraction and mapping of crop information due to its low cost, large scale, and efficient data acquisition capabilities.
In order to capture the spatial distribution of crop types, previous crop mapping studies have mostly relied on machine learning techniques such as random forest (RF), support vector machine (SVM), and decision tree, using medium-to low-resolution optical remote sensing data (MODIS, AVHARR, and Landsat) covering a full year in the study area to complete the mapping task [11][12][13][14].However, it is difficult to obtain fine-scale crop distribution results with the limited spatial resolution of satellite data.As a supplement, Sentinel-2 (S2) satellite data with higher spatial resolution and shorter revisit periods make it possible to map crops at a finer scale.For example, Hu et al. [15] generated a map of maize planting distribution in Heilongjiang Province, China, in 2018 based on S2 time-series data and SVM method.Wang et al. [16] used Landsat 7/8, Sentinel-1 (S1), and S2 time-series data throughout 2018, and based on the decision tree method, mapped the sugarcane planting distribution in Guangxi Province, China.Overall, these existing studies are mostly based on the historical time-series remote sensing data during crop growing season, which cannot meet the urgent needs of prior rapid crop information mapping.Therefore, it is crucial to develop a method that can achieve fine-scale crop mapping in the early stages during crop growth.
At present, research on early crop mapping is mainly focused on determining the optimal combination of limited time-series remote sensing data and collecting representative training samples [17].Yi et al. [18] used CNN models to compare and analyze the accuracy of crop mapping using S2 data with different time lengths and determine the optimal time span for crop mapping in the Shiyang River Basin.Yan and Ryu [19] generated high-confidence crop identification samples by combining high-quality crop attributes from Google Street View images and remote sensing data in order to complete crop mapping in the early stages of crop growth.However, these methods are often with high data collection costs and limited by geographical variations that prevent them from large-scale applications.Therefore, a universal framework for early crop mapping using limited temporal remote sensing data still needs to be developed.Fundamentally, crop identification through time-series remote sensing data is often based on learning the typical phenological characteristics of different crops to identify their types [20].As more data is collected throughout the entire crop growth cycle, the class-specific representative feature can be derived with higher crop identification accuracy.However, it is difficult to obtain phenological characteristics during the early stages of crop growth even with higher revisit satellites, which fundamentally limits the crop mapping accuracy for early crop mapping.
As a supplement, spatial-temporal prediction methods based on remote sensing data can effectively alleviate the problem by predicting crop phenological features during early crop growth stages.Specifically, the spatial-temporal prediction methods learn the crop growth trends over the past few years, making it possible to forecast longer or even complete annual time-series given early stages of crop growth.Currently, the existing spatial-temporal prediction methods, such as PredRNN and ConvLSTM, are mostly used for urban traffic flow and video prediction [21][22][23][24].A few spatial-temporal prediction studies in remote sensing are mainly focused on predicting remote sensing products such as land cover, surface reflectance, and meteorology.For instance, Azari et al. [25] predicted land cover maps for 2030 using a Decision Forest-Markov chain model.Nakapan and Hongthong [26] successfully predicted surface reflectance and AOD products for the following year using a linear regression model.Still, the challenges of remote sensing data forecasting in crop areas remain unsolved.In general, there are two aspects that prevent accurate crop prediction: (1) crop areas share complex spatial patterns (i.e., texture at various scales) with respect to different types, so it is difficult to encode such spatial patterns for future data construction; (2) crops have non-stationary temporal evolving (i.e., phenological) patterns that also elevate the difficulty to formulate such complex pattern for time-series data forecasting.Therefore, it is crucial to develop a spatial-temporal prediction method that is suitable for higher-resolution remote sensing data (especially in cropland areas) for future data prediction.
Compared with traditional machine learning methods, deep learning has obvious advantages in extracting robust features in remote sensing applications [27].With the continuous development of computational power and artificial intelligence, various deep learning models, such as CNN, Recurrent Neural Networks (RNN), and their variants, have been widely used in remote sensing [28][29][30].In order to generate robust spatial feature representations, the CNN framework uses pre-set convolutional kernels to slide over the input data to obtain robust spatial features, including high-frequency contour information of objects.However, due to the limitation of the kernel size, the receptive field generated by the convolutional kernel is restricted, resulting in local-scale spatial feature exploration.In contrast, the Transformer model encodes all tokens from the input data and uses nested attention mechanisms to extract spatial representation at the global scale [31,32].Meanwhile, the temporal feature also plays an important role in terms of crop phenology formulation.In this scope, RNN is capable of capturing temporal correlations in remote sensing time-series.However, studies have shown that as the length of the timeseries data increases, RNN networks may experience the problem of vanishing gradients or long-range dependencies collapsing, which is fatal for spatial-temporal prediction models [33].To address this issue, the LSTM model alleviates the gradient-vanishing problem by introducing gate mechanisms and memory cells to improve long-term temporal feature formulation.
Aiming to formulate comprehensive spatial-temporal representations for future remote sensing image prediction, especially to map crop types at early stages, we propose an integrated deep learning method called STPM that combines spatial-temporal prediction and crop type mapping.The proposed spatial-temporal prediction and early crop mapping approach proposed in this study comprises two primary components.The first component involves acquiring robust spatial-temporal patterns of crop evolution from historical data, which are utilized to generate future remote sensing images based on cur-rent observations.The second component entails implementing robust models for identifying crop types to facilitate early crop mapping by integrating current early data with predicted imagery.
The main contributions of the proposed method framework are as follows: (1) An efficient spatial-temporal prediction method is proposed by integrating the multimodular feature from "local-global-temporal" domains for future remote sensing data generation.(2) A new paradigm for early crop mapping is proposed that alleviates the problem of insufficient observations by combining crop growth trend prediction with identification models.
The remaining sections of this paper are organized as follows: Section 2 describes the detailed structure of the proposed STPM framework.Section 3 describes the study area, satellite time-series, sample collection, and post-processing flow, as well as the model parameters and computer configuration used in the experiments.In Section 4, the reliability of spatial-temporal prediction and early crop mapping results are discussed.Finally, Section 5 details the importance of different modular features within the spatial-temporal prediction model, the impact of initial length on crop mapping accuracy, and the advantages and limitations of the proposed STPM method.

Methodological Framework Overview
The STPM framework proposed in this study comprises three sub-modules.As illustrated in Figure 1, the first sub-module is the temporal remote sensing data preparation and sample generation module, which is responsible for data pre-processing and sample enhancement in preparation for the next step of feature extraction of crops.The second sub-module focuses on the extraction and fusion of local-global-temporal features in the spatial-temporal prediction stage, whereby the local-global-temporal depth features of the high-resolution spatial-temporal remote sensing data are obtained to fully characterize the crop growth pattern.The last part includes the pre-training of the crop classification model (for subsequent crop mapping tasks) and the rolling prediction of future remote sensing data (to fill in the data gaps for time-series mapping).In the next section, we will provide a detailed introduction to the spatial-temporal feature extraction and multi-modal deep feature fusion process of the spatial-temporal prediction model.
the high-resolution spatial-temporal remote sensing data are obtained to fully characterize the crop growth pattern.The last part includes the pre-training of the crop classification model (for subsequent crop mapping tasks) and the rolling prediction of future remote sensing data (to fill in the data gaps for time-series mapping).In the next section, we will provide a detailed introduction to the spatial-temporal feature extraction and multimodal deep feature fusion process of the spatial-temporal prediction model.

Local-Global-Temporal Multi-Modal Feature Exploration and Fusion
In order to effectively extract robust spatial-temporal evolving features of crops from remote sensing data, we have designed an integrated multi-modal feature extraction network architecture.Figure 2 shows an intuitive representation of the proposed network architecture.This model takes randomly cropped spatial-temporal patches as input samples, followed by an in-depth analysis of the recursive characteristics of remote sensing time-series data for future data generation.From various perspectives, such local-globaltemporal multi-modal features are able to explore the spatial-temporal evolution patterns of crops for better simulation.Specifically, when the input sample  ∈  × × (where H and W represent the height and width of the input patch, and T represents the length of the temporal dimension) enters the feature extraction network, it is extracted by multiple different feature extraction backbone networks for corresponding modal features.For local-scale features, the model uses a shallow depth separable convolution network (DSC) as the feature extractor.Compared with commonly used backbones [34,35], the DSC network can reduce the required training parameters while ensuring efficient feature extraction.For the input sample, this feature extraction stage includes two processes, namely convolution over time (DT) and convolution over pixels (PC).In the DT stage, n (n = T) 3 × 3 two-dimensional convolution kernels are used to perform sliding convolution on different temporal nodes to obtain the feature  that is consistent with the initial dimension of the sample.The calculation process can be represented as  =  ().Then, based on feature  , feature aggregation calculations are performed for each pixel using C convolution kernels of size 1 × 1 × T, resulting in the local-scale feature F1.The calculation in-

Local-Global-Temporal Multi-Modal Feature Exploration and Fusion
In order to effectively extract robust spatial-temporal evolving features of crops from remote sensing data, we have designed an integrated multi-modal feature extraction network architecture.Figure 2 shows an intuitive representation of the proposed network architecture.This model takes randomly cropped spatial-temporal patches as input samples, followed by an in-depth analysis of the recursive characteristics of remote sensing time-series data for future data generation.From various perspectives, such local-globaltemporal multi-modal features are able to explore the spatial-temporal evolution patterns of crops for better simulation.Specifically, when the input sample x ∈ R H×W×T (where H and W represent the height and width of the input patch, and T represents the length of the temporal dimension) enters the feature extraction network, it is extracted by multiple different feature extraction backbone networks for corresponding modal features.For localscale features, the model uses a shallow depth separable convolution network (DSC) as the feature extractor.Compared with commonly used backbones [34,35], the DSC network can reduce the required training parameters while ensuring efficient feature extraction.For the input sample, this feature extraction stage includes two processes, namely convolution over time (DT) and convolution over pixels (PC).In the DT stage, n (n = T) 3 × 3 two-dimensional convolution kernels are used to perform sliding convolution on different temporal nodes to obtain the feature y 1 that is consistent with the initial dimension of the sample.The calculation process can be represented as y 1 = f 1 (x).Then, based on feature y 1 , feature aggregation calculations are performed for each pixel using C convolution kernels of size 1 × 1 × T, resulting in the local-scale feature F 1 .The calculation involved can be represented as F 1 = f 2 y 1 , F 1 ∈ R H×W×C , where f 1 , f 2 represent two different convolution operations.This network branch is used to extract local texture features from the sample.We set the depth of the final output feature to be C.
volved can be represented as  =  ( ),  ∈  × × , where  ,  represent two different convolution operations.This network branch is used to extract local texture features from the sample.We set the depth of the final output feature to be C.For the extraction of global-scale features for the spatial dimension, we used the Transformer model as the feature extraction backbone network.The input sample x is first reshaped and flattened into a two-dimensional vector  ∈  ×( × ) , and then a two-dimensional convolution layer is used to project the feature dimension of the vector to D, resulting in a feature vector  ∈  × .To maintain the order of the sequence within the sample and reduce the computational cost of the model, the backbone network preserves the original 1-D position embedding module.Specifically, a fixed position vector p is added to the feature vector  to obtain a feature vector  * of the same dimension.Finally, the feature vector  * is fed into the stacked encoders, and the corresponding weight information is obtained through the attention mechanism.The internal structure of the encoder is shown in Figure 2. From the figure, it can be seen that each encoder layer consists of a Multi-Head Self-Attention (MHSA) module, an MLP module, and three Layer Norm (LN) layers.Meanwhile, the MLP module is connected by two fully connected layers and a GeLU activation function.Therefore, the entire process of obtaining the final global spatial feature representation F2 from the vector  * through the encoder can be expressed in the formula as follows: (, , ) =  ⋅ = ((( )) +  ) In this equation, l denotes the number of layers in the encoder, and  =  * .Q, K, and V represent the projections of input feature vectors in different feature vector spaces, For the extraction of global-scale features for the spatial dimension, we used the Transformer model as the feature extraction backbone network.The input sample x is first reshaped and flattened into a two-dimensional vector v ∈ R T×(H×W) , and then a two-dimensional convolution layer is used to project the feature dimension of the vector to D, resulting in a feature vector v ∈ R T×D .To maintain the order of the sequence within the sample and reduce the computational cost of the model, the backbone network preserves the original 1-D position embedding module.Specifically, a fixed position vector p is added to the feature vector v to obtain a feature vector v * of the same dimension.Finally, the feature vector v * is fed into the stacked encoders, and the corresponding weight information is obtained through the attention mechanism.The internal structure of the encoder is shown in Figure 2. From the figure, it can be seen that each encoder layer consists of a Multi-Head Self-Attention (MHSA) module, an MLP module, and three Layer Norm (LN) layers.Meanwhile, the MLP module is connected by two fully connected layers and a GeLU activation function.Therefore, the entire process of obtaining the final global spatial feature representation F 2 from the vector v * through the encoder can be expressed in the formula as follows: In this equation, l denotes the number of layers in the encoder, and m 0 = v * .Q, K, and V represent the projections of input feature vectors in different feature vector spaces, which are used to compute similarity and output vectors.W i represents the corresponding projection parameter, while H i denotes the distinct heads in the attention mechanism.
Finally, to effectively extract temporal features from the samples, a set of LSTM layers is integrated into the framework for temporal feature extraction.Specifically, the input sample x is continuously passed through interconnected LSTM cells, allowing the model to iteratively learn the temporal patterns within the data.Within each cell unit, the flow and propagation of information are controlled using forget gates ( f t ), input gates (i t ), and output gates (o t ), while the cell state (C t ) is used to capture and store temporal features.In summary, the process of obtaining temporal features F 3 within the samples using the LSTM feature extraction backbone can be represented by the following equation: where h t denotes the current moment of the hidden state inside the cell.
Once the local-global-temporal features are extracted, we concatenate all the resulting feature vectors to form a fused feature vector.In order to enhance the coupling relationship between the fused features, we design a simple feature aggregation module based on the fused feature.In particular, we use one-dimensional convolution and ReLU activation function in sequence to aggregate information of the concatenated feature, resulting in the final fused feature U.The specific process is shown in Equation (7).
In contrast to simple time-series prediction, in order to enable the model to learn robust spatial-temporal features that characterize the complex remote sensing scenes, we performed rolling training strategy along the temporal axis with representative samples.The training process is shown in Figure 3. which are used to compute similarity and output vectors.Wi represents the corresponding projection parameter, while Hi denotes the distinct heads in the attention mechanism.Finally, to effectively extract temporal features from the samples, a set of LSTM layers is integrated into the framework for temporal feature extraction.Specifically, the input sample x is continuously passed through interconnected LSTM cells, allowing the model to iteratively learn the temporal patterns within the data.Within each cell unit, the flow and propagation of information are controlled using forget gates ( ), input gates ( ), and output gates ( ), while the cell state ( ) is used to capture and store temporal features.In summary, the process of obtaining temporal features F3 within the samples using the LSTM feature extraction backbone can be represented by the following equation: where ℎ denotes the current moment of the hidden state inside the cell.
Once the local-global-temporal features are extracted, we concatenate all the resulting feature vectors to form a fused feature vector.In order to enhance the coupling relationship between the fused features, we design a simple feature aggregation module based on the fused feature.In particular, we use one-dimensional convolution and ReLU activation function in sequence to aggregate information of the concatenated feature, resulting in the final fused feature U.The specific process is shown in Equation (7).
In contrast to simple time-series prediction, in order to enable the model to learn robust spatial-temporal features that characterize the complex remote sensing scenes, we performed rolling training strategy along the temporal axis with representative samples.The training process is shown in Figure 3. Assuming that the input sample can be represented as  ,  ,  , . . .,  , where  ∈  × denotes the two-dimensional spatial data at the i-th time node, we set a fixed temporal length K, which represents the two-dimensional spatial data containing K temporal nodes included in each sliding input, denoted as  ,  , . . .,  ,  << .At the same time, we take the center pixel value of the data corresponding to the K+1 node as the target label for the model.Once the current training step is finished, the model will continue to roll down within the batch of samples and obtain the second training sample, denoted as  ,  , . . .,  , while taking the center pixel value of the K + 2 node as the new target label.The advantage of training the model in this way is that it not only enables infinite rolling prediction of future data, but also allows the spatial-temporal prediction model to be unrestricted by the initial length of input time-series data.
Finally, the model feeds the aggregated feature U into a fully connected layer to predict the corresponding future pixel values.During this process, the algorithm utilizes the root mean square error (RMSE) as the loss function to continuously optimize and guide training process.The calculation method of the RMSE can be expressed as Equation (8).Assuming that the input sample can be represented as [t 1 , t 2 , t 3 , . . . ,t n ], where t i ∈ R H×W denotes the two-dimensional spatial data at the i-th time node, we set a fixed temporal length K, which represents the two-dimensional spatial data containing K temporal nodes included in each sliding input, denoted as [t 1 , t 2 , . . . ,t K ], K << n.At the same time, we take the center pixel value of the data corresponding to the K+1 node as the target label for the model.Once the current training step is finished, the model will continue to roll down within the batch of samples and obtain the second training sample, denoted as [t 2 , t 3 , . . . ,t K+1 ], while taking the center pixel value of the K + 2 node as the new target label.The advantage of training the model in this way is that it not only enables infinite rolling prediction of future data, but also allows the spatial-temporal prediction model to be unrestricted by the initial length of input time-series data.
Finally, the model feeds the aggregated feature U into a fully connected layer to predict the corresponding future pixel values.During this process, the algorithm utilizes the root mean square error (RMSE) as the loss function to continuously optimize and guide training process.The calculation method of the RMSE can be expressed as Equation (8).
where p i , t i represent the predicted and true values, respectively, and N denotes the number of samples.

Spatial-Temporal Prediction for Early Crop Mapping
The spatial-temporal prediction module mentioned in the previous section was conducted to obtain future remote sensing data at the early growth stage of crops.Compared with existing early crop mapping methods [17][18][19], the STPM framework aims to forecast crop growth patterns step-by-step with current remote sensing data and perform early crop type mapping at the same time.The corresponding processing flow is shown in Figure 4.

𝐿 =
∑ ( −  ) where  ,  represent the predicted and true values, respectively, and N denotes the number of samples.

Spatial-Temporal Prediction for Early Crop Mapping
The spatial-temporal prediction module mentioned in the previous section was conducted to obtain future remote sensing data at the early growth stage of crops.Compared with existing early crop mapping methods [17][18][19]  Due to the inherent differences in training methods between the crop type identification module and spatial-temporal prediction module, it is necessary to pre-train a universal and transferable crop identification model.Considering the inter-annual growth variability of crops, in order to obtain a robust crop classification model, we trained a robust crop identification model based on a large number of time-series crop samples over the past years.Compared to existing time-series classification models, the Multi-Scale 1D Res-Net (MS-ResNet) [36] classification network was chosen as the basic model for crop identification module due to its smaller network parameters, superior crop recognition ability, and model transferability.The MS-ResNet refers to the process of extracting features at different time scales from the input vector using multiple one-dimensional convolution kernels of different sizes, and then fusing all the scale features for subsequent crop classification tasks.Assuming that the convolution kernel sizes of the multi-scale convolution layer are  ,  , . . .,  , and the corresponding convolution step sizes are  ,  , . . .,  , the process of extracting multi-scale features  can be represented as follows, = ( ,  , . . .,  ) Throughout the model training, the cross-entropy was adopted as the loss function, as formulated in Equation (11).Due to the inherent differences in training methods between the crop type identification module and spatial-temporal prediction module, it is necessary to pre-train a universal and transferable crop identification model.Considering the inter-annual growth variability of crops, in order to obtain a robust crop classification model, we trained a robust crop identification model based on a large number of time-series crop samples over the past years.Compared to existing time-series classification models, the Multi-Scale 1D ResNet (MS-ResNet) [36] classification network was chosen as the basic model for crop identification module due to its smaller network parameters, superior crop recognition ability, and model transferability.The MS-ResNet refers to the process of extracting features at different time scales from the input vector using multiple one-dimensional convolution kernels of different sizes, and then fusing all the scale features for subsequent crop classification tasks.Assuming that the convolution kernel sizes of the multi-scale convolution layer are {k 1 , k 2 , . . . ,k m }, and the corresponding convolution step sizes are {s 1 , s 2 , . . . ,s m }, the process of extracting multi-scale features ψ can be represented as follows, Throughout the model training, the cross-entropy was adopted as the loss function, as formulated in Equation (11).
Among them, ReLU represents the activation function, BN stands for Batch Normalization layer, and g i , a i denote the true crop label and the predicted result of the model, respectively.Finally, assuming that the time-series remote sensing data available to us in the current year are [T 1 , T 2 , . . . ,T n ], and the future time-series data to be predicted are T n+1 , T n+2 , . . ., T n+m .The final mapping result obtained through the proposed early crop mapping framework can be represented by MS-ResNet T 1 , T 2 , . . ., T n , T n+1 , T n+2 , . . ., T n+m .

Study Area and Ground Reference Data Introduction
The test area selected for this experiment is located in California, United States, as shown in Figure 5. California, located on the west coast of the United States, has a population of approximately 39.5 million and a geographic area of approximately 414,000 square kilometers, making it the third-largest state in the USA.Due to its proximity to the Sierra Nevada Mountains and its location along the Pacific Ocean, the Central Valley has abundant water and thermal resources, providing excellent growing conditions for economically important crops such as fruits and vegetables.This has made California one of the most important agricultural states in the USA.According to official US statistics, the major crops planted in California in 2022 include almonds, walnuts, grapes, tomatoes, and winter wheat [37].Within this state, we selected an area of interest, a 40 × 40 km crop plantation area in Fresno County.This plot is mainly used for growing economically important crops such as cotton, winter wheat, alfalfa, tomatoes, and grapes, and its corresponding crop calendar is described in Appendix A.

Study Area and Ground Reference Data Introduction
The test area selected for this experiment is located in California, United States, as shown in Figure 5. California, located on the west coast of the United States, has a population of approximately 39.5 million and a geographic area of approximately 414,000 square kilometers, making it the third-largest state in the USA.Due to its proximity to the Sierra Nevada Mountains and its location along the Pacific Ocean, the Central Valley has abundant water and thermal resources, providing excellent growing conditions for economically important crops such as fruits and vegetables.This has made California one of the most important agricultural states in the USA.According to official US statistics, the major crops planted in California in 2022 include almonds, walnuts, grapes, tomatoes, and winter wheat [37].Within this state, we selected an area of interest, a 40 × 40 km crop plantation area in Fresno County.This plot is mainly used for growing economically important crops such as cotton, winter wheat, alfalfa, tomatoes, and grapes, and its corresponding crop calendar is described in Appendix A.
The Cropland Data Layer (CDL) [38] is a nationwide dataset created by the United States Department of Agriculture (USDA) using remote sensing, drone imagery, and ground sampling data.The CDL dataset provides information on crop planting and types across most of the United States, with higher spatial resolution.This dataset is not only valuable for agricultural decision-makers and policymakers but also has broad applications in fields such as land use, environmental protection, and natural disaster monitoring.Therefore, we used the CDL data of the study area as the ground reference data for training our crop classification model.Simultaneously, we pre-sampled the CDL data to a spatial resolution of 10 m using the nearest neighbor method to better match the remote sensing imagery.The Cropland Data Layer (CDL) [38] is a nationwide dataset created by the United States Department of Agriculture (USDA) using remote sensing, drone imagery, and ground sampling data.The CDL dataset provides information on crop planting and types across most of the United States, with higher spatial resolution.This dataset is not only valuable for agricultural decision-makers and policymakers but also has broad applications in fields such as land use, environmental protection, and natural disaster monitoring.Therefore, we used the CDL data of the study area as the ground reference data for training our crop classification model.Simultaneously, we pre-sampled the CDL data to a spatial resolution of 10 m using the nearest neighbor method to better match the remote sensing imagery.

Satellite Data Download and Pre-Processing
To test the effectiveness of the STPM, we utilized the widely used NDVI time-series data as the background data for spatial-temporal prediction and crop identification.To this end, we obtained multiple years of NDVI time-series remote sensing data with a time span from 1 January 2019 to 31 December 2022, using the S2 Level-2A data obtained from the Google Earth Engine (GEE) [39] cloud platform under limited conditions of time, cloud cover, and cloud masks (with QA60 band markers).Optical remote sensing data is often affected by cloud cover, which can lead to differences in the amount of effective data available and the time intervals between adjacent data for the same area in different years, thereby interfering with the evaluation of subsequent spatial-temporal prediction and crop identification models.To address this issue, we used the 10-day median synthesis method to synthesize multiple scenes, where all available data for a fixed time window of 10 days are arranged from smallest to largest.Then, the median of all data for these 10 days is taken as the composite value, and the results are stored, thus ensuring that the effective data length for each year in the study area remains at 37 scenes.Finally, we applied S-G filtering [40] to the synthesized data with an initial window size of seven iterations to fill in missing values in the time-series data, reducing the impact of noise or outliers during the model training process.
Figure 6 shows the rules set for the synthesis of data.Assuming that the latest S2-NDVI data on the GEE platform is the 20th synthesized scene in 2022, and three years of CDL data from 2019 to 2021 can be obtained as crop labels, we trained and tested the STPM model based on these data after necessary pre-processing.
To test the effectiveness of the STPM, we utilized the widely used NDVI time-se data as the background data for spatial-temporal prediction and crop identification this end, we obtained multiple years of NDVI time-series remote sensing data with a t span from 1 January 2019 to 31 December 2022, using the S2 Level-2A data obtained f the Google Earth Engine (GEE) [39] cloud platform under limited conditions of t cloud cover, and cloud masks (with QA60 band markers).Optical remote sensing da often affected by cloud cover, which can lead to differences in the amount of effective available and the time intervals between adjacent data for the same area in different ye thereby interfering with the evaluation of subsequent spatial-temporal prediction crop identification models.To address this issue, we used the 10-day median synth method to synthesize multiple scenes, where all available data for a fixed time windo 10 days are arranged from smallest to largest.Then, the median of all data for thes days is taken as the composite value, and the results are stored, thus ensuring that effective data length for each year in the study area remains at 37 scenes.Finally, we plied S-G filtering [40] to the synthesized data with an initial window size of seven it tions to fill in missing values in the time-series data, reducing the impact of noise or liers during the model training process.
Figure 6 shows the rules set for the synthesis of data.Assuming that the latest NDVI data on the GEE platform is the 20th synthesized scene in 2022, and three year CDL data from 2019 to 2021 can be obtained as crop labels, we trained and tested STPM model based on these data after necessary pre-processing.

Sample Acquisition and Splitting
For the crop classification module, we first statistically analyze the proportion different crop types based on historical label data and then reclassify them, as show Table 1.Although crop rotation occurs in the agricultural areas of the USA, the main c types in the research area remain relatively stable across different years.Therefore reclassified the main crop types in the research area into 12 categories.On this basis take a random sampling within each year to extract the NDVI time-series data and la corresponding to each sample point.Specifically, we used the CDL data as a reference took a random scattering of points on top of it to obtain the samples, where the num of samples selected was 0.5% of the total number of pixels.Finally, all crop samples f different years were mixed and randomly divided into mutually independent train testing, and validation sets at a ratio of 5:3:2.

Sample Acquisition and Splitting
For the crop classification module, we first statistically analyze the proportions of different crop types based on historical label data and then reclassify them, as shown in Table 1.Although crop rotation occurs in the agricultural areas of the USA, the main crop types in the research area remain relatively stable across different years.Therefore, we reclassified the main crop types in the research area into 12 categories.On this basis, we take a random sampling within each year to extract the NDVI time-series data and labels corresponding to each sample point.Specifically, we used the CDL data as a reference and took a random scattering of points on top of it to obtain the samples, where the number of samples selected was 0.5% of the total number of pixels.Finally, all crop samples from different years were mixed and randomly divided into mutually independent training, testing, and validation sets at a ratio of 5:3:2.
Unlike the training mode of the classification model, when generating spatial-temporal prediction samples, we first overlay the NDVI time-series data from 2019 to 2022 (with 20 scenes) and then randomly cut out 17 × 17 tiles on the original data at a certain ratio.In order to balance the sample size and the generalization ability of the model, we performed data augmentation on the obtained tiles using commonly used methods such as rotation, translation, and adding noise.Finally, we divided the samples into mutually independent training, testing, and validation sets at a ratio of 5:3:2, consistent with the splitting of classification samples.Tables 2 and 3 show the respective distributions of the two types of samples.
Table 1.Real crop type distribution in the study area and selection of main crop types ("Others" represents the total proportion of all crop types that account for a relatively small percentage compared to the entire study area, and "Impervious" represents all impervious surfaces).

Experimental Design and Model Parameter Configuration
To better capture the spatial-temporal evolution patterns of the crops, we designed multiple feature extraction channels to obtain multi-modal feature information.Specifically, in each stage of training, the model slides to extract input samples within multiple spatialtemporal tiles, with a size of 4 × 17 × 17, and takes the center pixel of the corresponding next time node as the output label.For local feature extraction, the network uses four convolution kernels of sizes 3 × 3 and 1 × 1 to extract local feature vectors, which are then stretched into a size of 4 × 289.In addition, for the Transformer component, the model sets up 6 layers of encoders, with the heads of the multi-head attention set to 17 and the input dimension of the feed-forward network set to 289.Finally, a global feature vector of size 5 × 289 is outputted.For the temporal feature extraction branch, the input dimension is set to 289, the hidden nodes are set to 64, the network layers are set to 3, and the output feature size is set to 4 × 289.After all the feature extraction is completed, the multi-modal features are concatenated to form a fusion feature vector of size 13 × 289, and then an information aggregation is performed using a one-dimensional convolution kernel of size 1 × 13 to obtain a final feature vector of size 1 × 289.
Regarding crop classification, the model takes an input crop time-series of size 1 × 37 and uses 64 1 × 3, 1 × 5, and 1 × 7 convolution kernels to extract features of different scales.These features are then passed through multiple BN layers, ReLU layers, and convolutional layers.Next, multiple mean pooling layers are used to reduce the dimensionality of the extracted multiscale features to a size of 1 × 18.Finally, the features of different scales are concatenated and passed through fully connected layers and activation layers to output crop category information.
All of the modules described above were trained using the PyTorch framework [41].The Adam optimizer was used for training all models, with default settings of beta_1 = 0.9, beta_2 = 0.999, and epsilon = 1 × 10 −8 .For the crop classification model, we set the initial learning rate to 0.0001, batch size to 256, and trained for 100 epochs.For the spatialtemporal prediction model, we set the initial learning rate to 0.001, batch size to 512, and trained for 256 epochs.To prevent overfitting during the training process, we employed an early stopping mechanism to obtain the optimal set of model parameters.Model training was conducted on a server with the CentOS 7.6 operating system, an Intel(R) Xeon(R) Gold 5118 CPU, and two NVIDIA Tesla 100 16G graphics cards.The training process used Python version 3.7.

Model Evaluation Indicators
To better evaluate the potential of the proposed STPM methods in the paper, we used different evaluation metrics to assess the spatial-temporal prediction module and crop classification module, respectively.For the proposed spatial-temporal prediction module, we used RMSE as the evaluation metric to judge the model's training performance during the training phase.When the model made predictions, we also added R 2 as an evaluation metric.As for the crop identification module, we used five commonly used accuracy evaluation metrics, including accuracy, precision, recall, F1-score, and Kappa coefficient, to assess the training accuracy and prediction capabilities.In detail, we used accuracy to evaluate the pre-trained crop classification model from an overall perspective.Recall and precision were selected to assess the model's ability to distinguish between positive and negative samples, and finally, we used F1-score and Kappa coefficient as two comprehensive metrics to evaluate the model's robustness.

Evaluation of Spatial-Temporal Prediction Results
In this section, we employed our proposed spatial-temporal prediction model to forecast the spatial-temporal distribution of S2-NDVI in the selected study area from day 210 to 365 in 2022.To evaluate the performance of the model in predicting remote sensing time-series data, we conducted qualitative and quantitative evaluations.At the qualitative level, we visualized the spatial-temporal prediction results of the first nine prediction scenes and assessed the spatial pattern of the model from various aspects, including the degree of surface texture preservation.Meanwhile, we quantified the NDVI prediction results at each temporal node and the overall prediction results using multiple evaluation methods, such as density scatter plots, R 2 , and RMSE.Finally, we conducted a comparative analysis of the prediction errors of different crop types from different aspects.
Figure 7 presents the NDVI prediction results of the selected study area for days 210-300 in 2022.All spatial predictions of the model are very similar to the real surface distribution in visual inspection.The boundaries of cultivated land and field contours can be accurately distinguished on the map, proving the potential of our proposed spatialtemporal prediction module for large-scale prediction even in complex regions.For the fragmented and heterogeneous regions, as shown in the red box of Figure 7, the proposed method is able to capture the evolution pattern from the spatial-temporal domain, keeping spatial details in all nine prediction maps.In addition, we will discuss the impact of heterogeneity on the prediction results in depth in the following section.
degree of surface texture preservation.Meanwhile, we quantified the NDVI prediction results at each temporal node and the overall prediction results using multiple evaluation methods, such as density scatter plots, R 2 , and RMSE.Finally, we conducted a comparative analysis of the prediction errors of different crop types from different aspects.
Figure 7 presents the NDVI prediction results of the selected study area for days 210-300 in 2022.All spatial predictions of the model are very similar to the real surface distribution in visual inspection.The boundaries of cultivated land and field contours can be accurately distinguished on the map, proving the potential of our proposed spatial-temporal prediction module for large-scale prediction even in complex regions.For the fragmented and heterogeneous regions, as shown in the red box of Figure 7, the proposed method is able to capture the evolution pattern from the spatial-temporal domain, keeping spatial details in all nine prediction maps.In addition, we will discuss the impact of heterogeneity on the prediction results in depth in the following section.In Figures 8 and 9, we quantitatively analyze and evaluate the prediction results from two aspects: density scatter plot and error histogram.These measures clearly demonstrate the similarity between the predicted NDVI time-series data and the real time-series data In Figures 8 and 9, we quantitatively analyze and evaluate the prediction results from two aspects: density scatter plot and error histogram.These measures clearly demonstrate the similarity between the predicted NDVI time-series data and the real time-series data for the first nine predicted scenes.From the distribution of the density scatter plots and the statistics of R 2 and RMSE, we can see that the scatter plot distributions corresponding to the predicted data from the first scene to the ninth scene are gradually diverging towards both ends, with a clear downward trend in R 2 and a significant increase in RMSE.These results indicate that as time goes on, the prediction accuracy of the model gradually decreases.This is an understandable phenomenon for any rolling prediction model because every prediction process of the model will introduce certain errors.As the prediction length increases, the accumulated error also increases, leading to worse prediction results.Still, our proposed model outperforms most existing spatial-temporal prediction networks in terms of prediction accuracy and maximum effective prediction time, which will be discussed in Section 4.2.Meanwhile, we can also observe from the change of the scatter plots that although the distribution range of the scatter plot increases with the increase of the prediction length, most of the red hotspots shown in the figures are distributed on the diagonal, and their relative positions in the figure are continuously decreasing along the line.Considering that the starting time of our prediction model was around the end of June to early July 2022, which is close to the peak growth and harvest periods of most crops in the study area, most NDVI values corresponding to the initial time-series data for prediction are relatively high and gradually decrease as time goes.
wards both ends, with a clear downward trend in R 2 and a significant increase in RMSE.These results indicate that as time goes on, the prediction accuracy of the model gradually decreases.This is an understandable phenomenon for any rolling prediction model because every prediction process of the model will introduce certain errors.As the prediction length increases, the accumulated error also increases, leading to worse prediction results.Still, our proposed model outperforms most existing spatial-temporal prediction networks in terms of prediction accuracy and maximum effective prediction time, which will be discussed in Section 4.2.Meanwhile, we can also observe from the change of the scatter plots that although the distribution range of the scatter plot increases with the increase of the prediction length, most of the red hotspots shown in the figures are distributed on the diagonal, and their relative positions in the figure are continuously decreasing along the line.Considering that the starting time of our prediction model was around the end of June to early July 2022, which is close to the peak growth and harvest periods of most crops in the study area, most NDVI values corresponding to the initial time-series data for prediction are relatively high and gradually decrease as time goes.The histogram of prediction errors (the difference between the true value and the predicted value) for each predicted scene shows that the errors of the first nine scenes are concentrated within the range of (−0.2, 0.2) and almost follow a normal distribution.This undoubtedly indicates that the proposed model has a good predictive ability.When we observe the prediction errors along the time axis, we can see that the corresponding kernel density curve gradually widens and moves towards the positive direction with the in-crease in time, indicating that the corresponding standard deviation is continuously increasing, and the predicted values are gradually decreasing compared to the true values.This provides us with ideas and information sources for future improvement of the model.The histogram of prediction errors (the difference between the true value and the predicted value) for each predicted scene shows that the errors of the first nine scenes are concentrated within the range of (−0.2, 0.2) and almost follow a normal distribution.This undoubtedly indicates that the proposed model has a good predictive ability.When we observe the prediction errors along the time axis, we can see that the corresponding kernel density curve gradually widens and moves towards the positive direction with the increase in time, indicating that the corresponding standard deviation is continuously increasing, and the predicted values are gradually decreasing compared to the true values.This provides us with ideas and information sources for future improvement of the model.
In response to the previous question, to explore the impact of surface heterogeneity on the prediction results, we used a trained spatial-temporal prediction model to perform rolling prediction of the remaining 17 scenes of NDVI data in the study area in 2022 and overlaid them in chronological order.Based on this model, we obtained the spatialized results shown in Figure 10 by calculating the R 2 and RMSE between the predicted temporal data and the actual results on a pixel-by-pixel basis.In response to the previous question, to explore the impact of surface heterogeneity on the prediction results, we used a trained spatial-temporal prediction model to perform rolling prediction of the remaining 17 scenes of NDVI data in the study area in 2022 and overlaid them in chronological order.Based on this model, we obtained the spatialized results shown in Figure 10 by calculating the R 2 and RMSE between the predicted temporal data and the actual results on a pixel-by-pixel basis.Combining Figure 10a,b, it can be found that there are a few outliers in the flat and homogeneous cultivated land blocks, and these outliers exhibit a clear block distribution phenomenon.Considering that different crop types are planted in this area, there are significant differences in the growth and development patterns of different crop types (The anomalous areas are mainly found in plots planted with the three crop types, alfalfa, oth- Combining Figure 10a,b, it can be found that there are a few outliers in the flat and homogeneous cultivated land blocks, and these outliers exhibit a clear block distribution phenomenon.Considering that different crop types are planted in this area, there are significant differences in the growth and development patterns of different crop types (The anomalous areas are mainly found in plots planted with the three crop types, alfalfa, others, and fallow).This error distribution phenomenon may be related to the crop types planted in the parcel, which is also confirmed in our discussion of the impact of different crop types on the prediction results later.Compared with the vast majority of homogeneous and regular parcels, the prediction errors in the surface fragmented area corresponding to the red box are generally at a relatively high level, which indicates that the higher the degree of surface heterogeneity, the more errors are introduced during prediction.
In the previous discussion, we mainly analyzed and evaluated the accuracy of the prediction model at the spatial scale.However, the ultimate goal is to achieve early crop mapping based on remote sensing data through spatial-temporal prediction.Assuming that the crop classification model established can accurately identify all crop types, the prediction ability of the model will directly affect the accuracy of crop mapping.With this assumption, we refer to the CDL data and randomly select 300 sample points for each crop type based on the complete temporal prediction data obtained prior.We then calculate the average error between the predicted data and the actual data and display it in a violin plot.
Based on the quantitative indicators in Figure 11, it can be seen that the median of the average error of most crop types in the study area is distributed around 0. However, the distribution of prediction errors for different crop types in the figure varies significantly, which confirms our previous hypothesis that the prediction accuracy of spatial-temporal prediction models is related to crop types.Specifically, the average prediction errors for fallow, garlic, and onions are clearly represented by a "short and thick" violin plot in the figure, indicating that the spatial-temporal prediction model performs quite well in predicting the spatial-temporal changes of these three crop types.After comparing and analyzing the NDVI values of the three crop types from the 21st to 37th scenes, we found that these three crop types have a common characteristic: they are generally near their lower value during the year.This indicates that the model is sensitive to low values, which is consistent with the fact that the predicted values are generally lower than the true values, as reflected in the error histogram.For the five crop types of almonds, pistachios, winter wheat, tomatoes, and grapes, their corresponding average error violin plots exhibit a relative regularity.This is because their NDVI curves during this period show a process of decreasing from high values to low values, and the low sensitivity of the prediction model to high values leads to an expansion of the error.Finally, for the four types of cotton, alfalfa, impervious, and others, we can see clearly from Figure 11 that their corresponding violin plots are wide and long, indicating that there is a large difference in the average prediction errors among these types and a wide distribution range.Upon closer examination, the poor prediction performance of the model can be attributed to the fact that impervious, others, and alfalfa are not single crop types, making it difficult for the model to learn a stable spatial-temporal evolution pattern within them.Compared with the NDVI regularity changes of other crop types, the high value duration of cotton occupies the vast majority of the predicted time.Combining the previous analysis and error accumulation effects, the relatively scattered average prediction error of cotton can also be explained.

Comparative Experiment Based on Spatial-Temporal Prediction Models
To evaluate the superiority of our model in spatial-temporal prediction of fine-scale remote sensing data, we compared our proposed model with other spatial-temporal prediction models in this section.Since our model is trained based on the spatial-temporal sequence-to-pixel approach, it is difficult to find competitors with similar techniques.Therefore, based on existing research, we selected four competitive models for comparison, namely Memory In Memory (MIM) [42], ConvLSTM [43], PredRNN [44], and Cubi-cRNN [45].Simultaneously, to ensure fairness in the comparison, we used the same set of samples for all models and kept the training method unchanged.Table 4 summarizes the relevant information for all the models used in the experiment.
of cotton, alfalfa, impervious, and others, we can see clearly from Figure 11 that their corresponding violin plots are wide and long, indicating that there is a large difference in the average prediction errors among these types and a wide distribution range.Upon closer examination, the poor prediction performance of the model can be attributed to the fact that impervious, others, and alfalfa are not single crop types, making it difficult for the model to learn a stable spatial-temporal evolution pattern within them.Compared with the NDVI regularity changes of other crop types, the high value duration of cotton occupies the vast majority of the predicted time.Combining the previous analysis and error accumulation effects, the relatively scattered average prediction error of cotton can also be explained.

Comparative Experiment Based on Spatial-Temporal Prediction Models
To evaluate the superiority of our model in spatial-temporal prediction of fine-scale remote sensing data, we compared our proposed model with other spatial-temporal prediction models in this section.Since our model is trained based on the spatial-temporal sequence-to-pixel approach, it is difficult to find competitors with similar techniques.Therefore, based on existing research, we selected four competitive models for comparison, namely Memory In Memory (MIM) [42], ConvLSTM [43], PredRNN [44], and Cu-bicRNN [45].Simultaneously, to ensure fairness in the comparison, we used the same set of samples for all models and kept the training method unchanged.Table 4 summarizes the relevant information for all the models used in the experiment.By observing Table 4, it can be seen that different spatial-temporal prediction models use different building blocks, while almost all of them involve convolution and LSTM computing units internally.Once all models were well-trained, we selected the best training parameters for each model to conduct rolling prediction on S2-NDVI data and summarized the prediction accuracy of all models at each temporal point in Table 5. Regarding Table 5, we first conducted a vertical self-comparison of the four comparison models and found that, except for the ConvLSTM model, the effective prediction length of the remaining three comparison models is only up to the third scene, and the R 2 of predicting the first scene data is not higher than 0.9.This indicates that these three models cannot effectively learn the spatial-temporal evolving rules of large-scale remote sensing data, and their predictive capabilities for unknown remote sensing data are generally weak.The ConvLSTM model has a relatively excellent prediction ability, which may be due to its wide range of applicability and strong spatial-temporal modeling ability.Then, we continued to conduct horizontal comparisons of the prediction results of these four models with the model proposed in this article.It can be clearly seen that our proposed model has an advantage in both the effective length of prediction and the accuracy at each time point.At the same time, the decline rate of the prediction accuracy of all comparison models is significantly higher than that of our proposed model, which indirectly indicates that these models are more sensitive to error accumulation on spatial-temporal scales and have weaker robustness.

Early Crop Mapping Accuracy Assessment
In the process of large-scale crop mapping, the training of the crop identification model is crucial as it directly affects the accuracy of the down-stream crop mapping task.Therefore, we first evaluated the training effectiveness of the crop classification model used based on the hyperparameters determined in the previous chapters.Among them, Figure 12 illustrates five evaluation metrics for crop classification accuracy during the training process and the trend of the loss function on the validation set.We can clearly see that the model's loss function has become stable by around the 40th epoch, and the overall validation accuracy is maintained at around 80%.To improve the generalization ability of the model, we trained a classification model on a collection of crop samples from multiple years, as mentioned in the previous section, in order to enhance its robustness.Combining prior conditions such as label quality and sample type distribution, the final validation accuracy of the classification model is acceptable.

Early Crop Mapping Accuracy Assessment
In the process of large-scale crop mapping, the training of the crop identification model is crucial as it directly affects the accuracy of the down-stream crop mapping task.Therefore, we first evaluated the training effectiveness of the crop classification model used based on the hyperparameters determined in the previous chapters.Among them, Figure 12 illustrates five evaluation metrics for crop classification accuracy during the training process and the trend of the loss function on the validation set.We can clearly see that the model's loss function has become stable by around the 40th epoch, and the overall validation accuracy is maintained at around 80%.To improve the generalization ability of the model, we trained a classification model on a collection of crop samples from multiple years, as mentioned in the previous section, in order to enhance its robustness.Combining prior conditions such as label quality and sample type distribution, the final validation accuracy of the classification model is acceptable.Finally, based on the time-series prediction data obtained in the upper section, we combined the known NDVI time-series data of crop early growth with the predicted results and input them into the crop identification model for prediction, obtaining the crop early mapping results shown in Figure 14.From the figure, we can see that the predicted results are filled with a small amount of noise, and the prediction accuracy of the impervious type is very low, which is consistent with the information displayed in the confusion matrix.This may be due to the weak correlation between the predicted data and the real data.As a result, the NDVI values corresponding to the impervious surfaces gradually decreased over the year from being relatively stable to less stable, and when the cumulative errors between the data exceeded the tolerance range of the crop identification model, recognition errors occured.However, for major crop types such as almonds, tomatoes, and cotton, we can visually observe that the model's recognition accuracy has reached an acceptable level when the initial length is only 20 in 2022.By observing the annual growth curves of different crop types, it can be inferred that the pre-trained crop classification model mainly focuses on the growth peak range [46].Therefore, it is not difficult to conclude that the longer the time span of known remote sensing data, the higher the recognition accuracy, which is discussed in detail in Section 5.2.Finally, based on the time-series prediction data obtained in the upper section, we combined the known NDVI time-series data of crop early growth with the predicted results and input them into the crop identification model for prediction, obtaining the crop early mapping results shown in Figure 14.From the figure, we can see that the predicted results are filled with a small amount of noise, and the prediction accuracy of the impervious type is very low, which is consistent with the information displayed in the confusion matrix.This may be due to the weak correlation between the predicted data and the real data.As a result, the NDVI values corresponding to the impervious surfaces gradually decreased over the year from being relatively stable to less stable, and when the cumulative errors between the data exceeded the tolerance range of the crop identification model, recognition errors occured.However, for major crop types such as almonds, tomatoes, and cotton, we can visually observe that the model's recognition accuracy has reached an acceptable level when the initial length is only 20 in 2022.By observing the annual growth curves of different crop types, it can be inferred that the pre-trained crop classification model mainly focuses on the growth peak range [46].Therefore, it is not difficult to conclude that the longer the time span of known remote sensing data, the higher the recognition accuracy, which is discussed in detail in Section 5.2.

Contribution Evaluation of Multi-Modal Features to the Spatial-Temporal Prediction Module
To better evaluate the contribution of multi-modal features to S2-NDVI spatial-temporal prediction, we trained and predicted the model based on local, global spatial features, and temporal features, respectively, while ensuring that the input data and hyperparameters were the same.Figure 15 shows the corresponding loss function changes of these four models during the validation set training.Several rules can be observed from the figure.Firstly, our proposed model showed the fastest decrease in the loss function during training, followed by the time-based model, while the model that used ViT solely as the feature extraction backbone performed the worst.The second point is that the models based on multi-modal features and temporal features showed a decreasing trend in the loss function during training, while the models based on local and global features exhibited multiple fluctuations in the loss function, indicating that these two models had poor fitting performance and could not effectively learn the temporal and spatial patterns of the data.After all models were trained, we performed rolling predictions along the temporal axis based on their respective best training parameters and evaluated the prediction accuracy.The results are summarized in Table 6.
Remote Sens. 2023, 15, x FOR PEER REVIEW for single-feature prediction, the model based on temporal features performs bes NDVI data, while the model based on global spatial features performs the wo hough there are many improved versions of ViT models that have achieved exce sults in video prediction and other areas [47,48], our experiments demonstrate Transformer itself is not particularly suitable for spatial-temporal prediction of fi remote sensing data.Finally, we can conclude that for our proposed spatial-tempo diction model, temporal features play a dominant role in the entire learning proces local and global spatial features serve as auxiliary learning factors.From Table 6, it is clear that the model based on multi-modal features for spatialtemporal prediction has significant advantages in both prediction accuracy and effective prediction length, which directly proves the validity of our proposed viewpoint.Secondly, for single-feature prediction, the model based on temporal features performs best for S2-NDVI data, while the model based on global spatial features performs the worst.Although there are many improved versions of ViT models that have achieved excellent results in video prediction and other areas [47,48], our experiments demonstrate that the Transformer itself is not particularly suitable for spatial-temporal prediction of fine-scale remote sensing data.Finally, we can conclude that for our proposed spatial-temporal prediction model, temporal features play a dominant role in the entire learning process, while local and global spatial features serve as auxiliary learning factors.the initial data extension.Even the identification accuracy of the other types increased from 0.44 to 0.75.Surprisingly, when the available data in 2022 reached 25-time steps (around mid-June), the early crop mapping framework we proposed almost achieved the same accuracy as that of using data for the entire year.This result confirms the effectiveness of our work.Clearly, these changes align with our initial hypothesis that the longer the initial data length, the better the spatial-temporal prediction model's prediction effect and the lower the accumulated error.

Advantages and Limitations
Against the backdrop of a general lag in mapping research on most existing crops, researchers have begun to focus on early crop mapping.Although a small amount of work has been devoted to early crop mapping based on sample selection and feature selection, these methods are obviously not scalable.Instead, the STPM framework starts from the spatial-temporal prediction of remote sensing data, combined with historical time-series data and crop identification models, fundamentally alleviating the problem of lack of data in the early stages of crop growth.We also analyzed the applicable scope of the spatial-temporal prediction model and the reliable time for early crop mapping from both qualitative and quantitative perspectives, comprehensively exploring the outstanding advantages of the STPM early crop mapping framework in crop mapping production.
However, there are several shortcomings in this work.Firstly, the STPM framework requires a significant amount of computational time and running costs for training, as well as a large sample size.This is because the training of time-series prediction models requires continuous learning of internal changes in data across multiple feature dimensions, and fine-scale remote sensing data contains much more information than other types of data, such as video data.Secondly, the model itself lacks an error correction mechanism.From the above result analysis, we found that unlimited rolling prediction without restrictions leads to a rapid accumulation of prediction errors and a sharp decrease in prediction accuracy.Finally, our proposed STPM model is based on many years of historical remote sensing data and ignores various extreme factors to learn the near-stable spatial-temporal pattern of crop changes, so it is difficult to predict abnormal changes in crop growth caused by sudden natural disasters.In future work, we will focus on lightweight models and error correction mechanisms to achieve timely detection of crop information.

Conclusions
This study proposes a novel deep learning framework for fine-scale remote sensing data prediction and early crop mapping.The results of the experiments demonstrate that our proposed spatial-temporal prediction module can achieve high prediction accuracy in terms of future data generation, even in heterogeneous regions.Furthermore, the analysis of the prediction results shows that the prediction accuracy in the crop planting area is affected by crop types and the degree of surface heterogeneity.When compared to other competing spatial-temporal forecasting models, our model is able to outperform all of them in terms of effective forecast length of over 100 days, as well as forecast accuracy per view.Regarding large-scale early crop mapping, we analyzed the impact of different initial sequence lengths on crop mapping accuracy and examined the transfer effect of pre-trained crop recognition models over time from multiple perspectives.
In summary, this study provides a new deep learning framework for early crop mapping to alleviate the problem of information lag due to the lack of early crop growth data.However, there are still a few issues with this method that need further improvement, for example, the accumulation of errors when the spatial-temporal prediction model is rolling for a long time, etc.

Figure 1 .
Figure 1.Illustration of the general framework of the STPM, consisting of three main modules: (a) preparation and generation of time-series remote sensing data and sample augmentation, (b) extraction and fusion of multi-modal features from the samples, and (c) training of crop classification and spatial-temporal prediction models in combination with early crop mapping.

Figure 1 .
Figure 1.Illustration of the general framework of the STPM, consisting of three main modules: (a) preparation and generation of time-series remote sensing data and sample augmentation, (b) extraction and fusion of multi-modal features from the samples, and (c) training of crop classification and spatial-temporal prediction models in combination with early crop mapping.

Figure 2 .
Figure 2. Details of the multi-modal feature extraction process in the spatial-temporal prediction modular.Specifically, the input samples are extracted by three feature extraction backbone networks simultaneously for the corresponding types of features.These extracted features are then fused together by a deep feature fusion module.

Figure 2 .
Figure 2. Details of the multi-modal feature extraction process in the spatial-temporal prediction modular.Specifically, the input samples are extracted by three feature extraction backbone networks simultaneously for the corresponding types of features.These extracted features are then fused together by a deep feature fusion module.
, the STPM framework aims to forecast crop growth patterns step-by-step with current remote sensing data and perform early crop type mapping at the same time.The corresponding processing flow is shown in Figure 4.

Figure 4 .
Figure 4. Illustration of the STPM framework for collaborative processing, where MS-ResNet represents the crop identification module, and ST-prediction Net denotes the spatial-temporal prediction module.

Figure 4 .
Figure 4. Illustration of the STPM framework for collaborative processing, where MS-ResNet represents the crop identification module, and ST-prediction Net denotes the spatial-temporal prediction module.

Figure 5 .
Figure 5. Illustration of the study area, where the left map shows the geographical location of the study area, and the right map shows the actual surface distribution.

Figure 5 .
Figure 5. Illustration of the study area, where the left map shows the geographical location of the study area, and the right map shows the actual surface distribution.

Figure 6 .
Figure 6.Rules for data synthesis.The sequences from left to right represent time-series data w a year, where the dashed boxes indicate the synthesis process of ten days.

Figure 6 .
Figure 6.Rules for data synthesis.The sequences from left to right represent time-series data within a year, where the dashed boxes indicate the synthesis process of ten days.

Figure 7 .
Figure 7.The spatial-temporal prediction results of the NDVI data, based on our proposed model, are presented using a pseudocolor visualization.From left-up corner to right-down corner, they represent the 21st to 30th scenes of the synthesized data in 2022, corresponding to the real-time span of 210-300 days in 2022.

Figure 7 .
Figure 7.The spatial-temporal prediction results of the NDVI data, based on our proposed model, are presented using a pseudocolor visualization.From left-up corner to right-down corner, they represent the 21st to 30th scenes of the synthesized data in 2022, corresponding to the real-time span of 210-300 days in 2022.

Figure 8 .
Figure 8. Scatter plot results of partial NDVI prediction and actual data.A total of 150,000 representative pixels were randomly selected from the predicted image and the corresponding real image to calculate their R 2 , average RMSE, and fitting function relationship.The time range corresponds to that in Figure 7.

Figure 8 .
Figure 8. Scatter plot results of partial NDVI prediction and actual data.A total of 150,000 representative pixels were randomly selected from the predicted image and the corresponding real image to calculate their R 2 , average RMSE, and fitting function relationship.The time range corresponds to that in Figure 7.

Figure 9 .
Figure 9. Histogram statistics of the error of the top nine scenes of NDVI prediction.

Figure 9 .
Figure 9. Histogram statistics of the error of the top nine scenes of NDVI prediction.

25 Figure 10 .
Figure 10.(a) and (b) respectively illustrate the R 2 and RMSE calculation results between the predicted NDVI time-series based on the proposed model and the actual time-series data.

Figure 10 .
Figure 10.(a) and (b) respectively illustrate the R 2 and RMSE calculation results between the predicted NDVI time-series based on the proposed model and the actual time-series data.

Figure 11 .
Figure 11.The violin plot represents the distribution of prediction errors for different crop types, where the white dot indicates the median of the error distribution, the width of the violin represents the frequency distribution of the prediction errors, and the thin black lines inside the violins indicate the range between the minimum and maximum values.

Figure 11 .
Figure 11.The violin plot represents the distribution of prediction errors for different crop types, where the white dot indicates the median of the error distribution, the width of the violin represents the frequency distribution of the prediction errors, and the thin black lines inside the violins indicate the range between the minimum and maximum values.

Figure 12 .
Figure 12.Changes in various accuracy evaluation metrics and loss function on the validation set as the number of epochs increases during crop classification model training.

Figure 12 .
Figure 12.Changes in various accuracy evaluation metrics and loss function on the validation set as the number of epochs increases during crop classification model training.

Figure 12 .
Figure 12.Changes in various accuracy evaluation metrics and loss function on the validation set as the number of epochs increases during crop classification model training.

Figure 13 .
Figure 13.Confusion matrix results on the validation set during crop classification model training.Figure 13.Confusion matrix results on the validation set during crop classification model training.

Figure 13 .
Figure 13.Confusion matrix results on the validation set during crop classification model training.Figure 13.Confusion matrix results on the validation set during crop classification model training.

Figure 14 .
Figure 14.(a) Visualization of the early crop mapping results in 2022 by combining existing NDVI data with spatial-temporal prediction results.(b) Distribution of actual crop types in 2022.

Figure 14 .
Figure 14.(a) Visualization of the early crop mapping results in 2022 by combining existing NDVI data with spatial-temporal prediction results.(b) Distribution of actual crop types in 2022.

Figure 15 .
Figure 15.Changes in the loss function of the spatial-temporal prediction model based on features during training on the validation set.As the numerical differences correspondin stable state of loss reduction during the training process of different models are too large t played effectively, a double y-axis line graph is used to show the differences between the Each model is represented by a different color, and except for the pink line, which is refer the left axis, the other three lines are referenced to the right axis.

Figure 15 .
Figure 15.Changes in the loss function of the spatial-temporal prediction model based on different features during training on the validation set.As the numerical differences corresponding to the stable state of loss reduction during the training process of different models are too large to be displayed effectively, a double y-axis line graph is used to show the differences between the models.Each model is represented by a different color, and except for the pink line, which is referenced to the left axis, the other three lines are referenced to the right axis.

Table 2 .
The number of samples for each crop type listed in the table refers to only one of the three years 2019-2021, so the number of samples used to train the crop classification model (including the training, validation, and test sets) is equal to the number in the table multiplied by 3.

Table 3 .
Division of spatial-temporal prediction sample quantity distribution.

Table 4 .
Brief summary of the comparison of model structures, including MIM, ConvLSTM, PredRNN, and CubicRNN.

Table 5 .
Comparison of the accuracy of spatial-temporal prediction models.We set the R 2 value of 0.30 as the threshold for determining the validity of the prediction results.Any predictions below this threshold are considered invalid and denoted by a "-" in the table.The same threshold is used for Table6.

Table 6 .
Comparison of accuracy for spatial-temporal prediction models based on multi-m single features.

Table 7 .
Comparison of early crop mapping accuracy for different initial training lengths (where data from 2019-2021 were used for training, the only difference being the length of available data in 2022).