Crop Mapping from Sentinel-1 Polarimetric Time-Series with a Deep Neural Network

Timely and accurate agricultural information is essential for food security assessment and agricultural management. Synthetic aperture radar (SAR) systems are increasingly available in crop mapping, as they provide all-weather imagery. In particular, the Sentinel-1 sensor provides dense time-series data, thus offering a unique opportunity for crop mapping. However, in most studies, the Sentinel-1 complex backscatter coefficient was used directly which limits the potential of the Sentinel-1 in crop mapping. Meanwhile, most of the existing methods may not be tailored for the task of crop classification in time-series polarimetric SAR data. To solve the above problem, we present a novel deep learning strategy in this research. To be specific, we collected Sentinel-1 time-series data in two study areas. The Sentinel-1 image covariance matrix is used as an input to maintain the integrity of polarimetric information. Then, a depthwise separable convolution recurrent neural network (DSCRNN) architecture is proposed to characterize crop types from multiple perspectives and achieve better classification results. The experimental results indicate that the proposed method achieves better accuracy in complex agricultural areas than other classical methods. Additionally, the variable importance provided by the random forest (RF) illustrated that the covariance vector has a far greater influence than the backscatter coefficient. Consequently, the strategy proposed in this research is effective and promising for crop mapping.


Introduction
Many of the problems resulting from the rapid growth of the global population are related to agricultural production [1,2]. In this context, it is necessary to have a comprehensive understanding of crop production information. Timely and accurate agricultural information can achieve a range of important purposes, such as improving agricultural production, ensuring food security, and facilitating ecosystem services valuation [3]. Remote sensing, which provides timely earth observation data with large spatial coverage, could serve as a convenient and reliable method for agricultural monitoring [4]. It is now possible to build a time-series image stack for full-season monitoring and differentiate crop types according to their unique seasonal features [5]. demonstrated the advantages of using RNN as a temporal extractor compared with other methods. For example, Ndikumana et al. [14] designed the RNN framework to explore the temporal correlation with Sentinel-1 data for crop classification. Meanwhile, some studies have proposed combined methods by combining cyclic and convolution operations to process spatio-temporal cubes [28]. For instance, Rubwurm and Korner [29] designed a convolutional Recurrent model (convRNN) to tackle land cover classification in the Sentinel-2 time series. Compared to single model methods, the combined models generally provide better performance. Thus, it is necessary to develop a combined model that simultaneously considers the spatial-polarization-temporal features for time series SAR image classification.
In this study, we propose a Sentinel-1 time-series crop mapping strategy to further improve the classification accuracy. To serve this purpose, deep learning strategies were introduced to understand the spatial-temporal patterns and scattering mechanisms of crops. To be specific, we use the Sentinel-1 covariance matrix as the input vector to provide polarization features information for deep network training. Then, a novel depthwise separable convolution recurrent neural network (DSCRNN) architecture is proposed to better extract complex features from the Sentinel-1 time series, which integrates the operations of cyclic and convolution. Moreover, in order to better model the potential correlations from the phase information, the conventional convolution is replaced by the depthwise separable convolutions. The main contributions of this paper are:

1.
By using the decomposed covariance matrix, the potential of the Sentinel-1 time series in crop discrimination is fully explored.

2.
An effective crop classification method is proposed for time-series polarimetric SAR data by considering the temporal patterns of crop polarimetric and spatial characteristics.
The rest of this paper is organized as follows. Study areas and data are described in Section 2. Section 3 details the specific architecture and method of DSCRNN. Section 4 presents the results of the classification. The discussion and conclusion are presented in Sections 5 and 6, respectively.

Study Area
California is the largest agricultural state in the United States of America (U.S.) [30]. This indicates the significance of crop mapping in California. Thus, this study is carried out at two different sites in California, henceforth referred to as study area 1 and study area 2 ( Figure 1).
Study area 1 is situated in Imperial, Southern California, at 33 • 01 N and 115 • 35 W, covering a region about 10 km × 10 km. The area is in the Colorado Desert, with a tropical desert climate, which is very hot. The area has one of the highest yields of crops such as alfalfa, onions, and lettuce in California. The average mean temperature is higher than 27 • C [31], and the temperature variation is also very large. There was little rain throughout the year, below the mean annual precipitation in the U.S. Six classes were selected for analysis: winter wheat, alfalfa, other hay/non-alfalfa, sugar beets, onions, and lettuce.
Study area 2 is situated in an agricultural district stretching over Solano and Yolo counties of California, Northern California, at 38 • 26 N and 121 • 44 W, covering a region about 10 km × 10 km. The area has a Mediterranean climate, characterized by dry hot summers and wet cool winters [32]. The region is flat, the agricultural system is complex, and one of the most productive agricultural areas in the U.S. It has an annual precipitation of about 750 mm, concentrated in the spring and winter [33]. Seven major crop types were selected for analysis: walnut, almond, alfalfa, winter wheat, corn, sunflower, and tomato.

Sentinel-1 Data
In this study, the Sentinel-1 Interferometric Wide (IW) Single Look Complex (SLC) products were used. All the images were downloaded from the Sentinel-1 Scientific Data Hub. Since the major agricultural practices in both study areas were in spring and summer, we focused our data analysis on these seasons. Figure 2 shows the time distribution of Sentinel-1 images collected in the two study areas. In total, 15 scenes of Sentinel-1A images from 2018 were collected in the study area 1, and 11 Sentinel-1A images from 2019 were collected in study area 2.
The pre-processing of time series Sentinel-1 images was done using the sentinel application platform (SNAP) offered by European Space Agency (ESA). Data preprocessing consists of five steps: (1) terrain observation by progressive scans synthetic aperture radar (TOPSAR) split, (2) calibrate Sentinel-1 data to the complex number, (3) debursting, (4) refined Lee filter, and (5) range-doppler terrain correction for all images using the same digital elevation model data (SRTM DEM 30 m). Since we hoped that this study would facilitate the fusion of Sentinel-1 and Sentinel-2 features, we projected the data to the UTM reference and re-sampled it to 10 m to co-registration Sentinel-2.

Sentinel-1 Data
In this study, the Sentinel-1 Interferometric Wide (IW) Single Look Complex (SLC) products were used. All the images were downloaded from the Sentinel-1 Scientific Data Hub. Since the major agricultural practices in both study areas were in spring and summer, we focused our data analysis on these seasons. Figure 2 shows the time distribution of Sentinel-1 images collected in the two study areas. In total, 15 scenes of Sentinel-1A images from 2018 were collected in the study area 1, and 11 Sentinel-1A images from 2019 were collected in study area 2.
Remote Sens. 2020, 12, x FOR PEER REVIEW 5 of 19 Figure 2. Data acquisition date in two study areas.

Cropland Reference Data
The U.S. Department of Crop (USDA) Cropland Data Layer (CDL) of 2018 and 2019 was used as the reference data for crop classification and to test the experiment. The data is published regularly by the USDA and covers 48 states [35]. CDL has been widely used in all kinds of remote sensing crop research because of its high quality. However, there are some misclassifications in the data [36]. Through visual inspection, it was found that the misclassified pixels of CDL were concentrated at the boundary of the crop fields. Therefore, we performed a manual drawing of the reference data according to the CDL (Figure 3).
The process of drawing labeled data consists of three steps. First, the spatial resolution of the The pre-processing of time series Sentinel-1 images was done using the sentinel application platform (SNAP) offered by European Space Agency (ESA). Data preprocessing consists of five steps: (1) terrain observation by progressive scans synthetic aperture radar (TOPSAR) split, (2) calibrate Sentinel-1 data to the complex number, (3) debursting, (4) refined Lee filter, and (5) range-doppler terrain correction for all images using the same digital elevation model data (SRTM DEM 30 m).
Since we hoped that this study would facilitate the fusion of Sentinel-1 and Sentinel-2 features, we projected the data to the UTM reference and re-sampled it to 10 m to co-registration Sentinel-2.

Cropland Reference Data
The U.S. Department of Crop (USDA) Cropland Data Layer (CDL) of 2018 and 2019 was used as the reference data for crop classification and to test the experiment. The data is published regularly by the USDA and covers 48 states [35]. CDL has been widely used in all kinds of remote sensing crop research because of its high quality. However, there are some misclassifications in the data [36]. Through visual inspection, it was found that the misclassified pixels of CDL were concentrated at the boundary of the crop fields. Therefore, we performed a manual drawing of the reference data according to the CDL ( Figure 3).

Representation of Sentinel-1 Data
PolSAR image can be represented by a 2×2 complex scattering matrix S. However, Sentinel-1 only provides dual-polarization information. Therefore, the expression of S needs to be modified. The backscattering matrix of Sentinel-1 is expressed as： where and are backscattering coefficients under different polarimetric combinations. and represent the horizontal and vertical directions of the electromagnetic wave, respectively.
Since the scattering matrix is an inadequate representation of the scattering characteristics of complex targets [37], the covariance matrix C is used. This is written as: where , , , are the members of the covariance matrix, and * is the conjugate operation. The process of drawing labeled data consists of three steps. First, the spatial resolution of the CDL is resampled to 10 m before Sentinel-2 images overlaid on the CDL image to determine the crop field boundary. Secondly, the field of each major crop is manually delineated and buffered one pixel inward from the field boundary. Finally, fields of the same crop type are combined into a class. Detailed information about the modified labeled data are reported in Tables 1 and 2.

Representation of Sentinel-1 Data
PolSAR image can be represented by a 2×2 complex scattering matrix S. However, Sentinel-1 only provides dual-polarization information. Therefore, the expression of S needs to be modified. The backscattering matrix of Sentinel-1 is expressed as: where S VH and S VV are backscattering coefficients under different polarimetric combinations. H and V represent the horizontal and vertical directions of the electromagnetic wave, respectively. Since the scattering matrix S is an inadequate representation of the scattering characteristics of complex targets [37], the covariance matrix C is used. This is written as: where C 11 , C 12 , C 21 , C 22 are the members of the covariance matrix, and * is the conjugate operation. It can be seen from Equation (2) that the diagonal value of the matrix C dual is real and the off-diagonal complex value. Since the matrix C dual is a symmetric matrix, this means that {C 11 , C 12 , C 22 } contain all the information about C dual . We separate the true and imaginary parts of C 12 and convert them to real values. Thus, we get a 4-dimensional vector: where re and im represent the real and imaginary parts of complex numbers, respectively. Finally, in order to accelerate the convergence of the model, each pixel is normalized. The equation is: where i represents channel of C v . C v−max and C v−min are the maximum and minimum values of ith channel, respectively.  Figure 4 shows the proposed DSCRNN architecture. In order to maintain the integrity of the Sentinel-1 data, the covariance matrix vectors are sliced into patches as the input of the neural network. Then, the patches of T timestamps are fed into the DSCRNN. In this step, the same depthwise separable convolution operation is performed for the patches of each timestamp to obtain the feature sequence. Finally, the attentive LSTM layer is input to produce the crop classification. Next, we introduce the components of the architecture and its advantages.

Architecture of DSCRNN Network
Remote Sens. 2020, 12, x FOR PEER REVIEW 7 of 19 Finally, in order to accelerate the convergence of the model, each pixel is normalized. The equation is: where represents channel of . and are the maximum and minimum values of th channel, respectively. Figure 4 shows the proposed DSCRNN architecture. In order to maintain the integrity of the Sentinel-1 data, the covariance matrix vectors are sliced into patches as the input of the neural network. Then, the patches of T timestamps are fed into the DSCRNN. In this step, the same depthwise separable convolution operation is performed for the patches of each timestamp to obtain the feature sequence. Finally, the attentive LSTM layer is input to produce the crop classification. Next, we introduce the components of the architecture and its advantages.

Depthwise Separable Convolution
As shown in Figure 5, the convolution mechanism in the conventional CNNs extracts features from all dimensions of each image, including the spatial dimension and channel dimension [21]. For conventional CNNs, suppose the three-dimensional (3D) tensor ∈ IR × × input network, where , , and are the height, width, and depth of the input. This is written as: where is the trainable parameter, ( , ) is the location of output feature maps, (ℎ, , ) is an element in , which is in spatial location (ℎ, ) and in the ′ channel.

Depthwise Separable Convolution
As shown in Figure 5, the convolution mechanism in the conventional CNNs extracts features from all dimensions of each image, including the spatial dimension and channel dimension [21]. For conventional CNNs, suppose the three-dimensional (3D) tensor x ∈ IR H×W×D input network, where H, W, and D are the height, width, and depth of the input. This is written as: where f is the trainable parameter, (i, j) is the location of output feature maps, (h, w, d) is an element in x, which is in spatial location (h, w) and in the d s channel.
Remote Sens. 2020, 12, x FOR PEER REVIEW 8 of 19 where DConv is the depthwise convolution, PConv is the pointwise convolution, and represents a convolution filter of size 1. Compared with the conventional CNNs, the parameters of the depthwise separable convolution are significantly reduced [40].
For the data which is closely related between channels, the depthwise separable convolution may yield better results [41]. matrix contains the phase information and amplitude information. This means that the correlations between multi-channel can express the structure information of the crop. Therefore, depthwise separable convolution is more suitable for feature extraction in PolSAR images than conventional CNNs [41].

Attentive Long Short-Term Memory Neural Network
LSTM is one representative RNN architecture with the ability to maintain a temporal state between continuous input data and it learns from long-term context dependencies [42]. Compared with RNN, the inner structure of the hidden layer in LSTM is more complex [43]. An LSTM block consists of a memory cell state, forget gate, input gate, and output gate. The specific steps of LSTM at time t are as follows: The previous cell state of is passed to the forget gate , and the sigmoid activation function is used to determine the proportion of discarded information. This can be represented as: Then, through the input gate , it decides the percentage of the new information is stored in cell state for input , where the input gate should be updated. This is written as: Depthwise separable convolution is successfully applied to Xception [26] and MobileNet [38]. Different from the conventional CNNs, the depthwise separable convolution can be divided into depthwise convolution and pointwise convolution. To be specific, the depthwise separable convolution convolves kernels of each filter with each input channel, and pointwise convolution [39]. This is written as: where DConv is the depthwise convolution, PConv is the pointwise convolution, and f d represents a convolution filter of size 1. Compared with the conventional CNNs, the parameters of the depthwise separable convolution are significantly reduced [40].
For the data which is closely related between channels, the depthwise separable convolution may yield better results [41]. C dual matrix contains the phase information and amplitude information. This means that the correlations between multi-channel can express the structure information of the crop. Therefore, depthwise separable convolution is more suitable for feature extraction in PolSAR images than conventional CNNs [41].

Attentive Long Short-Term Memory Neural Network
LSTM is one representative RNN architecture with the ability to maintain a temporal state between continuous input data and it learns from long-term context dependencies [42]. Compared with RNN, the inner structure of the hidden layer in LSTM is more complex [43]. An LSTM block consists of a memory cell state, forget gate, input gate, and output gate. The specific steps of LSTM at time t are as follows: The previous cell state of C t−1 is passed to the forget gate F t , and the sigmoid activation function is used to determine the proportion of discarded information. This can be represented as: Then, through the input gate I t , it decides the percentage of the new information C t is stored in cell state C t for input x, where the input gate I t should be updated. This is written as: Update the present cell state C t based on multiplying the cell state C t−1 of the previous step by F t and the updated information C t by I t . This can be represented as follow: Finally, confirm the new hidden state h in the output gate O t , where the new cell state C j is used. This can be written as: where W Fx , W Fh , W Ix , W Ih , W Ox , W Oh , W Cx , W Ch are the weight matrices and bias is the trainable bias term. Finally, we couple LSTM with an attention mechanism that can connect the information extracted by the recursive neural network model in different time-lapse. Intuitively, the attention mechanism supports the model to pay attention to specific time stamps and discard useless contextual information. This is written as: where x t is the input vector at time t, h j is the output vector at time j, and f is the set of all trainable parameters. The purpose of this step is to learn a set of weights to measure the importance of the temporal information.
As discussed in Section 3.2.1, a 3D tensor is an input into the convolution network to obtain a feature vector cnn f ea . In this way, the Sentinel-1 time-series data can be regarded as a 4D tensor x ∈ IR H×W×D×T where T is the temporal dimension. This means that for each individual patch, the output feature returned by the depthwise separable convolution model is Seq cnn f eat (cnn f ea1 , cnn f ea2 , · · · cnn f eat ), where the feature map cnn f eat represents the feature vectors of depthwise separable convolution at time step t, and each output feature has the same dimensions. Then, the convolved feature sequence Seq cnn f eat is inputted into the attentive LSTM layer, and output a feature vector rnn f eat . Finally, with the feature vector rnn f eat , labels can be further assigned using the softmax classifier, as follow: where p i is the probability that rnn f eat belonging to class i.

Competing Methods
In order to evaluate the performance of DSCRNN, several classification methods such as SVM, RF, Conv1D, and LSTM were selected for comparison. In addition, in order to investigate the interplay among the different components of DSCRNN, we disentangle the different parts of our framework. They are simply recorded as Network (Net) A to Net C in the order of appearance for convenience: Net A is based on a conventional CNN as shown in Figure 6a, Net B is based on depthwise separable convolution as shown in Figure 6b, and Net C is based on a CRNN (using attentive LSTM and conventional CNN) as shown in Figure 6c.
Remote Sens. 2020, 12, x FOR PEER REVIEW 9 of 19 mechanism supports the model to pay attention to specific time stamps and discard useless contextual information. This is written as: where is the input vector at time , ℎ is the output vector at time j, and is the set of all trainable parameters. The purpose of this step is to learn a set of weights to measure the importance of the temporal information.
As discussed in Section., a 3D tensor is an input into the convolution network to obtain a feature vector . In this way, the Sentinel-1 time-series data can be regarded as a 4D tensor ∈ IR × × × where is the temporal dimension. This means that for each individual patch, the output feature returned by the depthwise separable convolution model is ( , , ⋯ ), where the feature map represents the feature vectors of depthwise separable convolution at time step , and each output feature has the same dimensions. Then, the convolved feature sequence is inputted into the attentive LSTM layer, and output a feature vector . Finally, with the feature vector , labels can be further assigned using the softmax classifier, as follow: where is the probability that belonging to class .

Competing Methods
In order to evaluate the performance of DSCRNN, several classification methods such as SVM, RF, Conv1D, and LSTM were selected for comparison. In addition, in order to investigate the interplay among the different components of DSCRNN, we disentangle the different parts of our framework. They are simply recorded as Network (Net) A to Net C in the order of appearance for convenience: Net A is based on a conventional CNN as shown in Figure 6 a, Net B is based on depthwise separable convolution as shown in Figure 6 b, and Net C is based on a CRNN (using attentive LSTM and conventional CNN) as shown in Figure 6 c.

Dataset Partition
In crop classification tasks, the labeled data is usually very limited. Therefore, according to the modified ground truth data, we randomly select 1% of all available samples for each crop type as the training set. For a model with a single-pixel as the input (e.g., RF, Conv1D, and LSTM), the time series of the pixel corresponding to the labeled data is used as the input vector. For the DSCRNN and variant model, we take a sample (labeled data) as the center point, and then we segment a square patch with the size of 18×18 as input. The remaining non-overlapping samples for each study area are used for testing (Table 3). It is important to note that there is no overlap between training and testing [44].

Experimental Designs
In this study, a carefully designed DSCRNN model is applied. In order to make full use of the rich information of the Sentinel-1 time series, the covariance matrix is used as an input vector to train the model. The main crop vectors from the manually drawn reference data are identified to guide the sample extraction for Sentinel-1 data. Take the study area 1 as an example; the sizes of input samples are set to 15 × 18 × 18 × 4. For the first depthwise separable convolution layer, the inputs are converted to 15 × 16 × 16 × 32. In this step, firstly, four 3 × 3 × 1 convolutional kernels are used to convolute every single channel of the input map to obtain four 15×16×16 feature maps. Secondly, thirty-two 1 × 1 × 4 convolution kernels are convoluted and generate 15 × 16 × 16 × 32 outputs. Then, the second depthwise separable convolution layer has sixty-four convolutional kernels to obtain a 15 × 14 × 14 × 64 output map, which is then downsampled to 15 × 7 × 7 × 64 with the max pooling operations. After the max pooling layer, the output feature map is flattened. Next, the temporal features can be extracted by the attentive LSTM with 150 hidden units. Finally, a softmax classifier outputs the 6 different labels. In the training stage, the Adam optimizer [45] was used and fixed as: learning rate = 0.001, β 1 = 0.9, β 2 = 0.999, ε = 1 × 10 −7 [5], and the batch size was set to 200. In all the following experiments, the patch size is set to 18.
The Conv1D model consists of one convolution layer with 512 filters, one max-pooling layer, and one fully connection layer. The LSTM model consists of two hidden layers with 150 units for each layer. Net A, Net B, and Net C all have a similar architecture to DSCRNN, except for the differences in some key components in DSCRNN (i.e., depthwise separable convolution and attentive LSTM). The neural net models are implemented using the Python TensorFlow library, while other models are implemented using the Python Scikit-learn library [46].
Considering leveraging the RF and SVM classifier, we optimize RF via tuning the number of trees in the forest and the maximum depth of each tree, and the SVM by adjusting C and gamma. We employ a "grid search" strategy to select the optimal parameters of the classifier: the classifier is repeatedly trained many times to select the optimal combination of parameter values. The number of trees values in the set of {200,400,600,800,1000}, the maximum depth of each tree in the range {20,40,60,80,100}, C in the set {0.001,0.01,0.1,1,10,100,1000,3000,5000,10,000}, and gamma in the set {0.1, 1,2,5,10}. The average accuracy (AA) [47], overall accuracy (OA), kappa coefficient (kappa), and F1-score are used as the criteria for evaluating the performance of the models.

Classification Results
In this section, we describe and discuss the experimental results obtained on the two study date sets introduced in Section 2. We evaluate the performance of DSCRNN and then compare it with several competing methods.

Results on Study Area 1
The results of crop classification in study area 1 obtained by different methods are shown in Figure 7. Obviously, for competitive methods such as SVM, RF, Conv1D, and LSTM, there is a lot of speckle noise in the classification maps, which results in a low accuracy for crop mapping. However, the spatial feature-based methods, especially DSCRNN, produced a noise-proof classification map. Table 4 lists the detailed accuracy assessment of these classification methods. It can be easily noticed that the overall classification accuracy of DSCRN (0.9603) is higher than that of LSTM (0.8744), RF (0.8911), Conv1D (0.8998), Net A (0.9092), and the differences between the Net B (0.9477), Net C (0.9486), and DSCRNN are not significant. However, it can be found that the AA, OA, Kappa, and F1-score of DSCRNN are slightly higher than Net B and Net C. From the tables, the overall accuracy of Net B (OA: 0.9477) is significantly higher than that of Net A (OA: 0.9092), which indicates that the depthwise separable convolution can achieve a better classification than conventional CNN in study area 1. This is because depthwise separable convolution can improve the ability to extract information from the phase of Sentinel-1 images. Similarly, the OA of Net C is slightly higher than Net A. This result confirms the importance of introducing time series temporal information into the SAR image classification.    The classification results of study area 2 with various classification methods are shown in Figure 8. The characteristics of the different crop types in study area 2 are very similar, which elevates the difficulty of performing an accurate classification. Therefore, the classification performances of all methods decrease compared with those of study area 1. Table 5 reports the classification accuracy of each method; it can be seen that DSCRNN still has a high overall classification accuracy (0.9389). It is worth noting that the Kappa of DSCRNN is 2.79% and 2.59% higher than that of Net B and Net C. This shows that the combination of depthwise separation convolution and CRNN makes more improvements in crop classification than using CRNN or depthwise separation convolution alone. In the comparison with Net A and Net B, it can be seen that for the same architecture (CNN and depthwise separable convolution), depthwise separable convolution improves the accuracy of alfalfa (0.9096 vs. 0.9575) significantly. Similarly, in the comparison between Net A and Net C, we can find that the context information for time series remarkable improvement in recognizing other hay (0.7766 vs. 0.8912). In addition, DSCRNN has a good recognition performance for alfalfa and other hay (0.9634 and 0.9549, respectively), which benefits from the utilization of the phase and temporal information of time series data. the comparison with Net A and Net B, it can be seen that for the same architecture (CNN and depthwise separable convolution), depthwise separable convolution improves the accuracy of alfalfa (0.9096 vs 0.9575) significantly. Similarly, in the comparison between Net A and Net C, we can find that the context information for time series remarkable improvement in recognizing other hay (0.7766 vs 0.8912). In addition, DSCRNN has a good recognition performance for alfalfa and other hay (0.9634 and 0.9549, respectively), which benefits from the utilization of the phase and temporal information of time series data. .

Influence of Different Input Data
In this section, experiments are implemented with different input data to verify the improvement when using the covariance matrix of Sentinel-1 images instead of backscattering (VV and VH). The common methods (RF, Net A, and Net B) with the input of amplitude are abbreviated as RF-v1, Net A-v1, and Net B-v1, while the models using covariance matrix input are noted as RF-v2, Net A-v2, and Net B-v2. Experiments are carried out on the Study area 1.
The results are reported in Table 6. In this study area 1, the RF with the covariance matrix as input always shows better performances, which confirms our hypothesis that the phase information indeed provides more useful information for the crop classification task. Clearly, the performance of Net B-v2 is much better than that of Net B-v1 which demonstrates that the depthwise separable convolution is helpful for extracting the underlying correlations of the phase information. However, Net A-v1 and Net A-v2 have similar classification accuracy. This means that conventional convolution has limited ability to extract information from the SAR phase.

Phase Information Importance
In this section, we discuss the contribution of the covariance matrix for crop classification. For most of the previous researches on Sentinel-1 images crop mapping, it is common to only consider amplitude information extraction and neglect the unique phase information of SAR. However, the complex-valued polarization scattering matrix can extract useful phase information, and thus generate accurate descriptions of crop type.
The classification results in Section 4 demonstrated that the RF with the covariance matrix as input has greater overall classification accuracies than backscatter features. It should be noted that the classification accuracy of the conventional convolution on the two data sets is similar (0.8953 and 0.9092, respectively). One possible reason may be that there is a significant difference between the phase information and the amplitude information, so the phase information may not be fully utilized when using conventional amplitude image classification methods (e.g., conventional CNN).
Further, to validate the importance of feature representations [48], the RF classifier with all available features (90 features) is utilized to investigate the importance of the input features for crop classification. For visual comparison, we add up the contribution score of features to represent the importance of each feature and list them through the temporal axis in Figure 9. It is clear from Figure 8 that the features derived from the covariance matrix are generally more important compared to the backscatter ones. This suggests that neglected phase information can be used to identify different types of scatterers. Also, for the images collected in January, the covariance matrix is the most important feature representation with the importance of 0.0653 and 0.0935, respectively. In particular, for the entire time series data dataset, the images collected from January to March, the covariance matrix has the most important impact on crop classification. This indicates that the largest separability amongst crop types in study area 1 occurred during this period. In contrast, for the May to June time series imagery, the phase information has limited influence on crop identification. It is important to note that the more time series images are collected, the less important the phase information may be. phase information and the amplitude information, so the phase information may not be fully utilized when using conventional amplitude image classification methods (e.g., conventional CNN). Further, to validate the importance of feature representations [48], the RF classifier with all available features (90 features) is utilized to investigate the importance of the input features for crop classification. For visual comparison, we add up the contribution score of features to represent the importance of each feature and list them through the temporal axis in Figure 9. It is clear from Figure  8 that the features derived from the covariance matrix are generally more important compared to the backscatter ones. This suggests that neglected phase information can be used to identify different types of scatterers. Also, for the images collected in January, the covariance matrix is the most important feature representation with the importance of 0.0653 and 0.0935, respectively. In particular, for the entire time series data dataset, the images collected from January to March, the covariance matrix has the most important impact on crop classification. This indicates that the largest separability amongst crop types in study area 1 occurred during this period. In contrast, for the May to June time series imagery, the phase information has limited influence on crop identification. It is important to note that the more time series images are collected, the less important the phase information may be. Figure 9. Importance validation with random forest (RF) classifier. The sum of the importance of all variables represented by each bar. For example, the first orange bar represents the sum of the importance of the four variables in the Sentinel-1 covariance matrix on January 5.

Pros and Cons
In this work, we demonstrated that combining phase information and amplitude information from the Sentinel-1 time-series data can be used to classify crops in complex agricultural areas. Most

Pros and Cons
In this work, we demonstrated that combining phase information and amplitude information from the Sentinel-1 time-series data can be used to classify crops in complex agricultural areas. Most of the time, for the Sentinel-1 covariance matrix images, both the classical method and the model proposed in this study obtain good classification accuracy. It is worth noting that the proposed DSCRNN has the highest overall classification accuracy in two study areas-AA, OA, Kappa, and F1-score of DSCRNN are above 0.91 for both study areas. Moreover, we explored the contributions of depthwise separable convolution and Attentive LSTM to the Sentinel-1 images classification to demonstrate the robustness of the DSCRNN. To be specific, in the comparison between Net A and Net B, it can be found that the overall accuracy of Net B is about 5 and 2 percentage points higher than that of Net A for both two study areas. Similarly, from the classification results of Net A and Net C, Attentive LSTM extracts temporal features from Sentinel-1 images are beneficial to the recognition of some complex crops (such as sugar beets). In terms of Net B and Net C, the difference between them is not obvious in study area 1. However, in study area 2, it is clear that Net C is superior to Net B. It seems that differences are related to the complexity of the study area. This indicates that for complex agricultural areas, it is difficult to fully describe the unique structure of crops by scattering characteristics alone, which can result in false recognition. In contrast, the unique growth patterns of crops are much easier to be distinguished than their structures.
In addition, some limitations of the proposed methodology must be stressed. First, the impacts of inaccurate labeled data. The classification method relies heavily on ground-truth maps. This means that the low quality of labeled data will have a negative impact on the performances. Therefore, we visually checked the CDL data and manual labeling the ground truth maps for higher accuracy. Another weakness of DSCRNN is the complexity of the model with a high computational cost.

Potential Applications
In this section, we summarize some of the studies in this paper and their potential impact on collaborative Sentinel-1 and Sentinel-2 data for crop mapping. The analysis of Sentinel-1 time series data led to recommendations for Sentinel-1 data selection in the synergy use of these two time-series. Our results have shown that the classification results with the Sentinel-1 covariance matrix as the input are better than the backscatter images, suggesting that much more valuable information is provided by the covariance matrix. As such, it is possible to use the covariance matrix data to replace the commonly used backscatter images to coordinate with the Sentinel-2 data. The analysis of DSCRNN classification results led to recommendations for Sentinel-1 branch selection in custom architectures. To be specific, in some collaborative Sentinel-1 and Sentinel-2 studies, the use of well-designed custom networks has been proposed (e.g., the two-branch architecture). With the help of two network branches, different types of Sentinel-2 and Sentinel-1 information can be processed differently to obtain the information on the two branches). DSCRNN has shown strong competitiveness in the experiments of the two study areas. Therefore, DSCRNN as a Sentinel-1 branch in two-branch networks may be of great help to improve crop classification performance.

Conclusions
In this study, we proposed a combined strategy for crop classification using Sentinel-1 time series data. Different from previous studies, the Sentinel-1 time series stack was replaced by the complex-valued covariance matrix instead of the commonly used backscatter signals. In this way, the original information of Sentinel-1 images could be effectively retained. Moreover, we proposed the DSCRNN architecture to characterization of crop type from multiple perspectives (spatial characteristics, phase correlation, and temporal information). The architecture utilized depthwise separable convolution to better formulate the potential correlation of the phase and spatial information of Sentinel-1 images. On this basis, we further introduced the Attentive LSTM into the network to extract the temporal relationship from the feature sequences. Compared to previous studies, the proposed method provided an accurate crop mapping result even with complex crop areas. For future works, we will focus on the combination of Sentinel-1 and Sentinel-2 data to boost crop mapping accuracy.
Author Contributions: Y.Q. and W.Z. developed the main idea that led to this paper. Y.Q. and W.Z. provided Sentinel-1 processing, classifications and their descriptions. Z.Y. and J.C. helped with the experiments and results analysis. All authors have read and agreed to the published version of the manuscript.

Conflicts of Interest:
The authors declare no conflict of interest.