A Dual Attention Convolutional Neural Network for Crop Classiﬁcation Using Time-Series Sentinel-2 Imagery

: Accurate and timely mapping of crop types and having reliable information about the cultivation pattern/area play a key role in various applications, including food security and sustainable agriculture management. Remote sensing (RS) has extensively been employed for crop type classiﬁcation. However, accurate mapping of crop types and extents is still a challenge, especially using traditional machine learning methods. Therefore, in this study, a novel framework based on a deep convolutional neural network (CNN) and a dual attention module (DAM) and using Sentinel-2 time-series datasets was proposed to classify crops. A new DAM was implemented to extract informative deep features by taking advantage of both spectral and spatial characteristics of Sentinel-2 datasets. The spectral and spatial attention modules (AMs) were respectively applied to investigate the behavior of crops during the growing season and their neighborhood properties (e.g., textural characteristics and spatial relation to surrounding crops). The proposed network contained two streams: (1) convolution blocks for deep feature extraction and (2) several DAMs, which were employed after each convolution block. The ﬁrst stream included three multi-scale residual convolution blocks, where the spectral attention blocks were mainly applied to extract deep spectral features. The second stream was built using four multi-scale convolution blocks with a spatial AM. In this study, over 200,000 samples from six different crop types (i.e., alfalfa, broad bean, wheat, barley, canola, and garden) and three non-crop classes (i.e., built-up, barren, and water) were collected to train and validate the proposed framework. The results demonstrated that the proposed method achieved high overall accuracy and a Kappa coefﬁcient of 98.54% and 0.981, respectively. It also outperformed other state-of-the-art classiﬁcation methods, including RF, XGBOOST, R-CNN, 2D-CNN, 3D-CNN, and CBAM, indicating its high potential to discriminate different crop types.


Introduction
Considering the prospect of human population growth, which is expected to reach 8.7 billion by 2030, the food supply system is subjected to escalating pressure [1,2]. Additionally, climate change effects and catastrophic natural disasters (e.g., drought and flood) are already hampering agricultural production and threatening food security from local to global scales [3,4]. Accordingly, it is vital to obtain authentic information about the location, extent, type, health, and yield of crops to ensure food security, poverty reduction, and water resource management [5]. Additionally, it is more appealing to incorporate efficient approaches that facilitate the requirement of sustainability and climate change adaption [6,7]. Thus, it is crucial to employ efficient approaches, such as advanced machine learning along with remote sensing (RS) data, to ensure high-quality information is derived about crops in order to achieve specified goals [8].
Landsat-8 images for crop mapping in fragmented and heterogeneous landscapes of Najaf-Abad, Iran. To this end, long-term in-situ phenological information was combined with satellite images to map annual crop types using intensive decision trees and SVM classifiers. Furthermore, Saadat, et al. [54] employed time-series Sentinel-1 data to map rice in the northern part of Iran. To this end, Gamma Nought, Sigma Nought, and Beta Nought features of Sentinel-1 images in three scenarios were used in the RF classifier. Their results indicated the superiority of Sigma Nought and Gamma Nought Sentinel-1 data in vertical transmittance and horizontal receiving (VH) polarization.
Although many crop mapping frameworks have been proposed by various researchers, they generally have one of the following disadvantages: (I) Most crop mapping studies have focused on conventional machine learning methods (e.g., RF and SVM). These algorithms do not usually provide the highest possible accuracies due to several factors, such as climatic conditions and the fluctuations in planting times. (II) Many studies have only used spectral-temporal information for crop mapping. However, spatial information should be included in the classification algorithm to produce highly accurate maps. (III) Many state-of-the-art deep learning methods for crop mapping have only used the 2D/3D convolution blocks for extracting deep features. All of these extracted deep features are not informative for crop mapping and provide redundant information.
In this regard, attention blocks should be implemented to select the most informative features.
The Iranian crop system is under escalating pressure mainly due to the severe water crisis and population growth [55]. Additionally, climate change and the current dramatic drought condition in Iran also exacerbate the existing pressure [56]. Furthermore, the current economic and political sanctions have become a notable issue that would amplify this pressure in Iran [57,58]. Consequently, incorporation of advanced technologies, such as remote sensing and machine/deep learning algorithms, is required to support efficient agricultural practices in Iran. Considering the importance of crop mapping in Iran, a novel deep learning algorithm was developed in this study for accurate crop classification. The classification model has three main steps: (1) data preparation, (2)

Study Area
The study area was an agricultural area in the southern portions of the Aq Qala counties, Golestan province. This study area is approximately centered at a latitude and longitude of 37 • 50 N and 54 • 40 E, respectively (see Figure 1). The climate of the study area is mainly influenced by the Alborz Mountains and the Caspian Sea. Thus, it has different climates with a diverse rate of precipitation and humidity [59]. For instance, the study area contains semi-arid (northern parts) and humid (southern parts) climates with annual precipitation between 249 and 529 mm [60]. Consequently, it includes both irrigated and Remote Sens. 2022, 14, 498 4 of 24 rainfed agricultural systems. The study area is among the most important counties for crop production in the province of Golestan, and various crops (e.g., wheat, alfalfa, and barleys) are cultivated in this region during each growing season, of which wheat is the dominant one. As one of the biggest crop production sources in Golestan, it is essential to establish regular and accurate crop condition monitoring systems and estimate the cultivated crop area with high reliability. mainly influenced by the Alborz Mountains and the Caspian Sea. Thus, it has different climates with a diverse rate of precipitation and humidity [59]. For instance, the study area contains semi-arid (northern parts) and humid (southern parts) climates with annual precipitation between 249 and 529 mm [60]. Consequently, it includes both irrigated and rainfed agricultural systems. The study area is among the most important counties for crop production in the province of Golestan, and various crops (e.g., wheat, alfalfa, and barleys) are cultivated in this region during each growing season, of which wheat is the dominant one. As one of the biggest crop production sources in Golestan, it is essential to establish regular and accurate crop condition monitoring systems and estimate the cultivated crop area with high reliability.
The geographical location of the study area, and (b) a false-color composite NDVI image (R: first month NDVI, G: second month NDVI, and B: third month NDVI) from the study area.

Sentinel-2 Imagery
In this study, time-series Sentinel-2 optical satellite images were employed for crop type classification. Sentinel-2 is a European satellite developed via the cooperation of the European Commission initiative Copernicus and the European Space Agency [33]. This platform carries the MultiSpectral Instrument (MSI) sensor, a wide-swath multispectral imager that images the Earth's surface using 13 bands with a spectral range from 443 nm to 2190 nm. These bands are taken from visible to shortwave infrared domains of the electromagnetic spectrum in three different spatial resolutions (i.e., 10-60 m) [61]. Sentinel-2 constellation (Sentinel-2A and B) provides global coverage of the Earth's surface every five days, making it suitable for a variety of land monitoring tasks. In total, 13 Sentinel-2 images were used in this study (see Table 1). As is clear from Table 1, the imagery acquired in the first two weeks of February 2018 and the second two weeks of March 2018 were not used because of the cloud cover over the study area on these two dates. Overall, we could effectively distinguish various types of crops in the study area using these time-series images [62,63].

Sentinel-2 Imagery
In this study, time-series Sentinel-2 optical satellite images were employed for crop type classification. Sentinel-2 is a European satellite developed via the cooperation of the European Commission initiative Copernicus and the European Space Agency [33]. This platform carries the MultiSpectral Instrument (MSI) sensor, a wide-swath multispectral imager that images the Earth's surface using 13 bands with a spectral range from 443 nm to 2190 nm. These bands are taken from visible to shortwave infrared domains of the electromagnetic spectrum in three different spatial resolutions (i.e., 10-60 m) [61]. Sentinel-2 constellation (Sentinel-2A and B) provides global coverage of the Earth's surface every five days, making it suitable for a variety of land monitoring tasks. In total, 13 Sentinel-2 images were used in this study (see Table 1). As is clear from Table 1, the imagery acquired in the first two weeks of February 2018 and the second two weeks of March 2018 were not used because of the cloud cover over the study area on these two dates. Overall, we could effectively distinguish various types of crops in the study area using these time-series images [62,63]. The second two weeks Dataset-Time- 9 March 2018 The first two weeks Dataset-Time- 10 March 2018 The second two weeks, high-cloudy, not used Dataset-Time- 11 April 2018 The first two weeks Dataset-Time- 12 April 2018 The second two weeks Dataset-Time- 13 May 2018 The first two weeks Dataset-Time- 14 May 2018 The second two weeks Dataset-Time- 15 June 2018 The first two weeks  Figure 2 illustrates the distribution of the collected in-situ samples over the study area. These samples were collected from ten classes during several field surveys. The field data were collected in 2018 from April to May for all crop classes. A handheld global positioning system (GPS) with a positional accuracy of <5 m was used to record the locations of the samples.

Dataset-Time-12
April 2018 The second two weeks Dataset-Time- 13 May 2018 The first two weeks Dataset-Time- 14 May 2018 The second two weeks Dataset-Time- 15 June 2018 The first two weeks Figure 2 illustrates the distribution of the collected in-situ samples over th area. These samples were collected from ten classes during several field surveys. T data were collected in 2018 from April to May for all crop classes. A handheld glo sitioning system (GPS) with a positional accuracy of <5 m was used to record the lo of the samples. Figure 2. The distribution of the reference samples from ten classes collected over the stud As is clear from Figure 2, more arboretum and agricultural-vegetable areas cated on the right side of the study area while other crops are dispersed over th area. Table 2 provides the number of samples for each class. The wheat and bro classes had the maximum and minimum numbers of reference samples. There ar ent approaches, such as manual splitting, random splitting, and non-random spli the division of reference samples into training, validation, and test samples [64] regard, random sampling is the most common way to split reference samples, wh extensively been used for classification tasks using remote sensing images [65-cordingly, in this study, random sampling was employed to divide reference samp training (3%), validation (0.1%), and test (96.9%) samples.  As is clear from Figure 2, more arboretum and agricultural-vegetable areas are located on the right side of the study area while other crops are dispersed over the study area. Table 2 provides the number of samples for each class. The wheat and broad bean classes had the maximum and minimum numbers of reference samples. There are different approaches, such as manual splitting, random splitting, and non-random splitting for the division of reference samples into training, validation, and test samples [64]. In this regard, random sampling is the most common way to split reference samples, which has extensively been used for classification tasks using remote sensing images [65][66][67]. Accordingly, in this study, random sampling was employed to divide reference samples into training (3%), validation (0.1%), and test (96.9%) samples.

Method
The general framework of crop type classification based on the proposed method is illustrated in Figure 3. The proposed classification framework was implemented in Remote Sens. 2022, 14, 498 6 of 24 three main steps: (1) data preparation and normalized difference vegetation index (NDVI) calculation, (2) model training and parameters tuning, and (3) prediction and accuracy assessment. The detail of each step is discussed in the following subsections.

Method
The general framework of crop type classification based on the proposed method is illustrated in Figure 3. The proposed classification framework was implemented in three main steps: (1) data preparation and normalized difference vegetation index (NDVI) calculation, (2) model training and parameters tuning, and (3) prediction and accuracy assessment. The detail of each step is discussed in the following subsections.

Data Preprationand Time-Series NDVI Calculation
Sentinel-2 datasets require several preprocessing steps, such as cloud masking and atmospheric correction. In this regard, we selected only non-cloudy images for the analysis. Moreover, the atmospheric correction was implemented using the Sen2cor module [68], which is available in the SNAP software.
Spectral feature extraction is the most common step in RS classification tasks [61]. The feature extraction can be conducted in two main categories: (1) combining spectral bands using simple mathematical operations, such as the spectral indices of NDVI [69,70]; and (2) deriving high order statistical features (i.e., covariance and correlation), such as PCA [71] and factor analysis (FA). Among different spectral indices, NDVI was selected due to its simplicity and its high applicability for crop mapping [72][73][74][75]. NDVI was computed based on the red (0.665 µ m) and near-infrared (NIR) (0.842 µ m) bands (see Equation (1)).

Data Preprationand Time-Series NDVI Calculation
Sentinel-2 datasets require several preprocessing steps, such as cloud masking and atmospheric correction. In this regard, we selected only non-cloudy images for the analysis. Moreover, the atmospheric correction was implemented using the Sen2cor module [68], which is available in the SNAP software.
Spectral feature extraction is the most common step in RS classification tasks [61]. The feature extraction can be conducted in two main categories: (1) combining spectral bands using simple mathematical operations, such as the spectral indices of NDVI [69,70]; and (2) deriving high order statistical features (i.e., covariance and correlation), such as PCA [71] and factor analysis (FA). Among different spectral indices, NDVI was selected due to its simplicity and its high applicability for crop mapping [72][73][74][75]. NDVI was computed based on the red (0.665 µm) and near-infrared (NIR) (0.842 µm) bands (see Equation (1)).
Crops have a dynamic nature because of their growth during their lifetime. Thus, employing time-series datasets is an effective and pertinent solution for mapping crops [76,77]. Consequently, the time-series NDVI features were utilized in this study for crop types classification.

Proposed Deep Learning Architecture
This study proposed a new dual-stream CNN architecture with both spectral and spatial attention blocks. According to the presented architecture in Figure 4, the proposed method received input patches of 11 × 11 × 13, and then the patches were fed into two separate streams for deep feature extraction.
Crops have a dynamic nature because of their growth during their lifetime. Thus, employing time-series datasets is an effective and pertinent solution for mapping crops [76,77]. Consequently, the time-series NDVI features were utilized in this study for crop types classification.

Proposed Deep Learning Architecture
This study proposed a new dual-stream CNN architecture with both spectral and spatial attention blocks. According to the presented architecture in Figure 4, the proposed method received input patches of 11 × 11 × 13, and then the patches were fed into two separate streams for deep feature extraction. The first stream explored deep features based on multi-scale residual convolution blocks and spectral attention blocks. This stream focused on deep spectral feature extraction based on spectral AM. In this regard, a shallow multi-layer feature extractor, maxpooling layer, spectral attention blocks, and multi-scale residual blocks were employed. First, the shallow deep features were extracted via a multi-scale convolution block. Then, the spectral attention block was employed to investigate the inter-channel relationship of feature maps. Subsequently, the max-pooling layer was applied to reduce the size of the generated feature maps. The multi-scale residual block was then employed to find more meaningful features. Similarly, the spectral attention block and max-pooling were employed. Finally, the extracted deep features were transferred to the latest multi-scale residual and spectral attention blocks to generate high-level deep features.
The second stream investigated deep features while concentrating on deep spatial features using spatial attention blocks. Similarly, this stream had one multi-scale convolution block and three multi-scale residual blocks. Moreover, after convolution block layers, the spectral attention block and max-pooling layers were employed.
After deep feature extraction based on multi-scale residual blocks and attention blocks, the deep features were flattened using a flattening layer. Then, they were fed to a dense layer, and the decision was made via a soft-max layer. The first stream explored deep features based on multi-scale residual convolution blocks and spectral attention blocks. This stream focused on deep spectral feature extraction based on spectral AM. In this regard, a shallow multi-layer feature extractor, max-pooling layer, spectral attention blocks, and multi-scale residual blocks were employed. First, the shallow deep features were extracted via a multi-scale convolution block. Then, the spectral attention block was employed to investigate the inter-channel relationship of feature maps. Subsequently, the max-pooling layer was applied to reduce the size of the generated feature maps. The multi-scale residual block was then employed to find more meaningful features. Similarly, the spectral attention block and max-pooling were employed. Finally, the extracted deep features were transferred to the latest multi-scale residual and spectral attention blocks to generate high-level deep features.
The second stream investigated deep features while concentrating on deep spatial features using spatial attention blocks. Similarly, this stream had one multi-scale convolution block and three multi-scale residual blocks. Moreover, after convolution block layers, the spectral attention block and max-pooling layers were employed.
After deep feature extraction based on multi-scale residual blocks and attention blocks, the deep features were flattened using a flattening layer. Then, they were fed to a dense layer, and the decision was made via a soft-max layer.
The main differences between the proposed architecture and other CNN frameworks are: (1) Utilizing a double streams framework for investigating spatial/spectral deep feature extraction. (2) Proposing a novel AM framework for extraction of informative deep features that have a higher efficiency compared to the convolutional block attention module (CBAM). (3) Taking advantage of residual, depth-wise, and separable convolution blocks as well as combining them for deep feature extraction. (4) Employing separable (point/depth-wise convolution layers) convolution which has a better performance. The AM in deep learning was inspired by the psychological attention mechanisms within the human brain [78][79][80][81][82]. The main idea behind the AM is to direct the focus of the network on extracting meaningful features instead of non-essential features [81]. The efficiency of the AM in deep learning models has been proven in previous literature [78,[82][83][84][85].
In this regard, this study proposed a novel AM to increase the efficiency of the implemented/developed architecture by considering both spectral and spatial AM. The main idea to incorporate the AM was to explore the relationship between spectral-temporal and spatial-temporal information of input patches for the crop type classification task.
The developed spectral AM concentrated on 'what' is meaningful in the given input feature map [83,84,86]. To this end, we introduced a spectral attention block in accordance with the architecture illustrated in Figure 5. Based on this, the input feature map was fed into a convolution block with a kernel size (a,b) that was equal to the length and width of the input feature data. The size of the output feature map was 1 × 1 × c. Moreover, the number of filters was c, which was equal to the number of feature maps of the input data. After reshaping the output of the previous layer, the features were transferred into a multi-layer perceptron (MLP) layer with two dense layers with different neuron sizes. The first and second layers reduced the number of neurons based on the reduction rate and reconstructed the features, respectively. Simultaneously, the separable convolution layer was employed for input data before the multiplication of features with the input feature map. Finally, the output of the first stream and separable convolution layer was fused using multiplication. The separable convolution layer was implemented in two steps: point-wise convolution and depth-wise convolution on the output of the point-wise convolution. (CBAM).
(3) Taking advantage of residual, depth-wise, and separable convolution blocks as well as combining them for deep feature extraction. (4) Employing separable (point/depth-wise convolution layers) convolution which has a better performance.

Attention Mechanism (AM)
The AM in deep learning was inspired by the psychological attention mechanisms within the human brain [78][79][80][81][82]. The main idea behind the AM is to direct the focus of the network on extracting meaningful features instead of non-essential features [81]. The efficiency of the AM in deep learning models has been proven in previous literature [78,[82][83][84][85]. In this regard, this study proposed a novel AM to increase the efficiency of the implemented/developed architecture by considering both spectral and spatial AM. The main idea to incorporate the AM was to explore the relationship between spectral-temporal and spatial-temporal information of input patches for the crop type classification task.
The developed spectral AM concentrated on 'what' is meaningful in the given input feature map [83,84,86]. To this end, we introduced a spectral attention block in accordance with the architecture illustrated in Figure 5. Based on this, the input feature map was fed into a convolution block with a kernel size (a,b) that was equal to the length and width of the input feature data. The size of the output feature map was 1 × 1 × c. Moreover, the number of filters was c, which was equal to the number of feature maps of the input data. After reshaping the output of the previous layer, the features were transferred into a multi-layer perceptron (MLP) layer with two dense layers with different neuron sizes. The first and second layers reduced the number of neurons based on the reduction rate and reconstructed the features, respectively. Simultaneously, the separable convolution layer was employed for input data before the multiplication of features with the input feature map. Finally, the output of the first stream and separable convolution layer was fused using multiplication. The separable convolution layer was implemented in two steps: point-wise convolution and depth-wise convolution on the output of the point-wise convolution. The developed spatial AM considered the inter-spatial relationship of feature maps [84,87,88]. The spatial AM concentrated on 'where' a useful region within the input feature map is [86,89]. This AM was implemented similarly to the spectral AM, but with different output sizes of convolution layers (see Figure 6). Based on this, the input feature map was transferred into a convolution block with a kernel size (a,b) with only one kernel convolution and padding. This means that the output size of the feature map was a × b × 1. After reshaping the output of the previous layer, the features were fed into an MLP with two fully connected layers with different neuron sizes. The first layer reduced the number of neurons based on the reduction rate. Then, the second fully connected layer reconstructed the features. Simultaneously, the separable convolution layer was employed for input data before the multiplication of features with the input feature map. Finally, the output of the first stream and separable convolution layer was fused via multiplication.
reshaping the output of the previous layer, the features were fed into an MLP with two fully connected layers with different neuron sizes. The first layer reduced the number of neurons based on the reduction rate. Then, the second fully connected layer reconstructed the features. Simultaneously, the separable convolution layer was employed for input data before the multiplication of features with the input feature map. Finally, the output of the first stream and separable convolution layer was fused via multiplication.

Convolution Layer
Convolution layers are the central core of CNN frameworks, and the main task of these layers is extracting high-level deep features from input imagery [90]. The convolution layers automatically explore spatial and spectral features at the same time. The basic computation of the convolutional layer can be defined as follows (Equation (2)) [91].
where x is the input data from layer − 1, is an activation function, w and b are the weighted template and bias vector, respectively.
The output of jth feature map for a 2D convolution at the spatial location of (x,y) can be computed using Equation (3) where m is the feature cube connected to the current feature cube in the ( − 1)th layer, and R and S are the length and width of the filter, respectively. This research took advantage of both residual and multi-scale blocks. The multi-scale blocks increase the efficiency of the network against the differences in the scale of objects

Convolution Layer
Convolution layers are the central core of CNN frameworks, and the main task of these layers is extracting high-level deep features from input imagery [90]. The convolution layers automatically explore spatial and spectral features at the same time. The basic computation of the convolutional layer can be defined as follows (Equation (2)) [91].
where x is the input data from layer N − 1, φ is an activation function, w and b are the weighted template and bias vector, respectively. The output of jth feature map for a 2D convolution at the spatial location of (x,y) can be computed using Equation (3) [35].
where m is the feature cube connected to the current feature cube in the (N − 1)th layer, and R and S are the length and width of the filter, respectively. This research took advantage of both residual and multi-scale blocks. The multiscale blocks increase the efficiency of the network against the differences in the scale of objects [35]. Moreover, the residual blocks improved the efficiency of the network and helped to prevent gradient vanishing.

Model Training
Since the unknown parameters of the deep learning architecture cannot be calculated through an analytical solution, the iterative framework was employed to optimize the model parameters [90]. The adaptive moment estimation (Adam) optimizer [80] was used in this study to optimize the model parameters. Furthermore, the cross-entropy (CE) loss function was utilized to calculate the error of the network during the training phase. The training phase was conducted based on the training samples, and then the loss value of the trained model was computed using validation samples. The CE loss function can be calculated using Equation (4): where Φ and ϕ are the true and predicted labels, respectively. Moreover, N refers to the number of classes.

Accuracy Assessment
The statistical accuracy assessment was performed using independent test samples. The six most common statistical criteria, all extracted from the confusion matrix of the classification, were utilized to evaluate classification results. These criteria were overall accuracy (OA), user accuracy (UA), producer accuracy (PA), Kappa coefficient (KC), omission error (OE), and commission error (CE).

Comparison with Other Classification Methods
Crop mapping has widely been applied by machine learning and deep learning-based methods [92,93]. The RF and XGBOOST are the most common machine learning methods that have widely been used in many crop mapping applications based on time-series datasets [93,94]. This research implemented these two machine learning-based methods to evaluate their efficiency in comparison with deep learning-based methods. Thus, six different classifiers, including two commonly used machine learning algorithms (i.e., RF [94] and XGBOOST [95]) and four deep learning models (i.e., recurrent-convolutional neural network (R-CNN) [49], 2D-CNN [47], 3D-CNN [47], and convolutional block attention module (CBAM)), were also implemented to produce a more comprehensive evaluation of the performance of the proposed model. R-CNN, developed by Mazzia, Khaliq and Chiaberge [49], combines the LSTM cells and 2D convolution layers for crop mapping based on time-series datasets. Moreover, CBAM [82] combines channel attention and spatial attention after each convolution layer, wherein the channel attention block is employed after the spatial attention block. The inputs of the RF and XGBOOST algorithms were spectral-temporal features with the size of 1 × 13 where 1 and 13 refer to the spectral (i.e., NDVI) and temporal information. Moreover, the input datasets of the deep learning-based methods were spatial-spectral-temporal information with the size of 11 × 11 × 1 × 13, where 11 × 11 was the width and length of spatial information, respectively, 1 was the spectral information (i.e., NDVI), and 13 was the temporal information. It is worth noting that the size of the spatial information for the deep learning-based methods was determined by trial and error. The patch data were also generated by moving a window with the size of 11 × 11. The label of this patch corresponded to the central pixel of the patch.

Parameter Setting
The proposed method and other classifiers have several parameters that need to be set. As described in the Method Section, the optimum values of the parameters for each classifier were determined based on several trial and error attempts (see Table 3). All parameters of the deep learning-based methods were set identically. It is worth noting that the selection of some of these parameters depended on the processing system.

Classification Results
The results of crop mapping based on the proposed deep learning method along with other algorithms are illustrated in Figure 7. A high-resolution image from the study area is also provided in Figure 7a for comparison purposes. The results showed that the map produced using XGBOOST (Figure 7b) included salt and pepper errors. Furthermore, the RF classifier (Figure 7c) could not delineate different classes with a high level of accuracy.
In general, deep learning methods (Figure 7d-h) produced better results compared to the XGBOOST and RF models. However, there were still several wrongly classified pixels in the results of the deep learning methods, especially those of the R-CNN and 2D-CNN methods. Overall, the proposed method (Figure 7h) Figure 8 shows the confusion matrices of the proposed and all implemented classification methods. Generally, the proposed deep learning method resulted in the lowest confusion between the classes, indicating its high potential for accurate crop type mapping. Among non-agricultural classes, there was considerable confusion between the barren and built-up classes with other classes, except broad bean. Furthermore, water had the  Figure 8 shows the confusion matrices of the proposed and all implemented classification methods. Generally, the proposed deep learning method resulted in the lowest confusion between the classes, indicating its high potential for accurate crop type mapping. Among non-agricultural classes, there was considerable confusion between the barren and built-up classes with other classes, except broad bean. Furthermore, water had the lowest mixing with different classes. Overall, most confusions occurred between the arboretum, barren, built-up, barley, and wheat classes. These confusions were much higher for the XGBOOST, RF, and R-CNN algorithms compared to other methods. For example, the highest confusion was between the barren and built-up classes (11,918 pixels) using the RF algorithm. The R-CNN algorithm also resulted in relatively high confusion between built-up/barren and barren/wheat. However, other deep learning algorithms, especially the proposed method, had higher accuracies. Among the 2D-CNN, 3D-CNN, and CBAM deep learning algorithms, the highest confusion was observed between barley and wheat using the 2D-CNN algorithm. The RF, XGBOSST, 2D-CNN, and RCNN classification methods could not discriminate the broad bean class from other classes mainly due to the lower number of its samples compared to other classes.
The statistical accuracy assessment of the crop maps using different accuracy measures is also summarized in Table 4. Regarding the non-deep learning algorithms, the RF classifier provided the lowest performance (OA = 74% and KC = 0.68), while XGBOOST provided a satisfactory result (OA = 87% and KC = 0.84). However, all the deep learning methods, except the R-CNN algorithm, achieved an OA of more than 90%. In particular, the proposed method provided the highest accuracy in mapping crops with an OA and KC of 98.5% and 0.98, respectively.

Impacts of the Time-Series NDVI on the Classification Results
The crop type classification provides useful information before harvesting agricultural products. This information can accurately be obtained by employing time-series NDVI datasets in a growing season. In this regard, the sensitivity of the number of the NDVI used in the classification was investigated in this study. For example, Figure 9 and Table 5 present the calculated confusion matrices and accuracy measures when different NDVI datasets were employed for crop type mapping using the proposed deep learning method. It was observed that using more NDVI images (i.e., seven months NDVI dataset) resulted in higher classification accuracies and lower confusions between different classes, closely followed by using six months NDVI datasets. As is clear from Figure 9, adding further information to the proposed method through adding more NDVI images could steadily reduce the uncertainties and confusion between the classes. For instance, the total interchangeable confusion between wheat and canola was continuously reduced by nearly 30% (i.e., from 800 wrongly classified pixels to 87 incorrectly classified pixels, when incorporating more time-series NDVI datasets). Furthermore, barley and wheat had the lowest confusion (113 pixels) when employing seven months of NDVI datasets, while the highest confusion (304 pixels) was associated with using two months of NDVI datasets. Overall, as is clear from Table 5, the highest accuracies were obtained when seven months of NDVI datasets were utilized. cially the proposed method, had higher accuracies. Among the 2D-CNN, CBAM deep learning algorithms, the highest confusion was observed betwe wheat using the 2D-CNN algorithm. The RF, XGBOSST, 2D-CNN, and RC tion methods could not discriminate the broad bean class from other class to the lower number of its samples compared to other classes. The statistical accuracy assessment of the crop maps using diffe measures is also summarized in Table 4. Regarding the non-deep learning a

Ablation Analysis
The ablation analysis is a crucial step for evaluating the performance of different aspects of an artificial intelligence method. The main purpose of this analysis was obtaining an insight into the effects of removing a part of the system on the general performance of the model. In this study, we investigated the impacts of ablation analysis on the efficiency of the proposed crop type mapping framework through three scenarios (S): (S#1): without AM, (S#2): without spectral attention block, and (S#3) without spatial attention block. The results of these scenarios were also compared with the proposed method when all the functions were used (i.e., S#4). Figure 10 shows the confusion matrices of four different scenarios of the ablation analysis. Although the obtained classification results were relatively similar, the results indicated the higher potential of the proposed method empowered with the AM mechanism, especially in comparison to S#1. For example, the proposed architecture considerably reduced the confusion between barley and wheat, which was over 1000 pixels in S#1 and reached 113 in the proposed architecture. Moreover, the proposed method successfully reduced the slight mutual confusion between canola/alfalfa and arboretum/alfalfa, respectively. Finally, as is clear, the effect of spatial attention was more than spectral attention in the classification results.
results of these scenarios were also compared with the proposed method when all the functions were used (i.e., S#4). Figure 10 shows the confusion matrices of four different scenarios of the ablation analysis. Although the obtained classification results were relatively similar, the results indicated the higher potential of the proposed method empowered with the AM mechanism, especially in comparison to S#1. For example, the proposed architecture considerably reduced the confusion between barley and wheat, which was over 1000 pixels in S#1 and reached 113 in the proposed architecture. Moreover, the proposed method successfully reduced the slight mutual confusion between canola/alfalfa and arboretum/alfalfa, respectively. Finally, as is clear, the effect of spatial attention was more than spectral attention in the classification results.

Accuracy
In this study, a new crop mapping framework was proposed using Sentinel-2 timeseries NDVI datasets. The results of crop mapping using different classifiers showed that the deep learning-based methods had relatively high potential. For example, the statistical methods (i.e., RF and XGBOOST) provided accuracies lower than 87%, while deep learning methods generally produced crop maps with more than 95% accuracy. Overall, the

Accuracy
In this study, a new crop mapping framework was proposed using Sentinel-2 timeseries NDVI datasets. The results of crop mapping using different classifiers showed that the deep learning-based methods had relatively high potential. For example, the statistical methods (i.e., RF and XGBOOST) provided accuracies lower than 87%, while deep learning methods generally produced crop maps with more than 95% accuracy. Overall, the proposed method had the lowest errors in terms of OE and CE (under 5% in almost all classes).
Imbalanced reference samples are among the common problems in supervised learning frameworks [96,97]. Due to several limitations in this study, the size of the reference samples was not balanced for all classes. For example, the broad bean class had the lowest number of reference samples (i.e., 72 pixels). Nevertheless, the proposed method was able to classify this class with a UA of more than 76% and PA of 100%. This indicated the robustness of the proposed network against the imbalanced reference samples. Figure 11 shows zoomed in patches of the classified results using the proposed method. Based on the results, the proposed method provided promising results for both crop and non-crop class types. For instance, the proposed method accurately delineated built-up areas with very few missed classifications. Additionally, the proposed method correctly classified arboretum areas in Figure 11c,d. proposed method had the lowest errors in terms of OE and CE (under 5% in almost all classes). Imbalanced reference samples are among the common problems in supervised learning frameworks [96,97]. Due to several limitations in this study, the size of the reference samples was not balanced for all classes. For example, the broad bean class had the lowest number of reference samples (i.e., 72 pixels). Nevertheless, the proposed method was able to classify this class with a UA of more than 76% and PA of 100%. This indicated the robustness of the proposed network against the imbalanced reference samples. Figure 11 shows zoomed in patches of the classified results using the proposed method. Based on the results, the proposed method provided promising results for both crop and non-crop class types. For instance, the proposed method accurately delineated built-up areas with very few missed classifications. Additionally, the proposed method correctly classified arboretum areas in Figure 11c,d. Figure 11. Comparison of the results of crop mapping (b,d,f) using the proposed method with very high resolution (VHR) imagery (a,c,e) in different areas. The left column is VHR imagery, and the right column is classified maps with the background of the VHR imagery.

Sensitivity Analysis
The effect of the number of NDVI datasets on the crop classification was also investigated in this study (see Section 4.3). Based on Table 5, the lowest accuracy was related to the two-month NDVI datasets (OA = 96%), and the highest accuracy was associated with the case of using NDVI datasets of all months (OA = 98.5%). As a result, although the agricultural crops could be detected with the NDVI datasets after two months of planting, increasing the number of NDVI datasets from other months of the growing season could potentially improve the accuracy. Table 6 shows the performance of the proposed method compared to other state-of-the-arts deep learning methods.

Sensitivity Analysis
The effect of the number of NDVI datasets on the crop classification was also investigated in this study (see Section 4.3). Based on Table 5, the lowest accuracy was related to the two-month NDVI datasets (OA = 96%), and the highest accuracy was associated with the case of using NDVI datasets of all months (OA = 98.5%). As a result, although the agricultural crops could be detected with the NDVI datasets after two months of planting, increasing the number of NDVI datasets from other months of the growing season could potentially improve the accuracy. Table 6 shows the performance of the proposed method compared to other state-of-the-arts deep learning methods.

Proposed Architecture and Deep Feature Extraction
Informative feature extraction is of the most critical factors in classification tasks. These features can be obtained based on combining spectral and spatial features. The results of pixel-based crop type mapping based on RF and XGBOOST algorithms showed that these methods had lower capability than deep learning-based approaches, mainly due to the employment of only spectral features. This indicated the impact of extracting informative spatial features for accurate crop type classification.
Suitable architecture is a key factor for extracting deep features based on CNN methods. In this regard, we designed a new framework for extracting deep features based on multiscale-residual block convolutions. Furthermore, spectral and spatial AMs were implemented to increase the efficiency of the proposed framework [89]. The results of crop type mapping demonstrated the high capability of the proposed algorithm to extract informative deep features, which could enhance the performance of the proposed method compared to other advanced crop mapping techniques.
The stability of deep learning-based methods is the most important factor in classification. To this end, the efficiency of the proposed method was evaluated in ten different epochs, the results of which are provided in Table 7. Based on the results, the proposed method had a high stability in different runs because the OA did not considerably change (i.e., 98.49 ± 0.04). Although the semantic segmentation-based methods, such as deeplabV3+ and U-Net, have achieved promising results in crop mapping [99,103], they require a high amount of sample datasets. This is because all pixels of the image dataset must be labeled through field visits, which is time-consuming and resource-intensive. However, the proposed method required nearly 7000 training samples, the collection of which was applicable in comparison with semantic segmentation-based methods.
The AM increases the performance of deep learning methods in processing tasks [79,81]. The CBAM is the most well-known attention block among other types of AMs [82]. Based on the results, the proposed AM outperformed the CBAM mechanism in all classes. This indicated the high potential of the AM in extracting deep features. In fact, the AM improved the accuracy of the proposed deep learning framework by concentrating the network on informative deep feature extracting.

Conclusions
Timely and accurate crop mapping is one of the most important components for managing and making decisions to support food security. In this regard, this study presented a novel deep learning-based technique for crop type mapping. We evaluated the efficiency of the proposed method on seven and three crop and non-crop classes, respectively. This research used the time-series NDVI for mapping crop types mainly because of the dynamic nature of crops. The results of crop type mapping were also compared with other advanced supervised learning techniques. The statistical and visual analyses indicated that the proposed deep learning model produced excellent performance in comparison to different state-of-the-art classification methods. Furthermore, the efficiency of the proposed AM was proven in the crop type classification task, as it resulted in achieving higher classification accuracy than the CBAM architecture. Moreover, we assessed the efficiency of the proposed manner using different NDVI datasets and observed the high potential of the proposed method by achieving high accuracies with different NDVI datasets (i.e., OA = 96% to 98%). The highest accuracy was related to when seven-month NDVI datasets were employed.