Improvement in Land Cover and Crop Classification based on Temporal Features Learning from Sentinel-2 Data Using Recurrent-Convolutional Neural Network (R-CNN)

The increasing spatial and temporal resolution of globally available satellite images, such as provided by Sentinel-2, creates new possibilities for researchers to use freely available multi-spectral optical images, with decametric spatial resolution and more frequent revisits for remote sensing applications such as land cover and crop classification (LC&CC), agricultural monitoring and management, environment monitoring. Existing solutions dedicated to cropland mapping can be categorized based on per-pixel based and object-based. However, it is still challenging when more classes of agricultural crops are considered at a massive scale. In this paper, a novel and optimal deep learning model for pixel-based LC&CC is developed and implemented based on Recurrent Neural Networks (RNN) in combination with Convolutional Neural Networks (CNN) using multi-temporal sentinel-2 imagery of central north part of Italy, which has diverse agricultural system dominated by economic crop types. The proposed methodology is capable of automated feature extraction by learning time correlation of multiple images, which reduces manual feature engineering and modeling crop phenological stages. Fifteen classes, including major agricultural crops, were considered in this study. We also tested other widely used traditional machine learning algorithms for comparison such as support vector machine SVM, random forest (RF), Kernal SVM, and gradient boosting machine, also called XGBoost. The overall accuracy achieved by our proposed Pixel R-CNN was 96.5%, which showed considerable improvements in comparison with existing mainstream methods. This study showed that Pixel R-CNN based model offers a highly accurate way to assess and employ time-series data for multi-temporal classification tasks.


Introduction
Significant increase in population around the globe, demanding increase in agricultural productivity and thus precise land cover and crop classification and spatial distribution of various crops are becoming significant for governments, policymakers and farmers to improve decision-making process to manage agricultural practices and needs ]. Crop maps are produced relatively at large scale, ranging from global [Wu(2014)], countrywide [Jin(2019)], and local level [Yang(2007), ]. The growing need for agriculture in the management of sustainable natural resources becomes essential for the development of effective cropland mapping and monitoring [Matton(2015)]. Group on Earth Observations (GEO), with its Integrated Global Observing Strategy (IGOS), also emphases on an operational system for monitoring global land covers and mapping spatial distribution of crops by using remote sensing imagery. spatial information of the crop maps have been the main source for crop growth monitoring [Huang(2015), Battude(2016), Guan(2016)], water resources managment [Toureiro(2017)], decsion making for policy makers to ensure food security [Yu(2017)]. Satellite and Geographic Information System (GIS) data have been an important source factor in establishing and improving the current systems that are responsible for developing and maintaining land cover and agricultural maps [Boryan(2011)]. Freely available satellite data offers one of the most applied sources for mapping agricultural land and assessing important indices that describe conditions of crop fields [Xie(2008)]. Recently launched sentinel-2 is equipped with multispectral imager that can provide up to 10m per pixel spatial resolution with the revisit time of 5 days, which offers great opportunity to be exploited in remote sensing domain. Multispectral time series data acquired from MODIS and LANDSAT have been widely used in many agricultural applications such as crop yield prediction [Johnson(2016)], landcover and crop classification [Senf(2015)], leaf area index estimation [Jin(2016)], plant height estimation [Liaqat(2017)], vegetation variability assessment [Senf(2017)] and many more. Two different data sources can also be used together to extract more features that lead to improving results. For example, Landsat-8 and sentinel-1 used together for LC&CC [Kussul(2016)]. There are some supervised or unsupervised algorithms for mapping cropland using mono or multi-temporal images [Xiong(2017), Yan(2015)]. Multi-temporal images have already proven to gain better performance than mono temporal mapping methods , Xiao(2018)]. The imagery used for only key phenological stages s proved to be sufficient for crop area estimation [Gallego(2008), Khaliq(2018)]. It has also found in [Zhou(2013)], that reducing time-series length affects the average accuracy of the classifier. Crop patterns were established using Enhanced Vegetation Index derived from 250 meters MODIS-Terra time series data and used to classify some major crops like corn, cotton, and soybean in Brazil [Arvor(2011)]. Centimetric resolution imagery is available at the cost of high price of commercial satellite imagery or with the extensive UAV flight campaigns to cover large area during the whole crop cycle to get better spatial and temporal details. However, most of the studies used moderate spatial resolution(10-30m) freely available satellite imagery for land cover mapping due to their high spectral and temporal resolution which is difficult in case of UAV and high resolution satellite imagery. Other than multispectral time series data, several vegetation indices (VIs) derived from different spectral bands have been exploited and used to enrich the feature space for vegetation assessment and monitoring [Zhong(2012), Wardlow(2012)]. VIs such as normalized difference vegetation index (NDVI), normalized difference water index (NDWI), enhanced vegetation indexes (EVI), textural features, such as grey level co-occurrence matrix (GLCM), statistical features, such as mean, standard deviation, inertial moment are the features more frequently used for crop classification. It is possible to increase the accuracy of the algorithms also using ancillary data such as elevation, census data, road density, or coverage. Nevertheless, all these derived features, along with the phenological metrics involve a huge volume of data, which may increase computational complexity with little improvement in accuracy [Löw(2013)]. Several feature selection methods have been proposed [Löw(2013)] to deal with this problem. In [Hao(2015)], various features have been derived from MODIS time series and best feature selection has been made using random forest algorithm. LC&CC can also be classified as pixel-based or object-based. Object-based image analysis (OBIA), described by Blaschke, that segmentation of satellite images into homogeneous image segments can be achieved with high-resolution sensors [Blaschke(2010)]. Various object-based classification has been proposed to produce crop maps using satellite imagery [Novelli(2016), Long(2013), Li(2015)]. In this work, we proposed a unique deep neural network architecture for LC&CC, which comprises of Recurrent Neural Network (RNN) that extracts temporal correlations from time series of sentinel-2 data in combination with Convolutional Neural Network (CNN) that analyzes and encapsulate the crops pattern through its filters. The remainder of this paper is organized as follows. Section II briefs about related work done for the LC&CC along with an overview of RNN and CNN. Section III provides an overview of the raw data collected and exploited during the research. Section IV provides detailed information on the proposed model and the training strategies. Section V contains a complete description of the experiments, results, and discussion along with the comparison with previous state-of-the-art results. Finally, Section VI draws some conclusions. There are various studies proposed in the past to address LC&CC. A more common approach adopted for classification tasks is to extract temporal features and phenological metrics from the VIs time series derived from remotely sensed imagery. There are also some simple statistics and threshold-based procedures used to calculate vegetation related metrics such as Maximum VI and time of peak VI [Walker(2014), Walker(2015)], which have improved classification accuracy when compared to using only VI as features [Simonneaux(2008)]. More complex methods have been adapted to extract temporal features and patterns to address the vegetation phenology [Shao(2016)]. Further, time series of VI represented by a set of functions [Galford(2008)], linear regression [Funk(2009)], Markov model [Siachalou(2015)] and curve-fitting functions. Sigmoid function has been exploited by [Xin(2015), Xin(2016)], and achieved better results due to its robustness and ease to derive phenological features for the characterization of vegetation variability [Dannenberg(2015)]. Although above-mentioned methods of temporal feature extraction offer many alternatives and flexibilities in deployment to assess vegetation dynamics, in practice, there are some important factors such as manually designed model and feature extraction, intra-class variability, uncertain atmospheric conditions, empirical seasonal patterns, which make the selection of such methods more difficult. Thus, an appropriate approach is needed to fully utilize the sequential information from time series of VI to extract temporal patterns. As our proposed DNN architecture is based on pixel classification, therefore the following subsections will provide relevant studies and description

Pixel based crops classification
A detailed review of the state-of-the-art supervised pixel-based methods for land cover mapping was performed in [Khatami(2016)]. It was found that support vector machine (SVM) for mono temporal image classification was the most efficient in terms of overall accuracy (OA) of about 75%. The second approach was neural networks (NN) based classifier with almost the same OA 74%. SVM is complex and resource-consuming for time series multispectral data applications with broad area classification. Another common approach in the remote sensing applications is the random forest (RF)-based classifiers [Gislason(2006)]. Nevertheless, multiple features should be derived to feed the RF classifier for effective use. Deep Learning (DL) is a branch of machine learning, and it is a powerful tool that is being widely used in solving a wide range of problems related to signal processing, computer vision, image processing, image understanding, and natural language processing [LeCun (2015)]. The main idea is to discover not only the mapping from representation to output but also the representation itself. That is achieved by breaking a complex problem into a series of simple mappings, each described by a different layer of the model, and then composing them in a hierarchical fashion. A large number of state of the art models, frameworks, architecture, and benchmark databases of reference imagery exist for image classification domain.

Recurrent neural network (RNN)
Sequence data analysis is an important aspect in many domains, ranging from natural language processing, handwriting recognition, image captioning, to robot automation. In recent years, Recurrent Neural Networks (RNNs) have proven to be a fundamental tool for sequence learning ], allowing to represent information from context window of hundreds of elements. Moreover, the research community has, over the years, come up with different techniques to overcome the difficulty of training over many time steps. For example, long short-term memory (LSTM) [Sutskever(2000)] and gated recurrent unit (GRU) [Chung(2014)], based architectures have proven ground-breaking achievements [Chung(2015), Bahdanau(2014)], in comparison to standard RNN models. In remote sensing applications, RNNs are commonly used when sequential data analysis is needed. For example, Lyu et al. [Lyu(2016)] employed RNN to use sequential properties such as spectral correlation and intra-bands variability of multispectral data. They further used LSTM model to learn a combined spectral-temporal feature representation from an image pair acquired at two different dates for change detection [Lyu(2018)]

Convolutional neural network (CNN)
Convolutional Neural Networks (CNNs) date back decades [LeCun(1998)], emerging from the study of the brain's visual cortex [Hubel(1962)] and classical concepts of computer vision theory [Hubel(2010)], [Hubel(2015)]. Since the 1990s, these have been applied successfully in image classification [LeCun(1998)]. However, due to technical constraints such as mainly lack of hardware performance, the large amount of data, and theoretical limitations, CNNs did not scale to large applications. Nevertheless, Geoffrey Hinton and his team demonstrated at the annual ImageNet ILSVRC [Krizhevsky(2012)] competition the feasibility to train large architectures capable of learning several layers of features with increasingly abstract internal representations [Zeiler(2014)]. Since that breakthrough achievement, CNNs became the ultimate symbol of the Deep Learning [LeCun(2015)] revolution, incarnating all those concepts that underpin the entire novel movement. In recent years, DL was widely used in data mining and remote sensing applications. In particular, image classification studies exploited several DL architectures due to their flexibility in feature representation, and automation capability for end-to-end learning. In DL models, features can be automatically extracted for classification tasks without feature crafting algorithms by integrating autoencoders [Wan(2017), Mou(2017)]. 2D CNNs have been broadly used in remote sensing studies to extract spatial features from high-resolution images for object detection and image segmentation [Kampffmeyer(2016), Kampffmeyer(2017), Audebert(2018)]. In crop classification, 2D convolution in the spatial domain performed better than 1D convolution in the spectral-domain [Kussul(2017)]. These studies formed multiple convolutional layers to extract spatial and spectral features from remotely sensed imagery.

Study area and data
The study site near Carpi, Emilia-Romagna, situated in center-north part of Italy with central coordinates44 • 47 01 N, 10 • 59 37 E was considered for LC& CC shown in Figure 1. The Emilia-Romagna region is one of the most fertile plains of Italy. An area almost 2640 km 2 was considered, which covers diverse crop land. The major crop fields in this region are Maize, Lucerne, Barley, Wheat, and Vineyards. The yearly averaged temperature and precipitation are 14 • C and 843 mm for this region. Most of the farmers practice single cropping in this area. To know about the spatial distribution of crops, we deeply studied Land Use Cover Area frame statistical Survey (LUCAS) and extracted all the information we need for ground truth data. LUCAS was carried out by Eurostat to be able to monitor the agriculture, climate change, biodiversity, forest, and water for almost all over the Europe [Kussul(2017)].
The technical reference document of LUCAS-2015 was used to prepare the ground truth data. Microdata that contains spatial information of crops and several land cover types along with the geo-coordinates for the considered region was imported in Quantum Geographic information system (QGIS) software, an Open-source software used for visualization, editing, analysis of geographical data. The selection of pixel was made manually by overlapping images and LUCAS data, so a proper amount of ground truth pixels were extracted for training and testing the algorithm. The sentinel-2 mission consists of twin polar-orbiting satellites launched by European Space Agency (ESA) in 2015 and can be used in various application areas such as land cover change detection, natural disaster monitoring, forest monitoring, and most importantly in agricultural monitoring and management [Kussul(2017)].
It is equipped with multi-spectral optical sensors that capture 13 bands of different wavelengths. We used only high-resolution bands that have 10 meter/pixel resolution shown in Table 1. It also has high revisit time ( ten days at the equator and five days with twin satellites (Sentinel-2A, Sentinel-2B). It became more popular in remote sensing community due to fact that it possesses various key features such as, free access to data products available at ESA Sentinel Scientific Data Hub with reasonable spatial resolution (which is 10m for Red, Green, Blue and Near Infrared bands), high revisit time and reasonable spectral resolution among other available free data sources. In our study, we used ten multitemporal sentinel-2 images reported in Table. 2, which are well co-registered from July-2015 to July-2016 with close to zero cloud coverage. The initial image selection was performed based on the cloudy pixel contribution at the granule level. This pre-screening was followed by further visual inspection of scenes and resulted in a multi-temporal layer stack of ten images. Sentinel Application Platform (SNAP) v5.0 along with sen2core v 2.5.1 were used to apply radiometric and geometric corrections to acquire Bottom of Atmosphere (BOA) Level 2A images from Top of Atmosphere (TOA) Level 1C. Further details about geometric, radiometric correction algorithms used in sen2cor can be found in [Kaufman(1988)]. Bands with 10 meter/pixel along with the derived Normalized Difference Vegetation Index (NDVI) were used for experiments, as shown in Table 1.  4 Convolutional and recurrent neural networks for pixel-based crops classification

Formulation
A single multi-temporal, multi-spectral pixel can be represented as a two-dimensional matrix X (i) ∈ R t * b where t and b are the number of time steps and spectral bands, respectively. Our goal is to compute from X (i) a probability distribution F (X (i) ) consisting of K probabilities, where K is equal to the number of classes. In order to achieve this objective, we propose a compact representation learning architecture composed of three main building blocks: • Time correlation representations -this operation extracts temporal correlations from multi-spectral, temporal pixels X (i) exploiting a sequence-to-sequence recurrent neural network based on Long Short-Term Memory (LSTM) cells. A final Time-Distributed layer is used to compress and maintain a sequence like structure, preserving the multidimensionality nature of the data. In this way, it is possible to take advantage of temporal and spectral correlations simultaneously.
• Temporal pattern extraction -this operation consists of a series of convolutional operations followed by rectifier activation functions that non linearly maps each elaborated temporal and spectral patterns onto high dimensional representations. So, RNN output temporal sequences are processed by a subsequent cascade of filters, which in a hierarchical fashion, extracts essential features for the successive stage.
• Multiclass classification -this final operation maps the feature space with a probability distribution F (X (i) ) with K different probabilities, where K, as previously stated, is equal to the number of classes.
The comprehensive pipeline of operations constitutes a lightweight, compact architecture able to non-linearly map multi-temporal information with its intrinsic nature, achieving results considerably better than previous state-ofthe-art solutions. Human brain mental imagery studies [Mellet(1996)], where images are a form of internal neural representation, inspired the presented architecture. Moreover, the joint effort of RNN and CNN distributes the knowledge representation through the entire model, exploiting one of the most powerful characteristics of deep learning known as distributed learning. An overview of the overall model, dubbed Pixel R-CNN, is depicted in Fig. 3. Each pixel is extracted contemporary from all images taken at different time steps t with all its spectral bands b. In this way, it is possible to create an instance X (i) , which can feed the first layer of the network. Firstly, the model extracts temporal representations from the input sample. Subsequently, these temporal features are further enriched by the convolutional layers that patterns in a hierarchical manner. The overall model act as a function F (X (i) ) that map the input sample with its related probabilities K. So, evaluating the probability distribution is possible to identify the class of belonging to the input sample.
It is worth to notice that this model is known as unrolled through time representation. Indeed, only after all time steps have been processed, the CNN is able to analyze and transform the temporal pattern. In the following subsections, we are going to describe in detail each individual block.

Time correlation representation
Nowadays, a popular strategy in time series data analysis is the use of RNNs that have proven excellent results in many fields of the application over the years. Looking at the simplest possible RNN shown in Fig. 4, composed of just one layer, it looks very similar to a feedforward neural network, except it also has a connection going backward. Indeed, the layer is not only fed by an input vector x (i) , but it also receives h (i) (cell state), which is equal to the output neuron itself, y (i) . So, at each time step t, this recurrent layer receives an input 1-D array x (i) t as well as its own output from the previous time step, y (i) (t−1) . In general, since the output of a recurrent neuron at time step t is a function of all inputs from previous time steps, it has, intuitively, a sort of memory that influences all successive outputs. In this example, it is straightforward to compute a cell's output, as shown in Eq. 1.   Fig. 4 does not change. Simply, all neurons are hidden in the depth dimension. Unfortunately, the basic cell just described suffer from major limitations, but most of all are the fact that, during training, the gradient of the loss function gradually fades away. For this reason, for the time correlation representation, we adopted a more elaborated cell known as peephole LSTM unit, see Fig. 5. That is an improved variation of the concept proposed in 1997 by Sepp Hochreiter and Jurgen Schmidhuber [Sutskever(2000)]. The key idea is that the network can learn what to store in a long-term state, c (t) what to throw away and what to use for the current state h (t) and y (t) that, as for the basic unit, are equal. That is performed with simple element-wise multiplications working as "valves" for the fluxes of information. Those elements, V 1 , V 2 and V 3 are controlled by fully connected (FC) layers that have as input the current input state x (t) and the previous short-term memory term h (t−1) . Moreover, for the peephole LSTM cell, the previous long-term state c (t−1) is added as an input to the FC of the forgot gate,V 1 , and the input gate, V 2 . Finally, the current long-term state c t is added as an input to the FC of the output gate. All "gates controllers" have sigmoid as activation functions (green boxes) instead of tanh ones to process the signals themselves (red boxes). So, to summarize, a peephole LSTM block has three signals as input and output; two are the standard input state x (t) and cell output y (t) . Instead, c and h are the long and short-term state, respectively, that the unit, utilizing its internal controllers and valves, can feed with useful information. Formally, as for the basic cell seen before, Eq (2). Eq (7). summarizes how to compute the cell's long-term state, its short-term state, and its output at each time step for a single instance.
In conclusion, multi-temporal, multi-spectral pixel X (i) is processed by a first layer of LSTM peephole cells obtaining a cumulative output Y (i) (lstm) . Finally, a TimeDistributedDense layer is applied which executes simply a Dense function Figure 6: Pixel R-CNN first layer for time correlations extraction. Peephole LSTM cells extracts temporal representations from input instances X (i) ∈ R t * b . The output matrix Y (i) (lstm) feeds a TimeDistributedDense layer, that preserves the multidimensional nature of the processed data extracting multi-spectral patterns.
across every output over time, using the same set of weights, preserving the multidimensional nature of the processed data Eq.(8). In Fig. 6 is presented a graphical representation of the first layer of the network. LSTM cells extract temporal representations from input samples X (i) with x (i) t ∈ R (1 * b) as columns. The output matrix Y (i) (lstm) feeds the subsequent TimeDistributedDense layer.

Temporal patterns extraction
The first set of layers extract a 2-dimensional tensor Y (i) (timeD) for each instance. In the second operation, after a simple reshaping operation in order to increase the dimensionality of the input tensor from two to three and beeing able to apply the following operations, we map each of these 3-dimensional array Y (i) (reshape) into an higher dimensional space. That is accomplished by two convolutional operations, built on top of each other, that hierarchically apply learned filters, extracting gradually more abstract representations. More formally, the temporal patterns extraction is expressed, for example, for the first convolutional layer, as an operation F conv1 where W 1 and B 1 represent filters and biases, respectively, and ' * ' is the convolutional operation. W 1 , contains n 1 filters with kernel dimension f 1 x f 1 x c, where f 1 is the spatial size of a filter and c is the number of input channels. As common for CNN, rectified linear unit (ReLU), max(0,x), has been chosen as activation function for both layers units. In Fig. 7 is depicted a graphical scheme of this section of the model. So, summarizing, output matrix Y of the TimeDistributedDense layer, after adding an extra dimension, feeds a stack of two convolutional networks that progressively reduce the first two dimensions, gradually extracting higher level representations and generating high dimensional arrays. Moreover, being the n 1 and n 2 filters shared across all units, the same operation carried out with a similarly-sized dense fully connected layers would require a much greater number of parameters and computational power. Instead, the synergy of RNN and CNN opens the possibility to elaborate the overall temporal canvas in an optimize and efficient way.

Multiclass classification
In the last stage of the network, the extracted feature tensor Y (i) (conv2) , after removing the extra dimensions with a simple flatten operation, is mapped to a probability distribution consisting of K probabilities, where K is equal to the number of classes. This is achieved by a weighted sum followed by a softmax activation function: where s(x)= W T .y (i) (f latten−conv2) + B is a vector containing the scores of each class for the input vector y (i) (f latten−conv2) , that after the flatten operaton is a 1-dimensional array. Weights W and bias B are learned, during the training process, in such a way to classify arrays of the high dimensional space into the K different classes. So, p k is the estimated probability that the extracted feature vector y (i) (f latten−conv2) belongs to class k given the scores of each class for that instance.

Training
Learning the overall mapping function F requires the estimation of all network parameters Θ of the three different model parts. This is simply achieved through minimizing the loss between each pixel class prediction F (X (i) ) and the corresponding ground truth y (i) with a supervised learning strategy. So, given a data set with n pixel samples {X i }and the respective true classes set{y i }, we use categorical cross-entropy as the loss function: where y (i) k cancels all classes loss except for the true one. Equation (11). is minimized using AMSGrad optimizer [Reddi(2019)], an adaptive learning rate method which modifies the basic ADAM optimizer [Kingma(2019)] algorithm. The overall algorithm update rule without the debiasing step is: Equation (12). and Eq. (13). are the exponential decay of the gradient and gradient squared, respectively. Instead, with the Eq. (14)., keeping a higher v t term results in a much lower learning rate, η , fixing the exponential moving average and preventing to converge to a sub-optimal point of the cost function. Moreover, we use a technique known as cosine aneling in order to cyclically vary the learning rate value between certain boundary values [Smith(2017)]. This value can be obtained with a preliminary training procedure, linearly increasing the learning rate while observing the value of the loss function. Finally, we employ, as only regularization methodology, "Dropout" [Srivastava(2014)] in the time representation stage, inserted between the LSTM and Time-Distributed layer. This simple tweak allows training a more robust and resilient to noise temporal patterns extraction stage. Indeed, forcing CNN to work without relying on certain temporal activations can greatly improve the abstraction of the generated representations distributing the knowledge across all available units.

Experimental Results and discussion
We first processed raw data in order to create a set of n pixel samples X ={X i } with the related ground truth labels Y={y i }. Then, in order to have a visual inspection of the data set, principal component analysis (PCA), one of the most popular dimensionality reduction algorithms, have been applied to project the training set onto a lower tri-dimensional hyperplane. Finally, quantitative and qualitative results are discussed with a detail description of the architecture settings.

Training data
Sample pixels require to be extracted from the raw data and then reordered in order to feed the devised architecture. Indeed, the first RNN stage requires data points to be collected in slices of time series. So, we separated labeled pixels from raw data, and we divided them in chunks of data, forming a tri-dimensional tensor X ∈ R i×t×b for the successive pre-processing pipeline. In Fig. 8, a visual representation of the data set tensor X generation, where fixing the first dimension X i,:,: there are the individual pixel samples X (i) with t = 9 time steps and b = 5 spectral bands. It is worth to notice that the number of time steps and bands are completely an arbitrary choice dictated by the raw data availability.
Subsequently, we adopted a simple pipeline of two steps to pre-process the data. Stratified sampling has been applied in order to divide the data set tensor X, with shape (92116, 9, 5), in a training and test set. Due to the natural unbalanced number of instances per class present in the data set Table. 3, this is an important step in order to preserve the same percentage in the two sets. After selecting a split percentage for the training of 60%, we obtained two tensors X train and X test with shape (55270, 9, 5) and (36846, 9, 5), respectively. Secondly, as common practice, in order to facilitate the training, we adopted standard scaling, (x − µ)/σ, to normalize the two sets of data points.  Figure 8: Overview of the tensor X ∈ R i×t×b generation. The first dimension i represents the collected instances X (i) , the second t the different time steps, and finally the last one b the five spectral bands, red, green, blue, near-infrared and NDVI. On the top, labeled pixels are extracted simultaneously, from all-time steps and bands starting from the raw satellite images. Then, X i,j,k are reshaped in order to set up the X (i) = X i,:,: of the data set tensor X.

Dataset Visualization
In order to explore and visualize the generated set of points, we exploit Principle Component Analysis (PCA), reducing the high dimensionality of the data set. For this operation, we considered the components t and b as features of our data points. So, applying Singular Value Decomposition (SVD) and then selecting the first three principal components, W d = (c 1 , c 2 , c 3 ), it was possible to plot the different classes in a tri-dimensional space, having a visual representation of the projected data points. In Fig. 9 the projected data points are plotted in tri-dimensional space. Except for water bodies, it is worth to point out how much intra-class variance is present. Indeed, most of the classes lay on more than one hyperplane, demonstrating the difficulty of the task and the studied data set. Finally, it was possible to analyze the explained variance ratio varying the number of dimensions. From Fig. 10, it is worth to notice that approaching higher components, the explained variance trend stops growing fast. So, that can be considered as the intrinsic dimensionality of the data set. Due to this fact, it is reasonable to assume that reducing the number of time steps would not significantly affect the overall results.

Experimental settings
In this section, we examine the settings of the final network architecture. The basic Pixel R-CNN model, shown in Fig.  3, is the result of a careful design aimed at obtaining the best performance in terms of accuracy and computational cost. Indeed, the final model is a lightweight model with 30,936 trainable parameters (less than 1 MB), fast and more accurate than the existing state-of-the-art solutions. With the suggested approach, we employed only an RNN layer with 32 output units for each peephole LSTM cell randomly turned off, with a probability p = 0.2, by a Dropout regularization operation. For all experiments, peephole LSTM has shown an improvement in overall accuracy around 0.8% over standard LSTM cells. Then, Time Distributed Dense transforms Y (i) (lstm) in a 9x9 square matrix that feed a stack of two CNN layers with a number of features n 1 = 16 and n 2 = 32, respectively. The first layer as a filter size of f 1 =3 and the second one f 2 = 7 producing a one-dimensional output array. Finally, a fully connected layer with SoftMax activation function maps Y (i) (conv2) to the probability of the K = 15 different classes. Except for the final layer, we adopted ReLU as activation functions. In order to find the best training hyperparameters for the optimizer, we used 10% of the training set to perform a random search evaluation, with few epochs, in order to select the most promising parameters. Then, after this first preliminary phase, the analysis has been focused only on the most promising hyperparameters value, fine-tuning them with a grid search strategy.
So, for the AMSGrad optimizer we set β 1 = 0.86, β 2 = 0.98 and = 10 −9 . Finally, as previously introduced, with a preliminary procedure, we linearly increased the learning rate of η while observing the value of the loss function in order to estimate the initial value of this important hyperparameter. In conclusion, we fed our model with more than 62000 samples for 150 epochs with a batch size of 128 while cyclically varying the learning rate value with a cosine aneling strategy. All tests have been carried out with the TensorFlow framework on a workstation with 64 GB RAM, Intel Core i7-9700K CPU, and an Nvidia 2080 Ti GPU.

Classification
Performance of the classifier was evaluated by user's accuracy (UA), producer's accuracy (PA), overall accuracy (OA), and the kappa coefficient (K) shown in the confusion matrix see Table. 4, which is the most common metric that has been used for classification tasks [Kussul(2016), Congalton(1991), Conrad(2014), Skakun (2015)]. Overall accuracy indicates the overall performance of our proposed Pixel R-CNN architecture by calculating a ratio between the correctly classified total number of pixels and total ground truth pixels for all classes. The diagonal elements of the matrix represent the pixels that were classified correctly for each class. Individual class accuracy was calculated by dividing the number of correctly classified pixels in each category by the total number of pixels in the corresponding row called User's accuracy, and columns called Producer's accuracy. PA indicates the probability that a certain crop type on the ground is classified as such. UA represents the probability that a pixel classified in a given class belongs to that class. Our proposed pixel-based Pixel R-CNN method achieved OA=96.5% and Kappa=0.914 with 15 number of classes for a diverse large scale area, which exhibits significant improvement as compared to other mainstream methods. Water bodies and trees stand highest in terms of UA with 99.1% and 99.3%, respectively. That is mainly due to intra-class variability and the minor change of NIR band reflectance over the time, which was easily learned by our Pixel R-CNN. Most of the classes, including the major types of crops such as Maize, Wheat, Lucerne, Vineyard, Soya, Rye, and Barley, were classified with more than 95% UA. Grassland being the worst class, which was classified with the PA = 65% and UA = 63%. The major confusion of grassland class was with Lucerne and Vineyard. It is worth mentioning that, Artificial class, which belongs to roads, buildings, urban areas, represents mixed nature of pixel reflectances, was accurately detected with UA = 97% and PA=99%.
For class Apple, obtained PA was 86% while UA = 68%, which shows that 86% of the ground truth pixels were identified as Apple, but only 68% of the pixels classified as Apple in the classification were actually belonged to  Total  PA   TM  AR  TR  RY  WH  SY  AP  PR  GL  WT  LN  DW  VY  BL    class Apple. Some Pixels (see Table. 4 ) belongs to Peer and Vineyard were mistakenly classified as Apple. The final classified map is shown in Fig. 11 with the example of the zoomed part and actual RGB image. To the best of our knowledge, a multi-temporal benchmark dataset is not available to compare classification approaches on equal footings. There are some data sets available online for crop classification without having ground truth of other land cover types such as Trees, Artificial land (build ups), Water bodies, Grassland. Therefore it is difficult to compare classification approaches on equal footings. Indeed, a direct quantitative comparison of the classification performed in these studies is difficult due to various dependencies such as the number of evaluated ground truth samples, the extent of the considered study area, and the number of classes to be evaluated. Nonetheless, we provided an overview of recent studies and their performances of the study domain by their applied approaches, the number of considered classes, used sensors, and achieved overall accuracy in  (2015)] also used sentinel-2 data and achieved a maximum 95% classification accuracy using RF classifier but nine classes were considered.
In conclusion, it is interesting to notice neuron activation inside the network during the classification process. Indeed, it is possible to plot unit values when the network receives specific inputs and compare how model behaviors change. In Fig. 12 four samples, belonging to the same class "artificials," feed the model creating individual activations in the input layer. Even if they all belong to the same class, the four instances took into account present a noticeable variance. Either the spectral features in a specific time instance (rows) or their temporal variation (columns) present different patterns that make them difficult to classify. However, already after the first LSTM layer with the TimeDistributedDense block, the resulting 9x9 matrices Y (timeD) have a clear pattern that can be used by the following layers to classify the different instances in their respective classes. So, the network during the training process learns to identify specific temporal schemes, that allows making strong distributed and disentangle representations.

Non deep learning classifiers
We tried four other traditional classifiers on the same dataset for comparison, which are Support Vector Machine (SVM), Kernal SVM, Random Forest (RF), and XGBoost. These are well-known classifiers for their high performances and also considered as baseline models in classification tasks [Srivastava(2014)]. SVM can perform nonlinear classification using kernel functions by separating hyperplanes. A widely used RF classifier is an ensemble of decision trees based on the bagging approach [Fernández(2014), Shi(2016)]. XGBoost is state of the art classifier based on gradient boosting model of decision trees, which attracted much attention in the machine learning community. RF and SVM have been widely used in remote sensing applications [Löw(2013), Hao(2015), Gislason(2006)]. Each classifier involves hyperparameters that need to be tuned at the time of classification model development.
We followed "random search" approach to optimize major hyperparameters [Lawrence (2006)]. Best values of hyperparameters were selected based on classification accuracy achieved for the validation set, and are highlighted with bold letters in Table. 6. Further details about hyper parameters and achieved overall accuracy (OA) for SVM, Kernal SVM, RF, and XGBoost are reported in Table. 6 From these non-deep learning classifiers, SVM stands highest with OA = 79.6% while RF , Kernel SVM, and XGBoost achieved 77.5%, 76.8% and 77.2% respectively. From the results presented in Table. 6, our proposed Pixel R-CNN based classifier achieved OA = 96.5%, which is far better results than the non deep learning classifiers. Learning temporal and spectral correlations from multi-temporal images considering large data set is challenging for traditional non-deep learning techniques. The introduction of deep learning models in the remote sensing domain brought more flexibility to exploit temporal features in such a way that it can increase the amount of information to gain much better and reliable results for classification tasks.

Conclusion
In this study, we developed a novel deep learning model with Recurrent and Convolutional Neural Network called Pixel R-CNN to perform Land Cover and Crop Classification by using multitemporal decametric sentinel-2 imagery of central north part of Italy. Our proposed Pixel R-CNN based architecture exhibits significant improvement as compared to other mainstream methods by achieving 96.5% overall accuracy with kappa=0.914 for 15 number of classes. We also tested widely used non-deep learning classifiers such as SVM, RF, SVM kernel, and XGBoost to compare with our proposed classifier and revealed that these methods are less effective, especially when the temporal features extraction is the key to increase classification accuracy. The main advantage of our architecture is the capability of automated features extraction by learning time correlation of multiple images, which reduces manual feature engineering and modeling crops phenological stages.