1. Introduction
Producing more (food) with less (consumption of natural resources) is one of the biggest challenges our society is facing to guarantee food security globally and it is, therefore, one of the priorities of UN SDG goals. To achieve this, agricultural production must be supported by sustainable planning and one of the first pieces of information needed is “where and when” crops are cultivated. Satellite Remote Sensing is the best candidate as an information source to perform crop monitoring worldwide, but powerful and robust algorithm to provide temporal and spatially explicit information on crop presence are needed. Sentinel-2 (S2) is a Copernicus Earth observation mission that systematically acquires high spatial resolution optical images of Earth. This mission introduces a paradigm shift in the quality and quantity of open access data, opening a new era for land monitoring systems especially for the agricultural sector. S2 provides multi-spectral data with a spatial resolution from 10 m up to 60 m able to cover 290 km per acquisition with a revising time of 5 days thanks the two satellites constellation. This, however, introduces the notion of big data and therefore we need models that have the ability to exploit this enormous amount of information. In this framework, we want to contribute to research innovation devoted to perform crop identification from the analysis of time series of Sentinel-2 satellite imagery. The goal is to identify an approach that is able to reliably identify (classify) the different crops that are growing in a given area using an end-to-end (3+2)D convolutional neural network (CNN) for semantic segmentation, the method also has the ambition to predict the period in which a given crop is actually growing in a given season.We develop a model based on Features Pyramid Networks (FPN), adapted to process the time series with 3D kernels of a small area associated with each input sample, and which is able to provide as output a segmentation map using 2D kernels. Furthermore, we investigate and propose a solution to understand how CNN identify the time intervals that contribute to the determination of the output class-Class Activation Interval (CAI). This solution allows us to interpret the reasoning made by CNN in the classification of a single pixel. We demonstrate in a variety of experiments that our network is able to identify discriminatory time intervals in the input features domain despite the CNN has been trained only to solve a classification task (i.e., spatial solution). Therefore, with our CAI method we are able to provide information on “when” the class associated with a pixel is present in the time series of Earth Observation (EO) data. CAI method becomes useful to discriminate when crops are the only one cultivated (e.g., summer maize) or represent the second crop of the season (e.g., winter wheat followed by maize). In the latter case, maize is generally sown later in the season following the grain harvest and soil preparation. The ability to provide such information will help characterize crop systems (single or double crops) alongside the class of simple crops. Thanks to CAI, the information on the sowing period will provide an idea of the cultivated variety and the destination of crops (e.g., corn for silage or forage).
Moreover, the proposed approach has a two-fold importance: demonstrate the capacity of the network to correctly interpret the physical process investigated and provide additional information to the end-user (i.e., crop presence and its temporal dynamics. The provision of CAI output is a way to assess robustness of model for each semantic class because provide an explicit representation of the time period in which crop is likely to be cultivated, such information for a specific study area can be confirmed by expert knowledge or provide added value information for final user.
Figure 1 provides a graphical representation of CAI information for a spatial subset of the analysed S2 data where maize is cultivated. Low CAI values occur at the beginning of time series (December to May) when crop residue (other crop) are present while the indicator values significantly increase in the proper growing period (May to September).
There are several types of convolutional neural networks (CNNs), and all of them can greatly help improve the speed and accuracy of many computer vision tasks. In particular, CNN’s 3D models are often used to improve object identification in videos or 3D volumes, such as security camera videos [
1] and medical scans of cancerous tissue [
2].
The spatiotemporal dimension of Sentinel-2 satellite imagery has many similarities to video; for this reason, we adopt models similar to those used for analysis of videos. In recent years, spatiotemporal data has been addressed through CNN models that follow three main ideas: CNN 2D (e.g., Two-stream ConvNets [
3] and Temporal Segment Network (TSN) [
4]), CNN 3D (e.g., SFSeg [
5] and 3D ResNet [
6]) and (2+1)D CNN (e.g., P3D [
7] and R(2+1)D [
8]). Our proposal partly follows the idea of 3D CNNs used for video data sets, but at the same time uses 2D kernels (i) to create segmentation maps and (ii) to predict the activation intervals of classes in the time domain
Recently, the performance levels of 3D CNNs in various fields have greatly improved [
6] and in addition we have a huge amount of free satellite data [
9] that can be easily interpreted by 3D models. Motivated by the success of the [
10] 2D-FPN network used for multi-class semantic segmentation, successfully applied also on satellite data for semantic segmentation starting from RGB images [
11], in this paper we develop a FPN (3+2)D for multi-class semantic segmentation, based on 3D and 2D convolution layers. In particular the model we proposed was designed for crops semantic segmentation to improve automatic crop recognition accuracy by implementing 3D multi-scale capabilities and to increase the spatial and temporal resolution of thematic product. This model is designed for crops semantic segmentation to improve crops recognition accuracy by implementing 3D multi-scale capabilities and to increase the spatial and temporal resolution of crops.
The Feature Pyramid Network (FPN) looks a lot like a U-shaped Convolutional Neural Network (U-Net) [
12]. Like the U-Net, the FPN has a lateral connection between the bottom-up pyramid and the top-down pyramid. However, where U-net simply copies the features and adds them, FPN applies a 1 × 1 level of convolution before adding them. This allows the bottom-up pyramid called the “backbone” to be pretty much anything we want. Due to this greater flexibility of the FPN, we have chosen it and adapted it to our particular (3+2)D problem. U-Net has also been adapted to 3D segmentation problems and has been successfully applied to many 2D and 3D segmentation problems [
13]. Unfortunately, we cannot compare directly with the U-Net because we would have to modify it in order to to be able to work on a 3D input and a 2D output at the same time, but knowing that FPN and U-Net are two very similar models and that FPN is more flexible compared to U-Net, we have decided to work only with the FPN.
Recents papers by Zhou et al. [
14,
15] have shown that the convolutional units of various layers of convolutional neural networks (CNNs) actually behave as object detectors despite receiving no supervision over the location of the object. A similar paper [
16] proposes a technique for making CNNs more transparent by adding “visual explanations” to a large class of models. They demonstrated that the network can maintain its remarkable object localization capability until the final layer, thus allowing it to easily identify discriminating image regions in a single forward pass for a wide variety of activities, even those for which the network was not originally trained. Inspired by the Class Activation Maps (CAM) proposed by Zhou et al. [
14], we propose to expand the proposed (3+2)D FPN CNN with a mechanism that allows visualizing the time interval of the time series that contribute to the determination of the class for each pixel.
With the (3+2)D FPN we want to be more precise in answering the question “
where a specific crop is located” inside an image, but we do not know how to answer the question “
when that crop was present” inside a time series. For this reason we have added to the network a mechanism to predict when the class is present in the input time series without needing to supply more ground-truth to the network. In a similar way to what happens in [
14], where the network is adapted to understand “
which area of the input image” contributes most to the determination of the output class, we have proposed a solution to predict for each pixel “
in which time interval” a specific class is present in the input time series dataset. Exactly as it happens for the CAM, where we can say which area of the image contributed to determining the output class and say nothing about the other classes, we use the CAI to tell when the class was active in the input time series without saying anything about the other classes on that pixel. This is to say that our approach is multi-class on every pixel that the model sees as input, but it is not multi-label.
The novelties we propose in this paper are listed below:
We propose a new semantic segmentation model suitable for remote sensing time series.
We add to the proposed CNN a mechanism that allows visualizing the time interval of the time series that contribute to the determination of the class for each pixel.
We exceed by about the state of the art on a public dataset with satellite time series.
3. Dataset
In our experiments we used the Munich dataset used in [
20] which contains squared blocks of
pixels including 13 Sentinel-2 bands (see some samples in
Figure 6). Each 480-m block was mined from a large geographical area of interest (102 km × 42 km) located north of Munich, Germany. In our experiments we used the split 0 of the dataset, containing 6534 blocks for the training set, 2016 blocks for the test set and 1944 blocks for the evaluation set.
The ground truth is a 2D image containing the segmentation of the various crops present in each sample, where each pixel has an associated class label obtained from the two growing seasons 2016 and 2017. The segmentation data are not for each date of the time series but are associated with an entire year, so in each pixel, the label represents the harvest crop declared in that year. The 17 classes in the dataset are reported in
Table 1. The dataset does not contain information on when a crop is present (sowing and plant growth) and when it is absent (harvest) during a year of observations. The dataset was divided into training, validation and test sets. The dataset is very unbalanced and the cardinality of the last two sets is shown in
Table 1. The cardinality reported in the paper [
20] does not correspond to ours despite having used the same splits, probably because we used a time series of different number of samples (we extracted 30 samples in each time series) or because some image augmentation was used in their paper. The original dataset can be downloaded from [
21].
4. Experiments
We have conducted three main groups of experiments: initially, in
Section 4.1, we test some techniques to neutralize the effect of unbalanced datasets. As a second experiment, in
Section 4.2, we compare the proposed model with the literature. Finally, in
Section 4.3, we analyze the CAI produced by the trained model.
In our experiments, we used some well-known metrics to evaluate the goodness of our model. In particular, we mainly use the overall accuracy, Kappa [
22], Recall, Precision, and F-measure coefficients to compare our results with the results published in [
20].
We have not conducted any systematic experiments to understand what could be the best value to assign to the hyper-parameter
for the loss function defined in Equation (
5) and therefore we set this value to
. We run each experiment for 300 epochs. The optimizer is SGD [
23], with momentum equals to 0.9, weight decay 0.001 and with an initial learning rate of 0.01 and a scheduler that uses a cosine function to reduce the learning rate after each epoch. All the time series used are made up of 30 randomly extracted samples from each of the two available years.
The trained models are available on pythorch-hub [
24] and the source code to run the experiments is available on a gitlab repository [
24].
4.1. Class Imbalance Experiments
Almost all land use segmentation datasets have the problem of the predominance of some classes over others. Which classes predominate with respect to the others, this depends on the geographical area, the season, and the extent of the territory analyzed. Furthermore, for this reason, multi-class segmentation for land cover problems through the analysis of satellite images still remains a challenging problem. The number of samples from the training dataset used to estimate the error gradient during training is called the batch size and is an important hyperparameter that affects the resulting trained model. If the dataset is unbalanced, the number of pixels belonging to each class that the neural network uses to compute the gradient depends on the batch size. If the batch is too large then the predominant classes completely crush the classes that have very few samples, while if the batch is small then the number of pixels for each class is more balanced. For this reason, in this first experiment, we experimentally analyze the effect of batch size together with two class weighing techniques.
In this section, we analyze the performance of the proposed model, without the NDVI layer, using different techniques to counter the effect due to the unbalanced dataset. In particular, we use the two different types of
weights of the loss function (see Equation (
4)) described below, to try to counter the effect of the unbalancing of the classes we have in the dataset, and then we compare it with a weight
for all classes (no weights). As bottom-up pathway of the proposed (3+2)D FPN, in these experiments we use a ResNet1 01.
In
Table 2, we have some results from the model which uses only the cross-entropy defined in Equation (
4) but without the
weight (no weights) and we compare it with the weighting scheme (batch weights) which is based on “Effective Number of samples” within each batch [
25] assigned to the
weight, and with weighting strategies based on the total number of samples present in each class (global weights) even assigned to the
weight. The weighting scheme proposed in [
25], counts for each batch of the training set, the number of actual samples
in each class
c. The effective weight used in each batch is defined using a simple formula
, where
is a hyper-parameter. In
global weights, we use a weight
for each class, computed using the entire training set.
Analyzing the overall accuracy (OA), Kappa, weighted Recall (w.R), Precision (w.P) and F-measure (w.F1) in
Table 2 and the graphs in
Figure 7, we can see that loss function weighting techniques do not lead to any advantage on this dataset. On the other hand, it should be noted that the batch size (see the second column labeled
batch in
Table 2) has a great influence on the result and then a very small batch size allows to obtain the best results. From
Table 1) we see that the class that has the lowest number of pixels (column #pix) on the validation set is the
asparagus class. While in
Figure 7 the effect of the weight
of the loss function and the effect of the batch size can be analyzed as a function of the number of samples for each class. Analyzing, for example, the
asparagus class when the weight
for all the classes (plot on the top of
Figure 7) we can see that the best results are obtained when we use a small bach size. As a consequence of these experiments, in the other experiments we do not use the weight
.
4.2. Comparisons
There are few public domain datasets in the field of remote sensing, and in particular, multi-class segmentation public datasets, suitable for training a deep model, are rare. In this section, we show the results of the comparison between our method and the results associated with the only public dataset available [
20]. Furthermore, we did an analysis on the use of different ResNet used as backbone of the bottom-up block of the proposed model. The results are reported in
Table 3 and show that as the complexity of the bottom-up block increases, the accuracy of the classification improves. We have not used more complex models due to our limited hardware availability, but we should probably get better results using more powerful models.
In
Table 1, we report the comparison results made with what was published in paper of Rußwurm and Korner [
20] and we can conclude that our proposed model behaves better from the point of view of all the metrics used in this comparison. Although the cardinality per class is not the same, we can see that the behavior of the (3+2)D FPN is very similar on both the validation and the test set.
4.3. Experiments on CAI and Ablation Study
To evaluate the quality of the Class Activation Intervals predicted by the network, we do not have any ground-truth values and therefore we asked for an opinion from experts. The result produced is in agreement with experts knowledge. For example, from
Figure 3 we can see how, for the CAI of the maize class the model has predicted the time interval that goes from May to the beginning of November, while for the
winter wheat class the model has predicted a CAI that goes from late April until early July despite the NDVI values are still high. This high NDVI value is typically due to weed presence, re-growth after harvesting and a subsequent crop. Indeed, the model identified in these noisy and complex data the time domain that is uniquely related to wheat growth.
To understand how the CAI changes with the variation of the patch on which it is calculated, we have taken into consideration the
winter wheat class and calculated all the CAI vectors on the entire test set. We show in
Figure 8 an aggregate mean value for each patch, and we can see how the time interval associated with this class does not change according to the patch, even if the average activation values
change slightly. The network used in this experiment uses Resnet 101 as a bottom-up block and the numerical results are reported in
Table 1.
To understand the effect of learning the NDVI indices with the aim of predicting the CAI, we also conducted an ablation study in which we eliminate the MSE loss during training and compare this result with the same model that uses the loss function. From the numerical results reported in
Table 3 we can see that the MSE Loss does not greatly affect the classification performance.