1. Introduction
Satellite image time series analysis is used in many research areas these days. Contrary to the past decades, nowadays we dispose of a huge amount of time-spread geospatial data that provide us a full description of almost any area of interest in the world [
1]. Exploiting satellite image time series (SITS) gives us better comprehension of a study area, its landscape, land cover, evolution and more comparing to a single image analysis [
2]. While some applications demand SITS analysis in order to detect or monitor a specific event (constructions, droughts, deforestation, etc.) [
3,
4,
5,
6], others exploit SITS to perform a land cover analysis of the whole area and/or its eventual evolution [
7]. For the second type of applications, the prior knowledge about temporal behavior of some classes (usually vegetation) is indispensable to make a correct classification map [
8,
9].
However, due to the variety of objects presented in the remote sensed images and in SITS in particular, few labeled data are available. For this reason, unsupervised approaches are becoming more and more popular for various projects. Most of the currently used unsupervised approaches for SITS clustering deploy pixel-wise analysis [
10,
11]. In these approaches, the pixels corresponding to the same geographical position on different images form temporal sequences that are further compared to each other and associated to different classes. Numerous studies have proven Dynamic Time Warping (DTW) algorithm [
12] to be an efficient tool to compute the similarity measure between temporal sequences. The main idea of this approach is to non-lineary map one series to another by minimizing the distance between them. Thus, the DTW distance matrix is computed for every point of the series and used as a similarity measure for a chosen clustering algorithm.
In general, DTW distance matrix has a high computational cost. To this end, the analysis of large datasets at pixel level may be extremely time-consuming and, hence, unreasonable. To deal with this issue, several object-based DTW clustering approaches have been proposed [
13,
14,
15] to analyze the data both at temporal and spatial dimension. In these methods, spatio-temporal segments (in a form of a 2D map) are extracted for the whole SITS, then, the temporal sequences constructed for segment descriptors are clustered. Therefore, the object-based SITS analysis has drastically reduced computational cost and ensured more homogeneous results of clustering algorithms compared with the pixel-based approaches.
Nevertheless, not so many SITS segmentation approaches are available [
16] and it can be tricky to create a proper segmentation map for the whole series as sometimes objects change from one image to another. If a series is short enough (does not cover more than a year), we can presume that objects shapes stay invariant and, in this case, we can project a single image segmentation to the whole SITS. However, this approach can not be used for a series that covers a large period of time, especially if it contains some permanent changes or important phenological variations. To capture some of these changes, segmentation may be performed on the concatenated product of two or three most representative images of the SITS [
13] or even on the concatenated product of the whole time series [
14]. In the first case, we may miss some objects. In the second one, the segmentation may have high computational cost and be difficult to parameterize if a SITS is long.
To overcome multi-temporal segmentation issues, in Reference [
17] the authors propose a graph-based approach to analyze different spatio-temporal dynamics in SITS. In this method, each image is segmented independently and all the spatio-temporal entities that belong to the same geographical location are connected to each other and form evolution graphs. Every graph is characterized by a bounding box—an object which footprint has the intersection with all graph objects at different timestamps. Following this method, Reference [
18] proposes an algorithm to cluster the extracted multi-annual graphs. Each evolution graph is firstly described by a simplified representation—synopsis. Secondly, spectral and hierarchical clustering algorithms with DTW distance measure are applied to graphs synopsis. This approach showed promising results for the clustering of natural habitat areas. However, it may be complicated to construct evolution graphs for urban areas as their segmentation is more complicated due to the non-homogeneity of the features. For this reason, the segmentation results of urban areas are usually not uniform from one image to another, contrary to the agricultural lands where a parcel is presented by one or two well-delimited segments that repeat over time if no changes happened.
To create a single segmentation map for the whole SITS, the authors of Reference [
19] propose a time series segmentation approach based on DTW distance measure. In this approach, at the beginning, each pixel is characterized by its temporal sequence, each sequence firstly represents an isolated segment, then the segments with similarity measure higher that a certain threshold are iteratively merged. However, for the aforementioned reasons, we estimate that the proposed approach can be slow, even if the distances are not computed for all pixel couples.
In this paper, we propose a SITS object-based clustering algorithm based on SITS compression with 3D convolutional autoencoder (AE). 3D convolutional networks have been successfully used in remote sensing applications for supervised classification [
20,
21] due to its ability to deal with multi-temporal image data in addition to lower computational cost comparatively to other temporal models such as, for example, convolutional Long Short-Term Memory (LSTM) network [
22]. Contrary to these methods, our 3D convolutional AE model is unsupervised and does not require any labeled data and, to our knowledge, no such models have been used in time-series remote sensing yet.
In our work, we deploy an AE neural networks structure. Traditionally, autoencoders are used for unsupervised dimensionality reduction or feature learning [
23]. Different AE models have been widely used in remote sensing [
24,
25,
26]. In these articles, the features are extracted from a single image using AEs and then used for a land scene classification. However, the AE structure can be adapted for any type of data, therefore, we propose to use AEs for the feature extraction and compression of the image series.
In our method, we first encode the whole SITS into a new feature image with a multi-view 3D convolutional AE. Both encoder and decoder parts contain two branches that are concatenated together before the bottleneck. While the first branch allows to obtain deep features from the spectral bands of the whole SITS, the second one only extracts some general information from the corresponding Normalized Difference Vegetation Index (NDVI) [
27] images. Second, we perform a preliminary segmentation of the SITS on its two most representative images. Then, we correct the preliminary segmentation by using the encoded feature image. Finally, we regroup the obtained objects with hierarchical clustering algorithm [
28] using the encoded features as descriptors. The proposed approach showed us good results in the two real-life datasets and outperformed the concurrent methods, including the ones based on the DTW measure.
We summarize our contributions as follows:
We propose a fully unsupervised approach of SITS clustering using deep learning approaches.
We propose a two-branch multi-view AE that extracts more robust features comparatively to a classic convolutional AE.
We develop a segmentation approach that produces a unique segmentation map for the whole SITS.
The proposed architecture is new and does not rely on a pre-existing or pre-trained network.
The rest of the paper is organized as follows—
Section 2 presents the proposed approach,
Section 3 describes datasets we used,
Section 4 gives the review of the experimental results with their qualitative and quantitative evaluation. In the last section, we resume the work done and overview the future prospective.
2. Methodology
Our proposed approach is developed for segmentation and clustering of a SITS. Let be a time series of S co-registered images , , …, acquired at timestamps , , …, . The framework is composed of several steps which are the following:
We start by relative normalization of all the images of the SITS using an algorithm described in Reference [
29] and correction of saturated pixels.
We deploy a two-branch multi-view 3D convolutional AE model in order to extract spatio-temporal features and compress the SITS.
Then, we perform a preliminary SITS segmentation using two farthest images of the dataset taken in different seasons.
We correct the preliminary change segmentation using the compressed SITS.
Finally, we perform the clustering of extracted segments using their spatio-temporal features as descriptors.
2.1. Time Series Encoding
For the compression and encoding of the SITS, we propose to use the two-branch multi-view 3D convolutional AE. While the first branch of the AE extracts deep temporal features from the initial series, the second one extracts some primary temporal features from the associated NDVI images (
Figure 1). The NDVI branch improves the model capacity to distinguish different vegetation types, especially the ones with weak seasonal variance. Moreover, by allocating a separate branch to NDVI images instead of just adding a supplementary NDVI channel to the initial images, we “force” the model to extract more robust and independent vegetation features.
Contrary to traditional 2D convolutional networks where convolution filters are applied in 2D plane, 3D convolutions preserve the temporal relations between data by extending the filters to the depth dimension [
30]. Therefore, the 3D convolution network extracts both spatial and temporal features.
The deployment of an AE type of model ensures the extraction of robust spatio-temporal features in an unsupervised manner without using any reference data. In classic AEs, the model firstly encodes the input data in some compressed latent representation and then decodes it back to its initial self. In image processing, the encoding pass is usually composed of convolutional and pooling layers for feature maps (FM) extraction that are followed by some fully-connected (FC) layers for feature compression. The decoding pass is often symmetrical to the encoding one. Once the model is trained, the extracted compressed representation is used to describe the data under the study. The encoder-decoder model allows us to compress the whole dataset in an uniform way. Moreover, it can compress any type of data independently of its shape and size.
In case of our multi-view AE, during the encoding step, we independently extract features from two different stack of images (original and their corresponding NDVI), the features are then merged together to obtain a combined descriptor. During the decoding pass, the features are separated and reconstructed independently into the initial stacks of the original and NDVI images.
The training and encoding processes of the whole series are performed patch-wise for the stack of SITS images. The patches of size p are extracted for every -pixel of the SITS (, , where H and W are images height and width respectively) and represent stacks of size , where B is the number of image bands. Obviously, for the first branch, B corresponds to the number of spectral bands, for the second one as we deal with single channel NDVI images. To extract deep features from the original images, we propose to use patches of size , however, as we extract only general information from the NDVI images, the patch size of is sufficient. We consider that is big enough to get necessary information of the neighbor pixels as it makes a m2 surface footprint. In addition, it ensures smooth maxpooling with window size and does not produce important border effect for the patches that contain two (or more) different classes (see more about it in the next subsection). For the NDVI branch, we believe that is the minimum sufficient patch size to get the information about the neighborhood vegetation features ( covers only 1 pixel radius, so this information can not be considered relevant). Moreover, we apply no padding to the second 3D convolutional layer of the NDVI branch to reduce the size of extracted feature maps before applying the maxpooling operation. Note that we tend to decrease the network complexity and its training time by choosing a smaller NDVI patch size as all the important information about land cover textures are extracted in the main branch, while the NDVI branch is used only to detect vegetation tendencies. As one may observe from the model schema, the configuration of FC layers depends on the number of images of the SITS. It guarantees that all the layers within different models have the same input/output step ratio while compressing the features. Note than if S is elevated, one might consider to add a supplementary FC layer.
For model evaluation and optimization, we use the mean-squared error (MSE) (
1):
where
o is the output patch of the model,
t is the target patch and
N is the number of patches per training batch.
Once the model is stable, every temporal stack of patches is encoded in a feature vector of size f that corresponds to the -pixel of a new feature image of size that will be further used as a compressed version of the whole dataset.
2.2. Segmentation
Satellite image segmentation is a task of image processing that partitions an image into non-intersecting regions (segments) so that the ensemble of pixels of each region shares similar properties. Segmentation can therefore be seen as a first step before doing a classification or a clustering of the newly created segments for any object-based method.
As it was mentioned in the previous section, SITS segmentation can be a complicated and challenging process, especially when the number of images is elevated. The main idea of our segmentation approach is the following—to get a more robust SITS clustering that is easy to visualize, we need to obtain a unique segmentation map for the whole series. To accomplish this task, we could directly perform the segmentation on the encoded SITS image. However, as the encoding is performed in a patch-wise manner for every image pixel, one may observe a border effect. This effect is produced for pixels located close to a border of two regions. The patches extracted for these pixels contain information about two (or more) different classes, their encoded spatio-temporal features will not be “pure”. For this reason, these pixels may be segmented as new objects (mostly linear) or segment borders may be shifted. Moreover, the linear objects, such as roads or rivers may not be distinguished or, on the contrary, over-segmented.
Figure 2 presents two examples of the border effect and its eventual correction with our method (explained later in the text). The first row shows the shifted borders in crop segmentation at the limits of different types of crops. The second row displays the segmentation of a road. We can observe that the road is over-segmented and its borders are shifted at the same time.
To tackle this problem, we propose to perform a two steps segmentation that includes the correction of the preliminary segmentation in respect to all objects borders of the time series. The preliminary segmentation is performed on two most representative concatenated images of the SITS. To obtain the maximum of coherent spatio-temporal objects in the preliminary segmentation , the chosen images should be as far apart as possible (e.g., the first and the last image) and correspond to different seasons.
For all image segmentations, the MeanShift [
31] algorithm available in
Orfeo ToolBox software (
www.orfeo-toolbox.org) under
QGIS interface was chosen. The most important parameters of the MeanShift segmentation algorithm are spatial radius
and range (spectral) radius
. The main idea of the algorithm is to firstly reproject a
n-channels image into
n-dimensional space and simplify its representation by replacing each pixel with the mean of the pixels in
neighborhood that have values within
. The regions smaller than
are merged. Secondly, the algorithm reprojects the data back into the image plane and separates the areas with the same mean value into non-overlapping segments. At the end, the segments smaller than
are merged with their neighbors.
Despite the fact that gives us correct segment borders, it is impossible to identify all the objects presented in SITS on the base of only two images. Therefore, in the next step, we perform the segmentation of the encoded SITS that is represented as a f-channels image. As it was mentioned before, this segmentation would contain numerous irrelevant objects and shifted borders. Finally, we choose as the reference and we correct it by fitting the segments from to obtain the final segmentation map .
The correction process is performed separately for each segment and is the following (see
Figure 3):
Let be a segment from to correct.
We firstly fill with the segments from that have spatial intersection with it. borders are preserved and used as the reference.
Second, we check the average width of these segments in horizontal and vertical axes of the SITS coordinate system. We select objects with width smaller than in at least one of the axes. size should not exceed half of the encoder patch size and be set after estimating the influence of the border effect.
At the third place, each of these objects is merged with a neighbor with the biggest common edge if the edge is at least 3 pixels long or if the object’s size does not exceed (minimum object size that we want to distinguish in our experiments). Note that in case we have several segments to merge, we sort them by ascending size and start by merging the smallest one while other segments sizes are being iteratively updated.
Finally, we fill a new segmentation map with new merged segments.
Our method might still produce some shifted borders for some corrected segments, but at the same time, it allows to reduce the border effect to minimum, to preserve correct shapes of linear objects and to avoid parasite segments that correspond to border pixels.
2.3. Clustering
To regroup the obtained segments, we deploy the hierarchical clustering algorithm (HCA) [
28] applied to segments descriptors.
Often, an output of a clustering algorithm does not correspond to the desired classes as some of them might be merged or, on the contrary, divided into two or more new clusters. For this reason, the user might make several tries in order to find an optimal number of clusters for the desired output partition. We choose HCA due to its ability to build a unique model to analyze cluster data at different levels. Contrary to other clustering algorithms, HCA model does not demand a researched number of clusters or a complex set of parameters that will further define the clusters number. During the algorithm execution, data points are presented as separate clusters and then, at every step, the model iteratively merges two clusters with the highest likelihood value. At the end, the user should simply choose the clustering level that corresponds the best to the desired clustering partition.
For segment descriptors, we use the median values of the encoded features of pixels within these segments. We choose the median values over the mean ones so the border pixels are not taken into account. We use Ward’s linkage [
28] and Euclidean distance between the segments as parameters for clustering algorithm.
3. Data
We evaluate the proposed approach on two real-life publicly available time series issued from SPOT-5 and Sentinel-2 missions. Both SITS are taken over the same geographical location (Montpellier area, France), but, however, differ in terms of spectral and temporal resolution. While the first SITS contains 12 images that are irregularly samples over 6 years, the second one contains 24 images taken over 2 years with more regular temporal resolution.
SPOT-5 dataset was taken between 2002 and 2008 and belongs to the archive Spot World Heritage (Available on
https://theia.cnes.fr/). We have filtered out cloudy and flawed images from all the images acquired by the SPOT-5 mission over the considered geographic area and obtained 12 exploitable images with irregular temporal resolution (minimum temporal distance between two consecutive images is 2 months, maximum—14 months, average—6 months). Distribution of dataset images is presented in
Table 1. All SPOT-5 images provide green, red, NIR and SWIR bands with 10-meters resolution.
Sentinel-2 dataset was taken between January 2017 and December 2018 (Available on
https://earthexplorer.usgs.gov/). After deleting unexploitable images as well as the images that were less than 15 days apart from the previous images, we have obtained 24 images with more regular temporal resolution (minimum temporal distance between two consecutive images is 15 days, maximum—2.5 months, average—1 month). Distribution of the dataset images is presented in
Table 2. Sentinel-2 images provide multiple spectral bands of different spectrum and spatial resolution, however, it was decided to keep only 10-meters resolution spectral bands—blue, green, red, NIR.
The original images of both datasets are clipped to rectangular shapes of pixels and transformed to UTM zone 31N: EPSG Projection. The clipped image extent corresponds respectively to the following latitude and longitude in WGS-84 system:
bottom left corner: 43°30′6.0444″N, 3°47′30.066″E
top right corner: 43°39′22.4856″N, 3°59′31.596″E
The pre-processing level of both datasets is 1C (orthorectified images, reflectance at the top of atmosphere). For this reason, both SITS were radiometrically normalized with the aim to obtain homogeneous and comparable spectral values over each dataset. For the image normalization, we have used an algorithm introduced in Reference [
29] that is based on histogram analysis of pixel distributions.
The ground truth (GT) for both datasets was taken from an open data website of Montpellier agglomeration (
http://data.montpellier3m.fr/) and correspond to landcover maps which we have manually modified to keep only distinguishable classes and merged the look-alike classes. While for the SPOT-5 dataset we have used Corina Land Cover (CLC) map of the 2008 year, for the Sentinel-2 dataset CLC of the 2017 year was taken. We have defined 9 well-distinguished GT classes:
urban and artificial area,
wooded area (include forests, parks, family gardens etc.),
natural area (not wooded),
water surface,
annual crops,
prairies,
vineyards,
orchards,
olive plantation.
For both datasets, the olive plantation class is very small, so we choose 8 reference classes for our clustering algorithm. The GT olive plantation class will be ignored during the evaluation.
Note that it is difficult to create a GT for a multi-annual SITS analysis as some objects may go through changes and it is impossible to detect all these changes manually. For this reason, for the SPOT-5 dataset, we use the GT that corresponds to the last year of the SITS. The SPOT-5 dataset was taken over 6 years and contains many change processes, mostly such as different constructions and permanent crop rotations. The study for change detection in the SPOT-5 dataset is presented in Reference [
6]. As these changes are less numerous, they will be considered by most of clustering algorithms as outliers, hence, they will be mixed with “stable” classes. However, some of these changes are only several timestamps long, so we still perform the clustering of the whole SITS instead of only free-change areas. Thus, the change areas will be regrouped with no change areas with the most similar temporal behavior or even make their proper clusters. At the same time, we consider that the Sentinel-2 dataset does not have any or has very few change areas as it is spread over only two years.
5. Conclusions
In this article, we have presented a fully unsupervised approach for SITS clustering based on a two branch multi-view 3D convolutional AE that does not demand any labeled data. The proposed approach exploits the AE model to compress a time series into an encoded image by extracting its spatio-temporal features. Then it performs the segmentation of the encoded image with the eventual correction of shifted segment borders related to the specificity of the encoding. The proposed approach was tested on two real-life datasets and showed its efficiency comparatively to the concurrent approaches.
The main advantages of the proposed algorithm include the improvement of traditional segmentation methods that are not initially adapted for the SITS that leads to higher NMI score. In addition, we have shown that we can improve clustering results by simply introducing a temporal NDVI branch in the AE model. The presented approach is a good alternative to traditional DTW-based methods as deep learning techniques are able to extract more robust and complex features compared with traditional Machine Learning methods.
In our future plans, we want to adapt our algorithm to all spectral bands of Sentinel-2 datasets as well as improve its accuracy for the clustering of linear objects. Other spectral indices can be also integrated in the proposed model, however, a closer study is needed to estimate the influence of each index on the clustering results. Moreover, the contextual constraint may be introduced to distinguish more classes (e.g., artificial areas such as beaches can be discriminated from urban areas as they are close to the water).