Unsupervised Deep Learning for Landslide Detection from Multispectral Sentinel-2 Imagery

: This paper proposes a new approach based on an unsupervised deep learning (DL) model for landslide detection. Recently, supervised DL models using convolutional neural networks (CNN) have been widely studied for landslide detection. Even though these models provide robust performance and reliable results, they depend highly on a large labeled dataset for their training step. As an alternative, in this paper, we developed an unsupervised learning model by employing a convolutional auto-encoder (CAE) to deal with the problem of limited labeled data for training. The CAE was used to learn and extract the abstract and high-level features without using training data. To assess the performance of the proposed approach, we used Sentinel-2 imagery and a digital elevation model (DEM) to map landslides in three different case studies in India, China, and Taiwan. Using minimum noise fraction (MNF) transformation, we reduced the multispectral dimension to three features containing more than 80% of scene information. Next, these features were stacked with slope data and NDVI as inputs to the CAE model. The Huber reconstruction loss was used to evaluate the inputs. We achieved reconstruction losses ranging from 0.10 to 0.147 for the MNF features, slope, and NDVI stack for all three study areas. The mini-batch K-means clustering method was used to cluster the features into two to ﬁve classes. To evaluate the impact of deep features on landslide detection, we ﬁrst clustered a stack of MNF features, slope, and NDVI, then the same ones plus with the deep features. For all cases, clustering based on deep features provided the highest precision, recall, F1-score, and mean intersection over the union in landslide detection.


Introduction
Landslides are one of the most dangerous and complicated natural disasters that usually cause severe destruction in natural areas and settlements and loss of human life and property [1], which occur in different types, frequencies, and intensities worldwide [2]. Therefore, studying and analyzing this natural hazard is highly necessary to find appropriate solutions to mitigate its adverse consequences. Rapid detection and mapping of such events are notably necessary for immediate response and rescue operations. Field surveys and visual interpretation of aerial photographs are the prevailing methods to map SVM, and RF. Their accuracy assessment results showed that the CNN model outperformed other ML models in the discrimination of landslides from non-landslide areas.
Nevertheless, the authors conclude that the performance and accuracy of CNN models significantly depend on their design (i.e., input window sizes, layer depth, and amount of labeled data for training). Therefore, although CNNs provide excellent results for supervised image classification and object detection, they require a significant amount of labeled data, training time, and powerful hardware for training and testing. Thus, lacking any of these elements, particularly adequate training data, would be problematic to properly run a DL model for RS applications [3]. However, unsupervised DL models can be an alternative solution for image clustering and feature extraction [23].
Autoencoders (AEs) are one of the unsupervised architectures in deep neural networks and have been used in many applications in the field of RS. Therefore, some researchers have employed AEs to solve critical image processing challenges such as image classification [24][25][26], clustering [23,27,28], spectral unmixing [29] and image segmentation [30,31]. AEs have also been applied to deal with other important problems such as image fusion [32], change detection [33][34][35], pansharpening [36,37], anomaly detection [38,39], and image retrieval [40]. In recent years, the application of AE for clustering purposes has gained much attention in the RS community. One of the first studies that used AE for clustering was developed by [41]. They presented results that demonstrated the potential of AEs in learning more distinctive and informative representations from data. Zhang et al. [42] investigated the potential of a dual AE consisting of convolutional and LSTM layers for clustering on Landsat-8 imagery. Mousavi et al. [23] clustered seismic data over the features produced by their proposed CAE. This study achieved a reliable precision against supervised methods and provided a model to facilitate the early warning of earthquake. In [43], the authors introduced a 3D-CAE to encode embedded representations of the hyperspectral dataset, then applied clustering over them. Rahimzad et al. [28] proposed a boosted CAE to learn and extract deep features for automatic RS image clustering in urban areas.
To the best of our knowledge, no study has yet explored the possibility of unsupervised AEs for landslide detection from Sentinel-2A imageries. Therefore, our main aim in this study was to use an unsupervised DL model based on the AE for landslide detection from different geographical areas in India, China, and Taiwan. Using this model, we brought into play an AE consisting of the convolutional layers known as the convolutional autoencoder (CAE) model. This proposed CAE was trained on Sentinel-2A images to extract discriminative features from the image and cluster them using mini-batch K-means [44]. Some studies [45,46] have proven that using irrelevant features is the main reason for overfitting, whereas cooperating of the most relevant features usually positively influences the overall image classification performance [47,48]. Therefore, selecting the most landsliderelevant features can significantly improve the landslide discrimination from non-landslide areas [48]. Stumpf and Kerle (2011) [49] concluded that reducing features led to higher landslide detection accuracy. Therefore, in this study, to improve the performance of the proposed CAE, we applied minimum noise fraction (MNF) on Sentinel-2A multispectral bands to reduce the dimensionality and separate noise in the data. Additionally, the effectiveness of some auxiliary data such as the topographic factor of slope and normalized difference vegetation index (NDVI) for landslide detection were investigated [48,50]. The resulting features of the MNF transformation were then stacked with slope and NDVI layers in this study. We designed and trained a CAE over the stacked and prepared features. CAE's learned features were then considered ultimate deep features and clustered in classes according to deep feature pixel values.
Moreover, to compare the impact of deep features on unsupervised landslide detection, the clustering was performed twice for each study area. First, clustering was conducted based only on MNF features stacked with slope and NDVI. Second, it was carried out on the former features plus the deep features generated by CAE. Finally, for accuracy assessment, the detected landslides were evaluated by validation measures such as overall accuracy, Remote Sens. 2021, 13,4698 4 of 27 precision, recall, f1 measure, and mean intersection-over-union (mIOU) validation using ground truth data.

Study Areas
During the monsoon season, which starts from early July and goes to almost the end of August, there are always heavy rainfalls in Kodagu, Karnataka, India (Figure 1a). During August 2018, heavy rainfall of more than 1200 mm occurred [51,52], while this annual figure for 2018, according to the India Meteorological Department, Customized Rainfall Information System (CRIS), reached almost 3650 mm. This event led to severe small and massive landslides in the region. As a result, over 200 villages were severely affected, sixteen people lost their lives, and around 40 were listed as missing [53]. Our selected area (250 km 2 ) is a part of Kodagu, which contains several landslides. The geomorphological settings of Kodagu are predominated by hill ranges, which are covered by dense forests and agricultural fields. The elevation varies from approximately 45 m above mean sea level (AMSL) to more than 1726 m AMSL.

Shuzheng Valley, Sichuan Province, (China)
The second study area was Shuzheng Valley, located in Sichuan Province (Figure 1b), which is an area with many scenic spots. Because of its rich ecosystems, lakes, and landscapes, the region is one of the leading tourist destinations in Sichuan Province [54]. On 8 August 2017, Shuzheng Valley and its scenic areas were hit by an earthquake with a magnitude of 7 Mw, resulting in 25 deaths, 525 injuries, and a USD 21 million economic loss on top of the devastated tourism resorts [55]. In addition to economic loss, there were numerous destructive accidents following the earthquake such as dam breaks, landslides, and rockfalls. The earthquake, mostly on steep slopes, triggered nearly 1780 landslides. The elevation of this study area ranges from almost 2000 m AMSL to less than 4500 m AMSL. The climate of the Shuzheng Valley is dominated by semi-humid, and the monsoon season in this area starts in May and continues until September. In addition, most of the landslides and debris flows also take place during this season [55]. For this site, an area of 69.78 km 2 with several landslides was selected.

Western Taitung County (Taiwan)
The other selected study area was Western Taitung County in Taiwan (Figure 1c). The most powerful and destructive typhoon in Taiwan's recorded history, the Morakot typhoon, struck the island in August 2009. In five days, the tropical storm Morakot dumped an aggregate of 2884 mm of rain on southern Taiwan, leading to floods and landslides affecting a region of 274 km 2 [56]. Consequently, this disaster left 652 people dead, 47 missing, and over three billion dollars of damage [56]. The Morakot typhoon caused more than 22,700 landslides in the mountainous areas. Most of the landslides were shallow, while some were characterized as deep-seated landslides [57]. Like previous study areas, an area of 348.12 km 2 with sufficient landslides was chosen for this site. The elevation of this study area is up to 3950 m AMSL.

Overall Workflow
In this study, Sentinel-2 multispectral images freely available in 11 bands (B2-B8A, B10-B12) were used along with slope factors for landslide detection. The overall workflow of this study is as follows (see Figure 2):


Preparing and resampling multispectral images, NDVI, and slope factor;  Applying MNF on multispectral images for dimensionality reduction;  Stacking slope factor and NDVI with resulting features from the MNF;  Feeding CAE with stacked data;  Clustering CAE deep features using mini-batch K-means; and  Evaluating clustering results for landslide detection through various accuracy assessment metrics.

Sentinel-2A Data
In this study, Sentinel-2A multispectral images were used for landslide detection. The image acquisition dates for study areas in India, China, and Taiwan were 14 September 2018, 15 May 2018, and 30 June 2019, respectively, which had the lowest percentage of cloud cover (less than five percent) in the given areas. Furthermore, the Google Earth Engine (GEE) environment was used to search, acquire, and mask images according to the study areas. In GEE, there are two types of sentinel-2A data, namely Level-1C and Level-2A. The former dataset has global coverage. It provides top of atmosphere (TOA) reflectance images but without atmospheric terrain and cirrus correction.
Preparing and resampling multispectral images, NDVI, and slope factor;


Preparing and resampling multispectral images, NDVI, and slope factor;  Applying MNF on multispectral images for dimensionality reduction;  Stacking slope factor and NDVI with resulting features from the MNF;  Feeding CAE with stacked data;  Clustering CAE deep features using mini-batch K-means; and  Evaluating clustering results for landslide detection through various accuracy assessment metrics.

Sentinel-2A Data
In this study, Sentinel-2A multispectral images were used for landslide detection. The image acquisition dates for study areas in India, China, and Taiwan were 14 September 2018, 15 May 2018, and 30 June 2019, respectively, which had the lowest percentage of cloud cover (less than five percent) in the given areas. Furthermore, the Google Earth Engine (GEE) environment was used to search, acquire, and mask images according to the study areas. In GEE, there are two types of sentinel-2A data, namely Level-1C and Level-2A. The former dataset has global coverage. It provides top of atmosphere (TOA) reflectance images but without atmospheric terrain and cirrus correction.
Applying MNF on multispectral images for dimensionality reduction;


Preparing and resampling multispectral images, NDVI, and slope factor;  Applying MNF on multispectral images for dimensionality reduction;  Stacking slope factor and NDVI with resulting features from the MNF;  Feeding CAE with stacked data;  Clustering CAE deep features using mini-batch K-means; and  Evaluating clustering results for landslide detection through various accuracy assessment metrics.

Sentinel-2A Data
In this study, Sentinel-2A multispectral images were used for landslide detection. The image acquisition dates for study areas in India, China, and Taiwan were 14 September 2018, 15 May 2018, and 30 June 2019, respectively, which had the lowest percentage of cloud cover (less than five percent) in the given areas. Furthermore, the Google Earth Engine (GEE) environment was used to search, acquire, and mask images according to the study areas. In GEE, there are two types of sentinel-2A data, namely Level-1C and Level-2A. The former dataset has global coverage. It provides top of atmosphere (TOA) reflectance images but without atmospheric terrain and cirrus correction.
Stacking slope factor and NDVI with resulting features from the MNF;


Preparing and resampling multispectral images, NDVI, and slope factor;  Applying MNF on multispectral images for dimensionality reduction;  Stacking slope factor and NDVI with resulting features from the MNF;  Feeding CAE with stacked data;  Clustering CAE deep features using mini-batch K-means; and  Evaluating clustering results for landslide detection through various accuracy assessment metrics.

Sentinel-2A Data
In this study, Sentinel-2A multispectral images were used for landslide detection. The image acquisition dates for study areas in India, China, and Taiwan were 14 September 2018, 15 May 2018, and 30 June 2019, respectively, which had the lowest percentage of cloud cover (less than five percent) in the given areas. Furthermore, the Google Earth Engine (GEE) environment was used to search, acquire, and mask images according to the study areas. In GEE, there are two types of sentinel-2A data, namely Level-1C and Level-2A. The former dataset has global coverage. It provides top of atmosphere (TOA) reflectance images but without atmospheric terrain and cirrus correction.
Feeding CAE with stacked data; Figure 2. Workflow of the methodology for landslide detection using CAE.


Preparing and resampling multispectral images, NDVI, and slope factor;  Applying MNF on multispectral images for dimensionality reduction;  Stacking slope factor and NDVI with resulting features from the MNF;  Feeding CAE with stacked data;  Clustering CAE deep features using mini-batch K-means; and  Evaluating clustering results for landslide detection through various accuracy assessment metrics.

Sentinel-2A Data
In this study, Sentinel-2A multispectral images were used for landslide detection. The image acquisition dates for study areas in India, China, and Taiwan were 14 September 2018, 15 May 2018, and 30 June 2019, respectively, which had the lowest percentage of cloud cover (less than five percent) in the given areas. Furthermore, the Google Earth Engine (GEE) environment was used to search, acquire, and mask images according to the study areas. In GEE, there are two types of sentinel-2A data, namely Level-1C and Level-2A. The former dataset has global coverage. It provides top of atmosphere (TOA) reflectance images but without atmospheric terrain and cirrus correction.
Clustering CAE deep features using mini-batch K-means; and


Preparing and resampling multispectral images, NDVI, and slope factor;  Applying MNF on multispectral images for dimensionality reduction;  Stacking slope factor and NDVI with resulting features from the MNF;  Feeding CAE with stacked data;  Clustering CAE deep features using mini-batch K-means; and  Evaluating clustering results for landslide detection through various accuracy assessment metrics.

Sentinel-2A Data
In this study, Sentinel-2A multispectral images were used for landslide detection. The image acquisition dates for study areas in India, China, and Taiwan were 14 September 2018, 15 May 2018, and 30 June 2019, respectively, which had the lowest percentage of cloud cover (less than five percent) in the given areas. Furthermore, the Google Earth Engine (GEE) environment was used to search, acquire, and mask images according to the study areas. In GEE, there are two types of sentinel-2A data, namely Level-1C and Level-2A. The former dataset has global coverage. It provides top of atmosphere (TOA) reflectance images but without atmospheric terrain and cirrus correction.
Evaluating clustering results for landslide detection through various accuracy assessment metrics.  Preparing and resampling multispectral images, NDVI, and slope factor;  Applying MNF on multispectral images for dimensionality reduction;  Stacking slope factor and NDVI with resulting features from the MNF;  Feeding CAE with stacked data;  Clustering CAE deep features using mini-batch K-means; and  Evaluating clustering results for landslide detection through various accuracy assessment metrics.

Datasets
In this study, Sentinel-2A multispectral images were used for landslide detection.

Sentinel-2A Data
In this study, Sentinel-2A multispectral images were used for landslide detection. The image acquisition dates for study areas in India, China, and Taiwan were 14 September 2018, 15 May 2018, and 30 June 2019, respectively, which had the lowest percentage of cloud cover (less than five percent) in the given areas. Furthermore, the Google Earth Engine (GEE) environment was used to search, acquire, and mask images according to the study areas. In GEE, there are two types of sentinel-2A data, namely Level-1C and Level-2A. The former dataset has global coverage. It provides top of atmosphere (TOA) reflectance images but without atmospheric terrain and cirrus correction.
At the same time, the latter does include all these corrections [58]. Although Level-2A seems like the best product for remote sensing applications, GEE's coverage is not global. Therefore, we applied such corrections using the Sen2Cor [59] plugin in SNAP software by European Space Agency (ESA). Sentinel-2A data include 13 bands in various electromagnetic spectrums and spatial resolutions including 10, 20, and 60 m. More information on Sentinel-2A images is available in the Sentinel User Handbook [58]. Following the corrections, since Sentinel-2A image bands including aerosols (B1) and water vapor (B9) are used for coastal aquatic applications [60] and atmospheric corrections [61], we excluded them from the final image stack because they do not provide relevant information for landslide detection. Sentinel-2 images and DEM data had different coordinate systems and spatial resolution. In the pre-processing stage, the slope factor generated from DEM was resampled to 10 m to match the corresponding satellite image spatial resolution. Furthermore, in each case study, the same UTM coordinate system was used for satellite images and slope factor, and then the slope factor was georeferenced based on Sentinel-2 images to be spatially matched together. Finally, a NDVI layer was generated, matching both images and slope factors in terms of dimension, spatial resolution, and coordinate system. Finally, due to spatial resolution variation in multispectral bands, all images were resampled to 10-m spatial resolution as input for MNF.

Landslide Inventory Data
In contrast to the supervised DL models in which inventory data are used for training and testing, we only prepared the inventory data to test our applied unsupervised model. The post-landslide PlanetScope VHR imagery from Planet Company and Google Earth images were used to digitize the landslides visually. PlanetScope consists of more than 120 optical satellites; Dove provides VHR multispectral images in four bands, namely Blue, Green, Red, and NIR, with a spatial resolution of 3 m [62]. PlanetScope images were used for landslide digitizing; however, for the study area in Taiwan, since there were no available PlanetScope images, we used Google Earth images to digitize landslides in the QGIS 3.10 software. Table 1 provides detailed information regarding landslide features in each study area.

Slope Factor
Surface topography is one of the critical factors in landslides, particularly in hilly and mountainous areas [63]. Since the probability of landslides in steep slopes is very high, the slope factor can play a significant role in landslide susceptibility modeling and landslide detection [64]. In our cases, the topography of study areas in China and Taiwan is associated with steep slopes, while the study area in India is mostly hilly. Therefore, we chose the slope factor as another essential layer for landslide detection to be fed into the CAE. The steepness of the surface or terrain can be presented through slope angles, and has a direct correlation with the probability of landslides. Thus, for three cases, we extracted slope angle data from a digital elevation model (DEM) with the spatial resolution Remote Sens. 2021, 13, 4698 8 of 27 of 12.5 m generated by the ALOS PALSAR sensor [65]. In addition, the slope factor was then resampled to 10 m to be consistent with other layers fed into our CAE model.

Normalized Difference Vegetation Index (NDVI)
NDVI is a numerical indicator for analyzing and mapping vegetation cover and other ecological applications [66,67]. However, it can be used as an asset in landslide detection because landslide-affected areas are mostly not vegetated. Therefore, a landslide can be better distinguished or separated from non-landslide areas [10]. This index is derived from Red and NIR bands (Equation (1)) of multispectral satellite images such as the Landsat series and Sentinel 2 constellation. Since it is a standard ratio, its values range from −1 to 1, values less than zero and closer to 1 are considered as water and higher vegetation cover, respectively. In this case, we derived the NDVI index using band Red (4), and NIR (8) of Sentinel-2A images was generated using the Sen2Cor plugin in the QGIS environment (https://plugins.qgis.org/plugins/sen2cor_adapter/, accessed on 7 November 2021).

Minimum Noise Fraction (MNF) Transformation
To reduce noise, computational requirements, and dimensionality, MNF transformation was applied on Sentinel-2A multispectral data before extracting deep features through CAEs. This transformation, a modified version of PCA, was proposed by Green et al., systematically reduces and segregates noise from the data through a normalized linear combination by maximizing the ratio of the signal to noise [68]. The MNF transformation method reduces spectral band dimensions through a linear transformation. During this transformation, data are rotated twice using PCA. First, a de-correlation and rescaling are performed to remove correlations and results in data with unit variance and no band-toband correlations. After noise-whitening by the first rotation, the second rotation uses the PCA to make the original image data [69]. Based on their variance, projected components from a MNF transformation are formed, where the first component has the highest variation and, therefore, the highest information content and vice versa. [70]. In PCA, ranking is based on the variance in each component, and by increasing the component, the number variance decreases, while in MNF, images are ranked based on their quality. The measure of image quality is the signal-to-noise ratio (SNR), and MNF orders images based on SNR, representing image quality, while in PCA, there is no such ranking in terms of noise [70,71]. The mathematical expression of MNF is as follows [72]: Let us assume noisy data as x with n-bands in the form of x = s + e where and s and e are the uncorrelated signals and noise components of x. Then, the covariance matrices of s and e should be calculated and related as (Equation (2)): The Var{e i }/Var{x i } is the ratio of the noise variance to the total variance for that band. In the following, the MNF transform chooses the linear transformation by (Equation (3)).
where y is a new dataset (n-bands) Y T = (y 1 , y 2 , y 3 , . . . , y n ), which is a linear transform of the original data, and the linear transform coefficients (A = (a 1 , a 2 , a 3 , . . . , a n )) are obtained by solving the eigenvalue equation (Equation (4)).
where Λ is a diagonal matrix of the eigenvalues and λ i , the eigenvalue corresponding to a i , equals the noise fraction in y i , i = 1, 2, . . . , n. We performed MNF transformation using the Spectral Python 0.21 library.

Convolutional Auto-Encoder (CAE)
AEs are a widely used deep neural network architecture, which uses its input as a label. Then, the network tries to reconstruct its input during the learning process; for this purpose, it automatically extracts and generates the most representative features during sufficient time iterations [25,73,74]. This type of network is constructed by stacking deep layers in an AE form, consisting of two main parts of an encoder and a decoder (see Figure 3). The encoder transforms input data into a new feature space through a mapping function. At the same time, the latter tries to rebuild the original input data using the encoded features with the minimum loss [23,75,76]. The middle hidden layer of the network (bottleneck) is considered to be the layer of interest employed as deep discriminative features [77]. Since the bottleneck is the layer that AE reconstructs from and usually has smaller dimensionality than the original data, the network forces the learned representations to find the most salient features of data [74]. CAE is a type of AE employing convolutional layers to discover the inner information of images [76]. In CAE, structure weights are shared among all locations within each feature map, thus preserving the spatial locality and reducing parameter redundancy [78]. More detail on the applied CAE is described in Section 3.4.1.

Convolutional Auto-Encoder (CAE)
AEs are a widely used deep neural network architecture, which uses its input as a label. Then, the network tries to reconstruct its input during the learning process; for this purpose, it automatically extracts and generates the most representative features during sufficient time iterations [25,73,74]. This type of network is constructed by stacking deep layers in an AE form, consisting of two main parts of an encoder and a decoder (see Figure  3). The encoder transforms input data into a new feature space through a mapping function. At the same time, the latter tries to rebuild the original input data using the encoded features with the minimum loss [23,75,76]. The middle hidden layer of the network (bottleneck) is considered to be the layer of interest employed as deep discriminative features [77]. Since the bottleneck is the layer that AE reconstructs from and usually has smaller dimensionality than the original data, the network forces the learned representations to find the most salient features of data [74]. CAE is a type of AE employing convolutional layers to discover the inner information of images [76]. In CAE, structure weights are shared among all locations within each feature map, thus preserving the spatial locality and reducing parameter redundancy [78]. More detail on the applied CAE is described in Section 3.4.1. To extract deep features, let us assume D, W, and H indicate the depth (i.e., number of bands), width, and height of the data, respectively, and n is the number of pixels. For each member of X set, the image patches with the size 7 × 7 × are extracted, where is its centered pixel. Accordingly, the X set can be represented as the image patches, each patch, * , is fed into the encoder block. For the input * , the hidden layer mapping (latent representation) of the feature map is given by (Equation (5)) [79]: where is the bias; is an activation function, which in this case, is a parametric rectified linear unit (PReLU), and the symbol * corresponds to the 2D-convolution. The reconstruction is obtained using (Equation (6)): where there is bias for each input channel, and ℎ identifies the group of latent feature maps. The corresponds to the flip operation over both dimensions of the weights . The is the predicted value [80]. To determine the parameter vector representing the To extract deep features, let us assume D, W, and H indicate the depth (i.e., number of bands), width, and height of the data, respectively, and n is the number of pixels. For each member of X set, the image patches with the size 7 × 7× D are extracted, where x i is its centered pixel. Accordingly, the X set can be represented as the image patches, each patch, x * i , is fed into the encoder block. For the input x * i , the hidden layer mapping (latent representation) of the k th feature map is given by (Equation (5)) [79]: where b is the bias; σ is an activation function, which in this case, is a parametric rectified linear unit (PReLU), and the symbol * corresponds to the 2D-convolution. The reconstruction is obtained using (Equation (6)): where there is bias b for each input channel, and h identifies the group of latent feature maps.
The W corresponds to the flip operation over both dimensions of the weights W. The y is the predicted value [80]. To determine the parameter vector representing the complete CAE structure, one can minimize the following cost function represented by (Equation (7)) [25]: To minimize this function, we should calculate the gradient of the cost function concerning the convolution kernel (W, W) and bias (b, b) parameters [80] (see Equations (8) and (9)): where δh and δy are the deltas of the hidden states and the reconstruction, respectively. The weights are then updated using the optimization method [81]. Finally, the CAE parameters can be calculated once the loss function convergence is achieved. The output feature maps of the encoder block are considered as the deep features. In this work, batch normalization (BN) [82] was applied to tackle the internal covariant shift phenomenon and improve the performance of the network through normalization of the input layer by rescaling and re-centering [83]. The BN helps the network learn faster as well as increase accuracy [84].

Parameter Setting
Before introducing the proposed CAE's hyperparameter setting, we demonstrated the network's framework and configuration for image paths in detail (Table 2). In the encoder block, the number of filters of CNN1 and CNN2 are considered as 8 and 12, respectively. Simultaneously, the kernel sizes of CNN1 and CNN2 are also set as 3 × 3. In the decoder block, the kernel size is set as 1 × 1 to use the complete spatial information of the input cube. In this block, we chose 8 and D (i.e., number of bands) for the output of the convolutional layers (CNN3 and CNN4, respectively) in our proposed model. Based on trial and error of different combinations by Keras Tuner, for three experiment datasets, the learning rate and batch size and epochs were set to 0.1, 10,000, and 100, respectively. For the next step, we set the parameters of the regularization techniques. In the proposed network model, regularization techniques (BN) [82] are taken into account. As already mentioned, BN is used to tackle the internal covariant shift phenomenon [85]. Accordingly, BN is applied to the third dimension of each layer output to make the training process more efficient. The Adam optimizer [86] was used to optimize the Huber loss function in the training process. Afterward, the optimized hyperparameters were applied for the predicting procedure, which provides the ultimate deep features.

Mini-Batch K-Means
One of the most widely used methods in remote sensing imagery clustering is Kmeans because it is easy to implement and does not require any labeled data to be trained. However, as the size of the dataset starts to increase, it loses its performance in clustering such a large dataset since it requires the whole dataset in the main memory [44]. In most cases, such computational resources are not available. To overcome this challenge, Scully [44] introduced a new clustering method called mini-batch K means, a modified clustering model based on K-means, a fast and memory-efficient clustering algorithm. The main idea behind the mini-batch K-means algorithm is to reduce the computational cost using small random batches of data with a fixed size that standard computers can handle. This algorithm provides lower stochastic noise and less computational time in clustering large datasets compared to general K-means. More information on mini-batch K-means can be found in [44,86]. In this case, a mini-batch K-means algorithm with a batch size of 150, the initial size of 500, and the learning rate based on the inverse of the number of samples was used to cluster features that the CAE generated. We performed clustering with two, three, and four classes to categorize landslide features, and then the most optimal classification was selected through visual inspection. However, non-landslide areas were merged in the final step to obtain only two landslide and background classes.

MNF Transformation
The MNF transformation was applied for the pre-processing stage. We carried out this transformation in the python platform for each case study with the help of the Scikit-Learn 1.0 package that led to a list of 11 corresponding images from informative bands to noisy ones by the size of the original dataset. In all case studies, from Table 3 and Figure 4, it was noticeable that after the third band of MNF features, the others became noisy and contained no explicit information on features. On the other hand, considering the statistical information about the transformed components, it can be perceived that the first three layers ( Figure 5) comprise 80% of the information of 11 image bands. Therefore, to reduce redundancy and noise, the first three bands of MNF were selected, then stacked NDVI with the slope layer, derived from DEM in QGIS 3.10, to be fed into the CAE model.

Experimental Results from the Proposed CAE Model
Achieving an optimal model requires tuning many inter-dependent parameters [28]. In this work, the optimal parameters for the learning process, especially the number of filters of encoder output, learning rate, and batch size were obtained via the Keras Tuner optimizer developed by the Google team [87]. Therefore, we used values between 0.1, 0.01, and 0.001 for the learning rate and values of 100, 1000, and 10,000 for the batch size. Additionally, a range of 10 to 15 was selected for the number of feature maps in the last layer of the encoder for three cases. As a result, in all cases, the optimal values for learning rate, batch size, and feature maps were 0.1, 10,000, and 12, respectively. Furthermore, a loss function, namely Huber, was used to evaluate the reconstruction loss of the inputs. Since a conventional loss function such as the mean squared error (MSE) is influenced by an outlier, the loss value is dependent on outlier data. However, Huber performs as a robust loss function that is less sensitive and does not allocate higher weights to outliers [88]. To evaluate the impacts of slope and NDVI on the loss function value, we applied Huber on three different sets of data including MNF, MNF + slope, and MNF + slope + NDVI. The results (see Figure 6) indicated that the Huber loss function provides the lowest reconstruction loss for a stack of MNF features, slope, and NDVI layers in each case. Figure 7 shows an improvement in the reconstruction loss based on the Huber loss function for each study area based on 100 epochs. All the experiments here were conducted on a machine with an Intel (R) Core (TM) i9-7900X 3.31 GHz CPU, 32 GB RAM, and NVIDIA GeForce GTX 1080 12 GB GPU, and applying CAEs on our dataset was carried out in Python 3.7 using the TensorFlow 2.

Experimental Results from the Proposed CAE Model
Achieving an optimal model requires tuning many inter-dependent parameters [28]. In this work, the optimal parameters for the learning process, especially the number of filters of encoder output, learning rate, and batch size were obtained via the Keras Tuner optimizer developed by the Google team [87]. Therefore, we used values between 0.1, 0.01, and 0.001 for the learning rate and values of 100, 1000, and 10,000 for the batch size. Additionally, a range of 10 to 15 was selected for the number of feature maps in the last layer of the encoder for three cases. As a result, in all cases, the optimal values for learning rate, batch size, and feature maps were 0.1, 10,000, and 12, respectively. Furthermore, a loss function, namely Huber, was used to evaluate the reconstruction loss of the inputs. Since a conventional loss function such as the mean squared error (MSE) is influenced by an outlier, the loss value is dependent on outlier data. However, Huber performs as a robust loss function that is less sensitive and does not allocate higher weights to outliers [88]. To evaluate the impacts of slope and NDVI on the loss function value, we applied Huber on three different sets of data including MNF, MNF + slope, and MNF + slope + NDVI. The results (see Figure 6) indicated that the Huber loss function provides the lowest reconstruction loss for a stack of MNF features, slope, and NDVI layers in each case. Figure 7 shows an improvement in the reconstruction loss based on the Huber loss function for each study area based on 100 epochs. All the experiments here were conducted on a machine with an Intel (R) Core (TM) i9-7900X 3.31 GHz CPU, 32 GB RAM, and NVIDIA GeForce GTX 1080 12 GB GPU, and applying CAEs on our dataset was carried out in Python 3.7 using the TensorFlow 2.1.0 package.  In addition, Figure 11 shows the minimum and maximum values for each deep feature of different case studies.  In addition, Figure 11 shows the minimum and maximum values for each deep feature of different case studies.  The stack of MNF features, slope, and NDVI layers were fed to our optimized CAE model (optimal parameters achieved in the tuning process), and the model's outputs were 12 reconstructed features of the input data. Figure 9. Reconstructed deep features of China's case study. The stack of MNF features, slope, and NDVI layers were fed to our optimized CAE model (optimal parameters achieved in the tuning process), and the model's outputs were 12 reconstructed features of the input data.  The stack of MNF features, slope, and NDVI layers were fed to our optimized CAE model (optimal parameters achieved in the tuning process), and the model's outputs were 12 reconstructed features of the input data.

Clustering Deep Features Using Mini-Batch K-Means
Mini-batch K-means available in the Scikit-learn 1.0 package was used to separately cluster input data and deep features into two to six categories. Since landslide inventory data were not used to train the model, the number of classes was unknown. Thus, we clustered input data and deep features into four clusters with a minimum of two to a maximum of five classes/categories. Through visual analysis of clustering maps, we chose clustering maps that had the best result compared to others. For example, in all cases, clustering MNF features stacked with slope and NDVI provided the best result with five classes, while the same scenario did not occur for clustering based on deep features. The best results were obtained with four classes using deep features for mapping landslides in India ( Figure 12) and China ( Figure 13). However, for Taiwan (Figure 14), clustering with three classes provided the highest accuracy. More information on the accuracy assessment is available in Section 4.3.

Clustering Deep Features Using Mini-Batch K-Means
Mini-batch K-means available in the Scikit-learn 1.0 package was used to separately cluster input data and deep features into two to six categories. Since landslide inventory data were not used to train the model, the number of classes was unknown. Thus, we clustered input data and deep features into four clusters with a minimum of two to a

Accuracy Assessment
In this study, the accuracy of detected landslides was evaluated based on four wildly applied metrics, namely precision, recall, and f1-score [89]. Precision is used to measure the model's performance in detecting landslides, recall indicates the number of accurately detected landslides, and f1-score shows the balance between precision and recall metrics. In addition, the mIOU [4] (Figure 15), which is commonly used in computer vision, is applied to evaluate the detected landslides' accuracy. Where the inventory datasets are polygon-based, mIOU is considered an appropriate metric that can illustrate the accuracy of a model in detecting objects and landslides. Generally, the mean area of overlap is divided on the mean area of the union of detected landslides and inventory map in polygon format. The mentioned metrics are mathematically expressed as follows in Equations (10)-(13):

Accuracy Assessment
In this study, the accuracy of detected landslides was evaluated based on four wildly applied metrics, namely precision, recall, and f1-score [89]. Precision is used to measure the model's performance in detecting landslides, recall indicates the number of accurately detected landslides, and f1-score shows the balance between precision and recall metrics. In addition, the mIOU [4] (Figure 15), which is commonly used in computer vision, is applied to evaluate the detected landslides' accuracy. Where the inventory datasets are polygon-based, mIOU is considered an appropriate metric that can illustrate the accuracy of a model in detecting objects and landslides. Generally, the mean area of overlap is divided on the mean area of the union of detected landslides and inventory map in polygon format. The mentioned metrics are mathematically expressed as follows in Equations (10)- (13): According to the inventory map of the case study in India, landslide detection precision through clustering MNF features stacked with slope and NDVI was 28%. In comparison, its recall metric reached 73%, which shows that clustering accuracy was relatively good in mapping actual landslides. However, due to low precision value, f1score became 41%. On the other hand, the precision value of detected landslides through deep clustering features reached 76%, three times higher than the previous one, and recall significantly increased to 91%, resulting in an f1-score of 83%. In the following, although the precision value for India's case was low, it was better for the case in China. This reached 44%, but the recall was not promising at 34%, indicating that only one-third of landslides were correctly detected.
Furthermore, precision, recall, and f1-score values for landslides mapped through the clustering of deep features for China's case were 72%, 87%, and 79%, respectively. Finally, clustering based on MNF features, slope, and NDVI for Taiwan's case provided much better results than other cases with the same clustering scenario. For example, precision and recall achieved 58% and 70%, respectively, which led to an F1 score of 63%, higher than the same metric in the other cases. In this case, clustering using deep features, similar to the other cases, presented higher precision, recall, and f1-score values, 77%, 82%, and 79%, respectively. Another metric used for accuracy assessment is mIOU, which shows that clustering with deep features provides more solid results than only the MNF features and auxiliary datasets such as slope and NDVI. For instance, for cases in India, and China, mIOU for clustering based on deep features were calculated as 70% and 60%, respectively. However, the same values for clustering without deep features were 26% and 24%, respectively. In contrast, mIOU for Taiwan's landslide map was relatively better for both clustering maps, primarily because of higher precision values in both clustering Parameters including TP (true positive), FP (false positive), and FN (false negative) stand for correctly detected landslides, and features detected as landslides. However, according to the inventory map, they are not landslides and undetected landslides, respectively.
According to the inventory map of the case study in India, landslide detection precision through clustering MNF features stacked with slope and NDVI was 28%. In comparison, its recall metric reached 73%, which shows that clustering accuracy was relatively good in mapping actual landslides. However, due to low precision value, f1-score became 41%. On the other hand, the precision value of detected landslides through deep clustering features reached 76%, three times higher than the previous one, and recall significantly increased to 91%, resulting in an f1-score of 83%. In the following, although the precision value for India's case was low, it was better for the case in China. This reached 44%, but the recall was not promising at 34%, indicating that only one-third of landslides were correctly detected.
Furthermore, precision, recall, and f1-score values for landslides mapped through the clustering of deep features for China's case were 72%, 87%, and 79%, respectively. Finally, clustering based on MNF features, slope, and NDVI for Taiwan's case provided much better results than other cases with the same clustering scenario. For example, precision and recall achieved 58% and 70%, respectively, which led to an F1 score of 63%, higher than the same metric in the other cases. In this case, clustering using deep features, similar to the other cases, presented higher precision, recall, and f1-score values, 77%, 82%, and 79%, respectively. Another metric used for accuracy assessment is mIOU, which shows that clustering with deep features provides more solid results than only the MNF features and auxiliary datasets such as slope and NDVI. For instance, for cases in India, and China, mIOU for clustering based on deep features were calculated as 70% and 60%, respectively. However, the same values for clustering without deep features were 26% and 24%, respectively. In contrast, mIOU for Taiwan's landslide map was relatively better for both clustering maps, primarily because of higher precision values in both clustering scenarios. For example, clustering using MNF features, slope, and NDVI resulted in a mIOU of 50% and for deep features, it reached 81%. More details are available in Table 4.

Discussion
This study used an unsupervised model called CAE with a mini-batch K-means clustering algorithm to detect landslides. The main novelty of the present study was transforming and using multi-source data in a deep learning model that can extract features, which are helpful in mapping landsides without the supervision of labeled training data. Applying MNF to multispectral data helped us to reduce the dimensions from 13 to three layers but contained more than 80% of the information in all cases. Then, stacking with layers such as slope and NDVI and training a CAE with them as mid-level features resulted in feature maps that increased the overall accuracy in clustering landslides. This study clearly showed that using relevant data and appropriate methods can improve the results, even in unsupervised clustering problems. Additionally, it indicates that a similar dataset with different processing techniques can produce different results in mapping phenomena such as landslides. According to inventory data, applying clustering on MNF features along with slope and NDVI layers generated lower accuracy than clustering based on deep features learned by them. For example, in the case study in India, despite adding slope, which is an essential factor in landslide detection [4], to MNF features, the residential area was still clustered with a landslide in the same cluster Figure 16(a2). In contrast, there was no such issue in clustering maps based on deep features Figure 16(a3). Furthermore, the same issue exists in clustering for landslides in China and Taiwan, but instead of residential areas, most rivers are clustered with landslides into the same classes. For example, in Figure 16(b2), which is related to landslides in China, almost all rivers were clustered with landsides into the same class, while in Figure 16(c2), which is clustering based on deep features, such an issue has not occurred. Rivers are not in the same class as landslides. The same scenario occurred for landslides in Taiwan (Figure 16(b3)), and clustering-based deep features solved this problem. Clustering with deep features provides relatively good results because the CAEs learned and extracted multiple features from five inputs (three layers of MNF features, slope, and NDVI), enabling clustering to be more precise with more input data, particularly in clustering landslides. Landslides in some deep features in all cases were separable from residential areas and rivers, which explains the parameters that autoencoders learned from input data were able to capture spatial and spectral information, consequently leading to much better clustering maps compared to the ones based on MNF features, slope, and NDVI. Therefore, the results rely on deep features that could extract more consistent landslide segment boundaries. As shown in Figure 16(b1,2), detecting landslides with smaller, narrow-shaped areas does have some difficulties. However, our proposed CAE overcomes this issue and extracted landslides with spatial conditions (Figure 16(c1,2)).
In a similar study to our case, Brbu et al. [90] applied CAE on Landsat multi-temporal images for landslide change detection in Valcea County, Romania. They achieved an accuracy of 96.13% for landslide change detection through extracting features from preand post-landslide images. Although they achieved higher accuracy than our accuracy scores, we only used single-pass images and our study areas. In addition, our results proved their claim that CAE models are great in extracting meaningful features from satellite imagery with medium resolution such as Sentinel-2 and Landsat 8. Despite the outstanding performance of both unsupervised and supervised deep learning models in feature extraction and classification, they still suffer from some issues. For example, there is no standard procedure to define a deep learning model. Furthermore, the same issue exists in clustering for landslides in China and Taiwan, but instead of residential areas, most rivers are clustered with landslides into the same classes. For example, in Figure 16b2, which is related to landslides in China, almost all In most cases, different parameters such as window size and kernel size, somehow the activation function and loss function and input dataset should be examined to achieve a good one. For instance, we tried three different loss functions and gained the lowest loss value by the Huber function. This indicates that choosing the wrong parameters can adversely affect the performance of the deep learning model. Finally, achieving relatively high precision and recall scores in our study using sentinel-2 data and, more importantly, through an unsupervised procedure, show that CAE models have remarkable capabilities in satellite image processing, handling multi-source data, and extraction features that need further attention.

Conclusions
Supervised DL and ML models have demonstrated their potential for landslide mapping investigation in terms of accuracy. However, the transferability of a DL-trained model depends on the data used for training. Moreover, the accuracy is also related to the quality and quantity of the training data. Generally, remote sensing image classification and object detection using supervised ML and DL models are associated with two main issues: providing sufficient labeled data to train the model and relatively long training time due to many training parameters to be adjusted and optimized. Therefore, mapping and updating landslide inventories are still challenging issues in the RS community using supervised DL and ML models. In this study, to address this issue, we used the CAE model to extract deep features from Sentinel-2A images, spectral data such as NDVI and slope, and then clustered them to detect landslides. Our results explicitly indicate that a clustering map based on deep features is acceptable compared to the state-of-the-art results.
Moreover, this study shows that in an emergency, a combination of satellite imagery with topographical data with CAE and the Mini-batch K-means clustering algorithm can be a reliable approach for rapid mapping of landslides and provide a primary inventory map. Future studies will focus on designing (almost) similar networks and implementing them in supervised and unsupervised procedures for landslide detection. Comparing the results of these same networks may help to improve the unsupervised one to obtain results more accurate than the supervised network.