An Efficient Multi-Sensor Remote Sensing Image Clustering in Urban Areas via Boosted Convolutional Autoencoder (BCAE)

High-resolution urban image clustering has remained a challenging task. This is mainly because its performance strongly depends on the discrimination power of features. Recently, several studies focused on unsupervised learning methods by autoencoders to learn and extract more efficient features for clustering purposes. This paper proposes a Boosted Convolutional AutoEncoder (BCAE) method based on feature learning for efficient urban image clustering. The proposed method was applied to multi-sensor remote-sensing images through a multistep workflow. The optical data were first preprocessed by applying a Minimum Noise Fraction (MNF) transformation. Then, these MNF features, in addition to the normalized Digital Surface Model (nDSM) and vegetation indexes such as Normalized Difference Vegetation Index (NDVI) and Excess Green (ExG(2)), were used as the inputs of the BCAE model. Next, our proposed convolutional autoencoder was trained to automatically encode upgraded features and boost the hand-crafted features for producing more clustering-friendly ones. Then, we employed the Mini Batch K-Means algorithm to cluster deep features. Finally, the comparative feature sets were manually designed in three modes to prove the efficiency of the proposed method in extracting compelling features. Experiments on three datasets show the efficiency of BCAE for feature learning. According to the experimental results, by applying the proposed method, the ultimate features become more suitable for clustering, and spatial correlation among the pixels in the feature learning process is also considered.


Introduction
Satellite and airborne image classification is one of the most demanding remote sensing (RS) applications [1].In general, image classification can be categorized as supervised and unsupervised approaches [2].Although supervised algorithms perform better than unsupervised ones, they require labeled or training samples.As a result, the unavailability of such high-quality and high-quantity training data justifies the use of clustering algorithms [3].Another advantage of unsupervised methods is that they carry out the classification task by extracting valuable information from the data without any a priori knowledge or specific assumption about data.
Clustering has been one of the fundamental research topics in data-mining studies [4,5].It is usually referred to as a subcategory of an unsupervised algorithm employed for dividing data into categories of a similar pattern without using training samples [6].Furthermore, the efficiency of machine learning algorithms highly relies on the representation or feature selection from data [7,8].In other words, choosing proper feature representation makes classification more efficient and accurate.Thus, proper feature representation for classification is an open question that has recently received much scientific attention [9].
Feature extraction methods try to transform the data space into a new feature space that is more compatible with clustering tasks [10].Depending on the used features, landcover identification methods can generally be based on hand-crafted features, unsupervisedlearned features, and deep automated extracted features [11].
The first generation of research works on scene classification have been mostly based on hand-crafted features, requiring a considerable amount of domain-expert knowledge and time-consuming hand-tuning [12].Traditionally, various hand-crafted feature extraction methods, such as scale-invariant feature transform (SIFT) [13], local binary pattern (LBP) [14], non-negative matrix factorization (NMF) [15], and complex wavelet structural similarity (CW-SSIM) [16], have been applied to extract the helpful information.
Automatic feature learning methods are considered a more helpful approach to overcome the limitations of hand-crafted based feature extraction methods [17].A typical and commonly used unsupervised feature learning method is principal component analysis (PCA) [18].However, such methods fail to model the nonlinear data structures [19].Compared to traditional unsupervised feature-learning techniques, the deep-learning-based method includes several processing units and layers and can extract more abstract representations from data with the hierarchy of complexity levels.Furthermore, deep-feature learning approaches effectively detect complicated patterns and hidden information in high-dimensional data.These make it very promising to use deep-learning-based feature extraction method for classification tasks [20].
The autoencoder (AE) deep neural network produces a nonlinear mapping function through a learning process during iterations.The encoders map the input data to their corresponding feature representation, and then the decoder regenerates input from the features extracted by the encoding procedure.The learning procedure is iterative, ensuring that the mapping function is efficient for the input data [21].Clustering based on independent features extracted from AEs has been studied in recent years.Song et al. [22] have developed one of the first AE-based clustering models.They obtained a practical and abstract feature, which is more informative for clustering purposes.They showed that their proposed deep network could learn a nonlinear mapping by effectively partitioning the transformed feature space [23].Huang et al. [24] proposed a deep embedded network to use a multilayer Gaussian restricted Boltzmann machine (GRBM) for feature extraction with preserving spatial locality and group sparsity constraints, enabling the model to learn more robust representations for clustering tasks.Tian et al. [25] proposed a graph-based encoder method to use deep sparse AEs for clustering and obtained better accuracy than spectral ones.
Instead of using AE as a preprocessing, joint AE and clustering were also considered in the literature [26].In Reference [27], a deep embedded clustering (DEC), as a jointly optimized algorithm that learns feature and clusters data simultaneously, is presented.The DEC works by mapping the data space to a new feature space through iteratively optimizing a clustering objective.Guo et al. [28] argued that the feature space could be adversely affected by clustering loss of DEC; accordingly, improved deep embedded clustering (IDEC) was presented to integrate the AE clustering loss and reconstruction loss and improve the performance by preserving the local-spatiality of data.Guo et al. argued the inefficiency of dense layers for clustering tasks and then used a convolutional autoencoder (CAE).Their results demonstrated improvement in both DEC and IDEC accuracies [29].Thus, the method is introduced as deep convolutional embedded clustering (DCEC).
Recently, several research works have utilized AEs for clustering applications and demonstrated notable performances over clustering using only hand-crafted features [8,30,31].In the multilayer structure of deep learning models, the feature of one level is transformed into a higher and more abstract level [32].However, AE had little contribution because it cannot guarantee that similar input data obtain similar representations in the latent space, which is essential for clustering purposes [33].On the other hand, several studies, including References [34][35][36][37], have indicated that hand-crafted features, despite their simplicity, contain information independent of the deep features.
Moreover, in remote-sensing image classification, single-source data generally cannot achieve high accuracy due to the lack of rich and diverse information to distinguish different land-cover classes [38].Multisource data, such as optical, radar, lidar, thermal, etc., have been used to overcome these limitations and significantly increased the discrimination among the land-cover classes [39][40][41].However, the limited representation of the handcrafted features extracted from multi-sensor data remains a significant challenge and needs further investigation.This leads to the overlapped classes and corrupts the performance of classification [42].Therefore, to simultaneously use the advantages of deep learning methods and available information in multi-sensor features in the proposed method in this article, instead of using raw images, hand-crafted features were used for training the CAE.The hand-crafted features applied are normalized difference surface model (nDSM), normalized difference vegetation index (NDVI) or Excess Green (ExG(2)), and components of minimum noise fraction (MNF) transformation.
To the best of our knowledge, no study has yet explored the possibility of CAEs for land cover clustering in urban areas.In this paper, a new boosted convolutional autoencoder (BCAE) with hand-crafted features is proposed to extract more effective deep features for clustering RS images automatically.The proposed model uses functional mid-level features to train a light-weighted network to increase the separation of clusters in feature space and boost the clustering results instead of applying raw images as input or employing complex network architectures.The model results are compared with the three most commonly used sets of features to prove the efficiency of the proposed method in extracting robust features.In addition, three different datasets were used to verify the performance of our proposed method compared to the competing sets.Experimental results demonstrate the effectiveness of the proposed method over the other three competing feature sets in terms of required processing time and accuracy of clustering applied to datasets.

Remotely Sensed Data
To evaluate the performance of the proposed BCAE method, we perform several experiments on three datasets, namely the Tunis, University of Houston, and ISPRS Vaihingen.Datasets descriptions are as follows: 1.
Tunis: These data are an improved satellite image in terms of spatial resolution by the Gram-Schmidt technique and includes eight spectral bands with a spatial resolution of 50 cm acquired by a WorldView-2 sensor.In terms of dimensions, this image is 809 × 809 pixels.This dataset includes digital terrain model (DTM) and digital surface model (DSM) of the study area; 2.
University of Houston: The imagery was acquired by National Center for Airborne Laser Mapping (NCALM) on 16 February 2017.The recording sensors consist of an Optech Titan MW (14SEN/CON340) with an integrated camera (a LiDAR imager operating at 1550, 1064, and 532 nm) and a DiMAC ULTRALIGHT + (a very highresolution color sensor) with a 70 mm focal length.We produced DTM and DSM from this dataset from multispectral LiDAR point cloud data at a 50 cm ground sample distance (GSD) and a very high-resolution RGB image at a 5 cm GSD.The data cube used in the study includes a crop of the original data with a width and height of 1500 × 1500 pixels, in 5 layers (blue, green, and red bands with DSM and DTM); 3.
ISPRS Vaihingen: The German Association of Photogrammetry, Remote Sensing, and Geoinformation produced the dataset.It consists of 33 image tiles in infra-red, red, and green wavelength and GSD of 9 cm/pixel that there is ground truth for 16 of them.This imagery also contains DSM extracted from dense LiDAR data.We used a 1500 × 1500 pixel cube from the 26th image patch in the dataset in 4 layers (IR-R-G with nDSM).It should be noted that we used the nDSM generated by Gerke [43].
and green wavelength and GSD of 9 cm/pixel that there is ground truth for 16 of them.This imagery also contains DSM extracted from dense LiDAR data.We used a 1500 × 1500 pixel cube from the 26 th image patch in the dataset in 4 layers (IR-R-G with nDSM).It should be noted that we used the nDSM generated by Gerke [43].

Methodology
This paper proposes a two-step framework feature learning and clustering of RS images, as illustrated in Figure 1.The proposed CAE was trained with hand-crafted features to extract more effective deep features in the first step.The extracted deep features are fed into the Mini Batch K-Means clustering algorithm to identify clusters in the second step.The central core of this framework is the proposed BCAE, which generates deep features boosted by using hand-crafted features.

Boosted Convolutional Autoencoder (BCAE)
In this paper, we propose to boost CAE by using hand-crafted features.In our model, a preprocessing step was performed to create more robust representations through our proposed CAE.In this phase, two practical hand-crafted features (i.e., normalized digital surface model (nDSM) and vegetation indexes, such as NDVI and ExG(2) for urban-scene classification, were used.We also used minimum noise fraction (MNF) [44] as another boosting feature.This workflow upgraded our deep features after unsupervised learning by CAE.
CAE [45] is an unsupervised feature learning method that recently attracted scientific attention.CAE is a multilevel feature extraction model aimed at discovering the inner information of images [46].Compared with conventional dense AE, CAE utilizes the spatial-locality of the original images, which is critical for image clustering and decreases the possibility of overfitting caused by parameter redundancy [30].
As illustrated in Figure 2, the proposed CAE includes two main blocks of encoder and decoder.The transformation from the original image into the hidden layer is named encoding.In contrast, the transforming feature map from the hidden layer toward the output image is described as a decoding procedure.The target of the encoder is to mine the inner information encapsulated in the input data and extracts them as constructive features, and the goal of the decoder block is to reconstruct the input data from the extracted features [46].

Boosted Convolutional Autoencoder (BCAE)
In this paper, we propose to boost CAE by using hand-crafted features.In our model, a preprocessing step was performed to create more robust representations through our proposed CAE.In this phase, two practical hand-crafted features (i.e., normalized digital surface model (nDSM) and vegetation indexes, such as NDVI and ExG(2) for urban-scene classification, were used.We also used minimum noise fraction (MNF) [44] as another boosting feature.This workflow upgraded our deep features after unsupervised learning by CAE.
CAE [45] is an unsupervised feature learning method that recently attracted scientific attention.CAE is a multilevel feature extraction model aimed at discovering the inner information of images [46].Compared with conventional dense AE, CAE utilizes the spatial-locality of the original images, which is critical for image clustering and decreases the possibility of overfitting caused by parameter redundancy [30].
As illustrated in Figure 2, the proposed CAE includes two main blocks of encoder and decoder.The transformation from the original image into the hidden layer is named encoding.In contrast, the transforming feature map from the hidden layer toward the output image is described as a decoding procedure.The target of the encoder is to mine the inner information encapsulated in the input data and extracts them as constructive features, and the goal of the decoder block is to reconstruct the input data from the extracted features [46].In the proposed CAE,  = { 1 ,  2 , … ,   } ∈  ×× is used as the input tensor, where D, W, and H indicate the depth (i.e., number of bands), width, and height of the input image, respectively, and n is the number of pixels.The X consists of the image patches where b is the bias, σ is an activation function (in this work, the rectified linear unit (ReLU)), and the symbol * corresponds to the 2D convolution.Then, the reconstruction is obtained by using Equation ( 2): In the proposed CAE, X = {x 1 , x 2 , . . . ,x n } ∈ R H×W×D is used as the input tensor, where D, W, and H indicate the depth (i.e., number of bands), width, and height of the input image, respectively, and n is the number of pixels.The X consists of the image patches x * i with the size of 7 × 7 × D (x * i ∈ R 7×7×D ) which are extracted from the input image.In the following, each patch is fed into the encoder Block.For the input x * i , the hidden layer mapping (latent representation) of the k th feature map is given by Equation ( 1) [47]: where b is the bias, σ is an activation function (in this work, the rectified linear unit (ReLU)), and the symbol * corresponds to the 2D convolution.Then, the reconstruction is obtained by using Equation ( 2): where b represents the bias for each input channel, and h denotes the encoded feature maps.W is the transposition of W, and y is the predicted value [48].To calculate the parameter , the following loss function should minimize [49]: (3) Remote Sens. 2021, 13, 2501 6 of 18 In order to minimize the loss function (Equation ( 3)), its gradient with respect to the convolution window parameters W, W, b, b should be calculated as Equation ( 4) [48]: where δh and δy are the deltas of the hidden states and the reconstruction, respectively.The weights are then updated by using adaptive learning rate methods (ADAMs) [50].Finally, the ultimate parameters of CAE are estimated once the cost function converges.The output feature maps of the encoder block are considered deep features.
In this work, the dropout [51] strategy is added to improve the computational efficiency and reduce the overfitting of CAE [52].In addition, the Batch Normalization (BN) [53] is also applied to improve the network's performance.BN helps networks learn faster, as well as increase accuracy [54].

Boosting Deep Representations with Hand-crafted Features
The classification of high-spatial-resolution RS images has become challenged by technology development in pixel size due to the high spectral mixture among different classes, and multispectral images are insufficient for such classification tasks [55].Light detection and ranging (LiDAR) data like nDSM is a crucial component that provides highaccuracy data about absolute elevation of objects which is almost perfect to distinguish objects of different heights, such as buildings and roads in scene classification [56].Huang et al. [55] also emphasized the effectiveness of using this elevation data in extracting buildings.Specifically, nDSM is advantageous for separating high and low vegetation.Recently, many building detection approaches that used aerial imagery and LIDAR data have shown that best correctness and completeness are achieved through spectral and elevation information [57,58].Nowadays, the urban scene classification task mainly relies on elevation information rather than near-infrared (NIR) spectral bands [55].
NDVI, the popular vegetation index, quantifies the vegetation by the difference in photosynthetic response to red-light absorption and near-infrared reflectance.The addition of an NDVI image layer led to a more accurate clustering [59].It was also added to detect vegetation [60].According to MacFaden et al. [61], the use of NDVI and nDSM is critical to extract vegetation (in particular trees) next to buildings.The NDVI is based on the spectral response in the NIR spectrum, which does not exist in the RGB data captured by UAVs.Torres-Sánchez et al. [62] investigated several vegetation indices calculated from RGB bands and showed that the excess green (ExG(2)) index [63] obtained the best result for vegetation mapping in UAV data.
The MNF transformation, a modified version of PCA, aims to minimize the correlation between bands and reduce systematic noise in the image [56].The approach is a conventional dimensionality reducing approach to determine the inherent dimensionality, reduce computational requirements, and segregate data noise [64].The hand-crafted features applied in this study are nDSM, NDVI or ExG(2), and MNF components (Table 1).

MNF
Considering noisy data as x with n-bands in the form of x = s + e, where s and e are the signals and noise parts of x, the covariance matrices of s and e can be calculated as follow: Then, the noise variance of the i th band with respect to the variance for i th band can be described as: In the following, the MNF transform is considered as a linear transformation: where, y is a produced dataset with n bands, which is a transformation of the original bands, the unknown coefficients (A) are obtained by calculating the eigenvectors associated with sorted eigenvalues: A∑ e ∑ −1 = ΛA where, Λ is eigenvalue matrix (λ i ), each eigenvalue associated with a i is the noise ratio in y i , i = 1, 2, . . ., n [65].

Accuracy Assessment
Cluster validity indices (CVIs) are applied to evaluate the performance of clustering algorithms.Unfortunately, most of CVIs, including the Davies-Bouldin index (DBI) and Xie-Beni index (XBI), were not suitable for RS images due to considerable between-cluster overlaps [66].Thus, in order to estimate the clustering accuracy, the confusion matrix was used, and the overall accuracy (OA), producer's accuracy (PA), and Kappa coefficient (κ) were calculated by using this matrix.
It is noticeable that, due to the dependence of the Mini Batch K-Means algorithm on the initial values and to eliminate its random influence and, also, because k-means may get stuck in local optima, we run Mini Batch K-Means 5 times and display the clustering with the smallest error [67].Tables 2-4 describe the ground truth data used for each dataset.

Parameter Setting
The encoder block of the proposed CAE framework consists of two convolutional layers (CNN1 and CNN2) having 12 and 24 filters, respectively.The kernel size of these layers is set to be 3 × 3.There are also two convolutional layers (CNN3 and CNN4) with the kernel size of 1 × 1, making it applicable to completely use the spatial information from the input datasets without considering the neighborhood and extracting features based more on the depth of the data.In this block, we choose 12 and D to output convolutional layers (CNN3 and CNN4) in our proposed model, where D is the depth of the dataset in use.Based on trial and error, the learning rate and batch size were chosen to be 0.01 and 10, respectively.Regularization is also used to avoid overfitting.The BN is added to the third dimension of each activation map of the convolutional layer to overcome the internal covariant shift problem.A 30% dropout is applied to the CNN1 and CNN4 layers to enhance the generalization ability by ignoring random connections [68].In the training process, the Adam optimizer optimized the mean squared error (MSE) cost function.Table 5 summarizes the specifications of the layers in the proposed BCAE framework.
Table 5.The configuration of BCAE for the feature learning of the image path dataset with a 7 × 7 × D window size for the input cube.

Block
Unit Input Shape Kernel Size Regularization Output Shape

Competing Features
In order to evaluate the performance of the BCAE method, three sets of features were considered through three scenarios of input data.The configurations of these features are as follows.The first feature set contained spectral features, including original spectral bands from optical imageries.The second feature set contained spectral and spatial features, including nDSM, NDVI, or ExG (2).Finally, the third feature set contained deep features extracted by training proposed CAE over raw spectral information of each band.

Mini Batch K-Means
The Mini Batch K-Means [69] clustering algorithm is a modified version of the Kmeans algorithm that is considered faster, simpler to implement, and generally used for large datasets [70].The main idea of this algorithm is to make batches of a fixed size randomly in each iteration and update the clusters.A learning rate is defined by inversing of samples to decrease the number of iterations.Therefore, as the number of samples increases, the effect of new samples is reduced [71].In this study, Mini Batch K-Means with a batch size of 10 was applied to cluster features extracted by BCAE.First, the number of clusters was determined as the number of classes in each dataset.Then, we labeled the clusters through visual inspection.

Preprocessing
Two preprocessing steps of feature extraction and resampling were applied to the datasets, as summarized in Table 6.The first one was to extract MNF, nDSM, NDVI, and ExG(2) as hand-crafted features, and the second one was to make the resolution of the dataset consistent.

UH Tunis Vaihingen
Features Extraction

MNF
Two preprocessing steps of feature extraction and resampling were applied to the datasets, as summarized in Table 6.The first one was to extract MNF, nDSM, NDVI, and ExG(2) as hand-crafted features, and the second one was to make the resolution of the dataset consistent.
The MNF transformation was applied for the preprocessing phase.First, we carried out MNF through Python programming, using Spectral Python (Spy) 0.21 package, leading to a list of corresponding images from informative bands to noisy ones (Table 7).It is evident from (Table 7) that in Tunis, after four first bands (a-d), the others are noisy and contain no clear feature.On the other hand, considering the statistical information about transformed components (Table 8), we noticed that the first layer in UH has the highest eigenvalue with a meaningful interval.It is also true for the Vaihingen dataset.Accordingly, for optimum result and reducing the redundancy, the first three bands in Tunis and the first band in UH and Vaihingen were selected.Then, selected MNF bands were stacked to the nDSM and the NDVI or ExG(2) features.Next, a dataset composed of patches (equal to the number of pixels in the dataset) taken from the stacked cube was generated.Finally, the proposed CAE was trained over the patches, and through unsupervised learning, our deep representations were extracted.It should be noted that all the experiments here were conducted on an Intel Core i3-3120M CPU.Moreover, CAE has been implemented in the Python 3.7 language, which has utilized Tensorflow 2.1.0.
Two preprocessing steps of feature extraction and resampling were appl datasets, as summarized in Table 6.The first one was to extract MNF, nDSM, N ExG(2) as hand-crafted features, and the second one was to make the resolut dataset consistent.

UH Tunis Vai
Features Extraction The MNF transformation was applied for the preprocessing phase.First, w out MNF through Python programming, using Spectral Python (Spy) 0.21 pack ing to a list of corresponding images from informative bands to noisy ones (Tab evident from (Table 7) that in Tunis, after four first bands (a-d), the others are contain no clear feature.On the other hand, considering the statistical informat transformed components (Table 8), we noticed that the first layer in UH has th eigenvalue with a meaningful interval.It is also true for the Vaihingen datase ingly, for optimum result and reducing the redundancy, the first three bands in the first band in UH and Vaihingen were selected.Then, selected MNF ba stacked to the nDSM and the NDVI or ExG(2) features.Next, a dataset com patches (equal to the number of pixels in the dataset) taken from the stacked generated.Finally, the proposed CAE was trained over the patches, and throu pervised learning, our deep representations were extracted.It should be noted t experiments here were conducted on an Intel Core i3-3120M CPU.Moreover, been implemented in the Python 3.7 language, which has utilized Tensorflow 2 Two preprocessing steps of feature extraction and resam datasets, as summarized in Table 6.The first one was to extract ExG(2) as hand-crafted features, and the second one was to m dataset consistent.

UH Tu
Features Extraction The MNF transformation was applied for the preprocessin out MNF through Python programming, using Spectral Python ing to a list of corresponding images from informative bands to evident from (Table 7) that in Tunis, after four first bands (a-d contain no clear feature.On the other hand, considering the sta transformed components (Table 8), we noticed that the first la eigenvalue with a meaningful interval.It is also true for the V ingly, for optimum result and reducing the redundancy, the firs the first band in UH and Vaihingen were selected.Then, se stacked to the nDSM and the NDVI or ExG(2) features.Nex patches (equal to the number of pixels in the dataset) taken fr generated.Finally, the proposed CAE was trained over the pa pervised learning, our deep representations were extracted.It s experiments here were conducted on an Intel Core i3-3120M been implemented in the Python 3.7 language, which has utiliz nDSM

Preprocessing
Two preprocessing steps of feature extraction and resampling were applied to the datasets, as summarized in Table 6.The first one was to extract MNF, nDSM, NDVI, and ExG(2) as hand-crafted features, and the second one was to make the resolution of the dataset consistent.

UH Tunis Vaihingen
Features Extraction The MNF transformation was applied for the preprocessing phase.First, we carried out MNF through Python programming, using Spectral Python (Spy) 0.21 package, leading to a list of corresponding images from informative bands to noisy ones (Table 7).It is evident from (Table 7) that in Tunis, after four first bands (a-d), the others are noisy and contain no clear feature.On the other hand, considering the statistical information about transformed components (Table 8), we noticed that the first layer in UH has the highest eigenvalue with a meaningful interval.It is also true for the Vaihingen dataset.Accordingly, for optimum result and reducing the redundancy, the first three bands in Tunis and the first band in UH and Vaihingen were selected.Then, selected MNF bands were stacked to the nDSM and the NDVI or ExG(2) features.Next, a dataset composed of patches (equal to the number of pixels in the dataset) taken from the stacked cube was generated.Finally, the proposed CAE was trained over the patches, and through unsupervised learning, our deep representations were extracted.It should be noted that all the experiments here were conducted on an Intel Core i3-3120M CPU.Moreover, CAE has been implemented in the Python 3.7 language, which has utilized Tensorflow 2.1.0.

Preprocessing
Two preprocessing steps of feature extraction and resampling were appl datasets, as summarized in Table 6.The first one was to extract MNF, nDSM, N ExG(2) as hand-crafted features, and the second one was to make the resolut dataset consistent.

UH Tunis Vai
Features Extraction The MNF transformation was applied for the preprocessing phase.First, w out MNF through Python programming, using Spectral Python (Spy) 0.21 pack ing to a list of corresponding images from informative bands to noisy ones (Tab evident from (Table 7) that in Tunis, after four first bands (a-d), the others are contain no clear feature.On the other hand, considering the statistical informat transformed components (Table 8), we noticed that the first layer in UH has th eigenvalue with a meaningful interval.It is also true for the Vaihingen datase ingly, for optimum result and reducing the redundancy, the first three bands in the first band in UH and Vaihingen were selected.Then, selected MNF ba stacked to the nDSM and the NDVI or ExG(2) features.Next, a dataset com patches (equal to the number of pixels in the dataset) taken from the stacked generated.Finally, the proposed CAE was trained over the patches, and throu pervised learning, our deep representations were extracted.It should be noted t experiments here were conducted on an Intel Core i3-3120M CPU.Moreover, been implemented in the Python 3.7 language, which has utilized Tensorflow 2

Preprocessing
Two preprocessing steps of feature extraction and resam datasets, as summarized in Table 6.The first one was to extract ExG(2) as hand-crafted features, and the second one was to m dataset consistent.

UH Tu
Features Extraction The MNF transformation was applied for the preprocessin out MNF through Python programming, using Spectral Python ing to a list of corresponding images from informative bands to evident from (Table 7) that in Tunis, after four first bands (a-d contain no clear feature.On the other hand, considering the sta transformed components (Table 8), we noticed that the first la eigenvalue with a meaningful interval.It is also true for the V ingly, for optimum result and reducing the redundancy, the firs the first band in UH and Vaihingen were selected.Then, se stacked to the nDSM and the NDVI or ExG(2) features.Nex patches (equal to the number of pixels in the dataset) taken fr generated.Finally, the proposed CAE was trained over the pa pervised learning, our deep representations were extracted.It s experiments here were conducted on an Intel Core i3-3120M been implemented in the Python 3.7 language, which has utiliz NDVI -

Preprocessing
Two preprocessing steps of feature extraction and resampling were appl datasets, as summarized in Table 6.The first one was to extract MNF, nDSM, N ExG(2) as hand-crafted features, and the second one was to make the resolut dataset consistent.
The MNF transformation was applied for the preprocessing phase.First, w out MNF through Python programming, using Spectral Python (Spy) 0.21 pack ing to a list of corresponding images from informative bands to noisy ones (Tab evident from (Table 7) that in Tunis, after four first bands (a-d), the others are contain no clear feature.On the other hand, considering the statistical informat transformed components (Table 8), we noticed that the first layer in UH has th eigenvalue with a meaningful interval.It is also true for the Vaihingen datase ingly, for optimum result and reducing the redundancy, the first three bands in the first band in UH and Vaihingen were selected.Then, selected MNF ba stacked to the nDSM and the NDVI or ExG(2) features.Next, a dataset com patches (equal to the number of pixels in the dataset) taken from the stacked generated.Finally, the proposed CAE was trained over the patches, and throu pervised learning, our deep representations were extracted.It should be noted t experiments here were conducted on an Intel Core i3-3120M CPU.Moreover, been implemented in the Python 3.7 language, which has utilized Tensorflow 2

Preprocessing
Two preprocessing steps of feature extraction and resam datasets, as summarized in Table 6.The first one was to extract ExG(2) as hand-crafted features, and the second one was to m dataset consistent.


The MNF transformation was applied for the preprocessin out MNF through Python programming, using Spectral Python ing to a list of corresponding images from informative bands to evident from (Table 7) that in Tunis, after four first bands (a-d contain no clear feature.On the other hand, considering the sta transformed components (Table 8), we noticed that the first la eigenvalue with a meaningful interval.It is also true for the V ingly, for optimum result and reducing the redundancy, the firs the first band in UH and Vaihingen were selected.Then, se stacked to the nDSM and the NDVI or ExG(2) features.Nex patches (equal to the number of pixels in the dataset) taken fr generated.Finally, the proposed CAE was trained over the pa pervised learning, our deep representations were extracted.It s experiments here were conducted on an Intel Core i3-3120M been implemented in the Python 3.7 language, which has utiliz ExG(2)

Preprocessing
Two preprocessing steps of feature extraction and resampling were applied to the datasets, as summarized in Table 6.The first one was to extract MNF, nDSM, NDVI, and ExG(2) as hand-crafted features, and the second one was to make the resolution of the dataset consistent.
The MNF transformation was applied for the preprocessing phase.First, we carried out MNF through Python programming, using Spectral Python (Spy) 0.21 package, lead-ing to a list of corresponding images from informative bands to noisy ones (Table 7).It is evident from (Table 7) that in Tunis, after four first bands (a-d), the others are noisy and contain no clear feature.On the other hand, considering the statistical information about transformed components (Table 8), we noticed that the first layer in UH has the highest eigenvalue with a meaningful interval.It is also true for the Vaihingen dataset.Accord-ingly, for optimum result and reducing the redundancy, the first three bands in Tunis and the first band in UH and Vaihingen were selected.Then, selected MNF bands were stacked to the nDSM and the NDVI or ExG(2) features.Next, a dataset composed of patches (equal to the number of pixels in the dataset) taken from the stacked cube was generated.Finally, the proposed CAE was trained over the patches, and through unsu-pervised learning, our deep representations were extracted.It should be noted that all the experiments here were conducted on an Intel Core i3-3120M CPU.Moreover, CAE has been implemented in the Python 3.7 language, which has utilized Tensorflow 2.1.0. -- The MNF transformation was applied for the preprocessing phase.First, we carried out MNF through Python programming, using Spectral Python (Spy) 0.21 package, leading to a list of corresponding images from informative bands to noisy ones (Table 7).It is evident from (Table 7) that in Tunis, after four first bands (a-d), the others are noisy and contain no clear feature.On the other hand, considering the statistical information about transformed components (Table 8), we noticed that the first layer in UH has the highest eigenvalue with a meaningful interval.It is also true for the Vaihingen dataset.Accordingly, for optimum result and reducing the redundancy, the first three bands in Tunis and the first band in UH and Vaihingen were selected.Then, selected MNF bands were stacked to the nDSM and the NDVI or ExG(2) features.Next, a dataset composed of patches (equal to the number of pixels in the dataset) taken from the stacked cube was generated.Finally, the proposed CAE was trained over the patches, and through unsupervised learning, our deep representations were extracted.It should be noted that all the experiments here were conducted on an Intel Core i3-3120M CPU.Moreover, CAE has been implemented in the Python 3.7 language, which has utilized Tensorflow 2.1.0.

Clustering Results
We performed the BCAE method on the three datasets and applied the Mini Batch K-Means algorithm to the encoded features.The clustering maps (Figures 3-5) and quantitative results (Tables 9-11) are listed below.

Tunis Dataset
Figure 3 shows the clustering maps for different feature sets.Visually, the performance of the boosted deep features (i.e., BCAE) is similar to spectral and spatial features (i.e., MDN), and they both outperform two other ones (i.e., MS and CAE-MS).It should be mentioned that BCAE leads to compensate noise and extract the correct boundaries of the vast majority of classes.Bare land had a similar spectral representation to some buildings.Hence, spectral features (MS) alone were insufficient to separate these land-covers.Even though the CAE-MS features perform slightly better than MS, summarizing these results makes it possible to prove that the information in its raw form for use as features will significantly impact reducing clustering performance.Table 9 presents the clustering accuracies for different feature sets.MDN, i.e., spatial and spectral features (i.e., MNF, nDSM, and NDVI), significantly improved OA by 16% compared to the first feature set scenario, i.e., using only spectral features (MS).BCAE led to superior performance to the competing features.The corresponding OA and Kappa were 20% and 30% higher than the MS ones, 4% and 6% higher than the MDN ones, 14% and 21% higher than the CAE-MS ones.The improved accuracy over MDN, particularly for building class (6%), indicates that the proposed BCAE method boosted spatial and spectral features.Based on the comparing results in Table 9, the challenging class, i.e., building, demonstrates low clustering accuracy over MS and CAE-MS, while the BCAE has remarkably improved the accuracy.It increased about 6% in OA in comparison with the MDN feature set.

UH Dataset
The UH dataset has two important spectral mixings in land covers, including spectral similarity between trees and low vegetation and bare land and building.In addition, there are some difficulties in the classification of these two classes for trees near buildings or trees without green leaves.The MS and CAE-MS were built on the spectral information only; thus, they are insufficient to address clustering in this dataset and show poor OA and Kappa values.On the other hand, compared to the MS, clustering with the CAE-MS leads to about 2.84% and 3% improvements in OA and Kappa, respectively.According to a recent comparison, the use of CAE has a positive effect on increasing cluster accuracy by extracting more distinct features.In the case of MDE, even with the significant improvement in three classes, compared to MS and CAE-MS feature sets, there was no noticeable improvement for building class; however, we have to note the fact that the use of MNF, nDSM, and ExG(2) features lead to 12% and 16% improvements in the OA and κ compared to the MS, respectively.In particular, MDE used the height information alongside ExG(2) and facilitated tree class distinguishing.The BCAE approach had a visually highlighted enhancement (Figure 4).Moreover, the BCAE method achieved better OA of 30% and κ of 40%.For the most challenging class (i.e., tree) in the UH dataset, the results (Table 10) show poor clustering performance over competing features, while the BCAE has a considerably improved accuracy of 77.37%.

UH Dataset
The UH dataset has two important spectral mixings in land covers, including spectral similarity between trees and low vegetation and bare land and building.In addition, there are some difficulties in the classification of these two classes for trees near buildings or trees without green leaves.The MS and CAE-MS were built on the spectral information only; thus, they are insufficient to address clustering in this dataset and show poor OA and Kappa values.On the other hand, compared to the MS, clustering with the CAE-MS leads to about 2.84% and 3% improvements in OA and Kappa, respectively.According to a recent comparison, the use of CAE has a positive effect on increasing cluster accuracy by extracting more distinct features.In the case of MDE, even with the significant improvement in three classes, compared to MS and CAE-MS feature sets, there was no noticeable improvement for building class; however, we have to note the fact that the use of MNF, nDSM, and ExG(2) features lead to 12% and 16% improvements in the OA and κ compared to the MS, respectively.In particular, MDE used the height information alongside ExG(2) and facilitated tree class distinguishing.The BCAE approach had a visually highlighted enhancement (Figure 4).Moreover, the BCAE method achieved better OA of 30% and κ of 40%.For the most challenging class (i.e., tree) in the UH dataset, the results (Table 10) show poor clustering performance over competing features, while the BCAE has a considerably improved accuracy of 77.37%.5 and Table 11 demonstrate the clustering maps and accuracies, respectively.In this dataset, we observe that CAE-MS performance is worse than MS by 3% and 2% in terms of κ and OA, respectively.Generally, the spectral overlap between low vegetation and tree classes leads to the worst accuracy among MS and CAE-MS.In this case, the

Vaihingen Dataset
Figure 5 and Table 11 demonstrate the clustering maps and accuracies, respectively.In this dataset, we observe that CAE-MS performance is worse than MS by 3% and 2% in terms of κ and OA, respectively.Generally, the spectral overlap between low vegetation and tree classes leads to the worst accuracy among MS and CAE-MS.In this case, the features learned directly from the raw data by the proposed ACE had a negative effect on accuracy.These results show that the clustering of images by using only spectral information in raw data cannot lead to acceptable performance.As shown in Figure 5, although the OA varies from 45 to 54% in three competing feature sets, the clustering maps do not represent the accurate content, especially in building class and bare land class affected by shadow.For very high-resolution imagery, there is no success with clustering on only the spectral data at all.There is a significant improvement for MDN compared to MS and CAE-MS (i.e., OA of 6-8% and Kappa of 9-12%).This is not surprising because the DSM can help in efficiently classifying houses and trees, while the infrared band can discriminate vegetation better.However, the water class is not classified correctly.Accordingly, to achieve high accuracy in such clustering tasks, spatial information is not representative enough.In BCAE features, all classes achieve remarkably better accuracy by using deep-boosted features than other comparing feature sets.
Remote Sens. 2021, 13, x FOR PEER REVIEW 14 of 19 the OA varies from 45 to 54% in three competing feature sets, the clustering maps do not represent the accurate content, especially in building class and bare land class affected by shadow.For very high-resolution imagery, there is no success with clustering on only the spectral data at all.There is a significant improvement for MDN compared to MS and CAE-MS (i.e., OA of 6-8% and Kappa of 9-12%).This is not surprising because the DSM can help in efficiently classifying houses and trees, while the infrared band can discriminate vegetation better.However, the water class is not classified correctly.Accordingly, to achieve high accuracy in such clustering tasks, spatial information is not representative enough.In BCAE features, all classes achieve remarkably better accuracy by using deepboosted features than other comparing feature sets.

Running Time Comparison
The execution time of the K-Means-based algorithm increases with the growth of the size and dimension of the dataset.Hence clustering datasets usually is not a time-efficient task [72].Therefore, it is necessary to accelerate these algorithms with dimension reduction and efficiently reduce data size.As shown in Table 12, it can be seen that the clustering of features extracted by BCAE has performed well compared to the three competing feature sets in terms of running time, and the clusters have converged faster.As a result, in addition to better accuracy, the proposed method is time-efficient.

Discussion
The main issue associated with image classification from using supervised methods is the requirement of a large set of labeled training samples.The lack of sufficient training data could highlight the importance of using unsupervised approaches.Clustering of high-resolution urban scenes in remote sensing, considering high complexity and interclass diversity of land cover, is faced with low performances.Therefore, one solution to overcome this problem would be extracting discriminative hand-crafted features, which requires a considerable amount of experience and time.
In this study, we used an unsupervised feature learning method to produce high-level features automatically.We also used multi-sensor data to boost deep features discrimination power by complementary spatial information and reducing dimensionality.The experimental results from the three datasets validated the efficiency and versatility of the BCAE for image clustering.The deep features outperform the manually designed ones.It turns out to work surprisingly well with clustering algorithm by use of the following factors: (1) the convolutional structure for local spatial-preserving and reducing data redundancy, (2) the BN for decreasing interval covariate shift of network and dropout for computational efficiency and increasing the generalizability, (3) fusion of complementary spatial information (i.e., nDSM) with spectral features (i.e., MNF and NDVI), and (4) no need for large labeled data and using patch learning with the scene image.In addition, since our proposed network is light-weighted, its run-time is short.The numerical results demonstrate the efficiency and superiority of the proposed method in terms of OA and κ, compared with three competing features that rely only on the hand-crafted features or only deep features learned from raw data.On the other hand, the low potential of clustering methods, specifically the K-Means-based algorithm, limits the number of correctly identified classes.

Conclusions
Many researchers prove that feature extraction plays a vital role in data processing; thus, in this article, we proposed a practical approach (BCAE) to learn more discriminative features based on CAE and hand-crafted (spectral and spatial) features for clustering RS images.The BCAE is constructed by stacking convolution layers in an AE form that learns features from hand-crafted features as its input, and our network boosts the discrimination of hand-crafted features.The feature maps of the last layer in the encoder part of BCAE that reflect the RS data's essential information can be used to input the clustering algorithm.The extracted features have more separable and compact patterns than the hand-crafted input features, which helps the clustering purpose.Our goal is to map data to a latent space where Mini Batch K-Means is a suitable tool for clustering.The learned deep feature representation is highly discriminative.
We will apply the proposed method to a broader range of data in our future work to establish a more global one.In addition, investigating various hand-crafted features and designing a robust autoencoder model and segment-based instead of pixel-wise clustering would be an open issue for future studies.

Figure 1 .
Figure 1.An overall overview of the proposed clustering method.

Figure 1 .
Figure 1.An overall overview of the proposed clustering method.

Figure 2 .
Figure 2. The proposed architecture of the CAE.

Figure 2 .
Figure 2. The proposed architecture of the CAE.
-DTM where DSM is the digital surface model, and DTM is the digital terrain model.NDVI NDV I = (ρ N IR − ρ red )/(ρ N IR + ρ red )where ρ N IR is the reflectance of the near-infrared wavelength band and ρ red is the reflectance of the red wavelength band.ExG(2)ExG(2) = 2g − r − b where r, g, and b are the color band divided by the sum of three bands per pixel[60].

Table 1 .
Extraction of hand-crafted features.

Table 2 .
Ground truths of the Tunis dataset.

Table 3 .
Ground truths of the UH dataset.

Table 4 .
Ground truths of the Vaihingen dataset.

Table 6 .
Hand-crafted features for all datasets.

Table 6 .
Hand-crafted features for all datasets.

Table 6 .
Hand-crafted features for all datasets.

Table 6 .
Hand-crafted features for all datasets.

Table 6 .
Hand-crafted features for all datasets.

Table 6 .
Hand-crafted features for all datasets.

Table 6 .
Hand-crafted features for all datasets.

Table 6 .
Hand-crafted features for all datasets.

Table 6 .
Hand-crafted features for all datasets.

Table 6 .
Hand-crafted features for all datasets.

Table 8 .
Statistics of MNF bands.

Table 8 .
Statistics of MNF bands.

Table 9 .
A comparison between the clustering results of the Tunis dataset.

Table 10 .
A comparison between the clustering results of the UH dataset.

Table 10 .
A comparison between the clustering results of the UH dataset.

Table 11 .
A comparison between the clustering results of the Vaihingen dataset.

Table 11 .
A comparison between the clustering results of the Vaihingen dataset.

Table 12 .
The computational time of the proposed and competing methods for the three datasets.