Stacked Autoencoders Driven by Semi-Supervised Learning for Building Extraction from near Infrared Remote Sensing Imagery

: In this paper, we propose a Stack Auto-encoder (SAE)-Driven and Semi-Supervised (SSL)- Based Deep Neural Network (DNN) to extract buildings from relatively low-cost satellite near infrared images. The novelty of our scheme is that we employ only an extremely small portion of labeled data for training the deep model which constitutes less than 0.08% of the total data. This way, we signiﬁcantly reduce the manual effort needed to complete an annotation process, and thus the time required for creating a reliable labeled dataset. On the contrary, we apply novel semi-supervised techniques to estimate soft labels (targets) of the vast amount of existing unlabeled data and then we utilize these soft estimates to improve model training. Overall, four SSL schemes are employed, the Anchor Graph, the Safe Semi-Supervised Regression (SAFER), the Squared-loss Mutual Information Regularization (SMIR), and an equal importance Weighted Average of them (WeiAve). To retain only the most meaning information of the input data, labeled and unlabeled ones, we also employ a Stack Autoencoder (SAE) trained under an unsupervised manner. This way, we handle noise in the input signals, attributed to dimensionality redundancy, without sacriﬁcing meaningful information. Experimental results on the benchmarked dataset of Vaihingen city in Germany indicate that our approach outperforms all state-of-the-art methods in the ﬁeld using the same type of color orthoimages, though the fact that a limited dataset is utilized (10 times less data or better, compared to other approaches), while our performance is close to the one achieved by high expensive and much more precise input information like the one derived from Light Detection and Ranging (LiDAR) sensors. In addition, the proposed approach can be easily expanded to handle any number of classes, including buildings, vegetation, and ground.


Introduction
Land cover classification is a widely studied field since the appearance of the first satellite images. In the last two decades, the sensors attached to satellites have evolved in a way that nowadays allow the capture of high-resolution images which may go beyond the Red Green Blue (RGB) visible spectrum. This technological advance made detection and classification of buildings and other man-made structures from satellite images possible [1]. The automatic identification of buildings in urban areas, using remote sensing data, can be beneficial in many applications including cadaster, urban and rural planning, urban change detection, mapping, geographic information systems, monitoring, housing value, and navigation [2][3][4].
Typically, for remote sensing applications, RGB, thermal, multi and hyper-spectral, Near Infrared (NIR) imaging, and LiDAR sensors are employed. Each sensor presents its own advantages and drawbacks, including the high purchase cost and the manual effort needed for data collection processing and analysis. In this paper, we employ the relatively low-cost imaging data of NIR sensors. A key application on remote sensing data, like the NIR ones, is to produce semantic labels of the inputs to assist experts in their analysis. To derive a semantic segmentation, classification schemes can be applied [5]. These schemes usually involve (i) a feature extraction phase in which a set of appropriate descriptors (even the raw image data) are selected and (ii) a classification phase which employs models (classifiers) to categorize the input features into the semantic labels, such as buildings, vegetation and ground.
The main drawback, however, of this classification-based approach is twofold. First, a feature-based analysis is information redundant which, apart from the computational and memory burdens it causes to the classifiers, may also result in a decreased performance, especially in case of complicated data content. Second, classification requires a training phase. in which a labeled dataset of pairs of (automatically extracted) features, along with desired outputs (targets), are fed to the classifier through a learning process to estimate appropriate classifier parameters (weights). The goal of the learning process is to minimize the error of the classifier outputs and the desired targets over all training samples (under a way to avoid overfitting). However, to produce the desired targets, an annotation process should be applied which is, most of the times, laborious, needs high manual effort and lasts long to be completed.
Regarding information redundancy reduction many methods can be applied such as vector quantization and mixture models, Principal Component Analysis (PCA), Singular Value Decomposition (SVD), etc. [6]. In this paper, we chose to use a deep Stacked Autoencoder (SAE) to significantly reduce the dimensionality of the input data, retaining, however, most of the meaningful information. We select such a scheme due to its highly non-linear capabilities in discarding redundant data than other linear approaches, such as PCA, its unified structure that can be utilized for different application cases and its easily applicability under a parallel computing framework making the scheme ready to be applied for large scale input data [7]. The reduced dimension-space, as provided through the SAE encoding part, mitigates all drawbacks attributed to the high dimensionality space of the original data.
To minimize data annotation effort, Semi-Supervised Learning schemes (SSL) can be employed. In SSL schemes, the classifier is trained with two sets; a small portion of labeled (annotated) data and a larger set of unlabeled data. For the latter, the required targets are unknown and are estimated by the SSL algorithms by transferring knowledge from the small annotated dataset to the larger unlabeled set. The reduction in the number of labeled data do not influence the feature extraction process. However, it significantly reduces the time needed to annotate the data (that need laborious manual effort) required for training since the targets of the unlabeled data used in the training phase are estimated automatically by the application of an SSL algorithm. Thus, no additional manual effort is required, meaning that no additional resources are wasted for the annotation. As is shown in the experimental result section, this dramatic decrease in the number of labeled samples and hence of the respective manual effort needed insignificantly affects the classification performance.
Classifiers are usually deep network structures. Among the most popular deep models adopted are the Convolutional Neural Networks (CNN) [8,9] which give excellent performance in remote sensing imagery data for classification purposes. This is also shown in one of our earliest works [10,11]. However, CNNs deeply convolve the input signals to find out proper relations among them using many convolutional filters. Thus, they cannot yield a compact representation of reduced input data which imposes computational costs when combined with the SSL methods. For this purpose, in this paper, a Deep Neural Network (DNN) model is used to execute the classification.

Description of the Current State-of-the-Art
Building extraction from urban scenes, with complex architectural structures, still remains challenging due to the inherent artifacts, e.g., shadows, etc., of the used data (remote soft labels provided by multiple SSL techniques on the encoded data. The SAE-based compression scheme is combined with four novel semi-supervised algorithms namely Anchor Graph, SAFER, SMIR (see Section 4) and a weighted combination of the above assuming equal importance for each scheme.
The adopted SSL approaches run over the non-linearly transformed input data, generated by the encoder part of the SAE. The encoder reduces the redundant information, creating much more reliable and robust training samples. The much smaller dimension of the input signals helps reducing unnecessary, or even contradictory, information. Given a set of robust soft labels, over a large set of unlabeled data, we are able to boost DNN performance.
The proposed auto-encoder scheme is nicely interwoven with the SSL algorithms. The SSL techniques require no modifications to operate on the data provided by the encoder, e.g., any type of preprocessing of the input data. At the same time, the DNN does not require any custom layers to incorporate the SSL outcomes. Therefore, the trained deep models can be easily utilized by third party applications as is or through transfer learning [33]. The semi-supervised fined-tuned DNN model can detect the buildings from the satellite NIR images, with high accuracy. This paper is organized as follows: Section 2 presents a conceptual background of this paper. The proposed methodology is given in Section 3. Section 4 presents the employed SSL approaches. Section 5 provides extensive experimental results and comparison against other state of the art approaches. Finally, Section 6 gives discussions and Section 7 concludes this paper.

Input Data Compression Using a Deep SAE Framework
Deep SAEs [34] have been employed for remote sensing data classification [35,36] resulting in accurate performance. Typically, training of a deep auto-encoder consists of two steps: (a) training per layer and (b) fine tuning of the entire network [37]. Training per layer is an unsupervised process exploiting all available data, the labeled and the unlabeled ones since there is no need to have target values available but only the input features.
Nevertheless, in remote sensing applications, the available training data are only a small portion of the total data entities [10], often resulting in low performance scores, especially when the inputs cannot be sufficiently represented in the training set. In layer-wise training step, each layer learns to reconstruct the input values using fewer computational nodes. This is in fact a compression scheme; we retain the input information using fewer neurons. In this study, we utilize only the encoder part of an SAE to compress the data. Compressed data are then exploited by a semi-supervised technique to train (fine-tune) the model to generate rough estimations (soft labels) that can be beneficial during the fine-tuning training phase [38]. Typically, the entire network is fine-tuned, using the backpropagation algorithm.

Semi-Supervised Learning (SSL) Schemes
Conventional training of deep neural network models is implemented over the available labeled data instances, which are in fact a limited set, since labeling a large amount of NIR images requires high manual effort, which is a time-consuming process. One approach to overcome this drawback is to apply Semi supervised learning (SSL) [39] to transfer knowledge from the labeled data to the unlabeled ones.
Overall, four novel semi-supervised schemes are adopted to estimate data labels for the vast amount of unlabeled data Anchor Graphs [40], SAFE Semi-Supervised Regression (SAFER) [41], Squared-loss Mutual Information Regularization [42], and an equal weighted importance of each of the above methods called WeiAve. The last acts as a simple fusion technique across the first three SSL schemes. The anchor graph approach optimally estimates soft labels to the unlabeled data based on a small portion of "anchor data" which behave as representatives. SAFER, on the other hand, employs a linear programming Remote Sens. 2021, 13, 371 5 of 23 methodology to estimate the best classifier of unlabeled data which yields at least as good performance as to a traditional supervised classification scheme. Finally, SMIR exploits Bayesian classification concepts (maximum information principle) to transfer knowledge from the labeled to the unlabeled data.

Deep Neural Networks (DNNs) for Buildings' Extraction
To execute the final classification, we utilize a DNN structure, which consists of (i) the encoding layers of the SAE, (ii) a fully connected neural network of one hidden layer and one output layer to take a decision whether the input value is a building or not. Concerning DNN training, we employ all available data, that is, the small portion of the labeled samples and the many unlabeled data, for which soft labels estimates have been generated by the application of the three SSL techniques.
Our novel methodology succeeds in a detection performance of buildings for satellite NIR data close to the ones achieved using much more costly methodologies like LiDAR or a great number of training samples which results in a high manual effort and are time consumed. This performance has been derived using real-life NIR data sets to increase reliability of our approach. We emphasize that the main contribution lies in the extremely narrow set of labeled data required, i.e., less than 0.08% among all data are labeled. More specifically, the NIR input data are split into two subgroups: a small set containing all the label data and a much larger set containing unlabeled ones. The labeled data set includes the corresponding outputs, provided by one or more experts, using a crowdsourcing concept, described in the experimental setup section. The unlabeled dataset does not contain any information on the outputs (no effort for annotating). At this point, we utilize the SSL techniques to estimate soft target values, i.e., labels for the outputs. Before doing so, the input data are no-linearly mapped in a much smaller dimension, using the encoder part of a SAE.

Description of the Overall Architecture
In our case, four SSL schemes are used, which are described in the following section, The training phase: As for the training, as is shown in Figure 1a, initially, we collect a large set of NIR images. In our case, the data correspond to city areas, located in Germany. We should stress that these data have been used as benchmarked data within the remote sensing community. This way, we can easily compare our results to other state-of-theart approaches.
More specifically, the NIR input data are split into two subgroups: a small set containing all the label data and a much larger set containing unlabeled ones. The labeled data set includes the corresponding outputs, provided by one or more experts, using a crowdsourcing concept, described in the experimental setup section. The unlabeled dataset does not contain any information on the outputs (no effort for annotating). At this point, we utilize the SSL techniques to estimate soft target values, i.e., labels for the outputs. Before doing so, the input data are no-linearly mapped in a much smaller dimension, using the encoder part of a SAE.
In our case, four SSL schemes are used, which are described in the following section, to generate the soft estimated target outputs (labels) of the unlabeled data; anchor graph, SAFER, SMIR, and a weighted average approach, of the above SSL schemes. The adopted approach results in the creation of a DNN classifier. This module is training using conventional learning strategies such as a backpropagation. The output indicates at which of the three available regions (buildings, vegetation, the ground) each NIR image pixel is assigned to.
The testing phase: Figure 1b shows a block diagram of the testing phase. Different inputs are received by the deep module than the labeled and unlabeled data to classify them into the three class categories. In this case, the encoder part of the SAE is part of the deep structure to reduce the redundant information of the inputs. The SSL schemes are not applicable in this case. Figure 2 describes the main steps of the proposed solution for enhancing buildings' classification in NIR images. The methodology adopted consists of four main steps. The first is an unsupervised learning of the SAE to generate proper weights of the model enabling it to carry out the dimensionality reduction. This includes the collection of the data, the construction (training) of the SAE, the retaining of only its encoder part and the projection of the data to reduce their dimensionality. Then, the second step is collection and annotation of a small portion of data through the crowdsource scheme and then the application of an SSL method onto the unlabeled data to approximate (softly) their desired targets. This is done since the SSL schemes are applied on the reduced data inputs x (r) i . The third set is the fine tuning (training) of the entire DNN stricture by exploiting all data (labeled and unlabeled ones). Finally, the last fourth step is the application of the model to the test data (unseen).
Remote Sens. 2021, 13, x FOR PEER REVIEW 7 of 24 Figure 2 describes the main steps of the proposed solution for enhancing buildings' classification in NIR images. The methodology adopted consists of four main steps. The first is an unsupervised learning of the SAE to generate proper weights of the model enabling it to carry out the dimensionality reduction. This includes the collection of the data, the construction (training) of the SAE, the retaining of only its encoder part and the projection of the data to reduce their dimensionality. Then, the second step is collection and annotation of a small portion of data through the crowdsource scheme and then the application of an SSL method onto the unlabeled data to approximate (softly) their desired targets. This is done since the SSL schemes are applied on the reduced data inputs ( ) .
The third set is the fine tuning (training) of the entire DNN stricture by exploiting all data (labeled and unlabeled ones). Finally, the last fourth step is the application of the model to the test data (unseen).

Description of Our Dataset and of the Extracted Features
Study areas, namely Area 1, Area 2, and Area 3, situated in Vaihingen city in Germany, were used for training and evaluation purposes ( Figure 3). The Area 1 mainly consists of historic buildings with notably complex structure; it has sporadically some, often high, vegetation. Area 2 has, mainly, high residential buildings with horizontal multiple

Description of Our Dataset and of the Extracted Features
Study areas, namely Area 1, Area 2, and Area 3, situated in Vaihingen city in Germany, were used for training and evaluation purposes ( Figure 3). The Area 1 mainly consists of historic buildings with notably complex structure; it has sporadically some, often high, vegetation. Area 2 has, mainly, high residential buildings with horizontal multiple planes, surrounded by long arrays or groups of dense high trees. Area 3 is a purely residential area with small, detached houses that consist of sloped surfaces, but there also exists relatively low vegetation. Figure 3 depicts characteristic content of these three areas. In the same figure, we have overlaid small polygons used by the users to select ground truth data of buildings, vegetation and the ground used for the small set of l labeled data.   Table 1 shows the flying parameters and supplementary information about the used datasets of the Vaihingen study areas as well as the software instruments we use. For the case of the Vaihingen the DSM is extracted from high resolution digital color-infrared (CIR) aerial images applying Dense Image Matching (DIM) methods. These images contain near infrared band (NIR) which is a very good source for the detection of vegetation, exploiting vegetation indexes such Normalized Difference Vegetation Index (NDVI). The CIR aerial images consist of NIR band, Red band, and Green band, and are mainly introduced in [43] in order to contribute to vegetation features. Based on this DIM and DSM, an orthoimage is generated. Table 1 also presents the accuracy and the specification of the generated DSM and DSMs and orthoimages expressed in terms of aerial triangulation accuracy.  One interesting case for this dataset is that annotation of the data into two categories, buildings, and non-buildings, is provided. This way, we can benchmark our model outcomes with other state-of-the-art approaches using a reference annotation scheme. In our case, as described in Section 6, two evaluation methods are considered. One using the polygons obtained by our expert users into the three categories (buildings, vegetation, and the ground) and one using the provided benchmarked annotation of the dataset into two categories; buildings and non-buildings (two class classification problem). Table 1 shows the flying parameters and supplementary information about the used datasets of the Vaihingen study areas as well as the software instruments we use. For the case of the Vaihingen the DSM is extracted from high resolution digital color-infrared (CIR) aerial images applying Dense Image Matching (DIM) methods. These images contain near infrared band (NIR) which is a very good source for the detection of vegetation, exploiting vegetation indexes such Normalized Difference Vegetation Index (NDVI). The CIR aerial images consist of NIR band, Red band, and Green band, and are mainly introduced in [43] in order to contribute to vegetation features. Based on this DIM and DSM, an orthoimage is generated. Table 1 also presents the accuracy and the specification of the generated DSM and DSMs and orthoimages expressed in terms of aerial triangulation accuracy.
A Multi-Dimensional Feature Vector (MDFV) is created to feed the classifier as we have done in one of our earlier works in [20]. The MDFV includes image information from the color components of the NIR images (that is NIR, Red, and Green), the vegetation index, and the height. The vegetation index for every pixel is estimated as where R and NIR refer to the red and near infrared image band. It should be mentioned that NDVI can be computed only for datasets where the NIR channel is available. The height is estimated through 3D information of the data. This is accomplished in our case by the application of a Dense Image Matching (DIM) approach to extract the Digital Surface Model (DSM) of the terrain. The cloth simulation and the closest point method [44] are applied to estimate a normalized height from the DSM model using a DIM technique, called normalized DSM (nDSM) [10]. The selected parameters of the cloth simulation algorithm for all the test sites are (i) steep slope and slope processing for the scene, (ii) cloth resolution = 1.5, (iii) max iterations = 500, and (iv) classification threshold = 0.5. This parameter selection provides information similar to a LiDAR system avoiding, however, related acquisition and processing costs of these sensors. Figure 4 shows a visual representation of the two additional feature values used in one MDFV; the normalized nDSM values to measure the height (first row) (see Figure 4a) and the vegetation index (second row) (see Figure 4b).
To avoid labeling the entirety of the images, which is a time-consuming process, only a small ground truth dataset, as a crowdsourcing approach, is employed. By these means, we accelerate the time for constructing the ground truth dataset. In particular, we ask the expert users to draw few polygons over the images. The only limitation is the number of classes. Users had to create (sketch) at least one polygon, which will serve annotation purposes, for each of the following three categories: Buildings (1), Vegetation (2), and Ground (3). This set consists of representative sample polygons for data of each class. Concerning the vegetation class, trees with medium and high height are considered as "good" indicative samples. The ground class contains the bare-earth, roads, and low vegetation (grass, low shrubs, etc.). The class of buildings contains all the man-made structures. To improve classification shadowed areas of each class are also included. In addition, the training sample polygons are spatially created to improve representativity of each class and take into account the spatial coherency of the content. Some examples of these polygons are shown in Figure 3.
simulation algorithm for all the test sites are (i) steep slope and slope processing for the scene, (ii) cloth resolution = 1.5, (iii) max iterations = 500, and (iv) classification threshold = 0.5. This parameter selection provides information similar to a LiDAR system avoiding, however, related acquisition and processing costs of these sensors. Figure 4 shows a visual representation of the two additional feature values used in one MDFV; the normalized nDSM values to measure the height (first row) (see Figure 4a) and the vegetation index (second row) (see Figure 4b).

Creation of the Small Portion of Labeled Data (Ground Truth)
Then, we split the annotated data within the polygons into three subsets to train and validate the classifiers. The created subsets, namely labeled, unlabeled, and unseen data, were formed using (approximately) 16, 64, and 20% of the amount within polygons. The labeled and the unlabeled data sets are used to train the network, while the unseen data to test the classifier performance to data different than the ones used in the training. For the labeled data, the desired targets are known. For the unlabeled data the unknown targets (desired outputs) are estimated by transferring knowledge from the labeled samples to the unlabeled ones. The target outputs of the unseen data are estimated from the classifier after training. This constitutes the major advantage of our method, since less than 0.08% among all data are considered as labeled, yet they suffice to create a robust classifier as we will see below, for each of the classes.
Concerning the vegetation class, trees with medium and high height are considered as "good" indicative samples. The ground class contains the bare-earth, roads, and low vegetation (grass, low shrubs, etc.). The class buildings contain all the man-made building structures. To improve the classification process, shadowed areas of each class are also included. In addition, the training sample polygons are spatially created to improve representativity of each class and consider the spatial coherency of the content. Table 2 demonstrates how the user annotations, using polygons, are distributed for each of the three examined city areas, in each of the three categories. At this point, we should note two things. The annotated data used only are 0.43% for the Area 1, 0.39% for the Area 2 and 0.52% for the Area 3. This includes all data (labeled, unlabeled, and test). Instead the labeled data used is only less than 0.08% of the total data. This number is extremely low number compared to other works, on the same dataset. At first, labeled data, that we use, are 10 times less compared to the work of Maltezos et al. [5], and the much less to the other supervised approaches. Secondly, we have unbalanced datasets for all the areas. Area 1 annotations resulted in a ratio greater than 3 building pixels to 1 of any other categories. Areas 2 and 3 have unbalanced annotated instances, but not as severe as in Area 1.  Figure 5 illustrates the proposed SAE-driven DNN model. The first two layers correspond to the SAE encoder the weights of which are set through an unsupervised learning process where inputs and outputs are the same [34]. The other two layers of the model are one hidden layer and one output layer responsible to conduct the final classification. Parameters for the hidden and output layers were randomly initialized. Then, a fine-tuning training step, using backpropagation algorithm, is applied to the entire network.   Figure 5 illustrates the proposed SAE-driven DNN model. The first two layers correspond to the SAE encoder the weights of which are set through an unsupervised learning process where inputs and outputs are the same [34]. The other two layers of the model are one hidden layer and one output layer responsible to conduct the final classification. Parameters for the hidden and output layers were randomly initialized. Then, a fine-tuning training step, using backpropagation algorithm, is applied to the entire network.  The initial image is separated into overlapping blocks of size 15 × 15 × 5 = 1125. The DNN classifier utilizes these 1125 values and decides the corresponding class for the pixel at the center of the patch. The first two hidden layers are encoders, trained in an unsupervised way. They serve as non-linear mappers reducing the dimensionality of the feature space from 1125 to 400, and then to 80. Then a hidden layer of 27 neurons perform a final mapping, allowing for the classification in one of the three pre-defined classes.

Evaluation Metrics
In order to objectively evaluate our results, four different metrics are considered: accuracy, precision, recall, and the critical success index (CSI). We should note that F1-score is directly calculated from precision and recall values. Accuracy (ACC) is defined as: where the nominator contains the true positives (TP) and true negatives (TN) samples, while denominator contains the TP and TN and false positives (FP) and false negatives (FN). Precision, recall and F1-score are given as Finally, the Critical Success Index (CSI) is defined as

Problem Formulation
Let us denote X ∈ R d the set of input data (or features originated from them) and x i as the i-th (feature) input datum, while we assume that n data are available, that is, i = 1, . . . , n. In this notation variable d denotes the input dimension. As is described in Section 5.1, in our case the input signals are 15 × 15 overlapped patches of NIR images, while for each patch we retain the three-color components (NIR, R and G), the vegetation index (see Equation (1)) and the (normalized) height through Digital Surface Modeling (DSM) measurements, called nDSM. This means that input dimension d = 15 × 15 × 15 = 1125. As we have stated in Section 3.1, only a small portion of the n available data are labeled, say l n. Without loss of generality, we can assume that the first l out of n data are the labeled ones and the remaining n-l the unlabeled ones. Then, for the labeled inputs x i , i = 1, . . . , l, we know the respective targets (desired outputs) t i . Vectors t i are part of the set T ∈ R c , where c is the number of classes, equaling three in our case, c = 3, i.e., buildings, vegetation, and the ground. This means that, if we denote as X l = {x 1 , . . . , x l } the set of the labeled input data then we know the target outputs of all these data T = {t 1 , . . . , t l } through an annotation process which, in our case is reliant on a crowdsourcing interface. In the sequel, the pairs (x i , t i ) of input-output relationships can be used through a training procedure to estimate the deep network parameters (weights).
The main drawback of the above-described process is that collecting the annotated (labeled) data is a tough task requiring a lot of manual effort and time. On the contrary, the overwhelming majority of data can be foundin the unlabeled set X u = {x l+1 , . . . , x n }, for which the desired targets t i , i = l + 1, . . . , n, are unknown. What we want to do is to approximate these unknown targets and generate reliable estimatest i to be able to include them in the training process and thus to estimate the deep network parameters not only from the small portion of l labeled data but from the large pool of both labeled and unlabeled ones. This way, we have the ambition to improve the classification performance since more information is considered.
In particular, if we denote as E (·) the loss evaluation function of our deep network and as y w,i the network output when the x i datum is fed as input and the network parameters (weights) are w, then the optimal weighsŵ are estimated in our semi-supervised learning approach asŵ 1 . . . y w,l y w,l+1 . . .
In this equation, matrices Y (n) w = y w,1 · · · y w,n and T (n) = [t w,1 · · · t w,ltw,l+1 · · ·t w,n ] include the network outputs for a specific set of parameters w and respective targets and approximate targets through an SSL scheme. Superscript (n) is added to demonstrate that in this case all the n data (labeled and unlabeled) are taken into account during the training and not only the small portion of l labeled data. This constitutes one of the main novelties of this article.
The second major innovation is the utilization of an SAE compression scheme at a first part of our proposed DNN structure. The goal of this encoding part is, through an unsupervised learning, to map the input set X ∈ R d to a reduced one X (r) ∈ R o , o d. In our case, only 80 out of 1125 input elements are retained, achieving a dimensionality reduction of 92.89%. The main advantage of such a compression scheme is that we keep only the most salient information of the input data, reducing both computational and memory requirements for training of the DNN, while simultaneously avoiding the learning of "confused" and "contradictory" information due to the high redundancy of the input signals. This means that inputs of the DNN are the signals x (r) i of significantly reduced dimension than the original x i .

The Anchor Graph Method
Anchor graph [40] is a graph-based approach based on a small portion of p < l labeled data, called anchors. These anchors are actual act as representative of the l labeled samples. The anchor samples can form a matrix A = [a 1 , . . . , a c ] ∈ R p×c where we recall that c is the number of classes. Thus, A = [a 1 , . . . , a c ] contains the labels for the representative p samples, in which each column vector accounts for a class. Then, the SSL works so as to minimize the following equation [45]: where, Z ∈ R n×p is a sample-adaptive weight matrix that describes how the total n samples are "projected" onto the p anchor samples,L = Z T LZ is a memory-wise and computationally tractable alternative of the Laplacian matrix L. Matrix L ∈ R n×n , and thuŝ L ∈ R p×p . The matrix I = [i 1 , . . . , i c ] ∈ R n×c is a class indicator matrix on ambiguously labeled samples with I ij = 1 if the label l i of the sample i yields the class j and I ij = 0 otherwise. The Laplacian matrix L, is calculated as L = D − W, where D ∈ R n×n is a diagonal degree matrix and W is given as for all k = 1, 2, . . . , p. The solution of the Equation (4) has the form of [45]: Remote Sens. 2021, 13, 371

of 23
where A * is the optimal estimation of matric A. Scalar γ of Equations (6) and (7) defines the weighted degree of the second term of both equations. Then, each sample label is, then, given by:l where Z i ∈ R 1×p denotes the i-th row of Z, and factor λ j = 1 T Z α j balances skewed class distributions.

SAFER: Safe Semi-Supervised Regression
Assume a set of b semi-supervised classifiers of soft outputs (hence, we can call these models as semi-supervised regressors-SSRs) applied over the unlabeled set X u . The outcome would be b predictions, i.e., {f 1 , . . . , Let as, also, denote as f 0 the model output over the same unlabeled set X u of a known traditional supervised approach using as targets the estimated unlabeled outputs. For each regressor, we set a significance weight a i ≥ 0. Then, we would like to find the optimal regressor, f , so that [41]: In Equation (9), both the optimal soft classifier output (the regression f ) and the weights a = [a 1 · · · a b ] are unknown. To solve this problem, we constrain the weights so that a ≥ 0 and 1 T a ≥ 1, that is, the sum of all weights should be one [46]. Then, we have a linear programming problem as follows: where the set M = a 1 T a = 1, and a ≥ 0 . The equation above is concave to f and convex to a and thus it is recognized as saddle-point convex-concave optimization [34]. As described in recent work [16], Equation (5) can be formulated as a geometric projection problem, handling that way the computational load. Specifically, by setting the derivative of Equation (10) to zero, we get a close form solution with respect to f and a as: Using Equation (11), we can initially estimate the optimal weight coefficients through the first term of the above-mentioned equation while then the optimal regression is estimated through the second term.

SMIR: Squared-Loss Mutual Information Regularization
The Squared-loss Mutual Information Regularization (SMIR) is a probabilistic framework trained in an unsupervised way so that a given information measure between data and cluster assignments is maximized (how well the clusters will represent the data). Maximization is achieved through a convex optimization strategy (under some mild assumptions regarding cluster overlapping) and thus it results in a global optimal solution [42].
For a given input x ∈ X we would like to estimate to which class this input is assigned to by maximizing the probabilityt = argmax t p( t|x). In this notation, we adopt a scalar network output t instead of a vector one. This is not a real restriction since any vectorized output of finite c classes can be mapped onto one-dimensional space. The described SMIR approach approximates the class-posterior probability p( t|x) as follows. Assuming a uniform class-prior probability p(t) = 1/c (equal importance of all output classes), the Square-loss Mutual Information (SMI) (without the use of the regularization terms) has the following form [47]: To unknown probability p( t|x) of Equation (12) can be approximated as a kernel model where q(·) is the approximate of the probability p( t|x) and A = [a 1 . . . a c ] ∈ R n×c , where a vector element of A is given as a r = [a r,1 , . . . , a r,n ] T are model parameters and k(·) is the kernel X × X → R which takes two inputs and returns a scalar. If we approximate the probability p(x) of Equation (12) as the empirical average then the SMI approach is given as where K ∈ R n×n is the kernel matric overall all n samples. In principle, any kernel model linear with respect to a t can be used to approximate the probability p( t|x). However, this may lead to a non-convex optimization and thus the optimal solution can be trapped to local minima. To avoid this, in [42] a regularization term is adopted. This is done by introducing a new kernel Φ n which maps the inputs from the input space X to the n-dimensional space R n , that is, If we denote as d i = n ∑ j=1 k x i , x j the degree of x i and as D = diag(d 1 , d 2 , . . . , d n ) the degree diagonal matrix, then we can approximate the class posterior probability p( t|x) by where · is the inner product. This equation is valid assuming that K is a full rank matrix and K − 1 2 is well defined. Plugging Equation (16) into (12), we can have an alternative the SMI criterion alternated with a regularization term called Squared-loss Mutual Information Regularization (SMIR) which is given as where A ∈ R n×c is the matrix representation of model's parameters as in Equation (13). Equation (17) is used to regularize a loss function ∆(p, q) of the actual class posterior probability p(·) and its approximate version q(·). Function ∆(·, ·) expresses how much the actual probability divers to the approximate one. Then, we can have as objective (i) to minimize ∆(p, q), (ii) to maximizeŜMIR and (iii) to regularize model parameters A. Hence, the SMIR optimization problem is formulated as: where γ, λ > 0 are regularization parameters. If the kernel function k(·) is nonnegative and λ > γc n , Equation (18) is convex and always converges to a global optimum. Thus, we can threshold λ to be greater than a specific value to guarantee the convexity property.
Under these regularization schemes, the optimal estimation of the class posterior p( t|x) is given as 42:p In Equation (20), β t is a normalized version of the optimal model parameters a * t and π t an estimate of the probability p(t).

Data Post Processing
To purify the output of the classifier from the noisy data, initially only the building category is selected from the available classes. The building mask is refined by post processing. The goal of the post processing is to remove noisy regions such as isolated pixels or tiny blobs of pixels and retains local coherency of the data. Towards this, initially a majority voting technique with a radius of 21 pixels is implemented. Additionally, an erosion filter of a 7 × 7 window is applied.
The majority voting filter categorizes the potential building block with respect to the outputs of the neighboring output data. This filter addresses the spatial coherency that a building has. Since the orthoimages generated based on DSMs, the building boundaries are blurred due to mismatches during the application DIM algorithm. This affects the building results dilating their boundaries. Thus, the erosion filter was applied to "absorb" possible excessive interpolations on the boundaries of the buildings by reducing their dilated size.

Performance Evaluation
A total of two alternative approaches are considered for the evaluation of the model performance: (i) over the polygons-bounded areas in which the three class categories are discriminated (buildings, vegetation and ground) and (ii) over the original annotations provided by the dataset. This includes only two categories; the buildings and the nonbuildings class as stated in Section 3.2. The first evaluation case is a typical multiclass classification problem while the second entails to a binary classification.

The Multi-Class Evaluation Approach
In this scenario, we evaluate the performance of our model over the three available classes, i.e., Buildings (1), Vegetation (2), and the Ground (3), given the annotated samples from the crowdsourced data. The SAE-driven and SSL-based DNN model has been trained using the small portion of labeled data and the unlabeled ones. Regarding the unlabeled data, one of the proposed SSL is applied to estimate the targets and through them to accomplish the model training. Table 3 demonstrates the proposed model performance over the unlabeled and unseen (test) data. This means that, after having training the model using both labeled and unlabeled data, we feed as inputs to the classifier only the unlabeled and the unseen data to evaluate its classification performance. The use of the unseen (test) set is made to assess the model performance to data totally outside the training phase, i.e., to data that the model has not seen during the learning process. The use of unlabeled data is to assess how well the model behaves with the amount of data, the targets of which have been estimated by the SSL methods; i.e., how well the selected SSL techniques work. The results have been obtained using the Accuracy, Precision, and Recall objective criteria (see Section 3.5) for all the three examined areas, averaging out over all of them and over all categories. Table 3. Building classification performance of the proposed SAE-driven and SSL-based DNN in case that one of the proposed SSL technique is applied during the model training when a three-class classification is adopted (buildings, vegetation, and ground). The results are obtained as the average over all the three examined areas and over all three categories.

Unlabeled
Anchor Graph [40] 0.967 0.969 0.971 SAFER [41] 0.967 0.970 0.970 SMIR [42] 0 We can see that high-performance results are obtained. The results are slightly better when the simple fusion SSL method, called WeiAve is employed, but it seems that all SSL techniques work very well in correctly estimating the unlabeled and test data.
Ablation Study: We now proceed to an ablation study to indicate how the different components of our system affect the final performance. First, we examine how well the proposed SSL algorithms work. Table 4 presents how well the four proposed SSL algorithms can estimate the actual targets (labels) of the unlabeled samples. That is, we have compared the soft labeled generated by the four SSL schemes with the actual ones assigned by the expert users. Evaluation is carried out for the three examined Areas and using two objective criteria; the root means squared error and the F1-score. As is observed, all the proposed SSL schemes correctly estimates the labels of the data. Table 4. Evaluation of the performance of the proposed Semi-Supervised Learning (SSL) techniques to estimate the actual targets (labels) of the unlabeled data.

Semi-Supervised
Learning (SSL) Technique Another ablation analysis is to examine how well our model behaves without the use of the SSL and SAE schemes, that is, without the use of the two main components of our approach. Towards this, initially we train the DNN model using both the labeled and the unlabeled data but for the latter we treat them as labeled ones considering in the training the actual targets of the unlabeled data. Then, we evaluate the performance of the trained DNN on the unseen (test) data. Table 5 shows the results obtained using three objective criteria; Accuracy, Precision, and Recall by averaging out on the three examined Areas. In this table, we show present the results of the WeiAve SSL approach of Table 3 for direct comparisons. As we can see, the results are very close which is justified from the fact that the SSL methods can correctly estimates the labels of the data (see Table 4). However, the disadvantage of including treating unlabeled data as labeled is the additional manual effort needed to generate these new labels and the extra human and financial resources this imposes on. Thus, our approach yields the same classification performance but with a much smaller portion of labeled data. Table 5. Comparison of the classification performance of the proposed scheme with the one derived without the use of any SSL algorithm and without the SAE encoding.

Accuracy (ACC) Precision (Pr) Recall (Re)
Unseen ( In the same Table 5, we also present classification results when we remove the SAE encoding part from the network. First, such as process dramatically increases the computational cost needed for training and the memory requirements due to the high dimension of the input signals. In addition, the classification results seem to be worsened. This is due to the noise embedded from the high information redundancy the input signals carry out. Thus, the SAE scheme not only reduces the computational and memory costs for the training but it also eliminates information noise in the inputs which may confuse classification.
In Table 6, we depict the computational cost imposed for the four SSL schemes, the time needed for the whole system to classify all pixels of the image, and the time for the SAE component to reduce information redundancy. We observe that SMIR is the fastest SSL technique requiring only few seconds to be completed. Instead, SAFER is much slower. The SAE encoding takes a lot of time but this is activated only in the training phase of the classifier. The time needed to classify the full image pixels is also depicted in Table 6. Recall that annotation requires the creation of overlapping patches of size 15 x 15. In our case, we use simple loop parsing for such creation. Numerical tensor manipulation could result in significantly reduced time.

Class Evaluation Approach
In this case, the evaluation is carried out on the provided annotation of the dataset which assesses the data into two categories: the buildings and the non-buildings. This way, we can provide a comparative analysis of our results to other approaches. Table 7 shows the results obtained in this case as the average over the two class categories using the objective criteria of Recall, Precision, F1-score, and Critical Success Index (CSI). The results are displayed for the three examined areas and the four different SSL methods. The highest F1 scores are achieved when the WeiAve approach is adopted for areas 1 and 2. Area 3 best score is achieved using SAFER technique. All cases result in high scores. In the same table, the ranking order of each method is also displayed. Figure 6 demonstrates the DNN classifier's performance over small objects (e.g., single trees). Pixel annotation similarity exceeds 85% for all images. Generally, when the object spans less than 10 × 10 pixels, detection capabilities decline. This could be partially explained since most of the block pixels, i.e., (15 × 15) − (10 × 10) = 125 pixels, describe something different. The best DNN model, in this case, is trained using the WeiAve SSL scheme.  Figure 7 evaluates against the ground truth data the building detection capabilities of our model for the three examined areas. In particular, as for Figure 7a,b of Areas 1 and 2, the WeiAve SSL technique has been applied to estimate the soft labels of the unlabeled data used during the training phase. For the Area 3, the SAFER SSL is exploited. This differentiation is adopted since WeiAve best performs for Area 1 and Area 2 while SAFER is the best for Area 3. Yellow color corresponds to pixels showing a building and model classified them as building (True Positive). Red color indicates pixels that model classified as buildings, but the actual label was either vegetation or ground (False Positive). Finally, blue color indicates areas that are buildings, but model failed to recognize them (False Figure 6. Illustrating the model classification outputs for the Area 2, using different SSL methods to estimate the soft labels of the unlabeled data during training. Figure 7 evaluates against the ground truth data the building detection capabilities of our model for the three examined areas. In particular, as for Figure 7a,b of Areas 1 and 2, the WeiAve SSL technique has been applied to estimate the soft labels of the unlabeled data used during the training phase. For the Area 3, the SAFER SSL is exploited. This differentiation is adopted since WeiAve best performs for Area 1 and Area 2 while SAFER is the best for Area 3. Yellow color corresponds to pixels showing a building and model classified them as building (True Positive). Red color indicates pixels that model classified as buildings, but the actual label was either vegetation or ground (False Positive). Finally, blue color indicates areas that are buildings, but model failed to recognize them (False Negative). As is observed, segmentation for building blocks is extremely accurate considering the limited training sample. Misclassification involved inner yards, kiosk size buildings (e.g., bus stations), and the edges of the buildings. Negative). As is observed, segmentation for building blocks is extremely accurate considering the limited training sample. Misclassification involved inner yards, kiosk size buildings (e.g., bus stations), and the edges of the buildings.

Comparison with Other State-of-the-Art Approaches
In this section, we compare the performance of our approach with other state-of-theart techniques exploited the same dataset as ours. This is the main value of selecting a benchmarked dataset for conducting our experiments, i.e., direct comparison of our results with other methods. Table 8 presents the comparison results. More specifically, in this table, we show our results using the proposed SAE-driven and SSL-based DNN in case that different SSL techniques are applied for labeling the unlabeled data. We also compare our results against other state-of-the-art methods using (a) the same type of data (orthoimage plus height estimation using DIM and DSM modeling), (b) combining orthoimages with expensive LiDAR information, (c) applying only expensive LiDAR information on the analysis. All the results have been obtained using the CSI score.
Our method outperforms all the state-of-the-art methods using the same data types (low-cost orthoimages plus an estimation of the height through DIM and DSM). If the expensive and more precise LiDAR information is utilized the results are slightly better and in some cases (like the work of [48]) even worse than our case. This reveals that our method, although it exploits only a cheap and not so precise height extracted information, gives results of similar performance. We should also stress that in our approach only less than 0.08% of the total data is utilized for a labeled training significantly reducing the effort required to annotate the data. As a result, it is clear that our methodology gives similar performance to state-of-the-art techniques though the fact that we use a very limited dataset, and relative cheap orthoimage information instead of high expensive LiDAR one.

Comparison with Other State-of-the-Art Approaches
In this section, we compare the performance of our approach with other state-of-theart techniques exploited the same dataset as ours. This is the main value of selecting a benchmarked dataset for conducting our experiments, i.e., direct comparison of our results with other methods. Table 8 presents the comparison results. More specifically, in this table, we show our results using the proposed SAE-driven and SSL-based DNN in case that different SSL techniques are applied for labeling the unlabeled data. We also compare our results against other state-of-the-art methods using (a) the same type of data (orthoimage plus height estimation using DIM and DSM modeling), (b) combining orthoimages with expensive LiDAR information, (c) applying only expensive LiDAR information on the analysis. All the results have been obtained using the CSI score.
Our method outperforms all the state-of-the-art methods using the same data types (low-cost orthoimages plus an estimation of the height through DIM and DSM). If the expensive and more precise LiDAR information is utilized the results are slightly better and in some cases (like the work of [48]) even worse than our case. This reveals that our method, although it exploits only a cheap and not so precise height extracted information, gives results of similar performance. We should also stress that in our approach only less than 0.08% of the total data is utilized for a labeled training significantly reducing the effort required to annotate the data. As a result, it is clear that our methodology gives similar performance to state-of-the-art techniques though the fact that we use a very limited dataset, and relative cheap orthoimage information instead of high expensive LiDAR one.

Discussion
The main problem in classifying satellite remote sensing data into semantic categories is the creation of an annotated (labeled) dataset needed for the training process. This dataset creation requires high manual effort which is a time consuming and costly process. In addition, data annotation also means waste of human and financial resources which make the whole process unaffordable. The main innovation of this paper is the utilization of a very small labeled dataset for semantic segmentation of remote sensing NIR data. The utilization of a very small labeled dataset reduces the cost for the annotation and better utilizes the experts in conducting remote sensing works rather than annotating data to create labels. However, a reduction in the number of training data will result in a deterioration of the classification accuracy as well. To compensate this, we adopt the concept of semi-supervised learning (SSL). The goal of an SSL scheme is to enrich the model training phase with additional unlabeled data, the targets of which are estimated by transferring knowledge from the small labeled dataset to the unlabeled one. Furthermore, a non-linear encoding scheme is adopted through the use of Stacked Auto-encoders (SAE) to remove information redundancy from the input signals.
The experiments show that the proposed SSL schemes can very well estimate the labels of the unlabeled data within an accuracy of almost 99.9%. This implies that we can reduce the labeled dataset even ten times than other state-of-the-art works while keeping classification performance almost the same. In our case only up to 0.08% of the total data are used as labeled data. In addition, the experiments show that the proposed scheme yields classification results very close to the ones obtained using high cost sensors such as LiDAR.
The advantages of our proposed SAE-driven and SSL-boosted DNN model are: (a) limited effort and time to construct the training set, since few labeled data (i.e., less than 0.08% when the closest supervised approach use approximately ten times more data [10]) are required for training, (b) adaptability to user needs, since user can define the number and the type of classes to classify (thus, we can easily apply the same concept to different case scenarios, e.g., classification of different types of objects instead of buildings), and (c) applicability in the sense that the proposed scheme supports the transfer learning concept, since a pretrained network can be easily updated to handle different types of problems.

Conclusions
In this paper we employ semi-supervised learning (SSL) methods with Stacked Autoencoders (SAE) for semantically segmenting NIR remote sensing images. The SSL schemes transfer knowledge from a small set of labeled data to estimate the targets (soft labels) of the unlabeled data. Then, deep neural network training is carried out using only this small portion of labeled samples and the estimated labels of the vast amount of unlabeled data to correctly train the network. As a result, the effort required to annotate the data is minimize while training is keeping at acceptable levels. Overall, four SSL methods are adopted for estimating the targets of the unlabeled samples; the Anchor Graph, SAFE Semi-Supervised Regression (SAFER), the Squared-loss Mutual Information Regularization (SMIR) and an equal importance Weighted Average of them, called (WeiAve).
Another novelty of our paper is the use of a Stack Autoencoder (SAE) scheme to reduce redundancy in the input signals, while keeping almost all the meaningful information. The goal of the SAE encoding is to map the input data into a much smaller dimensional space but under a highly non-linear way to retain most of the knowledge within the input samples. This way, we avoid noisy effects in the signals and potential contradictory information.
The combination of the above-mentioned novelties yields to a new proposed deep learning model which is called SAE-driven SSL-based Deep Neural Network (DNN). The selection of a DNN instead of a Convolutional Neural Network (CNN) model is to overcome the dimensionality of the inputs when they are propagated into multiple convolutions. The model is tested regarding its classification performance on a benchmarked dataset such as the Vaihingen city in Germany to allow us to directly compare our approach with other state-of-the-art methodologies. The results show that our approach outperforms the compared works in case that they exploit orthoimages as data types. This is achieved, although an exceedingly small portion of less than 0.08% of the total data have been used for the labeling set. We have also compared our method with methodologies employing highly sensitive but much more expensive sensors such as LiDAR information. The results indicate that our methodology yield close results to the ones obtained by LiDAR samples despite the fact that our data are much less precise and a very small portion of labeled samples is utilized.

Data Availability Statement:
The data presented in this study are openly available in ISPRS Test Project on Urban Classification, 3D Building Reconstruction and Semantic Labeling https://www2 .isprs.org/commissions/comm2/wg4/benchmark/detection-and-reconstruction/.