Learn to Extract Building Outline from Misaligned Annotation through Nearest Feature Selector

.


Introduction
The rooftops of buildings are dominant features in urban satellite or aerial imagery.For many remote sensing applications, such as slum mapping [1], urban planning [2], and solar panel capacity analysis [3], the spatial distributions and temporal renews of buildings are critical.These information are collected from labor-intensive and time-consuming field surveys [4].For analyses in the city or country scale, especially in developing countries, a robust and cost-efficient method for automatic building extraction is preferred.

Unsupervised Methods
For most unsupervised methods, building outlines are extracted using thresholding pixel values or histograms [11], edge detectors [12], and region techniques [13,14].Because of their simplicity, these methods do not require additional training data and are fast.However, when applied to residential areas with complex backgrounds, some artifacts and noises are inevitable in the extracted building outlines.

Supervised Methods
Unlike unsupervised methods, supervised methods extract building outlines from the images through patterns learned from ground truths.By learning from correct examples, supervised methods typically performed better in terms of both generalization and precision [15][16][17].
In the early stages, a two-stage approach that combines handcrafted descriptors for feature extraction [18][19][20][21] and classifiers for categorizing [22][23][24] are adopted in supervised methods.Because of the separation, an optimal combination of both the feature descriptor and classifier is difficult to achieve.Rather than the two-stage approach, convolutional neural network (CNN) methods enable a unified feature extraction and classification through sequential convolutional and fully connected layers [25,26].Initially, CNN-based methods are constructed in a patch-by-patch manner that predicts the class of a pixel through the surrounding patch [27].Subsequently, fully convolutional networks (FCNs) are introduced to reduce memory costs and improve computational efficiency through sequential convolutional, subsampling, and upsampling operations [28,29].Because of information loss caused by subsampling and upsampling operations, the prediction results of classic FCN models often present blurred edges.Hence, advanced FCN-based methods using various strategies have been proposed, such as unpooling [30], deconvolution [31], skip connections [32,33], multi-constraints [34], and stacking [35].Among FCN-based methods, two different approaches exist: (a) indirect and (b) direct approaches.

Indirect Approach
In the indirect approach, instead of extracting the building outline directly from the input aerial or satellite image, semantic maps are first generated.The outlines on top of those maps are computed consequently.Because the outlines are derived from segmentation output, the final accuracy relies significantly on the robustness of semantic segmentation.
In principle, all FCN-based methods mentioned above can be used for indirect building outline extraction.However, owing to the sensitivity of the outline/boundary, training with only semantic information typically results in an inconsistent outline or boundary.To prevent this, BR-Net [36] utilizes a modified U-Net, and a multitask framework to generate predictions for semantic maps and building outlines based on a consistent feature representation from a shared backend.

Direct Approach
Unlike the indirect approach, the direct approach extracts the building outlines directly from the input aerial or satellite images.Compared with the indirect approach, the direct approach learns the extraction pattern directly from the ground truth outline that preserves a higher fidelity.In the direct approach, building outline extraction is considered a segmentation or pixel-level classification problem that involves extremely biased data [37].In recent years, some advanced FCN-based models, such as RSRCNN [38], ResUNet [39], and D-LinkNet [40] have been proposed for better outline extractions.
However, these models focus on deeper network architectures to better utilize the feature representation capability of hidden layers.Furthermore, regardless of how these models generate predictions, their loss functions are computed directly from the pixel-to-pixel similarity of the ground truth.Owing to the extremely biased distribution of positive and negative pixels, the gradient explosion during training becomes a severe problem.Additionally, because of occasional human errors, several or tens of pixel misalignments will inevitably occur between the annotation and the corresponding aerial image.Owing to the much fewer positive pixels of the building outline, the pixel-to-pixel losses are extremely sensitive to these misalignments.
Hence, we propose a nearest feature selector (NFS) module, enabling a dynamic re-alignment between the ground truth and prediction.A dynamic matching between the ground truth and prediction is performed at every iteration to determine the matched position.Subsequently, the overlapped areas of both the ground truth and prediction are used for further loss computation.Because the NFS is used for the upper stream, it can be seamlessly integrated into all existing loss functions.The effectiveness of the proposed NFS module is demonstrated using a VHR image dataset [36] located in New Zealand (see Section 2.1).In comparative experiments, under different loss functions, the addition of the NFS indicates significantly higher values of the f1-score, Jaccard index [41], and kappa coefficient [42].
The main contributions of this study can are as follows: • We design a fully convolutional network framework for direct building outline extraction from aerial imagery.

•
We propose the nearest feature selector(NFS) module to dynamically re-align the prediction and annotation to avoid misleading by slightly misaligned annotations.

•
We analyze the effectiveness of the NFS with different loss functions to understand its effects on the performances of deep CNN models.
The rest of the paper is organized as follows: At first, we introduce the materials and methods used for this research in the Section 2.Then, we present the learning curves and quantitative and qualitative results in the Section 3. Subsequently, we illustrate our discussion and conclusion in the Sections 4 and 5, respectively.

Data
To evaluate the performance of different methods, a research area located in Christchurch, New Zealand, is selected.The original aerial imagery, as well as annotated building polygons, are hosted by the Land Information of New Zealand (LINZ) (https://data.linz.govt.nz/layer/53413nz-building-outlines-pilot/).The aerial images are in a spatial resolution of 0.075.Prior to performing our experiments, we evenly partition the study area into two areas for training (i.e., Figure 1a, left) and testing (i.e., Figure 1a, right), respectively.The original annotations provided by the LINZ are registered to the corresponding building grounds instead of rooftops (confirmed by visual interpretation uisng QGIS GUI (https://qgis.org/)).For accurate outline extraction, we manually adjust vectorized building outlines to ensure that all building polygons and aerial rooftops are roughly registered (i.e., Figure 1b).Because of the huge amount of buildings and occasional human errors, sub-pixel or several pixel misalignments will be inevitable.Thus, we have to train the models with imperfect "ground truth".As shown in Figure 1a, the study area is covered mainly by residential buildings with sparsely distributed factories, trees, and lakes.From training and testing areas, 16,635 and 14,834 patches are extracted.The size of the patch is 224 × 224 pixels.As shown in Figure 1c, within each pair of the patches, there are buildings in the center area.

Methodology
In this study, we are expected to correctly train and evaluate a model using imperfect annotation.Due to the inevitable misalignments, values of the loss functions or metrics, which are directly computed by the pixel-to-pixel comparison of the prediction and annotation, are inaccurate.To avoid this, we introduce the nearest feature selector (NFS) module to perform similarity selection during training and testing stages.
As shown in Figure 2, at the training phase, the NFS is applied to prediction and imperfect annotation to generate aligned prediction and annotation for accurate loss estimation and proper back-propagation.As for the testing phase, the NFS is applied to prediction and imperfect annotation to generate aligned prediction and annotation that can be used for reliable accuracy analysis.Since the NFS is applied to select the most paired overlap, it can avoid misalignments in the ground truth and produce a more reliable accuracy or prediction error.

Tr ai n i n g ph ase
Test i n g ph ase Figure 3 presents the workflow for building outline extraction.The aerial images and their corresponding building outlines are partitioned into two sets for training and testing.Through several cycles of training and validation, the hyperparameters, including batch size, the number of iterations, random seed, and initial learning rate were determined and optimized using the basic model (i.e., SegNet + L1 loss).Subsequently, the predictions generated by the optimized models are evaluated using the patches within the test set.For performance evaluations, we select three typically used balanced metrics, i.e., the f1-score, Jaccard index, and kappa coefficient.These metrics are computed before the post-processing operations [43,44].

Data Preprocessing
According to the location and extent of every building polygon, a square window is applied to the centroid of the polygon to extract the corresponding image patch.Later, all patches are resized as 224 × 224 pixels.After data preprocessing, there are 16,635 and 14,834 image patches extracted from training and testing area, respectively.Since we have carefully checked the annotations, there are no negative patches to be discarded.Then, the image patches within the training area are shuffled and partitioned into two groups: training (70%), and validation (30%).Subsequently, the number of patches used for training, validation, and testing are 11,644, 4990, and 14,834, respectively.

Proposed Model
For an efficient building outline extraction, we utilize a modified SegNet [30] for feature extraction and the NFS to achieve a dynamic alignment between the ground truth and prediction (see Figure 4).

Feature extraction
In this study, we utilize a modified SegNet for effective feature extraction from very-high-resolution aerial images.As shown in Figure 4, the modified SegNet comprises sequential operation layers, including convolution, nonlinear activation, batch normalization, subsampling, and unpooling operations.
The convolution operation is an element-wise multiplication within a two-dimensional kernel (e.g., 3 × 3, or 5 × 5).The size of the kernel determines the receptive field and computational efficiency of the convolution operation.Owing to the complexity of the task, we set the number of kernels of the corresponding convolutional layers to [24, 48, 96, 192, 384, 192, 96, 48, 24] [34].Subsequently, the convolution output is managed using a rectified linear unit [45], which treats all values less than zero as zeros.To accelerate network training, a batch normalization [46] layer was appended to every activation function except for the final layer.Max-pooling [47] and the corresponding unpooling [30] were used to reduce and upsample the width and height of intermediate features, respectively.

•
Nearest Feature Selector(NFS) Figure 5 shows the mechanisms of the NFS.The center area of the ground truth slides over the corresponding prediction along both the X-and Y-axes to generate overlaps of X i,j X i,j X i,j and Y c Y c Y c , respectively, where i and j are the distances from the initial position.To obtain a balance between the computational efficiency and sliding field, we set the maximum values of both i and j to five.Subsequently, they were used for similarity estimation through different criteria according to the number of channels of the output.
For the prediction and ground truth containing a single channel, the classic L1 distance is used.Thus, the distance of the (i,j) overlap can be formulated as: where X X X is the prediction, and Y Y Y is the corresponding ground truth.Both X X X and Y Y Y are ∈ R R R W×H .W and H are the width and height of the corresponding output, respectively.
Nearest Feature Selector (NFS) For the prediction and ground truth containing multiple channels, the average cosine similarity along the channels will be calculated.In such cases, the distance of overlaps can be formulated as: From all overlaps, location indices of the one with the closest distance to the ground truth is determined as: The nearest overlap (X i min ,j min X i min ,j min X i min ,j min ) and corresponding ground truth (Y c Y c Y c ) are selected for further final loss estimation.Four well-known loss functions, namely, L1, mean square error (MSE), binary cross-entropy (BCE) [48], and focal loss [49], are chosen in this study.
where W and H represent the width and hight of the nearest overlap (X i min ,j min X i min ,j min X i min ,j min ) and corresponding ground truth (Y c Y c Y c ).The values of y m,n and g m,n are the predicted probability and ground truth, respectively.
For notational convenience, we define p m,n : As compared with traditional cross-entropy, focal loss introduces a scaling factor (γ) to focus on difficult samples.Mathematically, the BCE and focal loss can be formulated as: Because the NFS is computed dynamically, it can be seamlessly integrated into the existing loss without further modification.
Three typically used balanced metrics, i.e., the f1-score, Jaccard index, and kappa coefficient, are used for the quantitative evaluation.Compared with unbalanced metrics such as precision and recall, the selected metrics provide a more generalized accuracy level by considering both precision and recall.
Jaccard Jaccard Jaccard = TP TP + FP + FN (10) where TP, FP, FN, and TN represent the number of true positives, false positives, false negatives, and true negatives, respectively.

Results
Four well-known loss functions, i.e., L1, mean square error (MSE), binary cross-entropy (BCE) [48], and focal loss [49] are used in this study.The L1 and MSE can be regarded as the most classic and typically used criteria for pixel-to-pixel comparisons.The BCE is a typical loss function that increases or decreases exponentially for binary classification.The focal loss introduces a scale factor to the BCE to reduce the importance of the easy example.These loss functions were trained either with or without the NFS, separately.All experiments were performed on the same dataset and processing platform.
Three typically used balanced metrics, i.e., the f1-score, Jaccard index, and kappa coefficient, are used for the quantitative evaluation.Compared with unbalanced metrics such as precision and recall, the selected metrics provide a more generalized accuracy level by considering both precision and recall.

Learning Curves
Figure 6 shows the relative values of loss from different loss functions under the validation dataset.Among all the loss functions (i.e., L1, MSE, BCE, and focal), the loss with the NFS (i.e., +NFS) indicated a faster converging speed than those without (i.e., −NFS).

Quantitative Results
Figure 8a shows the relative performances of different loss functions under the test dataset.Among all loss functions (i.e., L1, MSE, BCE, and focal), the loss with the NFS indicates the higher values for all evaluation metrics.
Figure 8b shows the corresponding values of the evaluation metrics over various loss functions.Among four loss functions, regardless of with or without the NFS, the focal loss is generally better than BCE, MSE, and L1 loss.L1 loss without NFS (L1 − NFS) indicates the lowest values for all metrics in all conditions.The best performance is achieved by focal loss with NFS, i.e., 0.651 for f1-score, 0.490 for the Jaccard index, and 0.626 for the kappa coefficient.Under all loss functions, the addition of the NFS results in significantly higher values for all evaluation metrics.The result indicates that the proposed NFS can effectively manage the slight misalignments from the annotation and achieve better performance.Interestingly, on the weakest L1 loss, the addition of the NFS results in the most significant increments among the three evaluation metrics.The increments of the f1-score, kappa coefficient, and Jaccard index reached 8.8%, 8.9%, and 9.8%, respectively.

Qualitative Results
Figure 9 presents six representative results of outlines extracted from the model trained by L1 loss with/without the NFS under test dataset.The backgrounds, red lines, and green circles represent the aerial input, predicted outline, and focused area.In general, the addition of the NFS yields a better building outline extraction, particularly on shadowed areas (e.g., green circles in a, b, and e) and turning corners (e.g., green circles in d and f).Additionally, the model trained with the NFS yields a more intact outline (e.g., green circles in c).
Figure 10 shows six representative groups of building outlines extracted from the model trained by the MSE loss with/without the NFS.Generally, the addition of the NFS yields a slightly better building outline extraction.Using the NFS, the extracted outlines contain fewer false positives within buildings (e.g., green circles in a and b) and fewer breakpoints (e.g., green circles c, d, e, and f).
Figure 11 shows six representative groups of outlines extracted from the model trained by BCE loss with or without the NFS.The backgrounds, red lines, and green circles represent the aerial input, predicted outline, and focused area, respectively.As shown in the figure, the addition of the NFS yields a slightly better line extraction at areas shadowed by surrounding trees (e.g., green circles of column a, e, and f).Moreover, the additional NFS results in better line continuity around corners of the buildings (e.g., green circles of column b, c, and d).In general, using the proposed NFS, the building outline extracted from the aerial image is more intact, particularly on building corners and shadowed areas.Figure 12 presents six representative pairs of building outlines extracted from the model trained with the focal loss with or without the NFS.Owing to the robustness of the focal loss, even without the NFS, the model successfully recognizes and extracts the major parts of the building outline from the aerial input (e.g., b, c, and f).However, with the additional NFS, the generated outlines contain fewer false positives around corners with complicated backgrounds (e.g., a, d and e).Compared with L1 loss, the addition of NFS imposes a less significant effect on the model trained with focal loss.This observation is consistent with the quantitative result shown in Figure 8b.
Table 1 shows the computing speeds of the methods in frames per second (FPS).Among all the loss functions, the additional NFS results in slightly longer processing time during both training and testing.However, the decline in PFS is not significant.

Regarding the NFS
In recent years, fully convolutional networks have demonstrated their ability in automatically extracting line features, including roads and building outlines [36,39,54].However, those studies mainly focused on designing deeper or more complex network architectures to enhance the representation capability for better predictions.The loss functions of fully convolutional networks cannot handle misalignments or rotations between inputs and manually created annotations.Because the building outline occupies a small portion of pixels, misalignments and rotations will severely interfere with the building outline extraction accuracy.
Herein, we propose the NFS module to dynamically re-align the prediction and corresponding annotation.The proposed framework can be easily appended into existing loss functions, such as L1, MSE, and focal loss.Through a dynamic re-alignment, the addition of NFS enables the correct position of the annotation to be located for an appropriate loss calculation.Qualitative and quantitative results based on the testing data demonstrated the effectiveness of our proposed NFS.

Accuracies, Uncertainties, and Limitations
Among all methods, the focal loss with NFS indicates the highest values for all evaluation metrics.Its values of the f1-score, Jaccard index, and kappa coefficient are 0.624, 0.597, and 0.468.Compared with the naive L1 loss, the addition of the NFS results in significant increments in all evaluation metrics.The increments of the f1-score, kappa coefficient, and Jaccard index reach 8.8%, 8.9%, and 9.8%, respectively.As it is arguable that the kappa coefficient is unsuitable in the assessment and comparison of the accuracy [55], the actual performance gained from the NFS might be less significant (i.e., less than 9.8%).For robust loss functions (e.g., focal, and BCE loss), the improvement afforded by the NFS is less significant (see details in Figure 8b).Owing to the sliding-and-matching mechanism, the proposed NFS cannot be applied to annotations that require rotation correction.Since the methods are designed and trained on image patches with dense buildings, the trained model is not appropriate for evaluating the entire study area where buildings are sparsely presented.
We observe a slight decrease in processing speed when the NFS is applied through the analysis of computational efficiency.Considering the performance gain by the NFS, computational efficiency degradation is negligible.Because the NFS is independent of the aerial characteristic, in principle, it should apply for not only aerial images, but also other data sources (e.g.satellite, SAR, and UAV).The effectiveness of the NFS will be further estimated using publicly available datasets from various sources [56].
Because of the extremely biased negative/positive ratio, complete building outline extraction is still challenging.With the current classification-based scheme, the model is trained to generate pixel-to-pixel prediction using features extracted from sequential convolutional layers.The predicted pixels of the building outline lack of internal connectivity that some pixels might be misclassified as non-outline (e.g., 2nd and 3rd rows in Figure 9).

Conclusions
For an accurate building outline extraction, we design a nearest feature selector (NFS) module to dynamically re-align predictions and slightly misaligned annotations.The proposed module can be easily combined with existing loss functions to manage subpixel or pixel-to-level misalignments of the manually created annotations more effectively.For all loss functions, the addition of the proposed NFS yielded significantly better performances in all the evaluation metrics.For the classic L1 loss, the increments gained by using the additional NFS are 8.8%, 8.9%, and 9.8% for the f1-score, kappa coefficient, and Jaccard index, respectively.We plan to improve the similarity selection mechanism and apply it to other data sources to achieve better generalization capacity for large-scale applications.

Figure 1 .
Figure 1.(a) Aerial imagery of the study area ranging from 172 • 33 E to 172 • 40 E and 43 • 30 S to 43 • 32 S, encompassing approximately 32 km 2 .(b) Manual adjustment of provided annotation (e.g., from Red to Green polygon).(c) Sample pairs of the extracted patches.

Figure 2 .
Figure 2. Experimental design for model training and evaluation under imperfect annotation.The proposed nearest feature selector(NFS) is applied to perform similarity selection during training and testing stages.

Figure 3 .
Figure 3. Experimental workflow for buidling outline extraction.Existing loss functions and proposed nearest feature selector are trained and evaluated using 224 × 224 image patches extracted from original dataset.

Figure 4 .
Figure 4. Overview of the proposed model.The model consists of a modified SegNet for feature extraction and the nearest feature selector (NFS) module for dynamic alignment.

Figure 5 .
Figure 5. Overview of the nearest feature selector (NFS) module.The center area of ground truth slides over prediction along X-and Y-axes to generate overlaps that are used for similarity selection.

Figure 6 .
Figure 6.Trends in validation loss values over different iterations.

Figure 7
Figure7shows the trend of kappa coefficient values over various iterations from four different loss functions under the validation dataset.Among all the conditions, the focal loss trained with the proposed NFS (i.e., focal + NFS) indicates the highest kappa coefficient values in most of the iterations.By contrast, the L1 loss trained without the NFS (i.e., L1 − NFS) indicated the lowest kappa coefficient values for almost every iteration.

Figure 7 .
Figure 7. Trends in validation accuracy values over different iterations.

Figure 8 .Figure 9 .Figure 10 .Figure 11 .
Figure 8. Performances of different losses, either with or without nearest feature selector (NFS).(a) Bar chart for comparison of relative performances (b) Table of performances under different loss functions.For each loss function, the highest values are highlighted in bold.a b c d e f

Figure 12 .
Figure 12.Representative results of outlines extracted from model trained by focal loss with/without nearest feature selector (NFS).Backgrounds, red lines, and green circles represent aerial input, predicted outline, and focused area, respectively.Selected results are denoted as (a-f).

Figure 13 Figure 13 .
Figure 13 presents four representative pairs of failure cases from the model trained with the loss function that combines with or without the nearest feature selector (NFS).As compared with the model trained without NFS, the addition of NFS might lead to un-expected misclassification around corners.
Condition F1-score Jaccard Index Kappa coefficient

Table 1 .
Comparison of the computational efficiencies of different loss functions under conditions that with or without NFS.