A Novel Object-Based Deep Learning Framework for Semantic Segmentation of Very High-Resolution Remote Sensing Data: Comparison with Convolutional and Fully Convolutional Networks

: Deep learning architectures have received much attention in recent years demonstrating state-of-the-art performance in several segmentation, classiﬁcation and other computer vision tasks. Most of these deep networks are based on either convolutional or fully convolutional architectures. In this paper, we propose a novel object-based deep-learning framework for semantic segmentation in very high-resolution satellite data. In particular, we exploit object-based priors integrated into a fully convolutional neural network by incorporating an anisotropic diffusion data preprocessing step and an additional loss term during the training process. Under this constrained framework, the goal is to enforce pixels that belong to the same object to be classiﬁed at the same semantic category. We compared thoroughly the novel object-based framework with the currently dominating convolutional and fully convolutional deep networks. In particular, numerous experiments were conducted on the publicly available ISPRS WGII/4 benchmark datasets, namely Vaihingen and Potsdam, for validation and inter-comparison based on a variety of metrics. Quantitatively, experimental results indicate that, overall, the proposed object-based framework slightly outperformed the current state-of-the-art fully convolutional networks by more than 1% in terms of overall accuracy, while intersection over union results are improved for all semantic categories. Qualitatively, man-made classes with more strict geometry such as buildings were the ones that beneﬁt most from our method, especially along object boundaries, highlighting the great potential of the developed approach.


Introduction
Semantic segmentation has received much research and development effort since it plays an important role in many critical computer vision tasks such as scene understanding, pattern recognition, object detection and tracking, etc.Several approaches have been adopted in order to improve classification results and create powerful, generic models independent of the training dataset.Currently, deep learning methods deliver state-of-the-art results in numerous image classification benchmark datasets based on mainly two different architectures.In particular, the patch-based methods [1,2] are based on convolutional networks (CNNs) [3][4][5] that receive as input fixed-size patches centered on each image pixel and thus, every single pixel is represented by the corresponding image region of this specific patch.Although such models are able to perform quite well, especially for sparse annotated datasets [6], they require much computational power which sometimes exceeds the capacity of available resources [7].The second architecture is based on fully convolutional networks (F-CNNs) [8], which consist only of convolutional layers.Unlike patch-based architectures, they can deliver dense predictions as they do not contain fully connected layers with fixed size dimensions.Several F-CNNs have been proposed in the literature [9,10] and currently outperform other approaches in several benchmarks.However, there are still significant challenges towards the effective detection of objects with specific geometry, a detection that retains and follows object edges and boundaries.
In the remote sensing community, object-based image analysis (also known as geographic object-based image analysis) employs a classification procedure based on image objects (i.e., image segments), ameliorating results for numerous mapping tasks [11].The terms object, segment or superpixel refer to a particular image region which can be enclosed in a polygon while all included pixels have common attributes (e.g., spectral values) and ideally belong to the same semantic category.Object-based approaches have delivered quite promising results in many applications especially in cases with very/ultra high-resolution data and when combined with knowledge-based and/or other machine learning frameworks [12,13].However, object-based frameworks are still employing mainly shallow, kernel-based classifiers or integrating superpixel information during a pre-/post-processing step.
To this end, in this study, we designed, developed and validated an object-based, deep-learning framework that can integrate object representations in the neural network independent of the employed deep architecture for the given semantic segmentation task.Under such a framework, an additional loss quantity penalizes the pixels included in each superpixel to have the same semantic label.More specifically, the contributions of this work are twofold.Firstly, we propose a novel, object-based semantic segmentation approach based on deep neural networks, which can exploit information from object representations and constrain accordingly the predictions.According to our knowledge it is the first time that a generic object-specific loss for F-CNNs is presented.Our goal is to constructively combine the networks' rich feature representations with object-based information derived from groups of pixels and investigate the potentials by comparing the results with plain deep architectures.Secondly, we present a thorough experimental design, and the comparison and validation on several state-of-the-art deep learning models from the two dominating architectures, i.e., CNNs and F-CNNs, and the proposed object-based one on publicly available remote sensing datasets.

Convolutional Networks for Semantic Segmentation (Patch-Based Learning)
Patch-based models can extract complicated features by combining spectral and spatial information at the same time.Historically, these were the first architectures adapted from the remote sensing community for the semantic segmentation of very high resolution datasets [14,15].In [16], the authors examined three different ways of exploiting multiple convolutional architectures (PatreoNet [17], AlexNet [4], CaffeNet [18], GoogLeNet [19], VGG ConvNets [5], and OverFeat ConvNets [20]): (i) training the architectures from scratch using only the dataset of interest; (ii) employing pre-trained convolutional networks and fine tuning them according to the dataset of interest; and (iii) using a pre-trained convolutional network as a feature extractor and replacing the last softmax layer with an SVM classifier.Various patch-based convolutional architectures (DenseNet121 [21], InceptionV3 [19], VGG19 [5], Xception [22], ResNet50 [23], and InceptionResNetV2) are also explored in [24] for the successful mapping of wetlands.In addition, different urban environment categories are detected in [25] by exploiting ResNet [23] and VGG [5].Similar approaches have also been explored for crop identification from high resolution imagery [26].Moreover, the proposed patch-based method in [27] employs an AlexNet-based pre-trained architecture integrated with Spatial Pyramid Pooling (SPP) and Side Supervision (SS) techniques.The former term implies the concatenation of all the convolved feature maps that are produced from the intermediate pooling layers.SS on the other hand indicates the enrichment of the classification decision mechanism with a convex strategy that performs intermediate supervision.This strategy forces the classification output to depend not only on the final layer outcome, but also on the intermediate parts of the network since objective functions are added to each hidden layer.Lastly, the authors of [28] improved the performance of convolutional architectures by enriching the standard RGB patch information with Local Binary Patterns (LBP) features [29].This extra feature information is incorporated into the training process using late and early fusion techniques experimenting with two different schemes: VGG-M [30] and ResNet [23].In the late fusion case, the RGB and LBP features are fed separately into two different network streams and later fused in the fully-connected layers.Conversely, in the early fusion case, the raw RGB and LBP features are concatenated in a single vector and then passed to the network for training.

Fully-Convolutional Networks for Semantic Segmentation (Pixel-Based Learning)
Fully convolutional networks currently do deliver state-of-the-art results in several semantic segmentation benchmark challenges.In particular, several successful segmentation approaches based on fully-convolutional encoder-decoder architectures [31][32][33] are presented in the DeepGlobe CVPR-2018 challenge, e.g.Iglovikov et al. [34] extended a semantic segmentation model to perform instance segmentation for building surfaces.In particular, the authors employed a U-Net-like [9] architecture where the encoder is replaced with the first five convolutional blocks of the WideResNet-38 network [35].The model output is a two-channel volume, one channel related to the binary building-non-building mask, and the other related to touching borders of building instances.Similarly, Seferbekov et al. [36] dealt with the automatic multi-class land segmentation problem using a Feature Pyramid Network [37] whose encoder is based on the ResNet50 network [23].
In general, it seems that encoder-decoder architectures have been widely applied on very high resolution remote sensing datasets owing to their promising results.For example, in [38], variations of the fully convolutional SegNet [10] architecture are employed for semantic segmentation on the ISPRS (WGII/4) benchmark dataset.Similarly, Audebert et al. [39] exploited a multi-scale SegNet, which outputs classification maps at different resolutions, while early and late fusion techniques [40] related to the dataset's DSM (digital surface model) are also investigated.In addition, the authors of [41] based their experiments on an encoder-decoder U-Net-like [9] deep architecture to effectively distinct sea from land areas.Marmanis et al. [42] also explored the use of F-CNNs [8] by providing spectral and elevation information to an ensemble of CNNs.
Recently, the authors of [43] tried to tackle the segmentation problem and preserve object boundaries [44][45][46] by formulating a bidirectional network called RiFCN (Recurrent Network in Fully Convolutional Network).The forward stream of this network is based on the VGG-16 [5] architecture while the backward stream deconvolves the pooled feature maps, with each deconvolution taking as input not only the features produced by the forward pass, but also features from the previous deconvolution levels.Similarly, Marmanis et al. [44] proposed a fully-convolutional architecture where boundary information is integrated to the learning process.Specifically, color and elevation information is passed through a boundary-detection encoder-decoder architecture, which outputs a scalar image of boundary-likelihoods.Then, this scalar volume is concatenated with the original raw image and is given as input to different segmentation encoder-decoder models, the results of which are averaged to produce the final classification scores.Moreover, the authors of [47] proposed a two-step process involving the training of a F-CNN for building detection combining the predictions with a CRF to integrate information from the boundaries of the objects.
Continuing with analogous approaches, the authors of [48] utilized a fully-convolutional encoder-decoder framework where the encoder is based on the ResNet architecture [23] followed by atrous spatial pyramid pooling [49].An additional multi-scale loss function applied on different scales of the produced features is also exploited and finally a superpixel-based dense conditional random fields framework is used as a post-processing step to smooth the results.

Object-Based Learning for Semantic Segmentation
Instead of exploiting raw image features extracted by a deep learning network, several recent studies try to use the object/super-pixel information during semantic segmentation tasks.Objects/super-pixels also serve the need of lower computational complexity, which is usually required since for deep convolutional networks millions of parameters must be tuned consuming significant GPU and CPU core hours.The authors of [50] enriched the objects' representations by including information from proximal, distant and global regions.The local region describes the pixels included in the segment while the proximal region incorporates the local region and is usually twice the object's size.The distant region represents an even larger surrounding part of the object and finally the global region describes the entire image scene.This enhanced feature representation, which describes better the different scale dependencies, is then given as input to a feedforward multilayer network.An asymmetric loss function is also considered in this network which balances the weights among frequent and non frequent classes.Recently, in [51], gridized superpixels were employed for the extraction of salient objects.Gridized superpixels resemble pixels, since they are similar in shape producing in this way a grid-like oversegmentation.Each segment is represented by its mean color value, thus resulting in much lower image dimensions.The corresponding binary ground truth images (salient-non-salient) are similarly processed by assigning the dominant label value in each segmented region.The encoded images are then given as input to a fully convolutional network with residual blocks and, eventually, the predicted output is reconstructed back to its original dimensions.
Object information has been also integrated during pre-/post-processing steps with promising results.Recently, the authors of [52] examined different segmentation algorithms for training data formulated as follows: patches of different sizes (32 × 32, 64 × 64 and 128 × 128) centered to each object are extracted and then reshaped to 228 × 228, while labels are given according to the dominating value inside each superpixel.All reshaped patches go through a pretrained AlexNet network to produce a final feature vector, which is then incorporated to a linear SVM classifier.During the testing phase, each segmented region is labeled with the class that the model predicted for the corresponding centered patch.Moreover, further experiments are conducted using also context features of neighboring segments included inside a certain radius for each superpixel.In a similar manner, the authors of [53] also adopted the preprocessing segmentation step, however they employed a different technique for extracting image patches.Superpixels replace pixels based on the assumption that the over-segmentation of an image produces superpixels that are very similar in shape and size (usually also rounded without following object geometry and boundaries, such as in [54]).This approach reduced drastically the amount of testing time by making the sliding-window process much more rapid.Reported performance accuracy is similar to the standard pixel-based sliding-window method.

The Developed Object-Based Learning Framework
In the aforementioned related studies, the object information is mainly integrated under a preprocessing or postprocessing manner.Therefore, the main motivation here was to design a deep learning framework that could efficiently integrate object representations.To do so, we included object-based constraints by incorporating elegantly simplified image representations as priors to an additional loss term in the deep neural network.In Figure 1, a representative graphical illustration of the developed approach is presented.
More specifically, let us define an image I s represented by a set of patches {P i } for i = 1, 2, . . ., Z with corresponding dense ground truth annotation S s including a set of labels {l i } for i = 1, 2, . . ., K.Both the image and the corresponding ground truth include s = 1, 2, . . ., D pixels.In our experiments, we dealt with a classification problem with more than two classes and thus we employed the multiclass cross entropy for the optimization of all the architectures where y s,l is a binary indicator that shows if class l is the correct answer for observation s and p s,l holds the probability that observation s belongs to class l.
Image objects can be generated by any segmentation algorithm to create regions with similar spectral or any other characteristics.To this end, let us consider, without loss of generality, a set of objects or segmented regions {C i } for i = 1, 2, . . ., M. Assuming that all the pixels inside an object should have the same label, we formulate our loss as where arg max l (p s,l ) indicates the label with the maximum probability of s pixel of the object C i and l d (C i ) the dominant category of the object C i .As ψ(•), one can consider any distance function.In our case, and for all experiments, we used a Potts model such as where c 1 is a constant value that defines the penalty that will be given to each pixel of the object that does not belong to the dominating class.In all our experiments, we set c 1 = 1.
At this point, we should mention that the L 2 penalizes different predictions than the dominant label l d and does not take into account the type of class that is different.The loss works similar to a smoothness term, enforcing homogeneity inside the regions of the objects C i .Hence, the optimal label for each pixel is obtained by the weighted (w 1 ) sum of the classification and object-based losses as follows:

Implementation Details
In this section, we provide a brief overview of the plain patch-based and pixel-based methods that were employed to compare the results with the proposed framework and conduct a comprehensive evaluation.Moreover, implementation details, selected hyperparameters as well as the required computational training duration are presented for each model.All experiments were conducted using the PyTorch deep learning framework [55].

Patch-Based Learning
Three commonly used architectures were implemented, namely ConvNet, AlexNet and VGG-16, which provide one single l i per P i .In particular, ConvNet has a relatively simple architecture.There are 4 blocks of layers: 2 convolutional and 2 fully-connected [56].The first convolutional layer includes 3 × 3 kernels, a stride of 1 and padding equal to 0. It is then followed by a ReLU layer and a maxpooling operation of 3 × 3 kernels and a stride of 2. The second convolutional block follows the same pattern and 2 fully-connected layers follow to produce the final classification product.
The AlexNet architecture is comprised of 8 blocks of layers following the same sequence: 5 convolutional and 3 fully-connected [56].Giving some more details, the first convolutional layer applies 3 × 3 filters with a stride of 1, followed by a ReLU activation function and a max-pooling operation of kernels and stride equal to 3 × 3 and 2, respectively.The second convolutional layer follows the same pattern while the third and fourth lack the max-pooling operation.The fifth convolutional layer is again the same only this time the max-pooling filters are of size 2 × 2. After that, some fully-connected layers produce linear transformations while at the same time dropout layers mask part of the input rejecting samples that do not meet certain probability expectations, reducing at the same time the overfitting chances.More precisely, the probability threshold is equal to 0.5 instructing in this way the layers to reject 50% of input elements using Bernoulli distribution binary samples.
The VGG-16 architecture, which is much deeper [56] than the previous ones, consists of a block where the training data are subject to convolutional filters, batch normalization filters and a ReLU activation function.Data dimensions change only in terms of depth after being passed through this block since the convolution filters are of size 3 × 3 using a stride and padding of 1.Such a convolutional block appears 13 times throughout the whole architecture while dropout and max-pooling layers are also evenly distributed among the blocks.
Regarding the implementation details of the patch-based architectures the ConvNet, AlexNet and VGG-16 were optimized by the standard Stochastic Gradient Descent with a learning rate of 0.04, a momentum of 0.9, a weight decay of 0.0005 and a batchsize equal to 100.

Pixel-Based Learning
Different fully convolutional architectures were tested in this work and we present them in this section.Starting with the SegNet architecture [10], it includes an encoder and a decoder part.The input image is passed firstly through the encoder to be downsampled to a very low resolution, while at the same time a variety of features are calculated.In this specific case, the encoder consists of 5 convolution blocks.Each block computes consecutive convolutional, activation function and batch normalization operations.The convolutional operation involves filters of size 3 × 3, a stride of 1 and padding equal to 1.One max-pooling operation, which applies 2 × 2 filters with a stride of 2, is also included in each block.In this way, the original input volume is downsampled in half five times.Next, the input image is passed through the decoder where upsampling procedures take place in a symmetric manner.To be more specific, the decoder also consists of 5 convolution blocks, while the maxpooling operations are replaced by unpooling operations which bring the dataset back to its original size.At the end the model generates a heatmap where each pixel consists of n probability values, where n is the total number of semantic classes.It should be also noted that all the activation operations apply the rectified linear unit (ReLU) function to the input, apart from the last layer which produces the final segmentation output using a softmax activation.
The U-Net architecture [9] is also based on a downsampling-upsampling procedure but involves skip connections that concatenate feature maps between the encoder and the decoder.The employed U-Net's encoder consists of five convolutional blocks.Each one applies 2 convolutions to the input in the form of Conv-Batch-ReLU.The convolution operations always involve 3 × 3 filters with both stride and padding being equal to 1.In the first convolutional block the depth is increased to 64, while the height and width dimensions remain unchanged.The following convolutional blocks reduce the dimensions in half using a 2 × 2 maxpooling operation and double the input depth, apart from the last block where the depth is unaltered.After the encoder, the decoder receives the low resolution volume to upsample it back to its original dimensions.For this purpose, four convolutional blocks of the same form are used, this time applying 2 × 2 upsampling operations.In addition, the resulted feature map of each upsampling operation is concatenated with the feature map of the symmetrical block existing in the encoder part.In this way, higher resolution information is combined with lower resolution information producing more sophisticated features and maintaining spatial knowledge.Finally, at the end of the model, a 1 × 1 convolution operation is applied to produce the final probability heat map for the existing classes.
Apart from the SegNet and U-Net models, the Fully Convolutional Network (FCN), which was proposed by Long et al. [8], was also employed.The authors proposed to replace the last fully-connected layers of various architectures with convolutional ones allowing in this way the model to produce heat maps instead of simple 1-D predictions.This is accomplished by exploiting deconvolutional layers, which are considered as a backward convolution that restores the downsampled volume.The actual upsampling is achieved by using a 32-stride deconvolution at the end of the model.Such models are independent of the input data dimensions and thus very flexible and easy to use.To boost the performance, a further skip layer was added, which collects information from intermediate parts of the model.In particular, apart from the 32-stride deconvolution, additional information from higher resolution layers is added by employing 16-stride and 8-stride layers.Then, the predictions of all upsampled skip layers are combined to produce the final classification map.The convolutionalized architectures were AlexNet, VGG and GoogLeNet.In all our experiments, we used the convolutionalized VGG-16 architecture with 32-stride and 16-stride deconvolutions.Firstly, the 32-stride fully convolutional model was trained using the VGG-16 weights as initialization.Then, the 16-stride network was in turn initialized with the parameters that the 32-stride scheme produced.
The term "pixel-based" implies that the network learns to provide some dense predictions for each pixel s of the patch P i .The patch information is passed through the model which downsamples it to a low resolution and then upsamples it back to the original dimensions following an autoencoder scheme.Pixel-based architectures are fully convolutional and in this work we employed three commonly used ones for very high-resolution satellite data i.e., SegNet, U-Net and FCN-16.These ones were also employed and integrated in the developed object-based framework (as described in Section 3.1).Regarding the implementation details, in the case of SegNet, the weights were initialized using the pretrained VGG-16 ImageNet model.All pixel-based models were optimized by the Stochastic Gradient Descent with a batchsize, learning rate, momentum and weight decay equal to 10, 0.01, 0.9 and 0.0005 respectively.

Generation of Objects/Superpixels
One important component of the employed framework is the strategy for generating the objects.Regarding the image segmentation process, the choices are plenty; however, they were narrowed down significantly since the goal here was to select methods that respect the following criteria: Perform a nonlinear, anisotropic diffusion by taking also into account the fact that signal continuity in spectrum is, usually, more plausible than continuity in space (ii) Take into account the fact that objects/segments/superpixels in the spatial directions should be enhanced, smoothed and elegantly simplified while their contours/edges/boundaries must remain perfectly spatially localized: no edge displacements, intensity shifts or spurious extrema should occur (iii) Tackle only the kind of noise that never forms a coherent structure in both spatial and spectral directions Among these methods, segments that have been produced through an effective, anisotropic data simplification process, able to integrate spatial and spectral information while respecting the aforementioned criteria are the ones based on scale space morphological filtering and anisotropic diffusion markers [57,58].Such simplification techniques have been applied successfully for edge detection, image segmentation tasks [59] as well as smoothing, simplifying and reducing the dimensionality of hyperspectral data [60,61].
In particular, comparing with standard superpixel methods such as SLIC [54], the selected anisotropic morphological levelings (AMLs) can simplify the optical, multispectral data while preserve successfully image contours and object boundaries (Figure 2).More specifically, during the superpixel creation, although the parameterized color proximity and spatial proximity distances are combined, the resulting objects tend to present rather rounded shapes that do not approximate adequately and correctly object boundaries.This is mainly due to the fact that, for larger objects or objects with irregular shapes, the spatial distances outweigh color proximity, giving more relative importance to spatial proximity than spectral coherence.This produces compact superpixels that do not adhere well to image boundaries.In the proposed processing pipeline, the dataset is simplified with AMLs in order to efficiently preserve the boundaries/edges of a variety of objects representing classes such as roads, roofs, trees, pavements, cars, other man-made objects, soil, vegetation, etc. Regarding the selected scale of filtering, although objects appear in different scale in the images and a standard object-based image analysis pipeline proposes a multiscale procedure [11,12], here we simplify the dataset at a relatively small scale creating a number of segments inside a single semantic object and letting the convolutional layers address the scale variance.From the simplified data, objects are then derived from the Quickshift segmentation algorithm [62] where each pixel is represented by a feature vector of its spectral information based on which the visually similar image regions are formed.The algorithm generates a forest of pixels based on which the objects/segments are created.In all our experiments, we assigned the values of 0.5, 2 and 12 to the ratio, kernelsize and maxdist parameters, respectively.
In Figure 2, the original raw image crops are presented in the left column along with the corresponding results from the Quickshift, SLIC and the proposed here AML-QS procedure.After a close look, one can observe that the objects derived from the AML-QS approach have been more semantically merged.In numerous cases, small objects (e.g., roof materials, roof objects, chimneys, asphalt lines, etc.) appear in a more spectral compact representation without spurious extrema.Moreover, the object boundaries and edges are represented more accurately, several small linear features that made object edges more noisy are not present while small objects with a specific shape (e.g., squared white rooftop skylights in the top row) have retained accurately boundaries and geometry.
For simplicity reasons, from now on we refer to the object-based approaches with the following abbreviations: OB_Snet for object-based SegNet, OB_Unet for object-based U-Net and OB_FCN for object-based FCN-16.The training processes for the object-based architectures were implemented using the exact same parameters as described in the plain pixel-based architectures (Section 3.2.2) in order to conduct an accurate comparison between them, i.e SegNet was trained with the same hyperparameters as OB_Snet; the same applies to OB_Unet and OB_FCN.Regarding the additional object-based loss function (Section 3.1), the value of w 1 was defined using grid search for each architecture.In particular, regarding the Vaihingen dataset, w 1 was equal to 2 for OB_Snet and OB_Unet, and 0.2 for OB_FCN.For the Potsdam dataset, w 1 was equal to 2 and 1 for OB_Snet and OB_Unet, respectively.

Dataset and Training Procedure
All our experiments were performed on the two publicly available ISPRS (WGII/4) benchmark datasets provided by Commission III of the ISPRS [63], depicting two different cities of Germany: Vaihingen and Potsdam.Both regions have been annotated with six different classes: Impervious Surfaces, Buildings, Low Vegetation, Trees, Cars and Clutter, which represents everything else that is not included in the other five classes.Regarding Vaihingen, it consists of 33 very high resolution images of average size 2494 × 2064 that have 3 available channels (InfraRed, Red, and Green) and a ground sample distance of 9 cm (i.e., the real ground value that corresponds to the distance of two adjacent pixel centers).In the case of Potsdam, 38 ortho-rectified images are available, with a size of 6000 × 6000 and a ground sample distance of 5 cm.Here, 4 different spectral channels are available: Red, Green, Blue and InfraRed.It should be noted that the different categories of this dataset are not proportionally balanced, i.e., some categories (e.g., Buildings) are much more common comparing to others (e.g., Cars).In Table 1, the proportion of each class in relation to the training images is presented.The Vaihingen dataset of the ISPRS benchmark consists of 16 tiles for training and 17 for testing purposes.For our experiments, we further divided the 16 training tiles into 14 for training (i.e., Areas 11, 13, 1, 21, 23, 26, 28, 30, 32, 34, 37, 3, 5 and 7) and 2 for validation (i.e., Areas 15 and 17).Regarding Potsdam, there are 24 training and 14 testing tiles.In a similar way, from the 24 training tiles, we used 17 for training (i.e., Areas 2_10, 3_10, 3_11, 3_12, 4_11, 4_10, 5_10, 5_12, 6_8, 6_9, 6_10, 6_11, 6_12, 7_7, 7_9, 7_11 and 7_12) and 7 for validation (i.e., Areas 2_11, 2_12, 4_10, 5_11, 6_7, 7_8 and 7_10).
For all the patch-based architectures, 29 × 29 patches were extracted randomly taking 1% of each class from every training image, resulting approximately in 1.1 million training and 38 thousand validation patches.All data were normalized by subtracting the mean and dividing by the standard deviation of the three available channels.
For the pixel-based architectures, patches of size 256 × 256 were extracted from the Vaihingen images using a step of 64 along both rows and columns forming in this way overlapping small regions.Approximately 13,800 training and 120 validation patches were created.All data were normalized before being processed by the networks via mean and standard deviation.In the case of Potsdam, patches were again of size 256 × 256 but this time they were formed with a step of 128 creating approximately 34,400 patches for training and 3700 for validation.

Training Time and Optimal Stop Points
All implemented models included in this work were trained using early stopping criteria.More precisely, the learning procedure was ceased when two requirements were satisfied.Firstly, the validation accuracy should not be increased during a specific number of epochs, which is called patience.Here, we used a patience of 10 epochs.Secondly, the difference between the training and validation accuracy should be minimum.If these criteria were met, then the training process was finished.All training tasks were assigned to the same GeForce GTX 1080 GPU.In Table 2, we provide all the relevant information, including computational costs and number of epochs for each of the architectures for both datasets.In general, the object based approaches were the more time demanding for training.

Quantitative Evaluation Metrics
To assess the quality of the results, we employed four different evaluation metrics: Overall Accuracy, Precision, Recall and F1 score.They are all expressed through the calculated TP (True Positives), FP (False Positives) and FN (False Negatives).If we have a class l, then TP is the number of pixels that have been correctly classified as l.FP is the number of pixels that have been wrongly classified as l.Finally, FN represents the pixels that belong to l but the model has associated them to some other class.
Moreover, to further assess the performance of the developed approach, we employed the Intersection-over-Union (IoU) and the Hausdorff distance (HD), which indicate how close to the ground truth are the predicted objects.In Equation ( 5), A and B are two different data samples.Since in our case we have a multi-class segmentation problem, the IoU is calculated on each semantic category separately.Regarding the Hausdorff distance, it measures the maximum distance that exists between the predicted object and the ground truth.For two different data samples (i.e., A and B), the Hausdorff distance can be expressed as Equation (6), where a and b are the points of A and B, while d(a, b) is the L2 norm.

Quantitative Evaluation
The developed object-based learning frameworks (i.e., OB_Snet, OB_Unet and OB_FCN) were applied to the Vaihingen dataset and compared with the performance of the patch-based (i.e., ConvNet, AlexNet and VGG-16) and pixel-based fully-convolutional (i.e., SegNet, U-Net and FCN-16) networks.The quantitative results regarding the calculated Overall Accuracy rates are presented in Figure 3 (left).The two object-based frameworks that resulted in higher accuracy rates (i.e., OB_Snet and OB_Unet) in Vaihingen were also applied to the Postdam dataset (Figure 3, right) as well.More specifically: Patch-based Learning Frameworks: Figure 4 presents the quantitative results for ConvNet, AlexNet and VGG-16 for the Vaihingen dataset.The highest Overall Accuracy (OA) rate resulted from AlexNet (i.e., 83.10%), with ConvNet resulting into the second highest outcome (i.e., 82.50%) and VGG-16 giving the lowest OA rate (i.e., 79.79%).We can observe that, even though VGG-16 is the deepest architecture, the F1 rates were lower comparing with the other two models.Between ConvNet and AlexNet, the latter delivered higher accuracy rates, while the ConvNet achieved higher F1 rate only for the Cars class.
Pixel-based Learning Frameworks: Quantitative results from the fully-convolutional frameworks (SegNet, U-Net and FCN-16) are presented in Figure 5 (left).Generally, comparing with the patch-based frameworks, the F1 rates were higher apart from the case of the FCN-16 where certain classes (e.g., Low_Vegetation and Trees) were outperformed by the patch-based frameworks in terms of F1 score.The FCN-16 F1 rates were lower than SegNet for all class categories.The U-Net network achieved very similar results to SegNet with Cars reaching the highest F1 rate among the pixel-based methods.The overall accuracy rates were equal to 89. 40    Developed Object-based Framework: In Figure 5 (right) the quantitative results of the developed OB_Snet, OB_FCN, and OB_Unet are presented for the Vaihingen dataset.In particular, the OAs were 89.40, 87.36 and 88.11, respectively.Comparing with the aforementioned pixel-based frameworks, one can observe that the additional object-based loss managed to even slightly increase the accuracy rates for the FCN-16 case, produced equal OA for the SegNet case and resulted into a lower score by 0.2% for the U-Net case.This can be also viewed in Figure 3 where all the resulting OA rates for both datasets are presented.Indeed, one can observe that the object-based framework in all cases managed to outperform the pixel-based frameworks in the Potsdam dataset by more than 1%.The more effective performance of the object-based approach was also observed for all classes in the Potsdam dataset.
In particular, the resulting confusion matrices (Figure 6) as well as the resulting F1 scores for the six different classes (Figure 7) demonstrate the improvement.We can notice that all the resulting F1 rates were higher than the standard pixel-based fully convolutional networks.Especially for the Buildings and Trees, the object-based framework managed to produce better results.As far as the IoUs of the different semantic categories are concerned, one can observe in Table 3 (left) that object-based approaches have produced slightly better results in the case of Impervious_Surfaces, Trees, Cars and Clutter for the Vaihingen dataset.The highest IoU rates have been achieved by SegNet and OB_Snet, while U-Net and OB_Unet seem to have detected much more successfully the Clutter category.Regarding FCN-16 and OB_FCN, one can notice from Table 3 (left) that they delivered the lowest IoU values.Comparing with Vaihingen, in the Potsdam dataset, the developed framework performed more effectively, probably because it managed to exploit more efficiently the additional spectral information (i.e., the blue channel) that was available.This is also obvious from the IoU values that were produced from the testing images (Table 3, right) since the highest IoU for each semantic category resulted from the proposed object-based frameworks.Specifically, OB_Snet has achieved the most successful rates for Impervious_Surfaces, Buildings, Low_Vegetation, Trees and Clutter, while OB_Unet had higher rates in the Cars case.In particular, even though patch-based architectures achieved relatively high quantitative results, the predicted map is noisy with significant gaps and fragmented outputs.In Figure 8, one can observe in indicative regions of the testing areas representative examples of the noisy results.It is obvious that the various classes are not well separated and boundaries are scattered and blurry.The corresponding results on the same zoomed regions that were derived from the fully convolutional pixel-based (SegNet, FCN-16 and U-Net) networks are presented in Figure 9.We can notice that the results are not so noisy and not so fragmented.However, certain objects have not been detected accurately in terms of object compactness, overall geometry and accuracy along their boundaries.
In Figure 10, the corresponding results from the same regions of the Vaihingen testing dataset are presented for the developed object-based learning approach.By comparing Figures 9 and 10, one can notice after a close look that in several cases the additional loss made the model more effective, for example in cases such as the depicted building in the second row of Figures 9 and 10 which was more accurately detected by the developed OB_Unet network.Generally speaking, comparing with the plain fully convolutional networks (in Figures 9 and 10 as well as in Figure 11 for the Potsdam testing dataset), one can observe that the resulting shapes and overall geometry were persevered comparing with the ground truth for the developed object-based networks.This was mainly due to the additional object-based priors that were integrated in the process which force and constrain the model to retain object shapes.Moreover, we observed that building boundaries derived from the object-based frameworks were more accurate, which was also justified by the HD metrics computed for the class Buildings on both datasets (Table 4).In particular, the HD metric calculates the maximum distance between the predicted Buildings pixels and the ones from the reference data.In the Vaihingen case, the object-based approaches resulted generally in better HD metrics with the object-based SegNet approach being closer to the reference geometry attaining a HD score equal to 20.89.Regarding Potsdam, both proposed object-based techniques performed better that the standard ones, especially if one compares the standard U-Net (with a score of almost 60) with the proposed OB_Unet (with a score of less than 45).In particular, the best HD score was achieved by OB_Snet and was equal to 41.87.

Discussion
From both the aforementioned quantitative and qualitative evaluation results, the following outcomes can be highlighted.Generally speaking, for semantic segmentation tasks in very high resolution images, the fully convolutional frameworks (both pixel-based and object-based) are more robust and effective than the patch-based ones.More specifically, the overall accuracy levels raised by 5-10% depending on the model that was employed in every case.The developed object-based learning approach generally ameliorated the overall accuracy rates and F1 scores or improved the resulting accuracy rates per-class.
Moreover, as far as the Vaihingen dataset is concerned, the recall rates of Impervious_Surfaces and Cars were ameliorated (see also Figure 5) for all object-based models (OB_Snet, OB_FCN, and OB_Unet).In addition, the object-based OB_FCN network outperformed the standard FCN-16 by 1% in terms of overall accuracy (see also Figure 3) with all classes except Clutter achieving higher F1 rates.
Regarding the Potsdam dataset, object-based methods produced higher accuracy rates for both object-based models (i.e., OB_Snet and OB_Unet, Figure 3).Specifically, the resulting F1 scores were improved for the Impervious_Surfaces, Buildings, Low_Vegetation and Trees while they stayed almost the same for the remaining classes (Figure 7).Judging by the achieved performance on the Vaihingen and Potsdam datasets, the more the spectral information is available in the datasets, the higher are the resulting accuracy rates for the object-based approach, which is logical and in accordance with the literature.However, for the object-based procedure this can be further justified by the fact that the AML implementation takes into account the fact that edges can occur also along the spectral dimension and not only the spatial image domain represented through e.g., lab/ cielab color spaces.Indeed, the proposed approach preserves both spatial as well as spectral edges and object boundaries.
Generally speaking, it should be also noted that regarding the size of the calculated objects and their scale, relatively smaller objects are more preferable than larger ones.In particular, if the average size of a single object is large, then it is highly likely that it actually consists of more than one semantic class.In such cases, the proposed approach cannot address properly the semantic segmentation task and most probably will assign the dominating label.
Comparison with Other State-Of-The-Art Methods on the ISPRS Dataset Apart from the aforementioned comparison with the state-of-the-art networks, we also compared our results with other methods existing in the literature tested on the publicly available ISPRS dataset, which employ object-based approaches, preserving shapes and boundaries.For example, Marmanis et al. [44] exploited a fully-convolutional architecture that takes advantage of boundary-related information.Regarding the results of this specific method on the Vaihingen dataset, the OA and F1 rates are similar to the ones reported by our proposed object-based method even if they use additional information related to the publicly available Digital Surface Model of the ISPRS dataset.In fact, our Cars F1 rate is higher by 0.029 comparing to the method in [44], which indicates that the lack of DSM information for Cars degrades the quality of their outcome.Regarding Potsdam, Marmanis et al. [44] only conducted experiments on the validation set.These results also indicate that even with additional information they are similar to ours.Furthermore, Liu et al. [46] attempted to preserve object boundaries by exploiting features from both VHR images and LiDAR data.Specifically, both data sources are passed through a fully convolutional network which produces probabilistic predictions.Then, a higher-order CRF receives the fused classification outputs and the final segmentation map is formulated through graph cut inference methods [64].The OA of this method is 0.5% lower than ours for the Potsdam dataset.At the same time, our method attains higher F1 rates for all semantic categories, especially for Cars where our results are equal to 0.949 compared to the 0.928 of Liu et al. [46].Moreover, comparing with the method in [48] (more detailed in the Introduction), which reports OA rates at 87.0% and 88.4% for the Vaihingen and Potsdam, respectively, the method proposed here outperforms the method in [48] in both datasets.In [45], the authors tried to preserve spatial boundary information by performing simultaneously semantic segmentation and edge detection using an encoder-decoder architecture.This is achieved by incorporating additional intermediate supervisions in the form of weighted losses related to the edge ground truth.Even though the achieved results are not based on the ISPRS Vaihingen testing benchmark, they are similar to our proposed method, with OA being 0.5% lower than the one we have presented.In the same way, Liu et al. [65] utilized an HourGlass-like architecture inspired by Newell et al. [66], the result of which is then post-processed by weighted belief propagation [67].In this case, the Potsdam OA rates are higher than our method (89.42%), whereas, in the Vaihingen case, the best OA rate is equal to 88.82%, which is approximately 0.6% lower than our quantitative results.

Conclusions
In this study, an object-based deep learning framework was designed and developed based on the integration of AML simplification and a loss function that constrains the learning process with object priors.The developed approach is generic and can be integrated with different fully convolutional deep networks.The method ultimately can enforce pixels belonging to the same object to be classified on the corresponding dominant class retaining spectral and spatial characteristics.Based on the quantitative evaluation, higher accuracy rates, overall and per-class, were achieved comparing with the state-of-the-art.Qualitatively, the method also demonstrated more compact and less noisy outcomes while it retained more effectively the overall shape, geometry, object edges and boundaries.Among the future perspectives are the automation of the learning weights in the loss function as well as the integration of different simplification scales and image representations towards tackling more efficiently scale space issues.

Figure 1 .
Figure 1.Illustration of the proposed object-based deep learning framework.The input image is fed to a F-CNN architecture with encoding (green) and decoding (yellow) layers.During the pixel-based optimization procedure, the semantic loss is calculated by comparing the network output with the reference data, while the object-based loss constrains the semantic labels to be the same with the dominant label inside each superpixel.The two losses are then combined together to produce the final segmentation map.

Figure 2 .
Figure 2. Comparing image objects derived from the Quickshift and SLIC algorithms with the proposed AML object-based approach (AML-QS).Image crops from the Vaihingen dataset are presented.Yellow dashed line: indicative areas where the proposed approach results into more semantically merged objects and more clear, accurate edges and boundaries.

Figure 3 .
Figure 3.The resulting Overall Accuracy (OA) rates for the Vaihingen (left) and Postdam (right) datasets after the application of the developed object-based learning frameworks as well as the current state-of-the-art (either patch-based or pixel-based) learning networks.

Figure 5 .
Figure 5. Resulting Confusion Matrices for the developed object-based learning frameworks (OB_Snet, OB_Unet and OB_FCN) on the Vaihingen dataset (right) and the corresponding ones for the state-of-the-art fully convolutional (SegNet, U-Net and FCN-16) networks (left).

Figure 8 .
Figure 8. Experimental results from the ConvNet, AlexNet and VGG-16 convolutional networks on indicative regions from the Vaihingen dataset.Along with a false color composite (R-G-NIR), the corresponding ground truth is presented as well (White, Impervious_Surfaces; Blue, Buildings; Light Blue, Low_Vegetation; Green, Trees; Yellow, Cars; Red, Clutter).

Figure 9 .Figure 10 .
Figure 9. Experimental results from the SegNet, FCN-16 and U-Net fully convolutional networks on indicative regions from the Vaihingen dataset.Along with a false color composite (R-G-NIR), the corresponding ground truth is presented as well (White, Impervious_Surfaces; Blue, Buildings; Light Blue, Low_Vegetation; Green, Trees; Yellow, Cars; Red, Clutter).

Figure 11 .
Figure 11.Experimental results from the developed OB_Snet and OB_Unet networks on indicative regions from the Potsdam dataset.The corresponding ground truth along with the results from the state-of-the-art SegNet and U-Net deep networks are presented as well (White, Impervious_Surfaces; Blue, Buildings; Light Blue, Low_Vegetation; Green, Trees; Yellow, Cars; Red, Clutter).

Table 1 .
Proportion of each semantic category in the training datasets.The proportion corresponds to the number of pixels belonging to the specific class divided by the total number of pixels in the training datasets.

Table 2 .
The required training time in minutes as well as the optimal epoch that was picked for each architecture are presented for Vaihingen (a) and Potsdam (b).In bold are the proposed in this paper frameworks, i.e., OB_Snet, OB_Unet and OB_FCN.
Resulting Confusion Matrices for the developed object-based learning frameworks (OB_Snet and OB_Unet) in the Potsdam dataset (right) and the corresponding ones for the fully convolutional (SegNet and U-Net) networks (left).The resulting F1 scores from all considered methods for all semantic categories of Potsdam dataset.In all cases, the developed object-based learning framework managed to outperform the current state-of-the-art fully convolutional networks.In bold are the higher achieved F1 scores.

Table 3 .
Intersection-over-Union (IoU) results for the Vaihingen (left) and Potsdam (right) datasets.The overall IoU value for each semantic category is calculated by adding the IoUs of all the testing images and dividing by their number.

Table 4 .
The calculated mean Hausdorff distance (HD) between the predicted and reference Buildings pixels for the Vaihingen and Potsdam datasets.