Semantic Labeling of High Resolution Aerial Imagery and LiDAR Data with Fine Segmentation Network

In this paper, a novel convolutional neural network (CNN)-based architecture, named fine segmentation network (FSN), is proposed for semantic segmentation of high resolution aerial images and light detection and ranging (LiDAR) data. The proposed architecture follows the encoder–decoder paradigm and the multi-sensor fusion is accomplished in the feature-level using multi-layer perceptron (MLP). The encoder consists of two parts: the main encoder based on the convolutional layers of Vgg-16 network for color-infrared images and a lightweight branch for LiDAR data. In the decoder stage, to adaptively upscale the coarse outputs from encoder, the Sub-Pixel convolution layers replace the transposed convolutional layers or other common up-sampling layers. Based on this design, the features from different stages and sensors are integrated for a MLP-based high-level learning. In the training phase, transfer learning is employed to infer the features learned from generic dataset to remote sensing data. The proposed FSN is evaluated by using the International Society for Photogrammetry and Remote Sensing (ISPRS) Potsdam and Vaihingen 2D Semantic Labeling datasets. Experimental results demonstrate that the proposed framework can bring considerable improvement to other related networks.


Introduction
Semantic segmentation of high resolution remote sensing images aims at assigning each pixel a certain semantic class, for instance, building, car, tree, or low vegetation.Accurate and timely acquisition of segmentation results is fundamental for precise urban planning, environmental monitoring and economic forecasting.With the development of aerospace remote sensing technology, the spatial resolution of remote sensing images has notably increased.Higher spatial resolution brings a lot of tiny objects and fine details, but also causes large intra-class variance and small inter-class differences, which often leads to segmentation ambiguities [1].
Several approaches based on spectral statistical features have been proposed for high resolution remote sensing image classification, including maximum likelihood method [2], minimum distance method [3] and K-means [4].Moreover, methods based on machine learning such as neural networks (NN) [5] and support vector machines (SVM) [6] have been developed for this task as well.Finally, new models based on object-oriented classification [7] and sparse representation [8] have been developed too.Although these frameworks might obtain satisfactory classification performance, they typically suffer of major drawbacks which can jeopardize the processing outcomes.Indeed, when scenarios characterized by high spectral complexity are taken into account, these architectures are not able to accurately track the interplay among samples on global and local scale.Furthermore, these shallow learning networks cannot satisfy the requirements for the complexity and diversity of functions and training samples because they usually have only one hidden layer.
Over the last few years, deep learning has received widespread attention in the image analysis field [9].Convolutional neural networks (CNNs) play a key-role within deep learning techniques.They have achieved remarkable results in several applications, especially for recognition and bounding box object detection [10,11].Semantic segmentation is a task that has higher requirements than object detection.Naturally, semantic segmentation attracts more research attention in the deep learning field as a progression from coarse to fine inference.There are several available CNNs for semantic segmentation which can be divided into two categories: patch-based methods and pixel-based methods.When patch-based methods are considered, the image patches around each pixel from the input image are extracted, and a single label for each patch with CNN for whole-image classification is predicted [12][13][14][15][16].This class of algorithms provides a remarkable enhancement in image segmentation: however, it also shows drawbacks, such as huge RAM cost, low computational efficiency, and limited receptive field.
On the other hand, pixel-based methods can predict labels for all the pixels of whole image at a time.The most classic architecture is fully convolutional network (FCN) proposed by Long et al. [17], which replaces fully connected layers with convolutional layers, and implements transposed convolutional layers to upscale the coarse outputs into fine segmentation results.Following the encoder-decoder paradigm of FCN, multiple CNN architectures have been developed and achieved better results.To mitigate the classification ambiguities caused by large factor upsample in FCN, Chen et al. [18] proposed "DeepLab" with a fully connected Conditional Random Field (DeepLab-CRF), which introduced "atrous" convolutions to avoid the effect of removing pooling layer and integrated the responses at the final CNN layer with a fully connected CRF to smooth the raw segmentation results.Noh et al. [19] proposed to replace bilinear interpolation by a multi-layer deconvolution network which is composed of deconvolution and unpooling layers in the upscaling stage (DeconvNet).Badrinarayanan et al. [20] presented a novel CNN architecture named SegNet, which shares a similar architecture as DeconvNet but with much smaller parameterization, and is easier to be trained end-to-end.
In recent years, CNNs were gradually applied into semantic segmentation for remote sensing images.In the patch-based methods category, Paisitkriangkrai et al. [21] suggested to integrate CNN features with handcraft features, and utilized CRFs to further improve the classification performance.Nogueira et al. [22] compared several existing powerful CNNs with three train strategies, which are training from scratch, fine-tuning, and using CNNs as feature extractor.Experimental results on three remote sensing datasets indicated that the fine-tuning is the best strategy.In [1], a FCN architecture without downsampling was proposed to mitigate feature detail loss.The architecture had a pre-trained network for color-infrared (CIR) images and a network trained from scratch for digital surface model data.The features of these two networks were then combined by concatenation.Although this architecture achieved state-of-the-art semantic labeling accuracy for high-resolution aerial imagery, it soon was outperformed by downsample-then-upsample architectures for its limited receptive field.Moreover, its decision-level fusion strategy is bound to train two separate neural networks for CIR images and LiDAR data respectively, which means the number of trainable parameters is doubled.Audebert et al. [23] transferred a powerful semantic segmentation architecture from generic images (SegNet [20]) to remote sensing images.They compared both decision-level and feature-level fusion methods, and proved their proposed dual stream SegNet to fuse multi-sensor data by residual correction in the feature level performed slightly better.Our FSN also applied feature-level fusion, but ours differs from their method in two ways.First, we processed LiDAR data by a lighter weighed branch, which guarantees the reduction of the computational overhead.Moreover, since CIR data contain more information than LiDAR data, we concatenate features from main encoder and lightweight branch in different depth while they concatenate the features of heterogeneous data in the same depth.In [24], Liu et al. fused a FCN trained on CIR images and a logistic regression trained on light detection and ranging (LiDAR) data in the decision level by a higher-order CRF framework, which outperformed the original counterparts.In [25], Volpi et al. applied learnable transpose convolutional layers in the decoder stage to decrease the spatial information loss, but the segmentation accuracy was still limited.In [26], Maggiori et al. analyzed some dense semantic labeling CNNs of high-resolution remote sensing imagery deeply, and derived a CNN framework with a multi-layer perceptron to learn to combine features at different resolutions.In [27], an hourglass-shape network was designed followed by a downsample-then-upsample paradigm, which introduced inception module to take advantage of multi-scale receptive fields, together with residual units to feed information from encoder to decoder directly.However, the methods in [25][26][27] did not pay much attention to multi-sensor data fusion problem.They fused CIR images and the LiDAR data in the first layer, which usually cannot exploit the features of each data and cannot obtain satisfying results.
In summary, it can be concluded that pixel-based methods achieve better performance than patch-based methods, and training from scratch is often outperformed by fine-tuning a pre-trained network.Although deep learning has achieved a solid success in semantic segmentation for remote sensing images, the well-known trade-off between recognition and localization [18,19] remains a challenging endeavor.Down-sampling operations give the network wider vision to produce more accurate recognition, but the small sized feature maps may lead to inaccurate location.Down-sampling operation can offer the network wider receptive field to obtain more accurate recognition, but the smaller feature maps may result in inaccurate localization.Besides, the existing upscale methods such as transposed convolutional layer and unpooling layer tend to bring artificial values in the low resolution feature maps and then upscale these low resolution feature maps.This can cause an information loss, and therefore decrease the segmentation accuracy.Moreover, multi-sensor data often are fused at the decision-level or are stacked together as input.The former approach suffers from the large number of parameters, and the latter makes the network unable to be initialized by pre-trained weights which is proved to be superior to random initialization [22][23][24].
In this paper, a new CNN-based architecture named fine segmentation network (FSN) is proposed for semantic segmentation of high resolution aerial imagery.The FSN belongs to pixel-based methods category and follows the encoder-decoder paradigm as FCN; moreover, it fuses the multi-sensor data in feature-level by multi-layer perceptron (MLP).The main contributions of the proposed network can be summarized as follows:

•
The encoder is structured into two parts: a main encoder and a lightweight branch.The main encoder is based on the Vgg16 [28] for CIR images.The lightweight branch is designed to deal with its corresponding LiDAR images: the digital surface models (DSMs) and the normalized DSMs (nDSMs) independently.This design accomplishes the feature extraction of multi-sensor data with a relatively few parameters.

•
Sub-pixel convolution layers proposed for image and video super-resolution [29] are implemented to replace the traditional deconvolution layers in the proposed FSN.Without adding any artificial value, sub-pixel convolution layer calculates convolutions in low resolution feature maps and upscales them in a single step.Thus, the contextual area can be expanded by a filter of the same size as that of common up-sampling layer.

•
MLP is used to accomplish effective feature-level fusion of multi-sensor remote sensing data at the back end of the structure.Moreover, multi-resolution feature maps are also fed into MLP to mitigate the recognition/localization trade-off.
The proposed FSN is evaluated on the International Society for Photogrammetry and Remote Sensing (ISPRS) Potsdam and Vaihingen 2D Semantic Labeling datasets.Experimental results demonstrate that it can bring considerable improvement to other related networks.
The reminder of this paper is organized as follows.In Section 2, a brief introduction for the main component of convolutional neural network is presented.In Section 3, the proposed FSN is detailed at the beginning, followed by a presentation for the post-processing method.Section 4 presents data and experiment settings, and experimental results.Section 5 draws the conclusions.

Convolutional Neural Network
Convolutional neural network [30] is a special version of deep neural network which is characterized by sparse connectivity and parameter sharing.Sparse connectivity means the neurons of CNNs are not fully connected, as shown in Figure 1a; indeed, each two layers are partially connected to make better use of local spatial characteristics.Parameter sharing means neurons in one feature map share a same parameters matrix.As shown in Figure 1b, layer n has four neurons belonging to one feature mapping, and connection lines of the same color identify the same weights.These two characteristics reduce the complexity of the network structure and the total amount of parameters.The reminder of this paper is organized as follows.In Section 2, a brief introduction for the main component of convolutional neural network is presented.In Section 3, the proposed FSN is detailed at the beginning, and then the training and inference strategies are introduced, followed by a presentation for the post-processing method.Section 4 presents data and experiment settings, and experimental results.Section 5 draws the conclusions.

Convolutional Neural Network
Convolutional neural network [30] is a special version of deep neural network which is characterized by sparse connectivity and parameter sharing.Sparse connectivity means the neurons of CNNs are not fully connected, as shown in Figure 1a; indeed, each two layers are partially connected to make better use of local spatial characteristics.Parameter sharing means neurons in one feature map share a same parameters matrix.As shown in Figure 1b, layer n has four neurons belonging to one feature mapping, and connection lines of the same color identify the same weights.These two characteristics reduce the complexity of the network structure and the total amount of parameters.CNNs can also be considered as a complex non-linear function to turn the inputs into target variables.In this section, we briefly present an introduction of CNNs' main composition elements.The basic type of CNN layers in semantic segmentation include: convolutional layer, non-linear activation layer, spatial pooling layer, transposed convolutional layer, and unpooling layer.

Convolutional Layer
The convolutional layers are the main component of CNNs.A convolutional layer can be considered as a set of neurons or filters.Each filter has a series of learnable parameters which are arranged as a convolution kernel with size P × Q × D, where P, Q and D represent length, width and depth of convolution kernel, respectively.The conversion of input image with size Ni × Li × D to output of convoltional layer with size No × Lo × D′ is performed by convolutional layer (a set of D′ filters).When the filter is centered on the spatial position (i,j) of the input, and the response for the d′-th filter can be written as: where yijd′ is the response for d′-th filter, xpqd is the window surrounding spatial position (i,j) of the input, Wpqd is the learnable weights of convolution kernel, and b is a learnable bias.The spatial dimensions of the output can be calculated as: CNNs can also be considered as a complex non-linear function to turn the inputs into target variables.In this section, we briefly present an introduction of CNNs' main composition elements.The basic type of CNN layers in semantic segmentation include: convolutional layer, non-linear activation layer, spatial pooling layer, transposed convolutional layer, and unpooling layer.

Convolutional Layer
The convolutional layers are the main component of CNNs.A convolutional layer can be considered as a set of neurons or filters.Each filter has a series of learnable parameters which are arranged as a convolution kernel with size P × Q × D, where P, Q and D represent length, width and depth of convolution kernel, respectively.The conversion of input image with size N i × L i × D to output of convoltional layer with size N o × L o × D is performed by convolutional layer (a set of D filters).When the filter is centered on the spatial position (i,j) of the input, and the response for the d -th filter can be written as: where y ijd is the response for d -th filter, x pqd is tnhe window surrounding spatial position (i,j) of the input, W pqd is the learnable weights of convolution kernel, and b is a learnable bias.The spatial dimensions of the output can be calculated as: where Z is the number of rows and columns padded on the borders of input, and S is the stride of convolution kernel sliding.Figure 2 reports an example where input image is of size 7 × 7 × 3, and the size of convolution kernel is 3 × 3 × 3.Moreover, the convolution is performed with stride 1 without padding, and only one filter is employed.Then, the size of output is 5 × 5 × 1.
Remote Sens. 2018, 10, x FOR PEER REVIEW 5 of 24 where Z is the number of rows and columns padded on the borders of input, and S is the stride of convolution kernel sliding.Figure 2 reports an example where input image is of size 7 × 7 × 3, and the size of convolution kernel is 3 × 3 × 3.Moreover, the convolution is performed with stride 1 without padding, and only one filter is employed.Then, the size of output is 5 × 5 × 1.

Nonlinear Activation Layer
Generally, neural network for image process aims to perform convolutions to images, but this operation is obviously linear.Thus, nonlinear activation layer is introduced into CNNs to increase the network's ability to express any complex non-linear mapping.The most common nonlinear activation applied in CNNs is the Rectified Linear Unit (ReLU) function [31], formulated as: Other common activation functions include Sigmoid function, Tanh function, and leaky ReLU function [32].

Spatial Pooling Layer
The role of spatial pooling layer is to reduce the dimensionality of the representation and create an invariance of small shifts and distortions by pooling over small windows into single values [33].Spatial pooling operation often utilizes small windows (e.g., 2 × 2 or 3 × 3), to slide over the feature maps, and convert the information within the window into a single value.The existing pooling methods include max pooling, mean pooling, and sum pooling functions.In this paper, the max pooling function is taken into consideration due to its stability and wide application in the literature.Given the size of window as P × P, Pij denotes the window centered on the spatial location of (i,j).Then, max pooling function returns the maximum value of the window area as:

Transposed Convolutional Layer
Transposed convolution is also called deconvolution [34].In semantic segmentation area, transpose convolution is a popular approach to recover the lost feature details caused by pooling operations or other downsampling operations.The low resolution input is first up-scaled by using bilinear interpolation or adding zeros, and then the convolutional operations are employed on the raw up-scaled results to fit in sensible values.

Nonlinear Activation Layer
Generally, neural network for image process aims to perform convolutions to images, but this operation is obviously linear.Thus, nonlinear activation layer is introduced into CNNs to increase the network's ability to express any complex non-linear mapping.The most common nonlinear activation applied in CNNs is the Rectified Linear Unit (ReLU) function [31], formulated as: Other common activation functions include Sigmoid function, Tanh function, and leaky ReLU function [32].

Spatial Pooling Layer
The role of spatial pooling layer is to reduce the dimensionality of the representation and create an invariance of small shifts and distortions by pooling over small windows into single values [33].Spatial pooling operation often utilizes small windows (e.g., 2 × 2 or 3 × 3), to slide over the feature maps, and convert the information within the window into a single value.The existing pooling methods include max pooling, mean pooling, and sum pooling functions.In this paper, the max pooling function is taken into consideration due to its stability and wide application in the literature.Given the size of window as P × P, P ij denotes the window centered on the spatial location of (i,j).Then, max pooling function returns the maximum value of the window area as:

Transposed Convolutional Layer
Transposed convolution is also called deconvolution [34].In semantic segmentation area, transpose convolution is a popular approach to recover the lost feature details caused by pooling operations or other downsampling operations.The low resolution input is first up-scaled by using bilinear interpolation or adding zeros, and then the convolutional operations are employed on the raw up-scaled results to fit in sensible values.

Unpooling Layer
Unpooling, is the reverse operation of pooling.Pooling is an irreversible operation; therefore, the position of the max value is recorded in the pooling stage (take max pooling as example).In the unpooling stage, only the value of the recorded position is activated, whilst the value of other position is set to 0.

Proposed FSN Method
In this section, we detail the proposed FSN method for semantic segmentation of high resolution aerial imagery.We first present the general network design of FSN in Section 3.1, and then the post-processing method is introduced in Section 3.2.

Network Architecture of FSN
The proposed FSN belongs to pixel-based methods, and it is designed in the encoder-decoder structure, as shown in Figure 3.The encoder consists of two parts: the main encoder part for color-infrared images and the lightweight branch for DSMs data.The main encoder is based on the first 13 layers of Vgg16 net, and each convolutional layer is followed by a ReLU activation.Five pooling layers are employed to downsample the feature maps to achieve wider receptive fields.

Unpooling Layer
Unpooling, is the reverse operation of pooling.Pooling is an irreversible operation; therefore, the position of the max value is recorded in the pooling stage (take max pooling as example).In the unpooling stage, only the value of the recorded position is activated, whilst the value of other position is set to 0.

Proposed FSN Method
In this section, we detail the proposed FSN method for semantic segmentation of high resolution aerial imagery.We first present the general network design of FSN in Section 3.1, then we provide details on training and inference strategies in Section 3.2.Finally, the post-processing method is introduced in Section 3.3.

Network Architecture of FSN
The proposed FSN belongs to pixel-based methods, and it is designed in the encoder-decoder structure, as shown in Figure 3.The encoder consists of two parts: the main encoder part for color-infrared images and the lightweight branch for DSMs data.The main encoder is based on the first 13 layers of Vgg16 net, and each convolutional layer is followed by a ReLU activation.Five pooling layers are employed to downsample the feature maps to achieve wider receptive fields.To obtain accurate fine resolution segmentation, we also consider DSM and nDSM data.These additional records characterize the height of the ground objects, thus can help the network to recognize buildings and trees.When these additional data are concatenated with CIR images as input of the main encoder, the pre-trained weights cannot be utilized to initialize the network anymore, i.e., the network cannot be trained by fine-tuning.Furthermore, although the recognition of buildings and trees can take advantage of DSM/nDSM data, the analysis based on other objects can be eventually degraded by their signal distribution for LiDAR data have no useful information about these objects.Therefore, an extra branch is designed to extract features from these LiDAR data independently, and the combination with the features of CIR images is done by MLP in an efficient and flexible manner [26].Since LiDAR data have no spectral information and the value of each pixel represents the height degree of current pixel, these additional data to some extent can be regarded as the probability graph of buildings and trees, whilst each pixel of heat maps represents the probability that the pixel belongs to the specific class, which makes LiDAR data close to the score maps from upper layer of the decoder.Therefore, a lightweight convolutional network can generate the appropriate features to complement with the preliminarily upscaled features from the upper To obtain accurate fine resolution segmentation, we also consider DSM and nDSM data.These additional records characterize the height of the ground objects, thus can help the network to recognize buildings and trees.When these additional data are concatenated with CIR images as input of the main encoder, the pre-trained weights cannot be utilized to initialize the network anymore, i.e., the network cannot be trained by fine-tuning.Furthermore, although the recognition of buildings and trees can take advantage of DSM/nDSM data, the analysis based on other objects can be eventually degraded by their signal distribution for LiDAR data have no useful information about these objects.Therefore, an extra branch is designed to extract features from these LiDAR data independently, and the combination with the features of CIR images is done by MLP in an efficient and flexible manner [26].Since LiDAR data have no spectral information and the value of each pixel represents the height degree of current pixel, these additional data to some extent can be regarded as the probability graph of buildings and trees, whilst each pixel of heat maps represents the probability that the pixel belongs to the specific class, which makes LiDAR data close to the score maps from upper layer of the decoder.Therefore, a lightweight convolutional network can generate the appropriate features to complement with the preliminarily upscaled features from the upper layer of the main encoder.Hence, we designed a structure composed by three convolutional layers (the first two are followed by ReLU activation).Moreover, three maxpooling layers are employed to downsample the feature maps.The detailed configurations of encoder are listed in Table 1.In the decoder stage, to enrich the receptive fields and contextual information of FSN, we introduce into our architecture the inception modules (the yellow blocks in Figure 3), which consist of convolution layers with multiple sized kernels [27,35] (see Figure 4).
Remote Sens. 2018, 10, x FOR PEER REVIEW 7 of 24 layer of the main encoder.Hence, we designed a structure composed by three convolutional layers (the first two are followed by ReLU activation).Moreover, three maxpooling layers are employed to downsample the feature maps.The detailed configurations of encoder are listed in Table 1.In the decoder stage, to enrich the receptive fields and contextual information of FSN, we introduce into our architecture the inception modules (the yellow blocks in Figure 3), which consist of convolution layers with multiple sized kernels [27,35] (see Figure 4).We also introduce a powerful super resolution module "sub-pixel convolution layer" [29] (orange blocks in Figure 3) into our design to perform upsampling in the decoder stage.Contrary to previous upscaling methods, sub-pixel convolution increases the resolution after convolutional operations, therefore convolutions with a smaller kernel size can integrate the same information while maintaining a given contextual area.Figure 5 shows an example of transposed convolutional layer and sub-pixel convolution layer.It is worth noting that transposed convolutional layer increases resolution by using interpolation operation or adding zero values in the first place, and the raw results are then filled in with sensible values by employing convolutions on it.Without filling any artificial value in space between pixels, sub-pixel convolution layer simply employs regular convolutions on the low resolution feature map, and reshapes it to high resolution by phase shifting in a single step.
Instead of only using upper layer's output, feature maps of different resolutions are combined to improve the segmentation performance.This combination is set to address the trade-off between recognition and localization.Indeed, the high resolution feature maps from lower layers are precise but show a small receptive field.On the other hand, low resolution feature maps from upper layers deliver low spatial details, although on wider samples.Hence, upper layer can detect some objects We also introduce a powerful super resolution module "sub-pixel convolution layer" [29] (orange blocks in Figure 3) into our design to perform upsampling in the decoder stage.Contrary to previous upscaling methods, sub-pixel convolution increases the resolution after convolutional operations, therefore convolutions with a smaller kernel size can integrate the same information while maintaining a given contextual area.Figure 5 shows an example of transposed convolutional layer and sub-pixel convolution layer.It is worth noting that transposed convolutional layer increases resolution by using interpolation operation or adding zero values in the first place, and the raw results are then filled in with sensible values by employing convolutions on it.Without filling any artificial value in space between pixels, sub-pixel convolution layer simply employs regular convolutions on the low resolution feature map, and reshapes it to high resolution by phase shifting in a single step.
Instead of only using upper layer's output, feature maps of different resolutions are combined to improve the segmentation performance.This combination is set to address the trade-off between recognition and localization.Indeed, the high resolution feature maps from lower layers are precise but show a small receptive field.On the other hand, low resolution feature maps from upper layers deliver low spatial details, although on wider samples.Hence, upper layer can detect some objects that lower layer cannot.Therefore, it is not a wise choice to reduce the depth of the network blindly or to discard the high resolution feature maps.
that lower layer cannot.Therefore, it is not a wise choice to reduce the depth of the network blindly or to discard the high resolution feature maps.The common approach to tackle this issue is delivered on an element-wise addition.For instance, the skip connections of FCN-8s that first upscale the lower resolution feature maps can fit the higher resolution ones and then add them element-wise.However, this linear combination cannot provide accurate characterization of practical scenarios, and a nonlinear combination is required.Here, we propose to utilize multi-layer perceptron (MLP) [26] to learn how to combine feature maps at different resolutions.MLP is a minimal system with one hidden layer, as shown in Figure 3. Specifically, multiple scale feature maps including features of lightweight branch are concatenated in depth, and a hidden layer with 1 × 1 convolutional kernels is employed on the pool of features to approximate the combining function.The detail configurations of decoder are shown in Table 2.

Post-Processing Method for FSN-Based Segmentation
The proposed FSN network provides relatively fine segmentation result.However, it still shows some drawbacks, such as slight inaccuracy in determining the border of objects.These effects might be caused by the max valued label assignment probability criterion, which does not consider the occurrence of classes with a lower probability.To further improve the accuracy of the The common approach to tackle this issue is delivered on an element-wise addition.For instance, the skip connections of FCN-8s that first upscale the lower resolution feature maps can fit the higher resolution ones and then add them element-wise.However, this linear combination cannot provide accurate characterization of practical scenarios, and a nonlinear combination is required.Here, we propose to utilize multi-layer perceptron (MLP) [26] to learn how to combine feature maps at different resolutions.MLP is a minimal system with one hidden layer, as shown in Figure 3. Specifically, multiple scale feature maps including features of lightweight branch are concatenated in depth, and a hidden layer with 1 × 1 convolutional kernels is employed on the pool of features to approximate the combining function.The detail configurations of decoder are shown in Table 2.

Post-Processing Method for FSN-Based Segmentation
The proposed FSN network provides relatively fine segmentation result.However, it still shows some drawbacks, such as slight inaccuracy in determining the border of objects.These effects might be caused by the max valued label assignment probability criterion, which does not consider the occurrence of classes with a lower probability.To further improve the accuracy of the segmentation results, we adopt fully connected conditional random fields (CRFs) as post-processing method in our segmentation task.
Several works have already applied CRFs to refine the segmentation results, and improved the segmentation performance [18,21].In this work, the output of softmax layer (heat maps) is inputted into the fully connected CRFs as unary potential, and the CIR image is inputted as the pairwise potential with color and position information.Thus, CIR image is served as the pairwise potential to describe the "distance" between each pixel, and it is also related to the color information.The general segmentation pipeline is shown in Figure 6.
segmentation results, we adopt fully connected conditional random fields (CRFs) as post-processing method in our segmentation task.
Several works have already applied CRFs to refine the segmentation results, and improved the segmentation performance [18,21].In this work, the output of softmax layer (heat maps) is inputted into the fully connected CRFs as unary potential, and the CIR image is inputted as the pairwise potential with color and position information.Thus, CIR image is served as the pairwise potential to describe the "distance" between each pixel, and it is also related to the color information.The general segmentation pipeline is shown in Figure 6.

Experimental Design
To compare the proposed network with the state-of-the-art methods, we tested all the considered algorithms on two open benchmarks of aerial image labeling provided by Commission ш of ISPRS [36,37], namely Potsdam and Vaihingen datasets.In this section, we first briefly introduce the datasets, and then the competing scheme is presented.Finally, an introduction of the evaluation metrics for the test results is provided. (

1) Dataset description
The datasets used in this work are two famous open airborne datasets provided by Commission ш of ISPRS [36,37].These datasets include very high resolution true orthophoto (TOP) tiles, DSMs, and corresponding ground truth maps of two German regions.Both regions cover urban scenes.Specifically, Potsdam is a historic city with dense settlement structure, whilst Vaihingen is a small village with detached buildings.
The Potsdam dataset [36] consists of 38 TOP tiles: each tile has six channels, i.e., near-infrared, red, green, blue, DSMs, and nDSMs.The spatial resolution of image tiles is 5 cm, and they are all of size 6000 × 6000 pixels.Six classes (impervious surfaces, building, low vegetation, tree, car, and clutter) have been pixel-wise labeled on 24 tiles.The considered segmentation architectures use five channels, i.e., near-infrared, red, green, DSMs, and nDSMs.We randomly choose four tiles to be validation set, namely 5_10, 6_7, 6_12, and 7_10, and four tiles are chosen to be test set, namely 2_12, 4_10, 5_11, and 7_11.The other 16 tiles are employed for training.
The Vaihingen dataset [37] includes 33 TOP tiles with a spatial resolution of 9 cm.For each tile, there are five channels including near-infrared, red, green, DSMs, and nDSMs, as also provided by Gerke et al. [38].The average size of the TOP tiles is 2494 × 2064 pixels.Only 16 tiles have ground truths, which also contain the same six classes as the Potsdam dataset.Here, we also employ the same five channels.Number 17, 34, and 37 are selected for validation, and number 3, 11, and 32 are selected for testing, while the remaining 10 tiles are chosen for training.

Experimental Design
To compare the proposed network with the state-of-the-art methods, we tested all the considered algorithms on two open benchmarks of aerial image labeling provided by Commission III of ISPRS [36,37], namely Potsdam and Vaihingen datasets.In this section, we first briefly introduce the datasets, and then the competing scheme is presented.Finally, an introduction of the evaluation metrics for the test results is provided. (

1) Dataset description
The datasets used in this work are two famous open airborne datasets provided by Commission III of ISPRS [36,37].These datasets include very high resolution true orthophoto (TOP) tiles, DSMs, and corresponding ground truth maps of two German regions.Both regions cover urban scenes.Specifically, Potsdam is a historic city with dense settlement structure, whilst Vaihingen is a small village with detached buildings.
The Potsdam dataset [36] consists of 38 TOP tiles: each tile has six channels, i.e., near-infrared, red, green, blue, DSMs, and nDSMs.The spatial resolution of image tiles is 5 cm, and they are all of size 6000 × 6000 pixels.Six classes (impervious surfaces, building, low vegetation, tree, car, and clutter) have been pixel-wise labeled on 24 tiles.The considered segmentation architectures use five channels, i.e., near-infrared, red, green, DSMs, and nDSMs.We randomly choose four tiles to be validation set, namely 5_10, 6_7, 6_12, and 7_10, and four tiles are chosen to be test set, namely 2_12, 4_10, 5_11, and 7_11.The other 16 tiles are employed for training.
The Vaihingen dataset [37] includes 33 TOP tiles with a spatial resolution of 9 cm.For each tile, there are five channels including near-infrared, red, green, DSMs, and nDSMs, as also provided by Gerke et al. [38].The average size of the TOP tiles is 2494 × 2064 pixels.Only 16 tiles have ground truths, which also contain the same six classes as the Potsdam dataset.Here, we also employ the same five channels.Number 17, 34, and 37 are selected for validation, and number 3, 11, and 32 are selected for testing, while the remaining 10 tiles are chosen for training. (

2) Training and Inference Strategy
The proposed FSN network is trained with sparse softmax cross-entropy loss function.Adam Optimizer [39] is employed to optimize the loss function.We utilize parameters of pre-trained Vgg16 net [40] to initialize the main encoder to employ parameter-transfer learning.The lightweight branch and decoder part are initialized by normally distributed random variables.For Potsdam dataset, we set a low learning rate 10 −5 and step down five times every five epochs, the batch size is set to 10, and image patch size of this dataset is 512 × 512 pixels; for Vaihingen dataset, the initial learning rate is 10 −5 and step down five times every five epochs, the batch size is set to 20 and image patch size is 256 × 256 pixels, the further details of these two datasets are present in Section 4.1.Data augmentation is adopted to mitigate the over-fitting phenomenon caused by constraints of the labeling data.The image patch is extracted with 50% overlap, and each image patch is flipped vertically and horizontally, and then rotated 90 • , 180 • and 270 • [27].
In the inference stage, sliding window overlap is employed to mitigate the border effect.The full tile test images of two datasets we used in this work are larger than 2000 × 2000 pixels, and we need to clip the image into small patches to fit the memory constrain.This processing leads to a problem that the segmentation results of patch border are bound to suffer from inconsistent phenomenon.Thus, we set 75% overlapping size in the inference procedure, as the size is proven to achieve the best accuracy in previous works [23,27]. (

3) Competing methods
To prove the rationality of the lightweight design, the proposed lightweight (3 layers) branch for DSMs and nDSMs is compared with middleweight (6 layers) and heavyweight (9 layers) branches first.The proposed network without sub-pixel convolution layer version and without MLP version are then evaluated to compare with the proposed version.The former one is removing all sub-pixel convolution layers (SP1-3) and replacing it with transposed convolutional layers to check if sub-pixel convolution layer can bring benefits to the segmentation task.The latter one is set to study whether MLP can achieve better feature combination performance than common element-wise addition.Hence, we remove MLP in our design and combine the feature maps at different resolutions by element-wise adding.In this case, to equalize the number of feature maps to the addition, the filter number of Conv6 and ConvL3 are set to 64.
We further evaluate the performance of the proposed FSN and FSN-noL by comparing with FCN-8s, SegNet, and HSN, in which FSN-noL represent FSN without lightweight branch version.Since FCN-8s and SegNet are acting as strong baselines for CIR images only, we use near-infrared, red and green channels to train these architectures.HSN is designed for five channels input, so it served as the CIR + LiDAR data baseline.The details of these baselines can be found in Appendix A. Overlap inference with 75% overlapping size is employed in the inference stage of all experiments.

(4) Evaluation metrics
To guarantee a fair comparison, we evaluated the performance of the considered frameworks in terms of overall accuracy (OA), per-class F 1 score, average F 1 score.Moreover, trainable weights for each testing result are computed.OA is the ratio of the number of correct labeled pixels and the total number of the whole image pixels.F 1 score can be expressed as: where precision and recall can be calculated based on the confusion matrices.Precision is the true positive pixels divided by the sum of true positive and false positive pixels.Recall is the ratio of true positive pixels and the sum of true positive and false negative pixels.

Validation of Lightweight Branches
We firstly present the comparison results of the middleweight, the heavyweight and the proposed lightweight branches.Specifically, light-, medium-and heavy-weight branches are characterized by 3, 6, and 9 convolutional layers, respectively.Figure 7 shows the structures of these branches, in which all max pooling layers have size 2 and stride 2. Table 3 shows the configurations of the middleweight and heavyweight branches, whereas the configurations of lightweight branch are presented in Section 3.1 (see ConvL1~PoolL3 in Table 1).
We firstly present the comparison results of the middleweight, the heavyweight and the proposed lightweight branches.Specifically, light-, medium-and heavy-weight branches are characterized by 3, 6, and 9 convolutional layers, respectively.Figure 7 shows the structures of these branches, in which all max pooling layers have size 2 and stride 2. Table 3 shows the configurations of the middleweight and heavyweight branches, whereas the configurations of lightweight branch are presented in Section 3.1 (see ConvL1~PoolL3 in Table 1).Table 4 shows the segmentation performances of three extra branches on Potsdam dataset, and the results are assessed by considering the original ground truth (GT) and its erode version.The erode ground truth (E-GT) aims to exclude the impact of uncertain border definitions, so the boundaries of objects are eroded by a circular disc of 3 pixel radius.The lightweight branch provides the best results in most cases.Moreover, the accuracy decreases as the depth of the extra branch increases.This proves that the lightweight convolutional network can generate the appropriate features to complement with the preliminarily upscaled features from the upper layer of the main encoder.

Validation of Sub-Pixel Convolution Layers and Multi-Layer Perceptron
The results of the proposed FSN, FSN without sub-pixel convolution layers (FSN-noSC), and FSN without multi-layer perceptron (FSN-noMLP) versions are reported in this subsection.Table 5 shows the numerical results of these three models evaluated on the test set of Potsdam dataset.Apparently, FSN achieve the best performance under all the accuracy metrics.When compared with  Table 4 shows the segmentation performances of three extra branches on Potsdam dataset, and the results are assessed by considering the original ground truth (GT) and its erode version.The erode ground truth (E-GT) aims to exclude the impact of uncertain border definitions, so the boundaries of objects are eroded by a circular disc of 3 pixel radius.The lightweight branch provides the best results in most cases.Moreover, the accuracy decreases as the depth of the extra branch increases.This proves that the lightweight convolutional network can generate the appropriate features to complement with the preliminarily upscaled features from the upper layer of the main encoder.

Validation of Sub-Pixel Convolution Layers and Multi-Layer Perceptron
The results of the proposed FSN, FSN without sub-pixel convolution layers (FSN-noSC), and FSN without multi-layer perceptron (FSN-noMLP) versions are reported in this subsection.Table 5 shows the numerical results of these three models evaluated on the test set of Potsdam dataset.Apparently, FSN achieve the best performance under all the accuracy metrics.When compared with FSN-noSC (which is without sub-pixel convolution version), the improvement in terms of overall accuracy of the proposed FSN reaches up to 0.6%.The same scale of improvement can be found in the comparison between the proposed FSN and the FSN without MLP version (FSN-noMLP).This outcome proves that sub-pixel convolution layer and multi-layer perceptron provide a strong contribution to the segmentation performance of the FSN design.Visual comparisons (Figure 8) show that the segmentation results of FSN are more accurate and coherent compared to the results of FSN-noSC and FSN-noMLP.For instance, the building with roof lawn in Figure 8a-e is easily confused by low vegetation.When removing multi-layer perceptron or sub-pixel convolution layer the building is mislabeled with low vegetation in different degree; on the other hand, the proposed FSN can label it correctly.In Figure 8f-j, we can also observe the mislabeled car in the middle of the segmentation result of FSN-noSC and FSN-noMLP.In contrast, the segmentation results of the proposed FSN are more precise.Through the above analysis, we can state that sub-pixel convolution layer has strong ability to obtain richer representations and MLP can learn to combine features in an appropriate manner.

Potsdam Dataset Results
In this subsection, we evaluate FCN-8s, SegNet, FSN-noL, HSN, FSN, and FSN with post-process (FSN + CRFs) by using Potsdam test set.The comparisons are reported in Table 6.As expected, the proposed FSN outperforms the other methods in almost every evaluation metrics.For CIR images only, FSN-noL shows better performance than that of FCN-8s and SegNet.The class clutter of Potsdam dataset is hard to be correctly labeled in most previous works, for its high intra-class variance caused by the diversified components.The proposed FSN can achieve an over 5% increased F 1 score of this class compared with other networks.It is worth noting that HSN integrated with Markov random field can smooth the raw segmentation results and further improve the accuracy.In this case, we employ fully connected CRFs as post-processing method to further improve the accuracy of our work.With the help of color and position information of original images, we achieve about 0.5% increase in overall accuracy.Figure 9a,b shows the errors of commission and omission for each method per classes.Then, we can observe that FSN-noL achieved lower errors compared with FCN-8s and SegNet: specifically, it made fewer mistakes in class impervious surface without LiDAR data.FSN and FSN + CRFs show lowest errors of commission and omission.By comparison, HSN suffer from higher errors in class tree and car, for its deficiencies on multi-sensor fusion.commission and omission for each method per classes.Then, we can observe that FSN-noL achieved lower errors compared with FCN-8s and SegNet: specifically, it made fewer mistakes in class impervious surface without LiDAR data.FSN and FSN + CRFs show lowest errors of commission and omission.By comparison, HSN suffer from higher errors in class tree and car, for its deficiencies on multi-sensor fusion.Visually, we can observe in Figure 10a-h that the impervious surface with scattering lawn is hard to be recognized.FCN-8s, SegNet, and FSN-noL fail to label this kind of impervious surface correctly.Indeed, FSN-noL achieves a better performance on it.Thanks to the LiDAR data, HSN, FSN and FSN + CRFs can label the building and impervious surface more accurately; especially, the proposed FSN and FSN + CRFs prove that FSN can properly fuse multi-sensor features.The sparse low vegetation shares a similar color to impervious surface and deciduous trees in this dataset: in fact, as shown in Figure 10i-p, part of impervious surface is labeled as low vegetation in segmentation results of FCN-8s, SegNet and HSN.Additionally, class clutter consists of different kinds of objects (e.g., water bodies, playgrounds, and containers), so the networks can hardly label this class.In Figure 10q-x, FSN and FSN-noL can correctly segment the class clutter and have a better detail characterization ability.Hence, thanks to the inception layers, which provide FSN multi-scale receptive field, it can achieve a relatively complete building segmentation result.In addition, the implementation of MLP brings more appropriate features to mitigate the recognition/localization trade-off, so that the segmentation results of small objects such as cars and small scale clutters have accurate outlines.Furthermore, the post-process (i.e., fully connected CRFs) smooths the raw segmentation results and amends some tiny mislabeled blocks.Figure A4 in Appendix B shows the full tile predictions of Potsdam dataset.To highlight the difference between methods, each tile is followed by its red/green images that marks mislabeled pixels in red and correctly labeled pixels in green.
FSN and FSN + CRFs can label the building and impervious surface more accurately; especially, the proposed FSN and FSN + CRFs prove that FSN can properly fuse multi-sensor features.The sparse low vegetation shares a similar color to impervious surface and deciduous trees in this dataset: in fact, as shown in Figure 10i-p, part of impervious surface is labeled as low vegetation in segmentation results of FCN-8s, SegNet and HSN.Additionally, class clutter consists of different kinds of objects (e.g., water bodies, playgrounds, and containers), so the networks can hardly label this class.In Figure 10q-x, FSN and FSN-noL can correctly segment the class clutter and have a better detail characterization ability.Hence, thanks to the inception layers, which provide FSN multi-scale receptive field, it can achieve a relatively complete building segmentation result.In addition, the implementation of MLP brings more appropriate features to mitigate the recognition/localization trade-off, so that the segmentation results of small objects such as cars and small scale clutters have accurate outlines.Furthermore, the post-process (i.e., fully connected CRFs) smooths the raw segmentation results and amends some tiny mislabeled blocks.Figure A4 in Appendix B shows the full tile predictions of Potsdam dataset.To highlight the difference between methods, each tile is followed by its red/green images that marks mislabeled pixels in red and correctly labeled pixels in green.
) Figure 10.Semantic Labeling results for three patches of Potsdam test set.Classes: impervious surface (white); buildings (blue); low vegetation (cyan); tree (green); car (yellow); clutter (red).In the first row, (a) is true orthophoto, (b) is ground truth, (c-h) are inference results of FCN-8s, SegNet, FSN-noL, HSN, FSN and FSN+CRFs for image patch with buildings; in the second row, (i) is true orthophoto, (j) is ground truth, (k-p) are inference results of FCN-8s, SegNet, FSN-noL, HSN, FSN and FSN+CRFs for image patch with a street between buildings; in the third row, (q) is true orthophoto, (r) is ground truth, (s-x) are inference results of FCN-8s, SegNet, FSN-noL, HSN, FSN and FSN+CRFs for image patch with clutters.

Vaihingen Dataset Results
The lower spatial resolution and the shortage of the labeled data of Vaihingen dataset make the segmentation performance worse with respect to the Potsdam images.The results are listed in Table 7.We can see that FSN exhibits the best performance in both average F1 score and overall accuracy compared with other methods.Moreover, FSN-noL achieves higher overall accuracy than that of FCN-8s and SegNet.As for per-class F1 score, FSN + CRFs achieve highest score in most classes.Figure 11a,b shows the errors of commission and omission for each method per classes of Vaihingen test set.Thanks to the wider receptive field and appropriate feature fusion approach of

Vaihingen Dataset Results
The lower spatial resolution and the shortage of the labeled data of Vaihingen dataset make the segmentation performance worse with respect to the Potsdam images.The results are listed in Table 7.We can see that FSN exhibits the best performance in both average F 1 score and overall accuracy compared with other methods.Moreover, FSN-noL achieves higher overall accuracy than that of FCN-8s and SegNet.As for per-class F 1 score, FSN + CRFs achieve highest score in most classes.Figure 11a,b shows the errors of commission and omission for each method per classes of Vaihingen test set.Thanks to the wider receptive field and appropriate feature fusion approach of FSN, errors of FSN-noL were lower than those delivered by FCN-8s and SegNet for the majority of the classes.Further, FSN and FSN + CRFs outperform HSN and other networks.Figure 12 illustrates some predictions of closeups, and Figure A5 in Appendix B shows the full tile prediction and its red/green images of Vaihingen dataset.In Figure 12a-h, the shadows of trees pose difficulties for the segmentation task, whilst trees and low vegetation are similar in color, so most networks fail to distinguish them correctly.Cement road and Low-rise building with cement roof are also prone to confused by networks, as shown in Figure 12i-p.The main reason for these ambiguities is the insufficient contextual and spatial information of networks.As expected, these classes with small inter-class differences can be well labeled by FSN.Moreover, FSN-noL outperforms FCN-8s and SegNet in terms of accuracy on those classes, since inception modules and sub-pixel convolution layers provide more diverse and wider vision for the proposed network.The lower spatial resolution makes the small-scale objects such as the cars in this dataset hard to be segmented correctly when compared with Potsdam dataset, as illustrated in Figure 12q-x, FCN-8s, SegNet and HSN mislabel part of the cars as impervious surface, while FSN-noL and the proposed Figure 12 illustrates some predictions of closeups, and Figure A5 in Appendix B shows the full tile prediction and its red/green images of Vaihingen dataset.In Figure 12a-h, the shadows of trees pose difficulties for the segmentation task, whilst trees and low vegetation are similar in color, so most networks fail to distinguish them correctly.Cement road and Low-rise building with cement roof are also prone to confused by networks, as shown in Figure 12i-p.The main reason for these ambiguities is the insufficient contextual and spatial information of networks.As expected, these classes with small inter-class differences can be well labeled by FSN.Moreover, FSN-noL outperforms FCN-8s and SegNet in terms of accuracy on those classes, since inception modules and sub-pixel convolution layers provide more diverse and wider vision for the proposed network.The lower spatial resolution makes the small-scale objects such as the cars in this dataset hard to be segmented correctly when compared with Potsdam dataset, as illustrated in Figure 12q-x, FCN-8s, SegNet and HSN mislabel part of the cars as impervious surface, while FSN-noL and the proposed FSN can accurately label the cars.

Submission to the ISPRS Challenge
We submitted the test results on the hidden test sets of Potsdam dataset (ID: "CASDE") and Vaihingen dataset (ID: "CASRS") to the ISPRS Challenge, which can be accessed on line [41,42].We scored 90.0% and 89.5% in overall accuracy for Potsdam and Vaihingen test set, respectively.Our score belongs to upper middle class in the leaderboard.Indeed, we believe that the results we achieved provide a significant point for discussion and enhancement for the application of deep learning techniques in the remote sensing community.In fact, when compared to the other architectures in the competition, our framework is characterized by a smaller computational cost.Hence, the trade-off between accuracy and computational complexity of the proposed approach is higher than that of several models in the ISPRS challenge.Further, this effect makes the proposed FSN a valid option for semantic labeling of remote sensing data by means of deep learning methods, especially in terms of system efficiency.Specifically, some architectures have mainly focused on improving accuracy by using multimodel feature fusion, which reasonably leads to a major effort in terms of required computational complexity.For instance, in the Potsdam 2D Labelling challenge, a recent submission (ID: "BKHN2") [41] employed all channels and ensemble five FCN-8s (VGG) models to improve the accuracy.When compared with the proposed FSN, it achieves 0.6% increase in overall accuracy.On the other hand, the amount of trainable weights is strongly increased, since a single FCN-8s has many more trainable weights than FSN.Moreover, when we compare our approach with the similar-scale networks such as "RITL7" [24], it is worth noting that it achieves an overall accuracy of 88.4% by fusing multisensor features in decision-level by means of higher order CRFs.Hence, the FSN (ID: "CASDE2") we introduce outperforms "RITL7" both in terms of overall accuracy and per class F1-score.Additionally, we also submitted the results of the FSN-noL (ID: "CASDE1"), and scored 89.7% points in overall accuracy.When compared to FSN, the F1-scores of the class building and tree dropped by 0.8% and 0.3%, respectively.These effects mean that the fusion with LiDAR data delivers a valuable enhancement to the recognition of building and tree.
In the Vaihingen 2D Labelling challenge, the SegNet with multi-kernel convolutional layer and dual-stream fusion strategy (ID: "ONE_7") [23] achieved 0.3% increase in overall accuracy when compared with the proposed FSN.However, it also suffers from more than twofold increase in trainable weights.Moreover, the proposed FSN (ID: "CASRS1") outperforms many other networks  (c-h) are inference results of FCN-8s, SegNet, FSN-noL, HSN, FSN and FSN+CRFs for image patch with trees and low vegetation areas; in the second row, (i) is true orthophoto, (j) is ground truth, (k-p) are inference results of FCN-8s, SegNet, FSN-noL, HSN, FSN and FSN+CRFs for image patch with a low-rise building; in the third row, (q) is true orthophoto, (r) is ground truth, (s-x) are inference results of FCN-8s, SegNet, FSN-noL, HSN, FSN and FSN+CRFs for image patch with cars parked around buildings.

Submission to the ISPRS Challenge
We submitted the test results on the hidden test sets of Potsdam dataset (ID: "CASDE") and Vaihingen dataset (ID: "CASRS") to the ISPRS Challenge, which can be accessed on line [41,42].We scored 90.0% and 89.5% in overall accuracy for Potsdam and Vaihingen test set, respectively.Our score belongs to upper middle class in the leaderboard.Indeed, we believe that the results we achieved provide a significant point for discussion and enhancement for the application of deep learning techniques in the remote sensing community.In fact, when compared to the other architectures in the competition, our framework is characterized by a smaller computational cost.Hence, the trade-off between accuracy and computational complexity of the proposed approach is higher than that of several models in the ISPRS challenge.Further, this effect makes the proposed FSN a valid option for semantic labeling of remote sensing data by means of deep learning methods, especially in terms of system efficiency.Specifically, some architectures have mainly focused on improving accuracy by using multimodel feature fusion, which reasonably leads to a major effort in terms of required computational complexity.For instance, in the Potsdam 2D Labelling challenge, a recent submission (ID: "BKHN2") [41] employed all channels and ensemble five FCN-8s (VGG) models to improve the accuracy.When compared with the proposed FSN, it achieves 0.6% increase in overall accuracy.On the other hand, the amount of trainable weights is strongly increased, since a single FCN-8s has many more trainable weights than FSN.Moreover, when we compare our approach with the similar-scale networks such as "RITL7" [24], it is worth noting that it achieves an overall accuracy of 88.4% by fusing multisensor features in decision-level by means of higher order CRFs.Hence, the FSN (ID: "CASDE2") we introduce outperforms "RITL7" both in terms of overall accuracy and per class F 1 -score.Additionally, we also submitted the results of the FSN-noL (ID: "CASDE1"), and scored 89.7% points in overall accuracy.When compared to FSN, the F 1 -scores of the class building and tree dropped by 0.8% and 0.3%, respectively.These effects mean that the fusion with LiDAR data delivers a valuable enhancement to the recognition of building and tree.
In the Vaihingen 2D Labelling challenge, the SegNet with multi-kernel convolutional layer and dual-stream fusion strategy (ID: "ONE_7") [23] achieved 0.3% increase in overall accuracy when compared with the proposed FSN.However, it also suffers from more than twofold increase in trainable weights.Moreover, the proposed FSN (ID: "CASRS1") outperforms many other networks in this dataset, such as ID: "UZ_1" (FCN + nDSM) [25] with an 87.3% overall accuracy and ID: "DST_2" (FCN-noDS + RF + CRFs) [1] with an 89.1% overall accuracy.The results of FSN-noL on this dataset (ID: "CASRS2") have been submitted to the ISPRS challenge evaluation as well.This test scored 88.7% overall accuracy.Analogously to Potsdam dataset, the F 1 -scores of impervious surface, building and tree are lower than FSN, which further shows the benefits of fusion with LiDAR Data.

Trainable Weights and Receptive Fields
Table 8 reports the trainable weight counts in each model considered in this paper.Since FCN-8s, SegNet and FSN share the same structure of encoder and at least 13 layers are employed in each encoder stage, the trainable weights of these three models are more than those employed in HSN, which has only nine layers in the corresponding stage.However, as the experimental results show, the decrease of the encoder layers and downsampling scale may cause the segmentation accuracy loss.Especially when the spatial resolution increased, for instance, the HSN has less encoder layer and its evaluation results on Potsdam dataset (which has a higher spatial resolution of 0.5 m than Vaihingen dataset) are worse than the other three models.However, the large amount of trainable weights makes the training phase more difficult, and causes the over-fitting phenomenon, which becomes more apparent when the number of training samples is limited.For example, FCN's evaluation results on Vaihingen dataset are worse than its performance on Potsdam dataset.Finally, the proposed FSN can balance the trade-off between performance and computational complexity, and achieve a higher accuracy with relatively fewer parameters.The largest receptive field of each model is also included in the Table 8.We can observe from the table that FSN has wider receptive field than other models.It is worth noting that both HSN and FSN have multi-scale receptive areas: this effect is caused by the use of inception layers, which allows them to achieve higher overall accuracy by enriching the contextual information.

Conclusions
A novel fine segmentation network (FSN) for semantic segmentation of multi-sensor remote sensing data is proposed in this paper.The architecture follows an encoder-decoder structure with a feature-level fusion approach.The encoder includes a main encoder for CIR data and a lightweight branch designed for LiDAR data.In the decoder stage, inception modules are introduced to enrich the receptive field and contextual information of the network.Sub-pixel convolution layers are employed to allow the network adaptively upscale the feature maps.Furthermore, the multi-sensor and multi-resolution feature maps are combined by multi-layer perceptron in an efficient and flexible manner.Transfer learning is used to tackle the shortage of training sample.Overlapping inference is used to mitigate the border effects, and fully connected CRFs serves as post processing method to further improve the accuracy.Experimental results based on ISPRS 2D Potsdam and Vaihingen datasets show that the proposed FSN outperforms the other related networks and provides better segmentation results with a relatively moderate computational complexity.Next steps in this research field will consider applying K-fold cross-validation for hyperparameter search.

Figure 1 .
Figure 1.Illustration of sparse connectivity and parameter sharing for CNN: (a) sparse connectivity; and (b) parameter sharing.

Figure 1 .
Figure 1.Illustration of sparse connectivity and parameter sharing for CNN: (a) sparse connectivity; and (b) parameter sharing.

Figure 3 .
Figure 3. Network architecture of FSN, where structure depicted in the solid-line box is encoder and structure depicted in the dashed-line box is decoder.

Figure 3 .
Figure 3. Network architecture of FSN, where structure depicted in the solid-line box is encoder and structure depicted in the dashed-line box is decoder.

Figure 6 .
Figure 6.General procedure of the image segmentation.

Figure 6 .
Figure 6.General procedure of the image segmentation.

Figure 8 .
Figure 8. Semantic Labeling results for two patches of Potsdam validation set.Classes: impervious surface (white); buildings (blue); low vegetation (cyan); tree (green); car (yellow); clutter (red).In the first row, (a) is true orthophoto, (b) is ground truth, (c-e) are inference results of FSN-noMLP, FSN-noSC and the proposed FSN for image patch with building with roof lawn; in the second row, (f) is true orthophoto, (g) is ground truth, (h-j) are inference results of FSN-noMLP, FSN-noSC and the proposed FSN for image patch with a street between buildings.

Figure 8 .
Figure 8. Semantic Labeling results for two patches of Potsdam validation set.Classes: impervious surface (white); buildings (blue); low vegetation (cyan); tree (green); car (yellow); clutter (red).In the first row, (a) is true orthophoto, (b) is ground truth, (c-e) are inference results of FSN-noMLP, FSN-noSC and the proposed FSN for image patch with building with roof lawn; in the second row, (f) is true orthophoto, (g) is ground truth, (h-j) are inference results of FSN-noMLP, FSN-noSC and the proposed FSN for image patch with a street between buildings.

Figure 8 .Figure 9 .
Figure 8. Semantic Labeling results for two patches of Potsdam validation set.Classes: impervious surface (white); buildings (blue); low vegetation (cyan); tree (green); car (yellow); clutter (red).In the first row, (a) is true orthophoto, (b) is ground truth, (c-e) are inference results of FSN-noMLP, FSN-noSC and the proposed FSN for image patch with building with roof lawn; in the second row, (f) is true orthophoto, (g) is ground truth, (h-j) are inference results of FSN-noMLP, FSN-noSC and the proposed FSN for image patch with a street between buildings.

Figure 9 .
Figure 9. Errors of commission and omission of each model per classes (Potsdam dataset): (a) error of commission; and (b) error of omission.Lower values indicate the better segmentation performance.

Figure 10 .
Figure 10.Semantic Labeling results for three patches of Potsdam test set.Classes: impervious surface (white); buildings (blue); low vegetation (cyan); tree (green); car (yellow); clutter (red).In the first row, (a) is true orthophoto, (b) is ground truth, (c-h) are inference results of FCN-8s, SegNet, FSN-noL, HSN, FSN and FSN+CRFs for image patch with buildings; in the second row, (i) is true orthophoto, (j) is ground truth, (k-p) are inference results of FCN-8s, SegNet, FSN-noL, HSN, FSN and FSN+CRFs for image patch with a street between buildings; in the third row, (q) is true orthophoto, (r) is ground truth, (s-x) are inference results of FCN-8s, SegNet, FSN-noL, HSN, FSN and FSN+CRFs for image patch with clutters.

Figure 11 .
Figure 11.Errors of commission and omission of each model per classes (Vaihingen dataset): (a) error of commission; and (b) error of omission.Lower values indicate the better segmentation performance.

Figure 11 .
Figure 11.Errors of commission and omission of each model per classes (Vaihingen dataset): (a) error of commission; and (b) error of omission.Lower values indicate the better segmentation performance.

Figure 12 .
Figure 12.Semantic Labeling results for three patches of Vaihingen test set.Classes: impervious surface (white); buildings (blue); low vegetation (cyan); tree (green); car (yellow); clutter (red).In the first row, (a) is true orthophoto, (b) is ground truth,(c-h) are inference results of FCN-8s, SegNet, FSN-noL, HSN, FSN and FSN+CRFs for image patch with trees and low vegetation areas; in the second row, (i) is true orthophoto, (j) is ground truth, (k-p) are inference results of FCN-8s, SegNet, FSN-noL, HSN, FSN and FSN+CRFs for image patch with a low-rise building; in the third row, (q) is true orthophoto, (r) is ground truth, (s-x) are inference results of FCN-8s, SegNet, FSN-noL, HSN, FSN and FSN+CRFs for image patch with cars parked around buildings.

Figure 12 .
Figure 12.Semantic Labeling results for three patches of Vaihingen test set.Classes: impervious surface (white); buildings (blue); low vegetation (cyan); tree (green); car (yellow); clutter (red).In the first row, (a) is true orthophoto, (b) is ground truth,(c-h) are inference results of FCN-8s, SegNet, FSN-noL, HSN, FSN and FSN+CRFs for image patch with trees and low vegetation areas; in the second row, (i) is true orthophoto, (j) is ground truth, (k-p) are inference results of FCN-8s, SegNet, FSN-noL, HSN, FSN and FSN+CRFs for image patch with a low-rise building; in the third row, (q) is true orthophoto, (r) is ground truth, (s-x) are inference results of FCN-8s, SegNet, FSN-noL, HSN, FSN and FSN+CRFs for image patch with cars parked around buildings.
suggestions and comments for the network design and experiments.B.Z. completed the theoretical framework.F.Y. provided support regarding the application of post-processing method.P.G. provided important suggestions for improving quality of the paper.

Figure A2 .Figure A4 .
Figure A2.Architecture of SegNet, where structure depicted in the solid-line box is encoder and structure depicted in the dashed-line box is decoder.

Table 1 .
Configurations of lightweight branch.

Table 1 .
Configurations of lightweight branch.

Table 2 .
Configurations of layers in decoder.

Table 2 .
Configurations of layers in decoder.

Table 3 .
Configurations of convolutional layers in middleweight branch and heavyweight branch.ConvM is convolutional layer of middleweight branch; ConvH is convolutional layer of heavyweight branch.

Table 3 .
Configurations of convolutional layers in middleweight branch and heavyweight branch.ConvM is convolutional layer of middleweight branch; ConvH is convolutional layer of heavyweight branch.

Table 4 .
Experimental results on the multi-scale extra branch (Potsdam validation set).HW is FSN with heavyweight branch; MW is FSN with middleweight branch; LW is FSN with lightweight branch (the proposed version); GT is Ground Truth; E-GT is Eroded Ground Truth.

Table 5 .
Experimental results on the effect of sub-pixel convolution and multi-layer perceptron (Potsdam validation set).FSN-noSC is FSN without sub-pixel convolution version; FSN-noMLP is FSN without multi-layer perceptron version.

Table 6 .
Experimental results on Potsdam test set.

Table 6 .
Experimental results on Potsdam test set.

Table 7 .
Experimental results on Vaihingen test set.Remote Sens. 2018, 10, x FOR PEER REVIEW 15 of 24 FSN, errors of FSN-noL were lower than those delivered by FCN-8s and SegNet for the majority of the classes.Further, FSN and FSN + CRFs outperform HSN and other networks.

Table 7 .
Experimental results on Vaihingen test set.

Table 8 .
Trainable weight and receptive field of the FCN-8s, SegNet, HSN and the proposed FSN.