Deep Feature Fusion with Integration of Residual Connection and Attention Model for Classification of VHR Remote Sensing Images

The classification of very-high-resolution (VHR) remote sensing images is essential in many applications. However, high intraclass and low interclass variations in these kinds of images pose serious challenges. Fully convolutional network (FCN) models, which benefit from a powerful feature learning ability, have shown impressive performance and great potential. Nevertheless, only classification results with coarse resolution can be obtained from the original FCN method. Deep feature fusion is often employed to improve the resolution of outputs. Existing strategies for such fusion are not capable of properly utilizing the low-level features and considering the importance of features at different scales. This paper proposes a novel, end-to-end, fully convolutional network to integrate a multiconnection ResNet model and a class-specific attention model into a unified framework to overcome these problems. The former fuses multilevel deep features without introducing any redundant information from low-level features. The latter can learn the contributions from different features of each geo-object at each scale. Extensive experiments on two open datasets indicate that the proposed method can achieve class-specific scale-adaptive classification results and it outperforms other state-of-the-art methods. The results were submitted to the International Society for Photogrammetry and Remote Sensing (ISPRS) online contest for comparison with more than 50 other methods. The results indicate that the proposed method (ID: SWJ_2) ranks #1 in terms of overall accuracy, even though no additional digital surface model (DSM) data that were offered by ISPRS were used and no postprocessing was applied.


Introduction
Recent developments in remote sensing technology have greatly contributed to the increasing availability of remotely sensed data with very high resolution (VHR). These images are adequate for capturing detailed information regarding the observed surface, which makes them more suitable for various applications that require high geometrical precision and thematic details, such as precision agriculture [1], disaster assessment [2], and urban environment analysis [3].
Image classification is one of the most commonly used techniques in order to effectively derive thematic geographic information from VHR remote sensing images. Image classification, which is also known as "semantic segmentation" [4] in the field of computer vision, aims to assign each pixel with a specific class label, and it has been a core topic in the remote sensing community for many years. The spectral information of each pixel is sufficient to separate it from other pixels for low-

1.
A multiconnection ResNet is proposed to fuse multilevel deep features corresponding to different layers of the FCN. The multiconnection residual shortcuts make it possible for low-level features to learn to cooperate with high-level features without introducing redundant information from the low-level features.

2.
A class-specific attention model is proposed to combine multiscale features. It can learn the contributions of various features for each geo-object at each scale. Thus, a class-specific scale-adaptive classification map can be achieved.

3.
A novel, end-to-end FCN is developed to integrate the multiconnection ResNet and class-specific attention model into a unified framework.
The remainder of this paper is organized, as follows: Section 2 reviews related works. In Section 3, the framework and main components of the proposed method are presented. Section 4 gives the experimental results and comparisons. A discussion is provided and conclusions are drawn in Sections 5 and 6, respectively.

Related Works
The remote sensing community has invested a great deal of effort into developing advanced effective methods for the classification of VHR images in recent years. A comprehensive review of existing approaches is beyond the scope of this paper. Here, we focus on CNN-related methods of image classification.

From Convolutional Neural Networks to Fully Convolutional Networks
CNN-based methods are state-of-the-art tools for image analysis tasks and achieve competitive performance in many benchmarks and contests [14]. A standard CNN consists of multiple convolutional layers (often combined with pooling layers) and it is ended by one or more fully connected layers. As shown in Figure 1a, a CNN that is designed for a classification task receives an image as input and produces a probability distribution over different classes as output. The class label with the maximum probability is allocated to the input image as its predicted label. In this regard, a standard CNN is more suitable for solving the problem of image-level labelling.  However, a dense class map is often expected for remote sensing image classification. In other words, a pixel-level labelling result is required. For this reason, great attention is devoted to the development of sophisticated strategies that are based on the CNN for image classification. Patch-based CNN models are widely used approaches, in which the class label for each pixel is predicted by performing patch-level labeling for the neighboring block around the target pixel while using the CNN [27][28][29][30][31]. The patches can be derived by sliding-window [29,30] and segmentation [18,31]. The main reason to use patches for classification is that the standard CNN has at least one fully connected layer and it therefore requires input images with a fixed size. However, this technique ignores the relationship between patches and it yields huge redundant computations [32]. As an alternative, the FCN is proposed to popularize CNN architectures by replacing the fully connected layers with convolutional layers. In this way, dense classification with lower resolution can be achieved, as shown in Figure 1b. The FCN allows for classification maps to be generated for input images of any size and it is also much faster than the patch-based CNN models [33]. Currently, almost all of the state-of-the-art CNN-related methods for pixel-level labelling are developed based on FCN.

Fusion of Multilevel Deep Features
Multiple stages of pooling and convolutional layers in the FCN progressively down-sample the initial image, which generates low-resolution feature maps that are accompanied by the loss of spatial details and finer structures. This presents challenges for exact alignment of class maps for pixel-level labelling tasks. It is necessary to incorporate low-level features to address this issue, which is beneficial for characterizing boundaries and details, together with high-level features to refine the classification results. One group of approaches for the fusion of multilevel features is referred to as the encoder-decoder paradigm, which includes FCN-8 [20], U-Net [34], DeconvNet [24], and SegNet [21]. In the encoder stage, these methods use multiple stacks of convolutional and pooling layers to obtain hierarchical features at different levels. During the decoder stage, these learned features are then used to up-sample the feature maps with low resolution to higher-resolution classification maps. However, this kind of method directly concatenates or adds the low-and high-level features together as a joint representation, which may introduce redundant information from the low-level features, leading to an adverse impact on the accuracy of the classification results. Two types of methods have been exploited in order to conduct multilevel feature fusion in a better way. The first one designs several gates to select the useful information from the low-level features when fusing features from different levels [26,35]. However, designing proper gates is a nontrivial task, especially without the necessary prior knowledge. The second method attempts to directly add or concatenate multilevel features, followed by a latent fitting strategy, which is expected to eliminate adverse effects brought by the redundant information [25]. Although this method has achieved impressive classification, the postprocessing scheme still has difficulty in removing the redundancy between the features at different levels.
There is another type of strategy to embed low-level features that draws support from extra methods or data. For example, conditional random field (CRF) models are used as a postprocessing step to improve the classification by imposing a smoothness prior in low-level vision, so that neighboring pixels are more likely to be allocated the same class label [36]. An object-based classification method, which is adequate for describing boundaries of geo-objects by image segmentation, is combined with abstracted deep features to increase the classification accuracy [31,37]. In addition, a variety of methods are used in an attempt to add additional data to the FCN, such as digital surface models (DSM), vegetation indices (VI), and edge information [38][39][40][41]. Although these strategies can work well to some extent, they rely on extra handcrafted methods, and the additional data are not always available. In this paper, we aim to develop a novel end-to-end FCN to achieve state-of-the-art performance without introducing extra methods or additional data.

Fusion of Multiscale Deep Features
It is essential to exploit multiscale features in the FCN setting since various objects are always presented in images to be classified in the form of multiple scales. To this end, two major types of FCN architecture have been proposed. The first is to generate multiscale features in the last few layers by cascading. ParseNet uses global average pooling to produce more abstract features and then combines them with the original feature map [42]. The pyramid scene parsing network (PSPNet) then extends ParseNet by applying multiple global average pooling layers to generate hierarchical scale features [43]. Additionally, dilated convolution is applied to generate multiscale features [44]. However, for these methods, the output map needs to be resampled to the resolution of the input image, which may blur the fine objects or parts. The second type first creates a multiscale representation of the original images and then feeds them through the network. For example, a Laplacian pyramid can be employed to pass images from each scale through a shared network and fuse the features across all scales [45]. Some studies directly represent the original input as resized versions of several scales and fuse multiple features across all scales following a coarse-to-fine sequence [46,47]. When compared to the first method, the second method can directly acquire multiscale features by means of a multiscale representation of the original images, but it usually treats features at each scale for fusion equally. In other words, the different contributions of multiscale features for classifying various objects are not considered.
To overcome these shortcomings, an attention mechanism that aims to generate a specific weight for each scale is applied to fuse the multiscale features in the FCN [48]. However, the main problem of this technique is that the weight is shared for all objects in each scale. Yet, it may be more reasonable that various objects are assigned different weights in each scale.

Deep Feature Fusion for Classification of VHR Remote Sensing Images
We present our approach for classification of VHR remote sensing images by fusing deep features with integration of residual connections and an attention model to solve the above two problems regarding deep feature fusion, i.e., avoid introducing redundant information and consider the contributions of features when fusing multiscale features. Figure 2 illustrates a schematic diagram of the proposed method. As can be seen, an input image is resampled to different sizes to produce a multiscale representation as the input of a shared deep network. The proposed network is based on two modules, a multiconnection ResNet and a class-specific attention model. In the following, we describe three main aspects of the proposed method: (1) the multiconnection ResNet for fusion of multilevel features, (2) the class-specific attention model for fusion of multiscale features, and (3) the model learning and inference. features [43]. Additionally, dilated convolution is applied to generate multiscale features [44].

182
However, for these methods, the output map needs to be resampled to the resolution of the input 183 image, which may blur the fine objects or parts. The second type first creates a multiscale 184 representation of the original images and then feeds them through the network. For example, a

185
Laplacian pyramid can be employed to pass images from each scale through a shared network and 186 fuse the features across all scales [45]. Some studies directly represent the original input as resized 187 versions of several scales and fuse multiple features across all scales following a coarse-to-fine 188 sequence [46,47]. When compared to the first method, the second method can directly acquire

192
To overcome these shortcomings, an attention mechanism that aims to generate a specific weight 193 for each scale is applied to fuse the multiscale features in the FCN [48]. However, the main problem 194 of this technique is that the weight is shared for all objects in each scale. Yet, it may be more reasonable

Multiconnection ResNet for Fusion of Multilevel Features
As shown in Figure 3, the detailed architecture of the proposed multiconnection ResNet is an encoder-decoder network. The encoder part is based on the residual network (ResNet), with 101 layers [49]. The ResNet is mainly constructed of four blocks, each of which contains several stacked residual units (ResUs). The fully connected layer and the average pooling layer at the end of the original ResNet are removed in order to get the dense feature map. The resolution of output feature maps is resampled to one-eighth of the original image in the encoder stage by using dilated convolution following a similar strategy as in [36]. During the decoder stage, the designed transposed residual units (Trans-ResUs) are applied to up-sample the low-resolution feature maps to higher-resolution classification maps. Different from ResU, the middle convolution in Trans-ResU is changed to the transposed convolution. ResU fuses the multiple features that are generated by previous layers in the encoder stage during the up-sampling process.

229
However, there is one potential problem that is associated with a traditional encoder-decoder 230 deep network: it is difficult to effectively fuse multilevel features without incorporating feature 231 redundancies, which can adversely influence the classification accuracy. The proposed 232 multiconnection ResNet is intended to make low-level features learn how to cooperate with high-233 level features to address this problem, which is achieved by stacked convolutional layers.

234
Nevertheless, as pointed out in [49], it may be difficult for the stacked convolution layers to directly 235 learn how to cooperate with high-level features in deep networks. For this reason, we add one type 236 of residual connection to these layers between the encoder and decoder. These layers and added 237 residual connections constitute the residual units between the encoder and decoder, as shown in  However, there is one potential problem that is associated with a traditional encoder-decoder deep network: it is difficult to effectively fuse multilevel features without incorporating feature redundancies, which can adversely influence the classification accuracy. The proposed multiconnection ResNet is intended to make low-level features learn how to cooperate with high-level features to address this problem, which is achieved by stacked convolutional layers. Nevertheless, as pointed out in [49], it may be difficult for the stacked convolution layers to directly learn how to cooperate with high-level features in deep networks. For this reason, we add one type of residual connection to these layers between the encoder and decoder. These layers and added residual connections constitute the residual units between the encoder and decoder, as shown in Figure 3. To be specific, given a low-level feature map f that is generated by the encoder, let F (·) denote a mapping of the stacked convolutional layers and H (·) be the desired underlying mapping to make low-level features learn to cooperate. The stacked convolutional layers are expected to fit another mapping, F (·) instead of directly learning the mapping H (·), which is called the residual mapping [49]: Thus, the desired H (·) can be recast into F (·) + f . On the other hand, the depth of the network is increased by adding stacked convolutional layers in the encoder and decoder parts, which will result in a potential degradation problem, i.e., as the depth of the network increases, the accuracy becomes saturated and then rapidly degrades [49]. Two more types of residual connection are added to the proposed multiconnection ResNet in order to reduce this phenomenon, as shown in Figure 3: residual connections between residual blocks in the encoder part, and residual connections in ResU and Trans-ResU in the decoder part. As described in [25], the third type of residual connections can also act as the correction of the latent fitting residual between multi-level feature if the learning of ResUs between encoder and decoder parts is insufficient.

Class-Specific Attention Model for Fusion of Multiscale Features
A class-specific attention model is proposed to fuse multiscale features to improve the multiconnection ResNet further, as shown in Figure 4. Different from direct fusion methods, e.g., averaging or concatenating, which ignore the importance of multiple features at different scales, this model aims to learn the different contributions of various features for each geo-object in each scale.

245
On the other hand, the depth of the network is increased by adding stacked convolutional layers 246 in the encoder and decoder parts, which will result in a potential degradation problem, i.e., as the 247 depth of the network increases, the accuracy becomes saturated and then rapidly degrades [49]. Two   To be specific, each original input image is resized into a set of scales s ∈ {1, . . . , S} while using bilinear resampling. Each resized image is passed through a shared multiconnection ResNet. First, a ResU and a separate 1 × 1 convolutional layer for each scale are applied to generate the weight score maps in the class-specific attention model. Mathematically, let g s c (x i ) denote the weight score for class c at scale s corresponding to the ith pixel x i in the weight score map, where c ∈ {1, . . . , C} stands for the geo-object class. Obviously, S and C represent the number of scales and geo-objects, respectively.
Subsequently, a softmax function in the scale space is adopted to obtain the specific class weight α s c (x i ) for each scale: Finally, the final classification score of the ith pixel, f c (x i ), can be computed though a weighted sum of score maps across all scales: where f s c (x i ) is the classification score map of scale s, which is generated by a specific 1 × 1 convolutional layer that is applied to the output of the multiconnection ResNet. Classification results can be inferred based on f c (x i ).

Model Learning and Inference
For pixel-level image classification, each pixel x i can be allocated a class label with the maximum posterior probability: where c * i denotes the most likely class label for pixel x i and P c (x i ; θ) stands for the posterior probability of allocating class c, which can be obtained by the proposed FCN model with parameter θ. To be specific, P c (x i ; θ) can be calculated by a softmax function across all classes: Thus, the image classification problem is equivalent to estimating the optimal parameter θ.
A typical FCN often contains millions of parameters, which are learned by minimizing the losses defined by the difference between the prediction and the ground truth. Multiple intermediate losses that are generated by various network branches corresponding to different scales and the final loss are used jointly to train the network in the learning stage for the proposed FCN. Specifically, the total loss Loss T can be expressed as the sum of the final loss and intermediate losses of different scales, which is given by Loss(P(x s ; θ), y s ), where x denotes the original input image and y denotes the associated ground truth map, and x s and y s denote the resized image and ground truth map corresponding to scale s, respectively. The Loss function is defined as where N is the number of pixels in the input image and y The stochastic gradient descent (SDG) algorithm is applied to train the proposed network. Thus, the first derivatives of the parameters in the network need to be calculated to update the parameters during the training stage, which is given by where η is the learning rate. We use the chain rule to obtain the derivative of each parameter. The dashed arrow in Figure 2 denotes the flow of loss. For clarity, the details of ∂Loss/∂θ are not presented here; the reader is referred to textbooks, such as [50]. At the inference stage, the network is initialized by the learned parameters from the training stage. The image that is to be classified is resized into multiple scales of images, which are fed into the network to get the classification score map for each scale by the shared multiconnection ResNet. The class-specific attention model first calculates the weight for each scale, and then fuses the feature maps to generate the final prediction map.

Experimental Data
Two open datasets of aerial image labelling were selected to evaluate the proposed method-the Massachusetts building dataset and the International Society for Photogrammetry and Remote Sensing (ISPRS) Potsdam dataset.
(1) Massachusetts building dataset. This dataset includes a collection of 151 aerial images that are the Boston area produced by [51]. There are three channels for each image: red (R), green (G), and blue (B). Each image has a size of 1500 × 1500 pixels with about 1 m spatial resolution. The two classes in the ground truth map are buildings and background. The ground truth maps of all these images are available for reference. Figure 5 shows a sample image and its corresponding ground truth map. In this experiment, the dataset was split into a training set of 137 images and a testing set of 14 images. At the inference stage, the network is initialized by the learned parameters from the training 306 stage. The image that is to be classified is resized into multiple scales of images, which are fed into 307 the network to get the classification score map for each scale by the shared multiconnection ResNet.

308
The class-specific attention model first calculates the weight for each scale, and then fuses the feature 309 maps to generate the final prediction map.  are available for reference. Figure 5 shows a sample image and its corresponding ground truth map.

320
In this experiment, the dataset was split into a training set of 137 images and a testing set of 14 images.   (2) ISPRS Potsdam dataset. This dataset is provided in the framework of the ISPRS Potsdam 2D Semantic Labeling Contest, which includes 38 images with 6000 × 6000 pixels and about 0.05 m spatial resolution, as shown in Figure 6. There are four channels for each image: near infrared (NIR), R, G, and B; corresponding DSM and nDSM data are also available. Six land cover classes are distributed in the dataset: impervious surfaces, buildings, low vegetation, trees, cars, and clutter/background. The clutter/background class corresponds to a mixture of some geo-objects that are not of interest in our classification experiment. The ground truth maps of 24 images are offered and the organizers for the online contest keep the other 14 images. The organizers ended the online competition in the summer of 2018 and published all of the data. In the experiments, we split the 24 images with available ground truth maps into a training set of 20 images and a testing set of four images, which are shown in Figure 6. We extended the testing set with the other 14 images that are used for online contest to make the division more balanced. It should be noted that only three channels (NIR, R, G) were used for training and testing in the experiments. Remote Sens. 2019, 11, x FOR PEER REVIEW 10 of 22

365
 PSPNet: PSPNet, which introduces the pyramid pooling module to fuse hierarchical scale 366 features, is proposed for scene parsing and semantic segmentation [43]. It ranked first in the

Methods for Comparison
Six state-of-the-art methods were compared with the proposed methods for both datasets in the experiments. Descriptions of these models (including our models) are as follows:

•
Multiconnection ResNet: This is the first component of our proposed model, which introduces multiconnection residual shortcuts to make it possible for the convolutional layers to fuse multilevel deep features corresponding to different layers of the FCN. Multiconnection ResNet is referred to as mcResNet for convenience.

•
Integration of multiconnection ResNet and class-specific attention model: This is the proposed end-to-end FCN, which integrates the multiconnection ResNet and class-specific attention model into a unified framework. For convenience, the proposed FCN is referred to as mcResNet-csAM. • FCN-8s: There are three variants of FCN models: FCN-32s, FCN-16s, and FCN-8s. We chose FCN-8s for comparison, which has been shown to achieve better classification performance than its counterparts [20]. • SegNet: SegNet was originally proposed for the semantic segmentation of roads and indoor scenes [21]. The main novelty of SegNet is that the decoder performs the nonlinear up-sampling according to max pooling indices in the encoder. Thus, SegNet can provide good performance with little time and space complexity.

•
Global convolutional network (GCN): GCN is proposed to address both the classification and localization issues for semantic segmentation [52]. It achieves state-of-the-art performance on two public benchmarks: PASCAL VOC 2012 and Cityscapes. • RefineNet: RefineNet is proposed to perform semantic segmentation, which is based on ResNet [53]. It achieves state-of-the-art performance on seven public datasets, including PASCAL VOC 2012 and NYUDv2. The RefineNet based on ResNet-101 was compared with ours in the experiments. • PSPNet: PSPNet, which introduces the pyramid pooling module to fuse hierarchical scale features, is proposed for scene parsing and semantic segmentation [43]. It ranked first in the ImageNet scene parsing challenge in 2016. We used the modified ResNet-101 as the backbone of PSPNet in the experiments following the official implementation. In the training phase, we also used the auxiliary loss with the weight of 0.4. • DeepLab V3+: DeepLab is proposed to conduct semantic segmentation by employing multiple dilated convolutions in the cascade to capture multiscale context, being motived by the fact that atrous/dilated convolutions can easily increase the field of view [54]. When compared with DeepLab V3, the DeepLab V3+ includes a simple decoder part to refine the results [55].

Evaluation Criteria
Two quantitative criteria were used in the experiments to evaluate the classification results of the above methods, F1 score (F1) and overall accuracy (OA). The F1 score can be calculated as where precision and recall can be represented as Here, TP, TN, FP, and FN stand for true positive, or the number of correct predictions that an instance is positive; true negative, or the number of correct predictions that an instance is negative; false positive, or the number of incorrect positive predictions; and false negative, or the number of incorrect negative predictions. OA can be obtained by Moreover, visual comparison is also utilized to evaluate the results.

Parameter Setting
In the experiments, each image in both datasets was not fed directly to the networks, being limited by GPU memory. Instead, patches had to be cropped from raw images as input to train the networks. In detail, at each training step, patches of 320 × 320 pixels are sampled from a random position of the original images. One or combined operations randomly process these extracted patches to augment the training data: mirror, rotation, and Gaussian blur. In the inference stage, a fixed stride is set to obtain the overlapped patches as inputs. The final predictions on the overlapping regions are averaged, which can reduce the border effects and improve the overall accuracy. In the experiments, the stride was empirically set to 59 pixels for the Massachusetts building dataset and 80 pixels for the ISPRS Potsdam dataset. The models were trained by using stochastic gradient descent (SGD) with momentum. We set the momentum to 0.9 and the weight decay to 0.0005. In addition, inspired by the work in [43], we applied the poly-like learning rate rule, i.e., the learning rate can be represented by base one multiplying (1 − t/T) power . Here, t and T denote the current iteration and maximum iteration, respectively. The base learning rate was set to 0.01 and the power variable was set to 0.9. All of the models were trained with a batch size of five and a total of 1,000,000 iterations. For the proposed model, three scales of input images were fed to the network: 0.5, 0.75, and 1 times the size of the original images (denoted as Scales = {1, 0.75, 0.5}).
For different convolutional network methods, the same parameter settings are configured as much as possible for a fair comparison. However, some hyperparameters that are specific to various models (e.g., momentum and weight decay) may be differently specified in order to make models converge quickly during the training stage. Furthermore, all of the models are trained while using the transfer learning technique. To be specific, the pretrained model based on PASCAL VOC 2012 is transferred to initialize all of the models for comparison in our experiments. It should also be noted that our proposed methods and GCN only use the pretrained parameters of PASCAL VOC 2012 to initialize the encoder part of the models and adopt the normal distribution to initialize other parts, in contrast to other approaches. Python 3.6 on a Linux platform implements all models in the experiments. The deep learning algorithms are based on TensorFlow. The full implementation (based on TensorFlow) and the trained network are available at GitHub (https://github.com/WindWang2/Multi-connection-attentionnetworks). The experiments were run on two Intel®Xeon®8-core CPU @ 2.2 GHz processors with a GPU of Nvidia®GTX1080Ti (11 GB).

Results of Massachusetts Building Dataset
For the quantitative aspect, Table 1 reports two evaluation criteria of classification results for the different methods, and Figure 7 shows a corresponding bar chart of F1 scores. The proposed mcResNet-csAM yields the highest values of F1 and OA when compared to other methods, achieving the best overall classification performance. In particular, our mcResNet still obtained the second highest F1 despite the lack of the class-specific attention model component, indicating that the multiconnection residual shortcuts are effective for combining high-level abstract features and low-level features for classification. DeepLab V3+ achieved comparable results with the third best F1 and the second better OA.   Four image patches containing buildings with different sizes and shapes were selected for visual comparison, as shown in Figure 8. It can be observed that FCN-8s performed the worst in identifying various sizes of buildings. To be specific, FCN-8s was not able to detect each single instance for small buildings, but could only identify building areas with blurred boundaries. The results of GCN, DeepLab V3+, RefineNet, PSPNet, and mcResNet are significantly better. However, they are still inadequate for preserving the shape details and fine edges. In addition, the classification results of large buildings by GCN are not homogeneous enough. SegNet can perform better in detecting small buildings, but it results in more false negatives, especially for large buildings. In contrast, mcResNet-csAM can achieve scale-adaptive classification results by taking the multiscale nature of various geo-objects into account, i.e., large buildings appear more homogeneous, and small buildings are detected with well-preserved structural details and boundaries. This confirms that our proposed end-to-end FCN is able to fuse the multilevel and multiscale deep features to enhance classification performance.     Table 2 and Figure 9 report the F1 scores and OA for different methods. It can be observed that (1) mcResNet-csAM outperforms other methods overall, although DeepLab V3+ can achieve optimal F1 scores for low vegetation with slight improvements and (2) mcResNet shows comparative performance with other methods, except mcResNet-csAM. This reveals the advantages of the proposed approach. Specifically, the proposed mcResNet-csAM and mcResNet exhibit obviously greater F1 values than other methods for classifying cars.  Four image patches with different scenes were selected for visual comparison, as illustrated in Figure 10. Similar to the experiment for the Massachusetts building dataset, for FCN-8s, large-scale buildings are well classified, while small-scale cars are detected with blurred boundaries. Classification by GCN is improved with regard to preserving detailed information, but the classification maps seem to be not as compact as those of other methods, resulting in the degradation of classification accuracy. The results of SegNet and RefineNet are better than GCN and FCN-8s, but both of them obtained some commission errors, especially for the class of buildings. Moreover, more coherent labelling with precise edges and preserved boundaries can be achieved by the proposed methods although the results of PSPNet, and DeepLab V3+ are very close to those of the proposed methods. In addition, benefiting from the ability to learn the contributions of various features for each geo-object in each scale, mcResNet-csAM compares favorably with mcResNet, in the respect of the scale-adaptive classification of various geo-objects.

Results of ISPRS Potsdam 2D Semantic Labeling Contest
The ISPRS Work Group III/4 held an online contest of two-dimensional (2D) semantic labeling for the Potsdam dataset. The participants sent the classified results of 14 images for which ground truth maps were not provided to the organizers for evaluation. All of the participants' results are published on the website (http://www2.isprs.org/commissions/comm2/wg4/potsdam-2d-semantic-labeling.html), which now includes more than 50 methods. We submitted the results of our proposed mcResNet-csAM (ID: SWJ_2) to compare with the competitors' methods. All of the settings were the same as those of the above experiments, with one difference: we retrained mcResNet-csAM by using all 24 images. The 10 methods (including our method) with the highest overall accuracy were selected for comparison ( Table 3). Statistics in Table 3 show that the proposed method ranks #1 in terms of overall accuracy, with the highest value of 91.7%, although images with only three channels (NIR, R, G) were used to train the network, and neither the additional DSM data nor any postprocessing strategies were applied. Moreover, mcResNet-csAM achieved the optimal F1 score for impervious surfaces. When compared with the methods that did not use DSM, the best F1 scores for low vegetation and buildings could be also obtained by mcResNet-csAM. Therefore, the results clearly demonstrate the positive impact of a synergistic use of multilevel and multiscale deep features in the proposed FCN unified framework, pointing out the advantages of the proposed approach.

Effect of Scale Setting
For the proposed mcResNet-csAM, various scales of input images, which form a multiscale representation, are fed to the network for training. It is necessary to discuss the effect of scale setting because the settings of scale parameters may affect the performance of our methods. It should be noted that the four images (denoted as the red in the Figure 6) with corresponding ground truth maps were used to evaluate the proposed mcResNet-csAM with different scale settings.
Specifically, five settings on the composition of scales were analyzed for both datasets:

Comparison of Different Methods for Fusing Multi-Scale Features
We performed comparisons against some other fusion methods, i.e., max pooling, average pooling, and feature pyramid networks (FPN) to show the effectiveness of the proposed attention model for fusing multi-scale features. Max pooling and average pooling are the mostly commonly used methods for fusing features in the neural networks. FPN is proposed by [61], which applies a hierarchical structure to fuse features. In the experiments, we set the Scales to {1, 0.75, 0.5}. We also used the four testing images to validate different fusion methods for the ISPRS Potsdam dataset. Table 5 lists the results of different methods for fusing multi-scale features. As can be seen, the proposed attention model achieved best score of F1 and OA in both two datasets. FPN and Max pooling get comparable results that are slightly worse than ours. Mean pooling obtains the worst performance, since it may blur some fine part. The results show that the proposed attention model can effectively fuse multi-scale features.  Table 6 reports the complexity of our method and other state-of-the-art methods. The average of the time to inference 100 patches (320 × 320 pixels) with a GTX 1080Ti GPU measures the time complexity. As shown in Table 6, mcResNet-csAM produces comparable space complexity, while it takes about 12 s more than the most efficient model, SegNet. The extra time is worthwhile, because the network is modeled by processing input images of multiple scales to achieve class-specific, scale-adaptive classification results.

Effect of Data Quality
In our experiments, some inaccurate ground truth maps were noticed in the above two datasets, such as those that are shown in Figures 11 and 12. For the Massachusetts building dataset, the associated ground truth building labels were retrieved from OpenStreetMap. Some buildings in the images of the dataset are not labeled in the corresponding ground truth maps, and vice versa, due to the time difference between image acquisition and building information collection by OpenStreetMap, as shown in Figure 11. Moreover, annotation errors are also unavoidable, especially for VHR images. For instance, obvious mislabeling also appeared in the ISPRS Potsdam dataset, as illustrated in Figure 12.

556
In this work, a novel FCN that is intended to conduct fusion of deep features was presented to One of the interesting things that we observed is that our proposed mcResNet-csAM model can obtain the correct image extraction results, even with inaccurate ground truth maps. Thus, the robustness of the proposed method against label noise has been verified to some extent. On the other hand, it is still necessary to further investigate the effect of improperly labeled samples, owing to human mistakes or limited labeling conditions, which will be carried out in our future work.

Conclusions
In this work, a novel FCN that is intended to conduct fusion of deep features was presented to address the problems of VHR image classification. The proposed approach showed impressive performance by focusing on two aspects: (1) A multiconnection ResNet is proposed to fuse multilevel deep features that correspond to different layers of FCNs. The multiconnection residual connections make it possible for low-level features to learn how to cooperate with high-level features. Thus, hierarchical features, i.e., high-level abstract knowledge invariant to pixel-level variations that are useful for locating geo-objects roughly and low-level features that contribute to recovering geo-object boundaries, can be effectively fused without introducing redundant information from the low-level features. (2) A class-specific attention model is proposed to combine multiscale features in the scale space. It can learn the contributions of various features for each geo-object in each scale. In detail, the model generates a class-specific weight map that can softly weight the multiscale features for each pixel site for each geo-object in each scale, and the weighted sum of score maps is then utilized for further classification. In this regard, a class-specific, scale-adaptive classification map can be achieved in the proposed framework.
Extensive experiments indicated that the proposed method outperforms other state-of-the-art methods in two benchmarks and they can achieve scale-adaptive classification results. For the ISPRS online contest, the proposed method achieved the highest OA without using additional DSM data and postprocessing.
Moreover, we presented the studies of the proposed attention model. The results showed that the proposed attention model achieved better performance than other fusion methods.
In future work, we will test the performance of methods when the input data is less perfect. We will also try to make the algorithm more efficient and apply it to other fine-grained classification tasks.