Multi-scale Adaptive Feature Fusion Network for Semantic Segmentation in Remote Sensing Images

: Semantic segmentation of high-resolution remote sensing images is highly challenging due to the presence of a complicated background, irregular target shapes, and similarities in the appearance of multiple target categories. Most of the existing segmentation methods that rely only on simple fusion of the extracted multi-scale features often fail to provide satisfactory results when there is a large difference in the target sizes. Handling this problem through multi-scale context extraction and efﬁcient fusion of multi-scale features, in this paper we present an end-to-end multi-scale adaptive feature fusion network (MANet) for semantic segmentation in remote sensing images. It is a coding and decoding structure that includes a multi-scale context extraction module (MCM) and an adaptive fusion module (AFM). The MCM employs two layers of atrous convolutions with different dilatation rates and global average pooling to extract context information at multiple scales in parallel. MANet embeds the channel attention mechanism to fuse semantic features. The high-and low-level semantic information are concatenated to generate global features via global average pooling. These global features are used as channel weights to acquire adaptive weight information of each channel by the fully connected layer. To accomplish an efﬁcient fusion, these tuned weights are applied to the fused features. Performance of the proposed method has been evaluated by comparing it with six other state-of-the-art networks: fully convolutional networks (FCN), U-net, UZ1, Light-weight ReﬁneNet, DeepLabv3+, and APPD. Experiments performed using the publicly available Potsdam and Vaihingen datasets show that the proposed MANet signiﬁcantly outperforms the other existing networks, with overall accuracy reaching 89.4% and 88.2%, respectively and with average of F1 reaching 90.4% and 86.7% respectively.


Introduction
With the advancement of global observation technology and the development of increasingly higher-resolution sensors, it is now possible to acquire very high-resolution remote sensing images. Such images can capture detailed ground information, and facilitate the accurate analysis of scenes, and also objects within scenes. With the availability of high-resolution remote sensing images, the demand for extracting detailed information of interest regions in the images has increased. maps. With this method, an end-to-end training has been used for the first time to accomplish image pixel classification. Despite demonstrating promising performance, the FCN still has some limitations. Firstly, since the backbone network continuously down-samples to extract image features, the size of the feature map is 1/32 of the original input size. While reducing the size of the feature map will reduce the amount of calculation, the spatial resolution of the feature map will also decrease simultaneously. This results in losing useful information, which makes it difficult for the feature map to recover fine details. Secondly, some categories in the remote sensing image have different sizes and it is difficult to extract suitable features through the backbone network.
Multi-scale context information is essential for targets with different scales. Context information of different scales can be concatenated to gain multi-scale information, thereby improving the performance of segmentation. Huang et al. [29] proposed U-net, a segmentation model based on encoding and decoding, which fuses the semantic information in different layers to the corresponding decoding part. It makes full use of different levels of semantic information and solves the problem of loss of useful shallow feature information effectively. However, the fusion of U-net is performed simply by concatenating each channel's features. PSPNet [30] uses the pyramid pooling module to aggregate context information in different regions, thereby improving the ability to achieve global information. However, it is computationally inefficient. DeepLabv3+ [31] uses a backbone network to down sample the image. Then, multi-scale information is obtained using atrous convolution with a different dilatation rate. Finally, the up-sampled features and low-level feature maps are added to make predictions. APPD [32] combines the advantages of DeepLabv3+ and U-net and employs post-processing based on super pixels to further improve the segmentation performance. It is known that both the high-level and the low-level semantic information achieved by the backbone network has a great effect on the segmentation results. Currently available high-level and low-level feature fusion methods are divided into channel-dimensional concatenation and channel-dimensional addition. If the fusion of high-level and low-level features is performed by simple addition or concatenation, the effectiveness of the fusion will be reduced. Nevertheless, efficient and reasonable fusion of high-level and low-level semantic information can refine the image segmentation result. In addition, the context information of different scales can alleviate the degradation of segmentation performance caused by the difference in target size.
In this paper, we propose an end-to-end multi-scale adaptive feature fusion network (MANet) for semantic segmentation in remote sensing images. It is a coding and decoding structure that includes a multi-scale context extraction module and an adaptive fusion module. Specifically, we have used ResNet101 as the backbone network to extract various features of the images. Multi-scale context extraction module uses two layers of atrous convolution with different dilatation rate and global average pooling to extract different scales of context information in parallel. The feature map extracted from the backbone network is fed into a multi-scale context extraction module to generate the contextual information of different scales, which is then concatenated. We introduce the channel attention mechanism to fuse semantic features. The low-level and high-level semantic information are concatenated to generate global features via global average pooling. The global features are used as channel weights, and these weights are adaptively learned by the fully connected layer. Finally, the fused features are adjusted by multiplying with these weights. Experiments performed using the publicly available Potsdam and Vaihingen datasets show that the proposed MANet significantly outperforms the other existing networks, with overall accuracy reaching 89.4% and 88.2% respectively. The main contributions of this paper are summarized as follows: • We propose a multi-scale context extraction module. It consists of a two-layer atrous convolution with different dilatation rate, global information, and information of its own. The multi-scale context extraction module extracts the features of different scales of the image. These features are concatenated to form new features, which are used to tackle the problem of different target sizes in the images.
• We designed a high-level and low-level feature adaptive fusion module. It combines both highand low-level features to form new features and applies channel attention to these new features to obtain weights. These weights are multiplied with fused features to emphasize useful features and to suppress useless features. This alleviates the problem of misidentification of similar targets in remote sensing images.

•
Based on the above model, we construct an end-to-end network called MANet for semantic segmentation in remote sensing images. The performance of our proposed MANet on Potsdam and Vaihingen datasets is compared to other state-of-the-art methods.
The remainder of this paper is organized as follows. Section 2 introduces our proposed method in detail. Section 3 details the experiments along with in-depth analysis and discussion of results. Section 4 discusses the performance of our proposed method. Section 5 provides the conclusions and our future perspectives.

Multi-Scale Adaptive Feature Fusion Algorithm
As mentioned earlier, the large variations in the target sizes in the remote sensing images makes it difficult for the model to extract corresponding useful features of the target. This in turn has a significant effect on the overall segmentation performance. Moreover, the shooting angles of the remote sensing images are all from the top, which may lead to the same visual representation of different categories, e.g., low vegetation areas and the areas with trees, resulting in the wrong pixel segmentation. These problems can be handled effectively by our proposed MANet, which is consisted of a multi-scale extraction module and an adaptive feature fusion module. In this section, we present the details of these two modules.

Multi-Scale Context Extraction Module
In remote sensing images, the scene is complex, and the size of the target is not same. In such cases, it is extremely difficult to extract target features only by a single scale. Therefore, multi-scale context information is indispensable to perform semantic segmentation. The difference in the size of the target and the complexity of the scene affect the feature extraction of the backbone network. In order to solve this problem, we propose a multi-scale context extraction module. It contains three main parts. The first part is responsible for extracting the global information; the second part uses atrous convolution to obtain information at several different scales; the third part is the feature map itself. The outputs of these three parts are concatenated at the end to form a multi-scale feature map. The structure of our multi-scale context extraction module is shown in Figure 1.
In the figure, yellow boxes represent 3 × 3 atrous convolutions with dilatation rate d. The stride of an atrous convolution is 1. GAP stands for global average pooling. Con1 × 1 represents a 1 × 1 convolution layer. UP denotes upsample operation. Concat means that features are concatenated according to the channel. The output results of the three parts are concatenated together to achieve a multi-scale context information extraction. Atrous convolutions with different dilatation rates can improve the receptive field and can extract features at different scales without introducing too many calculations.

Global Information Extraction
In image processing, convolution is the process of convolving an image with a kernel of specified size. This is usually performed to extract local features of the image. Although the receptive field of each convolution kernel will increase with the increase in the network depth, it is still not possible to obtain the global information. The extraction of global information is referred to generalization and integration of the features of the whole feature map, which can obtain global context information. In this part, i.e., part A in Figure 1, the features extracted from the last layer of the backbone network are subjected to global average pooling. The features obtained from each channel are averaged, and the number of channels is adjusted by a 1 × 1 convolution. Then, the global feature map is up-sampled. This part can be mathematically expressed as follows: where, X represents the feature map extracted by the backbone network and C 1×1 represents a convolution layer.

Parallel Atrous Convolution Multi-Scale Context Extraction
Two layers of convolution operations at different scales can change the receptive field of the convolution kernel and can obtain context information at different scales. However, using multiple convolution kernels of different scales to extract multi-scale context information in the last feature map will increase the number of parameters as well as the computation steps. Inspired by the network structure of the DeepLab [33][34][35], a 3 × 3 atrous convolution with different dilatation rate is used to substitute the conventional convolution. The main advantage is that it will obtain multi-scale context information without introducing too many parameters and computations. The dilatation rates of the atrous convolutions are 1, 2, 3, 5, and 7, respectively. This is equivalent to using the conventional convolution kernels of sizes 3, 5, 7, 11, and 15, respectively. The size of the training image is 512 × 512 and the size of the feature map after feature extraction is 16 × 16. In order to extract context information as much as possible, the maximum value of the dilatation rate is selected to be 7. When the dilatation rate of the atrous convolution is 5, it is equivalent to inserting four kernel elements between each element in a 3 × 3 convolution kernel. The 3 × 3 convolution is transformed into a 11 × 11 convolution so as to enlarge the receptive field of each kernel element. Here, the strides of all 3 × 3 atrous convolutions are 1. It is worth noting that although the 3 × 3 convolution is transformed into an 11 × 11 one, only the original 3 × 3 kernel elements are used for calculations, that is to say that the total number of computations is not increased. Apart from that, the dilation rate d can be flexibly adjusted according to the image size. The maximum value of d is determined based on the size of the feature map extracted by the backbone network.
Assuming the feature map size is m × m, the minimum and maximum values of d are 1 and (m − 3)/2 + 1, respectively. For our experiments, the size of the feature map extracted by the backbone network is 16 × 16, so the maximum d is 7. When the d is 1, 2, and 3, the neighboring pixels are considered in the feature extraction. When d is 5 and 7, the receptive field becomes larger so that more context information is considered in the feature extraction. In order to further increase nonlinearity of the model, we used batch normalization and a ReLU activation function after each convolution layer. Finally, all the outputs of this part are concatenated together. This second part of our multi-scale context extraction module, i.e., part B in Figure 1, can be expressed as follows: where, • represents feature concatenation, i.e., each feature is concatenated according to the channel and C d 3×3 stands for a 3 × 3 atrous convolution with a dilatation rate of d. In order to retain the original feature map, the input features extracted by the backbone network are directly combined with global context information and parallel multi-scale context information. In this way, we are able to extract the multi-scale context. The overall module is expressed as in Equation (3).
where, C d 3×3 stands for a 3 × 3 atrous convolution with a dilation rate of d and C 1×1 is a 1 × 1 convolution layer.

Adaptive Fusion Module
As mentioned before, the scene in remote sensing images is complicated and the visual effects are similar for different categories due to the shooting angle. To alleviate this problem, the fusion of highand low-level features to generate key features is essential. In semantic segmentation, the features extracted by the backbone network are usually distributed into different levels. The low-level features contain a lot of contour information whereas the high-level features contain rich semantic information [36]. In recent years, many methods were presented in the literature specifying the ways to fuse these high-and low-level features. FCN uses a skip-connection operation where the high-level abstract semantic information and the low-level fine semantic information are directly added according to corresponding channels to form new features. U-net [29] combines the semantic features of each layer extracted from the backbone network and the corresponding size features according to the channel to restore the resolution layer by layer. Most of these fusion methods use pixel-by-pixel addition or channel-by-channel concatenation. Despite of fusing high-level and low-level semantic information, these methods do not determine the channels with useful features and the channels with useless features. Inspired by SENet [27], in this work we introduce a channel attention mechanism to high-and low-level semantic information fusion module and propose an adaptive fusion module to fuse the high-and low-level semantic features. Its structure is shown in Figure 2. The upsampling of feature map B is accomplished via bilinear interpolation through which we can obtain a new feature map B that is twice the size of B. The low-level semantic features A extracted from the backbone network and the features B are concatenated to form a new feature C. The number of channels of C is the sum of the channels in B and A. Next, the feature map D is generated from C through a 1 × 1 convolution layer. D has half as many channels as C. The weight information α can be obtained using global average pooling for each channel of D. The weights β can be generated from α via two full connected layers. The normalized weights ω are acquired from β using a sigmoid function. Then, the weights ω are multiplied with D to get a new feature map E. Feature map E is readjusted using D on the channel according to its importance. E highlights features of useful channels and suppresses features of unwanted channels. Finally, the low-level feature map A is directly added to the feature map E to obtain the feature F which is the final fusion feature. This part can be expressed as: where, sig() is the sigmoid activation function and f c is the fully connected layer. UP denotes the upsample. • represents each feature is concatenated according to the channel. X 1 represents the high-level semantic features and X 2 represents the low-level semantic features. C 1×1 represents a 1 × 1 convolution layer.

Multi-Scale Adaptive Feature Fusion Network (MANet)
Based on the two modules above, we propose an end-to-end multi-scale adaptive feature fusion network. Its overall structure is shown in Figure 3.
The parts A, B, and C of our proposed MANet algorithm can be seen in Figure 3. As a reminder, part A is the backbone network, part B is a multi-scale context extraction module, and part C is a highand low-level feature adaptive fusion module. The red arrow in the figure represents upsampling. The signifier '1 * 1' represents a 1 × 1 convolution layer to change the number of channels. The multi-scale context extraction module (MCM) is the context extraction module that extracts the multi-scale context information to solve the problem associated with difference target sizes. The adaptive fusion module introduces channel attention to the fusion process of high-level and low-level layers. The efficient fusion of features can alleviate the problem of incorrect segmentation caused by similar features of similar categories. In this work we have used ResNet101 as our backbone network to extract image features. Given an input image feed to the backbone network, it passes through four stages of the network, where each stage generates features with different sizes and channels. As the number of channels at each stage is different, a 1 × 1 convolution is adopted to reduce the channels so as to have the same number of channels maintained at each stage's output. The final stage features are fed to the context extraction module where the multi-scale context information is extracted and concatenated. This solves the problem of excessive target size differences in remote sensing images. The channels of the concatenated features are reduced by 1*1 to generate new features, which are up-sampled and fed to the adaptive fusion module. The adaptive fusion module fuses high-level and low-level semantic information efficiently and emphasize the useful features by learned weights. Later, the features extracted at each stage of the network are fused with the integrated features in a bottom-up manner by the adaptive fusion module. Through this we can get a feature map with 1/4th the size of the input image. Then, we adopt bilinear interpolation to restore the feature map to the same size as the input image. Finally, the output is sent to the classifier to receive the segmentation results.
In order or classify the output, We have used a softmax classifier [37] with our network. Softmax is a multi-class classifier that can calculate the probability of each category and the sum of the probabilities of all categories is 1. The classification of a pixel can be expressed as: where n is the total number of categories. p i represents the probability of a pixel l belongs to category i. Cross entropy loss function [38] has been adopted to represent the difference between the prediction and the label. Suppose that there is a training sample set {(x (1) , y (1) ), . . . , (x (m) , y (m) )} with m pixels, the cross-entropy loss function is: where I(y j =i) is an indicator function. The result is 1 when the prediction is equal to the label, and the result is 0 otherwise. The network can get a loss via forward propagation, and the back propagation algorithm can be adopted to transfer the loss from the back to the front to update network parameters [39]. In this paper, the stochastic gradient descent method is used to update the network parameters.

Experiment and Analysis
In order to verify the effectiveness of our proposed method, we have conducted various experiments using publicly available Potsdam and Vaihingen datasets. In this section, we first briefly describe the datasets and training settings, then present the experiments performed on the two datasets. To further verify the validity of our proposed method we have designed some ablation experiments that are presented at the end.

Datasets Description
The Potsdam and Vaihingen datasets are provided by the ISPRS II/4 committee for semantic segmentation [40]. These are well-known in the remote sensing community and contain images of the cities and their surroundings. They can be downloaded online from the following address: http: //www2.isprs.org/commissions/comm3/wg4/data-request-form2.html. The images in the datasets have six common categories including impervious surface (white), building (blue), low vegetation (cyan), tree (green), car (yellow), and background (red). Sample images along with their corresponding labels from both these datasets are shown in the Figure 4. Potsdam is a historic city in Germany, with large buildings and densely intertwined streets. The Potsdam dataset contains 38 images at a ground surface distance (GSD) of about 9 cm, including five channels of infrared, red, green, blue, and digital surface models (DSM) with a resolution of 6000 × 6000 pixels. For experiments, out of the available 38, we used 24 images for training and 14 images for testing. Note that we used ground truth with no eroded boundary labels in the experiments. On the other hand, Vaihingen is a village with small and scattered buildings. The Vaihingen dataset contains 3-band IRRG (infrared, red, and green) image data, corresponding DSM, and a normalized digital surface model (NDSM) data [41]. At a GSD of about 9 cm, there are 33 images of about 2500 × 2000 pixels. We used 16 of the images for training and 17 of the images for testing. Similar to the other dataset, we have used the ground truth with no eroded boundary labels in the experiments.
Due to GPU memory limitations we needed to change the size of the images in the datasets. This can be done either by reducing the resolution or by cropping the image. However, by reducing the resolution, we will lose useful information in the image. For this reason, we reduced the size of each image and its corresponding label by cropping it to a size of 512 × 512 pixels with a 100 pixel overlap. The test images are mirror-filled and cropped to a size of 512 × 512 pixels with an overlap of 256 pixels and the final prediction results were stitched. The amount of data used can significantly affect the performance of the model. Sufficient data can prevent the model from overfitting and improve the robustness of the model. Data augmentation is an efficient way to solve the problem with data volume. In terms of spatial position, we employed horizontal and vertical rotation and random scale rotation. Random changes in brightness, saturation, and contrast were adopted in color. It is worth noting that the size of the images in the Vaihingen dataset is too small. The original training image is scaled to 0.5, 0.75, 1.25, and 1.5 times and cropped according to the above method.

Compared State-of-the-Art Methods
In order to better evaluate the network performance of our proposed MANet, we have compared its performance with FCN8s [28], U-net [29], UZ1 [42], Light-weight RefineNet [43], DeepLabv3+ [31], and APPD [32] methods. FCN is a well-known method in semantic segmentation. It is further divided into FCN32s, FCN16s, and FCN8s based on the type of skip-connections. In this work, we chose FCN8s with fine edge segmentation for comparison. UZ1 is a CNN-based encoder-decoder model that employs a deconvolution layer as a decoder. U-net employs high-level and low-level feature fusion. It simply stitches the features from channels together to form more features. Light-weight RefineNet is a lightweight and efficient network that uses a chained residual pooling module and a layer-by-layer fusion module for segmentation. DeepLabv3+ uses a spatial pyramid to obtain multi-scale semantic information, and adopts the structure of encoding and decoding to refine the segmentation results. APPD combines the advantages of both deeplabv3+ and U-net, considering the strategies of multi-scale context information and multi-scale feature fusion. It also uses post-processing based on super pixel.

Training Details
The proposed method was implemented using the PyTorch library. The size of the images used for both training and testing was 512 × 512. ResNet101 pre-trained on ImageNet was used for our backbone network. The MANet was trained on two Titan 1080 GPUs, each with 12 GB of memory. The total batch size was set to 12. Cross entropy was used as a loss function to update network parameters. Stochastic gradient descent with momentum was applied to optimize the network. The momentum and weight decay were set to 0.99 and 0.0005, respectively. We adjusted the learning rate according to the training epochs. The initial learning rate was set to 0.001. A new learning rate was set every 30 epochs, as shown in Table 1. It is worth noting that the experimental settings of all the compared algorithms are the same as the proposed MANet.

Metrics
In order to comprehensively evaluate the performance of different networks, we chose overall accuracy (OA), F1, precision, and recall [44] as our evaluation metrics. They are formulated as in the below equations.
where TP is the number of true positives and TN is the number of true negatives. FP is the number of false positives and FN is the number of false negatives. It is worth noting that the overall accuracy is the accuracy of all categories including background. In addition, for the task of pixel classification, when the categories are not balanced, precision and recall are used for prediction. To this end, we draw a precision-recall (PR) curve to measure the relationship between the precision and the recall of each category. Technically, we first obtain the model's score map in each category and select a series of thresholds between 0 and 1. Next, we obtain the numbers of TP, FP, and FN according to the thresholds. Then calculate the precision and recall under different thresholds. Finally, draw the PR curve with recall as the horizontal axis and precision as the vertical axis.

Experiments on the Potsdam Dataset
The first set of experiments were performed on the Potsdam dataset. For all the compared models, we computed the F1 for each category as well as the average of other metrics. The results are summarized in Table 2. It can be seen from the results that the proposed MANet achieved a result of 89.4% in OA and 90.4% in the F1 average. In comparison with APPD, it showed an improvement of 1% and 1.1% in OA F1 average, respectively. MANet achieved 0.6%, 0.3%, 1.1%, 1.8%, and 2% improvements, respectively, in the F1 of each category compared with APPD. Although the compared six methods considered the fusion of high-level and low-level features, they did not take into account the weights of the feature fusion. On the other hand, our proposed method can readjust the fused features by learning the weights (of the fused features). Because of this, our method can better distinguish categories and reduce false positives. Areas of low vegetation and trees are very similar and are very prone to causing an incorrect segmentation. Nevertheless, our method showed better performance in these two categories with an improved F1 score of 1.1% and 1.8%, respectively. These results clearly demonstrate the effectiveness of our adaptive fusion module. In the dataset, the car is a small target compared to several other categories, and the impervious surface is a large target. From Table 2, it can be seen that for these two categories, the F1 values increased by 0.6% and 2%, respectively. This shows that our proposed multi-scale context extraction module can solve the problem associated with variable target sizes, i.e., when the difference between two target size's is too large. Based on these two modules, the network can extract features more accurately and fuse high-level and low-level semantic features efficiently, thereby improving the overall segmentation performance.
In order to make a comprehensive comparison of each model, when data are not balanced, precision and recall are important indicators for evaluating segmentation performance. The PR curve can clearly show the relationship between precision and recall. The PR curves for each category of each model on the Potsdam dataset are shown in Figure 5. It can be seen that for all categories MANet achieved better results even when the data is unevenly distributed in the categories.  In order to compare the segmentation performance of our proposed MANet more intuitively and clearly, the visualization results of FCN8s, U-net, UZ1, Light-weight RefineNet, DeepLabv3+, APPD, and MANet are shown in Figure 6. In the figure, the first and the third rows show the image, ground truth, and the predictions from all seven methods. The second and the fourth rows are the predicted segmentation results corresponding to a small region in the graph, marked with a red box. It can be noticed that the two similar categories of low vegetation and tree in the second row cannot be distinguished well by other models. However, MANet demonstrated a smoother result for these challenging categories. All these results clearly demonstrate the efficiency of our method for semantic segmentation.

Experiments on the Vaihingen Dataset
Similar tests were performed on the Vaihingen dataset. The F1 for each category and the average of four metrics for all the seven models are summarized in Table 3. From the obtained results, the OA of MANet on the Vaihingen dataset is 88.2%, and the average of F1 is 86.7%, which is 1.7% and 1% higher,respectively, than its nearest competitor method APPD. Even though the amount of Vaihingen data is comparatively smaller than that of the Potsdam data, our method still managed to obtain better performance. Especially for the car category, where it achieved a 5.2% higher F1 than the APPD. Because the cars occupy a small proportion of pixels in the total training images and are occluded by buildings and trees, it is difficult for the other models to extract corresponding features for correct pixel classification. With our proposed MANet, the features of different scale targets are extracted by a multi-scale context extraction module, and the adaptive fusion module fuses the high-level and low-level semantic information adaptively. Due to this, even when the targets occupy small areas in the images, they can be extracted and fused to form valid features for correct segmentation. Even though the categories are not evenly distributed, the average of precision and recall in all categories are increased by 1.5% and 3.2%, respectively.
In addition, the PR curves of each model in each category of the Vaihingen dataset are shown in Figure 7. Similar to before, our proposed MANet demonstrated better segmentation performance than the other six models. Visual comparison results with this dataset are shown in Figure 8. It can be seen from the second row that the FCN8s and UZ1 methods show very degraded performance for low vegetation and trees. None of the six comparison methods segmented the car occluded by a tree. From the fourth row, it can be seen that the segmentation of the two similar categories of tree and low vegetation is prone to cause confusion. Besides, the segmentation results of UZ1 and U-net on buildings are incomplete. Even in this case, from the results shown in first and third rows, our proposed method showed a smoother performance.

Ablation Experiment
We decomposed and combined the proposed network to verify the effeteness of each module using F1 and OA metrics. This experiment used the Potsdam dataset. Firstly, the ResNet101 model was used as the basic network (Res101), and the last output feature map was up-sampled and passed to a classifier for segmentation. First, Res101+MCM were employed to verify the effectiveness of the context extraction module. The feature map extracted by Res101 was fed into the multi-scale context extraction module to obtain a new feature map. This was then up-sampled for the prediction. Next, the Res101 + adaptive fusion module (AFM) were adopted to verify the effectiveness of feature fusion. The feature map extracted by each stage of the backbone network was fed into the adaptive fusion module, and the output was up-sampled for the prediction. Finally, we integrated the two modules together, Res101+MCM+AFM. The experimental results are summarized in Table 4. We can observe that the "Res101+CMM" yields a result of 86.2% in OA and 86% in mean F1, which is 1.2% and 1.5% more than that of the "Res101". Multi-scale context extraction can further extract multi-scale features of the feature map from the backbone network, which solves the problem associated with greater target size differences. Besides, "Res101+AFM" outperforms the "Res101" by 3.3% in OA and 4.9% in mean F1. Each layer features extracted by the backbone network are rich in semantic information and the reasonable and efficient fusion of these features can improve the segmentation performance. Our model introduces channel attention with the fusion process that readjusts the features through learned weights. Furthermore, when we utilize the integration of two modules together, the performance is further boosted up. Compared with the Res101, "Res101+AMM+CFM" segmentation performance is improved by by 4.4% in OA and 5.9% in mean F1. The multi-scale context extraction module is employed to extract multi-scale context information. The adaptive fusion module can efficiently fuse the features to alleviate the misclassification of similar categories. The results of ablation experiments show that both our proposed multi-scale context information extraction and adaptive fusion modules can significantly improve the performance of remote sensing image semantic segmentation.

Model Complexity
In this subsection, we analyze the computational complexity of the MANet. Computational complexity includes the total number of floating point operations (FLOPs), number of network parameters and average test time for each image of size 512 × 512 pixels on two Titan 1080 GPUs with 12G RAM. The results are shown in Table 5. The computational scale of the MANet is reduced to a certain degree compared to FCN8s, UZ1, DeepLabv3+, and APPD. It is slightly larger than that of the LWRefineNet. However, the total number of parameters in the MANet is more than other methods. Due to multi-scale context extraction module in MANet, a lot of parallel convolutions make the total number of parameters large. In the case of average test time, our proposed method is at the same level with compared models.

Experiments on Small-Scale Dataset
To verify the performance of our model on small dataset, the proposed method was tested on a set of 20 multi-spectral Very-High-Resolution (VHR) images acquired over the city of Zurich by the QuickBird satellite in 2002 [45]. The average image size is 1000 × 1150 pixels and consists of four channels that span the near infrared to visible spectrum (NIR-R-G-B). The spatial resolution of the pan-sharpened image is 0.61 m/pixel. The images in the Zurich dataset are divided into eight categories including Road, Building, Tree, Grass, Bare Soil, Water, Rail, and Pool. An example along with the urban class legend is shown in Figure 9. Note that white background is not considered a separate class. zh1-zh13 were used for training and zh14-zh20 were used for testing in Zurich dataset. The size of each image and its corresponding label by cropping it was reduced to a size of 512 × 512 pixels with an overlap of 256 pixels. The original training images were scaled to 0.5,0.75, 1.25, and 1.5 times and cropped according to the above method. The test images were mirror-filled and cropped to a size of 512 × 512 pixels with an overlap of 256 pixels. The experimental setup was the same as the training details in Section 3.3. The F1 for each category of MANet and the results of the six compared algorithms is obtained and shown in Table 6. As can be seen from Table 6, MANet is superior to the other models in the categories of Road, Building, Bare Soil, Water, and Pool, which achieves 0.9%, 0.6%, 3.2%, 0.4%, and 6.8% improvements, respectively, in F1 of those category compared with APPD. However, F1 in the three categories of Trees, Grass, and Rail is lower than the compared algorithms. The two categories of grass and trees are similar in the Zurich dataset. Insufficient training data affects our model to extract corresponding features, making it difficult to correctly classify its pixels. Therefore, it can be seen from Table 6 that the proposed method has improved on small data, but the effect is not obvious.

Conclusions
In this paper, we propose a multi-scale adaptive feature fusion network for semantic segmentation in remote sensing images, namely MANet. The MANet uses encoding-decoding architecture with multi-scale context extraction and adaptive fusion modules. It uses ResNet101 as the backbone network for encoding and multi-scale context extraction and adaptive fusion modules for decoding. The multi-scale context extraction module is composed of two parallel layers of atrous convolutions with different dilatation rates, global information, and its own features. It extracts multi-scale context information to solve the problem of high differences in target sizes in remote sensing images. The adaptive fusion module introduces the channel attention mechanism into high-level and low-level semantic features in the fusion process, which adjust the fused features by the learned weights. By integrating these two modules together, the overall segmentation performance is significantly boosted up.
We have evaluated the proposed MANet on two publicly available benchmark datasets, Potsdam and Vaihingen datasets. With the Potsdam dataset, MANet achieved a result of 89.4% in overall accuracy and 90.4% in the average of F1 which brings 1% and 1.1% improvement in OA and the average of F1 compared with APPD, respectively. On the Vaihingen dataset, the OA of MANet is 88.2% and the average of F1 is 86.7%. Experimental results show that our proposed MANet outperforms other state-of-the-art models on both datasets, in terms of overall accuracy and F1. In ablation experiments, the performance of the "Res101+AMM+CFM" is increased over "Res101" by 4.4% in OA and 5.9% in mean F1. The ablation experiments further verify the effectiveness of the proposed module in semantic segmentation of remote sensing images. It is worth noting that the available information of the corresponding DSM and NDSM data in the Potsdam dataset is not used to assist in segmentation. Nevertheless, the parameter size of our proposed method is large. In the future, we aim to reduce this count. Moreover, semantic segmentation of remote sensing images belongs to supervised learning. The data labeling workload of semantic segmentation is huge, especially for remote sensing images. Semantic segmentation based on semi-supervised and unsupervised learning is therefore an important future research direction.