DGFNet: Dual Gate Fusion Network for Land Cover Classification in Very High-Resolution Images

Deep convolutional neural networks (DCNNs) have been used to achieve state-of-the-art performance on land cover classification thanks to their outstanding nonlinear feature extraction ability. DCNNs are usually designed as an encoder–decoder architecture for the land cover classification in very high-resolution (VHR) remote sensing images. The encoder captures semantic representation by stacking convolution layers and shrinking image spatial resolution, while the decoder restores the spatial information by an upsampling operation and combines it with different level features through a summation or skip connection. However, there is still a semantic gap between different-level features; a simple summation or skip connection will reduce the performance of land-cover classification. To overcome this problem, we propose a novel end-to-end network named Dual Gate Fusion Network (DGFNet) to restrain the impact of the semantic gap. In detail, the key of DGFNet consists of two main components: Feature Enhancement Module (FEM) and Dual Gate Fusion Module (DGFM). Firstly, the FEM combines local information with global contents and strengthens the feature representation in the encoder. Secondly, the DGFM is proposed to reduce the semantic gap between different level features, effectively fusing low-level spatial information and high-level semantic information in the decoder. Extensive experiments conducted on the LandCover dataset and the ISPRS Potsdam dataset proved the effectiveness of the proposed network. The DGFNet achieves state-of-art performance 88.87% MIoU on the LandCover dataset and 72.25% MIoU on the ISPRS Potsdam dataset.


Introduction
The rapid development of remote sensing sensors allows diverse access to very highresolution (VHR) remote sensing images. A pixel-based land cover classification, also known as semantic segmentation, using very high spatial resolution images has significant application value in land resource management [1,2], urban planning [3,4], change detection [5,6], and other fields. Since optical sensors reflect the spectral characteristics of the ground target and show consistent features with the human visual system, optical remote sensing has become the mainstream method of fine land cover mapping. However, the clear and complex spatial structure features exceedingly increase the difficulty of land-cover classification [7]. Typical land-cover classification methods can be roughly separated into three categories: pixel-based classification methods, object-based classification methods, and patch-based classification methods. For the pixel-based method, spectral information provided by the high-resolution images shows prodigious variance for intra-class and the similarity between different classes, leading to lower land-cover mapping accuracies [8]. Furthermore, VHR remote sensing images usually contain a few bands and the pixel-based classification method only considers the spectral information. It does not take count of the spatial characteristics and the topological relationship of ground objects of original images, making land-cover classification in VHR images more difficult. The object-based method can be divided into two stages: object generation and object determination. Firstly, those methods usually use feature extraction or clustering algorithms, such as simple linear iterative clustering (SLIC) [9], to generate objects. Subsequently, utilizing the spatio-temporal aggregation of multispectral data to determine the attribute of such objects is one of the better choices. The patch-based method is usually proposed in combination with DC-NNs, which can capture more robust features. Different from traditional feature-extraction methods, such as SIFT [10], SURF [11], HOG [12], and ORB [13], which are expensive and require a special design, deep convolutional neural networks (DCNNs) can extract features automatically and have more outstanding feature expression abilities. In addition, DCNNs have a stronger non-linear fitting ability, which is better than other classifiers, making land-cover classification more accurate.
DCNN is a well-known model for feature learning, which can automatically learn features of different levels from raw images by stacking convolutional layers and downsampling operators. In 2012, Krizhevsky et al. [14] proposed the AlexNet and won the ILSVRC contest, which plays a significant role in deep learning. Since then, DCNNs have seen an explosive development and have been applied to different tasks, such as object detection [15][16][17], semantic segmentation [18][19][20], and image retrieval [21][22][23], etc. For the semantic segmentation task, Long et al. [18] is a pioneer in building a complete full convolutional network (FCN) to predict pixel-level labels in an end-to-end manner. However, such architecture captures the semantic information by stacking convolution layers through non-linearities and downsampling, reducing the spatial information of original images. Considering this, U-Net [24] adopted the structure of skip connection for feature fusion, which reuses the low-level features to retain the spatial detail to a certain extent. SegNet [25] recorded the corresponding max-pooling index in the process of encoding. In the decoding stage, the recorded pooling index was used to improve the decoding performance. DANet [26] added two types of attention modules to the traditional dilated FCN to simulate semantic interdependence in spatial and channel dimensions separately. PSPNet [27] introduced a pyramid pooling module to aggregate the context information based on different regions to mine the global context information and improve the segmentation effect. HRNet [28] achieves strong semantic information and precise location information through parallel multi-resolution branches and continuous information interaction between different branches.
The method based on DCNNs is introduced into the remote sensing scene naturally. Differing from natural images, the scale of VHR remote sensing images is much larger, as well as the radiometric resolution of such images being much higher, which makes it contain lots of complicated scenes. For example, there are many multi-scale surface objects such as huge buildings and small cars. As for small-scale ground objects, with the decrease in spatial resolution, its structural information may be lost, resulting in poor segmentation effects. According to the characteristics of remote sensing images, some researchers have established the network based on multi-scale feature fusion. Nogueira et al. [29] used extended convolution [30][31][32] to enhance the context information of feature aggregation. Li et al. [33] designed an additional branch that uses the boundary information of original images as input to improve image segmentation. Marmanis et al. [34] combined multiscale features of different layers and used auxiliary data digital surface model (DSM) data to improve land-cover classification accuracy. In [35], Wang et al. proposed a gated convolutional neural network named GSN using the entropy of low-level features as a gate to refine the high-level features. The core idea of the above research is to fuse different level features directly. It is worth noting that low-level features in the shallow layers of DCNN can provide more detailed structural information, and high-level features in the deeper layers of DCNN contain more discriminative semantic information. Regardless of differences in the semantic information, direct fusion will inevitably embed background noise of low-level features and thus affect the robustness of features, which may lead to the loss of detailed spatial information. We consider the difference between different level features as a semantic gap and propose an end-to-end network named the Dual Gate Fusion neural Network (DGFNet). In detail, DGFNet consists of two main components: Feature Enhancement Module (FEM) and Dual Gate Fusion Module (DGFM). The FEM combines local information with global contents and strengthens the feature representation in the encoder. Secondly, the DGFM is proposed to reduce the semantic gap between different level features, effectively fusing low-level spatial information and high-level semantic information in the decoder. In general, the main contributions of this paper can be summarized as follows: 1. We propose a simple but efficient encoder-decoder segmentation network, which effectively captures the global content and fuses different multi-level features, improving the performance of land-cover classification in VHR images. 2. We propose a novel feature enhancement module (FEM). It combines local information and global context information, enhancing the representation of different layer features. 3. A dual-gate fusion module (DGFM) with the gate mechanism is proposed, which promotes the fusion of low-level spatial features and high-level semantic features effectively. 4. Exhaustive experiments are conducted to prove the effectiveness of the proposed network. We also achieve the state-of-art performance of 88.87% MIoU on the LandCover dataset and 72.25% MIoU on the ISPRS Potsdam dataset.
The remainder of this paper is organized as follows: related works are presented in Section 2. In Section 3, we introduce the proposed DGFNet in details. Section 4 presents the experimental details and experimental results to validate our approach, followed by the conclusions in Section 5.

DCNNs in Land-Cover Classification
Land-cover classification (also known as semantic segmentation) in VHR remote sensing images is difficult due to the large scale of original images in such a pixel-level task, which results in significant variation for intra-class and the similarity between different classes (e.g., trees and low vegetation). Since Long et al. [18] first built a full convolutional network (FCN) to achieve state-of-art performance in the pixel-level task, there has been a considerable number of works focussing on land-cover classification. For example, Mou and Zhu [36] proposed RiFCN, which recursively embedded different scale features into the learning framework to achieve accurate boundary inference and land cover classification. To increase the representation capacity of the framework such as FCN for land-cover classification in high-resolution remote sensing images, they also designed a relation module [37] to describe the relationships between observations in convolved images and augmented feature representations. Liu et al. [38] improved the classification result by integrating a spatial and channel relation-enhanced block with neural networks, which increases the variety of receptive field sizes. Chen et al. [39] fused different layer features of DCNN to restore the spatial resolution and improve the performance of land-cover classification. Sun and Wang [40] used an additional digital surface model (DSM) to restore the black spatial areas, such as shadows. They integrated the spectral information of color images with the geometry information of DSM, which improve the accuracy of land-cover classification. Diakogiannis et al. [41] proposed a novel DCNN architecture named ResUNet-a, which uses a UNet encoder/decoder as their backbone, in combination with residual connections, atrous convolutions, pyramid scene parsing pooling, and multitasking inference. ResUNet-a inferred a series of tasks sequentially, including the boundary of the objects, the distance transform of the segmentation mask, the segmentation mask, and a colored reconstruction of the input. Each of the tasks were conditioned on the inference of the previous ones, thus establishing a conditioned relationship between the various missions, which improve the final performance of land-cover classification.
The above work fully demonstrates the powerful feature extraction ability of DCNNs in land cover classification. With the increase in spatial resolution, VHR remote sensing images can capture more diverse scenes, which provide rich geometric information and feature information, increasing the difficulty of land-cover classification and reducing the classification accuracy of VHR images to certain extents. Due to the superior feature ability of DCNNs, the model based on DCNNs can be applied to different complex scenes, including VHR images. At present, the method based on DCNNs for land-cover classification in VHR images has become one of the mainstream methods.

Gate Mechanism in Neural Networks
Long short-term memory (LSTM) [42] is a famous framework for processing sequence data. LSTM can selectively transmit the previous information to the current state by the gate mechanism so LSTM can deal with the long-distance dependence problem effectively. Recently, Dauphin et al. [43], as well as Gehring et al. [44] extend the definition of gate mechanisms in conjunction with convolutional networks. They regard a convolutional layer without a non-linear function, followed by a unit with a sigmoid function as a "gate" unit. Li and Kameoka [45] used a gated CNN architecture instead of LSTM, which was introduced to model word sequences for language modeling and was shown to outperform LSTM language models trained in a similar setting. Subsequently, the gate mechanism is applied to various computer vision tasks. Yang et al. [46] combined the gate mechanism with hybrid connectivity for image classification to retain the capability of feature reexploitation to some extent, which improves the accuracy of classification. Yu et al. [47] used gated convolution instead of partial convolution to obtain a better restoration effect for image inpainting, as well as Chang et al. [48], who proposed 3D gated convolutions to tackle the uncertainty of free-form masks for video inpainting. Rayatdoost et al. [49] utilized the gate mechanism to fuse the features among different models for emotion recognition from facial behaviors. Cao et al. [50] built a classification network with a linear skip gated connection which can benefit information propagation for action recognition. For aspect-category sentiment analysis (ACSA) and aspect-term sentiment analysis (ATSA) tasks, Xue and Li [51] proposed an efficient convolutional neural network with gating mechanisms to achieve state-of-art performance in related fields.
The mentioned research above verifies that the gate mechanism can promote the fusion and transmission of feature information. For pixel-level land-cover classification tasks, DCNNs usually adopt encoder-decoder architecture. The encoder captures semantic representation by stacking convolution layers and shrinking image spatial resolution. Furthermore, the decoder restores the spatial information by an upsampling operation and combines it with different level features through summation or skip connection. Due to the semantic gap between different level features, simple addition and skip-connection operations can not fully fuse the feature maps of different levels. Considering the character of the gate mechanism, we use the gate mechanism instead of the summation and skipconnection operation to integrate different level features. In short, embedding the gate mechanism in neural networks is a simple and effective method for feature learning and fusion.

Methods
With the increase in remote sensing image resolution, the distinct and complex spatial structure characteristics of remote sensing images become more visible. DCNNs capture the semantic representations with global contents by stacking convolution layers through non-linearities and downsampling. That operation reduces the spatial information of original images, which may affect the land-cover classification accuracy vastly. At present, the existing methods cannot recover the lost spatial information well. Therefore, the architecture based on DCNNs still needs to be improved in the decoder through an effective fusion approach.
In this paper, we will introduce the overall framework of DGFNet as shown in Figure 1, firstly. Then, two main modules in the DGFNet, including FEM and DGFM, are described in detail after the overall framework.

Overall Framwork
Our overall segmentation model is shown in Figure 1. We adapt the encoder-decoder architecture proposed in FCN [18] as the semantic segmentation framework. The encoder is composed of a one-stream feature extractor and feature enhancement module. In the encoder stage, we use ResNet-50 [52] combined with an atrous spatial pyramid pooling (ASPP) module proposed in [30], which consists of different rate-dilated convolution, as the feature extractor to obtain different-level feature maps. Following the feature extractor, we design the FEM to combine the local information with global contents, making the extracted feature more robust. In the decoder stage, we propose a dual gate fusion module (DGFM) to fuse the low-level (shallow-layers) features and high-level (deeper-layers) features, making the information fusion more sufficient, which is beneficial to the recovery of spatial details in the decoding phase. The details about the FEM and DGFM are described in the subsequent subsection.

Feature Enhancement Module
The semantic segmentation problem could be divided into the pixel-wise classification as well as location tasks [53], where the classification task requires global contents by stacking small-sized convolution kernels, while the location task needs large-sized convolution kernels. The requirement of the convolution kernel size is contradictory. DCNNs stack the small kernel size convolution through downsampling to obtain the global context, which reduces the spatial resolution of original images. Inspired by the SE module proposed in [54], we design the FEM to combine the local information with the global context, instead of using large-sized convolution kernels. Differing from the SE module, we define the context aggregation for the input feature maps, making the representation of global context more plain. As shown in Figure 2, the FEM consists of two substructures: channel regulation structure (Figure 2a) and context aggregation structure (Figure 2b). We model the channel structure as a simple residual layout, whose main purpose is to regulate the channel of different level feature maps from the feature extractor. More specifically, we define the regulated feature map as V, calculated as: where W is the parameters of a linear transformation, U is a coarse feature map, and R(·) is the residual branch. We observed that Equation (1) is similar to the formula of image sharpening so that the channel regulation structure can help enhance the spatial information to a certain extent. Following the channel regulation structure, we designed the context aggregation structure to combine global contents with the local information. Let X ∈ R C×H×W be the feature map of one input instance, where C is the channel of feature maps, x i ∈ R N is the i th channel of the feature map X and flat into a vector, N = H × W is the number of positions in the feature map, y ∈ R C×1×1 and Z ∈ R C×H×W denote intermediate variable, and the output of the context aggregation structure can be formulated as: where G(·) is the global average pooling (GAP) operation, can be defined as: ⊕ is broadcast element-wise product operator, H(·) is ReLU function and a batchnormal operator, and W k and W l are the parameters of linear transformation. Based on Equation (2), the formula e W k x j ∑ N m=1 e W k xm indicates the weight of each position in the feature map, so we can obtain the context information through ∑ N j=1 e W k x j ∑ N m=1 e W k xm x j . Finally, the final output z i combines the global context y i with the local information x i to strengthen the representation of different level features. Compared to using large-sized convolution kernels to capture the global contents, the context aggregation structure needs fewer parameters and computing resources. When the input features X ∈ R C×H×W , the number of parameters is C 2 × H × W through using large size kernels to cover the full feature maps and C × (S + 2) (S << C) for making use of the context aggregation structure.

Dual Gate Fusion Module
As is well known, high-level (deeper layers) features contain more discriminative information. Utilizing the rich discriminative information of high-level features can identify the category of objects (including the background) more accurately. Low-level (shallow layers) features contain more spatial information which can help to restore spatial details better. Most DCNNs directly fuse those different-level features using simple element-wise addition, e.g., FCN [18], or skip connection, e.g., U-Net [24]. As there exists a semantic gap between shallow layer features and deeper layer features, direct fusion will inevitably embed the background noise of deeper layer features. Considering that information is transferred from high-level features to low-level features in the decoding stage, we designed a dual-gate fusion module to combine the high-level semantic information with low-level spatial structural information, which makes the information fusion more effective. As shown in Figure 3, X l , X h represent the low-level features, and for high-level features, the "position gate" p t can be defined as: where W p , b p are the weights and bias of a linear transformation, and σ is the sigmoid function which ensures the "position gate" p t between 0 and 1. The gate1 "position gate" p t indicates how important each spatial position is at the high-level features through the low-level features. To make full use of the discriminant information of high-level features, we define gate2 "filter gate" f t as: The same as W p and b p , W f and b f are the weights and bias of a linear transformation, σ is the sigmoid function, and ⊗ is broadcast element-wise multiplication operator. The "filter gate" f t decides how to refine the low-level features and suppresses the background noise of low-level features to a certain extent through high-level features. Finally, the output feature X o can be expressed as: Based on Equation (7), we use "position gate" p t to refine high-level features X h and use "filter gate" f t to update the low-level feature X l . For doing so, we make the different scales feature mutual constraint, which promotes the integration between low-level spatial features and high-level semantic features.

Dataset Description
LandCover Dataset: the dataset proposed in [55] contains 41 tiles of RGB images covering the whole of Poland, which can be grouped into four common land-cover categories: building, woodland, water, and background. In detail, the dataset contains 33 tiles with resolution 25 cm (9000 × 9500 pixels) and 8 tiles with resolution 50 cm (ca. 4200 × 4700 pixels), which gives 176.76 km 2 and 39.51 km 2 , respectively, and 216.27 km 2 overall. In [55], those images are partitioned into non-overlapping patches by a grid with a size of 512 × 512, so that we can obtain 10,674 patches. Among all these patches, 7470 patches are available for training, 1602 patches for validation, and 1602 patches for testing. Examples of the LandCover dataset are shown in Figure 4.
ISPRS Potsdam Dataset: the ISPRS Potsdam dataset [56] consists of 38 tile aerial images (6000 × 6000 pixels). Each image has a corresponding DSM with the same spatial resolution of 5 cm. The dataset contains the six most common land-cover categories, namely impervious surfaces (e.g., roads), buildings, low vegetation, tree, car, and clutter/background. Among them, 24 patches are available for training, and the remaining 14 for testing. Those images are partitioned into non-overlapping patches by a grid with a size of 400 × 400, so we can obtain 5400 patches for training and 3150 for testing. In addition, the image in the ISPRS Potsdam dataset has different channel compositions, including IRRG, RGB, and RGBIR. In this paper, we only use the RGB channel for training and testing. Examples of the ISPRS Potsdam dataset are shown in Figure 5.

Parameter Setting
We employ pre-trained ResNet-50 as our backbone network of the DGFNet, implemented in PyTorch. We use a standard stochastic gradient descent (SGD) optimizer with 0.9 momentum and weight decay 0.001. Data augmentation with random-Gaussian blur and random-flipping operations are adopted on each iteration in the training phase. Our learning rate is scheduled by poly, starting with 7 × 10 −3 for the LandCover dataset and 5 × 10 −3 for the ISPRS Potsdam dataset. All the comparative experiments are trained with a batch size of 20 for the LandCover dataset and a batch size of 16 for the ISPRS Potsdam dataset. We retrain all models by using one NVIDIA A100 GPU and 500 epochs for both datasets. For the loss function, we only use cross-entropy loss, which can be defined as: where t c is a one-hot vector, and p c indicates the probability that the prediction sample belongs to class c.

Evaluation Metrics
To assess the quantitative performance, four mainstream metrics for semantic segmentation are used, including pixel accuracy (PA) and mean pixel accuracy (MPA), mean intersection over union (MIoU), and frequency weighted intersection over union (FWIoU). Suppose there are K different land type classes. Let m ij be the number of pixels belonging to class i predicted to belong to class j, m i = ∑ K j=1 m ij is the total number of pixels belonging to class i, and n i = ∑ K j=1 m ji is the total number of pixels predicted to class j. Those metrics can be defined as: where the MIoU, FWIoU, MPA, and PA can describe the global land-cover classification performance. For example, the PA with 0.1% improvement indicates that millions of samples identified correctly for a pixel-level task. In addition, the MIoU, FWIoU, and MPA can avoid land-cover classification bias because of the imbalance of different classes on the LandCover and ISPRS Potsdam datasets. In addition to the mainstream metrics, we use the classical metrics, consisting of precision (P), recall (R), and F1, as the auxiliary metrics to evaluate classification results.The P, R, and F1 can be defined as: where TP is the true positive, FP is the false positive, TN is the true negative, FN is the false negative, and the index i indicates that the sample belongs to class i.

Influence of Different Modules on Classification
To verify the effectiveness of the proposed FEM and DGFM, we conducted a series of ablation experiments on the LandCover dataset. In the ablation experiment, we used 1 × 1 convolution combined with SE [54] instead of FEM and a simple addition operation instead of DGFM as our baseline. As the numerical result shown in Table 1, when we used the FEM instead of 1 × 1 convolution and SE alone, the MIoU and MPA increase by 1.97 and 1.48 percentage points, respectively. It shows that FEM can combine local information and global features effectively to some extent. When we utilized DGFM instead of the addition operation, the MIoU increases by 2.55 percentage points, and the other metrics also have different degrees of improvement, which means the effectiveness of the proposed DGFM. It indicates that the DGFM is helpful to fuse the low-level spatial feature and high-level semantic feature to some extend. Finally, we integrate the FEM and DGFM into the baseline, the MIoU has an obvious increase of 3.15 percentage points. More specifically, from the per-class IoU results shown in Table 2, the IoU of category "buildings" has a 8.54 percentage points improvement. Compared to other categories, the size of category "buildings" is small in the VHR remote sensing images, which means that our DGFNet can get global contents to correctly detect small objects.

Influence of Different Training Size on Classification
In addition to the ablation experiment about the proposed feature enhancement module (FEM) and dual gate fusion module (DGFM), we also conduct a series of experiments using different image sizes as the input to train DGFNet and test on the LandCover test set. As shown in Table 3, we crop the image size into 3 × 288 × 288, 3 × 320 × 320, 3 × 352 × 352, and 3 × 384 × 384 to train DGFNet, respectively. It is observed that we achieve the best results when we crop the training image size into 3 × 384 × 384 to train the DGFNet. The classification results improve with the increase in training image size to a certain extent. Owing to the feature enhancement module, the DGFNet can capture more global contexts with the increase in training image size to improve the performance of land-cover classification. To show the effectiveness of DGFNet, the proposed method is compared with the state-of-the-art methods, as listed in Table 4. The neural network DANet [26], PSPNet [27], FCN [18], and deeplabv3+ [32] are using pre-trained ResNet-50 as their backbone, DenseA-SPP [57] takes DenseNet [58] as its and all of them are implemented with the PyTorch framework. In addition, our method was compared with other published research on the same dataset recently, such as DFFAN [59] and MFANet [60]. As the results of Table 4   In Table 5, per-class IoU is computed to estimate the performance of recognizing distinct objects. The result indicates that our network has a better ability to distinguish objects with a small scale, such as buildings, because the feature enhancement module can combine the local information with global contents, strengthening the feature representation. From another perspective, our network can better distinguish the complex background information owing to the existence of DGFM to some extent. The DGFM makes different levels feature mutual constraint and promotes the integration between low-level spatial information and high-level semantic information. In Table 6, we also get the best performance about metrics P, R, and F1-score. Figures 6 and 7 show several example effects of remote sensing images within different scenarios.As the more complex scene in Figure 6, our prediction result (Figure 6g) is closest to the real land cover classification result (Figure 6b). In more detail, as displayed in the first row and second row of Figure 6, other methods mistakenly classify woodland as background. However, our proposed model can distinguish each category correctly, which is largely due to the designed feature enhancement module. The feature enhancement module (FEM) integrates the local information and global information to enhance the representation ability of features so it can recognize different objects very well. In addition, our model can restore the spatial resolution more accurately, e.g., the boundary of water in the third row. Different from Figure 6, the scene in Figure 7 is simple, but the spatial information is more clear, which is convenient for us to observe the restoration of spatial structure of original images, such as the boundary of woodland. Compared with other methods, our model can better recover the spatial edge details of the object. For example, the comparison models, such as deeplabv3+ and SegNet, cannot recover the edge information of woodland as shown in the first row of Figure 7 and the large area of woodland is misclassified as background in the third row, which makes the boundary discontinuous. In contrast to the other method, our DGFNet shows the most complete and accurate land-cover mapping results. This result indicates that our network has an advantage in distinguishing complex scenes and has a better recovery of spatial information. That is due to the dual gate fusion module reducing the semantic gap between different levels, making the low-level spatial features and highlevel semantic features more effectively fused. Compared with other networks, we can get more precise land-cover classification results, which verify the above conclusion again.

ISPRS Potsdam Dataset
To further illustrate the effectiveness of the proposed DGFNet, we also conducted comparative experiments on the ISPRS Potsdam dataset. Compared with the LandCover dataset, the scene in the ISPRS Potsdam dataset is more complex, which includes more small targets, such as vehicles. As the experimental results showing in Table 7, our network achieves the best results in terms of MIoU, FWIoU, MPA, and PA. Comparing to DANet, our model has an increase of 12.04, 7.73, 10.08, and 5.16 percentage points in MIoU, FWIoU, MPA, and PA, respectively. Compared with other state-of-art methods, the MIoU and MPA using DGFNet increases at least 1.24 and 1.79 percentage points, and other metrics have different degrees of improvement. For per-class IoU results, as shown in Figure 8, DGFNet has a better capability to detect small objects, such as cars. The success of detecting small targets is owed to the FEM, and can combine the local information with global content to strengthen the representation of different level features. In Table 8, we also get the best performance in terms of R and F1-score.      Figure 9 shows the visualization results of three different scenarios on the test set. In the first line, the scene is about objects with occlusion, such as some low vegetation being disturbed by shadows. SegNet cannot distinguish the low vegetation in the shadow area, while deeplabv3+, HRNet, and DenseASPP can only discriminate a part of low vegetation. DGFNet can completely recognize the low vegetation in the whole shadow area. The second scenario is focused on small objects. As shown in the second line of Figure 9, other models miss detecting the small car, while DGFNet has a better capability to detect the category "car". The success of detecting small objects represents that the FEM in DGFNet can effectively combine local information with global content to enhance the ability to identify small targets. The last scenario is the case of different objects with the same spectrum. As shown in the third line, the building targets and roads are similar in outward appearance, which means a slight bias between the different classes. In this case, other models have worse performance, resulting in lots of misjudgment. However, our DGFNet can distinguish the different categories correctly and restore the spatial details completely. For example, our model can recover the boundary of building better as shown in the third row of Figure 9. It indicates that the DGFM can suppress the semantic gap from different scales and fuse the low-level features and the high-level features effectively.

Model Size and Efficiency Analysis
To analyze the size and efficiency of the proposed model, we calculate the number of trainable parameters and the average inference time of a single image based on the land-cover dataset. The size of all network input images is 3 × 320 × 320. As shown in Table 9, the parameter of our network is equivalent to DANet, but the MIoU is increased by 13.09 percentage points. At the same time, compared with other models, our network also achieved the best performance in terms of MIoU. However, the mean inference time of a single image is higher than the other model and only lower than the DenseASPP model. That is because we adopt the DGFM in the decoding stage, and the DGFM promotes the fusion of low-level spatial features and high-level semantic features to improve the accuracy of land-cover classification. On the other hand, it increases the inference time due to the existence of multiple branches of DGFM. On the premise of ensuring the accuracy of land cover classification, our future work will focus on designing a lightweight network model to improve computational efficiency.

Conclusions
In this paper, a simple but efficient segmentation network named DGFNet is proposed for land-cover classification in VHR remote sensing images. The proposed DGFNet contains two novel modules: FEM and DGFM. The FEM can combine local information with global contents, strengthening the representation ability of feature maps in the encoder. The DGFM with the gate mechanism makes the different level features restrain each other, which promotes the fusion of multi-scale features and the restoration of spatial structure information. With those well-motivated modules, the DGFNet can capture more robust features by combining the local information and integrating multi-scale features, improving the performance of land-cover classification in VHR images. Exhaustive experiments prove the effectiveness of the proposed DGFNet. We also achieve the state-of-art performance of 88.87% MIoU on the LandCover dataset and 72.25% MIoU on the ISPRS Potsdam dataset. Acknowledgments: The authors would like to thank the anonymous reviewers for their constructive comments and suggestions.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: