Semantic Segmentation on Remotely-Sensed Images Using Enhanced Global Convolutional Network with Channel Attention and Domain Speciﬁc Transfer Learning

: In remote sensing domain, it is crucial to annotate semantics, e

of "Refinement residual block" from [15]. The first component of the block is a 1 × 1 convolution layer. 123 We use it to unify the number of channels to 21. Meanwhile, it can combine the information across all 124 channels. Then the following is a basic residual block, which can refine the feature map. Furthermore, 125 this block can strengthen the recognition ability of each stage, inspired from the architecture of ResNet 126 [7].  In this paper, there are two benchmark corpus including (i) ISPRS Vaihingen Challenge corpus 134 and (ii) Landsat-8 data set. They are very high and medium resolution images, consecutively. More 135 details of the data sets will be explained in Section 4.1 and Section 4.2. Before a discussion about the 136 model, it is worth to explain our data preprocessing procedure, since it is required when working with 137 neural network and deep learning models. Thus, the mean subtraction is executed.

138
In addition, data augmentation is often required on more complex object recognition tasks. 139 Therefore, a random horizontal flip is generated to increase the training data. For the ISPRS corpus, all 140 images are standardized and cropped into 512 × 512 pixels with a resolution of 9 cm 2 /pixel. For the 141 Landsat-8 corpus, each image is also flipped horizontally and scaled to 512 × 512 with a resolution of 142 30 m 2 /pixel from original images (16, 800 × 15, 800 pixels).  Although the GCN architecture has shown promising prediction performance, it can still be 161 possible to further improved by varying backbones using ResNet [7] with different numbers of layers 162 as ResNet50, ResNet101, and ResNet152 as shown in Figure 3. Also, GCN is suggested to work on 163 large kernel size. In this paper, we set the large kernel size as 9 (this previous work [15]).

165
Attention Mechanisms [16,17] in Neural Networks are very loosely based on the visual attention 166 mechanism found in humans. Human visual attention is well-studied and while there exist different 167 models, all of them essentially come down to be able to focus on a certain region of an image with 168 "very high resolution", while perceiving the surrounding image in "medium resolution", and then 169 adjusting the focal point over time.

170
To apply this atttentional layer to our network, the channel attention block is shown in Block "A"  It is shown in Figure 5.

187
In our experiments, two types of data sets are used: (i) medium resolution imagery (satellite  In this type of data, the satellite images are from Nan, province in Thailand. The data set is 194 obtained from Landsat-8 satellite consisting of 1,012 satellite images as shown some samples in Figure  Figure 6. Sample satellite images from Nan, a province in Thailand (left) and corresponding ground truth (right). The label of medium resolution data set includes five categories: impervious surface (Agriculture, yellow), Forest (green), Miscellaneous (Misc, brown), Urban (red) and Water (blue).

220
The multi-class classification task can be considered as multi-segmentation, where class pixels are To evaluate the performance of different comparing deep models, we will discuss the above two 229 major metrics (F1) and mean of class-wise Intersection over Union (Mean IoU)) on each category, and 230 the mean value of metrics to assess the average performance.

232
The implementation is based on a deep learning framework, called " use the mean pixel intersection-over-union (mean IoU) and F1 score as the metric.

243
Inspired by [16,26,36], we use the "poly" learning rate policy where the learning rate is multiplied 244 by Eq. 6 with power 0.9 and initial learning rate as 4e −3 . The learning rate is scheduled by multiplying 245 the initial as seen in Eq. 6.
All models are trained for 50 epochs with mini-batch size of 4, and each batch contains the cropped 247 images that are randomly selected from training patches. These patches are resized to 521 × 521 pixels.

248
. The statistics of batch normalization is updated on the whole mini-batch. "GCN152" method is compared to "GCN50" method and "GCN101" method for the varying backbones 257 using ResNet with different numbers of layers on GCN networks strategy. Second, "GCN152-A" 258 method is compared to "GCN152" method for the "Channel Attention" strategy. Third, the full 259 proposed technique "GCN152-TL-A" method is compared to existing methods for the concept of 260 domain specific transfer learning.  ResNet with a large number of layers is more robust than the small number of layers.

278
When comparing the results between the original GCN method and the enhanced GCN methods 279 on Landsat-8 corpus ( Our second mechanism focuses on applying "Channel Attention Block" (details in Section 3.4) to 283 change the weights of the features on each stage to enhance the consistency. From Table 2  obtain discriminative features stage-wise to make the prediction intra-class consistent. This is based 288 on the consideration that we re-weighted all feature maps of each layer. Our last strategy aims to use approach of domain specific "Transfer Learning" (details in Section  Table 2 and Table 3, F1 of "GCN152-TL-A" method is the winner; it clearly outperforms not only 293 the baseline, but also all previous generations. Its F1 is higher than DCED (baseline) at 17.80%. Its

294
Mean IoU is higher than DCED at 17.94%. Also, the result illustrates that concept of domain specific 295 "Transfer Learning" can enhance both precision (0.8293) and recall (0.8476). strategy we added to the network as shown in Figure 9(c-f) and Figure 12(c-f).   To achieve highest accuracy, the network must be configured and trained many epochs until all 301 parameters in the network are converged. Figure 11(a) illustrates that the proposed network has been 302 properly set and trained until it is really converged and ran more smoothly than baseline in Figure   303 10(a). Furthermore, Figure 10(b) and Figure 11  As can be seen in Figure 9 and Figure 12, the performance of our best model outperforms 311 other advanced models by a considerable margin on each category, especially for the Agriculture,

312
Miscellaneous (Misc), and Water. Furthermore, the loss curves shown in Figure 11(a) exhibit that, our 313 best model performs better on all the given categories.

315
In this subsection, the experiment was conducted on the ISPRS Vaihingen Challenge corpus. The 316 result is shown in Table 4 and Table 5 by comparing between baseline and variations of the proposed 317 techniques. It shows that our network with all strategies (GCN152-TL-A) outperforms other methods.

318
More details will be discussed to show that each of the proposed techniques can really improve an 319 accuracy. Only in this experiment, there are one baseline, including DCED network.  Table 4 and Table 5, F1 of GCN152 (0.7864) outperforms that of GCN50 (0.776), GCN101 (0.768), 324 and baseline method; DCED (0.7693); this yields higher F1 at 0.02%, 0.68%, and 1.01% respectively.

325
Mean IoU of GCN152 (0.8977) outperforms that of GCN50 (0.8776), GCN101 (0.8972), and baseline 326 method; DCED (0.8651); this yields higher Mean IoU at 0.02%, 0.68%, and 1.01% consecutively. This 327 can imply that enhanced GCN is also more accurate than DCED approach on very high resolution 328 data set. ResNet with large number of layers is still more robust than small number of layers same as 329 that performed on Landsat-8 corpus (Section 5.1.1).

330
When comparing the results between the original GCN method and the enhanced GCN methods 331 on Landsat-8 corpus ( Our second mechanism focuses on utilizing "Channel Attention Block" to change the weights of 335 the features on each stage to enhance the consistency. From Table 4 and 5, the F1 of GCN152-A (0.7902) 336 is greater than that of GCN152 (0.7864); this yields higher F1 score at 0.38%. and the Mean IoU of 337 GCN152-A (0.9057) is better than that of GCN152 (0.8977); this yields higher Mean IoU score at 0.80%.

338
The results (Figure 13e and Figure 14e) show that can also make the network to obtain discriminative 339 features stage-wise to make the prediction intra-class consistent on very high resolution images.  Table 4 and Table 5, F1 of "GCN152-TL-A" method is the winner; it clearly outperforms not only the 344 baseline, but also all previous generations. Its F1 is higher than DCED (baseline) at 2.49% and 1.82% 345 consecutively. Its Mean IoU is higher than DCED and GCN at 4.76% and 3.51% respectively. Also,  Figure 13(c-f) and Figure 14(c-f).

352
To further evaluate the effectiveness of the proposed "GCN152-TL-A" comparisons with baseline' 353 method on the one challenging benchmark and one private benchmark are presented as follows as 354 Table 2 and Table 3 for Landsat-8 data set on Nan province (Thailand) corpus and Table 4 and    Figure 13 and Figure 14 show twelve sample testing results from the proposed method on ISPRS

358
Vaihingen corpus. The results of the last column are also similar to the ground truth in the second 359 column same as performed on Landsat-8 corpus. Considering to each class (are shown in Table 3 and   As can be seen in Figure 13 and Figure 14, the performance of our best model outperforms Learning" is introduced to to allay the scarcity issue by training the initial weights using other remotely 384 sensed corpora whose resolutions can be different. The experiments were conducted on two data sets: The following abbreviations are used in this manuscript: