AMM-FuseNet: Attention-Based Multi-Modal Image Fusion Network for Land Cover Mapping

: Land cover mapping provides spatial information on the physical properties of the Earth’s surface for various classes of wetlands, artiﬁcial surface and constructions, vineyards, water bodies, etc. Having reliable information on land cover is crucial to developing solutions to a variety of environmental problems, such as the destruction of important wetlands/forests, and loss of ﬁsh and wildlife habitats. This has made land cover mapping become one of the most widespread applications in remote sensing computational imaging. However, due to the differences between modalities in terms of resolutions, content, and sensors, integrating complementary information that multi-modal remote sensing imagery exhibits into a robust and accurate system still remains challenging, and classical segmentation approaches generally do not give satisfactory results for land cover mapping. In this paper, we propose a novel dynamic deep network architecture, AMM-FuseNet that promotes the use of multi-modal remote sensing images for the purpose of land cover mapping. The proposed network exploits the hybrid approach of the channel attention mechanism and densely connected atrous spatial pyramid pooling (DenseASPP). In the experimental analysis, in order to verify the validity of the proposed method, we test AMM-FuseNet with three datasets whilst comparing it to the six state-of-the-art models of DeepLabV3+, PSPNet, UNet, SegNet, DenseASPP, and DANet. In addition, we demonstrate the capability of AMM-FuseNet under minimal training supervision (reduced number of training samples) compared to the state of the art, achieving less accuracy loss, even for the case with 1/20 of the training samples.


Introduction
During the past few decades, human activities have posed serious threats to the environment, such as over-logging, over mining, illegal hunting, plastic pollution [1], which makes it necessary to monitor the Earth for the purpose of preventing damages to the environment. With the rapid development of remote sensing technology in the last couple of decades, various space-or air-borne remote sensing sensors have been made available to provide useful and large-scope information about the Earth, such as forest cover, glacier conditions, ocean surface, urban construction. Thus, utilizing remote sensing images for the purpose of environmental applications has become a feasible solution, but there are still many challenges for using remote sensing images in various environmental applications. To name but a few, (1) it is expensive and challenging to obtain high-resolution images for all possible problematic areas; (2) passive remote sensors (e.g., optical) are at the mercy of the clouds and the amount of sunshine; and (3) the contents of remote sensing images are generally very complicated and difficult to analyze.
In order to help overcome the aforementioned challenges and beyond, taking the advantage of leveraging different remote sensing modalities is a potential solution. Various environmental applications can benefit from multi-modal image fusion by exploiting complementary features provided by different types of remote sensors. Specifically, for active-passive sensor data fusion, passive-optical sensors play the role of feeding the system with high spectral resolution of the Earth surface, which are useful for image analysis. However, for optical remote sensing imagery, in order to provide multi-spectral information, they tend to reduce spatial resolution so as to maintain acceptable bandwidth [2,3]. Additionally, this type of sensor is subject to the weather conditions and only provide useful information during the daytime and under clear weather conditions. On the contrary, active remote sensing sensors are able to acquire images without being affected by inclement weather conditions, such as heavy fog, storm, sandstorm, snowstorm. Additionally, they usually provide sufficient textural and structural information of observed objects [4]; however, they are mostly not capable of collecting color/spectral information. Lastly, since synthetic aperture radar (SAR) images are obtained via wave reflection, they have an important problem that degrades statistical inference, which is the presence of multiplicative speckle noise. The received back-scattered signals sum up coherently and then undergo nonlinear transformations. This in turn gives the resulting images a granular appearance, which is referred to as speckle noise [5]. Considering all these advantages and disadvantages, exploring remote sensing data coming from different modalities becomes crucial for many environmental applications via making use of the complementary advantages of each type of sensors.
Land cover mapping is one of the most widespread and important remote sensing applications in the literature. This is because, nowadays, decisions that concern the environment made by governments, politicians or organizations highly depend on adequate information for many complex interrelated aspects, where land cover/use is one such aspect. Furthermore, an improved understanding of land cover can help act to solve environmental problems, such as disorganized construction, loss of prime agricultural lands, destruction of important wetlands or forests, and loss of fish and wildlife habitats [6].
Land cover classification or mapping is a long-established application area that has been developing since the 1970s. The earliest Landsat land cover classification approach was mostly based on visual and manual approaches. This was done by drawing boundaries of different land cover types and marking each of the land cover classes [7]. In the late 1970s, with the development of computer technology, digital image analysis has become more widespread, and some platforms such as geographic information systems (GIS) were developed to make the analysis of remote sensing data more convenient. Following this, utilizing computer-based approaches for the purpose of land cover classification has become the common practice by geographic analysis specialists. In addition, due to the development of early automatic image processing methods, such as smoothing, sharpening and feature extraction [8], geographic experts have been able to use various traditional image processing algorithms to help perform land cover classification. Although one can generate digital land cover maps by using computers, manual annotation is generally required, which is time consuming and labor intensive. Specifically, in cases when the target scene accommodates plenty of objects to be classified, and the scene covers huge areas, manual annotation becomes more challenging.
In recent years, with the great success of deep learning in computer vision, automated land cover classification approaches have been significantly improved, which assign a class for each pixel among many particular classes, such as artificial surfaces, cultivated areas, and water bodies, as shown in Figure 1. Due to the similarity of land cover mapping and semantic segmentation, researchers have started to use segmentation networks to perform land cover mapping. Furthermore, machine learning based end-to-end frameworks make use of remote sensing data (spatial and spectral information) to achieve better performance in land cover classification compared to the traditional pixel-based methods [9]. However, considering the diversity and complexity of remote sensing data and along with imbalanced training samples, it is still challenging to achieve high performance for land cover classification. For the purpose of improving the performance of land cover classification, multi-modal data fusion is an important choice whilst exploiting complementary features of different modalities. Although, in the current circumstances, multi-modal remote sensing data are available with the development of remote sensing sensors and observation techniques (e.g., active and passive), the literature is still far from fully leveraging the advantages of using multiple modalities for land cover classification. In order to successfully implement multi-modal remote sensing image fusion for environmental applications, there generally are two types of fusion approaches: (1) machine learning based; and (2) traditional methods, such as component substitution (CS) and multi-scale decomposition (MSD).
Although classical/traditional image fusion methods have been well studied for a few decades, there still are various challenges; for example, (i) precise and complex registration processing is required before the fusion step; (ii) it is highly dependent on the correlation between images being fused; and (iii) it is likely to lose information during the fusion process while replacing a part of the component of the original data during the processing.
On the other hand, ML-based methods generally demonstrate more powerful outcomes for image fusion. Thus, utilizing machine learning approaches in remote sensing imagery related applications has become a hot topic in the literature. However, since the contents in remote sensing images appear very different to the classical natural images, widely used network structures for natural images are not capable of and optimal for processing remote sensing imagery. Meanwhile, the machine/deep learning approaches of remote sensing data fusion, especially for the environmental applications, can still be seen in early stages. Therefore, more robust and generalizable machine learning based methods, specifically for remote sensing data fusion, need to be explored and developed in order to provide suitable and accurate solutions for land cover applications, and beyond.
In this paper, we propose a novel dynamic deep network architecture, AMM-FuseNet, for the purposes of the land cover mapping application. AMM-FuseNet promotes the use of multi-modal remote sensing images whilst exploiting the hybrid approach of the channel attention mechanism and densely connected atrous spatial pyramid pooling (DenseASPP). In order to verify the validity of the proposed method, we test AMM-FuseNet under four test cases from three datasets (Potsdam [10], DFC2020 [11] and Hunan [12]). A comparative study is implemented to test AMM-FuseNet performance against six state-ofthe-art network architectures of DeepLab V3+ [13], PSPNet [14], UNet [15], SegNet [16], DenseASPP [17], and DANet [18]. The contributions of this paper are as follow: We design a novel encoder module, which combines a channel-attention mechanism and densely connected atrous spatial pyramid pooling (DenseASPP) module. This proposed feature extraction module enhances the representational power of the network by successfully weighting the output features obtained by the atrous convolution. This module can be easily extended to any other networks with an encoder-decoder structure.

2.
We propose a machine learning based land cover mapping method specifically suitable for multi-modal remote sensing image fusion. The proposed network extracts information from multiple modalities in a parallel fashion, but performs training with a single loss function to make use of their complementary features. Meanwhile, the encoders in a parallel fashion show a better ability to cope with minimal (small number of) training samples. This has been experimentally validated in a set of test cases (Section 5.3), where we gradually reduce the number of training samples and measure the model performance using the same test sample set.

3.
The proposed hybrid network exploits and combines many advantages of existing networks for the purpose of improving the performance of land cover mappings. The encoder of the proposed network combines two feature extraction modules (ResNet and Dense ASPP) in a parallel fashion to improve the feature extraction capabilities for each modality. In order to make more efficient use of the extracted features, skip connections are used to benefit from the low-, middle-, and high-level features at the same time. The proposed multi-modal image fusion network shows competitive performance for land cover mapping and outperforms the state of the art.
The rest of the paper is organized as follows: a general background and literature review are presented in Section 2, whilst Section 3 presents the proposed method AMM-FuseNet. Section 4 gives details of the datasets we have used, and Section 5 covers the experimental analysis. Section 6 concludes the paper with a brief summary and future works.

Related Work
Following the development in big data research area and its effects on computer vision research, especially in recent years, multi-modal remote sensing data for various applications have been made available under open-access licences (e.g., ESA Sentinel-1/2, fusion contest datasets, including DEM, airborne/UAV-based optical data). Thanks to their complementary features, multi-modal remote sensing imagery provides much richer information compared to single modality, especially for land cover/use applications. However, in the literature, most of the land cover mapping papers still use single modality data [19][20][21]. Along with the technical developments in computational imaging and deep/machine learning research, the usage of multi-modal for land cover mapping data [22,23], despite being in early stages and insufficient, have started to appear in some works in the recent years. Considering the increasing demand for multi-modal information for land cover mapping in the literature, obviously, the key challenge of this research lies within answering this research question: "how to make efficient use of complementary features in multi-modal remote sensing data?". One of the most common answer to this question in the literature is to implement an image fusion approach which directly concatenates multi-modal images and provide them as the input of land cover mapping networks. Furthermore, Land cover mapping can be basically described as a classification application, which classifies each pixel of remote sensing images to one of various categories (analogous to the semantic segmentation). Especially in the last decade, semantic segmentation networks, such as UNet [15], DeepLabv3+ [13], SegNet [16] and PSPNet [14], developed rapidly, and have achieved great success for some natural imaging datasets, such as COCO [24] and PASCAL VOC 2012 [25]. Thus, one can develop a machine learning approach built upon classical semantic segmentation networks mentioned above and try to develop some improved architectures for the purpose of land cover mapping application.
When it comes to pixel-level classification, either semantic segmentation in computer vision or land cover classification/mapping in remote sensing, fully convolutional networks (FCNs) [26] have made a considerable contribution, which have made these models and their variants become the state of the art in the literature. SegNet [16] and UNet [15] adopt a symmetrical encoder-decoder structure and skip connections, whilst making use of multistage features in the encoder. Alternatively, PSPNet [14] proposes to use a pyramid pooling structure, which provides a global contextual prior to pixel-level scene parsing. Instead of leveraging from the traditional convolution layer used in the aforementioned networks, atrous convolution and atrous spatial pyramid pooling (ASPP) are proposed in Deeplab architecture [27]. This fact helps the DeepLab architecture to exploit the ability to perceive multi-scale spatial information, even using fixed-size convolution kernels. Regardless of the fact that the ASPP can benefit from acquiring information from multi-scale features, DenseASPP [17] argues that the feature resolution in the scale-axis is not dense enough. Thus, DenseASPP combined dense networks [28] and an Atrous convolution network to generate densely scaled receptive fields. DeepLabv3+ [13] proposed an improved hybrid approach that combines an encoder-decoder structure and the ASPP, which can control the resolution of extracted encoder features, trade-off precision and runtime via setting different dilation rates. Specifically, in DeepLabv3+, appending the ASPP module after a backbone of ResNet makes the network exploit deeper levels and extract high-level features with an aim of improving the performance around the segmentation boundaries [29,30]. However, DeepLabv3+ just simply concatenates the two levels of features that come from the output of the first backbone layer and the output of the ASPP module in its decoder. In this case, the network misses the features of the intermediate process of extracting features, which basically reduces the classification performance.
Remote sensing imagery has more complex challenges compared to natural images, such as (1) hardly separable land cover classes; (2) imbalanced class distributions; and (3) imagery content under strong random noise, such as speckle in radar imagery. These might sometimes cause the aforementioned semantic segmentation networks to achieve unsatisfactory results. For the purpose of finding solutions for the challenges mentioned above, the literature includes some semantic segmentation networks for land cover classification applications. Fusion-FCN [31] improves the FCN network and uses it for multi-modal remote sensing for land cover classification, and this network is the winner of 2018 IEEE GRSS Data Fusion Contest. DKDFN [12] is also based on FCN and collaboratively fuses multimodal data and assimilates highly generalisable domain knowledge (e.g., remote sensing indices such as NDVI, NDBI, and NDWI) at the same time. The performance of DKDFN is better than that of some of these classical semantic segmentation networks such as UNet, SegNet, PSPNet, DeepLab. It is worth noting that both Fusion-FCN and DKDFN extract multi modal features with different encoders, and this use of multi-encoders also appears in RGB-D fusion for the semantic segmentation of natural images [32]. DISNet [33] is another network for land cover classification, which uses the DeepLabv3+ framework and only adds an attention-mechanism-based module in both the encoder and decoder of the network. This change improves the performance of land cover classification compared to the original DeepLabv3+ [13]. Xia [29] also implemented DeepLabv3+ and similarly proposed the global attention based up-sample module in the networks. Xia's network also passes multi-level features to the decoder to obtain efficient segmentation with accurate results. Similarly, Lei [34] proposed a multi-scale fusion network based on a variety of attention mechanisms for land cover classification, which also shows competitive performance. In order to combine the advantages of UNet and DeepLabv3+, ASPP-U-Net [35] was proposed for land cover classification and showed better results compared to the UNet and DeepLabv3+.

The Proposed AMM-FuseNet Architecture
In order to make use of multi-modal data for the purposes of land cover classification application, we propose a deep neural network named AMM-FuseNet, which explores the effect of channel attention on multi-modal data fusion. Specifically, a couple of channel attention modules are applied in the main structure and we also propose a channel attention based feature extractor called CADenseASPP. In the sequel, we explain the stages of the proposed AMM-FuseNet.

Channel-Attention Module
Channel attention is a kind of squeeze-and-excitation block, which focuses on finding channels that are meaningful and encourage decoders to use the most relevant features. Generally, the outputs of various channel attention modules are a weighted combination of input features according to the importance of each channel. Thus, the use of channel attention modules is prone to enhance the representative power of networks, and channel attention is able to make features more informative. Specifically, this paper adopts the efficient channel attention (ECA-Net) [36] module, which spatially squeezes the feature map and excites along the channel, as shown in Figure 2. Assume the feature map is where u i ∈ R H×W refers to ith channel of the feature map. Spatially squeezing means applying a global average pooling operation to generate a vector Z ∈ R 1×1×C , where the number of corresponding channels is C. In order to consider local cross-channel interaction, the vector will be input to a 1-dimensional convolution layer W ∈ R k×C , where k refers to the size of the convolutional kernel. Thus, the feature values Z for each channel become (2) Here, we use convolution layers rather than fully connected (FC) ones since the convolution operation has much lower complexity compared to the FC, despite achieving similar performance (recommended in [36]). Following this, a sigmoid layer σ(·) is used to normalize the transformed vectorẐ into [0, 1]. Finally, the channel attention will perform excitation of the original feature map along the channel aŝ It is worth noting that the value of σ(ẑ i ) is the attention score of the ith channel, which represents the importance of the channel in the feature map.

CADenseASPP Module
This paper proposes a channel-attention-based dense atrous spatial pyramid pooling (CADenseASPP) module (named after the DenseASPP [17] module) which is an extension of DenseNet [28]. As is mentioned in Section 3.1, channel attention is able to produce a weighted combination of features and make these derived features more informative. This paper also explores the effect of the channel attention module for features obtained by atrous convolution to promote better representation of features in the proposed model. CADenseASPP combines DenseASPP and channel attention modules, and as shown in Figure 3, channel attention modules are appended just after each Atrous convolution layer to weight the output features by their channel importance scores. Following this, similar with the DenseNet [28] architecture, all features obtained from each atrous convolution branch are concatenated together with features from the additional pooling and identity mapping layers. Due to its dense characteristics, the proposed module also shares the advantages of DenseNet [28], including alleviating the gradient-vanishing problem and having substantially fewer parameters.

AMM-FuseNet
The proposed multi-modal land cover mapping architecture, AMM-FuseNet, exploits the following: As shown in Figure 4, the proposed network exploits the use of multi-modal imagery via two encoders in a parallel fashion. AMM-FuseNet adopts ResNet-50 as its backbone and implements CADenseASPP to extract information from low level features that are obtained by the first convolution layer of the backbone. Additionally, in order to compensate for the low resolution of high-level features and make use of feature maps from middle or early layers, skip-connections are designed from encoder to decoder to make use of multi level features. Additionally, all branches from the backbone are optimized by corresponding channel attention modules. Then all features are up-sampled into the same spatial size and concatenated together with features obtained by the CADenseASPP module. In the AMM-FuseNet decoder, all features coming from both encoders are concatenated, and finally a segmentation head, consisting of two convolution layers and a ReLu activation function, carries out semantic segmentation. The proposed network, AMM-FuseNet, is specifically designed for two-modality data fusion applications. The differences between multi-modal remote sensing data caused by different technologies for obtaining information (e.g., Sentinel-2 data mainly focuses on spectral information whilst DEM data mainly collects elevation of the objects) might cause various problems. To name but a few, (i) low correlation between multi modal images; (ii) registration error especially for large scale land cover data sets. Therefore, when twomodality data have low correlation and strong registration error, sharing parameters of the same encoder will not be optimal and cause severe performance degradation. In addition, using two encoders for multi-modal data fusion has been proved to be more efficient in [37] compared to various deep learning architectures. Thus, the proposed method splits each modality into different encoders to promote their useful features in different network levels, yet fusion operation of the multi-modal features has been performed in the later stages of the architecture.
The proposed AMM-FuseNet promotes using two feature extractors in parallel for each data modality to efficiently make use of low-, middle-and high-level features (grey and green boxes in Figure 4). ResNet-50 (grey) provides the information that has a deeper understanding of the semantic information. On the other hand, the CADenseASPP (green) module is used for acquiring low-level features, which are helpful for the extraction of textural features. Meanwhile, in order to avoid losing information in the process of the forward propagation of ResNet-50, the proposed network also exploits features in each stage of ResNet-50 to increase the semantic knowledge.
Since there are many channels in remote sensing data, typically Sentinel-2 (13 bands), the channel attention mechanism plays a key role in reducing the impact of redundant data and in making use of information coming from the most relevant channels. On the contrary, for natural images, e.g., RGB images only have 3 bands, exploiting all information from all spectral bands is general practice in the literature. However, when the number of channels of input data increases, channel attention becomes highly useful to make efficient use of 'important' channels and reduce the role of 'unimportant' channels in order to obtain accurate classification or segmentation results. In addition, the usage of multiple channel attention module makes the proposed AMM-FuseNet a kind of dynamic neural network. Since the channel attention module rescales the features with input-dependent soft attention, applying the attention mechanism on features is equivalent to performing convolution with dynamic weights, which has been proved to enhance the representation power of deep networks [38]. Thus, we choose to use channel attention in the skip-connection branches and feature extractors.

Methodological Framework
A methodological overview of the whole approach is given in Figure 5. Each dataset used in this paper provides non-overlapping training and testing data. Both training and testing data include multi-modal remote sensing images and corresponding labels. The whole methodological approach consists of training and testing stages: • In the training stage, shown in the upper branch of Figure 5, multi-modal images are given as input to the deep learning networks followed by acquiring the land cover prediction. Then, the prediction and corresponding ground truth are used to calculate the loss by using some functions, such as cross entropy, mean square error and/or Dice loss functions. Then, this loss value is used to update the parameters in the networks by using optimizers, such as stochastic gradient descent (SGD) [39], which is illustrated in detail in Section 5.1.

•
In the testing stage, shown in the lower branch of Figure 5, multi-modal images are given as input to the trained models, and the land cover mapping predictions are obtained. Different from the training stage, the network will not be updated, and the prediction and ground truth are used for calculating the performance of the utilized network by using related assessment metrics of overall accuracy (OA), user's accuracy (UA), producer's accuracy (PA), mean intersection over union (mIoU), and F1-score, which are also stated in detail in Section 5.1.

Data
In this paper, we use three open-access multi-modal data sets, which are constructed for land cover classification application, namely (1) Hunan [12], (2) Potsdam [10] and (3) DFC2020 [11] datasets. Although most of the open-access land cover mapping datasets in the literature focus on single modality, all of the three datasets used in this paper are multi-modal remote sensing image datasets for land cover mapping. In particular, the DFC2020 and Potsdam datasets are highly representative and commonly used in the literature, whilst the Hunan dataset is a new dataset published in 2022 [12]. It includes three different remote sensing modalities for land cover mapping, which makes it highly suitable for testing the proposed method's performance. Some details regarding all three utilized datasets are presented in Table 1

Hunan
Hunan [12] is a multi-modal dataset for land cover mapping of Hunan province in China. This dataset consists of three remote sensing modalities of multi-spectral (Sentinel-2), SAR (Sentinel-1) and SRTM digital elevation model data (DEM). Specifically, the temporal resolutions of Sentinel-2 MSI and Sentinel-1 SAR imagery in Hunan are 5 and 6 days (combined constellation), respectively [40,41], which were captured in 2017. The SRTM (shuttle radar topography mission) is mounted on a space shuttle and obtains Earth surface data by utilizing a synthetic aperture radar. During its 11-day flight, from 11 February 2000 to 22 February 2000, it obtained data covering 80% of the Earth's surface [42]. The obtained data were converted into digital elevation model (DEM) data, which provide height information of the Earth surface. More details for this dataset are listed in Table 2. All 13 bands in Sentinel-2 are used in our experiments. Sentinel-1 data in Hunan dataset were pre-processed by thermal noise removal, radiometric calibration, terrain correction and logarithmic conversion. There are two bands in Sentinel-1 data, corresponding to dual-polarization of VV and VH, respectively. SRTM provides both elevation and slope data, which provide extra topographic information, but only elevation data are used in our experiments. The creators of the Hunan dataset resampled all data to a spatial resolution of 10 m by the default resampling strategy nearest neighbor in GEE [12]. Since the Hunan dataset contains three different modalities of remote sensing images as mentioned above, we have therefore divided them into two fusion pairs of (i) Sentinel-2 and Sentinel-1, and (ii) Sentinel-2 and DEM. The Hunan dataset consists of 500 image tiles for each modality, as well as their corresponding land cover labels. The size of all the images is 256 × 256. Geology experts manually labeled the data according to the Sentinel-2 mosaic. This dataset contains 7 imbalanced class labels, which are cropland (23.34%), forest (42.37%), grassland (7.35%), wetland (1.89%), water (13.35%), unused land (1.56%), and built-up area (10.14%). This distribution of land cover classes is based on the data collected in Hunan province, China in 2017.

DFC2020
DFC2020 is based on the SEN12MS dataset [43], which provides Sentinel-1 SAR imagery, Sentinel-2 multispectral imagery, and corresponding land cover maps on 7 areas (see Table 3) in the world between 2016 and 2017. The temporal resolution and collection time of modalities in DFC2020 is listed in Table 4. The size of all patches is 256 × 256 pixels. The fine-grained IGBP classification scheme in SEN12MS was aggregated to 10 coarsergrained classes, which are forest (11.3%), shrubland (6.9%), savanna (23.6%), grassland (16.8%), wetlands (1.1%), croplands (17.9%), urban/built-up (10.6%), snow/ice (0.0%), barren (5.2%), and water (6.5%). The class distributions are similar to the SEN12MS and DFC2020 datasets. On the other hand, since the DFC2020 dataset is a subset of SEN12MS, there is one class showing zero percentage. The comparison between the standard IGBP classes of SEN12MS and DFC2020 label classes can be found in detail in [11].  Table 4. Temporal resolution and collection time of the modalities in DFC2020.

ISPRS Potsdam
The ISPRS Potsdam Semantic Labeling dataset is an open-access benchmark dataset provided by the International Society for Photogrammetry and Remote Sensing (ISPRS). This dataset provides 38 multi-source patches (all of size 6000 × 6000), which contains infrared (IR), red, green and blue orthorectified optical images with corresponding digital surface models (DSM). For calculation purposes, we sub-divided all these data tiles into 512 × 512 patches, which leads to 3456 and 2016 samples for the training and test, respectively. The ground sampling distance of the two modalities of the true orthophoto (TOP) and the DSM is 5 cm. This dataset was classified manually into six land cover classes, which are impervious surfaces, buildings, low vegetation, trees, cars, clutter/background.

Implementation Details
We implemented our methods by using the Pytorch [44] environment. Following [13], we used a mini-batch SGD optimizer and adopted a poly learning rate policy, where the current learning rate equals the initial learning rate multiplied by 1 − iter max-iter power .
We set the initial learning rate and power to 0.01 and 0.9, respectively. The batch size of the input data was also set to 10. The objective function for training models is cross-entropy loss function. All the experiments were performed in the GW4 Supercomputer Isambard [45], the details of which are shown in Table 5. In the comparison analysis, we compared the proposed AMM-FuseNet to six state-ofthe-art models of DeepLabv3+, Unet, SegNet, PSPNet, DenseASPP, and DANet. In order to make the reference models suitable for loading multi-band images, the number of input channels in the first convolution layer was set to the number of bands of each dataset. By following the original papers of the reference models, we initiated the backbones of DeepLabv3+, SegNet, PSPNet, DANet with pretrained weights on ImageNet. The AMM-FuseNet has two experimental versions, namely the initial backbone with and without pretrained weights on ImageNet.
We comprehensively analyzed all the models by quantifying performance via class-related measures, such as overall accuracy (OA), user's accuracy (UA), producer's accuracy (PA), mean intersection over union (mIoU), and F 1 -score. It is worth noting that the UA represents correct positive predictions relative to the total positive predictions, whilst the PA represents correct positive predictions relative to total actual positives. Expressions of all five performance metrics are given as follows: where TP, TN, FP, and FN refer to the numbers of pixels that are true positives, true negatives, false positives, and false negatives for each class, respectively.

Quantitative Results and Analysis
The proposed method was tested from two different perspectives:

1.
We first used four test data coming from the three data sets discussed in Section 4 in order to promote the generalization capacity of the proposed method.

2.
Subsequently, we tested the capabilities of AMM-FuseNet under minimal training supervision in comparison to the reference methods.
For the first set of experiments, we first demonstrated the performance of all methods for each land cover class from the Hunan data set (Sentinel 1/2). This analysis aims to demonstrate the capabilities of each model for different land cover classes and different multi-modal remote sensing data. Figure 6 depicts IoU values of each of the seven land cover classes of the Hunan data set. Examining the bar plots in Figure 6, AMM-FuseNet achieved the highest IoU values for five out of seven categories, with particular considerable improvements for cropland, wetland and bare land. For the remaining two classes (forest and grassland), the IoU results are closer for each model, whilst SegNet, UNet, DeepLabv3+ and AMM-FuseNet compete to become the best performing method. In order to test the overall performance of each model, as mentioned above, we created four pairs of two-modality remote sensing data from three data sets. In order to reflect a better comprehensive performance of the models, we used mIoU, average OA, UA, PA and F 1 of all land cover classes for each model. We present these values in Tables 6-9. Please note that in our initial analysis, pretraining on ImageNet does not yield a consistent result due to the different characteristics of each dataset. Hence, we decided to test each model with both a pretrained backbone and a backbone that was not pretrained, and then only share the best performing result for each with an indicator in the "PT" columns in each table. Furthermore, in order to provide visual evidence for the performance metrics presented in Tables 6-9, we plotted land cover mapping outputs of each model for a randomly selected image from each of the two-modality data pairs in Figure 7-10.
Examining the performance metrics in Table 6, the proposed AMM-FuseNet showed the best performance on the first two-modality Hunan (Sentinel 1/2) data set for most of the cases. In particular, AMM-FuseNet obtained 4% UA gain compared to the second best model. Furthermore, the mIoU value of AMM-FuseNet is about 3 % higher than that of the second best model whilst PSPNet is slightly better than AMM-FuseNet in terms of the PA value. The second two-modality data set consists of Sentinel-2 and DEM modalities, which also come from the Hunan data set. AMM-FuseNet shows the best performance in terms of all performance metrics presented in Table 7. The proposed network has a considerable mIoU improvement of more than 4% when compared with the second best model, UNet, whilst around 5% improvement is obtained in terms of F 1 score compared to the DeepLabV3+. From the two experimental sets of Hunan dataset presented in Tables 6 and 7, the stateof-the-art models show instabilities when dealing with different combinations of multimodal imagery. For example, apart from AMM-FuseNet, DeepLabv3+ achieved the second best performance for most of the metrics on Sentinel 1/2 fusion whereas, in terms of Sentinel-2 and DEM fusion, UNet appeared to beat DeeplabV3+ and became the best. Despite these inconsistencies of the state-of-the-art models, the proposed AMM-FuseNet has shown consistent performance with better generalisability. It showed competitive performance either for Sentinel 1/2 or Sentinel-2/DEM fusion, even there are no pretrained weights to initialize the network.
Similar to the first Hunan dataset, the DFC2020 dataset also provides Sentinel 1/2 imagery. Examining the performance results presented in Table 8, AMM-FuseNet showed a competitive performance on the DFC2020 dataset in terms of all performance metrics. AMM-FuseNet improved slightly for all metrics compared to the second best network, Unet. Furthermore, the performance difference between the AMM-FuseNet and the remaining networks is relatively high. For example, there are more than 6% and 2% improvements in terms of mIoU and OA, respectively, between the proposed network and PSPNet. As of the last analysis of the first set of experiments, we tested all models on the Potsdam dataset which consists of IRRGB and DEM imagery. Unlike previous datasets, the proposed AMM-FuseNet has not shown the best performance when compared to the other networks. AMM-FuseNet is about 1% lower than the best model as shown in Table 9. Our first thought on this result is based on the dramatic, significant differences between the Potsdam dataset and the three previously used test cases in terms of frequency bands, semantic details and spatial resolution. We are going to discuss this in the next section in more detail. In order to demonstrate results visually and provide visual evidence for the performance metrics presented in the tables above, we chose to show four cases from each two-modality data sets. As shown in Figures 7-10, we depict the RGB imagery, ground truth land cover labels and results predicted by all the models. On each example of land cover mapping results, we draw a rectangular box to help readers to compare the differences between different models.
For the Hunan dataset, the prediction of AMM-FuseNet is visually closer to the ground truth compared with other predictions. Additionally, for DFC2020, the prediction of AMM-FuseNet is more accurate when compared to other models. Similarly, for Potsdam, AMM-FuseNet shows the ability to understand the data more deeply. Only AMM-FuseNet and UNet correctly recognize the single tree structure (green area in the rectangular box) in the Potsdam case, where all other networks incorrectly detect more tree related regions. Additionally, the UNet prediction shows a redundant and incorrect classification in the upper right part of the rectangular box (red pixels).

Minimal Supervision Analysis
As we mentioned in the previous section, all the models showed similar performance for the Potsdam dataset, where the proposed AMM-FuseNet could not manage to obtain the best performance outcome in terms of all five metrics. We consider three categories of potential reasons for these results.

1.
Due to the lower number of spectral bands in the data set, AMM-FuseNet's ability to extract useful information was affected and became closer to the state of the art.

2.
Since all imagery is collected in an urban area with very high spatial resolution (5 cm), discrimination between the objects becomes much easier compared to the other datasets, which leads to all methods performing at a similar level. 3.
The Potsdam dataset has a relatively high number of training samples compared to the other datasets (around four times compared to DFC2020 considering the input image size of 512 × 512). This has given the state-of-the-art networks enough data samples to fully learn.
Considering remote sensing imagery applications mostly having highly limited number of labeled samples (as in the case of Hunan dataset), the Potsdam experimental analysis cannot directly lead to correct conclusions for the purposes of remote sensing imagery applications.
Our purpose whilst developing AMM-FuseNet (along with having good performance in land cover mapping applications) is to make it suitable for minimal training supervision cases. AMM-FuseNet separates each modality into different encoders via utilising the same ground truth information, for the purpose of inheriting the advantages of consistency regularization [46], which shows considerable success in minimal supervision cases in the literature [47,48]. The key idea of consistency regularization is to force perturbed models (or perturbed inputs) to have consistent outputs to supervise each other, even when there is a very small number of labeled data. In this case, one modality can be regarded as a perturbed version of another modality and they are expected to have the same output. Thus, the weights in two encoders are supervised by each other during the training.
In order to experimentally prove this point and provide numerical evidence, we gradually reduced the number of training samples of the Potsdam dataset. Although the number of training samples is reduced, the class distributions in each minimal supervision test case is kept relatively unchanged. In addition, we used the same fixed set of test data samples for all the minimal supervision test cases so as to enable a consistent comparison of the results. Details of each test case in this set of experiments are shown in Table 10. The results of each minimal training supervision test cases for 1 2 − 1 20 are shown in Tables 11 and 12. Whilst gradually reducing the number of training samples, AMM-FuseNet starts to show its capability to work with smaller number of training samples. In contrast, the state-of-the-art approaches are affected by the small number of training samples and their results deteriorated dramatically. In particular, from ratio 1 6 , namely 576 training samples, AMM-FuseNet starts to be the best model in terms of mIoU and overall accuracy on the Potsdam dataset by having lesser performance loss. Even for only 173 training samples for the 1 20 case, the proposed AMM-FuseNet obtained more than 60% mIoU and 80% overall accuracy values. In order to visually draw a conclusion on the values in the tables, we depict two graphs (Figures 11 and 12) of percentage decreases on mIoU and accuracy for different partitions of the Potsdam training samples for each network. It is clear to see that performance loss of AMM-FuseNet is 8% in terms of mIoU and 4% of accuracy even though the training set is 20 times smaller than the original dataset. This provides numerical/experimental evidence to our remark that the proposed AMM-FuseNet architecture is better able to cope with cases where the data set has few training samples, and its performance improvement can be seen better under test cases with a small number of training samples. It is also evident that single-encoder architectures show drastic performance drops when the number of training samples is reduced. It is also important to note that methods such as DeepLabV3+ and Unet, which are two of the best approaches on the Hunan and DFC2020 datasets, have the highest performance drops in terms of both mIoU and accuracy, showing they are highly sensitive to the number of training samples.

Conclusions
In this work, we proposed a channel-attention based multi modal image fusion network, AMM-FuseNet, constructed with a proposed novel feature extraction module, CA-DenseASPP. The proposed network showed competitive performance when compared to the state-of-the-art segmentation networks, such as DeepLabv3+, SegNet and Unet, and appeared to be more robust and generalisable when applied to various multi-modal remote sensing data sets. For most of the cases with different remote sensing modalities of RGB, multi-spectral, SAR and DEM, the AMM-FuseNet showed a consistent performance by being the best model in terms of various performance metrics.
In presenting the proposed approach AMM-FuseNet in this paper, • We contributed to the literature with a multi-modal attention-based deep network architecture with improved land cover mapping/classification performance compared to the state of the art. • The parallel feed of multi-modal remote sensing information into the hybrid proposed encoder module of CADenseASPP improved the segmentation performance dramatically via weighting features in a dense atrous convolution operation. • It was experimentally proven by using the Potsdam data that the proposed network showed more powerful performance under small number of training sample (minimal training supervision) despite its relatively complex structure due to having two parallel encoders. This issue proved our contribution to the literature that the proposed approach could be a great choice for the segmentation applications that only have a small amount of labeled information. • As it stands, AMM-FuseNet appears as a candidate model to be a high-performing approach for other segmentation tasks in remote sensing and computer vision beyond the land cover mapping application.
The indistinguishable performance of all models under the full-Potsdam dataset has shown us future research directions in terms of the AMM-FuseNet architecture. Ongoing work includes performing a detailed complexity analysis, and exploring AMM-FuseNet's capabilities (i) to deal with minimal supervision, as well as (ii) extracting useful information from higher spatial resolution imagery.