1. Introduction
During the past few decades, human activities have posed serious threats to the environment, such as over-logging, over mining, illegal hunting, plastic pollution [
1], which makes it necessary to monitor the Earth for the purpose of preventing damages to the environment. With the rapid development of remote sensing technology in the last couple of decades, various space- or air-borne remote sensing sensors have been made available to provide useful and large-scope information about the Earth, such as forest cover, glacier conditions, ocean surface, urban construction. Thus, utilizing remote sensing images for the purpose of environmental applications has become a feasible solution, but there are still many challenges for using remote sensing images in various environmental applications. To name but a few, (1) it is expensive and challenging to obtain high-resolution images for all possible problematic areas; (2) passive remote sensors (e.g., optical) are at the mercy of the clouds and the amount of sunshine; and (3) the contents of remote sensing images are generally very complicated and difficult to analyze.
In order to help overcome the aforementioned challenges and beyond, taking the advantage of leveraging different remote sensing modalities is a potential solution. Various environmental applications can benefit from multi-modal image fusion by exploiting complementary features provided by different types of remote sensors. Specifically, for active-passive sensor data fusion, passive-optical sensors play the role of feeding the system with high spectral resolution of the Earth surface, which are useful for image analysis. However, for optical remote sensing imagery, in order to provide multi-spectral information, they tend to reduce spatial resolution so as to maintain acceptable bandwidth [
2,
3]. Additionally, this type of sensor is subject to the weather conditions and only provide useful information during the daytime and under clear weather conditions. On the contrary, active remote sensing sensors are able to acquire images without being affected by inclement weather conditions, such as heavy fog, storm, sandstorm, snowstorm. Additionally, they usually provide sufficient textural and structural information of observed objects [
4]; however, they are mostly not capable of collecting color/spectral information. Lastly, since synthetic aperture radar (SAR) images are obtained via wave reflection, they have an important problem that degrades statistical inference, which is the presence of multiplicative speckle noise. The received back-scattered signals sum up coherently and then undergo nonlinear transformations. This in turn gives the resulting images a granular appearance, which is referred to as speckle noise [
5]. Considering all these advantages and disadvantages, exploring remote sensing data coming from different modalities becomes crucial for many environmental applications via making use of the complementary advantages of each type of sensors.
Land cover mapping is one of the most widespread and important remote sensing applications in the literature. This is because, nowadays, decisions that concern the environment made by governments, politicians or organizations highly depend on adequate information for many complex interrelated aspects, where land cover/use is one such aspect. Furthermore, an improved understanding of land cover can help act to solve environmental problems, such as disorganized construction, loss of prime agricultural lands, destruction of important wetlands or forests, and loss of fish and wildlife habitats [
6].
Land cover classification or mapping is a long-established application area that has been developing since the 1970s. The earliest Landsat land cover classification approach was mostly based on visual and manual approaches. This was done by drawing boundaries of different land cover types and marking each of the land cover classes [
7]. In the late 1970s, with the development of computer technology, digital image analysis has become more widespread, and some platforms such as geographic information systems (GIS) were developed to make the analysis of remote sensing data more convenient. Following this, utilizing computer-based approaches for the purpose of land cover classification has become the common practice by geographic analysis specialists. In addition, due to the development of early automatic image processing methods, such as smoothing, sharpening and feature extraction [
8], geographic experts have been able to use various traditional image processing algorithms to help perform land cover classification. Although one can generate digital land cover maps by using computers, manual annotation is generally required, which is time consuming and labor intensive. Specifically, in cases when the target scene accommodates plenty of objects to be classified, and the scene covers huge areas, manual annotation becomes more challenging.
In recent years, with the great success of deep learning in computer vision, automated land cover classification approaches have been significantly improved, which assign a class for each pixel among many particular classes, such as artificial surfaces, cultivated areas, and water bodies, as shown in
Figure 1. Due to the similarity of land cover mapping and semantic segmentation, researchers have started to use segmentation networks to perform land cover mapping. Furthermore, machine learning based end-to-end frameworks make use of remote sensing data (spatial and spectral information) to achieve better performance in land cover classification compared to the traditional pixel-based methods [
9]. However, considering the diversity and complexity of remote sensing data and along with imbalanced training samples, it is still challenging to achieve high performance for land cover classification.
For the purpose of improving the performance of land cover classification, multi-modal data fusion is an important choice whilst exploiting complementary features of different modalities. Although, in the current circumstances, multi-modal remote sensing data are available with the development of remote sensing sensors and observation techniques (e.g., active and passive), the literature is still far from fully leveraging the advantages of using multiple modalities for land cover classification. In order to successfully implement multi-modal remote sensing image fusion for environmental applications, there generally are two types of fusion approaches: (1) machine learning based; and (2) traditional methods, such as component substitution (CS) and multi-scale decomposition (MSD).
Although classical/traditional image fusion methods have been well studied for a few decades, there still are various challenges; for example, (i) precise and complex registration processing is required before the fusion step; (ii) it is highly dependent on the correlation between images being fused; and (iii) it is likely to lose information during the fusion process while replacing a part of the component of the original data during the processing.
On the other hand, ML-based methods generally demonstrate more powerful outcomes for image fusion. Thus, utilizing machine learning approaches in remote sensing imagery related applications has become a hot topic in the literature. However, since the contents in remote sensing images appear very different to the classical natural images, widely used network structures for natural images are not capable of and optimal for processing remote sensing imagery. Meanwhile, the machine/deep learning approaches of remote sensing data fusion, especially for the environmental applications, can still be seen in early stages. Therefore, more robust and generalizable machine learning based methods, specifically for remote sensing data fusion, need to be explored and developed in order to provide suitable and accurate solutions for land cover applications, and beyond.
In this paper, we propose a novel dynamic deep network architecture, AMM-FuseNet, for the purposes of the land cover mapping application. AMM-FuseNet promotes the use of multi-modal remote sensing images whilst exploiting the hybrid approach of the channel attention mechanism and densely connected atrous spatial pyramid pooling (DenseASPP). In order to verify the validity of the proposed method, we test AMM-FuseNet under four test cases from three datasets (Potsdam [
10], DFC2020 [
11] and Hunan [
12]). A comparative study is implemented to test AMM-FuseNet performance against six state-of-the-art network architectures of DeepLab V3+ [
13], PSPNet [
14], UNet [
15], SegNet [
16], DenseASPP [
17], and DANet [
18]. The contributions of this paper are as follow:
We design a novel encoder module, which combines a channel-attention mechanism and densely connected atrous spatial pyramid pooling (DenseASPP) module. This proposed feature extraction module enhances the representational power of the network by successfully weighting the output features obtained by the atrous convolution. This module can be easily extended to any other networks with an encoder–decoder structure.
We propose a machine learning based land cover mapping method specifically suitable for multi-modal remote sensing image fusion. The proposed network extracts information from multiple modalities in a parallel fashion, but performs training with a single loss function to make use of their complementary features. Meanwhile, the encoders in a parallel fashion show a better ability to cope with minimal (small number of) training samples. This has been experimentally validated in a set of test cases (
Section 5.3), where we gradually reduce the number of training samples and measure the model performance using the same test sample set.
The proposed hybrid network exploits and combines many advantages of existing networks for the purpose of improving the performance of land cover mappings. The encoder of the proposed network combines two feature extraction modules (ResNet and Dense ASPP) in a parallel fashion to improve the feature extraction capabilities for each modality. In order to make more efficient use of the extracted features, skip connections are used to benefit from the low-, middle-, and high-level features at the same time. The proposed multi-modal image fusion network shows competitive performance for land cover mapping and outperforms the state of the art.
The rest of the paper is organized as follows: a general background and literature review are presented in
Section 2, whilst
Section 3 presents the proposed method AMM-FuseNet.
Section 4 gives details of the datasets we have used, and
Section 5 covers the experimental analysis.
Section 6 concludes the paper with a brief summary and future works.
2. Related Work
Following the development in big data research area and its effects on computer vision research, especially in recent years, multi-modal remote sensing data for various applications have been made available under open-access licences (e.g., ESA Sentinel-1/2, fusion contest datasets, including DEM, airborne/UAV-based optical data). Thanks to their complementary features, multi-modal remote sensing imagery provides much richer information compared to single modality, especially for land cover/use applications. However, in the literature, most of the land cover mapping papers still use single modality data [
19,
20,
21]. Along with the technical developments in computational imaging and deep/machine learning research, the usage of multi-modal for land cover mapping data [
22,
23], despite being in early stages and insufficient, have started to appear in some works in the recent years.
Considering the increasing demand for multi-modal information for land cover mapping in the literature, obviously, the key challenge of this research lies within answering this research question: “
how to make efficient use of complementary features in multi-modal remote sensing data?”. One of the most common answer to this question in the literature is to implement an image fusion approach which directly concatenates multi-modal images and provide them as the input of land cover mapping networks. Furthermore, Land cover mapping can be basically described as a classification application, which classifies each pixel of remote sensing images to one of various categories (analogous to the semantic segmentation). Especially in the last decade, semantic segmentation networks, such as UNet [
15], DeepLabv3+ [
13], SegNet [
16] and PSPNet [
14], developed rapidly, and have achieved great success for some natural imaging datasets, such as COCO [
24] and PASCAL VOC 2012 [
25]. Thus, one can develop a machine learning approach built upon classical semantic segmentation networks mentioned above and try to develop some improved architectures for the purpose of land cover mapping application.
When it comes to pixel-level classification, either semantic segmentation in computer vision or land cover classification/mapping in remote sensing, fully convolutional networks (FCNs) [
26] have made a considerable contribution, which have made these models and their variants become the state of the art in the literature. SegNet [
16] and UNet [
15] adopt a symmetrical encoder–decoder structure and skip connections, whilst making use of multi-stage features in the encoder. Alternatively, PSPNet [
14] proposes to use a pyramid pooling structure, which provides a global contextual prior to pixel-level scene parsing. Instead of leveraging from the traditional convolution layer used in the aforementioned networks, atrous convolution and atrous spatial pyramid pooling (ASPP) are proposed in Deeplab architecture [
27]. This fact helps the DeepLab architecture to exploit the ability to perceive multi-scale spatial information, even using fixed-size convolution kernels. Regardless of the fact that the ASPP can benefit from acquiring information from multi-scale features, DenseASPP [
17] argues that the feature resolution in the scale-axis is not dense enough. Thus, DenseASPP combined dense networks [
28] and an Atrous convolution network to generate densely scaled receptive fields. DeepLabv3+ [
13] proposed an improved hybrid approach that combines an encoder–decoder structure and the ASPP, which can control the resolution of extracted encoder features, trade-off precision and runtime via setting different dilation rates. Specifically, in DeepLabv3+, appending the ASPP module after a backbone of ResNet makes the network exploit deeper levels and extract high-level features with an aim of improving the performance around the segmentation boundaries [
29,
30]. However, DeepLabv3+ just simply concatenates the two levels of features that come from the output of the first backbone layer and the output of the ASPP module in its decoder. In this case, the network misses the features of the intermediate process of extracting features, which basically reduces the classification performance.
Remote sensing imagery has more complex challenges compared to natural images, such as (1) hardly separable land cover classes; (2) imbalanced class distributions; and (3) imagery content under strong random noise, such as speckle in radar imagery. These might sometimes cause the aforementioned semantic segmentation networks to achieve unsatisfactory results. For the purpose of finding solutions for the challenges mentioned above, the literature includes some semantic segmentation networks for land cover classification applications. Fusion-FCN [
31] improves the FCN network and uses it for multi-modal remote sensing for land cover classification, and this network is the winner of 2018 IEEE GRSS Data Fusion Contest. DKDFN [
12] is also based on FCN and collaboratively fuses multimodal data and assimilates highly generalisable domain knowledge (e.g., remote sensing indices such as NDVI, NDBI, and NDWI) at the same time. The performance of DKDFN is better than that of some of these classical semantic segmentation networks such as UNet, SegNet, PSPNet, DeepLab. It is worth noting that both Fusion-FCN and DKDFN extract multi modal features with different encoders, and this use of multi-encoders also appears in RGB-D fusion for the semantic segmentation of natural images [
32]. DISNet [
33] is another network for land cover classification, which uses the DeepLabv3+ framework and only adds an attention-mechanism-based module in both the encoder and decoder of the network. This change improves the performance of land cover classification compared to the original DeepLabv3+ [
13]. Xia [
29] also implemented DeepLabv3+ and similarly proposed the global attention based up-sample module in the networks. Xia’s network also passes multi-level features to the decoder to obtain efficient segmentation with accurate results. Similarly, Lei [
34] proposed a multi-scale fusion network based on a variety of attention mechanisms for land cover classification, which also shows competitive performance. In order to combine the advantages of UNet and DeepLabv3+, ASPP-U-Net [
35] was proposed for land cover classification and showed better results compared to the UNet and DeepLabv3+.
4. Data
In this paper, we use three open-access multi-modal data sets, which are constructed for land cover classification application, namely (1) Hunan [
12], (2) Potsdam [
10] and (3) DFC2020 [
11] datasets. Although most of the open-access land cover mapping datasets in the literature focus on single modality, all of the three datasets used in this paper are multi-modal remote sensing image datasets for land cover mapping. In particular, the DFC2020 and Potsdam datasets are highly representative and commonly used in the literature, whilst the Hunan dataset is a new dataset published in 2022 [
12]. It includes three different remote sensing modalities for land cover mapping, which makes it highly suitable for testing the proposed method’s performance. Some details regarding all three utilized datasets are presented in
Table 1, where we have the following:
SRTM refers to Shuttle Radar Topography Mission Digital Elevation Model (DEM) data;
TOP refers to the true orthophoto;
DSM refers to a digital surface model.
4.1. Hunan
Hunan [
12] is a multi-modal dataset for land cover mapping of Hunan province in China. This dataset consists of three remote sensing modalities of multi-spectral (Sentinel-2), SAR (Sentinel-1) and SRTM digital elevation model data (DEM). Specifically, the temporal resolutions of Sentinel-2 MSI and Sentinel-1 SAR imagery in Hunan are 5 and 6 days (combined constellation), respectively [
40,
41], which were captured in 2017. The SRTM (shuttle radar topography mission) is mounted on a space shuttle and obtains Earth surface data by utilizing a synthetic aperture radar. During its 11-day flight, from 11 February 2000 to 22 February 2000, it obtained data covering 80% of the Earth’s surface [
42]. The obtained data were converted into digital elevation model (DEM) data, which provide height information of the Earth surface. More details for this dataset are listed in
Table 2. All 13 bands in Sentinel-2 are used in our experiments. Sentinel-1 data in Hunan dataset were pre-processed by thermal noise removal, radiometric calibration, terrain correction and logarithmic conversion. There are two bands in Sentinel-1 data, corresponding to dual-polarization of VV and VH, respectively. SRTM provides both elevation and slope data, which provide extra topographic information, but only elevation data are used in our experiments. The creators of the Hunan dataset resampled all data to a spatial resolution of 10 m by the default resampling strategy nearest neighbor in GEE [
12]. Since the Hunan dataset contains three different modalities of remote sensing images as mentioned above, we have therefore divided them into two fusion pairs of (i) Sentinel-2 and Sentinel-1, and (ii) Sentinel-2 and DEM.
The Hunan dataset consists of 500 image tiles for each modality, as well as their corresponding land cover labels. The size of all the images is . Geology experts manually labeled the data according to the Sentinel-2 mosaic. This dataset contains 7 imbalanced class labels, which are cropland (23.34%), forest (42.37%), grassland (7.35%), wetland (1.89%), water (13.35%), unused land (1.56%), and built-up area (10.14%). This distribution of land cover classes is based on the data collected in Hunan province, China in 2017.
4.2. DFC2020
DFC2020 is based on the SEN12MS dataset [
43], which provides Sentinel-1 SAR imagery, Sentinel-2 multispectral imagery, and corresponding land cover maps on 7 areas (see
Table 3) in the world between 2016 and 2017. The temporal resolution and collection time of modalities in DFC2020 is listed in
Table 4. The size of all patches is
pixels. The fine-grained IGBP classification scheme in SEN12MS was aggregated to 10 coarser-grained classes, which are forest (11.3%), shrubland (6.9%), savanna (23.6%), grassland (16.8%), wetlands (1.1%), croplands (17.9%), urban/built-up (10.6%), snow/ice (0.0%), barren (5.2%), and water (6.5%). The class distributions are similar to the SEN12MS and DFC2020 datasets. On the other hand, since the DFC2020 dataset is a subset of SEN12MS, there is one class showing zero percentage. The comparison between the standard IGBP classes of SEN12MS and DFC2020 label classes can be found in detail in [
11].
4.3. ISPRS Potsdam
The ISPRS Potsdam Semantic Labeling dataset is an open-access benchmark dataset provided by the International Society for Photogrammetry and Remote Sensing (ISPRS). This dataset provides 38 multi-source patches (all of size ), which contains infrared (IR), red, green and blue orthorectified optical images with corresponding digital surface models (DSM). For calculation purposes, we sub-divided all these data tiles into patches, which leads to 3456 and 2016 samples for the training and test, respectively. The ground sampling distance of the two modalities of the true orthophoto (TOP) and the DSM is 5 cm. This dataset was classified manually into six land cover classes, which are impervious surfaces, buildings, low vegetation, trees, cars, clutter/background.
6. Conclusions
In this work, we proposed a channel-attention based multi modal image fusion network, AMM-FuseNet, constructed with a proposed novel feature extraction module, CADenseASPP. The proposed network showed competitive performance when compared to the state-of-the-art segmentation networks, such as DeepLabv3+, SegNet and Unet, and appeared to be more robust and generalisable when applied to various multi-modal remote sensing data sets. For most of the cases with different remote sensing modalities of RGB, multi-spectral, SAR and DEM, the AMM-FuseNet showed a consistent performance by being the best model in terms of various performance metrics.
In presenting the proposed approach AMM-FuseNet in this paper,
We contributed to the literature with a multi-modal attention-based deep network architecture with improved land cover mapping/classification performance compared to the state of the art.
The parallel feed of multi-modal remote sensing information into the hybrid proposed encoder module of CADenseASPP improved the segmentation performance dramatically via weighting features in a dense atrous convolution operation.
It was experimentally proven by using the Potsdam data that the proposed network showed more powerful performance under small number of training sample (minimal training supervision) despite its relatively complex structure due to having two parallel encoders. This issue proved our contribution to the literature that the proposed approach could be a great choice for the segmentation applications that only have a small amount of labeled information.
As it stands, AMM-FuseNet appears as a candidate model to be a high-performing approach for other segmentation tasks in remote sensing and computer vision beyond the land cover mapping application.
The indistinguishable performance of all models under the full-Potsdam dataset has shown us future research directions in terms of the AMM-FuseNet architecture. Ongoing work includes performing a detailed complexity analysis, and exploring AMM-FuseNet’s capabilities (i) to deal with minimal supervision, as well as (ii) extracting useful information from higher spatial resolution imagery.