ACFNet: A Feature Fusion Network for Glacial Lake Extraction Based on Optical and Synthetic Aperture Radar Images

: Glacial lake extraction is essential for studying the response of glacial lakes to climate change and assessing the risks of glacial lake outburst ﬂoods. Most methods for glacial lake extraction are based on either optical images or synthetic aperture radar (SAR) images. Although deep learning methods can extract features of optical and SAR images well, efﬁciently fusing two modality features for glacial lake extraction with high accuracy is challenging. In this study, to make full use of the spectral characteristics of optical images and the geometric characteristics of SAR images, we propose an atrous convolution fusion network (ACFNet) to extract glacial lakes based on Landsat 8 optical images and Sentinel-1 SAR images. ACFNet adequately fuses high-level features of optical and SAR data in different receptive ﬁelds using atrous convolution. Compared with four fusion models in which data fusion occurs at the input, encoder, decoder, and output stages, two classical semantic segmentation models (SegNet and DeepLabV3+), and a recently proposed model based on U-Net, our model achieves the best results with an intersection-over-union of 0.8278. The experiments show that fully extracting the characteristics of optical and SAR data and appropriately fusing them are vital steps in a network’s performance of glacial lake extraction.


Introduction
With global warming, glaciers have experienced extensive negative mass changes and greatly contributed to sea level rise [1]. Glacial lakes slightly alleviate sea level rise [2] by storing a small percentage of glacier meltwater. However, this small fraction of glacier meltwater has rapidly increased the size and number of glacial lakes over the last few decades [2][3][4]. As a glacial lake expands in area and depth, additional pressure is added to the moraine dam, increasing the probability of a glacial lake outburst flood (GLOF) [5]. Moreover, under a warming climate and deglaciation background, GLOF risks will increase in the future [6]. GLOFs could inundate buildings, bridges, and hydropower systems in their flow paths [7], as well as destroy communities downstream [8]. For disaster preparedness, many studies on assessing GLOF hazards and risks have been published [6,7,[9][10][11]. In addition, some scholars found that glaciers terminating into lakes have more negative mass balances than glaciers terminating on land [12,13] due to mechanical calving and thermal melting [14]. To better understand glacier dynamic evolution, glacial lakes connected with glaciers should be studied. An inventory of glacial lakes is a prerequisite for most studies related to glacial lakes.
Glacial lakes are mostly located in alpine areas, which makes field surveying difficult. With the development of remote sensing technology and the growing number of Earth observation satellites, scholars can more easily obtain outlines and areas of glacial lakes. Often, glacial lakes are mapped from remote sensing images by manual vectorization [10,[15][16][17][18][19][20] using a geographic information system (GIS) software. Although manual vectorization is the most accurate method for generating glacial lake boundaries, it is labor intensive, especially for large study areas. Thus, some researchers use threshold methods based on the normalized difference water index (NDWI) [21], band ratio [22,23], mountain shadow mask [24,25], slope maps [11,[26][27][28][29], and brightness temperature [30] to rapidly extract glacial lakes from optical multiband images. The success of using NDWI or the band ratio on lake extraction is based on the spectral characteristics of the water. Specifically, water has low reflectance in the near-infrared band and high reflectance in the green band. Mountain shadow masks and slope maps derived from a digital elevation model (DEM) are widely used to differentiate glacial lakes from mountain shadows; glacial lakes and mountain shadows have similar spectral characteristics, but glacial lakes have a gentler slope. The brightness temperature derived from the thermal band in a Landsat 8 scene could help distinguish water surfaces from glacier zones covered by wet snow. In addition, for the adaptive segmentation of each lake, a global-local iterative scheme [31,32] and active contour models [33,34] are used in mapping glacial lakes. In the methods above, thresholds of NDWI, band ratio, and slopes are set empirically, which adds uncertainty to glacial lake mapping. Thus, some scholars utilize machine learning methods, such as support vector machine (SVM) and random forest (RF), to carry out glacial lake pixel classification [35,36]. However, the features input into SVM or RF classifiers, such as NDWI, are still manually designed.
Over the last few years, deep learning (DL) methods have been widely used in the remote sensing field [37][38][39][40], big earth data analysis [41], and real-life applications in other fields [42,43]. There is no need to design input features artificially in a DL model due to its powerful ability in representation learning. However, to the best of our knowledge, there are few studies on DL applications for glacial lake extraction. Qayyum et al. applied a U-Net model to glacial lake extraction on very high resolution PlanetScope imagery and obtained better results than those acquired with SVM and RF classifiers [44]. Chen applied U-Net on supra-glacial lake extraction using high-resolution GaoFen-3 SAR images [45]. This study did not improve the network specially and just used SAR images. Wu et al. proposed a model based on U-Net for glacial lake extraction using a combination of Landsat 8 optical images and Sentinel 1 SAR images [46]. Their research showed that the addition of SAR features helps to identify glacial lakes. However, the authors simply concatenated two groups of shallow features, each filtered from optical or SAR images by a convolution layer, as an input of U-Net; moreover, their study area was limited to southeastern Tibet. The appropriate and efficient fusion of two modality features for mapping glacial lakes with high precision is challenging. In this study, to effectively use the spectral characteristics of optical images and the geometric characteristics of SAR images to accurately extract glacial lakes, we propose an atrous convolution fusion network (ACFNet) that sufficiently fuses the features of optical and SAR data using atrous convolutions. Furthermore, we compare the performance of ACFNet to the following: four fusion models in which data fusion occurs at input, encoder, decoder, and output; Wu's model [46]; and two typical methods in semantic segmentation (SS), field SegNet [47] and DeepLabV3+ [48]. These models were all trained and evaluated based on glacial lake data distributed throughout the Himalaya (Figure 1). The main contributions of this study are as follows: (1) We proposed a deep learning fusion model for glacial lake mapping from Landsat 8 optical images and Sentinel-1 SAR images.
(2) We explored the applicability of the proposed model and several typical fusion models in extracting glacial lakes.
(3) We explored the influence of imaging time intervals between optical images and SAR images on glacial lake extraction under different fusion models.

Study Area
The Himalaya is an arc region ( Figure 1) in the southwestern Tibetan Plateau and the source of the Indus, Ganges, and Brahmaputra Rivers. Compared to its north-south span of 150-400 km, Himalaya has a longer east-west span over 2000 km [49]. Glaciers cover an area of ~22,800 km 2 [50]. Over the last few decades, glaciers in the Himalaya have experienced moderate mass loss and intra-region variability [51] that is caused by morphological variables [12], the existence of glacial lakes at the glacier terminal [13], and heterogeneous regional climates [50]. The climate in the Himalaya is dominated by a monsoon system, including the Indian and the East Asian monsoons and the westerlies [52]. In southeastern Himalaya, most precipitation is associated with the summer monsoon. In northwestern Himalaya, the westerlies provide most of the winter precipitation [52]. Thus, in the eastern and central parts of Himalaya, most of the glaciers undergo summer accumulation. In the northwestern part of Himalaya, winter glacial accumulation is more important [50]. In addition, precipitation decreases sharply from south to north over the entire Himalaya because the mountains form a moisture barrier [50]. Since the 1990s, as glaciers shrink, glacial lakes have commonly increased in the Himalaya [53]. The expansion rates of glacial lakes are highest in the south-central Himalaya and lowest in western Himalaya [53]. An inventory of Himalayan glacial lakes from 2015 shows that most are located between the altitudes of 4000 m and 5700 m [53]. According to the statistics of a 2018 inventory of glacial lakes [3], approximately 85.3% of Himalayan glacial lakes have areas less than 0.1 km 2 . The complex freezing conditions of the vast Himalaya, as well as rugged terrains, the diversity of glacial lakes' size, color, and turbidity, make automatic glacial lake mapping difficult. Hence, selecting the Himalaya as the study area is helpful to evaluate the methods' availability and robustness in glacial lake extraction.

Study Area
The Himalaya is an arc region ( Figure 1) in the southwestern Tibetan Plateau and the source of the Indus, Ganges, and Brahmaputra Rivers. Compared to its north-south span of 150-400 km, Himalaya has a longer east-west span over 2000 km [49]. Glaciers cover an area of~22,800 km 2 [50]. Over the last few decades, glaciers in the Himalaya have experienced moderate mass loss and intra-region variability [51] that is caused by morphological variables [12], the existence of glacial lakes at the glacier terminal [13], and heterogeneous regional climates [50]. The climate in the Himalaya is dominated by a monsoon system, including the Indian and the East Asian monsoons and the westerlies [52]. In southeastern Himalaya, most precipitation is associated with the summer monsoon. In northwestern Himalaya, the westerlies provide most of the winter precipitation [52]. Thus, in the eastern and central parts of Himalaya, most of the glaciers undergo summer accumulation. In the northwestern part of Himalaya, winter glacial accumulation is more important [50]. In addition, precipitation decreases sharply from south to north over the entire Himalaya because the mountains form a moisture barrier [50]. Since the 1990s, as glaciers shrink, glacial lakes have commonly increased in the Himalaya [53]. The expansion rates of glacial lakes are highest in the south-central Himalaya and lowest in western Himalaya [53]. An inventory of Himalayan glacial lakes from 2015 shows that most are located between the altitudes of 4000 m and 5700 m [53]. According to the statistics of a 2018 inventory of glacial lakes [3], approximately 85.3% of Himalayan glacial lakes have areas less than 0.1 km 2 . The complex freezing conditions of the vast Himalaya, as well as rugged terrains, the diversity of glacial lakes' size, color, and turbidity, make automatic glacial lake mapping difficult. Hence, selecting the Himalaya as the study area is helpful to evaluate the methods' availability and robustness in glacial lake extraction.

Optical Dataset
Landsat series satellite images have been widely used to investigate the evolution of glacial lakes. The first Landsat satellite was launched in 1972; since then, the Landsat mission has launched multiple satellites for continuous observation. Thus, Landsat imagery has the longest continuous temporal archive of Earth's surfaces, making it popular for land cover change research. Landsat imagery also provides a good compromise between spatial resolution and swath width, and it is free to access; thus, it can facilitate studies at a large scale [54]. The Landsat 8 satellite equipped with an improved Operational Land Imager (OLI) provides better image quality than past Landsat satellites. Besides, there are some published glacial lake inventories derived from Landsat 8 imagery [3,4], which can help us to delineate glacial lakes. Hence, we used Landsat 8 imagery to create our optical dataset. To avoid the adverse effects of clouds and seasonal snowfall, we carefully selected Landsat 8 Level 1 Terrain-corrected (L1T) images, which were downloaded from the United States Geological Survey (https://www.usgs.gov/. Last accessed 16 May 2021). The imaging times ranged from September to November (autumn). In this period, the evaporation of water from glaciers is much smaller than that in summer, minimizing clouds and fog in images. Glacial lakes also reach their maximum area after glacier ablation, helping to identify small glacial lakes. To obtain stable and valid image values, we applied radiometric calibration to raw digital numbers (DNs) using ENVI 5.3 software, converting the data to top-of-atmosphere (TOA) reflectance.
With the assistance of a high-mountain Asia (HMA) glacial lake inventory [4], an expert manually delineated the boundaries of glacial lakes in ArcGIS 10.6 software. Thus, the outlined glacial lake contours are high quality and consistent. A total of 11,127 glacial lakes located throughout the Himalaya were mapped (Figure 1). The vector file of the glacial lake boundaries was further converted into a raster mask file with a 30 m spatial resolution. Limited by the graphics process unit (GPU) memory, it is not feasible to feed a large image with thousands of columns and rows into a network. When dealing with such a large image, decomposing it into sub-images is a commonly used pretreatment [55]. Thus, each Landsat 8 image and its corresponding mask file were clipped to patches using a 256 × 256 sliding window with a stride of 128, making the number of patches as large as possible while adjacent patches have some independence. To focus the network on glacial lakes, the patches without glacial lakes were removed. Finally, we generated a dataset with 9261 256 × 256 patches at a 30 m spatial resolution. The 9261 patches were randomly split into 4626 patches for training, 1848 for validation, and 2787 for testing. Considering that most convolutional neural networks (CNNs) are designed and applied to natural images that contain only the three bands of red, green, and blue (RGB), only RGB bands were reserved in our optical dataset.

SAR Dataset
The Sentinel-1 mission comprises two satellites that carry C-band synthetic aperture radar (SAR) instruments, providing a revisit time of six days. The main operational mode of Sentinel-1 is interferometric wide swath (IW), which has a swath of 400 km. C-band SAR signals can penetrate clouds and fog, allowing Sentinel-1 to effectively capture the Earth's surface under any weather situations. Even better, Sentinel-1 s data is freely available, making it advantageous for applications in land cover monitoring, emergency response, and other science studies. SAR data also reflect the geometric structure of ground objects. Generally, a water surface has a low backscatter coefficient in SAR images due to the signals' specular reflection. The short revisit period of Sentinel-1 satellites helps us to find the SAR images with imaging times nearest to those of Landsat 8 optical images. Therefore, Sentinel-1 SAR images are utilized as auxiliary data for Landsat 8 optical images to extract glacial lakes.
The Google Earth Engine (GEE) platform provides level-1 Sentinel-1 data, which have been processed to backscatter coefficient through radiometric calibration and terrain correction. To homogenize the dataset used in this study, only the images acquired in the Remote Sens. 2021, 13, 5091 5 of 18 IW mode were used. The polarization band was also restricted to VV, in which the SAR signal is transmitted and received vertically. Our preparation of optical-SAR patch pairs refers to another study [56]. For each Landsat 8 optical patch prepared, the corresponding Sentinel-1 SAR patch was obtained from GEE. Figure 2a shows the process in detail. First, an optical patch's boundary vector file with geographic coordinates and its imaging time were uploaded to GEE. Second, Sentinel-1 scenes that contained the uploaded boundary were filtered out through GEE. Third, the unique Sentinel-1 scene that has the imaging time nearest to that of the optical patch was selected. Finally, the selected Sentinel-1 scene was clipped to the optical patch's boundary and resampled to 30 m to be consistent with the optical patch. Considering that deep CNNs could learn the contextual information and high-level semantic information, no fuzzy preprocessing such as [57] was adopted to deal with the possible noises in SAR images. Figure 2b shows three example RGB images and their corresponding VV images. The final result is three sets of data. We used RGB and VV data to denote Landsat 8 optical RGB data and Sentinel-1 SAR VV data, respectively. Then, the RGB and VV data were concatenated to form 4-band RGB+VV data. correction. To homogenize the dataset used in this study, only the images acquired in the IW mode were used. The polarization band was also restricted to VV, in which the SAR signal is transmitted and received vertically. Our preparation of optical-SAR patch pairs refers to another study [56]. For each Landsat 8 optical patch prepared, the corresponding Sentinel-1 SAR patch was obtained from GEE. Figure 2a shows the process in detail. First, an optical patch's boundary vector file with geographic coordinates and its imaging time were uploaded to GEE. Second, Sentinel-1 scenes that contained the uploaded boundary were filtered out through GEE. Third, the unique Sentinel-1 scene that has the imaging time nearest to that of the optical patch was selected. Finally, the selected Sentinel-1 scene was clipped to the optical patch's boundary and resampled to 30 m to be consistent with the optical patch. Considering that deep CNNs could learn the contextual information and high-level semantic information, no fuzzy preprocessing such as [57] was adopted to deal with the possible noises in SAR images. Figure 2b shows three example RGB images and their corresponding VV images. The final result is three sets of data. We used RGB and VV data to denote Landsat 8 optical RGB data and Sentinel-1 SAR VV data, respectively. Then, the RGB and VV data were concatenated to form 4-band RGB+VV data.
(a) (b) Figure 2. (a) The flow diagram for preparing an optical-SAR patch pair; (b) RGB images and their corresponding "good" VV images with imaging times nearest those of the RGB images and "bad" VV images that were acquired half a year earlier than the RGB images.

Methodology
Convolutional neural networks have achieved great success in image classification [58,59]. To transfer this success to SS tasks, Long et al. [60] designed a fully convolutional network (FCN) by replacing fully connected layers with convolution layers. Since then, many popular networks emerged in the SS field, such as U-Net [61], SegNet [47], PSP-Net [62], and the DeepLab series of models [48,[63][64][65]. U-Net gradually integrates images' shallow appearance features and high-level semantic features in its decoding process, which is conducive to extracting small objects. Given that most of glacial lakes in our study area are small and the pixel size of data used is 30 m, U-Net is more suitable for glacial lake extraction and is utilized as a component in ACFNet.

U-Net Structure
U-Net provides a concise symmetrical encoder-decoder structure. The encoder part consists of a series of convolution filters and max pooling operations. Symmetrically, a series of upsampling operations and convolution layers comprise the decoder part. In U- Figure 2. (a) The flow diagram for preparing an optical-SAR patch pair; (b) RGB images and their corresponding "good" VV images with imaging times nearest those of the RGB images and "bad" VV images that were acquired half a year earlier than the RGB images.

Methodology
Convolutional neural networks have achieved great success in image classification [58,59]. To transfer this success to SS tasks, Long et al. [60] designed a fully convolutional network (FCN) by replacing fully connected layers with convolution layers. Since then, many popular networks emerged in the SS field, such as U-Net [61], SegNet [47], PSP-Net [62], and the DeepLab series of models [48,[63][64][65]. U-Net gradually integrates images' shallow appearance features and high-level semantic features in its decoding process, which is conducive to extracting small objects. Given that most of glacial lakes in our study area are small and the pixel size of data used is 30 m, U-Net is more suitable for glacial lake extraction and is utilized as a component in ACFNet.

U-Net Structure
U-Net provides a concise symmetrical encoder-decoder structure. The encoder part consists of a series of convolution filters and max pooling operations. Symmetrically, a series of upsampling operations and convolution layers comprise the decoder part. In U-Net, high-level semantic features post-upsampling are concatenated with corresponding Remote Sens. 2021, 13, 5091 6 of 18 shallow-appearance features, and then they are merged via subsequent convolution layers. Thus, the detailed spatial information lost in the encoder part due to max pooling will gradually be recovered during the decoding process. Adequate integration of high-level semantic features and shallow-appearance features in U-Net facilitates the boundary extraction of targets and recognition of small objects. Figure 3 shows the U-Net architecture with ResNet as the backbone/encoder. Remote Sens. 2021, 13, x FOR PEER REVIEW Net, high-level semantic features post-upsampling are concatenated with corres shallow-appearance features, and then they are merged via subsequent convolu ers. Thus, the detailed spatial information lost in the encoder part due to max poo gradually be recovered during the decoding process. Adequate integration of h semantic features and shallow-appearance features in U-Net facilitates the boun traction of targets and recognition of small objects. Figure 3 shows the U-Net arc with ResNet as the backbone/encoder.

ResNet Backbone
To address the degradation problem when training deep networks, Kaim proposed a deep residual learning network [58]. Specifically, for a feature x, le pected learned output via several stacked nonlinear layers be H(x). The residual forces the stacked nonlinear layers to learn F(x) = H(x) − x by adding an identity m of x onto F(x). When x is H(x), the stacked nonlinear layers only need to learn a z ping F(x) = 0. If x is close to H(x), then it is also easier to learn a residual correc than to learn a new mapping from scratch. Residual learning provides easy opti of deep residual networks [58]. Thus, we used ResNet as the backbone/encode model in this study.
The architecture of U-Net with ResNet as the backbone is shown in Figure 3. image is filtered by Conv, Res1, Res2, Res3, and Res4, generating feature maps w of 1/2, 1/4, 1/8, 1/16, and 1/32, the size of the input image. Conv represents the firs lution layer of ResNet. Res1, Res2, Res3, and Res4 represent the first, second, th fourth residual Conv block in ResNet, respectively. The high-level feature maps a ually upsampled by 2× and merged with corresponding shallow feature maps in blocks. One decoder block contains two convolution blocks, which each consist convolution layer, a batch normalization layer and a ReLU activation function. Th map generated by the last Decoder block is passed through a 3 × 3 convolution la upsampling layer, and a sigmoid activation function to create the final prediction

ACFNet Architecture
The architecture of the proposed ACFNet is shown in Figure 4. RGB and VV are extracted by independent ResNet encoders. The first decoder block of eac generates a group of features with a spatial size of 1/16, the size of the input im two groups of features are adequately fused through an atrous convolution bloc as proposed. In this block, features of two modes are concatenated and filtered b

ResNet Backbone
To address the degradation problem when training deep networks, Kaiming et al. proposed a deep residual learning network [58]. Specifically, for a feature x, let the expected learned output via several stacked nonlinear layers be H(x). The residual network forces the stacked nonlinear layers to learn F(x) = H(x) − x by adding an identity mapping of x onto F(x). When x is H(x), the stacked nonlinear layers only need to learn a zero mapping F(x) = 0. If x is close to H(x), then it is also easier to learn a residual correction of x than to learn a new mapping from scratch. Residual learning provides easy optimization of deep residual networks [58]. Thus, we used ResNet as the backbone/encoder of our model in this study.
The architecture of U-Net with ResNet as the backbone is shown in Figure 3. An input image is filtered by Conv, Res1, Res2, Res3, and Res4, generating feature maps with sizes of 1/2, 1/4, 1/8, 1/16, and 1/32, the size of the input image. Conv represents the first convolution layer of ResNet. Res1, Res2, Res3, and Res4 represent the first, second, third, and fourth residual Conv block in ResNet, respectively. The high-level feature maps are gradually upsampled by 2× and merged with corresponding shallow feature maps in Decoder blocks. One decoder block contains two convolution blocks, which each consist of a 3 × 3 convolution layer, a batch normalization layer and a ReLU activation function. The feature map generated by the last Decoder block is passed through a 3 × 3 convolution layer, a 2× upsampling layer, and a sigmoid activation function to create the final prediction map.

ACFNet Architecture
The architecture of the proposed ACFNet is shown in Figure 4. RGB and VV features are extracted by independent ResNet encoders. The first decoder block of each branch generates a group of features with a spatial size of 1/16, the size of the input image. The two groups of features are adequately fused through an atrous convolution block (ACB), as proposed. In this block, features of two modes are concatenated and filtered by three 3 × 3 atrous convolutions with dilated rates of 1, 2, and 3. Note that the 3 × 3 atrous convolution Remote Sens. 2021, 13, 5091 7 of 18 with a dilated rate of 1 is the standard 3 × 3 convolution. Atrous convolution allows for the fusion of two modes of features under a large receptive field while keeping the filter's parameters constant. The three groups of features filtered under the different receptive fields are integrated by a 1 × 1 convolution. Note that the 1 × 1 convolution layer and 3 × 3 convolution layer in ACB are both followed by a batch normalization layer and a ReLU activation function. The integrated features flow parallelly into subsequent RGB and VV decoder branches to perform the second decoding separately. Features generated by the second decoder block of the RGB and VV branches are fused by an element-wise summation. The fused features continue to flow parallelly into subsequent RGB and VV decoder branches, and they pass through the last two Decoder blocks separately. The last Decoder block of each branch generates a group of features with more of this branch's modality characteristics. To take full advantage of these two groups of semantic features for a highly accurate prediction, further integration is performed by an ACB again. Using the U-Net with ResNet as the backbone mentioned in Section 3.2, the last integrated semantic features are passed through a 3 × 3 convolution layer, a 2× upsampling layer, and a sigmoid activation function to generate the final prediction map. filter's parameters constant. The three groups of features filtered under the different receptive fields are integrated by a 1 × 1 convolution. Note that the 1 × 1 convolution layer and 3 × 3 convolution layer in ACB are both followed by a batch normalization layer and a ReLU activation function. The integrated features flow parallelly into subsequent RGB and VV decoder branches to perform the second decoding separately. Features generated by the second decoder block of the RGB and VV branches are fused by an element-wise summation. The fused features continue to flow parallelly into subsequent RGB and VV decoder branches, and they pass through the last two Decoder blocks separately. The last Decoder block of each branch generates a group of features with more of this branch's modality characteristics. To take full advantage of these two groups of semantic features for a highly accurate prediction, further integration is performed by an ACB again. Using the U-Net with ResNet as the backbone mentioned in Section 3.2, the last integrated semantic features are passed through a 3 × 3 convolution layer, a 2× upsampling layer, and a sigmoid activation function to generate the final prediction map.

Fusion Methods in the Encoder-Decoder Structure
In the encoder-decoder semantic segmentation structure, the fusion of two modality data could occur at the input, encoder, decoder, or output, resulting in four fusion methods: (1) input fusion, in which two modality data are concatenated as the input data of a SS network [66]; (2) encoder fusion (Figure 5a), in which the extracted features of one modality data are always fused into corresponding extracted features of another modality data in the encoding process [67]; (3) decoder fusion (Figure 5b), in which the extracted features of two modality data in the encoding process are fused before they merge with high-level upsampled features in the decoding process [68][69][70]; and (4) output fusion, in which predictions made by two independent SS networks are fused by an element-wise summation to produce the final prediction. Regardless of the input fusion or output fusion, intermediate features are not fused. In this study, for a fair comparison of these four fusion methods in fusing RGB and VV data for glacial lake extraction, we adopted U-Net and ResNet as the structure and backbone/feature extractor, respectively, for the four fusion methods. The features are both fused via an element-wise summation in the encoder fusion and decoder fusion as shown in Figure 5. Comparing these four typical fusion models is helpful to explore the influence of the position where data fusion occurs on the networks' performances.

Fusion Methods in the Encoder-Decoder Structure
In the encoder-decoder semantic segmentation structure, the fusion of two modality data could occur at the input, encoder, decoder, or output, resulting in four fusion methods: (1) input fusion, in which two modality data are concatenated as the input data of a SS network [66]; (2) encoder fusion (Figure 5a), in which the extracted features of one modality data are always fused into corresponding extracted features of another modality data in the encoding process [67]; (3) decoder fusion (Figure 5b), in which the extracted features of two modality data in the encoding process are fused before they merge with high-level upsampled features in the decoding process [68][69][70]; and (4) output fusion, in which predictions made by two independent SS networks are fused by an element-wise summation to produce the final prediction. Regardless of the input fusion or output fusion, intermediate features are not fused. In this study, for a fair comparison of these four fusion methods in fusing RGB and VV data for glacial lake extraction, we adopted U-Net and ResNet as the structure and backbone/feature extractor, respectively, for the four fusion methods. The features are both fused via an element-wise summation in the encoder fusion and decoder fusion as shown in Figure 5. Comparing these four typical fusion models is helpful to explore the influence of the position where data fusion occurs on the networks' performances.

Implementation Details
The implementation of our method is based on the PyTorch library. We used stochastic gradient descent (SGD) [71] to optimize the networks. The weight decay and momentum were set to 0.0001 and 0.9, respectively. We set the initial learning rate to 0.01. After 20 epochs, the learning rate decayed to 0.001 by multiplying by a factor of 0.1. The learning rate setting was found after a systematic search. The batch size was set to 16 due to the limited GPU memory, as in [62]. We trained all of the models for 50 epochs, which are enough for the models to converge.

Loss Function and Evaluation Metrics
In remote sensing images of the Himalayan regions, glacial lakes occupy a very small area compared with glaciers, vegetation, and bare land; thus, there is a large classification imbalance. In this situation, the widely used cross-entropy function for SS will bias network predictions towards background objects. To mitigate this issue, dice loss [72] was selected as our loss function to optimize the networks. The true positive (TP), false positive (FP), and false negative (FN) can be obtained by calculating the difference between the predictions and ground truth. According to the definition of dice coefficient in [73], the dice loss can be expressed as: Considering the large classification imbalance, we used precision, recall, F1, and intersection-over-union (IOU) [74] to evaluate the predictions of the networks on the test set. Precision is defined as: Precision, also known as user accuracy, reflects how many positive samples are correctly classified in the predictions. Recall is defined as:

Implementation Details
The implementation of our method is based on the PyTorch library. We used stochastic gradient descent (SGD) [71] to optimize the networks. The weight decay and momentum were set to 0.0001 and 0.9, respectively. We set the initial learning rate to 0.01. After 20 epochs, the learning rate decayed to 0.001 by multiplying by a factor of 0.1. The learning rate setting was found after a systematic search. The batch size was set to 16 due to the limited GPU memory, as in [62]. We trained all of the models for 50 epochs, which are enough for the models to converge.

Loss Function and Evaluation Metrics
In remote sensing images of the Himalayan regions, glacial lakes occupy a very small area compared with glaciers, vegetation, and bare land; thus, there is a large classification imbalance. In this situation, the widely used cross-entropy function for SS will bias network predictions towards background objects. To mitigate this issue, dice loss [72] was selected as our loss function to optimize the networks. The true positive (TP), false positive (FP), and false negative (FN) can be obtained by calculating the difference between the predictions and ground truth. According to the definition of dice coefficient in [73], the dice loss can be expressed as: Considering the large classification imbalance, we used precision, recall, F1, and intersection-over-union (IOU) [74] to evaluate the predictions of the networks on the test set. Precision is defined as: Precision, also known as user accuracy, reflects how many positive samples are correctly classified in the predictions. Recall is defined as: Recall, also known as producer accuracy, reflects how many positive samples in the ground truth are correctly predicted. F1 is defined as: The F1 score, a compromise between precision and recall, comprehensively reflects the prediction accuracy of a model. Note that the F1 score is actually the dice coefficient [73]. IOU can be expressed as: For binary classification, predictions and ground truth both belong to the set (0, 1). IOU is the ratio of the intersection of the two binary masks to the union of the two binary masks. A higher IOU correlates to a higher model prediction accuracy.

Results
Ref. [58] provided ResNets with various depths, which include 18-layer, 34-layer, 50-layer, 101-layer, and 152-layer ResNets. We trained and evaluated U-Nets with ResNets of different depths as the backbone. Given the limited GPU memory and time-consuming task of training very deep networks, 101-layer and 152-layer ResNets were not adopted in this study. The evaluation results of these U-Nets are detailed in Table 1. In addition to the four fusion methods and ACFNet mentioned in Section 3, we also trained and evaluated Wu's model [46] and two other classical semantic segmentation models (SegNet and DeepLabV3+). Similar to ACFNet, Wu's model was proposed to extract glacial lakes based on optical and SAR images. SegNet and DeepLabV3+ are advanced and popular in the semantic segmentation field. For a comparison with ACFNet, input fusion, Wu's model, SegNet, and DeepLabV3+ all utilize 50-layer ResNet as the backbone. The evaluation results of input fusion, encoder fusion, decoder fusion, output fusion, SegNet, DeepLabV3+, Wu's model, and ACFNet are detailed in Table 2.

Backbone Depth for RGB and VV Data
For the RGB data, as seen from Figure 6a, U-Nets with different depths present similar training curves. In the validation stage (Figure 6b), the U-Net with a 50-layer ResNet as the backbone produces a lower loss than the U-Nets with 18-layer and 34-layer ResNets as the backbone after 21 epochs.

Backbone Depth for RGB and VV Data
For the RGB data, as seen from Figure 6a, U-Nets with different depths present similar training curves. In the validation stage (Figure 6b), the U-Net with a 50-layer ResNet as the backbone produces a lower loss than the U-Nets with 18-layer and 34-layer ResNets as the backbone after 21 epochs. For the test set, the U-Net with a 50-layer ResNet as the backbone achieved the highest IOU of 0.7724, as shown in Table 1. For the VV data, as seen in Figure 6c, the U-Net with an 18-layer ResNet as the backbone converged significantly faster than the U-Nets with 34-layer and 50-layer ResNets as the backbone, and it achieved the lowest training loss. In the validation (Figure 6d), the loss curves of the U-Nets with 18-layer and 34-layer ResNets as the backbone were more stable than the U-Net with a 50-layer ResNet as the backbone after 21 epochs. As shown in Table 1, the U-Net with an 18-layer ResNet as the backbone achieved the highest IOU of 0.5883 with the test set. Therefore, we chose 50layer and 18-layer ResNets as the backbone/encoders of the RGB and VV branches, respectively, for ACFNet, encoder fusion, decoder fusion, and output fusion.

Effects of Fusion Methods
In this section, we discuss the effects of input fusion, encoder fusion, decoder fusion, output fusion, and ACFNet on fusing RGB and VV data for glacial lake extraction. Input fusion simply expands a RGB image's channel with the addition of a SAR image's VV band. RGB data reflect an object's spectral information, while VV data reflect the For the test set, the U-Net with a 50-layer ResNet as the backbone achieved the highest IOU of 0.7724, as shown in Table 1. For the VV data, as seen in Figure 6c, the U-Net with an 18-layer ResNet as the backbone converged significantly faster than the U-Nets with 34-layer and 50-layer ResNets as the backbone, and it achieved the lowest training loss. In the validation (Figure 6d), the loss curves of the U-Nets with 18-layer and 34-layer ResNets as the backbone were more stable than the U-Net with a 50-layer ResNet as the backbone after 21 epochs. As shown in Table 1, the U-Net with an 18-layer ResNet as the backbone achieved the highest IOU of 0.5883 with the test set. Therefore, we chose 50-layer and 18-layer ResNets as the backbone/encoders of the RGB and VV branches, respectively, for ACFNet, encoder fusion, decoder fusion, and output fusion.

Effects of Fusion Methods
In this section, we discuss the effects of input fusion, encoder fusion, decoder fusion, output fusion, and ACFNet on fusing RGB and VV data for glacial lake extraction. Input fusion simply expands a RGB image's channel with the addition of a SAR image's VV band. RGB data reflect an object's spectral information, while VV data reflect the geometrical structure of an object and its surroundings. Directly concatenating RGB and VV data as input is not appropriate because they are in two separate modalities but will go through the same network overall. Among the four fusion methods and our model, input fusion presented the highest loss during the training process and validation ( Figure 7). As expected, input fusion also achieved a poor IOU of 0.7596 with the test set (Table 2). In addition, the imaging time of the VV images was close to that of the RGB images, but not the same, which signifies a difference in the texture of glacial lakes may exist between the two types of images. Therefore, the extracted objects' texture features from RGB+VV images will be a mixture of an object's texture features from RGB and VV images, bringing uncertainties to the network's predictions.
Remote Sens. 2021, 13, x FOR PEER REVIEW 11 of 18 geometrical structure of an object and its surroundings. Directly concatenating RGB and VV data as input is not appropriate because they are in two separate modalities but will go through the same network overall. Among the four fusion methods and our model, input fusion presented the highest loss during the training process and validation ( Figure  7). As expected, input fusion also achieved a poor IOU of 0.7596 with the test set (Table  2). In addition, the imaging time of the VV images was close to that of the RGB images, but not the same, which signifies a difference in the texture of glacial lakes may exist between the two types of images. Therefore, the extracted objects' texture features from RGB+VV images will be a mixture of an object's texture features from RGB and VV images, bringing uncertainties to the network's predictions. Encoder fusion gradually incorporates the extracted VV features into the feature extraction branch of the RGB data via a summation method. In this process, VV and RGB features are fused at locations with different network depths. Due to the relatively sufficient feature fusion, encoder fusion achieved an IOU of 0.7944 with the test set (Table 2), which is 3.48% higher than that of input fusion. However, gradually integrating VV features with RGB features in the encoding process may have disturbed the feature extraction of the RGB data, because the features fed to the RGB branch's encoding layers are a mixture of two modality features.
In decoder fusion, the feature extraction branches for the RGB and VV data at the encoder stage are independent, guaranteeing that the extracted features have their own modality characteristics. Note that decoder fusion has exactly the same number of parameters as encoder fusion. The only difference between decoder fusion and encoder fusion is that the extracted features are fused at different stages. However, decoder fusion converged faster than encoder fusion in the training process and presented lower loss during validation (Figure 7). With the test set, decoder fusion achieved an IOU of 0.8132 (Table  2), which is 1.88% higher than that of encoder fusion. This finding suggests that extracting RGB and VV features independently is better than incorporating VV features into RGB features during the encoding process.
Output fusion simply adds the prediction of the RGB branch to the prediction of the VV branch as the final prediction. No features of the RGB and VV data are fused in the encoding or decoding stage. Thus, there is no disturbance between the RGB and VV data in the encoding or decoding processes, facilitating the training of the network. Output fusion converged fastest in the training process and achieved the lowest training loss (Figure 7a). However, due to the poor predictive ability of the VV branch, output fusion performed worse than ACFNet in validation (Figure 7b). With the test set, output fusion achieved an IOU of 0.8067 (Table 2), which is 4.71% and 1.23% higher than that of input fusion and encoder fusion, respectively. This finding suggests that the sufficient extraction of characteristics of each modal data is vital for the full use of multimodal data. It is worth Encoder fusion gradually incorporates the extracted VV features into the feature extraction branch of the RGB data via a summation method. In this process, VV and RGB features are fused at locations with different network depths. Due to the relatively sufficient feature fusion, encoder fusion achieved an IOU of 0.7944 with the test set (Table 2), which is 3.48% higher than that of input fusion. However, gradually integrating VV features with RGB features in the encoding process may have disturbed the feature extraction of the RGB data, because the features fed to the RGB branch's encoding layers are a mixture of two modality features.
In decoder fusion, the feature extraction branches for the RGB and VV data at the encoder stage are independent, guaranteeing that the extracted features have their own modality characteristics. Note that decoder fusion has exactly the same number of parameters as encoder fusion. The only difference between decoder fusion and encoder fusion is that the extracted features are fused at different stages. However, decoder fusion converged faster than encoder fusion in the training process and presented lower loss during validation (Figure 7). With the test set, decoder fusion achieved an IOU of 0.8132 (Table 2), which is 1.88% higher than that of encoder fusion. This finding suggests that extracting RGB and VV features independently is better than incorporating VV features into RGB features during the encoding process.
Output fusion simply adds the prediction of the RGB branch to the prediction of the VV branch as the final prediction. No features of the RGB and VV data are fused in the encoding or decoding stage. Thus, there is no disturbance between the RGB and VV data in the encoding or decoding processes, facilitating the training of the network. Output fusion converged fastest in the training process and achieved the lowest training loss (Figure 7a). However, due to the poor predictive ability of the VV branch, output fusion performed worse than ACFNet in validation (Figure 7b). With the test set, output fusion achieved an IOU of 0.8067 (Table 2), which is 4.71% and 1.23% higher than that of input fusion and encoder fusion, respectively. This finding suggests that the sufficient extraction of characteristics of each modal data is vital for the full use of multimodal data. It is worth noting that output fusion achieved the highest precision of 0.9283 with the test set (Table 2). If the RGB branch and VV branch both believe a pixel is a glacial lake pixel, then this pixel has a high probability of being a glacial lake pixel. Because the prediction generated by output fusion can be regarded as the sum of probabilities given by RGB branch and VV branch.
Unlike the shallow features fused in encoder fusion and decoder fusion, the features generated by the first and the second decoder blocks in ACFNet are at deep positions in the network. Thus, these features have a higher level of semantic information than the shallow features with the same spatial size. Fusing two modality features with high-level semantics could mitigate the negative influence caused by the different imaging times of optical and SAR images. ACFNet is improved based on output fusion. In output fusion, the two predictions of the RGB branch and VV branch are simply fused by an element-wise summation. There are no neurons to learn the relationship between these two predictions. In ACFNet, two groups of features generated by the last decoder block of the RGB branch and VV branch are fused through the atrous convolution block. In this block, many neurons learn a nonlinear map to convert these two groups of features to a new group of features for a prediction with high accuracy. Because of the two additional atrous convolution blocks, ACFNet is not easy to train compared with output fusion and achieves higher training loss than output fusion (Figure 7a). However, ACFNet achieved the lowest loss in validation (Figure 7b) and the highest IOU of 0.8278 with the test set (Table 2), which is 2.11% higher than that of output fusion. This indicates that our network architecture and atrous convolution block are effective for fusing RGB and VV data for glacial lake extraction.

Comparisons with Other Models
As shown in Table 2, Wu's model achieves a higher IOU than input fusion, with a margin of 2.65%; this indicates that concatenating the shallow features of the RGB and VV images as an input is better than directly concatenating the RGB and VV images as an input. Because Wu's model cannot effectively extract and blend features of the RGB and VV data, its IOU was lower than that of ACFNet, with a large margin of 4.17%. Although SegNet utilizes max pooling indices stored in the encoding process to recover details in the decoding process, it achieved the worst IOU of 0.7221 with the test set (Table 2), which is 3.75% lower than that of input fusion. This result indicates that shallow features could provide more information than max pooling indices for recovering object details. DeepLabV3+ achieved an IOU of 0.7389, which is 2.07% lower than that of input fusion, because 4X upsampling at the end of the network caused the prediction loss of many boundary details of glacial lakes. Like input fusion, SegNet and DeepLabV3+ simply concatenate the RGB and VV images as input without specifically extracting and blending two modality features. Thus, their performances with the test set were far inferior to ACFNet (Table 2). Figure 8 shows the glacial lake extraction effects of input fusion, encoder fusion, decoder fusion, output fusion, ACFNet, Wu's model, SegNet, and DeepLabV3+.

Impacts of Imaging Time Intervals between SAR and Optical Images
In our dataset, the imaging time of each SAR patch was as close to the corresponding optical patch as possible. The distribution of the imaging time intervals between the SAR patches and optical patches is shown in Figure 9a. Approximately 95.56% of the imaging time intervals were within six days. Thus, the edge details of the glacial lakes in the SAR images are very similar to those in the optical images. Given that our optical images were mainly acquired during autumn when seasonal snowfall is sparse, the glacial lakes in the SAR images exhibit a flat interior and a low backscatter coefficient, as shown in the "good" VV column of Figure 2b. Short imaging time intervals between the SAR and optical images are vital to the success of multisource data fusion. To further understand the impacts of the imaging time intervals between the SAR and optical images for glacial lake extraction, we generated a set of SAR patches that were acquired half a year earlier than the optical patches. We call this set of SAR images the "bad" VV images. The distribution of the imaging time intervals between the "bad" VV patches and optical patches is shown in Figure 9b. Approximately 95.37% of the imaging time intervals were between −188 and −178 days. Thus, the "bad" VV images were mainly acquired between March and May. During this period, many glacial lakes have frozen surfaces or are partially covered by seasonal snowfall, resulting in a heterogeneous interior structure and a relatively high backscatter coefficient in the SAR images, as shown in the "bad" VV column of Figure 2b. This makes it difficult to distinguish glacial lakes from background objects in the "bad" VV images.

Impacts of Imaging Time Intervals between SAR and Optical Images
In our dataset, the imaging time of each SAR patch was as close to the corresponding optical patch as possible. The distribution of the imaging time intervals between the SAR aging time intervals between the "bad" VV patches and optical patches is shown in Figure  9b. Approximately 95.37% of the imaging time intervals were between −188 and −178 days. Thus, the "bad" VV images were mainly acquired between March and May. During this period, many glacial lakes have frozen surfaces or are partially covered by seasonal snowfall, resulting in a heterogeneous interior structure and a relatively high backscatter coefficient in the SAR images, as shown in the "bad" VV column of Figure 2b. This makes it difficult to distinguish glacial lakes from background objects in the "bad" VV images.
(a) (b) Figure 9. Imaging time intervals between the SAR images and optical images: (a) Imaging time intervals between "good" VV patches and RGB patches; (b) imaging time intervals between "bad" VV patches and RGB patches.
We trained and evaluated the models mentioned previously with the dataset comprising the RGB data and "bad" VV data. The evaluation results with the test set are detailed in Table 3. Compared with the evaluation results (Table 2) with the test set comprising the RGB data and "good" VV data, the accuracies of all of the models decreased. This is because the "bad" VV data cannot provide obvious features of glacial lakes due to the snow and ice covers. Wu's model achieved the smallest IOU decrease of 0.54%. We attribute this result to the simple and inadequate fusion of the RGB features and VV features, which indicates that the model just needs to learn few parameters to ignore the "bad" VV features. When simply concatenating the RGB and VV images as input, input fusion produced a small IOU decrease of 0.63%. Similarly, DeepLabV3+ produced a small IOU decrease of 1.16%. The largest IOU decrease of 5.59% was achieved by SegNet. Although the RGB and VV images were simply concatenated as input in SegNet, "bad" VV data produced incorrect max pooling indices that influenced the network's decoding process and predictions. Output fusion also produced a large IOU decrease of 4.22% due to the poor predictions of the "bad" VV data. Encoder fusion, decoder fusion, and ACFNet achieved relatively large IOU decreases. We attribute this result to the sufficient feature fusion amplifying the effect of "bad" VV data. Even so, ACFNet achieved the highest IOU of 0.7814 among the eight methods (Table 3). Figure 9. Imaging time intervals between the SAR images and optical images: (a) Imaging time intervals between "good" VV patches and RGB patches; (b) imaging time intervals between "bad" VV patches and RGB patches.
We trained and evaluated the models mentioned previously with the dataset comprising the RGB data and "bad" VV data. The evaluation results with the test set are detailed in Table 3. Compared with the evaluation results (Table 2) with the test set comprising the RGB data and "good" VV data, the accuracies of all of the models decreased. This is because the "bad" VV data cannot provide obvious features of glacial lakes due to the snow and ice covers. Wu's model achieved the smallest IOU decrease of 0.54%. We attribute this result to the simple and inadequate fusion of the RGB features and VV features, which indicates that the model just needs to learn few parameters to ignore the "bad" VV features. When simply concatenating the RGB and VV images as input, input fusion produced a small IOU decrease of 0.63%. Similarly, DeepLabV3+ produced a small IOU decrease of 1.16%. The largest IOU decrease of 5.59% was achieved by SegNet. Although the RGB and VV images were simply concatenated as input in SegNet, "bad" VV data produced incorrect max pooling indices that influenced the network's decoding process and predictions. Output fusion also produced a large IOU decrease of 4.22% due to the poor predictions of the "bad" VV data. Encoder fusion, decoder fusion, and ACFNet achieved relatively large IOU decreases. We attribute this result to the sufficient feature fusion amplifying the effect of "bad" VV data. Even so, ACFNet achieved the highest IOU of 0.7814 among the eight methods (Table 3).

Conclusions
We proposed a feature fusion network (ACFNet) to extract glacial lakes using Landsat 8 optical RGB images and Sentinel-1 SAR VV images. In this proposed model, the features of optical images and SAR images were independently extracted by two CNN branches in the encoder stage. Two modality high-level semantic features generated by decoder blocks were adequately fused under different receptive fields by atrous convolution blocks. Input fusion, encoder fusion, decoder fusion, output fusion, SegNet, DeepLabV3+, and Wu's model were compared with our model. Due to the sufficient feature extraction of singlemodal data (optical/SAR) and the adequate fusion of optical and SAR features, our model achieved the best glacial lake extraction with an F1 score of 0.9057. Although the selected SAR patches had imaging times closest to those of the optical patches, the boundary details of the glacial lakes in these two types of images were slightly different. Fusing more advanced features that have a larger receptive field and more abstract semantics rather than shallow features will help to mitigate the influence of this discrepancy; this point also explains why our model works effectively. However, SAR images acquired in a different season than the optical images greatly affected those networks that adequately fuse optical and SAR features. This is because there are a lot of neurons needed to be suppressed to neglect the "bad" SAR features. Our method could be used to monitor the long-term changes of glacial lakes, providing a base for assessing risks of GLOFs and forecasting GLOFs. However, subject to the ubiquitous cloud and snow in the Himalaya and the revisit periods of the satellites, there would be many monitoring gaps. Note that SAR images are utilized as auxiliary data for optical data to extract glacial lakes in our method. If the glacial lakes are covered by cloud, our method would give uncertain predictions. Besides, the data used in this study were limited to optical RGB and SAR VV images. Given the complex environments in glaciated alpine regions, in future work, we will integrate additional data into the model, such as surface temperatures and DEM, to map glacial lakes with high accuracy.

Conflicts of Interest:
The authors declare no conflict of interest.