Exploring Contrastive Representation for Weakly-Supervised Glacial Lake Extraction

: Against the background of the ongoing atmospheric warming, the glacial lakes that are nourished and expanded in High Mountain Asia pose growing risks of glacial lake outburst ﬂoods (GLOFs) hazards and increasing threats to the downstream areas. Effectively extracting the area and consistently monitoring the dynamics of these lakes are of great signiﬁcance in predicting and preventing GLOF events. To automatically extract the lake areas, many deep learning (DL) methods capable of capturing the multi-level features of lakes have been proposed in segmentation and classiﬁcation tasks. However, the portability of these supervised DL methods need to be improved in order to be directly applied to different data sources, as they require laborious effort to collect the labeled lake masks. In this work, we proposed a simple glacial lake extraction model (SimGL) via weakly-supervised contrastive learning to extend and improve the extraction performances in cases that lack the labeled lake masks. In SimGL, a Siamese network was employed to learn similar objects by maximizing the similarity between the input image and its augmentations. Then, a simple Normalized Difference Water Index (NDWI) map was provided as the location cue instead of the labeled lake masks to constrain the model to capture the representations related to the glacial lakes and the segmentations to coincide with the true lake areas. Finally, the experimental results of the glacial lake extraction on the 1540 Landsat-8 image patches showed that our approach, SimGL, offers a competitive effort with some supervised methods (such as Random Forest) and outperforms other unsupervised image segmentation methods in cases that lack true image labels.


Introduction
Glacial lakes are of considerable interest due to their sensitivity to the ongoing climate changes and threatening outburst risks to downstream communities in High Mountain Asia [1][2][3].As a typical component of water resources, glacial lakes are positioned in glacierized regions and fed by melting glaciers.As they are adversely impacted by glacier shrinkage and seasonal climate variation, glacial lakes have grown rapidly in terms of both number and area [1][2][3][4][5], especially in recent decades, concomitant with the increases in glacier-related hazards such as glacial lake outburst floods (GLOFs) [6,7].To monitor the dynamics of lakes and give forewarning of GLOFs, the automatic and accurate extraction of glacial lakes from remotely sensed images is a prerequisite for the fast evaluation of these outlying lakes.
Many approaches have been explored by coupling two or more types of optical imagery, including the digital elevation model (DEM), synthetic aperture radar (SAR), thermal infrared image and satellite altimetry for glacial lake extraction.For example, Li et al. [8] and Song et al. [9] employed the Normalized Difference Water Index (NDWI) to highlight the lake information and extracted this information by leveraging a local threshold derived from bimodal histograms in the buffer zone of each potential lake area.Gardelle et al. [10] combined green, Near Infrared (NIR) and Short Wave Infrared (SWIR) bands to map the frozen glacial lakes or lakes with floating ice.Wangchuk et al. [11] collected a lot of pixels from the lake area and the non-lake area, and they gave a category for each pixel according to the patterns of the glacial lakes in high-dimensional space learning from the random forest.Shen et al. [12] and Li et al. [13] conducted an objectoriented method to identify the potential lake areas and refine the final lake area by a pre-defined water extraction decision ruleset.Zhao et al. [14] integrated the advantage of threshold segmentation and a simplified active contour model to extract small glacial lakes.Mitkari et al. [15] proposed an object-based image analysis (OBIA) method for mapping small supraglacial lakes from high spatial resolution data of LISS-IV.Zhang et al. [16] used a phase-congruency-based detector on the C-band of SAR images to extract the area of glacial lakes according to their outline and texture.However, many influencing factors, including the physical properties of lakes (glacial lakes of differing size, turbidity, depth and coloring) and complicated environmental conditions (such as shadows from mountains and clouds and melting glaciers) [2,3,8,9], still pose challenges to large-scale glacial lake extraction.In particular, the different intrinsic physical properties cause the lake to show varying spectral responses, and the extrinsic geomorphic factors always show similar spectral values to glacial lakes; these cause difficulties in differentiating glacial lakes against diverse factors in remotely sensed images.Although they possess good performance and high accuracy, these traditional methods still require extensive pre-or post-processing to determine the optimal segmentation parameters and to eliminate the effects from these factors.
Recently, with the advancements in the DL model, many tasks in terms of classification, segmentation and object detection have achieved great signs of progress as the DL method can capture the high-level features of objects and give final decisions based on these feature patterns.Some DL models have also been successfully applied in glacial lake extraction.For example, Kaushik et al. [17] used a Deep Convolutional Neural Network (DCNN) to automatically map the glacial lakes from multisource remote sensing (RS) data; Qayyum et al. [18] and Wu et al. [19] employed a U-net model to extract the contours of glacial lakes; Thati et al. [20] utilized a V-net model to segment the water and non-water bodies from satellite imagery; and Zhao et al. [21] proposed a GAN-based architecture for glacial lake mapping.These five DL-based models utilized convolution operations to capture the high-level spatial features and skip connections to integrate the features at different scales, so that the model learns the patterns of glacial lakes and greatly improve the accuracy of the glacial lake extraction.Although their performance far exceeds that of the traditional methods, and without the use of any other ancillary data (such as the Digital Elevation Model (DEM) [18][19][20][21]) and post-processing work, they require a lot of effort to prepare the training labels.These supervised DL methods limit the applicability of the glacial lake extraction method to different data modalities.
To avoid assembling such large annotated image labels and to extend the model training for downstream tasks, some works explore the unsupervised (without labeled data) or semi-supervised (with few labels) representation methods and obtain notable progress [22][23][24][25][26][27][28][29][30].As a conspicuous breakthrough in unsupervised learning, contrastive representation learning is based on the intuition that the same object in various transformations of an image (multi-view, color change, rotation, blurring) should have similar representations, whilst being dissimilar to the representations of other objects.Several contrastive learning works, including SimCLR [26], MoCo [27], SwAV [28], BYOL [29] and Simsiam [30], use a Siamese network to find the valuable representation by maximizing the similarity between the input image and its augmentations.SimCLR [26] and MoCo [27] provided a training strategy without a memory bank; SwAV [28] applied the online clustering mechanism to the Siamese network; BYOL [29] designed an asynchronous momentum encoder to solve the issues of constructing negative pairs; Simsiam [30] proved that a contrastive network learns the meaningful representations without inputs of negative sample pairs, momentum encoders and large batches in the training stage.These encouraged us to explore the unsupervised DL methods for glacial lake extraction.As seen in Figure 1, the traditional DL methods require the input of an image and a true mask of the glacial lake, which prevents the model from capturing the lake features and learning the lake patterns.Contrastive learning is to find the lake areas by transforming the input image and generalizing the features of the lakes.
contrastive learning works, including SimCLR [26], MoCo [27], SwAV [28], BYOL [29] and Simsiam [30], use a Siamese network to find the valuable representation by maximizing the similarity between the input image and its augmentations.SimCLR [26] and MoCo [27] provided a training strategy without a memory bank; SwAV [28] applied the online clustering mechanism to the Siamese network; BYOL [29] designed an asynchronous momentum encoder to solve the issues of constructing negative pairs; Simsiam [30] proved that a contrastive network learns the meaningful representations without inputs of negative sample pairs, momentum encoders and large batches in the training stage.These encouraged us to explore the unsupervised DL methods for glacial lake extraction.As seen in Figure 1, the traditional DL methods require the input of an image and a true mask of the glacial lake, which prevents the model from capturing the lake features and learning the lake patterns.Contrastive learning is to find the lake areas by transforming the input image and generalizing the features of the lakes.In this work, we proposed a simple glacial lake extraction network (SimGL) via weakly-supervised contrastive learning.In our design, a remote sensing image is the only input.Zhang et al. [31] evaluated the extraction performance of glacial lakes using 23 classical spectral features and concluded that the NDWI has great potential in glacial lake mapping; therefore, we employ a strict NDWI map segmenting with tight thresholds to provide rough location cues of glacial lakes and as a pseudo label.Inspired by contrastive learning works [26][27][28][29][30], a Siamese network cascading a prediction head is introduced to learn similar representations and sort out the same objects.Our loss function consisted of two parts: a contrastive loss in the Siamese network to constrain the segment processing to learn similar representations; and a location loss between the segmentations and NDWI maps to capture the precise boundaries of the glacial lakes.The impressive evaluating results of the glacial lake mapping on the Landsat-8 imagery also demonstrate that our model can obtain a competitive performance with some supervised methods and shows great potential in learning the patterns of glacial lakes without labeling data.In this work, we proposed a simple glacial lake extraction network (SimGL) via weaklysupervised contrastive learning.In our design, a remote sensing image is the only input.Zhang et al. [31] evaluated the extraction performance of glacial lakes using 23 classical spectral features and concluded that the NDWI has great potential in glacial lake mapping; therefore, we employ a strict NDWI map segmenting with tight thresholds to provide rough location cues of glacial lakes and as a pseudo label.Inspired by contrastive learning works [26][27][28][29][30], a Siamese network cascading a prediction head is introduced to learn similar representations and sort out the same objects.Our loss function consisted of two parts: a contrastive loss in the Siamese network to constrain the segment processing to learn similar representations; and a location loss between the segmentations and NDWI maps to capture the precise boundaries of the glacial lakes.The impressive evaluating results of the glacial lake mapping on the Landsat-8 imagery also demonstrate that our model can obtain a competitive performance with some supervised methods and shows great potential in learning the patterns of glacial lakes without labeling data.
We summarize the contributions of our work threefold:

•
We proposed a simple yet effective glacial lake extraction model, SimGL, which effectively learns lake representations from unlabeled RS images via contrastive learning in the training stage.

•
We further introduced the NDWI map into our model to provide location cues of the glacial lakes and proposed a location loss to encourage the segmentations to coincide with the true glacial lake boundaries.

•
We evaluated our model SimGL using four metrics and compared the segmentation performance with other glacial lake mapping methods on the Landsat-8 imagery.The results demonstrate that our model SimGL surpasses other unsupervised methods and narrows the performance difference with supervised DL methods.

Methodology
In this section, we introduce the architecture of SimGL (as shown in Figure 2) and explain the loss function, in which the components are detailed in the following subsection.From Figure 2, our model consists of two parts: (1) an NDWI map, regarded as a pseudo label, provides the lake location cues and penalizes the network for segmenting the lake masks; (2) a Siamese network learns the meaningful representations of objects by maximizing the similarity between two inputs.Note that we only use some simple strategies to provide the location cues of the lakes.Neither image label supervision nor a sophisticated structure (such as salient object detection) is involved in the model training.We summarize the contributions of our work threefold: • We proposed a simple yet effective glacial lake extraction model, SimGL, which effectively learns lake representations from unlabeled RS images via contrastive learning in the training stage.

•
We further introduced the NDWI map into our model to provide location cues of the glacial lakes and proposed a location loss to encourage the segmentations to coincide with the true glacial lake boundaries.

•
We evaluated our model SimGL using four metrics and compared the segmentation performance with other glacial lake mapping methods on the Landsat-8 imagery.The results demonstrate that our model SimGL surpasses other unsupervised methods and narrows the performance difference with supervised DL methods.

Methodology
In this section, we introduce the architecture of SimGL (as shown in Figure 2) and explain the loss function, in which the components are detailed in the following subsection.From Figure 2, our model consists of two parts: (1) an NDWI map, regarded as a pseudo label, provides the lake location cues and penalizes the network for segmenting the lake masks; (2) a Siamese network learns the meaningful representations of objects by maximizing the similarity between two inputs.Note that we only use some simple strategies to provide the location cues of the lakes.Neither image label supervision nor a sophisticated structure (such as salient object detection) is involved in the model training.Then applying the prediction layer on projected features from one branch to predict the transform features from another branch, and we use a contrastive loss to measure the similarity between the two features.Another part only takes the RS image as input, then a location loss was calculated between the output map (generated by decoding the multi-scale features) and location cues (generated by thresholding the NDWI map) to constrain the segmentation results.

Preliminaries
Given an RS image x ∈ R h×w×c with height h, width w and channel c, we aim to find the network designs f that detect the lake segmentation mask y ∈ R h×w×1 from the input image x using a weakly-supervised training strategy.If any location in image x is denoted by u, this process can be modeled as p( y u |x) = f (x u ; θ); where θ is the parameters of the network, our model uses two loss terms to learn the segmentation parameters θ − contrastive loss and location loss.

Weakly Location Cues from NDWI Map
Benefiting from its simple calculation and convenient application, the Water Index (WI) is the most frequently used method for locating the lake area.Among the WIs, NDWI has been widely applied in glacial lake mapping due to its great superiority in eliminating the effects of confounding factors with glacial lakes, such as mountain shadows and melting glaciers [8,9].Here, the NDWI was defined by [32]: where ρ green and ρ N IR represent the Top-of-Atmosphere (TOA) reflectance values in the green and NIR bands measured by sensors, respectively.The NDWI map contains the position cues of the glacial lakes.Most previous research has set a threshold in the range of 0.0 to 0.4 to segment glacial lake areas in the NDWI map.Although a low threshold can segment the glacial lake pixels as much as possible, some other pixels that have similar spectral feature values with glacial lakes will also be retained.We aim to find the accurate lake pixels as a pseudo label to provide the lake information to the model; a high or tight threshold should segment the glacial lake pixels only and avoid the effects from any other object pixels.Therefore, we set a tight threshold T on the NDWI map to obtain the lake binary mask B, where we assign the value 0 for the background and 1 for the glacial lake.This step is called weak localization and is illustrated in the brown part in Figure 2.With a tight threshold, all non-lake pixels and some lake pixels confused with the background are removed, leaving only the glacial lake areas in binary masks.Although the lake areas may not be complete and precise enough in this mask, the mask is helpful in guiding the network to seek objects that are similar to the glacial lakes.To describe the similarity between the masks and segmentation results, we introduce a location loss, which is defined as follows: In the location loss, there are many existing ways to generate the binary mask B, such as salient object detection (SOD) [33,34], whereas glacial lakes are too small to be identified by SOD.Thus, we use a simple WI to localize the glacial lakes.

Contrastive Semantic Segmentation
In our Siamese network, we use a weight-shared encoder, which contains four downsampling blocks, to capture similar objects from two inputs at different scales.The multiscale feature maps are represented by q = [q 1 , q 2 , q 3 , q 4 ].To measure the similarity of two sets of feature maps, the feature maps are input to a projector to output the embedding vectors [29,31].A vector of an image should be predictive of the vector of the transformed image [29]; therefore, we employ a predictor p to transform one vector and match it to the vector from another branch (as "predictor" in Figure 2) [29,31].Here, we use the negative cosine similarity to measure their matching score: where q' denotes the feature map encoded from the transformed images x'.Following the work [31], we use a stop-gradient operation (see "stop-grad" in Figure 2) to avoid the model subjecting to model collapse.Therefore, the Formula (3) can be modified as: This indicates that the q' was regarded as a constant in Formula (4).Following [29,31], two symmetrical components have consisted of our contrastive loss: Overall, our model was trained with a composition of two loss functions: Here, the λ is the hyper-parameter balancing two loss terms.The location loss acquired from an input image and the NDWI map provides a piece of weakly supervised lake information, while the contrastive loss, as in Formula ( 5), forces the network to contrast the similarities between the inputs and their augmentations and penalizes the network for seeking the feature maps of the same objects.

Image Augmentation
We defined seven ways to generate the augmentations of multi-band RS images.(5) Blurring.The images are blurred with the Gaussian kernel; here, the kernel size is 3 × 3 and the other parameters maintain the default values.(6) Random area erasing.We randomly masked some pixels (less than 1% image size) with 0 to erase the spectral information.(7) Noise addition.We add the Gaussian noise to the image.
An example of transforming a glacial lake image in different ways is shown in Figure 3.These transformations perturbed the spectral features or spatial features, which is crucial to enhance the robustness of the model training.

Detailed Network Architecture
Overall, the components of our models included (see Figure 2): • Encoder and decoder: The encoder and decoder are the same as the structures in the U-net.We use the encoder to capture the feature maps at different scales from the

Detailed Network Architecture
Overall, the components of our models included (see Figure 2): • Encoder and decoder: The encoder and decoder are the same as the structures in the U-net.We use the encoder to capture the feature maps at different scales from the input image and the decoder to reconstruct the segmentation results of the lakes from the feature maps.In our model, each selected feature map is from the results before the down-sampling operation.

•
Projector: The projector has three fully-connected (fc) layers and batch normalization (BN) layers, and the first two layers are activated by ReLU.The output of the projector is a 2048-d vector.

•
Predictor: The predictor has two fc layers.The first is connected to a BN and a ReLU layer, and the last is without any other operations.The output dimension is 2048-d, while in hidden layers, it was set as 512-d.

Experiment Results
In this section, we give a detailed study on validating the SimGL and comparing our model with the state-of-the-art models.

Dataset and Evaluation Metrics
Dataset: Landsat-8 OLI images have a suitable spatial resolution (30 m) and moderate revisiting period (16 days), benefiting the glacial lake investigation over a large-scale region.We collected 103 Landsat-8 images (all images were acquired in the autumn of 2016 as the boundaries of the glacial lakes are clearer in this period), and then randomly cropped 256 × 256 × 7 image patches from these images.Only the patch containing lake pixels greater than an 1% area of the patch is kept.To make the corresponding lake labels, we first converted the High Mountain Asian Glacial Lake Invention dataset (Hi-MAG) [1] in 2016 into a raster file with 30 m spatial resolution.Secondly, we cropped the raster file according to geographical coordinates of patches and made them consistent in range.Finally, our dataset contains 1540 image patches and 1540 corresponding labels.In our experiments, we split the dataset into 70% for training, 20% for validation and 10% for testing.More details of our dataset are summarized in Table 1.Evaluation metric: in our study, we use four metrics to evaluate the segmentation effects on glacial lakes: Precision (P), Recall (R), F1 Score (F1) and Intersection over Union (IoU).They are defined following [21]: P = correctly extracted water pixels/all extracted pixels; R = correctly extracted water pixels/all water pixels; F1 = 2 × P × R / (P + R); IoU = (extracted water pixels ∩ true water pixels)/(extracted water pixels ∪ true water pixels).

Implementation Details
All of the experiments were implemented using Tensorflow 1.14 on the Python 3.7 platform with one GTX 1660 Ti GPU (6 GB GPU memory).For the hyper parameters, we followed the conventions [31] to set the initial learning rate as 0.005.Then, it was scheduled following a cosine decay policy, with a decay rate of 0.0001.For training, we set the loss coefficient λ as 10.Our model SimGL was trained by Adam optimizer and the training 100 epochs with a batch size of 8.

Diagnostic Experiments
Coefficient λ in joint loss: The coefficient λ balances the influence of two types of losses, and we give the segment performance for the values of λ changed from 10 −4 to 10 4 , as shown in Figure 4. Obviously, the optimal segmentation performance of the F1 score and IoU occurred as the λ reached 10.Therefore, we set the λ to 10 as it has good effects on balancing two loss terms.

Implementation Details
All of the experiments were implemented using Tensorflow 1.14 on the Python 3.7 platform with one GTX 1660 Ti GPU (6 GB GPU memory).For the hyper parameters, we followed the conventions [31] to set the initial learning rate as 0.005.Then, it was scheduled following a cosine decay policy, with a decay rate of 0.0001.For training, we set the loss coefficient λ as 10.Our model SimGL was trained by Adam optimizer and the training 100 epochs with a batch size of 8.

Diagnostic Experiments
Coefficient λ in joint loss: The coefficient λ balances the influence of two types of losses, and we give the segment performance for the values of λ changed from 10 −4 to 10 4 , as shown in Figure 4. Obviously, the optimal segmentation performance of the F1 score and IoU occurred as the λ reached 10.Therefore, we set the λ to 10 as it has good effects on balancing two loss terms.Loss term ablation: To prove whether using the NDWI only or the contrastive module only is enough for glacial lake mapping, we explored the model effects when part of the model was ablated; namely, only one type of loss used or both two loss terms used.The loss term ablation results are shown in Table 2.As our Landsat patches are similar in content, including glacial lakes, glaciers, shadows, etc., these high-frequency objects will also be extracted with the glacial lakes when using contrast learning only, which caused a low F1 score (0.1356) and IoU (0.1083) in the loss ablation results.Although the model can yield good results using location loss only, the location cues provided by the NDWI map were not accurate enough to true lake boundaries.Thus, our model SimGL combined the advantages of two loss terms and further improved the segmentation results.Water Index and thresholds: Many WIs are designed to highlight the lake information; some of them include NDWI [32], MNDWI [35] and MI [36].They are always combined with threshold segmentation to segment the WI maps to the lake binary masks.In our model, the use of a rough lake location map as a pseudo label is required to directly guide the segmentations to be more accurate and less contaminated by the background.In most glacial lake mapping research, the thresholds of these three WIs range between 0.0 and 0.4 [1,8,9].Therefore, we explore how the segment performance and their evaluation metrics are affected when setting gradually tight thresholds in a broad range of [−0.1, 0.8], as shown in Figure 5. Water Index and thresholds: Many WIs are designed to highlight the lake information; some of them include NDWI [32], MNDWI [35] and MI [36].They are always combined with threshold segmentation to segment the WI maps to the lake binary masks.In our model, the use of a rough lake location map as a pseudo label is required to directly guide the segmentations to be more accurate and less contaminated by the background.In most glacial lake mapping research, the thresholds of these three WIs range between 0.0 and 0.4 [1,8,9].Therefore, we explore how the segment performance and their evaluation metrics are affected when setting gradually tight thresholds in a broad range of [−0.1, 0.8], as shown in Figure 5. Specifically, from Figure 5, the F1 and IoU progressively increase when we set the threshold more tightly to 0.6 in three WIs, but decrease when setting the threshold to a high value (great than 0.7).For example, when we set the NDWI threshold as 0.6, our model will achieve an astonishing performance as it has the highest Precision, F1 and IoU Specifically, from Figure 5, the F1 and IoU progressively increase when we set the threshold more tightly to 0.6 in three WIs, but decrease when setting the threshold to a high value (great than 0.7).For example, when we set the NDWI threshold as 0.6, our model will achieve an astonishing performance as it has the highest Precision, F1 and IoU (see Figure 5a).Similarly, they are 0.6 for MNDWI (see Figure 5b) and 0.7 for MI (see Figure 5c).The glacial lake pixels always show high WI values, as well as some pixels from melting glaciers and mountain shadows; thus, a tight threshold can filter out the more glacial lake pixels, but the noise pixels are extracted if we use a loose threshold.
Moreover, we further visualized the results of SimGL when setting different thresholds on NDWI, MNDWI and MI, as shown in Figure 6.
As shown in Figure 6, the experimental results were heavily contaminated by the glaciers when using MDNWI and MI.This also proved that the NDWI is more accurate in generating lake masks, especially in glacierized regions.In addition, almost all of the extracted pixels belonged to glacial lakes when the threshold of NDWI was set to be greater than 0.5, indicating that a tight threshold will facilitate the model to learn the lake's information more accurately.
Types of image augmentations: Seven methods are defined for transforming an image in our experiments.We further divide them into two groups to explore the influence of each type and group.One contains {color jitter, gray scaling, blurring, random area erasing and noise addition}, which will change the spectral distribution of an image.The other one uses a {flipping and rotating} operation to change the location of each pixel.Table 3 shows the evaluation metrics in the case of employing one type of group transformation method.Each type in a group is processed with an applying probability of 0.6 in the evolution experiments.
(see Figure 5a).Similarly, they are 0.6 for MNDWI (see Figure 5b) and 0.7 for MI (see Figure 5c).The glacial lake pixels always show high WI values, as well as some pixels from melting glaciers and mountain shadows; thus, a tight threshold can filter out the more glacial lake pixels, but the noise pixels are extracted if we use a loose threshold.
Moreover, we further visualized the results of SimGL when setting different thresholds on NDWI, MNDWI and MI, as shown in Figure 6.As shown in Figure 6, the experimental results were heavily contaminated by the glaciers when using MDNWI and MI.This also proved that the NDWI is more accurate in generating lake masks, especially in glacierized regions.In addition, almost all of the extracted pixels belonged to glacial lakes when the threshold of NDWI was set to be greater than 0.5, indicating that a tight threshold will facilitate the model to learn the lake's information more accurately.
Types of image augmentations: Seven methods are defined for transforming an image in our experiments.We further divide them into two groups to explore the influence of each type and group.One contains {color jitter, gray scaling, blurring, random area erasing and noise addition}, which will change the spectral distribution of an image.The other one uses a {flipping and rotating} operation to change the location of each pixel.Table 3 shows the evaluation metrics in the case of employing one type of group transformation method.Each type in a group is processed with an applying probability of 0.6 in the evolution experiments.From Table 3, we found that the flipping operation obtained the highest F1 and IoU scores, and the lowest F1 and IoU were scored by random area erasing.Thus, we considered giving a high probability to some important transformations; finally, we set applying probabilities of {0.7, 0.6, 0.8, 0.7, 0.5, 0.5, 0.5} to the transformations of {color jitter, gray scaling, flipping, rotating, blurring, random area erasing and noise addition}.

Comparison with the State-of-the-Arts
In this subsection, we compare our model with other widely used mapping methods, including some supervised and unsupervised methods: • NDWI: WI is the most simple and widely used method in glacial lake mapping, including in NDWI [32] and MNDWI [35].Among these indexes, the NDWI is the most feasible way to highlight the lake information and suppress the background information [31].To test the segmentation performance of using NDWI only, we set the segmentation threshold to 0.6, the same as we used in our model.

•
Global-local iterative segmentation algorithm (GLSeg) [8,9]: The GLSeg includes two hierarchical image segmentation stages.First, segment the NDWI map to delineate the potential lake areas using a global threshold.Second, calculate the local threshold to determine the extent of each potential lake within a buffer zone of the lake.Moreover, the auxiliary data (such as DEM) are introduced to filter the noise pixels with similar NDWI values with the glacial lakes.For fairness, we only use the RS imagery as the input, and the parameters are set following [8,9].We set NDWI > 0.1, NIR < 0.15 and SWIR < 0.05 in the global segmentation stage and the local threshold was computed according to the mean and variance of the lake and background pixels.[14,37]: as a region-based segmentation method, the C-V model shows great anti-noise ability, which improves the segmentation accuracy in the homogenous areas of glacial lakes and avoids the influences of individual noise pixels from the surroundings.The C-V model employs an active curve to separate the image into inner parts and outer parts and uses an energy function to evaluate the segment results.When the energy function reaches an optimal state, the curve will converge to the true lake boundaries.The parameters are set following [14,37].

•
Random Forest classification (RF) [11]: RF has good robustness and generalization in classification tasks because of its random sampling operations on the input data and features in each decision tree.For the RF training, we set 1000 trees to vote whether a pixel belongs to the glacial lake or not, and our training set includes 93,431 glacial lake samples and 93,431 non-lake samples, each of which has seven band values and a class label.• U-net [18,20,38]: U-net is the first DL model for glacial lake segmentation.It learns the pattern of glacial lakes in order to eliminate the dependence on the auxiliary data (such as using DEM to remove the mountain shadows).U-net contains four pairs of encoder and decoder units, and a skip connection is employed to concatenate feature maps from different scales and capture more details of the lake boundaries.Finally, the output mask is the segmentation results.We set the parameters to be the same as [18].• GAN-GL [21]: GAN-GL uses a zero-sum game between a generator and a discriminator to find the stable state, and a water attention module is also introduced to accelerate the convergence process.This GAN-based method can delineate the glacial lake boundaries more easily, without any distribution assumptions.
We evaluated the glacial lake extraction effects of these segmentation methods on the validation dataset, and the results are shown in Table 4. Evidently, the segmentation effects of the unsupervised segmentation methods (NDWI, GLseg, C-V) are significantly lower than that of the supervised methods (RF, U-net, GAN-GL) by comparing the F1 score and IoU.As for comparing these unsupervised methods, our model has yielded the best performance and even shows competitive efforts to the supervised classification method (RF).

Discussion
In this section, we discuss the benefit and defects of our model SimGL, and its applicability to other sensors.

Visualization of Comparisons
Figure 7 shows the visualization results of the glacial lake extraction performances.By comparing the results from other glacial lake extraction methods, our SimGL shows several potential improvements.First, the results from the pixel-based methods (NDWI, GLseg, and RF) demonstrated that they are prone to producing noises or individual pixels (see Figure 7a,c), which may come from mountain shadows, melting glaciers and floating ice, while our model SimGL utilized the convolution operation to capture the spatial features of glacial lakes, showing that it had an excellent anti-noise ability.Second, compared to the supervised method RF, our model SimGL shows good effects in extracting lakes with floating ice or frozen surface (see Figure 7b,d,f).Considering the three unsupervised glacial lake extraction methods (the NDWI, the C-V model and the GLseg model), some small lakes were detected by the C-V model and the GLseg model, but not extracted by NDWI.As a low global threshold of 0.1 was set in C-V and GLseg, some small lakes that are easily confused with background (they always have a low NDWI value) are discriminated.This means that the location cues provided by the NDWI map are not accurate enough for the glacial lake areas, but our model SimGL extracted more lake pixels than when employing NDWI only (see Figure 7f), which also illustrated that our model SimGL could learn the patterns of glacial lakes from limited lake information.
Despite the supervised DL model (U-net and GAN-GL) designing an automatic scheme to segment glacial lakes and obtaining an exceptional performance, these methods are still inevitable to prepare a large number of training images and labels.Therefore, our method SimGL provides a new scheme to segment the glacial lakes in cases of lacking true lake labels in large-scale areas.

Applicability to Different Sensors
To determine the generalizability and robustness of the model, we conducted our model on four types of data: the Landsat-5 Thematic Mapper (TM), Landsat-7 Enhanced Thematic Mapper Plus (ETM+), Landsat-8 Operational Land Imager (OLI) and Sentinel-2 Multi-Spectral imager (MSI).Here, the Landsat-8 OLI has a higher radiometric resolution (16 bits) than its predecessors (eight bits in Landsat-5 TM and Landsat-7 ETM+), representing an abundant color of information in the image.In terms of the Sentinel-2A MSI imagery, although it has a high cloud cover of 66.07%, fortunately the region where the glacial lake is nourished is cloud-free, and the lake boundary is clear enough to implement segmentation experiments.Moreover, considering that the Sentinel-2A MSI imagery has 13 bands with different spatial resolutions, we finally stacked the layers of band 2/3/4/8 (corresponding to blue/green/red/NIR bands) to an image file as they all have 10 m spatial resolution.The detailed information of the four images is listed in Table 5.Finally, the experimental results of the different sensors are shown in Figure 8. From the first row and third row in Figure 8, the glacial lake areas extracted by our model SimGL are very close to the true lake areas and without noise interference, even if different RS images are used for the experiment, which also indicates that our model has good applicability and can be easily applied to the other RS image.The second row is the segmentation results by thresholding the NDWI map with a value of 0.6, the pixels great than this value will be marked as lake pixels.The third row is the testing results using our model SimGL.

Possibility in Monitoring GLOF Events
All the glacial lake extraction methods are finally expected to conveniently monitor and find potentially dangerous lakes on a large scale, then give early warning for some dangerous lakes under long-term observations by remote sensing.Recently, some works have attempted to extract glacial lakes at different periods and analyze them to identify the lakes with a high outburst risk.For example, Ahmed et al. [39] used a simple weighted index on high-resolution satellite data for glacial lake mapping and change detection analysis, and 21 glacial lakes were marked as potentially dangerous lakes in the upper Jhelum The second row is the segmentation results by thresholding the NDWI map with a value of 0.6, the pixels great than this value will be marked as lake pixels.The third row is the testing results using our model SimGL.

Possibility in Monitoring GLOF Events
All the glacial lake extraction methods are finally expected to conveniently monitor and find potentially dangerous lakes on a large scale, then give early warning for some dangerous lakes under long-term observations by remote sensing.Recently, some works have attempted to extract glacial lakes at different periods and analyze them to identify the lakes with a high outburst risk.For example, Ahmed et al. [39] used a simple weighted index on high-resolution satellite data for glacial lake mapping and change detection analysis, and 21 glacial lakes were marked as potentially dangerous lakes in the upper Jhelum basin, Kashmir Himalaya, India; Nie et al. [40] extracted the glacial lake extent in 1990, 2000, 2005 and 2010, employing an object-oriented segmentation method and manual inspection, ultimately identifying 118 lakes as potential vulnerable lakes in Himalaya; Shrestha et al. [41] delineated the glacial lake boundary using NDWI in the years 1977, 1990, 2000 and 2010 in the Koshi basin of Himalaya and found 42 rapidly growing glacial lakes that should be paid more attention in terms of GLOF.
These works use simple segmentation methods (such as NDWI) to extract glacial lakes at a regional scale and identify the dangerous ones through the comparative analysis of lake areas in different periods.While our model SimGL has better applicability to RS sensors and a better segmentation performance, which greatly reduces the pre-and post-processing work in the glacial lake extraction.In the future, the temporal mapping of glacial lake areas using our model, as well as investigating the lakes with high expansion rates, is imperative for recognizing the dangerous lakes and giving early warning of GLOF events.

Impaction of Locaton Cues
Owing to the similar contents of the images in the dataset, they always contain objects such as glaciers, glacial lakes, vegetation, etc.; thus, these objects are extracted with glacial lakes only using contrast learning, resulting in a low F1 score (0.1356) and IoU (0.1083) in loss term ablation.To extract glacial lakes in a case lacking true lake labels, we combined the contrastive learning and rough location cues provided by a simple Water Index.In our model, designed in Figure 2, the NDWI map provided rough location cues of glacial lakes, guiding the extraction area to coincide more with the true lake boundaries.These weak location cues are critical to the segmentation effects.Therefore, we explored how to obtain effective location cues of glacial lakes, and whether our model learned something useful with the help of this weakly-supervised information.
The model performance was first evaluated under the condition of using different Water Indexes and setting threshold values.By analyzing the results from Figures 5 and 6, we can conclude that the model achieved the optimal effect when setting a threshold of 0.6 to the NDWI.
As seen in Figures 7 and 8, we further visualized the results of our model SimGL and the results of using NDWI only.Three merits of our model can be deduced from the comparison of the results of the two methods: (1) The NDWI as a pixel-based method that processes each pixel by masking the pixels with a pre-defined threshold, which may contaminate the segmentations with a lot of isolated noise pixels if these isolated pixels have similar NDWI values to the lake pixels (such as in Figure 7a, many mountain shadow pixels as noise are separately extracted by NDWI).On the other hand, our model SimGL can effectively avoid the interference of noise as the model can segment the lake areas by identifying the high-level spatial features of lakes; (2) The glacial lake boundary extracted by NDWI is relatively unsmooth as the setting of a high threshold would segment an inaccurate boundary of the lakes (for example, in Figure 8, the boundary of the glacial lake was ragged when we used NDWI to map the glacial lake from Landsat-7 ETM+ imagery), but our model, as it captured and learned with the spatial features, can provide complete glacial lake areas; (3) The NDWI will fail to identify the lake pixels if they are covered by floating ice (such as the extraction results in Figure 7f and the result evaluated on Sentinel-2A MSI imagery in Figure 8).However, the SimGL can eliminate the influence of floating ice to some extent.Specifically, glacial lake areas can also be discriminated by SimGL even though the surface is covered by thin floating ice.All of these three merits suggest that our model SimGL, combining contrastive learning and rough location cues, can effectively learn the features and patterns of glacial lakes with the limited cues provided by NDWI, and give a better mapping result of glacial lakes.

Conclusions
In this work, we proposed a simple glacial lake extraction network (SimGL) via a weakly-supervised training strategy.This weakly-supervised DL method extends and improves the extraction performances when lacking the true labeled lake masks in the model training stage, and therefore shows good applicability to different RS data.In the SimGL, a Siamese model was utilized to capture similar objects from the input image and its augmentation via unsupervised contrastive learning.Then, a pseudo lake label provided by masking the NDWI map with a tight threshold was used to give the lake location cues and guide the segmentation.The evaluation results of the glacial lake segmentation on the 1540 Landsat-8 image patches indicated that our model outperformed the other unsupervised image segmentation methods and achieved a competitive performance with some supervised methods (such as Random Forest).
Through the comparisons with the NDWI segmentation method and the explorations of the applicability to other RS sensors data, our model shows good benefits in its anti-noise ability and applicability.In addition, although we use the NDWI map to generate the location cues to SimGL, our model can learn the features and patterns of glacial lakes with limited weakly-supervised information and segment the glacial lakes more accurately.In general, our work provides a new technology for segmenting glacial lakes from RS imagery, even without lake labels in the training stage, which significantly improves the effects of glacial lake mapping over a large-scale area.

Figure 1 .
Figure 1.Different inputs and training strategies in contrastive learning and traditional deep learning (DL) methods in glacial lake mapping.DL methods always need to input labeled lake masks to construct the loss function, while contrastive learning calculates the loss function between the input image only and its augmentations.

Figure 1 .
Figure 1.Different inputs and training strategies in contrastive learning and traditional deep learning (DL) methods in glacial lake mapping.DL methods always need to input labeled lake masks to construct the loss function, while contrastive learning calculates the loss function between the input image only and its augmentations.

Figure 2 .
Figure 2. The architecture of the proposed SimGL.It consists of two parts: one takes the RS image and its augmentations as input pairs for a Siamese network, which generates a set of feature maps at different scales.Then applying the prediction layer on projected features from one branch to predict the transform features from another branch, and we use a contrastive loss to measure the similarity between the two features.Another part only takes the RS image as input, then a location loss

Figure 2 .
Figure 2. The architecture of the proposed SimGL.It consists of two parts: one takes the RS image and its augmentations as input pairs for a Siamese network, which generates a set of feature maps at different scales.Then applying the prediction layer on projected features from one branch to predict the transform features from another branch, and we use a contrastive loss to measure the similarity between the two features.Another part only takes the RS image as input, then a location loss was calculated between the output map (generated by decoding the multi-scale features) and location cues (generated by thresholding the NDWI map) to constrain the segmentation results.

( 1 )
Color jitter.We use color jitter with {brightness, contrast, saturation, hue} strength of {0.4,0.4, 0.4, 0.2} for the RGB bands of the RS images as the hue is only well-defined for the RGB data.(2)Gray scaling.We use gray scaling to remove the color information and represent each pixel only by its intensity.(3) Flipping.We randomly flip an image along with a horizontal or vertical location.(4) Rotating.We randomly rotate an image by an angle in the set of {90 • , 180 • , 270 • }.

Figure 3 .
Figure 3. Different ways to generate the augmentations of multi-band RS images.

Figure 3 .
Figure 3. Different ways to generate the augmentations of multi-band RS images.

Figure 4 .
Figure 4.The effects of coefficient λ on the segmentation results.The horizontal axis is the range of λ from 10 −4 to 10 4 , and the vertical axis reflects the value of each metric.Obviously, optimal segmentation performance of F1 and IoU occurred as the λ reached 10.

Figure 5 .
Figure 5.The effects between thresholds and evaluation metrics on the NDWI, MNDWI and MI, respectively.(a) Threshold ablation for NDWI.We set 0.6 of the NDWI threshold for further experiments as it can balance the lake information and noise information.(b) Threshold ablation for MNDWI.The best threshold of MNDWI should be 0.6 for providing pseudo lake masks.(c) Threshold ablation for MI.The best threshold of MNDWI should be 0.7 for providing pseudo lake masks.

Figure 5 .
Figure 5.The effects between thresholds and evaluation metrics on the NDWI, MNDWI and MI, respectively.(a) Threshold ablation for NDWI.We set 0.6 of the NDWI threshold for further experiments as it can balance the lake information and noise information.(b) Threshold ablation for MNDWI.The best threshold of MNDWI should be 0.6 for providing pseudo lake masks.(c) Threshold ablation for MI.The best threshold of MNDWI should be 0.7 for providing pseudo lake masks.

Figure 6 .
Figure 6.Two visualized samples of SimGL when setting different thresholds and WIs in the pseudo label generation stage.The blue area is lakes extracted by SimGL.From visualizations, the model output is closest to the ground truth when we set the threshold in the range of [0.5, 0.7] on NDWI.

Figure 6 .
Figure 6.Two visualized samples of SimGL when setting different thresholds and WIs in the pseudo label generation stage.The blue area is lakes extracted by SimGL.From visualizations, the model output is closest to the ground truth when we set the threshold in the range of [0.5, 0.7] on NDWI.

Figure 7 .
Figure 7. Visualization of segmentation results of the glacial lake by employing different methods.Extracted lakes are marked in blue.(a-f) are six regions of glacial lakes developed in different surroundings.

Figure 7 .
Figure 7. Visualization of segmentation results of the glacial lake by employing different methods.Extracted lakes are marked in blue.(a-f) are six regions of glacial lakes developed in different surroundings.

Figure 8 .
Figure 8.The visualization results of our model SimGL conducting on four different RS images.The blue areas are extracted lakes.The first row shows the RS images from Landsat-5 TM, Landsat-7 ETM+, Landsat-8 OLI and Sentinel-2A.The second row is the segmentation results by thresholding the NDWI map with a value of 0.6, the pixels great than this value will be marked as lake pixels.The third row is the testing results using our model SimGL.

Figure 8 .
Figure 8.The visualization results of our model SimGL conducting on four different RS images.The blue areas are extracted lakes.The first row shows the RS images from Landsat-5 TM, Landsat-7 ETM+, Landsat-8 OLI and Sentinel-2A.The second row is the segmentation results by thresholding the NDWI map with a value of 0.6, the pixels great than this value will be marked as lake pixels.The third row is the testing results using our model SimGL.

Table 1 .
Details of our dataset.

Table 2 .
Ablation results of using different loss terms.

Table 2 .
Ablation results of using different loss terms.

Table 3 .
Evolution results of different types of image transformation.

Table 3 .
Evolution results of different types of image transformation.

Table 4 .
Evaluation results of segmentation methods weather it involves label or threshold.

Table 5 .
Detailed information of images from four different sensors.