Large-Scale River Mapping Using Contrastive Learning and Multi-Source Satellite Imagery

: River system is critical for the future sustainability of our planet but is always under the pressure of food, water and energy demands. Recent advances in machine learning bring a great potential for automatic river mapping using satellite imagery. Surface river mapping can provide accurate and timely water extent information that is highly valuable for solid policy and management decisions. However, accurate large-scale river mapping remains challenging given limited labels, spatial heterogeneity and noise in satellite imagery (e.g., clouds and aerosols). In this paper, we propose a new multi-source data-driven method for large-scale river mapping by combining multispectral imagery and synthetic aperture radar data. In particular, we build a multi-source data segmentation model, which uses contrastive learning to extract the common information between multiple data sources while also preserving distinct knowledge from each data source. Moreover, we create the ﬁrst large-scale multi-source river imagery dataset based on Sentinel-1 and Sentinel-2 satellite data, along with 1013 handmade accurate river segmentation mask (which will be released to the public). In this dataset, our method has been shown to produce superior performance (F1-score is 91.53%) over multiple state-of-the-art segmentation algorithms. We also demonstrate the effectiveness of the proposed contrastive learning model in mapping river extent when we have limited and noisy data.


Introduction
River system plays a crucial role in global carbon circulation, as it delivers the carbonaceous matter within the global ecosystem and maintains the connection between ocean and land [1,2]. Rivers are also important in many countries due to the increasing demand to supply drinking water, irrigation and farming practices and power generation. Hence, the capacity to accurately map large-scale rivers is urgently needed for making policy and management decisions.
With the recent development of the space remote sensing technology, satellite images become available over large regions (often at global scale), which enables large-scale surface area monitoring [3,4]. Many existing methods utilize pre-defined water indices, such as Normalized Difference Water Index (NDWI) [5] and Modified Normalized Difference Water Index (MNDWI) [6], which are computed from a subset of spectral bands. These indices are designed to enhance the water representation in contrast to other land covers based on the reflectance characteristics.
Recently, there is a growing interest in land cover mapping using machine learning algorithms including Support Vector Machine (SVM) [7], Random Forest (RF) [8] and deep learning-based image segmentation algorithms such as UNET [9] and Pspnet [10]. Compared with the water indices, these machine learning methods can better extract informative non-linear combinations of reflectance values from all the spectral bands that can reflect water extent [11,12]. Additionally, deep learning-based image segmentation methods often outperform pre-defined water indices and traditional pixel-wise machine learning algorithms due to their capacity to capture image-level information. In particular, these deep learning models use receptive fields to capture spatial dependencies across pixels and thus have a better chance at capturing informative patterns related to river shapes, compactness and surrounding land covers [13,14].
However, large-scale river mapping still remains a challenge due to several reasons [15][16][17][18]. Firstly, we often have limited labeled training samples due to the high cost associated with manual annotation through visual inspection. This become especially serious for training advanced deep learning models. Second, rivers located in different regions can have different water properties, catchment characteristics and surrounding land covers and thus show different reflectance spectral value in the satellite imagery [19]. Traditional machine learning methods that are trained from limited labeled samples collected from specific regions may not be able to generalize to other regions [20,21]. Third, given the nature of remote sensing imagery, the river area in the multi-spectral satellite images may be blocked by unpredictable noise such as aerosol and clouds, leading to missing information during the river mapping process [22,23].
To address these challenges, we propose a new machine learning method that combines multi-spectral data and synthetic aperture radar (SAR) data to jointly map the extent of rivers over large regions, the main idea of the proposed method is shown in Figure 1.
Remote Sens. 2021, 13, x FOR PEER REVIEW 2 of 18 indices are designed to enhance the water representation in contrast to other land covers based on the reflectance characteristics.
Recently, there is a growing interest in land cover mapping using machine learning algorithms including Support Vector Machine (SVM) [7], Random Forest (RF) [8] and deep learning-based image segmentation algorithms such as UNET [9] and Pspnet [10]. Compared with the water indices, these machine learning methods can better extract informative non-linear combinations of reflectance values from all the spectral bands that can reflect water extent [11,12]. Additionally, deep learning-based image segmentation methods often outperform pre-defined water indices and traditional pixel-wise machine learning algorithms due to their capacity to capture image-level information. In particular, these deep learning models use receptive fields to capture spatial dependencies across pixels and thus have a better chance at capturing informative patterns related to river shapes, compactness and surrounding land covers [13,14].
However, large-scale river mapping still remains a challenge due to several reasons [15][16][17][18]. Firstly, we often have limited labeled training samples due to the high cost associated with manual annotation through visual inspection. This become especially serious for training advanced deep learning models. Second, rivers located in different regions can have different water properties, catchment characteristics and surrounding land covers and thus show different reflectance spectral value in the satellite imagery [19]. Traditional machine learning methods that are trained from limited labeled samples collected from specific regions may not be able to generalize to other regions [20,21]. Third, given the nature of remote sensing imagery, the river area in the multi-spectral satellite images may be blocked by unpredictable noise such as aerosol and clouds, leading to missing information during the river mapping process [22,23].
To address these challenges, we propose a new machine learning method that combines multi-spectral data and synthetic aperture radar (SAR) data to jointly map the extent of rivers over large regions, the main idea of the proposed method is shown in Figure 1. The proposed contrastive learning based multi-source data segmentation structure. The segmentation result is generated using the self-information and common information, extracting from the multi-source data pair, including Data 1 and Data 2. For the common information extracting process, the contrastive learning idea is to minimize the relevant image information embedding between the same location multi-source data. Here, the relevant image information are extracted from multi-source data pair of same location and the irrelevant image information are extracted from multi-source data pair of different location. Thus, the common information path can be more focused on the clear area of the image (without aerosol and cloud) and can guide the segmentation process, along with the self-information extracted from each multi-source data. The proposed contrastive learning based multi-source data segmentation structure. The segmentation result is generated using the self-information and common information, extracting from the multi-source data pair, including Data 1 and Data 2. For the common information extracting process, the contrastive learning idea is to minimize the relevant image information embedding between the same location multi-source data. Here, the relevant image information are extracted from multi-source data pair of same location and the irrelevant image information are extracted from multi-source data pair of different location. Thus, the common information path can be more focused on the clear area of the image (without aerosol and cloud) and can guide the segmentation process, along with the self-information extracted from each multi-source data.
The multi-spectral data can capture optical reflectance characteristics of different land covers while SAR data is more sensitive to surface structures and also less impacted by clouds and aerosols. We build a novel machine learning algorithm that can leverage Remote Sens. 2021, 13, 2893 3 of 18 supplementary strengths of these two types of data to improve the mapping performance. Our major innovation and contributions can be summarized as follows:

•
We present a contrastive learning based multi-source satellite data segmentation structure for large-scale river mapping, as shown in Figure 1. To better extract representative patterns from limited and noisy data, we include the contrastive learning strategies, where the common information and self-information are extracted from the multi-source data and fused for the river segmentation.

•
We create a large-scale multi-source river imagery dataset, which, to the best of our knowledge, is the first multi-source large-scale satellite dataset (with aligned multi-spectral data and SAR data) towards river mapping. The dataset contains multispectral data and SAR data for 1013 river sites that are distributed over the entire globe and handmade ground truth masks for these river sites. • On our dataset, we have demonstrated the superiority of the proposed multi-source segmentation method over multiple widely used methods including water indicesbased method, traditional machine learning approaches and deep learning approaches.

Study Area and Data
Aiming at the large-scale river mapping task, we first create a large-scale multi-source river imagery dataset, which to the best of our knowledge, is the first multi-source largescale satellite dataset toward river mapping. This dataset contains 1013 image pairs of rivers based on Sentinel-1 and Sentinel-2 satellite and 1013 manually annotated river ground truth masks. The annotation is conducted through visual inspection by referring to both Seninel-1 and Sentinel-2 Satellite imagery. Table 1 shows the details of these available satellite data sources. We show the details of the created multi-source river imagery dataset in Table 2. The dataset contains the rich river imagery samples from different continents of the world, covering the time period from January 2016 to December 2016. During the labeling process, the annotators carefully inspect both Sentinel-1 and Sentinle-2 images on close dates. Here, the Sentinel-1 is Synthetic Aperture Radar style imagery, which contains less noise information such as cloud and aerosol, but in low resolution. In the meantime, Sentinel-2 is Multi-Spectral style imagery, which is high resolution, but contains more cloud and aerosol. Thus, we may have a better chance at obtaining accurate labels by leveraging the advantages of both types of the satellite imagery. Figure 2 shows an example of the multi-resource satellite imagery in our dataset. Here, we select the VV bands of the Sentinel-1 data to generate the gray image and the Band #9 (Band name: Water vapour), #7 (Band name: Vegetation Red Edge) and #3 (Band name: Green) of the Sentinel-2 data to generate the false color composite image for sample visualization (the sample visualization shown in the rest paper follows same fashion). We manually create the river segmentation mask according to the Sentinel-1 and Sentinel-2 imagery.
During the labeling process, the annotators carefully inspect both Sentinel-1 and Sentinle-2 images on close dates. Here, the Sentinel-1 is Synthetic Aperture Radar style imagery, which contains less noise information such as cloud and aerosol, but in low resolution. In the meantime, Sentinel-2 is Multi-Spectral style imagery, which is high resolution, but contains more cloud and aerosol. Thus, we may have a better chance at obtaining accurate labels by leveraging the advantages of both types of the satellite imagery. Figure 2 shows an example of the multi-resource satellite imagery in our dataset. Here, we select the VV bands of the Sentinel-1 data to generate the gray image and the Band #9 (Band name: Water vapour), #7 (Band name: Vegetation Red Edge) and #3 (Band name: Green) of the Sentinel-2 data to generate the false color composite image for sample visualization (the sample visualization shown in the rest paper follows same fashion). We manually create the river segmentation mask according to the Sentinel-1 and Sentinel-2 imagery.

Proposed Method
We propose a contrastive learning process to extract representative hidden features from multi-spectral data and SAR data. These hidden features are extracted by the proposed deep learning models and contain the common information between the multisource data and the self-information of each type of the data. Here, we first describe the idea of contrastive learning used in our proposed method. Let x , x + and x − represent three samples within a dataset. Here, x serves as an anchor sample and we consider the relationship between x and x + to be relevant, x and x − to be irrelevant. For example, can be a multi-spectral image patch from a specific region and can be a SAR image patch from the same region while is a SAR image patch from a different region. The goal of contrastive learning is to learn an embedding ( ) f ⋅ that achieves higher score for the relevant pair ( , ) x x + and lower score for the irrelevant pair ( , ) x x − [24], which is shown as follows:

Proposed Method
We propose a contrastive learning process to extract representative hidden features from multi-spectral data and SAR data. These hidden features are extracted by the proposed deep learning models and contain the common information between the multi-source data and the self-information of each type of the data. Here, we first describe the idea of contrastive learning used in our proposed method. Let x, x + and x − represent three samples within a dataset. Here, x serves as an anchor sample and we consider the relationship between x and x + to be relevant, x and x − to be irrelevant. For example, x can be a multi-spectral image patch from a specific region and x + can be a SAR image patch from the same region while x − is a SAR image patch from a different region. The goal of contrastive learning is to learn an embedding f (·) that achieves higher score for the relevant pair (x, x + ) and lower score for the irrelevant pair (x, x − ) [24], which is shown as follows: where the function score() can be any similarity measure between two feature embeddings.
Here, we hypothesize that data in different modalities (e.g., SAR and multi-spectral images) both contain key patterns that describe our detection target (i.e., water extent). The idea behind contrastive learning is to better extract such relevant information that is shared between multi-source input samples.
In our problem, we further extend such standard contrastive learning approach to improve the detection to extract both the relevant information shared across different data sources and the self-information that is unique to each data source. The intuition is that we not only make use of common knowledge from multi-spectral and SAR data, but also leverage their supplementary strengths (e.g., land cover spectral characteristics by multispectral data and surface textures and elevations by SAR) to improve the river mapping. In particular, we use A = {a 0 , a 1 , . . .} and B = {b 0 , b 1 , . . .} to represent two different types of data as input for the river segmentation task. We use the relevant pair (a i , b i ) to represent multi-source data samples (i.e., image patches) of two different data types taken from the same geographical region. In the meantime, the irrelevant pair (a i , b j ) represents multi-source data samples of two different data types taken from different geographical regions. The proposed algorithm in this paper contains three steps: (1) multi-source data common information extraction, (2) multi-source data self-information extraction and (3) information fusion-based segmentation.

Common Information Extraction
We develop a neural network structure to extract common information from multisource data input, as shown in Figure 3.
, where the function () can be any similarity measure between two feature embeddings. Here, we hypothesize that data in different modalities (e.g., SAR and multi-spectral images) both contain key patterns that describe our detection target (i.e., water extent). The idea behind contrastive learning is to better extract such relevant information that is shared between multi-source input samples. In our problem, we further extend such standard contrastive learning approach to improve the detection to extract both the relevant information shared across different data sources and the self-information that is unique to each data source. The intuition is that we not only make use of common knowledge from multi-spectral and SAR data, but also leverage their supplementary strengths (e.g., land cover spectral characteristics by multispectral data and surface textures and elevations by SAR) to improve the river mapping. In particular, we use represents multi-source data samples of two different data types taken from different geographical regions. The proposed algorithm in this paper contains three steps: (1) multisource data common information extraction, (2) multi-source data self-information extraction and (3) information fusion-based segmentation.

Common Information Extraction
We develop a neural network structure to extract common information from multisource data input, as shown in Figure 3. Here, we utilize the UNET staked convolutional and pooling layers in implementing the neural network structure of f (x). The parameters in these neural network layers (used to compute f (x)) can be trained based on the contrastive learning and will be further used for model testing.
During the training and testing process, we first train the proposed structure based on pseudo-code in Algorithm 1 and Algorithm 2. Algorithm 1 shows the pseudo-code for the model training. Here, we set up the training set with two different types of the sample pairs which are marked with a flag variable. Here, the flag variables are manually obtained during the multi-source sample pairs creation for common information extraction model training. the training sample pair with flag = 1 represents the relevant pair (a i ,b i ) and the training sample pair with f lag = 0 represents the irrelevant pair (a i ,b j ). Given two input samples Input 1 (e.g., a i ) and Input2 (e.g., b i or b j ), the loss function for extracting common information is defined as follows: When flag = 0, the loss Loss f lag=0 is computed as follows: Training process of the common information extraction structure Input: Multi-source training sample pairs (SAR and multi-spectral data), flag (represents a multi-source training sample pair belongs to relevant or irrelevant type) Output: Distance 1: Block the 'Feature1' and 'Feature2' layers during the training process. 2: Enable the 'Distance' layer during the training process. 3: Set the 'Input 1' as the SAR sample input, the 'Input 2' as the multi-spectral sample input. 4: for each multi-source training sample pairs do 5: Fine-tuning the network parameters using loss function Loss (Input1, Input2, flag) defined in Equation (2).

6: end for
The m parameter is a constant, where m > 0. The m is the expected minimum distance between f (Input1) and f (Input2). Here, Distance represents the distance between embeddings of Input1 and Imput2 and its value can be calculated using the Euclidean distance, as follows: It can be observed that Loss f lag=0 and Distance are negatively correlated, as shown in Equation (3). Thus, the goal of Loss f lag=0 is to maximize the Distance between irrelevant pairs. Here, m denotes a threshold for distance penalty, i.e., we do not penalize an irrelevant pair (a i , b j ) if the distance of between them is greater or equal to m.
When flag = 1, the loss Loss f lag=1 is calculated as follows: It can be observed that Loss f lag=1 and Distance are positively correlated, as shown in Equation (5). Thus, the goal of Loss f lag=1 is to minimize the Distance between relevant pair (a i , b i ). Thus, during the model training process, the model loss Loss(Input1, Input2, f lag) can be simplified into Loss f lag=0 or Loss f lag=1 based on the vector flag of current input training sample and could implement by the tensorflow platform.
Furthermore, Algorithm 2 shows the pseudo-code for the model testing. During the testing process, the multi-sources testing samples are set as the input of the structure and the common information matrices of the multi-sources testing samples are extracted from the 'Feature 1 and 'Feature' layers and will be further used in the information fusion process as described in Section 2.2.3.

Algorithm 2. Testing process of the common information extraction structure
Input: Multi-source testing sample pairs (SAR and multi-spectral data) Output: Common information at 'Feature 1' and 'Feature 2' layers 1: Enable the 'Feature1' and 'Feature2' layers during the testing process. 2: Block the 'Distance' layer during the testing process. 3: Set the 'Input 1' as the SAR sample input, the 'Input 2' as the multi-spectral sample input. 4: for each multi-source testing sample pairs do 5: Achieve the common information matrices for multi-source testing sample pair from the 'Feature1' and 'Feature2' layers, separately. 6: end for

Self-Information Extraction
We also build a self-information extraction structure for each type of input data (i.e., multi-spectral data or SAR data). Our model architecture is inspired by the multi-scale feature extraction structure in UNET. As shown in Figure 4, the model is trained in a supervised way using data from each single source as input to predict the segmentation ground truth. We maintain two self-information extraction structures and have them trained separately using two types of input data. In this way, the hidden representation obtained from this extraction structure (e.g., the hidden layer before the final convolutional layers) can encode discriminative information about water extent using each single-source data.

Information Fusion
In order to combine the extracted common information and self-information from the multi-source data inputs and perform the river segmentation task, we build an information fusion structure. We show the overall structure in Figure 5, which contains three paths. Firstly, the common information path extracts the common information from the multi-source data input. Secondly, the self-information path extracts the self-information from each type of the data within the multi-source data. Thirdly, the information fusion path concatenates the common information and self-information, then generates the final segmentation result.
During the training process, the common information path and the self-information path are trained separately. Then, the parameters of the common information path and the self-information path are copied from the well-trained common information extraction model and self-information extraction model to initialize respective components in the overall model structure. Finally, the entire model will be trained using two types of inputs and training labels in a supervised fashion. The Binary Cross Entropy Loss is used for supervised training. Assuming we have n training samples and we useŷ and y to represent predicted labels and manually annotated labels, respectively, the loss function is defined as follows: and training labels in a supervised fashion. The Binary Cross Entropy Loss is used for supervised training. Assuming we have training samples and we use and to represent predicted labels and manually annotated labels, respectively, the loss function is defined as follows: Figure 4. The proposed self-information extraction structure. The self-information of the input data is captured by a multi-stage feature extraction structure and further generates the output using fused multi-stage feature and convolutional layers.

Experiment Setup
The experiment is conducted with Window 10, Python 2.7 and GTX2080 hardware environment. We also compare with several baseline segmentation methods including a popular water index-based approach, traditional machine learning approaches and stateof-the-art deep learning approaches.
Here, we setup several baselines in both single-source input style model and multisource input style model. In order to achieve the multi-source input style model, we combine the SAR data (the sample size is 96 × 96 × 1) and multi-spectral data (the sample size is 96 × 96 × 9), which finally becomes a multi-source style input (the sample size is 96 × 96 × 10). These baselines include: • NDWI: Normalized Difference Water Index is widely used to map water extent. We compute NDWI from multi-spectral imagery and threshold the obtained values of different pixels to obtain the river map. • SVM and RF: Support Vector Machine and Random Forest are popular machine learning methods that have been widely used in remote sensing. Here, SVM-S2 and RF-S2 train the SVM and RF model using only Sentinel-2 data (SAR data) as input, separately. SVM-M and RF-M utilize the multi-source input as the model input during the model training and model testing.

•
UNET: UNet is a popular deep learning model for pixel-wise classification (or segmentation) and was originally designed for medical image segmentation. Here, we setup three baseline methods, including UNET-S1, UNET-S2, UNET-M. UNET-S1 train the UNET model using only Sentinel-1 data (SAR data) as input. UNET-S2 train the UNET model using only Sentinel-2 data (SAR data) as input. UNET-M utilize the multi-source input as the model input during the model training and model testing.

•
Non-contrastive: In order to verify the contribution of the common information during the segmentation process, we design the non-contrastive baseline as a variant of Figure 5. The proposed information fusion based multi-source data segmentation structure. The proposed structure contains two self-information extraction path and a common information extraction path. The weights of these paths are copied from well-trained the common information extraction structure and self-information extraction structure, as mentioned in Figures 3 and 4.

Experiment Setup
The experiment is conducted with Window 10, Python 2.7 and GTX2080 hardware environment. We also compare with several baseline segmentation methods including a popular water index-based approach, traditional machine learning approaches and state-of-the-art deep learning approaches.
Here, we setup several baselines in both single-source input style model and multisource input style model. In order to achieve the multi-source input style model, we combine the SAR data (the sample size is 96 × 96 × 1) and multi-spectral data (the sample size is 96 × 96 × 9), which finally becomes a multi-source style input (the sample size is 96 × 96 × 10). These baselines include: • NDWI: Normalized Difference Water Index is widely used to map water extent. We compute NDWI from multi-spectral imagery and threshold the obtained values of different pixels to obtain the river map. • SVM and RF: Support Vector Machine and Random Forest are popular machine learning methods that have been widely used in remote sensing. Here, SVM-S2 and RF-S2 train the SVM and RF model using only Sentinel-2 data (SAR data) as input, separately. SVM-M and RF-M utilize the multi-source input as the model input during the model training and model testing. • UNET: UNet is a popular deep learning model for pixel-wise classification (or segmentation) and was originally designed for medical image segmentation. Here, we setup three baseline methods, including UNET-S1, UNET-S2, UNET-M. UNET-S1 train the UNET model using only Sentinel-1 data (SAR data) as input. UNET-S2 train the UNET model using only Sentinel-2 data (SAR data) as input. UNET-M utilize the multi-source input as the model input during the model training and model testing. • Non-contrastive: In order to verify the contribution of the common information during the segmentation process, we design the non-contrastive baseline as a variant of the proposed method by removing the common information path in the proposed structure. Here, the non-contrastive method utilized the multi-scale idea of the UNET and further setup a convolutional path starting with the concatenate layer, which could fusion the information from two self-information extraction path.

•
We implement two versions of the proposed algorithm: • Proposed Method A: The self-information extraction path and common information extraction path both utilize same amount of training samples (for which we have training labels) during the training process. • Proposed Method B: The self-information extraction path is trained using labeled training samples. The common information extraction path utilizes all the samples (both labeled and unlabeled) since it can be trained in an unsupervised fashion. Finally, the information fusion part is trained using only the labeled data.

•
The key hyper-parameters for the baselines and proposed method are shown in Table  3. The Contrastive Loss in Table 3 are described in Section 2.2.1.

Large-Scale River Mapping
We divide the entire dataset (1013 samples) by randomly selecting 60% samples as the training set and the remaining 40% samples as the testing set. Then, we train the proposed method and baselines using the training set and validate these methods on the testing set. Here, the VV bands of the Sentinel-1 and the nine 10/20-m spatial resolution bands of the Sentinel-2 data are used for training and testing process.  Then, we evaluate the performance by measuring the F1-score, precision and recall among different methods, by using the 10m spatial resolution ground truth mask. Here, we briefly introduce the definition of these measures. Let TP represents the number of pixels that have been correctly predicted among river pixels. FP represents the number of pixels that have not been correctly predicted among river pixels. TN represents the number of pixels that have been correctly predicted among land pixels. FN represents the number of pixels that have not been correctly predicted among land pixels.
The precision is calculated by Equation (7): The recall is calculated by Equation (8): After gathering the precision and recall values, the F1-score is calculated by Equation (9). The F1-score is one of the widely used measuring methods and could convey the balance between the precision and the recall.  Then, we evaluate the performance by measuring the F1-score, precision and recall among different methods, by using the 10m spatial resolution ground truth mask. Here, we briefly introduce the definition of these measures. Let TP represents the number of pixels that have been correctly predicted among river pixels. FP represents the number of pixels that have not been correctly predicted among river pixels. TN represents the number of pixels that have been correctly predicted among land pixels. FN represents the number of pixels that have not been correctly predicted among land pixels.
The precision is calculated by Equation (7): The recall is calculated by Equation (8): After gathering the precision and recall values, the F1-score is calculated by Equation (9). The F1-score is one of the widely used measuring methods and could convey the balance between the precision and the recall. Then, we evaluate the performance by measuring the F1-score, precision and recall among different methods, by using the 10m spatial resolution ground truth mask. Here, we briefly introduce the definition of these measures. Let TP represents the number of pixels that have been correctly predicted among river pixels. FP represents the number of pixels that have not been correctly predicted among river pixels. TN represents the number of pixels that have been correctly predicted among land pixels. FN represents the number of pixels that have not been correctly predicted among land pixels.
The precision is calculated by Equation (7): The recall is calculated by Equation (8): After gathering the precision and recall values, the F1-score is calculated by Equation (9). The F1-score is one of the widely used measuring methods and could convey the balance between the precision and the recall.
The comparison between the proposed method and baseline methods segmentation performance on the testing dataset is shown in Table 4. It can be observed that thresholdbased NDWI method produces low F1-score for the large-scale river mapping task, due to the uncertainty in the reflectance spectral characteristics among different surface regions. Hence, we may need to use different thresholds for rivers in different places. The pixelbased machine learning method SVM and RF shows better performance on F1-score, compared with NDWI method. UNET-S1 method can better segment the river based on the concept of the receptive field idea, compared with SVM, RF and NDWI. Due to the improved spatial resolution of Sentinel-2 data, UNET-S2 method shows better segmentation result, compared with UNET-S1. Furthermore, the non-contrastive method outperforms other baselines on the F1-score. Overall, the proposed method shows best segmentation performance than other baselines. By comparing the proposed method A and the Noncontrastive method, we confirm the effectiveness of incorporating the common information extracted through contrastive learning in the final segmentation process. Furthermore, the Proposed Method B slightly outperforms the Proposed Method A, which demonstrates that the unsupervised common information model training process using more unlabeled data could better help the river segmentation process. We further visualize the segmentation results among the proposed method and baselines, using noise samples and limit samples. Figure 8 shows examples of river mapping results made by the proposed method and other state-of-the-art methods on large-scale river segmentation using noise samples. Each sample is first visualized by the Sentinel-1. Sentinel-2 and manually annotated ground truth, based on the proposed large-scale multisource river imagery dataset, as mentioned in Section 2.1. As shown in Figure 8, the aerosol and cloud affect the river segmentation result among the threshold based NDWI method and the pixel-based machine learning methods. Here, we compare the single-source input style machine learning method RF-S2, SVM-S2 with multi-source input style machine learning model RF-M, SVM-M. It can be seen that simply combine the multi-source data as the input could not help too much in the segmentation performance. UNET-S1 shows that the SAR data is able to better handle the fog and cloud and thus could obtain good river shape. However, due to the low resolution of the SAR data, the boundary of the segmentation result still need to be improved. In addition, the SAR data only has one channel, which makes it challenging for distinguishing the river content with many other different land covers. In addition, the proposed method B shows better river segmentation performance, especially within the noise part of the satellite image, compared with the baseline deep learning method UNET-S1, UNET-S2 and UNET-M. Additionally, Proposed Method A and Proposed Method B outperform the Non-contrastive method, which confirms the contribution of the common information extracted by the contrastive learning concept. The red circle within column (n) represents the significate improvement of the proposed method B, compared with other baseline methods. The result demonstrates the proposed method B is able to achieve better segmentation for large-scale surface area river mapping by using both labeled and unlabeled information.

River Mapping with Limited Labels
In practice, we may not have sufficient training samples if we want to map rivers in a specific year. Hence, we further evaluate the performance of our method and other existing methods on large-scale river segmentation using 20/100/200/400 training samples from the entire training set and present the F1-score performance among different methods in Table 5. It can be seen that Multi-source input model Non-contrastive better segments the river under limited training sample condition, compared with single source input model UNET-S1 and UNET-S2. Furthermore, since the labeled data is often limited among many real world cases, the Proposed Method B is able to utilized the unlabeled data through offline training. Thus, the Proposed Method B performs better than the Proposed Method A due to the additional information from the unlabeled training samples, especially under limited labeled training sample conditions. In addition, we show an example of detected river extent in Figure 9. It can be observed that the proposed method B shows better segmentation result among rest comparison methods. A due to the additional information from the unlabeled training samples, especially under limited labeled training sample conditions. In addition, we show an example of detected river extent in Figure 9. It can be observed that the proposed method B shows better segmentation result among rest comparison methods. We further explore the method performance under different cloud coverage percentage condition. Figure 10 shows the segmentation accuracy curves using different cloud coverage samples. It can be observed that cloud coverage could highly influence the segmentation performance among all methods, due to the missing information in the foggy and cloudy satellite images. In addition, multi-source input model Non-contrastive outperforms several single-source input models, such as NDWI, SVM, RF and UNET-S1 and UNET-S2. Moreover, the Proposed Method B shows better segmentation F1-score than the Proposed Method A due to its ability to leverage unlabeled information during the unsupervised common information path training process.

Long-term River Monitoring
Finally, we can apply our proposed method for long-term river mapping, which can be very useful for monitoring and managing water resources. Figure 11 shows an example of the river monitoring over a two-year period. Figure 12 shows the river area estimation based on the water pixels from the river segmentation results. We can observe that the river regularly changes from season to season. In addition, it can be seen that the November's river area from 2015 to 2017 shows an increasing trend. We further explore the method performance under different cloud coverage percentage condition. Figure 10 shows the segmentation accuracy curves using different cloud coverage samples. It can be observed that cloud coverage could highly influence the segmentation performance among all methods, due to the missing information in the foggy and cloudy satellite images. In addition, multi-source input model Non-contrastive outperforms several single-source input models, such as NDWI, SVM, RF and UNET-S1 and UNET-S2. Moreover, the Proposed Method B shows better segmentation F1-score than the Proposed Method A due to its ability to leverage unlabeled information during the unsupervised common information path training process.

Long-Term River Monitoring
Finally, we can apply our proposed method for long-term river mapping, which can be very useful for monitoring and managing water resources. Figure 11 shows an example of the river monitoring over a two-year period. Figure 12 shows the river area estimation based on the water pixels from the river segmentation results. We can observe that the river regularly changes from season to season. In addition, it can be seen that the November's river area from 2015 to 2017 shows an increasing trend. e Sens. 2021, 13, x FOR PEER REVIEW 16 of 18

Conclusions
In this paper we present a large-scale river mapping deep learning model based on contrastive learning and multi-source satellite data. We also create a large-scale multisource river imagery dataset, along with handmade accurate river ground truth masks. The experiments show that the proposed method outperforms the several state-of-the-art segmentation methods and demonstrate that the proposed contrastive learning method is able to better segment the river over large spatial regions. Moreover, the proposed method shows better performance when multi-spectral satellite images are noisy (e.g., blocked by clouds).
In the future work, we plan to further validate our method using other types of highresolution satellite data for longer term river monitoring and integrate these data sources in our dataset. Moreover, since there could be existed pixel error within current version of the proposed dataset when both optical and SAR images contain artifacts, we plan to refine the labels by utilizing an independent satellite data source (such as DigitalGlobe or WorldView). We will also study and extract temporal changes of water area in rivers, in order to automatically detect river expansion or shrinkage.