Terrain Shadow Interference Reduction for Water Surface Extraction in the Hindu Kush Himalaya Using a Transformer-Based Network

: Water is the basis for human survival and growth, and it holds great importance for ecological and environmental protection. The Hindu Kush Himalaya (HKH) is known as the “Water Tower of Asia”, where water influences changes in the global water cycle and ecosystem. It is thus very important to efficiently measure the status of water in this region and to monitor its changes; with the development of satellite-borne sensors, water surface extraction based on remote sensing images has become an important method through which to do so, and one of the most advanced and accurate methods for water surface extraction involves the use of deep learning networks. We designed a network based on the state-of-the-art Vision Transformer to automatically extract the water surface in the HKH region; however, in this region, terrain shadows are often misclassified as water surfaces during extraction due to their spectral similarity. Therefore, we adjusted the training dataset in different ways to improve the accuracy of water surface extraction and explored whether these methods help to reduce the interference of terrain shadows. Our experimental results show that, based on the designed network, adding terrain shadow samples can significantly enhance the accuracy of water surface extraction in high mountainous areas, such as the HKH region, while adding terrain data does not reduce the interference from terrain shadows. We obtained the water surface extraction results in the HKH region in 2021, with the network and training datasets containing both water surface and terrain shadows. By comparing these results with the data products of Global Surface Water, it was shown that our water surface extraction results are highly accurate and the extracted water surface boundaries are finer, which strongly confirmed the applicability and advantages of the proposed water surface extraction approach in a wide range of complex surface environments.


Introduction
Water has been described as the "catalyst of life on Earth and the source of all things".As an indispensable source of energy for human society, water influences human economic prosperity and productive development [1,2].The Hindu Kush Himalaya (HKH), also known as the "Water Tower of Asia", supplies water to 1.3 billion people and affects the ecosystems of river basins both within and beyond this region [3,4].It is of great significance to efficiently obtain the spatial distribution of water surface in this region in order to monitor its status and changes [5,6].The HKH is one of the largest mountain systems in the world; its high mountains, intermountain valleys, and plateaus result in complex terrain and undulating topography that make terrain shadow interference a major challenge for water surface extraction in this region [7].
Over several years of research, many different types of water surface extraction methods have been developed, including single-and multi-band spectral analysis, water indexing, shallow machine learning, and deep learning methods [8].Both single-and multi-band spectral analysis methods, as well as water index methods [9,10], mainly achieve water surface extraction based on human a priori cognitive knowledge, but they can be considered as water surface extraction methods based on artificially created features [11].On the other hand, the shallow machine learning and deep learning methods allow machines to automatically learn water features based on samples [12,13].Moreover, the deep learningbased method is an end-to-end water surface extraction method with higher accuracy and large-scale applicability that has been developed with the recent breakthroughs in artificial intelligence (AI) [14][15][16].The advantage of the deep learning-based water surface extraction method is that the artificial neural network used in deep learning facilitates consideration of the complex nonlinear features of the heterogeneous surface [17][18][19].Therefore, the models based on deep neural networks are more robust than the shallow machine learning and traditional methods based on spectral analysis and water indexes [20][21][22].
However, shadows, such as those from terrain, clouds, and other phenomena, are often major distractions during water surface extraction due to their similar spectral features to water surfaces.The HKH region is the most undulating and extensive mountainous region in Asia and, indeed, the world; therefore, the accuracy of existing water surface products, such as the European Commission's Joint Research Center's Global Surface Water (JRC GSW) [23], is often not as good as it could be for this region due to strong interference from terrain shadows.In previous studies, two main approaches have been investigated to avoid or reduce terrain shadow interference: One approach involves the utilization of Digital Surface Model (DSM) or Digital Elevation Model (DEM) data [24][25][26] to obtain various terrain factor information, which is then used as terrain-related prior knowledge in order to remove the terrain shadows that are misclassified as water surfaces [27,28], an approach that obviously depends on the accuracy of the terrain data and the reliability of terrain-related prior knowledge [29,30].The other approach does not use any terrain data, but only uses spectral images to improve water surface extraction in the mountainous area based on the band math calculation algorithm [31].
The issue of water surface extraction being disturbed by terrain shadows is not well resolved due to manual modeling and potentially limited a priori knowledge of the terrain [23,32].Considering the data-driven AI models [33], especially the big models based on the novel Transformer architecture, which show excellent performance in both the computer vision and remote sensing communities [34][35][36][37], this study targets big data-driven AI models by introducing samples of terrain shadows in high mountainous areas, exploring whether these models can effectively distinguish water surfaces from terrain shadows only based on the spectral information and whether the terrain data, such as DSM or DEM, are highly useful in distinguishing water surfaces and terrain shadows.Therefore, in this study, we developed an end-to-end Transformer-based AI model for water surface extraction, and the remote sensing image samples containing terrain shadows in the HKH region were specifically prepared with the aim of investigating the effect of adding terrain shadow samples on the interference of these shadows during water surface extraction.Furthermore, we also introduced DSM data and prepared specific samples containing the relevant terrain features in order to investigate whether terrain data significantly help to reduce the interference of terrain shadows during water surface extraction with the Transformer-based AI models.Finally, we extracted surface water distribution data over the whole HKH region, using the model that demonstrated the best performance in our experiment, and compared them with the JRC GSW data to evaluate the accuracy of our extraction results and to verify the robustness of the proposed method to reduce the interference of terrain shadows in particularly large and high mountainous areas.

Study Area
The entire Hindu Kush Himalaya (HKH) region was adopted as the study area for our research (Figure 1).This area is located at 60.8539 • E~105.0447• E, 15.9579 • N~39.3187 • N.Many large and important lakes and rivers are distributed in this area, such as Lake Qinghai, Lake Selincuo, Lake Namtso, and the Yangtze and Yellow Rivers [38], thus affecting the water circulation in the region and worldwide [39,40].The HKH region also includes many globally significant mountains, including the Himalaya, Pamir, Tianshan, and Hengduan Mountains, among others [41].These mountains contribute to the unique climatic conditions and rich biodiversity of the region (Figure 2) [42,43]; however, they also pose a significant terrain shadow interference problem for water surface extraction from remote sensing images.Therefore, the selected HKH region is appropriate for studying the influence of terrain shadows during water surface extraction based on remote sensing images.Additionally, the HKH region covers more than 4.3 million km 2 and spans eight countries, which is ideal for demonstrating that the Transformer-based method for reducing terrain shadow interference during water surface extraction can be applied to a large, high, mountainous area.
whole HKH region, using the model that demonstrated the best performance in our experiment, and compared them with the JRC GSW data to evaluate the accuracy of our extraction results and to verify the robustness of the proposed method to reduce the interference of terrain shadows in particularly large and high mountainous areas.

Study Area
The entire Hindu Kush Himalaya (HKH) region was adopted as the study area for our research (Figure 1).This area is located at 60.8539°E~105.0447°E,15.9579°N~39.3187°N.Many large and important lakes and rivers are distributed in this area, such as Lake Qinghai, Lake Selincuo, Lake Namtso, and the Yangtze and Yellow Rivers [38], thus affecting the water circulation in the region and worldwide [39,40].The HKH region also includes many globally significant mountains, including the Himalaya, Pamir, Tianshan, and Hengduan Mountains, among others [41].These mountains contribute to the unique climatic conditions and rich biodiversity of the region (Figure 2) [42,43]; however, they also pose a significant terrain shadow interference problem for water surface extraction from remote sensing images.Therefore, the selected HKH region is appropriate for studying the influence of terrain shadows during water surface extraction based on remote sensing images.Additionally, the HKH region covers more than 4.3 million km 2 and spans eight countries, which is ideal for demonstrating that the Transformer-based method for reducing terrain shadow interference during water surface extraction can be applied to a large, high, mountainous area.

Data Sources
Sentinel-2 remote sensing imagery, ALOS World 3D terrain data, and the European Space Agency (ESA) WorldCover 10 m 2020 product were the main data sources used in this study.They are described as follows: (1) Sentinel-2 remote sensing imagery Sentinel-2 imagery, with its high spatiotemporal resolution, was the primary source used in our study to extract water surface data for the HKH region.It contains 13 spectral bands, with spatial resolutions of 10 m, 20 m, and 60 m, and a 5-day revisiting period.It is one of the most popular data sources for land monitoring [44].In this study, Short-wave Infra-Red 1 (SWIR1) with 20 m resolution, Near Infra-Red (NIR) with 10 m resolution, and Red with 10 m resolution were selected as the input bands for the water surface extraction network due to their sensitivity to water surfaces [45].The representativeness of different water surface and non-water objects was comprehensively considered when selecting images from the Sentinel-2 data source, and only the images with less than 20% cloud cover were considered in order to ensure good quality in the samples generated based on them.The Sentinel-2 images we used were the L2A products of Sentinel-2, and they were Bottom-of-Atmosphere (BOA) reflectance images that underwent pre-processing, including radiometric processing, atmospheric correction, etc.

Data Sources
Sentinel-2 remote sensing imagery, ALOS World 3D terrain data, and the European Space Agency (ESA) WorldCover 10 m 2020 product were the main data sources used in this study.They are described as follows: (1) Sentinel-2 remote sensing imagery Sentinel-2 imagery, with its high spatiotemporal resolution, was the primary source used in our study to extract water surface data for the HKH region.It contains 13 spectral bands, with spatial resolutions of 10 m, 20 m, and 60 m, and a 5-day revisiting period.It is one of the most popular data sources for land monitoring [44].In this study, Short-wave Infra-Red 1 (SWIR1) with 20 m resolution, Near Infra-Red (NIR) with 10 m resolution, and Red with 10 m resolution were selected as the input bands for the water surface extraction network due to their sensitivity to water surfaces [45].The representativeness of different water surface and non-water objects was comprehensively considered when selecting images from the Sentinel-2 data source, and only the images with less than 20% cloud cover were considered in order to ensure good quality in the samples generated based on them.The Sentinel-2 images we used were the L2A products of Sentinel-2, and they were Bottom-of-Atmosphere (BOA) reflectance images that underwent pre-processing, including radiometric processing, atmospheric correction, etc.
(2) ALOS World 3D terrain data ALOS World 3D-30m (AW3D30) is a digital surface model (DSM) of the Panchromatic Remote-sensing Instrument for Stereo Mapping (PRISM), which was an optical sensor on board the Advanced Land Observing Satellite "ALOS", provided by the Japan Aerospace Exploration Agency (JAXA).It contains not only the elevation information of the terrain but also includes the height information of surface buildings, trees, and so on [46].It has a horizontal resolution of 30 m and an elevation accuracy of 5 m.It is one of the most commonly used and accurate terrain data products [47], and we used it as a data source for terrain features in the HKH region.There are several historical versions of this product, and the version we used is version 2.1.
(3) European Space Agency (ESA) World Cover 10 m 2020 product (2) ALOS World 3D terrain data ALOS World 3D-30m (AW3D30) is a digital surface model (DSM) of the Panchromatic Remote-sensing Instrument for Stereo Mapping (PRISM), which was an optical sensor on board the Advanced Land Observing Satellite "ALOS", provided by the Japan Aerospace Exploration Agency (JAXA).It contains not only the elevation information of the terrain but also includes the height information of surface buildings, trees, and so on [46].It has a horizontal resolution of 30 m and an elevation accuracy of 5 m.It is one of the most commonly used and accurate terrain data products [47], and we used it as a data source for terrain features in the HKH region.There are several historical versions of this product, and the version we used is version 2.1.
(3) European Space Agency (ESA) World Cover 10 m 2020 product The ESA World Cover data product was used to accelerate labeling water surfaces in the training dataset.This product is a global land cover product with 10 m resolution [48], which is generated based on Sentinel-1 and Sentinel-2 data.It has high overall accuracy and detailed categories, and it is one of the most popular global land cover products, featuring the following 11 land cover categories: "Tree cover", "Shrubland", "Grassland", "Cropland", "Built-up", "Bare/sparse vegetation", "Snow and Ice", "Permanent water bodies", "Herbaceous Wetland", "Mangrove", and "Moss and lichen".In this study, we extracted the category "Permanent water bodies" to assist in water surface labeling.

Water Surface Extraction Network
The water surface extraction network proposed in this study uses encoder−decoder architecture [49].The overall network structure is shown in Figure 3.To extract water surfaces of different types and sizes in the HKH region, the network was built on a Swin Transformer [50], which is based on an attentional mechanism [51].It imitates the operating mechanism of the human brain when observing things, focuses on important information, and it can extract water surface features at various levels, from texture to scene [52], allowing it to better capture the differences between water surfaces and non-water objects in the HKH region.

Production of Training Datasets
Figure 4 shows the process of producing the training datasets, including image data preparation and ground truth data preparation.
During image data preparation, we stacked the SWIR1, NIR, and Red bands of the Sentinel-2 images to obtain the images.To keep the spatial resolution of the model input band consistent, the SWIR1 band was upsampled to 10 m resolution using the nearest neighbor algorithm, and the images were divided into water surface images and terrain shadow images.The water surface images contained mainly water surfaces and sample backgrounds with sparse vegetation, while the terrain shadow images contained different types of terrain shadows; both types of images were manually classified and selected.To ensure that the water surface features in the two training datasets were consistent in water surface distribution, the selected terrain shadow images did not contain any water.We sampled from these images, generated image data of specified sizes, and obtained both water surface image data and terrain shadow image data.The utilized image size was 768 pixels × 768 pixels.
To investigate whether terrain data have significant impacts on the Transformerbased water surface extraction model, we stacked the spectral images and the pre-processed ASWD3D-30m DSM data as inputs for the model.The specific pre-processing of the DSM data included reprojection, upsampling, and cropping.The 30 m DSM data were reprojected and upsampled to 10 m in order to match the spatial resolution of the 10 m Sentinel 2 data.In addition, the DSM data were cropped based on the corresponding boundaries of the image data.The preprocessed ASWD3D-30m DSM data were stacked with the water surface image data to obtain combined water surface image and terrain data, and they were stacked with the terrain shadow image data to obtain combined terrain shadow image and terrain data.The encoder consists of four stages.After the input remote sensing image, with a size of H × W × 3, is divided into patches through the patch partition module, it is processed through these four stages in sequence.Each stage includes a Swin Transformer Block and a Linear Embedding or Patch Merging Module.The Swin Transformer Block consists of Window Multihead Self-Attention (W-MSA) and Shifted-Window Multihead Self-Attention (SW-MSA).The W-MSA calculates local self-attention within the windows, while the SW-MSA calculates global self-attention by shifting to interact with the information.In this first stage, the image is transformed into a one-dimensional vector via the Linear Embedding Module.In stages 2−4, the Patch Merging Module downscales the feature maps to form pyramid feature maps.Each stage of processing will generate an object-level feature map, so a total of four feature maps can be obtained, with sizes of H/4 × W/4 × C, H/8 × W/8 × 2C, H/16 × W/16 × 4C, and H/32 × W/32 × 8C.
The decoder consists of a Pyramid Pooling Module (PPM) [53], a Feature Pyramid Network (FPN) [54], a 3 × 3 convolution, and a classifier.Scene-level feature maps are produced with the PPM and combined with the object-level feature maps from the encoder with the FPN to form a fused feature map.With the 3 × 3 convolution, the fused feature map is then upsampled into the original image size, and the classifier is applied to perform a pixel-level classification output.
The Transformer-based water surface extraction network has 62.3 million parameters, and the embedding dim is set to 96, the patch size is set to 2, the window size is set to 9 × 9, and the number of heads for the multihead attention mechanisms in the four sequential stages are set to 3, 6, 12, and 24, respectively.
In addition to the water surface extraction network proposed in this study, we also utilized two CNNs (U-Net [55] and Deeplab V3 with a ResNet-50 backbone [56]) to extract water surfaces, and they were compared to the Transformer-based water surface extraction network.This step was conducted to further verify whether the Transformer-based network outperforms the CNNs in water surface extraction.For ground truth data preparation, we extracted the "Permanent water bodies" category layer from the ESA WorldCover 10 m 2020 product, and then we manually modified the layer based on the selected Sentinel-2 images, which included deleting redundant water surface objects that did not exist in the images, adding water surface objects that existed in the images, and modifying the specific boundaries of the water surfaces; thereby, accurate water surface ground truth images corresponding to each Sentinel-2 image were obtained.The water surface ground truth images were also cropped based on the corresponding image data to produce ground truth data with the same spatial range as the image data.Finally, we produced the four following training datasets, as shown in Table 1: spectral data (no terrain shadow), spectral data (with terrain shadow), spectral and terrain data (no terrain shadow), and spectral and terrain data (with terrain shadow).In this experiment, the number of samples containing terrain shadows was 352, accounting for 16.8% of the total number of samples in the dataset with terrain shadow samples.It can be calculated from the ground truth data that the number of water surface pixels accounts for approximately 8.6% of the total number of pixels in the dataset without terrain shadow samples and approximately 7.2% of the total number of pixels in the dataset with terrain shadow samples.During image data preparation, we stacked the SWIR1, NIR, and Red bands of the Sentinel-2 images to obtain the images.To keep the spatial resolution of the model input band consistent, the SWIR1 band was upsampled to 10 m resolution using the nearest neighbor algorithm, and the images were divided into water surface images and terrain shadow images.The water surface images contained mainly water surfaces and sample backgrounds with sparse vegetation, while the terrain shadow images contained different types of terrain shadows; both types of images were manually classified and selected.To ensure that the water surface features in the two training datasets were consistent in water surface distribution, the selected terrain shadow images did not contain any water.We sampled from these images, generated image data of specified sizes, and obtained both water surface image data and terrain shadow image data.The utilized image size was 768 pixels × 768 pixels.

Production of Training Datasets
To investigate whether terrain data have significant impacts on the Transformer-based water surface extraction model, we stacked the spectral images and the pre-processed ASWD3D-30m DSM data as inputs for the model.The specific pre-processing of the DSM data included reprojection, upsampling, and cropping.The 30 m DSM data were reprojected and upsampled to 10 m in order to match the spatial resolution of the 10 m Sentinel-2 data.In addition, the DSM data were cropped based on the corresponding boundaries of the image data.The preprocessed ASWD3D-30m DSM data were stacked with the water surface image data to obtain combined water surface image and terrain data, and they were stacked with the terrain shadow image data to obtain combined terrain shadow image and terrain data.
For ground truth data preparation, we extracted the "Permanent water bodies" category layer from the ESA WorldCover 10 m 2020 product, and then we manually modified the layer based on the selected Sentinel-2 images, which included deleting redundant water surface objects that did not exist in the images, adding water surface objects that existed in the images, and modifying the specific boundaries of the water surfaces; thereby, accurate water surface ground truth images corresponding to each Sentinel-2 image were obtained.The water surface ground truth images were also cropped based on the corresponding image data to produce ground truth data with the same spatial range as the image data.
Finally, we produced the four following training datasets, as shown in Table 1: spectral data (no terrain shadow), spectral data (with terrain shadow), spectral and terrain data (no terrain shadow), and spectral and terrain data (with terrain shadow).In this experiment, the number of samples containing terrain shadows was 352, accounting for 16.8% of the total number of samples in the dataset with terrain shadow samples.It can be calculated from the ground truth data that the number of water surface pixels accounts for approximately 8.6% of the total number of pixels in the dataset without terrain shadow samples and approximately 7.2% of the total number of pixels in the dataset with terrain shadow samples.Using the same method, we produced two validation datasets, one with terrain data and the other without terrain data.The validation datasets with terrain data were designed to evaluate the accuracy of the models trained using the training datasets with terrain data; the validation datasets without terrain data were designed to evaluate the accuracy of the models trained using the training datasets without terrain data.To ensure fairness in the accuracy evaluation, the number and distribution of samples in the two validation datasets were the same, and they both covered different types of water surfaces and terrain shadows in multiple areas.

Training Settings
The proposed network was implemented using the PyTorch framework (version 1.13.1), and the hardware environment was an Intel(R) Xeon(R) W-2245 CPU @ 3.90 GHz with 16.0 GB RAM and an NVIDIA GeForce RTX3080 Ti GPU with 7424 CUDA cores and 12 GB memory.The optimizer for training was adaptive moment estimation (ADAM) with a total of 300 training epochs.A linear learning rate was applied, starting at 0.000003 and increasing to 0.00006 through linearly changing the small multiplicative factor after 10 epochs.In order to prevent overfitting, we set the weight decay to 0.01 and set the batch size to 4 based on the GPU memory capacity.
Under this hardware environment, using all training samples to train the model for one epoch took about 15 min, and each model was trained for 300 epochs.During the prediction phrase, extracting the water surface result from a Sentinel-2 image took about 90 s.

Accuracy Evaluation
Overall Accuracy (OA), Intersection over Union (IoU), and Kappa were used as the metrics with which to quantitatively evaluate the accuracy levels of the trained models.
OA and IoU are calculated with the confusion matrix, composed of True-Positives (TP), False-Positives (FP), False-Negatives (FN), and True-Negatives (TN), which, respectively, represent the correctly predicted number of water surface pixels, the incorrectly predicted number of water surface pixels, the unpredicted number of water surface pixels, and the correctly predicted number of non-water surface pixels.The equations for OA and IoU can be represented as follows: Kappa is used to measure the consistency between the number of predicted pixels of elements of different classes and the real number of pixels [57]; its equation is as follows: Po has the same meaning as OA; the equation of Pe is as follows: In the equation, a1, a2... ax represent the numbers of true pixels in each class; b1, b2..., bx represent the predicted numbers of samples in each class; n is the total number of pixels.

Accuracy Evaluation Results
The evaluation results of the water surface extraction models trained by the training dataset "Spectral data (no terrain shadow)" in Section 3.2 are shown in Table 2.In order to avoid contingency and uncertainty in the results, the scores of the top three accuracy epochs are shown, and their average scores are calculated."Model_Transformer_Baseline" in Table 2 is a baseline model trained using the samples without terrain data (TD) or with terrain shadow samples (TS).The average OA, IoU, and Kappa scores were 0.9985, 0.9816, and 0.9899, respectively, and they showed that the proposed Transformer-based deep learning network can extract water surfaces from Sentinel-2 images successfully.Additionally, the following two models in Table 2 were obtained by training U-Net and Deeplab V3 using the same training dataset.The average OA, IoU, and Kappa scores for the Model_U-Net_Baseline and Model_Deeplab V3_Baseline were 0.9781, 0.8573, 0.7689 and 0.9964, 0.9563, 0.9757, respectively.This proves that, under the conditions of the experiment conducted in this study, the Transformer-based network has higher accuracy for water surface extraction than the widely used CNN networks.The "*" indicates the average value with the highest accuracy.
Table 3 shows the evaluation results of the four models obtained by training the proposed Transformer-based water surface extraction network using four different training datasets, The "*" indicates the average value with the highest accuracy.
"Model_Transformer_Baseline" in Table 3 is the same as "Model_Transformer_Baseline" in Table 2.
"Model_Transformer_TS" in Table 3 was trained with the samples containing terrain shadow images.It demonstrated the highest average OA, IoU, and Kappa scores of 0.9985, 0.9823, and 0.9903, respectively.Thus, adding terrain shadow images to the training samples can improve the accuracy of water surface extraction results based on the proposed Transformer-based network.
"Model_Transformer_TD" in Table 3 was trained with the samples containing DSM data.The average OA, IoU, and Kappa scores were 0.9980, 0.9725, and 0.9850, respectively, lower than those of the Model_Transformer_Baseline, thus demonstrating that the introduction of additional terrain data reduces the accuracy of water surface extraction results based on the proposed Transformer-based network.
When terrain data were added to the dataset with terrain shadow samples, the average OA, IoU, and Kappa scores of "Model_Transformer_TS+TD" became 0.9981, 0.9730, and 0.9853, reducing the accuracy of "Model_Transformer_TS".Compared with "Model_Transformer_TD", "Model_Transformer_TS+TD" had higher scores, which also shows that adding terrain shadow samples can improve the accuracy of the model.

Results of the Misclassification of Terrain Shadows as Water Surfaces
The misclassification of terrain shadows as water surfaces was analyzed on the basis of the two following area scenes: unvegetated areas and vegetated areas.
(1) Misclassification of terrain shadows in unvegetated areas The misclassification of terrain shadows in unvegetated areas is shown in Figure 5.When "Model_Transformer_Baseline" was used to extract water surfaces, many terrain shadows in the bare area, regardless of their sizes, were misclassified as water surfaces (Figure 5b), thus showing that, when the training dataset does not contain terrain shadow samples and terrain data, the trained model cannot distinguish water surfaces and terrain shadows in the bare area.
When "Model_Transformer_TS" was used to extract water surfaces, almost no terrain shadows in unvegetated areas were misclassified as water surfaces (Figure 5c), thus showing that adding terrain data can improve the ability of the trained model to distinguish terrain shadows and water surfaces in unvegetated areas.
When "Model_Transformer_TD" was used to extract water surfaces, no terrain shadows in the unvegetated areas were misclassified as water surfaces, similar to when "Model_ Transformer_TS" was used (Figure 5d), showing that, in unvegetated areas, the addition of terrain data to the training dataset can also reduce the interference of terrain shadows in water surface extraction. (

2) Misclassification of terrain shadows in vegetated areas
The misclassification of terrain shadows as water surfaces in vegetated areas is shown in Figure 6.When "Model_Transformer_Baseline" was used to extract water surfaces, a small number of terrain shadows in areas with high vegetation coverage were misclassified as water surfaces (Figure 6b), fewer than the misclassified terrain shadows in the unvegetated areas.
When "Model_Transformer_TS" was used to extract water surfaces, no terrain shadows in the vegetated areas were misclassified as water surfaces (Figure 6c), thus showing that, whether in unvegetated or vegetated areas, adding terrain shadow samples to the training dataset can improve the model's performance in distinguishing terrain shadows from water surfaces.
When "Model_Transformer_TD" was used to extract water surfaces, many terrain shadows in the vegetated areas were misclassified as water surfaces (Figure 6d), and the areas of misclassified terrain shadows were larger than those resulting from using "Model_Transformer_Baseline", showing that, in vegetated areas, adding terrain data affects the model's performance in distinguishing between terrain shadows and water surface.

Water Surface Extraction Results
Figure 7 shows the water surface extraction results.We present the results in terms of rivers, small lakes, medium lakes, and large lakes.Small lakes refer to lakes whose area is less than 1 km 2 , and the large lakes refer to lakes whose area is greater than 100 km 2 ; the rest of the lakes are medium lakes.(

2) Misclassification of terrain shadows in vegetated areas
The misclassification of terrain shadows as water surfaces in vegetated areas is shown in Figure 6.When "Model_Transformer_Baseline" was used to extract water surfaces, a small number of terrain shadows in areas with high vegetation coverage were misclassified as water surfaces (Figure 6b), fewer than the misclassified terrain shadows in the unvegetated areas.
When "Model_Transformer_TS" was used to extract water surfaces, no terrain shadows in the vegetated areas were misclassified as water surfaces (Figure 6c), thus showing that, whether in unvegetated or vegetated areas, adding terrain shadow samples to the training dataset can improve the model s performance in distinguishing terrain shadows from water surfaces.When using "Model_Transformer_Baseline" to extract rivers, most of the river pixels were present in the extraction result, but a small number of pixels were missed (Figure 7a).When using "Model_Transformer_TS" to extract rivers, the extracted river morphology was more complete than that extracted with "Model_Transformer_Baseline".However, when "Model_Transformer_TD" was used to extract rivers, the river surface was basically not extracted.Since "Model_Transformer_TD" was obtained by adding the terrain data based on "Model_Transformer_Baseline", which suggests that the use of terrain data significantly affects the accuracy of river extraction.When "Model_Transformer_TS+TD" was used to extract rivers, it could only extract a small number of river pixels.This demonstrates that, even if there are terrain shadow samples in the training dataset, the use of terrain data still weakens the model's performance in extracting rivers.
Remote Sens. 2024, 16,2032 12 of 26 When "Model_Transformer_TD" was used to extract water surfaces, many terrain shadows in the vegetated areas were misclassified as water surfaces (Figure 6d), and the areas of misclassified terrain shadows were larger than those resulting from using "Model_Transformer_Baseline", showing that, in vegetated areas, adding terrain data affects the model s performance in distinguishing between terrain shadows and water surface.

Water Surface Extraction Results
Figure 7 shows the water surface extraction results.We present the results in terms of rivers, small lakes, medium lakes, and large lakes.Small lakes refer to lakes whose area is less than 1 km 2 , and the large lakes refer to lakes whose area is greater than 100 km 2 ; the rest of the lakes are medium lakes."Model_Transformer_Baseline" could accurately extract small, medium, and large lakes, with clear boundaries and complete lake surfaces, and no lakes were missed (Figure 7b-d).
When "Model_Transformer_TS" was used to extract lakes, the results were the same as those of "Model_Transformer_Baseline", which indicates that adding terrain shadow samples to the training dataset does not affect lake extraction.However, when "Model_Transformer_TD" was used to extract lakes, many small and medium lake pixels were missed, a small number of large lake pixels were missed, and some non-water pixels were extracted, thus showing that adding terrain data affects the accuracy of lake extraction.When "Model_Transformer_TS+TD" was used to extract small lakes, the results were similar to those of "Model_Transformer_TD", thus proving that, even if the training dataset contains terrain shadow samples, adding terrain data to the training dataset still affects the accuracy of lake extraction.

Results of the Misclassification of Other Non-Water Objects
The results of the misclassification of non-water objects as water surfaces are shown in Figure 8.When using "Model_Transformer_Baseline" and "Model_Transformer_TS" to extract water surfaces, non-water objects, other than terrain shadows, were not misclassified as water surfaces, thus showing that the proposed water surface extraction network could well distinguish water surface from non-water objects and that adding terrain shadow samples to the training dataset does not affect the model s performance in distinguishing water surfaces from non-water objects.
However, when using "Model_Transformer_TD" to extract water surfaces, we found that some non-water objects were misclassified as water surfaces, mainly clouds, cloud

Results of the Misclassification of Other Non-Water Objects
The results of the misclassification of non-water objects as water surfaces are shown in Figure 8.When using "Model_Transformer_Baseline" and "Model_Transformer_TS" to extract water surfaces, non-water objects, other than terrain shadows, were not misclassified as water surfaces, thus showing that the proposed water surface extraction network could well distinguish water surface from non-water objects and that adding terrain shadow samples to the training dataset does not affect the model's performance in distinguishing water surfaces from non-water objects.
shadows, ice, and snow, as shown in Figure 8b,c, which shows that adding the te data to the training dataset affects the model s performance in distinguishing wate faces from non-water objects.In addition, when "Model_Transformer_TS+TD" was to extract water surfaces, the extraction results were similar to the resul "Model_Transformer_TD", which indicates that, even when the training dataset con terrain shadow samples, adding terrain data still affects the model s performance in tinguishing water surfaces from non-water objects.However, when using "Model_Transformer_TD" to extract water surfaces, we found that some non-water objects were misclassified as water surfaces, mainly clouds, cloud shadows, ice, and snow, as shown in Figure 8b,c, which shows that adding the terrain data to the training dataset affects the model's performance in distinguishing water surfaces from non-water objects.In addition, when "Model_Transformer_TS+TD" was used to extract water surfaces, the extraction results were similar to the results of "Model_Transformer_TD", which indicates that, even when the training dataset contains terrain shadow samples, adding terrain data still affects the model's performance in distinguishing water surfaces from non-water objects.

Results of Water Surface Extraction in the HKH Region
In order to validate the robustness of the water surface extraction network, "Model_ Transformer_TS", which demonstrated the best performance among the experimental models, was used to extract water surface from Sentinel-2 images of the HKH region; as shown in Figure 9, this region covers 594 tiles of Sentinel-2 images.Since the rainy season in the HKH region is from July to October, we used only the images with less than 20% cloud cover during this period in 2021 to extract water surfaces.The model directly extracted water surfaces from these images without any pre-or post-processing to refine the extraction results.After the extraction, we used a maximum-area algorithm to composite the water surface extraction results with different dates from the same tile and merged the composited results to form the overall water surface extraction results for the HKH region, as shown in Figure 10.

Results of Water Surface Extraction in the HKH Region
In order to validate the robustness of the water surface extraction network, "Model_Transformer_TS", which demonstrated the best performance among the experimental models, was used to extract water surface from Sentinel-2 images of the HKH region; as shown in Figure 9, this region covers 594 tiles of Sentinel-2 images.Since the rainy season in the HKH region is from July to October, we used only the images with less than 20% cloud cover during this period in 2021 to extract water surfaces.The model directly extracted water surfaces from these images without any pre-or post-processing to refine the extraction results.After the extraction, we used a maximum-area algorithm to composite the water surface extraction results with different dates from the same tile and merged the composited results to form the overall water surface extraction results for the HKH region, as shown in Figure 10.

Results of Water Surface Extraction in the HKH Region
In order to validate the robustness of the water surface extraction network, "Model_Transformer_TS", which demonstrated the best performance among the experimental models, was used to extract water surface from Sentinel-2 images of the HKH region; as shown in Figure 9, this region covers 594 tiles of Sentinel-2 images.Since the rainy season in the HKH region is from July to October, we used only the images with less than 20% cloud cover during this period in 2021 to extract water surfaces.The model directly extracted water surfaces from these images without any pre-or post-processing to refine the extraction results.After the extraction, we used a maximum-area algorithm to composite the water surface extraction results with different dates from the same tile and merged the composited results to form the overall water surface extraction results for the HKH region, as shown in Figure 10.

Accuracy Evaluation
To evaluate our water surface results in the HKH region, we compared our water surface extraction results with the Global Surface Water (GSW) seasonality product, one of the most accurate and widely used water surface products, with a resolution of 30 m.The comparison includes a quantitative comparison and an extraction result comparison.

Quantitative Comparison
Table 4 shows the results of the comparison between our results and those of the GSW.It can be seen that the consistency proportions of the selected lakes are all higher than 99%, which shows that our water surface extraction results are highly consistent with the GSW seasonality product on the lakes with large areas and high ecological values.

Quantitative Extraction Result Comparison
To further compare our extraction results with those of the GSW seasonality product, we compared them based on the spatial distribution of the water surface using visual interpretation.
Figure 11 shows our water surface extraction results and the GSW seasonality product of the whole HKH region.It can be seen that our extraction results are consistent with the GSW seasonality product in both the spatial distribution of water surface and the whole water surface morphology, with no redundant or abnormal segments in our data, which shows that terrain shadows were not misclassified as water surface in our large-scale extraction results.
In addition, we also compared our extraction results with the GSW seasonality product for some specific water surfaces, including those of large lakes, medium lakes, small lakes, and rivers, the compared results of which are shown in Figures 12, 13, 14 and 15, respectively.For large lakes and medium lakes, our extraction results are basically consistent with the GSW seasonality product regardless of the distribution, overall shape, or boundaries of the lakes, and they are very clear and accurate.For small lakes and rivers, our extraction results are consistent with the GSW seasonality product in regard to overall shape.Since we used 10 m Sentinel-2 images to extract water surfaces, our extraction results are finer at the boundaries and can better reflect the specific shapes of small lakes and rivers, as shown in Figures 14 and 15.

Discussion
In high mountainous areas, such as the HKH region, terrain shadows are an important factor that interferes with water surface extraction; thus, we designed a water surface extraction network based on the Vision Transformer, adding terrain shadow samples to increase the sample representativeness, thereby improving the performance of the network used to distinguish water surface and terrain shadows.
Our experiment shows that, when terrain shadow samples were added to the training dataset, the model's performance in distinguishing water surface and terrain shadows significantly improved and the water surface extraction result became more accurate.This clearly demonstrates that preparing samples containing terrain shadows is a very effective tool when using the Transformer-based deep learning network to reduce the interference of terrain shadows in water surface extraction.Moreover, the proportion of terrain shadow samples in our experiment was low, at about 16.8%.This also confirms that samples containing terrain shadows are extremely important for water surface extraction in high mountainous areas, since the misclassification of terrain shadows as water surfaces can be mostly avoided without adding too many terrain shadow samples.In addition, preparing samples containing terrain shadows (and not having to label them) is much easier than other methods, including adding and processing terrain data.The application case whereby the entire water surface distribution of the HKH region in 2021 was extracted, based on the proposed deep learning method, further strongly validates the effectiveness and usability of Transformer-based deep learning networks with samples containing both water surface and terrain shadows.In addition to terrain shadows, some other non-water objects, such as clouds, cloud shadows, glaciers, and snow, are also potentially misclassified as water surfaces.We believe that these misclassifications can also be reduced by adding a certain number of relevant samples to the training dataset.
In order to investigate whether terrain data significantly help to reduce the interference of terrain shadows during water surface extraction in high mountainous areas, we introduced AW3D30 terrain data and prepared samples that contained terrain data to train the same Transformer-based deep learning network.Our experimental results show that adding the AW3D30 terrain data to the training dataset is not very effective; while it can reduce the misclassification of terrain shadows as water surface in some unvegetated areas, misclassification is not reduced, or becomes even more severe, in vegetated areas.In addition, the performance of the water surface extraction model decreases significantly after adding the terrain data, as demonstrated in the experimental results showing incomplete or incorrect water surface boundaries.We wondered whether the accuracy of the terrain data might have caused the undesirable results above; in reality, however, the terrain data we used were already among those of the highest accuracy publicly available today.Moreover, the economic and labor costs of obtaining more accurate terrain data would be higher.In addition, it is possible that the deep learning networks themselves may perform poorly when faced with training data containing large differences, since terrain data and spectral imagery have two completely different physical features.
Finally, it is important to note that all of the experimental results and conclusions presented in this research are based on the approach of deep learning networks, especially Transformer-based networks; therefore, the method of reducing terrain shadow interference by adding terrain shadow samples may only be effective for water surface extraction using these deep learning networks.It is unclear whether the approach used in our research is applicable to other extraction methods, such as water index or shallow machine learning methods.

Conclusions
In this study, we designed a water surface extraction network, based on the Vision Transformer, in order to efficiently and automatically extract water surfaces in high mountainous areas, such as the HKH region, and explored utilizing the method to reduce the interference of terrain shadows during water surface extraction.Our results show that adding terrain shadow samples can greatly improve the model's performance in distinguishing between water surface and terrain shadows, and the model can accurately extract water surfaces; meanwhile, adding specific terrain data is not very effective in reducing the misclassification of terrain shadows as water surfaces, and this approach could reduce the accuracy of water surface extraction in the HKH region.Using the Transformer-based network and sufficient samples of both water surface and terrain shadows, we quickly obtained water surface extraction results from the HKH region in the year 2021.Comparison of our extraction results with the GSW seasonality product shows that the two are highly consistent, and our extraction results are finer at the water surface boundaries.This sufficiently demonstrates the high spatiotemporal generalization ability of our water surface extraction network, as well as the broad feasibility of reducing terrain shadow interference by simply adding terrain shadow samples.In the future, we plan to produce more samples from different geographical areas and apply the proposed method to various terrain environments beyond the HKH region in order to verify the generalization capability of the method.

Figure 1 .
Figure 1.The location of the Hindu Kush Himalaya.Figure 1.The location of the Hindu Kush Himalaya.

Figure 1 .
Figure 1.The location of the Hindu Kush Himalaya.Figure 1.The location of the Hindu Kush Himalaya.

Figure 2 .
Figure 2. The terrain diagram of the Hindu Kush Himalaya.

Figure 2 .
Figure 2. The terrain diagram of the Hindu Kush Himalaya.

Figure 3 .
Figure 3.The structure of the water surface extraction network.

Figure 3 .
Figure 3.The structure of the water surface extraction network.

Figure 4
Figure 4 shows the process of producing the training datasets, including image data preparation and ground truth data preparation.

Figure 4 .
Figure 4.The preparation of training datasets.

Figure 4 .
Figure 4.The preparation of training datasets.

Figure 5 .
Figure 5. Misclassification of terrain shadows in unvegetated areas (areas circled in red represent terrain shadows misclassified as water surfaces).

Figure 5 .
Figure 5. Misclassification of terrain shadows in unvegetated areas (areas circled in red represent terrain shadows misclassified as water surfaces).

Figure 6 .
Figure 6.Misclassification of terrain shadows in vegetated areas (areas circled in red represent terrain shadows misclassified as water surfaces).

Figure 6 .
Figure 6.Misclassification of terrain shadows in vegetated areas (areas circled in red represent terrain shadows misclassified as water surfaces).

Figure 7 .
Figure 7. Water surface extraction results (areas circled in yellow represent missed water surface pixels, and areas circled in red represent non-water objects misclassified as water surfaces).

Figure 7 .
Figure 7. Water surface extraction results (areas circled in yellow represent missed water surface pixels, and areas circled in red represent non-water objects misclassified as water surfaces).

Figure 8 .
Figure 8. Misclassification of non-water objects (areas circled in red represent non-water o misclassified as water surface).

Figure 8 .
Figure 8. Misclassification of non-water objects (areas circled in red represent non-water objects misclassified as water surface).

Figure 9 .
Figure 9.The distribution of Sentinel-2 imaging tiles in the HKH region.

Figure 10 .
Figure 10.Water surface extraction results of the HKH region in 2021.

Figure 9 .
Figure 9.The distribution of Sentinel-2 imaging tiles in the HKH region.

Figure 9 .
Figure 9.The distribution of Sentinel-2 imaging tiles in the HKH region.

Figure 10 .
Figure 10.Water surface extraction results of the HKH region in 2021.

Figure 10 .
Figure 10.Water surface extraction results of the HKH region in 2021.

Figure 11 .
Figure 11.Our water surface extraction results and the GSW seasonality product for the HKH region in 2021.Figure 11.Our water surface extraction results and the GSW seasonality product for the HKH region in 2021.

Figure 11 .
Figure 11.Our water surface extraction results and the GSW seasonality product for the HKH region in 2021.Figure 11.Our water surface extraction results and the GSW seasonality product for the HKH region in 2021.

Figure 12 .
Figure 12.Our water surface extraction results and the GSW seasonality product for large lakes.Figure 12.Our water surface extraction results and the GSW seasonality product for large lakes.

Figure 12 .
Figure 12.Our water surface extraction results and the GSW seasonality product for large lakes.Figure 12.Our water surface extraction results and the GSW seasonality product for large lakes.

Figure 13 .
Figure 13.Our water surface extraction results and the GSW seasonality product for medium lakes.Figure 13.Our water surface extraction results and the GSW seasonality product for medium lakes.

Figure 13 .
Figure 13.Our water surface extraction results and the GSW seasonality product for medium lakes.Figure 13.Our water surface extraction results and the GSW seasonality product for medium lakes.

Figure 14 .
Figure 14.Our water surface extraction results and the GSW seasonality product for small lakes.Figure 14.Our water surface extraction results and the GSW seasonality product for small lakes.

Figure 14 .
Figure 14.Our water surface extraction results and the GSW seasonality product for small lakes.Figure 14.Our water surface extraction results and the GSW seasonality product for small lakes.

Figure 15 .
Figure 15.Our water surface extraction results and the GSW seasonality product for rivers.Figure 15.Our surface extraction results and the GSW seasonality product for rivers.

Figure 15 .
Figure 15.Our water surface extraction results and the GSW seasonality product for rivers.Figure 15.Our surface extraction results and the GSW seasonality product for rivers.
Author Contributions: X.Y.: Writing-original draft, Methodology, Validation, Visualization, Data Curation.J.S.: Writing-review and editing, Conceptualization, Methodology, Investigation, Funding Acquisition, Software, Supervision, Resources.All authors have read and agreed to the published version of the manuscript.Funding: This research was funded by the National Key Research and Development Program of China (grant number: 2022YFF0711602, 2021YFE0117800), and the Strategic Priority Research Program of the Chinese Academy of Sciences (grant number: XDB0740200).

Table 1 .
The four training datasets utilized in the experiment.

Table 2 .
The evaluation results of the model obtained by training the Transformer-based network using the training dataset "Spectral data (no terrain shadow)".

Table 3 .
The evaluation results of the model obtained by training the Transformer-based network using different training datasets.

Table 4 .
The water surface areas of our extraction results compared with the GSW seasonality product in 2021.