The source data used in this study is the JRC Global Surface Water (GSW) [
34] dataset, specifically, we used the JRC Global Surface Water Mapping Layers, v1.4 Extent data product. Additionally, the GRNWRZ V2.0 [
9] global river network dataset was used as auxiliary data, and for certain regions, Landsat 8 [
35] imagery was employed as a reference for manual water body type labeling. During the process of extracting vectors from raster data, we applied the classic Douglas-Peucker algorithm for edge simplification to reduce vertex redundancy in the vector features. The deep learning classifier used was a hybrid architecture combining CNN and Transformer. Finally, partial manually labeled data and OSM data from certain regions were used as standards for benchmarking against the classification results.
Figure 2 illustrates the workflow and the data used.
3.1. Source Data and Pre-Processing
The JRC Global Surface Water (GSW) dataset is one of the most widely used global surface water remote sensing data products. Developed by the Joint Research Centre (JRC) of the European Commission based on Landsat series satellite imagery, it covers a 34-year time span from 1987 to 2020 at a spatial resolution of 30 m. The dataset provides multiple products, including annual classifications, occurrence frequency, and seasonality indices, and serves as a fundamental resource for studying surface water dynamics.
The preprocessing stage mainly involved data format conversion, optimization, and label annotation. This stage consisted of multiple steps: (1) Collecting global surface water remote sensing data products and converting them into vector format; (2) Applying a contour simplification algorithm to smooth and simplify features; (3) Selecting representative regions for manual labeling to provide training and testing datasets for the classifier.
In this study, we used the JRC Global Surface Water Mapping Layers, v1.4 Extent as the primary dataset. This product includes all surface water bodies detected by JRC over the monitoring period. It consists of 504 raster tiles, each covering a 10° × 10° area, collectively spanning from 60° S to 80° N. The product contains a single band, in which the pixel values of 1 represent water, 0 represents no water, and 255 represents no data. After excluding tiles without water pixels, 353 raster files remained. Based on pixel values and the Douglas–Peucker algorithm, we were able to vectorize water bodies and simplify vertices efficiently.
According to [
34,
36], we selected seven representative regions: the Amazon River floodplain, the Mekong Delta, parts of the Tibetan Plateau, parts of the Arctic Circle, the East African Rift Lakes Region, parts of Australia, and the North American Great Lakes Region. A detailed description of these regions is provided in
Table 1.
Within these seven regions, manual classification and labeling of surface water vector features were performed. An example is shown in
Figure 3.
3.2. Aggregation of Fragmented Water Bodies
The fragmented features from vectorization process can be categorized into three main types: disconnected narrow river fragments, wide river and lake edge fragments, and isolated fragments.
Figure 4 illustrates typical examples of these three categories.
To address these three types of fragmentation, we adopted a three-stage approach to identify and aggregate fragmented water bodies. This is a progressive, multi-tiered feature integration strategy that transitions from strong constraints based on a priori knowledge to weaker constraints based on spatial neighbor relationships. The method consists of three stages, “river network constraint—classification buffer—topological neighbor”, and is designed to systematically restore the spatial integrity of global water landscapes while minimizing mis-aggregation errors.
In the first stage, we employed the GRNWRZ V2.0 global river network dataset to restore hydrological connectivity disrupted by narrow river fragmentations. GRNWRZ V2.0 provides a seven-level classification system based on river basin characteristics: L1 represents rivers draining into the same ocean or inland basin, L2 distinguishes exorheic basins by drainage area or endorheic basins by catchment extent, while L3–L7 are further subdivided according to decreasing basin size [
9]. Through comparative experiments and analysis, we found that the L5 river network data strikes a balance: it ensures coverage of all major rivers within the study area while effectively distinguishing between interconnected lakes and independent rivers.
A 150 m buffer was constructed around the river network, and all water body features within this buffer were labeled as rivers. Fragmented water bodies intersecting the buffer were assumed to originate from the corresponding river, and fragments within the same buffer zone were considered hydrologically related.
Figure 5 shows a schematic diagram of the aggregation strategy. The main rationale behind this step is that narrow river fragments are usually derived from the river itself, either as a result of seasonal water level fluctuations or as errors introduced during remote sensing extraction. This stage is particularly effective in addressing linear fragmentation along narrow river channels and reconstructing the continuity of original river features.
In the second stage, to address wide river and lake edge fragments, we applied a classification buffer approach. In this stage, a 180 m buffer was created around water body features labeled as rivers in the first stage, while a 120 m buffer was constructed for unlabeled water body features. Fragments whose buffers intersected were aggregated into the same feature.
Figure 6 illustrates the aggregation strategy. The underlying principle is that two or more spatially proximate water body fragments are highly likely to have originated from the same continuous water system. Similarly, extremely small fragments along the edges of larger rivers or lakes are very likely to be part of the same water body, primarily due to remote sensing observation errors. This stage effectively reduces fragmentation in independent water body data.
In the third stage, aimed at addressing isolated fragments, we employed a GIS-based nearest neighbor analysis. In this stage, water body features not labeled as rivers after the first two stages were analyzed using neighborhood relationships and a distance threshold. The goal was to enhance global connectivity and improve the classification value of very small water bodies. In reality, small water bodies are often separated by bridges, walkways, ditches, or appear fragmented in remote sensing imagery due to clouds, shadows, snow/ice, or noise. This stage is designed to handle fragmentation of relatively isolated water body features. A distance threshold of 150 m was set, and nearest neighbor searches were conducted using Euclidean distance.
Figure 7 illustrates the aggregation strategy. The main principle is that geographically adjacent water body fragments are often homologous in natural formation processes. Aggregating small nearby fragments before classification improves practical relevance and ensures greater logical consistency.
It is worth noting that to ensure metric accuracy during geometric operations, we mitigated the projection distortion inherent in the source WGS 1984 geographic coordinate system (EPSG:4326) by dynamically projecting each tile to its corresponding UTM zone based on the tile’s centroid
. The selection criteria are described in Equations (1) and (2).
In these equations, represents the longitude of the tile center, and represents the latitude of the tile center. UTM divides zones at 6° intervals, with EPSG codes ranging from 32,601 to 32,660 in the Northern Hemisphere and from 32,701 to 32,760 in the Southern Hemisphere. After processing, all vector features were reprojected back to WGS 1984.
These three steps can significantly reduce the number of fragmented water bodies. The combination of hydrological priors and spatial reasoning enables this method to be reliably applied under diverse climatic and topographic conditions. The aggregated water body dataset produced through this process forms a solid foundation for subsequent river and lake classification. Furthermore, although this strategy for mitigating fragmentation at the global scale cannot guarantee perfectly accurate aggregation results, the aggregation operations do not significantly alter the shape factors of independent water body fragments. In other words, they do not affect subsequent classification results, making this a highly reliable aggregation approach.
3.3. Subtype Classification of Water Bodies
Surface water body subtypes generally include rivers, lakes, wetlands, canals, and others. In this study, only rivers and lakes are considered. Surface water bodies were classified into these two primary subtypes based on morphological characteristics, hydrological dynamics, and limnological properties. Lakes are defined as relatively stationary natural or artificial water bodies, surrounded by land, and typically replenished by precipitation, groundwater, or inflowing rivers. Water is discharged from lakes either through evaporation or outflowing rivers, and their surface area can range from 0.03 km
2 to several tens of thousands of km
2. Rivers, in contrast, are continuous and directional, flowing from high to low elevations and ultimately discharging into oceans, lakes, or other water bodies. Rivers are usually elongated and meandering in shape. Detailed definitions of each subtype are provided in
Table 2. Other water body subtypes and potential directions for future research are discussed in detail in
Section 5. In addition, the final classification results in this study include an “Other” category, which does not correspond to actual surface water. This category primarily represents buffers established along land coastlines in the JRC GSW dataset, as shown in
Figure 8.
To achieve fine-grained classification of surface water body subtypes, we employed a hybrid CNN-Transformer architecture classifier, referred to as the Hybrid CNN-Transformer Classifier (HCTC). This model combines the local feature extraction capability of CNNs with a lightweight Transformer architecture capable of capturing long-range spatial dependencies and contextual relationships. The network architecture of HCTC is illustrated in
Figure 9.
This hybrid design is specifically intended to handle the complexity of water body contours: rivers are typically elongated and meandering, whereas lakes tend to be more rounded and enclosed. The model employs a simplified ResNet-18 network, with the average pooling and fully connected layers removed, retaining only the feature extraction portion. Features from different levels are projected to a unified channel dimension using 1 × 1 convolutions and then upsampled layer by layer for top-down fusion. Subsequently, a spatial attention mechanism emphasizes important regions, and a Transformer encodes the global context, compensating for the local limitations of the CNN. The aggregated features are finally fed into a linear classification head for three-class prediction: rivers, lakes, and other.
All water body features are resampled to 512 × 512 images for input to the model. Initial low-level features are obtained through a 7 × 7 convolution followed by max pooling, and multi-level features are progressively extracted through the four residual blocks of ResNet-18.
In which , , , and .
To leverage information at multiple scales, we adopted the Feature Pyramid Network (FPN) concept. Features from different layers are projected to a unified dimension
through 1 × 1 convolutions and then progressively upsampled in a top-down manner and summed. Specifically, the high-level semantic features are upsampled and added to the corresponding low-level detailed features to generate the multi-scale feature maps
:
where
denotes the 1 × 1 convolution mapping. This results in a set of multi-scale fused features
, which incorporate both low-level details and high-level semantic information. Then we apply the spatial attention mechanism independently to each feature map
to enhance salient regions and suppress irrelevant areas:
where
represents the attention-enhanced feature maps,
denotes the convolution operation,
is the Sigmoid activation,
is the attention weight map, and
denotes element-wise multiplication. This process highlights water body boundaries and main shapes while attenuating interference.
Each attention-enhanced feature map is flattened into a sequence of tokens. These sequences are then concatenated to form a unified multi-scale sequence. Since the number of tokens from low-level features (high resolution) is significantly larger than that from high-level features, there is a risk of semantic information being overwhelmed. To address this, we prepend a learnable [CLS] token to the sequence. Learnable positional encodings are added to retain spatial structure information. The final input to the Transformer is defined as
The Transformer encoder then processes this sequence. Through the multi-head self-attention (MHSA) mechanism, the [CLS] token dynamically interacts with tokens from all scales, aggregating the most relevant features for classification—whether they are fine-grained boundaries from or global shapes from . Finally, the output state of the [CLS] token will be used for the final prediction.
Each encoder layer consists of an MHSA and an FFN. The number of Transformer encoder layers was set to 4. This depth was empirically determined to strike an optimal balance between capturing global contextual dependencies and maintaining computational efficiency, while avoiding overfitting on the available dataset.
The multi-head self-attention is formulated as
where
,
,
, and
are learnable parameters.
is the scaling factor. The update of an encoder layer can be expressed as
The final global context representation
is the output sequence of the last encoder layer. We extract the first token (corresponding to the [CLS] token) as the global semantic representation of the water body:
Finally, the object-level global feature vector
is fed into a classification head, consisting of a linear transformation and a softmax activation, to yield the predicted probabilities:
where
,
are parameters of the classification layer. And
is the final classification results.
The overall model architecture and principles are as described above. The model was trained and tested on manually labeled data from the 7 regions introduced earlier. The manually labeled dataset contains over 1300 Lake instances, over 600 River instances, and only 78 Other instances corresponding to coastal buffer areas. Due to the extreme class imbalance, weighted cross-entropy was applied during training to mitigate the issue. The class weights were set as
These weights were calculated based on the inverse square root of the sample counts for each class. Additional hyperparameter settings of the model are summarized in
Table 3.