Improved Convolutional Neural Network with Attention Mechanisms for River Extraction

Cui, Hanwen; Liang, Jiarui; Li, Cheng; Tian, Xiaolin

doi:10.3390/w17121762

Open AccessArticle

Improved Convolutional Neural Network with Attention Mechanisms for River Extraction

¹

School of Computer Science, Zhuhai College of Science and Technology, Zhuhai 519041, China

²

School of Computer Science and Engineering, Faculty of Innovation Engineering, Macau University of Science and Technology, Macao 999078, China

³

Academy of Interdisciplinary Studies, The Hong Kong University of Science and Technology, Hong Kong 999077, China

^*

Author to whom correspondence should be addressed.

Water 2025, 17(12), 1762; https://doi.org/10.3390/w17121762

Submission received: 1 April 2025 / Revised: 29 May 2025 / Accepted: 10 June 2025 / Published: 12 June 2025

(This article belongs to the Special Issue Ecohydrological Processes, Environmental Effects, and Integrated Regulation of Wetland Ecosystems, 3rd Edition)

Download

Browse Figures

Versions Notes

Abstract

Rivers, as fundamental components of freshwater supply and wetland ecosystems, play an essential role in sustaining biodiversity and facilitating sustainable resource utilization. This study introduces the integration of the attention mechanism within the convolutional neural network (CNN) framework and constructs seven enhanced models. A novel dataset has been independently developed utilizing high spatial resolution remote sensing images obtained from China’s Gaofen-2 satellite (GF-2), which enables the efficient and precise extraction of river distribution. The city of Zhuhai, characterized by its intricate river network located in the lower reaches of the Pearl River Basin, has been selected as the experimental area for this research. The experimental results indicate that the CNN model enhanced by the attention mechanism significantly surpasses the baseline model across several performance metrics, including overall accuracy, Kappa coefficient, Precision, Recall, F1-score, Mean Intersection over Union, and the extraction result map. Notably, the model incorporating the Bottleneck Attention Module demonstrates the highest performance, achieving overall accuracy and Kappa coefficient values of 93.09% and 0.8618, respectively, which surpass the baseline model by 12.62% and 0.2524. This study thus provides crucial spatial data and method support for river resource management, supporting ecological conservation and sustainable wetland management.

Keywords:

attention mechanism; CNN; GF-2 satellite image; river extraction

1. Introduction

As an important category of wetlands, rivers play an important role in wetland systems. Rivers maintain the hydrological connectivity and dynamic balance of wetlands through the alternation of flood and dry periods, promoting the health and stability of wetland ecosystems. However, at the same time, the river water volume surges and overflows during the flooding period, and in the case of high mountains and deep valleys, it will exacerbate the river pooling and flowing terrain, forming flash floods, which will cause significant losses to the safety of people’s lives and property. Traditional river extraction techniques mainly rely on manual sampling and field surveys, which are inefficient and make it difficult to obtain global information. In 2017, the State Council released the New Generation Artificial Intelligence Development Plan, which explicitly proposes to strengthen the applied research on AI technology in the fields of water resources management and ecological environmental protection. Machine learning, especially deep learning, can learn rich spectral and spatial features from remote sensing data without the need for complex spectral characterization of images, which can significantly improve classification accuracy and automation. Therefore, applying machine learning methods to river extraction in high spatial resolution remote sensing images has important application value and can provide powerful support for water resource management and ecological environmental protection.

River extraction methods based on remote sensing images have been evolving from early visual interpretation to extraction methods based on band combination to machine learning methods and deep learning. Deep-WaterMap model [1], CNN [2], Artificial Neural Network (ANN) [3], Deep Neural Network (DNN) [4], and many other deep learning models have been applied. All of them show better river extraction performance, but there is still room to improve the overall accuracy. For large rivers, due to the obvious difference between the pure water spectral features and other surrounding features, the river can be accurately extracted using traditional methods. However, for small rivers, the river is mixed with its background spectral information, and the spatial resolution will blur the details of the river, which makes it more difficult to extract the river. With the improvement of spatial and temporal resolution of remote sensing images, the use of high-resolution data sources has become the trend of river extraction. Deep learning methods can more effectively extract rich intrinsic information from remote sensing image data, which is the direction to improve the accuracy of river extraction. CNN has achieved remarkable results in image feature extraction, but there are still limitations in its feature characterization ability. The attention mechanism effectively enhances the model’s attention to key features by dynamically allocating the weight parameters to improve the feature characterization ability and computational efficiency. Aiming at the strong spatial correlation of high-resolution remote sensing images, traditional pixel-level methods ignore spatial information, while object-oriented methods face the challenge of big data processing. The introduction of the attention mechanism can promote multi-scale feature fusion and enhance model adaptability and classification accuracy, which essentially simulates the resource allocation mechanism of human vision.

In this paper, the Chinese high-resolution remote sensing image, GF-2, is used as the data source, and a dedicated dataset is constructed independently. Seven attention mechanisms are selected, namely Squeeze-and-Excitation Networks, Efficient Channel Attention, Convolutional Block Attention Module, Coordinate Attention, Criss-Cross Attention, Residual Attention Network, and Bottleneck Attention Module to improve the CNN model, all of which are lightweight, efficient, and easy to embed. Traditional CNN models face problems such as sensitivity to complex backgrounds and noise, difficulty in effectively distinguishing key features in multi-scale targets and complex backgrounds, and insufficient ability to capture features at different scales. With the addition of the attention mechanism module, it can enhance the important features, suppress irrelevant features, optimize the network characteristics, and improve the classification accuracy. Taking Zhuhai City as the study area, we implement the extraction study of rivers in Zhuhai City and analyze and compare the performance of different models in extracting rivers from high spatial resolution images.

2. Overview of the Study Area

The study area is located between 21°48′–22°27′ N latitude and 113°03′–114°19′ E longitude, on the west bank of the Pearl River estuary in the southern part of Guangdong Province, with a total land area of 1732 km². The region has a subtropical oceanic climate, often attacked by the southern subtropical monsoon winds, with many thunderstorms and abundant rainfall. The average annual temperature is 22.4 °C, and the average rainfall over the years is 1700~2300 mm, of which from April to September precipitation accounts for 85% of the year, and catastrophic weather includes tropical cyclones and heavy rainfall. The study area is located in the alluvial plain of the Pearl River Delta, formed by the deposition of the Xijiang, Beijiang, and Dongjiang rivers into the sea, which is the largest alluvial plain in the southern subtropical region of China, with typical plains and river networks and hilly geomorphological features. The water system in the region is well developed, and the main rivers include the seawater channels of the Xijiang River (Modaomen Waterway, Jidimen Waterway, and Hutiaomen Waterway, etc.) and the mountain streams and rivers on the hilly mountains and islands (Doumen River Stream, Dachikan River, Feisha River, Nanxi River, Jishan River, and Shenqian River, etc.). Hydrological monitoring data show that the city’s average multi-year runoff totaled 1429.68 m³, with a total water resource of 1757 million m³.

3. Materials and Methods

3.1. Data

The GF-2 satellite, the data source of this paper, was successfully launched on 19 August 2014. It is the first civilian optical remote sensing satellite with a spatial resolution better than 1 m developed independently by China and is also China’s current highest resolution civilian land observation satellite. It carries a 1 m panchromatic and 4 m multispectral high-resolution camera, with a spatial resolution of up to 0.8 m at the point under the star, featuring sub-metre spatial resolution and high positioning accuracy.

In order to better extract the river, this paper selects the data from April to September, when precipitation is more concentrated in the study area, as much as possible. Combined with the amount of cloud coverage, data quality, and other factors, 17 GF-2 remote sensing images were selected to cover the entire study area. The time phases are concentrated on 23 July 2016, 30 April 2017, and 26 August 2017, and the remaining uncovered areas are supplemented with 2015 and 2018 data.

3.2. Methods

3.2.1. Data Processing

The GF-2 images were processed using ENVI 5.3 software. Radiometric calibration, atmospheric correction, and orthographic correction were performed on the MSS images, and radiometric calibration and orthographic correction were performed on the PAN images. After that, the image fusion of MSS and PAN images was carried out, which can effectively improve the spatial resolution of the fused images on the basis of preserving the spectral information of the multispectral images. We cropped the images according to the boundary range of the study area and subjected the cropped image to an image mosaic to obtain the remote sensing image of the whole study area. This was completed to prepare for the extraction of the river using different methods (shown in Figure 1).

This paper uses different methods to extract two features, namely rivers and other features except rivers (hereinafter referred to as non-rivers). ArcGIS Pro 2.5.0 software was used to make deep-learning samples. To better extract rivers, Band3, Band4, and Band2 were selected as band combination schemes. One researcher drew all the rivers and non-rivers by visual interpretation. The interpretation signs are shown in Figure 2, where (a) is the river and the rest are the typical representatives of non-rivers. The interpretation results were compared with the research results in “China Wetland Resources (Guangdong Volume)”, and the accuracy of the samples was ensured by combining field surveys. When making the data set, we compared the river extraction results of 3 × 3, 5 × 5, 7 × 7, and 11 × 11 patch sizes, and found that the patch size of 7 × 7 was the best. So, the sample size was determined to be 7 × 7. The number of non-rivers in the sample is more than that of rivers, resulting in unbalanced sample data, which will affect the accuracy of the model. Therefore, this paper randomly extracted non-river samples to reduce the sample size to make it consistent with the river sample size. The numbers of rivers and non-river samples finally selected were both 54,168. The order was further disrupted to make the two types of features evenly distributed in the training set and the test set. Using the hold-out method, also known as the simple cross-validation method, the data set was divided into two mutually exclusive sets—the training set and the test set, with a ratio of 60% and 40%. The two sets obtained by random division were 65,001 and 43,335, respectively. To ensure that the scale of each type of data was consistent and to avoid the model being particularly sensitive to a certain type of data, the sample data were normalized to [0, 1] or [−1, 1]. All four bands of GF-2 were used as samples to better learn the characteristics of rivers and non-rivers.

3.2.2. CNN

CNNs first originated and were modified from Multi-layer Perceptron (MLP). They were first discovered suddenly by biologists Hubel and Wiese [5] in their study of the visual cortex of cats. The development of CNN networks was taken a step further when Yann Lecun et al. designed a CNN network to recognize handwritten zip codes on envelopes supplied by the U.S. Postal Service. A CNN model generally consists of an input layer, a convolutional layer, a pooling layer, a fully connected layer, and an output layer. The input layer is used to accept the initial image used to train the network, the convolutional layer extracts various features from the image by convolving the input image and reduces the effect of noise on the classification result, and the pooling layer is usually used together with the convolutional layer to downsample the feature map of the output of the convolutional layer to reduce the feature dimension. The fully connected layer is used to vectorize the feature data, and the feature data extracted from the feature layers (convolutional layer and pooling layer) are output to the classifier to complete the classification of the image. The ‘convolution-pooling’ structure of the CNN enables it to automatically extract various features of the image, thus greatly reducing the number of parameters to be trained and improving its generalization ability. This reduces the number of parameters to be trained, improves the generalization ability, and enables it to be well applied to other fields. The CNN model architecture designed in this paper is shown in Figure 3. The model consists of three sets of convolutional layers alternating with dropout layers, followed by flattening the feature map, and finally passing through 256 and 128 fully connected layers for river extraction. The purpose of designing the dropout layer is to prevent overfitting and improve the generalization ability of the model. Except for the output layer using Softmax, the activation function of the other layers uses the ReLu function. Implemented using the Keras library, the loss function is sparse_categorical_crossentropy, the optimizer function is RMSprop, and the learning_rate is 0.001. When training the model, epochs are set to 23, and batch size is set to 32. In this study, we use only the CPU. The experimental configurations are as follows: the CPU is an Intel Core i9, the operating system is MacOS 12.6, and the development environment is Tensorflow 2.11.0 and Python 3.8.19.

3.2.3. Attention Mechanism Module

In this paper, seven different attention mechanisms are selected, namely Squeeze-and-Excitation Networks, Efficient Channel Attention, Convolutional Block Attention Module, Coordinate Attention, Criss-Cross Attention, Residual Attention Network, and Bottleneck Attention Module. They are lightweight, easy to embed, and can maintain low computational cost, which can make up for the shortcomings of the traditional CNN model and improve the classification accuracy effectively. The architecture of the CNN model combining the attention mechanism module is shown in Figure 4. Through experimental comparison and to increase comparability, all attention mechanism modules are embedded after two sets of convolutional layers and dropout layers.

1.: Squeeze-and-Excitation Networks

Squeeze-and-Excitation (SE) Networks, introduced by Jie Hu et al. in 2018 [6], represent a sophisticated channel attention mechanism designed to optimize the properties of neural networks. This framework enhances the representation of significant features while diminishing the influence of irrelevant ones. The primary objective of SE is to bolster the feature selection capabilities of CNN.

The implementation process of SE consists of the following steps:

(1): Squeeze: the features of each channel are compressed by global average pooling, which compresses the spatial information of each channel into a global descriptor.
(2): Excitation: The model uses a small, fully connected network to learn the excitation of the channels, and then a sigmoid activation function is used to obtain a weight value between 0 and 1, which indicates the importance of each channel.
(3): The adjusted channel weights are applied to the original feature map, thus reinforcing the important channels and suppressing irrelevant channels.

The channel compression ratio is set to 4. Compared with the CNN, total parameters increase by 0.0021 M, and floating-point operations per second (FLOPs) increase by 0.59% (as shown in Table 1). The CNN cannot adequately capture the correlation between channels, which leads to insufficiently accurate feature extraction and thus makes it difficult to effectively differentiate between the target and the complex background; it adaptively adjusts the weights of each channel according to its importance, thus improving the network’s ability to identify key features, and thus enhancing the accuracy of river extraction.

2.: Efficient Channel Attention

Efficient Channel Attention (ECA) represents a novel channel attention mechanism introduced by Qilong Wang et al. in 2020 [7], designed to optimize the selective representation of features within CNN. This mechanism is characterized by its lightweight and efficient nature, distinguishing itself from the SE by utilizing one-dimensional convolution to model channel dependencies.

The implementation process of ECA consists of the following steps:

(1): Perform global average pooling for each channel of the input feature map to generate a global description of each channel.
(2): Compute the channel weights through a one-dimensional convolutional layer and learn the local interaction information between channels without relying on the complex fully connected layer, resulting in a more lightweight model structure. However, it maintains a strong feature modeling capability, reduces the number of parameters and computational complexity, and achieves high efficiency.
(3): The weight values are normalized by a sigmoid function to obtain the weights of each channel. Finally, these weights are applied to the input feature map to complete the recalibration of channel features.

The convolution kernel size is set to 3. Compared with the CNN, the total parameters are almost unchanged, but the FLOPs are increased by 0.43% (as shown in Table 1). In remote sensing image classification, CNN is difficult to effectively distinguish key features in multi-scale targets and complex backgrounds. ECA can effectively enhance the important channel features, i.e., it enhances the model’s sensitivity to important features, reduces the influence of background noise, and avoids computational overheads, which improves the accuracy and efficiency of extracting rivers in remote sensing images.

3.: Convolutional Block Attention Module

The Convolutional Block Attention Module (CBAM), proposed by Sanghyun Woo et al. in 2018 [8], represents a significant advancement in the integration of attention mechanisms within convolutional neural networks. CBAM synergistically combines channel attention and spatial attention, enabling the model to assign differential weights to various elements within the input feature map. This capability allows for the amplification of salient features while simultaneously mitigating the influence of less relevant components.

The implementation process of CBAM consists of the following steps:

(1): Channel attention: global information about the channel is obtained by compressing each channel through global average pooling and global maximum pooling. Then, this information is used to generate the attention weights of each channel through a shared MLP (multilayer perceptron) network, and finally, the weights obtained by the sigmoid activation function are weighted channel-by-channel with the original feature map.
(2): Spatial attention: on the channel-weighted feature maps, CBAM generates a spatial attention map using convolutional operations to further emphasize important regions in the image. The module fuses the information using maximum pooling and average pooling and generates spatial attention weights through a convolutional layer.

The dimension reduction ratio in channel attention is set to 0.25, and the spatial attention convolution kernel size is 1. Compared with the CNN, total parameters increased by 0.0043 M and FLOPs increased by 1.33% (as shown in Table 1). CBAM enhances the model’s ability to capture key features through the channel and spatial attention mechanisms, while suppressing irrelevant information to compensate for the CNN’s difficulty in effectively handling multi-scale targets and background interference in complex scenes, thus improving the accuracy and robustness of remote sensing image classification.

4.: Coordinate Attention

Coordinate Attention (CA), as introduced by Qibin Hou et al. in 2021 [9], represents a significant advancement in attention mechanisms tailored for CNNs. Unlike traditional channel and spatial attention frameworks, CA enhances the capacity of CNNs to effectively model the spatial relationships between target locations and channel features by capturing spatial location information within the feature map in both horizontal and vertical dimensions.

The implementation process of CA consists of the following steps:

(1): By generating coordinate information (e.g., coordinates relative to the image center) for each pixel location, this information is passed to the attention module. This coordinate information helps the network to capture the relationship between different locations in space.
(2): Assign weights to horizontal and vertical coordinates in space by introducing 1D convolution operations along the direction of the coordinate axes (X and Y axes), respectively. This approach explicitly captures the spatial relationships in different directions.
(3): With the learned coordinate weights, they are applied to the input feature map to enhance the feature response of important regions while suppressing irrelevant regions.

The coordinate dimension reduction ratio is 32, and the spatial attention convolution kernel size is 1. Compared with the CNN, the total parameters are almost increased by 0.0017 M, and the FLOPs are increased by 13.45% (as shown in Table 1). In remote sensing image classification, it is difficult for CNNs to effectively capture the spatial location information of targets, and long-distance dependencies in complex backgrounds enhance the sensitivity of the model to the location of the target by utilizing explicit coordinate information.

5.: Criss-Cross Attention

Criss-Cross Attention (CCA), introduced by Zilong Huang et al. in 2019 [10], represents an innovative approach in the realm of image feature learning networks, aimed at effectively capturing dependencies among distant pixels within an image. This advancement significantly enhances the model’s proficiency in perceiving global information.

The implementation process of CCA consists of the following steps:

(1): It is a cross-over convolutional structure that extracts features along the horizontal and vertical directions of the image, respectively. This approach allows the model to capture a wider range of spatial information.
(2): Features are propagated in the row and column directions of the image by incrementally augmenting the contextual information between pixels to form long-range dependencies both horizontally and vertically. This avoids the computationally intensive global attention mechanism.
(3): Fusion of features in different directions to enhance important information and suppress irrelevant background by weighting.

Compared with the CNN, total parameters have hardly increased, and FLOPs have increased by 7.38% (as shown in Table 1). In remote sensing image classification, it is difficult for the CNN to effectively capture long-distance spatial dependencies, especially in complex scenes. CA enhances the model’s ability to model global contextual information, thus improving the accuracy of remote sensing image classification, especially when dealing with multi-scale targets and complex backgrounds.

6.: Residual Attention Network

The Residual Attention Network (RA), introduced by Ke Zhu et al. in 2021 [11], represents a significant advancement in the field of deep learning, particularly in the enhancement of feature extraction capabilities. This model amalgamates the principles of Residual Networks (ResNet) with an adaptive attention mechanism to improve the focus on salient features within an input image.

The implementation process of RA contains the following steps:

(1): Extract the feature map of the input image by convolutional layer;
(2): Generate spatial attention weights using the attention module to weight the feature maps and highlight important regions;
(3): Sum the weighted feature maps with the original feature maps through residual linking to retain the original information while enhancing the key features. This design avoids information loss while improving the model’s ability to model multi-target scenes.

Compared with the CNN, total parameters are almost increased by 0.0041, and FLOPs are increased by 17.90% (as shown in Table 1). By introducing the attention mechanism, RA effectively enhances the network’s focus on key information and reduces the over-learning of background and noise, thus compensating for the shortcomings of traditional CNN models. Combined with residual learning, the network is able to avoid the problems of gradient vanishing and training difficulties, enabling the model to extract features at a deeper level and improve classification performance.

7.: Bottleneck Attention Module

The Bottleneck Attention Module (BAM), introduced by Jongchan Park et al. in 2018 [12], represents a significant advancement in the field of deep learning by integrating channel attention and spatial attention mechanisms to enhance feature selectivity and improve accuracy within neural networks.

The implementation process of BAM contains the following steps:

(1): Input features are processed through a ‘bottleneck layer’ to reduce the feature dimension (i.e., the number of channels);
(2): The channel attention branch generates channel weights through global average pooling and multilayer perceptron (MLP) to emphasize the important feature channels, and the spatial attention branch generates spatial weights through convolutional layers to highlight the key regions in the feature map;
(3): Combine the outputs of the two branches to generate attention weights and multiply them with the original feature map to enhance feature representation.

The dimensionality reduction ratio in channel attention is set to 16, and the expansion rate of the dilated convolution in spatial attention is set to 2. Compared with the CNN, total parameters increased by 0.0014 M and FLOPs increased by 4.36% (as shown in Table 1). By introducing the channel and spatial attention mechanism, BAM can effectively focus on the key regions and important features in the image, reduce the influence of background noise, make up for the shortcomings of the traditional CNN, and enhance the model’s ability to capture the details and complex patterns in the remote sensing image, so as to improve the classification accuracy.

In summary, for total params, CNN-CBAM has the largest increase compared to the CNN. CNN-CCA and CNN-ECA have almost no change, followed by CNN-BAM, which only increased by 0.0014 M. For FLOPs, CNN-RA requires the most computing power, which is 2.7068 M. For training time, CNN-CCA takes the longest time, which takes 9119.9862 s, while the other models take about 100~200 s (as shown in Table 1).

4. Results

For the results of remote sensing extraction of eight models (CNN, CNN-CCA, CNN-SE, CNN-CBAM, CNN-ECA, CNN-CA, CNN-RA, CNN-BAM), this paper selects the confusion matrix for verification. As can be seen in Figure 5, TP and TN in the confusion matrix of all models are much larger than FP and FN, indicating that all models have good performance.

To further evaluate the accuracy of the extraction results, the overall accuracy (OA), Kappa coefficient, Precision, Recall, F1-score, and Mean Intersection over Union (MIoU) are selected as evaluation indicators. Table 2 is arranged in ascending order of OA. It can be seen that the OA of the models with the addition of the attention mechanism are all improved over the baseline model CNN, and the OA of the models with the addition of the attention mechanism are all greater than 90% except for the CNN-CCA model. Among them, the OA of CNN-BAM is the optimal one, which is 93.09%, and is improved by 12.62% over the CNN. Except for the CNN and CNN-CCA, the Kappa coefficient of the other six models is greater than 0.82, indicating that the classification effect is very good. Among them, CNN-BAM has an optimal Kappa coefficient of 0.8618, which is an increase of 0.2524 over the CNN. The CNN has the lowest Precision of 72.85%. Except for the CNN-CCA and CNN-SE models, the Precision of all other models is greater than 90%. Among them, CNN-BAM has the highest Precision, which is 18.48% higher than the CNN. The Recall values are all greater than 95%, indicating that there are few missed reports and that the model can effectively extract river samples. CNN-BAM has the highest F1-score, which is 9.98% higher than the CNN. Except for the CNN and CNN-CCA models, the F1-score is greater than 91%, indicating that these models have better overall performance. Except for the CNN and CNN-CCA, the MIoU is greater than 84%, among which CNN-BAM has the highest MIoU of 87.32%, which is 16.01% higher than the lowest CNN.

As shown in Figure 6, by comparing and analyzing the river extraction results of the eight models, it can be seen that all the major rivers were extracted, but some non-rivers, such as reservoirs and aquaculture farms, were incorrectly extracted. The model extraction results are sorted from worst to best as follows: CNN, CNN-CCA, CNN-SE, CNN-RA, CNN-CBAM, CNN-CA, CNN-ECA, and CNN-BAM. Among them, in the central west region of the extraction map of the CNN, CNN-CCA, two long strips of water bodies oriented east–west and north–south can be seen clearly. This is a huge artificial lake, i.e., Baiteng Lake, created at the entrance of Niwanmen after the Baiteng sea blocking project in the late 1950s when a sea-blocking dike of about 5.8 km in length appeared at the entrance of Niwanmen. Except for the above two models, the erroneous extraction of Baiteng Lake gradually disappeared. The west area of the study area is the worst hit area where aquaculture farms are wrongly extracted as rivers. However, with the above sorting results, the situation of being wrongly extracted gradually improved. Combining the above evaluation indexes and extraction result graphs, adding the BAM attention mechanism module in CNN is the best.

Although attention modules such as CCA, SE, CBAM, ECA, CA, RA, and BAM have improved the ability to express features in many visual tasks, they still have their shortcomings in complex remote sensing river extraction scenarios—some ignore spatial direction, some lose channel information, and some have insufficient receptive fields. These limitations lead to attention bias towards the wrong area in cases of spectral confusion, shadow interference, and similarity between small water bodies and backgrounds, resulting in mis-extraction. Even in the CNN-BAM model with the best results in this paper, BAM performs dimensionality compression before channel and spatial attention. Although it reduces the amount of calculation, it also loses fine-grained features, making it difficult to accurately distinguish narrow rivers. In addition, it is difficult to consider low, medium, and high-level features in one attention operation, and rivers with diverse forms are prone to extraction omissions or mis-extraction. The design of each attention mechanism focuses on different aspects—some strengthen channel dependence, some focus on spatial perception, and some take into account coordinate guidance—but they are all limited by their focus range and dimensionality reduction strategies. In remote sensing river extraction, further research can be conducted on adaptively selecting and fusing multiple attention modules to suppress mis-extraction and improve extraction accuracy effectively.

5. Conclusions

This paper takes the CNN as the baseline model and introduces seven attention mechanism modules to study the efficient, accurate, and real-time extraction of rivers in high spatial resolution remote sensing image GF-2. The results show that these models have great potential in river extraction and provide effective tools for the sustainable development of management and protection of river resources. The attention mechanism, which is lightweight, easy to embed, computationally efficient, as well as able to compensate for the shortcomings of the baseline model CNN, better enhances the model’s ability to capture key features and suppresses extraneous information, which improves the accuracy and robustness of remote sensing image classification. Comprehensively analyzing the seven evaluation metrics such as confusion matrix, OA, Kappa coefficient, Precision, Recall, F1-score, MIoU, and total params, FLOPs, training time, the CNN combined with the BAM attention mechanism module has the best river extraction effect. In addition, on the basis of combining the attention mechanism, further consideration can be given to extracting river resources using multi-source remote sensing images, and the more features the model learns, the more it helps to improve the accuracy.

Author Contributions

Conceptualization and methodology, H.C., J.L., C.L., and X.T.; software and validation, H.C., J.L., and C.L.; data curation, H.C., J.L., and C.L.; writing—original draft preparation, H.C., J.L., and C.L.; writing—review and editing, H.C. and X.T.; funding acquisition, H.C. and X.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Department of Education of Guangdong Province, grant number 2021KTSCX176, and the Science and Technology Development Fund of Macau, grant number 0020/2024/RIA1.

Data Availability Statement

The datasets presented in this article are not readily available because the data are part of an ongoing study. Requests to access the datasets should be directed to the corresponding author.

Acknowledgments

The authors are grateful to the support of Department of Education of Guangdong Province “Innovation and Strengthening Project” Scientific Research Project (Scientific Research Platform and Project of Guangdong College and Universities, 2021KTSCX176), and the Science and Technology Development Fund of Macau (0020/2024/RIA1).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Isikdogan, F.; Bovik, A.C.; Passalacqua, P. Surface Water Mapping by Deep Learning. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 2017, 10, 4909–4918. [Google Scholar] [CrossRef]
Chen, Y.; Fan, R.S.; Yang, X.C.; Wang, J.X.; Latif, A. Extraction of Urban Water Bodies from High-Resolution Remote-Sensing Imagery Using Deep Learning. Water 2018, 10, 585. [Google Scholar] [CrossRef]
Bui, X.N.; Nguyen, H.; Tran, Q.H.; Nguyen, D.A.; Bui, H.B. Predicting Ground Vibrations Due to Mine Blasting Using a Novel Artificial Neural Network-Based Cuckoo Search Optimization. Nat. Resour. Res. 2021, 30, 2663–2685. [Google Scholar] [CrossRef]
Li, K.; Wang, J.L.; Yao, J.Y. Effectiveness of Machine Learning Methods for Water Segmentation with Roi as the Label: A Case Study of The Tuul River in Mongolia. Int. J. Appl. Earth Obs. Geoinf. 2021, 103, 102497–102507. [Google Scholar] [CrossRef]
Hubel, D.H.; Wiesel, T.N. Receptive Fields of Single Neurones in The Cat’s Striate Cortex. J. Physiol. 1959, 148, 574–591. [Google Scholar] [CrossRef] [PubMed]
Hu, J.; Shen, L.; Albanie, S.; Sun, G.; Wu, E. Squeeze-and-Excitation Networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020. [Google Scholar]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the 2018 European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
Hou, Q.B.; Zhou, D.Q.; Feng, J.S. Coordinate Attention for Efficient Mobile Network Design. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Online, 19–25 June 2021. [Google Scholar]
Huang, Z.; Wang, X.; Wei, Y.; Huang, L.; Shi, H.; Liu, W.; Huang, T.S. CCNet: Criss-Cross Attention for Semantic Segmentation. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
Zhu, K.; Wu, J.X. Residual Attention: A Simple but Effective Method for Multi-Label Recognition. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021. [Google Scholar]
Park, J.C.; Woo, S.H.; Lee, J.Y. BAM: Bottleneck Attention Module. In Proceedings of the British Machine Vision Conference (2018), Newcastle, UK, 3–6 September 2018. [Google Scholar]

Figure 1. Research framework.

Figure 2. Interpretation signs in GF-2 images.

Figure 3. The architecture of the CNN model.

Figure 4. The CNN architecture with attention mechanism module.

Figure 5. The confusion matrix heatmap of different models.

Figure 6. River extraction result of different models.

Table 1. Comparison of different models’ total params, FLOPs, and training time.

Model	Total Params (M)	FLOPs (M)	Training Time (s)
CNN	0.8433	2.2959	125.6719
CNN-CCA	0.8433	2.4653	9119.9862
CNN-SE	0.8454	2.3095	145.2962
CNN-CBAM	0.8476	2.3264	187.7041
CNN-ECA	0.8433	2.3057	140.9470
CNN-CA	0.8450	2.6048	271.4967
CNN-RA	0.8474	2.7068	171.2819
CNN-BAM	0.8447	2.3961	277.7017

Table 2. Six evaluation indicators of different models.

Model	OA (%)	Kappa Coefficient	Precision (%)	Recall (%)	F1-Score (%)	MioU (%)
CNN	80.47	0.6095	72.85	97.12	83.25	71.31
CNN-CCA	87.02	0.7405	81.02	96.68	88.16	78.83
CNN-SE	91.10	0.8220	87.33	96.14	91.52	84.37
CNN-CBAM	92.67	0.8534	90.34	95.55	92.87	86.69
CNN-ECA	92.87	0.8574	90.83	95.36	93.04	86.99
CNN-CA	92.88	0.8577	90.66	95.61	93.07	87.04
CNN-RA	92.88	0.8576	90.64	95.63	93.07	87.03
CNN-BAM	93.09	0.8618	91.33	95.22	93.23	87.32

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cui, H.; Liang, J.; Li, C.; Tian, X. Improved Convolutional Neural Network with Attention Mechanisms for River Extraction. Water 2025, 17, 1762. https://doi.org/10.3390/w17121762

AMA Style

Cui H, Liang J, Li C, Tian X. Improved Convolutional Neural Network with Attention Mechanisms for River Extraction. Water. 2025; 17(12):1762. https://doi.org/10.3390/w17121762

Chicago/Turabian Style

Cui, Hanwen, Jiarui Liang, Cheng Li, and Xiaolin Tian. 2025. "Improved Convolutional Neural Network with Attention Mechanisms for River Extraction" Water 17, no. 12: 1762. https://doi.org/10.3390/w17121762

APA Style

Cui, H., Liang, J., Li, C., & Tian, X. (2025). Improved Convolutional Neural Network with Attention Mechanisms for River Extraction. Water, 17(12), 1762. https://doi.org/10.3390/w17121762

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Improved Convolutional Neural Network with Attention Mechanisms for River Extraction

Abstract

1. Introduction

2. Overview of the Study Area

3. Materials and Methods

3.1. Data

3.2. Methods

3.2.1. Data Processing

3.2.2. CNN

3.2.3. Attention Mechanism Module

4. Results

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI