1. Introduction
In recent years, with the acceleration of urbanization processes around the world, the area occupied by cities and towns has expanded. This is causing a series of environmental problems, such as reductions in and degradation of natural habitats, losses in biodiversity, land subsidence and water pollution [
1]. According to the United Nations, the urbanization of developing countries has been particularly prominent in the past decade. Their data show that urban expansion is most obvious in Asia, especially China and India [
2]. Rapid urbanization has resulted in increases in the area of impervious surfaces. The United States Geological Survey (USGS) defines impervious surfaces as hard areas that do not allow water to seep into the ground [
3]. Specifically, they refer to any natural or man-made substance that can hinder water infiltration and, thus, affect the flood runoff, material precipitation and pollution profile. They include building roofs covered with waterproof materials, parking lots and sidewalks. In general, with ongoing global urbanization, the expansion of urban impervious surfaces is having very important impacts on the ecological balance, hydrological conditions and environment of urban areas [
4]. To monitor and evaluate sustainable urban development, the United Nations proposed 17 sustainable development goals (SDGs) in 2015. Among them, SDG11 refers to sustainable cities and communities; specifically, strategies to make cities and other human settlements inclusive, safe, resilient and sustainable. The accurate quantification of impervious surfaces is an important planning tool for urban land use development. Careful planning can mitigate the adverse effects of urban heat islands, water quality degradation and natural habitat loss caused by increases in impervious surface area [
5]. The automatic extraction of accurate real-time data on impervious surfaces is very important for urban planning and environmental and resource management [
6,
7]. Research on the automatic extraction of impervious surface data is of great significance to urban ecological construction and the monitoring of urban dynamics and for achieving sustainable development in urban and rural areas.
In the early days, impervious surfaces were mainly studied by manual survey and mapping. Although such methods are highly accurate, they are costly and have poor real-time performance. Compared with traditional surveying and mapping methods, satellite-based remote sensing technology is lower cost, very practical and provides wider coverage. Concurrent with its rapid development, remote sensing technology has become widely used to obtain data on impervious surfaces and has become an important research method in sustainable urban development. The traditional method of extracting impervious water surface data from remote sensing images is to analyse differences in the reflected spectral characteristics of different ground objects through spectral analysis and mixed pixel decomposition. However, data resolution and spectral interference from different ground objects limit the accuracy of impervious surface extraction. For instance, Deyong et al. applied a classification and regression tree (CART) to Landsat and night light data to effectively extract data on impervious water surfaces [
8]. Yu et al. proposed the joint use of multi-source remote sensing data, including multispectral images, high-spatial-resolution images and airborne LIDAR data, to extract impervious surfaces [
9]. They made full use of visible light, near-infrared radiation, thermal infrared radiation, elevation and other features extracted from the multi-source remote sensing data to achieve a more accurate understanding of urban impervious surfaces. In general, in early research, the extraction of impervious surfaces was mostly based on simple machine learning algorithms. Nevertheless, the features and algorithms must be adjusted for use in different scenarios, applications or geographical areas [
10]. These presented many problems, such as a low utilization rate of underlying features, extreme dependence on manual work, poor automation of the extraction process and poor overall accuracy.
In recent years, deep learning has become a major focus of machine learning. It is characterized by its unique automatic feature learning ability and strong ability to represent and fit nonlinear functions. It can generate abstract high-level representations, attributes or features by processing and integrating low-level features [
9]. Due to its great advantages over traditional machine learning algorithms, deep learning and related methods have been successfully applied to various computer vision tasks, such as image classification, instance segmentation and target detection. Convolutional neural networks (CNNs) have been gradually applied to remote sensing image processing because they can automatically mine the relevant context representation of images and deeply learn the abstract image features [
11]. A fully convolution neural network (FCN) extends image-level classification to the pixel level, greatly promoting the development of semantic segmentation networks [
12]. The Unet model based on an encoder–decoder architecture combines the characteristics of deconvolution and jump networks. Many studies have applied it to remote sensing image research and achieved good results [
13]. The Feature Pyramid Network (FPN) is a feature pyramid model that combines multi-level features to solve multi-scale problems. It fuses high- and low-level features to increase the expression ability of low-level features and improve network performance. This allows targets of different scales to be allocated to different layers for prediction, following a strategy of “divide and conquer” [
14]. The DeepLabv3 network architecture adds a module for multi-scale object segmentation and uses serial and parallel hole convolution modules. It uses a variety of different hole rates to obtain multi-scale content information, which improves the performance of multi-scale object instance segmentation [
15,
16]. LinkNet links an encoder and decoder to maintain the accuracy of a network model while reducing the number of parameters on a large scale [
17]. Through context aggregation based on different regions, the Pyramid Scene Parsing Network (PSPNet) allows the network model to make full use of context information and improve the network’s performance under scenarios with different resolutions [
18]. The DeepLabv3+ architecture adds a new decoding module to the DeepLabv3 architecture to reconstruct object boundaries more accurately for image segmentation [
19]. The Pixel Aggregation Network (PAN) architecture adds a bottom-up pyramid based on FPN to transfer the underlying features. This allows the model to combine semantic and positioning information to improve performance [
20]. The multi-scale attention network (MAnet) introduces a Position-wise Attention Block (PAB) and Multi-scale Fusion Attention Block (MFAB) to capture the channel dependencies between any feature maps by multi-scale semantic feature fusion, providing advancements in medical image segmentation [
21]. Compared with classic machine-learning methods, deep-learning methods have better performance in image segmentation [
22]. Several network models have been used to extract impervious surfaces; Bowen et al. used a depth convolution neural network to extract data on impervious surfaces from Gaofen 2 satellite remote sensing images of Wuhan city [
23]. The efficiency and accuracy of the deep-learning methods were better than those of traditional machine-learning algorithms such as random forest and support vector machine. Parekh et al. used a Unet series to extract data on impervious surfaces from Landsat 8 remote sensing images and achieved good results [
3]. Based on the local attention mechanism model in a densely-connected FCN, Pang Bo et al. extracted data on impervious surfaces from GF-2 remote sensing images of Tianjin [
24]. Their method had better integrity than other methods in extracting details of impervious surfaces from remote sensing images. In addition, the research of Furkan et al. shows that, even if the sample annotation precision is less than 100%, using a depth neural network classifier to classify remote sensing images can still obtain superior classification results [
25]. Even though previous studies have demonstrated satisfactory performance in impervious surface data extraction based on DL networks, some limitations remain that need to be tackled [
26]. For instance, as the network hierarchy deepens, small details such as impervious surfaces and edges will be lost. In addition, due to incomplete imaging, such models may commission or omission certain ground objects. To retain detailed information and extract more accurate impervious surface data, further exploration of network models is required. This must ensure their ability to extract multi-scale image features and be suitable for the extraction of impervious surface data from high-resolution remote sensing images [
27,
28].
To sum up, using a deep-learning method to extract impervious surface information can overcome the main shortcoming of traditional methods—the requirement for a large amount of prior knowledge. Its end-to-end learning method can optimise the model parameters, reduce the dependence on prior knowledge and human intervention, and provide more accurate extractions on impervious surfaces. The present study produced an impervious water surface dataset for deep learning, which is based on high-resolution remote sensing images of Chengdu, a typical Chinese city. The data are analysed using a proposed model—the Small Attention Hybrid Unet (SAH-Unet). Compared with other classical semantic segmentation networks, SAH-Unet demonstrates better performance in extracting impervious surface data. This study proposes a new method for the automatic extraction of impervious surface information from high-resolution remote sensing images. The method provides support for monitoring the sustainable development of cities.
4. Discussion
This paper proposes the SAH-Unet network model for extracting impervious surface data from high-resolution remote sensing images. The experimental results show that the network structure setting is effective.
Table 5 and
Figure 11 show that the introduction of CBAM helps the model to extract impermeable surface information more accurately, while the introduction of MFF enhances its ability to extract impermeable surface details.
Table 1 and
Table 5 show that the introduction of depth separable convolution greatly reduces the number of model parameters while maintaining model performance. In the experiment testing the generalisation ability of the proposed model with high-resolution remote sensing images of Chengdu, good results in the extraction of impervious surface information were achieved.
Historically, impervious surface modeling is based on statistical indices computed to accentuate impervious surfaces in satellite imagery; the use of deep-learning methods to extract the impermeable surface is still a frontier topic [
55]. SAH-Unet achieves the best results in terms of target edges and details and has certain advantages over the LinkNet, DeepLabV3+, PAN, Unet, MAnet, PSPNet and FPN frameworks.
Table 3 and
Table 4 show the precision results of each model on the training, validation and test set: the total extraction precision of SAH-Unet on the test set was 0.9159, while the MIOU, F-score, Recall and Precision were 0.8467, 0.9117, 0.9199 and 0.9042, respectively, the best of all models.
Figure 9 shows the visualisation results of impervious surface extraction.
Figure 10 uses high-resolution remote sensing images of different time images and regions to test the generalisation ability of SAH-Unet. SAH-Unet also has great advantages over other models. In view of the difficulties of detail and shadow extraction with impervious surface data, this method also has some improvements.
It is worth mentioning that both buildings and vegetation will produce shadows, and when large shadows are used as input features, due to the incomplete information contained in the image, it greatly increases the difficulty of model recognition, resulting in the phenomenon of misclassification. In addition, due to the small amount of bare surface leakage in the urban area, it may lead to insufficient recognition by the model.
In general, there is still much room for improvement in the extraction of impervious surface information from high-resolution remote sensing images. With the continuous development of network architectures and remote sensing technology, further progress will be made in the extraction of data on urban impervious surface via deep learning.