In less developed countries like Asian and Africa, urban sprawling is usually accompanied by the emergence of unplanned settlements, such as slums, shantytowns, and urban villages [1
]. Rapid urbanization with the housing demand of low-income city dwellers has contributed to the emergence of unplanned settlements [3
]. This kind of built-up area is often stuffed with high-density small buildings. Although unplanned settlements do provide housing for low-income city dwellers, their existence contributes to unequal living conditions, inhabitants often do not have access to equal services and face unsanitary living conditions, and there are issues with public safety [4
]. Effective management, improvement, and reconstruction of the unplanned settlements become important policies in some cities [2
]. Identifying and characterizing the unplanned settlements is essential and indispensable for urban planners and policymakers to evaluate urban reconstructions [1
]. Typically, the urban village is the most commonly existing unplanned urban settlement in China.
Traditionally, the cartography of the unplanned buildings in urbanized areas has relied on field surveys by land management departments through the collection of measurements of buildings and digitalization [1
]. However, field surveys are costly and labor-intensive due to the extensive existence of urban villages [8
]. Remote sensing technologies have the advantage of being low-cost and allowing large coverage; without a doubt, it is an add-on method to the field surveys, and helps update the base map data [1
]. Using low-altitude aerial photos or Unmanned Aerial Vehicle images to capture the buildings and digitize their footprints, the manual method is another conventional mapping method, but it is time-consuming. Therefore, a highly efficient, intelligent, and image-based methodology is urgently needed to cope with the challenges.
Remote sensing-based mapping of urban buildings has a long history with substantial literature [10
]. With the increasing availability of high-resolution optical images, along with object-based image analysis (OBIA) has dominated the area of mapping urban buildings [9
]. More often, most current techniques for image target detection are based on spectral and spatial features [14
]. In recent years, Random Forest (RF) as a machine learning method has become one of the most popular approaches in built-up areas mapping [15
]. Although numerous classification methods have been developed for urban land use mapping for some high-density built-up areas, the widely used pixel- or object-based methods usually does not allow the extraction of individual buildings, but typically clusters several buildings into one segment [17
]. In a complex urban built-up area, the features (spectral, texture, shape, etc.) are uneasily described by conventional remote sensing methods, owing to the large variance among urban buildings [7
]. Segmentation is the process of partitioning an image into segments by grouping neighboring pixels with similar feature values (brightness, texture, color, etc.) [13
]. In high-density built-up areas, segmentation with OBIA is often impeded by difficulties such as the scale selection and rule definition. It is challenging to completely delineate the boundaries and preserve their shapes because the noise and textures on the building’s edges usually degrade the performance of image segmentation [21
]. Additionally, a generalized and straightforward methodology is often hard to obtain and has not been reported. Therefore, developing a reliable and accurate building segmentation method towards mapping the unplanned urban settlements is still challenging.
On the other hand, there have been several studies demonstrating that the high-density slums can be mapped out using remote sensing images based on their physical characteristics distinguishable from formal settlements [1
]. Meanwhile, it is crucial to recognize that mapping the slums need to go beyond delineating the whole area, characterizing individual buildings accounts for more demands. For instance, to assess the potential and benefits of urban reconstruction, both governmental and private decision-makers need to be acquainted with information regarding reconstruction incentives, public services, and environmental improvements [22
]. The valuable information can only be obtained based on individual buildings extracted to address the need of different stakeholders. Therefore, the classification and characterization of buildings appear equally important.
Semantic segmentation is a classic Computer Vision problem to mask out regions of interest. In recent years, machine learning technologies, especially deep learning, have attracted the attention of the remote sensing community [14
]. The deep convolutional neural network (CNN) has been applied to semantic segmentation [20
]. One of the most recognized deep learning algorithms for image segmentation is the fully convolutional network illustrated by Long et al. [28
]. This deep learning architecture is an end-to-end, pixel-by-pixel manner of semantic segmentation. It captures the details of features, and transfers the details to be recognizable in the trained neural network. Therefore, CNN is gaining more attention, attributing to its capability to automatically discover relevant contextual features in image categorization [28
]. The CNN-based semantic segmentation has been applied to many pixel-wise image applications, such as road extraction, building extraction, urban land use classification, maritime semantic labeling, vehicle extraction, damage mapping and so on [1
Conventional image classification methods have encountered bottlenecks for high-density built-up areas. CNN often outperforms most conventional image classifications [1
]. Potentially, it could segment individual buildings in high-density unplanned settlements. Deep learning-based semantic segmentation is to use a convolutional neural network to learn prior knowledge of features and extract the objects from images. There are several advantages of using CNN to find solutions for unplanned urban settlements. First of all, it is fast to apply CNN to extraction. The learning process does not require human interventions. The network training takes hours on a computer. After the network training is completed, the classification of each image takes only seconds [26
]. Secondly, it is truly automatic. The intrinsic difference between deep learning methods and the traditional visual recognition methods is that the deep learning methods can autonomously learn feature representations from a large amount of data, without much expertise or effort in predefining features [7
]. In a complex urban area, manually predefined features are not likely able to cover all land cover types. Therefore, it is of great significance for automatic feature learning from remote sensing imagery rather than manually-defined features [26
]. Moreover, the advantage of deep learning for a remote sensing image is characterized by sample repeatability and adaptability. Training samples are reusable in the CNN, which is significantly minimizing manual efforts in hand-craft features. More importantly, for most conventional remote sensing methods, boundary segmentation between high-density buildings is very difficult to achieve. This bottleneck could be potentially overcome by deep-learning semantic segmentation [19
In recent years, many CNN with excellent performance has been reported. Compared to others, the U-net proposed by Ronneberger et al. [33
] appears to be more popular and more quickly adopted and modified for remote sensing image segmentation. Training with a substantial amount of training data, the U-net uses a sliding-window set up to predict the class label of each pixel by providing a local region (patch) around that pixel. It also works with notably few training samples and yields precise semantic segmentation. The U-net has been adopted and highly evaluated by researchers [19
Although remote sensing detection has greatly benefited from deep learning methods, whether the deep learning methods can separate the adjacent buildings in an overcrowded urban village has not yet been understood. Semantic segmentation is naturally related to classification problems. However, the feasibility, capability, and accuracy of the U-net convolutional neural network for classification in high-density urban village buildings had not been well understood either. Thus, the research objective of this study was placed on the semantic segmentation and classification for high-density buildings in urban villages. In this article, a deep learning framework based on U-net for the semantic segmentation of high-resolution images was proposed. By implementing and applying the U-net based CNN, the mapping capability of the individual buildings in the urban villages was demonstrated, and accuracy was validated.