Next Article in Journal
The Rising Concern for Sea Level Rise: Altimeter Record and Geo-Engineering Debate
Previous Article in Journal
Lightweight-VGG: A Fast Deep Learning Architecture Based on Dimensionality Reduction and Nonlinear Enhancement for Hyperspectral Image Classification
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Geoscience-Aware Network (GASlumNet) Combining UNet and ConvNeXt for Slum Mapping

1
State Key Laboratory of Resources and Environmental Information System, Institute of Geographic Sciences and Natural Resources Research, Chinese Academy of Sciences, Beijing 100101, China
2
College of Resources and Environment, University of Chinese Academy of Sciences, Beijing 100049, China
3
Key Laboratory for Geographical Process Analysis & Simulation of Hubei Province, Central China Normal University, Wuhan 430079, China
4
College of Urban and Environmental Sciences, Central China Normal University, Wuhan 430079, China
*
Author to whom correspondence should be addressed.
Remote Sens. 2024, 16(2), 260; https://doi.org/10.3390/rs16020260
Submission received: 29 November 2023 / Revised: 3 January 2024 / Accepted: 8 January 2024 / Published: 9 January 2024
(This article belongs to the Section AI Remote Sensing)

Abstract

:
Approximately 1 billion people worldwide currently inhabit slum areas. The UN Sustainable Development Goal (SDG 11.1) underscores the imperative of upgrading all slums by 2030 to ensure adequate housing for everyone. Geo-locations of slums help local governments with upgrading slums and alleviating urban poverty. Remote sensing (RS) technology, with its excellent Earth observation capabilities, can play an important role in slum mapping. Deep learning (DL)-based RS information extraction methods have attracted a lot of attention. Currently, DL-based slum mapping studies typically uses three optical bands to adapt to existing models, neglecting essential geo-scientific information, such as spectral and textural characteristics, which are beneficial for slum mapping. Inspired by the geoscience-aware DL paradigm, we propose the Geoscience-Aware Network for slum mapping (GASlumNet), aiming to improve slum mapping accuracies via incorporating the DL model with geoscientific prior knowledge. GASlumNet employs a two-stream architecture, combining ConvNeXt and UNet. One stream concentrates on optical feature representation, while the other emphasizes geo-scientific features. Further, the feature-level and decision-level fusion mechanisms are applied to optimize deep features and enhance model performance. We used Jilin-1 Spectrum 01 and Sentinel-2 images to perform experiments in Mumbai, India. The results demonstrate that GASlumNet achieves higher slum mapping accuracy than the comparison models, with an intersection over union (IoU) of 58.41%. Specifically, GASlumNet improves the IoU by 4.60~5.97% over the baseline models, i.e., UNet and ConvNeXt-UNet, which exclusively utilize optical bands. Furthermore, GASlumNet enhances the IoU by 10.97% compared to FuseNet, a model that combines optical bands and geo-scientific features. Our method presents a new technical solution to achieve accurate slum mapping, offering potential benefits for regional and global slum mapping and upgrading initiatives.

Graphical Abstract

1. Introduction

In developing nations and regions, the phenomenon of pronounced rural–urban migration poses a substantial challenge to urban development in keeping up with population urbanization. The rapid influx of population impairs the capacity of cities to adequately cater to the evolving needs of living conditions, employment opportunities and public health management, eventually leading to urban poverty [1,2]. Urban slum development and growth, as well as an increase in the population of the poor, are evidence of urban poverty [3,4]. About one in eight of the global population currently lives in slum-like environments [5,6]. Access to accurate geospatial information of slums is critical to achieve SDG 11.1 [7]. Traditional field surveys and manual enumeration are time-consuming and labor-intensive; therefore, remote sensing (RS) is necessary for rapid and accurate slum mapping [8,9].
Figuring out “What is a slum?” is the first step in RS-based slum extraction and mapping. According to UN-Habitat [10], slums typically lack clean water sources, sanitation, sufficient living space, stability and security. However, this definition cannot be visualized on RS imagery directly. Thus, Kohli D et al. [11] proposed the generic slum ontology (GSO), a methodology to establish a link between the appearance of slums and image features at three levels, i.e., the environment level, the neighborhood level and the object level. We synthesized the literature [12,13,14] to collate the GSO information in Table 1. Based on the GSO, real-world slum appearance characteristics can be converted into computer-understandable parameters, such as textural features, geometric features, morphological features and so on. The disparities between slums and non-slums led to the discovery, through an analysis of GSO-related studies, that the vegetation spectral indexes and textural characteristics are crucial to discriminate slums and non-slums [15]. Specifically, houses are tightly packed and disordered in slums, whereas house are relatively sparsely packed and orderly in formal settlements [16]. In addition, there are fewer green spaces in slums, whereas there are adequate green spaces in formal settlements [17].
Earlier research designed image feature sets according to GSO and used object-oriented analysis (OOA) and machine learning (ML) methods to map slums [18,19,20,21]. The OOA-based slum mapping methods are generally characterized by low levels of automation and limited transferability. This circumstance can be attributed to the following reasons: (1) OOA methods necessitate trials and errors to ascertain suitable scale parameters, and (2) the formulation of classification rule sets is contingent upon expert knowledge [21]. ML methods possess the capability to automatically differentiate between slum and non-slum areas through the training of models using input features and samples [15,22,23]. For example, Prabhu R. et al. [23] extracted textural features, wavelet frame transformation features and morphological profiles. They employed the minimum redundancy maximum relevance algorithm for feature selection. Subsequently, the selected features were input into a support vector machine (SVM) for slum mapping. The effectiveness of ML methods is substantially influenced by the discernibility of the input feature sets. This accentuates the importance of feature engineering and data preprocessing, as deficiencies in these areas can compromise the intelligence and performance of ML methods [24]. In addition, the similarity of spectral and textural features among different land cover types may compromise the effectiveness of pure GSO-based methods. Techniques endowed with robust capabilities in feature representation and information extraction are imperative for achieving higher-accuracy slum mapping.
Deep learning (DL) techniques have gained popularity in recent years for RS image information analysis [25,26,27,28]. DL models automatically extract discriminative high-level semantic features without the need for complex feature engineering of the input data [25], outperforming previous techniques in the capacity to extract information. Studies [29,30,31,32,33] on slum mapping based on DL have also generated fine results. Wurm M. et al. [29] employed a fully convolutional neural network (FCN) to extract slums and obtained an overall accuracy (OA) of 90.64% on Quick Bird images with a resolution of 2 m and an OA of 86.71% on Sentinel-2 images with a resolution of 10 m. Wurm M.’s study demonstrated the advancement of DL on slum mapping. Verma D. et al. [30] verified the potential of medium-resolution imagery like Sentinel-2 for mapping slums with DL and transfer learning. They extracted slums from the medium-resolution imagery (10 m) using a DL model pre-trained on the high-resolution imagery (0.5 m) through transfer learning and obtained an IoU accuracy of 43.20% on the Sentinel-2 images. However, the current DL-based slum extraction studies, in contrast to earlier GSO-based methods, predominantly rely on three optical bands while disregarding other geo-scientific information associated with slums, such as the aforementioned spectral indices and textural features. The integration of geo-scientific information as supplementary knowledge in optical band-based slum extraction studies is envisaged. It is anticipated that incorporating DL models with geo-scientific information will lead to a further enhancement in the accuracy of slum extraction.
Ge et al. [34] proposed a geoscience-aware deep learning paradigm. The new paradigm for RS enables the complementarity of knowledge-driven approaches that provide geo-scientific prior knowledge, thereby improving model interpretability, and data-driven approaches that have excellent information extraction and feature representation. Some researchers have proposed dual-stream architectures [28,35] to synchronously learn information from visual features and other auxiliary features for better performance. Audebert N. et al. [36] used FuseNet [37] to conduct semantic segmentation on RS images, which is an example of embedding geoscientific knowledge into a DL model. In their work, they used the normalized difference vegetation index (NDVI) and the digital surface model (DSM) as the geo-scientific features, and their experimental results demonstrated that the DL models that included the geo-scientific knowledge had a 1% greater F1 score than the DL model that did not. Similarly, He et al. [38] proposed an institution-inspired network with two streams to mine information from multispectral imagery and DSM data. Xiong et al. [39] used a geometry-aware semantic segmentation network incorporating RGB bands and normalized DSM data to achieve accurate land cover classification. While these studies have provided valuable insights into RS image information extraction, the effectiveness of current DL models incorporating geo-scientific knowledge for slum mapping tasks remains unclear, presenting an opportunity for further exploration.
Therefore, the objective of this study is to enhance the performance of a slum extraction model by incorporating slum-related geo-scientific prior knowledge into DL models. We propose a two-stream DL model, named GASlumNet, which bridges UNet and ConvNeXt. One stream is dedicated to learning deep features from the optical bands, while the other stream focuses on learning deep features from geo-scientific information. GASlumNet is compared with existing models, such as UNet [40], ConvNeXt UNet [40,41] and FuseNet [36,37], in the study area of Mumbai, India. The model performance is evaluated using four indices: precision, recall, overall accuracy (OA) and intersection over union (IoU).

2. Study Area

Urban poverty is widespread in India. In 2020, 49.0% of India’s urban population reside in slums. Our study area is situated in Mumbai (Figure 1a), the largest harbor city of India and the state capital of Maharashtra. The vegetation areas are concentrated in the north-western, northern and the eastern parts of the city, while built-up areas are densely spread in the other parts (Figure 1c). The total population of Mumbai has reached 20.4 million, and the population density reached approximately 229 persons per hectare. Mumbai is also one of the most slum-populated cities. About 55% of the urban population in Mumbai now reside in slums, with Dharavi (Figure 1d)—one of the city’s largest slum communities—covering 2.39 km2; it is home to approximately 1 million poor people. The Maharashtra government established the Slum Rehabilitation Authority (SRA) in 1995 and launched a program to improve slum conditions in response to the urgent slum issue. The SRA compiled a map of the slums located in Mumbai in 2015–2016 (https://sra.gov.in/ (accessed on 27 November 2023)) (Figure 1b).

3. Methods

3.1. Preparing Slum Dataset

A slum dataset of Mumbai was built to facilitate the model training and validation in this study. The dataset comprises ground truth data of the slums, optical bands and spectral and textural features. Specifically, the optical bands were initially extracted from high-resolution RS imagery, while the spectral and textural features, enriched with geo-scientific knowledge related to slums, were calculated based on medium-resolution multispectral RS imagery.
The red (R), the green (G) and the blue (B) bands, from Jilin-1 Spectrum-01 data of 5-m spatial resolution, were used as the optical bands, of which the wavelengths are 630–680 nm, 525–600 nm and 450–515 nm, respectively. The Jilin-1 Spectrum-01 Level-1 radiometric-corrected images used in this study were photographed on 20 January 2020. Two images covered our study area, i.e., ‘JL1GP01_PMS1_20200120151147_200020875_102_0013_001_L1’ and ‘JL1GP01_PMS1_20200120151147_200020875_102_0014_001_L1’. Orthorectification was performed based on the rational polynomial coefficients (RPC) provided by the image sources. Further, image stitching and layer stacking were conducted to produce the RGB image. Figure 2a shows the RGB synthesis of the Jilin-1 Spectrum-01 image. The projection of the image is WGS 84/UTM zone 43N (EPSG: 32643).
The spectral and textural features used in this study were calculated from the Sentinel-2 imagery. We collected the Sentinel-2 Level-1C orthorectified top-of-atmosphere reflectance images of Mumbai in 2016 using JavaScript API on the Google Earth Engine platform (GEE), which is ‘COPERNICUS/S2_HARMONIZED’ in the GEE data catalogue. Initially, we filtered the images with cloud coverage percentages less than 20%. Subsequently, cloud masking was conducted based on the QA60 band to generate a cloud-free image collection. Ultimately, a composite Sentinel-2 image was created using the median values from the filtered cloud-free collection. We computed the spectral and textural features using the composite Sentinel-2 image, which are the normalized difference vegetation index (NDVI), the normalized difference soil–vegetation index (NDSGI) [42] and the normalized difference built-up index (NDBI) [43]. NDVI and NDSGI can aid in distinguishing slums, which are characterized by fewer vegetated areas, from areas with tree cover, grassland and part of formal settlements that exhibit more green space. NDBI may assist in separating slums from soil backgrounds, part of non-slum built-up and water areas. The formulations of three spectral indexes are expressed in Equations (1)–(3).
NDVI = ρ NIR ρ R ρ NIR + ρ R
NDSGI = ρ G ρ R ρ G + ρ R
NDBI = ρ SWIR ρ NIR ρ SWIR + ρ NIR
Referring to previous studies [44], the short-wave infrared band (SWIR) plays an important role in identifying slums from backgrounds. Simultaneously, the near-infrared band (NIR) and the red band (R) are indispensable components for distinguishing slums from vegetated areas. Thus, we calculated the textural features based on SWIR, NIR and R bands via a gray-level co-occurrence matrix (GLCM) [45]. As for the selection of coefficients of GLCM, variance and contrast [13,46] were often used in various existing studies to extract slums, considering slums have lower GLCM variance and contrast values, while non-slum built-up areas have relatively higher values because of their complex scenes. Nevertheless, abrupt changes at the boundaries of slums may introduce significant variance and contrast, potentially resulting in the misclassification of these boundary pixels. In this study, GLCM mean values expressed as GLCMSWIR mean, GLCMNIR mean and GLCMR mean were used as the textural coefficient. Intuitively, pixels in slum areas may exhibit higher GLCMSWIR mean values, lower GLCMNIR mean values and relatively lower GLCMR mean values compared to non-slum areas. This is attributed to the presence of limited vegetated areas and the prevalence of dense materials, such as metal or shingle roofs [47], in slums. The window size for GLCM is set as 5 × 5. All the geo-scientific indexes calculated from the Sentinel-2 image were re-sampled to 5 m resolution via the bilinear interpolation method and the projection of WGS 84/UTM zone 43N (EPSG: 32643) (Figure 2b,c).
The slum map in portable document format released by SRA was used as the ground truth data in this study. We transferred the PDF data into shapefile format and performed geo-referencing. The shapefile data were then transformed into a geo-tiff format with a 5 m spatial resolution and the projection of WGS 84/UTM zone 43N (EPSG: 32643) (Figure 2d).
The RGB image, spectral indexes and textural features were divided into un-overlapped image patches, each of which has a size of 64 × 64 pixels. The ground truth data of the slums were used as the labels for model training and validation and were also divided into patches with a size of 64 × 64 pixels. The label patches are binary, i.e., 0 indicates non-slum and 1 indicates slum.

3.2. Architecture of GASlumNet

The architecture of the proposed GASlumNet is presented in Figure 3. The proposed GASlumNet has two streams, of which one stream (UNet stream) takes the RGB bands as input and the other stream (ConvNeXt stream) takes the spectral and textural features as the input. GASlumNet incorporates ConvNeXt and UNet. Specifically, the UNet encoder learns deep features from the optical bands, while the ConvNeXt learns deep features from the geo-scientific information. A feature-level fusion mechanism is performed to fuse and optimize deep features from the two encoders hierarchically. Then, the UNet decoder generates Output1, and a modified decoder is stacked to the ConvNeXt to generate Output2. Additionally, we introduced a decision-level fusion mechanism to generate the final output by combining Output1 and Output2. The remainder of this section expands on GASlumNet in more detail.

3.2.1. UNet Stream

We used the original UNet architecture proposed by [40] as the UNet stream of GASlumNet. UNet [40] has an encoder (the contracting path) and a decoder (the expansive path). The UNet encoder has five down-sampling stages, of which the last down-sampling stage is used to bridge the encoder and the decoder. Each down-sampling stage consists of two convolution-batch normalization-rectified linear unit (CBR) blocks, and a maximum pooling layer connects every two down-sampling stages. As the encoder goes deeper, the size of the feature maps is halved, and the number of feature channels is doubled. The UNet decoder has four up-sampling stages, of which each has one transposed convolutional layer, two CBR blocks and a skip connection (concatenation). During decoding, feature maps generated by the encoder are fused with the feature maps generated by the decoder at the corresponding level via the skip connection. As the decoder goes deeper, the size of the feature maps is doubled, and the number of feature channels is halved. At last, a 1 × 1 convolutional layer is used to generate the pixel-wise prediction map.

3.2.2. ConvNeXt Stream

The ConvNeXt stream of GASlumNet contains the original ConvNeXt [41] and a modified decoder. ConvNext [41] was initially proposed as a pure CNN model with the aim to outperform the local vision transformer [48] by upgrading the traditional ResNet [49] towards the architecture of the Swin Transformer [50]. The original ConvNeXt has one stem block and four down-sampling stages. The stem block consists of a convolutional layer and a layer normalization. Each down-sampling stage has several ConvNeXt blocks. As presented in Figure 4, a ConvNeXt block adopts a depth-wise convolutional layer with a large kernel size, layer normalization, the Gaussian error linear unit (GeLU) and two point-wise convolutional layers. A down-sampling layer consisting of a layer normalization and a 2 × 2 convolutional layer connects every two down-sampling stages. We used a modified UNet decoder as the decoder of the ConveNeXt stream by replacing the CBR blocks with the convolution-batch normalization-GeLU (CBG) blocks. Similar to the UNet stream, the ConvNeXt stream utilizes the skip connection in the decoder and uses a 1 × 1 convolutional layer to obtain pixel-wise results.
Generally, Table 2 and Table 3 depict the structure of the encoders and decoders in our GASlumNet. In the tables, ks denotes the kernel size; s and p denote stride and padding; h, w and c denote the output height, width and the number of channels; DS means the down-sampling stage; Maxp means the maxpooling layer; Dlayer means the down-sample layer in the ConvNeXt; US means the up-sampling stage; and H and W are the height and width of the input image. It should be noted that the kernel size and stride of the stem block are 2 × 2 and 2 in our GASlumNet, whereas they are 4 × 4 and 4 in the original ConvNeXt proposed in [41]. We made this change to adapt for the feature fusion between the UNet encoder and the ConvNeXt encoder. The ConvNeXt stream has larger perceptual fields than the UNet stream because of the large-kernel size depth-wise convolutional layers used in the ConvNeXt; thus, the ConvNeXt stream can capture global semantic features of slums from the medium-resolution geo-scientific information [41], while the UNet learns local slum semantics from the high-resolution optical bands [40].

3.2.3. Feature-Level Fusion with Multi-Scale Attention

The Feature-level fusion mechanism is introduced into GASlumNet to fuse the deep features extracted from the optical bands and the geo-scientific information to enhance the model performance. We first performed channel-wise addition on the deep feature maps from the UNet encoder and ConvNeXt, and then exploited the multi-scale channel attention mechanism (MSCAM) [51] to optimize the fused features. As illustrated in Figure 5, the MSCAM uses a global adaptive average pooling layer to obtain global information and utilizes point-wise convolutional layers to aggregate global and local context. A bottleneck mechanism combines the aggregated multi-scale information and the input features.

3.2.4. Decision-Level Fusion and Joint Loss Function

As for the decision fusion mechanism, we used a super parameter γ to balance the weights between the outputs of the UNet stream and the ConvNeXt stream. During the model training phrase, the super parameter is used to compute the joint loss L J , defined as:
L J = γ L O + ( 1 γ ) L G
where L O denotes the loss computed for the UNet stream, which learns deep features from the optical bands, and L G denotes the loss computed for the ConvNeXt stream, which learns deep features from the geo-scientific information. Thus, the parameters of the streams in GASlumNet are synchronously optimized by minimizing the joint loss. Specifically, L O and L G are computed in the same way as [35] by incorporating the Dice loss L _ dice [52] and the weighted cross entropy (WCE) loss L _ wce [53]. The Dice loss and the WCE loss both combine to alleviate the problem of class imbalance. L O and L G are computed as:
L O = δ L _ wce + ( 1 δ ) L _ dice
L G = δ L _ wce + ( 1 δ ) L _ dice
where δ is set at 0.4 in this study and is used to balance the two losses.
During the inference phrase, the balance parameter γ is used to combine the prediction maps generated by the streams as:
y ^ i , j = softmax ( γ y ^ O i , j + 1 γ y ^ G i , j )
where i , j indicate the location of the pixel in the input RS image; y ^ O i , j and y ^ G i , j are the predicted logits at the pixel ( i , j ) generated by the UNet stream and the ConvNeXt stream; y ^ i , j is the final predicted label at the pixel ( i , j ); and softmax ( · ) is the softmax function to generate binary classification labels. The decision fusion is conducive to eliminating errors generated by a single stream and improving the robustness.

3.3. Evaluation Metrics

To assess the model performance, we calculated the precision, the recall, the overall accuracy (OA) and IoU based on the classification results. We defined the slum as the positive class and the non-slum as the negative class. Then, the evaluation metrics selected in this study were calculated as follows:
precision = true   positives true   positives + false   positives × 100
recall = true   positives true   positives + false   negatives × 100
OA = true   positives + true   negatives true   positives + true   negatives + false   positives + false   negatives × 100
IoU = true   positives true   positives + false   negatives + false   positives × 100
Specifically, the true positives denote the number of pixels that are classified as slums correctly, the true negatives denote the number of pixels that are classified as non-slums correctly, the false positives denote the number of pixels that are misclassified as slums and the false negatives denote the number of pixels that are misclassified as non-slums. All the metrics calculated range from 0 to 100%. The precision can depict the fraction of correctly identified slums from all the predicted slums. The recall can depict the fraction of correctly identified slums from all the slum ground truth. Higher precision denotes fewer classification errors, and higher recall denotes fewer classification omissions. OA is an overall metric to evaluate performance for separating slums from non-slums, and IoU depicts the similarity between ground truth slums and predicted slums.

4. Experiment and Results

4.1. Experimental Settings

As described in Section 3.1, the Jilin-1 RGB image and Sentinel-2 geo-scientific features are divided into patches with a size of 64 × 64 pixels. The total amount of patches is 3620, of which 2892 image patches are used for model training and 728 patches for testing. Table 4 presents the experimental settings in this study.
The proposed GASlumNet was formulated based on the architectural principles of UNet and ConvNeXt. Consequently, we designated UNet and ConvNeXt-UNet as our baseline models. GASlumNet and FuseNet both adopt a two-stream architecture, enabling the integration of RGB bands and geo-scientific auxiliary features; hence, we chose FuseNet as the comparative model for assessing GASlumNet’s performance. For fair comparisons, all the models in our experiments were trained from scratch. We did not exploit model fine-tuning, thus no pre-trained backbones, like ConvNeXt-base [41], VGG16 [54], etc., were used.

4.2. Comparisons of Slum Mapping among Different Methods

Table 5 presents the quantitative results on the testing datasets among different models. Among the comparative models, UNet and ConvNeXt-UNet used RGB images to identify slums, while FuseNet and our GASlumNet used RGB images and geo-scientific features. Higher OA and IoU values indicate that the detected slums are more similar to the ground truth slums. GASlumNet achieved the highest OA and IoU values, showing the best performance on slum classification among these models. Specifically, GASlumNet outperformed the baseline models (UNet, ConvNeXt-UNet) with OA increases of 1.06~1.35% and IoU increases of 4.60~5.97%. GASlumNet outperformed the contrastive model (FuseNet) with an OA increase of 2.25% and an IoU increase of 10.97%.
The precision values and recalls can indirectly reflect the classification errors and omissions of models. GASlumNet obtained the highest precision and recall values, i.e., 72.82% and 74.69%, respectively, indicating that GASlumNet generated fewer omissions and errors than the other models.
Figure 6 shows the slum detection results of Mumbai generated by the baseline models (UNet and ConvNeXt-UNet), the contrastive model (FuseNet) and our GASlumNet. Generally, our GASlumNet obtained more accurate and concise slum boundaries than the other models in Figure 6, although all the models generated false positives and false negatives in varying degrees.

4.3. Patch-Based Accuracy Assessment among Different Methods

We further quantitatively compared the mapping results according to the different sizes of the slum patches. Slums are divided into three groups, i.e., large slum patches in sizes larger than or equal to 25 hectares, medium slum patches in sizes larger than or equal to 5 hectares and smaller than 25 hectares, and small pockets in sizes smaller than 5 hectares. Table 6 presents the recall values of slums in various sizes detected by different models. UNet, ConvNeXt and our GASlumNet all exhibited promising performance in identifying slums of a medium or large size, with their recall values near or over 80%. However, all the models had varying degrees of difficulty in identifying slums smaller than 5 hectares. Only GASlumNet identified half of the small slums.
According to the qualitative comparisons in Figure 7, it can be found that when detecting the large slum patches, all the models can generate fine results (Figure 7A). When detecting medium-sized and small-sized slums, our GASlumNet generated fewer omissions, for instance, as the black dashed ellipses indicate in Figure 7B,D. Our GASlumNet also generated fewer errors, as the black dashed rectangles highlight in Figure 7C,E.

4.4. Land Cover Types of False Positives and False Negatives Generated by Different Methods

To find out which land cover types are easily misclassified as slums, we calculated the area of the ESA land cover types in the false positives and false negatives (Table 7). Land cover types including bare/sparse vegetation and built-up are the easiest to misclassify as slums. Overall, GASlumNet generated fewer omissions and errors when separating slums from these two land cover types.
Figure 8 presents examples of slum classification results under different scenes. It can be found that tree cover pixels and non-slum built-up pixels at the boundaries of slum patches are easily misclassified (the first and fourth columns of Figure 8). ConvNeXt-UNet and FuseNet generated severe omissions when detecting slums in the scene of bare/sparse vegetation areas (the third column of Figure 8), while UNet and ConvNeXt-UNet misclassified a lot of non-slum built-up pixels as slums (the fifth column of Figure 8). GASlumNet performed better in handling these scenarios, which are prone to omissions and errors in classifications.

4.5. Results under Different Ancillary Geo-Scientific Features

Table 8 presents slum mapping accuracies generated by different models utilizing various input features. The results indicate the following:
(1)
Overall, GASlumNet consistently achieved the highest slum mapping accuracies across different input features. Specifically, GASlumNet demonstrated improvements in IoU values of 2.52%, 3.09% and 10.97%, and increases in OA values of 0.35%, 0.29% and 2.25% compared to UNet, ConvNeXt-UNet, and FuseNet when utilizing RGB, spectral and textural features. GASlumNet also attained the highest IoU values among all the models.
(2)
The incorporation of ancillary geographic features into the models positively impacted the performance. With the exception of FuseNet, models in Table 8 that simultaneously used multiple input features outperformed those using only RGB bands, spectral features or textural features. For instance, when the RGB bands were concatenated with ancillary geographic features (spectral or textural) and fed into the model, UNet and ConvNeXt-UNet achieved higher accuracies than when using only RGB bands.
(3)
In comparison to FuseNet, which also employed a dual-stream architecture and multiple input features, GASlumNet consistently exhibited significantly higher precision, recall, OA and IoU values, underscoring the superior effectiveness of GASlumNet over FuseNet.
Table 8. Quantitative results among different models with different input features.
Table 8. Quantitative results among different models with different input features.
ModelsInput FeaturesPrecisionRecallOAIoU
UNetRGB68.5471.4791.9953.81
Spectral70.8462.2191.7249.53
Textural69.3660.7491.3747.89
RGB, spectral76.8767.2993.0855.96
RGB, textural71.4972.0292.5955.95
RGB, spectral, textural72.6170.8292.7055.89
ConvNeXt-UNetRGB67.5970.0591.7052.44
Spectral68.1263.5291.3548.96
Textural66.5362.2790.9847.42
RGB, spectral66.8075.7691.9255.04
RGB, textural73.1069.7992.7055.53
RGB, spectral, textural74.0368.6492.7655.32
FuseNetRGB, spectral64.1458.7690.3244.23
RGB, textural65.0267.8091.0349.68
RGB, spectral, textural65.1263.6190.8047.44
Input Features of
UNet (RGB-Stream)
Input Features of
ConvNeXt (Auxiliary Stream)
GASlumNetRGBSpectral69.8177.5592.6958.07
RGBTextural74.0571.9893.0557.48
RGB, spectralTextural75.0270.0493.1057.25
RGB, texturalSpectral73.8371.6892.9857.16
RGBSpectral, textural72.8274.6993.0558.41

4.6. Results under Different Balance Parameters

We tested different γ [ 0 , 1 ] values of the joint loss function to measure the contributions of the UNet stream and the ConvNeXt stream to the classification results. The higher the γ , the higher importance of the UNet stream and the input RGB bands; conversely, the higher importance of the ConvNeXt stream and the input ancillary geographic features. The results generated when γ = 0 , 0.1 , 0.3 , 0.5 , 0.7 , 0.9 , 1 are presented in Table 9. Specifically, we used the UNet when γ = 1 and the ConvNeXt-UNet when γ = 0 . The results in Table 9 demonstrate that adding geographic features to the model can improve accuracies by comparing the results under 0 < γ < 1 with the results under γ = 0 or γ = 1 . Thus, both RGB and the geographic features contribute to improve the slum detection performance. As γ increased, the OA and IoU values first increased and then decreased. The highest precision and OA values were obtained when γ was 0.5, while the highest recall and IoU values were achieved when γ was 0.7. This demonstrates that the RGB-stream contributes more than the geographic features to increase the accuracies.

5. Discussion

5.1. Differences from Existing Related Studies

In this study, a geoscience-aware network, named GASlumNet, was proposed to map accurate slums in Mumbai city. Mumbai has lots of slums and accommodates a large slum population. Previous researchers [30,44,55] have applied RS techniques to map slums in the same study area. These studies were all dedicated to improving the accuracy of slum mapping, but with different focuses. Specifically, the CSSIs [44] were proposed to validate the performance and potential of the multispectral medium-resolution imagery for slum mapping. The CCF [55] was validated for its applicability in mapping slums in developing countries. The CNN-based study [30] demonstrated the effectiveness of DL models on slum mapping with both high-resolution and medium-resolution images. Our GASlumNet was proposed to validate the effectiveness of incorporating geoscientific knowledge, including spectral features and textural information, into deep learning algorithms. Whether compared to the DL-based methods (e.g., CNN and CNN transfer learning) or prior knowledge-based methods (e.g., CSSIs), the proposed GASlumNet achieved better slum classification results when mapping slums across the whole of Mumbai city (Table 10). Even when compared to the study [30] using ultra high-resolution RS imagery (e.g., Pleiades with 0.5-m spatial resolution) and DL methods, GASlumNet obtained an increase of 5.56% of the IoU values. In addition, GASlumNet kept a promising balance between classification omissions and errors, with a smaller difference between the precision value and the recall value compared to the results of CSSIs. The comparisons above have demonstrated the advancement of the proposed GASlumNet.

5.2. Performance of GASlumNet

The proposed GASlumNet is characterized as a two-stream architecture with feature-level fusion and decision-level fusion. According to Table 5 and Table 8, the two-stream architecture allowed the model to learn discriminative features from both RGB bands and prior geoscientific knowledge, which can be confirmed by the comparison results between GASlumNet and UNet or ConvNeXt-UNet. Additionally, the hierarchical feature-level fusion scheme has also shown a contribution to higher-accuracy slum mapping. Specifically, when compared to UNet or ConvNeXt-UNet, which both use RGB bands and geoscientific knowledge via the input-level feature fusion, GASlumNet, using hierarchical feature-level fusion, obtained increases of 2.52–3.09% in the IoU values.
The contrastive model, FuseNet, can also combine the RGB bands and auxiliary geographic information via hierarchical feature-level fusion [36,37]; however, our GASlumNet outperformed FuseNet according to the quantitative and qualitative results. We attribute the superiority of GASlumNet to its structure that incorporates the advantages of UNet and ConvNeXt by the multi-scale attention mechanism and the decision-level fusion. FuseNet has two VGG-16 as encoders that have the same structures to extract features from the RGB bands and the geographic information. In contrast, the ConvNeXt stream of our GASlumNet exploits convolutions with large kernel sizes and large perceptual fields to extract global slum features from the geographic indexes. The UNet stream has relatively smaller perceptual fields to extract local slum features from the high-resolution RGB images. Thus, the two streams of GASlumNet can complement each other and improve the robustness of the model by feature-level fusion with a multi-scale attention mechanism. In addition, the decision fusion scheme is proven to improve the performance of a two-stream model structure.

5.3. Applicability and Limitations of GASlumNet

The proposed GASlumNet performed better in detecting medium-sized and large-sized slum patches (≥5 ha) than the comparative models according to Table 6; however, when identifying slum pockets smaller than 5 ha, all the models generated omissions. We attribute this to the influence of mixed pixels. Slums of a small size are easily blended with other classes on RS imagery, and these mixed pixels inevitably undermine the mapping results. Nevertheless, GASlumNet can identify half of all small-sized slums correctly with a recall value of 50.79% and outperformed the other models. It is reasonable to assume that more small-sized slums will be detected correctly by GASlumNet when ultra-high-resolution imagery (<1 m) is used in future work.
Previous prior knowledge-based studies [20,46] generally used textural indexes to separate slums with non-slum built-up and vegetation spectral indexes to separate slums with green areas. Referring to these studies, spectral indexes and textural indexes are selected as the geoscientific features to help with mapping slums. However, there are no standard indexes to perfectly identify slums due to the intra-class variability and inter-class similarity, which caused inevitable classification errors among different land cover types. Land-cover types, including non-slum built-up and bare/sparse vegetation, are easily misclassified as slums, according to Table 7 and Figure 8. For example, dense, built-up neighborhoods with few vegetation may be wrongly classified as slums to various degrees by different models (e.g., the fifth column in Figure 8). The category of bare/sparse vegetation shares similar spectral features with slums because of a small fraction of vegetation and the influence of soil background in these classes. In this paper, spectral indexes, including NDVI, NDSGI and NDBI, and textural indexes calculated by GLCM mean were intuitively selected to map slums. The comparisons between GASlumNet and other models have demonstrated the promising performance of combining geoscientific features and DL models; however, other spectral indexes and textural indexes were not explored due to paper length. In addition, the spatial structural information can also help with image classification [56]. In future work, the graph theory and graph neural networks (GNN) can be exploited to represent slum features to help improve mapping performance.

6. Conclusions

This study introduces a two-stream network, GASlumNet, aimed at achieving high-accuracy slum mapping. GASlumNet combines UNet and ConvNeXt to capture slum features inherent in both RGB and geographic features. The ConvNeXt stream acquires global slum features through depth-wise convolutions with large kernel sizes derived from medium-resolution geographic index maps. The UNet stream focuses on learning local slum features from high-resolution RGB images. Additionally, feature-level fusion and decision-level fusion mechanisms are incorporated to hierarchically bridge the two streams, enhancing the overall model performance.
The experimental results for slum classification in Mumbai, India, demonstrate that GASlumNet achieves the highest overall accuracy (OA) and intersection over union (IoU) values, surpassing UNet, ConvNeXt-UNet and FuseNet. The integration of ancillary geo-scientific information with deep learning (DL) models proves beneficial in enhancing the accuracy of slum mapping. The proposed model serves as a technical reference for accurately mapping slums, contributing to the enrichment of land cover and land use classification schemes in the developing world.

Author Contributions

Conceptualization, W.L.; methodology, W.L.; software, W.L.; validation, Y.H. and F.P.; formal analysis, W.L. and Y.H.; investigation, W.L. and F.P.; writing—original draft preparation, W.L.; writing—review and editing, Y.H., F.P., Z.F. and Y.Y.; visualization, W.L.; supervision, Y.H.; project administration, Y.H.; funding acquisition, Y.H., F.P. and Z.F. All authors have read and agreed to the published version of the manuscript.

Funding

This study was supported by the Network Security and Information Program of the Chinese Academy of Sciences (CAS-WX2021SF-0106-02); Z.F. was supported by the Second Tibetan Plateau Scientific Expedition and Research Program (STEP) (20190ZKK1006); Y.H. was supported by the National Natural Science Foundation of China (42130508) and the Key Project of Innovation LREIS (KPI011); F.P. was supported by the National Natural Science Foundation of China (42071389).

Data Availability Statement

The ground truth data of the slums were acquired from the website https://sra.gov.in (accessed on 27 November 2023).

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. UN-Habitat. World Cities Report 2020: The Value of Sustainable Urbanization; United Nations Human Settlements Programme: Nairobi, Kenya, 2020. [Google Scholar]
  2. Wirastri, M.V.; Morrison, N.; Paine, G. The Connection between Slums and COVID-19 Cases in Jakarta, Indonesia: A Case Study of Kapuk Urban Village. Habitat Int. 2023, 134, 102765. [Google Scholar] [CrossRef]
  3. Thomson, D.R.; Stevens, F.R.; Chen, R.; Yetman, G.; Sorichetta, A.; Gaughan, A.E. Improving the Accuracy of Gridded Population Estimates in Cities and Slums to Monitor SDG 11: Evidence from a Simulation Study in Namibia. Land Use Policy 2022, 123, 106392. [Google Scholar] [CrossRef]
  4. Maung, N.L.; Kawasaki, A.; Amrith, S. Spatial and Temporal Impacts on Socio-Economic Conditions in the Yangon Slums. Habitat Int. 2023, 134, 102768. [Google Scholar] [CrossRef]
  5. UN-Habitat. The Challenge of Slums: Global Report on Human Settlements, 2003; Routledge: London, UK, 2003. [Google Scholar]
  6. UN-Habitat. Slum Almanac 2015–2016: Tracking Improvement in the Lives of Slum Dwellers. Participatory Slum Upgrading Programme. 2016. Available online: https://unhabitat.org/sites/default/files/documents/2019-05/slum_almanac_2015-2016_psup.pdf. (accessed on 27 November 2023).
  7. United Nations. Transforming Our World: The 2030 Agenda for Sustainable Development; United Nations: New York, NY, USA, 2015. [Google Scholar]
  8. MacTavish, R.; Bixby, H.; Cavanaugh, A.; Agyei-Mensah, S.; Bawah, A.; Owusu, G.; Ezzati, M.; Arku, R.; Robinson, B.; Schmidt, A.M. Identifying Deprived “Slum” Neighbourhoods in the Greater Accra Metropolitan Area of Ghana Using Census and Remote Sensing Data. World Dev. 2023, 167, 106253. [Google Scholar] [CrossRef] [PubMed]
  9. Kuffer, M.; Abascal, A.; Vanhuysse, S.; Georganos, S.; Wang, J.; Thomson, D.R.; Boanada, A.; Roca, P. Data and Urban Poverty: Detecting and Characterising Slums and Deprived Urban Areas in Low-and Middle-Income Countries. In Advanced Remote Sensing for Urban and Landscape Ecology; Springer: Cham, Switzerland, 2023; pp. 1–22. [Google Scholar]
  10. UN-Habitat. Metadata on SDGs Indicator 11.1. 1 Indicator Category: Tier I. UN Human Settlements Program, Nairobi. 2018. Available online: http://unhabitat.org/sites/default/files/2020/06/metadata_on_sdg_indicator_11.1.1.pdf (accessed on 27 November 2023).
  11. Kohli, D.; Sliuzas, R.; Kerle, N.; Stein, A. An Ontology of Slums for Image-Based Classification. Comput. Environ. Urban Syst. 2012, 36, 154–163. [Google Scholar] [CrossRef]
  12. Kohli, D.; Kerle, N.; Sliuzas, R. Local Ontologies for Object-Based Slum Identification and Classification. Environs 2012, 3, 3. [Google Scholar]
  13. Kohli, D.; Sliuzas, R.; Stein, A. Urban Slum Detection Using Texture and Spatial Metrics Derived from Satellite Imagery. J. Spat. Sci. 2016, 61, 405–426. [Google Scholar] [CrossRef]
  14. Badmos, O.S.; Rienow, A.; Callo-Concha, D.; Greve, K.; Jürgens, C. Urban Development in West Africa—Monitoring and Intensity Analysis of Slum Growth in Lagos: Linking Pattern and Process. Remote Sens. 2018, 10, 1044. [Google Scholar] [CrossRef]
  15. Kuffer, M.; Pfeffer, K.; Sliuzas, R.; Baud, I. Extraction of Slum Areas from VHR Imagery Using GLCM Variance. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2016, 9, 1830–1840. [Google Scholar] [CrossRef]
  16. Mudau, N.; Mhangara, P. Mapping and Assessment of Housing Informality Using Object-Based Image Analysis: A Review. Urban Sci. 2023, 7, 98. [Google Scholar] [CrossRef]
  17. Abed, A.D. Urban Upgrading of Slums: Baghdad and London Slums as Study Models for Urban Rehabilitation. Comput. Urban Sci. 2023, 3, 31. [Google Scholar] [CrossRef]
  18. Mahabir, R.; Croitoru, A.; Crooks, A.T.; Agouris, P.; Stefanidis, A. A Critical Review of High and Very High-Resolution Remote Sensing Approaches for Detecting and Mapping Slums: Trends, Challenges and Emerging Opportunities. Urban Sci. 2018, 2, 8. [Google Scholar] [CrossRef]
  19. Kuffer, M.; Wang, J.; Nagenborg, M.; Pfeffer, K.; Kohli, D.; Sliuzas, R.; Persello, C. The Scope of Earth-Observation to Improve the Consistency of the SDG Slum Indicator. ISPRS Int. J. Geo-Inf. 2018, 7, 428. [Google Scholar] [CrossRef]
  20. Trento Oliveira, L.; Kuffer, M.; Schwarz, N.; Pedrassoli, J.C. Capturing Deprived Areas Using Unsupervised Machine Learning and Open Data: A Case Study in São Paulo, Brazil. Eur. J. Remote Sens. 2023, 56, 2214690. [Google Scholar] [CrossRef]
  21. Dewan, A.; Alrasheedi, K.; El-Mowafy, A. Mapping Informal Settings Using Machine Learning Techniques, Object-Based Image Analysis and Local Knowledge. In Proceedings of the International Geoscience and Remote Sensing Symposium (IGARSS), Pasadena, CA, USA, 16–21 July 2023; pp. 7249–7252. [Google Scholar]
  22. Duque, J.C.; Patino, J.E.; Betancourt, A. Exploring the Potential of Machine Learning for Automatic Slum Identification from VHR Imagery. Remote Sens. 2017, 9, 895. [Google Scholar] [CrossRef]
  23. Prabhu, R.; Parvathavarthini, B.; Alagu Raja, R.A. Slum Extraction from High Resolution Satellite Data Using Mathematical Morphology Based Approach. Int. J. Remote Sens. 2021, 42, 172–190. [Google Scholar] [CrossRef]
  24. Brenning, A. Interpreting Machine-Learning Models in Transformed Feature Space with an Application to Remote-Sensing Classification. Mach. Learn. 2023, 112, 3455–3471. [Google Scholar] [CrossRef]
  25. Yuan, Q.; Shen, H.; Li, T.; Li, Z.; Li, S.; Jiang, Y.; Xu, H.; Tan, W.; Yang, Q.; Wang, J. Deep Learning in Environmental Remote Sensing: Achievements and Challenges. Remote Sens. Environ. 2020, 241, 111716. [Google Scholar] [CrossRef]
  26. Hong, D.; Gao, L.; Yokoya, N.; Yao, J.; Chanussot, J.; Du, Q.; Zhang, B. More Diverse Means Better: Multimodal Deep Learning Meets Remote-Sensing Imagery Classification. IEEE Trans. Geosci. Remote Sens. 2020, 59, 4340–4354. [Google Scholar] [CrossRef]
  27. Li, J.; Hong, D.; Gao, L.; Yao, J.; Zheng, K.; Zhang, B.; Chanussot, J. Deep Learning in Multimodal Remote Sensing Data Fusion: A Comprehensive Review. Int. J. Appl. Earth Obs. Geoinf. 2022, 112, 102926. [Google Scholar] [CrossRef]
  28. Bergamasco, L.; Bovolo, F.; Bruzzone, L. A Dual-Branch Deep Learning Architecture for Multisensor and Multitemporal Remote Sensing Semantic Segmentation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 2147–2162. [Google Scholar] [CrossRef]
  29. Wurm, M.; Stark, T.; Zhu, X.X.; Weigand, M.; Taubenböck, H. Semantic Segmentation of Slums in Satellite Images Using Transfer Learning on Fully Convolutional Neural Networks. ISPRS J. Photogramm. Remote Sens. 2019, 150, 59–69. [Google Scholar] [CrossRef]
  30. Verma, D.; Jana, A.; Ramamritham, K. Transfer Learning Approach to Map Urban Slums Using High and Medium Resolution Satellite Imagery. Habitat Int. 2019, 88, 101981. [Google Scholar] [CrossRef]
  31. Stark, T.; Wurm, M.; Zhu, X.X.; Taubenböck, H. Satellite-Based Mapping of Urban Poverty with Transfer-Learned Slum Morphologies. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 13, 5251–5263. [Google Scholar] [CrossRef]
  32. Rehman, M.F.U.; Aftab, I.; Sultani, W.; Ali, M. Mapping Temporary Slums from Satellite Imagery Using a Semi-Supervised Approach. IEEE Geosci. Remote Sens. Lett. 2022, 19, 3512805. [Google Scholar] [CrossRef]
  33. El Moudden, T.; Dahmani, R.; Amnai, M.; Fora, A.A. Slum Image Detection and Localization Using Transfer Learning: A Case Study in Northern Morocco. Int. J. Electr. Comput. Eng. 2023, 13, 3299–3310. [Google Scholar] [CrossRef]
  34. Ge, Y.; Zhang, X.; Atkinson, P.M.; Stein, A.; Li, L. Geoscience-Aware Deep Learning: A New Paradigm for Remote Sensing. Sci. Remote Sens. 2022, 5, 100047. [Google Scholar] [CrossRef]
  35. Lu, W.; Hu, Y.; Zhang, Z.; Cao, W. A Dual-Encoder U-Net for Landslide Detection Using Sentinel-2 and DEM Data. Landslides 2023, 20, 1975–1987. [Google Scholar] [CrossRef]
  36. Audebert, N.; Le Saux, B.; Lefèvre, S. Beyond RGB: Very High Resolution Urban Remote Sensing with Multimodal Deep Networks. ISPRS J. Photogramm. Remote Sens. 2018, 140, 20–32. [Google Scholar] [CrossRef]
  37. Hazirbas, C.; Ma, L.; Domokos, C.; Cremers, D. Fusenet: Incorporating Depth into Semantic Segmentation via Fusion-Based Cnn Architecture. In Part I 13, Proceedings of the Computer Vision–ACCV 2016: 13th Asian Conference on Computer Vision, Taipei, Taiwan, 20–24 November 2016; Revised Selected Papers; Springer: Berlin/Heidelberg, Germany, 2017; pp. 213–228. [Google Scholar]
  38. He, Q.; Sun, X.; Diao, W.; Yan, Z.; Yao, F.; Fu, K. Multimodal Remote Sensing Image Segmentation with Intuition-Inspired Hypergraph Modeling. IEEE Trans. Image Process. 2023, 32, 1474–1487. [Google Scholar] [CrossRef]
  39. Xiong, Z.; Chen, S.; Wang, Y.; Mou, L.; Zhu, X.X. GAMUS: A Geometry-Aware Multi-Modal Semantic Segmentation Benchmark for Remote Sensing Data. arXiv 2023, arXiv:2305.14914. [Google Scholar]
  40. Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Part III 18, Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
  41. Liu, Z.; Mao, H.; Wu, C.-Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A Convnet for the 2020s. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11976–11986. [Google Scholar]
  42. Philpot, W.; Jacquemoud, S.; Tian, J. ND-Space: Normalized Difference Spectral Mapping. Remote Sens. Environ. 2021, 264, 112622. [Google Scholar] [CrossRef]
  43. Zha, Y.; Gao, J.; Ni, S. Use of Normalized Difference Built-up Index in Automatically Mapping Urban Areas from TM Imagery. Int. J. Remote Sens. 2003, 24, 583–594. [Google Scholar] [CrossRef]
  44. Peng, F.; Lu, W.; Hu, Y.; Jiang, L. Mapping Slums in Mumbai, India, Using Sentinel-2 Imagery: Evaluating Composite Slum Spectral Indices (CSSIs). Remote Sens. 2023, 15, 4671. [Google Scholar] [CrossRef]
  45. Haralick, R.M.; Shanmugam, K.; Dinstein, I.H. Textural Features for Image Classification. IEEE Trans. Syst. Man Cybern. 1973, 6, 610–621. [Google Scholar] [CrossRef]
  46. Wurm, M.; Weigand, M.; Schmitt, A.; Geiß, C.; Taubenböck, H. Exploitation of Textural and Morphological Image Features in Sentinel-2A Data for Slum Mapping. In Proceedings of the 2017 Joint Urban Remote Sensing Event (JURSE), Dubai, United Arab Emirates, 6–8 March 2017; pp. 1–4. [Google Scholar]
  47. Kotthaus, S.; Smith, T.E.; Wooster, M.J.; Grimmond, C.S.B. Derivation of an Urban Materials Spectral Library through Emittance and Reflectance Spectroscopy. ISPRS J. Photogramm. Remote Sens. 2014, 94, 194–212. [Google Scholar] [CrossRef]
  48. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S. An Image Is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
  49. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  50. Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
  51. Dai, Y.; Gieseke, F.; Oehmcke, S.; Wu, Y.; Barnard, K. Attentional Feature Fusion. In Proceedings of the 2021 IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2021; pp. 3560–3569. [Google Scholar]
  52. Milletari, F.; Navab, N.; Ahmadi, S.-A. V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation. In Proceedings of the 2016 Fourth International Conference on 3D vision (3DV), Stanford, CA, USA, 25–28 October 2016; pp. 565–571. [Google Scholar]
  53. Phan, T.H.; Yamamoto, K. Resolving Class Imbalance in Object Detection with Weighted Cross Entropy Losses. arXiv 2020, arXiv:2006.01413. [Google Scholar]
  54. Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
  55. Gram-Hansen, B.J.; Helber, P.; Varatharajan, I.; Azam, F.; Coca-Castro, A.; Kopackova, V.; Bilinski, P. Mapping Informal Settlements in Developing Countries Using Machine Learning and Low Resolution Multi-Spectral Data. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, Honolulu, HI, USA, 27–28 January 2019; pp. 361–368. [Google Scholar]
  56. Song, X.; Hua, Z.; Li, J. GMTS: GNN-Based Multi-Scale Transformer Siamese Network for Remote Sensing Building Change Detection. Int. J. Digit. Earth 2023, 16, 1685–1706. [Google Scholar] [CrossRef]
Figure 1. Illustration of the study area and ground truth data. (a) The location of Mumbai; (b) slums of Mumbai with the base map of Google Satellite imagery; (c) land cover of Mumbai presented by the European Space Agency (ESA) WorldCover 10 m v100 data; (d) a zoomed-in view of Dharavi with the base map of Google Satellite imagery.
Figure 1. Illustration of the study area and ground truth data. (a) The location of Mumbai; (b) slums of Mumbai with the base map of Google Satellite imagery; (c) land cover of Mumbai presented by the European Space Agency (ESA) WorldCover 10 m v100 data; (d) a zoomed-in view of Dharavi with the base map of Google Satellite imagery.
Remotesensing 16 00260 g001
Figure 2. The RGB bands, spectral indexes and textural features used in this study. (a) RGB composite of Jilin-1 image; (b) the false-color composite of NDVI, NDSGI and NDBI; (c) the false-color composite of GLCMSWIR mean, GLCMNIR mean and GLCMR mean; (d) ground truth of slums.
Figure 2. The RGB bands, spectral indexes and textural features used in this study. (a) RGB composite of Jilin-1 image; (b) the false-color composite of NDVI, NDSGI and NDBI; (c) the false-color composite of GLCMSWIR mean, GLCMNIR mean and GLCMR mean; (d) ground truth of slums.
Remotesensing 16 00260 g002
Figure 3. The architecture of GASlumNet.
Figure 3. The architecture of GASlumNet.
Remotesensing 16 00260 g003
Figure 4. A ConvNeXt block [41].
Figure 4. A ConvNeXt block [41].
Remotesensing 16 00260 g004
Figure 5. The MSCAM [51].
Figure 5. The MSCAM [51].
Remotesensing 16 00260 g005
Figure 6. Slum maps of Mumbai generated by UNet (a), ConvNeXt-UNet (b), FuseNet (c) and GASlumNet (d).
Figure 6. Slum maps of Mumbai generated by UNet (a), ConvNeXt-UNet (b), FuseNet (c) and GASlumNet (d).
Remotesensing 16 00260 g006
Figure 7. Examples of classification results for slums of different sizes among the baseline models (UNet and ConvNeXt-UNet), the contrastive model (FuseNet) and the proposed GASlumNet. FP: false positive, TP: true positive, FN: false negative. (A) scenes containing large slums; (B) scenes containing large and medium slums; (C,D) scenes containing large and small slums; (E) scenes containing small slum pockets.
Figure 7. Examples of classification results for slums of different sizes among the baseline models (UNet and ConvNeXt-UNet), the contrastive model (FuseNet) and the proposed GASlumNet. FP: false positive, TP: true positive, FN: false negative. (A) scenes containing large slums; (B) scenes containing large and medium slums; (C,D) scenes containing large and small slums; (E) scenes containing small slum pockets.
Remotesensing 16 00260 g007
Figure 8. Examples of classification results under different landcover scenes. FP: false positives, TP: true positives, FN: false negatives.
Figure 8. Examples of classification results under different landcover scenes. FP: false positives, TP: true positives, FN: false negatives.
Remotesensing 16 00260 g008
Table 1. Examples of GSO.
Table 1. Examples of GSO.
LevelIndicatorsObservation in SlumsFeatures
EnvironmentLocationHazardous and flood-prone areas; close to railways, highways and major roads; close to water areas; some on steep slopesAssociation-distance to river, roads
Neighborhood characteristicsClose to CBD, middle/high socioeconomic status areas and industrial areasAssociation-distance to socioeconomic-status areas
SettlementShapeGenerally irregular; elongated formation following the river or railwayGeometry
DensityHighly compact; high roof coverage; low vegetation/open space coverageTexture
ObjectAccess networkGenerally unpaved, narrow, irregular roads and footpathsGeometry/spectral features
Building characteristicsRoof: iron sheet, asbestos, plastic, fiber, clay tiles; less bright than formal settlements
building size: small
Spectral/morphological features
Table 2. The encoder structures of two streams in GASlumNet.
Table 2. The encoder structures of two streams in GASlumNet.
UNet StreamConvNeXt Stream
Stagekss, ph, w, cStagekss, ph, w, c
DS1 3 × 3 1, 1H, W, 64Stem 2 × 2 2, 0H/2, W/2, 128
Maxp 2 × 2 2, 0H/2, W/2, 64
DS2 3 × 3 1, 1H/2, W/2, 128DS1 7 × 7 , 1 × 1 , 1 × 1 × 3 [1, 3]
[1, 0]
[1, 0]
H/2, W/2, 128
Maxp 2 × 2 2, 0H/4, W/4, 128Dlayer1 2 × 2 2, 0H/4, W/4, 256
DS3 3 × 3 1, 1H/4, W/4, 256DS2 7 × 7 , 1 × 1 , 1 × 1 × 3 [1, 3]
[1, 0]
[1, 0]
H/4, W/4, 256
Maxp 2 × 2 2, 0H/8, W/8, 256Dlayer2 2 × 2 2, 0H/8, W/8, 512
DS4 3 × 3 1, 1H/8, W/8, 512DS3 7 × 7 , 1 × 1 , 1 × 1 × 27 [1, 3]
[1, 0]
[1, 0]
H/8, W/8, 512
Maxp 2 × 2 2, 0H/16, W/16, 512Dlayer3 2 × 2 2, 0H/16, W/16, 1024
DS5 3 × 3 1, 1H/16, W/16, 1024DS4 7 × 7 , 1 × 1 , 1 × 1 × 3 [1, 3]
[1, 0]
[1, 0]
H/16, W/16, 1024
Table 3. The decoder structure of GASlumNet.
Table 3. The decoder structure of GASlumNet.
Stagekss, ph, w, c
US1ConvT1 2 × 2 2, 0H/8, W/8, 512
CBR/CBG 3 × 3 1, 1H/8, W/8, 512
US2ConvT2 2 × 2 2, 0H/4, W/4, 256
CBR/CBG 3 × 3 1, 1H/4, W/4, 256
US3ConvT3 2 × 2 2, 0H/2, W/2, 128
CBR/CBG 3 × 3 1, 1H/2, W/2, 128
US4ConvT4 2 × 2 2, 0H, W, 64
CBR/CBG 3 × 3 1, 1H, W, 64
1 × 1 Conv 1 × 1 1, 1H, W, 2
Table 4. The experimental settings.
Table 4. The experimental settings.
ItemsSettings
Super parametersNo. of categories2
Balance parameter γ 0.7
Settings for model trainingBatch size64
Epochs100
OptimizerAdam
Learning rate1 × 10−3
Weight decay5 × 10−4
Experimental environmentSystemWindows 10
LanguagePython
FrameworkPytorch 1.11.0
CPUCPUs (Intel(R) Xeon(R) Silver 4210R) with 64 GB memory
GPUNAVIDIA GeForce RTX 3090 with 24 GB memory
Table 5. Classification accuracies on the testing dataset among different models.
Table 5. Classification accuracies on the testing dataset among different models.
ModelsPrecision (%)Recall (%)OA (%)IoU (%)
UNet68.5471.4791.9953.81
ConvNeXt-UNet67.5970.0591.7052.44
FuseNet65.1263.6190.8047.44
GASlumNet72.8274.6993.0558.41
Table 6. Recalls of detected slums in different sizes.
Table 6. Recalls of detected slums in different sizes.
ModelsSmall Slum Pockets
(<5 ha)
Medium Slum Patches
(5~25 ha)
Large Slum Patches
(≥25 ha)
UNet47.8879.9982.47
ConvNeXt-UNet46.0179.1080.75
FuseNet41.6169.6376.39
GASlumNet50.7983.3685.81
Table 7. Area of main landcover types misclassified as slums on the testing dataset. FP = false positives of slum classification results; FN = false negatives of slum classification results.
Table 7. Area of main landcover types misclassified as slums on the testing dataset. FP = false positives of slum classification results; FN = false negatives of slum classification results.
ModelsESA Land Cover Types
Built-Up (ha)Bare/Sparse Vegetation (ha)Tree Cover (ha)Water (ha)Others (ha)
FPUNet216.3481.3915.411.874.40
ConvNeXt-UNet232.6875.9011.761.754.74
FuseNet223.4888.0212.850.277.18
GASlumNet183.9273.0710.040.633.27
FNUNet145.4299.8327.081.635.51
ConvNeXt-UNet154.59102.9528.881.513.83
FuseNet193.57122.5932.181.874.18
GASlumNet119.9292.5828.601.633.69
Table 9. Quantitative results under different balance parameters.
Table 9. Quantitative results under different balance parameters.
γ PrecisionRecallOAIoU
067.5970.0591.7052.44
0.175.0370.6793.1057.21
0.375.8070.3093.1957.41
0.576.2870.4993.2857.81
0.772.8274.6993.0558.41
0.973.0073.9893.0358.09
168.5471.4791.9953.81
Table 10. Slum mapping accuracies of the whole of Mumbai among different studies. CSSIs = composite slum spectral indexes; CCF = canonical correlation forest; CNN = convolutional neural network.
Table 10. Slum mapping accuracies of the whole of Mumbai among different studies. CSSIs = composite slum spectral indexes; CCF = canonical correlation forest; CNN = convolutional neural network.
MethodsRS Imagery
(Spatial Resolution—m)
PrecisionRecallOAIoU
CSSIs (threshold-based) [44]Sentinel-2 (10 m)63.8658.38-43.89
CSSIs (ML-based) [44]Sentinel-2 (10 m)61.5682.50-54.45
CCF [55]Sentinel-2 (10 m)---40.30
CNN transfer learning [30]Pleiades (0.5 m), Sentinel-2 (10 m)--92.0043.20
CNN [30]Pleiades (0.5 m)--94.3058.30
GASlumNetJilin-1 (5 m), Sentinel-2 (10 m)76.1379.8594.8963.86
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Lu, W.; Hu, Y.; Peng, F.; Feng, Z.; Yang, Y. A Geoscience-Aware Network (GASlumNet) Combining UNet and ConvNeXt for Slum Mapping. Remote Sens. 2024, 16, 260. https://doi.org/10.3390/rs16020260

AMA Style

Lu W, Hu Y, Peng F, Feng Z, Yang Y. A Geoscience-Aware Network (GASlumNet) Combining UNet and ConvNeXt for Slum Mapping. Remote Sensing. 2024; 16(2):260. https://doi.org/10.3390/rs16020260

Chicago/Turabian Style

Lu, Wei, Yunfeng Hu, Feifei Peng, Zhiming Feng, and Yanzhao Yang. 2024. "A Geoscience-Aware Network (GASlumNet) Combining UNet and ConvNeXt for Slum Mapping" Remote Sensing 16, no. 2: 260. https://doi.org/10.3390/rs16020260

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop