A Geoscience-Aware Network (GASlumNet) Combining UNet and ConvNeXt for Slum Mapping

Lu, Wei; Hu, Yunfeng; Peng, Feifei; Feng, Zhiming; Yang, Yanzhao

doi:10.3390/rs16020260

Open AccessArticle

A Geoscience-Aware Network (GASlumNet) Combining UNet and ConvNeXt for Slum Mapping

¹

State Key Laboratory of Resources and Environmental Information System, Institute of Geographic Sciences and Natural Resources Research, Chinese Academy of Sciences, Beijing 100101, China

²

College of Resources and Environment, University of Chinese Academy of Sciences, Beijing 100049, China

³

Key Laboratory for Geographical Process Analysis & Simulation of Hubei Province, Central China Normal University, Wuhan 430079, China

⁴

College of Urban and Environmental Sciences, Central China Normal University, Wuhan 430079, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2024, 16(2), 260; https://doi.org/10.3390/rs16020260

Submission received: 29 November 2023 / Revised: 3 January 2024 / Accepted: 8 January 2024 / Published: 9 January 2024

(This article belongs to the Section AI Remote Sensing)

Download

Browse Figures

Versions Notes

Abstract

:

Approximately 1 billion people worldwide currently inhabit slum areas. The UN Sustainable Development Goal (SDG 11.1) underscores the imperative of upgrading all slums by 2030 to ensure adequate housing for everyone. Geo-locations of slums help local governments with upgrading slums and alleviating urban poverty. Remote sensing (RS) technology, with its excellent Earth observation capabilities, can play an important role in slum mapping. Deep learning (DL)-based RS information extraction methods have attracted a lot of attention. Currently, DL-based slum mapping studies typically uses three optical bands to adapt to existing models, neglecting essential geo-scientific information, such as spectral and textural characteristics, which are beneficial for slum mapping. Inspired by the geoscience-aware DL paradigm, we propose the Geoscience-Aware Network for slum mapping (GASlumNet), aiming to improve slum mapping accuracies via incorporating the DL model with geoscientific prior knowledge. GASlumNet employs a two-stream architecture, combining ConvNeXt and UNet. One stream concentrates on optical feature representation, while the other emphasizes geo-scientific features. Further, the feature-level and decision-level fusion mechanisms are applied to optimize deep features and enhance model performance. We used Jilin-1 Spectrum 01 and Sentinel-2 images to perform experiments in Mumbai, India. The results demonstrate that GASlumNet achieves higher slum mapping accuracy than the comparison models, with an intersection over union (IoU) of 58.41%. Specifically, GASlumNet improves the IoU by 4.60~5.97% over the baseline models, i.e., UNet and ConvNeXt-UNet, which exclusively utilize optical bands. Furthermore, GASlumNet enhances the IoU by 10.97% compared to FuseNet, a model that combines optical bands and geo-scientific features. Our method presents a new technical solution to achieve accurate slum mapping, offering potential benefits for regional and global slum mapping and upgrading initiatives.

Keywords:

slum mapping; deep learning; geoscience-aware paradigm; semantic segmentation; remote sensing

Graphical Abstract

1. Introduction

In developing nations and regions, the phenomenon of pronounced rural–urban migration poses a substantial challenge to urban development in keeping up with population urbanization. The rapid influx of population impairs the capacity of cities to adequately cater to the evolving needs of living conditions, employment opportunities and public health management, eventually leading to urban poverty [1,2]. Urban slum development and growth, as well as an increase in the population of the poor, are evidence of urban poverty [3,4]. About one in eight of the global population currently lives in slum-like environments [5,6]. Access to accurate geospatial information of slums is critical to achieve SDG 11.1 [7]. Traditional field surveys and manual enumeration are time-consuming and labor-intensive; therefore, remote sensing (RS) is necessary for rapid and accurate slum mapping [8,9].

Figuring out “What is a slum?” is the first step in RS-based slum extraction and mapping. According to UN-Habitat [10], slums typically lack clean water sources, sanitation, sufficient living space, stability and security. However, this definition cannot be visualized on RS imagery directly. Thus, Kohli D et al. [11] proposed the generic slum ontology (GSO), a methodology to establish a link between the appearance of slums and image features at three levels, i.e., the environment level, the neighborhood level and the object level. We synthesized the literature [12,13,14] to collate the GSO information in Table 1. Based on the GSO, real-world slum appearance characteristics can be converted into computer-understandable parameters, such as textural features, geometric features, morphological features and so on. The disparities between slums and non-slums led to the discovery, through an analysis of GSO-related studies, that the vegetation spectral indexes and textural characteristics are crucial to discriminate slums and non-slums [15]. Specifically, houses are tightly packed and disordered in slums, whereas house are relatively sparsely packed and orderly in formal settlements [16]. In addition, there are fewer green spaces in slums, whereas there are adequate green spaces in formal settlements [17].

Earlier research designed image feature sets according to GSO and used object-oriented analysis (OOA) and machine learning (ML) methods to map slums [18,19,20,21]. The OOA-based slum mapping methods are generally characterized by low levels of automation and limited transferability. This circumstance can be attributed to the following reasons: (1) OOA methods necessitate trials and errors to ascertain suitable scale parameters, and (2) the formulation of classification rule sets is contingent upon expert knowledge [21]. ML methods possess the capability to automatically differentiate between slum and non-slum areas through the training of models using input features and samples [15,22,23]. For example, Prabhu R. et al. [23] extracted textural features, wavelet frame transformation features and morphological profiles. They employed the minimum redundancy maximum relevance algorithm for feature selection. Subsequently, the selected features were input into a support vector machine (SVM) for slum mapping. The effectiveness of ML methods is substantially influenced by the discernibility of the input feature sets. This accentuates the importance of feature engineering and data preprocessing, as deficiencies in these areas can compromise the intelligence and performance of ML methods [24]. In addition, the similarity of spectral and textural features among different land cover types may compromise the effectiveness of pure GSO-based methods. Techniques endowed with robust capabilities in feature representation and information extraction are imperative for achieving higher-accuracy slum mapping.

Deep learning (DL) techniques have gained popularity in recent years for RS image information analysis [25,26,27,28]. DL models automatically extract discriminative high-level semantic features without the need for complex feature engineering of the input data [25], outperforming previous techniques in the capacity to extract information. Studies [29,30,31,32,33] on slum mapping based on DL have also generated fine results. Wurm M. et al. [29] employed a fully convolutional neural network (FCN) to extract slums and obtained an overall accuracy (OA) of 90.64% on Quick Bird images with a resolution of 2 m and an OA of 86.71% on Sentinel-2 images with a resolution of 10 m. Wurm M.’s study demonstrated the advancement of DL on slum mapping. Verma D. et al. [30] verified the potential of medium-resolution imagery like Sentinel-2 for mapping slums with DL and transfer learning. They extracted slums from the medium-resolution imagery (10 m) using a DL model pre-trained on the high-resolution imagery (0.5 m) through transfer learning and obtained an IoU accuracy of 43.20% on the Sentinel-2 images. However, the current DL-based slum extraction studies, in contrast to earlier GSO-based methods, predominantly rely on three optical bands while disregarding other geo-scientific information associated with slums, such as the aforementioned spectral indices and textural features. The integration of geo-scientific information as supplementary knowledge in optical band-based slum extraction studies is envisaged. It is anticipated that incorporating DL models with geo-scientific information will lead to a further enhancement in the accuracy of slum extraction.

Ge et al. [34] proposed a geoscience-aware deep learning paradigm. The new paradigm for RS enables the complementarity of knowledge-driven approaches that provide geo-scientific prior knowledge, thereby improving model interpretability, and data-driven approaches that have excellent information extraction and feature representation. Some researchers have proposed dual-stream architectures [28,35] to synchronously learn information from visual features and other auxiliary features for better performance. Audebert N. et al. [36] used FuseNet [37] to conduct semantic segmentation on RS images, which is an example of embedding geoscientific knowledge into a DL model. In their work, they used the normalized difference vegetation index (NDVI) and the digital surface model (DSM) as the geo-scientific features, and their experimental results demonstrated that the DL models that included the geo-scientific knowledge had a 1% greater F1 score than the DL model that did not. Similarly, He et al. [38] proposed an institution-inspired network with two streams to mine information from multispectral imagery and DSM data. Xiong et al. [39] used a geometry-aware semantic segmentation network incorporating RGB bands and normalized DSM data to achieve accurate land cover classification. While these studies have provided valuable insights into RS image information extraction, the effectiveness of current DL models incorporating geo-scientific knowledge for slum mapping tasks remains unclear, presenting an opportunity for further exploration.

Therefore, the objective of this study is to enhance the performance of a slum extraction model by incorporating slum-related geo-scientific prior knowledge into DL models. We propose a two-stream DL model, named GASlumNet, which bridges UNet and ConvNeXt. One stream is dedicated to learning deep features from the optical bands, while the other stream focuses on learning deep features from geo-scientific information. GASlumNet is compared with existing models, such as UNet [40], ConvNeXt UNet [40,41] and FuseNet [36,37], in the study area of Mumbai, India. The model performance is evaluated using four indices: precision, recall, overall accuracy (OA) and intersection over union (IoU).

2. Study Area

Urban poverty is widespread in India. In 2020, 49.0% of India’s urban population reside in slums. Our study area is situated in Mumbai (Figure 1a), the largest harbor city of India and the state capital of Maharashtra. The vegetation areas are concentrated in the north-western, northern and the eastern parts of the city, while built-up areas are densely spread in the other parts (Figure 1c). The total population of Mumbai has reached 20.4 million, and the population density reached approximately 229 persons per hectare. Mumbai is also one of the most slum-populated cities. About 55% of the urban population in Mumbai now reside in slums, with Dharavi (Figure 1d)—one of the city’s largest slum communities—covering 2.39 km²; it is home to approximately 1 million poor people. The Maharashtra government established the Slum Rehabilitation Authority (SRA) in 1995 and launched a program to improve slum conditions in response to the urgent slum issue. The SRA compiled a map of the slums located in Mumbai in 2015–2016 (https://sra.gov.in/ (accessed on 27 November 2023)) (Figure 1b).

3. Methods

3.1. Preparing Slum Dataset

A slum dataset of Mumbai was built to facilitate the model training and validation in this study. The dataset comprises ground truth data of the slums, optical bands and spectral and textural features. Specifically, the optical bands were initially extracted from high-resolution RS imagery, while the spectral and textural features, enriched with geo-scientific knowledge related to slums, were calculated based on medium-resolution multispectral RS imagery.

The red (R), the green (G) and the blue (B) bands, from Jilin-1 Spectrum-01 data of 5-m spatial resolution, were used as the optical bands, of which the wavelengths are 630–680 nm, 525–600 nm and 450–515 nm, respectively. The Jilin-1 Spectrum-01 Level-1 radiometric-corrected images used in this study were photographed on 20 January 2020. Two images covered our study area, i.e., ‘JL1GP01_PMS1_20200120151147_200020875_102_0013_001_L1’ and ‘JL1GP01_PMS1_20200120151147_200020875_102_0014_001_L1’. Orthorectification was performed based on the rational polynomial coefficients (RPC) provided by the image sources. Further, image stitching and layer stacking were conducted to produce the RGB image. Figure 2a shows the RGB synthesis of the Jilin-1 Spectrum-01 image. The projection of the image is WGS 84/UTM zone 43N (EPSG: 32643).

The spectral and textural features used in this study were calculated from the Sentinel-2 imagery. We collected the Sentinel-2 Level-1C orthorectified top-of-atmosphere reflectance images of Mumbai in 2016 using JavaScript API on the Google Earth Engine platform (GEE), which is ‘COPERNICUS/S2_HARMONIZED’ in the GEE data catalogue. Initially, we filtered the images with cloud coverage percentages less than 20%. Subsequently, cloud masking was conducted based on the QA60 band to generate a cloud-free image collection. Ultimately, a composite Sentinel-2 image was created using the median values from the filtered cloud-free collection. We computed the spectral and textural features using the composite Sentinel-2 image, which are the normalized difference vegetation index (NDVI), the normalized difference soil–vegetation index (NDSGI) [42] and the normalized difference built-up index (NDBI) [43]. NDVI and NDSGI can aid in distinguishing slums, which are characterized by fewer vegetated areas, from areas with tree cover, grassland and part of formal settlements that exhibit more green space. NDBI may assist in separating slums from soil backgrounds, part of non-slum built-up and water areas. The formulations of three spectral indexes are expressed in Equations (1)–(3).

NDVI = \frac{ρ_{NIR} - ρ_{R}}{ρ_{NIR} + ρ_{R}}

(1)

NDSGI = \frac{ρ_{G} - ρ_{R}}{ρ_{G} + ρ_{R}}

(2)

NDBI = \frac{ρ_{SWIR} - ρ_{NIR}}{ρ_{SWIR} + ρ_{NIR}}

(3)

Referring to previous studies [44], the short-wave infrared band (SWIR) plays an important role in identifying slums from backgrounds. Simultaneously, the near-infrared band (NIR) and the red band (R) are indispensable components for distinguishing slums from vegetated areas. Thus, we calculated the textural features based on SWIR, NIR and R bands via a gray-level co-occurrence matrix (GLCM) [45]. As for the selection of coefficients of GLCM, variance and contrast [13,46] were often used in various existing studies to extract slums, considering slums have lower GLCM variance and contrast values, while non-slum built-up areas have relatively higher values because of their complex scenes. Nevertheless, abrupt changes at the boundaries of slums may introduce significant variance and contrast, potentially resulting in the misclassification of these boundary pixels. In this study, GLCM mean values expressed as GLCM_SWIR mean, GLCM_NIR mean and GLCM_R mean were used as the textural coefficient. Intuitively, pixels in slum areas may exhibit higher GLCM_SWIR mean values, lower GLCM_NIR mean values and relatively lower GLCM_R mean values compared to non-slum areas. This is attributed to the presence of limited vegetated areas and the prevalence of dense materials, such as metal or shingle roofs [47], in slums. The window size for GLCM is set as 5 × 5. All the geo-scientific indexes calculated from the Sentinel-2 image were re-sampled to 5 m resolution via the bilinear interpolation method and the projection of WGS 84/UTM zone 43N (EPSG: 32643) (Figure 2b,c).

The slum map in portable document format released by SRA was used as the ground truth data in this study. We transferred the PDF data into shapefile format and performed geo-referencing. The shapefile data were then transformed into a geo-tiff format with a 5 m spatial resolution and the projection of WGS 84/UTM zone 43N (EPSG: 32643) (Figure 2d).

The RGB image, spectral indexes and textural features were divided into un-overlapped image patches, each of which has a size of 64 × 64 pixels. The ground truth data of the slums were used as the labels for model training and validation and were also divided into patches with a size of 64 × 64 pixels. The label patches are binary, i.e., 0 indicates non-slum and 1 indicates slum.

3.2. Architecture of GASlumNet

The architecture of the proposed GASlumNet is presented in Figure 3. The proposed GASlumNet has two streams, of which one stream (UNet stream) takes the RGB bands as input and the other stream (ConvNeXt stream) takes the spectral and textural features as the input. GASlumNet incorporates ConvNeXt and UNet. Specifically, the UNet encoder learns deep features from the optical bands, while the ConvNeXt learns deep features from the geo-scientific information. A feature-level fusion mechanism is performed to fuse and optimize deep features from the two encoders hierarchically. Then, the UNet decoder generates Output1, and a modified decoder is stacked to the ConvNeXt to generate Output2. Additionally, we introduced a decision-level fusion mechanism to generate the final output by combining Output1 and Output2. The remainder of this section expands on GASlumNet in more detail.

3.2.1. UNet Stream

We used the original UNet architecture proposed by [40] as the UNet stream of GASlumNet. UNet [40] has an encoder (the contracting path) and a decoder (the expansive path). The UNet encoder has five down-sampling stages, of which the last down-sampling stage is used to bridge the encoder and the decoder. Each down-sampling stage consists of two convolution-batch normalization-rectified linear unit (CBR) blocks, and a maximum pooling layer connects every two down-sampling stages. As the encoder goes deeper, the size of the feature maps is halved, and the number of feature channels is doubled. The UNet decoder has four up-sampling stages, of which each has one transposed convolutional layer, two CBR blocks and a skip connection (concatenation). During decoding, feature maps generated by the encoder are fused with the feature maps generated by the decoder at the corresponding level via the skip connection. As the decoder goes deeper, the size of the feature maps is doubled, and the number of feature channels is halved. At last, a

1 \times 1

convolutional layer is used to generate the pixel-wise prediction map.

3.2.2. ConvNeXt Stream

The ConvNeXt stream of GASlumNet contains the original ConvNeXt [41] and a modified decoder. ConvNext [41] was initially proposed as a pure CNN model with the aim to outperform the local vision transformer [48] by upgrading the traditional ResNet [49] towards the architecture of the Swin Transformer [50]. The original ConvNeXt has one stem block and four down-sampling stages. The stem block consists of a convolutional layer and a layer normalization. Each down-sampling stage has several ConvNeXt blocks. As presented in Figure 4, a ConvNeXt block adopts a depth-wise convolutional layer with a large kernel size, layer normalization, the Gaussian error linear unit (GeLU) and two point-wise convolutional layers. A down-sampling layer consisting of a layer normalization and a

2 \times 2

convolutional layer connects every two down-sampling stages. We used a modified UNet decoder as the decoder of the ConveNeXt stream by replacing the CBR blocks with the convolution-batch normalization-GeLU (CBG) blocks. Similar to the UNet stream, the ConvNeXt stream utilizes the skip connection in the decoder and uses a

1 \times 1

convolutional layer to obtain pixel-wise results.

Generally, Table 2 and Table 3 depict the structure of the encoders and decoders in our GASlumNet. In the tables, ks denotes the kernel size; s and p denote stride and padding; h, w and c denote the output height, width and the number of channels; DS means the down-sampling stage; Maxp means the maxpooling layer; Dlayer means the down-sample layer in the ConvNeXt; US means the up-sampling stage; and H and W are the height and width of the input image. It should be noted that the kernel size and stride of the stem block are

2 \times 2

and 2 in our GASlumNet, whereas they are

4 \times 4

and 4 in the original ConvNeXt proposed in [41]. We made this change to adapt for the feature fusion between the UNet encoder and the ConvNeXt encoder. The ConvNeXt stream has larger perceptual fields than the UNet stream because of the large-kernel size depth-wise convolutional layers used in the ConvNeXt; thus, the ConvNeXt stream can capture global semantic features of slums from the medium-resolution geo-scientific information [41], while the UNet learns local slum semantics from the high-resolution optical bands [40].

3.2.3. Feature-Level Fusion with Multi-Scale Attention

The Feature-level fusion mechanism is introduced into GASlumNet to fuse the deep features extracted from the optical bands and the geo-scientific information to enhance the model performance. We first performed channel-wise addition on the deep feature maps from the UNet encoder and ConvNeXt, and then exploited the multi-scale channel attention mechanism (MSCAM) [51] to optimize the fused features. As illustrated in Figure 5, the MSCAM uses a global adaptive average pooling layer to obtain global information and utilizes point-wise convolutional layers to aggregate global and local context. A bottleneck mechanism combines the aggregated multi-scale information and the input features.

3.2.4. Decision-Level Fusion and Joint Loss Function

As for the decision fusion mechanism, we used a super parameter

γ

to balance the weights between the outputs of the UNet stream and the ConvNeXt stream. During the model training phrase, the super parameter is used to compute the joint loss

L_{J}

, defined as:

L_{J} = γ L_{O} + (1 - γ) L_{G}

(4)

where

L_{O}

denotes the loss computed for the UNet stream, which learns deep features from the optical bands, and

L_{G}

denotes the loss computed for the ConvNeXt stream, which learns deep features from the geo-scientific information. Thus, the parameters of the streams in GASlumNet are synchronously optimized by minimizing the joint loss. Specifically,

L_{O}

and

L_{G}

are computed in the same way as [35] by incorporating the Dice loss

L_dice

[52] and the weighted cross entropy (WCE) loss

L_wce

[53]. The Dice loss and the WCE loss both combine to alleviate the problem of class imbalance.

L_{O}

and

L_{G}

are computed as:

L_{O} = δ L_wce + (1 - δ) L_dice

(5)

L_{G} = δ L_wce + (1 - δ) L_dice

(6)

where

δ

is set at 0.4 in this study and is used to balance the two losses.

During the inference phrase, the balance parameter

γ

is used to combine the prediction maps generated by the streams as:

\hat{y} (i, j) = softmax (γ {\hat{y}}_{O} (i, j) + (1 - γ) {\hat{y}}_{G} (i, j))

(7)

where

i, j

indicate the location of the pixel in the input RS image;

{\hat{y}}_{O} (i, j)

and

{\hat{y}}_{G} (i, j)

are the predicted logits at the pixel (

i, j

) generated by the UNet stream and the ConvNeXt stream;

\hat{y} (i, j)

is the final predicted label at the pixel (

i, j

); and

softmax (\cdot)

is the softmax function to generate binary classification labels. The decision fusion is conducive to eliminating errors generated by a single stream and improving the robustness.

3.3. Evaluation Metrics

To assess the model performance, we calculated the precision, the recall, the overall accuracy (OA) and IoU based on the classification results. We defined the slum as the positive class and the non-slum as the negative class. Then, the evaluation metrics selected in this study were calculated as follows:

precision = \frac{true positives}{true positives + false positives} \times 100

(8)

recall = \frac{true positives}{true positives + false negatives} \times 100

(9)

OA = \frac{true positives + true negatives}{true positives + true negatives + false positives + false negatives} \times 100

(10)

IoU = \frac{true positives}{true positives + false negatives + false positives} \times 100

(11)

Specifically, the true positives denote the number of pixels that are classified as slums correctly, the true negatives denote the number of pixels that are classified as non-slums correctly, the false positives denote the number of pixels that are misclassified as slums and the false negatives denote the number of pixels that are misclassified as non-slums. All the metrics calculated range from 0 to 100%. The precision can depict the fraction of correctly identified slums from all the predicted slums. The recall can depict the fraction of correctly identified slums from all the slum ground truth. Higher precision denotes fewer classification errors, and higher recall denotes fewer classification omissions. OA is an overall metric to evaluate performance for separating slums from non-slums, and IoU depicts the similarity between ground truth slums and predicted slums.

4. Experiment and Results

4.1. Experimental Settings

As described in Section 3.1, the Jilin-1 RGB image and Sentinel-2 geo-scientific features are divided into patches with a size of

64 \times 64

pixels. The total amount of patches is 3620, of which 2892 image patches are used for model training and 728 patches for testing. Table 4 presents the experimental settings in this study.

The proposed GASlumNet was formulated based on the architectural principles of UNet and ConvNeXt. Consequently, we designated UNet and ConvNeXt-UNet as our baseline models. GASlumNet and FuseNet both adopt a two-stream architecture, enabling the integration of RGB bands and geo-scientific auxiliary features; hence, we chose FuseNet as the comparative model for assessing GASlumNet’s performance. For fair comparisons, all the models in our experiments were trained from scratch. We did not exploit model fine-tuning, thus no pre-trained backbones, like ConvNeXt-base [41], VGG16 [54], etc., were used.

4.2. Comparisons of Slum Mapping among Different Methods

Table 5 presents the quantitative results on the testing datasets among different models. Among the comparative models, UNet and ConvNeXt-UNet used RGB images to identify slums, while FuseNet and our GASlumNet used RGB images and geo-scientific features. Higher OA and IoU values indicate that the detected slums are more similar to the ground truth slums. GASlumNet achieved the highest OA and IoU values, showing the best performance on slum classification among these models. Specifically, GASlumNet outperformed the baseline models (UNet, ConvNeXt-UNet) with OA increases of 1.06~1.35% and IoU increases of 4.60~5.97%. GASlumNet outperformed the contrastive model (FuseNet) with an OA increase of 2.25% and an IoU increase of 10.97%.

The precision values and recalls can indirectly reflect the classification errors and omissions of models. GASlumNet obtained the highest precision and recall values, i.e., 72.82% and 74.69%, respectively, indicating that GASlumNet generated fewer omissions and errors than the other models.

Figure 6 shows the slum detection results of Mumbai generated by the baseline models (UNet and ConvNeXt-UNet), the contrastive model (FuseNet) and our GASlumNet. Generally, our GASlumNet obtained more accurate and concise slum boundaries than the other models in Figure 6, although all the models generated false positives and false negatives in varying degrees.

4.3. Patch-Based Accuracy Assessment among Different Methods

We further quantitatively compared the mapping results according to the different sizes of the slum patches. Slums are divided into three groups, i.e., large slum patches in sizes larger than or equal to 25 hectares, medium slum patches in sizes larger than or equal to 5 hectares and smaller than 25 hectares, and small pockets in sizes smaller than 5 hectares. Table 6 presents the recall values of slums in various sizes detected by different models. UNet, ConvNeXt and our GASlumNet all exhibited promising performance in identifying slums of a medium or large size, with their recall values near or over 80%. However, all the models had varying degrees of difficulty in identifying slums smaller than 5 hectares. Only GASlumNet identified half of the small slums.

According to the qualitative comparisons in Figure 7, it can be found that when detecting the large slum patches, all the models can generate fine results (Figure 7A). When detecting medium-sized and small-sized slums, our GASlumNet generated fewer omissions, for instance, as the black dashed ellipses indicate in Figure 7B,D. Our GASlumNet also generated fewer errors, as the black dashed rectangles highlight in Figure 7C,E.

4.4. Land Cover Types of False Positives and False Negatives Generated by Different Methods

To find out which land cover types are easily misclassified as slums, we calculated the area of the ESA land cover types in the false positives and false negatives (Table 7). Land cover types including bare/sparse vegetation and built-up are the easiest to misclassify as slums. Overall, GASlumNet generated fewer omissions and errors when separating slums from these two land cover types.

Figure 8 presents examples of slum classification results under different scenes. It can be found that tree cover pixels and non-slum built-up pixels at the boundaries of slum patches are easily misclassified (the first and fourth columns of Figure 8). ConvNeXt-UNet and FuseNet generated severe omissions when detecting slums in the scene of bare/sparse vegetation areas (the third column of Figure 8), while UNet and ConvNeXt-UNet misclassified a lot of non-slum built-up pixels as slums (the fifth column of Figure 8). GASlumNet performed better in handling these scenarios, which are prone to omissions and errors in classifications.

4.5. Results under Different Ancillary Geo-Scientific Features

Table 8 presents slum mapping accuracies generated by different models utilizing various input features. The results indicate the following:

(1): Overall, GASlumNet consistently achieved the highest slum mapping accuracies across different input features. Specifically, GASlumNet demonstrated improvements in IoU values of 2.52%, 3.09% and 10.97%, and increases in OA values of 0.35%, 0.29% and 2.25% compared to UNet, ConvNeXt-UNet, and FuseNet when utilizing RGB, spectral and textural features. GASlumNet also attained the highest IoU values among all the models.
(2): The incorporation of ancillary geographic features into the models positively impacted the performance. With the exception of FuseNet, models in Table 8 that simultaneously used multiple input features outperformed those using only RGB bands, spectral features or textural features. For instance, when the RGB bands were concatenated with ancillary geographic features (spectral or textural) and fed into the model, UNet and ConvNeXt-UNet achieved higher accuracies than when using only RGB bands.
(3): In comparison to FuseNet, which also employed a dual-stream architecture and multiple input features, GASlumNet consistently exhibited significantly higher precision, recall, OA and IoU values, underscoring the superior effectiveness of GASlumNet over FuseNet.

Table 8. Quantitative results among different models with different input features.

Models	Input Features		Precision	Recall	OA	IoU
UNet	RGB		68.54	71.47	91.99	53.81
	Spectral		70.84	62.21	91.72	49.53
	Textural		69.36	60.74	91.37	47.89
	RGB, spectral		76.87	67.29	93.08	55.96
	RGB, textural		71.49	72.02	92.59	55.95
	RGB, spectral, textural		72.61	70.82	92.70	55.89
ConvNeXt-UNet	RGB		67.59	70.05	91.70	52.44
	Spectral		68.12	63.52	91.35	48.96
	Textural		66.53	62.27	90.98	47.42
	RGB, spectral		66.80	75.76	91.92	55.04
	RGB, textural		73.10	69.79	92.70	55.53
	RGB, spectral, textural		74.03	68.64	92.76	55.32
FuseNet	RGB, spectral		64.14	58.76	90.32	44.23
	RGB, textural		65.02	67.80	91.03	49.68
	RGB, spectral, textural		65.12	63.61	90.80	47.44
	Input Features of UNet (RGB-Stream)	Input Features of ConvNeXt (Auxiliary Stream)
GASlumNet	RGB	Spectral	69.81	77.55	92.69	58.07
	RGB	Textural	74.05	71.98	93.05	57.48
	RGB, spectral	Textural	75.02	70.04	93.10	57.25
	RGB, textural	Spectral	73.83	71.68	92.98	57.16
	RGB	Spectral, textural	72.82	74.69	93.05	58.41

4.6. Results under Different Balance Parameters

We tested different

γ \in [0, 1]

values of the joint loss function to measure the contributions of the UNet stream and the ConvNeXt stream to the classification results. The higher the

γ

, the higher importance of the UNet stream and the input RGB bands; conversely, the higher importance of the ConvNeXt stream and the input ancillary geographic features. The results generated when

γ = 0, 0.1, 0.3, 0.5, 0.7, 0.9, 1

are presented in Table 9. Specifically, we used the UNet when

γ = 1

and the ConvNeXt-UNet when

γ = 0

. The results in Table 9 demonstrate that adding geographic features to the model can improve accuracies by comparing the results under

0 < γ < 1

with the results under

γ = 0

or

γ = 1

. Thus, both RGB and the geographic features contribute to improve the slum detection performance. As

γ

increased, the OA and IoU values first increased and then decreased. The highest precision and OA values were obtained when

γ

was 0.5, while the highest recall and IoU values were achieved when

γ

was 0.7. This demonstrates that the RGB-stream contributes more than the geographic features to increase the accuracies.

5. Discussion

5.1. Differences from Existing Related Studies

In this study, a geoscience-aware network, named GASlumNet, was proposed to map accurate slums in Mumbai city. Mumbai has lots of slums and accommodates a large slum population. Previous researchers [30,44,55] have applied RS techniques to map slums in the same study area. These studies were all dedicated to improving the accuracy of slum mapping, but with different focuses. Specifically, the CSSIs [44] were proposed to validate the performance and potential of the multispectral medium-resolution imagery for slum mapping. The CCF [55] was validated for its applicability in mapping slums in developing countries. The CNN-based study [30] demonstrated the effectiveness of DL models on slum mapping with both high-resolution and medium-resolution images. Our GASlumNet was proposed to validate the effectiveness of incorporating geoscientific knowledge, including spectral features and textural information, into deep learning algorithms. Whether compared to the DL-based methods (e.g., CNN and CNN transfer learning) or prior knowledge-based methods (e.g., CSSIs), the proposed GASlumNet achieved better slum classification results when mapping slums across the whole of Mumbai city (Table 10). Even when compared to the study [30] using ultra high-resolution RS imagery (e.g., Pleiades with 0.5-m spatial resolution) and DL methods, GASlumNet obtained an increase of 5.56% of the IoU values. In addition, GASlumNet kept a promising balance between classification omissions and errors, with a smaller difference between the precision value and the recall value compared to the results of CSSIs. The comparisons above have demonstrated the advancement of the proposed GASlumNet.

5.2. Performance of GASlumNet

The proposed GASlumNet is characterized as a two-stream architecture with feature-level fusion and decision-level fusion. According to Table 5 and Table 8, the two-stream architecture allowed the model to learn discriminative features from both RGB bands and prior geoscientific knowledge, which can be confirmed by the comparison results between GASlumNet and UNet or ConvNeXt-UNet. Additionally, the hierarchical feature-level fusion scheme has also shown a contribution to higher-accuracy slum mapping. Specifically, when compared to UNet or ConvNeXt-UNet, which both use RGB bands and geoscientific knowledge via the input-level feature fusion, GASlumNet, using hierarchical feature-level fusion, obtained increases of 2.52–3.09% in the IoU values.

The contrastive model, FuseNet, can also combine the RGB bands and auxiliary geographic information via hierarchical feature-level fusion [36,37]; however, our GASlumNet outperformed FuseNet according to the quantitative and qualitative results. We attribute the superiority of GASlumNet to its structure that incorporates the advantages of UNet and ConvNeXt by the multi-scale attention mechanism and the decision-level fusion. FuseNet has two VGG-16 as encoders that have the same structures to extract features from the RGB bands and the geographic information. In contrast, the ConvNeXt stream of our GASlumNet exploits convolutions with large kernel sizes and large perceptual fields to extract global slum features from the geographic indexes. The UNet stream has relatively smaller perceptual fields to extract local slum features from the high-resolution RGB images. Thus, the two streams of GASlumNet can complement each other and improve the robustness of the model by feature-level fusion with a multi-scale attention mechanism. In addition, the decision fusion scheme is proven to improve the performance of a two-stream model structure.

5.3. Applicability and Limitations of GASlumNet

The proposed GASlumNet performed better in detecting medium-sized and large-sized slum patches (≥5 ha) than the comparative models according to Table 6; however, when identifying slum pockets smaller than 5 ha, all the models generated omissions. We attribute this to the influence of mixed pixels. Slums of a small size are easily blended with other classes on RS imagery, and these mixed pixels inevitably undermine the mapping results. Nevertheless, GASlumNet can identify half of all small-sized slums correctly with a recall value of 50.79% and outperformed the other models. It is reasonable to assume that more small-sized slums will be detected correctly by GASlumNet when ultra-high-resolution imagery (<1 m) is used in future work.

Previous prior knowledge-based studies [20,46] generally used textural indexes to separate slums with non-slum built-up and vegetation spectral indexes to separate slums with green areas. Referring to these studies, spectral indexes and textural indexes are selected as the geoscientific features to help with mapping slums. However, there are no standard indexes to perfectly identify slums due to the intra-class variability and inter-class similarity, which caused inevitable classification errors among different land cover types. Land-cover types, including non-slum built-up and bare/sparse vegetation, are easily misclassified as slums, according to Table 7 and Figure 8. For example, dense, built-up neighborhoods with few vegetation may be wrongly classified as slums to various degrees by different models (e.g., the fifth column in Figure 8). The category of bare/sparse vegetation shares similar spectral features with slums because of a small fraction of vegetation and the influence of soil background in these classes. In this paper, spectral indexes, including NDVI, NDSGI and NDBI, and textural indexes calculated by GLCM mean were intuitively selected to map slums. The comparisons between GASlumNet and other models have demonstrated the promising performance of combining geoscientific features and DL models; however, other spectral indexes and textural indexes were not explored due to paper length. In addition, the spatial structural information can also help with image classification [56]. In future work, the graph theory and graph neural networks (GNN) can be exploited to represent slum features to help improve mapping performance.

6. Conclusions

This study introduces a two-stream network, GASlumNet, aimed at achieving high-accuracy slum mapping. GASlumNet combines UNet and ConvNeXt to capture slum features inherent in both RGB and geographic features. The ConvNeXt stream acquires global slum features through depth-wise convolutions with large kernel sizes derived from medium-resolution geographic index maps. The UNet stream focuses on learning local slum features from high-resolution RGB images. Additionally, feature-level fusion and decision-level fusion mechanisms are incorporated to hierarchically bridge the two streams, enhancing the overall model performance.

The experimental results for slum classification in Mumbai, India, demonstrate that GASlumNet achieves the highest overall accuracy (OA) and intersection over union (IoU) values, surpassing UNet, ConvNeXt-UNet and FuseNet. The integration of ancillary geo-scientific information with deep learning (DL) models proves beneficial in enhancing the accuracy of slum mapping. The proposed model serves as a technical reference for accurately mapping slums, contributing to the enrichment of land cover and land use classification schemes in the developing world.

Author Contributions

Conceptualization, W.L.; methodology, W.L.; software, W.L.; validation, Y.H. and F.P.; formal analysis, W.L. and Y.H.; investigation, W.L. and F.P.; writing—original draft preparation, W.L.; writing—review and editing, Y.H., F.P., Z.F. and Y.Y.; visualization, W.L.; supervision, Y.H.; project administration, Y.H.; funding acquisition, Y.H., F.P. and Z.F. All authors have read and agreed to the published version of the manuscript.

Funding

This study was supported by the Network Security and Information Program of the Chinese Academy of Sciences (CAS-WX2021SF-0106-02); Z.F. was supported by the Second Tibetan Plateau Scientific Expedition and Research Program (STEP) (20190ZKK1006); Y.H. was supported by the National Natural Science Foundation of China (42130508) and the Key Project of Innovation LREIS (KPI011); F.P. was supported by the National Natural Science Foundation of China (42071389).

Data Availability Statement

The ground truth data of the slums were acquired from the website https://sra.gov.in (accessed on 27 November 2023).

Conflicts of Interest

The authors declare no conflicts of interest.

References

UN-Habitat. World Cities Report 2020: The Value of Sustainable Urbanization; United Nations Human Settlements Programme: Nairobi, Kenya, 2020. [Google Scholar]
Wirastri, M.V.; Morrison, N.; Paine, G. The Connection between Slums and COVID-19 Cases in Jakarta, Indonesia: A Case Study of Kapuk Urban Village. Habitat Int. 2023, 134, 102765. [Google Scholar] [CrossRef]
Thomson, D.R.; Stevens, F.R.; Chen, R.; Yetman, G.; Sorichetta, A.; Gaughan, A.E. Improving the Accuracy of Gridded Population Estimates in Cities and Slums to Monitor SDG 11: Evidence from a Simulation Study in Namibia. Land Use Policy 2022, 123, 106392. [Google Scholar] [CrossRef]
Maung, N.L.; Kawasaki, A.; Amrith, S. Spatial and Temporal Impacts on Socio-Economic Conditions in the Yangon Slums. Habitat Int. 2023, 134, 102768. [Google Scholar] [CrossRef]
UN-Habitat. The Challenge of Slums: Global Report on Human Settlements, 2003; Routledge: London, UK, 2003. [Google Scholar]
UN-Habitat. Slum Almanac 2015–2016: Tracking Improvement in the Lives of Slum Dwellers. Participatory Slum Upgrading Programme. 2016. Available online: https://unhabitat.org/sites/default/files/documents/2019-05/slum_almanac_2015-2016_psup.pdf. (accessed on 27 November 2023).
United Nations. Transforming Our World: The 2030 Agenda for Sustainable Development; United Nations: New York, NY, USA, 2015. [Google Scholar]
MacTavish, R.; Bixby, H.; Cavanaugh, A.; Agyei-Mensah, S.; Bawah, A.; Owusu, G.; Ezzati, M.; Arku, R.; Robinson, B.; Schmidt, A.M. Identifying Deprived “Slum” Neighbourhoods in the Greater Accra Metropolitan Area of Ghana Using Census and Remote Sensing Data. World Dev. 2023, 167, 106253. [Google Scholar] [CrossRef] [PubMed]
Kuffer, M.; Abascal, A.; Vanhuysse, S.; Georganos, S.; Wang, J.; Thomson, D.R.; Boanada, A.; Roca, P. Data and Urban Poverty: Detecting and Characterising Slums and Deprived Urban Areas in Low-and Middle-Income Countries. In Advanced Remote Sensing for Urban and Landscape Ecology; Springer: Cham, Switzerland, 2023; pp. 1–22. [Google Scholar]
UN-Habitat. Metadata on SDGs Indicator 11.1. 1 Indicator Category: Tier I. UN Human Settlements Program, Nairobi. 2018. Available online: http://unhabitat.org/sites/default/files/2020/06/metadata_on_sdg_indicator_11.1.1.pdf (accessed on 27 November 2023).
Kohli, D.; Sliuzas, R.; Kerle, N.; Stein, A. An Ontology of Slums for Image-Based Classification. Comput. Environ. Urban Syst. 2012, 36, 154–163. [Google Scholar] [CrossRef]
Kohli, D.; Kerle, N.; Sliuzas, R. Local Ontologies for Object-Based Slum Identification and Classification. Environs 2012, 3, 3. [Google Scholar]
Kohli, D.; Sliuzas, R.; Stein, A. Urban Slum Detection Using Texture and Spatial Metrics Derived from Satellite Imagery. J. Spat. Sci. 2016, 61, 405–426. [Google Scholar] [CrossRef]
Badmos, O.S.; Rienow, A.; Callo-Concha, D.; Greve, K.; Jürgens, C. Urban Development in West Africa—Monitoring and Intensity Analysis of Slum Growth in Lagos: Linking Pattern and Process. Remote Sens. 2018, 10, 1044. [Google Scholar] [CrossRef]
Kuffer, M.; Pfeffer, K.; Sliuzas, R.; Baud, I. Extraction of Slum Areas from VHR Imagery Using GLCM Variance. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2016, 9, 1830–1840. [Google Scholar] [CrossRef]
Mudau, N.; Mhangara, P. Mapping and Assessment of Housing Informality Using Object-Based Image Analysis: A Review. Urban Sci. 2023, 7, 98. [Google Scholar] [CrossRef]
Abed, A.D. Urban Upgrading of Slums: Baghdad and London Slums as Study Models for Urban Rehabilitation. Comput. Urban Sci. 2023, 3, 31. [Google Scholar] [CrossRef]
Mahabir, R.; Croitoru, A.; Crooks, A.T.; Agouris, P.; Stefanidis, A. A Critical Review of High and Very High-Resolution Remote Sensing Approaches for Detecting and Mapping Slums: Trends, Challenges and Emerging Opportunities. Urban Sci. 2018, 2, 8. [Google Scholar] [CrossRef]
Kuffer, M.; Wang, J.; Nagenborg, M.; Pfeffer, K.; Kohli, D.; Sliuzas, R.; Persello, C. The Scope of Earth-Observation to Improve the Consistency of the SDG Slum Indicator. ISPRS Int. J. Geo-Inf. 2018, 7, 428. [Google Scholar] [CrossRef]
Trento Oliveira, L.; Kuffer, M.; Schwarz, N.; Pedrassoli, J.C. Capturing Deprived Areas Using Unsupervised Machine Learning and Open Data: A Case Study in São Paulo, Brazil. Eur. J. Remote Sens. 2023, 56, 2214690. [Google Scholar] [CrossRef]
Dewan, A.; Alrasheedi, K.; El-Mowafy, A. Mapping Informal Settings Using Machine Learning Techniques, Object-Based Image Analysis and Local Knowledge. In Proceedings of the International Geoscience and Remote Sensing Symposium (IGARSS), Pasadena, CA, USA, 16–21 July 2023; pp. 7249–7252. [Google Scholar]
Duque, J.C.; Patino, J.E.; Betancourt, A. Exploring the Potential of Machine Learning for Automatic Slum Identification from VHR Imagery. Remote Sens. 2017, 9, 895. [Google Scholar] [CrossRef]
Prabhu, R.; Parvathavarthini, B.; Alagu Raja, R.A. Slum Extraction from High Resolution Satellite Data Using Mathematical Morphology Based Approach. Int. J. Remote Sens. 2021, 42, 172–190. [Google Scholar] [CrossRef]
Brenning, A. Interpreting Machine-Learning Models in Transformed Feature Space with an Application to Remote-Sensing Classification. Mach. Learn. 2023, 112, 3455–3471. [Google Scholar] [CrossRef]
Yuan, Q.; Shen, H.; Li, T.; Li, Z.; Li, S.; Jiang, Y.; Xu, H.; Tan, W.; Yang, Q.; Wang, J. Deep Learning in Environmental Remote Sensing: Achievements and Challenges. Remote Sens. Environ. 2020, 241, 111716. [Google Scholar] [CrossRef]
Hong, D.; Gao, L.; Yokoya, N.; Yao, J.; Chanussot, J.; Du, Q.; Zhang, B. More Diverse Means Better: Multimodal Deep Learning Meets Remote-Sensing Imagery Classification. IEEE Trans. Geosci. Remote Sens. 2020, 59, 4340–4354. [Google Scholar] [CrossRef]
Li, J.; Hong, D.; Gao, L.; Yao, J.; Zheng, K.; Zhang, B.; Chanussot, J. Deep Learning in Multimodal Remote Sensing Data Fusion: A Comprehensive Review. Int. J. Appl. Earth Obs. Geoinf. 2022, 112, 102926. [Google Scholar] [CrossRef]
Bergamasco, L.; Bovolo, F.; Bruzzone, L. A Dual-Branch Deep Learning Architecture for Multisensor and Multitemporal Remote Sensing Semantic Segmentation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 2147–2162. [Google Scholar] [CrossRef]
Wurm, M.; Stark, T.; Zhu, X.X.; Weigand, M.; Taubenböck, H. Semantic Segmentation of Slums in Satellite Images Using Transfer Learning on Fully Convolutional Neural Networks. ISPRS J. Photogramm. Remote Sens. 2019, 150, 59–69. [Google Scholar] [CrossRef]
Verma, D.; Jana, A.; Ramamritham, K. Transfer Learning Approach to Map Urban Slums Using High and Medium Resolution Satellite Imagery. Habitat Int. 2019, 88, 101981. [Google Scholar] [CrossRef]
Stark, T.; Wurm, M.; Zhu, X.X.; Taubenböck, H. Satellite-Based Mapping of Urban Poverty with Transfer-Learned Slum Morphologies. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 13, 5251–5263. [Google Scholar] [CrossRef]
Rehman, M.F.U.; Aftab, I.; Sultani, W.; Ali, M. Mapping Temporary Slums from Satellite Imagery Using a Semi-Supervised Approach. IEEE Geosci. Remote Sens. Lett. 2022, 19, 3512805. [Google Scholar] [CrossRef]
El Moudden, T.; Dahmani, R.; Amnai, M.; Fora, A.A. Slum Image Detection and Localization Using Transfer Learning: A Case Study in Northern Morocco. Int. J. Electr. Comput. Eng. 2023, 13, 3299–3310. [Google Scholar] [CrossRef]
Ge, Y.; Zhang, X.; Atkinson, P.M.; Stein, A.; Li, L. Geoscience-Aware Deep Learning: A New Paradigm for Remote Sensing. Sci. Remote Sens. 2022, 5, 100047. [Google Scholar] [CrossRef]
Lu, W.; Hu, Y.; Zhang, Z.; Cao, W. A Dual-Encoder U-Net for Landslide Detection Using Sentinel-2 and DEM Data. Landslides 2023, 20, 1975–1987. [Google Scholar] [CrossRef]
Audebert, N.; Le Saux, B.; Lefèvre, S. Beyond RGB: Very High Resolution Urban Remote Sensing with Multimodal Deep Networks. ISPRS J. Photogramm. Remote Sens. 2018, 140, 20–32. [Google Scholar] [CrossRef]
Hazirbas, C.; Ma, L.; Domokos, C.; Cremers, D. Fusenet: Incorporating Depth into Semantic Segmentation via Fusion-Based Cnn Architecture. In Part I 13, Proceedings of the Computer Vision–ACCV 2016: 13th Asian Conference on Computer Vision, Taipei, Taiwan, 20–24 November 2016; Revised Selected Papers; Springer: Berlin/Heidelberg, Germany, 2017; pp. 213–228. [Google Scholar]
He, Q.; Sun, X.; Diao, W.; Yan, Z.; Yao, F.; Fu, K. Multimodal Remote Sensing Image Segmentation with Intuition-Inspired Hypergraph Modeling. IEEE Trans. Image Process. 2023, 32, 1474–1487. [Google Scholar] [CrossRef]
Xiong, Z.; Chen, S.; Wang, Y.; Mou, L.; Zhu, X.X. GAMUS: A Geometry-Aware Multi-Modal Semantic Segmentation Benchmark for Remote Sensing Data. arXiv 2023, arXiv:2305.14914. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Part III 18, Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
Liu, Z.; Mao, H.; Wu, C.-Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A Convnet for the 2020s. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11976–11986. [Google Scholar]
Philpot, W.; Jacquemoud, S.; Tian, J. ND-Space: Normalized Difference Spectral Mapping. Remote Sens. Environ. 2021, 264, 112622. [Google Scholar] [CrossRef]
Zha, Y.; Gao, J.; Ni, S. Use of Normalized Difference Built-up Index in Automatically Mapping Urban Areas from TM Imagery. Int. J. Remote Sens. 2003, 24, 583–594. [Google Scholar] [CrossRef]
Peng, F.; Lu, W.; Hu, Y.; Jiang, L. Mapping Slums in Mumbai, India, Using Sentinel-2 Imagery: Evaluating Composite Slum Spectral Indices (CSSIs). Remote Sens. 2023, 15, 4671. [Google Scholar] [CrossRef]
Haralick, R.M.; Shanmugam, K.; Dinstein, I.H. Textural Features for Image Classification. IEEE Trans. Syst. Man Cybern. 1973, 6, 610–621. [Google Scholar] [CrossRef]
Wurm, M.; Weigand, M.; Schmitt, A.; Geiß, C.; Taubenböck, H. Exploitation of Textural and Morphological Image Features in Sentinel-2A Data for Slum Mapping. In Proceedings of the 2017 Joint Urban Remote Sensing Event (JURSE), Dubai, United Arab Emirates, 6–8 March 2017; pp. 1–4. [Google Scholar]
Kotthaus, S.; Smith, T.E.; Wooster, M.J.; Grimmond, C.S.B. Derivation of an Urban Materials Spectral Library through Emittance and Reflectance Spectroscopy. ISPRS J. Photogramm. Remote Sens. 2014, 94, 194–212. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S. An Image Is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Dai, Y.; Gieseke, F.; Oehmcke, S.; Wu, Y.; Barnard, K. Attentional Feature Fusion. In Proceedings of the 2021 IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2021; pp. 3560–3569. [Google Scholar]
Milletari, F.; Navab, N.; Ahmadi, S.-A. V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation. In Proceedings of the 2016 Fourth International Conference on 3D vision (3DV), Stanford, CA, USA, 25–28 October 2016; pp. 565–571. [Google Scholar]
Phan, T.H.; Yamamoto, K. Resolving Class Imbalance in Object Detection with Weighted Cross Entropy Losses. arXiv 2020, arXiv:2006.01413. [Google Scholar]
Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Gram-Hansen, B.J.; Helber, P.; Varatharajan, I.; Azam, F.; Coca-Castro, A.; Kopackova, V.; Bilinski, P. Mapping Informal Settlements in Developing Countries Using Machine Learning and Low Resolution Multi-Spectral Data. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, Honolulu, HI, USA, 27–28 January 2019; pp. 361–368. [Google Scholar]
Song, X.; Hua, Z.; Li, J. GMTS: GNN-Based Multi-Scale Transformer Siamese Network for Remote Sensing Building Change Detection. Int. J. Digit. Earth 2023, 16, 1685–1706. [Google Scholar] [CrossRef]

Figure 1. Illustration of the study area and ground truth data. (a) The location of Mumbai; (b) slums of Mumbai with the base map of Google Satellite imagery; (c) land cover of Mumbai presented by the European Space Agency (ESA) WorldCover 10 m v100 data; (d) a zoomed-in view of Dharavi with the base map of Google Satellite imagery.

Figure 2. The RGB bands, spectral indexes and textural features used in this study. (a) RGB composite of Jilin-1 image; (b) the false-color composite of NDVI, NDSGI and NDBI; (c) the false-color composite of GLCM_SWIR mean, GLCM_NIR mean and GLCM_R mean; (d) ground truth of slums.

Figure 3. The architecture of GASlumNet.

Figure 4. A ConvNeXt block [41].

Figure 5. The MSCAM [51].

Figure 6. Slum maps of Mumbai generated by UNet (a), ConvNeXt-UNet (b), FuseNet (c) and GASlumNet (d).

Figure 7. Examples of classification results for slums of different sizes among the baseline models (UNet and ConvNeXt-UNet), the contrastive model (FuseNet) and the proposed GASlumNet. FP: false positive, TP: true positive, FN: false negative. (A) scenes containing large slums; (B) scenes containing large and medium slums; (C,D) scenes containing large and small slums; (E) scenes containing small slum pockets.

Figure 8. Examples of classification results under different landcover scenes. FP: false positives, TP: true positives, FN: false negatives.

Table 1. Examples of GSO.

Level	Indicators	Observation in Slums	Features
Environment	Location	Hazardous and flood-prone areas; close to railways, highways and major roads; close to water areas; some on steep slopes	Association-distance to river, roads
Environment	Neighborhood characteristics	Close to CBD, middle/high socioeconomic status areas and industrial areas	Association-distance to socioeconomic-status areas
Settlement	Shape	Generally irregular; elongated formation following the river or railway	Geometry
Settlement	Density	Highly compact; high roof coverage; low vegetation/open space coverage	Texture
Object	Access network	Generally unpaved, narrow, irregular roads and footpaths	Geometry/spectral features
Object	Building characteristics	Roof: iron sheet, asbestos, plastic, fiber, clay tiles; less bright than formal settlements building size: small	Spectral/morphological features

Table 2. The encoder structures of two streams in GASlumNet.

UNet Stream				ConvNeXt Stream
Stage	ks	s, p	h, w, c	Stage	ks	s, p	h, w, c
DS1	$3 \times 3$	1, 1	H, W, 64	Stem	$2 \times 2$	2, 0	H/2, W/2, 128
Maxp	$2 \times 2$	2, 0	H/2, W/2, 64
DS2	$3 \times 3$	1, 1	H/2, W/2, 128	DS1	$[\begin{matrix} 7 \times 7, \\ 1 \times 1, \\ 1 \times 1 \end{matrix}] \times 3$	[1, 3] [1, 0] [1, 0]	H/2, W/2, 128
Maxp	$2 \times 2$	2, 0	H/4, W/4, 128	Dlayer1	$2 \times 2$	2, 0	H/4, W/4, 256
DS3	$3 \times 3$	1, 1	H/4, W/4, 256	DS2	$[\begin{matrix} 7 \times 7, \\ 1 \times 1, \\ 1 \times 1 \end{matrix}] \times 3$	[1, 3] [1, 0] [1, 0]	H/4, W/4, 256
Maxp	$2 \times 2$	2, 0	H/8, W/8, 256	Dlayer2	$2 \times 2$	2, 0	H/8, W/8, 512
DS4	$3 \times 3$	1, 1	H/8, W/8, 512	DS3	$[\begin{matrix} 7 \times 7, \\ 1 \times 1, \\ 1 \times 1 \end{matrix}] \times 27$	[1, 3] [1, 0] [1, 0]	H/8, W/8, 512
Maxp	$2 \times 2$	2, 0	H/16, W/16, 512	Dlayer3	$2 \times 2$	2, 0	H/16, W/16, 1024
DS5	$3 \times 3$	1, 1	H/16, W/16, 1024	DS4	$[\begin{matrix} 7 \times 7, \\ 1 \times 1, \\ 1 \times 1 \end{matrix}] \times 3$	[1, 3] [1, 0] [1, 0]	H/16, W/16, 1024

Table 3. The decoder structure of GASlumNet.

Stage		ks	s, p	h, w, c
US1	ConvT1	$2 \times 2$	2, 0	H/8, W/8, 512
US1	CBR/CBG	$3 \times 3$	1, 1	H/8, W/8, 512
US2	ConvT2	$2 \times 2$	2, 0	H/4, W/4, 256
US2	CBR/CBG	$3 \times 3$	1, 1	H/4, W/4, 256
US3	ConvT3	$2 \times 2$	2, 0	H/2, W/2, 128
US3	CBR/CBG	$3 \times 3$	1, 1	H/2, W/2, 128
US4	ConvT4	$2 \times 2$	2, 0	H, W, 64
US4	CBR/CBG	$3 \times 3$	1, 1	H, W, 64
$1 \times 1$ Conv		$1 \times 1$	1, 1	H, W, 2

Table 4. The experimental settings.

	Items	Settings
Super parameters	No. of categories	2
Super parameters	Balance parameter $γ$	0.7
Settings for model training	Batch size	64
	Epochs	100
	Optimizer	Adam
	Learning rate	1 × 10⁻³
	Weight decay	5 × 10⁻⁴
Experimental environment	System	Windows 10
	Language	Python
	Framework	Pytorch 1.11.0
	CPU	CPUs (Intel(R) Xeon(R) Silver 4210R) with 64 GB memory
	GPU	NAVIDIA GeForce RTX 3090 with 24 GB memory

Table 5. Classification accuracies on the testing dataset among different models.

Models	Precision (%)	Recall (%)	OA (%)	IoU (%)
UNet	68.54	71.47	91.99	53.81
ConvNeXt-UNet	67.59	70.05	91.70	52.44
FuseNet	65.12	63.61	90.80	47.44
GASlumNet	72.82	74.69	93.05	58.41

Table 6. Recalls of detected slums in different sizes.

Models	Small Slum Pockets (<5 ha)	Medium Slum Patches (5~25 ha)	Large Slum Patches (≥25 ha)
UNet	47.88	79.99	82.47
ConvNeXt-UNet	46.01	79.10	80.75
FuseNet	41.61	69.63	76.39
GASlumNet	50.79	83.36	85.81

Table 7. Area of main landcover types misclassified as slums on the testing dataset. FP = false positives of slum classification results; FN = false negatives of slum classification results.

	Models	ESA Land Cover Types
	Models	Built-Up (ha)	Bare/Sparse Vegetation (ha)	Tree Cover (ha)	Water (ha)	Others (ha)
FP	UNet	216.34	81.39	15.41	1.87	4.40
	ConvNeXt-UNet	232.68	75.90	11.76	1.75	4.74
	FuseNet	223.48	88.02	12.85	0.27	7.18
	GASlumNet	183.92	73.07	10.04	0.63	3.27
FN	UNet	145.42	99.83	27.08	1.63	5.51
	ConvNeXt-UNet	154.59	102.95	28.88	1.51	3.83
	FuseNet	193.57	122.59	32.18	1.87	4.18
	GASlumNet	119.92	92.58	28.60	1.63	3.69

Table 9. Quantitative results under different balance parameters.

$γ$	Precision	Recall	OA	IoU
0	67.59	70.05	91.70	52.44
0.1	75.03	70.67	93.10	57.21
0.3	75.80	70.30	93.19	57.41
0.5	76.28	70.49	93.28	57.81
0.7	72.82	74.69	93.05	58.41
0.9	73.00	73.98	93.03	58.09
1	68.54	71.47	91.99	53.81

Table 10. Slum mapping accuracies of the whole of Mumbai among different studies. CSSIs = composite slum spectral indexes; CCF = canonical correlation forest; CNN = convolutional neural network.

Methods	RS Imagery (Spatial Resolution—m)	Precision	Recall	OA	IoU
CSSIs (threshold-based) [44]	Sentinel-2 (10 m)	63.86	58.38	-	43.89
CSSIs (ML-based) [44]	Sentinel-2 (10 m)	61.56	82.50	-	54.45
CCF [55]	Sentinel-2 (10 m)	-	-	-	40.30
CNN transfer learning [30]	Pleiades (0.5 m), Sentinel-2 (10 m)	-	-	92.00	43.20
CNN [30]	Pleiades (0.5 m)	-	-	94.30	58.30
GASlumNet	Jilin-1 (5 m), Sentinel-2 (10 m)	76.13	79.85	94.89	63.86

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lu, W.; Hu, Y.; Peng, F.; Feng, Z.; Yang, Y. A Geoscience-Aware Network (GASlumNet) Combining UNet and ConvNeXt for Slum Mapping. Remote Sens. 2024, 16, 260. https://doi.org/10.3390/rs16020260

AMA Style

Lu W, Hu Y, Peng F, Feng Z, Yang Y. A Geoscience-Aware Network (GASlumNet) Combining UNet and ConvNeXt for Slum Mapping. Remote Sensing. 2024; 16(2):260. https://doi.org/10.3390/rs16020260

Chicago/Turabian Style

Lu, Wei, Yunfeng Hu, Feifei Peng, Zhiming Feng, and Yanzhao Yang. 2024. "A Geoscience-Aware Network (GASlumNet) Combining UNet and ConvNeXt for Slum Mapping" Remote Sensing 16, no. 2: 260. https://doi.org/10.3390/rs16020260

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Geoscience-Aware Network (GASlumNet) Combining UNet and ConvNeXt for Slum Mapping

Abstract

1. Introduction

2. Study Area

3. Methods

3.1. Preparing Slum Dataset

3.2. Architecture of GASlumNet

3.2.1. UNet Stream

3.2.2. ConvNeXt Stream

3.2.3. Feature-Level Fusion with Multi-Scale Attention

3.2.4. Decision-Level Fusion and Joint Loss Function

3.3. Evaluation Metrics

4. Experiment and Results

4.1. Experimental Settings

4.2. Comparisons of Slum Mapping among Different Methods

4.3. Patch-Based Accuracy Assessment among Different Methods

4.4. Land Cover Types of False Positives and False Negatives Generated by Different Methods

4.5. Results under Different Ancillary Geo-Scientific Features

4.6. Results under Different Balance Parameters

5. Discussion

5.1. Differences from Existing Related Studies

5.2. Performance of GASlumNet

5.3. Applicability and Limitations of GASlumNet

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI