Urban Informal Settlement Classification via Cross-Scale Hierarchical Perception Fusion Network Using Remote Sensing and Street View Images

Hu, Jun; Huang, Xiaohui; Ren, Tianyi; Zhang, Liner

doi:10.3390/rs17233841

Open AccessArticle

Urban Informal Settlement Classification via Cross-Scale Hierarchical Perception Fusion Network Using Remote Sensing and Street View Images

School of Computer Science, China University of Geosciences, Wuhan 430074, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(23), 3841; https://doi.org/10.3390/rs17233841

Submission received: 11 October 2025 / Revised: 17 November 2025 / Accepted: 19 November 2025 / Published: 27 November 2025

(This article belongs to the Section Urban Remote Sensing)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

We proposed PanFusion-Net, a cross-modal fusion framework that combines multi-scale remote sensing structures with fine-grained street-view information and employs dual multi-linear pooling to strengthen high-order interactions and deep semantic fusion across heterogeneous modalities.
The proposed method achieved a consistently superior performance on the WuhanUIS dataset we constructed, as well as on the ChinaUIS and $S^{2} U V$ datasets.

What are the implications of the main findings?

This work provides urban planners with automated and highly accurate tools for identifying informal settlements.
The proposed approach establishes a new technical paradigm for cross-modal geospatial analyses and can be extended to a broader range of monitoring applications.

Abstract

Urban informal settlements (UISs), characterized by self-organized housing, a high population density, inadequate infrastructure, and insecure land tenure, constitute a critical, yet underexplored, aspect of contemporary urbanization. They necessitate scholarly scrutiny to tackle pressing challenges pertaining to equity, sustainability, and urban governance. The automated, accurate, and rapid extraction of UISs is of paramount importance for sustainable urban development. Despite its significance, this process encounters substantial obstacles. Firstly, from a remote sensing standpoint, informal settlements are typically characterized by a low elevation and a high density, giving rise to intricate spatial relationships. Secondly, the remote sensing observational features of these areas are often indistinct due to variations in shooting angles and imaging environments. Prior studies in remote sensing and geospatial data analysis have often overlooked the cross-modal interactions of features, as well as the progressive information encoded in the intrinsic hierarchies of each modality. We introduced a spatial network to solve this problem by combining panoramic and coarse-to-fine asymptotic perspectives, using remote sensing images and urban street view images to support a hierarchical analysis through fusion. Specifically, we utilized a multi-linear pooling technique and then established coarse-to-fine-grained and panoramic viewpoint details within an integrated structure, known as the panoramic fusion network (PanFusion-Net). Comprehensive testing was performed on a self-constructed WuhanUIS dataset as well as two open-source datasets, ChinaUIS and

S^{2} U V

. The experimental results confirmed that the performance of the introduced PanFusion-Net exceeded all comparative models across all of the above datasets.

Keywords:

urban informal settlements; multimodal; dual multi-linear pooling hierarchy; remote sensing images; street view images

1. Introduction

Urban informal or unplanned settlements have long been a focus of policymakers, non-governmental organizations (NGOs), and academia. Urban informal residential areas typically have problems such as a high population density, inadequate facilities, lagging functions, deteriorating spatial quality, and weak governance capabilities, reflecting imbalances and inadequacies in urban development [1]. The identification of informal residential areas is crucial for understanding the socio-economic activities and spatial relationships within these areas, analyzing the causes, governance models and their effectiveness, and governance mechanisms [2]. This is especially true in developing countries, where these areas are often severely damaged by catastrophic events to an almost irreparable extent. Currently, nearly one billion people reside in urban informal settlements across the globe [3]. Sudden disasters often overwhelm the response capacities of these settlements, leading to major disruptions within the community [4]. In the process of advancing the United Nations Sustainable Development Goals (SDG 11) [5,6], detailed mapping of the geographical and demographic characteristics of informal urban settlements is essential. Therefore, the rapid and accurate identification of informal residential areas plays a vital role in sustainable urban development, urban renewal, and the early warning of disasters. Urban remote sensing is a pivotal branch within the broader remote sensing (RS) domain [7,8]. It primarily utilizes RS technology to obtain urban information [9], monitor dynamic processes, understand underlying mechanisms, predict future trends, and support urban planning [10,11,12], disaster prevention, and sustainable development decision-making.

However, due to the limited information provided by RS technology (such as spectral, texture, and temporal information), the use of RS technology alone can not adequately capture the intricate complexity and extensive diversity of urban functional patterns [13,14]. The rapid evolution of information and communication technologies has unlocked unprecedented access to geospatial big data [15,16,17,18] for use in land use mapping, the classification of informal settlements using drone images [19], and urban lighting landscapes using nighttime view images [20].

Street view images (SVIs) have garnered tremendous momentum in urban studies in recent years [21]. SVIs have exploded onto the scene as an invaluable resource for capturing geospatial data and conducting urban analyses, thereby yielding rich insights and empowering data-driven decisions [21,22].

The precise identification and analysis of urban informal settlements (UISs) via remote sensing images (RSIs) or other geospatial data [23,24,25] face three principal challenges, notwithstanding their critical importance for urban studies, planning, and assessing global living conditions. Firstly, these settlements exhibit distinct morphological characteristics marked by a low-rise, high-density morphology. Predominantly comprising informally constructed structures without regulatory oversight, the absence of infrastructure and building codes results in predominantly single-story or low-rise buildings. This contrasts sharply with formal urban high-rises. Extreme population pressure on limited land resources forces compact dwelling arrangements, generating intricate spatial relationships [26,27,28]. Narrow, winding alleys and minimal open spaces cause structures to visually merge in imagery, hindering individual building delineation. Secondly, single satellite acquisition angles can cause the obscuration, distortion, or foreshortening of structures, impeding accurate dimension measurements and boundary identification. Furthermore, the image quality is adversely affected by environmental factors, notably persistent cloud cover in tropical regions, as well as atmospheric haze and variable illumination conditions. Haze reduces contrast, complicating the discrimination of surface features like roads versus buildings. Thirdly, limitations in multimodal data analyses persist. While complementary data sources (e.g., optical imagery, radar data, elevation models) offer the potential for comprehensive characterization, the current methodologies inadequately exploit intermodal synergies and intramodal correlations [29]. A significant shortcoming in the current methodologies is the ineffective fusion of optical and radar data. While optical imagery provides textural and appearance-based information, radar data contributes structural and elevation insights. However, their disparate analysis hinders the exploitation of their complementary strengths. Within single modalities, such as optical imagery, spectral band correlations (e.g., between visible RGB bands and the near-infrared band) remain underutilized. These correlations could yield valuable insights into vegetation cover, building materials, and water bodies, but are frequently overlooked in the existing approaches. Consequently, significant obstacles impede the accurate RS-based identification and analysis of informal settlements.

In this work, to address the aforementioned problems, we advanced a hierarchical network that integrates panoramic and coarse-to-fine asymptotic perspectives, leveraging RSIs and SVIs to bridge a hierarchical analysis within and across these modalities. This network integrates feature extraction using multi-scale pyramid ResNet18, and feature fusion using multi-linear pooling [30] aims to mine the features among the levels within a single modality and integrate the correlation features among different modalities, thereby enhancing the effect of UIS interpretation. More specifically, we engineered a feature fusion module using multi-linear pooling by incorporating bi-linear and tri-linear relationships between distinct layers; this module is specifically designed to effectively learn highly discriminative representations for both cross-scale hierarchical perception progressive RSIs and panoramic SVIs. The mapping simulates the human observation and recognition pattern from far to near and from left to right. Moreover, we integrated the multi-linear pooling-based feature fusion module with the multi-scale pyramid ResNet18 feature-extraction module to effectively address the UIS problem. With this approach, we achieved superior results within a dual multi-linear pooling hierarchy model (D-MLPH), namely the panoramic fusion network (PanFusion-Net).

Our key contribution is the proposed panoramic fusion network (PanFusion-Net). This novel framework is designed to hierarchically align multi-level morphological features from aerial imagery with fine-grained facade semantics from street-view images, while simultaneously modeling high-order interactions between the two modalities. By integrating the macroscopic “bird’s-eye” perspective with the granular “on-the-ground” view, PanFusion-Net generates a holistic and discriminative representation that significantly enhances the robustness of urban informal settlement (UIS) identification.

To assess the efficacy of the proposed methodology, three datasets were used for UIS classification: WuhanUIS, ChinaUIS, and

S^{2} U V

. The experimental results across all three datasets demonstrated that the proposed PanFusion-Net significantly outperformed the other advanced UIS classification methods. The principal contributions of this paper are outlined below:

For precise UIS classification, we proposed a novel multimodal PanFusion-Net designed to fully integrate multi-level and inter-level features from RSIs and SVIs.
The proposed D-MLPH employs feature extraction using multi-scale pyramid ResNet18 to mimic the human eye’s coarse-to-fine progressive perspective from distant to near views and the panoramic perspective from left to right, efficiently extracting feature information from multi-resolution data sources.
The proposed PanFusion-Net incorporates novel dual-feature fusion using a multi-linear pooling structure to jointly fuse and integrate modality-specific and cross-modality hierarchical features from both progressive RSIs and panoramic SVIs, dramatically boosting UIS mapping.

This paper proceeds as follows: Section 2 systematically elaborates on the proposed cross-scale hierarchical perception fusion network architecture, which integrates panoramic and asymptotic visual modalities. Section 3 details the experimental datasets and a rigorous performance analysis. Section 4 synthesizes the key findings and addresses the study’s limitations. Conclusive remarks with future research directions are presented in Section 5.

2. Panoramic Fusion Network (PanFusion-Net)

The concept of coarse-to-fine optimization was initially adopted to address large-scale document problems in natural language processing [31,32]. Due to the significant time constraints involved in retrieving information from massive document collections, researchers have proposed coarse-to-fine retrieval strategies to improve efficiency [33,34]. Over time, this paradigm has also been extended to various image recognition tasks, where hierarchical processing and multiscale modeling have proven to be highly effective for handling complex visual patterns [35,36]. Such a hierarchical optimization framework enables the system to progressively refine the search domain and concentrate analytical efforts on the most relevant regions. However, its application in remote sensing remains limited.

In this study, we extend this idea to UIS recognition in multimodal scenes [7,37,38]. In the coarse stage, the model identifies potential UIS regions from low- to high-resolution satellite views and assigns initial labels. At a finer scale, these labels are further refined using multi-angle street-view imagery, while maintaining left-to-right spatial consistency across the panoramic scenes. This progressive strategy also aligns with the way humans observe and recognize.

2.1. Network Architecture (PanFusion-Net)

Two multi-linear pooling (MLP) models [39,40,41] were used as classifiers, and the entire model is referred to as the dual multi-linear pooling method based on a hierarchical framework (D-MLPH), as shown in Figure 1. It can be seen that three different resolution RS datasets and three different orientation image datasets were associated with three different classification tasks, using ResNet-18 [42] as a deep feature extractor. The four directional street-view images (0°, 90°, 180°, and 270°) are processed independently through separate D-MLPH branches to extract direction-specific features. However, these directional features are subsequently fused through fully connected layers with remote sensing features, which optimizes the overall classification performance, but limits the ability to attribute classification decisions to specific directional inputs. Notably,

feature 4_{0}

,

feature 4_{1}

, and

feature 4_{2}

serve as the three principal extraction stages (e.g.,

α, β, γ

), since, in contrast to earlier layers, they encapsulate richer semantic cues.

Let the loss function (loss) be defined as follows at each resolution:

\begin{matrix} I_{high} & = loss (I_{high}), \end{matrix}

(1)

\begin{matrix} I_{medium} & = loss (I_{medium}), \end{matrix}

(2)

\begin{matrix} I_{low} & = loss (I_{low}) . \end{matrix}

(3)

In total, the overall loss of the proposed D-MLPH model is formulated as

\begin{matrix} I_{full} = I_{high} + I_{medium} + I_{low} . \end{matrix}

(4)

This closely mirrors human and non-human primate vision [43]: the global scene structure is grasped before the finer details are resolved, following a far-to-near, coarse-to-fine, left-to-right progression [44]. For instance, neurons in the macaque inferior temporal cortex exhibit activation during the encoding phase of face perception [45,46,47]. This process initiates with holistic face categorization, followed by the encoding of finer details like identity or expression [48,49].

2.2. Feature Extraction Using Multi-Scale Pyramid ResNet18 (FE-ResNet18-FPN)

In the proposed D-MLPH framework, feature extraction [50] is a critical step for obtaining rich and discriminative representations from both RS and SVIs. For RSIs, we first generated three versions with different spatial resolutions to capture information at multiple scales. We adopted ResNet-18 [42] as the backbone feature extractor for all input modalities. For each input image, we extracted feature maps from three different blocks within the last stage (stage 4) of ResNet-18, denoted as

feature 4_0

,

feature 4_1

, and

feature 4_2

. All these feature maps reside in the network’s deeper tiers and distill high-level semantics at varying depths within the final stage.

To amplify their expressiveness, every feature map is first mapped into a high-dimensional embedding via

1 \times 1

point-wise convolutions [51]. This projection enables the subsequent multi-linear pooling module to effectively model complex interactions between different deep features. By leveraging multiple deep features from the same stage, the model can better capture subtle differences and complementary information, which is crucial for distinguishing between UIS and non-UIS (other) areas.

2.3. Feature Fusion Using Multi-Linear Pooling (FF-MLP)

Bilinear-pooling architectures [52] originally tackled fine-grained recognition, in which the objective is to discriminate subtle subclasses that share a common high-level visual category [53]. Owing to minute inter-class discrepancies and a high sensitivity to the pose, viewing angle, and object placement, the task remains highly challenging.

Therefore, the differences between classes are often smaller than the differences within classes. The identification of informal settlements has this characteristic. The formal and informal settlements we discuss are both located in urban areas and belong to the category of urban land identification, and secondly, they both fall within the category of residential areas. Bi-linear pooling computes the outer product of features at distinct spatial locations and pools the resultant matrices through averaging to generate bi-linear representations. This product encodes pairwise channel correlations while remaining invariant to spatial shifts [39,54]. Relative to its linear counterparts, bi-linear pooling yields more expressive features and admits full end-to-end training, delivering an accuracy that rivals or surpasses part-based methods.

For image L at position l that has two features

f_{A} (l, L) \in R^{T \times N}

and

f_{B} (l, L) \in R^{T \times M}

, the operation should be performed as follows. The two features are bi-linearly fused at the same position to obtain matrix b, where M and N are the numbers of feature channels.

\begin{matrix} b (l, L, f_{A}, f_{B}) & = f_{A} (l, L) f_{B} (l, L) \in R^{M \times N}, \end{matrix}

(5)

Sum pooling [55] is applied to all positions of b to obtain matrix

δ

. Matrix

δ

is generated by applying sum pooling [55] across the entire spatial extent of b.

\begin{matrix} δ (L) & = \sum_{l \in A} b (l, L, f_{A}, f_{B}) \in R^{M \times N}, \end{matrix}

(6)

The bilinear representation x is generated by vectorizing matrix

δ

.

\begin{matrix} x & = vec (δ (L)) \in R^{M N \times 1}, \end{matrix}

(7)

The feature vector x is first processed by matrix normalization, producing an intermediate vector y. Subsequently, y undergoes

L_{2}

normalization, yielding the final fused feature representation z.

\begin{matrix} y & = sign (x) \sqrt{| x |} \in R^{M N \times 1}, \end{matrix}

(8)

\begin{matrix} z & = \frac{y}{{∥ y ∥}_{2}} \in R^{M N \times 1}, \end{matrix}

(9)

For the image features we are familiar with,

T = 1

, and M and N are the numbers of channels for the feature, respectively. Given standard image features where temporal dimension

T = 1

, the two features are represented as tensors with channel dimensions M and N, respectively.

\begin{matrix} a_{l} & = f_{A}^{T} (l, L) \in R^{M \times 1}, b_{l} = f_{B}^{T} (l, L) \in R^{N \times 1}, \end{matrix}

(10)

Then

\begin{matrix} δ (L) & = \sum_{l} a_{l} b_{l}^{T} \in R^{M \times N}, \end{matrix}

(11)

As

\begin{matrix} A & = [a_{1}, \dots, a_{L}] \in R^{M \times L}, \end{matrix}

(12)

\begin{matrix} B & = [b_{1}, \dots, b_{L}] \in R^{N \times L}, \end{matrix}

(13)

Then

\begin{matrix} δ (L) & = A B^{T} \in R^{M \times N} . \end{matrix}

(14)

The author used bi-linear pooling to fuse two feature extractors, which were then used for fine-grained classification, achieving good results.

Factorized bi-linear pooling [56] aims to augment the original bi-linear pooling with an effective multimodal attention mechanism. The complete formulation is given below:

\begin{matrix} f & = α^{T} F α, \end{matrix}

(15)

Here,

F \in R^{h w \times h w}

acts as the projection operator and

f

denotes the bi-linear output. Setting

F = I

collapses FBP to the vanilla bi-linear form; sum pooling replaces the conventional average. By exploiting the factorization strategy,

F

is expressed as the outer product of two rank-one vectors:

\begin{matrix} f & = α^{T} F α = P^{T} (Φ_{1}^{T} α \circ Φ_{2}^{T} α), \end{matrix}

(16)

Among them,

Φ_{1}, Φ_{2} \in R^{h w \times d}

act as projection operators,

P \in R^{d \times c c}

serves as the classifier, ∘ denotes element-wise multiplication, and d specifies the joint-embedding dimension.

The current bi-linear approaches predominantly tap only the final activation layer of a CNN, often with duplicate feature extractors, thereby overlooking subtle, fine-grained cues and failing to capture the semantic components of diverse objects. Moreover, they abandon intermediate convolutional responses, discarding the valuable discriminative details needed for fine-grained classification [57,58].

Cross-layer feature interactions and fine-grained representation learning reinforce one another [59]. To model these inter-layer dependencies effectively, cross-layer bi-linear schemes [60] offer a parameter-efficient and powerful solution by explicitly relating activations across depths without extra trainable weights. By design, such cross-layer bi-linear pooling distills nuanced, fine-level image cues [61]. Thus, we formalize the factorized cross-layer bi-linear pooling framework as follows:

\begin{matrix} f & = P^{T} (Φ_{1}^{T} α \circ Φ_{2}^{T} β), \end{matrix}

(17)

where

α

and

β

denote distinct layers,

Φ_{1}, Φ_{2} \in R^{h w \times d}

act as projection operators,

P \in R^{d \times c c}

serves as the classifier, ∘ denotes element-wise multiplication, and d specifies the joint-embedding dimension. Building on this, we introduced a tri-linear pooling scheme [62] that integrates three separate layers,

α

,

β

, and

γ

, by extending the pairwise Hadamard product to a three-way interaction, formulated as follows:

\begin{matrix} f & = P^{T} (Φ_{1}^{T} α \circ Φ_{2}^{T} β \circ Φ_{3}^{T} γ), \end{matrix}

(18)

where

Φ_{3}

represents the projection matrix

\in R^{h w \times d}

, which combines three independent layers.

While bi-linear pooling remains a key enabler for fine-grained detection by modeling pairwise feature relations, the prevailing variants still confine themselves to a single convolutional layer, thereby neglecting vital cross-layer exchanges [52,63]. Activations isolated from any one layer furnish only a partial view, since objects and their constituent parts exhibit multiple attributes essential for sub-category discrimination [64].

In practice, categorizing an image often demands the simultaneous consideration of multiple part-level cues. Motivated by this, prior work [65] has harvested bi-linear descriptors from several CNN layers, treating each convolutional block as an attribute extractor, and later fused these descriptors via element-wise multiplication [66]. Yet, this strategy remains limited to pairwise layer interactions [67]. In this paper, we introduce a dual multi-linear pooling hierarchy (D-MLPH) architecture that interlinks two multiple cross-layer bi-linear units with a tri-linear pooling module, jointly modeling both cross-layer interactions and their higher-order relationships. As Figure 1 illustrates, an input image is forwarded through a CNN to yield multi-layer feature maps. These maps are individually projected into a high-dimensional space to encode part-specific attributes, after which their interactions are captured via element-wise multiplication. Tri-linear pooling is then applied to the resulting inter-layer context before sum pooling condenses the high-dimensional activations into compact representations. The final D-MLPH formulation is expressed as follows:

\begin{matrix} f_{D - M L P H} = & P^{T} concat (P^{T} concat (Φ_{1}^{T} α \circ Φ_{2}^{T} β, Φ_{1}^{T} α \circ Φ_{3}^{T} γ, \\ Φ_{2}^{T} β \circ Φ_{3}^{T} γ, Φ_{1}^{T} α \circ Φ_{2}^{T} β \circ Φ_{3}^{T} γ), \\ P^{T} concat (Φ_{1}^{T} α \circ Φ_{2}^{T} β, Φ_{1}^{T} α \circ Φ_{3}^{T} γ, \\ Φ_{2}^{T} β \circ Φ_{3}^{T} γ, Φ_{1}^{T} α \circ Φ_{2}^{T} β \circ Φ_{3}^{T} γ)) \end{matrix}

(19)

where

α

,

β

, and

γ

denote distinct layers.

Φ_{1}, Φ_{2}

, and

Φ_{3}

act as projection operators, where

Φ_{3}

combines three independent layers. ∘ denotes element-wise multiplication.

The input data is fed forward through a CNN to harvest feature tensors across multiple depths. Each tensor is independently projected into a high-dimensional embedding to encode part-specific attributes, after which their mutual interactions are captured via element-wise products.

Algorithm 1 details the multi-linear pooling process that fuses three feature maps (

F_{1}

,

F_{2}

,

F_{3}

) extracted from different ResNet-18 blocks. The algorithm first projects features to higher dimensions, and then computes pairwise and tri-linear interactions to capture cross-layer relationships. After global pooling and normalization, these interactions are concatenated to form the final fused representation

F_{M L P}

.

Algorithm 1 Multi-linear pooling feature fusion.

Require:: Feature maps $F_{1}, F_{2}, F_{3} \in R^{B \times 512 \times H \times W}$
Ensure:: Fused feature representation $F_{M L P}$
1:: Feature Projection:
2:: ${\hat{F}}_{1} \leftarrow {Conv}_{1 \times 1} (F_{1})$ Project to 8192 channels
3:: ${\hat{F}}_{2} \leftarrow {Conv}_{1 \times 1} (F_{2})$
4:: ${\hat{F}}_{3} \leftarrow {Conv}_{1 \times 1} (F_{3})$
5:: Bi-Linear Interactions:
6:: $I_{12} \leftarrow {\hat{F}}_{1} ⊙ {\hat{F}}_{2}$ Element-wise multiplication
7:: $I_{13} \leftarrow {\hat{F}}_{1} ⊙ {\hat{F}}_{3}$
8:: $I_{23} \leftarrow {\hat{F}}_{2} ⊙ {\hat{F}}_{3}$
9:: Tri-Linear Interaction:
10:: $I_{123} \leftarrow {\hat{F}}_{1} ⊙ {\hat{F}}_{2} ⊙ {\hat{F}}_{3}$
11:: Global Average Pooling:
12:: for each interaction $I \in {I_{12}, I_{13}, I_{23}, I_{123}}$ do
13:: $\bar{I} \leftarrow GlobalAvgPool (I)$
14:: end for
15:: Normalization:
16:: for each pooled feature $\bar{I}$ do
17:: $N \leftarrow SignedSqrt (\bar{I})$
18:: $N \leftarrow L 2 Normalize (N)$
19:: end for
20:: Feature Concatenation:
21:: $F_{M L P} \leftarrow Concat ([N_{12}, N_{13}, N_{23}, N_{123}])$
22:: return $F_{M L P}$

3. Experimental Results and Analysis

3.1. Study Area

Wuhan is a major megacity in central China and the capital of Hubei Province. With a population of over 13 million within its 8569 km² municipal area, Wuhan’s status as a national transportation and economic hub has been accompanied by rapid urbanization and large-scale spatial expansion. This transformation has yielded a complex urban landscape where planned developments intersect with self-organized neighborhoods and “urban villages”. Furthermore, Wuhan’s inherent polycentric structure fosters diverse manifestations of informality, encompassing both dense inner-city enclaves and peripheral informal settlements.

These informal settlements are typically characterized by irregular land use, a high building density, and the lack of formal urban services. Most informal settlements emerged as formerly rural villages and are gradually engulfed by expanding urban areas, yet they retain collective land ownership and localized governance structures. Morphologically, they often feature tightly packed, multi-story concrete buildings; narrow alleyways; and mixed commercial–residential functions, creating a visual and structural contrast with planned urban neighborhoods. These conditions make Wuhan a compelling and valuable testbed for studying a wide spectrum of informal settlements within a single administrative boundary.

3.2. The Datasets

To demonstrate the efficacy of our proposed framework, this section conducts systematic experiments on a self-constructed WuhanUIS dataset and two representative and authoritative multimodal datasets: ChinaUIS and

S^{2} U V

. These datasets are among the latest publicly available collections widely recognized in the RS and urban settlement research communities. They contain RSIs and SVIs from different cities and regions in China, covering both UISs and others.

3.2.1. WuhanUIS Dataset

The WuhanUIS dataset was independently constructed in this study to support experiments on UIS classification. To ensure geographic diversity and mitigate spatial bias, we employed a stratified sampling strategy. The city of Wuhan was divided into a grid, and sampling locations were systematically selected from each grid cell, covering the city’s core urban areas, suburban districts, and developing outskirts. This approach ensured that our dataset was not limited to a few key areas, but captured a wide spectrum of urban environments. We first identified specific areas in Wuhan, and then collected street-level images through Baidu Map Street View. For each location, four rectified SVI views at 0°, 90°, 180°, and 270° were extracted. The four SVI views (0°, 90°, 180°, 270°) denoted sector ranges 0–90°, 90–180°, 180–270°, and 270–360°, jointly providing full 360° coverage per location. The RSIs had a spatial resolution of 1.19m. Centered on the geographical coordinates of each SV location, we obtained high-, medium-, and low-resolution RSIs from Google Earth, taken in 2020. Each image had a size of

(256 \times 256)

pixels and had three color channels. Each data point thus contained three RSIs of different resolutions, each shaped as

(3 \times 256 \times 256)

, and four SVIs from different perspectives, also shaped as

(3 \times 256 \times 256)

. Some of these samples are visualized in Figure 2.

To facilitate model input and ensure data consistency, all the images were resized to

(3 \times 256 \times 256)

. A manual visual inspection was carried out to categorize each image as either belonging to a UIS or not a UIS (others). The UIS labels were determined by visible indicators jointly observed from the RSIs and SVIs at the image level (e.g., high-density, low-rise morphology and irregular alley patterns in RSIs; deteriorated or makeshift facades, informal extensions, and cluttered narrow lanes in SVIs). An example of annotated images is shown in Figure 3, where the regions considered as a UIS are marked with red boxes in both the remote sensing image and the four street-view images. The annotation phase took into account the following criteria: (1) in cases where the RSIs and at least one SVI from any angle displayed characteristics of informal settlements, the sample was designated as a UIS; (2) in cases where none of the SVIs or the RSIs exhibited characteristics of informal settlements, the corresponding sample was designated as other areas; and (3) in cases where the RSIs exhibited features of informal settlements while none of the SVIs did, or vice versa, the sample was excluded to ensure consistency and avoid annotation ambiguity. This third situation mainly arose due to temporal or spatial inconsistencies between the acquisition of SVIs and RSIs. To maintain the reliability and integrity of the dataset, such samples were filtered out.

Finally, a total of 3832 valid instances were obtained, including 1500 UIS samples and 2332 other instances. The distribution across categories is presented in Table 1.

3.2.2. ChinaUIS Dataset

The ChinaUIS dataset comprises images from eight major Chinese megacities—Beijing, Shanghai, Guangzhou, Shenzhen, Tianjin, Chengdu, Wuhan, and Chongqing—each with an urban population exceeding 10 million. It encompasses both UISs and other regions, with sufficient samples in each category to facilitate robust model training and evaluation.

Specifically, it comprises 1833 sample sets, each containing one RSI with dimensions of

(3 \times 224 \times 224)

and four SVIs of size

(3 \times 512 \times 1024)

captured from different angles (0°, 90°, 180°, and 270°). Among them, 643 samples are from UIS regions and 1190 from other areas. The detailed distribution across cities and categories is provided in Table 2. All the samples were carefully selected to represent the diverse urban landscapes of China’s major metropolitan regions. Representative examples are illustrated in Figure 4.

3.2.3. $S^{2} U V$ Dataset

The

S^{2} U V

dataset is a widely recognized multimodal dataset tailored for UIS classification tasks, built from high-resolution RSIs and SVIs. It features samples from key cities in the Beijing–Tianjin–Hebei region, covering both UISs and other areas.

Specifically, it contains 2570 sample sets, comprising 856 UISs and 1714 other instances. Each set contains one RSI and four SVIs captured from distinct perspectives (0°, 90°, 180°, and 270°). All the images were preprocessed and resized to

(3 \times 224 \times 224)

pixels for consistency. This well-structured dataset laid strong groundwork for assessing multimodal classification approaches in UIS detection.

3.3. Running Environment

All the experiments were carried out using PyTorch (v2.5.1 + cu121) on a workstation featuring an NVIDIA GeForce RTX 3060 Laptop GPU. We used the Adam optimizer, a learning rate of

1 \times 10^{- 4}

, a batch size of 32, and 200 training epochs.

3.4. Experiment on WuhanUIS Dataset

3.4.1. Experimental Operation

To rigorously assess the effectiveness of the proposed framework, experiments were conducted on the WuhanUIS dataset.

Each comparative model was evaluated under three distinct input configurations to demonstrate the superiority of multimodal fusion and the enhanced performance of our proposed model, ensuring a thorough and robust experimental assessment:

Only RSIs: Using only high-resolution RSIs.
Only SVIs: Using only SVIs.
RSIs and SVIs Merged: Combining high-resolution RSIs and SVIs to leverage complementary information.

A progressive comparison, from a single-input configuration to a combined-input configuration, not only clarifies the impact of the input type on the model performance, but also highlights the importance of integrating near and far views to improve the classification accuracy.

The PanFusion-Net uniquely processes RS data by simultaneously using three resolution scales of the same RSI and handles SV data by concurrently analyzing SVIs from three angles. This multi-scale and multi-angle approach enables PanFusion-Net to capture features at different levels of detail and from various perspectives.

For all experiments, the overall accuracy (OA) was measured on the independently maintained test set to ensure a fair comparison across all the models and configurations.

3.4.2. Experimental Results

Table 3 presents the category accuracy and OA of different models. Each model was trained using either only SVIs or only high-resolution RSIs. To highlight the multimodal fusion gain, this table intentionally uses asymmetric input settings by row (RS, SV, RS + SV), as indicated in the input column. It can be seen that the proposed method achieved the highest OA: 96.14%, showing a significant improvement over the baseline models.

Figure 5 illustrates two typical types of classification errors of PanFusion-Net on the WuhanUIS dataset. Note that our model performs image-level classification, i.e., it predicts whether the entire image is a UIS or not. One type of error occurs when the SVI is too open and lacks visible buildings, causing the model to mistakenly classify the area as a UIS. The other type of error happens in urban construction areas, where irregular buildings lead the model to misclassify them as others.

Table 4 shows the category accuracy and OA of different models, where each model was trained using the RSI–SVI merged training method, combining both RSIs and SVIs for enhanced classification.

Figure 6 decreases steadily and converges, while the validation loss decreases and then stabilizes without a sustained upward trend, indicating convergence without obvious overfitting.

3.5. Experiment on ChinaUIS Dataset

3.5.1. Experimental Results

This section evaluates the performance of the proposed PanFusion-Net on the ChinaUIS dataset for UIS classification. To assess the classification accuracy quantitatively, we computed the confusion matrix on the test set, as summarized in Table 5 and Figure 7. The model achieved an OA of 91.07% and a kappa coefficient of 80.31%, indicating its strong capability in handling multimodal classification tasks that integrate RSIs and SVIs.

Figure 8 illustrates two common types of misclassification made by the PanFusion-Net model on the ChinaUIS dataset. The first occurred in urban park regions, where dense tree coverage causes these areas to visually resemble open spaces typically labeled as UISs, leading to their incorrect classification as others. The second type happened in villa districts or characteristic commercial streets, where the buildings are lower and have unique styles, causing the model to misclassify them as UISs.

3.5.2. Experimental Validity

To verify the contribution of the multimodal inputs, hierarchical structure, and multi-linear pooling modules in the proposed PanFusion-Net, we carried out comprehensive experiments on the ChinaUIS dataset. The summarized outcomes are presented in Table 6.

Initially, we evaluated the effectiveness of individual modalities. For SVI, each viewpoint (0°, 90°, 180°, 270°) was tested separately as well as combined. According to Table 6, the accuracies from the single-angle SVI inputs ranged between 86.34% and 88.52%, with the 270° view performing best. When aggregating all four perspectives without integrating remote sensing data, the model achieved an OA of 87.43% alongside a kappa of 79.22%.

For RSIs alone (single-modal-rs), the model yielded an OA of 89.44% and a kappa coefficient of 77.37%. This clearly surpasses all SVI-based models, indicating that, for UIS classification, RSIs provide more distinctive features, whereas SVIs involve a richer environmental complexity that is harder to decipher individually.

Ultimately, the PanFusion-Net, which fuses RSI and SVI inputs through a dual multi-linear pooling hierarchy, attained an OA of 91.07% and a kappa coefficient of 80.31%. Compared to the single-modal RS approach, PanFusion-Net achieved a competitive performance, demonstrating the effectiveness of multimodal feature fusion for UIS classification in complex urban environments.

3.6. Experiment on $S^{2} U V$ Dataset

3.6.1. Experimental Results

The

S^{2} U V

dataset was utilized in this experiment to examine the robustness and cross-dataset adaptability of PanFusion-Net.

To further verify the performance of PanFusion-Net on the

S^{2} U V

dataset, we computed the confusion matrix in Figure 9 based on its prediction outcomes, as presented in Table 7. The results indicate that the model achieved an OA of 96.11% and a kappa coefficient of 91.29% on the test set. These findings highlight the strong performance and generalization capability of PanFusion-Net on the UIS classification benchmark.

3.6.2. Experimental Comparison with Multimodal Models

To evaluate the generalization performance of PanFusion-Net and benchmark it against existing multimodal approaches, we carried out experiments using the

S^{2} U V

dataset. Related models include Trans-MDCNN [68] and FusionMixer [28].

Table 8 presents the test set results for various multimodal models. Among them, our proposed PanFusion-Net achieved the best performance, with an OA of 96.11% and a kappa coefficient of 91.29%. In comparison with FusionMixer, the second-best model, PanFusion-Net achieved an improvement of 1.81% in the OA and 3.95% in the kappa. These improvements demonstrate the effectiveness of PanFusion-Net in deeply integrating RSI and SVI features, which significantly boosts the classification performance.

To further examine PanFusion-Net’s advantage over other multimodal approaches for UIS classification, we analyzed the category-wise accuracy on both the UIS and other classes, as summarized in Table 9. Among all the evaluated models, the classification performance on the UIS class was generally lower than that for the other class, indicating greater ambiguity in distinguishing UIS regions. However, PanFusion-Net outperformed the competing methods in both categories. Specifically, for the UIS class, PanFusion-Net achieved a 95.00% accuracy, surpassing the other two models by 8.95% and 1.88%, respectively. For the other class, PanFusion-Net reached 96.67%, representing improvements of 2.39% and 3.11% compared to the alternatives.

The experimental findings indicate that PanFusion-Net outperformed the other multimodal UIS classification models, achieving the highest OA and kappa coefficient across all categories. Moreover, the model delivers a balanced prediction performance, with similar accuracy levels between the UIS and other classes. Overall, PanFusion-Net demonstrated a superior accuracy and robustness compared to the latest competing methods. This advantage stems from its dual multi-linear pooling architecture, which enhances the model’s capacity to handle heterogeneous multimodal inputs and fully exploit modality-specific features. By leveraging a hierarchical feature fusion framework, PanFusion-Net effectively integrates multi-level features, both within and across modalities, combining a coarse-to-fine granularity with panoramic contextual information to achieve more accurate UIS classification.

4. Discussion

4.1. Performance Gains from Multimodal Integration

The experimental results across the WuhanUIS, ChinaUIS, and

S^{2} U V

datasets consistently confirm the effectiveness of the proposed architecture. On the WuhanUIS dataset, PanFusion-Net achieved a clearly superior performance over all single-modality baselines, indicating that the combination of RS and SV information substantially improved the discriminatory capability in complex urban environments. With multimodal inputs, conventional networks such as ResNet-18 [42] and HBP [65] also exhibit noticeable improvements compared with their single-modality variants, further validating the complementary characteristics of the two data sources. In particular, many instances that were misclassified under single-modality conditions were correctly identified when multimodal information was incorporated, and representative examples are illustrated in Figure 10. Nevertheless, their performance remains slightly lower than that of PanFusion-Net.

4.2. Mechanisms Behind the Performance Improvements

These improvements arise mainly from two aspects of the proposed design. RSIs provide stable macro-level spatial context, whereas SVIs supply fine-scale local cues that are often invisible from overhead views. The hierarchical spatial analysis mechanism further stabilizes this fusion by first performing coarse-scale region localization [69] to narrow the spatial search space [70] and then applying fine-scale semantic refinement to reduce local ambiguity. In addition, the proposed dual-branch multi-linear pooling module strengthens high-order interactions between heterogeneous modalities, enabling the network to capture non-linear relationships between macro structures and street-level visual patterns. When combined with the cross-scale feature aggregation strategy, these components produce more complete and consistent multimodal representations across different spatial resolutions and viewpoints.

4.3. Cross-Dataset Robustness and Generalization

Similar improvements were observed on the ChinaUIS dataset, where PanFusion-Net enhanced both the overall accuracy and the kappa coefficient relative to single-modality remote sensing input and the average performance of street-view inputs from different directions. On the

S^{2} U V

dataset, PanFusion-Net again showed a strong robustness and generalization, surpassing existing multimodal fusion models [71], including FusionMixer [28] and Trans-MDCNN [68]. Together, these findings highlight the critical role of multimodal information in improving fine-grained informal settlement classification.

4.4. Limitations and Future Directions

Despite the demonstrated benefits of multimodal fusion for informal settlement identification, the approach proposed in this study, along with other advanced fusion techniques, is subject to several inherent limitations. First, this study relied exclusively on imagery data. Then, the high costs associated with acquiring and annotating street-view images have limited the diversity of features we could obtain. Moreover, deep and multi-level fusion architectures, while capable of achieving expressive hierarchical representations, often incur substantial computational overhead and parameter complexity. As fusion modules are repeatedly stacked across scales, challenges related to the training cost, inference time, and optimization stability become more pronounced. In addition, as illustrated in Figure 10, the current fusion strategy may lead to over-fusion in a small number of cases, where the excessive blending of heterogeneous features suppresses informative modality-specific cues. These failure cases indicate that the fusion process is not always optimally balanced and further suggest the need for more adaptive mechanisms capable of dynamically regulating the degree of fusion based on scene characteristics.

To address these challenges, future work should aim to develop more adaptive and efficient fusion mechanisms. Subsequent research would benefit from integrating additional geospatial data streams, including points of synthetic aperture radar (SAR), nighttime lights, dense point cloud data, and other relevant multimodal datasets. Such auxiliary information has the potential to deliver richer contextual cues, thereby enabling the more precise and holistic detection and characterization of urban informal settlements. In addition, cross-modal or self-attention modules can dynamically select informative modality features at different scales and semantic levels while suppressing noise or low-quality inputs, thereby maintaining cross-view complementarity while avoiding unnecessary deep fusion operations. This design improves the model’s adaptability to heterogeneous and imperfect data [72]. Future research may combine hierarchical fusion structures with lightweight attention mechanisms to better balance the performance, efficiency, and robustness, providing a more scalable solution for informal settlement identification in complex urban environments [73,74].

5. Conclusions

This study introduces PanFusion-Net, a novel framework that advances urban informal settlement classification by synergistically fusing multi-resolution satellite and panoramic street-view imagery. Its superior performance is driven by a modality-specific dual multi-linear pooling hierarchy, which effectively guides the spatial feature learning across these disparate data sources. Within PanFusion-Net, modality-specific D-MLPH branches hierarchically encode cross-view spatial representations from fused multi-scale remote sensing and panoramic street-view inputs, especially coarse-to-fine and panoramic perspectives. This method enables the rapid and accurate completion of UIS classification tasks within a limited visual dataset. At the same time, this approach coincides with the human habit of recognizing objects from far to near, and from the whole to the part. By conducting rigorous benchmarking across a self-constructed WuhanUIS dataset as well as two open-source datasets, ChinaUIS and

S^{2} U V

confirmed the state-of-the-art efficacy of our framework. Subsequent research directions include investigating transformer-based architectures and multi-task learning paradigms to advance the segmentation precision. More methods related to UIS recognition should be explored to better support the United Nations SDG 11.

Author Contributions

Conceptualization, J.H.; Methodology, J.H.; Software, J.H.; Validation, J.H.; Formal analysis, J.H. and T.R.; Investigation, J.H.; Resources, J.H.; Data curation, J.H.; Writing—original draft, J.H.; Writing—review & editing, J.H.; Visualization, J.H. and L.Z.; Supervision, J.H. and X.H.; Project administration, J.H.; Funding acquisition, X.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Natural Science Foundation of China grant number No. U21A2013.

Data Availability Statement

The data presented in this study are openly available in The Urban Informal Settlements Mapping in Wuhan at https://doi.org/10.57760/sciencedb.28677.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Hasan, A. Informal settlements and urban sustainability in Bangladesh. Environ. Urban. 2010, 22, 13–28. [Google Scholar]
Mahabir, R.; Croitoru, A.; Crooks, A.T.; Agouris, P.; Stefanidis, A. A Review of Spatial Characteristics to Inform Slum Classification: The Case of Informal Settlements in Haiti. Urban Sci. 2018, 2, 8. [Google Scholar] [CrossRef]
United Nations Statistics Division (UNSD). The Sustainable Development Goals Report; United Nations Statistics Division (UNSD): New York, NY, USA, 2025. [Google Scholar]
Kaiser, Z.R.M.A.; Sakil, A.H.; Baikady, R.; Deb, A.; Hossain, M.T. Building resilience in urban slums: Exploring urban poverty and policy responses amid crises. Discov. Glob. Soc. 2025, 3. [Google Scholar] [CrossRef]
Gupta, S.; Degbelo, A. An Empirical Analysis of AI Contributions to Sustainable Cities (SDG 11). In The Ethics of Artificial Intelligence for the Sustainable Development Goals; Mazzi, F., Floridi, L., Eds.; Springer International Publishing: Cham, Switzerland, 2023; pp. 461–484. [Google Scholar] [CrossRef]
Han, W.; Wang, L.; Wang, Y.; Li, J.; Yan, J.; Shao, Y. A novel framework for leveraging geological environment big data to assess sustainable development goals. Innov. Geosci. 2025, 3, 100122. [Google Scholar] [CrossRef]
Fan, R.; Li, J.; Song, W.; Han, W.; Yan, J.; Wang, L. Urban informal settlements classification via a transformer-based spatial-temporal fusion network using multimodal remote sensing and time-series human activity data. Int. J. Appl. Earth Obs. Geoinf. 2022, 111, 102831. [Google Scholar] [CrossRef]
Fan, R.; Wang, L.; Xu, Z.; Niu, H.; Chen, J.; Zhou, Z.; Li, W.; Wang, H.; Sun, Y.; Feng, R. The first urban open space product of global 169 megacities using remote sensing and geospatial data. Sci. Data 2025, 12, 586. [Google Scholar] [CrossRef]
Gong, J.; Liu, C.; Huang, X. Advances in urban information extraction from high-resolution remote sensing imagery. Sci. China Earth Sci. 2020, 63, 463–475. [Google Scholar] [CrossRef]
Zhou, W.; Ming, D.; Lv, X.; Zhou, K.; Bao, H. SO–CNN based urban functional zone fine division with VHR remote sensing image. Remote Sens. Environ. 2019, 236. [Google Scholar] [CrossRef]
Li, M.; Stein, A.; Beurs, K.M.D. A Bayesian characterization of urban land use configurations from VHR remote sensing images. Int. J. Appl. Earth Obs. Geoinf. 2020, 92, 102175. [Google Scholar] [CrossRef]
Wang, Z.; Hao, Z.; Lin, J.; Feng, Y.; Guo, Y. UP-Diff: Latent Diffusion Model for Remote Sensing Urban Prediction. IEEE Geosci. Remote Sens. Lett. 2024, 22, 7502505. [Google Scholar] [CrossRef]
Cao, R.; Tu, W.; Yang, C.; Li, Q.; Liu, J.; Zhu, J.; Zhang, Q.; Li, Q.; Qiu, G. Deep learning-based remote and social sensing data fusion for urban region function recognition. ISPRS J. Photogramm. Remote Sens. 2020, 163, 82–97. [Google Scholar] [CrossRef]
Alawode, G.L.; Oluwajuwon, T.V.; Hammed, R.A.; Olasuyi, K.E.; Krasovskiy, A.; Ogundipe, O.C.; Kraxner, F. Spatiotemporal assessment of land use land cover dynamics in Mödling district, Austria, using remote sensing techniques. Heliyon 2025, 11, e43454. [Google Scholar] [CrossRef]
Li, S.; Dragicevic, S.; Castro, F.A.; Sester, M.; Winter, S.; Coltekin, A.; Pettit, C.; Jiang, B.; Haworth, J.; Stein, A.; et al. Geospatial big data handling theory and methods: A review and research challenges. ISPRS J. Photogramm. Remote Sens. 2016, 115, 119–133. [Google Scholar] [CrossRef]
Li, X.; Hu, T.; Gong, P.; Du, S.; Chen, B.; Li, X.; Dai, Q. Mapping essential urban land use categories in Beijing with a fast area of interest (AOI)-based method. Remote Sens. 2021, 13, 477. [Google Scholar] [CrossRef]
Yin, J.; Dong, J.; Hamm, N.A.; Li, Z.; Wang, J.; Xing, H.; Fu, P. Integrating remote sensing and geospatial big data for urban land use mapping: A review. Int. J. Appl. Earth Obs. Geoinf. 2021, 103, 102514. [Google Scholar] [CrossRef]
Chen, J.; Fan, R.; Niu, H.; Xu, Z.; Yan, J.; Song, W.; Feng, R. A unified multimodal learning method for urban functional zone identification by fusing inner-street visual–textual information from street-view and satellite images. Int. J. Appl. Earth Obs. Geoinf. 2025, 142, 104685. [Google Scholar] [CrossRef]
Gevaert, C.; Persello, C.; Sliuzas, R.; Vosselman, G. Informal settlement classification using point-cloud and image-based features from UAV data. ISPRS J. Photogramm. Remote Sens. 2017, 125, 225–236. [Google Scholar] [CrossRef]
Fan, Z.; Biljecki, F. Nighttime Street View Imagery: A new perspective for sensing urban lighting landscape. Sustain. Cities Soc. 2024, 116, 105862. [Google Scholar] [CrossRef]
Zhang, F.; Zhou, B.; Liu, L.; Liu, Y.; Fung, H.H.; Lin, H.; Ratti, C. Measuring human perceptions of a large-scale urban region using machine learning. Landsc. Urban Plan. 2018, 180, 148–160. [Google Scholar] [CrossRef]
Biljecki, F.; Ito, K. Street view imagery in urban analytics and GIS: A review. Landsc. Urban Plan. 2021, 215, 104217. [Google Scholar] [CrossRef]
Fan, R.; Niu, H.; Xu, Z.; Chen, J.; Feng, R.; Wang, L. Refined Urban Informal Settlements’ Mapping at Agglomeration Scale With the Guidance of Background Knowledge From Easy-Accessed Crowdsourced Geospatial Data. IEEE Trans. Geosci. Remote Sens. 2025, 63, 1–16. [Google Scholar] [CrossRef]
Gram-Hansen, B.J.; Helber, P.; Varatharajan, I.; Azam, F.; Coca-Castro, A.; Kopackova, V.; Bilinski, P. Mapping informal settlements in developing countries using machine learning and low resolution multi-spectral data. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, Honolulu, HI, USA, 27–28 January 2019; pp. 361–368. [Google Scholar]
Wang, L.; Zhang, J.; Wang, Y.; Song, X.; Sun, Z. Artificial Intelligence Reshapes River Basin Governance. Sci. Bull. 2024, 70, 1564–1567. [Google Scholar] [CrossRef]
Pressick, R.D. Architecture & Legitimacy: Strategies for the Development of Urban Informal Settlements. Ph.D. Dissertation, Toronto Metropolitan University, Toronto, ON, Canada, 2010. [Google Scholar] [CrossRef]
Matarira, D.; Mutanga, O.; Naidu, M.; Vizzari, M. Object-Based Informal Settlement Mapping in Google Earth Engine Using the Integration of Sentinel-1, Sentinel-2, and PlanetScope Satellite Data. Land 2022, 12, 99. [Google Scholar] [CrossRef]
Fan, R.; Li, J.; Li, F.; Han, W.; Wang, L. Multilevel spatial-channel feature fusion network for urban village classification by fusing satellite and streetview images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5630813. [Google Scholar] [CrossRef]
Zhu, X.X.; Tuia, D.; Mou, L.; Xia, G.S.; Zhang, L.; Xu, F.; Fraundorfer, F. Deep learning in remote sensing: A comprehensive review and list of resources. IEEE Geosci. Remote Sens. Mag. 2017, 5, 8–36. [Google Scholar] [CrossRef]
Kong, D.; Fowlkes, C. Low-rank bilinear pooling for fine-grained classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 365–374. [Google Scholar]
Sarlin, P.E.; Cadena, C.; Siegwart, R.; Dymczyk, M. From coarse to fine: Robust hierarchical localization at large scale. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 12716–12725. [Google Scholar]
Hirschberg, J.; Manning, C.D. Advances in natural language processing. Science 2015, 349, 261–266. [Google Scholar] [CrossRef] [PubMed]
Charniak, E.; Johnson, M. Coarse-to-Fine n-Best Parsing and MaxEnt Discriminative Reranking. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL‘05), Ann Arbor, MI, USA, 25–30 June 2005; Knight, K., Ng, H.T., Oflazer, K., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2005; pp. 173–180. [Google Scholar] [CrossRef]
Che, W.; Zhang, M.; Aw, A.; Tan, C.; Liu, T.; Li, S. Using a hybrid convolution tree kernel for semantic role labeling. ACM Trans. Asian Lang. Inf. Process. 2008, 7, 1–23. [Google Scholar] [CrossRef]
Brox, T.; Bruhn, A.; Papenberg, N.; Weickert, J. High accuracy optical flow estimation based on a theory for warping. In Proceedings of the European Conference on Computer Vision, Prague, Czech Republic, 11–14 May 2004; Springer: Berlin/Heidelberg, Germany, 2004; pp. 25–36. [Google Scholar]
Zhu, S.; Li, C.; Change Loy, C.; Tang, X. Face alignment by coarse-to-fine shape searching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 4998–5006. [Google Scholar]
Alrasheedi, K.G.; Dewan, A.; El-Mowafy, A. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 1121–1130. [Google Scholar] [CrossRef]
Dabra, A.; Kumar, V. Neural Computing and Applications. Neural Comput. Appl. 2023, 34, 1001–1010. [Google Scholar] [CrossRef]
Fukui, A.; Park, D.H.; Yang, D.; Rohrbach, A.; Darrell, T.; Rohrbach, M. Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Austin, TX, USA, 1–5 November 2016. [Google Scholar]
Algashaam, F.M.; Nguyen, K.; Alkanhal, M.; Chandran, V.; Boles, W.; Banks, J. Multispectral periocular classification with multimodal compact multi-linear pooling. IEEE Access 2017, 5, 14572–14578. [Google Scholar] [CrossRef]
Huo, Y.; Lu, Y.; Niu, Y.; Lu, Z.; Wen, J.R. Coarse-to-fine grained classification. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, Paris, France, 21–25 July 2019; pp. 1033–1036. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
Killian, N.J.; Vurro, M.; Keith, S.B.; Kyada, M.J.; Pezaris, J.S. Perceptual learning in a non-human primate model of artificial vision. Sci. Rep. 2016, 6, 36329. [Google Scholar] [CrossRef]
Peters, J.C.; Goebel, R.; Goffaux, V. From coarse to fine: Interactive feature processing precedes local feature analysis in human face perception. Biol. Psychol. 2018, 138, 1–10. [Google Scholar] [CrossRef] [PubMed]
Leopold, D.A.; Bondar, I.V.; Giese, M.A. Norm-based face encoding by single neurons in the monkey inferotemporal cortex. Nature 2006, 442, 572–575. [Google Scholar] [CrossRef]
Bell, A.H.; Summerfield, C.; Morin, E.L.; Malecek, N.J.; Ungerleider, L.G. Encoding of Stimulus Probability in Macaque Inferior Temporal Cortex. Curr. Biol. 2016, 26, 2280–2290. [Google Scholar] [CrossRef]
Russ, B.E.; Koyano, K.W.; Day-Cooney, J.; Perwez, N.; Leopold, D.A. Temporal continuity shapes visual responses of macaque face patch neurons. Neuron 2023, 111, 903–914.e3. [Google Scholar] [CrossRef] [PubMed]
Freiwald, W.A.; Tsao, D.Y.; Livingstone, M.S. A face feature space in the macaque temporal lobe. Nat. Neurosci. 2009, 12, 1187. [Google Scholar] [CrossRef] [PubMed]
Sugase, Y.; Yamane, S.; Ueno, S.; Kawano, K. Global and fine information coded by single neurons in the temporal visual cortex. Nature 1999, 400, 869–873. [Google Scholar] [CrossRef]
Guyon, I.; Elisseeff, A. An introduction to feature extraction. In Feature Extraction: Foundations and Applications; Springer: Berlin/Heidelberg, Germany, 2006; pp. 1–25. [Google Scholar]
Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017. [Google Scholar]
Lin, T.Y.; RoyChowdhury, A.; Maji, S. Bilinear CNN models for fine-grained visual recognition. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1449–1457. [Google Scholar]
Gao, Y.; Beijbom, O.; Zhang, N.; Darrell, T. Compact bilinear pooling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 317–326. [Google Scholar]
Yu, Z.; Yu, J.; Xiang, C.; Fan, J.; Tao, D. Beyond bilinear: Generalized multimodal factorized high-order pooling for visual question answering. IEEE Trans. Neural Netw. Learn. Syst. 2018, 29, 5947–5959. [Google Scholar] [CrossRef]
Zhao, L.; Zhang, Z. A improved pooling method for convolutional neural networks. Sci. Rep. 2024, 14, 1589. [Google Scholar] [CrossRef]
Kim, J.H.; On, K.W.; Lim, W.; Kim, J.; Ha, J.W.; Zhang, B.T. Hadamard Product for Low-rank Bilinear Pooling. arXiv 2016, arXiv:1610.04325. [Google Scholar]
Cui, Y.; Zhou, F.; Wang, J.; Liu, X.; Lin, Y.; Belongie, S. Kernel pooling for convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2921–2930. [Google Scholar]
Zhao, B.; Wu, X.; Feng, J.; Peng, Q.; Yan, S. Diversified visual attention networks for fine-grained object classification. IEEE Trans. Multimed. 2017, 19, 1245–1256. [Google Scholar] [CrossRef]
Leyva, I.; Sevilla-Escoboza, R.; Sendiña-Nadal, I.; Gutiérrez, R.; Buldú, J.; Boccaletti, S. Inter-layer synchronization in non-identical multi-layer networks. Sci. Rep. 2017, 7, 45475. [Google Scholar] [CrossRef]
Huo, Y.; Gang, S.; Guan, C. FCIHMRT: Feature cross-layer interaction hybrid method based on Res2Net and transformer for remote sensing scene classification. Electronics 2023, 12, 4362. [Google Scholar] [CrossRef]
Li, Z.; Lang, C.; Liew, J.H.; Li, Y.; Hou, Q.; Feng, J. Cross-layer feature pyramid network for salient object detection. IEEE Trans. Image Process. 2021, 30, 4587–4598. [Google Scholar] [CrossRef]
Zheng, H.; Fu, J.; Zha, Z.J.; Luo, J. Looking for the devil in the details: Learning trilinear attention sampling network for fine-grained image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 5012–5021. [Google Scholar]
Tan, M.; Yuan, F.; Yu, J.; Wang, G.; Gu, X. Fine-grained image classification via multi-scale selective hierarchical biquadratic pooling. ACM Trans. Multimed. Comput. Commun. Appl. 2022, 18, 1–23. [Google Scholar] [CrossRef]
Zhang, N.; Donahue, J.; Girshick, R.; Darrell, T. Part-based R-CNNs for fine-grained category detection. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Proceedings, Part I 13. Springer: Cham, Switzerland, 2014; pp. 834–849. [Google Scholar]
Yu, C.; Zhao, X.; Zheng, Q.; Zhang, P.; You, X. Hierarchical bilinear pooling for fine-grained visual recognition. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 574–589. [Google Scholar]
Lee, J.; Kim, D.; Ham, B. Network quantization with element-wise gradient scaling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 6448–6457. [Google Scholar]
Wang, Y.; Yang, L.; Liu, X.; Shen, C.; Huang, J. Deep Co-Interaction for Multi-Granularity Feature Fusion in Fine-Grained Visual Recognition. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 452–468. [Google Scholar]
Boan Chen and Quanlong Feng and Bowen Niu and Fengqin Yan and Bingbo Gao and Jianyu Yang and Jianhua Gong and Jiantao Liu. Multi-modal fusion of satellite and street-view images for urban village classification based on a dual-branch deep neural network. Int. J. Appl. Earth Obs. Geoinf. 2022, 109, 102794. [Google Scholar] [CrossRef]
Weng, Q. Remote sensing of impervious surfaces in the urban areas: Requirements, methods, and trends. Remote Sens. Environ. 2012, 117, 34–49. [Google Scholar] [CrossRef]
Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef] [PubMed]
Congalton, R.G. A review of assessing the accuracy of classifications of remotely sensed data. Remote Sens. Environ. 1991, 37, 35–46. [Google Scholar] [CrossRef]
Zhou, G.; Qian, L.; Gamba, P. Advances on Multimodal Remote Sensing Foundation Models for Earth Observation Downstream Tasks: A Survey. Remote Sens. 2025, 17, 3532. [Google Scholar] [CrossRef]
Do, M.K.; Han, K.; Lai, P.; Phan, K.T.; Xiang, W. RobSense: A Robust Multi-modal Foundation Model for Remote Sensing with Static, Temporal, and Incomplete Data Adaptability. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 11–15 June 2025; pp. 7427–7436. [Google Scholar]
Samadzadegan, F.; Toosi, A.; Dadrass Javan, F. A critical review on multi-sensor and multi-platform remote sensing data fusion approaches: Current status and prospects. Int. J. Remote Sens. 2025, 46, 1327–1402. [Google Scholar] [CrossRef]

Figure 1. The dual multi-linear pooling (D-MLP) model.

Figure 2. Representative samples from WuhanUIS. (a,b) Examples of UIS labeled groups, each composed of one RSI and four SVIs captured from four viewpoints; (c,d) Examples of groups labeled as Others with the same structure.

Figure 3. Example of annotated images in WuhanUIS.

Figure 4. Representative samples from ChinaUIS. (a) A UIS labeled group from Beijing, composed of one RSI and four SVIs captured from four viewpoints; (b) A UIS labeled group from Shenzhen with the same structure; (c) A group labeled as Others from Guangzhou with the same structure; (d) A group labeled as Others from Chongqing with the same structure.

Figure 5. Common failure cases observed in WuhanUIS. (a,b) Examples of UIS labeled groups misclassified as Others, each composed of one RSI and four SVIs captured from four viewpoints; (c,d) Examples of groups labeled as Others misclassified as UIS with the same structure.

Figure 6. Training and validation loss curves of PanFusion-Net on the WuhanUIS datasets.

Figure 7. Confusion matrix of PanFusion-Net on ChinaUIS.

Figure 8. Common failure cases observed in ChinaUIS. (a) A UIS labeled group from Guangzhou misclassified as Others, composed of one RSI and four SVIs captured from four viewpoints; (b) A UIS labeled group from Chengdu misclassified as Others with the same structure; (c) A group labeled as Others from Shenzhen misclassified as UIS with the same structure; (d) A group labeled as Others from Chengdu misclassified as UIS with the same structure.

Figure 9. Confusion matrix of PanFusion-Net on

S^{2} U V

.

Figure 9. Confusion matrix of PanFusion-Net on

S^{2} U V

.

Figure 10. Representative cases of multimodal fusion performance on WuhanUIS dataset. (a) Both RSI and SVI modalities produce incorrect predictions, while the fusion result is correct (ground truth is UIS); (b) Both RSI and SVI modalities produce incorrect predictions, while the fusion result is correct (ground truth is Others); (c) The RSI prediction is incorrect whereas the SVI prediction is correct, and the fusion output remains correct (ground truth is Others); (d) The RSI prediction is correct while the SVI prediction is incorrect, and the fusion output is still correct (ground truth is UIS).

Table 1. Sample distribution on WuhanUIS.

Category	UIS	Others	Total
Number of samples	1500	2332	3832

Table 2. Sample distribution across selected cities on ChinaUIS.

City	UIS	Others	Total
Beijing	127	136	263
Shanghai	90	151	241
Guangzhou	71	138	209
Shenzhen	44	153	197
Tianjin	100	157	257
Chengdu	77	156	233
Chongqing	53	156	209
Wuhan	81	143	224
Total	643	1190	1833

Table 3. Overall classification accuracy (%) of different models under different input modalities on WuhanUIS.

Models	Input	Overall Accuracy (%)
ResNet-18 [42]	RS	94.72
B-CNN [52]	RS	95.85
HBP [65]	RS	95.95
MLP [41]	RS	95.85
ResNet-18 [42]	SV	82.66
B-CNN [52]	SV	82.76
HBP [65]	SV	84.84
MLP [41]	SV	82.92
PanFusion-Net (ours)	RS + SV	96.14

Table 4. Classification accuracy (%) of different models by category on WuhanUIS.

Models	Input	Overall Accuracy (%)
ResNet-18 [42]	RS + SV	95.74
HBP [65]	RS + SV	95.54
PanFusion-Net (ours)	RS + SV	96.14

Table 5. Confusion matrix of PanFusion-Net on ChinaUIS.

Classification	UIS	Others	Total	P.A. (%)
UIS	334	23	357	93.56
Others	26	166	192	86.46
Total	360	189	549
U.A. (%)	92.78	87.83
O.A. (%):		91.07
Kappa (%):		80.31

Table 6. Results of different models on ChinaUIS.

Models	Category Accuracy		OA (%)	Kappa (%)
Models	UIS (%)	Others (%)	OA (%)	Kappa (%)
Single-modal-sv-0°	80.21	89.64	86.34	69.93
Single-modal-sv-90°	77.08	92.44	87.07	70.97
Single-modal-sv-180°	77.60	92.44	87.25	71.42
Single-modal-sv-270°	81.25	92.44	88.52	74.49
Single-modal-sv-Comb.	78.12	92.44	87.43	79.22
Single-modal-rs	85.40	91.62	89.44	77.37
PanFusion-Net (ours)	86.46	93.56	91.07	80.31

Table 7. Confusion matrix of PanFusion-Net on

S^{2} U V

.

Table 7. Confusion matrix of PanFusion-Net on

S^{2} U V

.

Classification	UIS	Others	Total	P.A. (%)
UIS	114	6	120	95.00
Others	8	232	240	96.67
Total	122	238	360
U.A. (%)	93.44	97.48
O.A. (%):		96.11
Kappa (%):		91.29

Table 8. Results for different multimodal models on

S^{2} U V

.

Table 8. Results for different multimodal models on

S^{2} U V

.

Model Name	OA (%)	Kappa (%)
Trans-MDCNN [68]	92.61	83.52
FusionMixer [28]	94.30	87.34
PanFusion-Net (ours)	96.11	91.29

Table 9. Intra-class performance comparison of multimodal classification models.

Class	Trans-MDCNN	FusionMixer	PanFusion-Net (Ours)
UIS (%)	86.05	93.12	95.00
Others (%)	94.28	93.56	96.67

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hu, J.; Huang, X.; Ren, T.; Zhang, L. Urban Informal Settlement Classification via Cross-Scale Hierarchical Perception Fusion Network Using Remote Sensing and Street View Images. Remote Sens. 2025, 17, 3841. https://doi.org/10.3390/rs17233841

AMA Style

Hu J, Huang X, Ren T, Zhang L. Urban Informal Settlement Classification via Cross-Scale Hierarchical Perception Fusion Network Using Remote Sensing and Street View Images. Remote Sensing. 2025; 17(23):3841. https://doi.org/10.3390/rs17233841

Chicago/Turabian Style

Hu, Jun, Xiaohui Huang, Tianyi Ren, and Liner Zhang. 2025. "Urban Informal Settlement Classification via Cross-Scale Hierarchical Perception Fusion Network Using Remote Sensing and Street View Images" Remote Sensing 17, no. 23: 3841. https://doi.org/10.3390/rs17233841

APA Style

Hu, J., Huang, X., Ren, T., & Zhang, L. (2025). Urban Informal Settlement Classification via Cross-Scale Hierarchical Perception Fusion Network Using Remote Sensing and Street View Images. Remote Sensing, 17(23), 3841. https://doi.org/10.3390/rs17233841

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Urban Informal Settlement Classification via Cross-Scale Hierarchical Perception Fusion Network Using Remote Sensing and Street View Images

Highlights

Abstract

1. Introduction

2. Panoramic Fusion Network (PanFusion-Net)

2.1. Network Architecture (PanFusion-Net)

2.2. Feature Extraction Using Multi-Scale Pyramid ResNet18 (FE-ResNet18-FPN)

2.3. Feature Fusion Using Multi-Linear Pooling (FF-MLP)

3. Experimental Results and Analysis

3.1. Study Area

3.2. The Datasets

3.2.1. WuhanUIS Dataset

3.2.2. ChinaUIS Dataset

3.2.3. S 2 U V Dataset

3.3. Running Environment

3.4. Experiment on WuhanUIS Dataset

3.4.1. Experimental Operation

3.4.2. Experimental Results

3.5. Experiment on ChinaUIS Dataset

3.5.1. Experimental Results

3.5.2. Experimental Validity

3.6. Experiment on S 2 U V Dataset

3.6.1. Experimental Results

3.6.2. Experimental Comparison with Multimodal Models

4. Discussion

4.1. Performance Gains from Multimodal Integration

4.2. Mechanisms Behind the Performance Improvements

4.3. Cross-Dataset Robustness and Generalization

4.4. Limitations and Future Directions

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

3.2.3. $S^{2} U V$ Dataset

3.6. Experiment on $S^{2} U V$ Dataset