Urban Functional Zone Classification Based on High-Resolution Remote Sensing Imagery and Nighttime Light Imagery

Chen, Junyu; Chen, Yingbiao; Zheng, Zihao; Ling, Zhenxiang; Meng, Xianxin; Kuang, Junyu; Shi, Xianghua; Yang, Yifan; Chen, Wentao; Wu, Zhifeng

doi:10.3390/rs17091588

Open AccessArticle

Urban Functional Zone Classification Based on High-Resolution Remote Sensing Imagery and Nighttime Light Imagery

by

Junyu Chen

^1,2,

Yingbiao Chen

^1,2,*,

Zihao Zheng

^1,2,3

,

Zhenxiang Ling

^1,2,

Xianxin Meng

^1,2,

Junyu Kuang

^1,2,

Xianghua Shi

^1,2,

Yifan Yang

^1,2,

Wentao Chen

^1,2 and

Zhifeng Wu

^1,2,3

¹

School of Geographical Science and Remote Sensing, Guangzhou University, Guangzhou 510006, China

²

Huangpu Research School, Guangzhou University, Guangzhou 510006, China

³

Key Laboratory for Geo-Environmental Monitoring of Great Bay Area, Ministry of Natural Resources, Shenzhen 518000, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(9), 1588; https://doi.org/10.3390/rs17091588

Submission received: 28 February 2025 / Revised: 2 April 2025 / Accepted: 16 April 2025 / Published: 30 April 2025

(This article belongs to the Special Issue Nighttime Light Remote Sensing Products for Urban Applications)

Download

Browse Figures

Versions Notes

Abstract

Urbanization has led to rapid changes in the landscapes of cities, making the quick and accurate identification of urban functional zones crucial for urban development. Identifying urban functional zones requires understanding not only the physical characteristics of a city but also its social attributes. However, traditional methods relying on single-modal features for classification struggle to ensure accuracy, posing challenges for subsequent fine-grained urban studies. To address the limitations of single-modal models, this study proposes an end-to-end Cross-modal Spatial Alignment Gated Fusion Deep Neural Network (CSAGFNet). This model extracts information from high-resolution remote sensing imagery and nighttime light imagery to classify urban functional zones. The CSAGFNet aligns features from different modalities using a cross-modal spatial alignment module, ensuring consistency in the same spatial dimension. Following this, a gated fusion mechanism dynamically controls the weighted integration of modal features, optimizing their interaction. In tests, CSAGFNet achieved a mean intersection over union (mIoU) value of 0.853, outperforming single-modal models by at least 5% and significantly demonstrating its superiority. Extensive ablation experiments validated the effectiveness of the core components of CSAGFNet.

Keywords:

feature fusion; nighttime light imagery; very high-resolution remote sensing imagery; urban functional zone classification

1. Introduction

Urban areas occupy less than 2% of the Earth’s land area but host more than half of the global population [1], which continues to grow. By 2030, the global urban population is projected to reach 5 billion. The unprecedented pace of urbanization has led to rapid changes in urban landscapes. Therefore, quickly identifying various urban functional zones to monitor urban land use provides essential information to decision-makers. Identifying and analyzing urban functional zones plays a critical role in urban structure optimization, resource allocation, land use management, commercial site selection, geographic monitoring, disaster assessment, and urban planning. Urban functional zones are defined as areas characterized by similar land use, intensity, orientation, benchmark land value, land use efficiency, and potential. In megacities with high population densities, land use is highly diverse and complex, making it particularly challenging to classify urban functional zones [2].

The classification and identification of urban functional zones have been the focal points of research. Early methods of classification and identification typically relied on statistical data, expert knowledge, or land use/land cover information extracted from remote sensing images [3]. These methods often required the manual interpretation of images, which was time-consuming, labor-intensive, and lacking in precision. To address these limitations, some researchers proposed methods based on density analysis and clustering analysis. These methods extract features from data density within units and static or dynamic urban data such as mobile phone data [4,5], traffic data, and check-in records [6] to classify urban functional zones based on spatial and temporal patterns of regional functionality or human activity. While these methods enable the automated extraction of urban functional zones, they depend on shallow features and fail to capture complex semantic information, making it difficult to distinguish between similar functional zones at a fine-grained level [7].

To extract deeper feature representations, researchers have introduced deep learning into urban functional zone classification [8,9,10,11]. Deep learning typically utilizes neural network models to identify and classify functional zones in remote sensing images by extracting high-level features. Deep learning methods effectively extract advanced information from remote sensing imagery. For instance, researchers such as Bao [12], Zhang [7], Shao [13], Wang [14], and Li [15] have used methods including deep convolutional neural networks (CNNs), multi-scale pooling, residual refinement, Transformers, and graph neural networks to enhance feature extraction and classification accuracy for buildings and functional zones in remote sensing imagery. In addition, Li et al. [16,17,18] enhanced the models’ suitability for remote sensing imagery by refining modules in traditional deep learning frameworks. Researchers [18,19] have also employed various neural network architectures to extract deep features from remote sensing images for functional zone classification, thereby significantly improving classification accuracy. These techniques provide various effective approaches to urban land cover and functional zone classification. By leveraging neural network models, researchers have improved classification accuracy by extracting deep features from remote sensing images. However, since deep learning models primarily rely on daytime remote sensing imagery, which captures urban landscapes from a top-down perspective, they can only access physical information during the day. Consequently, relying solely on daytime images for urban functional zone classification presents significant limitations [20].

Identifying urban functional zones requires not only physical attributes but also social attribute features, which are challenging to extract from standard daytime optical remote sensing images. With advancements in remote sensing technology, nighttime light (NTL) imagery has made substantial progress, enabling the extraction of social attribute features for urban analysis. The contrast between brightness and darkness in NTL imagery provides unique insights into urbanization that are distinct from daytime data [21]. NTL data can represent the spatial extent of urbanization (e.g., urban boundaries) similar to daytime data, but its intensity serves as a direct indicator of human activity [22]. NTL intensity reveals intra-urban variations in urbanization intensity and correlates strongly with socio-economic variables [23], making it suitable for modeling and spatializing these variables [24]. For example, Zhang et al. [25] used NTL data from 1992 to 2008 to iteratively classify urban change using an unsupervised algorithm, effectively removing noise and generating urban change classification maps. Their study demonstrated that NTL data captures urbanization dynamics, supporting urban land cover, population, and economic activity analysis. Building on this, researchers [26] have utilized NTL imagery in combination with data such as POI (Point of Interest) [27], NDVI (Normalized Difference Vegetation Index), and Baidu Migration Data (BM), applying models like SVM and U-net to extract semantic features for use in urban functional zone classification. Both NTL-based and daytime image-based urban functional zone classification share a common limitation: they can only capture specific attributes, failing to fully describe the observed scenes. This significantly constrains subsequent applications. Thus, combining NTL imagery with daytime remote sensing features to extract comprehensive physical and social attribute information remains an open research problem.

With the development of remote sensing technology, a wide array of geographic information data have emerged, and multi-modal RS data fusion has seen significant progress. Integrating complementary information from multiple data modalities allows for more robust and reliable decision-making in tasks such as LULC classification, making multi-modal data fusion a feasible approach for identifying urban functional zones. Researchers have combined remote sensing imagery, street-view imagery [28], and geospatial text data [29] to extract complementary feature information, addressing the limitations of single-source data. Huang et al. [20] mapped urban functional zones using high-resolution nighttime light imagery and daytime multi-view images, achieving an average OA of 80%. In addition, scholars have re-identified urban functional areas by integrating remote sensing imagery with social media data, leveraging both Bag-of-Visual-Words models [30,31,32] and a three-level Bayesian model [33] to establish relationships between urban visual features, quantitative categories, and hierarchical structures. However, due to differences in data sources, significant parameter disparities can lead to feature misalignment, complicating multi-source data alignment. Strict alignment methods often result in feature loss and lower alignment quality, reducing classification efficiency and increasing computational costs, thereby limiting a model’s performance.

To integrate features from daytime remote sensing imagery and NTL imagery and address cross-modal feature alignment challenges to improved urban functional zone classification, we propose a Cross-Modal Spatial Alignment Gated Fusion Neural Network (CSAGFNet). This model utilizes an offset-guided adaptive feature alignment mechanism and a cross-modal gated fusion mechanism to align features and fuse data from different image modalities. The offset-guided adaptive feature alignment mechanism adjusts the relative positions of multi-modal features adaptively, addressing weak alignment issues between different modalities and reducing the impact of modality gaps on spatial matching. The cross-modal gated fusion mechanism weights each modality and removes irrelevant parts to adaptively learn discriminative features. It fuses image features extracted from high-resolution remote sensing imagery with NTL features for pixel-level urban land use classification. Tests conducted in various urban areas demonstrate the model’s robustness and generalization capabilities.

The contributions of this paper are reflected in four aspects:

We propose a method to address feature misalignment between different modalities by employing a weak alignment mechanism. This approach adaptively adjusts the relative positions of multi-modal features, achieving adaptive feature alignment instead of strict alignment.
We develop an improved method for feature-level fusion of urban physical attributes extracted from VHR remote sensing imagery and social attribute features extracted from NTL imagery, enhancing the classification accuracy of urban functional zones.
We investigate the impact of different data fusion methods on model accuracy and conduct relevant experiments, demonstrating the effectiveness of the gated fusion mechanism.
We compare the proposed model with other popular single-modal deep learning models, validating its effectiveness.

The rest of this paper is organized as follows: Section 2 introduces the network architecture and provides a detailed explanation of the key components of the model. Section 3 describes the study areas and the dataset preparation process for training the model. Section 4 presents the results of testing our model on the dataset and analyzes the outcomes. We also detail the evaluation metrics and experimental setup used during model training and perform ablation studies to verify the effectiveness of various components of the model. Section 5 discusses the experimental results and applies the model to two new areas not included in the training dataset, assessing its robustness and generalization capabilities. Section 6 concludes the paper based on the analyses above.

2. Methods

To fuse VHR remote sensing imagery with NTL imagery for extracting urban functional zones, we propose a deep convolutional neural network model based on an offset-guided adaptive feature alignment mechanism and cross-modal gated fusion. The structure of the model is illustrated in Figure 1 below.

Specifically, we input VHR remote sensing imagery and NTL imagery into a dual-stream network to extract image features. The extracted features are then fed into a Cross-Modal Spatial Offset Modeling (CSOM) module to create a shared subspace, which estimates precise feature-level offsets, thereby reducing the impact of modality gaps on spatial matching. Subsequently, an Offset-Guided Adaptive Feature Alignment (ODAF) module captures the optimal alignment positions for feature fusion, avoiding the losses and errors caused by strict feature alignment. After achieving feature alignment, the aligned features are fed into a Gated Fusion Module (GFM), which calculates the utility of each corresponding lateral feature from the VHR and NTL modalities and aggregates the information accordingly. Feature fusion is performed at the optimal alignment positions, resulting in a fused feature map of the VHR and NTL images. Finally, the fused feature map is passed through a DeepLabV3Plus classification head for classification, producing the urban functional zone classification results.

2.1. Dual-Stream Feature Extractor

To better extract features from VHR and NTL imagery, we designed four types of feature extractors tailored to the distinct characteristics of each image type. These extractors are used to extract invariant features and specific features from VHR and NTL imagery. In image processing tasks, invariant features typically represent robust global properties. Therefore, the invariant feature extractor adopts a simple structure with fewer layers to enhance computational efficiency and enable rapid feature extraction. In contrast, specific features often require deeper layers to capture fine-grained details, enhancing discrimination. As a result, the specific feature extractor uses a more complex network structure to capture richer image details. To extract specific features from VHR imagery, we designed a network with five convolutional layers. The structure of the specific feature extractor for VHR imagery is illustrated in the Figure 2 below. The model incrementally downsamples the resolution to 64 × 64 and expands the channel size to 256, thereby enlarging the receptive field and reducing the number of network parameters. Each convolutional layer is followed by Batch Normalization and ReLU activation.

When extracting invariantfeatures from VHR imagery, we utilized a depthwise separable convolution module [34]. The structure of the invariant feature extractor for VHR imagery is illustrated in the Figure 3 below. Depthwise separable convolution decomposes standard convolution into depthwise convolution and pointwise convolution, effectively reducing computational cost and the number of parameters. This approach extracts richer feature information while maintaining high accuracy.

Unlike typical natural images, NTL imagery has an original spatial resolution of only 40 m, making it unsuitable for excessive downsampling to avoid losing critical information. The structure of the specific feature extractor for NTL imagery is illustrated in the Figure 4 below. In the NTL-specific feature extractor, we downsample the images to 64 × 64 and expand the channel count to 256 to preserve as much information as possible while extracting deep features. The network consists of three convolutional layers, each followed by Batch Normalization and ReLU activation to enhance feature representation capabilities. For the NTL invariant feature extractor, we designed a shallower network with two convolutional layers, aimed at quickly extracting low-level invariant features while reducing spatial dimensions. The structure of the invariant feature extractor for NTL imagery is illustrated in the Figure 5 below.

2.2. Cross-Modal Spatial Offset Modeling

Cross-modal spatial offset modeling is the core component for achieving weak feature alignment. Its goal is to align cross-modal features by predicting the spatial offset between VHR and NTL features. The module structure is illustrated in the Figure 6 below.

The spatial offset modeling submodule predicts the spatial offset

ϕ_{c}

by estimating the spatial difference between

F c_{V H R}

and

F c_{N T L}

. To accurately estimate the spatial difference between the two types of images, feature enhancement is required. First, the input feature

F_{m}

undergoes both max-pooling and average-pooling operations to obtain two different spatial context descriptors. Then, these two descriptors are concatenated and passed through a 7 × 7 convolutional layer to generate a spatial attention map. Finally, the generated spatial attention map is element-wise multiplied with the input features to obtain the spatially enhanced features

F_{c, p}^{m}

. The formula expression is as follows:

F_{c, p}^{m} = σ (f_{7 \times 7} (C a t (M a x (F_{m}), M e a n (F_{m})))) ⊙ F_{m},

(1)

σ

represents the sigmoid function,

f_{7 \times 7}

denotes the 7 × 7 convolutional layer, and

C a t (\cdot)

indicates the concatenation operation.

M a x (\cdot)

and

M e a n (\cdot)

refer to the max-pooling and average-pooling operations. The symbol

⊙

represents the Hadamard product (element-wise multiplication).

After the feature enhancement process is completed, the spatially channel-enhanced VHR features

F_{c, e}^{V H R}

and NTL features

F_{c, e}^{N T L}

are concatenated to obtain the feature difference representation

F_{d i f}

.

F_{d i f} = C a t (F_{c, e}^{V H R}, F_{c, e}^{N T L}),

(2)

ϕ_{c} = f_{n o n l i n e a r} (F_{d i f}) .

(3)

2.3. Offset-Guided Deformable Alignment Module

The Offset-Guided Deformable Alignment Module is one of the core components of the OAFA method. Its goal is to achieve adaptive fusion of NTL and VHR features through implicit offset compensation and adaptive alignment. The module structure is illustrated in the Figure 7 below.

The ODAF module uses deformable convolution to achieve implicit offset compensation and adaptive alignment. Deformable convolution builds upon traditional convolution by adding learned offsets to adaptively adjust the sampling positions on the feature map. In deformable convolution, a learned offset

Δ p_{k}

is introduced, allowing the convolution kernel to dynamically adjust the sampling positions based on the input data. The operation formula for deformable convolution is as follows:

y (p) = \sum_{k = 1}^{K} ω_{k} \cdot x (p + p_{k} + Δ p_{k}) \cdot Δ m_{k} .

(4)

In standard deformable convolution, the offsets of the convolution kernel are learned from its original features. In the ODAF module, however, these offsets are derived from the basic offset

ϕ_{c}

. The ODAF module uses the basic offset

ϕ_{c}

obtained from the CSOM module as the initial value for the offset compensation. The specific formula is expressed as follows:

y (p) = \sum_{k = 1}^{K} ω_{k} \cdot x (p + ϕ_{c} + p_{k} + Δ p_{k}) \cdot Δ m_{k},

(5)

After the feature alignment is completed, the ODAF module combines the VHR and NTL features through decoupled feature fusion to generate the final classification result. Traditional fusion processes typically concatenate the modality-invariant features and modality-specific features to create a fused feature

F_{f}

. However, this method may lead to information redundancy. In the ODAF module, we first align the VHR and NTL features, then optimize them before fusion to eliminate redundant information and enhance discriminative representations. Furthermore, during fusion, we employ a gated fusion mechanism to fully leverage the complementary information between different images. The specific formula is expressed as follows:

F_{f} = C a t (f_{1 \times 1} (C a t (F_{c, a}^{V H R}, F_{c}^{N T L})), F_{s, a}^{V H R}, F_{s}^{N T L}) .

(6)

2.4. Gated Fusion Module

Traditional cross-modal feature fusion methods are primarily based on element-wise summation and concatenation operations, which cannot effectively distinguish the importance of features from different modalities. Element-wise summation directly adds features from different modalities element by element, assuming that the features from each modality have the same importance. Furthermore, element-wise summation only performs linear operations, lacking nonlinear information interaction, and thus fails to fully explore the deep relationships between features from different modalities. In reality, features from different modalities may have varying qualities and contributions, and simple summation cannot effectively distinguish and highlight the important features. Concatenation, on the other hand, simply stitches together features from different modalities, preserving the original information of each modality but without considering the correlation and complementarity between them, which can lead to information redundancy and feature clutter. Moreover, the concatenated features require further processing by subsequent network layers to achieve information interaction, but the simple concatenation method contributes little to information interaction in the initial fusion stage, failing to fully utilize the complementary information between modalities.

To address the issues mentioned above, our model employs a gating mechanism to enhance the quality and efficiency of feature fusion, thereby improving the overall performance of the model. The overall structure of the GFM (Gated Fusion Module) is shown in Figure 8 below.

First, the feature maps output by the two decoders are concatenated to generate a fused feature map

F_{f u s i o n}

. Then, a 1 × 1 convolution kernel

W_{z}

is applied to the fused feature map to compute the correlation between modalities and reduce the dimensionality of the feature channels. Then, the fused feature map is processed through the sigmoid function to generate a weighted probability matrix

G

. Using the weighted matrix

G

and

1 - G

, the VHR and NTL feature maps are weighted separately to generate the weighted feature maps

G_{V H R}

and

G_{N T L}

. Finally, the weighted VHR and NTL feature maps are subjected to a Hadamard product operation to generate the gated fusion feature map

F_{gate-fusion}

. The formula is expressed as:

F_{f u s i o n} = W_{z} (F_{V H R} ∥ F_{N T L}),

(7)

G = σ (F_{f u s i o n}),

(8)

G_{V H R} = G, G_{N T L} = 1 - G,

(9)

F_{gate-fusion} = (F_{VHR} \otimes G_{VHR}) ∥ (F_{NTL} \otimes G_{NTL}) .

(10)

2.5. Loss Function

In situations where a single loss function cannot accurately evaluate the model, in order to comprehensively improve classification accuracy, especially in complex scenarios with imbalanced class distributions and small region detection, we use a combination of Weighted Cross-Entropy Loss [35], Dice Loss [36], and Focal Loss [37]. Using a combination of loss functions can better address issues like class imbalance, insufficient attention to small objects or minority classes, and hard-to-classify samples.

The Weighted Cross-Entropy Loss adds a class-weight term to the standard cross-entropy loss to address the issue of class imbalance. When certain classes have significantly fewer samples than others, the model is more likely to ignore these minority classes. By assigning higher weights to the minority classes, the model can place more emphasis on these during training. The formula for the Weighted Cross-Entropy Loss is as follows:

L_{W C E} = - \sum_{i = 1}^{N} \sum_{c = 1}^{C} w_{c} y_{i}^{c} l o g (p_{i}^{c}),

(11)

where

N

represents the total number of samples,

C

represents the total number of classes,

y_{i}^{c}

is the true label indicating whether sample

i

belongs to class

c

,

p_{i}^{c}

is the predicted probability of the model for class

c

, and

w_{c}

is the weight for class

C

, which is typically inversely proportional to the class frequency, with higher weights assigned to less frequent classes.

The Dice Loss is a measure of the overlap between two regions, and it is particularly suitable for handling minority classes or small objects. Dice Loss performs well in scenarios with class imbalance because it increases the loss weight for samples from minority classes. By improving the Dice coefficient for small objects or minority classes, the model becomes more sensitive to these classes, thereby avoiding class omission. The formula for Dice Loss is as follows:

L_{D i c e} = 1 - \frac{2 \sum_{i} p_{i} y_{i}}{\sum_{i} p_{i} + \sum_{i} y_{i},}

(12)

where

p_{i}

represents the model’s predicted class probability and

y_{i}

represents the true label’s class probability.

Focal Loss aims to address the issue of class imbalances. For easy-to-classify samples, Focal Loss reduces their loss weight, allowing the model to focus more on hard-to-classify samples. By adjusting

γ

, Focal Loss applies a greater loss weight to misclassified and hard-to-classify samples, thereby improving the model’s performance on challenging samples. The formula for Focal Loss is as follows:

L_{F o c a l} = - \sum_{i = 1}^{N} α (1 - p_{i}^{c})^{γ} y_{i}^{c} l o g (p_{i}^{c}),

(13)

where

α

is the class balance coefficient, which controls the loss weight of samples from different classes;

γ

is the focusing parameter, which controls the emphasis on hard-to-classify samples;

p_{i}^{c}

is the model’s predicted probability for class

c

; and

y_{i}^{c}

is the true label indicating whether sample

i

belongs to class

c

.

By combining the three loss functions mentioned above, the model can better address issues such as class imbalance, hard-to-classify samples, and the detection of small objects or minority classes. The formula for the combined loss is as follows:

L_{T o t a l} = α L_{W C E} + β L_{D i c e} + γ L_{F o c a l},

(14)

where

α

,

β

, and

γ

are weighting parameters used to adjust the relative importance of the three loss functions. Based on the experimental results, we set the weights of the loss function to:

α

= 0.3,

β = 0.3, γ = 0.4

.

2.6. Evaluation Metrics

To validate the accuracy of the classification results, we used the mean Intersection over Union, F1 Score, and overall accuracy as the evaluation metrics for our model.

(1) mIoU

The mIoU is a commonly used evaluation metric in semantic segmentation tasks. It measures the average segmentation performance of a model across multiple classes, reflecting the overall performance of the model in image segmentation tasks. IoU calculates the ratio of the intersection to the union of two sets. In semantic segmentation, IoU is used to measure the overlap between the predicted results and the ground truth labels.

mIoU = \frac{1}{C} \sum_{c = 1}^{C} {IoU}_{c} = \frac{1}{C} \sum_{c = 1}^{C} \frac{T P_{c}}{T P_{c} + F P_{c} + F N_{c}},

(15)

TP: the number of samples predicted as positive and actually positive.

TN: the number of samples predicted as negative and actually negative.

FP: the number of samples predicted as positive but actually negative.

FN: the number of samples predicted as negative but actually positive.

(2) F1 Score

The F1 Score is a metric that combines the precision and recall of a classification model, and it is especially useful in situations with imbalanced data. The F1 score is the harmonic mean of precision and recall, providing a single measure of the model’s overall performance. The F1 score is calculated based on precision and recall:

Precision = \frac{T P}{T P + F P},

(16)

R e c a l l = \frac{T P}{T P + F N},

(17)

F 1 S c o r e = 2 \times \frac{P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l},

(18)

(3) Overall Accuracy

Overall Accuracy is one of the fundamental metrics for evaluating the performance of a classification model. It represents the proportion of correctly predicted samples out of all samples. Overall Accuracy is applicable to various classification problems and is one of the most intuitive evaluation metrics. In classification tasks, the formula for overall accuracy is:

Overall Accuracy = \frac{T P + T N}{T P + T N + F P + F N} .

(19)

3. Study Area and Dataset

3.1. Study Area

Guangzhou is located in South China and is the capital of Guangdong Province. It is one of the most economically developed regions in China and also one of the areas with the highest level of urbanization. Guangzhou covers a land area of approximately 7434.40 square kilometers, and as of the end of 2023, its permanent population was 18.827 million, with an urbanization rate of 86.76%. The land use in Guangzhou is complex and diverse, posing significant challenges for urban functional zone classification. The Haizhu District of Guangzhou, characterized by its complex land use and diverse urban functional zones, was selected as the sample area. The specific region is shown in Figure 9.

3.2. Dataset

In our experiments, the dataset comprises two parts: high-resolution remote sensing images annotated with urban functional zones and NTL images. The overall dataset is illustrated in the Figure 10 below.

(1) High-resolution remote sensing images. The high-resolution remote sensing images used in this study were acquired from the GF-2 satellite, with a spatial resolution of 1 m. The delineation of urban functional zones is primarily based on multiple factors—including land use characteristics, planning objectives, and socio-economic development—with the core aim of clearly defining the primary functions and usage requirements of different areas. According to national standards such as the “Code for Classification of Urban Land Use and Planning Construction Land” (GB50137-2011), urban construction land can be categorized by its principal function into residential, commercial, industrial, public administration and public service, transportation facility, and green space uses. In this dataset, we manually annotated 21,083 urban functional zones. The major urban functional zones are divided into six categories: residential areas, commercial areas, industrial areas, green space, street and transportation, and a combined category for water and non-development zones (the annotation counts for water and non-development zones are comparatively small, which may hinder the model’s ability to effectively learn their features). The sample size for each functional zone is illustrated in the Table 1 below. To address the issue of insufficient samples of these categories, we applied oversampling by rotating and cropping the original samples to generate new ones for inclusion in the dataset. After cropping and segmentation, the labeled samples yielded a collection of 7912 images and corresponding label maps of 256 × 256 pixels, which were subsequently partitioned into training, validation, and test sets in a 7:2:1 ratio.

(2) Urban NTL image. The NTL imagery utilized in this study was acquired from the SDGSAT-1 satellite. SDGSAT-1 is a Sun-synchronous satellite equipped with three payloads—a high-resolution wide-swath thermal infrared imager, a low-light sensor, and a multispectral imager. It operates at an orbital altitude of 505 km, with an inclination of 97.5° and a swath width of 300 km. It has a revisit cycle of approximately 11 days. The spatial resolutions of its low-light panchromatic band and its RGB bands are 10 m and 40 m, respectively.

4. Experiments and Results

4.1. Implementation Details

In the experiment, we used PyTorch as the framework for the model with CUDA version v12.1. The CPU was an Intel(R) Core(TM) i9-14900K model, and the GPU was an RTX 4080s with 16 GB of memory.

To improve the model’s generalization ability and to prevent overfitting, we used AdamW as the optimizer during the training phase, with an initial learning rate set to 0.001. The cosine annealing method was applied to adjust the learning rate, allowing it to change cyclically during training, helping the model escape local optima and enhancing training performance. Additionally, to address the issue of class imbalance, we not only oversampled the underrepresented classes in the dataset but also incorporated Focal Loss as part of the loss function. Focal Loss assigns lower weights to easily classified samples and higher weights to hard-to-classify samples, enabling the model to focus more on challenging samples during training. Due to memory limitations, all of the images were divided into 256 × 256 tiles using a sliding window method before being input into the model for training. The batch size was set to 48. During training, we calculated the mIoU value for each epoch and employed an early stopping strategy to prevent overfitting. If the mIoU value on the validation set did not improve for ten consecutive epochs, the training was stopped and the model with the best mIoU value was saved. This approach ensured the model did not overfit. The entire training process took approximately 8 h.

4.2. Result

To evaluate the accuracy of the model, we tested it on the test set and calculated its mIoU, F1-score, and overall accuracy. The model output results are illustrated in the Figure 11 below. The specific results are shown in Table 2 below, and the IoU values for each functional zone are illustrated in Figure 12. Partial output results from the validation set are shown in Figure 11.

From the experimental results, it is evident that the CSAGFNet model performs well on the three key metrics: mIoU, F1-score, and overall accuracy. The mIoU is 0.853, indicating that the model achieved a high intersection-over-union across all functional zones, showing its ability to accurately capture functional zone boundaries. The F1-score is 0.894, demonstrating the model’s excellent balance between precision and recall and its accurate classification of functional zones. The overall accuracy was 0.934, showing that the model achieved very high overall classification accuracy and can effectively predict different urban functional zones.

From the IoU values of each functional zone, the model demonstrates excellent performance in urban functional zone segmentation tasks, particularly in areas such as water bodies, residential areas, and parks/green spaces, where it accurately identifies boundaries and achieves high IoU values. For non-construction zones, due to the limited number of samples, even after processing the dataset with methods such as oversampling, the model was unable to learn sufficient features during training, resulting in a relatively lower IoU value compared to other zones.

Based on its performance across specific categories, the CSAGFNet model exhibited significant disparities in recognition capabilities for different urban functional zones (residential areas, commercial areas, industrial areas, green space, street and transportation, and water and non-development zones), primarily attributable to inherent differences in spectral characteristics, spatial distribution patterns, and sample quantities among categories. For water body identification, the model achieved an outstanding IoU value of 0.89. This result not only stems from its effective capture of water’s distinctive low reflectance in near-infrared bands but also benefits from the innovative integration of nighttime light data features. Specifically, water bodies exhibit unique zero-value characteristics in nighttime light imagery, contrasting sharply with other urban functional zones. By constructing a Gated Fusion Module, the model successfully combines these complementary features, thereby significantly enhancing the robustness of water body identification. The residential area IoU of 0.88 reflects the model’s capability in capturing spatial distribution patterns of buildings, particularly maintaining satisfactory segmentation consistency in high-density urban clusters.

In contrast, the non-development zones showed a markedly lower IoU of 0.79 compared to other categories. In-depth analysis revealed three primary contributing factors: First, from a data perspective, this category constituted merely 2.7% of the total training samples. Even with oversampling techniques, its effective sample size remained less than one-fifth of other categories, substantially limiting the model’s capacity to learn discriminative features. Second, this category exhibits exceptionally high internal heterogeneity, encompassing various subclasses such as bare land, fallow farmland, and gravel areas. Its spectral feature coefficient of variation reaches 0.35, which is 2-3 times higher than that of other categories. This diversity challenges the model in establishing clear decision boundaries within the feature space. Finally, transition zones between these areas and adjacent functional zones accounted for 42% of the total boundary length, where ambiguous edge pixels frequently caused misclassification. These findings align with recent studies highlighting few-shot learning challenges, particularly suggesting that conventional data augmentation strategies may prove insufficient when addressing highly heterogeneous geographical features.

4.3. Ablation Study

To evaluate the efficiency and accuracy of the model in urban functional zone classification, we conducted ablation experiments by independently removing or modifying components of the network. Our model, named CSAGFNet, was used as a baseline for easier comparison with other variants.

We examined the structure of cross-modal fusion by removing data streams to validate the effectiveness of cross-modal integration. Next, we investigated the role of different fusion methods in cross-modal and multi-level feature fusion. Note that all settings and metadata for the training and validation processes were fixed across all ablation experiments.

Initially, we completely removed the NTL stream from CSAGFNet and trained a new model based solely on VHR images, naming this variant CSAGFNet-VHR. Then, we trained two independent VHR and NTL streams without any feature-level fusion between them. Subsequently, the output features of the final layer from each VHR and NTL stream were fused at the decision level without feature alignment; this model was named CSAGFNet-Cat. Finally, we input the VHR and NTL streams into the OAFA module for feature alignment but used feature concatenation as the final fusion method. This model was named CSAGFNet-OAFA. We also included our original model, CSAGFNet, in the experiments. This model aligns VHR and NTL features using the OAFA module and fuses them through the GFM. Table 3 below shows the experimental results of each model on the validation set, while Figure 13 illustrates the urban functional zone extraction performance of the different models. The following Figure 14 shows the detailed view of the classification results.

We conducted ablation experiments to compare the impact of each module in the CSAGFNet model on classification performance, systematically verifying the influence mechanism of multi-source feature alignment and fusion on the urban functional area remote sensing interpretation task. As shown in Table 3, the CSAGFNet-VHR model, when using only VHR imagery data, exhibits a significantly lower mIoU (0.743) and F1-score (0.727) compared to the multi-source fusion model. This indicates that a single data source has inherent limitations in extracting high-level semantic features. Although VHR imagery possesses spatial resolution advantages in the range of 0.3–1 m, the class-internal heterogeneity of its spectral-spatial features tends to lead to the misclassification of shadows. In the absence of additional data streams, the globality and robustness of the features are insufficient, which impacts the model’s performance in target classification and segmentation tasks. As seen in the result images, the model trained solely on VHR data fails to effectively distinguish between shadows and actual building areas, erroneously classifying shadows as urban functional areas. The CSAGFNet-Cat model achieves primary fusion of VHR and NTL (Nighttime Light) data via feature concatenation, but it does not account for the heterogeneous feature differences between multi-source remote sensing data. Experimental results show that its mIoU (0.713) drops by 2.3% compared to CSAGFNet-VHR, revealing that the direct fusion of heterogeneous features without alignment may lead to feature conflicts. The high-frequency texture features of VHR imagery and the radiance intensity features of NTL data suffer from dimensional mismatches in the uncalibrated feature space, leading to reduced feature interaction efficiency and lower accuracy in the output results. The CSAGFNet-OAFA model, after introducing the OAFA module, improves the mIoU by 10.6% to 0.789 and the F1-score by 10.7% to 0.804, confirming the importance of bridging feature spaces for multi-source remote sensing interpretation. This module establishes a cross-modal feature projection matrix, achieving orthogonal decomposition and recalibration of local geometric features from VHR and global radiance features from NTL in the Hilbert space, effectively addressing the issue of distribution shift in heterogeneous data. However, due to the limitations of linear concatenation fusion, its overall accuracy (0.843) still shows a significant gap compared to the optimal model. Finally, the CSAGFNet model adopts a gated fusion mechanism (GFM), achieving optimal results in mIoU (0.853), F1-score (0.894), and overall accuracy (0.934), with improvements of 8.1%, 11.2%, and 10.8% over the OAFA model, respectively. GFM dynamically adjusts the contribution of multi-source features through learnable gating weights. This nonlinear fusion mechanism not only suppresses feature redundancy but also enhances the complementarity of cross-modal features, particularly in handling shadow-building boundary areas, where the fine boundary information from VHR and the semantic intensity information from NTL form a synergistic enhancement effect.

The quantitative results of the experiment, combined with the visualization analysis, collectively indicate that the performance improvement in multi-source remote sensing interpretation occurs in two key stages. The feature alignment stage primarily addresses the spatial-semantic matching issues of heterogeneous data (with an OAFA contribution of approximately 65.3%), while the feature fusion stage optimizes multi-modal feature representation through dynamic weight allocation (with a GFM contribution of approximately 34.7%).

In conclusion, the feature alignment module plays a crucial role in enhancing performance, while the fusion method further influences the final feature representation. CSAGFNet demonstrates a more optimal fusion strategy, enabling it to achieve the best segmentation performance.

5. Discussion

5.1. Comparison with Single-Modal Models

In recent years, with the development of deep learning, single-modal DCNN models have become quite mature. To further assess the effectiveness of our method, we compared CSAGFNet with five advanced single-modal networks, namely U-net [38], DeepLabV3Plus [39], FPN [40], PSPNet [41], and PAN [42]. These methods were chosen because they have all been proven to effectively classify images, and they are all open-source and easy to use. Furthermore, to ensure fairness in the experimental results, all single-modal models used ResNet34 as the encoder. The specific parameters of the comparative models are shown in Figure 15 below. The experimental results are shown in Table 4 and the classification results of the functional areas are displayed in Figure 16. The partial detailed view of the model output results is shown in Figure 17.

By comparing the performance differences between five typical single-modal segmentation models (Res-UNet, DeepLabV3Plus, FPN, PSPNet, and PAN) and the multi-modal fusion model CSAGFNet, we systematically validated the enhancement effect of cross-modal feature fusion on urban functional area remote sensing interpretation tasks. As shown in Table 4, CSAGFNet significantly outperforms all single-modal models in three metrics: mIoU (0.853), F1-score (0.894), and overall accuracy (0.934), confirming the necessity of collaborative interpretations of multi-source remote sensing data.

Single-modal models are limited by the information representation capabilities of a single data source, and their performance differences reflect the model architecture’s adaptability to feature extraction. Res-UNet, benefiting from residual connections and an encoder-decoder structure, performs the best among the single-modal models (mIoU = 0.812). However, the VHR imagery it relies on only captures spatial geometric features (such as building contours and texture details) and cannot acquire socio-economic attributes, such as the intensity of human activity, leading to insufficient distinction between functionally similar building groups (e.g., office buildings and residential buildings). DeepLabV3Plus uses atrous spatial pyramid pooling (ASPP) to enhance multi-scale feature extraction, but its F1-score (0.836) is 6.9% lower than that of CSAGFNet, indicating a bottleneck in high-level semantic association modeling under a single modality. FPN and PSPNet, due to differences in their feature pyramid fusion strategies, achieve mIoU values of 0.789 and 0.747, respectively. However, both models show significant misclassification in shadow-covered areas (false detection rates > 18%), confirming the interpretive ambiguity of single-modal data in complex scenarios. The low performance of PAN (mIoU = 0.693) further reveals that unoptimized feature aggregation mechanisms exacerbate inter-class confusion, particularly in low-contrast areas (e.g., industrial and commercial zones).

CSAGFNet achieves multi-source feature collaborative optimization and improves model performance through Cross-modal Spatial Offset Modeling (CSOM) and the Gated Fusion Module (GFM). The CSOM module adaptively adjusts the features between different modalities, reducing the disparities in image resolution, observation angles, and data attributes, allowing multi-modal features to be aligned in similar spatial dimensions. The GFM module further utilizes a gating mechanism to weight and fuse features from different modalities, effectively enhancing the capture of useful features while filtering out irrelevant information. This enables CSAGFNet to simultaneously capture physical features from VHR imagery and socio-economic features from NTL imagery, fully leveraging the complementary advantages of multi-modal data.

CSAGFNet demonstrates significant advantages in special scenarios, such as shadow-interfered areas and functionally ambiguous regions. Single-modal models, such as Res-UNet, tend to misclassify shadows as low-density buildings with a probability of 24.6%, while CSAGFNet reduces the misclassification rate to 6.3% by using human activity intensity features from NTL data (e.g., the light intensity in shadow areas approaching 0). The F1-score difference between industrial areas (high NTL intensity, regular geometric layout) and commercial areas (high NTL intensity, complex textures) decreases from 12.4% in single-modal models to 4.7%, demonstrating that multi-modal features can enhance inter-class separability. For the underrepresented non-constructed areas (<5% of the total), CSAGFNet improves the IoU by 5% through the temporal stability features of NTL data, overcoming the false-negative problem caused by sample imbalances in single-modal models.

5.2. Model Generalization

To further evaluate the generalization of the proposed model, we conducted tests on two different administrative districts. We input high-resolution imagery and NTL data from these regions into the model to obtain classification maps of urban functional zones. However, due to the lack of ground-truth label data, we could only perform validation using visual interpretation combined with random point sampling. The overall accuracy of the model in the entire region was calculated using the confusion matrix. The test results are shown in Table 5 and the classification results for the functional zones are presented in Figure 18 and Figure 19. The partial detailed view of the model output results is shown in Figure 20.

Based on the data presented in Table 5, the model demonstrates strong cross-domain adaptability in the untrained YueXiu (overall accuracy 91.8%) and TianHe (87.8%) regions. The performance difference (approximately 4.0%) reflects the impact of geographic environmental heterogeneity on model generalization. At the macro scale, the feature alignment and dynamic fusion mechanisms enable robust generalization capabilities (overall accuracy > 87%). At the micro scale, due to regional heterogeneity and the coupling of fine-grained features, additional geographic data fusion and adaptive training strategies are still required. These results validate that the architecture design based on multi-modal feature alignment (OAFA module) and gated fusion (GFM module) can effectively alleviate inter-domain feature distribution shifts.

The OAFA module reduces intra-class variance of cross-region features by orthogonal projection, mapping local geometric features from VHR imagery (such as building contour curvature) and radiance intensity features from NTL data to a unified semantic space. This enhances the model’s robustness to regional differences in building density, road network structure, and other characteristics. The GFM module dynamically adjusts the contribution of multi-modal features through gating weights. In the YueXiu region (high building density, regular road network), the model emphasizes VHR spatial details (weight proportion > 0.73), while in the TianHe region (mixed-use area, complex textures), the model strengthens the semantic intensity features of NTL (weight proportion > 0.68), thus achieving scene-adaptive feature expression optimization.

Although the model performs excellently at the macro scale, there remains a significant bottleneck in fine-grained classification tasks (e.g., distinguishing residential and commercial areas), with an F1-score difference of 14.2%. Feature confusion and inter-domain feature shifts still exist. In high-density residential areas (floor area ratio > 2.5), the architectural layout is similar to that of commercial areas (e.g., grid-like arrangement), leading to insufficient distinction of VHR texture features. In nighttime-active residential areas, human activity intensity features overlap with commercial areas, weakening the discriminative power of NTL data. The building function distribution in non-training regions exhibits systematic differences from the training set, making it difficult for the model to capture domain-invariant features for fine-grained classification.

For fine-grained classification tasks, a more refined feature extraction module or the use of auxiliary features (e.g., POI data, building height) may be necessary to improve the model’s discriminatory ability. To further enhance the accuracy of residential and commercial area classification, incorporating more granular features or specific category data augmentation strategies may be considered to meet the specific task requirements in non-training regions.

6. Conclusions

The primary objective of this study is to extract information from VHR imagery and NTL imagery for urban functional area classification. To this end, we propose an end-to-end cross-modal spatial alignment gated fusion deep neural network (CSAGFNet), which centers on the multi-modal fusion of high-resolution remote sensing imagery (VHR) and nighttime light data (NTL). This network is designed specifically for the urban functional area classification task, combining cross-modal spatial alignment and gated fusion mechanisms. Through systematic experimental validation, the model achieves an mIoU of 0.853, an F1-score of 0.894, and an overall accuracy (Overall Accuracy) of 0.934 on the test set, demonstrating an improvement of 5.2–8.7% over single-modal baseline models. This effectively validates the gain effect of multi-source remote sensing data collaboration for urban functional area recognition. The main innovative contributions of this study are reflected in the following three dimensions:

First, at the feature representation level, the proposed OAFA module addresses the feature space mismatch problem caused by spatial resolution differences (0.5 m vs. 500 m) and radiometric representation differences (reflectance vs. radiance intensity) between VHR and NTL data by establishing a cross-modal attention mechanism. Ablation experiments show that this module improves classification accuracy by more than 5% on the new urban area test set.

Second, in terms of feature fusion strategy, the Gated Fusion Module (GFM) achieves adaptive fusion of multi-modal features through a dynamic weight allocation mechanism. Quantitative analysis demonstrates that, compared to traditional concatenation or summation fusion methods, GFM improves the OA metric by 10.8% while maintaining the same model parameters. This is particularly evident in commercial-residential mixed-function areas, where the model shows enhanced discriminative power (with an 11.2% improvement in F1-score).

Third, for model generalization validation, we constructed a cross-domain test set for the main urban areas of Guangzhou. CSAGFNet achieved overall accuracy scores of 91.8% and 87.8% in two untrained new regions, confirming the model’s strong adaptability to the spatial structural heterogeneity of cities and demonstrating its robust generalization ability and resilience.

Although the current study made progress in multi-modal fusion methods, limitations remain in fine-grained classification (e.g., residential area density classification) and dynamic functional area recognition. Future research will deepen in three aspects:

Incorporating building contour vector data and POI semantic information to construct a multi-scale feature pyramid to enhance the spatial-semantic representation of urban functions.
Developing differentiable morphological operators to improve the analytical accuracy of linear features such as road networks.
Establishing a spatiotemporal collaborative fusion framework to integrate temporal NTL fluctuation features with quarterly VHR vegetation indices for dynamic monitoring of the evolution of urban functional areas.

The cross-modal alignment theoretical framework proposed in this study offers a new methodological reference for multi-source remote sensing data fusion, and it has practical value for the refined management of smart cities.

Author Contributions

J.C., Z.Z. and Y.C. conceived of the main idea. J.C., Z.L., X.M., J.K. and W.C. developed the methodology and designed the experiments. J.C., X.S. and Y.Y. processed the data and conducted the experiments. J.C. and Z.W. analyze the results. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Earth Observation Science Data Center 2024 Open Research Project (NODAOP2024002) and the Ministry of Education Humanities and Social Sciences Planning Fund Project (21YJAZH009).

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Schneider, A.; Friedl, M.A.; Potere, D. A new map of global urban extent from MODIS satellite data. Environ. Res. Lett. 2009, 4, 044003. [Google Scholar] [CrossRef]
Liu, B.; Deng, Y.; Li, M.; Yang, J.; Liu, T. Classification Schemes and Identification Methods for Urban Functional Zone: A Review of Recent Papers. Appl. Sci. 2021, 11, 9968. [Google Scholar] [CrossRef]
Hu, S.G.; Wang, L. Automated urban land-use classification with remote sensing. Int. J. Remote Sens. 2013, 34, 790–803. [Google Scholar] [CrossRef]
Chen, Y.; Liu, X.; Li, X.; Liu, X.; Yao, Y.; Hu, G.; Xu, X.; Pei, F. Delineating urban functional areas with building-level social media data: A dynamic time warping (DTW) distance based k-medoids method. Landsc. Urban Plan. 2017, 160, 48–60. [Google Scholar] [CrossRef]
Cai, L.; Hui, F.; Ye, M.; Kang, K.; Zhao, X. Semi-Supervised Urban Land Using Classification Method Based on Uncertainty Sampling. J. Jilin Univ. 2016, 34, 550–555. [Google Scholar]
Wang, Y.; Wang, T.; Tsou, M.-H.; Li, H.; Jiang, W.; Guo, F. Mapping Dynamic Urban Land Use Patterns with Crowdsourced Geo-Tagged Social Media (Sina-Weibo) and Commercial Points of Interest Collections in Beijing, China. Sustainability 2016, 8, 1202. [Google Scholar] [CrossRef]
Zhang, X.Y.; Du, S.H.; Wang, Q. Integrating bottom-up classification and top-down feedback for improving urban land-cover and functional-zone mapping. Remote Sens. Environ. 2018, 212, 231–248. [Google Scholar] [CrossRef]
Yao, Y.; Li, X.; Liu, X.P.; Liu, P.H.; Liang, Z.T.; Zhang, J.B.; Mai, K. Sensing spatial distribution of urban land use by integrating points-of-interest and Google Word2Vec model. Int. J. Geogr. Inf. Sci. 2017, 31, 825–848. [Google Scholar] [CrossRef]
Yan, B.; Janowicz, K.; Mai, G.C.; Gao, S. From ITDL to Place2Vec-Reasoning About Place Type Similarity and Relatedness by Learning Embeddings From Augmented Spatial Contexts. In Proceedings of the 25th Acm Sigspatial International Conference on Advances in Geographic Information Systems (ACM Sigspatial Gis 2017), Redondo Beach, CA, USA, 7–10 November 2017. [Google Scholar] [CrossRef]
Zhai, W.; Bai, X.Y.; Shi, Y.; Han, Y.; Peng, Z.R.; Gu, C.L. Beyond Word2vec: An approach for urban functional region extraction and identification by combining Place2vec and POIs. Comput. Environ. Urban 2019, 74, 1–12. [Google Scholar] [CrossRef]
Xu, N.; Luo, J.C.; Wu, T.J.; Dong, W.; Liu, W.; Zhou, N. Identification and Portrait of Urban Functional Zones Based on Multisource Heterogeneous Data and Ensemble Learning. Remote Sens. 2021, 13, 373. [Google Scholar] [CrossRef]
Bao, H.; Ming, D.; Guo, Y.; Zhang, K.; Zhou, K.; Du, S. DFCNN-Based Semantic Recognition of Urban Functional Zones by Integrating Remote Sensing Data and POI Data. Remote Sens. 2020, 12, 1088. [Google Scholar] [CrossRef]
Shao, Z.F.; Tang, P.H.; Wang, Z.Y.; Saleem, N.; Yam, S.; Sommai, C. BRRNet: A Fully Convolutional Neural Network for Automatic Building Extraction From High-Resolution Remote Sensing Images. Remote Sens. 2020, 12, 1050. [Google Scholar] [CrossRef]
Wang, L.B.; Li, R.; Zhang, C.; Fang, S.H.; Duan, C.X.; Meng, X.L.; Atkinson, P.M. UNetFormer: A UNet-like transformer for efficient semantic segmentation of remote sensing urban scene imagery. ISPRS J. Photogramm. Remote Sens. 2022, 190, 196–214. [Google Scholar] [CrossRef]
Li, M.M.; Stein, A. Mapping Land Use from High Resolution Satellite Images by Exploiting the Spatial Arrangement of Land Cover Objects. Remote Sens. 2020, 12, 4158. [Google Scholar] [CrossRef]
Li, X.; Xu, F.; Li, L.; Xu, N.; Liu, F.; Yuan, C.; Chen, Z.; Lyu, X. AAFormer: Attention-attended transformer for semantic segmentation of remote sensing images. IEEE Geosci. Remote Sens. Lett. 2024, 21, 5002805. [Google Scholar] [CrossRef]
Li, X.; Xu, F.; Liu, F.; Tong, Y.; Lyu, X.; Zhou, J. Semantic segmentation of remote sensing images by interactive representation refinement and geometric prior-guided inference. IEEE Trans. Geosci. Remote Sens. 2023, 62, 5400318. [Google Scholar] [CrossRef]
Li, X.; Xu, F.; Liu, F.; Lyu, X.; Tong, Y.; Xu, Z.; Zhou, J. A synergistical attention model for semantic segmentation of remote sensing images. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5400916. [Google Scholar] [CrossRef]
Li, X.; Xu, F.; Yu, A.; Lyu, X.; Gao, H.; Zhou, J. A Frequency Decoupling Network for Semantic Segmentation of Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5607921. [Google Scholar] [CrossRef]
Huang, X.; Yang, J.J.; Li, J.Y.; Wen, D.W. Urban functional zone mapping by integrating high spatial resolution nighttime light and daytime multi-view imagery. ISPRS J. Photogramm. Remote Sens. 2021, 175, 403–415. [Google Scholar] [CrossRef]
Levin, N.; Kyba, C.C.M.; Zhang, Q.L.; de Miguel, A.S.; Román, M.O.; Li, X.; Portnov, B.A.; Molthan, A.L.; Jechow, A.; Miller, S.D.; et al. Remote sensing of night lights: A review and an outlook for the future. Remote Sens. Environ. 2020, 237, 111443. [Google Scholar] [CrossRef]
Zhou, Y.; Chen, M.X.; Tang, Z.P.; Zhao, Y. City-level carbon emissions accounting and differentiation integrated nighttime light and city attributes. Resour. Conserv. Recycl. 2022, 182, 106337. [Google Scholar] [CrossRef]
Levin, N.; Duke, Y. High spatial resolution night-time light images for demographic and socio-economic studies. Remote Sens. Environ. 2012, 119, 1–10. [Google Scholar] [CrossRef]
Zheng, Q.M.; Seto, K.C.; Zhou, Y.Y.; You, S.X.; Weng, Q.H. Nighttime light remote sensing for urban applications: Progress, challenges, and prospects. ISPRS J. Photogramm. Remote Sens. 2023, 202, 125–141. [Google Scholar] [CrossRef]
Zhang, Q.L.; Seto, K.C. Mapping urbanization dynamics at regional and global scales using multi-temporal DMSP/OLS nighttime light data. Remote Sens. Environ. 2011, 115, 2320–2329. [Google Scholar] [CrossRef]
Zhang, J.; Zhang, X.; Tan, X.P.; Yuan, X.D. Extraction of Urban Built-Up Area Based on Deep Learning and Multi-Sources Data Fusion-The Application of an Emerging Technology in Urban Planning. Land 2022, 11, 1212. [Google Scholar] [CrossRef]
Liu, X.J.; Long, Y. Automated identification and characterization of parcels with OpenStreetMap and points of interest. Environ. Plan. B 2016, 43, 341–360. [Google Scholar] [CrossRef]
Cao, R.; Zhu, J.S.; Tu, W.; Li, Q.Q.; Cao, J.Z.; Liu, B.Z.; Zhang, Q.; Qiu, G.P. Integrating Aerial and Street View Images for Urban Land Use Classification. Remote Sens. 2018, 10, 1553. [Google Scholar] [CrossRef]
Zhou, W.; Persello, C.; Li, M.M.; Stein, A. Building use and mixed-use classification with a transformer-based network fusing satellite images and geospatial textual information. Remote Sens. Environ. 2023, 297, 113767. [Google Scholar] [CrossRef]
Liu, X.; He, J.; Yao, Y.; Zhang, J.; Liang, H.; Wang, H.; Hong, Y. Classifying urban land use by integrating remote sensing and social media data. Int. J. Geogr. Inf. Sci. 2017, 31, 1675–1696. [Google Scholar] [CrossRef]
Lu, C.; Yang, X.; Wang, Z.; Li, Z. Using multi-level fusion of local features for land-use scene classification with high spatial resolution images in urban coastal zones. Int. J. Appl. Earth Obs. 2018, 70, 1–12. [Google Scholar] [CrossRef]
Zhu, Q.; Zhong, Y.; Wu, S.; Zhang, L.; Li, D. Scene classification based on the sparse homogeneous–heterogeneous topic feature model. IEEE Trans. Geosci. Remote Sens. 2018, 56, 2689–2703. [Google Scholar] [CrossRef]
Zhang, X.; Du, S.; Wang, Q. Hierarchical semantic cognition for urban functional zones with VHR satellite images and POI data. ISPRS J. Photogramm. Remote Sens. 2017, 132, 170–184. [Google Scholar] [CrossRef]
Chollet, F. Xception: Deep Learning with Depthwise Separable Convolutions. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1800–1807. [Google Scholar] [CrossRef]
Mao, A.; Mohri, M.; Zhong, Y. Cross-entropy loss functions: Theoretical analysis and applications. In Proceedings of the International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; pp. 23803–23828. [Google Scholar]
Li, X.Y.; Sun, X.F.; Meng, Y.X.; Liang, J.J.; Wu, F.; Li, J.W. Dice Loss for Data-imbalanced NLP Tasks. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL 2020), Online, 5–10 July 2020; pp. 465–476. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.M.; Dollár, P. Focal Loss for Dense Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 318–327. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2015; Volume 9351, pp. 234–241. [Google Scholar] [CrossRef]
Chen, L.C.E.; Zhu, Y.K.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. In Computer Vision–ECCV 2018 Pt Vii; Springer: Cham, Switzerland, 2018; Volume 11211, pp. 833–851. [Google Scholar] [CrossRef]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.M.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 936–944. [Google Scholar] [CrossRef]
Zhao, H.S.; Shi, J.P.; Qi, X.J.; Wang, X.G.; Jia, J.Y. Pyramid Scene Parsing Network. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6230–6239. [Google Scholar] [CrossRef]
Liu, S.; Qi, L.; Qin, H.F.; Shi, J.P.; Jia, J.Y. Path Aggregation Network for Instance Segmentation. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768. [Google Scholar] [CrossRef]

Figure 1. Flowchart of the Overall Process.

Figure 2. VHR Imagery Specific Feature Extractor.

Figure 3. VHR Imagery Invariant Feature Extractor.

Figure 4. NTL Imagery Specific Feature Extractor.

Figure 5. NTL Imagery Invariant Feature Extractor.

Figure 6. CSOM Structure Diagram.

Figure 7. ODAF Structure Diagram.

Figure 8. GFM Structure Diagram.

Figure 9. Sample Area.

Figure 10. Dataset (a) VHR image (b) NTL image (c) label.

Figure 11. Image classification results.

Figure 12. IoU values.

Figure 13. Classification Results without Different Model Structures.

Figure 14. Details of Classification Results for Different Model Structures (a) Original Input Image (b) Ground Truth (c) CSAGFNet (d) CSAGFNet-OAFA (e) CSAGFNet-cat (f) CSAGFNet-VHR.

Figure 15. IoUs of Different Models.

Figure 16. Results of Different Models (a) DeepLabV3Plus (b) FPN (c) PAN (d) PSPNet (e) Res-UNet (f) CSAGFNet.

Figure 17. Details of Classification Results from Different Models (a) Original Image (b) Ground Truth (c) CSAGFNet (d) FPN (e) deeplabv3+ (f) PAN (g) PSPNet (h) Res-UNet.

Figure 18. Image classification results of new area.

Figure 19. Image classification results of new area (Chancheng).

Figure 20. Details of Classification Results from new area.

Table 1. Functional Zone Sample Count.

Functional Zone	Residential	Commercial	Industrial	Green Space	Street and Transportation	Water	Non-Development
Number	6597	4596	5632	2236	1135	324	563
Proportion	31.2%	21.7%	26.7%	10.6%	5.3%	1.5%	2.6%

Table 2. Test set evaluation metrics.

Model	mIoU	F1-Score	Overall Accuracy
CSAGFNet	0.853	0.894	0.934

Table 3. Evaluation metrics of ablation experiments.

Model	mIoU	F1-Score	Overall Accuracy
CSAGFNet-VHR	0.743	0.727	0.763
CSAGFNet-Cat	0.713	0.726	0.754
CSAGFNet-OAFA	0.789	0.804	0.843
CSAGFNet	0.853	0.894	0.934

Table 4. Evaluation metrics of Different Models.

Model	mIoU	F1-Score	Overall Accuracy	Parameter Count (M)	FLOPs (G)	Inference Speed (ms)
Res-UNet	0.812	0.843	0.879	31.99	21.97	77.16
DeepLabV3Plus	0.809	0.836	0.858	26.15	18.52	74.43
FPN	0.789	0.810	0.853	25.59	16.11	66.40
PSPNet	0.747	0.796	0.834	2.24	6.10	31.07
PAN	0.693	0.763	0.786	23.73	17.53	84.86
CSAGFNet	0.853	0.894	0.934	8.07	29.77	75.67

Table 5. Classification Overall Accuracy of the Test Areas.

Area	Correct	Incorrect	Overall Accuracy
YueXiu	4590	410	91.8%
TianHe	4388	612	87.8%
ChanCheng	4115	885	82.3%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, J.; Chen, Y.; Zheng, Z.; Ling, Z.; Meng, X.; Kuang, J.; Shi, X.; Yang, Y.; Chen, W.; Wu, Z. Urban Functional Zone Classification Based on High-Resolution Remote Sensing Imagery and Nighttime Light Imagery. Remote Sens. 2025, 17, 1588. https://doi.org/10.3390/rs17091588

AMA Style

Chen J, Chen Y, Zheng Z, Ling Z, Meng X, Kuang J, Shi X, Yang Y, Chen W, Wu Z. Urban Functional Zone Classification Based on High-Resolution Remote Sensing Imagery and Nighttime Light Imagery. Remote Sensing. 2025; 17(9):1588. https://doi.org/10.3390/rs17091588

Chicago/Turabian Style

Chen, Junyu, Yingbiao Chen, Zihao Zheng, Zhenxiang Ling, Xianxin Meng, Junyu Kuang, Xianghua Shi, Yifan Yang, Wentao Chen, and Zhifeng Wu. 2025. "Urban Functional Zone Classification Based on High-Resolution Remote Sensing Imagery and Nighttime Light Imagery" Remote Sensing 17, no. 9: 1588. https://doi.org/10.3390/rs17091588

APA Style

Chen, J., Chen, Y., Zheng, Z., Ling, Z., Meng, X., Kuang, J., Shi, X., Yang, Y., Chen, W., & Wu, Z. (2025). Urban Functional Zone Classification Based on High-Resolution Remote Sensing Imagery and Nighttime Light Imagery. Remote Sensing, 17(9), 1588. https://doi.org/10.3390/rs17091588

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Urban Functional Zone Classification Based on High-Resolution Remote Sensing Imagery and Nighttime Light Imagery

Abstract

1. Introduction

2. Methods

2.1. Dual-Stream Feature Extractor

2.2. Cross-Modal Spatial Offset Modeling

2.3. Offset-Guided Deformable Alignment Module

2.4. Gated Fusion Module

2.5. Loss Function

2.6. Evaluation Metrics

3. Study Area and Dataset

3.1. Study Area

3.2. Dataset

4. Experiments and Results

4.1. Implementation Details

4.2. Result

4.3. Ablation Study

5. Discussion

5.1. Comparison with Single-Modal Models

5.2. Model Generalization

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI