1. Introduction
Urban areas occupy less than 2% of the Earth’s land area but host more than half of the global population [
1], which continues to grow. By 2030, the global urban population is projected to reach 5 billion. The unprecedented pace of urbanization has led to rapid changes in urban landscapes. Therefore, quickly identifying various urban functional zones to monitor urban land use provides essential information to decision-makers. Identifying and analyzing urban functional zones plays a critical role in urban structure optimization, resource allocation, land use management, commercial site selection, geographic monitoring, disaster assessment, and urban planning. Urban functional zones are defined as areas characterized by similar land use, intensity, orientation, benchmark land value, land use efficiency, and potential. In megacities with high population densities, land use is highly diverse and complex, making it particularly challenging to classify urban functional zones [
2].
The classification and identification of urban functional zones have been the focal points of research. Early methods of classification and identification typically relied on statistical data, expert knowledge, or land use/land cover information extracted from remote sensing images [
3]. These methods often required the manual interpretation of images, which was time-consuming, labor-intensive, and lacking in precision. To address these limitations, some researchers proposed methods based on density analysis and clustering analysis. These methods extract features from data density within units and static or dynamic urban data such as mobile phone data [
4,
5], traffic data, and check-in records [
6] to classify urban functional zones based on spatial and temporal patterns of regional functionality or human activity. While these methods enable the automated extraction of urban functional zones, they depend on shallow features and fail to capture complex semantic information, making it difficult to distinguish between similar functional zones at a fine-grained level [
7].
To extract deeper feature representations, researchers have introduced deep learning into urban functional zone classification [
8,
9,
10,
11]. Deep learning typically utilizes neural network models to identify and classify functional zones in remote sensing images by extracting high-level features. Deep learning methods effectively extract advanced information from remote sensing imagery. For instance, researchers such as Bao [
12], Zhang [
7], Shao [
13], Wang [
14], and Li [
15] have used methods including deep convolutional neural networks (CNNs), multi-scale pooling, residual refinement, Transformers, and graph neural networks to enhance feature extraction and classification accuracy for buildings and functional zones in remote sensing imagery. In addition, Li et al. [
16,
17,
18] enhanced the models’ suitability for remote sensing imagery by refining modules in traditional deep learning frameworks. Researchers [
18,
19] have also employed various neural network architectures to extract deep features from remote sensing images for functional zone classification, thereby significantly improving classification accuracy. These techniques provide various effective approaches to urban land cover and functional zone classification. By leveraging neural network models, researchers have improved classification accuracy by extracting deep features from remote sensing images. However, since deep learning models primarily rely on daytime remote sensing imagery, which captures urban landscapes from a top-down perspective, they can only access physical information during the day. Consequently, relying solely on daytime images for urban functional zone classification presents significant limitations [
20].
Identifying urban functional zones requires not only physical attributes but also social attribute features, which are challenging to extract from standard daytime optical remote sensing images. With advancements in remote sensing technology, nighttime light (NTL) imagery has made substantial progress, enabling the extraction of social attribute features for urban analysis. The contrast between brightness and darkness in NTL imagery provides unique insights into urbanization that are distinct from daytime data [
21]. NTL data can represent the spatial extent of urbanization (e.g., urban boundaries) similar to daytime data, but its intensity serves as a direct indicator of human activity [
22]. NTL intensity reveals intra-urban variations in urbanization intensity and correlates strongly with socio-economic variables [
23], making it suitable for modeling and spatializing these variables [
24]. For example, Zhang et al. [
25] used NTL data from 1992 to 2008 to iteratively classify urban change using an unsupervised algorithm, effectively removing noise and generating urban change classification maps. Their study demonstrated that NTL data captures urbanization dynamics, supporting urban land cover, population, and economic activity analysis. Building on this, researchers [
26] have utilized NTL imagery in combination with data such as POI (Point of Interest) [
27], NDVI (Normalized Difference Vegetation Index), and Baidu Migration Data (BM), applying models like SVM and U-net to extract semantic features for use in urban functional zone classification. Both NTL-based and daytime image-based urban functional zone classification share a common limitation: they can only capture specific attributes, failing to fully describe the observed scenes. This significantly constrains subsequent applications. Thus, combining NTL imagery with daytime remote sensing features to extract comprehensive physical and social attribute information remains an open research problem.
With the development of remote sensing technology, a wide array of geographic information data have emerged, and multi-modal RS data fusion has seen significant progress. Integrating complementary information from multiple data modalities allows for more robust and reliable decision-making in tasks such as LULC classification, making multi-modal data fusion a feasible approach for identifying urban functional zones. Researchers have combined remote sensing imagery, street-view imagery [
28], and geospatial text data [
29] to extract complementary feature information, addressing the limitations of single-source data. Huang et al. [
20] mapped urban functional zones using high-resolution nighttime light imagery and daytime multi-view images, achieving an average OA of 80%. In addition, scholars have re-identified urban functional areas by integrating remote sensing imagery with social media data, leveraging both Bag-of-Visual-Words models [
30,
31,
32] and a three-level Bayesian model [
33] to establish relationships between urban visual features, quantitative categories, and hierarchical structures. However, due to differences in data sources, significant parameter disparities can lead to feature misalignment, complicating multi-source data alignment. Strict alignment methods often result in feature loss and lower alignment quality, reducing classification efficiency and increasing computational costs, thereby limiting a model’s performance.
To integrate features from daytime remote sensing imagery and NTL imagery and address cross-modal feature alignment challenges to improved urban functional zone classification, we propose a Cross-Modal Spatial Alignment Gated Fusion Neural Network (CSAGFNet). This model utilizes an offset-guided adaptive feature alignment mechanism and a cross-modal gated fusion mechanism to align features and fuse data from different image modalities. The offset-guided adaptive feature alignment mechanism adjusts the relative positions of multi-modal features adaptively, addressing weak alignment issues between different modalities and reducing the impact of modality gaps on spatial matching. The cross-modal gated fusion mechanism weights each modality and removes irrelevant parts to adaptively learn discriminative features. It fuses image features extracted from high-resolution remote sensing imagery with NTL features for pixel-level urban land use classification. Tests conducted in various urban areas demonstrate the model’s robustness and generalization capabilities.
The contributions of this paper are reflected in four aspects:
We propose a method to address feature misalignment between different modalities by employing a weak alignment mechanism. This approach adaptively adjusts the relative positions of multi-modal features, achieving adaptive feature alignment instead of strict alignment.
We develop an improved method for feature-level fusion of urban physical attributes extracted from VHR remote sensing imagery and social attribute features extracted from NTL imagery, enhancing the classification accuracy of urban functional zones.
We investigate the impact of different data fusion methods on model accuracy and conduct relevant experiments, demonstrating the effectiveness of the gated fusion mechanism.
We compare the proposed model with other popular single-modal deep learning models, validating its effectiveness.
The rest of this paper is organized as follows:
Section 2 introduces the network architecture and provides a detailed explanation of the key components of the model.
Section 3 describes the study areas and the dataset preparation process for training the model.
Section 4 presents the results of testing our model on the dataset and analyzes the outcomes. We also detail the evaluation metrics and experimental setup used during model training and perform ablation studies to verify the effectiveness of various components of the model.
Section 5 discusses the experimental results and applies the model to two new areas not included in the training dataset, assessing its robustness and generalization capabilities.
Section 6 concludes the paper based on the analyses above.
2. Methods
To fuse VHR remote sensing imagery with NTL imagery for extracting urban functional zones, we propose a deep convolutional neural network model based on an offset-guided adaptive feature alignment mechanism and cross-modal gated fusion. The structure of the model is illustrated in
Figure 1 below.
Specifically, we input VHR remote sensing imagery and NTL imagery into a dual-stream network to extract image features. The extracted features are then fed into a Cross-Modal Spatial Offset Modeling (CSOM) module to create a shared subspace, which estimates precise feature-level offsets, thereby reducing the impact of modality gaps on spatial matching. Subsequently, an Offset-Guided Adaptive Feature Alignment (ODAF) module captures the optimal alignment positions for feature fusion, avoiding the losses and errors caused by strict feature alignment. After achieving feature alignment, the aligned features are fed into a Gated Fusion Module (GFM), which calculates the utility of each corresponding lateral feature from the VHR and NTL modalities and aggregates the information accordingly. Feature fusion is performed at the optimal alignment positions, resulting in a fused feature map of the VHR and NTL images. Finally, the fused feature map is passed through a DeepLabV3Plus classification head for classification, producing the urban functional zone classification results.
2.1. Dual-Stream Feature Extractor
To better extract features from VHR and NTL imagery, we designed four types of feature extractors tailored to the distinct characteristics of each image type. These extractors are used to extract invariant features and specific features from VHR and NTL imagery. In image processing tasks, invariant features typically represent robust global properties. Therefore, the invariant feature extractor adopts a simple structure with fewer layers to enhance computational efficiency and enable rapid feature extraction. In contrast, specific features often require deeper layers to capture fine-grained details, enhancing discrimination. As a result, the specific feature extractor uses a more complex network structure to capture richer image details. To extract specific features from VHR imagery, we designed a network with five convolutional layers. The structure of the specific feature extractor for VHR imagery is illustrated in the
Figure 2 below. The model incrementally downsamples the resolution to 64 × 64 and expands the channel size to 256, thereby enlarging the receptive field and reducing the number of network parameters. Each convolutional layer is followed by Batch Normalization and ReLU activation.
When extracting invariantfeatures from VHR imagery, we utilized a depthwise separable convolution module [
34]. The structure of the invariant feature extractor for VHR imagery is illustrated in the
Figure 3 below. Depthwise separable convolution decomposes standard convolution into depthwise convolution and pointwise convolution, effectively reducing computational cost and the number of parameters. This approach extracts richer feature information while maintaining high accuracy.
Unlike typical natural images, NTL imagery has an original spatial resolution of only 40 m, making it unsuitable for excessive downsampling to avoid losing critical information. The structure of the specific feature extractor for NTL imagery is illustrated in the
Figure 4 below. In the NTL-specific feature extractor, we downsample the images to 64 × 64 and expand the channel count to 256 to preserve as much information as possible while extracting deep features. The network consists of three convolutional layers, each followed by Batch Normalization and ReLU activation to enhance feature representation capabilities. For the NTL invariant feature extractor, we designed a shallower network with two convolutional layers, aimed at quickly extracting low-level invariant features while reducing spatial dimensions. The structure of the invariant feature extractor for NTL imagery is illustrated in the
Figure 5 below.
2.2. Cross-Modal Spatial Offset Modeling
Cross-modal spatial offset modeling is the core component for achieving weak feature alignment. Its goal is to align cross-modal features by predicting the spatial offset between VHR and NTL features. The module structure is illustrated in the
Figure 6 below.
The spatial offset modeling submodule predicts the spatial offset
by estimating the spatial difference between
and
. To accurately estimate the spatial difference between the two types of images, feature enhancement is required. First, the input feature
undergoes both max-pooling and average-pooling operations to obtain two different spatial context descriptors. Then, these two descriptors are concatenated and passed through a 7 × 7 convolutional layer to generate a spatial attention map. Finally, the generated spatial attention map is element-wise multiplied with the input features to obtain the spatially enhanced features
. The formula expression is as follows:
represents the sigmoid function, denotes the 7 × 7 convolutional layer, and indicates the concatenation operation. and refer to the max-pooling and average-pooling operations. The symbol represents the Hadamard product (element-wise multiplication).
After the feature enhancement process is completed, the spatially channel-enhanced VHR features
and NTL features
are concatenated to obtain the feature difference representation
.
2.3. Offset-Guided Deformable Alignment Module
The Offset-Guided Deformable Alignment Module is one of the core components of the OAFA method. Its goal is to achieve adaptive fusion of NTL and VHR features through implicit offset compensation and adaptive alignment. The module structure is illustrated in the
Figure 7 below.
The ODAF module uses deformable convolution to achieve implicit offset compensation and adaptive alignment. Deformable convolution builds upon traditional convolution by adding learned offsets to adaptively adjust the sampling positions on the feature map. In deformable convolution, a learned offset
is introduced, allowing the convolution kernel to dynamically adjust the sampling positions based on the input data. The operation formula for deformable convolution is as follows:
In standard deformable convolution, the offsets of the convolution kernel are learned from its original features. In the ODAF module, however, these offsets are derived from the basic offset
. The ODAF module uses the basic offset
obtained from the CSOM module as the initial value for the offset compensation. The specific formula is expressed as follows:
After the feature alignment is completed, the ODAF module combines the VHR and NTL features through decoupled feature fusion to generate the final classification result. Traditional fusion processes typically concatenate the modality-invariant features and modality-specific features to create a fused feature
. However, this method may lead to information redundancy. In the ODAF module, we first align the VHR and NTL features, then optimize them before fusion to eliminate redundant information and enhance discriminative representations. Furthermore, during fusion, we employ a gated fusion mechanism to fully leverage the complementary information between different images. The specific formula is expressed as follows:
2.4. Gated Fusion Module
Traditional cross-modal feature fusion methods are primarily based on element-wise summation and concatenation operations, which cannot effectively distinguish the importance of features from different modalities. Element-wise summation directly adds features from different modalities element by element, assuming that the features from each modality have the same importance. Furthermore, element-wise summation only performs linear operations, lacking nonlinear information interaction, and thus fails to fully explore the deep relationships between features from different modalities. In reality, features from different modalities may have varying qualities and contributions, and simple summation cannot effectively distinguish and highlight the important features. Concatenation, on the other hand, simply stitches together features from different modalities, preserving the original information of each modality but without considering the correlation and complementarity between them, which can lead to information redundancy and feature clutter. Moreover, the concatenated features require further processing by subsequent network layers to achieve information interaction, but the simple concatenation method contributes little to information interaction in the initial fusion stage, failing to fully utilize the complementary information between modalities.
To address the issues mentioned above, our model employs a gating mechanism to enhance the quality and efficiency of feature fusion, thereby improving the overall performance of the model. The overall structure of the GFM (Gated Fusion Module) is shown in
Figure 8 below.
First, the feature maps output by the two decoders are concatenated to generate a fused feature map
. Then, a 1 × 1 convolution kernel
is applied to the fused feature map to compute the correlation between modalities and reduce the dimensionality of the feature channels. Then, the fused feature map is processed through the sigmoid function to generate a weighted probability matrix
. Using the weighted matrix
and
, the VHR and NTL feature maps are weighted separately to generate the weighted feature maps
and
. Finally, the weighted VHR and NTL feature maps are subjected to a Hadamard product operation to generate the gated fusion feature map
. The formula is expressed as:
2.5. Loss Function
In situations where a single loss function cannot accurately evaluate the model, in order to comprehensively improve classification accuracy, especially in complex scenarios with imbalanced class distributions and small region detection, we use a combination of Weighted Cross-Entropy Loss [
35], Dice Loss [
36], and Focal Loss [
37]. Using a combination of loss functions can better address issues like class imbalance, insufficient attention to small objects or minority classes, and hard-to-classify samples.
The Weighted Cross-Entropy Loss adds a class-weight term to the standard cross-entropy loss to address the issue of class imbalance. When certain classes have significantly fewer samples than others, the model is more likely to ignore these minority classes. By assigning higher weights to the minority classes, the model can place more emphasis on these during training. The formula for the Weighted Cross-Entropy Loss is as follows:
where
represents the total number of samples,
represents the total number of classes,
is the true label indicating whether sample
belongs to class
,
is the predicted probability of the model for class
, and
is the weight for class
, which is typically inversely proportional to the class frequency, with higher weights assigned to less frequent classes.
The Dice Loss is a measure of the overlap between two regions, and it is particularly suitable for handling minority classes or small objects. Dice Loss performs well in scenarios with class imbalance because it increases the loss weight for samples from minority classes. By improving the Dice coefficient for small objects or minority classes, the model becomes more sensitive to these classes, thereby avoiding class omission. The formula for Dice Loss is as follows:
where
represents the model’s predicted class probability and
represents the true label’s class probability.
Focal Loss aims to address the issue of class imbalances. For easy-to-classify samples, Focal Loss reduces their loss weight, allowing the model to focus more on hard-to-classify samples. By adjusting
, Focal Loss applies a greater loss weight to misclassified and hard-to-classify samples, thereby improving the model’s performance on challenging samples. The formula for Focal Loss is as follows:
where
is the class balance coefficient, which controls the loss weight of samples from different classes;
is the focusing parameter, which controls the emphasis on hard-to-classify samples;
is the model’s predicted probability for class
; and
is the true label indicating whether sample
belongs to class
.
By combining the three loss functions mentioned above, the model can better address issues such as class imbalance, hard-to-classify samples, and the detection of small objects or minority classes. The formula for the combined loss is as follows:
where
,
, and
are weighting parameters used to adjust the relative importance of the three loss functions. Based on the experimental results, we set the weights of the loss function to:
= 0.3,
.
2.6. Evaluation Metrics
To validate the accuracy of the classification results, we used the mean Intersection over Union, F1 Score, and overall accuracy as the evaluation metrics for our model.
(1) mIoU
The mIoU is a commonly used evaluation metric in semantic segmentation tasks. It measures the average segmentation performance of a model across multiple classes, reflecting the overall performance of the model in image segmentation tasks. IoU calculates the ratio of the intersection to the union of two sets. In semantic segmentation, IoU is used to measure the overlap between the predicted results and the ground truth labels.
TP: the number of samples predicted as positive and actually positive.
TN: the number of samples predicted as negative and actually negative.
FP: the number of samples predicted as positive but actually negative.
FN: the number of samples predicted as negative but actually positive.
(2) F1 Score
The F1 Score is a metric that combines the precision and recall of a classification model, and it is especially useful in situations with imbalanced data. The F1 score is the harmonic mean of precision and recall, providing a single measure of the model’s overall performance. The F1 score is calculated based on precision and recall:
(3) Overall Accuracy
Overall Accuracy is one of the fundamental metrics for evaluating the performance of a classification model. It represents the proportion of correctly predicted samples out of all samples. Overall Accuracy is applicable to various classification problems and is one of the most intuitive evaluation metrics. In classification tasks, the formula for overall accuracy is:
4. Experiments and Results
4.1. Implementation Details
In the experiment, we used PyTorch as the framework for the model with CUDA version v12.1. The CPU was an Intel(R) Core(TM) i9-14900K model, and the GPU was an RTX 4080s with 16 GB of memory.
To improve the model’s generalization ability and to prevent overfitting, we used AdamW as the optimizer during the training phase, with an initial learning rate set to 0.001. The cosine annealing method was applied to adjust the learning rate, allowing it to change cyclically during training, helping the model escape local optima and enhancing training performance. Additionally, to address the issue of class imbalance, we not only oversampled the underrepresented classes in the dataset but also incorporated Focal Loss as part of the loss function. Focal Loss assigns lower weights to easily classified samples and higher weights to hard-to-classify samples, enabling the model to focus more on challenging samples during training. Due to memory limitations, all of the images were divided into 256 × 256 tiles using a sliding window method before being input into the model for training. The batch size was set to 48. During training, we calculated the mIoU value for each epoch and employed an early stopping strategy to prevent overfitting. If the mIoU value on the validation set did not improve for ten consecutive epochs, the training was stopped and the model with the best mIoU value was saved. This approach ensured the model did not overfit. The entire training process took approximately 8 h.
4.2. Result
To evaluate the accuracy of the model, we tested it on the test set and calculated its mIoU, F1-score, and overall accuracy. The model output results are illustrated in the
Figure 11 below. The specific results are shown in
Table 2 below, and the IoU values for each functional zone are illustrated in
Figure 12. Partial output results from the validation set are shown in
Figure 11.
From the experimental results, it is evident that the CSAGFNet model performs well on the three key metrics: mIoU, F1-score, and overall accuracy. The mIoU is 0.853, indicating that the model achieved a high intersection-over-union across all functional zones, showing its ability to accurately capture functional zone boundaries. The F1-score is 0.894, demonstrating the model’s excellent balance between precision and recall and its accurate classification of functional zones. The overall accuracy was 0.934, showing that the model achieved very high overall classification accuracy and can effectively predict different urban functional zones.
From the IoU values of each functional zone, the model demonstrates excellent performance in urban functional zone segmentation tasks, particularly in areas such as water bodies, residential areas, and parks/green spaces, where it accurately identifies boundaries and achieves high IoU values. For non-construction zones, due to the limited number of samples, even after processing the dataset with methods such as oversampling, the model was unable to learn sufficient features during training, resulting in a relatively lower IoU value compared to other zones.
Based on its performance across specific categories, the CSAGFNet model exhibited significant disparities in recognition capabilities for different urban functional zones (residential areas, commercial areas, industrial areas, green space, street and transportation, and water and non-development zones), primarily attributable to inherent differences in spectral characteristics, spatial distribution patterns, and sample quantities among categories. For water body identification, the model achieved an outstanding IoU value of 0.89. This result not only stems from its effective capture of water’s distinctive low reflectance in near-infrared bands but also benefits from the innovative integration of nighttime light data features. Specifically, water bodies exhibit unique zero-value characteristics in nighttime light imagery, contrasting sharply with other urban functional zones. By constructing a Gated Fusion Module, the model successfully combines these complementary features, thereby significantly enhancing the robustness of water body identification. The residential area IoU of 0.88 reflects the model’s capability in capturing spatial distribution patterns of buildings, particularly maintaining satisfactory segmentation consistency in high-density urban clusters.
In contrast, the non-development zones showed a markedly lower IoU of 0.79 compared to other categories. In-depth analysis revealed three primary contributing factors: First, from a data perspective, this category constituted merely 2.7% of the total training samples. Even with oversampling techniques, its effective sample size remained less than one-fifth of other categories, substantially limiting the model’s capacity to learn discriminative features. Second, this category exhibits exceptionally high internal heterogeneity, encompassing various subclasses such as bare land, fallow farmland, and gravel areas. Its spectral feature coefficient of variation reaches 0.35, which is 2-3 times higher than that of other categories. This diversity challenges the model in establishing clear decision boundaries within the feature space. Finally, transition zones between these areas and adjacent functional zones accounted for 42% of the total boundary length, where ambiguous edge pixels frequently caused misclassification. These findings align with recent studies highlighting few-shot learning challenges, particularly suggesting that conventional data augmentation strategies may prove insufficient when addressing highly heterogeneous geographical features.
4.3. Ablation Study
To evaluate the efficiency and accuracy of the model in urban functional zone classification, we conducted ablation experiments by independently removing or modifying components of the network. Our model, named CSAGFNet, was used as a baseline for easier comparison with other variants.
We examined the structure of cross-modal fusion by removing data streams to validate the effectiveness of cross-modal integration. Next, we investigated the role of different fusion methods in cross-modal and multi-level feature fusion. Note that all settings and metadata for the training and validation processes were fixed across all ablation experiments.
Initially, we completely removed the NTL stream from CSAGFNet and trained a new model based solely on VHR images, naming this variant CSAGFNet-VHR. Then, we trained two independent VHR and NTL streams without any feature-level fusion between them. Subsequently, the output features of the final layer from each VHR and NTL stream were fused at the decision level without feature alignment; this model was named CSAGFNet-Cat. Finally, we input the VHR and NTL streams into the OAFA module for feature alignment but used feature concatenation as the final fusion method. This model was named CSAGFNet-OAFA. We also included our original model, CSAGFNet, in the experiments. This model aligns VHR and NTL features using the OAFA module and fuses them through the GFM.
Table 3 below shows the experimental results of each model on the validation set, while
Figure 13 illustrates the urban functional zone extraction performance of the different models. The following
Figure 14 shows the detailed view of the classification results.
We conducted ablation experiments to compare the impact of each module in the CSAGFNet model on classification performance, systematically verifying the influence mechanism of multi-source feature alignment and fusion on the urban functional area remote sensing interpretation task. As shown in
Table 3, the CSAGFNet-VHR model, when using only VHR imagery data, exhibits a significantly lower mIoU (0.743) and F1-score (0.727) compared to the multi-source fusion model. This indicates that a single data source has inherent limitations in extracting high-level semantic features. Although VHR imagery possesses spatial resolution advantages in the range of 0.3–1 m, the class-internal heterogeneity of its spectral-spatial features tends to lead to the misclassification of shadows. In the absence of additional data streams, the globality and robustness of the features are insufficient, which impacts the model’s performance in target classification and segmentation tasks. As seen in the result images, the model trained solely on VHR data fails to effectively distinguish between shadows and actual building areas, erroneously classifying shadows as urban functional areas. The CSAGFNet-Cat model achieves primary fusion of VHR and NTL (Nighttime Light) data via feature concatenation, but it does not account for the heterogeneous feature differences between multi-source remote sensing data. Experimental results show that its mIoU (0.713) drops by 2.3% compared to CSAGFNet-VHR, revealing that the direct fusion of heterogeneous features without alignment may lead to feature conflicts. The high-frequency texture features of VHR imagery and the radiance intensity features of NTL data suffer from dimensional mismatches in the uncalibrated feature space, leading to reduced feature interaction efficiency and lower accuracy in the output results. The CSAGFNet-OAFA model, after introducing the OAFA module, improves the mIoU by 10.6% to 0.789 and the F1-score by 10.7% to 0.804, confirming the importance of bridging feature spaces for multi-source remote sensing interpretation. This module establishes a cross-modal feature projection matrix, achieving orthogonal decomposition and recalibration of local geometric features from VHR and global radiance features from NTL in the Hilbert space, effectively addressing the issue of distribution shift in heterogeneous data. However, due to the limitations of linear concatenation fusion, its overall accuracy (0.843) still shows a significant gap compared to the optimal model. Finally, the CSAGFNet model adopts a gated fusion mechanism (GFM), achieving optimal results in mIoU (0.853), F1-score (0.894), and overall accuracy (0.934), with improvements of 8.1%, 11.2%, and 10.8% over the OAFA model, respectively. GFM dynamically adjusts the contribution of multi-source features through learnable gating weights. This nonlinear fusion mechanism not only suppresses feature redundancy but also enhances the complementarity of cross-modal features, particularly in handling shadow-building boundary areas, where the fine boundary information from VHR and the semantic intensity information from NTL form a synergistic enhancement effect.
The quantitative results of the experiment, combined with the visualization analysis, collectively indicate that the performance improvement in multi-source remote sensing interpretation occurs in two key stages. The feature alignment stage primarily addresses the spatial-semantic matching issues of heterogeneous data (with an OAFA contribution of approximately 65.3%), while the feature fusion stage optimizes multi-modal feature representation through dynamic weight allocation (with a GFM contribution of approximately 34.7%).
In conclusion, the feature alignment module plays a crucial role in enhancing performance, while the fusion method further influences the final feature representation. CSAGFNet demonstrates a more optimal fusion strategy, enabling it to achieve the best segmentation performance.
5. Discussion
5.1. Comparison with Single-Modal Models
In recent years, with the development of deep learning, single-modal DCNN models have become quite mature. To further assess the effectiveness of our method, we compared CSAGFNet with five advanced single-modal networks, namely U-net [
38], DeepLabV3Plus [
39], FPN [
40], PSPNet [
41], and PAN [
42]. These methods were chosen because they have all been proven to effectively classify images, and they are all open-source and easy to use. Furthermore, to ensure fairness in the experimental results, all single-modal models used ResNet34 as the encoder. The specific parameters of the comparative models are shown in
Figure 15 below. The experimental results are shown in
Table 4 and the classification results of the functional areas are displayed in
Figure 16. The partial detailed view of the model output results is shown in
Figure 17.
By comparing the performance differences between five typical single-modal segmentation models (Res-UNet, DeepLabV3Plus, FPN, PSPNet, and PAN) and the multi-modal fusion model CSAGFNet, we systematically validated the enhancement effect of cross-modal feature fusion on urban functional area remote sensing interpretation tasks. As shown in
Table 4, CSAGFNet significantly outperforms all single-modal models in three metrics: mIoU (0.853), F1-score (0.894), and overall accuracy (0.934), confirming the necessity of collaborative interpretations of multi-source remote sensing data.
Single-modal models are limited by the information representation capabilities of a single data source, and their performance differences reflect the model architecture’s adaptability to feature extraction. Res-UNet, benefiting from residual connections and an encoder-decoder structure, performs the best among the single-modal models (mIoU = 0.812). However, the VHR imagery it relies on only captures spatial geometric features (such as building contours and texture details) and cannot acquire socio-economic attributes, such as the intensity of human activity, leading to insufficient distinction between functionally similar building groups (e.g., office buildings and residential buildings). DeepLabV3Plus uses atrous spatial pyramid pooling (ASPP) to enhance multi-scale feature extraction, but its F1-score (0.836) is 6.9% lower than that of CSAGFNet, indicating a bottleneck in high-level semantic association modeling under a single modality. FPN and PSPNet, due to differences in their feature pyramid fusion strategies, achieve mIoU values of 0.789 and 0.747, respectively. However, both models show significant misclassification in shadow-covered areas (false detection rates > 18%), confirming the interpretive ambiguity of single-modal data in complex scenarios. The low performance of PAN (mIoU = 0.693) further reveals that unoptimized feature aggregation mechanisms exacerbate inter-class confusion, particularly in low-contrast areas (e.g., industrial and commercial zones).
CSAGFNet achieves multi-source feature collaborative optimization and improves model performance through Cross-modal Spatial Offset Modeling (CSOM) and the Gated Fusion Module (GFM). The CSOM module adaptively adjusts the features between different modalities, reducing the disparities in image resolution, observation angles, and data attributes, allowing multi-modal features to be aligned in similar spatial dimensions. The GFM module further utilizes a gating mechanism to weight and fuse features from different modalities, effectively enhancing the capture of useful features while filtering out irrelevant information. This enables CSAGFNet to simultaneously capture physical features from VHR imagery and socio-economic features from NTL imagery, fully leveraging the complementary advantages of multi-modal data.
CSAGFNet demonstrates significant advantages in special scenarios, such as shadow-interfered areas and functionally ambiguous regions. Single-modal models, such as Res-UNet, tend to misclassify shadows as low-density buildings with a probability of 24.6%, while CSAGFNet reduces the misclassification rate to 6.3% by using human activity intensity features from NTL data (e.g., the light intensity in shadow areas approaching 0). The F1-score difference between industrial areas (high NTL intensity, regular geometric layout) and commercial areas (high NTL intensity, complex textures) decreases from 12.4% in single-modal models to 4.7%, demonstrating that multi-modal features can enhance inter-class separability. For the underrepresented non-constructed areas (<5% of the total), CSAGFNet improves the IoU by 5% through the temporal stability features of NTL data, overcoming the false-negative problem caused by sample imbalances in single-modal models.
5.2. Model Generalization
To further evaluate the generalization of the proposed model, we conducted tests on two different administrative districts. We input high-resolution imagery and NTL data from these regions into the model to obtain classification maps of urban functional zones. However, due to the lack of ground-truth label data, we could only perform validation using visual interpretation combined with random point sampling. The overall accuracy of the model in the entire region was calculated using the confusion matrix. The test results are shown in
Table 5 and the classification results for the functional zones are presented in
Figure 18 and
Figure 19. The partial detailed view of the model output results is shown in
Figure 20.
Based on the data presented in
Table 5, the model demonstrates strong cross-domain adaptability in the untrained YueXiu (overall accuracy 91.8%) and TianHe (87.8%) regions. The performance difference (approximately 4.0%) reflects the impact of geographic environmental heterogeneity on model generalization. At the macro scale, the feature alignment and dynamic fusion mechanisms enable robust generalization capabilities (overall accuracy > 87%). At the micro scale, due to regional heterogeneity and the coupling of fine-grained features, additional geographic data fusion and adaptive training strategies are still required. These results validate that the architecture design based on multi-modal feature alignment (OAFA module) and gated fusion (GFM module) can effectively alleviate inter-domain feature distribution shifts.
The OAFA module reduces intra-class variance of cross-region features by orthogonal projection, mapping local geometric features from VHR imagery (such as building contour curvature) and radiance intensity features from NTL data to a unified semantic space. This enhances the model’s robustness to regional differences in building density, road network structure, and other characteristics. The GFM module dynamically adjusts the contribution of multi-modal features through gating weights. In the YueXiu region (high building density, regular road network), the model emphasizes VHR spatial details (weight proportion > 0.73), while in the TianHe region (mixed-use area, complex textures), the model strengthens the semantic intensity features of NTL (weight proportion > 0.68), thus achieving scene-adaptive feature expression optimization.
Although the model performs excellently at the macro scale, there remains a significant bottleneck in fine-grained classification tasks (e.g., distinguishing residential and commercial areas), with an F1-score difference of 14.2%. Feature confusion and inter-domain feature shifts still exist. In high-density residential areas (floor area ratio > 2.5), the architectural layout is similar to that of commercial areas (e.g., grid-like arrangement), leading to insufficient distinction of VHR texture features. In nighttime-active residential areas, human activity intensity features overlap with commercial areas, weakening the discriminative power of NTL data. The building function distribution in non-training regions exhibits systematic differences from the training set, making it difficult for the model to capture domain-invariant features for fine-grained classification.
For fine-grained classification tasks, a more refined feature extraction module or the use of auxiliary features (e.g., POI data, building height) may be necessary to improve the model’s discriminatory ability. To further enhance the accuracy of residential and commercial area classification, incorporating more granular features or specific category data augmentation strategies may be considered to meet the specific task requirements in non-training regions.
6. Conclusions
The primary objective of this study is to extract information from VHR imagery and NTL imagery for urban functional area classification. To this end, we propose an end-to-end cross-modal spatial alignment gated fusion deep neural network (CSAGFNet), which centers on the multi-modal fusion of high-resolution remote sensing imagery (VHR) and nighttime light data (NTL). This network is designed specifically for the urban functional area classification task, combining cross-modal spatial alignment and gated fusion mechanisms. Through systematic experimental validation, the model achieves an mIoU of 0.853, an F1-score of 0.894, and an overall accuracy (Overall Accuracy) of 0.934 on the test set, demonstrating an improvement of 5.2–8.7% over single-modal baseline models. This effectively validates the gain effect of multi-source remote sensing data collaboration for urban functional area recognition. The main innovative contributions of this study are reflected in the following three dimensions:
First, at the feature representation level, the proposed OAFA module addresses the feature space mismatch problem caused by spatial resolution differences (0.5 m vs. 500 m) and radiometric representation differences (reflectance vs. radiance intensity) between VHR and NTL data by establishing a cross-modal attention mechanism. Ablation experiments show that this module improves classification accuracy by more than 5% on the new urban area test set.
Second, in terms of feature fusion strategy, the Gated Fusion Module (GFM) achieves adaptive fusion of multi-modal features through a dynamic weight allocation mechanism. Quantitative analysis demonstrates that, compared to traditional concatenation or summation fusion methods, GFM improves the OA metric by 10.8% while maintaining the same model parameters. This is particularly evident in commercial-residential mixed-function areas, where the model shows enhanced discriminative power (with an 11.2% improvement in F1-score).
Third, for model generalization validation, we constructed a cross-domain test set for the main urban areas of Guangzhou. CSAGFNet achieved overall accuracy scores of 91.8% and 87.8% in two untrained new regions, confirming the model’s strong adaptability to the spatial structural heterogeneity of cities and demonstrating its robust generalization ability and resilience.
Although the current study made progress in multi-modal fusion methods, limitations remain in fine-grained classification (e.g., residential area density classification) and dynamic functional area recognition. Future research will deepen in three aspects:
Incorporating building contour vector data and POI semantic information to construct a multi-scale feature pyramid to enhance the spatial-semantic representation of urban functions.
Developing differentiable morphological operators to improve the analytical accuracy of linear features such as road networks.
Establishing a spatiotemporal collaborative fusion framework to integrate temporal NTL fluctuation features with quarterly VHR vegetation indices for dynamic monitoring of the evolution of urban functional areas.
The cross-modal alignment theoretical framework proposed in this study offers a new methodological reference for multi-source remote sensing data fusion, and it has practical value for the refined management of smart cities.