A Deep-Learning-Based Multimodal Data Fusion Framework for Urban Region Function Recognition

: Accurate and efﬁcient classiﬁcation maps of urban functional zones (UFZs) are crucial to urban planning, management, and decision making. Due to the complex socioeconomic UFZ properties, it is increasingly challenging to identify urban functional zones by using remote-sensing images (RSIs) alone. Point-of-interest (POI) data and remote-sensing image data play important roles in UFZ extraction. However, many existing methods only use a single type of data or simply combine the two, failing to take full advantage of the complementary advantages between them. Therefore, we designed a deep-learning framework that integrates the above two types of data to identify urban functional areas. In the ﬁrst part of the complementary feature-learning and fusion module, we use a convolutional neural network (CNN) to extract visual features and social features. Speciﬁcally, we extract visual features from RSI data, while POI data are converted into a distance heatmap tensor that is input into the CNN with gated attention mechanisms to extract social features. Then, we use a feature fusion module (FFM) with adaptive weights to fuse the two types of features. The second part is the spatial-relationship-modeling module. We designed a new spatial-relationship-learning network based on a vision transformer model with long-and short-distance attention, which can simultaneously learn the global and local spatial relationships of the urban functional zones. Finally, a feature aggregation module (FGM) utilizes the two spatial relationships efﬁciently. The experimental results show that the proposed model can fully extract visual features, social features, and spatial relationship features from RSIs and POIs for more accurate UFZ recognition.


Introduction
Today, more than half of the world's population lives in cities, yet cities cover only a tiny fraction of the Earth's surface.Asian and African countries are continuously urbanizing, and the urban population continues to grow; the world's urban population is expected to increase by 500 million by 2030 [1].Therefore, with a growing urban population, it is critical to manage and monitor limited urban areas.An urban functional zone is a concept that describes people's different activities in a certain area, such as industrial areas, commercial areas, or residential areas [2,3].As a basic urban unit, an accurate UFZ map is very important for urban planning, management, and decision making [4,5].However, with the rapid development of urbanization in the world, urban functional zone maps managed by the government cannot be updated in a timely manner [6,7].Therefore, it is crucial to make accurate and timely UFZ maps.
With the rapid development of related disciplines and technologies, high-resolution remote-sensing image data have gradually shown potential in the task of UFZ recognition.Research based on remote-sensing images continues to develop, and the use of RSIs is recognized as one of the most effective and efficient methods [8,9].In the past, some traditional methods, such as the scale-invariant feature transform (SIFT) [10] and the histogram of oriented gradients (HOG) [11,12], were used to identify UFZs from remotesensing images.These methods perform well in traditional land use and land cover (LULC) classification tasks but do not perform well in urban area recognition tasks with complex structures and fine semantics.Therefore, some scholars have proposed topic models to further identify UFZs from remote-sensing imagery.Additionally, probabilistic topic models [13,14] can further exploit the potential semantic relationships of urban areas by using object features.However, such methods are still only based on low-level manual features and cannot effectively represent the complex high-level semantic feature relationships of UFZs.In recent years, computer science has developed rapidly, and the advantages of many deep-learning methods [15][16][17] in identifying urban functional zones and scenario-level LULC classification have become increasingly obvious.For example, benefiting from the development of the transformer [18], Wang [19] designed a U-shaped transformer network based on the UNet [20] model architecture to interpret high-resolution urban scene images and obtained satisfactory results on four challenging datasets.Zhou [21] introduced the concept of the super object (SO) and classified the urban functional area based on the methods of frequency statistics and convolutional neural networks.Du [22] mapped large-scale and fine-grained urban functional zones from remote-sensing images using a multi-scale semantic segmentation network and an object-based approach.In general, the use of high-resolution RSIs can achieve relatively good results for producing fine-grained urban area classification maps [13,23,24].However, RSI data perform well in extracting the physical characteristics of ground objects, such as the distribution of buildings and the spatial structure of cities, but they fail to reflect the dynamically changing properties and human social activity information [25,26].In addition, some UFZ components are visually very similar, and the method based on remotesensing image data can extract the visual semantic features efficiently, but it cannot analyze the social semantic features between and within different functional areas.The urban functional zone contains various functional properties of regions that are directly related to human social activities [27,28].This relationship is complex to some extent.Therefore, it is difficult to obtain high-precision UFZ classification maps using only RSI data.
It has been shown that social perception and human activities are better methods for dynamically identifying urban areas [26,29].Socioperceptual big data that record human activity in real time are becoming increasingly available, such as points of interest (POIs) [2,30], mobile phone positioning data [31,32], social media check-in data [33], geotagged photos [34,35], and vehicle GPS trajectory data [36,37].Unlike remote-sensing imagery, these data are byproducts of human social activities, and many have temporal features; therefore, they contain rich socioeconomic attributes.For example, by fusing remote-sensing image data and taxi trajectory data, Qian [38] used the road network as the basic segmentation unit to identify urban functional districts based on the residual network framework.Cao [39] proposed two strategies: enforcing cross-modal feature consistency (CMFC) and cross-modal triplet (CMT) constraints.Time-dependent social sensing signature features were extracted based on a long short-term memory network (LSTM) and a one-dimensional convolutional neural network and fused with the remote-sensing image data for more accurate urban functional area classification.POIs are the main static data in social sensing data; they are not only easy to obtain but can also provide comprehensive land use information based on human activities and geographical locations [40].Xu [41] calculated the statistical features of POIs and fused them with remote-sensing image data to identify UFZs.Lu [42] proposed a unified deep-learning framework that can simultaneously extract visual features, social features, and spatial relationship features from RSI data and POI data.Bao [43] proposed the deeper-feature convolutional neural network (DFCNN) and integrated remote-sensing data and POI data to identify urban functional zones.There have been many efforts by scholars to recognize urban functional zones, but many of the previous studies could not effectively identify the implicit relationship between UFZs and POIs because the relationship between them is not one-to-one or one-to-many, and the synergy mechanism between POIs and RSIs has rarely been studied.Simple fusion strategies (stacking or adding) [41,43] have difficulty taking full advantage of the complementary advantages of multimodal data.Additionally, fine-grained spatial relationships are important for identifying UFZs for the following reasons [44].The first is that the layout of urban functional zones itself helps to reduce visual ambiguity.For example, determining whether a retail area is a commercial or residential area requires information about the surrounding area.Second, aggregating spatial information between and within different urban functional areas is conducive to determining the types of UFZs.For example, there are often recreational parks next to residential areas, while there are few large green-space areas around industrial areas.This co-occurrence relationship is very clear to some extent.
To address the above problems, we built a unified deep-learning framework, Lufz-CrossFormer (Lu-CF), which can efficiently exploit the complementary advantages between RSIs and POIs and simultaneously capture the potential spatial relationships of urban functional zones.The framework is divided into two parts: a complementary feature fusion module and a spatial-relationship-modeling module.In the first module, we use a common CNN to extract visual features from remote-sensing imagery data and a CNN with a gated attention mechanism to extract social features from point-of-interest data.To utilize the POIs conveniently, we first convert them into the corresponding hierarchical distance heatmap tensor according to the number of POI categories.Then, we use a layer-weighted model (LWM) to capture the possible co-occurrence relationship between POIs and UFZs.Finally, an adaptive feature fusion module (FFM) is used to efficiently fuse the two types of features.In the second part, the fused features are recoded according to the location relationship.We designed a new network structure based on the long-short-distance attention (LSDA) network CrossFormer [45], which can simultaneously capture the local spatial relationships and global spatial relationships of the fused features.Finally, a feature aggregation module (FGM) is used to efficiently utilize the two spatial relationships for more accurate urban functional area recognition.The main contributions of this paper are summarized as follows: (1) We designed a unified deep-learning framework and integrated remote-sensing images and POIs to recognize urban functional zones.Our method can extract visual features, social features, and spatial relationship features from different data for more accurate urban functional zone recognition, while existing relevant studies rarely take into account all three features simultaneously.(2) We investigated which POI categories have a greater impact on the final urban functional zone recognition accuracy, as well as the advantages of using POI data compared to using RSI data, through a series of experiments, which contributes to a further understanding of the role of multimodal data in the urban functional zone recognition task.(3) The synergy mechanism of remote-sensing image data and point-of-interest data in the urban functional zone recognition task has rarely been studied.In this study, we used a feature fusion module to adaptively fuse the visual and social features and further analyzed the specific effects of this synergy mechanism for different urban functional areas.

Data Source
As shown in Figure 1, we used the CSU-RSISC10 dataset [44] as the main research dataset, which was collected via Google Earth in Santa Monica, a coastal city in Los Angeles, USA, that covers an area of approximately 20 km 2 .It contains 288 sample-level images with a pixel resolution of 2000 × 2000 and a spatial resolution of 0.15 m per pixel.Each sample-level image was further divided into 400 nonoverlapping patch-level images of size 100 × 100. Figure 2 illustrates the two-level hierarchical structure of this dataset.The first part of the model uses unordered patch-level images as training data.The second part uses sample-level images containing spatial location information for training.In our experiments, we chose 70% of the data as the training set and the remaining data as the test set.For the patch-level data, the distribution of the number of different UFZ categories in the training and test sets is shown in Table 1.In addition, we downloaded 22,162 POIs from the open-source OSM website and reclassified them into 9 classes: airport, industry, supermarket, retail, hotel, institution, public service, nature, and residence.The distribution of the number of each type of POI is shown in Table 2. Finally, according to the "Code for Classification of Urban Land Use and Planning Standards of Development Land (GB 50137-2011)" and the LBCS standards, we classified the urban functional zones into commerce, industry, residence, construction, institution, transport, open space, and water.
Each sample-level image was further divided into 400 nonoverlapping patch-level of size 100 × 100. Figure 2    of size 100 × 100. Figure 2 illustrates the two-level hierarchical structure of this d The first part of the model uses unordered patch-level images as training data.The part uses sample-level images containing spatial location information for training.experiments, we chose 70% of the data as the training set and the remaining data test set.For the patch-level data, the distribution of the number of different UFZ cat in the training and test sets is shown in Table 1.In addition, we downloaded 22,16 from the open-source OSM website and reclassified them into 9 classes: airport, in supermarket, retail, hotel, institution, public service, nature, and residence.The di tion of the number of each type of POI is shown in Table 2. Finally, according to the for Classification of Urban Land Use and Planning Standards of Development La 50137-2011)" and the LBCS standards, we classified the urban functional zones int merce, industry, residence, construction, institution, transport, open space, and wa

Methods
The overall structure of the network framework established in this paper is shown in Figure 3.In the first part, we use a CNN to extract visual and social features while fusing them efficiently.In the second part, we design an architecture based on the CrossFormer network to extract local spatial relationships and global spatial relationships from the fused features generated in the first part and utilize them efficiently for the final classification output.The following subsections explain the details of the proposed framework.

Methods
The overall structure of the network framework established in this paper is shown in Figure 3.In the first part, we use a CNN to extract visual and social features while fusing them efficiently.In the second part, we design an architecture based on the CrossFormer network to extract local spatial relationships and global spatial relationships from the fused features generated in the first part and utilize them efficiently for the final classification output.The following subsections explain the details of the proposed framework.

Complementary Feature Learning and Fusion
In this part, we use two CNNs to extract visual features and social features from RSIs and POIs.Then, the two types of features are fused efficiently to obtain fused features with strong representation ability based on the FFM.

Visual Feature Learning
First, we crop a large-scale three-band RGB RSI with a size of 3 × h × w according to the window step (h × w) to obtain a group of image units  ( = 1, 2, 3, … , × ), where each individual image unit represents a type of urban functional area.Then, ResNet-50 [46] is used as the backbone network to extract features from these image units, and a global average pooling operation is used to obtain the final visual feature  ∈ ℝ × × .

Social Feature Learning
Many UFZs are visually similar; for example, the appearance of commercial buildings and institutional buildings may look extremely similar.Therefore, relying on remotesensing imagery alone will not yield satisfactory results in many cases.We introduce POI data with rich social attributes as complementary features to RSI data and use a CNN with

Complementary Feature Learning and Fusion
In this part, we use two CNNs to extract visual features and social features from RSIs and POIs.Then, the two types of features are fused efficiently to obtain fused features with strong representation ability based on the FFM.

Visual Feature Learning
First, we crop a large-scale three-band RGB RSI with a size of 3 × h × w according to the window step (h × w) to obtain a group of image units n j j = 1, 2, 3, . . ., H h × W w , where each individual image unit represents a type of urban functional area.Then, ResNet-50 [46] is used as the backbone network to extract features from these image units, and a global average pooling operation is used to obtain the final visual feature f l ∈ R 2048×1×1 .

Social Feature Learning
Many UFZs are visually similar; for example, the appearance of commercial buildings and institutional buildings may look extremely similar.Therefore, relying on remotesensing imagery alone will not yield satisfactory results in many cases.We introduce POI data with rich social attributes as complementary features to RSI data and use a CNN with an attention mechanism to capture possible co-occurrence relationships between urban functional zones and the corresponding points of interest.
Specifically, we first convert them into the corresponding hierarchical distance heatmap tensor according to the number of POI categories and input it into the convolutional neural network, as the CNN is better at processing two-dimensional continuous image data.Assume that the region contains M types of POIs, and the number of each POI category is n u (u = 1, 2, . .., M).For the uth-category point of interest, we calculate the minimum distance from pixel point (x i , y i ) in the remote-sensing image to n u POIs according to Equation (1) to generate the distance heatmap DP: where (x l , y l ) represents the spatial coordinates of the uth-category point of interest, l = 1, 2, . .., n u .As a result, we obtain M types of distance heatmaps and then stack them to form a distance heatmap tensor T ∈ R M×H×W .Similar to the remote-sensing-imagecropping method, we crop the heatmap tensor T into H/h × W/w nonoverlapping patches: Second, the relationships between POIs and UFZs are not one-to-one or one-to-many.An urban functional district may have multiple types of POIs, or a POI may appear in different urban functional zones simultaneously.For example, there will be retail POIs in both commercial and residential areas.To this end, inspired by Lu's approach [42], we adaptively weight t j by adding a layer-weighted model (LWM) to the CNN to capture possible co-occurrence relationships between different types of POIs and UFZs.The structure of the LWM model is shown in Figure 4. We first use global average pooling encoding t j to obtain an intermediate heatmap tensor p j ∈ R M×1 , and then two fully connected (FC) layers are used to obtain the interlayer weights: where δ(x) is the ReLU activation function, and ϕ(x) is the sigmoid function.The first FC layer has a learnable weight w 1 ∈ R D r1 ×M to extend the dimension of the feature t j , and the second FC layer has a learnable weight w 2 ∈ R M×D r1 to reduce the output dimension of the first FC layer.Then, we use the output w j to adaptively weight the layers of t j by performing layerwise multiplication between t j and w j : where t j represents the heatmap tensor after adaptive weighting.Finally, we incorporate the LWM into the CNN to extract the social feature tional neural network, as the CNN is better at processing two-dimensional continuous image data.Assume that the region contains M types of POIs, and the number of each POI category is nu (u = 1, 2, ..., M).For the u th -category point of interest, we calculate the minimum distance from pixel point (xi, yi) in the remote-sensing image to nu POIs according to Equation ( 1) to generate the distance heatmap DP: where (xl, yl) represents the spatial coordinates of the u th -category point of interest,  = 1, 2, ..., nu.As a result, we obtain M types of distance heatmaps and then stack them to form a distance heatmap tensor T∈ ℝ M×H×W .Similar to the remote-sensing-image-cropping method, we crop the heatmap tensor T into H/h × W/w nonoverlapping patches:  ∈ ℝ × ×  = 1, 2, … , × .Second, the relationships between POIs and UFZs are not one-to-one or one-to-many.An urban functional district may have multiple types of POIs, or a POI may appear in different urban functional zones simultaneously.For example, there will be retail POIs in both commercial and residential areas.To this end, inspired by Lu's approach [42], we adaptively weight  by adding a layer-weighted model (LWM) to the CNN to capture possible co-occurrence relationships between different types of POIs and UFZs.The structure of the LWM model is shown in Figure 4. We first use global average pooling encoding  to obtain an intermediate heatmap tensor  ∈  M×1 , and then two fully connected (FC) layers are used to obtain the interlayer weights: where () is the ReLU activation function, and () is the sigmoid function.The first FC layer has a learnable weight  ∈ ℝ × to extend the dimension of the feature  , and the second FC layer has a learnable weight  ∈ ℝ × to reduce the output dimension of the first FC layer.Then, we use the output  to adaptively weight the layers of  by performing layerwise multiplication between  and  : where  represents the heatmap tensor after adaptive weighting.Finally, we incorporate the LWM into the CNN to extract the social feature  ∈ ℝ × × .

Complementary Feature Fusion
As we can imagine, visual features are effective for identifying certain urban functional areas, such as water and many open spaces.However, for commercial, industrial, and institutional areas, we need social features that contain rich socioeconomic attributes for more accurate recognition, suggesting that visual and social features have different

Complementary Feature Fusion
As we can imagine, visual features are effective for identifying certain urban functional areas, such as water and many open spaces.However, for commercial, industrial, and institutional areas, we need social features that contain rich socioeconomic attributes for more accurate recognition, suggesting that visual and social features have different discriminative abilities for different UFZ categories.Therefore, similar to the Social Feature Learning section, we use an adaptive feature fusion module (FFM) for better feature fusion.As shown in Figure 5, specifically, the module can learn the adaptive fusion weights from the visual feature f l and the social feature f s , as shown in Equation ( 4): where The first FC layer is a downsampling layer to reduce the output dimension, and its weight is w 3 ∈ R 4096 r ×4096 ; the second FC layer is a rescaling layer to compute the feature's weight factor, and r is the rescaling factor.
ture Learning section, we use an adaptive feature fusion module (FFM) for better feature fusion.As shown in Figure 5, specifically, the module can learn the adaptive fusion weights from the visual feature  and the social feature  , as shown in Equation ( 4): ( ) ( ) where  =   ∈ ℝ .The first FC layer is a downsampling layer to reduce the output dimension, and its weight is  ∈ ℝ × ; the second FC layer is a rescaling layer to compute the feature's weight factor, and r is the rescaling factor. = [  ] is a feature tensor of length 2, and  and  denote the ability of visual and social features to discriminate between different urban functional zones, respectively.Then, through Equation ( 5), we obtain the final fused feature  by adaptively weighting the visual features and social features:

Spatial Relationship Modeling
In the first part, we convert the input into a group of fused features   = 1, 2, 3, … , × .After that, we construct the feature tensor  according to the position relation of the tensor  , as defined in Equation ( 6): Each tensor  in the feature tensor  corresponds to patch-level imagery clipping from large-scale imagery.Then, a simple convolutional network is used to extract local spatial relationships.Finally, we convert the feature  into a one-dimensional sequence  and input it into the global module to extract the global spatial relationships: , , , .... , where  = × is the sequence length, and E is the embedding dimension.As shown in Figure 6, the local spatial relationships can be obtained through two groups of parallel convolution operations, and four groups of CrossFormer blocks (CF blocks) are used to obtain the global spatial relationships.The global features and local w f = [w l w s ] is a feature tensor of length 2, and w l and w s denote the ability of visual and social features to discriminate between different urban functional zones, respectively.Then, through Equation ( 5), we obtain the final fused feature f c by adaptively weighting the visual features and social features:

Spatial Relationship Modeling
In the first part, we convert the input into a group of fused features f c c = 1, 2, 3, . . ., H h × W w .After that, we construct the feature tensor F p according to the position relation of the tensor f c , as defined in Equation ( 6): Each tensor f c in the feature tensor F p corresponds to patch-level imagery clipping from large-scale imagery.Then, a simple convolutional network is used to extract local spatial relationships.Finally, we convert the feature F p into a one-dimensional sequence F o and input it into the global module to extract the global spatial relationships: where L = H h × W w is the sequence length, and E is the embedding dimension.As shown in Figure 6, the local spatial relationships can be obtained through two groups of parallel convolution operations, and four groups of CrossFormer blocks (CF blocks) are used to obtain the global spatial relationships.The global features and local features are added, and the results are weighted sums with the original input F p .Finally, a feature aggregation module (FGM) is used to obtain the final global-local context.

Local Spatial Relationship Modeling
While global spatial relationships are crucial to identifying complex UFZs, local information is also important for maintaining rich spatial details.As shown in the right half of Figure 6, we use two groups of parallel convolutions containing batch normalization operations to extract local information, with convolution kernel sizes of 1 and 3, and finally perform a sum operation.

Global Spatial Relationship Modeling
On a large scale, global information represents the relationships between different urban functional areas or a UFZ and its sub-UFZs.Some studies [19,47] have shown that vision-transformer-based structures have unique advantages in capturing global spatial relationships.In this paper, we present a network architecture designed to capture the global spatial relationships of the urban functional district based on CrossFormer.As shown in the left half of Figure 6, we first transform the original input  into a one-dimensional sequence and then input it into the CrossFormer blocks sequentially.As shown in Figure 7a, a CF block consists of a series of modules containing short-distance attention (SDA), long-distance attention (LDA), dynamic position bias (DPB), and multilayer perception (MLP).After that, we perform reshaping, upsampling, and convolution operations to restore the output to the size of the input.Finally, a residual connection is used to prevent model degradation, and we stack 6 global modules to obtain the final global features.In the following, we explain the details of the CrossFormer block.

Local Spatial Relationship Modeling
While global spatial relationships are crucial to identifying complex UFZs, local information is also important for maintaining rich spatial details.As shown in the right half of Figure 6, we use two groups of parallel convolutions containing batch normalization operations to extract local information, with convolution kernel sizes of 1 and 3, and finally perform a sum operation.

Global Spatial Relationship Modeling
On a large scale, global information represents the relationships between different urban functional areas or a UFZ and its sub-UFZs.Some studies [19,47] have shown that vision-transformer-based structures have unique advantages in capturing global spatial relationships.In this paper, we present a network architecture designed to capture the global spatial relationships of the urban functional district based on CrossFormer.As shown in the left half of Figure 6, we first transform the original input F p into a one-dimensional sequence and then input it into the CrossFormer blocks sequentially.As shown in Figure 7a, a CF block consists of a series of modules containing short-distance attention (SDA), longdistance attention (LDA), dynamic position bias (DPB), and multilayer perception (MLP).After that, we perform reshaping, upsampling, and convolution operations to restore the output to the size of the input.Finally, a residual connection is used to prevent model degradation, and we stack 6 global modules to obtain the final global features.In the following, we explain the details of the CrossFormer block.

(a) Dynamic Position Bias
The commonly used relative position bias (RPB) represents the relative embedding positions by adding a bias to the self-attention.The following equation represents the use of RPB to represent long-short-distance attention: where Q, K, V ∈ R G 2 ×D denote query, key, and value in the self-attention module, respectively, G is the group size, and D is the dimension of the embeddings.
√ d is a constant, and B ∈ R G 2 ×G 2 is the RPB matrix.In the past, B i,j = B ∆x,∆y , where B is a fixed-size matrix, and (∆x, ∆y) is the coordinate distance of the ith and jth embeddings.When the size of (∆x, ∆y) is larger than B , the size of the image or group is limited.The MLP-based module DPB is designed to solve the following problem: ( ) where , ,  ∈ ℝ × denote query, key, and value in the self-attention module, respectively, G is the group size, and D is the dimension of the embeddings.√ is a constant, and  ∈ ℝ × is the RPB matrix.In the past,  , =  ∆ ,∆ , where  is a fixed-size matrix, and (∆, ∆) is the coordinate distance of the  and  embeddings.When the size of (∆, ∆) is larger than  , the size of the image or group is limited.The MLP-based module DPB is designed to solve the following problem: ( ) As shown in Figure 7b, the structure comprises three FC layers containing layer normalization (LN) and ReLU activation, and the dimension of the middle layer is set to D/4.The output  , is a scalar representing the relative positions of the  and  embeddings.The DPB is a trainable module that can be optimized along with the entire model; it can handle arbitrary input group sizes without being limited by (∆, ∆).

(b) Long-and Short-Distance Attention
The self-attention module in the CrossFormer block is divided into two parts: shortdistance attention (SDA) and long-distance attention (LDA).For SDA, we use a group of varying windows (G × G) to divide adjacent embeddings, and Figure 8a illustrates the case where G = 3.Unlike the fixed window of the Swin Transformer, SDA has a variable group size.In this experiment, the parameter G of the four CF blocks is {2, 4, 5, 4}.For LDA, given an input of M × M, the embeddings are sampled at a fixed interval I.As shown in Figure 8b (M = 9, I = 3), all of the embeddings belonging to the red boxes form a group, and the parts belonging to the purple boxes form another group.The width and height of the group can be computed by G = M/I.After grouping embeddings, both SDA and LDA compute the self-attention within their respective groups.As a result, the computational cost of the self-attention module will be reduced while capturing the fine-grained spatial relationships efficiently.As shown in Figure 7b, the structure comprises three FC layers containing layer normalization (LN) and ReLU activation, and the dimension of the middle layer is set to D/4.The output B i,j is a scalar representing the relative positions of the ith and jth embeddings.The DPB is a trainable module that can be optimized along with the entire model; it can handle arbitrary input group sizes without being limited by (∆x, ∆y).

(b) Long-and Short-Distance Attention
The self-attention module in the CrossFormer block is divided into two parts: shortdistance attention (SDA) and long-distance attention (LDA).For SDA, we use a group of varying windows (G × G) to divide adjacent embeddings, and Figure 8a illustrates the case where G = 3.Unlike the fixed window of the Swin Transformer, SDA has a variable group size.In this experiment, the parameter G of the four CF blocks is {2, 4, 5, 4}.For LDA, given an input of M × M, the embeddings are sampled at a fixed interval I.As shown in Figure 8b (M = 9, I = 3), all of the embeddings belonging to the red boxes form a group, and the parts belonging to the purple boxes form another group.The width and height of the group can be computed by G = M/I.After grouping embeddings, both SDA and LDA compute the self-attention within their respective groups.As a result, the computational cost of the self-attention module will be reduced while capturing the fine-grained spatial relationships efficiently.

Feature Aggregation Module
The original feature F P retains rich spatial details but lacks semantic attributes.Additionally, the global-local feature has fine-grained semantic information, but its spatial resolution is insufficient.Therefore, adding the two directly may reduce the classification accuracy [48].In the method described in this paper, we use a feature aggregation module (FGM) to narrow the semantic gap between the two types of features to achieve more accurate UFZ recognition.

Feature Aggregation Module
The original feature FP retains rich spatial details but lacks semantic attributes.Additionally, the global-local feature has fine-grained semantic information, but its spatial resolution is insufficient.Therefore, adding the two directly may reduce the classification accuracy [48].In the method described in this paper, we use a feature aggregation module (FGM) to narrow the semantic gap between the two types of features to achieve more accurate UFZ recognition.
First, a weighted-sum operation is performed on the two types of features, and the weights can be updated during model training to make full use of the rich spatial details and precise semantic attributes of different features.As shown in Figure 9, after a convolution operation (3 × 3), the fused features are input into the spatial path and channel path.Second, the two paths designed in the model help to strengthen the channel-based and space-based feature representation.Specifically, for the spatial path, the model uses a depth-separable convolution to produce a space-based attention feature  ∈ ℝ × × , where h and w represent the spatial resolution of the feature map.After processing by the sigmoid function, matrix multiplication is used to obtain the path output.For the channel path, we first use the global average pooling operation to obtain the channel-based attention feature  ∈ ℝ × × , where c represents the channel dimension.In addition, the rescaling operation consists of two convolution layers of 1 × 1, reducing and then restoring the dimension of the channel by a fixed factor.Similar to the spatial path, the sigmoid function and matrix multiplication are used to obtain the final path output.Finally, we use a convolution layer of 1 × 1 to obtain the FGM output.First, a weighted-sum operation is performed on the two types of features, and the weights can be updated during model training to make full use of the rich spatial details and precise semantic attributes of different features.As shown in Figure 9, after a convolution operation (3 × 3), the fused features are input into the spatial path and channel path.Second, the two paths designed in the model help to strengthen the channel-based and space-based feature representation.Specifically, for the spatial path, the model uses a depthseparable convolution to produce a space-based attention feature S ∈ R h×w×1 , where h and w represent the spatial resolution of the feature map.After processing by the sigmoid function, matrix multiplication is used to obtain the path output.For the channel path, we first use the global average pooling operation to obtain the channel-based attention feature C ∈ R 1×1×c , where c represents the channel dimension.In addition, the rescaling operation consists of two convolution layers of 1 × 1, reducing and then restoring the dimension of the channel by a fixed factor.Similar to the spatial path, the sigmoid function and matrix multiplication are used to obtain the final path output.Finally, we use a convolution layer of 1 × 1 to obtain the FGM output.

Implementation Details
In the first part, we use ResNet-50 as the backbone network to extract visual features from the remote-sensing imagery.Meanwhile, the points of interest are transformed into a hierarchical distance heatmap tensor and input into the convolutional neural network to extract social features.Finally, a module based on an attention mechanism is used to efficiently fuse the two types of features.To train this part, we used cross-entropy loss (CE) and the Adam optimizer.The batch size was set to 16; the learning rate was initially set to 1 × 10 −5 , and every five epochs, it became 0.98 times the original.
In the second part, we use two modules to extract global spatial relationships and local spatial relationships from the fused features, and then a feature aggregation module is used to obtain the final global-local context.After the model of the first part was trained, we saved the fused features obtained in this part to a local computer as input for the second part.Meanwhile, the loss function, optimizer, batch size, and learning rate adjustment strategies used in this part were exactly the same as in the first part.
We used the PyTorch deep-learning framework for training based on the Windows 10 operating system with an NVIDIA GeForce RTX 3090 graphics card (memory 24 GB).For the complementary feature-learning and fusion part, the model converged after 200 epochs; for the spatial-relationship-modeling part, the model converged after 150 epochs.

Evaluation Metrics
In the experiment, we used the kappa coefficient (Kappa) to evaluate the comprehensive performance of each model, the F1 score to measure the accuracy of each category, and the overall accuracy (OA) to evaluate the overall classification accuracy, as shown in the  (11) where TP denotes the number of pixels correctly classified into the positive class, TN denotes the number of pixels correctly classified into the negative class, FP denotes the number of pixels misclassified into the positive class, and FN denotes the number of pixels misclassified into the negative class.Po denotes the overall classification accuracy; a1, a2, . .., aC denote the number of true samples in each category; b1, b2, . .., bC denote the number of predicted samples in each category; and n is the total number of samples.

Comparative Experiments for the Second Part
In the following experiment, we kept the structure of the first part unchanged and compared different models for the second part to further verify the importance of spatial relationships for urban functional zone recognition.The models we chose to compare include the hierarchical vision transformer networks based on the sparse attention mechanism Swin Transformer (SWINT) [49] and the pyramid vision transformer (PVT) [50], the long short-term memory network (LSTM) [51], and the gated recurrent neural network (GRU) [52] based on time-series modeling.For a fair comparison, we stack six global modules and one local module for all experimental models.
As shown in Figure 10, the area classified by our method is purer and the road structure is more complete compared with other methods.The main reason is that our model can obtain richer composition patterns of spatial relationships by adding the variable group size and long-short-distance attention.As shown in the red circles in the first row, our method accurately identifies most of the institutional areas and has fewer misclassified spots, with a clearer road network compared to the results obtained by the GRU and LSTM.As marked by the red circles in the fourth row, our method obtains better results than SWINT and PVT due to the consideration of finer spatial relationships.In the sixth row, the classification results of our method are purer than those of the SWNT and PVT.Although the GRU and LSTM have no obvious misclassified spots in this region, the urban functional area edge recognized by them is not precise enough.The quantitative evaluation results of the models are shown in Table 3.The GRU and LSTM can only capture the long-range dependence information in the horizontal and vertical directions, so they are the least effective.Our method improves Kappa by 1.03% over the second-best method and 7.04% over the worst method, and the overall accuracy is improved by at least 1.8% over other methods.In addition, the F1 scores of each UFZ category can further prove the advantage of our method.Our model has the best performance in six categories, especially the F1 score of the transport category, which reaches 0.8820.Meanwhile, the accuracy performance is not satisfactory in the construction category due to the samples being scarce.For the other two categories, the accuracy gap between our method and the best method is less than 1%.The qualitative and quantitative The quantitative evaluation results of the models are shown in Table 3.The GRU and LSTM can only capture the long-range dependence information in the horizontal and vertical directions, so they are the least effective.Our method improves Kappa by 1.03% over the second-best method and 7.04% over the worst method, and the overall accuracy is improved by at least 1.8% over other methods.In addition, the F1 scores of each UFZ category can further prove the advantage of our method.Our model has the best performance in six categories, especially the F1 score of the transport category, which reaches 0.8820.Meanwhile, the accuracy performance is not satisfactory in the construction category due to the samples being scarce.For the other two categories, the accuracy gap between our method and the best method is less than 1%.The qualitative and quantitative evaluation results show that our model can make full use of the fine-grained spatial relationships for more accurate urban functional zone recognition.In this part, we used three related methods and a benchmark model for comparative experiments.The first one is the RPFM (Remote Sensing Images and Point-of-Interest Fused Model) [41], which also uses the image patch as the basic mapping unit, and it represents the relationship between POIs and urban region functions based on the distance metric.The second one is DHAO (deep integration of high-resolution imagery and Open-StreetMap) [53], which uses statistical features from POI data and visual features from RSI data for urban functional zone classification, and the road network is used as the basic mapping unit.The third one is the SO-CNN (Super-Object-based CNN) [21], which uses only remote-sensing image data to identify UFZs, and the super object is used as the basic mapping unit.In addition, the benchmark model is the widely used UNet network [20], which only utilizes image data to identify urban functional areas.
On the one hand, social perception information is crucial for accurate UFZ recognition, and our method can sufficiently capture it through the complementary feature-learning and fusion module.By learning rich social features, our approach can identify institutional areas, commercial areas, and open spaces more accurately.The results obtained by the SO-CNN are not accurate enough because only RSI data are considered.Although the RPFM and DHAO also use both POI data and RSI data, they simply incorporate the statistical characteristics of POI data and therefore fail to take full advantage of their social attributes.As a result, the inadequate representation can produce misclassified results, such as the red circles in Figure 11.We further give two specific examples.The region in the fourth row of Figure 11 consists of a large number of commercial areas as well as some residential areas.Accurately identifying commercial areas has always been a difficult problem in the UFZ classification task.Nevertheless, our method correctly identified almost all commercial areas in the region and has fewer misclassification spots.In the fifth row, our model accurately identified the area where the beach meets the ocean, while all other models misclassified this area as a transport category.On the other hand, spatial relationship features are also important for accurate UFZ classification.As marked by the yellow circles in Figure 11, the results obtained by our method are purer and more continuous due to the consideration of the spatial position relationships between different patches, while other methods only utilize the features of a single UFZ.In addition, the benchmark model obtained the worst results due to the model structure and use of a single data type.The quantitative evaluation results are shown in Table 4, which shows that our method obtains the highest scores on most of the metrics.The kappa value of our method is at least 2.84% higher than that of other methods, and it has an overall accuracy improvement of 3.84% over the second-best method.In addition, our model achieves the highest F1 scores on seven UFZ categories, with the scores improved by at least 6.50% and 3.61% for the commercial and institutional areas, respectively.The qualitative and quantitative evaluation results show that our model has a better overall performance in the urban functional zone recognition task.The quantitative evaluation results are shown in Table 4, which shows that our method obtains the highest scores on most of the metrics.The kappa value of our method is at least 2.84% higher than that of other methods, and it has an overall accuracy improvement of 3.84% over the second-best method.In addition, our model achieves the highest F1 scores on seven UFZ categories, with the scores improved by at least 6.50% and 3.61% for the commercial and institutional areas, respectively.The qualitative and quantitative evaluation results show that our model has a better overall performance in the urban functional zone recognition task.

Ablation Experiment
In this section, we analyze the specific contribution of each module in the proposed framework to the final urban functional zone recognition accuracy through a series of ablation experiments.The results of the ablation experiments are shown in Table 5.We conducted experiments by continuously adding new modules to verify the contribution of different modules to the UFZ recognition accuracy.For example, using both RSI data and POI data (Stage 2) improves the Kappa value by 0.0793 and OA value by 0.0779 compared to using only remote-sensing imagery (Stage 1), and the use of the LWM module (Stage 3) increases the Kappa value to 0.8246; meanwhile, the addition of the FFM (Stage 4) increases the Kappa value from 0.8246 to 0.8379.The spatial relationship modules also have an important impact on improving the urban functional district classification accuracy.After adding the local module and global module (Stage 7), the Kappa value increases from 0.8379 to 0.8654.Finally, our model obtains the optimal performance (Kappa = 0.8697, OA = 0.9136) after adding the FGM (Stage 8).It is worth noting that including only the local module (Stage 6) leads to a greater accuracy improvement compared to including only the global module (Stage 5), suggesting that the local spatial relationship is also crucial to accurate UFZ classification.The results of the ablation experiments show that all of the modules of our framework have a positive effect on improving the urban functional area recognition accuracy, and using POI data leads to the greatest improvement in accuracy; therefore, we discuss the relationship between points of interest and urban functional zone recognition accuracy in the following section.

Discussion
In this section, we discuss the contributions of different point-of-interest categories to the UFZ classification accuracy and the synergy mechanism between RSI data and POI data in the urban functional area recognition task.Finally, we visualize the layer activation of visual features.

Specific Impact of POIs on UFZ Recognition
The relationship between POIs and UFZs is not one-to-one or one-to-many and is complex in many cases.To determine which point-of-interest categories have a more important impact on the urban functional zone classification accuracy, we deleted one POI category at a time and repeated the experiment.As shown in Table 6, we find that regardless of which POI types are deleted, the final recognition accuracy decreases, indicating that all POIs have a positive effect on improving the urban functional area classification accuracy.When we remove POIs in the institution, residence, industry, and public service categories, the model accuracy decreases the most, suggesting that these POI categories have a relatively strong association with the urban functional district.In contrast, deleting retail, hotel, and supermarket POIs has less of an impact on the classification accuracy, mainly because these points of interest can occur in almost all urban functional area categories, and there is no clear representative relationship between them and the urban functional zone.In addition, we investigated the issue of the cost of using both remote-sensing images and points of interest simultaneously versus training with more remote-sensing imagery alone in the case of obtaining close classification accuracy.As Table 7 shows, training with both RSIs and POIs results in higher accuracy than training with only twice as much RSI data, which is close to the accuracy obtained when training with only three times as many remote-sensing images, further demonstrating the advantages of using the point-of-interest data.

Synergy Mechanism of POIs and RSIs in UFZ Recognition
The discriminative abilities of visual and social features for different urban functional zone categories are different; as we can imagine, visual features are more important for identifying water, while social features may be more crucial to identifying commercial or institutional areas.To further clarify this synergy mechanism, as shown in Figure 12, for patch-level data, we take the weights of visual feature w l and social feature w s to generate the point (w l , w s ) and visualize the two-dimensional display of the discriminative ability of the different features for each UFZ category.We observe two phenomena.First, even for the same UFZ category, the fusion weights change dynamically, and the distribution of fusion weights for all urban functional areas tends to be relatively dispersed and concentrated.Second, visual features are more important for identifying urban functional zones in the transport and water categories, while social features are more crucial for recognizing the commerce, industry, institution, open space, and residence classes.However, for the residence and open space categories, although the social features are dominant, the complementary role of visual features becomes increasingly obvious.For the transport category, the rule is the opposite.For the construction area, the two types of features have a relatively equivalent status.
Based on the above discussion, we conjecture that this synergy mechanism is manifested by taking one feature as the main fusion object and the other as an auxiliary object.Nevertheless, for some urban functional area categories, when the discriminative ability of the dominant feature reaches a peak or even declines, the counterpart of the complementary feature continues to improve and approaches the dominant feature.As shown in Figure 13, we give two examples to further illustrate this synergy mechanism.In Figure 13a, it is difficult to determine whether the sports field belongs to an institutional area, residential area, or open space based on visual features alone, but the surrounding residential POIs provide strong social signals indicating that the field belongs to a residential area; in Figure 13b, a large area of green space provides strong visual information indicating that the area belongs to the open space, while the points of interest are scarce in this area, and it will produce incorrect results if we only rely on the surrounding residential POIs.As a result, our model tends to give the social features a larger weight in Figure 13a and the visual features a larger weight in Figure 13b.Therefore, based on the different features obtained from RSIs and POIs, the synergy mechanism is a dynamic process in specific urban functional zones, and the experiments prove that the model in this paper can efficiently utilize this synergy mechanism for more accurate urban functional area recognition.Based on the above discussion, we conjecture that this synergy mechanism is manifested by taking one feature as the main fusion object and the other as an auxiliary object.Nevertheless, for some urban functional area categories, when the discriminative ability of the dominant feature reaches a peak or even declines, the counterpart of the complementary feature continues to improve and approaches the dominant feature.As shown in Figure 13, we give two examples to further illustrate this synergy mechanism.In Figure 13a, it is difficult to determine whether the sports field belongs to an institutional area, residential area, or open space based on visual features alone, but the surrounding residential POIs provide strong social signals indicating that the field belongs to a residential area; in Figure 13b, a large area of green space provides strong visual information indicating that the area belongs to the open space, while the points of interest are scarce in this area, and it will produce incorrect results if we only rely on the surrounding residential POIs.As a result, our model tends to give the social features a larger weight in Figure 13a and the visual features a larger weight in Figure 13b.Therefore, based on the different features obtained from RSIs and POIs, the synergy mechanism is a dynamic process in specific urban functional zones, and the experiments prove that the model in this paper can efficiently utilize this synergy mechanism for more accurate urban functional area recognition.Based on the above discussion, we conjecture that this synergy m manifested by taking one feature as the main fusion object and the other as object.Nevertheless, for some urban functional area categories, when the d ability of the dominant feature reaches a peak or even declines, the count complementary feature continues to improve and approaches the dominan shown in Figure 13, we give two examples to further illustrate this synergy In Figure 13a, it is difficult to determine whether the sports field belongs to an area, residential area, or open space based on visual features alone, but the residential POIs provide strong social signals indicating that the field residential area; in Figure 13b, a large area of green space provides s information indicating that the area belongs to the open space, while the poin are scarce in this area, and it will produce incorrect results if we only surrounding residential POIs.As a result, our model tends to give the soc larger weight in Figure 13a and the visual features a larger weight in Therefore, based on the different features obtained from RSIs and POIs, mechanism is a dynamic process in specific urban functional zones, and the prove that the model in this paper can efficiently utilize this synergy mechan accurate urban functional area recognition.

Layer Activation Visualization of Visual Features
In the above sections, we focused on the importance of social features for urban functional zone classification; however, visual features are also crucial in almost all relevant tasks.For this purpose, we use the Grad-CAM [54] method to visualize the layer activation of the network for input during the visual feature extraction phase.As shown in Figure 14, we find that in the shallow stage (Layer 1 or Layer 2), the network tends to focus on the important regions of the image at a larger scale.When the number of layers deepens, the network focuses on a specific region of the image at a relatively small scale.Meanwhile, for urban functional districts containing more buildings, the edge and corner information of some buildings is more important; for the institutional area, we chose a relatively representative image and can see that the network mainly focuses on the areas of different surface materials on the roof; for green spaces belonging to the open space category, the network focuses on the texture information of the image; for water, the network tends to focus on the edge region or texture information of the image.As a result, visual features with strong representation ability lay a solid foundation for accurate urban functional zone recognition.

Layer Activation Visualization of Visual Features
In the above sections, we focused on the importance of social features for urban functional zone classification; however, visual features are also crucial in almost all relevant tasks.For this purpose, we use the Grad-CAM [54] method to visualize the layer activation of the network for input during the visual feature extraction phase.As shown in Figure 14, we find that in the shallow stage (Layer 1 or Layer 2), the network tends to focus on the important regions of the image at a larger scale.When the number of layers deepens, the network focuses on a specific region of the image at a relatively small scale.Meanwhile, for urban functional districts containing more buildings, the edge and corner information of some buildings is more important; for the institutional area, we chose a relatively representative image and can see that the network mainly focuses on the areas of different surface materials on the roof; for green spaces belonging to the open space category, the network focuses on the texture information of the image; for water, the network tends to focus on the edge region or texture information of the image.As a result, visual features with strong representation ability lay a solid foundation for accurate urban functional zone recognition.

Conclusions
In this study, a unified deep-learning framework was designed for learning visual, social, and spatial relationship features simultaneously from remote-sensing imagery data and point-of-interest data for more accurate urban functional area recognition.In the complementary feature-learning and fusion module, we use two convolutional neural networks to extract visual features and social features from RSIs and POIs, respectively, and then the feature fusion module is used to fuse them efficiently.In the spatial-relationship-modeling module, taking the fused features obtained in the first part as input, we designed a new network structure based on the CrossFormer to extract the global and local spatial information of the urban functional zone distribution, and then a feature aggregation module is used to utilize the two spatial relationships efficiently.The comparative experimental results for the second part show that the Kappa value (0.8697) of our model on the test dataset is 1.03% higher than that of the second-best method (0.8594) and 7.04% higher than that of the worst method (0.7993).Additionally, it obtains the highest F1 scores in six urban functional zone categories, proving that the proposed model can effectively utilize the spatial relationship composition patterns of different urban functional zones to obtain more accurate results.Meanwhile, the results of comparative experiments with other methods show that the Kappa and OA values are improved by 2.84% and 3.84%, respectively, over the second-best method, proving that our method can effectively utilize the synergy mechanism based on different situations and integrates the visual features, social features, and spatial relationship features for more rational and effective urban functional area recognition tasks.In addition, we investigate the influence of different point-of-interest categories on the urban functional zone classification accuracy, the synergy mechanism of POIs and RSIs in the UFZ classification task, and the layer activation visualization of visual features.In particular, we demonstrate that the synergy mechanism is a dynamic process based on the discriminative ability of complementary features from RSIs and POIs in different UFZ categories, and our method can capture this mechanism to obtain a better urban functional zone recognition result.The framework in this paper is instructive for accurate urban functional area recognition using multimodal data.In the future, we will attempt to use more representative datasets and investigate the impact of different mapping units on the UFZ classification accuracy to achieve more efficient and accurate urban functional zone recognition tasks.

Figure 1 .
Figure 1.Overview of the dataset used in this study.

Figure 2 .
Figure 2. Two-level hierarchy of the dataset.

Figure 1 .
Figure 1.Overview of the dataset used in this study.

Figure 1 .
Figure 1.Overview of the dataset used in this study.

Figure 2 .
Figure 2. Two-level hierarchy of the dataset.

Figure 2 .
Figure 2. Two-level hierarchy of the dataset.

Figure 3 .
Figure 3.The overall structure of the proposed framework.

Figure 3 .
Figure 3.The overall structure of the proposed framework.

Figure 4 .
Figure 4.The structure of the layer-weighted model.

Figure 4 .
Figure 4.The structure of the layer-weighted model.

Figure 5 .
Figure 5.The structure of the adaptive feature fusion module.

Figure 5 .
Figure 5.The structure of the adaptive feature fusion module.
SPRS Int.J. Geo-Inf.2023, 12, x FOR PEER REVIEW 8 of 22 features are added, and the results are weighted sums with the original input  .Finally, a feature aggregation module (FGM) is used to obtain the final global-local context.

Figure 6 .
Figure 6.Illustration of the spatial-relationship-modeling module.

Figure 6 .
Figure 6.Illustration of the spatial-relationship-modeling module.

)Figure 7 .
Figure 7.The structures of CF block and DPB.(a) CF block; (b) DPB.(a) Dynamic Position Bias The commonly used relative position bias (RPB) represents the relative embedding positions by adding a bias to the self-attention.The following equation represents the use of RPB to represent long-short-distance attention:

22 Figure 9 .
Figure 9.The structure of the FGM.

Figure 9 .
Figure 9.The structure of the FGM.

Figure 10 .
Figure 10.The test results of different models for the second part.

Figure 10 .
Figure 10.The test results of different models for the second part.

Figure 11 .
Figure 11.The test results of different models.

Figure 11 .
Figure 11.The test results of different models.

Figure 12 .
Figure 12.Two-dimensional visualization of discriminative ability of different features for different UFZ categories.Ws represents the weight of social features, Wl represents the weight of visual features, and the blue dotted line denotes the case where Ws = Wl.When the point is above the dotted line, social features are more important.When the point is below the dotted line, visual features are more important.

Figure 12 .
Figure 12.Two-dimensional visualization of discriminative ability of different features for different UFZ categories.Ws represents the weight of social features, Wl represents the weight of visual features, and the blue dotted line denotes the case where Ws = Wl.When the point is above the dotted line, social features are more important.When the point is below the dotted line, visual features are more important.

Figure 12 .
Figure 12.Two-dimensional visualization of discriminative ability of different featur UFZ categories.Ws represents the weight of social features, Wl represents the we features, and the blue dotted line denotes the case where Ws = Wl.When the point is ab line, social features are more important.When the point is below the dotted line, visu more important.

Figure 13 .
Figure 13.A case of the synergy mechanism.(a) Social features are more import features are more important.

Figure 13 .
Figure 13.A case of the synergy mechanism.(a) Social features are more important; (b) visual features are more important.

Figure 14 .
Figure 14.Layer activation visualization of visual features.

Figure 14 .
Figure 14.Layer activation visualization of visual features.

Table 1 .
Quantitative distribution of different UFZs in training and test sets.

Table 2 .
Quantitative distribution of different POI categories.

Table 1 .
Quantitative distribution of different UFZs in training and test sets.

Table 2 .
Quantitative distribution of different POI categories.

Table 3 .
Quantitative evaluation results of different models for the second part.

Table 4 .
Quantitative evaluation results of different models.

Table 4 .
Quantitative evaluation results of different models.

Table 5 .
The results of the ablation experiment.RSI: use of remote-sensing imagery.POI: use of distance heatmap tensor of POIs.LWM: layer-weighted module.FFM: feature fusion module.LOCAL: local spatial relationship modeling.Global: global spatial relationship modeling.FGM: feature aggregation module.

Table 6 .
Impact of removing different types of POIs on model accuracy.

Table 7 .
The advantages of using POI data.