Next Article in Journal
Working to Move the Transportation Disadvantaged—Challenges for Community-Based Transportation Providers
Previous Article in Journal
Digital Public Service Maturity and Municipal Governance Performance: A City-Level Diagnostic Framework for Armenia
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

RE-SAT: Spatial-Aware Transformers with Semantic-Guided Prompting for Urban Region Embedding

1
School of Artificial Intelligence, Shenzhen Technology University, Shenzhen 518118, China
2
College of Applied Science, Shenzhen University, Shenzhen 518060, China
3
School of Cyber Science and Technology, University of Science and Technology of China, Hefei 230026, China
*
Authors to whom correspondence should be addressed.
These authors contributed equally to this work.
Urban Sci. 2026, 10(3), 168; https://doi.org/10.3390/urbansci10030168
Submission received: 1 January 2026 / Revised: 9 March 2026 / Accepted: 17 March 2026 / Published: 19 March 2026

Abstract

Learning transferable region embeddings is a fundamental problem in urban computing, as such representations support a wide range of downstream prediction tasks. Existing methods leverage multi-view and multimodal urban data but often fail to explicitly model spatial relations across views or effectively align general region embeddings with task-specific objectives. In this paper, we propose a spatial-aware Transformer (RE-SAT) network with semantic-guided prompting for urban region embedding. RE-SAT adopts a two-stage learning paradigm. In the first stage, a spatial-aware Transformer encoder injects connectivity and distance-based spatial priors into the attention mechanism to learn task-agnostic region embeddings from multi-view urban data. In the second stage, RE-SAT adapts the learned embeddings to downstream tasks via a semantic-guided prompt learning mechanism, which generates task-aware soft prompts from textual task descriptions without modifying the universal embeddings. Extensive experiments on multiple urban prediction tasks across different cities demonstrate that RE-SAT consistently outperforms state-of-the-art methods, achieving maximum relative improvements of 12.2% in MAE, 12.2% in RMSE, and 6.7% in R2, validating its effectiveness and generalizability. Consequently, this framework serves as a robust decision-support tool for urban planners and policymakers, facilitating efficient resource allocation and intelligent city management across diverse scenarios.

1. Introduction

Learning region embeddings is a fundamental problem in urban computing [1]. The goal is to encode heterogeneous urban data into compact region-level representations that can be directly used by downstream models. These data sources include structured information such as points of interest, human mobility, and land use, as well as unstructured modalities like satellite imagery, street-view images, and geo-referenced texts [2,3]. By providing a unified representation, region embeddings connect diverse urban data with learning-based decision models. Well-learned region embeddings have been widely applied to urban tasks such as crime prediction, check-in forecasting, and service demand estimation [4]. An important advantage of these representations is their reusability across tasks and cities, which reduces the need for repeated model training. As large-scale urban data become increasingly accessible, region representation learning has emerged as a key component in urban analytics systems.
With the development of deep learning, most recent studies adopt data-driven methods to learn region representations [5,6]. Early work mainly focused on integrating multiple urban data views, including mobility patterns and points of interest (POIs) [7]. These approaches typically generated a single embedding for each region using simple fusion strategies, such as feature concatenation, weighted aggregation, or dimensionality reduction through multilayer perceptrons and autoencoders [8]. Although effective in capturing view-level information, they largely ignored spatial dependencies between regions.
Later studies addressed this limitation by explicitly modeling spatial relations [9]. These methods constructed graph structures to represent region–region interactions and employed graph neural networks as encoders [10,11]. To capture diverse spatial and functional relationships, multiple graphs were often used. In addition, attention mechanisms were introduced to adaptively fuse information from different views and, in some cases, across regions, as demonstrated in models such as Urban Multi-Modal and Multi-View Dual Contrastive Learning (UrbanMMCL) [12], Multi-View Joint Representation Learning Framework for Urban Region Embedding (MVURE) [13], and Hybrid Attentive Fusion (HAFusion) [14]. These designs improved the ability to model spatial interactions in region representations. More recently, region representation learning has been extended to multimodal settings. Beyond structured urban data, these methods incorporate visual and textual information, such as satellite images, street-view data, and POI-related texts. Cross-modal consistency is typically achieved through contrastive learning or hierarchical modeling strategies. Representative approaches include Region Dual Contrastive Learning (RegionDCL) [15], Urban Region Profiling with Contrastive Language-Image Pretraining (UrbanCLIP) [16], and City Foundation Models (CityFM) [17], which aim to align region semantics across modalities.
To further improve adaptation to downstream tasks, prompt learning has been introduced into region representation learning [18]. In this line of work, general-purpose region embeddings are first learned, and task-related prompts are then injected to guide task-specific prediction [16]. These prompts are often derived from multimodal inputs and are designed to introduce task semantics with limited parameter updates [19]. For example, Heterogeneous Region Embedding with Prompt Learning (HREP) [18] applies randomly initialized soft prompts for efficient task adaptation, while Flexible Urban Region Representation Learning (FlexiReg) [19] uses image and text prompts to enhance semantic richness and generalization across tasks and cities.
Despite these advances, two major challenges remain. First, spatial relations within and across views are still insufficiently explored. Most existing methods encode each view independently. On the one hand, spatial dependencies between regions are not fully considered in some view fusion strategies, which weakens region-level interactions. On the other hand, spatial interactions across different views are rarely modeled. This design may lead to fragmented or inconsistent region semantics across views. In practice, spatial relations are essential not only within individual views but also across views, because urban region functions exhibit spatial continuity, and different views are essentially diverse observations of the same underlying spatial structure. Ignoring this shared spatial structure limits the ability to learn coherent region representations. Second, existing prompt-based methods lack a clear alignment with task objectives. Task prompts are often constructed in a random or isolated manner, which prevents the model from capturing intrinsic relationships among tasks. However, urban tasks are not independent. Tasks such as check-in prediction and human mobility flow prediction rely on similar data sources and spatial activity patterns, and are therefore more closely related to each other than to tasks like crime prediction. Such task-level similarities should be reflected in the prompt representations. Nevertheless, current methods fail to explicitly model these relationships, treating prompts for related tasks independently. As a result, prompt-based fine-tuning may have limited effectiveness in improving downstream performance.
To address the above challenges, we propose a spatial-aware Transformer network with semantic-guided prompting for region embedding (RE-SAT), a unified framework for urban region representation learning and task-specific adaptation. RE-SAT is designed to explicitly model spatial relations within and across views, while enabling effective task alignment through semantic-guided prompt learning. RE-SAT follows a two-stage learning paradigm. In the first stage, the model learns task-agnostic region embeddings from multi-view urban data through a multi-view representation learning module. As the core component of this module, we design a spatial-aware Transformer encoder (STE), which injects spatial priors—such as region connectivity and distance-based adjacency—directly into the self-attention mechanism. This design enables the model to jointly capture global semantic correlations and local spatial dependencies within each view, while preserving spatial consistency across different views. The resulting view-specific representations are further integrated via an adaptive multi-view fusion module to produce universal region embeddings.
In the second stage, RE-SAT adapts the frozen region embeddings to specific downstream tasks through a semantic-guided prompt learning mechanism. Instead of manually designing or randomly initializing task prompts as previous work [18], we encode textual task descriptions using a BERT-based text encoder to obtain task semantic representations. A lightweight prompt generation module then aligns task semantics with region embeddings in a shared latent space and generates soft prompts via attention-based fusion. These prompts are concatenated with the original region embeddings to form task-aware representations for downstream prediction. By explicitly modeling task semantics and their relationships with region representations, RE-SAT enables effective and stable task adaptation without modifying the universal embeddings.
To summarize, this paper aims to answer two fundamental research questions: (1) How can we design a unified encoding mechanism that effectively balances complex global semantic correlations (e.g., functionally similar but geographically distant regions) and local spatial dependencies (e.g., connectivity and proximity) across multi-source urban views? (2) How can we leverage task-specific semantic information to efficiently guide the adaptation of frozen region embeddings for diverse downstream applications? The main contributions of this paper are summarized as follows:
  • We propose RE-SAT, a unified two-stage framework for urban region representation learning that explicitly models spatial relations within and across views, while enabling semantic-guided task adaptation via prompt learning.
  • We design the STE that incorporates connectivity encoding and distance-based spatial priors into the attention mechanism, allowing the model to capture both global semantic dependencies and local spatial structures from multi-view urban data.
  • We introduce a semantic-guided prompt generation module that aligns textual task semantics with pre-trained region embeddings in a shared latent space, generating task-aware soft prompts without modifying the universal embeddings.
  • Extensive experiments on multiple downstream tasks demonstrate that RE-SAT consistently outperforms state-of-the-art baselines, validating the effectiveness and generalizability of the proposed framework.
The structure of the paper is as follows: Section 2 reviews related work on urban region embedding strategies and the application of prompt learning in urban computing. Section 3 details the methodology of the proposed RE-SAT framework, elaborating on the spatial-aware Transformer encoder and the semantic-guided prompt generation mechanism. Section 4 presents the experimental datasets, implementation details, and a comprehensive evaluation of the model’s performance across multiple downstream tasks compared with state-of-the-art baselines. Section 5 provides an in-depth discussion on the model’s limitations, practical implications, and future research directions. Finally, Section 6 concludes the study.

2. Related Work

2.1. Urban Region Embedding

Early research on urban region embedding primarily focused on single-view approaches. Wang and Li [20] pioneered region embedding methods based on taxi flow graphs, treating regions as nodes and movement flows as edges, learning transition patterns through an improved word2vec framework. Yao et al. [21] further integrated spatio-temporal features of moving trajectories (direction/time distribution) and destination attractiveness to enhance the accuracy of urban functional zone identification. Huang et al. [22] proposed the Hierarchical Graph Infomax (HGI) framework, which maximizes hierarchical graph information to capture both the uniqueness of POIs and inter-region interactions. Wu et al. [23] introduced multi-graph fusion networks (MGFN), modeling regional movement patterns through a spatio-temporal similarity graph and cross-modal message passing.
More recent studies have shifted towards collaboratively modeling multi-view heterogeneous data. Jenkins et al. [10] were among the first to integrate satellite imagery, POIs, mobility data, and a spatial graph, capturing regional semantics through multi-modal embeddings. Zhang et al. [13] designed MVURE, utilizing graph attention networks to achieve adaptive fusion of human mobility and regional attributes. Luo et al. [11] proposed a multi-graph representation learning framework for urban region profiling (Region2Vec), constructing a tri-graph structure encompassing mobility, geographical proximity, and POI data, learning region profiles via an attention-based fusion module. Sun et al. [14] introduced HAFusion, which explicitly models the internal and external correlations across multiple views through a hybrid attention feature learning paradigm. Xu and Zhou [24] designed the Coarsened Graph Attention Pooling (CGAP) model, employing coarsened graph attention pooling to hierarchically aggregate neighborhood information, thereby mitigating the over-smoothing issue inherent in traditional graph neural networks (GNNs). Cao et al. [12] proposed UrbanMMCL, constructing a multi-view graph structure encompassing functional, mobility, and geographical views with variational graph autoencoders, learning unified urban region representations via cross-view contrastive learning.
These existing studies often grapple with the trade-off between capturing “global correlation” and preserving “local structurality.” Diverging from these works, we propose the STE to address this critical balance dilemma, simultaneously capturing long-range global correlation among regions and the space proximity and connectivity based on graph topology.

2.2. Prompt Learning

Inspired by its success in the natural language processing domain, the “pre-train-and-prompt” workflow has been extended to various fields. Prompt learning guides the behavior of a frozen foundation model by designing the input (i.e., the prompt), rather than updating all model weights. For instance, Graph Prompt Feature (GPF) [25] adjusts the feature representations of downstream graphs by introducing learnable perturbation vectors in the node feature space, thereby enabling task adaptation. Edge Prompt Tuning (EdgePrompt) [26] integrates additional learnable edge prompt vectors into the message-passing mechanism of a pre-trained GNN to better embed graph structural information. Furthermore, All in One [27] designs a unified prompt graph with tokens, structures, and insertion patterns to reformulate node and edge tasks into graph-level tasks via meta-learning for efficient multi-task adaptation.
In the domain of urban region embedding, recent research has leveraged prompt learning to enhance region representations for downstream tasks. HREP [18] pioneered this direction by utilizing prefix tuning to fine-tune pre-trained region representations. FlexiReg [19] employed a prompt enhancer to fuse text and street-view features, generating augmented prefix prompts. UrbanMMCL [12] utilized prompt engineering to generate corresponding text for a region’s satellite and street-view image. However, these methods are often limited to static prompts or lack relevance to the task semantics. Our semantic-guided prompt learning breaks this limitation by tightly aligning the region representation with the task semantics, such as “crime prediction”.

3. Methodology

In this section, we describe the proposed RE-SAT framework in detail. We first formulate the problem of urban region embedding and introduce the necessary notations. We then provide an overview of the overall architectural pipeline of RE-SAT. Next, we present the two key components of the framework: the STE for multi-view region representation learning, and the semantic-guided prompt generation module for task-specific adaptation. Finally, we describe the two-stage training paradigm and the corresponding optimization objectives.

3.1. Problem Formulation

Definition 1
(Urban Region Embedding). Given a set of non-overlapping urban regions R = { r 1 , r 2 , , r N } , the objective of urban region embedding is to project each region r i into a low-dimensional embedding e i R d , where d denotes the embedding dimensionality. These representations are subsequently utilized for downstream predictive tasks, such as crime prediction, which aim to map regional embeddings to specific target values.
Definition 2
(POI Features). Let P R N × n p denote the points of interest feature matrix, where P i , j P represents the count of POIs belonging to category j in region r i , and n p represents the number of POI categories.
Definition 3
(Land-Use Features). Let L R N × n l denote the land-use feature matrix, where L i , j L quantifies the count of areas falling into land-use category j within region r i , and n l represents the number of land categories.
Definition 4
(Human Mobility Features). Let M R N × N denote the mobility flow matrix, where M i , j M represents the volume of human transitions from region r i to region r j over a specific period.
Definition 5
(Geographic Proximity Features). Let A R N × N be the distance-based adjacency matrix, where A i , j A is the normalized proximity coefficient between region r i and r j . A higher A i , j indicates a shorter geographic distance.

3.2. Data Preprocessing

Following [14], we constructed multi-view datasets for New York City (NYC) and Chicago (CHI). The preprocessing pipeline consists of three main steps to transform raw urban data into region-level feature matrices.
Region Partitioning: We discretized the geographical space into non-overlapping functional units. For NYC, the study area focuses on Manhattan, divided into 180 census tracts. For CHI, the city is delineated into 77 official community areas.
Human Mobility Feature Extraction: We utilized large-scale taxi trajectory records (pickup and drop-off coordinates) to construct the human mobility view. We mapped each trip’s origin and destination to the corresponding regions and aggregated the total volume of trips from region r i to region r j over the observation period. This results in a mobility flow matrix M , where entries represent the transition intensity between regions.
POI/Land-Use Feature Extraction: We processed POI and land-use data to capture regional functionality. Raw POI records were collected from OpenStreetMap and mapped to regions based on their coordinates. We categorized them into n p standard types (e.g., restaurants, schools) and computed the frequency for each region to form the POI feature matrix P . Similarly, land-use data were processed by counting the number of functional zones (e.g., residential, commercial) within each region, resulting in feature matrix L .
Geographic Adjacency Matrix Construction. To explicitly characterize the spatial proximity between urban regions, we construct a distance-based adjacency matrix A R N × N . Specifically, we calculate the pairwise Manhattan distances between regional centroids based on their geographic coordinates. These physical distances are then transformed into normalized proximity coefficients via a Gaussian kernel function. To ensure graph sparsity and mitigate the influence of weak, long-range dependencies, we exclude self-loops and truncate connections exceeding a predefined distance threshold.

3.3. Model Overview

The overall architecture of RE-SAT is illustrated in Figure 1. RE-SAT follows a two-stage paradigm for urban region representation learning and task adaptation. In the first stage, the model ingests multi-source urban features, including POI, land-use, mobility, and proximity information, and learns general-purpose region representations through a multi-view representation learning module. This module captures both global semantic correlations and local spatial dependencies among regions. In the second stage, task-specific semantics are encoded using a BERT-based text encoder. A lightweight prompt generation module then integrates these task semantic representations with the pre-trained region embeddings to synthesize soft prompt vectors. The generated prompts are concatenated with the original embeddings to form task-aware region representations for downstream applications, such as crime prediction and check-in forecasting.

3.4. Multi-View Representation Learning

This stage aims to extract universal regional semantics by balancing global and local perspectives through a two-step process.

3.4.1. STE: Spatial-Aware Transformer Encoder

To explicitly model spatial dependencies, we utilize three different STEs to learn view-specific representations for POI, land-use, and mobility.
  • Connectivity Encoding: Given a feature matrix X { P , L , M } and adjacency matrix A , we first compute the degree centrality of each region to measure its connectivity and accessibility, as expressed in Equation (1):
    d e g = j = 1 N A i , j max k ( j = 1 N A k , j ) + ϵ ,
    where d e g R N × 1 represents the degree centrality vector of all regions, and ϵ is a very small constant used to prevent division by zero.
This is transformed via a linear layer into a connectivity encoding and injected into the input features, which are calculated via Equation (2):
H = X + Linear ( d e g ) .
  • Spatial-Prior Augmented Attention: The core of STE is a multi-head self-attention (MHSA) mechanism augmented by spatial priors. To capture both long-range semantic similarity and local spatial proximity, we map the adjacency matrix A into proximity encodings and integrate them into the attention weights. The attention score between region r i and r j is formulated as Equation (3):
    α i j = Softmax j ( W Q H i ) ( W K H j ) d + β Linear ( A ) ,
    where W Q and W K are learnable parameters, and β is the hyperparameter. Unlike molecular graphs or social networks, which often exhibit complex, irregular topologies with high-order structural significance, urban region graphs are typically characterized by regular geometric arrangements and strong local spatial autocorrelation. According to Tobler’s First Law of Geography [28] (“near things are more related than distant things”), first-order proximity contains the most critical spatial information for urban profiling. While the self-attention mechanism inherently captures global semantic correlations (long-range dependencies), it lacks structural awareness. By injecting a lightweight linear bias derived from the adjacency matrix A , we explicitly introduce a local spatial inductive bias that prioritizes immediate neighborhood connectivity. This design effectively complements the global receptive field of Transformers without incurring the high computational overhead associated with complex eigen-decomposition or shortest-path encodings used in general graph transformers [29,30]. Furthermore, the simplicity of the linear bias acts as a regularizer, mitigating the risk of over-smoothing often observed in deep GNNs when aggregating high-order neighborhoods.
The STE module consists of several consecutive layers stacked to capture high-order spatial and semantic dependencies. Formally, the transformation within the l-th layer is defined as Equations (4) and (5):
H ( l ) = MHSA ( LN ( H ( l 1 ) ) + H ( l 1 ) ,
H ( l ) = FFN ( LN ( H ( l ) ) + H ( l ) ,
where FFN · is a feed-forward block and LN · is the layer normalization.
The resulting representation is refined via a lightweight multi-layer perceptron to perform feature projection and nonlinear abstraction. This process yields the final spatial-aware view-specific representation Z v R N × d , v { P , L , M } , which encapsulates both the global semantic context and the underlying topological structure of the urban region.

3.4.2. Fusion Module

To synthesize embeddings from heterogeneous views, we adopt the DAFusion mechanism [14]. It consists of:
  • ViewFusion: Learns adaptive weights for each view (POI, land-use, mobility) by calculating pairwise correlations, resulting in a fused representation: Z = v γ v · Z v .
  • RegionFusion: Utilizes a Transformer-based structure to encode high-order correlations across different regional fused representations and finally generate the general region embeddings E = { e 1 , e 2 , , e N } .

3.5. Semantic-Guided Prompt Learning

We introduce a semantic-guided prompting mechanism via prefix tuning to bridge the gap between general representations and specific downstream tasks.

3.5.1. Semantic-Guided Prompt Generation

To align task semantics with region embedding in a shared latent space, we design a lightweight generator. We first utilize BERT [31] to encode textual task descriptions (e.g., “User check-in behavior at a location is driven by the attraction of its Points of Interest, the broader functionality defined by its land use category, and the volume and origin of human mobility inflows. Popular destinations often have a specific POI profile and are embedded within areas of complementary land use.”). The BERT-output [ C L S ] token and the pre-trained region embeddings are projected into a unified semantic space, as shown in Equation (6):
E = E W E , T = T W T ,
where W E and W T are learnable project matrices.
Subsequently, we calculate attention weights between task semantics and each region to perform weighted fusion via Equation (7), incorporating a residual connection to preserve the original regional profile expressed in Equation (8):
α = Softmax Tan h E + T W a ,
S ˜ = E + α T ,
where Tan h · is an activation function and W a is a learnable parameters matrix. After layer normalization and output projection, we obtain the final semantic-guided soft prompts S .

3.5.2. Downstream Prediction

The soft prompts are concatenated with the original embeddings to construct task-aware region embeddings. A feed-forward network (FNN) is then employed for the final prediction, which is formulated in Equation (9):
Y ^ = FNN S | | E ,
where | | denote concatenation.

3.6. Model Training

RE-SAT is trained in two distinct phases:
  • Multi-view Representation Learning. The pseudo-code for this stage is shown in Algorithm 1. We employ a self-supervised multi-task reconstruction objective L T a s k to ensure embedding quality:
  • Regional Similarity Reconstruction: To ensure that the universal embedding E effectively preserves view-specific semantic structures, we employ specialized linear projection heads to map E back into the POI and land-use subspaces, denoted as P and L , respectively. We then formulate a reconstruction objective that constrains the pairwise inner products of these projected embeddings to approximate the empirical region similarity matrices A p o i and A l a n d u s e . The learning objectives L p o i and L l a n d u s e are defined as Equations (10) and (11):
    L p o i = 1 N 2 i = 1 N j = 1 N A i , j p o i ( P i ) · P j ,
    L l a n d u s e = 1 N 2 i = 1 N j = 1 N A i , j l a n d u s e ( L i ) · L j .
  • Mobility Distribution Reconstruction: To effectively characterize urban mobility dynamics, we aim to approximate the empirical transition probability distributions between regions. Specifically, the universal embedding E is projected into two distinct latent subspaces—a source space M S and a destination space M D . We compute two transition probabilities with these matrices to characterize the mobility dynamics. Specifically, for an origin region i, the outbound transition probability to a destination region j is defined as Equation (12):
    P r ^ o b s ( r j | r i ) = exp ( ( M i S ) · M j D ) k R exp ( ( M i S ) · M k D ) .
    Conversely, for a destination region j, the inbound transition probability originating from region i is formulated as Equation (13):
    P r ^ o b d ( r i | r j ) = exp ( ( M i S ) · M j D ) k R exp ( ( M k S ) · M j D ) .
    The loss L m is defined as the KL-divergence between the predicted and ground truth probability distributions, as shown in Equation (14):
L m o b i l i t y = i = 1 n j = 1 n ( P r o b s ( r j | r i ) log ( P r ^ o b s ( r j | r i ) ) + P r o b d ( r j | r i ) log ( P r ^ o b d ( r j | r i ) ) ) .
Equation (15) presents the total multi-task reconstruction objective:
L T a s k = λ 1 L p o i + λ 2 L l a n d u s e + λ 3 L m o b i l i t y ,
where λ 1 , λ 2 , λ 3 are learnable parameters to automatically balance three loss functions, and λ 1 + λ 2 + λ 3 = 1 .
  • Semantic-Guided Prompt Learning. The pseudo-code for this stage is shown in Algorithm 2. We adapt the frozen pre-trained embeddings to specific tasks. All pre-trained parameters remain frozen to preserve universal urban knowledge. This enables the model to effectively specialize the general embeddings for specific task scenarios without catastrophic forgetting. We optimize the task-specific soft prompts and the prediction head using a mean squared error (MSE) loss, as expressed in Equation (16):
    L M S E = 1 N i = 1 N ( y i y ^ i ) 2 ,
    where y ^ i Y ^ is the prediction and y i is the ground true of the region r i .
Algorithm 1 Multi-View Representation Learning (Stage 1)
Input: Feature matrix: POI P , Land-use L , Mobility M , and Proximity matrix A .
Output: Pre-trained universal region embeddings E .
1:
Initialize parameters of STE and loss weights λ i ;
2:
for each epoch do
3:
    for each view X { P , L , M }  do
4:
        Extract view-specific features Z v using STE with A: Z v STE ( X , A ) ;
5:
    end for
6:
    Compute fused region embeddings via DAFusion: E = v { P , L , M } γ v · Z v ;
7:
    Calculate reconstruction losses:
8:
         L p o i Compute POI similarity reconstruction loss based on Equation (10);
9:
         L l a n d u s e Compute land-use similarity reconstruction loss based on Equation (11);
10:
        L m o b i l i t y Compute KL-divergence of flow distribution based on Equation (14);
11:
    Compute total Loss: L T a s k = λ 1 L p o i + λ 2 L l a n d u s e + λ 3 L m o b i l i t y ;
12:
    Update parameters via backpropagation;
13:
end for
14:
return Optimal universal embeddings E .
Algorithm 2 Semantic-Guided Prompt Learning (Stage 2)
Input: Frozen embeddings E , Downstream task description T e x t , Ground truth Y.
Output: Task-specific prediction Y ^ .
1:
Initialize PromptGenerator and Task Head (FNN);
2:
for each epoch do
3:
    Encode task text: T = BERT ( T e x t ) ;
4:
    Generate soft prompts: S PromptGenerator ( E , T ) ;
5:
    Concatenate soft prompts and embeddings: X t a s k = [ S E ] ;
6:
    Final prediction: Y ^ = FNN ( X t a s k ) ;
7:
     L M S E Compute MSE loss based on Equation (16);
8:
    Update PromptGenerator, S and Task Head;
9:
end for
10:
return Prediction results Y ^ .

4. Results

In this section, we first provide a detailed exposition of the datasets and experimental configurations. Subsequently, we evaluate the efficacy of the learned region representations through four critical downstream tasks—check-in prediction, crime prediction, service call estimation, and population prediction—conducted across two major metropolitan areas: NYC and CHI.

4.1. Datasets

In this paper, we utilize real-world datasets collected from NYC and CHI. The geographical granularity for NYC is defined by the census tracts of Manhattan, while CHI is delineated by its official community area boundaries. We leverage multi-source urban features, including POIs, land-use categories, and taxi trajectory records, to capture the foundational semantics and mobility patterns of each region during the first stage of training. To assess the representational power of the embeddings, we gather longitudinal records comprising criminal incidents, user check-ins, public service requests, and population statistics. Detailed statistics are summarized in Table 1.

4.2. Experiment Setup

Baselines. We compare the performance of RE-SAT with several state-of-the-art urban region embedding methods.
  • MVURE [13]: This work introduces a multi-view joint graph representation learning framework that leverages graph attention networks to adaptively fuse human mobility patterns with multi-dimensional regional attributes, such as POIs and check-in data.
  • MGFN [23]: MGFN constructs a comprehensive multi-graph architecture, encompassing mobility flow and spatio-temporal similarity graphs, and employs a cross-modal message-passing mechanism to capture intricate dynamic interactions and spatial dependencies between urban regions.
  • ReCP [37]: This framework proposes a multi-view contrastive prediction paradigm that aligns heterogeneous views—including mobility, geographic proximity, and POIs—within a shared latent space to learn robust and generalizable region representations.
  • HAFusion [14]: HAFusion develops a hybrid attention-based fusion mechanism that explicitly models both intra-view semantic consistency and inter-view complementary correlations, thereby enhancing the representation quality of heterogeneous urban data.
  • HREP [18]: As a pioneering effort in prompt-based regional modeling, HREP utilizes a prefix-tuning strategy to adapt pre-trained region representations to diverse downstream tasks while keeping the backbone parameters frozen.
  • FlexiReg [19]: FlexiReg introduces a flexible prompt enhancement module that synergistically integrates textual instructions with multimodal features, such as street-view imagery, to generate augmented prompts that bolster the adaptability of region embeddings.
  • UrbanCLIP [16]: Drawing inspiration from the contrastive language-image pre-training paradigm, UrbanCLIP achieves cross-modal semantic alignment between satellite imagery and rich textual descriptions through prompt engineering to construct fine-grained urban region profiles.
Implementation Details. The training of RE-SAT is executed in two sequential phases, each spanning 3000 epochs to ensure robust convergence. For the pre-training phase, the learning rate is initialized at 5 × 10 4 , while for the semantic-guided prompt learning phase, it is set to 0.5 . The STE module is configured with a depth of three layers. Regarding the spatial weighting hyperparameter, we set β = 0.4 for the NYC dataset and β = 0.9 for the CHI dataset. Following the settings in [18,19], the dimensionality for both the universal region embeddings and the semantic soft prompts is fixed at d = 144 . To ensure optimal model configuration, all hyperparameters are determined through an exhaustive grid search on the validation set.
Evaluation Metrics. To rigorously evaluate the predictive performance of the learned representations across various downstream tasks, we adopt three widely recognized metrics consistent with prior works [14,18]:
  • Mean Absolute Error (MAE): Measures the average magnitude of the errors in a set of predictions, without considering their direction.
  • Root Mean Square Error (RMSE): Provides a measure of the square root of the average of squared differences between prediction and actual observation, penalizing larger errors more heavily.
  • Coefficient of Determination ( R 2 ): Represents the proportion of the variance for the dependent variable that is explained by the independent variables in the model.

4.3. Overall Performance

We compare RE-SAT with a suite of state-of-the-art baselines on four downstream tasks in NYC and CHI, with the results summarized in Table 2. Statistical significance tests indicate that RE-SAT consistently outperforms the strongest baseline, FlexiReg, with a significant margin ( p < 0.05 ), establishing a new state-of-the-art performance.
A first observation is that methods integrating multiple data views consistently outperform single-view approaches. This result highlights the inherent complexity and heterogeneity of urban regions, whose semantics cannot be fully captured from a single perspective. For example, MGFN relies solely on human mobility flows and therefore neglects rich intra-region attributes, while UrbanCLIP is restricted to satellite imagery that mainly reflects external visual characteristics and fails to model deep semantic interactions or inter-regional connectivity. In contrast, models such as ReCP and HAFusion achieve better performance by jointly modeling functional and semantic information from multiple complementary views.
We further observe that prompt-tuning-based methods generally surpass traditional single-stage learning approaches. This advantage can be attributed to the distribution gap between pre-training pretext tasks and downstream applications, which often limits the direct transferability of general region representations. By enabling parameter-efficient task adaptation, prefix-tuning effectively bridges this gap and yields substantial performance improvements with minimal computational overhead.
Most importantly, RE-SAT consistently outperforms all competing baselines across all tasks and datasets, achieving a maximum relative improvement of 12.2%. This superiority stems from its ability to jointly model global semantic correlations and local spatial structures across multi-source urban data, while explicitly incorporating task-specific semantics through semantic-guided prompt learning. Compared with the strongest competitor, RE-SAT achieves average improvements of 3.9% and 5.6% on the NYC and CHI datasets, respectively. The observed gains are further validated by t-test results, confirming their statistical significance ( p < 0.05 ).

4.4. Ablation Study

To investigate the significance of modeling local structural dependencies, we conduct an ablation study by removing the spatial-aware components—specifically, the connectivity and proximity encodings—from the STE module. This effectively reduces the encoder to a vanilla Transformer architecture. As evidenced by the results in Table 3, RE-SAT with spatial awareness consistently outperforms its counterpart. Compared to the “w/o Spatial-aware”, RE-SAT achieves average performance gains of 6.3% in check-in prediction, 1.2% in crime prediction, 4.3% in service call estimation, and 6.5% in population prediction. We attribute this superiority to the model’s ability to explicitly characterize and balance local structurality with global correlations, thereby capturing the inherent spatial autocorrelation of urban regions more effectively.
Furthermore, to validate whether the semantic-guided prompt learning module effectively aligns universal region representations with task-specific semantics, we adopt a random initialization strategy for soft prompts, following the paradigm in HREP, rather than using semantic-guided generation. The empirical results demonstrate that the semantic-guided variant yields significantly enhanced performance. Compared to the randomly initialized version, RE-SAT exhibits average improvements of 5.9%, 2.4%, 4.1%, and 7.8% across the four downstream tasks, respectively. This performance gap highlights that random initialization fails to encapsulate the nuanced differences between diverse tasks. In contrast, our semantic-guided mechanism strengthens the task-awareness of region embeddings, facilitating more robust adaptation of universal representations to various downstream applications.

4.5. Cross-City Transferability Analysis

To further evaluate the generalization capability of the learned representations across different spatial contexts, we conducted a cross-city transferability experiment. Specifically, we pre-trained the multi-view representation learning backbone on NYC. The frozen encoder weights were then directly transferred to CHI using a zero-padding strategy to align the spatial dimensions. During the downstream adaptation on CHI, we solely trained the lightweight semantic-guided prompt module without fine-tuning the stage 1 backbone. As shown in Table 4, the transferred model, denoted as RE-SAT (NYC → CHI), exhibits remarkable robustness. While experiencing a minor, expected performance drop compared to the RE-SAT model fully trained on CHI, it still significantly outperforms the locally trained HREP and UrbanCLIP baselines across all downstream tasks.
Remarkably, in the population prediction task, RE-SAT (NYC → CHI) achieves an R2 of 0.718, successfully surpassing the strongest locally trained baseline, FlexiReg (0.698). These findings strongly substantiate that the STE module learns transferable spatial-semantic correlations that transcend specific city boundaries. By achieving competitive performance through prompting alone without the need to retrain the encoder, RE-SAT is proven to be a highly efficient and generalizable framework for cross-city urban analytics.

4.6. Parameter Analysis

Spatial Inductive Bias β . The parameter β serves as the spatial inductive bias, governing the spatial awareness of the STE module. Specifically, a larger β encourages the model to emphasize local structural dependencies, whereas a smaller β shifts the focus toward capturing long-range regional correlations. To investigate the sensitivity of model performance to this parameter, we varied β within the range { 0.2 , 0.3 , 0.4 , 0.5 , 0.6 } using the NYC dataset. As illustrated in Figure 2, for the check-in prediction task, the performance exhibits an upward trend as β increases from 0.2, reaching its zenith at β = 0.5 before subsequently declining. Conversely, for crime prediction, service call estimation, and population prediction, the optimal performance threshold is consistently observed at β = 0.4 . Based on these empirical observations, we conclude that setting β = 0.4 provides a robust and effective trade-off between local and global spatial modeling, yielding superior performance across the majority of downstream tasks for the RE-SAT framework.
Impact of the Number of STE # layer . To investigate the influence of the depth of the STE module on model efficacy, we conducted a sensitivity analysis by varying the number of stacked spatial-aware Transformer layers within the set { 2 , 3 , 4 } using the NYC dataset. As illustrated in Figure 3, RE-SAT exhibits consistently stable performance across all four downstream tasks—namely, check-in prediction, crime prediction, service call estimation, and population prediction. This consistency underscores the inherent robustness of the STE module in capturing complex spatial dependencies, regardless of moderate variations in its architectural depth. Notably, the configuration with # l a y e r = 3 layers achieves a marginal performance lead compared to other settings, suggesting an optimal balance between model capacity and generalization. Consequently, we fix the number of layers in the STE module at # l a y e r = 3 for all subsequent experiments.

4.7. Running Time of Downstream Task

To investigate the computational overhead incurred during the downstream task learning phase, Table 5 presents a comparative analysis of the training duration for RE-SAT alongside other representative two-stage frameworks. The empirical results demonstrate that RE-SAT achieves the shortest adaptation time, thereby validating the lightweight nature of our proposed prompt learning paradigm. Specifically, although HREP utilizes a seemingly simpler random initialization strategy, it necessitates an extensive training regimen of up to 6000 epochs to ensure the quality of prompt tuning, which significantly increases the overall execution latency. Furthermore, UrbanCLIP incurs substantial costs due to its auxiliary contrastive learning objectives, while FlexiReg introduces significant overhead by incorporating complex multimodal features (e.g., imagery and text) for prompt enhancement. In contrast, our semantic-guided mechanism achieves a superior performance–efficiency trade-off, delivering higher predictive accuracy while maintaining a significantly lower computational footprint.

5. Discussion

The proposed RE-SAT framework demonstrates substantial improvements in urban region embedding; however, its limitations must be acknowledged and provide opportunities for future research.
From a practical standpoint, the primary advantage of RE-SAT lies in its parameter-efficient adaptation. By leveraging semantic-guided prompting, the model circumvents the prohibitive computational costs of retraining large-scale encoders for every new task, offering a scalable decision-support mechanism for urban planners and policymakers. Nevertheless, this efficiency is inherently bounded by data dependency. The generation of high-quality universal embeddings relies heavily on the availability of multi-view data, particularly human mobility trajectories, which are often proprietary, costly to procure, and subject to strict privacy regulations. In data-scarce scenarios where certain modalities are unavailable, the representational capacity of the model inevitably degrades. Developing modality-agnostic architectures or utilizing cross-modal imputation techniques to maintain robustness under severe missing data conditions remains a critical direction for future research.
Methodologically, the current instantiation of the spatial-aware Transformer encoder operates primarily in a transductive setting. The spatial priors injected into the attention mechanism assume a fixed spatial topology defined by an N × N adjacency matrix. Although we effectively mitigated the dimensional mismatch during cross-city transfer via a zero-padding strategy, this approach serves as a heuristic workaround rather than a fundamental solution. To achieve seamless and universal cross-city transferability, transitioning from transductive attention mechanisms to inductive graph representation paradigms—capable of generalizing across urban graphs of arbitrarily varying sizes—will be a primary focus of our subsequent work.

6. Conclusions

This paper presents RE-SAT, a novel multi-view representation learning and prompt-tuning framework designed to generate robust and task-adaptive embeddings for urban regions. By leveraging heterogeneous data sources, including POIs, land-use patterns, and human mobility trajectories, we construct comprehensive regional profiles that capture multifaceted urban dynamics. A key innovation is the spatial-aware Transformer encoder, which explicitly models and balances long-range global correlations and fine-grained local structurality, thereby producing highly representative embeddings. Furthermore, we propose a semantic-guided prompt learning mechanism that utilizes task-specific textual descriptions to bridge the gap between universal region representations and downstream objectives. This mechanism facilitates the seamless adaptation of pre-trained embeddings to diverse tasks. Extensive experiments conducted on two real-world datasets across four downstream tasks—check-in prediction, crime prediction, service call estimation, and population prediction—demonstrate that RE-SAT consistently outperforms state-of-the-art regional embedding baselines.

Author Contributions

Conceptualization, G.D., Z.G. and B.Z.; Methodology, G.D., Z.G. and B.Z.; Software, Z.G.; Validation, Z.G.; Formal analysis, Z.G.; Resources, Z.G. and B.Z.; Writing—original draft, Z.G.; Writing—review & editing, G.D., B.Z. and J.C.; Visualization, Z.G.; Supervision, G.D., B.Z., X.F., L.D., J.C. and H.H.; Project administration, G.D., B.Z., X.F., L.D., J.C. and H.H.; Funding acquisition, B.Z., X.F., L.D., J.C. and H.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key Research and Development Program of China (No. 2025YFE0103100), Shenzhen Science and Technology Program (No. JCYJ20240813113300001, 20231127180406001) and Natural Science Foundation of Top Talent of SZTU (grant no. GDRC202518).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding authors.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
POIsPoints of Interest
UrbanMMCLUrban Multi-Modal and Multi-View Dual Contrastive Learning
MVUREMulti-View Joint Representation Learning Framework for Urban Region Embedding
HAFusionHybrid Attentive Fusion
RegionDCLRegion Dual Contrastive Learning
UrbanCLIPUrban Region Profiling with Contrastive Language-Image Pretraining
CityFMCity Foundation Models
HREPHeterogeneous Region Embedding with Prompt Learning
FlexiRegFlexible Urban Region Representation Learning
HGIHierarchical Graph Infomax
MGFNMulti-Graph Fusion Networks
Region2VecMulti-Graph Representation Learning Framework for Urban Region Profiling
CGAPCoarsened Graph Attention Pooling
GNNsGraph Neural Networks
GPFGraph Prompt Feature
EdgePromptEdge Prompt Tuning
NYCNew York City
CHIChicago

References

  1. Zhang, L.; Long, C.; Cong, G. Region embedding with intra and inter-view contrastive learning. IEEE Trans. Knowl. Data Eng. 2022, 35, 9031–9036. [Google Scholar] [CrossRef]
  2. Hao, X.; Chen, W.; Yan, Y.; Zhong, S.; Wang, K.; Wen, Q.; Liang, Y. Urbanvlp: Multi-granularity vision-language pretraining for urban socioeconomic indicator prediction. In Proceedings of the AAAI Conference on Artificial Intelligence; AAAI Press: Washington, DC, USA, 2025; Volume 39, pp. 28061–28069. [Google Scholar] [CrossRef]
  3. Xiao, C.; Zhou, J.; Xiao, Y.; Huang, J.; Xiong, H. Refound: Crafting a foundation model for urban region understanding upon language and visual foundations. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining; ACM: New York, NY, USA, 2024; pp. 3527–3538. [Google Scholar] [CrossRef]
  4. Xu, Y.; Deng, Z.; Zhu, T.; Han, L.; Sun, L.; Chen, Z.; Sheng, H. Generating evolving region embedding with memory-based graph for dynamic urban sensing. Inf. Fusion 2025, 124, 103341. [Google Scholar] [CrossRef]
  5. Chan, W.; Ren, Q. Region-wise attentive multi-view representation learning for urban region embedding. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management; ACM: New York, NY, USA, 2023; pp. 3763–3767. [Google Scholar] [CrossRef]
  6. Kim, N.; Yoon, Y. Effective urban region representation learning using heterogeneous urban graph attention network (HUGAT). IEEE Access 2025, 13, 102602–102612. [Google Scholar] [CrossRef]
  7. Zhang, Y.; Fu, Y.; Wang, P.; Li, X.; Zheng, Y. Unifying inter-region autocorrelation and intra-region structures for spatial embedding via collective adversarial learning. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining; ACM: New York, NY, USA, 2019; pp. 1700–1708. [Google Scholar] [CrossRef]
  8. Fu, Y.; Wang, P.; Du, J.; Wu, L.; Li, X. Efficient region embedding with multi-view spatial networks: A perspective of locality-constrained spatial autocorrelations. In Proceedings of the AAAI Conference on Artificial Intelligence; AAAI Press: Washington, DC, USA, 2019; Volume 33, pp. 906–913. [Google Scholar] [CrossRef]
  9. Dai, G.; Yi, W.; Cao, J.; Gong, Z.; Fu, X.; Zhang, B. CRRL: Contrastive Region Relevance Learning Framework for Cross-city Traffic Prediction. Inf. Fusion 2025, 122, 103215. [Google Scholar] [CrossRef]
  10. Jenkins, P.; Farag, A.; Wang, S.; Li, Z. Unsupervised representation learning of spatial data via multimodal embedding. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management; ACM: New York, NY, USA, 2019; pp. 1993–2002. [Google Scholar] [CrossRef]
  11. Luo, Y.; Chung, F.l.; Chen, K. Urban region profiling via multi-graph representation learning. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management; ACM: New York, NY, USA, 2022; pp. 4294–4298. [Google Scholar] [CrossRef]
  12. Cao, J.; Chen, J.; Wang, X.; Huang, W.; Chen, D.; Zhao, T.; Tu, W.; Li, Q. UrbanMMCL: Urban Region Representations via Multi-Modal and Multi-Graph Self-Supervised Contrastive Learning. ISPRS J. Photogramm. Remote Sens. 2026, 232, 75–93. [Google Scholar] [CrossRef]
  13. Zhang, M.; Li, T.; Li, Y.; Hui, P. Multi-view joint graph representation learning for urban region embedding. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence; IJCAI: Montreal, AC, Canada, 2020; pp. 4431–4437. [Google Scholar] [CrossRef]
  14. Sun, F.; Qi, J.; Chang, Y.; Fan, X.; Karunasekera, S.; Tanin, E. Urban region representation learning with attentive fusion. In Proceedings of the 2024 IEEE 40th International Conference on Data Engineering (ICDE); IEEE: Piscataway, NJ, USA, 2024; pp. 4409–4421. [Google Scholar] [CrossRef]
  15. Li, Y.; Huang, W.; Cong, G.; Wang, H.; Wang, Z. Urban region representation learning with openstreetmap building footprints. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining; ACM: New York, NY, USA, 2023; pp. 1363–1373. [Google Scholar] [CrossRef]
  16. Yan, Y.; Wen, H.; Zhong, S.; Chen, W.; Chen, H.; Wen, Q.; Zimmermann, R.; Liang, Y. Urbanclip: Learning text-enhanced urban region profiling with contrastive language-image pretraining from the web. In Proceedings of the ACM Web Conference 2024; ACM: New York, NY, USA, 2024; pp. 4006–4017. [Google Scholar] [CrossRef]
  17. Balsebre, P.; Huang, W.; Cong, G.; Li, Y. City foundation models for learning general purpose representations from openstreetmap. In Proceedings of the 33rd ACM International Conference on Information and Knowledge Management; ACM: New York, NY, USA, 2024; pp. 87–97. [Google Scholar] [CrossRef]
  18. Zhou, S.; He, D.; Chen, L.; Shang, S.; Han, P. Heterogeneous region embedding with prompt learning. In Proceedings of the AAAI Conference on Artificial Intelligence; AAAI Press: Washington, DC, USA, 2023; Volume 37, pp. 4981–4989. [Google Scholar] [CrossRef]
  19. Sun, F.; Chang, Y.; Tanin, E.; Karunasekera, S.; Qi, J. FlexiReg: Flexible Urban Region Representation Learning. In Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining; V. 2; ACM: New York, NY, USA, 2025; pp. 2702–2713. [Google Scholar] [CrossRef]
  20. Wang, H.; Li, Z. Region representation learning via mobility flow. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management; ACM: New York, NY, USA, 2017; pp. 237–246. [Google Scholar] [CrossRef]
  21. Yao, Z.; Fu, Y.; Liu, B.; Hu, W.; Xiong, H. Representing urban functions through zone embedding with human mobility patterns. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18); IJCAI: Montreal, AC, Canada, 2018. [Google Scholar] [CrossRef]
  22. Huang, W.; Zhang, D.; Mai, G.; Guo, X.; Cui, L. Learning urban region representations with POIs and hierarchical graph infomax. ISPRS J. Photogramm. Remote Sens. 2023, 196, 134–145. [Google Scholar] [CrossRef]
  23. Wu, S.; Yan, X.; Fan, X.; Pan, S.; Zhu, S.; Zheng, C.; Cheng, M.; Wang, C. Multi-graph fusion networks for urban region embedding. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence (IJCAI-22); IJCAI: Montreal, AC, Canada, 2022; pp. 321–327. [Google Scholar] [CrossRef]
  24. Xu, Z.; Zhou, X. CGAP: Urban region representation learning with coarsened graph attention pooling. In Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence; IJCAI: Montreal, AC, Canada, 2024; pp. 7518–7526. [Google Scholar] [CrossRef]
  25. Fang, T.; Zhang, Y.; Yang, Y.; Wang, C.; Chen, L. Universal prompt tuning for graph neural networks. Adv. Neural Inf. Process. Syst. 2023, 36, 52464–52489. [Google Scholar] [CrossRef]
  26. Fu, X.; He, Y.; Li, J. Edge Prompt Tuning for Graph Neural Networks. In Proceedings of the the Thirteenth International Conference on Learning Representations; ICLR: Singapore, 2025; Available online: https://openreview.net/forum?id=92vMaHotTM (accessed on 16 March 2026).
  27. Sun, X.; Cheng, H.; Li, J.; Liu, B.; Guan, J. All in One: Multi-Task Prompting for Graph Neural Networks. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining; KDD ’23; ACM: New York, NY, USA, 2023; pp. 2120–2131. [Google Scholar] [CrossRef]
  28. Tobler, W.R. A computer movie simulating urban growth in the Detroit region. Econ. Geogr. 1970, 46, 234–240. [Google Scholar] [CrossRef]
  29. Ying, C.; Cai, T.; Luo, S.; Zheng, S.; Ke, G.; He, D.; Shen, Y.; Liu, T.Y. Do Transformers Really Perform Badly for Graph Representation? In Proceedings of the Advances in Neural Information Processing Systems; NeurIPS: San Diego, CA, USA, 2021; Volume 34, pp. 28877–28888. Available online: https://proceedings.neurips.cc/paper/2021/hash/f1c1592588411002af340cbaedd6fc33-Abstract.html (accessed on 16 March 2026).
  30. Hu, Z.; Dong, Y.; Wang, K.; Sun, Y. Heterogeneous graph transformer. In Proceedings of the Web Conference 2020; ACM: New York, NY, USA, 2020; pp. 2704–2710. [Google Scholar] [CrossRef]
  31. Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies; Long and Short PaperS; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; Volume 1, pp. 4171–4186. [Google Scholar] [CrossRef]
  32. NYC Government. NYC Open Data. 2025. Available online: https://opendata.cityofnewyork.us/ (accessed on 28 June 2025).
  33. Chicago Government. Chicago Data Portal. 2025. Available online: https://data.cityofchicago.org/ (accessed on 28 June 2025).
  34. OpenStreetMap. 2025. Available online: https://www.openstreetmap.org/ (accessed on 28 June 2025).
  35. Foursquare. 2025. Available online: https://foursquare.com/ (accessed on 28 June 2025).
  36. WorldPop. 2025. Available online: https://www.worldpop.org/ (accessed on 28 June 2025).
  37. Li, Z.; Huang, W.; Zhao, K.; Yang, M.; Gong, Y.; Chen, M. Urban region embedding via multi-view contrastive prediction. In Proceedings of the AAAI Conference on Artificial Intelligence; AAAI Press: Washington, DC, USA, 2024; Volume 38, pp. 8724–8732. [Google Scholar] [CrossRef]
Figure 1. Overall framework of the proposed RE-SAT.
Figure 1. Overall framework of the proposed RE-SAT.
Urbansci 10 00168 g001
Figure 2. Parameter analysis of β on NYC.
Figure 2. Parameter analysis of β on NYC.
Urbansci 10 00168 g002
Figure 3. Parameter analysis of # l a y e r on NYC.
Figure 3. Parameter analysis of # l a y e r on NYC.
Urbansci 10 00168 g003
Table 1. Dataset Statistics.
Table 1. Dataset Statistics.
DatasetNYCCHISource
Regions18077Open Portal [32,33]
POIs24,49657,891OpenStreetMap [34]
POI categories2626OpenStreetMap [34]
Land-use categories1112OpenStreetMap [34]
Taxi trips10,953,8793,381,807Open Portal [32,33]
Crime records35,33518,200Open Portal [32,33]
Check-in counts106,902167,232Foursquare [35]
Service call records516,18724,350Open Portal [32,33]
Population counts1,540,6922,508,984WorldPop [36]
Table 2. Performance comparison on four downstream tasks. Best: bold; second: underline. * indicates a statistically significant improvement with p-value < 0.05. The improvement is calculated against the best baseline.
Table 2. Performance comparison on four downstream tasks. Best: bold; second: underline. * indicates a statistically significant improvement with p-value < 0.05. The improvement is calculated against the best baseline.
Check-inNYCCHI
MAE ↓RMSE ↓R2MAE ↓RMSE ↓R2
MVURE285.1 ± 6.2461.0 ± 8.40.682 ± 0.0151693 ± 743171 ± 1280.656 ± 0.029
MGFN345.0 ± 13.3503.5 ± 20.80.621 ± 0.0321281 ± 412276 ± 860.817 ± 0.011
ReCP233.8 ± 3.6392.7 ± 19.10.763 ± 0.0271272 ± 922341 ± 2670.804 ± 0.045
HAFusion202.8 ± 7.2322.8 ± 12.60.844 ± 0.012929 ± 621947 ± 750.870 ± 0.010
HREP274.1 ± 8.3417.7 ± 14.90.739 ± 0.0081679 ± 713135 ± 790.664 ± 0.017
UrbanCLIP393.6 ± 5.9602.4 ± 3.10.458 ± 0.0052612 ± 294885 ± 730.186 ± 0.024
FlexiReg198.2 ± 3.6309.0 ± 17.20.850 ± 0.011922 ± 761775 ± 1990.891 ± 0.022
RE-SAT *191.4 ± 2.9271.3 ± 14.50.885 ± 0.008809.8 ± 531594 ± 770.924 ± 0.013
Improvement3.4%12.2%4.1%12.2%10.2%3.7%
CrimeNYCCHI
MAE ↓RMSE ↓R2MAE ↓RMSE ↓R2
MVURE67.4 ± 0.990.5 ± 1.30.625 ± 0.017100.4 ± 6.6129.2 ± 7.30.461 ± 0.062
MGFN73.2 ± 2.691.4 ± 2.90.618 ± 0.014107.4 ± 5.4137.9 ± 5.20.386 ± 0.047
ReCP81.4 ± 2.0101.3 ± 1.90.483 ± 0.02586.9 ± 5.5120.1 ± 7.10.534 ± 0.057
HAFusion56.1 ± 0.776.1 ± 2.00.734 ± 0.01477.8 ± 1.2107.1 ± 4.40.631 ± 0.033
HREP66.1 ± 2.784.4 ± 2.30.674 ± 0.00988.3 ± 6.4114.4 ± 5.50.578 ± 0.041
UrbanCLIP97.4 ± 2.6126.1 ± 1.90.267 ± 0.012101.6 ± 0.6134.7 ± 1.70.416 ± 0.006
FlexiReg53.2 ± 0.473.6 ± 1.60.751 ± 0.00961.7 ± 0.285.1 ± 2.00.766 ± 0.011
RE-SAT *52.9 ± 0.270.4 ± 1.80.766 ± 0.00661.5 ± 0.282.0 ± 1.70.783 ± 0.008
Improvement0.6%4.3%2.0%0.3%3.6%2.2%
Service CallNYCCHI
MAE ↓RMSE ↓R2MAE ↓RMSE ↓R2
MVURE1402 ± 272128 ± 360.398 ± 0.022190.3 ± 9.8266.9 ± 12.10.441 ± 0.050
MGFN1653 ± 702250 ± 1080.327 ± 0.060208.2 ± 11.3293.4 ± 16.60.329 ± 0.077
ReCP1478 ± 592136 ± 410.366 ± 0.017206.7 ± 11.1303.4 ± 16.10.284 ± 0.076
HAFusion1273 ± 201951 ± 270.493 ± 0.014159.3 ± 13.9222.0 ± 18.90.613 ± 0.067
HREP1396 ± 202011 ± 360.462 ± 0.021185.7 ± 6.1262.2 ± 10.80.468 ± 0.022
UrbanCLIP1409 ± 72401 ± 160.232 ± 0.005183.2 ± 0.9256.3 ± 1.80.491 ± 0.003
FlexiReg1303 ± 161989 ± 430.497 ± 0.012121.1 ± 2.3178.2 ± 5.10.753 ± 0.014
RE-SAT *1264 ± 141851 ± 410.523 ± 0.010117.6 ± 2.4170.2 ± 4.40.775 ± 0.010
Improvement3.0%6.9%5.2%2.9%4.5%2.9%
PopulationNYCCHI
MAE ↓RMSE ↓R2MAE ↓RMSE ↓R2
MVURE2899 ± 533708 ± 700.508 ± 0.00713,717 ± 32217,174 ± 5520.313 ± 0.043
MGFN3222 ± 474207 ± 1190.367 ± 0.02613,071 ± 50516,578 ± 7070.359 ± 0.054
ReCP3527 ± 1014434 ± 1470.329 ± 0.04312,085 ± 40017,029 ± 5610.325 ± 0.044
HAFusion2497 ± 503277 ± 820.616 ± 0.01910,678 ± 39013,988 ± 5480.544 ± 0.035
HREP3118 ± 674023 ± 1210.421 ± 0.03712,063 ± 53915,397 ± 8320.447 ± 0.061
UrbanCLIP3338 ± 114499 ± 160.276 ± 0.00213,328 ± 6917,498 ± 740.288 ± 0.006
FlexiReg2231 ± 242974 ± 600.701 ± 0.0028126 ± 32011,395 ± 5080.698 ± 0.028
RE-SAT *2193 ± 192881 ± 430.703 ± 0.0027638 ± 30310,030 ± 6250.745 ± 0.022
Improvement1.7%3.1%0.3%6.0%12.0%6.7%
Table 3. Ablation study on NYC.
Table 3. Ablation study on NYC.
(a) Check-in Prediction
MethodMAE ↓RMSE ↓R2
w/o Spatial-aware199.1305.10.852
w/o Semantic-guided Prompt Learning198.1306.40.860
RE-SAT191.4271.30.885
(b) Crime Prediction
MethodMAE ↓RMSE ↓R2
w/o Spatial-aware53.271.90.760
w/o Semantic-guided Prompt Learning53.572.60.759
RE-SAT52.970.40.766
(c) Service Call Estimation
MethodMAE ↓RMSE ↓R2
w/o Spatial-aware128819590.495
w/o Semantic-guided Prompt Learning132419850.489
RE-SAT126418510.523
(d) Population Prediction
MethodMAE ↓RMSE ↓R2
w/o Spatial-aware226531440.652
w/o Semantic-guided Prompt Learning236131340.649
RE-SAT219328810.703
Table 4. Cross-city transferability analysis (R2).
Table 4. Cross-city transferability analysis (R2).
MethodCheckInCrimeService CallPopulation
HREP0.6640.5780.4680.447
UrbanCLIP0.1860.4160.4910.288
FlexiReg0.8910.7660.7530.698
RE-SAT0.9240.7830.7750.745
RE-SAT (NYC → CHI)0.8660.6770.6460.718
Table 5. Running time of downstream task(s). The best results are in boldface.
Table 5. Running time of downstream task(s). The best results are in boldface.
MethodNYCCHI
HREP92146
UrbanCLIP8686
FlexiReg137103
RE-SAT5152
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Dai, G.; Guo, Z.; Zhang, B.; Fu, X.; Dong, L.; Cao, J.; Huang, H. RE-SAT: Spatial-Aware Transformers with Semantic-Guided Prompting for Urban Region Embedding. Urban Sci. 2026, 10, 168. https://doi.org/10.3390/urbansci10030168

AMA Style

Dai G, Guo Z, Zhang B, Fu X, Dong L, Cao J, Huang H. RE-SAT: Spatial-Aware Transformers with Semantic-Guided Prompting for Urban Region Embedding. Urban Science. 2026; 10(3):168. https://doi.org/10.3390/urbansci10030168

Chicago/Turabian Style

Dai, Genan, Zitao Guo, Bowen Zhang, Xianghua Fu, Li Dong, Jinzhou Cao, and Hu Huang. 2026. "RE-SAT: Spatial-Aware Transformers with Semantic-Guided Prompting for Urban Region Embedding" Urban Science 10, no. 3: 168. https://doi.org/10.3390/urbansci10030168

APA Style

Dai, G., Guo, Z., Zhang, B., Fu, X., Dong, L., Cao, J., & Huang, H. (2026). RE-SAT: Spatial-Aware Transformers with Semantic-Guided Prompting for Urban Region Embedding. Urban Science, 10(3), 168. https://doi.org/10.3390/urbansci10030168

Article Metrics

Back to TopTop