1. Introduction
Learning region embeddings is a fundamental problem in urban computing [
1]. The goal is to encode heterogeneous urban data into compact region-level representations that can be directly used by downstream models. These data sources include structured information such as points of interest, human mobility, and land use, as well as unstructured modalities like satellite imagery, street-view images, and geo-referenced texts [
2,
3]. By providing a unified representation, region embeddings connect diverse urban data with learning-based decision models. Well-learned region embeddings have been widely applied to urban tasks such as crime prediction, check-in forecasting, and service demand estimation [
4]. An important advantage of these representations is their reusability across tasks and cities, which reduces the need for repeated model training. As large-scale urban data become increasingly accessible, region representation learning has emerged as a key component in urban analytics systems.
With the development of deep learning, most recent studies adopt data-driven methods to learn region representations [
5,
6]. Early work mainly focused on integrating multiple urban data views, including mobility patterns and points of interest (POIs) [
7]. These approaches typically generated a single embedding for each region using simple fusion strategies, such as feature concatenation, weighted aggregation, or dimensionality reduction through multilayer perceptrons and autoencoders [
8]. Although effective in capturing view-level information, they largely ignored spatial dependencies between regions.
Later studies addressed this limitation by explicitly modeling spatial relations [
9]. These methods constructed graph structures to represent region–region interactions and employed graph neural networks as encoders [
10,
11]. To capture diverse spatial and functional relationships, multiple graphs were often used. In addition, attention mechanisms were introduced to adaptively fuse information from different views and, in some cases, across regions, as demonstrated in models such as Urban Multi-Modal and Multi-View Dual Contrastive Learning (UrbanMMCL) [
12], Multi-View Joint Representation Learning Framework for Urban Region Embedding (MVURE) [
13], and Hybrid Attentive Fusion (HAFusion) [
14]. These designs improved the ability to model spatial interactions in region representations. More recently, region representation learning has been extended to multimodal settings. Beyond structured urban data, these methods incorporate visual and textual information, such as satellite images, street-view data, and POI-related texts. Cross-modal consistency is typically achieved through contrastive learning or hierarchical modeling strategies. Representative approaches include Region Dual Contrastive Learning (RegionDCL) [
15], Urban Region Profiling with Contrastive Language-Image Pretraining (UrbanCLIP) [
16], and City Foundation Models (CityFM) [
17], which aim to align region semantics across modalities.
To further improve adaptation to downstream tasks, prompt learning has been introduced into region representation learning [
18]. In this line of work, general-purpose region embeddings are first learned, and task-related prompts are then injected to guide task-specific prediction [
16]. These prompts are often derived from multimodal inputs and are designed to introduce task semantics with limited parameter updates [
19]. For example, Heterogeneous Region Embedding with Prompt Learning (HREP) [
18] applies randomly initialized soft prompts for efficient task adaptation, while Flexible Urban Region Representation Learning (FlexiReg) [
19] uses image and text prompts to enhance semantic richness and generalization across tasks and cities.
Despite these advances, two major challenges remain. First, spatial relations within and across views are still insufficiently explored. Most existing methods encode each view independently. On the one hand, spatial dependencies between regions are not fully considered in some view fusion strategies, which weakens region-level interactions. On the other hand, spatial interactions across different views are rarely modeled. This design may lead to fragmented or inconsistent region semantics across views. In practice, spatial relations are essential not only within individual views but also across views, because urban region functions exhibit spatial continuity, and different views are essentially diverse observations of the same underlying spatial structure. Ignoring this shared spatial structure limits the ability to learn coherent region representations. Second, existing prompt-based methods lack a clear alignment with task objectives. Task prompts are often constructed in a random or isolated manner, which prevents the model from capturing intrinsic relationships among tasks. However, urban tasks are not independent. Tasks such as check-in prediction and human mobility flow prediction rely on similar data sources and spatial activity patterns, and are therefore more closely related to each other than to tasks like crime prediction. Such task-level similarities should be reflected in the prompt representations. Nevertheless, current methods fail to explicitly model these relationships, treating prompts for related tasks independently. As a result, prompt-based fine-tuning may have limited effectiveness in improving downstream performance.
To address the above challenges, we propose a spatial-aware Transformer network with semantic-guided prompting for region embedding (RE-SAT), a unified framework for urban region representation learning and task-specific adaptation. RE-SAT is designed to explicitly model spatial relations within and across views, while enabling effective task alignment through semantic-guided prompt learning. RE-SAT follows a two-stage learning paradigm. In the first stage, the model learns task-agnostic region embeddings from multi-view urban data through a multi-view representation learning module. As the core component of this module, we design a spatial-aware Transformer encoder (STE), which injects spatial priors—such as region connectivity and distance-based adjacency—directly into the self-attention mechanism. This design enables the model to jointly capture global semantic correlations and local spatial dependencies within each view, while preserving spatial consistency across different views. The resulting view-specific representations are further integrated via an adaptive multi-view fusion module to produce universal region embeddings.
In the second stage, RE-SAT adapts the frozen region embeddings to specific downstream tasks through a semantic-guided prompt learning mechanism. Instead of manually designing or randomly initializing task prompts as previous work [
18], we encode textual task descriptions using a BERT-based text encoder to obtain task semantic representations. A lightweight prompt generation module then aligns task semantics with region embeddings in a shared latent space and generates soft prompts via attention-based fusion. These prompts are concatenated with the original region embeddings to form task-aware representations for downstream prediction. By explicitly modeling task semantics and their relationships with region representations, RE-SAT enables effective and stable task adaptation without modifying the universal embeddings.
To summarize, this paper aims to answer two fundamental research questions: (1) How can we design a unified encoding mechanism that effectively balances complex global semantic correlations (e.g., functionally similar but geographically distant regions) and local spatial dependencies (e.g., connectivity and proximity) across multi-source urban views? (2) How can we leverage task-specific semantic information to efficiently guide the adaptation of frozen region embeddings for diverse downstream applications? The main contributions of this paper are summarized as follows:
We propose RE-SAT, a unified two-stage framework for urban region representation learning that explicitly models spatial relations within and across views, while enabling semantic-guided task adaptation via prompt learning.
We design the STE that incorporates connectivity encoding and distance-based spatial priors into the attention mechanism, allowing the model to capture both global semantic dependencies and local spatial structures from multi-view urban data.
We introduce a semantic-guided prompt generation module that aligns textual task semantics with pre-trained region embeddings in a shared latent space, generating task-aware soft prompts without modifying the universal embeddings.
Extensive experiments on multiple downstream tasks demonstrate that RE-SAT consistently outperforms state-of-the-art baselines, validating the effectiveness and generalizability of the proposed framework.
The structure of the paper is as follows:
Section 2 reviews related work on urban region embedding strategies and the application of prompt learning in urban computing.
Section 3 details the methodology of the proposed RE-SAT framework, elaborating on the spatial-aware Transformer encoder and the semantic-guided prompt generation mechanism.
Section 4 presents the experimental datasets, implementation details, and a comprehensive evaluation of the model’s performance across multiple downstream tasks compared with state-of-the-art baselines.
Section 5 provides an in-depth discussion on the model’s limitations, practical implications, and future research directions. Finally,
Section 6 concludes the study.
3. Methodology
In this section, we describe the proposed RE-SAT framework in detail. We first formulate the problem of urban region embedding and introduce the necessary notations. We then provide an overview of the overall architectural pipeline of RE-SAT. Next, we present the two key components of the framework: the STE for multi-view region representation learning, and the semantic-guided prompt generation module for task-specific adaptation. Finally, we describe the two-stage training paradigm and the corresponding optimization objectives.
3.1. Problem Formulation
Definition 1 (Urban Region Embedding). Given a set of non-overlapping urban regions , the objective of urban region embedding is to project each region into a low-dimensional embedding , where d denotes the embedding dimensionality. These representations are subsequently utilized for downstream predictive tasks, such as crime prediction, which aim to map regional embeddings to specific target values.
Definition 2 (POI Features). Let denote the points of interest feature matrix, where represents the count of POIs belonging to category j in region , and represents the number of POI categories.
Definition 3 (Land-Use Features). Let denote the land-use feature matrix, where quantifies the count of areas falling into land-use category j within region , and represents the number of land categories.
Definition 4 (Human Mobility Features). Let denote the mobility flow matrix, where represents the volume of human transitions from region to region over a specific period.
Definition 5 (Geographic Proximity Features). Let be the distance-based adjacency matrix, where is the normalized proximity coefficient between region and . A higher indicates a shorter geographic distance.
3.2. Data Preprocessing
Following [
14], we constructed multi-view datasets for New York City (NYC) and Chicago (CHI). The preprocessing pipeline consists of three main steps to transform raw urban data into region-level feature matrices.
Region Partitioning: We discretized the geographical space into non-overlapping functional units. For NYC, the study area focuses on Manhattan, divided into 180 census tracts. For CHI, the city is delineated into 77 official community areas.
Human Mobility Feature Extraction: We utilized large-scale taxi trajectory records (pickup and drop-off coordinates) to construct the human mobility view. We mapped each trip’s origin and destination to the corresponding regions and aggregated the total volume of trips from region to region over the observation period. This results in a mobility flow matrix , where entries represent the transition intensity between regions.
POI/Land-Use Feature Extraction: We processed POI and land-use data to capture regional functionality. Raw POI records were collected from OpenStreetMap and mapped to regions based on their coordinates. We categorized them into standard types (e.g., restaurants, schools) and computed the frequency for each region to form the POI feature matrix . Similarly, land-use data were processed by counting the number of functional zones (e.g., residential, commercial) within each region, resulting in feature matrix .
Geographic Adjacency Matrix Construction. To explicitly characterize the spatial proximity between urban regions, we construct a distance-based adjacency matrix . Specifically, we calculate the pairwise Manhattan distances between regional centroids based on their geographic coordinates. These physical distances are then transformed into normalized proximity coefficients via a Gaussian kernel function. To ensure graph sparsity and mitigate the influence of weak, long-range dependencies, we exclude self-loops and truncate connections exceeding a predefined distance threshold.
3.3. Model Overview
The overall architecture of RE-SAT is illustrated in
Figure 1. RE-SAT follows a two-stage paradigm for urban region representation learning and task adaptation. In the first stage, the model ingests multi-source urban features, including POI, land-use, mobility, and proximity information, and learns general-purpose region representations through a multi-view representation learning module. This module captures both global semantic correlations and local spatial dependencies among regions. In the second stage, task-specific semantics are encoded using a BERT-based text encoder. A lightweight prompt generation module then integrates these task semantic representations with the pre-trained region embeddings to synthesize soft prompt vectors. The generated prompts are concatenated with the original embeddings to form task-aware region representations for downstream applications, such as crime prediction and check-in forecasting.
3.4. Multi-View Representation Learning
This stage aims to extract universal regional semantics by balancing global and local perspectives through a two-step process.
3.4.1. STE: Spatial-Aware Transformer Encoder
To explicitly model spatial dependencies, we utilize three different STEs to learn view-specific representations for POI, land-use, and mobility.
Connectivity Encoding: Given a feature matrix
and adjacency matrix
, we first compute the degree centrality of each region to measure its connectivity and accessibility, as expressed in Equation (
1):
where
represents the degree centrality vector of all regions, and
is a very small constant used to prevent division by zero.
This is transformed via a linear layer into a connectivity encoding and injected into the input features, which are calculated via Equation (
2):
Spatial-Prior Augmented Attention: The core of STE is a multi-head self-attention (MHSA) mechanism augmented by spatial priors. To capture both long-range semantic similarity and local spatial proximity, we map the adjacency matrix
into proximity encodings and integrate them into the attention weights. The attention score between region
and
is formulated as Equation (
3):
where
and
are learnable parameters, and
is the hyperparameter. Unlike molecular graphs or social networks, which often exhibit complex, irregular topologies with high-order structural significance, urban region graphs are typically characterized by regular geometric arrangements and strong local spatial autocorrelation. According to Tobler’s First Law of Geography [
28] (“near things are more related than distant things”), first-order proximity contains the most critical spatial information for urban profiling. While the self-attention mechanism inherently captures global semantic correlations (long-range dependencies), it lacks structural awareness. By injecting a lightweight linear bias derived from the adjacency matrix
, we explicitly introduce a local spatial inductive bias that prioritizes immediate neighborhood connectivity. This design effectively complements the global receptive field of Transformers without incurring the high computational overhead associated with complex eigen-decomposition or shortest-path encodings used in general graph transformers [
29,
30]. Furthermore, the simplicity of the linear bias acts as a regularizer, mitigating the risk of over-smoothing often observed in deep GNNs when aggregating high-order neighborhoods.
The STE module consists of several consecutive layers stacked to capture high-order spatial and semantic dependencies. Formally, the transformation within the
l-th layer is defined as Equations (
4) and (
5):
where
is a feed-forward block and
is the layer normalization.
The resulting representation is refined via a lightweight multi-layer perceptron to perform feature projection and nonlinear abstraction. This process yields the final spatial-aware view-specific representation , which encapsulates both the global semantic context and the underlying topological structure of the urban region.
3.4.2. Fusion Module
To synthesize embeddings from heterogeneous views, we adopt the DAFusion mechanism [
14]. It consists of:
ViewFusion: Learns adaptive weights for each view (POI, land-use, mobility) by calculating pairwise correlations, resulting in a fused representation: .
RegionFusion: Utilizes a Transformer-based structure to encode high-order correlations across different regional fused representations and finally generate the general region embeddings .
3.5. Semantic-Guided Prompt Learning
We introduce a semantic-guided prompting mechanism via prefix tuning to bridge the gap between general representations and specific downstream tasks.
3.5.1. Semantic-Guided Prompt Generation
To align task semantics with region embedding in a shared latent space, we design a lightweight generator. We first utilize BERT [
31] to encode textual task descriptions (e.g.,
“User check-in behavior at a location is driven by the attraction of its Points of Interest, the broader functionality defined by its land use category, and the volume and origin of human mobility inflows. Popular destinations often have a specific POI profile and are embedded within areas of complementary land use.”). The BERT-output
token and the pre-trained region embeddings are projected into a unified semantic space, as shown in Equation (
6):
where
and
are learnable project matrices.
Subsequently, we calculate attention weights between task semantics and each region to perform weighted fusion via Equation (
7), incorporating a residual connection to preserve the original regional profile expressed in Equation (
8):
where
is an activation function and
is a learnable parameters matrix. After layer normalization and output projection, we obtain the final semantic-guided soft prompts
.
3.5.2. Downstream Prediction
The soft prompts are concatenated with the original embeddings to construct task-aware region embeddings. A feed-forward network (FNN) is then employed for the final prediction, which is formulated in Equation (
9):
where
denote concatenation.
3.6. Model Training
RE-SAT is trained in two distinct phases:
Regional Similarity Reconstruction: To ensure that the universal embedding
effectively preserves view-specific semantic structures, we employ specialized linear projection heads to map
back into the POI and land-use subspaces, denoted as
and
, respectively. We then formulate a reconstruction objective that constrains the pairwise inner products of these projected embeddings to approximate the empirical region similarity matrices
and
. The learning objectives
and
are defined as Equations (
10) and (
11):
Mobility Distribution Reconstruction: To effectively characterize urban mobility dynamics, we aim to approximate the empirical transition probability distributions between regions. Specifically, the universal embedding
is projected into two distinct latent subspaces—a source space
and a destination space
. We compute two transition probabilities with these matrices to characterize the mobility dynamics. Specifically, for an origin region
i, the outbound transition probability to a destination region
j is defined as Equation (
12):
Conversely, for a destination region
j, the inbound transition probability originating from region
i is formulated as Equation (
13):
The loss
is defined as the KL-divergence between the predicted and ground truth probability distributions, as shown in Equation (
14):
Equation (
15) presents the total multi-task reconstruction objective:
where
are learnable parameters to automatically balance three loss functions, and
.
Semantic-Guided Prompt Learning. The pseudo-code for this stage is shown in Algorithm 2. We adapt the frozen pre-trained embeddings to specific tasks. All pre-trained parameters remain frozen to preserve universal urban knowledge. This enables the model to effectively specialize the general embeddings for specific task scenarios without catastrophic forgetting. We optimize the task-specific soft prompts and the prediction head using a mean squared error (MSE) loss, as expressed in Equation (
16):
where
is the prediction and
is the ground true of the region
.
| Algorithm 1 Multi-View Representation Learning (Stage 1) |
| Input: Feature matrix: POI , Land-use , Mobility , and Proximity matrix . |
| Output: Pre-trained universal region embeddings . |
- 1:
Initialize parameters of STE and loss weights ; - 2:
for each epoch do - 3:
for each view do - 4:
Extract view-specific features using STE with A: ; - 5:
end for - 6:
Compute fused region embeddings via DAFusion: ; - 7:
Calculate reconstruction losses: - 8:
Compute POI similarity reconstruction loss based on Equation ( 10); - 9:
Compute land-use similarity reconstruction loss based on Equation ( 11); - 10:
Compute KL-divergence of flow distribution based on Equation ( 14); - 11:
Compute total Loss: ; - 12:
Update parameters via backpropagation; - 13:
end for - 14:
return Optimal universal embeddings .
|
| Algorithm 2 Semantic-Guided Prompt Learning (Stage 2) |
| Input: Frozen embeddings , Downstream task description , Ground truth Y. |
| Output: Task-specific prediction . |
- 1:
Initialize PromptGenerator and Task Head (FNN); - 2:
for each epoch do - 3:
Encode task text: ; - 4:
Generate soft prompts: ; - 5:
Concatenate soft prompts and embeddings: ; - 6:
Final prediction: ; - 7:
Compute MSE loss based on Equation ( 16); - 8:
Update PromptGenerator, and Task Head; - 9:
end for - 10:
return Prediction results .
|
4. Results
In this section, we first provide a detailed exposition of the datasets and experimental configurations. Subsequently, we evaluate the efficacy of the learned region representations through four critical downstream tasks—check-in prediction, crime prediction, service call estimation, and population prediction—conducted across two major metropolitan areas: NYC and CHI.
4.1. Datasets
In this paper, we utilize real-world datasets collected from NYC and CHI. The geographical granularity for NYC is defined by the census tracts of Manhattan, while CHI is delineated by its official community area boundaries. We leverage multi-source urban features, including POIs, land-use categories, and taxi trajectory records, to capture the foundational semantics and mobility patterns of each region during the first stage of training. To assess the representational power of the embeddings, we gather longitudinal records comprising criminal incidents, user check-ins, public service requests, and population statistics. Detailed statistics are summarized in
Table 1.
4.2. Experiment Setup
Baselines. We compare the performance of RE-SAT with several state-of-the-art urban region embedding methods.
MVURE [
13]: This work introduces a multi-view joint graph representation learning framework that leverages graph attention networks to adaptively fuse human mobility patterns with multi-dimensional regional attributes, such as POIs and check-in data.
MGFN [
23]: MGFN constructs a comprehensive multi-graph architecture, encompassing mobility flow and spatio-temporal similarity graphs, and employs a cross-modal message-passing mechanism to capture intricate dynamic interactions and spatial dependencies between urban regions.
ReCP [
37]: This framework proposes a multi-view contrastive prediction paradigm that aligns heterogeneous views—including mobility, geographic proximity, and POIs—within a shared latent space to learn robust and generalizable region representations.
HAFusion [
14]: HAFusion develops a hybrid attention-based fusion mechanism that explicitly models both intra-view semantic consistency and inter-view complementary correlations, thereby enhancing the representation quality of heterogeneous urban data.
HREP [
18]: As a pioneering effort in prompt-based regional modeling, HREP utilizes a prefix-tuning strategy to adapt pre-trained region representations to diverse downstream tasks while keeping the backbone parameters frozen.
FlexiReg [
19]: FlexiReg introduces a flexible prompt enhancement module that synergistically integrates textual instructions with multimodal features, such as street-view imagery, to generate augmented prompts that bolster the adaptability of region embeddings.
UrbanCLIP [
16]: Drawing inspiration from the contrastive language-image pre-training paradigm, UrbanCLIP achieves cross-modal semantic alignment between satellite imagery and rich textual descriptions through prompt engineering to construct fine-grained urban region profiles.
Implementation Details. The training of RE-SAT is executed in two sequential phases, each spanning 3000 epochs to ensure robust convergence. For the pre-training phase, the learning rate is initialized at
, while for the semantic-guided prompt learning phase, it is set to
. The STE module is configured with a depth of three layers. Regarding the spatial weighting hyperparameter, we set
for the NYC dataset and
for the CHI dataset. Following the settings in [
18,
19], the dimensionality for both the universal region embeddings and the semantic soft prompts is fixed at
. To ensure optimal model configuration, all hyperparameters are determined through an exhaustive grid search on the validation set.
Evaluation Metrics. To rigorously evaluate the predictive performance of the learned representations across various downstream tasks, we adopt three widely recognized metrics consistent with prior works [
14,
18]:
Mean Absolute Error (MAE): Measures the average magnitude of the errors in a set of predictions, without considering their direction.
Root Mean Square Error (RMSE): Provides a measure of the square root of the average of squared differences between prediction and actual observation, penalizing larger errors more heavily.
Coefficient of Determination (): Represents the proportion of the variance for the dependent variable that is explained by the independent variables in the model.
4.3. Overall Performance
We compare RE-SAT with a suite of state-of-the-art baselines on four downstream tasks in NYC and CHI, with the results summarized in
Table 2. Statistical significance tests indicate that RE-SAT consistently outperforms the strongest baseline, FlexiReg, with a significant margin (
), establishing a new state-of-the-art performance.
A first observation is that methods integrating multiple data views consistently outperform single-view approaches. This result highlights the inherent complexity and heterogeneity of urban regions, whose semantics cannot be fully captured from a single perspective. For example, MGFN relies solely on human mobility flows and therefore neglects rich intra-region attributes, while UrbanCLIP is restricted to satellite imagery that mainly reflects external visual characteristics and fails to model deep semantic interactions or inter-regional connectivity. In contrast, models such as ReCP and HAFusion achieve better performance by jointly modeling functional and semantic information from multiple complementary views.
We further observe that prompt-tuning-based methods generally surpass traditional single-stage learning approaches. This advantage can be attributed to the distribution gap between pre-training pretext tasks and downstream applications, which often limits the direct transferability of general region representations. By enabling parameter-efficient task adaptation, prefix-tuning effectively bridges this gap and yields substantial performance improvements with minimal computational overhead.
Most importantly, RE-SAT consistently outperforms all competing baselines across all tasks and datasets, achieving a maximum relative improvement of 12.2%. This superiority stems from its ability to jointly model global semantic correlations and local spatial structures across multi-source urban data, while explicitly incorporating task-specific semantics through semantic-guided prompt learning. Compared with the strongest competitor, RE-SAT achieves average improvements of 3.9% and 5.6% on the NYC and CHI datasets, respectively. The observed gains are further validated by t-test results, confirming their statistical significance ().
4.4. Ablation Study
To investigate the significance of modeling local structural dependencies, we conduct an ablation study by removing the spatial-aware components—specifically, the connectivity and proximity encodings—from the STE module. This effectively reduces the encoder to a vanilla Transformer architecture. As evidenced by the results in
Table 3, RE-SAT with spatial awareness consistently outperforms its counterpart. Compared to the “w/o Spatial-aware”, RE-SAT achieves average performance gains of 6.3% in check-in prediction, 1.2% in crime prediction, 4.3% in service call estimation, and 6.5% in population prediction. We attribute this superiority to the model’s ability to explicitly characterize and balance local structurality with global correlations, thereby capturing the inherent spatial autocorrelation of urban regions more effectively.
Furthermore, to validate whether the semantic-guided prompt learning module effectively aligns universal region representations with task-specific semantics, we adopt a random initialization strategy for soft prompts, following the paradigm in HREP, rather than using semantic-guided generation. The empirical results demonstrate that the semantic-guided variant yields significantly enhanced performance. Compared to the randomly initialized version, RE-SAT exhibits average improvements of 5.9%, 2.4%, 4.1%, and 7.8% across the four downstream tasks, respectively. This performance gap highlights that random initialization fails to encapsulate the nuanced differences between diverse tasks. In contrast, our semantic-guided mechanism strengthens the task-awareness of region embeddings, facilitating more robust adaptation of universal representations to various downstream applications.
4.5. Cross-City Transferability Analysis
To further evaluate the generalization capability of the learned representations across different spatial contexts, we conducted a cross-city transferability experiment. Specifically, we pre-trained the multi-view representation learning backbone on NYC. The frozen encoder weights were then directly transferred to CHI using a zero-padding strategy to align the spatial dimensions. During the downstream adaptation on CHI, we solely trained the lightweight semantic-guided prompt module without fine-tuning the stage 1 backbone. As shown in
Table 4, the transferred model, denoted as RE-SAT (NYC → CHI), exhibits remarkable robustness. While experiencing a minor, expected performance drop compared to the RE-SAT model fully trained on CHI, it still significantly outperforms the locally trained HREP and UrbanCLIP baselines across all downstream tasks.
Remarkably, in the population prediction task, RE-SAT (NYC → CHI) achieves an R2 of 0.718, successfully surpassing the strongest locally trained baseline, FlexiReg (0.698). These findings strongly substantiate that the STE module learns transferable spatial-semantic correlations that transcend specific city boundaries. By achieving competitive performance through prompting alone without the need to retrain the encoder, RE-SAT is proven to be a highly efficient and generalizable framework for cross-city urban analytics.
4.6. Parameter Analysis
Spatial Inductive Bias . The parameter
serves as the spatial inductive bias, governing the spatial awareness of the STE module. Specifically, a larger
encourages the model to emphasize local structural dependencies, whereas a smaller
shifts the focus toward capturing long-range regional correlations. To investigate the sensitivity of model performance to this parameter, we varied
within the range
using the NYC dataset. As illustrated in
Figure 2, for the check-in prediction task, the performance exhibits an upward trend as
increases from 0.2, reaching its zenith at
before subsequently declining. Conversely, for crime prediction, service call estimation, and population prediction, the optimal performance threshold is consistently observed at
. Based on these empirical observations, we conclude that setting
provides a robust and effective trade-off between local and global spatial modeling, yielding superior performance across the majority of downstream tasks for the RE-SAT framework.
Impact of the Number of STE . To investigate the influence of the depth of the STE module on model efficacy, we conducted a sensitivity analysis by varying the number of stacked spatial-aware Transformer layers within the set
using the NYC dataset. As illustrated in
Figure 3, RE-SAT exhibits consistently stable performance across all four downstream tasks—namely, check-in prediction, crime prediction, service call estimation, and population prediction. This consistency underscores the inherent robustness of the STE module in capturing complex spatial dependencies, regardless of moderate variations in its architectural depth. Notably, the configuration with
layers achieves a marginal performance lead compared to other settings, suggesting an optimal balance between model capacity and generalization. Consequently, we fix the number of layers in the STE module at
for all subsequent experiments.
4.7. Running Time of Downstream Task
To investigate the computational overhead incurred during the downstream task learning phase,
Table 5 presents a comparative analysis of the training duration for RE-SAT alongside other representative two-stage frameworks. The empirical results demonstrate that RE-SAT achieves the shortest adaptation time, thereby validating the lightweight nature of our proposed prompt learning paradigm. Specifically, although HREP utilizes a seemingly simpler random initialization strategy, it necessitates an extensive training regimen of up to 6000 epochs to ensure the quality of prompt tuning, which significantly increases the overall execution latency. Furthermore, UrbanCLIP incurs substantial costs due to its auxiliary contrastive learning objectives, while FlexiReg introduces significant overhead by incorporating complex multimodal features (e.g., imagery and text) for prompt enhancement. In contrast, our semantic-guided mechanism achieves a superior performance–efficiency trade-off, delivering higher predictive accuracy while maintaining a significantly lower computational footprint.
5. Discussion
The proposed RE-SAT framework demonstrates substantial improvements in urban region embedding; however, its limitations must be acknowledged and provide opportunities for future research.
From a practical standpoint, the primary advantage of RE-SAT lies in its parameter-efficient adaptation. By leveraging semantic-guided prompting, the model circumvents the prohibitive computational costs of retraining large-scale encoders for every new task, offering a scalable decision-support mechanism for urban planners and policymakers. Nevertheless, this efficiency is inherently bounded by data dependency. The generation of high-quality universal embeddings relies heavily on the availability of multi-view data, particularly human mobility trajectories, which are often proprietary, costly to procure, and subject to strict privacy regulations. In data-scarce scenarios where certain modalities are unavailable, the representational capacity of the model inevitably degrades. Developing modality-agnostic architectures or utilizing cross-modal imputation techniques to maintain robustness under severe missing data conditions remains a critical direction for future research.
Methodologically, the current instantiation of the spatial-aware Transformer encoder operates primarily in a transductive setting. The spatial priors injected into the attention mechanism assume a fixed spatial topology defined by an adjacency matrix. Although we effectively mitigated the dimensional mismatch during cross-city transfer via a zero-padding strategy, this approach serves as a heuristic workaround rather than a fundamental solution. To achieve seamless and universal cross-city transferability, transitioning from transductive attention mechanisms to inductive graph representation paradigms—capable of generalizing across urban graphs of arbitrarily varying sizes—will be a primary focus of our subsequent work.