MaskPOI: A POI Representation Learning Method Using Graph Mask Modeling

Zhang, Haoyuan; Shi, Zexi; Li, Mei; Mao, Shanjun

doi:10.3390/electronics14071242

Open AccessArticle

MaskPOI: A POI Representation Learning Method Using Graph Mask Modeling

¹

Institude of Remote Sensing and Geographic Information System, Peking University, Beijing 100871, China

²

College of Urban and Environmental Sciences, Peking University, Beijing 100871, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(7), 1242; https://doi.org/10.3390/electronics14071242

Submission received: 11 January 2025 / Revised: 20 February 2025 / Accepted: 21 March 2025 / Published: 21 March 2025

Download

Browse Figures

Versions Notes

Abstract

Point of Interest (POI) data play a critical role in enabling location-based services (LBS) by providing intrinsic attributes, including geographic coordinates and semantic categories, alongside a spatial context that reflects relationships among POIs. However, the inherent label sparsity in POI datasets poses significant challenges for traditional supervised learning approaches. To address this limitation, we propose MaskPOI, a novel self-supervised learning framework that combines the strengths of graph neural networks and masked modeling. MaskPOI incorporates two complementary modules: an edge mask-based graph autoencoder that models the spatial topology by predicting edge existence and uncovering hidden spatial relationships and a feature mask-based graph autoencoder that reconstructs masked node features to explore the rich attribute characteristics of POIs. Together, these modules enable MaskPOI to jointly capture the spatial and attribute information essential for robust representation learning. Extensive experiments demonstrate MaskPOI’s effectiveness in improving performance on downstream tasks such as functional zone classification and population density prediction. Ablation studies further validate the contributions of its components, highlighting MaskPOI as a powerful and versatile framework for POI representation learning.

Keywords:

point of interest (POI); self-supervised learning (SSL); graph neural network (GNN); masked modeling; representation learning

1. Introduction

The exponential growth of geospatial data, particularly Point of Interest (POI) data, has created significant opportunities for advancing location-based services (LBS) [1]. Applications such as intelligent transportation [2], urban planning [3], and business analytics [4] rely heavily on an accurate and comprehensive understanding of POIs. A POI inherently encapsulates two fundamental types of information: intrinsic attributes such as geographic coordinates and categories and spatial context, which reflects the relationships among POIs within their environment [5]. Robust information representations enable downstream models to better understand POI semantics, uncover spatial patterns, and generalize across diverse geospatial contexts. Effectively capturing and integrating these diverse sources of information into feature vectors is critical for downstream tasks such as classification [6], prediction [7], and recommendation [8].

To address the need for effective POI representation, existing methods primarily focus on modeling intrinsic attributes and spatial contexts. For intrinsic attributes, coordinate encoders are used to process geographic features [9], while text encoders capture semantic information from POI categories and descriptions [10]. These encoded features are then fused to form unified representations [5]. For spatial context, graph neural networks (GNNs) have become a dominant approach due to their ability to model relational dependencies among POIs [11,12,13]. By aggregating information from neighboring POIs, GNNs enable the encoding of spatial structures and dependencies.

Despite their utility, POI datasets present unique challenges, the most notable being label sparsity, which significantly hampers the effectiveness of supervised learning approaches. In POI datasets, labels are typically limited to categorical information, which is often incorporated as part of the input features rather than being used as targets for learning tasks. This lack of diverse and task-specific labels restricts the range of meaningful supervised tasks that can be formulated, making it challenging for models to learn reliable representations. These limitations have driven increasing interest in self-supervised learning (SSL), a paradigm that derives meaningful representations through pretext tasks leveraging the inherent structure of the data [14]. SSL has gained substantial traction in domains such as computer vision [15,16,17,18] and natural language processing [19,20,21], where it has successfully mitigated the dependence on labeled data and unlocked novel opportunities for representation learning.

Self-supervised learning (SSL) methods for POI representation can be broadly categorized into skip-gram-based methods and GNN-based methods, each with distinct strengths and limitations. Skip-gram-based methods draw inspiration from Word2Vec [22], treating the spatial neighborhood of a POI as its context and minimizing errors in predicting this context using POI embeddings [23,24]. These methods excel at capturing localized spatial relationships and category-level similarities. GNN-based methods construct graphs based on spatial proximity and leverage graph neural networks to encode local structures into latent representations [25,26]. These approaches offer greater flexibility, allowing for predictive tasks such as reconstructing node attributes or inferring relationships. While both approaches offer valuable insights, they face notable limitations. Skip-gram methods are constrained by their reliance on local pairwise relationships, which restrict their ability to capture more complex spatial structures. GNN-based contrastive learning relies heavily on the quality of graph construction, which determines how well spatial relationships are represented, and the design of pretext tasks, which guide the learning process.

Recently, masked modeling, a powerful SSL paradigm, has emerged as a promising direction for representation learning. By masking and reconstructing parts of the data, this approach enables models to capture contextual dependencies and semantic features effectively. Notable examples include masked language modeling [27,28] and masked image modeling [29,30], which have demonstrated significant success in natural language processing and computer vision by designing tasks that exploit the inherent structure of textual and visual data. Inspired by these advancements, researchers have begun to explore the potential of masked modeling for POI representation learning.

For POI data, masked modeling has primarily been implemented through sequence-based approaches, such as GeoBERT [31] and SpaBERT [32]. These methods transform POI attributes into textual sequences and apply masked language modeling to predict missing components. While effective in some scenarios, these approaches often neglect or oversimplify the spatial relationships among POIs during the transformation process, leading to a partial loss of critical spatial context. This limitation highlights the need for innovative methods that can simultaneously preserve and leverage the structural integrity of POI data for more comprehensive representation learning.

To address these limitations, this paper introduces MaskPOI, a novel self-supervised learning framework that combines the strengths of graph neural networks (GNNs) and masked modeling. MaskPOI is designed to jointly capture both the spatial context and intrinsic attributes of POIs. By leveraging GNNs to model spatial relationships and implementing a graph-specific masking strategy, MaskPOI ensures the preservation of structural and spatial integrity while enabling robust and generalizable representation learning.

The proposed framework consists of two main components: An edge mask-based graph autoencoder, which predicts the existence of edges in the graph. This module focuses on modeling the spatial topology of POIs and uncovering hidden spatial relationships that may not be explicitly annotated in the data. A feature mask-based graph autoencoder, which masks and reconstructs node features. This module allows the model to deeply explore attribute characteristics, leading to richer and more distinguishable representations. Both components are integrated into a GNN encoder–decoder architecture, where the GNN encoder extracts graph-level representations and the edge and feature decoders reconstruct spatial and attribute relationships, respectively. These self-supervised tasks enable the model to learn high-quality POI representations without requiring explicit supervision, capturing the complex interplay between spatial and intrinsic attribute information.

The primary contributions of this paper are summarized as follows:

We propose a self-supervised learning framework based on graph neural networks, which jointly learns the node and edge features in the graph to capture both spatial and attribute information between POIs. This approach generates high-quality POI representations by effectively utilizing the structural characteristics of POI data, thereby improving the accuracy and generalization ability of representation learning.
We design an edge mask-based graph modeling task that enhances the modeling of spatial relationships between POIs by predicting the existence of edges. This approach captures the spatial topology of POIs and uncovers potential spatial relationships that may not be explicitly labeled, thereby improving the model’s ability to characterize spatial interactions.
We propose a feature mask-based graph modeling task, which focuses on learning by masking and reconstructing POI node features. This method enables the deep exploration of POI attribute features, resulting in richer and more distinguishable representations that significantly improve performance on downstream tasks.
We conduct extensive experiments in Beijing and Xiamen, evaluating the proposed framework on two downstream tasks: functional zone classification and population density prediction. The results demonstrate the superiority of MaskPOI in practical applications. Additionally, through ablation experiments, we provide an in-depth analysis of each module’s effectiveness, further validating the innovation and practicality of our approach.

2. Related Work

2.1. POI Representation Learning

Recent advancements in Point of Interest (POI) representation learning have leveraged self-supervised methodologies to address the challenges of capturing intrinsic attributes and spatial relationships. These methods can be broadly categorized into three paradigms: skip-gram-based models, graph neural network (GNN)-based approaches, and masked modeling-based techniques. Each paradigm introduces unique strategies to enhance POI representations.

Skip-gram-based models are inspired by natural language processing methods, such as Word2Vec [22], and focus on learning embeddings based on the spatial or semantic context of POIs. For instance, Place2Vec [23] utilizes the spatial neighborhood of POIs as context to capture geographic relationships and category-level similarities. Similarly, HRPR [33] employs meta-path-based random walks combined with skip-gram models to model user preferences and generates robust POI embeddings. These methods are particularly effective in capturing localized spatial relationships and semantic configurations, as demonstrated by Liu et al. [3], who visualized urban regions through semantic mappings of POI distributions.

Graph neural network-based methods emphasize modeling the spatial dependencies among POIs by constructing graphs based on spatial proximity or other relationships. For example, Xu et al. [34] represented urban POIs as graph nodes and utilized graph convolutional networks (GCNs) to encode their spatial context. Similarly, GE [35] created multi-dimensional relationship graphs and embedded them using GNNs to integrate various features for location-based recommendations. Zhu et al. [36] encoded geographical positions into graph nodes and optimized GNN weights to generate high-quality POI representations. Furthermore, SLS-REC [37] combined contrastive learning with attention-based GNNs to refine noisy user sequences, achieving improved representation quality for recommendation tasks.

Masked modeling-based methods have recently emerged as a promising direction for POI representation learning. These approaches adopt masking and reconstruction strategies to capture contextual dependencies and spatial semantics effectively. GeoBERT [31] applied masked language modeling to sequentially encoded POI attributes, achieving significant improvements in modeling contextual dependencies. SpaBERT [32] enhanced GeoBERT by incorporating spatial awareness into the transformer architecture, improving its ability to model spatial relationships. Additionally, MGeo [38] adopted a multi-modal masked modeling approach, treating the geographic context features as a distinct modality to enhance its capability for query-POI matching.

While these approaches have advanced POI representation learning, limitations remain. Skip-gram-based methods often struggle with capturing complex spatial structures beyond local relationships, whereas the performance of GNN-based approaches is sensitive to graph construction quality. Masked modeling methods, though effective, sometimes oversimplify spatial relationships when converting POI attributes into textual sequences, necessitating novel strategies to preserve spatial integrity.

2.2. Self-Supervised Graph Representation Learning

Self-supervised learning on graphs has been extensively explored through contrastive and generative paradigms, each leveraging different mechanisms to enhance the representation quality.

Contrastive approaches align similar representations while maximizing the diversity between different samples, often through negative sampling and data augmentation strategies. For instance, methods like DGI [39] and InfoGraph [40] introduce corrupted graph views as negative samples to enhance representation robustness. Techniques such as GCC [41] and GraphCL [42] employ batch-wise negative instances or advanced regularization mechanisms [43] to further refine representations. Data augmentation methods, including structural perturbations [44] and partial masking [45], play a pivotal role in improving the generalizability of these models.

Generative approaches focus on reconstructing missing or corrupted graph components, enabling models to capture both local and global dependencies. Graph autoencoders (GAEs), such as VGAE [46] and ARGA [47], are foundational in this domain, encoding nodes into latent representations and decoding them to reconstruct graph structures or attributes. Recent advancements have adopted masked autoencoding paradigms, where portions of input graphs are masked and reconstructed to encourage richer feature learning. For example, MGAE [48] and GraphMAE [49] applied masking strategies to graph domains, enabling models to learn context-aware representations by predicting hidden node attributes or structural components.

These self-supervised strategies have demonstrated remarkable success across various domains, yet their application to POI data remains underexplored. Masked modeling, in particular, holds significant potential for advancing POI representation learning by addressing the dual challenges of capturing spatial relationships and intrinsic attributes.

3. Problem Formulation

Let

P = {p_{1}, p_{2}, \dots, p_{t}}

represent a collection of t POIs. Each POI

p_{i}

(e.g., a public park or a shopping mall) is characterized by its two-dimensional geographic coordinates

(x_{i}, y_{i})

and a categorical set

c_{i}

. The set

c_{i}

is expressed as

c_{i} = {c_{i}^{1}, c_{i}^{2}, \dots, c_{i}^{h_{i}}}

, where

c_{i}^{j}

specifies the category of

p_{i}

at the j-th hierarchical level, and

h_{i}

denotes the depth of the hierarchy (commonly

h_{i} = 2, 3

for standard POI datasets). For instance, a POI’s categories might include

{entertainment (first - level), movie theater (second - level)}

, illustrating the hierarchical structure where broader first-level categories encompass multiple specific second-level categories.

POI representation learning aims to map each

p_{i}

into a low-dimensional vector space

v_{i} \in R^{d}

such that the vector

v_{i}

effectively captures the spatial, categorical, and potentially other contextual attributes of the POI. The goal of representation learning is to ensure that the learned embeddings

{v_{1}, v_{2}, \dots, v_{t}}

can be utilized in downstream tasks, such as POI recommendation, clustering, or spatial analysis, while preserving meaningful relationships between POIs, including geographic proximity, categorical similarity, and other latent connections.

4. Methodology

4.1. Overview of MaskPOI

MaskPOI is a self-supervised graph neural network framework designed for POI representation learning. It adopts a dual-branch architecture to simultaneously leverage spatial relationships and feature dependencies within the constructed POI graph structure. The overall architecture of MaskPOI is depicted in Figure 1a. The core idea of MaskPOI is to randomly mask edges or node features in the graph, process the graph through the GNN encoder, and then predict the masked components using separate decoders.

Firstly, a graph is constructed where the nodes represent POIs and the edges capture the spatial relationships between them. Each POI node is initialized with a feature vector that encodes its attributes, including the location and category.

The core of MaskPOI is the GNN encoder, which aggregates information from neighboring nodes to produce contextualized node embeddings. To enhance the robustness of the learned embeddings, MaskPOI introduces two key self-supervised learning mechanisms as follows: Edge Masking: A subset of edges is randomly masked, and the model learns to reconstruct these masked connections using the encoded node embeddings. This mechanism encourages the model to infer missing relationships based on the structural information in the graph. Feature Masking: A subset of features is randomly masked, and the model is trained to reconstruct these masked features. This task forces the model to understand and predict the underlying characteristics of POIs based on their graph context.

Finally, the outputs from the GNN encoder are fed into two separate decoders: the edge decoder, which predicts the presence of masked edges, and the feature decoder, which reconstructs the masked feature values. The combination of these two reconstruction tasks forms the self-supervised training objective, ensuring the learned embeddings capture both structural- and feature-level dependencies.

MaskPOI’s architecture is designed to generalize across various downstream tasks, as shown in Figure 1b.The GNN encoder produces task-agnostic representations for individual POIs, which can be directly utilized for node-level tasks such as POI classification or clustering. For region-level tasks, such as urban functional zone classification or population density prediction, an additional aggregation step combines embeddings of POIs within the same region to capture macro-scale patterns. The resulting representations, whether node level or region level, are passed to a simple linear probe for efficient task-specific predictions, showcasing the versatility of MaskPOI across various application scenarios.

4.2. Graph Construction and POI Feature Generation

To accurately represent the Points of Interest (POIs) and their spatial interconnections, we constructed a topological graph grounded in the road network. This design captures both the functional diversity of POIs and their real-world spatial relationships.

The graph construction process begins by dividing the study area into spatial grids. In each grid, roads and their vertices are added as edges and nodes to form the initial structural framework, as shown in Figure 2a. These nodes do not carry any attribute information. POIs, which are not located directly on roads, are connected to the road network based on the shortest distance to road segments. If the nearest point on a road segment does not coincide with an existing road vertex, a new intersection node is added at this location. The POI is then connected to this newly created node, ensuring an accurate representation of spatial relationships, as shown in Figure 2b.

After constructing the topological graph, we need to generate the initial features for the nodes based on the POI attributes. The initial features for POIs combine semantic and spatial information to enable robust representation learning. Semantic attributes, including a three-level categorical hierarchy (e.g., food service, fast food restaurant, McDonald’s), are encoded using a pre-trained Bidirectional Encoder Representation from Transformers (BERT) model [28]. BERT is pre-trained on a large corpus using Masked Language Modeling (MLM) and Next Sentence Prediction (NSP) tasks, enabling it to learn deep contextual representations. The model processes input text by tokenizing it using word encoding, adding special tokens ([CLS]), and passing the sequence through multiple Transformer layers. The [CLS] token’s output is commonly used as a global sentence representation, while other tokens capture token-level semantics. By extracting 768-dimensional embeddings from the [CLS] token of each category description, BERT effectively captures semantic relationships between different POI categories.

To incorporate spatial information, the geographic coordinates of POIs are normalized within the bounding box of the grid. These normalized coordinates, consisting of two values (latitude and longitude), are concatenated with the BERT-based semantic embeddings. As a result, the final feature vector for each POI node has a total of

2 + (3 \times 768) = 2306

dimensions, effectively capturing both spatial and semantic characteristics. The process can be expressed by the following formula:

f_{v} = concat (\frac{x - x_{\min}}{x_{\max} - x_{\min}}, \frac{y - y_{\min}}{y_{\max} - y_{\min}}, BERT (C_{main}), BERT (C_{mid}), BERT (C_{sub}))

(1)

where

C_{main}, C_{mid}, C_{sub}

represent the three-level categorical attributes of the POI;

BERT (C)

denotes the embedding generated by the BERT model for the category C;

(x, y)

are the geographic coordinates of the POI; and

x_{\min}, x_{\max}, y_{\min}, y_{\max}

are the minimum and maximum values of the grid.

Road intersection nodes, which lack semantic attributes, are assigned feature vectors consisting of normalized coordinates concatenated with zero vectors matching the dimensionality of the BERT embeddings. This uniform feature dimensionality across all nodes ensures consistency and facilitates the learning process.

By integrating POIs into road networks and leveraging semantic embeddings from BERT, the constructed graph effectively combines spatial and attribute information. This design not only reflects real-world spatial interactions but also enhances the interpretability and utility of the learned POI representations. The impact of different graph construction methods and feature encoding strategies is discussed in Section 7.1 and Section 7.2. We compared alternative graph construction techniques, such as Delaunay triangulation, with the proposed road-based method. Additionally, the effectiveness of the BERT-based embeddings was evaluated in comparison to simpler encoding approaches, such as one-hot encoding.

4.3. Edge Mask Modeling

We adopted an edge-wise random masking strategy to facilitate the edge prediction task in MaskPOI. Given a graph

G = (V, E)

, we randomly sampled a subset of edges

E_{mask} \subseteq E

as the masked edges and retained the remaining edges

E_{vis}

to construct the visible graph

G_{vis} = (V, E_{vis})

. The relationship between these components satisfies

E_{mask} \cup E_{vis} = E

, ensuring that the masked and visible graphs together represent the original graph structure.

The masked edges

E_{mask}

are sampled based on a Bernoulli distribution, where each edge in E is independently selected for masking with a probability p. Formally, the sampling process can be expressed as:

E_{mask} \sim Bernoulli (p)

(2)

where

0 < p < 1

defines the edge masking ratio. This ratio determines the proportion of edges that are removed and thus controls the difficulty of the edge reconstruction task. A higher p introduces more masked edges, forcing the model to rely more heavily on the remaining graph context for prediction.

The GNN encoder in the proposed MaskPOI framework is implemented using a multi-layer Graph Convolutional Network (GCN), which is designed to capture both local and global node features within the graph. The encoder takes the graph structure (edge index) and node features as inputs and processes them through multiple GCN layers. Each layer propagates and aggregates neighborhood information to produce richer node embeddings. The shared GNN encoder is responsible for producing node embeddings for both the edge-masking and feature-masking tasks. For each task, the input graph is processed by the encoder, and the resulting node embeddings are passed to the respective decoders.

The edge decoder in MaskPOI reconstructs the masked edges by predicting their existence based on the node embeddings generated by the GNN encoder. For a given masked edge

e = (u, v)

, where

u, v \in V

, the decoder first combines the node embeddings

z_{u}

and

z_{v}

through an elementwise product:

z_{u v} = z_{u} ⊙ z_{v}

(3)

where

z_{u}, z_{v} \in R^{d}

are the d-dimensional embeddings of nodes u and v, and ⊙ represents the elementwise product. This operation captures the pairwise interaction between the nodes.

The combined representation

z_{u v}

is passed through a multi-layer perceptron (MLP) to model non-linear relationships, and a sigmoid activation function is applied to compute the probability of the edge’s existence.

{\hat{y}}_{u v} = σ (MLP (z_{u v}))

(4)

where

σ (\cdot)

is the sigmoid function, and

{\hat{y}}_{u v} \in (0, 1)

represents the predicted likelihood of the edge

e = (u, v)

existing in the graph.

4.4. Feature Mask Modeling

To facilitate the self-supervised learning task of feature reconstruction, MaskPOI employs a feature masking strategy combined with a dedicated decoder for reconstructing node attributes. Given the input node feature matrix X, the masking process selects a portion of the nodes to mask based on a predefined masking ratio p. For each selected node, its features are replaced with a learnable masking token.

The masked node features and the graph structure are input into the shared GNN encoder mentioned above. The shared encoder ensures that the learned embeddings capture both structural- and feature-level dependencies. The output embeddings Z from the encoder serves as the input for the feature decoder, enabling the reconstruction of the original node features. The feature decoder in MaskPOI is implemented using GCN instead of a standard MLP. By leveraging the GCN layers, the decoder incorporates structural relationships from the graph, effectively utilizing neighborhood information to reconstruct features.

4.5. Training Objective

The training objective of MaskPOI aims to learn robust POI representations through self-supervised learning by optimizing two primary reconstruction tasks: edge reconstruction and feature reconstruction. By addressing these tasks simultaneously, the model effectively captures both the structural relationships and the feature-level dependencies among POIs.

Edge Reconstruction Task: The edge reconstruction task aims to predict the existence of masked edges using the latent node representations. The edge decoder is tasked with predicting whether the masked edges (positive samples) exist, as well as predicting a set of randomly sampled non-existent edges (negative samples). The edge reconstruction loss is computed by comparing the predicted probabilities of the edges’ existence with the actual labels, as follows:

L_{edge} = E_{masked edges} [- log σ ({\hat{y}}_{mask})] + E_{negative edges} [- log (1 - σ ({\hat{y}}_{negative}))]

(5)

where

{\hat{y}}_{mask}

and

{\hat{y}}_{negative}

represent the decoder’s predicted scores for the masked edges and the randomly sampled non-existent edges, respectively, and

σ (\cdot)

denotes the sigmoid activation function. The first term penalizes the incorrect predictions of the masked edges (positive samples), while the second term penalizes the incorrect predictions of the negative edges (non-existent samples), encouraging the model to learn to distinguish between actual edges and random samples.

Feature Reconstruction Task: The feature reconstruction task aims to reconstruct the masked node features using the graph structure and the remaining visible node features. The feature decoder predicts the original feature values for the masked nodes. The reconstruction loss is computed based on the cosine similarity between the reconstructed feature vector and the original feature vector for each masked node:

L_{feature} = \frac{1}{| masked nodes |} \sum_{i \in masked nodes} (1 - \frac{〈 x_{rec, i}, x_{init, i} 〉}{| x_{rec, i} | | x_{init, i} |})

(6)

where

x_{rec, i}

is the reconstructed feature vector for node i, and

x_{init, i}

is the ground truth feature vector for node i. The cosine similarity loss measures how similar the reconstructed and original feature vectors are, encouraging the model to preserve important feature-level information while learning the dependencies among POIs.

Total Training Loss: The total training objective combines both the edge reconstruction loss and the feature reconstruction loss as follows:

L_{total} = L_{edge} + L_{feature}

(7)

This multi-task learning objective allows the MaskPOI model to learn both structural and feature dependencies in the graph, resulting in robust and generalizable POI embeddings.

5. Experiment Settings

5.1. Study Areas and Data

This research selects Beijing and Xiamen as representative cities of northern and southern China, respectively. The study area in Beijing focuses on the central urban area, approximately within a 9 km square around Tiananmen Square, effectively covering the city’s core, as illustrated in Figure 3a. This region represents the economic and administrative center of Beijing, characterized by a high degree of urbanization and a dense concentration of POIs, including commercial, residential, cultural, and governmental facilities. The study area in Xiamen is confined to Xiamen Island, as illustrated in Figure 3b. Xiamen Island serves as the political, economic, and cultural hub of Xiamen, exhibiting a compact urban structure. The island’s unique geographic setting and its role as a major port city contribute to its distinct urban dynamics compared to Beijing.

The POIs used in this study were collected from Amap https://lbs.amap.com/ (accessed on 20 February 2025) and represent a wide range of urban facilities and services. A total of 206,331 POIs were included in the dataset for Beijing, categorized into a hierarchical structure with three levels as follows: 23 first-level categories (e.g., food services, retail, public services), 193 second-level categories (e.g., fast food restaurants, shopping malls, government agencies), and 587 third-level categories (e.g., McDonald’s, Starbucks, local grocery stores). A total of 48,282 POIs were included in the dataset for Xiamen, categorized into 14 first-level categories, 93 second-level categories, and 437 third-level categories. Each POI is associated with its geographic coordinates and three category labels from the hierarchical structure, providing both spatial and functional context for analysis. In addition, the latest OpenStreetMap (OSM) road vector data for Beijing was used in this study https://www.openstreetmap.org/ (accessed on 20 February 2025).

To construct the topological network, both POIs and roads were divided into grids with a cell size of 500 m. This grid-based approach allows for efficient spatial analysis by associating POIs and roads with specific grid cells, facilitating the creation of a topological network that captures the relationships between geographic features within the study area.

Similar to the approach in study [50], we validated the learned POI representations on two downstream tasks: urban functional zone classification and population density prediction. The ground truth data of urban functional zone classification are shown in Figure 3. The urban functions of Beijing are sourced from EULUC-China https://data-starcloud.pcl.ac.cn/resource/7 (accessed on 20 February 2025) [51]. EULUC-China is a comprehensive land use dataset for China, developed through the integration of remote sensing imagery and GIS, providing detailed information on land use types across the country. It includes 12 categories, including transportation, road, sports, and cultural, park and green space, medical and healthcare, commercial service, business, residential, industrial, educational and research, airport, administrative office. The urban functions of Xiamen are derived from the Urbanscape Essential Dataset of Peking University http://geoscape.pku.edu.cn/en.html (accessed on 20 February 2025), which offers comprehensive spatial data on 12 urban functions across the study areas. These include forest, water, undeveloped land, transportation, green space, industrial areas, educational and governmental facilities, commercial zones, residential (type 1, 2, 3), and agricultural land. The dataset was compiled using a combination of remote sensing data, POI information, and human-driven corrections and adjustments. Each grid in the study area is assigned a 12-dimensional vector, representing the proportion of urban function in each of these categories. This vector is computed by overlaying the spatial grid with the urban function dataset, ensuring that each grid captures the relative distribution of these functional zones.

The population density data is sourced from WorldPop. We utilized the most recent dataset, representing the spatial distribution of China’s population in 2020 with a resolution of 100 m https://hub.worldpop.org/geodata/summary?id=49730 (accessed on 20 February 2025). By overlaying this dataset with the spatial grid, we calculated the population density for each grid cell, expressed as the number of people per 100 square meters.

5.2. Implementation Details of MaskPOI

The implementation of MaskPOI involved the systematic tuning of hyperparameters and the selection of model components to optimize performance. For edge masking, a range of mask rates between 0.1 and 1.0 was explored, with a rate of 0.3 selected as optimal. Similarly, the feature masking rates were tuned between 0.1 and 0.9, leading to a chosen rate of 0.6. The dimensionality of node representations was adjusted between 32 and 256, with the best performance observed at a dimension of 128. Both the GNN encoder and the feature decoder were implemented as single-layer Graph Convolutional Networks (GCNs). The analysis of these parameters can be found in Section 6.2.

The training process employed a learning rate of

1 \times 10^{- 3}

with a cosine annealing scheduler, a weight decay of

5 \times 10^{- 5}

, and a batch size of

1 \times 10^{4}

. The model was trained for 100 epochs on an NVIDIA A800 GPU with 80 GB of memory, enabling the efficient handling of large-scale graph data. For the experiments conducted on the Beijing dataset, the model used approximately 76,524 MB of GPU memory, with a training time of around 10 min. For the Xiamen dataset, the model consumed 16,278 MB of GPU memory, and the training time was approximately 3 min.

For the downstream experiments, we used Set2Set [24,52,53] to obtain the region representations. We randomly selected 80% of the data for training, 10% for validation, and 10% for testing. Each experiment was run 5 times with random initialization. The mean and standard deviation are reported.

5.3. Compare Methods

To evaluate the effectiveness of our proposed approach, we compared it with several state-of-the-art methods for POI representation learning and graph self-supervised learning. These methods include the following:

Semantic Embedding [24]: A method that utilizes textual or categorical attributes of POIs to learn representations, focusing on semantic similarity between POIs.
DeepWalk [54]: A classic graph-based representation learning algorithm that performs random walks on graphs to generate node embeddings. It is effective for learning structural information and is commonly used in spatial and networked data analyses.
Node2Vec [55]: An extension of DeepWalk that introduces a biased random walk strategy to capture both local and global graph structures, enabling the embedding to encode more diverse contextual information.
DGI (Deep Graph Infomax) [39]: A graph self-supervised learning method that maximizes mutual information between local node embeddings and the global graph representation. This approach has proven effective in unsupervised learning on graph data.
GAE (Graph Autoencoder) [46]: A graph neural network-based autoencoder model that learns low-dimensional representations by reconstructing the graph structure. It is commonly used for unsupervised node and graph representation learning.
GraphMAE (Graph Masked Autoencoder) [49]: A self-supervised graph learning model that uses masked graph modeling to learn robust embeddings by reconstructing missing node or edge features.
MaskGAE (Masked Graph Autoencoder [56]: An advanced variant of graph autoencoder that integrates masking mechanisms during encoding to enhance the model’s ability to generalize and capture latent patterns in graph data.
UniMP (Unified Message Passing Model) [57]: A model that unifies feature and label propagation within a Graph Transformer and uses a masked label prediction strategy. It adopts the vanilla multi-head attention of the Transformer in graph learning, taking the node features and label embeddings as input for information propagation between nodes.
GCA (Graph Contrastive Learning with Adaptive Augmentation) [58]: A novel unsupervised graph representation learning method that generates two graph views via adaptive augmentation at the topology and node-attribute levels and uses a contrastive loss to train the model and outperforms state-of-the-art methods in node classification tasks.

5.4. Evaluation Metrics

For the evaluation of the proposed methods, we adopted different metrics tailored to the tasks. To measure the similarity between vectors in the urban functional composition classification task, we employed the following metrics, ↓ indicates that smaller values are better, while ↑ indicates that larger values are better.

L1 Distance (L1)↓: The L1 distance between the estimated functional composition vector

{\hat{y}}_{i}

and the ground truth vector

y_{i}

for region i is defined as:

L 1 ({\hat{y}}_{i}, y_{i}) = \sum_{k = 1}^{n} |{\hat{y}}_{i}^{f_{k}} - y_{i}^{f_{k}}|,

(8)

Kullback–Leibler Divergence (KL)↓: The KL divergence measures the difference between the estimated functional composition

{\hat{y}}_{i}

and the ground truth

y_{i}

:

K L ({\hat{y}}_{i} ‖ y_{i}) = \sum_{k = 1}^{n} {\hat{y}}_{i}^{f_{k}} log \frac{{\hat{y}}_{i}^{f_{k}}}{y_{i}^{f_{k}}},

(9)

Cosine Similarity (Cosine)↑: The cosine similarity between the estimated functional composition

{\hat{y}}_{i}

and the ground truth

y_{i}

is computed as:

Cos ine ({\hat{y}}_{i}, y_{i}) = \frac{\sum_{k = 1}^{n} {\hat{y}}_{i}^{f_{k}} \cdot y_{i}^{f_{k}}}{\sqrt{\sum_{k = 1}^{n} {({\hat{y}}_{i}^{f_{k}})}^{2}} \cdot \sqrt{\sum_{k = 1}^{n} {(y_{i}^{f_{k}})}^{2}}},

(10)

where

{\hat{y}}_{i}^{f_{k}}

is the estimated proportion of the function type

f_{k}

that region i bears, and

y_{i}^{f_{k}}

is the corresponding ground truth proportion; satisfy

{\hat{y}}_{i}^{f_{k}}, y_{i}^{f_{k}} \geq 0

and

\sum_{k = 1}^{n} {\hat{y}}_{i}^{f_{k}} = \sum_{k = 1}^{n} y_{i}^{f_{k}} = 1

.

For the regression task of predicting the population density, we used the following metrics:

Root Mean Square Error (RMSE)↓: The RMSE evaluates the standard deviation of the prediction errors between the estimated population density

{\hat{y}}_{i}

and the ground truth

y_{i}

for all regions, defined as:

RMSE = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} {({\hat{y}}_{i} - y_{i})}^{2}},

(11)

Mean Absolute Error (MAE)↓: The MAE computes the average magnitude of errors between the estimated population density

{\hat{y}}_{i}

and the ground truth

y_{i}

, defined as:

MAE = \frac{1}{N} \sum_{i = 1}^{N} |{\hat{y}}_{i} - y_{i}|,

(12)

Coefficient of Determination (

R^{2}

)↑: The

R^{2}

metric assesses the proportion of variance in the ground truth population density

y_{i}

that is predictable from the estimated population density

{\hat{y}}_{i}

, defined as:

R^{2} = 1 - \frac{\sum_{i = 1}^{N} {({\hat{y}}_{i} - y_{i})}^{2}}{\sum_{i = 1}^{N} {(y_{i} - \bar{y})}^{2}},

(13)

where

{\hat{y}}_{i}

is the estimated population density for region i,

y_{i}

is the corresponding ground truth,

\bar{y} = \frac{1}{N} \sum_{i = 1}^{N} y_{i}

is the mean of the ground truth population density, and N is the total number of regions.

These metrics comprehensively evaluate the effectiveness of the proposed methods across both classification and regression tasks, providing robust and interpretable performance measures.

6. Results

6.1. Performance

We evaluated the performance of MaskPOI on two downstream tasks: the urban functional zone classification and population density prediction in Beijing and Xiamen, respectively.

The result of Beijing is shown in Table 1. MaskPOI outperforms all other models across most metrics on both tasks. In urban functional zone classification, MaskPOI achieves the best results in KL divergence (0.834 ± 0.013) and cosine similarity (0.761 ± 0.003). It also performs exceptionally well in population density prediction, achieving the lowest RMSE (0.133 ± 0.003) and MAE (0.1 ± 0.002), along with the highest

R^{2}

value (0.436 ± 0.005). These results demonstrate the superior ability of MaskPOI to capture complex spatial and functional relationships in the data, highlighting its effectiveness in both tasks.

In contrast, models such as DeepWalk and Node2Vec show relatively lower performance across all metrics, particularly in the urban functional zone classification task. While Semantic exhibits promising results in L1, MaskPOI consistently shows superior performance, making it the most effective model for these types of POI-based tasks.

The result of Xiamen is shown in Table 2. MaskPOI demonstrates superior performance across both urban function classification and population prediction tasks. It achieves the lowest KL divergence (0.791) for urban function classification, along with the highest cosine similarity (0.703), indicating its ability to effectively capture urban function patterns. For population prediction, MaskPOI also excels, attaining the lowest RMSE (0.173) and MAE (0.138), and the highest

R^{2}

(0.639), showing its robustness in accurately predicting population densities. These results highlight the effectiveness of incorporating POI data and advanced masking techniques in enhancing both classification and regression tasks.

In comparison, other methods such as Semantic, GCA, and GAE fall short of MaskPOI in terms of overall accuracy and error reduction. DGI shows competitive results, while others like DeepWalk and Node2Vec lag behind in both tasks, demonstrating lower cosine similarity and higher prediction errors. Overall, MaskPOI stands out as the most effective approach, consistently delivering the best results across most evaluation metrics.

6.2. Parameter Sensitivity Analyses

To thoroughly evaluate the robustness and effectiveness of the MaskPOI model, we conducted a series of parameter sensitivity analyses focusing on three key aspects that significantly influence model performance on the Beijing dataset. First, we investigated the impact of different GNN encoder architectures, including GraphSAGE [59], GCN [60], GIN [61], and GAT [62], to identify the most suitable encoder for capturing the spatial and functional relationships among POIs. The choice of encoder directly affects the quality of node representations and thus plays a critical role in determining the overall performance.

Second, we analyzed the sensitivity of the model to the mask rate, which includes both edge masking and feature masking. By varying the proportion of edges and feature elements masked during training, we aimed to explore the trade-off between providing sufficient training signals and preserving essential information. This analysis helps to identify the optimal mask rates that ensure robust self-supervised learning.

Finally, we examined the effect of the representation dimension. Larger dimensions increase the model’s capacity to capture complex patterns but may also lead to overfitting. By evaluating a range of dimensions, we sought to balance model expressiveness and efficiency.

The following subsections provide a detailed analysis of each aspect, discussing the experimental setups, results, and insights.

6.2.1. GNN Encoder Type

The choice of GNN encoder architecture significantly affects the quality of the learned POI representations, as it determines how effectively spatial and functional relationships among POIs are captured. In this study, we evaluated four widely used GNN encoders: GraphSAGE [59], GCN [60], GIN [61], and GAT [62].

Graph Sample and Aggregate (GraphSAGE) is an inductive learning framework that generates node embeddings by aggregating features from local neighborhoods. Its ability to generalize to unseen nodes makes it particularly suitable for dynamic or incomplete graphs.

Graph Isomorphism Network (GIN) aims to enhance the discriminative power of GNNs by utilizing injective aggregation functions. GIN is particularly effective in capturing fine-grained differences between node neighborhoods, which is critical for applications involving complex spatial or functional relationships.

Graph Attention Network (GAT) incorporates an attention mechanism to learn the importance of neighboring nodes dynamically. By assigning different weights to neighbors, GAT adapts its aggregation process based on the relative significance of nodes.

Graph Convolutional Network (GCN) is a spectral-based approach that applies convolution operations on graph data, focusing on local structural information. The GCN is computationally efficient and widely used for graph-based tasks but may have limited capacity for capturing higher-order dependencies.

The experimental results presented in Table 3 demonstrate the varying effectiveness of different GNN encoders for the urban functional zone classification and population density prediction tasks. In the urban functional zone classification task, the GCN and GIN exhibit comparable performance, with each excelling in specific metrics. The GCN achieves the best KL divergence (0.834), while the GIN slightly outperforms the GCN in terms of the L1 error (0.865 compared to 0.867) and achieves the highest cosine similarity (0.763 compared to 0.761). For the population density prediction task, the GCN achieves the lowest RMSE (0.133) and MAE (0.1) while attaining the highest

R^{2}

value (0.436). These results highlight the GCN’s ability to effectively model and predict continuous spatial distributions, which is critical for this regression-oriented task.

When compared to the other encoders, GraphSAGE and the GAT show relatively weaker performance. While GraphSAGE is designed for inductive learning and dynamic graphs, its simple neighborhood aggregation scheme appears insufficient for capturing the nuanced spatial and functional relationships inherent in POI data. The GAT, which incorporates attention mechanisms to dynamically weight neighboring nodes, achieves moderate results but does not provide a significant advantage over the GCN or GIN. This suggests that, for the given datasets, the additional computational complexity introduced by attention mechanisms may not yield substantial benefits.

The results of this study affirm the effectiveness of the GCN as the preferred encoder for both tasks. Its spectral-based convolutional framework enables the GCN to focus on local structural information, ensuring that the learned representations are well-aligned with the underlying spatial dependencies of population density. Furthermore, the GCN’s computational efficiency and architectural simplicity make it particularly suitable for processing large-scale POI graphs, which often involve high-dimensional features and dense connectivity.

6.2.2. Mask Rate

The edge mask rate controls the proportion of edges that are randomly masked in the graph structure during training. As shown in Figure 4, we conducted experiments with various edge mask rates ranging from 0.1 to 1.0, evaluating the performance of these rates on two downstream tasks: urban functional zone classification (Task 1) and population density prediction (Task 2).

In Task 1, shown in Figure 4a, lower values are desirable for the L1 error (represented in blue) and KL divergence (orange), while higher values for cosine similarity (green) are preferred. Upon analyzing the results, no distinct trend emerged for these metrics as the edge mask rate varied. However, it can be observed that edge mask rates of 0.3 and 0.9 generally yield relatively good performance across all three metrics, indicating that these rates offer competitive results without significant degradation.

In Task 2, depicted in Figure 4b, we examined the RMSE (blue), MAE (orange), and

R^{2}

(green) as the evaluation metrics. For the RMSE and MAE, lower values indicate better performance, while higher values of

R^{2}

are preferred. The results reveal an approximately M-shaped trend in performance as the edge mask rate increases. Initially, as the mask rate increases from 0.1 to 0.3, the RMSE and MAE decrease, reaching their optimal values at 0.3. Similarly,

R^{2}

reaches its peak at the 0.3 mask rate, suggesting that a moderate amount of edge masking promotes effective learning. However, at higher mask rates, particularly beyond 0.7, the metrics begin to deteriorate, signaling that excessive masking compromises the structural integrity of the graph, leading to a decline in the model’s performance.

The observed trends indicate that the edge mask rate plays a crucial role in balancing the amount of structural information retained in the graph. While lower mask rates fail to provide sufficient self-supervised learning, higher rates undermine the graph’s structure, which is essential for learning meaningful representations. Thus, after carefully analyzing the results across both tasks, we determined that an edge mask rate of 0.3 strikes an optimal balance. This rate provides a sufficient amount of masking to encourage the self-supervised learning process while maintaining enough structural information for the model to make accurate predictions. Therefore, we selected an edge mask rate of 0.3 as the optimal setting for this study, as it yields competitive performance across all evaluation metrics for both Task 1 and Task 2.

The feature mask rate controls the proportion of node features that are randomly masked during training, thereby influencing the model’s ability to infer missing attributes and learn more robust representations. As shown in Figure 5, we evaluated the performance of various feature mask rates ranging from 0.1 to 0.9, applying them to both the urban functional zone classification task (Task 1) and the population density prediction task (Task 2).

In Task 1, represented in Figure 5a, the model achieves a good balance across all metrics around a mask rate of 0.5, with a low L1 error (blue), low KL divergence (orange), and high cosine similarity (green). This suggests that at this rate, the model is effectively encouraged to infer missing node attributes while retaining enough of the feature information to capture the underlying functional relationships in the data.

In Task 2, shown in Figure 5b, the model achieves its best performance at a feature mask rate of 0.6, where theRMSE (blue) and MAE (orange) are minimized, and

R^{2}

(green) reaches its highest value. This indicates that the model is able to best predict the population density when approximately 60% of the node features are masked. However, when the rate is at 0.5 or 0.7, the model’s performance significantly decreases. In Task 1, the performance at 0.5 and 0.6 is similar.

Based on these findings, a feature mask rate of 0.6 was selected as the optimal setting for this study. Masking around 60% of the node features strikes an effective balance between challenging the model to infer missing information and ensuring that it retains enough input context for accurate predictions. At this rate, the model is exposed to sufficient noise to improve its robustness while still preserving enough of the original feature data to perform well on both tasks. This trade-off makes 0.6 the most suitable choice for optimal model performance in this study.

6.2.3. Representation Dimension

The representation dimension represents the size of the learned POI embeddings, which directly affects the expressiveness and capacity of the MaskPOI model. Larger dimensions enable the model to capture more complex relationships and features, while smaller dimensions promote efficiency and may reduce the risk of overfitting. As shown in Figure 6, we evaluated the impact of different representation dimensions (32, 64, 128, 192, and 256) on both the urban functional zone classification task (Task 1) and the population density prediction task (Task 2).

In Task 1, illustrated in Figure 6a, the model’s performance initially improves as the representation dimension increases but begins to decline after reaching its peak at 128 dimensions. It clearly shows that 128 is the optimal value for the dimension. In Task 2, shown in Figure 6b, the model’s performance does not show significant changes with varying dimensions. At a dimension of 64, the model performs noticeably worse, while at 192, the performance is slightly better than at 128. However, in Task 1, the performance at 192 is significantly worse than at 128.

After considering the model’s performance across both tasks, we chose a representation dimension of 128 as the optimal setting for this study. This dimension provides sufficient capacity to capture the underlying relationships in the POI data while maintaining computational efficiency and avoiding the diminishing returns observed at higher dimensions.

6.3. Ablation Study

To evaluate the contributions of the proposed masking strategies in MaskPOI, we conducted ablation experiments by isolating the effects of each component: masking edges (mask edge), masking features (mask feature), and combining both (both). The results are summarized in Table 4.

The baseline model that incorporates both edge masking and feature masking achieves the best performance across all metrics for both tasks. For urban functional zone classification (Task 1), the baseline model achieves the lowest L1 error (0.867) and KL divergence (0.834) and the highest cosine similarity (0.761). For the population density prediction (Task 2), it achieved the lowest RMSE (0.133) and MAE (0.1) and the highest

R^{2}

(0.436). These results demonstrate the complementary benefits of using both masking strategies, which together enable the model to learn robust and comprehensive POI representations.

When comparing the two individual masking strategies, the results indicate that both contribute positively to the overall performance. However, the mask edge strategy provides a more significant improvement compared to the mask feature. For Task 1, the mask edge outperforms the mask feature in all metrics, achieving a lower L1 error (0.884 vs. 0.972), KL divergence (0.884 vs. 0.967), and a higher cosine similarity (0.748 vs. 0.72). Similarly, in Task 2, the mask edge achieves a lower RMSE (0.134 vs. 0.145) and a higher

R^{2}

(0.41 vs. 0.324). These results suggest that masking edges has a stronger impact on the model’s ability to capture spatial relationships and structural dependencies, which are crucial for both tasks.

Overall, the results validate the effectiveness of both edge and feature masking, with edge masking providing a more pronounced improvement. The combination of the two strategies yields the best results, highlighting their complementary nature in enhancing the learned POI representations and ensuring robust performance across tasks.

7. Discussion

In this section, we explore how the construction of the POI graph and the initial feature embedding affect POI representation, and we also discuss the limitations of MaskPOI.

7.1. Impact of Graph Construction Method

In this study, we constructed the network by connecting POIs to their nearest roads, creating a hybrid road+POI network that captures both the spatial relationships among POIs and their alignment with the road network. To further evaluate the impact of the network construction methods, we compared this baseline approach with a purely POI-based method using Delaunay triangulation to construct the network. The differences between the two methods of constructing networks for the same POI data are shown in Figure 7.

The comparison is presented in Table 5. The results show that the two network construction methods yield similar performances across both tasks. One possible reason for the similar performance between the two network construction methods is that both approaches effectively capture the essential spatial relationships between POIs, which are crucial for the downstream tasks. Although the hybrid road+POI network explicitly incorporates road information, the Delaunay triangulation method also creates a graph structure that closely reflects the spatial distribution of POIs. As a result, both networks provide meaningful topological information that supports the representation learning process.

Another contributing factor might be the robustness of the representation learning framework employed in this study. The use of mask-based modeling enables the discovery of implicit information, such as latent spatial relationships that are not explicitly represented. Consequently, despite the structural differences between the two graphs, the network is capable of learning and capturing the same implicit relationships.

7.2. Impact of Initial POI Feature Embedding

In this study, we utilized BERT [28] to construct the initial features for POIs, capturing semantic information from their textual descriptions. To analyze the impact of different text encoding methods, we compared the BERT-based embedding approach with a one-hot encoding method.

One-hot encoding is a conventional approach for categorical data representation, where each unique category is mapped to a high-dimensional sparse binary vector. In our case, POIs are categorized using a three-level hierarchical structure. To construct one-hot encoded representations, we follow these steps:

Category Indexing: Each level of the POI hierarchy is assigned a unique index within its respective level.
Binary Vector Representation: Each POI is represented using three separate one-hot vectors, one for each category level. The vector has a length equal to the total number of unique categories at that level, with a single “1” at the corresponding index and all other positions set to “0”.
Concatenation: The three one-hot vectors corresponding to the three levels are concatenated to form the final one-hot representation.

The results, presented in Table 6, demonstrate the performance differences between BERT embeddings and one-hot encoding. To assess whether these improvements are statistically significant, we conducted a paired t-test for each metric. Statistically significant differences (

p < 0.05

) are indicated with an asterisk (*).

For urban functional zone classification, BERT-based embeddings outperform one-hot encoding across all metrics, but the improvements seen in the L1 error (

p = 0.1232

) and KL divergence (

p = 0.4108

) are not statistically significant. However, the improvement in cosine similarity (

p = 0.0396

) is significant, suggesting that the BERT embeddings better capture the semantic relationships within POI categories.

For population density prediction, the BERT embeddings demonstrate statistically significant improvements across all metrics (

p < 0.05

), including the RMSE (

p = 0.0132

), MAE (

p = 0.0079

), and

R^{2}

(

p = 0.0251

). These results confirm that the richer semantic representations provided by the BERT embeddings lead to more accurate population estimations.

Overall, these findings validate the choice of BERT embeddings in this study, as they not only enhanced the model’s performance but also introduced greater semantic expressiveness and robustness, particularly in tasks that require a fine-grained understanding of POI relationships. While certain improvements did not reach statistical significance, the consistent trend suggests that the BERT embeddings provide a more informative representation than one-hot encoding, reducing our reliance on manually defined category hierarchies.

7.3. Limitations

MaskPOI presents certain limitations that should be addressed in future research. First, the high dimensionality of the POI feature encoding (2306 dimensions) significantly increases memory consumption. This makes it challenging to construct large graphs, especially when we attempt to capture relationships over extended spatial distances. In this study, we constructed graphs within a 500 m region, but to better capture long-range spatial dependencies, larger graphs would be needed, which, in turn, require more efficient processing techniques. Future work could explore methods for dimensionality reduction or more efficient graph processing algorithms that can handle larger, high-dimensional graphs, allowing for the construction of graphs over broader areas or longer distances.

Second, the POI features used in this study are limited to coordinates and category labels, which provide only a basic representation of each POI. However, POIs typically contain other valuable attributes, such as names, user reviews, ratings, and other metadata, which could significantly enhance the expressiveness of the POI features. Integrating these additional data sources could provide a richer feature set, improving the model’s ability to capture the full spectrum of POI characteristics. Future research should focus on incorporating multi-modal information to create more comprehensive representations of POIs, potentially improving the performance of the MaskPOI framework in various tasks.

8. Conclusions

In this study, we proposed MaskPOI, a novel self-supervised learning framework tailored for Point of Interest (POI) representation learning. By leveraging a dual-branch architecture with edge masking and feature masking mechanisms, MaskPOI effectively captures both spatial relationships and attribute dependencies among POIs. While the framework achieved promising results, several challenges were encountered. The high dimensionality of the POI features led to substantial memory consumption, limiting the ability to scale to larger graphs or broader spatial coverage. Furthermore, the reliance on basic POI attributes, such as coordinates and categories, restricted the framework’s ability to fully capture the richness of the POI characteristics. These limitations highlight key areas for future work, including the development of more efficient dimensionality reduction or graph processing methods and the integration of additional POI attributes, such as names and reviews, to enhance feature representations. Addressing these challenges will be crucial for improving the scalability and applicability of the framework in diverse urban environments.

Author Contributions

Conceptualization, H.Z.; methodology, H.Z. and Z.S.; software, H.Z. and Z.S.; validation, H.Z. and Z.S.; formal analysis, H.Z.; investigation, H.Z.; resources, H.Z.; data curation, H.Z.; writing—original draft preparation, H.Z., Z.S., M.L. and S.M.; writing—review and editing, H.Z., Z.S., M.L. and S.M.; visualization, H.Z.; supervision, M.L. and S.M.; project administration, M.L. and S.M.; funding acquisition, S.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by grants from the Scientific and Technological Key Project of “Revealing the List and Taking Command” in Heilongjiang Province (2021ZXJ02A02), the Postdoctoral Research Foundation of China (2024M751474).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset is available on request from the authors.

Acknowledgments

The authors express their appreciation to the editor and anonymous reviewers for their insightful recommendations, which significantly contributed to enhancing the initial manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Psyllidis, A.; Gao, S.; Hu, Y.; Kim, E.K.; McKenzie, G.; Purves, R.; Yuan, M.; Andris, C. Points of Interest (POI): A Commentary on the State of the Art, Challenges, and Prospects for the Future. Comput. Urban Sci. 2022, 2, 20. [Google Scholar] [CrossRef] [PubMed]
Qi, G.; Huang, A.; Guan, W.; Fan, L. Analysis and Prediction of Regional Mobility Patterns of Bus Travellers Using Smart Card Data and Points of Interest Data. IEEE Trans. Intell. Transp. Syst. 2018, 20, 1197–1214. [Google Scholar] [CrossRef]
Liu, K.; Yin, L.; Lu, F.; Mou, N. Visualizing and Exploring POI Configurations of Urban Regions on POI-Type Semantic Space. Cities 2020, 99, 102610. [Google Scholar] [CrossRef]
Zhou, L.; Shi, Y.; Zheng, J. Business Circle Identification and Spatiotemporal Characteristics in the Main Urban Area of Yiwu City Based on POI and Night-Time Light Data. Remote Sens. 2021, 13, 5153. [Google Scholar] [CrossRef]
Chen, Y.; Huang, W.; Zhao, K.; Jiang, Y.; Cong, G. Self-Supervised Learning for Geospatial AI: A Survey. arXiv 2024, arXiv:2408.12133. [Google Scholar] [CrossRef]
Zhou, C.; Yang, H.; Zhao, J.; Zhang, X. POI Classification Method Based on Feature Extension and Deep Learning. J. Adv. Comput. Intell. Intell. Inform. 2020, 24, 944–952. [Google Scholar] [CrossRef]
Feng, S.; Cong, G.; An, B.; Chee, Y.M. Poi2Vec: Geographical Latent Representation for Predicting Future Visitors. In Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; Volumr 31, pp. 102–108. [Google Scholar] [CrossRef]
Zhao, P.; Luo, A.; Liu, Y.; Xu, J.; Li, Z.; Zhuang, F.; Sheng, V.S.; Zhou, X. Where to Go Next: A Spatio-Temporal Gated Network for Next POI Recommendation. IEEE Trans. Knowl. Data Eng. 2020, 34, 2512–2524. [Google Scholar] [CrossRef]
Mai, G.; Janowicz, K.; Hu, Y.; Gao, S.; Yan, B.; Zhu, R.; Cai, L.; Lao, N. A Review of Location Encoding for GeoAI: Methods and Applications. Int. J. Geogr. Inf. Sci. 2022, 36, 639–673. [Google Scholar] [CrossRef]
Bing, J.; Chen, M.; Yang, M.; Huang, W.; Gong, Y.; Nie, L. Pre-Trained Semantic Embeddings for POI Categories Based on Multiple Contexts. IEEE Trans. Knowl. Data Eng. 2022, 35, 8893–8904. [Google Scholar] [CrossRef]
Yu, J.; Guo, L.; Zhang, J.; Wang, G. A Survey on Graph Neural Network-Based Next POI Recommendation for Smart Cities. J. Reliab. Intell. Environ. 2024, 10, 299–318. [Google Scholar] [CrossRef]
Wang, Z.; Zhu, Y.; Wang, C.; Ma, W.; Li, B.; Yu, J. Adaptive Graph Representation Learning for Next POI Recommendation. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, Taipei, Taiwan, 23–27 July 2023; pp. 393–402. [Google Scholar] [CrossRef]
Yuan, Z.; Liu, H.; Liu, Y.; Zhang, D.; Yi, F.; Zhu, N.; Xiong, H. Spatio-Temporal Dual Graph Attention Network for Query-POI Matching. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual, 25–30 July 2020; pp. 629–638. [Google Scholar] [CrossRef]
Liu, X.; Zhang, F.; Hou, Z.; Mian, L.; Wang, Z.; Zhang, J.; Tang, J. Self-Supervised Learning: Generative or Contrastive. IEEE Trans. Knowl. Data Eng. 2021, 35, 857–876. [Google Scholar] [CrossRef]
Kolesnikov, A.; Zhai, X.; Beyer, L. Revisiting Self-Supervised Visual Representation Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 1920–1929. [Google Scholar] [CrossRef]
Wang, Y.; Albrecht, C.M.; Braham, N.A.A.; Mou, L.; Zhu, X.X. Self-Supervised Learning in Remote Sensing: A Review. IEEE Geosci. Remote Sens. Mag. 2022, 10, 213–247. [Google Scholar] [CrossRef]
He, K.; Fan, H.; Wu, Y.; Xie, S.; Girshick, R. Momentum Contrast for Unsupervised Visual Representation Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 9729–9738. [Google Scholar] [CrossRef]
He, K.; Chen, X.; Xie, S.; Li, Y.; Dollár, P.; Girshick, R. Masked Autoencoders Are Scalable Vision Learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 16000–16009. [Google Scholar] [CrossRef]
Fang, H.; Wang, S.; Zhou, M.; Ding, J.; Xie, P. Cert: Contrastive Self-Supervised Learning for Language Understanding. arXiv 2020, arXiv:2005.12766. [Google Scholar] [CrossRef]
Lan, Z. Albert: A Lite BERT for Self-Supervised Learning of Language Representations. arXiv 2019, arXiv:1909.11942. [Google Scholar] [CrossRef]
Jawahar, G.; Sagot, B.; Seddah, D. What Does BERT Learn About the Structure of Language? In Proceedings of the ACL 2019—57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019. [Google Scholar] [CrossRef]
Mikolov, T. Efficient Estimation of Word Representations in Vector Space. arXiv 2013, arXiv:1301.3781. [Google Scholar] [CrossRef]
Yan, B.; Janowicz, K.; Mai, G.; Gao, S. From ITDL to Place2Vec: Reasoning About Place Type Similarity and Relatedness by Learning Embeddings from Augmented Spatial Contexts. In Proceedings of the 25th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, Redondo Beach, CA, USA, 7–10 November 2017; pp. 1–10. [Google Scholar] [CrossRef]
Huang, W.; Cui, L.; Chen, M.; Zhang, D.; Yao, Y. Estimating Urban Functional Distributions with Semantics Preserved POI Embedding. Int. J. Geogr. Inf. Sci. 2022, 36, 1905–1930. [Google Scholar] [CrossRef]
Zhang, D.; Xu, R.; Huang, W.; Zhao, K.; Chen, M. Towards an Integrated View of Semantic Annotation for POIs with Spatial and Textual Information. In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, Macao, China, 19–25 August 2023; pp. 2441–2449. [Google Scholar] [CrossRef]
Li, S.; Zhou, J.; Xu, T.; Liu, H.; Lu, X.; Xiong, H. Competitive Analysis for Points of Interest. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Virtual, 6–10 July 2020; pp. 1265–1274. [Google Scholar] [CrossRef]
Bao, H.; Dong, L.; Wei, F.; Wang, W.; Yang, N.; Liu, X.; Wang, Y.; Gao, J.; Piao, S.; Zhou, M. UniLMv2: Pseudo-Masked Language Models for Unified Language Model Pre-Training. In Proceedings of the International Conference on Machine Learning, Virtual, 13–18 July 2020; pp. 642–652. [Google Scholar] [CrossRef]
Devlin, J. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar] [CrossRef]
Xie, Z.; Zhang, Z.; Cao, Y.; Lin, Y.; Bao, J.; Yao, Z.; Dai, Q.; Hu, H. SimMIM: A Simple Framework for Masked Image Modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 9653–9663. [Google Scholar] [CrossRef]
Sun, X.; Wang, P.; Lu, W.; Zhu, Z.; Lu, X.; He, Q.; Li, J.; Rong, X.; Yang, Z.; Chang, H. RingMo: A Remote Sensing Foundation Model with Masked Image Modeling. IEEE Trans. Geosci. Remote Sens. 2022, 61, 1–22. [Google Scholar] [CrossRef]
Gao, Y.; Xiong, Y.; Wang, S.; Wang, H. GeoBERT: Pre-Training Geospatial Representation Learning on Points of Interest. Appl. Sci. 2022, 12, 12942. [Google Scholar] [CrossRef]
Li, Z.; Kim, J.; Chiang, Y.Y.; Chen, M. SpaBERT: A Pretrained Language Model from Geographic Data for Geo-Entity Representation. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP, Miami, FL, USA, 7–11 December 2022; pp. 2757–2769. [Google Scholar] [CrossRef]
Wen, Y.; Zhang, J.; Chen, G.; Chen, X.; Chen, M. POI Recommendation Based on Heterogeneous Network. In Proceedings of the Communications, Signal Processing, and Systems: Proceedings of the 8th International Conference on Communications, Signal Processing, and Systems, Urumqi, China, 20–22 July 2020; pp. 1795–1802. [Google Scholar] [CrossRef]
Xu, Y.; Zhou, B.; Jin, S.; Xie, X.; Chen, Z.; Hu, S.; He, N. A Framework for Urban Land Use Classification by Integrating the Spatial Context of Points of Interest and Graph Convolutional Neural Network Method. Comput. Environ. Urban Syst. 2022, 95, 101807. [Google Scholar] [CrossRef]
Xie, M.; Yin, H.; Wang, H.; Xu, F.; Chen, W.; Wang, S. Learning Graph-Based POI Embedding for Location-Based Recommendation. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, Indianapolis, IN, USA, 24–28 October 2016; pp. 15–24. [Google Scholar] [CrossRef]
Zhu, D.; Zhang, F.; Wang, S.; Wang, Y.; Cheng, X.; Huang, Z.; Liu, Y. Understanding Place Characteristics in Geographic Contexts Through Graph Convolutional Neural Networks. Ann. Am. Assoc. Geogr. 2020, 110, 408–420. [Google Scholar] [CrossRef]
Fu, J.; Gao, R.; Yu, Y.; Wu, J.; Li, J.; Liu, D.; Ye, Z. Contrastive Graph Learning Long and Short-Term Interests for POI Recommendation. Expert Syst. Appl. 2024, 238, 121931. [Google Scholar] [CrossRef]
Ding, R.; Chen, B.; Xie, P.; Huang, F.; Li, X.; Zhang, Q.; Xu, Y. MGeo: Multi-Modal Geographic Language Model Pre-Training. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, Taipei, China, 23–27 July 2023; pp. 185–194. [Google Scholar] [CrossRef]
Velickovic, P.; Fedus, W.; Hamilton, W.L.; Liò, P.; Bengio, Y.; Hjelm, R.D. Deep Graph Infomax. ICLR (Poster) 2019, 2, 4. [Google Scholar] [CrossRef]
Sun, F.Y.; Hoffmann, J.; Verma, V.; Tang, J. InfoGraph: Unsupervised and Semi-Supervised Graph-Level Representation Learning via Mutual Information Maximization. arXiv 2019, arXiv:1908.01000. [Google Scholar] [CrossRef]
Qiu, J.; Chen, Q.; Dong, Y.; Zhang, J.; Yang, H.; Ding, M.; Wang, K.; Tang, J. GCC: Graph Contrastive Coding for Graph Neural Network Pre-Training. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Virtual, 6–10 July 2020; pp. 1150–1160. [Google Scholar] [CrossRef]
You, Y.; Chen, T.; Sui, Y.; Chen, T.; Wang, Z.; Shen, Y. Graph Contrastive Learning with Augmentations. Adv. Neural Inf. Process. Syst. 2020, 33, 5812–5823. [Google Scholar] [CrossRef]
Thakoor, S.; Tallec, C.; Azar, M.G.; Azabou, M.; Dyer, E.L.; Munos, R.; Velickovic, P.; Valko, M. Large-Scale Representation Learning on Graphs via Bootstrapping. arXiv 2021, arXiv:2102.06514. [Google Scholar] [CrossRef]
Hu, Z.; Dong, Y.; Wang, K.; Chang, K.W.; Sun, Y. GPT-GNN: Generative Pre-Training of Graph Neural Networks. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Virtual, 23–27 August 2020; pp. 1857–1867. [Google Scholar] [CrossRef]
Hu, W.; Liu, B.; Gomes, J.; Zitnik, M.; Liang, P.; Pande, V.; Leskovec, J. Strategies for Pre-Training Graph Neural Networks. arXiv 2019, arXiv:1905.12265. [Google Scholar] [CrossRef]
Kipf, T.N.; Welling, M. Variational Graph Auto-Encoders. arXiv 2016, arXiv:1611.07308. [Google Scholar] [CrossRef]
Pan, S.; Hu, R.; Long, G.; Jiang, J.; Yao, L.; Zhang, C. Adversarially Regularized Graph Autoencoder for Graph Embedding. arXiv 2018, arXiv:1802.04407. [Google Scholar] [CrossRef]
Tan, Q.; Liu, N.; Huang, X.; Chen, R.; Choi, S.H.; Hu, X. MGAE: Masked Autoencoders for Self-Supervised Learning on Graphs. arXiv 2022, arXiv:2201.02534. [Google Scholar] [CrossRef]
Hou, Z.; Liu, X.; Cen, Y.; Dong, Y.; Yang, H.; Wang, C.; Tang, J. GraphMAE: Self-Supervised Masked Graph Autoencoders. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, 14–18 August 2022; pp. 594–604. [Google Scholar] [CrossRef]
Huang, W.; Zhang, D.; Mai, G.; Guo, X.; Cui, L. Learning Urban Region Representations with POIs and Hierarchical Graph Infomax. ISPRS J. Photogramm. Remote Sens. 2023, 196, 134–145. [Google Scholar] [CrossRef]
Gong, P.; Chen, B.; Li, X.; Liu, H.; Wang, J.; Bai, Y.; Chen, J.; Chen, X.; Fang, L.; Feng, S.; et al. Mapping essential urban land use categories in China (EULUC-China): Preliminary results for 2018. Sci. Bull. 2020, 65, 182–187. [Google Scholar] [CrossRef]
Vinyals, O.; Bengio, S.; Kudlur, M. Order Matters: Sequence to Sequence for Sets. arXiv 2015, arXiv:1511.06391. [Google Scholar] [CrossRef]
Bai, L.; Huang, W.; Zhang, X.; Du, S.; Cong, G.; Wang, H.; Liu, B. Geographic Mapping With Unsupervised Multi-Modal Representation Learning From VHR Images and POIs. ISPRS J. Photogramm. Remote Sens. 2023, 201, 193–208. [Google Scholar] [CrossRef]
Perozzi, B.; Al-Rfou, R.; Skiena, S. DeepWalk: Online Learning of Social Representations. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA, 24–27 August 2014; pp. 701–710. [Google Scholar] [CrossRef]
Grover, A.; Leskovec, J. node2vec: Scalable Feature Learning for Networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 855–864. [Google Scholar] [CrossRef]
Li, J.; Wu, R.; Sun, W.; Chen, L.; Tian, S.; Zhu, L.; Meng, C.; Zheng, Z.; Wang, W. What’s Behind the Mask: Understanding Masked Graph Modeling for Graph Autoencoders. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Long Beach, CA, USA, 6–10 August 2023; pp. 1268–1279. [Google Scholar] [CrossRef]
Shi, Y.; Huang, Z.; Feng, S.; Zhong, H.; Wang, W.; Sun, Y. Masked Label Prediction: Unified Message Passing Model for Semi-Supervised Classification. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, Montreal, QC, Canada, 19–27 August 2021; pp. 1548–1554. [Google Scholar] [CrossRef]
Zhu, Y.; Xu, Y.; Yu, F.; Liu, Q.; Wu, S.; Wang, L. Graph contrastive learning with adaptive augmentation. In Proceedings of the Web Conference 2021, Ljubljana, Slovenia, 19–23 April 2021; pp. 2069–2080. [Google Scholar] [CrossRef]
Hamilton, W.; Ying, Z.; Leskovec, J. Inductive Representation Learning on Large Graphs. arXiv 2017, arXiv:1706.02216. [Google Scholar] [CrossRef]
Kipf, T.N.; Welling, M. Semi-Supervised Classification With Graph Convolutional Networks. arXiv 2016, arXiv:1609.02907. [Google Scholar] [CrossRef]
Xu, K.; Hu, W.; Leskovec, J.; Jegelka, S. How Powerful Are Graph Neural Networks? arXiv 2018, arXiv:1810.00826. [Google Scholar] [CrossRef]
Veličković, P.; Cucurull, G.; Casanova, A.; Romero, A.; Lio, P.; Bengio, Y. Graph Attention Networks. arXiv 2017, arXiv:1710.10903. [Google Scholar] [CrossRef]

Figure 1. (a) Overview of the MaskPOI framework. (b) A downstream application workflow for node-level or region-level tasks.

Figure 2. Graph construction method: Red—POIs; Blue—road network vertices (no attributes). (a) Initial road network and unconnected POIs. (b) Graph after connecting POIs based on the nearest distance.

Figure 3. Study areas: (a) Beijing and (b) Xiamen.

Figure 4. Performance metrics for varying edge mask rates: (a) Urban functional zone classification with metrics L1, KL divergence, and cosine similarity; (b) population density prediction with metrics RMSE, MAE, and

R^{2}