Text Geolocation Prediction via Self-Supervised Learning

Wu, Yuxing; Zeng, Zhuang; Liu, Kaiyue; Xu, Zhouzheng; Ye, Yaqin; Zhou, Shunping; Yao, Huangbao; Li, Shengwen

doi:10.3390/ijgi14040170

Open AccessArticle

Text Geolocation Prediction via Self-Supervised Learning

by

Yuxing Wu

¹,

Zhuang Zeng

¹,

Kaiyue Liu

¹

,

Zhouzheng Xu

¹,

Yaqin Ye

¹

,

Shunping Zhou

¹,

Huangbao Yao

² and

Shengwen Li

^1,*

¹

School of Computer Science, China University of Geosciences, Wuhan 430074, China

²

School of Computer Science, McGill University, Montréal, QC H3A 0E9, Canada

^*

Author to whom correspondence should be addressed.

ISPRS Int. J. Geo-Inf. 2025, 14(4), 170; https://doi.org/10.3390/ijgi14040170

Submission received: 12 March 2025 / Revised: 9 April 2025 / Accepted: 10 April 2025 / Published: 12 April 2025

Download

Browse Figures

Versions Notes

Abstract

Text geolocation prediction aims to infer the geographic location of text with text semantics, serving as a fundamental task for various geographic applications. As the mainstream approach, the deep learning-based methods follow the supervised learning paradigms, which rely heavily on a large amount of labeled samples to train model parameters. To address this limitation, this paper presents a method for text geolocation prediction without labeled samples, namely GeoSG (Geographic Self-Supervised Geolocation) model, which leverages self-supervised learning to improve text geolocation prediction in situations where labeled samples are unavailable. Specifically, GeoSG integrates spatial distance and hierarchical constraints to characterize the interactions of POIs and text in a geographic relationship graph. And it designs two self-supervised tasks to train a shared network to learn the relationships among POIs and texts. Finally, the text geolocations are inferred based on the trained shared network. Experimental results on two datasets show that the proposed method outperforms the state-of-the-art baselines and is robust. This study provides a methodological reference for geolocating various text documents and offers a solution for numerous geographic intelligence tasks that lack labeled samples.

Keywords:

text geolocation; self-supervised learning; graph neural networks; data-scarce scenarios

1. Introduction

The texts tagged with geolocations directly portray the connections between the geographic environment and texts, providing fundamental data for numerous geographic applications, including government policy-making [1], business management [2], and academic research [3]. Restricted by privacy policies, hardware and software capabilities, only a small proportion of the ever-increasing variety of texts are geo-tagged when produced [4]. Geolocating un-geotagged texts becomes increasingly valuable [5,6,7].

Accurately inferring the geographic locations of texts is highly challenging due to the uncertainty of natural language expressions. Moreover, the challenge arises from the diversity of contexts within the text and the complex relationships among the geographic features mentioned. Initially, traditional methods, including text-based meta-fields, text-based geographic distribution, and rule-based methods are used to geolocate texts [8]. These methods employ manually acquired experience and rules to present the relationships between text and geographic locations. Usually, the acquired experience and rules are not sufficient to capture the complex relationships between texts and geolocations, resulting in poor geolocation accuracy.

Deep learning-based methods have emerged to improve the accuracy of text geolocation prediction [9,10]. These methods leverage neural networks to simulate the interactions between texts and geographic locations, significantly improving prediction accuracy. They have benefited from the advantages of deep neural network models in capturing complex relationships [11,12]. Furthermore, some studies incorporate multi-dimensional or multi-view information to support model inference, such as utilizing user metadata or image data in tweet geolocation prediction [13]. These deep learning-based methods follow the paradigm of supervised machine learning, which highly relies on large-scale labeled samples to train models. In practice, generalized texts typically lack such rich metadata and do not provide sufficient labeled training data for effective neural network parameter learning.

In real-world applications, texts often contain mentions of geographic entities—such as landmarks—that serve as location references. However, effectively leveraging these entity locations to infer the central geographic location of a text remains challenging, as illustrated in Figure 1. This paper focuses on text geolocation prediction without labeled sample, by utilizing geographic entities and geographic knowledge bases as auxiliary knowledge. The relationships between texts and their geographic locations can be abstracted as the interactions between the locations of multiple geographic entities mentioned in the text and the semantic location of the text, thus the complex relationships can be captured in an unsupervised setting. Following this idea, the paper proposes a novel method namely GeoSG (Geographic Self-Supervised Geolocation) by leveraging self-supervised learning and the positional information of geographic entities mentioned in the text to improve the prediction accuracy of text geolocations. Specifically, GeoSG incorporates spatial distance and hierarchical constraints into a graph to characterize the interactions between geographic entities and text. And, it designs two self-supervised modules to train the intricate relationships between geographic entities and texts with a shared graph network backbone. The first module enhances the geolocation capabilities by predicting the masked coordinates of geographic entities. The second module predicts distances between geographic entities based on their names, facilitating the learning of relationships between textual content and geographic features. Finally, GeoSG utilizes the pre-trained networks to predict geolocations of texts. To the best of our knowledge, this is the first effort that predicts text geolocations without labeled samples. The contribution of this work is threefold:

(1) This paper highlights the machine learning task of text geolocation in scenarios without labeled samples, and proposes to introduce self-supervised learning to address the text geolocation task without labeled samples.

(2) A self-supervised text geolocation method is developed. This method introduces two graph self-supervised modules during the training process, enabling their shared network backbone to learn the complex interactions between nodes, thus improving the ability of geographic entities to infer geolocations. In addition, the method optimizes the results of geolocation by introducing the hierarchical relationships of geographic entities and geolocation influence weights. This work explores an effective approach to predict text geolocation without labeled samples, and provides a methodology reference for various geographic tasks when labeled samples are unavailable.

(3) The experimental results show that the proposed model GeoSG, outperforms the baseline models and is robust.

2. Related Work

2.1. Text Geolocation Prediction Methods

Previous text geolocation prediction methods can be categorized into four groups, including rule-based, topic model-based, spatial distribution estimation-based, and deep learning-based.

Rule-based methods extract explicit geographic entities from texts, utilizing these geolocations as predicted textual positions while incorporating a priori knowledge. For example, traditional text-based location inference techniques often involve creating documents for each grid cell by concatenating the texts associated with that cell and retrieving the most content-similar document for non-geotagged texts using various retrieval models such as TF-IDF (Term Frequency-Inverse Document Frequency) and BM25 (Best Matching 25) [14]. Additionally, they match text meta-fields with geographic entities, using the average of the coordinates of multiple entities that the text matches as the text location [8]. However, since these methods are based on specific rules, they are less transferable to different text contexts.

Topic model-based approaches aim to learn the joint distributions of words and geolocations, enhancing localization through complex a priori knowledge structures. One example is the utilization of Latent Dirichlet Allocation (LDA) to infer the location of texts by extracting topic distributions from texts and using these distributions to predict geographic geolocations. This method makes the assumption that the geographic distribution of topics can provide significant clues for location prediction, offering a nuanced understanding of the text’s context [15]. Although these methods are very innovative, there are natural differences between topic modeling and text position prediction, making it difficult to accurately geolocate text.

Spatial distribution estimation-based methods offer a global perspective on text geolocation. For example, the density-based baseline model using the Gaussian mixture model assigns geolocations to geographic word n-grams based on their spatial density, forming an ellipse covering a predefined maximum area and containing a certain ratio of total texts [16]. An alternative approach, the Locality-adapted Kernel Density Estimation (LocKDE), determines the likelihood of terms occurring in specific geolocations by estimating their spatial density. This method uses kernel density estimation with a bandwidth that is adjusted based on the information gain ratio, which helps the model to be finely tuned to account for the local variability in term distributions [17]. These methods, while providing a comprehensive spatial analysis, often lack the ability to effectively model the semantic content of the text for accurate predictions.

Recently, deep learning-based methods have integrated local interactions between textual content and POIs (point of interest) with global interactions that combine spatial and textual features, resulting in a comprehensive framework for accurate semantic location prediction of texts [11]. For example, the BiLSTM (Bidirectional Long Short-Term Memory) regression model has been used to capture the linguistic nuances of tweets, with the result that their geolocation can be predicted solely on the basis of textual content. This approach uses bidirectional LSTMs to process text in both forward and backward directions, thus capturing long-term dependencies and enhancing the semantic understanding of the content [18]. The EDGE framework combining entity correlation mining with a Gaussian mixture model, diffuses semantic embeddings of geo-indicative based on graph network and non-geo-indicative entities, predicting location as a mixture of bivariate Gaussian distributions [9]. Other tasks predict the location of text by using deep learning models to associate the text with a predefined area. Techniques such as CNN (convolutional neural network) [19] and LSTM [20] have been applied for both coarse-grained [21] and fine-grained region localization [22]. Hierarchical models further refine these predictions by localizing tweets to specific countries and cities, and propose a multi-view learning approach for fine-grain tweet location prediction within a specific area of interest, although they may encounter deviations in precise distance calculations [23]. To enhance fine-grained area localization, POI-level localization tasks have also been proposed [24]. In addition, certain location inference tasks aim to determine a user’s position by employing neural networks combined with additional information, such as users’ online social interactions or other external knowledge [25]. Overall, deep learning provides an effective solution for modeling the association of text and geolocation, thus improving the text geolocation performance. However, it is important to note that this approach is highly reliant on the availability of large-scale, labeled samples.

2.2. Machine Learning Models Without Labeled Samples

In view of the lack of manually labeled samples in most applications, techniques that reduce reliance on extensive datasets become essential. Unsupervised learning [26] allows models to make predictions without relying on manually labeled data. Self-supervised learning (SSL) [27] is a specialized form of unsupervised learning that directly generates supervisory signals from the data themselves, bridging the gap between unsupervised and supervised learning. Unlike traditional unsupervised methods, which rely solely on the structure of the data, SSL uses pretext tasks to create labels from the data, allowing models to learn meaningful representations and improve performance.

In the domain of self-supervised learning, particularly in graph learning [28,29], methods are generally classified into four categories: generation-based [30], auxiliary property-based [31], contrast-based [32], and hybrid approach [33]. Generation-based methods have shown superior performance in learning node representations and capturing the global graph topology. Consequently, GeoSG adopts a generation-based strategy to design a self-supervised task for text geolocation. In contrast to traditional self-supervised learning tasks in geography, such as next-location prediction [34] and trip recommendation [35], which are based on sequential or multi-view information, GeoSG relies more on the semantic information of the text itself.

While self-supervised learning significantly reduces the need for labeled data, applying these techniques to text geolocation presents unique challenges and opportunities. Prior to this study, no work had utilized deep learning models for text geolocation without labeled samples. Traditional rule-based methods, such as those relying on place name recognition followed by database matching [8], do not require labeled samples but lack the ability to model the complex relationships between geographic locations and textual content. By leveraging self-supervised learning techniques, GeoSG seeks to capture the intricate relationships between geographic locations and textual content without labeled samples.

3. Method

3.1. Overview

Given a set of textual documents and a set of geographic entities, this study aims to facilitate machine learning methods to output the geographic coordinates of documents, where each document consists of a sequence of tokens, denoted as

D = {x_{1}, x_{2}, \dots x_{N}}

,

N

represents the total number of tokens; and each geographic entity contains a name and a geographic coordinate.

As illustrated in Figure 2, the proposed method, GeoSG, is structured around four interconnected modules: (a) geographic relationship construction, (b) POI geolocation prediction, (c) distance prediction, and (d) document geolocation prediction. The graph structure is constructed by module (a), while modules (b–d) share the graph neural network architecture and parameters. Especially, the basic structure of the graph neural network is a graph attention network (GAT). The geographic relationship construction module constructs a hierarchical graph of geographic entities to incorporate both hierarchical and distance constraints to better capture complex spatial relationships. And, GeoSG addresses the challenges posed by the lack of labeled samples with two modules: leveraging task-consistent pretraining to enable the model to predict locations by utilizing related entities, and learning the complex interactions of geographic relationships through the intrinsic features of geographic entities within the region. Finally, the document geolocation module infers document locations based on the trained model.

3.2. Geographic Relationship Construction

GeoSG will construct geographic relationships by introducing two types of constraints: hierarchical and distance constraints. These constraints facilitate the characterization of the relationships between geographic entities, motivated by the idea that spatial relationships can offer rich geolocation clues. Additionally, following the first law of geography, the physical distance between entities at the same level provides meaningful clues, and closer geographic entities tend to exhibit more spatial relationships or connections.

As illustrated in Figure 3, the geographic entity graph,

G

, is constructed to present both the hierarchical and spatial constraints. The set of geographic entities

E

within the region

V

is mapped into this graph

G

. Each node in

G

presents an individual geographic entity, and is characterized by its name, type, and geolocation.

Three types of edges will be constructed for the graph: Borough–Street, Street–POI, and POI–POI. The edge types are defined based on the types in the OSM data, in which a comprehensive set of predefined categories that have been organized accordingly—for example, restaurants and bars are collectively classified as POIs. Their relationships are established based on proximity. Specifically, the Borough–Street edge represents the relationship between a borough and a street node, with each street linked to its nearest unique borough. The Street–POI edge represents the relationship between a street and a POI node, with each POI linked to its nearest unique street. The POI–POI edges represent relations between pairs of POIs that are within a specified distance K, where each POI is linked to every other POI node located within that distance. The three types of edges facilitate the exchange of information between geographic entities, enriching their geolocation semantics during model training.

For graph construction, we utilized the Deep Graph Library (DGL) to assist in the process. The name attributes of each node are represented using BERT [36], which is assigned to each node.

3.3. POI Geolocation Prediction Module

The module treats POIs as special geographic entities characterized by specific geographic coordinates, and can interact with geographic entities in the graph built in the last subsection. Specifically, the module aims to predict the coordinates of the geographic entity nodes that are randomly selected from the graph. The coordinates of the selected nodes are masked. For the graph with n nodes, GeoSG first uses an encoder to encode the nodes on the graph:

h_{i}^{L} = f_{θ} (\{n_{1}, n_{2}, \dots, n_{N}\})

(1)

[h^{L}, G] \overset{G A T c o n v}{\to} [h^{G}]

(2)

where

n_{1}, n_{2}, \dots, n_{N}

are the original text attributes of the nodes,

f_{θ}

is an encoder,

h^{L}

represents the hidden vector, GATconv refers to the graph attention convolution operator, and

h^{G}

denotes the representation of the node features after the graph neural network information update.

For calculating attention weight coefficients between nodes, GeoSG derives a scalar value using a graph attention network during training:

e_{i j} = LeakyReLU (a [W h_{i}] [W h_{j}])

(3)

α_{ij} = \frac{\exp (e_{ij})}{\sum_{n \in N_{i}} \exp (e_{in})}

(4)

where

e_{i j}

represents the correlation coefficient between node

ⅈ

and node

j

,

W

is the weight matrix,

h_{i}

and

h_{j}

are embedding representations of node

ⅈ

and node

j

,

a

is the weight vector, and

LeakyReLU

is employed as the nonlinear activation function.

N_{i}

denotes the set of neighboring nodes directly connected to node

ⅈ

in the graph, meaning that

n \in N_{i}

represents a node n that is adjacent to node

ⅈ

. Softmax normalization of the correlation coefficient

e_{i j}

between node

ⅈ

and all neighboring nodes yields the attention coefficient

α_{i j}

.

Finally, the relative weights of all neighbors are normalized using Equation (1):

P_{i j} = \frac{α_{i j}}{\sum_{n \in N_{i}} α_{i n}}

(5)

The predicted geolocation of node

ⅈ

can be calculated using Equation (6):

{\hat{PL}}_{i} = \sum_{n \in N_{i}} P_{i n} {Location}_{n}

(6)

where

{Location}_{n}

represents the latitude and longitude coordinates of the neighboring node

n

. The optimization objective of the model is to minimize the distance between predicted coordinates

\hat{PL}

and the true coordinates

PL

.

The formulaic representation of the optimization objective for POI geolocation prediction is:

L o s s = \sum_{i \in G} D i s (P L_{i}, {\hat{PL}}_{i})

(7)

As a self-supervised task, all geographic entities involved in the graph construction are used as training samples to train the model. The input to the POI geolocation prediction task is the same as the input to the textual document location prediction task, which is the graph constructed in Section 3.2. And, the essence of the prediction is the same for both tasks, which is to predict the locations of graph nodes. In this way, the model learned from the POI geolocation prediction can be used directly to predict the locations of textual documents.

3.4. Distance Prediction Module

The distance prediction module aims to predict the distances between each pair of nodes in a hierarchical graph of geographic entities. The semantic relatedness of these entities often indicates their geographic proximity. This task uses the actual distances between node pairs as ground-truth to train the graph network.

The model shares the backbone network with the entity geolocation prediction mod-ule, and adopts the parameters trained in the previous task. Then, a sampling algorithm is employed to select node pairs and compute their similarity.

r_{i, j}^{p} = F (s i m (h_{i}^{G}, h_{j}^{G}))

(8)

where

s i m

denotes the similarity computation function,

F

represents the functional function that maps the similarity computation results to the interval from 1 to 10 and

r^{p}

represents the degree of similarity between nodes.

Subsequently, the training loss is computed based on the vector similarity metrics of the node pairs and the true distance gap metrics of the node pairs:

r_{i, j}^{t} = F (g e o (L o c a t i o n_{i}, L o c a t i o n_{j}))

(9)

L o s s = \frac{1}{M} \sum_{i = 1}^{M} M S E (r^{p}, r^{t})

(10)

where

g e o

represents a functional function that calculates the difference in true distance from the latitude and longitude of two geolocations, and

r^{t}

denotes the distance metric between two nodes. The

F

function is the same as before, mapping the actual distance to the interval from 1 to 10. During training, the parameters of the graph network are updated by minimizing

L o s s

.

The distance prediction module aims to predict inter-node distances rather than the geolocations of texts. Since the POI geolocation module and the distance prediction module share a graph neural network, the distance prediction module can be viewed as an optimization of the POI geolocation module by refining the distance between POI points.

3.5. Geolocation Prediction Module

The document geolocation module determines the location coordinates of a document by utilizing the geolocations of multiple geographic entities mentioned within the document. Initially, the document information is encoded through BERT, generating the document’s hidden vector representation

h^{D}

. The document, represented by this hidden vector, is added as a node to the graph

G

. Subsequently, the textual document is matched with an external knowledge base using character matching to identify the names of geographic entities that appear in the document, thereby linking these entities to their corresponding nodes in the original graph G.

After the document encoding process is complete, the document geolocation module leverages the backbone network trained in the two pre-trained tasks, thus sharing the model parameters acquired during pretraining, as depicted in Figure 4. By calculating the attention coefficient between the document embedding and the embedding of the geographic entity in the document and utilizing Equations (5) and (6), the attentional weights of multiple nodes in graph

G

with respect to the document are computed as

P

. After obtaining the node weights

P

and the coordinates of the corresponding geographic entity, the location of the document

ⅈ

can be predicted. This is achieved by performing the computation shown in Equation (11):

{\hat{Y}}_{i} = \sum_{n \in N i} P_{i n} {Location}_{n}

(11)

4. Experiment and Result

4.1. Dataset

In this study, the geographic entities object set is sourced from OpenStreetMap (OSM), an open-source, free wiki world map project, which is taken as the knowledge base of the proposed method. The objects in the set are organized in shapefiles. Each of the geographic objects consists of the attributes ‘osm_id’, ‘code’, ‘fclass’, ‘name’, ‘type’, and ‘geometry’.

To evaluate the performance of the proposed method, two datasets were constructed from Manhattan and Boston, two major urban areas in the United States. The textual data consisted of property descriptions collected from apartment rental and sales websites in the U.S., where the location of each property is considered the semantic location of the text. They are categorized as ‘documents’, and labelled with five attributes, including category, identifier, text content, latitude, and longitude. Geographic entities relevant to these texts were collected from OSM within specific latitude and longitude ranges for each region. The selected bounding boxes are not strictly defined by city boundaries, given the geographic references in the textual descriptions that may go beyond the administrative boundaries. For Manhattan, 1757 geographic entities were collected, while for Boston, 1429 entities were gathered within a larger latitude and longitude range to provide broader geographic context. Given the relatively simple structure of the model, these data should be sufficient. These geographic entities were classified into three categories: districts, streets, and POIs. The statistical summary of the datasets is presented in Table 1, where each text contains multiple geographic entities as supporting features. This is partly attributed to the fact that three types of geographic entities were chosen, which effectively characterizes the relationship between the text and the geographic entities, thus facilitating the training of the model. Each entity is annotated with attributes including identifier, name, latitude, longitude, and category. The datasets provide a foundation for experiments on semantic location inference. Geographic visualizations of the datasets are presented in Figure 5.

4.2. Evaluation Metric

This paper evaluates the performance of the proposed method using two standard error metrics, mean absolute error (MAE) and root mean square error (RMSE). MAE is a measure of the mean absolute distance between the predicted and true geolocations, quantified in kilometers (km) in our experiments. Formally, for each test instance in the dataset, MAE is computed in accordance with Equation (12),

MAE = \frac{\sum_{i = 0}^{M} D i s (Y_{i}, {\hat{Y}}_{i})}{M}

(12)

where

M

is the number of samples in the test set,

Y_{i}

is the labeled coordinates of the text,

{\hat{Y}}_{i}

is the predicted coordinates of the text, and

D i s (Y_{i}, {\hat{Y}}_{i})

denotes that the distance between the predicted geographic location and the true location.

The RMSE considers the square of the model’s prediction error and is calculated as Equation (13):

RMSE = \sqrt{\frac{\sum_{i = 0}^{M} D i s (Y_{i}, {\hat{Y}}_{i})^{2}}{M}}

(13)

4.3. Baseline

Eight widely-used methods were selected as baselines to investigate the performance of the proposed method, including six unsupervised models and two supervised models. In them, Mean [8], BD (best in document) [37], BA (best in all) [37], Random [37], Rvs [38] and LLM-Loc (large language models for geolocation) [39] are unsupervised models. And AttReg [11] and Geo-twitter [10] are supervised models. In particular, Mean matches text against a gazetteer and averages the coordinates of extracted place names as the prediction. BD computes semantic similarity between the text and its geographic entities using BERT, selecting the top-ranked location. Random uses the coordinates of the first extracted place name. BA is similar to BD, but instead of calculating the similarity between the text and the entity mentioned in the text, the similarity between the text and the entities in the entire gazetteer is calculated. Rvs employs a large language model, T5, to analyze the text, aiming to ascertain its central location. The model is designed to either directly extract the primary location from the text or to compute an average if multiple geolocations are identified. LLM-Loc leverages the advanced capabilities of the Llama 3.1 [40] model to predict geographic coordinates, which capitalizes on the model’s extensive knowledge base, comprehension, and generative abilities to enhance geolocation accuracy. In the two supervised models, AttReg designs a feedforward network to geolocate text documents from a sequence of entity coordinates. Geo-twitter employs BERT to encode text messages into vectors, and directly predict geographic coordinates based on thse vectors.

4.4. Implementation Detail

In this study, the BERT model is employed to obtain the initial 768-dimensional embedding vectors of geographic entity names and texts. The number of attention heads is set to 4. The Adam optimizer [41] is employed to train model parameters with an initial learning rate of 10⁻⁵. The learning rate is dynamically adjusted by the ExponentialLR strategy with a gamma value of 0.3. A fixed random seed of 12,345 is set. Importantly, the training data are sourced from relevant geographic knowledge databases, not labeled samples, indicating that our model is trained in the self-supervised tasks. The first self-supervised task involves 1500 and 1400 POIs in its training process of the two regions, respectively, and is trained for 100 epochs. The second self-supervised task involves 600,000 node pairs in each region, and is trained over 200 epochs with the coefficient r ranging from 1 to 10.

The baseline settings are as follows: The T5 model used by Rvs is T5-base. LLC-Loc employs the Llama version Llama-3.1-8B. AttReg normalizes the latitude and longitude coordinates of entity mentions, followed by processing through two SimpleRNN layers with hidden dimensions of 16 and 32 respectively, using ReLU activation functions and a dropout rate of 0.1. The Adam optimizer is employed, with the final fully connected layer outputting a one-dimensional prediction. Geo-twitter utilizes the BERT model to extract 768-dimensional embedding vectors of geographic texts, processed similarly to GeoSG, followed by a linear regression layer to predict latitude and longitude. The model uses 4 attention heads and is trained using the Adam optimizer with an initial learning rate of 0.001.

4.5. Overall Result

Table 2 reports the experimental results of GeoSG and baselines on the text geolocation tasks on two datasets.

The proposed method, GeoSG achieves the best results on two datasets in terms of both MAE and RMSE, indicating its superior accuracy and stability in text geolocation. We argue that this is attributed to two main reasons: firstly, the performance improvement that comes from combining multiple entity mentions, and secondly, the enhanced capabilities brought about by two self-supervised tasks. In addition, part of experimental data is obtained from the crowd-sourced project, OSM, featuring poor data quality. The proposed method achieves the best performance with this data, which further demonstrates that the method is effective.

For the six unsupervised baselines, the Mean method generally represents a straightforward prediction strategy, but exhibits poorer stability due to its neglect of the varying influences of different geographic entities within the text. The BD and Random methods, both of which infer the text location based on coordinates of a single geographic entity, achieve results that are not too far from the Mean method. These methods use geographic entities mentioned in the text to determine the locations of text documents, which works well when the relationships between text locations and the mentioned geographic entities are less complex. For instance, the Random method achieves the third-best MAE score for Manhattan. However, this method exhibits instability, as evidenced by its performance on both the other metrics and the Boston dataset. Moreover, their performance is inferior to that of GeoSG, mainly because they fail to consider the combined effects of multiple geographic entities. In comparison, BA obtained worse results, mainly because BA does not utilize entity information within the text, resulting in a decrease in performance. Since BA matches using all entities within the region, it is more suitable for cases where the text does not mention any geographic entity name. Rvs has achieved great performance in performance thanks to the power of the large language model, T5. This can be attributed to the fact that the large language model can better capture more relationships between entities and texts, thereby improving prediction accuracy. LLM-Loc generates coordinates using a generative language model directly, resulting in poor performance. Although the model exhibited high accuracy in predicting specific locations, which is attributed to its rich knowledge database and reasoning abilities as detailed in Section 5.6, its performance was unstable due to the lack of fine-tuning for specific tasks. This instability resulted in some incorrect predictions, reducing prediction accuracy. Compared to GeoSG, Rvs and LLM-Loc rely on their language modeling capabilities without the aid of geographic entity information, which makes them suitable for location inference tasks in geographic regions where geographic entity information may be lacking.

In addition, experiments are conducted to compare the proposed model with the two limited supervised regression models. The inclusion of supervised data is essential because untrained models would produce highly random predictions in direct regression tasks. Generally, the limited supervised model should achieve better performance than the above limited supervised baselines since some labeled samples are involved in their training processes. However, Geo-twitter, which predicts coordinates directly from textual content, results in significant errors in the absence of sufficient samples for training due to the complexity of the inference process. AttReg, which predicts geographic locations by utilizing multiple entities, shows close MAE and RMSE values across both metrics, indicating stable predictions. Nevertheless, its accuracy remains unsatisfactory. This further illustrates that fully supervised regression methods for geolocation prediction struggle to achieve accurate predictions under sample scarcity. In contrast, the proposed GeoSG method achieves significant performance improvements without labeled samples, thereby demonstrating its practical significance in real-world applications.

5. Discussion

5.1. Ablation Study

In this section, ablation study is conducted to investigate the effect of key modules in GeoSG. Four sets of comparison experiments are set up in this section. Specifically, w/o distance prediction presents the variant that removes the distance prediction module from GeoSG; w/o POI geolocation prediction presents the variant that removes the POI geolocation prediction module from GeoSG; w/o self-supervised modules is the variant that removes all the both self-supervised modules from GeoSG; and the w/o graph variant removes both the graph structure and the two modules. Their experimental results are presented in Table 3.

The experimental results underscore the crucial role of each module proposed by GeoSG. Removing the POI geolocation prediction module eliminates the training process for location prediction, making it difficult to transfer the learned model knowledge to the text location prediction task, thus leading to performance degradation. Removing the distance prediction module cuts down on the process of optimizing the model from the perspective of node distances, thus detracting from the correct inference of document locations. Additionally, the variant that removed both modules simultaneously achieved worse accuracy, indicating that both modules are effective. Finally, the experimental result of the w/o graph module highlights the relationships of geographic entities and geolocation prediction tasks. In particular, the hierarchical relationship graph facilitates a better grasp of global location relationships, and enhances the inference capabilities of the models from texts to geolocations. In addition, the distance prediction module does not have as much impact on model performance as the POI geolocation module. This can be explained by the fact that the distance prediction model is aimed at predicting inter-node distances rather than geolocations of texts, and thus can be considered as an optimization of the POI geolocation.

5.2. Impact of Module Order on Performance

This subsection will examine the impact of the training order of the two self-supervised modules on model performance. Their experimental results are listed in Table 4. In the table, GeoSG denotes the training order in the proposed method, which first trains POI geolocation prediction. Interchanged denotes the variant that first trains the model with the distance prediction module and then with the POI geolocation prediction module.

The experimental results show that the two self-supervised modules improve the accuracy of model inference regardless of the order in which they are trained. This finding aligns with the expectations outlined in Section 5.1, where the two self-supervised modules were introduced. Since the self-supervised modules are structured in two stages, rather than undergoing joint training, their order slightly affects the results. The strategy of integrating the two modules, such as introducing joint training, may deserve further investigation.

5.3. Effect of K

In Section 3.2, the hyperparameter

K

is used as a distance threshold to ascertain whether or not to add connecting edges between two POI nodes. The aim of this experiment is to explore the impact of the values of

K

on model performance. Specifically, the value of

K

was set to a range of 500 to 2500 m, and experiments were conducted at 500-m intervals. Their experimental results are shown in Figure 6.

Figure 6 shows that the proposed method achieves the best results in terms of both MAE and RMSE metrics when

K

is set to 1000 m. We argue that when the value of

K

is smaller than 1000 m, there are few POIs interconnected, thus providing limited geographic interactions. As the value of

K

increases, POI edges increase, improving model performance. However, when

K

is excessively large, substantial noise is introduced, resulting in decreased prediction accuracies.

5.4. Model Convergence Analysis

To further examine the convergence of the proposed model, we record the change in loss during training for two self-supervised tasks, POI geolocation prediction module and the distance prediction module, in Figure 7. The figure displays the results of POI geolocation prediction on the left, and that of the distance prediction module on the right. The horizontal axis represents the number of training rounds, and the vertical axis indicates the loss values for each epoch.

The curves of both train and validation loss in Figure 7 demonstrate that all the models converge effectively and rapidly, suggesting that the networks are well-trained in the two tasks. For the POI geolocation prediction module, the loss curve reaches a state of convergence within the 20 epochs. Similarly, the loss curve for the distance prediction module not only rapidly converges in the early rounds but also exhibits a slightly oscillatory convergence pattern. The two self-supervised tasks achieve a stable state with self-supervision signals, indicating the effectiveness of the two designed modules. The convergence demonstrated by the loss curves, in conjunction with the improved results discussed in Section 5.1, further validates that the proposed self-supervised paradigm is effective.

5.5. Selection of Graph Network

The proposed method employs the GAT network to encode the graph nodes in the proposed method. To investigate the effect of GAT on model performance, this subsection experimented with four widely-used graph neural network models that differ in feature learning. The three graph networks, including GCN (graph convolutional networks) [42], SAGE (graph sample and aggregate) [43], and EDGE (EdgeConv) [44] were selected as comparison baselines. The three baseline networks compute node similarities in lieu of the attention mechanism in GAT. They then predict the text geolocation by utilizing the node vectors of the last layer to weigh the geolocation of geographic entity nodes. Their results are illustrated in Figure 8.

As illustrated in Figure 8, the GAT network exhibited the best performance, indicating that attention mechanisms facilitate the capture of complex relationships between nodes, thereby enhancing model performance. The EDGE model attempts to capture the importance between nodes by training edge weights, which is somewhat analogous to attention mechanisms but more simplified. The performance of EDGE exceeds that of GCN and SAGE, suggesting that even a more simplified method of weight learning can be effective, as well as edge weights can provide additional insights into the information transmission process. In general, the proposed method is straightforward and efficient.

5.6. Case Study

To further investigate the results of the proposed model and clarify the task, this subsection analyzes a case study of the dataset. The performance of various methods in terms of text location inference is reported, with the results shown in Figure 9.

The central location of the text is closer to the Empire State Building, and relatively farther from other geographical references. GeoSG successfully captures these semantic and geographical features, yielding accurate inference results. Rvs and BD, by understanding semantics, identified the optimal individual geographic entity as the center of the text, achieving good performance in this case as well. Mean was influenced by three geographically less relevant entities, resulting in a biased inference. Similarly, Random selected geographically less relevant entities, leading to a larger deviation. AttReg inferred a location outside the map, possibly due to a decreasing trend in latitude while learning the relationships between coordinates. BA matched a location outside the map. Geo-twitter, due to insufficient training, predicted a significantly deviated location. LLM-Loc, leveraging the capabilities of large models, achieved relatively accurate results.

5.7. Error Distribution Analysis

To further investigate the performance of various models, this subsection will visualize the error distribution of GeoSG and the seven baseline methods. Figure 10 visualizes the error distributions of these methods on the Manhattan dataset for distances of <1 km, 1–3 km, 3–5 km and >5 km.

Figure 10 shows that GeoSG predictions are more concentrated within the 1 km error range compared to the baseline methods, thereby demonstrating the superiority of the proposed method in geolocation prediction accuracy.

The Mean and Random methods exhibit the highest error proportions in the 1–3 km range. This suggests that while simple methods based on entity geolocations like Mean and Random can roughly predict locations. BD, which utilizes semantic information to a greater extent than Random, exhibits a significant increase in hits within the 1 km range, although there is only a slight decrease at distances greater than 5 km. This is because it selected more closed entities via incorporating feature information, which indicates that the incorporation of semantic information can enhance the performance of text geolocation models, although the use of single entity mentions for positioning still has limitations. BA exhibits poor performance across all error ranges, indicating that text location inference in the unlabeled sample scenario is highly dependent on the mention of geographic entities in the text. In comparison to the Mean method, Rvs shows performance improvements in both the <1 km and <5 km ranges. This can be considered an enhanced Mean method, in alignment with the analysis presented in Section 5.1. LLM-Loc shows a high proportion of predictions within the <1 km range, which demonstrates that advanced generative language models have strong geolocation inference capabilities. It should be noted that this method shows a high proportion of errors in the >5 km range. We argue that this is due to the fact that prediction errors are amplified during the process of generating text.

The supervised methods, AttReg and Geo-twitter, encounter significant obstacles in the absence of sufficient training data. AttReg performs well across the first three error intervals based on the coordinates of the entity mentions, yet it still falls short of ideal results.

6. Conclusions

This paper presents a novel method that leverages self-supervised learning to improve the prediction of text geolocation under scenarios without labeled samples. The method introduces two pretraining modules on a geographic entity graph to facilitate the learning of complex relationships between geographic entities and textual context. It then infers the geolocations of texts. Extensive experiments on the two datasets demonstrate that the proposed method significantly improves the prediction accuracy of geolocation and is robust. This study shows that both spatial relationships of geographic entities and the pre-training modules are highly effective in predicting text locations. It provides a methodological reference for geolocating texts in scenarios without labeled samples, and offers a novel solution for the scenario of lacking labeled samples in various applications.

One limitation of this study is that text geolocation is mainly primarily based on text content, which does not make use of the attributes of texts, such as time, author, and so on. Incorporating text attributes would help improve the proposed method. Moreover, research on enhancing the applicability of the proposed method on text documents written in other alphabets or languages would be highly valuable. Additionally, the relationships between geographic entities and texts are inherently complex and diverse. It is well worth investigating the development of new self-supervised tasks to further capture the relationships between them and improve the accuracy of text location. Finally, analyzing and reducing the impact of poor quality crowdsource data to optimize model performance is worthy of further exploration, too.

Author Contributions

Conceptualization, Yuxing Wu, Zhuang Zeng and Shengwen Li; Methodology, Yuxing Wu, Zhuang Zeng and Shengwen Li; Data curation, Yuxing Wu and Zhuang Zeng; Formal analysis, Yuxing Wu; Funding acquisition, Shengwen Li; Writing—original draft, Yuxing Wu, Zhuang Zeng and Shengwen Li; Writing–review and editing, Yuxing Wu, Zhuang Zeng, Kaiyue Liu, Zhouzheng Xu, Yaqin Ye, Shunping Zhou, Huangbao Yao and Shengwen Li; Supervision, Shengwen Li, Yaqin Ye, and Shunping Zhou; Project administration, Shengwen Li. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the National Natural Science Foundation of China, grant numbers 42371420.

Data Availability Statement

The data are available at https://github.com/shavings/GeoSG (accessed on 10 April 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Adesina, A.A.; Okwandu, A.C.; Nwokediegwu, Z.Q.S. Geo-information systems in urban planning: A review of business and environmental implications. Magna Sci. Adv. Res. Rev. 2024, 11, 352–367. [Google Scholar] [CrossRef]
Olaniyi, O.O.; Abalaka, A.I.; Olabanji, S.O. Utilizing Big Data Analytics and Business Intelligence for Improved Decision-Making at Leading Fortune Company. J. Sci. Res. Rep. 2023, 29, 64–72. [Google Scholar] [CrossRef]
Guo, F.; Liu, Z.; Lu, Q.; Ji, S.; Zhang, C. Public Opinion About COVID-19 on a Microblog Platform in China: Topic Modeling and Multidimensional Sentiment Analysis of Social Media. J. Med. Internet Res. 2024, 26, e47508. [Google Scholar] [CrossRef] [PubMed]
Morstatter, F.; Pfeffer, J.; Liu, H.; Carley, K.M. Is the Sample Good Enough? Comparing Data from Twitter’s Streaming API with Twitter’s Firehose. arXiv 2013, arXiv:1306.5204. [Google Scholar] [CrossRef]
Hu, X.; Zhou, Z.; Li, H.; Hu, Y.; Gu, F.; Kersten, J.; Fan, H.; Klan, F. Location Reference Recognition from Texts: A Survey and Comparison. ACM Comput. Surv. 2023, 56, 112. [Google Scholar] [CrossRef]
Ariyachandra, M.R.M.F.; Wedawatta, G. Digital Twin Smart Cities for Disaster Risk Management: A Review of Evolving Concepts. Sustainability 2023, 15, 11910. [Google Scholar] [CrossRef]
Awan, A.T.; Gonzalez, A.D.; Sharma, M. A Neoteric Approach toward Social Media in Public Health Informatics: A Narrative Review of Current Trends and Future Directions. Information 2024, 15, 276. [Google Scholar] [CrossRef]
Zohar, M. Geolocating tweets via spatial inspection of information inferred from tweet meta-fields. Int. J. Appl. Earth Obs. Geoinform. 2021, 105, 102593. [Google Scholar] [CrossRef]
Hui, B.; Chen, H.; Yan, D.; Ku, W.-S. EDGE: Entity-Diffusion Gaussian Ensemble for Interpretable Tweet Geolocation Prediction. In Proceedings of the 2021 IEEE 37th International Conference on Data Engineering (ICDE), Chania, Greece, 19–22 April 2021; pp. 1092–1103. [Google Scholar]
Lutsai, K.; Lampert, C. Predicting the geolocation of tweets using transformer models on customized data. J. Soc. Inf. Sci. 2024, 29, 69–99. [Google Scholar] [CrossRef]
Mousset, P.; Pitarch, Y.; Tamine, L. End-to-End Neural Matching for Semantic Location Prediction of Tweets. ACM Trans. Inf. Syst. TOIS 2020, 39, 1–35. [Google Scholar] [CrossRef]
Li, B.-X.; Chen, C.-Y. Typhoon-DIG: Distinguishing, Identifying and Geo-Tagging Typhoon-Related Social Media Posts in Taiwan. In Proceedings of the 2024 9th International Conference on Big Data Analytics (ICBDA), Tokyo, Japan, 16–18 March 2024; pp. 149–156. [Google Scholar] [CrossRef]
Liu, Y.; Luo, X.Y.; Tao, Z.; Zhang, M.; Du, S. UGCC: Social Media User Geolocation via Cyclic Coupling. IEEE Trans. Big Data 2023, 9, 1128–1141. [Google Scholar] [CrossRef]
Fernández-Martínez, N.J.; Periñán Pascual, C. Knowledge-based rules for the extraction of complex, fine-grained locative references from tweets. Rev. Electrón. Lingüíst. Apl. 2020, 19, 136–163. [Google Scholar]
Lozano, M.G.; Schreiber, J.; Brynielsson, J. Tracking geographical locations using a geo-aware topic model for analyzing social media data. Decis. Support Syst. 2017, 99, 18–29. [Google Scholar] [CrossRef]
Paule, J.D.G.; Sun, Y.; Moshfeghi, Y. On fine-grained geolocalisation of tweets and real-time traffic incident detection. Inf. Process. Manag. 2019, 56, 1119–1132. [Google Scholar] [CrossRef]
Özdikis, Ö.; Ramampiaro, H.; Nørvåg, K. Locality-adapted kernel densities of term co-occurrences for location prediction of tweets. Inf. Process. Manag. 2019, 56, 1280–1299. [Google Scholar] [CrossRef]
Mishra, P. Geolocation of Tweets with a BiLSTM Regression Model. In Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects, Barcelona, Spain, 13 December 2020; Zampieri, M., Nakov, P., Ljubešić, N., Tiedemann, J., Scherrer, Y., Eds.; International Committee on Computational Linguistics (ICCL): New York, NY, USA, 2020; pp. 283–289. [Google Scholar]
Fukushima, K. Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biol. Cybern. 1980, 36, 193–202. [Google Scholar] [CrossRef]
Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
Mahajan, R.; Mansotra, V. Predicting Geolocation of Tweets: Using Combination of CNN and BiLSTM. Data Sci. Eng. 2021, 6, 402–410. [Google Scholar] [CrossRef]
Elteir, M.K. Fine-Grained Arabic Post (Tweet) Geolocation Prediction Using Deep Learning Techniques. Information 2025, 16, 65. [Google Scholar] [CrossRef]
Abboud, M.; Zeitouni, K.; Taher, Y. Fine-grained location prediction of non geo-tagged tweets: A multi-view learning approach. In Proceedings of the 5th ACM SIGSPATIAL International Workshop on AI for Geographic Knowledge Discovery, Hamburg, Germany, 18 October 2022. [Google Scholar]
Li, M.; Lim, K.H.; Guo, T.; Liu, J. A Transformer-based Framework for POI-level Social Post Geolocation. In European Conference on Information Retrieval; Springer: Berlin/Heidelberg, Germany, 2022. [Google Scholar]
Gao, J.; Xiong, W.; Chen, L.; Ouyang, X.; Yang, K. SRGCN: Social Relationship Graph Convolutional Network-Based Social Network User Geolocation Prediction. In Proceedings of the 2023 4th International Conference on Intelligent Computing and Human-Computer Interaction (ICHCI), Guangzhou, China, 4–6 August 2023; pp. 281–286. [Google Scholar] [CrossRef]
Wu, X.; Liu, X.; Zhou, Y. Review of Unsupervised Learning Techniques. In Proceedings of the 2021 Chinese Intelligent Systems Conference, Fuzhou, China, 16–17 October 2021; Jia, Y., Zhang, W., Fu, Y., Yu, Z., Zheng, S., Eds.; Springer: Singapore, 2022; pp. 576–590. [Google Scholar]
Liu, X.; Zhang, F.; Hou, Z.; Mian, L.; Wang, Z.; Zhang, J.; Tang, J. Self-Supervised Learning: Generative or Contrastive. IEEE Trans. Knowl. Data Eng. 2023, 35, 857–876. [Google Scholar] [CrossRef]
Liu, Y.; Pan, S.; Jin, M.; Zhou, C.; Xia, F.; Yu, P.S. Graph Self-Supervised Learning: A Survey. IEEE Trans. Knowl. Data Eng. 2021, 35, 5879–5900. [Google Scholar] [CrossRef]
Wu, L.; Lin, H.; Tan, C.; Gao, Z.; Li, S.Z. Self-Supervised Learning on Graphs: Contrastive, Generative, or Predictive. IEEE Trans. Knowl. Data Eng. 2023, 35, 4216–4235. [Google Scholar] [CrossRef]
Hou, Z.; Liu, X.; Cen, Y.; Dong, Y.; Yang, H.; Wang, C.; Tang, J. GraphMAE: Self-Supervised Masked Graph Autoencoders. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, 14–18 August 2022; Association for Computing Machinery: New York, NY, USA, 2022; pp. 594–604. [Google Scholar] [CrossRef]
Sun, X.; Wang, Z.; Lu, Z.; Lu, Z. Self-supervised graph representations with generative adversarial learning. Neurocomputing 2024, 592, 127786. [Google Scholar] [CrossRef]
Liu, J.; He, M.; Shang, X.; Shi, J.; Cui, B.; Yin, H. BOURNE: Bootstrapped Self-Supervised Learning Framework for Unified Graph Anomaly Detection. In Proceedings of the 2024 IEEE 40th International Conference on Data Engineering (ICDE), Utrecht, The Netherlands, 13–17 May 2024; pp. 2820–2833. [Google Scholar] [CrossRef]
Jing, B.; Park, C.; Tong, H. HDMI: High-order Deep Multiplex Infomax. In Proceedings of the Web Conference, Ljubljana, Slovenia, 19–23 April 2021; Association for Computing Machinery: New York, NY, USA, 2021; pp. 2414–2424. [Google Scholar] [CrossRef]
Gao, Q.; Hong, J.; Xu, X.; Kuang, P.; Zhou, F.; Trajcevski, G. Predicting Human Mobility via Self-Supervised Disentanglement Learning. IEEE Trans. Knowl. Data Eng. 2024, 36, 2126–2141. [Google Scholar] [CrossRef]
Gao, Q.; Wang, W.; Zhang, K.; Yang, X.; Miao, C.; Li, T. Self-supervised representation learning for trip recommendation. Know-Based Syst. 2022, 247, 108791. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; Burstein, J., Doran, C., Solorio, T., Eds.; Association for Computational Linguistics: Minneapolis, MN, USA, 2019; Volume 1, pp. 4171–4186. [Google Scholar] [CrossRef]
Wing, B.; Baldridge, J. Simple supervised document geolocation with geodesic grids. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, OR, USA, 19–24 June 2011. [Google Scholar]
Paz-Argaman, T.; Palowitch, J.; Kulkarni, S.; Baldridge, J.; Tsarfaty, R. Where Do We Go From Here? Multi-scale Allocentric Relational Inferencefrom Natural Spatial Descriptions. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics, St. Julian’s, Malta, 17–22 March 2024; Graham, Y., Purver, M., Eds.; Association for Computational Linguistics: St. Julian’s, Malta, 2024; Volume 1, pp. 1026–1040. [Google Scholar]
Savarro, D.; Zago, D.; Zoia, S. Leveraging Large Language Models to Geolocate Linguistic Variations in Social Media Posts. arXiv 2024, arXiv:2407.16047. [Google Scholar]
Dubey, A.; Jauhri, A.; Pandey, A.; Kadian, A.; Al-Dahle, A.; Letman, A.; Mathur, A.; Schelten, A.; Yang, A.; Fan, A.; et al. The Llama 3 Herd of Models. arXiv 2024, arXiv:2407.21783. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. In Proceedings of the 3rd International Conference on Learning Representations (ICLR), San Diego, CA, USA, 7–9 May 2015; Bengio, Y., LeCun, Y., Eds.; Conference Track Proceedings: New York, NY, USA, 2015. [Google Scholar]
Kipf, T.N.; Welling, M. Semi-Supervised Classification with Graph Convolutional Networks. In Proceedings of the International Conference on Learning Representations, Toulon, France, 24–26 April 2017. [Google Scholar]
Hamilton, W.L.; Ying, R.; Leskovec, J. Inductive representation learning on large graphs. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Curran Associates Inc.: Red Hook, NY, USA, 2017; pp. 1025–1035. [Google Scholar]
Wang, Y.; Sun, Y.; Liu, Z.; Sarma, S.E.; Bronstein, M.M.; Solomon, J.M. Dynamic Graph CNN for Learning on Point Clouds. ACM Trans. Graph. 2019, 38, 1–12. [Google Scholar] [CrossRef]

Figure 1. Task illustration. The text mentions several geographic entities with known locations that can be used to infer the semantic location of the text. (a) Text document, where entities are highlighted in distinct colors, and the corresponding colored location indicators mark the true positions of these entities. (b) The ground truth location and the locations of two geographic entities. (c) An example where using the position of a single entity to represent the text’s location fails to accurately capture the semantic location. (d) An example where averaging the positions of multiple entities to represent the text’s location fails to effectively capture the semantic location.

Figure 2. Overview of GeoSG. The framework comprises four key modules: (a) geographic relationship construction, (b) POI geolocation prediction, (c) distance prediction, and (d) document geolocation prediction. The structured graph constructed in (a) serves as input for modules (b–d). Self-supervised pretraining on the POI geolocation prediction and distance prediction modules initializes the model for the end-to-end document geolocation module.

Figure 3. The process of constructing a hierarchical graph

G

of geographic entities. Geographic entities are categorized into three levels, and then connected following the formulated rules, where each type of entity is presented by a unique color in the graph.

Figure 3. The process of constructing a hierarchical graph

G

of geographic entities. Geographic entities are categorized into three levels, and then connected following the formulated rules, where each type of entity is presented by a unique color in the graph.

Figure 4. Pretraining and parameter passing processes in the proposed GeoSG framework. Two self-supervised pretraining tasks are introduced on the geographic entity graph: POI geolocation prediction and distance prediction. Both tasks are trained on the graph, with updated parameters from POI geolocation prediction passed to distance prediction. Finally, the parameters learned from the distance prediction module are used to initialize the document geolocation module.

Figure 5. Spatial distribution of texts and geographic entities in the two datasets.

Figure 6. Effect of K value on model performance.

Figure 7. Convergence curves of the losses on (a) POI geolocation and (b) distance prediction modules.

Figure 8. Experimental results for several GNN networks.

Figure 9. Case Study. The top-left corner displays the text from the actual task, which needs to be inferred for its coordinates. The given semantic location of the text represents its center position, and the coordinates of various geographic entities mentioned within the text are provided. On the right, the map and its markers display the inference results of each method. The arrows on the map indicate that the respective positions extend beyond the visible map area.

Figure 10. Error distribution of GeoSG on the proposed method and baselines.

Table 1. Statistical description of two datasets.

Metrics	#Words/Document		#Geographic Entities/Document		#Document
Metrics	MAX	AVG	MAX	AVG	NUM
Manhattan	358	142.38	10	3.92	240
Boston	195	113.10	6	2.58	72

Table 2. Experimental results of GeoSG and the baselines on the two datasets. The best results are highlighted in bold.

Models	Manhattan		Boston
Models	MAE	RSME	MAE	RSME
Mean	3.239	10.202	7.538	45.042
BD	3.286	5.412	7.327	19.080
Random	3.076	4.369	10.795	30.916
BA	8.608	8.921	44.807	55.445
Rvs	2.804	3.417	7.785	18.284
LLM-Loc	14.167	36.552	14.033	41.751
AttReg	4.399	4.964	16.814	29.860
Geo-twitter	81.781	120.014	151.536	186.114
GeoSG	1.790	2.563	6.026	11.847

Table 3. Ablation experiment results. The best results are highlighted in bold.

Method	Manhattan		Boston
Method	MAE	RSME	MAE	RSME
GeoSG	1.790	2.563	6.026	11.847
w/o distance prediction	1.854	2.512	6.147	12.182
w/o POI geolocation prediction	1.837	2.526	6.364	12.764
w/o self-supervised modules	2.293	3.011	6.362	12.755
w/o graph	2.410	3.104	6.726	12.813

Table 4. Experimental results of two self-supervised modules. The best results are highlighted in bold.

Order	Manhattan		Boston
Order	MAE	RMSE	MAE	RMSE
GeoSG	1.790	2.563	6.026	11.847
Interchanged	1.907	2.588	6.174	12.341
w/o self-supervised tasks	2.293	3.011	6.362	12.755

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Published by MDPI on behalf of the International Society for Photogrammetry and Remote Sensing. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wu, Y.; Zeng, Z.; Liu, K.; Xu, Z.; Ye, Y.; Zhou, S.; Yao, H.; Li, S. Text Geolocation Prediction via Self-Supervised Learning. ISPRS Int. J. Geo-Inf. 2025, 14, 170. https://doi.org/10.3390/ijgi14040170

AMA Style

Wu Y, Zeng Z, Liu K, Xu Z, Ye Y, Zhou S, Yao H, Li S. Text Geolocation Prediction via Self-Supervised Learning. ISPRS International Journal of Geo-Information. 2025; 14(4):170. https://doi.org/10.3390/ijgi14040170

Chicago/Turabian Style

Wu, Yuxing, Zhuang Zeng, Kaiyue Liu, Zhouzheng Xu, Yaqin Ye, Shunping Zhou, Huangbao Yao, and Shengwen Li. 2025. "Text Geolocation Prediction via Self-Supervised Learning" ISPRS International Journal of Geo-Information 14, no. 4: 170. https://doi.org/10.3390/ijgi14040170

APA Style

Wu, Y., Zeng, Z., Liu, K., Xu, Z., Ye, Y., Zhou, S., Yao, H., & Li, S. (2025). Text Geolocation Prediction via Self-Supervised Learning. ISPRS International Journal of Geo-Information, 14(4), 170. https://doi.org/10.3390/ijgi14040170

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Text Geolocation Prediction via Self-Supervised Learning

Abstract

1. Introduction

2. Related Work

2.1. Text Geolocation Prediction Methods

2.2. Machine Learning Models Without Labeled Samples

3. Method

3.1. Overview

3.2. Geographic Relationship Construction

3.3. POI Geolocation Prediction Module

3.4. Distance Prediction Module

3.5. Geolocation Prediction Module

4. Experiment and Result

4.1. Dataset

4.2. Evaluation Metric

4.3. Baseline

4.4. Implementation Detail

4.5. Overall Result

5. Discussion

5.1. Ablation Study

5.2. Impact of Module Order on Performance

5.3. Effect of K

5.4. Model Convergence Analysis

5.5. Selection of Graph Network

5.6. Case Study

5.7. Error Distribution Analysis

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI