CHTopo: A Multi-Source Large-Scale Chinese Toponym Annotation Corpus

Ye, Peng; Jiang, Yujin; Wang, Yadi

doi:10.3390/info16070610

Open AccessArticle

CHTopo: A Multi-Source Large-Scale Chinese Toponym Annotation Corpus

by

Peng Ye

^1,2

,

Yujin Jiang

^3,4 and

Yadi Wang

^5,*

¹

Urban Planning and Development Institute, Yangzhou University, Yangzhou 225127, China

²

College of Civil Engineering and Transportation, Yangzhou University, Yangzhou 225127, China

³

Zhejiang Academy of Culture and Tourism Development, Hangzhou 311231, China

⁴

School of Applied Digital Technology, Tourism College of Zhejiang, Hangzhou 311231, China

⁵

School of Management, Henan University of Urban Construction, Pingdingshan 467041, China

^*

Author to whom correspondence should be addressed.

Information 2025, 16(7), 610; https://doi.org/10.3390/info16070610

Submission received: 3 June 2025 / Revised: 6 July 2025 / Accepted: 15 July 2025 / Published: 16 July 2025

(This article belongs to the Special Issue Text Mining: Challenges, Algorithms, Tools and Applications)

Download

Browse Figures

Versions Notes

Abstract

Toponyms are fundamental geographical resources characterized by their spatial attributes, distinct from general nouns. While natural language provides rich toponymic data beyond traditional surveying methods, its qualitative ambiguity and inherent uncertainty challenge systematic extraction. Traditional toponym recognition methods based on part-of-speech tagging only focus on the surface-level features of words, failing to effectively handle complex scenarios such as alias nesting, metonymy ambiguity, and mixed punctuation. This leads to the loss of toponym semantic integrity and deviations in geographic entity recognition. This study proposes a set of Chinese toponym annotation specifications that integrate spatial semantics. By leveraging the XML markup language, it deeply combines the spatial location characteristics of toponyms with linguistic features, and designs fine-grained annotation rules to address the limitations of traditional methods in semantic integrity and geographic entity recognition. On this basis, by integrating multi-source corpora from the Encyclopedia of China: Chinese Geography and People’s Daily, a large-scale Chinese toponym annotation corpus (CHTopo) covering five major categories of toponyms has been constructed. The performance of this annotated corpus was evaluated through toponym recognition, exploring the construction methods of a large-scale, diversified, and high-coverage Chinese toponym annotated corpus from the perspectives of applicability and practicality. CHTopo is conducive to providing foundational support for geographic information extraction, spatial knowledge graphs, and geoparsing research, bridging linguistic and geospatial intelligence.

Keywords:

Chinese text; toponym; annotated corpus; toponym recognition

1. Introduction

Toponyms refer to the specific names assigned to geographical entities in particular locations or areas. As some of the most commonly used pieces of public information, toponyms are closely related to people’s daily lives and are indispensable foundational geographic information resources for national administration, economic development, and domestic and international exchanges [1,2,3]. The primary sources of toponym data include surveying, mapping, and natural language [4]. Toponyms are a crucial aspect of informatized surveying, encompassing names, coordinates, and types of geographical features [5]. However, although toponym information obtained through surveying is current, it is relatively limited in quantity and often lacks alternate names, historical names, and low-level toponyms.

Toponyms are also an essential component of maps, which are venues for the mass, systematic appearance of toponyms. However, due to the scarcity of map data, difficulties in updating maps, and issues like multiple names for the same location and duplicate names, toponym data are challenging to acquire and share effectively [6,7]. Natural language serves as another important medium for conveying toponym information, supplementing the aforementioned data acquisition methods [8,9,10]. This includes traditional paper documents, electronic files, statistical tables, and web pages. Despite the richness and diversity of toponyms in natural language, they are characterized by qualitative, ambiguous, and uncertain properties [11]. The lack of standardized and structured methods for acquiring and organizing these toponyms hampers large-scale collection and retrieval [12].

Deep learning is the primary technology currently used to extract toponyms from natural language, mainly through training toponym recognition models on annotated corpora [13,14]. A toponym annotated corpus is a specially collected, structured, and sizeable collection of representative texts in which toponyms are marked according to a specific format [15]. The annotation standards for toponyms differ significantly across languages. For English texts, notable standards include the Generalized Upper Model (GUM), Toponym Resolution Markup Language (TRML), GeoTagger, and SpatialML [16,17,18,19]. Representative Chinese toponym annotated corpora include the People’s Daily corpus and the Microsoft corpus.

Related research has gained attention in the field of geo-information science. For instance, using 432 samples from Aesop’s Fables, entities such as toponyms, people, and objects, along with their spatial relationships, were annotated [20]. Methods for annotating spatial entities, spatial relationships, and spatial processes based on semantic roles have been proposed [21]. Additionally, annotation guidelines for geographical entities and spatial relationships in Chinese texts have been developed, leading to the construction of an annotated corpus based on the Encyclopedia of China: Chinese Geography [22,23]. Overall, existing toponym annotation methods are primarily divided into two categories: One category is based on linguistic part-of-speech tagging, which only identifies toponyms in texts through “toponym” part-of-speech tags without considering the core essence of toponyms—“spatial location semantics.” The other category relies on fundamental entity recognition techniques, which can recognize “toponym” entities but do not design annotation specifications tailored to the particularities of the geographic domain. Consequently, they cannot distinguish between geographic element types of toponyms and exhibit insufficient coverage in annotating professional geographic terms.

This study proposes a methodology for constructing a Chinese toponym annotation corpus, with its innovation lying in the deep integration of spatial semantics and linguistic features. This approach not only preserves the capability of traditional methods to capture surface-level linguistic features but also resolves annotation ambiguity in complex scenarios through spatial semantic constraints. The main contributions of this study include the following two aspects:

(1): We propose a spatial-semantic integrated annotation framework for Chinese toponyms. To address prevalent challenges including nested aliases, metonymic expressions, and mixed punctuation in Chinese geographic nomenclature, we developed fine-grained XML annotation rules that integrate spatial attributes with linguistic features. This approach overcomes the limitations of traditional part-of-speech tagging methods in preserving semantic integrity and identifying geographic entities, which provides a novel technical framework for standardized processing of Chinese toponyms.
(2): A multi-source heterogeneous large-scale Chinese toponym annotation corpus (CHTopo) was constructed. By integrating encyclopedic authoritative texts with dynamic news corpora, this resource comprehensively covers five major categories of geographic names including administrative regions, natural landscapes, and transportation facilities. The hybrid training strategy effectively enhanced the model’s generalization capability across cross-domain texts. The corpus provides high-quality foundational data support for geographic information extraction and spatial knowledge graph construction.

The main chapters of this paper are as follows: Section 2 introduces the methods of annotation specification of Chinese toponyms, Section 3 explains the annotation of the Chinese toponym corpus, Section 4 presents the experimental evaluation and discussion of the methods, and Section 5 presents the conclusions and future work.

2. Description Characteristics and Annotation Specification of Chinese Toponyms

2.1. Description Characteristics

Spatial location is one of the key features distinguishing toponyms from general nouns. To determine whether a noun in a text is a toponym, it is not only necessary to consider linguistic aspects such as part of speech and word integrity but, more importantly, whether the “toponym” explicitly conveys its inherent spatial location semantics. For instance, in descriptions like “Nanjing leaders went to Shanghai for research” and “Nanjing salted duck is very famous,” the actual spatial location of “Nanjing” is not the focus and does not need to be understood as a toponym. Toponym descriptions in Chinese texts also have the following characteristics:

(1): Characteristic Words: Toponyms often end with characteristic words that indicate administrative regions and divisions (e.g., province, city) or types of toponyms (e.g., road, mountain, river, island). These characteristic words help in recognizing toponyms, especially in determining the right boundary of the toponym.
(2): Variable Length: Toponyms do not have a strict length limit and can include multi-character words or named entities. Examples include “京” (“Jing” in Chinese), “双江拉祜族佤族布朗族傣族自治县” (“Shuangjiang Lahuzu Wa Autonomous County” in Chinese), and “中山路” (“Zhongshan Road” in Chinese).
(3): Homonyms: Different types and ranges of geographical features often share the same name in Chinese texts, such as mountains and cities (e.g., “黄山”, Huangshan in Chinese, refers to Huangshan Mountain or Huangshan City), lakes and cities (e.g., “巢湖”, Chaohu in Chinese, refers to Chaohu Lake or Chaohu City), and cities and counties (e.g., “芜湖”, Wuhu in Chinese, refers to Wuhu City or Wuhu County).
(4): Historical and Audience Variability: The spatial location of a toponym with the same name can vary across different historical periods and for different audiences. For instance, Beijing during the Jin Dynasty was located near present-day Baling Left Banner in Inner Mongolia, while the jurisdiction of Beijing in the year 1949 differed from its current extent.

2.2. Markup Language

Using XML as the markup language, and considering the characteristics of Chinese toponym descriptions, this study references the “Rules for the Classification and Codes of Geographical Names” (GB/T 18521-2001) [24] to categorize Chinese toponyms into five major classes: Area, Water, Sea, Landscape, and Transport. Based on this categorization, we have designed a standard for annotating Chinese toponyms. In this annotation standard, toponyms are tagged using the <Place> tag, and each <Place> tag includes three attributes:

(1): ID: The serial number of the annotation unit.
(2): Type: The type of geographical feature described by the toponym.
(3): StartNode and EndNode: The start and end positions of the toponym in the original text.

The core advantage of adopting XML markup language lies in the modular design of its structured tags. First, the tag hierarchy is extensible. The base tag <Place> supports nested sub-tags; when new annotation dimensions are added, only new attributes or sub-tags need to be appended under the <Place> tag without modifying existing annotation rules. Second, namespace isolation is implemented. By using custom namespaces, conflicts with tags from other XML standards are avoided, enabling parallel expansion of multi-dimensional annotations. Finally, XSD (XML Schema Definition) constraints are applied. XML Schema is employed to clarify tag types, mandatory requirements for attributes, and other specifications, ensuring the standardization of newly added annotations. To support rapid adaptation to mainstream natural language processing frameworks such as HuggingFace and spaCy, XML markup language can be integrated through format conversion. For named entity recognition tasks in HuggingFace and spaCy, XML is converted to JSON formats supported by these frameworks via XSLT (Extensible Stylesheet Language Transformation). The converted JSON can be directly input into HuggingFace and spaCy to generate the data formats required for training, thereby supporting downstream natural language processing tasks such as named entity recognition.

2.3. Annotation Specification

Considering the complexity of toponym descriptions and expressions in Chinese texts, this paper defines several special annotation patterns, which include the following aspects. In particular, square brackets [ ] denote the span of a toponym in the original text. Parentheses ( ) within examples provide supplementary explanations (e.g., abbreviations) and are not part of the annotation.

(1) Toponym alias phenomenon: Toponyms often have aliases, abbreviations, and acronyms, which typically need to be annotated. However, words indicated by terms such as “meaning” in the description are not annotated (Table 1).

(2) Derived toponym phenomenon: In Chinese texts, the combination of “toponym + geographical concept” is often used to describe features associated with the toponym, such as “九华山主峰” (the main peak of Jiuhuashan Mountain in Chinese). This phenomenon is referred to as derived toponyms. Derived toponyms are annotated differently due to their expression of features distinct from the recognized geographical entity. When toponyms are followed by terms indicating topography, geographical scope, or organizations, generally only the main toponym is annotated (Table 2).

Some combinations require contextual judgment to determine whether they represent toponyms. When a term only indicates a type of landform and does not refer to a specific location, it should not be annotated as a toponym. For instance, in the phrase “江南丘陵低丘盆地作物可一年三熟或两熟” (“In the Jiangnan hilly low hill basin, crops can be grown three times or twice a year” in Chinese), “江南丘陵低丘盆地” (Jiangnan hilly low hill basin in Chinese) only describes a type of basin. In geography, specific combinations are sometimes used to describe a natural landscape unit as a whole, such as “河西走廊” (Hexi Corridor in Chinese). Since such a term refers to a determinable location, it should be annotated as a toponym (Table 3).

In addition, commonly recognized general derived toponyms such as “黄河下游” (Lower Yellow River in Chinese) and “长江中下游平原” (Yangtze Plain, Middle and Lower in Chinese) can be annotated as whole toponyms.

(3) Metonymy in toponyms: The phenomenon of metonymy in toponyms is both common and complex, and generally, such instances do not require annotation as toponyms. For instance, when a toponym is used as an adjectival phrase to modify another object, implying a certain association but not expressing geographical meaning or spatial location, it should not be annotated as a toponym. An instance of this is “南京盐水鸭” (Nanjing salted duck in Chinese), where “Nanjing” is used to describe a type of duck rather than a geographic location. Additionally, when toponyms in texts imply institutions, units, or organizations, and the focus is not on their geographic location, such terms do not possess geographic semantics and are not annotated as toponyms. For instance, terms like “江苏省规定…” (“Jiangsu Province regulations …” in Chinese) or “日本方面认为…” (“The Japanese side believes …” in Chinese) are used to refer to entities or perspectives without geographic significance and should not be annotated as toponyms.

(4) Spatial relations in toponym expressions: Toponyms often accompany spatial relationship terms. Some commonly used toponyms that imply omitted reference objects, such as “华北” (North China in Chinese) and “东南亚” (Southeast Asia in Chinese), should be annotated as a whole, provided there is no public recognition discrepancy. The most frequent occurrence involves expressions of geographical ranges in the form “toponym + spatial relationship,” indicating a specific relationship with reference objects. To ensure the completeness and consistency of the primary toponym, only the primary toponym should be annotated (Table 4).

(5) Challenges with mixed punctuation in toponyms: Mixed punctuation in toponyms can make it difficult to determine the start and end positions for annotation. When annotating such names, the independence and completeness of linguistic elements should be considered. Firstly, for composite toponyms with shared elements. When two or more toponyms share common linguistic elements and are connected by punctuation, they are presented as a composite toponym (Table 5).

Secondly, the combination of multiple simple toponym elements needs to be considered, when multiple simple toponym elements are combined to express a higher-level geographical concept (Table 6).

Thirdly, the expansion of abbreviated toponyms, which provides a complete form of abbreviated toponyms to offer detailed explanations (Table 7). The parentheses indicate expanded forms, but only the full compound term (Baotou–Lanzhou Railway) is annotated as a single toponym.

3. Annotation of Chinese Toponym Corpus

3.1. Corpus Data Source

In general, the selection of data sources for corpora should adhere to the following principles: (1) Authenticity: the data should originate from real language usage and reflect actual applications of toponyms. (2) Diverse genres: incorporate texts from various genres to ensure a balanced representation of toponym types and coverage. For instance, travel writings focus on city and geographical landscape names, whereas city-level news reports often include numerous street and community names. (3) Scale and representativeness: the data source should be of sufficient scale and representativeness. (4) Public availability: preferably, the data sources should be publicly available to ensure accessibility and shareability.

This study selects two different types of corpus data sources: (1) the Encyclopedia of China: Chinese Geography (referred to as C1). This is a specialized popular science work that introduces various geographical elements. The descriptions of toponyms in this source are highly standardized, with a large number of toponyms and diverse types. (2) People’s Daily (April to June 1998) (referred to as C2). This source covers a wide range of fields including society, economy, sports, and education. The toponym descriptions in this source are flexible and diverse, with a relatively scattered distribution, predominantly featuring administrative regions and territorial divisions. To ensure the balance of toponym annotation results, this study employs a stratified sampling method to select the corpus. For the Encyclopedia of China: Chinese Geography, classification is conducted based on China’s provincial administrative divisions, where the proportion of the sample size for each province aligns with its proportion in the total entries of the entire book. For People’s Daily, themes are categorized into “current political news,” “economic development,” “cultural tourism,” and “ecological protection,” with proportional sampling implemented within each theme. Additionally, each text entry in the Encyclopedia of China: Chinese Geography contains approximately 800 characters, with a relatively long length and dense toponym distribution, whereas most texts in People’s Daily have around 300 characters, with a relatively short length and sparse toponym distribution. Thus, fewer texts are sampled from the Encyclopedia of China: Chinese Geography, while a relatively larger number of texts are sampled from People’s Daily.

3.2. Corpus Annotation and Consistency Control

The annotation team comprised 3 graduate students majoring in geographic information science and 1 linguistics expert. To ensure the accuracy and consistency of annotation results, this study established a full-process quality control system following the workflow of “training–pre-annotation–calibration–verification–revision.” All team members completed two phases of training: first, rule training, involving systematic study of the Chinese Toponym Annotation Specifications and unified understanding of “spatial location semantics” through simulated annotation of 100 typical texts and expert explanations; and second, tool training, focusing on mastering the operation of XML annotation tools to ensure technical implementation aligned with the rules.

Prior to formal annotation, inter-annotator consistency was calibrated through pre-annotation tests. Four annotators independently annotated a common batch of 200 texts (100 from the Encyclopedia of China: Chinese Geography and 100 from People’s Daily). Cohen’s kappa statistic was used to evaluate consistency, with a focus on alignment in annotation boundaries, type labeling, and handling of special patterns. The initial average kappa value from pre-annotation was 0.72. Discrepant cases were discussed by experts, and rules were revised. After two rounds of calibration, the final kappa value increased to 0.85, meeting the requirement for annotation consistency.

During formal annotation, a three-level verification mechanism of “cross-validation + expert review + automated checking” was adopted. For every 500 annotated texts, 10% were randomly selected for re-annotation by members not involved in the original process. Discrepancies such as boundary errors and type misclassifications were recorded, with the correction rate controlled within 3%. An audit group composed of linguistics experts and geographic information experts conducted full-volume reviews of batches with an error rate ≥ 2%, focusing on complex cases (e.g., metonymy scenarios and mixed punctuation). The final correction rate was reduced to 1.2%. Under this full-process quality control system, the annotation results for these two data sources are presented in Table 8.

The Chinese dataset “zh_msra.tar.gz” created by Microsoft Research Asia (MSRA) and the New Era People’s Daily Segmented Corpus (NEPD) are both highly significant annotated corpora for modern Chinese. When comparing the CHTopo corpus with these two existing corpora, the following observations can be made: First, NEPD only provides word segmentation annotation and cannot directly determine whether a segment corresponds to a toponym based on segmentation results; thus, it is not applicable for toponym recognition tasks. The zh_msra corpus, derived from real news texts, employs the BIO tagging scheme to annotate Chinese named entities, including 36,517 toponym entities in total. Notably, the zh_msra corpus is designed for general name entity recognition tasks and does not specifically focus on Chinese toponym annotation. In contrast, the CHTopo corpus demonstrates advantages in both the richness of corpus sources and the granularity of toponym annotation.

4. Testing and Analysis of Chinese Toponym Annotation Corpus

The quality evaluation of toponym annotation corpora is a complex task, with different evaluation methods suited to different applications. Toponym recognition is one of the most important applications of toponym annotation corpora. Therefore, this study aims to assess the quality of the annotated corpora through performance evaluation of toponym recognition, including closed testing and open testing. The toponym recognition model employs Conditional Random Fields (CRFs) [25]. As a classical sequence labeling model, its advantages lie in its strong interpretability, well-defined feature functions, low computational cost, and suitability for small-sample scenarios. The CRF method utilized manually designed feature templates based on the experimental corpus [26], and these templates were created according to the requirements of the CRF tool’s corpus format. The template design is illustrated in Table 9. The feature template design of a CRF is closely integrated with the word-formation rules of Chinese toponyms and the requirements of sequence labeling tasks. First, from a linguistic motivation perspective, it needs to cover the core patterns of toponyms. Chinese toponyms are typically composed of “specific name + generic name,” and their key patterns can be captured by unigrams and bigrams. Second, from the perspective of task adaptability, it must match the needs of sequence labeling. The design of the CRF’s feature functions requires direct association with the character information at the current position (i) and contextual positions (i − 1, i + 1) to predict toponym boundaries. Unigram features (i − 1, i, i + 1) and bigram features (i − 1, i; i, i + 1) exactly cover the core contextual windows for sequence labeling, avoiding the explosion of feature dimensions caused by excessively large feature windows.

The CRF model was trained using a feature set including character-level n-grams (n = 1), part-of-speech tags, and contextual position indicators. The regularization parameter was set to 0.1 via grid search.

This experiment utilized three evaluation metrics from the field of natural language processing to assess the results of Chinese toponym recognition: Precision (P), Recall (R), and F1 Score.

P = \frac{T P}{T P + F P} \times 100 %

(1)

R = \frac{T P}{T P + F N} \times 100 %

(2)

F 1 = \frac{P \times R \times 2}{P + R}

(3)

In the formula, TP is the number of positive samples correctly identified as positive; FP is the number of negative samples incorrectly identified as positive; FN is the number of positive samples incorrectly identified as negative; and TN is the number of negative samples correctly identified as negative. These metrics allow for a comprehensive evaluation of the toponym recognition model’s performance, providing insights into its accuracy, completeness, and overall effectiveness in processing toponym annotations.

4.1. Closed Testing

Closed testing involves developing a recognition model using a subset of the annotated corpus and then evaluating the model on the remaining portions of the corpus. This testing approach is categorized into two types based on the source of the training corpus: single-corpus testing and mixed-corpus testing.

4.1.1. Single-Corpus Testing

Firstly, 300 texts were randomly selected from C1 to create training datasets, with three iterations denoted as C11, C12, and C13. Similarly, 1000 texts were randomly chosen from C2 for training, resulting in datasets C21, C22, and C23. These six training datasets were used to develop corresponding toponym recognition models. Subsequently, C1 and C2 were used as test corpora for toponym recognition. The test results are presented in Table 2, where C1 + C2 represents the combined corpus of C1 and C2.

From Figure 1, it is evident that C11, C12, and C13 achieved similar precision and recall rates for C1, indicating a high degree of similarity within the C1 corpus. Randomly selecting different subsets of this corpus yields comparable results. In contrast, C21 and C22 produced similar recognition results for C2, while C23 showed significant deviations, suggesting variability in the People’s Daily corpus and indicating that the testing outcomes are influenced by the selected texts.

The recognition performance of C11, C12, and C13 on C2 was relatively poor, while C21, C22, and C23 performed inadequately on C1. This suggests that the toponym description features in the Encyclopedia of China: Chinese Geography and People’s Daily corpora differ substantially, making cross-corpus recognition challenging. Notably, the model trained with C13 showed slightly better performance for recognizing C1, whereas the recall rate for C2 using the C23 model was significantly higher than other models, indicating that, overall, C23 achieved the best performance.

Comparing the test results of C11, C12, and C13 with those of C21, C22, and C23, it is observed that there are more significant differences in test performance among different CRF models trained on C21, C22, and C23. This is primarily attributed to the corpus sources of C21, C22, and C23 being People’s Daily, where reports on different topics exhibit substantial variations in toponym types, character lengths of toponyms, and contextual features. For instance, regional administrative toponyms appear more frequently in reports on current political news and economic development. These toponyms have clear boundaries and iconic generic names, contributing to more stable boundary recognition. However, regional administrative toponyms involved in current political news and economic development reports also differ: current political news reports often feature administrative divisions (e.g., “Beijing Municipality,” “Jiangsu Province”) with relatively short character lengths, whereas economic development reports involve economic development zones (e.g., “Suzhou Industrial Park”) with longer character lengths, which may affect the accuracy of boundary recognition. Additionally, cultural and tourism-themed reports involve a large number of scenic spot toponyms (e.g., “Zhouzhuang Ancient Town,” “National Forest Park”). These toponyms typically adopt a composite structure of “specific name + multi-level generic name,” imposing higher requirements on the model’s ability to capture contextual information. It can be concluded that the inherent differences in corpus features of randomly sampled test data exert an impact on the toponym recognition performance of CRF models.

4.1.2. Mixed-Corpus Testing

The models trained on single corpora exhibited varying performance for C1 and C2, each showing strengths and weaknesses. To enhance the recognition accuracy and recall rates for both datasets, a mixed training approach was adopted. In this approach, portions of both C1 and C2 corpora were combined to create new training datasets for recognition and testing. Given that C13 and C23 showed the best recognition results in single-corpus testing, they were selected as representative corpora for C1 and C2, respectively, and were mixed to form a new training dataset, C3.

The testing results indicate that C3 leverages the advantages of C13 and C23, providing robust recognition performance for both C1 and C2. This improvement leads to enhanced precision and recall rates overall, and broadens the model’s recognition capabilities. Closed testing results show that models trained on single corpora reflect the toponym and description characteristics of their respective types of text, but may not be directly applicable to corpora with significant differences. The mixed-corpus approach addresses the limitations of single-corpus models, offering a more comprehensive reflection of the overall toponym description features within the corpus. This approach better aligns with the design goals of achieving universality and broad applicability in Chinese toponym annotation corpora.

4.2. Open Testing

Open testing refers to the evaluation of recognition models using texts outside of the annotated corpora. For this purpose, 282 specialized texts were collected from sources such as the internet and news outlets, covering topics like natural disasters, influenza, and the South China Sea issues. These texts were manually annotated, resulting in 2914 toponyms. To comprehensively assess the performance differences in toponym recognition using various corpora, models trained on C13, C23, and C3 were tested on this specialized dataset. The experimental results indicate that the model trained with the mixed corpus C3 achieved approximately a 10% increase in F1 score compared to models trained on C13 and C23. This improvement underscores the broader recognition capability of the mixed corpus (Figure 2).

Both the Encyclopedia of China: Chinese Geography and People’s Daily corpora primarily focus on domestic toponyms and cover a limited range of locations, notably lacking many global toponyms and administrative divisions of counties and towns, which are relatively common in the specialized texts used for open testing. To address this gap, we selected simple texts from online geographical encyclopedias that include these missing toponyms to construct a new web-based corpus data source (C4). The supplementary corpus C4 was constructed by scraping 500 web pages from Baidu Encyclopedia containing county-level toponyms not covered in C1/C2.

The experiments demonstrated that by combining C4 with C13 and C23 to form a new training dataset, C3′, targeted supplementation of toponyms was achieved, thereby enhancing toponym recognition capabilities. Through this open testing, we further realized that constructing an effective toponym recognition model requires not only ensuring the diversity of the training corpus but also ensuring comprehensive coverage of toponyms.

4.3. Extended Experiments with Deep Learning

Pre-trained models exemplified by BERT (Bidirectional Encoder Representations from Transformers) demonstrate exceptional capabilities in text representation and comprehension. The prevailing paradigm for named entity recognition (NER) tasks predominantly employs BERT or BERT-BiLSTM as foundational text feature encoders. Long Short-Term Memory (LSTM) networks effectively learn sentence-level contextual features, while the Bidirectional LSTM (BiLSTM) [27] architecture integrates forward and backward LSTMs through concatenated outputs. This bidirectional mechanism ensures that predictions are jointly determined by preceding and subsequent inputs, thereby enhancing output accuracy. Overall, combining the contextual semantic representation capability of pre-trained language models (BERT) with the sequence modeling advantages of BiLSTM demonstrates excellent performance in Chinese name entity recognition tasks, serving as one of the mainstream choices to balance performance and complexity.

In the BERT-BiLSTM-CRF framework, input text sequences first undergo contextual feature extraction through BERT’s character embedding layer, generating low-dimensional character vectors. These vectors subsequently pass through a BiLSTM-based bidirectional encoding layer for higher-dimensional feature extraction. The processed features then enter a CRF decoding layer incorporating MFM (Multi-Feature Mapping) or TLP (Tag Label Projection) mechanisms, which enforces strict compliance with entity category formats during sequence output. This hierarchical architecture ultimately accomplishes the entity prediction task through coordinated multi-layer feature transformation.

This study further implements the BERT-BiLSTM-CRF model for Chinese toponym recognition, with detailed training parameters specified in the accompanying table. To address the catastrophic forgetting phenomenon where BERT may overwrite pre-trained knowledge during fine-tuning, we empirically set the pre-training layer’s learning rate to 0.0001 based on existing research [28] and experimental validation. A dropout rate of 0.2 was applied to randomly deactivate specific neuron units during training, effectively mitigating overfitting risks. The C3 corpus underwent random partitioning into training, validation, and test sets following a 7:2:1 ratio. Additionally, XML-formatted annotated data was systematically converted into the BIOES tagging format for model compatibility. (See Table 10).

To benchmark against state-of-the-art methods, we fine-tuned a BERT-BiLSTM-CRF model on the CHTopo corpus. The model achieved an F1 score of 92.30% on C1 (vs. 83.40% for CRF) and 78.10% on C2 (vs. 68.44% for CRF), demonstrating superior performance in handling ambiguous and long-tail toponyms. Moreover, we simulated low-resource scenarios by training on 10% of C3. CRF achieved 72.1% F1 (vs. 68.5% for BERT-BiLSTM-CRF), indicating traditional models may still benefit from handcrafted features in limited data.

Comparing the recognition results of the BERT-BiLSTM-CRF model with those of the CRF model, it is observed that generally lower recognition rates are evident for the following three categories of toponyms (Table 11). It can be seen that both the CRF and BERT-BiLSTM-CRF models still exhibit recognition shortcomings in scenarios involving derived toponyms, mixed punctuation toponyms, and metonymic toponyms. To address these issues, it is necessary to integrate the annotation rules and coverage limitations of the CHTopo corpus, and continue to optimize and improve along directions such as corpus supplementation, feature template refinement, and the integration of multi-recognition models.

4.4. Extended Experiments with Practical Applications

In the current big data environment, social media has gradually been applied to the prevention, preparation, response, and recovery work of disaster management. Identifying toponyms from social media texts has become a fundamental task in the promotion and application of disaster big data. Based on the CHTopo corpus and the CRF model trained therefrom, experiments were conducted using social media data from typical disasters. Super Typhoon Lekima (international number: 1909), the strongest typhoon to make landfall in China in 2019, and Sina Weibo (one of the most widely used social media platforms in China) were selected for this study [29]. This research collected Weibo posts containing the keywords “typhoon” and “Lekima” from 9–12 August 2019, and processed/analyzed posts from the major affected regions (Zhejiang Province, Jiangsu Province, Shanghai Municipality, and Shandong Province). A total of 34,825 eligible Weibo posts were obtained [30].

Applying the CRF model trained on the C3 corpus to toponym recognition in Sina Weibo texts yields results with P = 81.98%, R = 76.34%, and F1 = 79.06%, with partial toponym recognition results shown in Table 12. It can be observed that in social media texts related to natural disaster monitoring and emergency response, the toponyms involved are primarily administrative divisions. As this type of toponym has been extensively annotated in the CHTopo corpus, the trained CRF model also excels at recognizing administrative division-based toponyms. However, since Weibo users share various disaster-related information at any time and place, the texts may contain descriptions of toponyms at lower administrative levels, which poses challenges to toponym recognition. For instance, “Dongyuan Sicun” (a residential community name) in Table 12 was not identified by the CRF model. Overall, applying the CRF model to social media texts can achieve relatively accurate recognition results in most scenarios. Based on the toponym recognition results from social media texts, more extensive research can be further conducted, such as toponym-based disaster information aggregation, disaster status identification, and disaster process analysis.

4.5. Discussion

The CHTopo corpus constructed in this study provides high-quality foundational resources for Chinese toponym annotation. However, limited by data sources, annotation rules, and technical implementation, it still has the following limitations in aspects such as the absence of temporal dimension annotation, insufficient coverage of dialectal toponyms, limited quantity of global toponyms, and lack of a dynamic update mechanism. Future extensions may focus on the following directions:

(1): Expanding temporal dimension annotation to support historical context analysis

A “Time” attribute will be added to the existing XML annotation scheme to record the existence time range or change time points of toponyms. By supplementing temporal attributes, the corpus can support tasks such as “spatial location comparison of the same toponym across different historical periods” and “statistical analysis of toponym change frequency,” thereby providing data support for historical geography research.

(2): Supplementing dialectal corpora to improve the cultural dimension of toponyms

Expand data sources to include dialectal texts (e.g., local county annals, dialectal literary works, and dialectal versions of local news), and add a “Dialect” attribute to XML tags (recording dialectal pronunciation or regional cultural backgrounds). The supplementation of dialectal annotations will enhance the application value of the corpus in scenarios such as dialectal toponym recognition and regional cultural dissemination.

(3): Adding new global toponym data sources to enhance coverage breadth

Building on the existing data sources of the Encyclopedia of China: Chinese Geography and People’s Daily, three new categories of corpus data sources for international toponyms will be added. First, the authoritative geographical literature, including the Encyclopedia of World Geography and United Nations Geospatial Information Management Documents on Toponymic Standardization, containing multilingual official toponyms. Second, international news corpora, including the China Daily International Edition, BBC Chinese Website, and Xinhua News Agency International News, covering multiple fields such as politics, economy, and culture. Finally, social media and travelogues, including Ctrip International Travelogues and the Zhihu “Global Travel” Topic, which is beneficial for supplementing niche toponyms.

(4): Constructing a dynamic update system to support real-time toponym annotation

Develop an automated annotation tool based on natural language processing that integrates real-time data sources such as news and government announcements, enabling dynamic extraction and automatic annotation of toponyms. For example, upon detecting an establishment announcement of “Changbai Korean Autonomous County,” the tool can automatically extract the toponym information and supplement it to the corpus while generating a <Place> tag. The dynamic update system will upgrade the corpus from a “static resource” to “living data,” better supporting real-time application scenarios.

5. Conclusions

The construction of toponym annotation corpora requires careful consideration of the balance, representativeness, and scale of the data. It is essential to use a comprehensive annotation system to ensure the completeness of spatial semantic information. In practical annotation processes, the determination of whether toponyms reflect spatial semantics may involve a degree of subjectivity, which can impact subsequent training and recognition.

Moreover, comparative analysis of annotation results with those from tokenization software reveals that while common and simple toponyms are relatively consistent, there is a significant discrepancy for complex toponyms. Models trained on multi-source corpora not only effectively capture the descriptive features and patterns within the toponym annotation corpus but also enhance the recognition of a wider array of toponyms.

In practical applications, the toponym recognition model can be expanded based on specific needs. For instance, in the open testing conducted in this study, incorporating web-based corpus data effectively addressed issues with out-of-vocabulary terms and broadened the scope of toponym recognition. This study demonstrates that expanding the corpus through web-sourced texts, particularly targeting administrative divisions such as countries, cities, and towns, offers a more efficient and expedient alternative to traditional methods. This approach is particularly valuable for the challenging task of collecting toponym data for other regions.

Author Contributions

Conceptualization, P.Y.; data curation, Y.J.; formal analysis, Y.J.; investigation, P.Y. and Y.J.; methodology, P.Y.; project administration, P.Y.; validation, Y.J.; visualization, Y.W.; writing—original draft, P.Y.; writing—review and editing, Y.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (grant no. 42301522), and the Humanities and Social Sciences Foundation of Yangzhou University (grant no. xjj2021-08).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

DeLozier, G.; Baldridge, J.; London, L. Gazetteer-independent toponym resolution using geographic word profiles. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence (AAAI’15), Austin, TX, USA, 25–30 January 2015; AAAI Press: New York, NY, USA, 2015; pp. 2382–2388. [Google Scholar]
Kumar, A.; Singh, J.P. Location reference identification from tweets during emergencies: A deep learning approach. Int. J. Disaster Risk Reduct. 2019, 33, 365–375. [Google Scholar] [CrossRef]
Speriosu, M.; Baldridge, J. Text-driven toponym resolution using indirect supervision. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Sofia, Bulgaria, 4–9 August 2013; Association for Computational Linguistics: Sofia, Bulgaria, 2013; pp. 1466–1476. [Google Scholar]
Karimzadeh, M.; Pezanowski, S.; MacEachren, A.M.; Wallgrun, J.O. GeoTxt: A scalable geoparsing system for unstructured text geolocation. Trans. GIS 2019, 23, 118–136. [Google Scholar] [CrossRef]
Buscaldi, D. Approaches to disambiguating toponyms. SIGSPATIAL Spec. 2011, 3, 16–19. [Google Scholar] [CrossRef]
Hu, Y.; Mao, H.; McKenzie, G. A natural language processing and geospatial clustering framework for harvesting local place names from geotagged housing advertisements. Int. J. Geogr. Inf. Sci. 2018, 33, 714–738. [Google Scholar] [CrossRef]
Gritta, M.; Pilehvar, M.T.; Collier, N. A pragmatic guide to geoparsing evaluation: Toponyms, named entity recognition and pragmatics. Lang. Resour. Eval. 2020, 54, 683–712. [Google Scholar] [CrossRef] [PubMed]
Mehta, S.; Jain, G.; Mala, S. Natural Language Processing Approach and Geospatial Clustering to Explore the Unexplored Geotags Using Media. In Proceedings of the 2023 13th International Conference on Cloud Computing, Data Science & Engineering (Confluence), Noida, India, 19–20 January 2023; pp. 672–675. [Google Scholar]
Kuai, X.; Guo, R.; Zhang, Z.; He, B.; Zhao, Z.; Guo, H. Spatial context-based local toponym extraction and Chinese textual address segmentation from urban POI data. ISPRS Int. J. Geo-Inf. 2020, 9, 147. [Google Scholar] [CrossRef]
Berragan, C.; Singleton, A.; Calafiore, A.; Morley, J. Transformer-based named entity recognition for place name extraction from unstructured text. Int. J. Geogr. Inf. Sci. 2023, 37, 747–766. [Google Scholar] [CrossRef]
Halterman, A. Mordecai: Full text geoparsing and event geocoding. J. Open Source Softw. 2017, 2, 91. [Google Scholar] [CrossRef]
Weissenbacher, D.; Magge, A.; O’Connor, K.; Scotch, M.; Gonzalez-Hernandez, G. SemEval-2019 task 12: Toponym resolution in scientific papers. In Proceedings of the 13th International Workshop on Semantic Evaluation, Minneapolis, MN, USA, 6–7 June 2019; Association for Computational Linguistics: Minneapolis, MN, USA, 2019; pp. 907–916. [Google Scholar]
Wang, S.; Zhang, X.; Ye, P.; Du, M. Deep belief networks based toponym recognition for Chinese text. ISPRS Int. J. Geo-Inf. 2018, 7, 217. [Google Scholar] [CrossRef]
Wallgrun, J.O.; Karimzadeh, M.; MacEachren, A.M.; Pezanowski, S. GeoCorpora: Building a corpus to test and train microblog geoparsers. Int. J. Geogr. Inf. Sci. 2018, 32, 1–29. [Google Scholar] [CrossRef]
Karimzadeh, M.; MacEachren, A.M. GeoAnnotator: A collaborative semi-automatic platform for constructing geo-annotated text corpora. ISPRS Int. J. Geo-Inf. 2019, 8, 161. [Google Scholar] [CrossRef]
Mani, I.; Hitzeman, J.; Richer, J.; Harris, D.; Quimby, R.; Wellner, B. SpatialML: Annotation scheme, corpora, and tools. In Proceedings of the 6th International Conference on Language Resources and Evaluation, Marrakech, Morocco, 28–30 May 2008; LPEC: Marrakech, Morocco, 2008. [Google Scholar]
Talmy, L. The fundamental system of spatial schemes in language. In From Perception to Meaning: Image Schemes in Cognitive Linguistics; Hampe, B., Ed.; De Gruyter: Berlin, Germany, 2005; pp. 199–263. [Google Scholar]
Leidner, J.L. Toponym Resolution in Text. Ph.D. Thesis, University of Edinburgh, Edinburgh, UK, 2007. [Google Scholar]
Mani, I.; Doran, C.; Harris, D.; Hitzeman, J.; Quimby, R.; Richer, J.; Wellner, B.; Mardis, S.; Clancy, S. SpatialML: Annotation scheme, resources, and evaluation. Lang. Resour. Eval. 2010, 44, 263–280. [Google Scholar] [CrossRef]
Li, H. Research on Spatial Conceptual Model Based on Natural Language Processing. Ph.D. Thesis, Harbin Institute of Technology, Harbin, China, 2007. [Google Scholar]
Le, X.; Yang, C.; Yu, W. Spatial concept extraction based on spatial semantic role in natural language. Geomat. Inf. Sci. Wuhan Univ. 2005, 30, 1100–1103. [Google Scholar]
Zhang, X.; Zhu, S.; Zhang, C. Annotation for geographical named entities in Chinese text. Acta Geod. Cartogr. Sin. 2012, 41, 115–120. [Google Scholar]
Zhang, X.; Zhang, C.; Zhu, S. Annotation for geographical spatial relations in Chinese text. Acta Geod. Cartogr. Sin. 2012, 41, 468–474. [Google Scholar]
GB/T 18521-2001; Rules for Classification of Geographical Names and Code Representation. National Standard: Beijing, China, 2001.
Sutton, C.; McCallum, A. An introduction to conditional random fields. Found. Trends Mach. Learn. 2010, 4, 267–373. [Google Scholar] [CrossRef]
Song, S.; Nan, Z.; Huang, H. Named entity recognition based on conditional random fields. Cluster Comput. 2017, 1, 5195–5206. [Google Scholar] [CrossRef]
Qiu, Q.; Xie, Z.; Wu, L.; Tao, L.; Li, W. BiLSTM-CRF for geological named entity recognition from the geoscience literature. Earth Sci. Inform. 2019, 12, 565–579. [Google Scholar] [CrossRef]
Sun, C.; Qiu, X.; Xu, Y.; Huang, X. How to Fine-Tune BERT for Text Classification? In Proceedings of the Chinese Computational Linguistics: 18th China National Conference on Computational Linguistics, Kunming, China, 18–20 October 2019; Springer: Cham, Switzerland, 2019; pp. 194–206. [Google Scholar]
Ye, P.; Zhang, C.; Chen, M.; Li, S. Typhoon disaster state information extraction for Chinese texts. Sci. Rep. 2024, 14, 7925. [Google Scholar] [CrossRef] [PubMed]
Ye, P.; Zhang, X.; Huai, A.; Tang, W. Information Detection for the Process of Typhoon Events in Microblog Text: A Spatio-Temporal Perspective. ISPRS Int. J. Geo-Inf. 2021, 10, 174. [Google Scholar] [CrossRef]

Figure 1. The results of closed tests. (a) The precision of closed tests. (b) The recall of closed tests. (c) The F1 value of closed test.