CHTopo: A Multi-Source Large-Scale Chinese Toponym Annotation Corpus
Abstract
1. Introduction
- (1)
- We propose a spatial-semantic integrated annotation framework for Chinese toponyms. To address prevalent challenges including nested aliases, metonymic expressions, and mixed punctuation in Chinese geographic nomenclature, we developed fine-grained XML annotation rules that integrate spatial attributes with linguistic features. This approach overcomes the limitations of traditional part-of-speech tagging methods in preserving semantic integrity and identifying geographic entities, which provides a novel technical framework for standardized processing of Chinese toponyms.
- (2)
- A multi-source heterogeneous large-scale Chinese toponym annotation corpus (CHTopo) was constructed. By integrating encyclopedic authoritative texts with dynamic news corpora, this resource comprehensively covers five major categories of geographic names including administrative regions, natural landscapes, and transportation facilities. The hybrid training strategy effectively enhanced the model’s generalization capability across cross-domain texts. The corpus provides high-quality foundational data support for geographic information extraction and spatial knowledge graph construction.
2. Description Characteristics and Annotation Specification of Chinese Toponyms
2.1. Description Characteristics
- (1)
- Characteristic Words: Toponyms often end with characteristic words that indicate administrative regions and divisions (e.g., province, city) or types of toponyms (e.g., road, mountain, river, island). These characteristic words help in recognizing toponyms, especially in determining the right boundary of the toponym.
- (2)
- Variable Length: Toponyms do not have a strict length limit and can include multi-character words or named entities. Examples include “京” (“Jing” in Chinese), “双江拉祜族佤族布朗族傣族自治县” (“Shuangjiang Lahuzu Wa Autonomous County” in Chinese), and “中山路” (“Zhongshan Road” in Chinese).
- (3)
- Homonyms: Different types and ranges of geographical features often share the same name in Chinese texts, such as mountains and cities (e.g., “黄山”, Huangshan in Chinese, refers to Huangshan Mountain or Huangshan City), lakes and cities (e.g., “巢湖”, Chaohu in Chinese, refers to Chaohu Lake or Chaohu City), and cities and counties (e.g., “芜湖”, Wuhu in Chinese, refers to Wuhu City or Wuhu County).
- (4)
- Historical and Audience Variability: The spatial location of a toponym with the same name can vary across different historical periods and for different audiences. For instance, Beijing during the Jin Dynasty was located near present-day Baling Left Banner in Inner Mongolia, while the jurisdiction of Beijing in the year 1949 differed from its current extent.
2.2. Markup Language
- (1)
- ID: The serial number of the annotation unit.
- (2)
- Type: The type of geographical feature described by the toponym.
- (3)
- StartNode and EndNode: The start and end positions of the toponym in the original text.
2.3. Annotation Specification
3. Annotation of Chinese Toponym Corpus
3.1. Corpus Data Source
3.2. Corpus Annotation and Consistency Control
4. Testing and Analysis of Chinese Toponym Annotation Corpus
4.1. Closed Testing
4.1.1. Single-Corpus Testing
4.1.2. Mixed-Corpus Testing
4.2. Open Testing
4.3. Extended Experiments with Deep Learning
4.4. Extended Experiments with Practical Applications
4.5. Discussion
- (1)
- Expanding temporal dimension annotation to support historical context analysis
- (2)
- Supplementing dialectal corpora to improve the cultural dimension of toponyms
- (3)
- Adding new global toponym data sources to enhance coverage breadth
- (4)
- Constructing a dynamic update system to support real-time toponym annotation
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- DeLozier, G.; Baldridge, J.; London, L. Gazetteer-independent toponym resolution using geographic word profiles. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence (AAAI’15), Austin, TX, USA, 25–30 January 2015; AAAI Press: New York, NY, USA, 2015; pp. 2382–2388. [Google Scholar]
- Kumar, A.; Singh, J.P. Location reference identification from tweets during emergencies: A deep learning approach. Int. J. Disaster Risk Reduct. 2019, 33, 365–375. [Google Scholar] [CrossRef]
- Speriosu, M.; Baldridge, J. Text-driven toponym resolution using indirect supervision. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Sofia, Bulgaria, 4–9 August 2013; Association for Computational Linguistics: Sofia, Bulgaria, 2013; pp. 1466–1476. [Google Scholar]
- Karimzadeh, M.; Pezanowski, S.; MacEachren, A.M.; Wallgrun, J.O. GeoTxt: A scalable geoparsing system for unstructured text geolocation. Trans. GIS 2019, 23, 118–136. [Google Scholar] [CrossRef]
- Buscaldi, D. Approaches to disambiguating toponyms. SIGSPATIAL Spec. 2011, 3, 16–19. [Google Scholar] [CrossRef]
- Hu, Y.; Mao, H.; McKenzie, G. A natural language processing and geospatial clustering framework for harvesting local place names from geotagged housing advertisements. Int. J. Geogr. Inf. Sci. 2018, 33, 714–738. [Google Scholar] [CrossRef]
- Gritta, M.; Pilehvar, M.T.; Collier, N. A pragmatic guide to geoparsing evaluation: Toponyms, named entity recognition and pragmatics. Lang. Resour. Eval. 2020, 54, 683–712. [Google Scholar] [CrossRef] [PubMed]
- Mehta, S.; Jain, G.; Mala, S. Natural Language Processing Approach and Geospatial Clustering to Explore the Unexplored Geotags Using Media. In Proceedings of the 2023 13th International Conference on Cloud Computing, Data Science & Engineering (Confluence), Noida, India, 19–20 January 2023; pp. 672–675. [Google Scholar]
- Kuai, X.; Guo, R.; Zhang, Z.; He, B.; Zhao, Z.; Guo, H. Spatial context-based local toponym extraction and Chinese textual address segmentation from urban POI data. ISPRS Int. J. Geo-Inf. 2020, 9, 147. [Google Scholar] [CrossRef]
- Berragan, C.; Singleton, A.; Calafiore, A.; Morley, J. Transformer-based named entity recognition for place name extraction from unstructured text. Int. J. Geogr. Inf. Sci. 2023, 37, 747–766. [Google Scholar] [CrossRef]
- Halterman, A. Mordecai: Full text geoparsing and event geocoding. J. Open Source Softw. 2017, 2, 91. [Google Scholar] [CrossRef]
- Weissenbacher, D.; Magge, A.; O’Connor, K.; Scotch, M.; Gonzalez-Hernandez, G. SemEval-2019 task 12: Toponym resolution in scientific papers. In Proceedings of the 13th International Workshop on Semantic Evaluation, Minneapolis, MN, USA, 6–7 June 2019; Association for Computational Linguistics: Minneapolis, MN, USA, 2019; pp. 907–916. [Google Scholar]
- Wang, S.; Zhang, X.; Ye, P.; Du, M. Deep belief networks based toponym recognition for Chinese text. ISPRS Int. J. Geo-Inf. 2018, 7, 217. [Google Scholar] [CrossRef]
- Wallgrun, J.O.; Karimzadeh, M.; MacEachren, A.M.; Pezanowski, S. GeoCorpora: Building a corpus to test and train microblog geoparsers. Int. J. Geogr. Inf. Sci. 2018, 32, 1–29. [Google Scholar] [CrossRef]
- Karimzadeh, M.; MacEachren, A.M. GeoAnnotator: A collaborative semi-automatic platform for constructing geo-annotated text corpora. ISPRS Int. J. Geo-Inf. 2019, 8, 161. [Google Scholar] [CrossRef]
- Mani, I.; Hitzeman, J.; Richer, J.; Harris, D.; Quimby, R.; Wellner, B. SpatialML: Annotation scheme, corpora, and tools. In Proceedings of the 6th International Conference on Language Resources and Evaluation, Marrakech, Morocco, 28–30 May 2008; LPEC: Marrakech, Morocco, 2008. [Google Scholar]
- Talmy, L. The fundamental system of spatial schemes in language. In From Perception to Meaning: Image Schemes in Cognitive Linguistics; Hampe, B., Ed.; De Gruyter: Berlin, Germany, 2005; pp. 199–263. [Google Scholar]
- Leidner, J.L. Toponym Resolution in Text. Ph.D. Thesis, University of Edinburgh, Edinburgh, UK, 2007. [Google Scholar]
- Mani, I.; Doran, C.; Harris, D.; Hitzeman, J.; Quimby, R.; Richer, J.; Wellner, B.; Mardis, S.; Clancy, S. SpatialML: Annotation scheme, resources, and evaluation. Lang. Resour. Eval. 2010, 44, 263–280. [Google Scholar] [CrossRef]
- Li, H. Research on Spatial Conceptual Model Based on Natural Language Processing. Ph.D. Thesis, Harbin Institute of Technology, Harbin, China, 2007. [Google Scholar]
- Le, X.; Yang, C.; Yu, W. Spatial concept extraction based on spatial semantic role in natural language. Geomat. Inf. Sci. Wuhan Univ. 2005, 30, 1100–1103. [Google Scholar]
- Zhang, X.; Zhu, S.; Zhang, C. Annotation for geographical named entities in Chinese text. Acta Geod. Cartogr. Sin. 2012, 41, 115–120. [Google Scholar]
- Zhang, X.; Zhang, C.; Zhu, S. Annotation for geographical spatial relations in Chinese text. Acta Geod. Cartogr. Sin. 2012, 41, 468–474. [Google Scholar]
- GB/T 18521-2001; Rules for Classification of Geographical Names and Code Representation. National Standard: Beijing, China, 2001.
- Sutton, C.; McCallum, A. An introduction to conditional random fields. Found. Trends Mach. Learn. 2010, 4, 267–373. [Google Scholar] [CrossRef]
- Song, S.; Nan, Z.; Huang, H. Named entity recognition based on conditional random fields. Cluster Comput. 2017, 1, 5195–5206. [Google Scholar] [CrossRef]
- Qiu, Q.; Xie, Z.; Wu, L.; Tao, L.; Li, W. BiLSTM-CRF for geological named entity recognition from the geoscience literature. Earth Sci. Inform. 2019, 12, 565–579. [Google Scholar] [CrossRef]
- Sun, C.; Qiu, X.; Xu, Y.; Huang, X. How to Fine-Tune BERT for Text Classification? In Proceedings of the Chinese Computational Linguistics: 18th China National Conference on Computational Linguistics, Kunming, China, 18–20 October 2019; Springer: Cham, Switzerland, 2019; pp. 194–206. [Google Scholar]
- Ye, P.; Zhang, C.; Chen, M.; Li, S. Typhoon disaster state information extraction for Chinese texts. Sci. Rep. 2024, 14, 7925. [Google Scholar] [CrossRef] [PubMed]
- Ye, P.; Zhang, X.; Huai, A.; Tang, W. Information Detection for the Process of Typhoon Events in Microblog Text: A Spatio-Temporal Perspective. ISPRS Int. J. Geo-Inf. 2021, 10, 174. [Google Scholar] [CrossRef]
Example 1 | |
---|---|
Original text (in Chinese) | [扎什伦布寺]最早称 “[康建曲批]”,意为 “雪城兴佛”。 |
Translation text (in English) | The [Tashilhunpo Monastery] was originally called “[Kangjianqupi]”, meaning “Xue Cheng Xing Fo”. 1 |
Annotated text | <Place Id = “1” Type = “Area” StartNode = “1” EndNode = “5”>扎什伦布寺 (“Tashilhunpo Monastery” in Chinese)</Place> |
<Place Id = “2” Type = “Area” StartNode = “10” EndNode = “13”>康建曲批 (“Kangjianqupi” in Chinese)</Place> |
Example 2 | |
Original text (in Chinese) | 枯水期时[鄱阳湖]湖面急剧萎缩。 |
Translation text (in English) | During the dry season, the lake surface of [Poyang Lake] shrinks dramatically. 1 |
Annotated text | <Place Id = “3” Type = “Water” StartNode = “5” EndNode = “7”>鄱阳湖 (“Poyang Lake” in Chinese)</Place> |
Example 3 | |
Original text (in Chinese) | [华西县]县府搬迁。 |
Translation text (in English) | The county government of [Huaxi County] has been relocated.1 |
Annotated text | <Place Id = “4” Type = “Area” StartNode = “1” EndNode = “3”>华西县 (“Huaxi County” in Chinese)</Place> |
Example 4 | |
---|---|
Original text (in Chinese) | [昆仑山脉][塔什库尔干谷地]的海拔3100–3900米。 |
Translation text (in English) | The elevation of the [Kunlun Mountains] and the [Tashkurgan Valley] ranges from 3100 to 3900 m. 1 |
Annotated text | <Place Id = “5” Type = “Landscape” StartNode = “1” EndNode = “4”>昆仑山脉 (“Kunlun Mountains” in Chinese)</Place> |
<Place Id = “6” Type = “Landscape” StartNode = “5” EndNode = “11”>塔什库尔干谷地 (“Tashkurgan Valley” in Chinese)</Place> |
Example 5 | |
---|---|
Original text (in Chinese) | [蒲圻市]南部为低山丘陵。 |
Translation text (in English) | The southern part of [Puqi City] consists of low mountain hills. 1 |
Annotated text | <Place Id = “7” Type = “Area” StartNode = “1” EndNode = “3”>蒲圻市(“Puqi City” in Chinese)</Place> |
Example 6 | |
---|---|
Original text (in Chinese) | [兴安岭]由[大、[小兴安岭]]组成。 |
Translation text (in English) | The [Xing’an Mountains] are composed of [Daxing’an Mountains] and [Xiaoxing’an Mountains]. 1 |
Annotated text | <Place Id = “8” Type = “Landscape” StartNode = “1” EndNode = “3”> 兴安岭 (“Xing’an Mountains” in Chinese)</Place> |
<Place Id = “9” Type = “Landscape” StartNode = “7” EndNode = “10”>小兴安岭 (“Xiaoxing’an Mountains” in Chinese)</Place> | |
<Place Id = “10” Type = “Landscape” StartNode = “5” EndNode = “10”>大、小兴安岭 (“Daxing’an and Xiaoxing’an Mountains” in Chinese)</Place> |
Example 7 | |
---|---|
Original text (in Chinese) | [[黑河]–[腾冲]线]是我国地理上非常重要的一条分界线。 |
Translation text (in English) | The [[Heihe]–[Tengchong] Line] is a very important geographical boundary in China. 1 |
Annotated text | <Place Id = “11” Type = “Area” StartNode = “1” EndNode = “2”>黑河 (“Heihe” in Chinese)</Place> |
<Place Id = “12” Type = “Area” StartNode = “5” EndNode = “6”>腾冲 (“Tengchong” in Chinese)</Place> | |
<Place Id = “13” Type = “Landscape” StartNode = “1” EndNode = “6”>黑河–腾冲线 (“Heihe–Tengchong Line” in Chinese)</Place> |
Example 8 | |
---|---|
Original text (in Chinese) | [包(头)兰(州)铁路]全长990公里。 |
Translation text (in English) | The [Baotou–Lanzhou Railway] has a total length of 990 km. 1 |
Annotated text | <Place Id = “14” Type = “Transport” StartNode = “1” EndNode = “10”>包(头)兰(州)铁路 (“Baotou–Lanzhou Railway” in Chinese)</Place> |
Data Source | Number of Parts | Area | Water | Sea | Landscape | Transport | Total |
---|---|---|---|---|---|---|---|
Encyclopedia of China: Chinese Geography | 1382 | 38,543 | 10,258 | 10,949 | 2658 | 2638 | 65,053 |
People’s Daily | 8199 | 32,223 | 796 | 1067 | 530 | 852 | 35,480 |
No. | Feature Templates of CRF |
---|---|
1 | # Unigram U00:%x[−2,0] U01:%x[−1,0] U02:%x[0,0] U03:%x[1,0] U04:%x[2,0] U05:%x[−2,0]/%x[−1,0] U06:%x[−1,0]/%x[0,0] U07:%x[0,0]/%x[1,0] U08:%x[1,0]/%x[2,0] U09:%x[−1,0]/%x[0,0]/%x[1,0] |
2 | # Bigram B |
Parameter | Value |
---|---|
Learning rate | 0.0001 |
Dropout | 0.2 |
Maximum gradient | 5 |
Model iterations | 100 |
Label categories | 5 types (BIEOS) |
Error Type | Original Text | Recognition Result | Annotated Result | Error Analysis |
---|---|---|---|---|
Boundary omission of derived toponyms | 枯水期时鄱阳湖湖面急剧萎缩。 (in Chinese) | 鄱阳湖湖面 (in Chinese) | 鄱阳湖 (in Chinese) | The feature template of CRF only covers the current character and adjacent characters, failing to capture that “lake surface” is a non-toponymic derived description. |
During the dry season, the lake surface of Poyang Lake shrinks sharply. (in English) | Poyang Lake surface (in English) | Poyang Lake | ||
Structural misjudgment of mixed punctuation toponyms | 秦岭-淮河线是中国地理区分北方地区和南方地区的地理分界线。 (in Chinese) | 秦岭; 淮河 (in Chinese) | 秦岭-淮河线; 秦岭; 淮河 (in Chinese) | Not optimized for mixed punctuation, splitting “Qinling–Huaihe Line” into independent toponyms. |
The Qinling–Huaihe Line is a geographical dividing line in China that distinguishes northern and southern regions. (in English) | Qinling; Huaihe (in English) | Qinling–Huaihe Line; Qinling; Huaihe (in English) | ||
Over-annotation of metonymic toponyms | 南京盐水鸭是江苏特产。 (in Chinese) | 南京 (in Chinese) | —— | Due to the high-frequency feature of the generic name “Jing” (“capital”), the model fails to combine context (where “salted duck” indicates a specialty) to identify “Nanjing” as a metonym. |
Nanjing salted duck is a specialty of Jiangsu. (in English) | Nanjing (in English) | —— |
Original Weibo Text | Recognition Result | Correct Result |
---|---|---|
台风终于放过广东 广州。 (in Chinese) | 广东; 广州 (in Chinese) | 广东; 广州 (in Chinese) |
The typhoon has finally spared Guangdong and Guangzhou. (in English) | Guangdong; Guangzhou (in English) | Guangdong; Guangzhou (in English) |
安吉县综合执法局执法人员巡查到孝源街道皈山场村时,…土房就倒塌了… (in Chinese) | 安吉县; 孝源街道皈山场村 (in Chinese) | 安吉县; 孝源街道; 皈山场村 (in Chinese) |
When law enforcement officers from the Anji County Comprehensive Law Enforcement Bureau patrolled to Guishanchang Village, Xiaoyuan Subdistrict, … the adobe house collapsed … (in English) | Anji County; Xiaoyuan Subdistrict Guishanchang Village (in English) | Anji County; Xiaoyuan Subdistrict; Guishanchang Village (in English) |
台风天都不用出门了啦啦啦。 上海·东园四村 (in Chinese) | 上海 (in Chinese) | 上海东园四村 (in Chinese) |
No need to go out on this typhoon day, la la la. Shanghai·Dongyuan Sicun (in English) | Shanghai (in English) | Shanghai Dongyuan Sicun (in English) |
…中心位于浙江省金华市磐安县内,… (in Chinese) | 浙江省金华市磐安县 (in Chinese) | 浙江省金华市磐安县 (in Chinese) |
…the center is located within Pan’an County, Jinhua City, Zhejiang Province, … (in English) | Pan’an County, Jinhua City, Zhejiang Province (in English) | Pan’an County, Jinhua City, Zhejiang Province (in English) |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Ye, P.; Jiang, Y.; Wang, Y. CHTopo: A Multi-Source Large-Scale Chinese Toponym Annotation Corpus. Information 2025, 16, 610. https://doi.org/10.3390/info16070610
Ye P, Jiang Y, Wang Y. CHTopo: A Multi-Source Large-Scale Chinese Toponym Annotation Corpus. Information. 2025; 16(7):610. https://doi.org/10.3390/info16070610
Chicago/Turabian StyleYe, Peng, Yujin Jiang, and Yadi Wang. 2025. "CHTopo: A Multi-Source Large-Scale Chinese Toponym Annotation Corpus" Information 16, no. 7: 610. https://doi.org/10.3390/info16070610
APA StyleYe, P., Jiang, Y., & Wang, Y. (2025). CHTopo: A Multi-Source Large-Scale Chinese Toponym Annotation Corpus. Information, 16(7), 610. https://doi.org/10.3390/info16070610