ChineseCTRE: A Model for Geographical Named Entity Recognition and Correction Based on Deep Neural Networks and the BERT Model

: Social media is widely used to share real‑time information and report accidents during natural disasters. Named entity recognition (NER) is a fundamental task of geospatial information applications that aims to extract location names from natural language text. As a result, the identi‑ fication of location names from social media information has gradually become a demand. Named entity correction (NEC), as a complementary task of NER, plays a crucial role in ensuring the accuracy of location names and further improving the accuracy of NER. Despite numerous methods having been adopted for NER, including text statistics‑based and deep learning‑based methods, there has been limited research on NEC. To address this gap, we propose the CTRE model, which is a geospa‑ tial named entity recognition and correction model based on the BERT model framework. Our ap‑ proach enhances the BERT model by introducing incremental pre‑training in the pre‑training phase, significantly improving the model’s recognition accuracy. Subsequently, we adopt the pre‑training fine‑tuning mode of the BERT base model and extend the fine‑tuning process, incorporating a neu‑ ral network framework to construct the geospatial named entity recognition model and geospatial named entity correction model, respectively. The BERT model utilizes data augmentation of VGI (volunteered geographic information) data and social media data for incremental pre‑training, lead‑ ing to an enhancement in the model accuracy from 85% to 87%. The F1 score of the geospatial named entity recognition model reaches an impressive 0.9045, while the precision of the geospatial named entity correction model achieves 0.9765. The experimental results robustly demonstrate the effective‑ ness of our proposed CTRE model, providing a reference for subsequent research on location names.


Introduction
With the continuous promotion of the construction of smart cities, the methods used for obtaining geographic spatial data have become more diverse and widespread.Emerging technologies such as smartphones, the Internet of Things, and social media have led to explosive growth in geographic spatial data [1][2][3].As important basic data for urban construction, geographic spatial data contain information on urban spatial structures, transportation networks, public facilities, and many other aspects, providing important support for urban planning, transportation management, and other fields.Among geographic spatial data, geographic named entity data are very important, including place names, addresses, spatial locations, and other components, which play important roles in building the urban geographic spatial environment.Therefore, the geographic named entity recognition of geographic named entity data is crucial for these data to be used more effectively.Currently, the most advanced research in geographic named entities mainly applies relevant theories in natural language processing to the semantic understanding of the text for geographic named entities.This approach is more suitable for geographic named entity data with a single source and a simple structure, and it can achieve good results.However, due to the complexity of geographic named entity data and their diverse data sources, this method performs poorly in processing the data.Previous research roughly divided geographic named entity recognition into two categories: spatial statistical-based geographic named entity recognition [4] and deep neural network-based geographic named entity recognition [5].Many previous studies [6] have achieved high recognition accuracy in geographic named entity recognition, ignoring the correctness of identified geographic entities; thus, further standardization and precision improvement are needed.Therefore, this study uses the BERT model and introduces the incremental pre-training method to improve the accuracy of the model in the pre-training phase.And we also enhance the corpus by enriching the geographic named entity database after completing the incremental pretraining.Additionally, we compare and verify the number of semantic feature extraction modules of the model to further improve its accuracy [7].In addition, many of the geographic named entity recognition data used in previous research come from VGI (volunteered geographic information) data [8][9][10][11][12].Despite the extraction of place names, errors may still exist in the resulting geographic named entity data.Therefore, place name correction work is also very important, as it plays an important role in expanding the standard place name library for the subsequent efficient use of geographic named entity data.Previous research roughly divided Chinese text correction into three categories: rule-based Chinese spelling correction methods, machine learning-based Chinese spelling methods, and deep learning-based Chinese text correction [13][14][15].The existing research mainly focuses on correcting general natural language texts, while geographic named entities are a special type of text that contains potential spatial information.Therefore, this study conducts geographic named entity text correction based on BERT and provides an understanding of the semantic features of geographic named entities.
The innovative points of this article are as follows: (1) We introduce a transfer learning approach and use multiple sources of VGI data to achieve a highly accurate expansion of the geographic named entity database.(2) Incremental pre-training is introduced in the pre-training stage of the geographic named entity recognition model and the geographic named entity correction model, further improving the accuracy of geographic named entity text recognition and correction.
(3) After performing geographic named entity recognition, we conduct further experiments and attempt to achieve geographic named entity text correction based on the existing geographic named entity recognition results, further improving the accuracy of geographic named entity recognition.
In this paper, we propose two models for Chinese geographical named entity recognition and geographical named entity correction, collectively known as CTRE, where "C" stands for Chinese context, "T" represents toponym, "R" denotes recognition, and "E" stands for error (the full term is error correction, which corresponds to text error correction in Chinese, both of which are based on the BERT framework and are extended and constructed).Previous studies have demonstrated the effectiveness of the CRF [16,17] and BiLSTM [18,19] frameworks in text sequence labeling and named entities, so our proposed model framework extends the structure of CRF and BiLSTM on the basis of BERT.Based on the two proposed models, this study aims to address two major challenges.Firstly, the proposed geographical named entity recognition model is utilized to extract and iden-tify geographical named entities from social media data.However, due to the inherent fallibility of language, such as spelling errors and missing spellings in English, as well as homophonic and typo words in Chinese, and the spatial heterogeneity, which refers to different names for the same place and different places having the same name, the identified entities may contain errors.Therefore, the second objective is to employ the proposed geographical named entity error correction model to rectify the errors that are present in the extracted geographical named entities.For example, when identifying a geographic named entity in a Twitter text, the content is "Taylor Swift performs 'Our Song 'from' Taylor Swift (Debut) 'as the first surprise song for Day 2 of 'The Eras Tour 'in Los Angeles, California!"We can identify the geo-named entity "Los Angeles, California" through the geographical named entity recognition model.Unlike English texts, Chinese vocabulary is based on a Chinese character, and each Chinese character has a specific meaning.In English, words are typically composed of letters or numbers, and each letter has no meaning in itself and needs to be combined to form a vocabulary.For example, in Chinese, Beijing is "BeiJing", where the three characters of "Bei" are spelled as a whole to form a word and are recognized in the process of geographical named entity recognition.In English, the letter a in California can be recognized separately.In addition, due to the diversity of online media data, online text is mostly user-generated, so there are errors such as misspellings, missing spelling, etc.For example, "Los Angeles" is misspelled as "Los Angelas", so we need to correct it.Both the geographical named entity recognition and text correction models contribute to the advancement of natural language processing, providing more efficient, accurate, and intelligent solutions for real-world applications.This study's contribution lies in the further correction research based on the original geographic named entity recognition, offering more efficient, accurate, and intelligent solutions for real-world applications.In addition, by correcting the identified geographic named entity data, it provides assistance in expanding the existing standard address dataset.
The remaining parts of our paper are as follows: The second part introduces the concepts related to geographical named entity recognition and the method for correcting Chinese addresses.The BERT model, the models of geographic named entity recognition and text correction, as well as the related content are introduced in the third part.The datasets, pre-processing methods, experiments for named entity recognition and address correction, and the resulting experimental outcomes are presented in the fourth part.The fifth part presents the study's conclusions and prospects for future research.

Geographic Named Entity Recognition Based on Social Media Platforms
Place names are proper names assigned by people to geographical entities in physical space.In addition to denoting specific geographic locations, place names may also include natural or human features.Place names are widely used in people's daily lives and are the basic resources of geographic information [20].Place name data can be collected through empirical data such as interviews and social surveys [21,22], but these approaches are difficult to popularize and apply on a large scale due to problems such as a high cost, low efficiency, and weak generalization.As geographic spatial information services become more widespread, the data are growing exponentially.Social media, location-based travel blogs, and housing advertisements are becoming more prevalent in people's daily lives, and geographic named entity information is widely present in these different types of text [23][24][25].However, this geographic named entity information contained in text is often not effectively utilized.With the development and progress of natural language processing technology, geographic named entity recognition driven by big data has become a hot research topic.Currently, research on methods for geographic named entity recognition in ubiquitous network text can be broadly categorized into two types, traditional spatial statistical methods and deep neural network-based methods.

Methods Based on Spatial Statistics
The spatial statistical method involves researchers creating rules for recognizing geographic named entities in specific research areas or extracting them from text by analyzing their spatial distribution patterns.For instance, de Bruijn et al. proposed a method for extracting place names by matching them with existing databases and OpenStreetMap [26].However, in ubiquitous network text data, particularly in crowdsourced data, the place name information contributed by users usually has a certain degree of arbitrariness.In addition, there are some local conventions for place names, which often contain irregular language and some abbreviations that cannot be identified using methods based on specific place name databases.Furthermore, McKenzie et al. combined multiple spatial statistical metrics and random forest ensemble learning methods to extract neighborhood names from rental property listings [27].Lai et al. used a spatial point pattern analysis method to extract place names from geotagged tweets [28].Nevertheless, these spatial statistical methods may encounter issues such as high dimensionality, computational complexity, and overfitting despite their effectiveness in certain research domains.

Methods Based on Deep Neural Networks
Geographic named entity recognition (NER) based on deep neural networks is an approach that utilizes deep learning models to automatically identify geographic entities from textual data.This method leverages deep neural networks to extract features and classify input text, enabling the automatic identification of geographic named entities such as countries, cities, rivers, etc.In this method, the geographic named entity recognition task is often treated as a sequence labeling problem, where each word in the text sequence is tagged as a geographic entity or non-geographic entity.To achieve this, deep learning models such as recurrent neural networks (RNNs), convolutional neural networks (CNNs), and attention mechanisms are commonly used to model the text sequence, and classifiers such as multilayer perceptrons (MLPs) are used to predict the tag of each word.These deep learning models learn the language features and contextual information of geographic entities in the text by training on large amounts of annotated geographic named entity text data, which improves the recognition effectiveness of geographic named entities.Hu et al. proposed a deep learning architecture called C-LSTM that combines geographic dictionaries and rules and is applied to the process of geotagging Weibo data [29].Additionally, NER tools have been utilized for extracting historical corpora, but this approach may lack generalization.Using artificial neural networks is an effective way to address generality and scalability.However, this method cannot solve the problem of toponym disambiguation because it does not utilize the contextual features of the text.To solve the problem of toponym disambiguation, Wang et al. proposed a neural network geotagging model based on the BiLSTM-CRF architecture to extract locations from social media messages and trained the model to manually annotate tweet data and a dataset from Wikipedia [30].With the development of natural language processing technology, pre-training and finetuning language models represented by BERT have made it possible to efficiently and accurately extract geographic named entities from ubiquitous network texts [31].Liu et al. implemented the Geo-NER for geological reports based on the BERT model [32].However, directly applying BERT to geographic named entity recognition may ignore the interlabel constraint relationship between label sequences, which can significantly affect the model's performance in the task.To address this, Ma et al. proposed a BERT-BiLSTM-CRF deep neural network architecture for Chinese text geotagging tasks [33].Qiu et al. proposed a ChineseTR architecture based on weakly supervised BERT + BiLSTM + CRF for Chinese geotagging and trained the Chinese geotagging model on a training dataset generated from the People's Daily corpus [34].Tao et al. proposed an improved BERT model method for geographical named entity recognition and verified the effectiveness of the method [35].The proposed method employed ALBERT + BiLSTM + CRF.The biggest difference from the model framework in this study is the BERT of the base.Compared with BERT, ALBERT has made certain improvements.It adopts the methods of parameter sharing and embedding layer sharing, which reduce the number of parameters in BERT and can improve the training efficiency and reasoning speed.However, this method has limited resource scenarios.For example, the experimental dataset used by Tao et al. is TPCNER, and the dataset size is at the 600,000 level.The resources in this study are abundant, and the data magnitude reaches the 3,000,000 level.Models trained with different scales of data have different semantic feature extraction abilities.Our data size is larger, so the model has better semantic feature extraction ability and the evaluation scale is relatively large.Therefore, we use BERT to build the model on the base to obtain the semantic entities in the training model.The research conducted by Tao et al. was biased towards the improvement of the accuracy of identifying entities in geographic naming recognition.After identification, the recognition effects of different models and methods on the same dataset were compared, and the quality of the dataset was verified.However, our study further carried out the related tasks of text error correction as an in-depth study and supplement to the identification research, which Tao et al. did not have in the previous study.As a result, the scalability of our methodology is better.Deep learning-based geographic named entity recognition models often divide pre-training and fine-tuning into two stages, allowing them to better learn the semantic features of the text by pre-training on massive unsupervised data.

Chinese Address Correction
Text correction plays an important role in the field of natural language processing, and a good correction model is crucial for improving downstream task performance [36].However, Chinese text correction is a challenging task due to its complexity.Research in related fields has proposed various methods for Chinese text correction, which can be categorized into three main groups: rule-based text spelling correction methods, machine learning-based text spelling correction methods, and deep learning-based text spelling correction methods.

Ruled-Based Chinese Text Spelling Correction Method
Rule-based Chinese text spelling correction methods rely on knowledge resources such as dictionaries.They identify a character as a spelling error if it does not comply with the predefined rules, such as not being present in the dictionary, and provide candidate characters as correction options.Early Chinese text correction methods first detected the misspelled position, generated candidate characters for these positions, and then selected a suitable one to replace the misspelling [37][38][39][40].Another early study on Chinese text spelling correction proposed by Chang [41] used a Chinese character dictionary that accounted for the similarities in shape, pronunciation, meaning, and input method code to handle the spelling correction task.Each Chinese character in the sentence was replaced with a similar one from the dictionary, and the probability of all modified sentences was calculated based on a language model.Zhang et al. proposed another rule-based Chinese text correction method that differentiated between Chinese and English matching methods.This method could handle not only Chinese character replacement errors, but also insertion and deletion errors, greatly improving the correcting performance compared to Chang's method [42].In addition, Huang et al. used a segmentation tool (CKIP) to generate correction candidates for detecting Chinese spelling errors [43].Hung et al. corrected Chinese text spelling errors based on manually edited error templates [44].Similarly, Jiang et al. designed a new grammar rule system for correcting Chinese grammar and spelling errors [45].In rule-based Chinese text spelling correction methods, if a character does not comply with predefined rules, the method identifies it as a spelling error and provides candidate characters as correction options.However, in practical applications, rule-based Chinese text spelling correction methods heavily rely on linguistic knowledge and rules, and building language knowledge and rule libraries requires significant human and time costs and cannot cover the complex linguistic phenomena of Chinese.In addition, due to the complexity of Chinese, there are many unknown and ambiguous words that rule-based methods cannot effectively handle.More importantly, rule-based methods require full-text scanning and matching for each input text, resulting in a low processing efficiency, and thus are not suitable for large-scale text processing.Therefore, the limitations of rulebased Chinese text spelling correction methods restrict their widespread use in practical scenarios.With the continuous development of artificial intelligence and natural language processing technologies, machine learning and deep learning-based Chinese text spelling correction methods have gradually become mainstream.

Machine Learning-Based Spelling Method
The machine learning-based Chinese spelling correction method [46][47][48] is a technique that utilizes machine learning algorithms and language models to automatically detect and correct spelling errors in Chinese text.This method requires a large amount of Chinese language corpus to train the language model and teach it the patterns of spelling errors.When a spelling error is detected, the method uses the trained model to analyze and correct it, providing more accurate spelling correction suggestions.This method can be widely used in areas such as Chinese input methods, word processing software, search engines, and social media to improve the accuracy and efficiency of text processing.Compared with traditional rule-based spell checking methods, machine learning-based spell checking methods can better handle complex language patterns and spelling errors and can provide more accurate corrections based on the user's input history and contextual information.In the research on machine learning-based Chinese text spelling correction methods, unsupervised n-gram language models are often used for error detection [49,50].This approach introduces a confusion set of similar characters after error detection to limit the candidate options.Xie et al. replaced characters with confusion sets and evaluated the modified sentence using a joint bi-gram and tri-gram language model [49].In the studies by Jia et al. and Xin et al., graph models were used to represent sentences, and the singlesource shortest path (SSSP) algorithm was performed on the graph to correct spelling errors [51,52].In addition, transforming text spelling correction tasks into sequence labeling problems and using machine learning methods such as conditional random fields or hidden Markov models is also a solution [50,53].Xiong et al. proposed a Chinese spelling correction framework called HANSpeller, which uses an extended hidden Markov model and ranking model to correct spelling errors in Chinese articles, and achieves further refinement using a rule-based model [54].Unlike rule-based Chinese text spelling correction methods, machine learning-based methods can automatically learn correction rules according to the data, making them more adaptable to different domains and types of text, with relatively good adaptability and scalability.However, these methods require a large amount of annotated data in practical applications, and obtaining and standardizing text data require a lot of labor and time costs.Moreover, these methods can easily produce erroneous prediction results when processing long texts due to inherent issues with the model.

Deep Learning-Based Chinese Text Spelling Correction Method
The deep learning-based Chinese text spelling correction method is a technology that uses deep learning models to automatically detect and correct spelling errors in Chinese text.The purpose of this technology is to automatically identify spelling errors in text by training a model to provide correct suggestions or automatic corrections to improve the accuracy and readability of the text.Deep learning-based methods can typically handle various types of spelling errors, including typos, homophones, and look-alike characters, and can adaptively process various types and styles of text.This technology has been widely applied in various fields such as search engines, intelligent customer service, and natural language processing [35].Some research is based on end-to-end networks (e.g., RNN), directly treating Chinese text spelling correction as a sequence labeling task [55,56].Wang et al. proposed an end-to-end confusion-set-guided encoder-decoder model based on the sequence-to-sequence framework, treating Chinese text spelling correction as a sequence-to-sequence task and injecting confusion set information through a copying mechanism [40].However, this method also has a limited generalization ability and performs poorly when dealing with spelling errors that have not appeared in the dataset.Overall, machine learning-based Chinese text spelling correction methods are becoming increasingly popular.

Methods
The geographical named entity recognition model uses the BERT framework as the encoder, and further adds a decoder, which includes a fully connected feedforward neural network layer, a bidirectional LSTM layer, and a CRF layer.In addition, adding a CRF layer after BERT can increase the constraint of the data sequence order.The geographical named entity correction model adds a decoder on top of the BERT framework, which includes two sub-modules: error detection and correction.The input of the decoder is passed to the fully connected layer for linear transformation, and then binary classification is used to judge whether the character is correctly segmented or not.The correction module uses the maximum value selection method to convert the output of the encoder into the corresponding characters.

BERT Model
BERT (Bidirectional Encoder Representations from Transformers) is a pre-trained language model proposed by Google AI Research [31].It leverages the transformer's encoder structure, and its performance is enhanced through two pre-training tasks on a vast text corpus, Masked Language Modeling (MLM) and Next Sentence Prediction (NSP), to further improve the performance of the model.The model's architecture is shown in Figure 1.
widely applied in various fields such as search engines, intelligent customer service, and natural language processing [35].Some research is based on end-to-end networks (e.g., RNN), directly treating Chinese text spelling correction as a sequence labeling task [55,56].Wang et al. proposed an end-to-end confusion-set-guided encoder-decoder model based on the sequence-to-sequence framework, treating Chinese text spelling correction as a sequence-to-sequence task and injecting confusion set information through a copying mechanism [40].However, this method also has a limited generalization ability and performs poorly when dealing with spelling errors that have not appeared in the dataset.Overall, machine learning-based Chinese text spelling correction methods are becoming increasingly popular.

Methods
The geographical named entity recognition model uses the BERT framework as the encoder, and further adds a decoder, which includes a fully connected feedforward neural network layer, a bidirectional LSTM layer, and a CRF layer.In addition, adding a CRF layer after BERT can increase the constraint of the data sequence order.The geographical named entity correction model adds a decoder on top of the BERT framework, which includes two sub-modules: error detection and correction.The input of the decoder is passed to the fully connected layer for linear transformation, and then binary classification is used to judge whether the character is correctly segmented or not.The correction module uses the maximum value selection method to convert the output of the encoder into the corresponding characters.In the fine-tuning stage, the BERT model is initialized with pre-trained parameters, and a neural network structure is designed for specific downstream tasks.It is then finetuned using labeled downstream task datasets.
The emergence of BERT has revolutionized natural language processing, introducing a new pre-training and fine-tuning method for language models.Compared to previous In the fine-tuning stage, the BERT model is initialized with pre-trained parameters, and a neural network structure is designed for specific downstream tasks.It is then finetuned using labeled downstream task datasets.
The emergence of BERT has revolutionized natural language processing, introducing a new pre-training and fine-tuning method for language models.Compared to previous models, the BERT model has stronger semantic understanding capabilities [57].By leveraging semantic information from the text corpus during pre-training, the self-supervised approach implicitly introduces linguistic knowledge for downstream tasks.In the pretraining of the BERT model, the MASK objective function is used to train the model's loss function, which is commonly referred to as the Masked Language Model (MLM) loss.The formula can be specifically represented as where M is the set of tokens that are replaced with the MASK token, x i_true is the true i − th token, and P(x i_true | X _ {< i}, X _ {> i}) is the probability that the model predicts x i_true , obtained through the softmax function.The objective of this loss function is to minimize the difference between the predicted and true values.During training, the model will attempt to optimize this loss function to better understand the context and semantic information of the input sequence.
The computation formula of the transformer in the BERT model is as follows: In this equation, Q is the query matrix, K is the content to be attended to, and V is the value matrix.The purpose of scaling by d k is to avoid the dot product from being too large, as a large dot product can result in a very small gradient after passing through softmax.

Incremental Pre-Training
Similar to fine-tuning, the incremental pre-training of language models is also a transfer learning method.Incremental pre-training refers to further pre-training on a new dataset based on an existing pre-trained language model to enhance the model's performance in a specific domain.Incremental pre-training usually involves the following steps: first, select an existing pre-trained model; then, conduct additional pre-training on the new dataset, typically using the same or similar tasks as the original pre-training; finally, fine-tune the model that has undergone incremental pre-training to adapt to specific downstream tasks.
The advantage of incremental pre-training is that it can utilize information from the new dataset to enhance the model's representational power.Additionally, as the pretrained model has already learned a significant amount of semantic information, the model can converge faster during incremental pre-training and require less training data.For incremental pre-training, we selected 139,255 Sina Weibo text data points from check-in locations in Jinan city from March to December 2022 after pre-processing and cleaning.We used these data to construct a general web text corpus.Some examples of Sina Weibo text data are shown in Table 1.

Geographic Named Entity Recognition Model
Named entities, also known as name entities, are essential terms in the field of natural language processing, encompassing words and phrases that pertain to specific objects, tasks, places, and more, and are identifiable by their names.Among them, geographical named entities represent a subclass that primarily denotes geographic locations, organization names, institutions, and other relevant entities.For the purpose of this study, geographical named entities specifically refer to texts that are abundant in web texts, imbued with geographic semantics.Apart from merely denoting a geographic location, they encompass descriptions of spatial relationships with specific geographic points, including orientation, distance, and other related aspects.
Based on the BERT model, we constructed a geographic named entity recognition framework that can extract text semantics and perform task-oriented recognition.We first conducted incremental pre-training on the collected ubiquitous network text data, and then added new neural network structures for the geographic named entity recognition task.The resulting model predicts whether each character in the ubiquitous network text belongs to a geographic named entity.The incremental pre-training fine-tuning learning model and training framework were built for the geographic named entity recognition task.The input data for the model are ubiquitous network text data, which are more naturallanguage-like and are therefore used by the geographic named entity semantic model for incremental pre-training.The final neural network structure of the geographic named entity recognition model is shown in Figure 2.
orientation, distance, and other related aspects.
Based on the BERT model, we constructed a geographic named entity recognition framework that can extract text semantics and perform task-oriented recognition.We first conducted incremental pre-training on the collected ubiquitous network text data, and then added new neural network structures for the geographic named entity recognition task.The resulting model predicts whether each character in the ubiquitous network text belongs to a geographic named entity.The incremental pre-training fine-tuning learning model and training framework were built for the geographic named entity recognition task.The input data for the model are ubiquitous network text data, which are more natural-language-like and are therefore used by the geographic named entity semantic model for incremental pre-training.The final neural network structure of the geographic named entity recognition model is shown in Figure 2 Our model consists of two main parts: the encoder and the decoder.The encoder is mainly used for the pre-training and fine-tuning stages of the BERT model.In the pretraining stage, we adopt an incremental pre-training method to further improve the accuracy of semantic recognition.In the fine-tuning stage, we mainly train the module quantity of the semantic feature extraction module to ensure that BERT's semantic feature extraction achieves the best results.The output of the encoder is used as the input of the decoder, and its dimensionality is kept consistent.
For the decoder part of the model, the output of the encoder is used as the input.It first passes through a fully connected feedforward neural network layer and is transformed nonlinearly via an activation function to maintain the same dimensionality as the Our model consists of two main parts: the encoder and the decoder.The encoder is mainly used for the pre-training and fine-tuning stages of the BERT model.In the pretraining stage, we adopt an incremental pre-training method to further improve the accuracy of semantic recognition.In the fine-tuning stage, we mainly train the module quantity of the semantic feature extraction module to ensure that BERT's semantic feature extraction achieves the best results.The output of the encoder is used as the input of the decoder, and its dimensionality is kept consistent.
For the decoder part of the model, the output of the encoder is used as the input.It first passes through a fully connected feedforward neural network layer and is transformed nonlinearly via an activation function to maintain the same dimensionality as the input.Then, the output of this layer is used as the input to the bidirectional LSTM layer.The bi-directional LSTM layer contains two LSTM layers, which read the input sequence from both the forward and backward directions and generate forward and backward hidden state sequences, respectively.Finally, these two sequences are concatenated into a new sequence, and the concatenated output is fed into a fully connected layer for sequence labeling classification.The last layer is a CRF layer, which is added after BERT and can add constraints on the sequence data relationships.The CRF layer calculates the CRF loss function based on the sequence labeling classification output of the fully connected layer, thereby adding constraints on the order of generated labels to ensure that the output results are legal.Compared with the BERT model, the previous modules (i.e., the model's input part and the semantic feature extraction module) are exactly the same.
Next, the objective function and loss function of the geographical named entity recognition model will be introduced.When introducing the model structure, it was mentioned that the role of the CRF layer is to calculate the loss function based on the output of the fully connected layer.Specifically, in the geographical named entity recognition task, the loss function of the CRF layer of the model contains two types of scores: emission score and transition score.In the CRF layer, Equation (3) can be used to represent the probabilities.
In Equation (3), n is the sequence length, and ∑ y i X iyi represents the summation of emission scores, which is used to denote the scores of all labels observed at position i with the given feature X iyi .∑ y i ,y i+1 t y i ,y i+1 represents the summation of transition scores, which is used to denote the scores of all transitions from one label to another.Z(x) represents the normalization factor used to calculate the sum of possible state sequences.
The emission score x iy i comes from the output of the BiLSTM layer, which focuses on expressing which geographical named entity label the current character can be mapped to.The transition score t y i y j represents the transition score between categories in the text sequence.For example, t B−Entity,I−Entity = 0.9 means the score from the B − Entity category to the I − Entity category is 0.9.The transition score matrix composed of all category transitions is a parameter of the geographical named entity recognition model, and these scores are updated during the iterative process of training.When calculating the CRF loss function, the true path score (emission score and transition score) and the total score of all paths (emission score and transition score) are calculated.The goal of model training is to make the score of the true path the highest among all paths.Therefore, the objective function of the model can be obtained: Taking the logarithm of the above equation yields the following: Log Objective Function = log P RealPath P 1 + P 2 + . . .+ P N (5) By taking the negative of the objective function above, since the goal of the model training is to minimize the loss function, we obtain the following loss function: Loss Function = −log P RealPath P 1 + P 2 + . . .+ P N (6) In the equation above, P real path represents the true path fraction when calculating the CRF loss function, which is composed of the transmit fraction and the transfer fraction.P i represents the path fraction of the i − th path, so the denominator in the fraction represents the total fraction of all paths.
To make the fine-tuning task more efficient, we propose several optimization strategies.The early stopping strategy is used to prevent overfitting of the neural network.The hierarchical fine-tuning strategy allows for different layers to use different learning rates for model fine-tuning.The layer-by-layer unfreezing strategy, similar to the hierarchical fine-tuning strategy, only trains the parameters related to the target task in higher layers while freezing the general knowledge in lower layers, ensuring that the model's general knowledge is not forgotten.In addition, to evaluate the accuracy of the geographic named entity recognition model, we introduce the evaluation metrics of precision, recall, and F1 score.To introduce these metrics, we first explain several commonly used concepts: True Positive (TP): The number of positive samples that the model correctly classifies as positive, that is, the number of correctly identified geographic named entities in the sample.
False Positive (FP): The number of negative samples that the model incorrectly classifies as positive, that is, the number of non-geographic named entities identified as geographic named entities in the sample.
True Negative (TN): The number of negative samples that the model correctly classifies as negative, that is, the number of correctly identified non-geographic named entities in the sample.
False Negative (FN): The number of positive samples that the model incorrectly classifies as negative, that is, the number of geographic named entities identified as non-geographic named entities in the sample.
The formulas for calculating precision, recall, and F1 score are as follows: As the above formulas show, precision refers to the samples predicted as positive by the model that are actually positive.Recall refers to the proportion of actual positive samples that are correctly predicted by the model.The F1 score is the harmonic mean of precision and recall, which can comprehensively evaluate the performance of both indicators.

Geographic Named Entity Correction
Chinese geographical named entity text error correction refers to detecting and correcting errors in spelling, words, and other aspects of geographical named entity text to improve the accuracy and readability of the text.Unlike English text, Chinese text has certain particularities in the form of language.Chinese vocabulary is composed of one or more Chinese characters, and each Chinese character has its own meaning, while English words usually represent a letter or number, where each letter has no meaning in itself, so letters need to be combined to form a vocabulary.Therefore, for English, entity error correction is used to correct the letters in the word, while for Chinese, entity error correction is used to correct a Chinese character.Types of errors generally include spelling errors and word errors.Geographic named entity correction provides researchers with a new perspective for studying geographical named entities and expands the limitations of the original geographic named entity recognition, which was only focused on recognition accuracy.Examples of geographical named entity of different error types are shown in Table 2.The geographical named entity text correction model is based on the BERT framework and is also divided into two parts: the encoder and the decoder.The resulting model structure is shown in Figure 3.
The encoder part of the model is consistent with the geographic named entity recognition model.The encoder part of the model is identical to that of the geographic named entity recognition model.As for the decoder part, two modules were designed for error detection and correction.For the error detection module, the input of the decoder first passes through a fully connected layer for linear transformation.Since error detection is a binary classification task that classifies each character as either correct or incorrect, the output dimension of this layer needs to be converted to two dimensions, making the final output dimension suitable for the target task's classification dimension.The softmax function is applied to the linearly transformed output to compute the probabilities of correctness and incorrectness for each character.These probabilities are normalized, resulting in output values of either 0 or 1.In the output vector, the value on the first dimension represents the probability of being incorrect, while the value on the second dimension represents the probability of being correct.For the error correction module, the output of the encoder is converted to the corresponding character using the maximum value selection method.The ID of the dimension with the maximum output value is used to predict the character and query the lookup table to obtain the corresponding model predicted character.
perspective for studying geographical named entities and expands the limitations of the original geographic named entity recognition, which was only focused on recognition accuracy.Examples of geographical named entity of different error types are shown in Table 2.

大名山
The geographical named entity text correction model is based on the BERT framework and is also divided into two parts: the encoder and the decoder.The resulting model structure is shown in Figure 3.The encoder part of the model is consistent with the geographic named entity recognition model.The encoder part of the model is identical to that of the geographic named entity recognition model.As for the decoder part, two modules were designed for error detection and correction.For the error detection module, the input of the decoder first passes through a fully connected layer for linear transformation.Since error detection is a binary classification task that classifies each character as either correct or incorrect, the output dimension of this layer needs to be converted to two dimensions, making the final output dimension suitable for the target task's classification dimension.The softmax function is applied to the linearly transformed output to compute the probabilities of correctness and incorrectness for each character.These probabilities are normalized, resulting in output values of either 0 or 1.In the output vector, the value on the first dimension represents the probability of being incorrect, while the value on the second dimension represents the probability of being correct.For the error correction module, the output of the Next, we will introduce the objective function and loss function of the geographical named entity correction model.The decoder structure of the geographical named entity text correction model consists two parts, which include error detection and error correction subtask modules.Therefore, we will discuss the loss functions of these two subtasks separately.
For the error detection subtask, there is a problem of class imbalance in the dataset, which means that the number of samples with incorrect characters is much smaller than that of correct characters.This may lead to poor performance of the model when predicting incorrect characters, which are the focus of the correction task.Therefore, in this section, we use the improved cross-entropy loss function, Focal Loss, to calculate the loss value of the error detection subtask, which has the following mathematical form: The core idea of the Focal Loss function is to balance the importance of positive and negative samples, reducing the weight of easily classified positive samples while focusing on the weight of difficult-to-classify negative samples.In Equation (10), p t represents the predicted probability of the model for the correct character, and −log(p t ) is consistent with the cross-entropy loss function.That is, the degree of punishment for correctly classified characters is proportional to the logarithm of the predicted probability, meaning that the smaller the predicted probability, the greater the punishment.(1 − p t ) γ is the unique aspect of Focal Loss, controlling the degree of punishment for difficult-to-classify samples through an adjustable parameter γ. α t represents the weight of the sample class.In this study, difficult-to-classify negative samples refer to erroneous characters that need to be detected, and we set the value of α t to 0.25 and the value of γ to 2.
For the error correction subtask, it is similar to the target task of confusion word correction.
Therefore, we only list the mathematical form of the loss value without further explanation.
In the training process of the model, the loss value Loss det of the error detection subtask and the loss value Loss cor of the error correction subtask need to be weighted and summed to obtain the final loss function.x represents the input text sequence, while δ(x) represents the text sequence obtained by making small modifications to the input text, which can include operations such as replacing, inserting, or deleting a single character, as shown in Equation (12).
In Equation ( 12), detection_weight is a hyperparameter for model training, representing the proportion of the loss value of the error detection subtask to the total loss value, which indicates the importance of the error detection subtask in the task of geographical named entity text correction.
To make the task of geographical named entity text correction more effective, we use a gradient optimization strategy during training.This strategy is a way to increase the batch size to improve the training efficiency without increasing the memory usage when training deep neural networks.

Datasets and Pre-Training
The original data used in this study come from two sources, one of which being geographical named entity data that were manually collected by government departments, and the other being geographical named entity data that were extracted from ubiquitous web text data.Jinan is located in the eastern coastal region of China and is the capital city of Shandong Province.It covers an area of approximately 10,244.45square kilometers and has 10 districts and two counties under its jurisdiction, as shown in the specific administrative map in Figure 4.
The general preprocessing of geographical named addresses mainly involves methods such as correcting geographical named addresses, filling geographical named elements, and identifying and filling geographical named address elements to preprocess geographical named address data.However, in this study, semantic representation learning was used to obtain the semantic features of geographical named entities.The geographical named entity corpus used in this study not only includes standardized geographical named address data, but also includes geographical named entity data obtained from ubiquitous networks such as social media.Ubiquitous network text data, taking Weibo as an example of the original data source, are shown in Table 3.During the preprocessing of the geographical named entity data, we mainly corrected and cleaned the erroneous data and eliminated the data noise by removing duplicate geographical named entity records, useless characters, and stop words and unifying full-width/half-width characters.Example of the cleaned and preprocessed geographic named entity data are shown in Table 4.
The original data used in this study come from two sources, one of which being geographical named entity data that were manually collected by government departments, and the other being geographical named entity data that were extracted from ubiquitous web text data.Jinan is located in the eastern coastal region of China and is the capital city of Shandong Province.It covers an area of approximately 10,244.45square kilometers and has 10 districts and two counties under its jurisdiction, as shown in the specific administrative map in Figure 4.The general preprocessing of geographical named addresses mainly involves methods such as correcting geographical named addresses, filling geographical named elements, and identifying and filling geographical named address elements to preprocess geographical named address data.However, in this study, semantic representation learning was used to obtain the semantic features of geographical named entities.The geographical named entity corpus used in this study not only includes standardized geographical named address data, but also includes geographical named entity data obtained from ubiquitous networks such as social media.Ubiquitous network text data, taking Weibo as an example of the original data source, are shown in Table 3.During the preprocessing of the geographical named entity data, we mainly corrected and cleaned the erroneous data and eliminated the data noise by removing duplicate geographical named entity records, useless characters, and stop words and unifying full-width/half-width

济南租房独立卫浴春华居可 短租到1214800
Jinan rental independent bathroom Chunhua residence can be short-term rent to 1,214,800

大明湖没有夏雨荷 只有废阳阳的池某灿
There is no Xia Yuhe in Daming Lake, only Chi Moucan in Abandoned Yangyang This study uses a total of 3,530,611 geographic named entity text data points located in Jinan City, Shandong Province in 2022, which have been preprocessed and cleaned as the experimental corpus for constructing model instances of geographic named entities.

Pre-Training and Validation of Place Names Based on the BERT Model
In order to verify the performance and training cost of the proposed BERT model, we calculated various indicators of model training and validation, including the training time, loss value, perplexity of the validation set, and accuracy of the validation set.These results are shown in Table 5.Unlike supervised learning in downstream tasks, the pre-training phase of the geographic named entity semantic model is a self-supervised learning task, and there are no standard labeled data.Therefore, this study added perplexity as the main indicator to evaluate the performance of the model.The calculation formula is as follows: where e is the natural logarithm, and eval_loss is the average loss value of the model on the validation set.
According to the experimental results, our validation accuracy reached 86.52%, which is at a high level and proves that the BERT model, after improvement, can better understand the semantic meaning of geographic named entity text.After discussing and analyzing the pre-training phase used in previous relevant studies, this study obtained similar conclusions to those of previous studies, proving that even if the research area and data objects are different, the model proposed in this study follows the rules defined by previous studies.This model provides a foundation for subsequent model construction and related application research.

Comparative Analysis of the Number of Semantic Modules in BERT Model
Previous research [48] has conducted comparative studies on the number of semantic modules in the BERT model.To verify whether the BERT-based geographic named entity recognition model conforms to the conclusions of previous research, we conducted comparative validation experiments on the number of modules and on whether digits are uniformly replaced.In the number comparison validation experiment, we compared the output results of the BERT-based geographic named entity recognition model with those of previous research to evaluate the differences and similarities between them.In the comparison validation of whether digits are uniformly replaced, we verified whether our model can correctly identify and process digits and replace them uniformly to improve the accuracy and consistency of geographic named entity recognition, addressing the issues found in previous research.Through these validation experiments, we can more accurately evaluate the performance and reliability of the BERT-based geographic named entity recognition model and provide guidance and reference for further improving and optimizing geographic named entity recognition technology.
Therefore, we conducted a comparative analysis of different numbers of semantic feature extraction modules.After training with the training set of the geographic named entity corpus, we set different numbers of semantic feature extraction modules (6, 8, 10, and 12) for four control models.The specific training and validation indicators are shown in Table 6.According to Table 6, as the number of semantic feature extraction modules increases, the training time also increases correspondingly, showing a positive correlation with the module quantity.The perplexity of the training set is roughly negatively correlated with the number of modules, and the loss of the training set and the perplexity of the validation set are relatively small, indicating that there is no overfitting phenomenon.This study ultimately used a model with 12 semantic feature extraction modules as the basic framework for further research to ensure the ultimate effectiveness of downstream tasks.
To further improve the accuracy of subsequent models based on the BERT framework, we conducted a digit replacement experiment after selecting the upper module features.In the geographical named entity recognition task, identifying digits is relatively difficult and less practical.Therefore, we conducted an experiment to replace Arabic numerals in the geographical named entity text corpus, and the experimental results are shown in Table 7.Based on Table 6, we can see that the Arabic numerals in the geographic named entity corpus affect the model's ability to extract semantic features from the text.At the same time, only a small amount of training time cost was added, and the loss value of the training set and the perplexity of the validation set decreased significantly.In order to improve the semantic feature extraction ability of the BERT model, we replaced the Arabic numerals in the corpus with unified codes.Finally, we decided to use the geographic named entity corpus with the replaced Arabic numerals as the basis for subsequent research.

Geographic Named Entity Recognition
In order to improve the robustness and resilience of the geographic named entity recognition model, we conducted incremental pre-training on a ubiquitous network text corpus constructed from 139,255 Sina Weibo texts that were checked-in within Jinan city from March to December 2022, after cleaning and preprocessing.Sina Weibo is one of the most popular social media platforms in China; it is a microblog-based social network where users can post short texts, pictures, videos, and audios and interact with other users.As of 2022, the number of active users on Sina Weibo exceeded 580 million, making it one of the most important platforms in the field of social media in China.The obtained Weibo text data can be identified using the named entity recognition model to obtain the corresponding entities.The example process of geographic named entity recognition is shown in Figure 5. Examples of recognized entities are shown in Table 8.
most popular social media platforms in China; it is a microblog-based social network where users can post short texts, pictures, videos, and audios and interact with other users.As of 2022, the number of active users on Sina Weibo exceeded 580 million, making it one of the most important platforms in the field of social media in China.The obtained Weibo text data can be identified using the named entity recognition model to obtain the corresponding entities.The example process of geographic named entity recognition is shown in Figure 5. Examples of recognized entities are shown in Table 8.This study first manually constructed 14,728 labeled data points of geographic named entity recognition based on the collected Sina Weibo data, and integrated 2363 data points from the CLUENER2020 dataset to form the dataset for the geographical named entity recognition task.The model was fine-tuned based on the above incremental pre-training.The ratio of the training set, validation set, and testing set during the training process was 8:1:1, and the data selection method was random.In addition, to evaluate the recognition accuracy of the proposed method for geographic named entity recognition, we compared it with several traditional methods for geographic named entity recognition, including Fully Connected Neural Network + CRF, RNN + CRF, and BiLSTM + CRF.Three commonly used classification indicators were used to evaluate the effectiveness of geographic named entities, namely precision, recall, and F1-score.
Similar to the pre-training task of the geographical named entity recognition model, to verify the performance and training cost of the recognition model, we also calculated various indicators for model training and validation, as shown in Table 10.From Table 9, we can see that the model has relatively good performance on the ubiquitous web text corpus with a relatively small amount of data, indicating that the model has strong text semantic understanding ability.After verifying the effectiveness of the model, we compared it with several traditional models.The comparison results of the precision, recall, and F1 score for each method are shown in Table 11.As shown in Table 11, our improved BERT-based geographic named entity recognition model has significant advantages over traditional methods (FCNN + CRF, RNN + CRF, and BiLSTM + CRF), with an F1 score of around 0.90, which proves that incorporating social media information into geographic named entity recognition with pre-trained language models can greatly improve the accuracy of geographic named entity recognition tasks.By comparing BiLSTM + CRF with our method, it can be concluded that the method based on the pre-trained model can enable the model to learn more language features and have better generalization ability, and thus improve the recognition effect.Our proposed geographic named entity recognition model outperforms previous methods in precision, recall, and F1 score, demonstrating its excellent identification performance.Overall, our proposed model performs better than previous traditional models and has better comprehensive performance, laying a foundation for further research on geographic named entity recognition text.

Geographic Named Entity Correction
This study used a total of 3,530,611 geographic named entity text data points located in Jinan City, Shandong Province in 2022, which were preprocessed and cleaned, as the experimental corpus for constructing model instances of geographic named entities.
To perform the correction of geographic named entities, we used a dataset of 779,924 text samples of geographic named entities obtained through manual annotation and data augmentation techniques.The data augmentation methods used included vocabulary replacement, back-translation, random insertion, and random deletion.
The dataset was randomly divided into training, validation, and testing sets with proportions of 85%, 10%, and 5%, respectively.The structure of the dataset included two fields, which were the original geographic named entities and the corrected geographic named entities.The training data accounted for 85% of the total dataset, which was used with the aim of training a geographic named entity correction model with a large amount of data, while the validation data accounted for 10%, and was mainly used to evaluate the trained correction model and adjust and optimize the model parameters based on the evaluation results.Some examples of geographic named entity correction are shown in Table 12.Additionally, the hyperparameters used to train the geographic named entity correction model are shown in Table 13.After data augmentation and model training on the geographical text corpus, in order to evaluate the geographical named entity text correction model, we compared it with the RNN (seq2seq) method.The comparison results of our geographical named entity correction model and the traditional error correction method based on RNN and the seq2seq model architecture on the sequence metrics are shown in Table 14.From Table 14, compared with traditional RNN error correction methods, it can be seen that our proposed error correction method improves the precision rate from 89.8% to 97.6%, improves the recall rate from 43.7% to 76.5%, improves the F1 value from 58.8% to 85.8%, and improves the accuracy rate from 80.5% to 85.9%.We can see that our proposed model for geographic named entity correction outperforms the seq2seq model architecture on all metrics, demonstrating that the deep learning architecture using incremental pre-training and data augmentation can greatly improve the accuracy and effectiveness of geographic named entity text correction tasks.Other parameters of the geographic named entity correction model are shown in Table 15.Based on Table 15, we can see that our proposed geographic named entity text correction model achieved good results in most of the detection, correction, and sequence-level metrics.Our model's F1 scores for detection, correction, and sequence level were 0.9108, 0.8932, and 0.8577, respectively, all of which reached a high level, demonstrating the excellent correction effect of our proposed geographic named entity correction model.This indirectly proves that the geographic named entity correction model has a strong understanding ability of geographic named entity text semantic features, which lays the foundation for future research on geographic named entities.

Conclusions
In this paper, we first introduced the BERT model framework in natural language processing tasks.Then, we fine-tuned the BERT framework and performed incremental pre-training, which improved the BERT framework and achieved good results.Based on the BERT framework, we extended and constructed a geographical named entity recognition (NER) model.By comparing it with other geographical NER methods, we proved that our pre-training fine-tuning approach performed better than the other recognition methods, laying the foundation for further research on geographical named entity recognition.Finally, our proposed geographical named entity correction (NEC) model was also extended and improved based on the BERT framework.Through a comparison with traditional sequence-to-sequence-based correction methods, our proposed geographical NEC model achieved much better correction results.With the continuous development and popularization of natural language processing technology, geographical named entity recognition and correction techniques will be further developed and applied and will become important tools for people to process geographical information and manage geographical knowledge.However, current geographical named entity recognition and correction technologies still have some shortcomings, such as high error rates in geographical named entity correction, incomplete coverage of geographical information data, and differences in processing geographical names in different languages.In summary, geographical named entity recognition and correction techniques have broad application prospects, but also have certain defects.In the future, we will focus on realizing more content related to geographical named entities and further expanding and upgrading these functions.We will also improve the BERT model to rebuild the geographical NER and NEC models and propose functions such as place name recommendation.

Figure 1 .
Figure 1.The overall framework for pre-training and fine-tuning the BERT language model.

Figure 1 .
Figure 1.The overall framework for pre-training and fine-tuning the BERT language model.

Figure 2 .
Figure 2. Network structure of the geographic named entity recognition model.

Figure 2 .
Figure 2. Network structure of the geographic named entity recognition model.

Figure 3 .
Figure 3. Network structure of the geographic named entity correction model.

Figure 3 .
Figure 3. Network structure of the geographic named entity correction model.

Figure 5 .
Figure 5.The example process of geographic named entity recognition.Figure 5.The example process of geographic named entity recognition.

Figure 5 .
Figure 5.The example process of geographic named entity recognition.Figure 5.The example process of geographic named entity recognition.

Table 1 .
An example of incrementally pre-trained Weibo text data.

Table 2 .
Geographical named entity of different error types.

Table 2 .
Geographical named entity of different error types.

Table 3 .
Examples of Weibo text data.

Table 4 .
Examples of cleaned and preprocessed geographical named entity data.

Table 5 .
Training and validation metrics of semantic models for geographical named entities using character encoding strategy in BERT model.

Table 6 .
Training and validation metrics of semantic models for geographical named entities with different numbers of semantic feature extraction modules.

Table 7 .
Training and validation metrics of semantic models for geographical named entities in text corpus where Arabic numerals are replaced with (NUM) and where they are not replaced.

Table 8 .
Training data examples for the geographical named entity recognition model.

Table 9 .
Training hyperparameters for the geographical named entity recognition model.

Table 10 .
Incremental pre-training task for geographical named entity recognition model on various metrics on the training and validation sets.

Table 11 .
Comparison results of precision, recall, and F1 score between the geographical named entity recognition model and other methods.

Table 12 .
Examples of corrected geographical named entities.

Table 14 .
Comparison results of sequence-level metrics between the geographical named entity text correction model and the method based on the seq2seq model structure.

Table 15 .
Various error correction metrics of the geographical named entity correction model.