A Coarse-to-Fine Model for Geolocating Chinese Addresses

: Address geolocation aims to associate address texts to the geographic locations. In China, due to the increasing demand for LBS applications such as take-out services and express delivery, automatically geolocating the unstructured address information is the key issue that needs to be solved ﬁrst. Recently, a few approaches have been proposed to automate the address geolocation by directly predicting geographic coordinates. However, such point-based methods ignore the hierarchy information in addresses which may cause poor geolocation performance. In this paper, we propose a hierarchical region-based approach for geolocating Chinese addresses. We model the address geolocation as a Sequence-to-Sequence (Seq2Seq) learning task, that is, the input sequence is a textual address, and the output sequence is a GeoSOT grid code which exactly represents multi-level regions covered by the address. A novel coarse-to-ﬁne model, which combines BERT and LSTM, is designed to learn the task. The experimental results demonstrate that our model correctly understands the Chinese addresses and achieves the highest geolocation accuracy among all the baselines.


Introduction
Addresses, as natural language descriptions of geographic locations, are often used by humans in daily life. In China, there are increasing demands for various LBS applications, such as take-out services, express delivery, online car-hailing services, etc. While the unstructured addresses are easy for humans to understand and locate, they are difficult for computers to operate with. To enable the use of unstructured address information by these applications, one prerequisite is automatically assigning the correct geographic locations for the addresses. This process is commonly called address geolocation.
Accurate estimation of address location is an important factor for LBS applications. In recent years, deep learning models have been explored for the textual geolocation prediction task. Numerous studies adopted deep neural networks to predict coordinates (i.e., longitudes and latitudes) of text data including blogs, tweets, Wikipedia articles, etc. However, such point-based geolocation methods ignore the hierarchy information in address descriptions (e.g., country, province, city, etc.), which has been shown to be very effective in previous studies [1,2]. In addition, recent work [3] also demonstrates that predicting coarse-grained areas is much easier than predicting fine-grained areas. To address these challenges, we propose a novel coarse-to-fine model for geolocating Chinese addresses. Our proposed model is based on an encoder-decoder framework augmented with an attention mechanism [4]. We address the first challenge by taking the state-of-the-art language model BERT (Bidirectional Encoder Representations from Transformers) as the encoder, which is capable of extracting different embeddings of the Chinese characters according to the different contexts. We tackle the second challenge by adopting a multilevel subdivision scheme for the earth's surface, known in the literature as GeoSOT (Geographical coordinate Subdividing grid with one dimension integer coding on a 2 n Tree) [5]. Based on GeoSOT, we first build hierarchies of regions related to target address locations in the training phase. Since each region has a unique identification code, we then train an LSTM (Long Short-Term Memory) [6] network based decoder to predict each region's code by attending to the address semantic meaning.
We make the following contributions in this work: (1) We creatively model the Chinese address geolocation as a Seq2Seq learning task in which the input is the textual Chinese address and the output is a GeoSOT grid code. (2) A novel coarse-to-fine model is proposed, which takes the BERT as encoder and an LSTM model as a decoder. (3) We demonstrate the effectiveness of our coarse-to-fine geolocation model by conducting detailed experiments. It significantly outperforms the baseline methods.

Related Work
This section introduces prior studies that are most relevant to our work, including text-based geolocation prediction and neural language modeling.

Textual Geolocation Prediction
Textual geolocation aims at locating the textual addresses with language modeling techniques. According to the predicting targets, prior studies can be divided into two categories: for coordinates and for regions.
Coordinate-oriented approaches view the geolocation task from the regression perspective and directly predict the longitude and latitude of text data. They are widely used in social media data geolocation, such as tweets, blogs, social images, etc. Considerable literature has developed various geolocators by leveraging different features of text content and users, such as location indicative words, metadata, user profiles, and friendship graphs. For example, as a pioneering work, Fink et al. [7] presented a method that uses the place name mentions in a blog to determine the blog's location. Chi et al. [8] integrated location indicative words, city/country names, hashtags, and mentions and trained a multinomial Naive Bayes classifier to predict the locations of tweets. Liu et al. [9] proposed a unified framework to predict geolocations for Flickr images, which combines the information from both image tags and the user profile. Rahimi et al. [10] proposed GCN, a multiview geolocation model based on graph convolutional networks that uses both text and network context. Miura et al. [11] unified text, metadata, and user network representations with a neural network for geolocation prediction. However, most studies are conducted on social media data like Twitter, where metadata and external gazetteers are needed. By contrast, our geolocation model only relies on textual features.
Region-oriented approaches take the geolocation prediction as a classification task by first partitioning the regions into discrete subregions using regular grids, adaptive grids or city-level regions. These approaches treat the resulting discrete regions as either a flat list [12][13][14][15][16][17][18] or a nested hierarchy [2,3]. For example, Wing and Baldridge [12] was the first to use the n-gram statistical language model and a discrete, regular grid division of the earth's surface to predict the grids belonging to the document. It was extended by Roller et al. [13] with additional considerations of data distribution, the authors defined an alternative grid construction using k-d trees that more robustly adapt to data. Rout et al. [16] uses an SVM classifier and a number of features that reflect different aspects and characteristics of Twitter user networks to predict city-level location. Dredze et al. [18] adopted a supervised learning approach, training a multiclass classifier to identify the city of a tweet. Foregoing taking discrete regions as a flat list, the other research thread tried to predict text geolocation hierarchically by treating the discrete regions as a nested hierarchy. Mahmud et al. [1] developed a two-level hierarchical location classifier which first predicts time zone or state, and then predicts the city label. Wing and Baldridge [2] constructed a grid hierarchy. The probability of the final fine-grained location can be computed recursively from the leaf node up to the root. Recently, Kulkarni et al. [19] proposed a multilevel geocoder (MLG) for geolocating tweets. MLG exploits the natural hierarchy of the geographic locations by jointly predicting at different levels of granularity. However, with the deepening hierarchy, such classification-based geolocation methods can hardly handle the classification because the output space is too large. To overcome this limitation, we propose the coarse-to-fine model (CFM) to achieve multilevel geolocation in a Seq2Seq fashion. To the best of our knowledge, our method is the first deep learning-based neural network which models the geolocation prediction as a Seq2Seq task.

Neural Language Modeling
Language modeling aims to learn the joint probability of word sequences in a language. The first neural language model was proposed by Bengio et al. [20], who proposed to represent each word by a continuous real-vector and leverage a feedforward neural network to learn the distributed representation of each word. Compared with the traditional statistical language model, the neural language model substantially ameliorates the curse of dimensionality and exhibits better generalization ability. With the rapid development of deep learning technologies, the feedforward neural network-based language model was later extended to recursive neural networks [21] and convolutional neural networks based language models [22]. However, these models for learning word embeddings only allow a single context-independent representation for each word. In other words, they can hardly handle polysemy. To solve this problem, the concept of pretraining word embeddings was proposed [23] and widely adopted in ELMo (Embeddings from Language Models) [24], GPT (Generative Pre-Training) [25] and BERT [26]. ELMo is a two-layer bidirectional LSTM model. It learns the representation for each word depending on the entire context in which it is used. Therefore, even the same word will have different representations if the context is different. It has been proven that ELMo functions well for the word disambiguation task. GPT uses the Transformer [27] decoder (uni-directional) instead of the LSTM as the language model to better capture long-distance word relations. Moreover, fine-tuning the language model is taken as a training target together with downstream tasks. BERT integrates the advantages of ELMo and GPT, which takes the transformer encoder (bidirectional) as the language model. It achieves great success in a wide range of NLP tasks. Recently, a lite BERT (i.e., ALBERT) [28] is proposed to decrease memory consumption and increase the training speed of BERT. In this work, we leverage the pretrained BERT to extract the character representations in Chinese addresses.

Problem Statement
We model the coarse-to-fine geolocation as a Seq2Seq task. The given address V can be viewed as a sequence of n Chinese characters {v 1 , v 2 , . . . , v n }. The output of the model is the GeoSOT grid code C , which is a sequence of digits containing p quaternary digits {c 1 , c 2 , . . . , c p } with c t being the digit at time t. We formulate the geolocation as the inference over a probabilistic model. The goal of the inference is to generate a code sequence c * 1:p which maximizes P(c 1:p |v 1:n ): Figure 2 illustrates the overall architecture of our coarse-to-fine geolocation model. Essentially, it follows an encoder-decoder framework with an attention mechanism. The encoder is used to learn the location-specific information implied in the input address V, and the decoder is used to generate the GeoSOT code sequence C.

Overall Architecture
BERT Encoder LSTM Decoder As Chinese addresses are inherently difficult for a machine to understand, we leverage the advanced language model BERT as the encoder to capture the complex relationships between Chinese characters in the address. It consists of N identical Transformer (abbr. Trm) blocks. The encoder takes the Chinese address as input and outputs the feature representation h i for each Chinese character. Considering that the hierarchical locations covered by the given address are represented by a GeoSOT grid code, a LSTM-based decoder is used to predict the code digit one by one. The probability for each digit is computed after a character-level attention layer. The details of our model are elaborated in the following sections.

GeoSOT Subdivision Scheme
GeoSOT (Geographical coordinate Subdividing grid with One dimension integer coding on a 2 n Tree) is a geo-referencing and coding framework [5]. Taking the intersection of the prime meridian and equator as the central point, GeoSOT recursively divides the surface of the earth into four grid cells. It finally constructs a hierarchical quadtree with 32 levels spanning from the global to the centimeter scale. Table 1 shows the grid size at each level. Grid cells at each level are indexed using a Z-order filling curve [29]. Each cell can be represented as a single string containing quaternary numbers such as '0', '1', '2' and '3'. The longer the GeoSOT code length, the finer the grid granularity. The subdivision and coding method is shown in Figure 3. The advantages of GeoSOT codes are two-fold: (1) uniqueness, in which each geographical region on the Earth has only one unique GeoSOT code; (2) recursiveness, which is the lower-level grids that are subdivided by the upper-level grids. The GeoSOT grid code can represent the geospatial hierarchies at various levels without relying on external metadata.

Representing Chinese Textual Addresses
Given the raw Chinese addresses, we first aim to transform them into a computer-operable form, then extract geographical features (e.g., location or spatial relations) that can be understood by computers to support the subsequent geolocation prediction.

Input Processing
Tokenization. is the process of splitting the raw text into smaller pieces. Different from nonlogo syllabary languages, such as English, the Chinese language is formed by a stream of characters with no white space to separate them. In addition, there are a huge number of word-level combinations in Chinese, which means that building a word-level vocabulary is more likely to encounter out-of-vocabulary situations in the testing phase. By contrast, the number of Chinese characters is relatively limited, and we can easily exhaust all Chinese characters to construct the vocabulary. Based on the above observations, we consider performing the character-level tokenization for Chinese address texts in this work.
Input Embedding. In the previous step, we obtained a sequence of character-level tokens for each address. To further transform them into a computer-operable form, we take advantage of the word embedding technique. Word embeddings are the distributed representations of words, which encode each word into a unique real-valued vector [30,31]. Compared to the traditional one-hot representations, word embeddings are able to overcome the sparsity of training data and greatly reduce trainable parameters.
In our work, the embedding vector e i for each character-level token v i is directly retrieved from an embedding matrix E by a lookup operation. Moreover, the token positions are added to the initial input to record the location information. Similarly, we transform each token's position into an embedding, called position embedding p i , which is retrieved from another embedding matrix P. Both E and P are trainable. For each character, we sum the token embedding e i and the position embedding p i . Finally, an input embedding matrix X is obtained.

Feature Extraction
After the input embedding layer, each Chinese character in the raw addresses is transformed into a 2D vector. We then apply the encoder module, i.e., the BERT model, to extract high-level semantic features from the input embedding matrices. The encoder module consists of N identical blocks (i.e., transformer blocks). Each block contains a multihead self-attention layer (MultiHead) and a feed-forward layer (FFN).
The self-attention mechanism [27] allows each character in the same address to build an attentive context by weighting them with different relevance to each other regardless of the address length. Formally, the computation steps in this layer are as follows: where W Q i ,W K i ,W V i ,W O are trainable parameters and d k is the dimension of W K i . Concatenating h heads together, we obtain one feature vector f after projection by W O for each input character in the address. Following the MultiHead layer, the FFN layer is applied to generate the output of the block. Similar to [27], we employ the residual connection (brown dotted line in Figure 2) and layer normalization around two blocks.

Coarse-to-Fine Location Prediction
To conduct the coarse-to-fine location prediction, i.e., predicting the GeoSOT code sequence essentially, we leverage the LSTM architecture with attention mechanism as our decoder.
LSTM [6] is a recursive neural network which introduces a cell state and three elementwise multiplication gates, called forget gate, input gate and output gate, to control the cell state. These three gates control how information is stored, forgotten, and exploited inside the network.
As defined in Equation (1), the generated GeoSOT code c t at time t is predicted based on all the previously generated parent codes c <t before c t and the hidden states H = {h t } L t=1 of the encoder. To be more specific: where s t is the t-th hidden state of the decoder calculated by the LSTM cell. a t is the attention vector which is widely used in many applications. The vanilla attention mechanism is proposed to focus on the semantic relevance between the encoder states {h t } L t=1 and the decoder state s t at time t. The attention vector is usually represented by the weighted sum of the encoder hidden states: where u, W Att 1 , and W Att Implementation Details. The dataset is divided into training, validation and testing set in 8:1:1 proportions. We implement our approach by PyTorch (https://pytorch.org). In terms of hyper-parameter setting, the number of layers (i.e., Transformer blocks) and the number of self-attention heads in the encoder is 12. The dimensions of hidden vector are set as 768 in the encoder and decoder. We use the Adam optimizer [32] with the batch size 100 and the learning rate 0.001. The network was trained for 400 epochs and the best epoch was chosen by observing the performance on the validation set. In addition, all the training in this work was done on a single NVIDIA GeForce GTX 1080 Ti GPU with 11 GB RAM.

Visualizing the Performance in Polysemy Recognition
We claimed earlier that polysemy is a common phenomenon in Chinese addresses, which presents a challenge to correct geolocation. To demonstrate that our model is able to recognize the different semantic or geographical meanings of the same Chinese character, we visualize the t-SNE [33] plot of the learned character embeddings with tensorboard (https://tensorflow.google.cn/tensorboard).
First, we explore the Chinese character "市" , which is a common example of polysemy in Chinese addresses. Semantically, it can represent both a city-level region (e.g., Beijing, Shanghai, Guangzhou, etc.) and a market (e.g., supermarket, bazaar, country fair, etc.). Figure 5a shows two obvious clusters as the red cluster represents the "region" meaning and the blue cluster refers to the "market" meaning. The two clusters are separated because they have no semantic association at all. We provide further evidence of our model's distinguishability with another Chinese character "区". As we showed earlier by the example Figure 1, this character can represent both a district-level region and a residence-level region. Though they refer to similar semantic meanings, i.e., geographic regions, the geospatial meanings are different. An interesting finding is shown in Figure 5b. Character embeddings that refer to large geographic regions (i.e., districts) are clustered together (see the red dots). Similarly, those refer to small geographic regions (i.e., residential quarters) are also clustered together (see the blue dots).
The reasons why our model can distinguish polysemy are two-fold. First, the Bert-based encoder helps address this to a certain extent. It can capture the contextual information and obtain the precise semantic meaning of Chinese characters. Second, the coarse-to-fine predicting strategy assists. When decoding different levels of geographic regions, the model is forced to attend to the input characters that are truly useful.

Comparison in Geolocation Prediction
We evaluate our proposed method using three metrics. Accuracy is the percentage of correctly predicted GeoSOT codes. Taking the prediction of the L17 GeoSOT code as an example, only when all 17 digits are predicted correctly is it considered as a correct case. We take it as a hard metric because GeoSOT code can be directly used in various downstream applications. This is why correctly predicting the total GeoSOT code is important. Moreover, we use two distance-based metrics which are often used in textual geolocation related works: the mean and median error distances [34] between the centeroid of the assigned GeoSOT grid and the actual coordinate.
Accuracy. Two classic models often used in Seq2Seq learning are taken as baselines. One is the Vanilla-RNN model. It adopts a basic RNN to map the input sequence to a vector of a fixed dimension, and then uses another deep RNN to decode the target sequence from the vector. Similarly, we take the character embedding as the input and predict the corresponding GeoSOT grid code. The other one is the Bi-LSTM model. It is also provided as a strong baseline which uses the bidirectional LSTM units and character level attention mechanism. The performances of different models in terms of GeoSOT code prediction accuracy are presented in Figure 6. The difference between Figure 6a,b is the setting of input address sequence lengths, that is, the average character length of the former is 10 and that of the latter is 20. Moreover, under the same input length, we predict GeoSOT codes of different lengths: 13, 15 and 17, respectively. Please note that longer GeoSOT codes represent finer regions. It is clearly shown in Figure 6 that our method outperforms the other baselines significantly. When the input length is fixed, the prediction accuracy of the three methods will decrease as the GeoSOT code length increases. This is in line with common sense since predicting fine-grained regions is more difficult than predicting coarse-grained regions. However, our model still outperforms the baselines significantly. Moreover, if we fix the output code length, the Vanilla-RNN model and Bi-LSTM model achieve similar performance when the input addresses are short. However, when the address lengths become longer, the Bi-LSTM model outperforms the Vanilla-RNN model. Regardless of the input address length, the geolocation performance of our model is stable and outperforms the baselines. This is because the self-attention mechanism used in our encoder is not sensitive to the sequence length. It exhibits superiority in capturing the correct context information. In addition, it is worth noting that, with the input and output length getting longer, the geolocating task becomes more difficult across all methods with increasing input and output length.
Distance-based metrics. We take two state-of-the-art machine learning algorithms based on decision trees as baselines. In detail, we experiment with the two following algorithms: (1) XGBoost-regression, (2) XGBoost-classification. XGBoost (extreme Gradient Boosting) is an advanced implementation of the gradient boosting algorithm. During the training phase, XGBoost grows a sequence of weak learners (i.e., shallow trees), in which each weak learner focuses on correcting the residual errors of the current model approximation. By aggregating the weak learner outputs, XGBoost generates a strong learner. Given the Chinese addresses, we use XGBoost-regression to predict the coordinates. We calculate the mean and median error distances between the predicted location and the actual location. As for XGBoost-classification, it is used to predict the GeoSOT code. Specifically, given a GeoSOT level, it predicts over a large set of grid cells. Similarly, we take the centeroid of the predicted GeoSOT grid to calculate the distance-based metrics. We implement the aforementioned methods with SciKit-learn (https://scikit-learn.org/). In terms of the hyper-parameter setting, the max depth and eta are set as 7 and 0.1, respectively. Moreover, we set the hidden size as 768, 1024, and 2048, respectively, in our model for comparison.
The performances of each model are shown in Table 3. The results show that the regression-based method that directly predicts coordinates performs poorly. As for XGBoost-classification, it is also not easy to predict correctly over a large number of classes. Taking the 13th level as an example, there are almost 14 million GeoSOT grids in the world. By contrast, our proposed model consistently outperforms the other two models under any hidden layer dimension. We attribute it to the fact that our method sequentially learns to assign multi-granularity geographic areas according to the hierarchical geographic information implied in the addresses.

Ablation Study
Finally, we explore the impact of different parameter settings on the model performance. Taking the self-attention head as an example, Table 4 shows the comparative performance of our model under different numbers of heads. It can be seen that the increase in accuracy is small, which indicates that increasing the number of self-attention heads in the encoder module can improve the performance, but not significantly. In addition, we train the model with 12 heads for about 5.5 h more than that with six heads. Although we choose 12 heads in order to achieve the highest performance in this work, we suggest that researchers balance the trade-off between the speed and performance.

Discussion
Geolocating textual addresses is an important task in various LBS applications. Previous studies tried to predict the coordinates in a regression fashion or predict a discrete region by multi-classification. However, they all suffered from too large output space. By contrast, even for a person, the intuitive way of geolocating a textual address is to correlate a series of regions with different scales based on the hierarchical geographic information implied in the address. This motivates us to consider a coarse-to-fine geolocating approach. Specifically, this paper opens up a new paradigm for geolocation prediction, i.e., predicting a series of hierarchical regions in a Seq2Seq fashion. Its strength lies in taking full advantage of the inherent hierarchy information in Chinese addresses without relying on any additional information beyond the texts. Moreover, the discrete global grid system, GeoSOT, provides a globally unified benchmark for hierarchically discretizing the earth's surface. Without relying on any external gazetteers, we sequentially predict a GeoSOT code which exactly represents a set of regions from coarse to fine.
We think that there are at least three limitations and opportunities for new use. First, although our approach focuses on Chinese addresses, it is possible to be generalized to more types of geographical texts, e.g., Weibo or travel notes. Theoretically, these datasets can be directly trained with our proposed model. We take this as one of our future works. Second, the GeoSOT grid code can be replaced by any other type of geocodes such as GeoHash [35] and Google S2 [19]. Considering that different LBS applications use different geocoding methods, we plan to support user-defined coding methods in the future. Third, this approach is expected to be intelligent enough to predict the granularity. In this work, we take the explicit control of the granularity of the predicted region (e.g., L13, L15, L17). However, we believe that it is more intelligent to predict the granularity by the model itself according to the input data. This is because different applications and the amount of original information contained in the input data will affect the final prediction granularity. We plan to take these factors into account and adapt our model to predict the GeoSOT codes of variable lengths.

Conclusions
In this paper, we introduce a novel coarse-to-fine model for geolocating Chinese addresses. Our proposed approach first models the geolocation prediction as a Seq2Seq learning task, and then develops a deep learning-based neural network to solve it. Without any additional information beyond texts or external gazetteers, our model takes the textual address as input and outputs the GeoSOT grid code that exactly represents a series of hierarchical regions covered by the address. Compared with previous studies, our method effectively narrows the prediction space. The experimental results in terms of distinguishing polysemy and geolocation accuracy demonstrate the significant advantages of our model in the geolocation task.
Author Contributions: Chunyao Qian conceived, designed, and performed the experiments and wrote the manuscript; Chao Yi collected the dataset and reviewed the manuscript; Jiashu Liu polished the language; Chengqi Cheng supervised the study; and Guoliang Pu offered helpful suggestions and revised the manuscript critically. All authors have read and approved of the submitted manuscript, have agreed to be listed, and have accepted this version for publication.

Conflicts of Interest:
The authors declare no conflict of interest.