1. Introduction
Addresses are used to describe a unique spatial location on Earth and are usually expressed in the form of an addressing system [
1]. In recent years, with the rapid development of location services, massive amounts of industry data based on addresses as spatial information have started to emerge. Address matching is a crucial application in address services, which compares addresses with the same location in different address databases to obtain the best match with the search address and to determine position on a map [
2]. Traditional address matching technology is challenged by the prevalence of high-precision address matching in urban industries, such as logistics and online taxi services. Therefore, an effective address matching method is required to facilitate the provision of accurate and efficient intelligent spatial location services and to promote the development of smart cities.
The pattern of arrangement of address elements varies from country to country. For instance, the US address pattern is “room number + street + state + country”, and it performs well in creating a national geodatabase [
3]. Japanese addresses, on the other hand, are coded based on location and geographic relativity, with the overall order being the opposite of the address pattern used in the US, and generally without “streets” [
4]. In general, the address patterns of the above countries are nested and relatively standardized. However, Chinese addresses are relatively more difficult to match due to their complex context and rules, mainly due to the following reasons: (1) Chinese addresses are written without separators; (2) Chinese addresses often contain landmarks or POI and topology (e.g., road intersections); (3) different government departments manage addresses, leading to confusion; and (4) address assignment and updating lags behind rapid urban renewal [
5]. The data objects of this study are Chinese addresses.
Address matching is generally divided into matching based on rule-based or statistical methods and semantic similarity matching based on machine learning and deep learning. Character-based approaches match addresses by calculating their string similarity metrics and then manually establishing a threshold or a particular classifier to identify a match [
6,
7]. String similarity metrics include edit distance and its variants [
8,
9,
10], Jaccard similarity metric [
11], and Jaro distance and its variants [
12,
13]. Among them, Santos et al. compared 13 different string similarity metrics for place name matching and found that adjusting the similarity threshold was the key to achieving good performance [
14]. In addition, the calculation of cosine similarity between embeddings based on N-grams is also a common method [
15,
16]. This method has better performance compared with traditional metrics. Recently, Yong et al. proposed a normalization method based on the Euclidean distance between the address to be processed and the address in the standard library, but it is only applicable to some specific datasets [
17]. Another type of methods is address element based, which segments out address elements by rules or statistical methods and then compares the address elements and their hierarchy to determine whether they match [
18,
19,
20]. Lin et al. point out that the degree of matching of address elements depends on whether they can be extracted correctly [
21]. In general, dictionary queries [
22], probability statistics such as CRF [
23] and HMM [
24], and creating matching rules [
25,
26] are the basic ways for retrieving address elements. Another common method is to construct a decision tree consisting of matching rules, each corresponding to a path in the tree. Kang et al. proposed an address matching tree model based on the analysis of the spatial constraint relationship between address elements; this requirement makes the address model more complex [
27]. Focusing on the wrong word separation problem, Luo and Huang suggested a method based on a trie tree and finite-state machine [
28]. The aforementioned techniques, however, frequently struggle when dealing with nonstandardized (missing address elements or represented by POI) and complexly structured addresses (such as the Chinese address feature aforementioned).
In recent years, the area of artificial intelligence has seen tremendous progress in natural language processing (NLP), most of which is attributable to deep learning’s enhanced performance. Word2vec [
29], ELMo [
30], GPT [
31], BERT [
32], XLNet [
33], ERNIE [
34], and ELECTRA [
35] are a few of these classical language models. Since addresses as special textual descriptions, more and more studies in address matching has also introduced natural language models based on deep learning [
36]. Cruz et al. analyzed 41 papers on address matching published between 2002 and 2021 and discovered that most of the relevant studies have used deep learning methods. Among them, consistent with the above in this paper, due to the complexity of Chinese addresses, Chinese address matching accounted for half of the studies [
37].
Comber et al. used CRF and word2vec for address matching to extract the semantics of addresses without designing complex rules [
38]; Zhang et al. provides a convolutional neural network (W-TextCNN) for Chinese address pattern classification [
39]. With the popularity of gating mechanism neural networks, address matching and normalizing based on LSTM and GRU have been carried out by an increasing number of researchers [
40,
41,
42,
43]. Santos et al. used a deep neural network based on bidirectional GRUs for place name matching [
44]; Shan enriched the address context by collecting address data on the Internet and trained an address representation model with two LSTMs and attention mechanisms to extract address vectors [
45]. While Li et al. incorporated the hierarchical relationship between address elements into a neural network and proposed a BiLSTM-based multitask learning method [
46], Chen et al. proposed a contrast learning address matching model based on attention-Bi-LSTM-CNN networks (ABLC) [
47]. Subsequently, more and more researchers have used the attention mechanism in their address matching models [
48,
49,
50]. With the popularity of pretrained language models, Lin et al. used the classical enhanced sequence inference model (ESIM) [
51] for address record pair modelling [
21], whereas Xu et al. and Qian et al. used the BERT model. Xu et al. proposed a BERT-based model for extracting address semantic representations to achieve the fusion of address semantics and geospatial information [
36]; Qian et al. combined BERT and LSTM, and proposed a hierarchical region-based approach for geolocation of Chinese addresses [
52]. However, all of the aforementioned methods require the extraction of address semantic features to embedding, and this can affect the effectiveness of address semantic understanding, as has been demonstrated in the field of NLP [
32].
In summary, when dealing with nonstandardized addresses with complicated structures, the aforementioned approaches still lack a level of comprehension of address semantics, which negatively impacts the accuracy of address matching.
To address the above problems, we use a deep transfer learning approach. First, we pretrain an addresses corpus so that our address semantic model (abbreviated as ASM) can learn unsupervised address contexts to better understand address semantics. Then, we use the address-specific geospatial property to build a labelled address matching dataset, allowing the matching problem to be converted into a binary classification prediction problem. Finally, fine-tuning the ASM with the address matching dataset allows the model to improve its performance significantly.
The contributions of this paper are as follows: (1) A neural network based on a multihead self-attention mechanism and a permutation-based target task is used to train the ASM for a large-scale corpus in an unsupervised automated manner. The ASM can learn address semantics better. (2) A deep transfer learning approach is used to achieve semantic address matching by fine-tuning the ASM, which improves the matching accuracy. (3) A semantic address matching dataset construction method is proposed to convert address matching into a classification prediction task. The method constructs an address matching dataset with labels using location information as the inference condition. (4) Results demonstrate that with the transfer learning approach, a better-performing downstream task such as address matching can also be achieved with microsupervision.
The remainder of this paper is organized in four sections.
Section 2 introduces the materials used in our study, as well as the data processing procedures. The methodology adopted is also demonstrated in
Section 2, including the pretraining and fine-tuning based on XLNet. The results of our experiments are analyzed in
Section 3.
Section 4 presents our conclusions and the future work of this study.
2. Materials and Methods
In this section, we introduce a deep transfer learning approach in NLP and propose a semantic address matching framework. First, we tokenize all address data to be used as model input in the pretraining phase. Then, we use the XLNet model [
33] to pretrain the address corpus and make the model understand the address semantics by learning contextual information. Finally, we construct a supervised dataset for semantic address matching, fine-tune the pre-trained ASM for address matching, and compare it with multiple models to evaluate the accuracy of the ASM.
2.1. Dataset
Address records for the raw data were manually collected in 2019 from various government departments. The geographical area to which this data refers is Shangcheng District, Hangzhou, Zhejiang Province, China. The address dataset contains a variety of location description types, including standard addresses, nonstandard addresses, POIs, road intersections, place name abbreviations, and so on. The preprocessed address dataset amounted to 1,552,532, consisting of three fields of address records, longitude, and latitude, which served as the address corpus for the pretraining phase.
To use the address data for semantic address matching, we created a dataset of address pairs with labels based on the address corpus. Based on the set of addresses filtered with the same coordinates, we performed manual matching using a standard address database. In addition, to give the model better prediction performance and generalization capabilities, we augmented the dataset with easy data augmentation [
51] methods for text classification tasks, mainly using synonym replacement, address element deletion, and address element insertion. To improve the robustness of the model, we constructed mismatched address pairs in the set of address pairs with Jaccard similarity coefficients [
11] greater than zero. We finally obtained a dataset of 64,358 address pairs and corresponding labels, a sample of which is shown in
Table 1. The statistical features of the dataset used for semantic address matching are shown in the
Table 2, where we used the difference in the number of characters, Levenshtein distance [
8], and Jaccard similarity coefficient [
11] to show the similarity of address pairs in the dataset. Unmatched address pairs will perform worse in terms of text similarity, in line with our common sense.
2.2. Semantic Address Matching Definition
In this paper, we study the address matching in the absence of a standard address database, referred to as the semantic address matching task. The following description defines semantic address matching:
Given the address dataset: , the goal of semantic address matching is to find each address pair: , satisfying , where , , , and represent the comparison operator. The operation objects on either side of the comparison operator refer to the same real-world object with the same coordinates. It is important to note that no information other than the string itself and its corresponding geospatial information is utilized in this study to calculate the similarity of two addresses. Therefore, the task addressed in this study focuses on the problem of matching addresses with the same location instead of address disambiguation. In addition, due to the many different representations of the same location, we believe that it is not possible to achieve a correct match without processing from a natural language understanding perspective. Therefore, in our study, “address semantic understanding” refers to the textual understanding of the address corpus, while “address semantic reasoning” used for address matching is based on the spatial relationship reasoning of addresses.
2.3. Pretraining Phase Using the Address Corpus Based on XLNet
This section presents a transfer learning-based pretraining model for address semantics: address semantic model (ASM). The ASM is based on the characteristics of Chinese addresses, combined with the advantages of semantic understanding in deep learning natural language models. The model takes as input a single character of a Chinese address that has been tokenized, and uses a multihead self-attention-based semantic extraction module to help the model understand the semantics of the address with the objective of permutation unknown character prediction. For the practical training problem resulting from the prediction objective, a two-stream self-attention structure for target position representations is used. The overall structure of the ASM is shown in
Figure 1.
2.3.1. Tokenization of Address Characters
The conversion of Chinese addresses into input that can be received by the ASM is the basis for training. Since Chinese addresses are not like alphabetic forms of languages, such as English, they do not have delimiters. Therefore, most Chinese address studies start with the segmentation of address elements. Due to the unique hierarchy of addresses, partitioning addresses into various address elements is already a problem worth studying. Our study, however, aims to convert the complex address matching into a classification problem that can be automated for computer computation. Although the commonly used SentencePiece method [
53] in NLP can automate the segmentation of Chinese addresses by counting high-frequency co-occurring characters combined into subword units and constructing dictionaries, the subwords obtained by its segmentation are too long, and some of the segmented words do not conform to the common sense of Chinese addresses, which will affect the semantic understanding during pretraining.
We therefore use the Basic Tokenizer, which tokenizes a character as a unit. It separates words and symbols according to spaces. We first add blank characters before and after each character of the address. Then the characters are matrix-transformed according to the lookup table to become the input of the one-hot encoding, and the activated dimensions in the one-hot encoding are the index number corresponding to the character in the dictionary key–value pair. In this study, two dictionaries—one with non-Chinese characters and the other with solely Chinese characters—are created once the individual characters from each address have been obtained. These dictionaries have 9425 and 3491 characters, respectively.
2.3.2. Objective of Permutation Unknown Character Prediction and Two-Stream Self-Attention Structure
The objective of permutation language modeling is derived from the XLNet model [
30]. Without altering the character order of the original text, the target employs rearrangement to sabotage the index order of text descriptions. This training target not only preserves the high-order and long-range dependencies present in the text context, but also improves on the disadvantages of past autoregressive language modeling’s targets that could only exploit unidirectional contexts (forward or backward), enabling a pretrained model to utilize deep, bidirectional contextual information more effectively. Addresses, as special natural languages incorporating geospatial information and hierarchy, need to fully utilize the bidirectional contextual information, so we use a permutation language model objective for pretraining the address corpus.
Specifically, we assume that given an address record X of length T, there are a total of T! sequences of permutations. If all permutations are traversed and the parameters of the model are shared, then the model must be able to learn the context of all positions. We take a simplified address record, for example, “Hangzhou Underwater World” (“Hang Zhou Hai Di Shi Jie” in Chinese pinyin), and predict the third character “Hai” in a different order, as shown in
Figure 2. In
Figure 2b, for instance, the address permutation is disordered as 3→2→4→1→6→5, so when predicting “Hai”, there is no address context character, and the prediction can only be made based on the previous hidden state. For
Figure 2f, the “Hai” (3) character learns all five context characters except itself.
The objective function of XLNet is to maximize the log-likelihood function of the target subsequence conditional on the nontarget subsequence:
where
denotes the set of all permutations of the index of an address record of length T;
is one of the sequences of indexed permutations, where
denotes the t-th element of the sequence of indexed permutations, and
denotes the first t-1 elements of z;
denotes the maximum Expectation; and
denotes the predicted probability. In addition, XLNet used the partial prediction optimization. It slices a permutation
into two subsequences,
and
, where
is the slice point that slices the two subsequences into a nontarget sequence and a target sequence, respectively.
While the above objective of permutation unknown character prediction works well for understanding address semantic by removing ambiguity from the target prediction, it creates the problem that the model does not know the position of the character to be predicted in the original address record. Therefore XLNet [
33] introduces a two-stream self-attention structure to let the model know where the character to be predicted is located explicitly, which consists of two sets of hidden representations instead of one. The two streams of representations are updated with a shared set of parameters as follows:
where
Q,
K,
V denote the query, key, and value in an attention operation [
54];
denotes the content representation, which serves a similar role to the standard hidden states in Transformer, and
denotes the query representation, which only has access to the contextual information
and the position
, but not the content
.
In addition, we employ Transformer-XL with a multihead self-attention mechanism as an address semantic feature extractor [
55]. Transformer-XL integrates two important techniques, namely, the relative positional encoding scheme and the segment recurrence mechanism. This allows for better adaptation to the two-stream attention permutation language model. As the number of semantic feature extraction structures affects the performance of the model in subsequent experiments, each layer of the Transformer-XL module is tentatively defined in this section as the address-transformer module.
2.4. Fine-Tuning for Semantic Address Matching
Fine-tuning is an implementation of deep transfer learning, which refers to adding task-relevant structures and parameters to an already-trained model, and then retraining on a task-relevant corpus [
56]. We therefore used a newly constructed labelled address matching corpus for the semantic address matching, adding a new neural network structure for a fine-tuned learning model and training framework based on the classification task. The network structure is first superimposed with a layer of fully connected feedforward neural networks for nonlinear transformation, with an activation function of
tanh, which is mathematically formulated as follows:
After obtaining the probability distribution features using the fully connected neural network, we then connected the fully connected neural network without the activation function for linear transformation. Since semantic address matching is a binary classification task of whether to match, the output of this layer is two-dimensional. Finally, we passed the output probability distribution score of this layer into the SoftMax normalization function to predict the probability of matching or not matching the address pair, respectively. We designed the deep semantic address matching model (abbreviated as DSAMM) with the following objective function. Here, given that the size of the number of address string pairs per batch iteration is
batch_size, the predicted probability output is
prob(batch_size,2), and the true label sequence is
label(batch_size), the true label probability for each address pair is as follows:
The final objective function is obtained by taking logarithmic values of the probabilities and then summing them (i.e., log transformation) and averaging them. The objective function is specified below:
The accuracy metrics used in this study include precision, recall, and F1 score [
57]. Precision calculates the proportion of true positive samples out of those predicted to be positive; recall reflects the rate at which positive examples in this are predicted to be accurate and, in semantic address matching, refers to the percentage of correctly matched pairs out of all address pairs that should be correctly matched; and the F1 score is the harmonic mean of precision and recall.