Improving a Street-Based Geocoding Algorithm Using Machine Learning Techniques

: Address matching is a crucial step in geocoding; however, this step forms a bottleneck for geocoding accuracy, as precise input is the biggest challenge for establishing perfect matches. Matches still have to be established despite the inevitability of incorrect address inputs such as misspellings, abbreviations, informal and non-standard names, slangs, or coded terms. Thus, this study suggests an address geocoding system using machine learning to enhance the address matching implemented on street-based addresses. Three di ﬀ erent kinds of machine learning methods are tested to ﬁnd the best method showing the highest accuracy. The performance of address matching using machine learning models is compared to multiple text similarity metrics, which are generally used for the word matching. It was proved that extreme gradient boosting with the optimal hyper-parameters was the best machine learning method with the highest accuracy in the address matching process, and the accuracy of extreme gradient boosting outperformed similarity metrics when using training data or input data. The address matching process using machine learning achieved high accuracy and can be applied to any geocoding systems to precisely convert addresses into geographic coordinates for various research and applications, including car navigation.


Overview of Geocoding and Its Advancement
In this section, we will review the general steps involved in geocoding and existing studies on geocoding and word matching.

General Steps for Geocoding
As one of the primary functions of a GIS [19], geocoding works to assign positions, such as latitude and longitude, into textual addresses using a reference database [3,5,10,15,19]. These addresses are attached to datasets that are applicable across various applications domains; hence by geocoding, a bridge between spatial and attribute data is established [20]. This association to the geographic information enables not just visual display through accurate maps, but also the application of more in-depth spatial analysis [21,22]. This integration strengthens the function of GIS as a vital tool across various fields of interest such as urban planning and management [13], human activity and movement studies [5], health [9, 23,24], emergency dispatching [7], traffic accidents [12], and management of administrative data [4].
Generally, the geocoding process includes three steps-parsing, matching, and locating, as shown in Figure 1 [5]. Previous studies have presented analogous procedures that mostly aim to accept textual input, perform a preprocessing step to this input, and perform a match to a database to return coordinate value pairs for position [8,22,25]. Parsing converts unstructured or semi-structured input addresses into structured ones. This process is crucial in giving out precise location, even if input addresses or even the address databases are imprecise and vague [3]. Input addresses may have problems matching due to their unstructured forms. Thus it is essential to capture meaningful units of addresses and enhance the quality of these address elements, which is critical in improving geocoding accuracy [22]. Matching compares and links the structured input addresses to an address reference database. The address reference database includes the information on the addressing system to be matched with the input addresses. Locating finds coordinates based on the results from the matching process.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 3 of 22 meaningful units of addresses and enhance the quality of these address elements, which is critical in improving geocoding accuracy [22]. Matching compares and links the structured input addresses to an address reference database. The address reference database includes the information on the addressing system to be matched with the input addresses. Locating finds coordinates based on the results from the matching process.

Existing Studies on the Advancement of Geocoding and Word Matching Using Machine Learning
Geocoding has been the subject of multiple studies, particularly on the improvement of its process. Lee (2009) [5] focused on an area-based address geocoding method since street-based address geocoding methods are limited to Western countries and thus, do not apply to countries with area-based addressing systems. The suggested area-based address geocoding method was able to define house locations within a block accurately. In order to do that, this method models the boundaries with block number and building number range information, and the network distance defines the geographic coordinates of houses along segments of block polygons using a linear interpolation technique. Furthermore, a 3D address geocoding method was also proposed based on the 3D indoor geocoding [26]. The 3D address includes an address for building and an indoor address, such as an apartment number. Especially for the 3D indoor geocoding, network models of buildings are ideal. The constructed network model forms the foundation of calculating network distances from an interpolation method and finding location information in a building. Lee et al. (2017) [27] also suggested a novel idea for 3D indoor geocoding that utilizes optical character recognition to detect the current indoor location and semantic queries to determine a destination of interest. Yao et al. (2015) [6] suggested three fuzzy matching algorithms for Chinese address matching based on a full-text search. They focused more on user input and result control rather than address standardizations and models, which current researchers pay attention to. With the three fuzzy matching methods, the address matching engine was able to achieve higher match efficiency than the traditional database retrieval. In addition, it guaranteed greater freedom on user input and result control and showed very high accuracy (100%).
Matci and Avdan (2018) [14] proposed a method to standardize addresses to improve geocoding results. The address data undergo parsing, semantical analysis, and reformatting through the natural language process. The developed method was tested on 233 primary school addresses using Google Geocoding API and ArcGIS geocoding API. In addition, the test data were standardized in three formats-the Turkish National Post Telephone Telegraph, Google, and ArcGIS. The results indicated that the standardized addresses significantly improved the accuracy of the geocoding results. Especially, when the addresses were standardized in the Google format and geocoded using Google Geocoding API, it showed the highest accuracy (99.1%).

Existing Studies on the Advancement of Geocoding and Word Matching Using Machine Learning
Geocoding has been the subject of multiple studies, particularly on the improvement of its process. Lee (2009) [5] focused on an area-based address geocoding method since street-based address geocoding methods are limited to Western countries and thus, do not apply to countries with area-based addressing systems. The suggested area-based address geocoding method was able to define house locations within a block accurately. In order to do that, this method models the boundaries with block number and building number range information, and the network distance defines the geographic coordinates of houses along segments of block polygons using a linear interpolation technique. Furthermore, a 3D address geocoding method was also proposed based on the 3D indoor geocoding [26]. The 3D address includes an address for building and an indoor address, such as an apartment number. Especially for the 3D indoor geocoding, network models of buildings are ideal. The constructed network model forms the foundation of calculating network distances from an interpolation method and finding location information in a building. Lee et al. (2017) [27] also suggested a novel idea for 3D indoor geocoding that utilizes optical character recognition to detect the current indoor location and semantic queries to determine a destination of interest. Yao et al. (2015) [6] suggested three fuzzy matching algorithms for Chinese address matching based on a full-text search. They focused more on user input and result control rather than address standardizations and models, which current researchers pay attention to. With the three fuzzy matching methods, the address matching engine was able to achieve higher match efficiency than the traditional database retrieval. In addition, it guaranteed greater freedom on user input and result control and showed very high accuracy (100%).
Matci and Avdan (2018) [14] proposed a method to standardize addresses to improve geocoding results. The address data undergo parsing, semantical analysis, and reformatting through the natural language process. The developed method was tested on 233 primary school addresses using Google Geocoding API and ArcGIS geocoding API. In addition, the test data were standardized in three formats-the Turkish National Post Telephone Telegraph, Google, and ArcGIS. The results indicated that the standardized addresses significantly improved the accuracy of the geocoding results. Especially, when the addresses were standardized in the Google format and geocoded using Google Geocoding API, it showed the highest accuracy (99.1%).
Regarding word matching, some studies suggested ways of using machine learning methods [13,18,28,29]. Christen et al. (2006) [28] developed a novel geocode match engine with a rule-based approach to find an exact match or other approximate matches. Especially, hidden Markov models were adopted to achieve better address standardization accuracy. The developed geocoding engine achieved 94.94% matches at different levels (address, street, locality). Choi and Lee (2019) [18] proposed an approach to the alias database management for efficient POI (Point of Interest) retrieval. The authors adopted Word2vec, a simple neural network structure, for word embedding to convert text data into the form of numeric vectors. The word embedding is capable of making a machine learning model understand similar meanings of words. The suggested method determines the match of a given POI name and the corresponding POI in the alias DB based on text similarity. The most similar word with the similarity degree of 60% or more is retrieved, and the user confirms whether it is the correct one. Santos et al. (2018) [29] used supervised machine learning methods for combining multiple similarity metrics concerning toponym matching. The use of machine learning with multiple similarity metrics has the benefits of avoiding setting similarity thresholds manually. The authors showed that the methods based on machine learning outperform the individual usage of similarity metrics with setting a manual threshold. Lin et al. (2020) [13] introduced a novel address matching method based on deep learning techniques for identifying the semantic similarity between address records. The suggested method computed the semantic similarity of the compared address records for determining whether they match. It was evaluated using the Shenzhen Address Database and achieved 97% of accuracy outperforming other current address matching methods.
This study proposes an algorithm for accurate and efficient geocoding. For the matching process, machine learning methods with multiple similarity metrics are used, like Santos et al. (2018) [29]. Compared to Santos et al. (2018) [29], this study introduces more similarity metrics for better performance of matching. Moreover, the proposed method utilizes a hyper-parameter tuning of machine learning methods to select the best machine learning method for the matching process. An alias DB is the source of training data for machine learning methods. Regarding the quality of the training data, we also apply different ratios of matching and non-matching pairs to understand the effect of different combinations of the pairs on the performance of matching.

Geocoding Algorithm Using Machine Learning Techniques
In this section, we discuss the proposed geocoding algorithm divided into three major parts-address parsing, address matching using machine learning, and address locating. The input addresses undergo a parsing process in order to determine which parts of the text belong to each address component. Then, each of these components undergoes a matching process integrated with a machine learning method in order to increase matching accuracy. We then compare resulting matches to a reference database for calculating coordinate information. Figure 2 summarizes this process, and succeeding sub-sections will discuss these in detail.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 5 of 22 Figure 2. Methodology for implementing the geocoding system using machine learning.

Address Parsing
Parsing is the analysis of a sequence of characters and, in particular, breaks given texts into meaningful pieces. In this paper, we used Korean addresses, based on the road name address system implemented in 2014. This system follows a hierarchy of administrative units, optionally beginning

Address Parsing
Parsing is the analysis of a sequence of characters and, in particular, breaks given texts into meaningful pieces. In this paper, we used Korean addresses, based on the road name address system implemented in 2014. This system follows a hierarchy of administrative units, optionally beginning with a province name (suffixed by "do"), city ("si"), district ("gu"), that precedes the road name (ending with "daero", "ro", or "gil", depending on the number of lanes) and building number [30]. In the developed algorithm, the parsing (left side of Figure 2 and Algorithm 1) divides the input address into the city, district, road, and building number so that each piece can work adequately in the matching or locating process. Then, the building number goes through the query building number to decide the road segment used to compute the geocoded location. The rest of parsed city, ward, and road goes to the matching component and we match them with the records in the Alias DB.
Regular expressions, or simply Regex, is a programming tool used in many languages such as Python, Java, Perl, and PHP [31]. Regexes allow expression of patterns and repetitions in texts [32], so they perform well in detecting parts of an address input. In the case of Korean addresses, they may be used to identify address elements more flexibly, compared to purely whitespaces as delimiters (since each element may contain a whitespace in between). Based on the character suffixes used in Korean Addresses, we constructed a regex to detect characters or numerals for each element from the raw input of the user. The parsed city, district, and road names undergo the address matching using machine learning, resulting in a set of road segments through the matching process. The address matching process helps to select a set of road segments to be sent to the address locating process. The city, district, and road names are matched one or two times in the matching process to find the records in the Alias DB for geocoding correctly (Algorithm 2 and in the middle of Figure 2). for i = 0 to number of official names 18: Test In the first matching, the city, district, and road names undergo simple matching with their official names. If there is no identical official name found for the city, district, or road name, the second matching is subsequently involved using machine learning to supplement the first simple matching. Before matching using machine learning, we generate training data from the Alias DB, and select the best machine learning model through the evaluation of 17 text similarity metrics (Table 1) and three machine learning models. Each record of Alias DB has an official name and the aliases, while the training data consist of matching pairs and non-matching pairs. The matching pairs refer to the pairs of one official name and one alias of the official name. On the other hand, the non-matching pairs ae defined by the aliases and other unmatched official names in the Alias DB. In this study, we tested different combinations of matching and non-matching pairs for the training data since the accuracy of matching results may vary depending on the ratio of matching and non-matching pairs consisting of training data. For matching and non-matching pairs, we used variables calculated using 17 text similarity metrics to train machine learning models.
Previous studies have widely used similarity metrics for word matching [16,18,33,34]. This methodology uses edit-based and token-based similarity metrics. Edit-based similarity is more applicable for short phrases and compares only the characters. Hence, it is simple, yet it is inefficient for long phrases and computationally expensive. On the other hand, token-based similarity explores text as a set of tokens (words) and is more applicable for long texts. Table 1. Seventeen kinds of similarity metrics.

Edit-Based Similarity Metric Token-Based Similarity Metric
Jaro [35] Cosine [36] Jaro-Winkler [37] Tversky [38] Jaro-Winkler Reversed [29] Overlap [34] Jaro-Winkler Sorted [28] Bag [39] Hamming [40] Jaccard [41] Mlipns [42] Sorensen_Dice [43] Strcmp95 [37] Monge_elkan [33] Needleman-Wunsch [44] Gotoh [45] Smith_Waterman [46] Three different kinds of machine learning methods-support vector machine (SVM), random forest (RF), extreme gradient boosting (XGB)-were tested to find the best method. SVM is a supervised machine learning method. The goal of SVM is to find a hyperplane in an N-dimensional space that best separates data points. A hyperplane that has the maximum margin between data points is selected as the best decision boundary to classify the data points. Given some training data D: where x i is an m-dimensional real vector, y i is either 1 or −1, indicating the class of input vector x i . Two parallel hyperplanes ( Figure 3) are defined such that w T x + b = 1 and w T x + b = −1. Maximizing the distance between these two hyperplanes defines the maximum-margin hyperplane.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 8 of 22 RF is also a supervised machine learning method and built upon decision trees on data points ( Figure 4). It makes predictions from each of the trees with different features and selects the best solution with majority voting. By majority voting, RF can reduce the over-fitting. RF is also a supervised machine learning method and built upon decision trees on data points ( Figure 4). It makes predictions from each of the trees with different features and selects the best solution with majority voting. By majority voting, RF can reduce the over-fitting.
XGB, as a supervised machine learning method, is an implementation of gradient boosted decision trees for more efficient computation and the increase of performance. In the gradient boosted trees, each tree in boosting is a weak learner and tries to minimize the errors of the previous tree to make a strong learner. RF is also a supervised machine learning method and built upon decision trees on data points ( Figure 4). It makes predictions from each of the trees with different features and selects the best solution with majority voting. By majority voting, RF can reduce the over-fitting. XGB, as a supervised machine learning method, is an implementation of gradient boosted decision trees for more efficient computation and the increase of performance. In the gradient boosted trees, each tree in boosting is a weak learner and tries to minimize the errors of the previous tree to make a strong learner.
We performed a 10-fold cross-validation to evaluate the performance of the matching process. This method splits the entire data set into 10 sets with equal size, and selects one single set as a test set. Remaining sets become training sets, and the cross-validation is repeated 10 times by using each test set only once. The performance of three machine learning models was compared with each similarity metric to see whether the machine learning models outperform conventional methods.
To understand the minimum number of similarity metrics needed and choose the best model, we tested different numbers of similarity metrics. To do this, first, different similarity metrics were arranged in descending order by their accuracy and grouped into 2-17 based on the order. Second, We performed a 10-fold cross-validation to evaluate the performance of the matching process. This method splits the entire data set into 10 sets with equal size, and selects one single set as a test set. Remaining sets become training sets, and the cross-validation is repeated 10 times by using each test set only once. The performance of three machine learning models was compared with each similarity metric to see whether the machine learning models outperform conventional methods.
To understand the minimum number of similarity metrics needed and choose the best model, we tested different numbers of similarity metrics. To do this, first, different similarity metrics were arranged in descending order by their accuracy and grouped into 2-17 based on the order. Second, the least number of similarity metrics showing higher accuracy of machine learning models than all the 17 similarity metrics was determined ( Figure 10). Last, we chose the best model with the minimum number of similarity metrics that shows the highest accuracy among the three machine learning models. We measured accuracy for each similarity metric and the three different machine learning models. Especially, for each similarity metric, a threshold value needs to be set, determining whether a given pair of texts matches or not. Therefore, to find the optimal threshold, different thresholds need to be tested (Algorithm 3). This methodology tests different threshold values ranging from 0.00 to 1.00 by the increment 0.05 for each similarity metric and sets the optimal threshold that has the highest accuracy. for i = 0 to length(Traindata) 8: if When the first and second matching do not find a matched official name, the city, district, or road name is input manually and then the matching is undertaken once again. A set of road segments results from the end of the address matching process. As a result of the matching, candidates of road segments are selected and sent to the locating process with the parsed building number. The combination of road segments and the building number can help to narrow down to a final road segment necessary in locating.

Address Locating
The U.S. Census Bureau performs address locating with address ranges [47]. As illustrated in Figure 5, each record in the address database includes two address ranges for the left and right sides of the road. The left side has odd-numbered addresses, and the right side has even-numbered addresses. For example, geocoding the address with the building number '105 means we must first identify how this building number was assigned. In some territories, such as in the United States, the Bureau assigns this number from the address number range of the start and end nodes located at the centerline of the roads intersecting the concerned road segment. The building number arises from interpolation. Since the numbering begins at the centerlines, not within the segment itself, the widths of these perpendicular roads containing the start node, or start width, and the end node, or end width, must be taken into consideration. Therefore, to geocode this building's address, its location is approximated on the left side (odd-numbered addresses) of the road 'Park Avenue' using a linear interpolation method. In order to adjust the aforementioned road's length, the road types of the adjacent roads (i.e., road containing the start node and the one containing the end node) are taken into consideration. In some cases, such as in Korea, the start and end nodes are aligned with two end sides of a road segment instead of intersection points (right side of Figure 5). With this, it is apparent that the segment is cut off by half of the length of connected roads on both sides, shortening the total actual segment distance by twice of this end_offset value. Since 10 m shows the highest accuracy of geocoding among 5, 10, and 15 m (not presented here), this study used this value for the end_offset. From the address database, the values for the X and Y coordinates for the start and end of the road segments, xfrom, yfrom, xto, yto, were obtained, and these values were used for the bearing angle θo. Using the end offset value and θo, we calculated the X and Y coordinates, xstart, ystart, xend, yend, of the shortened segment, after cut-off. Since the segment cut-off only signifies the width of the road perpendicular to the road segment in concern, the xstart and ystart corresponding to the beginning of the shortened segment, is still not in the middle of the sub-segment corresponding to the first building number, for example, '101′ in Figure 5 on the right side. Hence, after interpolating for the location of building number '103′ along the segment, the midpoint was calculated using the quantity mid_offset. This midpoint is the position on the segment directly in front of the building's center.
In both types of addresses, the geocoded point's final location (x, y) is inside a building or a parcel. This final point is calculated along a pre-defined perpendicular offset, perp offset in Table 5, away from the road. The offset can be a fixed value, like 5 m, 10 m, or a combination of different In some cases, such as in Korea, the start and end nodes are aligned with two end sides of a road segment instead of intersection points (right side of Figure 5). With this, it is apparent that the segment is cut off by half of the length of connected roads on both sides, shortening the total actual segment distance by twice of this end_offset value. Since 10 m shows the highest accuracy of geocoding among 5, 10, and 15 m (not presented here), this study used this value for the end_offset. From the address database, the values for the X and Y coordinates for the start and end of the road segments, x from , y from , x to , y to , were obtained, and these values were used for the bearing angle θo. Using the end offset value and θo, we calculated the X and Y coordinates, x start , y start , x end , y end , of the shortened segment, after cut-off. Since the segment cut-off only signifies the width of the road perpendicular to the road segment in concern, the x start and y start corresponding to the beginning of the shortened segment, is still not in the middle of the sub-segment corresponding to the first building number, for example, '101 in Figure 5 on the right side. Hence, after interpolating for the location of building number '103 along the segment, the midpoint was calculated using the quantity mid_offset. This midpoint is the position on the segment directly in front of the building's center.
In both types of addresses, the geocoded point's final location (x, y) is inside a building or a parcel. This final point is calculated along a pre-defined perpendicular offset, perp offset in Algorithm 4, away from the road. The offset can be a fixed value, like 5 m, 10 m, or a combination of different values considering road types, having an azimuth Az with a value 90 • away from θo clockwise, if the building is on the right, or counterclockwise if the building is on the left. In this study, the offset is a combination of different values considering road types. Since the width of the road can vary according to different road types, we expect the combination offset to work well to move addresses on a building or a parcel correctly. Within the context of roads in Korea, the wide road type has a width of 30 m. Similarly, we assigned a width of the narrow roads with less than four lanes to 6 m, and the others to 20 m. Figure 6 illustrates this process and the procedure that follows details this.

Experimental Evaluation
This section presents three different kinds of experiments, as illustrated in Figure 7, to compare the performance of geocoding using machine learning with simple matching and similarity metrics. The input address data commonly go through the address parsing, matching, and locating processes. Experimental case 1 presents address matching with simple matching. This case does not use similarity metrics or machine learning for the address matching, but only simple matching described in Section 3.2. Experimental case 2 uses a similarity metric for the address matching. This case tests one similarity metric with the highest accuracy and another with the lowest accuracy in edit-based and token-based similarity metrics, respectively, to explore the ranges of performance. The Experiment case 3 uses the address matching with machine learning. For machine learning, we generated training data from the Alias DB.  For the input data, 1524 input addresses were selected and used as test data. In the input data, we assumed that city and district names were correctly input, but the road names were input with spelling errors and mistakes. In most cases, there are 25%-75% of perfect matches in input data [8], and this study generates and tests three kinds of input data with different percentages of perfect matches. Input data 1 has 30% of correct matches, Input data 2 has 50% of correct matches, and Input data 3 has 70% of correct matches. For the correct matches, we used the standard addresses obtained on the official address website and made Input data 1, 2, and 3 by changing the number of the correct addresses. To make incorrect matches in Input data 1, 2, and 3, we made wrong addresses by randomly inserting a space, special character, number, or typo, or removing a character(s) in addresses. Figure 8 illustrates the road network data (address database in Figure 2) and its schema. It consists of road segments in Gyeonggi Province, Suwon City, Paldal District, Korea. The database schema for the road network contains the road name (roadname) and building numbers for address ranges. From R is a start building number, and To R is an end building number for the right side of the road. From L is also a start building number, and To L is an end building number for the left side of the road. From X, From Y, To X, and To Y are coordinates of both ends of the road segment. For the input data, 1524 input addresses were selected and used as test data. In the input data, we assumed that city and district names were correctly input, but the road names were input with spelling errors and mistakes. In most cases, there are 25%-75% of perfect matches in input data [8], and this study generates and tests three kinds of input data with different percentages of perfect matches. Input data 1 has 30% of correct matches, Input data 2 has 50% of correct matches, and Input data 3 has 70% of correct matches. For the correct matches, we used the standard addresses obtained on the official address website and made Input data 1, 2, and 3 by changing the number of the correct addresses. To make incorrect matches in Input data 1, 2, and 3, we made wrong addresses by randomly inserting a space, special character, number, or typo, or removing a character(s) in addresses. Figure 8 illustrates the road network data (address database in Figure 2) and its schema. It consists of road segments in Gyeonggi Province, Suwon City, Paldal District, Korea. The database schema for the road network contains the road name (roadname) and building numbers for address ranges. From R is a start building number, and To R is an end building number for the right side of the road. From L is also a start building number, and To L is an end building number for the left side of the road. From X, From Y, To X, and To Y are coordinates of both ends of the road segment. Appl. Sci. 2020, 10, x FOR PEER REVIEW 13 of 22

Experimental Case 1: Address Matching without Any Similarity Metrics
The input data go through the address parsing process first. The parsing process divides the input address data into city, district, road, and building number. Since we assume that there are no errors or mistakes in the city and district names and building numbers, this methodology simply matches them without a similarity metric for further processes.
The address matching without any similarity metrics shows 71.78% of accuracy when Input data 3 with 70% of correct matches are used ( Table 6). As the percentage of correct matches in the input data increases, the accuracy also rises as expected. Mismatched addresses are addresses matched with incorrect addresses, whereas no addresses have matched with the unmatched addresses. Since matching for aliases with official names in the simple matching is impossible without any similarity metrics, there is no mismatched input address. Instead, there is 28.22% of unmatched input addresses in Input data 3. After that, the address locating process geocodes all the 1524 input addresses, as illustrated in Figure 9. A total of 32.74% of input addresses are geocoded on corresponding parcels using Input data 3. The same process geocoded the rest of the addresses outside the corresponding parcels.
In order to understand positional accuracy, the distances between all input addresses and corresponding parcels were also calculated and explored, excluding mismatched addresses, which may have considerable distances. On average, the geocoded points are 4.43 m further away from their corresponding parcels using Input data 3. As the percentage of correct matches increases from Input data 1 to Input data 3, the mean distance decreases because the number of addresses geocoded on corresponding parcels, which has 0 m distance, rises.

Experimental Case 1: Address Matching without Any Similarity Metrics
The input data go through the address parsing process first. The parsing process divides the input address data into city, district, road, and building number. Since we assume that there are no errors or mistakes in the city and district names and building numbers, this methodology simply matches them without a similarity metric for further processes.
The address matching without any similarity metrics shows 71.78% of accuracy when Input data 3 with 70% of correct matches are used ( Table 2). As the percentage of correct matches in the input data increases, the accuracy also rises as expected. Mismatched addresses are addresses matched with incorrect addresses, whereas no addresses have matched with the unmatched addresses. Since matching for aliases with official names in the simple matching is impossible without any similarity metrics, there is no mismatched input address. Instead, there is 28.22% of unmatched input addresses in Input data 3. After that, the address locating process geocodes all the 1524 input addresses, as illustrated in Figure 9. A total of 32.74% of input addresses are geocoded on corresponding parcels using Input data 3. The same process geocoded the rest of the addresses outside the corresponding parcels.
In order to understand positional accuracy, the distances between all input addresses and corresponding parcels were also calculated and explored, excluding mismatched addresses, which may have considerable distances. On average, the geocoded points are 4.43 m further away from their corresponding parcels using Input data 3. As the percentage of correct matches increases from Input data 1 to Input data 3, the mean distance decreases because the number of addresses geocoded on corresponding parcels, which has 0 m distance, rises. Table 2. Performance of address matching process without any similarity metrics, percentage of geocoded input addresses on corresponding parcels, and distances between geocoded addresses and the corresponding parcels.

Experimental Case 2: Address Matching with a Similarity Metric
In the address matching with a similarity metric, we tested Jaro-Winkler Sorted and Mlipns in edit-based similarity metrics, and Tversky and Monge-Elkan in token-based similarity metrics. Among the four metrics, Jaro-Winkler Sorted and Tversky, which have the highest accuracy when training data are used (Table 9), also achieve higher accuracy over 89% using Input data 3 than the other two metrics. Mlipns and Monge-Elkan have approximately 14% lower accuracy and 4-5 times more mismatches than Jaro-Winkler Sorted and Tversky, as shown in Table 7. The accuracy of Mlipns and Monge-Elkan becomes much lower than Jaro-Winkler Sorted and Tversky in Input data 1 and 2 than Input data 3. Regarding the quality of matches, Monge-Elkan has higher mismatched addresses than Mlipns, which means that it tries to match similar incorrect names more times. In total, 38.71% of all the input addresses are present on the corresponding parcels with the side offset in Jaro-Winkler Sorted with 70% of correct matches in input addresses, and the rest of addresses (61.29%) are outside corresponding parcels.
The mean distances of Jaro-Winkler Sorted and Tversky are around 4.40 m, which is lower than Mlipns and Monge-Elkan when using Input data 1. Tversky has the lowest mean distance, whereas Monge-Elkan shows the highest mean distance in all the three kinds of input data.

Experimental Case 2: Address Matching with a Similarity Metric
In the address matching with a similarity metric, we tested Jaro-Winkler Sorted and Mlipns in edit-based similarity metrics, and Tversky and Monge-Elkan in token-based similarity metrics. Among the four metrics, Jaro-Winkler Sorted and Tversky, which have the highest accuracy when training data are used (Table 5), also achieve higher accuracy over 89% using Input data 3 than the other two metrics. Mlipns and Monge-Elkan have approximately 14% lower accuracy and 4-5 times more mismatches than Jaro-Winkler Sorted and Tversky, as shown in Table 3. The accuracy of Mlipns and Monge-Elkan becomes much lower than Jaro-Winkler Sorted and Tversky in Input data 1 and 2 than Input data 3. Regarding the quality of matches, Monge-Elkan has higher mismatched addresses than Mlipns, which means that it tries to match similar incorrect names more times. In total, 38.71% of all the input addresses are present on the corresponding parcels with the side offset in Jaro-Winkler Sorted with 70% of correct matches in input addresses, and the rest of addresses (61.29%) are outside corresponding parcels. Table 3. Performance of address matching process with a similarity metric, percentage of geocoded input addresses on corresponding parcels, and distances between geocoded addresses and the corresponding parcels. The mean distances of Jaro-Winkler Sorted and Tversky are around 4.40 m, which is lower than Mlipns and Monge-Elkan when using Input data 1. Tversky has the lowest mean distance, whereas Monge-Elkan shows the highest mean distance in all the three kinds of input data.

Experimental Case 3: Address Matching Using Machine Learning
For the address matching using machine learning, the Alias DB and training data derived from the Alias DB are needed. The Alias DB has one official name and its four aliases. We established four aliases systematically based on the rules of making alias attributes [18], as shown in Table 4. For instance, in Alias 1, a space is inserted after the first two characters. For Alias 2, a similar character replaces the first or second character. The Alias DB has 352 records for the set of official road names and their four aliases. Table 4. Four attributes of address aliases.

Alias1
Case using one space in official name Alias2 Case where the official name has only one character removed Alias3 Case where the official name has two characters removed Alias4 Case where the official name has only one misspelling Training data consisted of matching pairs and non-matching pairs, as described in Section 3.2 and had 1750 records. For matching pairs, each official name is paired with itself and each alias to calculate similarity values using seventeen similarity metrics. Additionally, we paired an official name and its four aliases with another similar official name, to make non-matching pairs. The reason why we chose similar names for non-matching pairs is to increase the performance of machine learning models when distinguishing a name against its similar form. One example of the non-matching pairs is <Hyowon 94th street, Hyowon 9th street>. We generated and tested three kinds of training data consisting of 20% of matching pairs and 80% of non-matching pairs (Training data 1), 50% of matching pairs and 50% of non-matching pairs (Training data 2), or 80% of matching pairs and 20% of non-matching pairs (Training data 3). The training data have a column 'Matching' used as a label for training, which provides information on whether the records are the matching pairs (Matching: 1) or non-matching pairs (Matching: 0).
In the address matching, to choose the best machine learning model, the performance of three machine learning models was compared with the matching without any similarity metrics and 17 similarity metrics using training data through the model evaluation and selection. Without any similarity metrics, matching shows 85% of accuracy using Training data 1 and 37% of accuracy using Training data 3 ( Table 5). As the ratio of matching pairs is becoming more substantial, the matching accuracy without similarity metrics becomes smaller too because the matching only identifies the same word pairs, and this method cannot regard similar words as identical. Among different similarity metrics, Overlap shows the highest accuracy using Training data 1 (95%), whereas Jaro-Winkler Sorted, Tversky, and Jaccard show the best accuracy using Training data 3 (96%). Most of similarity metrics show the highest accuracy in Training data 3 when compared to Training data 1 and 2, which is opposite to the matching without similarity metrics, possibly because the similarity metrics can match similar words, and the Training data 3 has the highest percentage of matching pairs with aliases. However, when the training data are 50% + 50% pairs and are not biased to the matching or non-matching pairs, many similarity metrics have the lowest accuracy. It indicates that it is difficult for similarity metrics to successfully identify both matching and non-matching pairs well when the matching and non-matching pairs are of equal parts in the training data. Three machine learning models have higher accuracy than all the 17 similarity metrics in the three kinds of training data with different pair combinations. All the three methods are tuned with optimal hyper-parameters to achieve the best performance. We performed hyper-parameter tuning using a grid search. This method sets a list of parameters, and the range of values for each parameter. For each classifier, the algorithm attempts every combination of parameters. Then, the best set of hyper-parameter values was chosen based on accuracy. SVM, RF, and XGB have the highest accuracy using Training data 1 (100%), and it gets a bit lower as the ratio of matching pairs increases. SVM and RF show 99% of accuracy using Training data 2, while XGB achieves 100%. Using Training data 3, SVM has 97% of accuracy, whereas RF and XGB show 98%. XGB shows the highest accuracy (100%) in Training data 2, tuned with 0.05 for learning rate, 400 gradient boosted trees, and 0.7 for subsample ratio of columns.
Among the three kinds of training data, Training data 3, which shows the lowest accuracy of machine learning models, was used to test different numbers of similarity metrics and select the best machine learning model. We chose 97% as the minimum accuracy that all the three machine learning models need to exceed, which is 1% higher than the highest accuracy of similarity metrics in Training data 3. We determined different kinds of similarity metrics tested by descending order of their accuracy. As shown in Figure 10, all the trained models exceed 97% using nine similarity metrics (Jaro-Winkler Sorted, Tversky, Jaccard, Jaro, Jaro-Winkler Reversed, Hamming, Cosine, Sorensen_Dice, Jaro-Winkler). Among them, XGB shows the highest accuracy (98.29%) and is the best model with nine similarity metrics for the evaluation of geocoding. Among the three kinds of training data, Training data 3, which shows the lowest accuracy of machine learning models, was used to test different numbers of similarity metrics and select the best machine learning model. We chose 97% as the minimum accuracy that all the three machine learning models need to exceed, which is 1% higher than the highest accuracy of similarity metrics in Training data 3. We determined different kinds of similarity metrics tested by descending order of their accuracy. As shown in Figure 10, all the trained models exceed 97% using nine similarity metrics (Jaro-Winkler Sorted, Tversky, Jaccard, Jaro, Jaro-Winkler Reversed, Hamming, Cosine, Sorensen_Dice, Jaro-Winkler). Among them, XGB shows the highest accuracy (98.29%) and is the best model with nine similarity metrics for the evaluation of geocoding. With the trained XGB model, as a result, 96.39% of input addresses are correctly matched through the address matching process using machine learning (Table 10) in Input data 3. Since some input addresses include random spelling errors and mistakes that do not abide by the systematic rules, the accuracy becomes lower than the results in Table 9. Regarding the quality of address matching, there is 1.25% of mismatched addresses and 2.36% of unmatched addresses in Input data 3. Results show that 40.94% of all the input addresses are geocoded on the corresponding parcels with the side offset, and the rest of the addresses (59.06%) are outside corresponding parcels. The With the trained XGB model, as a result, 96.39% of input addresses are correctly matched through the address matching process using machine learning (Table 6) in Input data 3. Since some input addresses include random spelling errors and mistakes that do not abide by the systematic rules, the accuracy becomes lower than the results in Table 5. Regarding the quality of address matching, there is 1.25% of mismatched addresses and 2.36% of unmatched addresses in Input data 3. Results show that 40.94% of all the input addresses are geocoded on the corresponding parcels with the side offset, and the rest of the addresses (59.06%) are outside corresponding parcels. The percentage of the addresses geocoded on corresponding parcels is not very different in three kinds of input data with different ratios of correct matches. Table 6. Performance of address matching process with a similarity metric, percentage of geocoded input addresses on corresponding parcels, and distances between geocoded addresses and the corresponding parcels. On average, the geocoded points are 4.44 m further away from their corresponding parcels in Input data 1. The mean distance of Input data 1 is lower than the other two kinds of input data because a few addresses geocoded with long distances are unmatched, so the evaluation excludes the distances of these addresses. The 4 m distance on average with 7 m standard deviation is similar to GPS positional accuracy. GPS positional accuracy ranges from 4.4 to 10.3 m under an urban environment [48], so considering this, the developed geocoding system is applicable in some domains, like GPS-equipped car navigation. Table 7 and Figure 11 show the results of all the three experimental cases. The accuracy of the address matching with or without similarity metrics and machine learning increases along with the ratio of correct matches, which increases from Input data 1 to Input data 3. Results show that address matching using machine learning outperforms the matching with or without similarity metrics. Notably, the address matching using XGB is 7% higher than Jaro-Winkler Sorted and Tversky and 21% higher than Mlipns and Monge-Elkan when using input data with 70% of correct matches. When input data have 30% of correct matches (Input data 1), the difference of accuracy between XGB and Jaro-Winkler Sorted and Tversky becomes 14%, which is two times more than input data with 70% of correct matches. Further, the difference of accuracy between XGB and Mlipns and Monge-Elkan becomes 53-55%, which is 2.5 times more than input data having 70% of correct matches. It indicates that the performance of Mlipns and Monge-Elkan drops more than Jaro-Winkler Sorted and Tversky as the percentage of correct matches in the input data decreases. * Input data 1: 30% of correct matches; Input data 2: 50% of correct matches; Input data 3: 70% of correct matches. Figure 11. Accuracy of address matching in three experimental cases.

Conclusions
This study suggested an algorithm for accurate and stable geocoding. Address parsing helped to obtain meaningful units of addresses to send them to the matching process. We introduced machine learning techniques in order to achieve high accuracy in the address matching process. It was proved that XGB with the optimal hyper-parameters was the best machine learning method with the highest accuracy in the address matching process, and the accuracy of XGB outperformed similarity metrics when using training data or input data. The performance of XGB was also consistent across different kinds of input data. The address matching process using machine learning was able to deal with human errors, including spelling errors, in input addresses to match addresses accurately. As a module in the suggested geocoding system, it can be applied to any other geocoding system to precisely convert addresses into geographic coordinates for relevant research and applications. The address locating allowed to narrow down to a road segment from the candidate road segments selected in the address matching and convert addresses into one geocoded point on a map.
This study, however, has some limitations. First, the performance of the matching process was dependent on the quality of the Alias DB. Training data only had 1750 records for matching and nonmatching pairs, which is an insufficient number to make a robust machine learning method. For each official name, we only made four different kinds of aliases. There may be some excluded aliases in the Alias DB and training process. Thus, for future work, more matching cases with more aliases need to be considered in machine learning methods to enhance the robustness of the trained methods. Second, the proposed algorithm was only applicable for street-based addresses. Apart from the The performance of XGB is consistently high when compared to address matching with or without similarity metrics. The accuracy of XGB is 95-96% across input data with different percentages of correct matches, whereas the accuracy of address matching with or without similarity metrics is relatively low and varies more. Especially, Jaro-Winkler Sorted and Tversky vary from around 80% in input data with 30% of correct matches, to 89% in input data with 70% of correct matches. When XGB gains 1%, Jaro-Winkler Sorted and Tversky change 9%. As a result, the address matching using machine learning is more stable and accurate to use for geocoding than simple matching or matching with a similarity metric. Combining the capabilities of various similarity metrics using machine learning is better than using each similarity metric for geocoding.
Further, among the four kinds of similarity metrics, Jaro-Winkler Sorted and Tversky are more consistent across input data with different ratios of correct matches than Mlipns and Monge-Elkan. When Jaro-Winkler Sorted and Tversky vary 9%, Mlipns and Monge-Elkan change approximately 34%. Jaro-Winkler Sorted and Tversky, which have the highest accuracy when using training data, also achieve high accuracy in input data and are found to be less sensitive to the proportion of correct matches in address matching. On the other hand, Mlipns and Monge-Elkan in edit-based and token-based similarity metrics, which have the lowest accuracy when using training data, also show low accuracy and are found to be more sensitive to the proportion of correct matches. Thus, using Mlipns and Monge-Elkan is not expected to show consistent performance and not appropriate for accurate address matching when there are input addresses with different percentages of correct matches.

Conclusions
This study suggested an algorithm for accurate and stable geocoding. Address parsing helped to obtain meaningful units of addresses to send them to the matching process. We introduced machine learning techniques in order to achieve high accuracy in the address matching process. It was proved that XGB with the optimal hyper-parameters was the best machine learning method with the highest accuracy in the address matching process, and the accuracy of XGB outperformed similarity metrics when using training data or input data. The performance of XGB was also consistent across different kinds of input data. The address matching process using machine learning was able to deal with human errors, including spelling errors, in input addresses to match addresses accurately. As a module in the suggested geocoding system, it can be applied to any other geocoding system to precisely convert addresses into geographic coordinates for relevant research and applications. The address locating allowed to narrow down to a road segment from the candidate road segments selected in the address matching and convert addresses into one geocoded point on a map. This study, however, has some limitations. First, the performance of the matching process was dependent on the quality of the Alias DB. Training data only had 1750 records for matching and non-matching pairs, which is an insufficient number to make a robust machine learning method. For each official name, we only made four different kinds of aliases. There may be some excluded aliases in the Alias DB and training process. Thus, for future work, more matching cases with more aliases need to be considered in machine learning methods to enhance the robustness of the trained methods. Second, the proposed algorithm was only applicable for street-based addresses. Apart from the street-based addresses, there are area-based addressing systems and hybrid addressing systems [5], and the proposed algorithm did not consider those two addressing systems. Hierarchical area-based addressing systems are used in Eastern Asia, while China has a hybrid addressing system. In order to apply to various countries, we suggest the expansion of the proposed algorithm to embody the two different addressing systems for future research.
Moreover, we need to put some efforts on decreasing mismatched addresses. Mismatched addresses ended up being geocoded in other locations far away from correct places and decreased the positional accuracy of geocoding. Therefore, by finding a way to complement the machine learning (e.g., adding additional rules), we need to avoid making mismatching. Lastly, this study did not consider unstructured addresses in the parsing step of the geocoding. This study assumed that the input addresses are structured and normalized, and no error occurred in the parsing process. Although machine learning techniques in the matching process can handle abbreviations, misspellings, and misplacements, to achieve high performance of the geocoding, the normalization, like removing punctuation and standardization is necessary to some extent [49]. Therefore, future work needs to strengthen the parsing process.