Deep Learning for Toponym Resolution: Geocoding Based on Pairs of Toponyms

: Geocoding aims to assign unambiguous locations (i


Introduction
Geocoding is a core part of the broader text geoparsing task (also known as toponym resolution), in addition to geotagging.While geotagging deals with the automatic recognition of named entities (i.e., named entity recognition) corresponding to places, geocoding aims to match the identified place entities to the corresponding locations.Usually, the task consists of associating real-world coordinates (i.e., latitude-longitude) or polygon boundaries to toponyms.One important challenge in geocoding is related to toponym disambiguation [1] which faces, among other types of ambiguity issues [2], the problem of referent ambiguity (also known as geo/geo ambiguity).Referent ambiguity refers to toponyms having multiple locations [3] (e.g., Sofia (capital of Bulgaria) = Sofia (province in Bulgaria)).Additionally, as gazetteer lookup methods are widely used for geocoding, gazetteer completeness (i.e., appropriate geospatial and temporal coverage) is also a major issue for the geocoding task.For instance, in the case of historical document analysis, new methods were proposed in order to geocode toponyms and solve toponym ambiguity without using gazetteers [4][5][6].
Place names are often grounded to their geographic area, and specific character sequences-e.g., prefixes and suffixes-can be found in close locations.For instance, place names in south-west of France tend to end with the suffix "-ac".Character n-grams have been proven efficient for different natural language processing tasks, involving spelling errors or neologisms.Therefore , we propose a deep neural network architecture to model toponym co-occurrences in various contexts, combining n-gram embeddings with Long Short-Term Memory (LSTM) units (The code is available online: https://git.liris.cnrs.fr/jfize/toponym-geocoding (accessed on 28 November 2021)).The proposed architecture takes pairs of toponyms as input and returns latitude and longitude coordinates as outputs.For each pair, the first entry is the toponym that we want to geocode, and the second entry is used as context.This particular matching choice can easily adapt to different geocoding scenarios or applications (e.g., resolving place names referenced in textual paragraphs or place names appearing in tables or spreadsheets).We describe several experiments and evaluation results based on various contexts.For instance, we built several datasets of pairs of toponyms in order to evaluate the contribution of different relations between toponyms.We consider three types of relations, namely (1) co-occurrences of toponyms in text, based on Wikipedia articles describing geo-located places, (2) spatial proximity, computed from Geonames data and based on a buffer radius around each toponym, and (3) spatial inclusion, also computed from Geonames data and based on the feature type hierarchy (e.g., a city included in an administrative region).
This article is organised as follows: Section 2 presents related work, while Section 3 describes the proposed architecture and the data used for the model training.Section 4 presents several experiments and evaluation scores obtained on different datasets.Then, Section 5 discuss limitations in of our proposal, and Section 6 concludes the paper.

Related Work
We distinguish four categories of methods for geocoding: two using gazetteer matching with either heuristics or machine learning techniques, and two using no gazetteer data and leveraging language models or deep learning methods.
Several studies used map-based approaches or distance heuristics in combination with gazetteer lookup methods [1,7,8].These methods are mainly based on the calculation of distance between place candidates and unambiguous toponyms.Lieberman et al., proposed to combine different spatial contexts: a global context that integrates knowledge from external datasets (i.e., gazetteers) and a local context that uses information extracted from the text itself [9].In the context of hiking description analysis, Moncla et al. proposed a method based on the DBSCAN clustering algorithm (Density-Based Spatial Clustering) [10], grapping all toponym referents and then selecting the cluster of places that contains the maximum number of distinct toponyms [11].Other heuristics, such as subtyping using feature type metadata, are also used.In addition to the spatial context (e.g., distance, proximity, density, centroid, etc.), methods involving other different types of heuristics (e.g., importance, size, population count, semantic or ontology hierarchical relations, etc.) have also been proposed [12][13][14].
Other data-driven approaches are also based on gazetteers and use machine learning instead of manually designed combinations of heuristics [15][16][17][18].Hu and Ge used machine learning algorithms such as decision trees on a probability matrix between toponyms and place candidates.Each weight is computed by comparing geographical features of all candidates [15].Lieberman and Samet proposed to use the combination of features from the toponym to be disambiguated and other toponyms that appears in a context window [16].Molina-Villegas et al. proposed the use of word embedding for geographic named entity recognition and geographic entity disambiguation [17].This approach aims to explore semantic relationships of words and documents in Mexican Spanish.They use Wikipedia articles of locations to enrich the semantic space of word embedding models with information from different topics such as culture, economy and history.
Several gazetteer-free methods have also been developed.Some of these are based on language models, defined in [19] as a model that "[. . .] assigns a probability of likelihood of a given word (or a sequence words) to follow a sequence of words".In geocoding, approaches that use language models compute the probability of a word (or word sequence) to be associated with a geographic footprint.Delozier et al. propose to learn the probabilities for words (including toponyms) to be associated to a region, the latter being defined by rectangular area [4].Kamallo and Rafieri use different language models learned on specific features, combining them to associate a ranking for each place candidate to a toponym [20].Speriosu and Baldridge propose different geocoding algorithms, one based on geographical features only (from Geonames) and three text-driven approaches using a subset of Wikipedia, named GEOWIKI, that corresponds to articles for geographic places [21].
More recent studies rely on deep neural network architectures [22][23][24].For instance, Gritta et al. proposed a network architecture that learns from multiple inputs: context words, context place mentions, and MapVec, i.e., a vector that encodes all place coordinates that share the same toponym as the input [22].Cardoso et al. [23] combined context-aware word-embeddings [25] and a recurrent neural network based on Bidirectional LSTMs [26].
Context-based methods can obtain a high geocoding accuracy.However, most of these approaches require external resources (even after training) and are still based on analysis performed at the word level.They can thus be unfit for toponym variations or spelling errors.Considering these issues, our contribution lies on proposing a model that does not require a gazetteer and is trained on a subword level.In addition to reducing the impact of spelling errors, the use of sub-words (or character n-grams) has been known to be efficient in different natural language processing tasks.Most importantly, certain subwords can integrate spatial properties due to their usage.For instance, the prefix tre-is commonly used in Bretagne, FR because it means populated place in the local language.Additionally, in France, the suffix -ac is found almost exclusively in names of places located in the south west of France.

Materials and Methods
To address the problem of geocoding toponyms, we propose a neural network architecture that takes two toponyms as input and returns latitude and longitude coordinates corresponding the location of the first one.We chose to use only two toponyms because we want our model to geocode place names from data with few contextual information such as tabular data, historical documents, images or map captions) The two input toponyms are defined as follows: the first toponym t is the one to be geocoded.The second toponym ct is a contextual toponym that helps to disambiguate t.The model can be defined as a function f such as f (t, ct) → (t lat , t lon ) ∈ R 2 where t lat and t lon refer to the latitude and the longitude of t.For our model to adapt to toponym variations (aliases, spelling errors, etc.) and to learn geographic properties of certain affixes, input toponyms are transformed to sequences of character n-grams.For instance, if the sequence size n is 2, the toponym Paris will be represented as the following sequence: {Pa,ar,ri,is}.In this study, based on preliminary experiments, n is set to 4.

Process Overview
The process workflow is divided into three steps: (i) toponym transformation to character n-grams; (ii) latitude-longitude prediction using a recurrent neural network architecture; (iii) reprojecting output coordinates into the WGS84 (https://fr.wikipedia.org/wiki/WGS_84 (accessed on 28 November 2021)) coordinate system.
The first step takes and transforms each input toponym into a character n-gram sequence.In order to be compatible with the neural network, we need to assign each n-gram to a row in an embedding matrix, which contains vector representations for a defined vocabulary, e.g., a set of words or word n-grams.In our approach, this corresponds to every n-gram found in a large set of toponyms collected from both Geonames and Wikipedia in multiple languages.N-gram embeddings are generated using the WORD2VEC Skip-gram model [27].As a first experiment, we did not use more recent approaches like ELMo or BERT because these embeddings are contextualized on larger textual utterances (e.g., sentences), whereas we do not use much textual context in our approach (only two toponyms, and not entire sentences as in several other NLP studies).
Once the input is transformed, the next step consists of predicting the coordinates using the neural network illustrated in Figure 1.This neural network is divided into two parts, with one responsible for the feature extraction (i.e., a Bidirectional LSTM), and the second responsible for predicting coordinates using the extracted features.Bi-LSTM or Bidirectional LSTM networks are well known for their efficiency in extracting features from sequential data.The LSTM cell formula is as follows: where x t ∈ R d is the input vector to the LSTM unit; f t ∈ R h is a forget gate's activation vector; i t ∈ R h is the input/update gate's activation vector; o t ∈ R h is hte output gate's activation vector; h t ∈ R h is the hidden state vector also known as output vector of the LSTM unit; ct ∈ R h is the cell input activation vector; c t ∈ R h is the cell state vector; W ∈ R h×d , U ∈ R h×h and b ∈ R h are weight matrices and bias vector parameters which need to be learned during training.Bi-LSTMs are frequently used in NLP for named entity recognition [26,28] or for producing contextual word-embeddings [25].Once the features are extracted by the Bi-LSTM, we use two multi-layer perceptrons, one for predicting each coordinate (latitude and longitude).Each one is composed of two layers of 500 neurons with a ReLU activation function [29].The two output layers for each coordinate are finally associated with a sigmoïd activation function.
Since the network output corresponds to latitude-longitude coordinates, we use the great-circle distance as loss function.This function is defined by the following formula: where δ is the latitude and λ is the longitude.All coordinate (δ, δ , λ, λ ) values are normalised between 0 and 1 and converted to radians before computing the distance.Finally, the output coordinates (latitude and longitude) between 0 and 1 are re-projected to WGS84 in the final step.

Generating Pairs of Toponyms for Training
In order for the neural network to predict the closest latitude-longitude coordinates, we train the network with specific input data.Particularly, our model uses toponyms that appear in the same context to perform geocoding.In this section, we present the different contexts in which pairs of toponyms are generated, or extracted from well-know sources-Wikipedia and Geonames-to build training datasets.

Textual Context
Places that are geographically close also tend to appear together in a same text.For instance, imagine we want to geocode the pair Paris and Texas, with Paris the toponym to geocode and Texas the context toponym.In this example, Paris can be associated with Paris, TX due to their proximity.However, most of the time, capital cities or important cities like Paris, FR are chosen for all Paris toponym occurrences.Therefore, we propose to build our first training dataset with pairs coming from toponym relationships extracted from texts.
To learn from place co-occurrences in a textual context, we decided to use Wikipedia pages of places.Particularly, we use interlinks between Wikipedia pages.For instance, the Wikipedia page for Paris contains links to other pages of places such as Versailles or Tour Eiffel.Therefore, the following pairs will be generated: {Paris, Versailles; Paris, Tour Eiffel}.
To do this, we designed the process illustrated in Figure 2. The process is defined as follows: first, for identifying pages of places, we use the Wikidata (https://www.wikidata.org/(accessed on 28 November 2021)) dataset in a process illustrated in Figure 2. Wikidata is a knowledge base where each entity is characterised by statements.Each statement is represented as a triplet subject-property-value, e.g., <subject>Barack Obama</subject> <property>is born</property> on the <value>4th of August in 1961</value>.The process starts by filtering places from Wikidata.To do that, we select Wikidata entries based on the appearance of the P625 property used to associate latitude-longitude coordinates with the entry.Then, using the existing mapping between Wikidata and Wikipedia [30], we recover the content for place pages and extract the interlinks used to generate the toponyms pairs.

Spatial Context
If two toponyms, Paris and Lyon appear in the same context, we can assume that Paris refers to Paris (France).Again, if Lyon is replaced by Dallas, then the most likely answer would be Paris (Texas).In this example, we geocode Paris based on the proximity between the two places.Therefore, to complete co-occurrence information from textual data, we propose to increase our training dataset with pairs of toponyms built from two spatial relationships, namely inclusion and proximity.An inclusion relation means that one place is contained in another one, e.g., Paris → France.We define the proximity relationship as the co-location of two places within a defined radius.To extract such relationships, we based our extraction procedure on the Geonames dataset, which includes official toponyms and centroid coordinates for each place.Concerning inclusion relationships, we use the Geonames hierarchy dataset (Available at this address: http://download.geonames.org/export/dump/(accessed on 28 November 2021)) which states directly inclusion relationships between places.As for proximity relationships, we use a simple approach that avoids heavy computation.We use the hierarchical projection method named Healpix [31] to associate latitude-longitude coordinates to a cell index.The cell area over the globe is defined by a parameter nside, and this parameter was set by default to 256.All places within a cell are considered adjacent.

Sampling
In the case of adjacency pairs and co-occurrence pairs, collecting all available combinations within a cell can overload the training dataset.Therefore, we establish a sampling strategy for co-occurrence and proximity pairs.Concerning the proximity pairs, for each place p i in Geonames, the sampling parameter corresponds to the number of places randomly selected in the same area as p i .Then, each selected place is associated with p i to form a pair.For co-occurrence pairs, each place is associated to co-occurrent place names found in Wikipedia (see Section 3.2.1).For each place, we sampled k co-occurrent place names.Then, each selected co-occurrent place name is associated with p i to form a pair.In our experiments (see Section 4), we compare models trained with datasets generated using different sampling values, set to 4 and 50.

Training/Validation Dataset Generation
Based on extracted pairs of toponyms, we built different datasets combining different contexts of extraction.Therefore, we build a dataset that contains only co-occurrences, one that contains co-occurrences and proximity, etc.Once the pairs from different contexts are gathered, we need to split the produced datasets into training and test toponym pairs.In order, to keep a geographic consistency, our stratified splitting strategy is to concatenate different random splits executed on different subdivisions of the area of interest.To obtain cells with equal area, we use the Healpix [31] grid system.Healpix allows us to obtain different cell sizes based on a selected resolution, and we set this value to 128.

Model Evaluation
To evaluate our model, we designed three experiments.First, we evaluate our model on pairs of toponyms built using co-occurrences from Wikipedia.Second, we evaluate the capacity of the model to geocode a Wikipedia page based on its toponyms.Thirdly, we evaluate our model using well known datasets proposed in the literature (i.e., SpatialML, TR-CONLL, Lake District Corpus and War of the Rebellion).

Datasets
In our experiments, we chose to train our model on different geographic areas: France (FR), United-States (US), Great-Britain (GB), Japan (JP), Argentina (AR) and Nigeria (NG).Table 1 shows the number of pairs for each dataset according to the context (i.e., proximity, inclusion and cooccurrences).

Evaluation Metrics
Since asking for exact coordinates for relatively large places (e.g., cities) is difficult, we need to measure the average distance and the accuracy of our model given a threshold tolerance value.In order to do that, we use the accuracy@k metric [22], defined in the following formula where k is the tolerance variable.Based on the literature [32], results in the following experiments are given with k = 161 km.
where dist(x, y) function corresponds to the haversine distance between point x and y; y corresponds to the coordinates predicted by the model, and ỹ corresponds to the true coordinates.As an extension of accuracy@k we also compute the Area Under the Curve (AUC) for accuracy@k from 0 km to 1000 km [33].This method gives a more precise overview of the performance of the models than a single score.

Results on Pairs of Toponyms
In a first experiment, we evaluate the geocoding accuracy of the model on pairs of toponyms.In order to replicate real-world requests on the model, we use pairs extracted from co-occurrences of places in Wikipedia.Figures 3 and 4 show the results obtained considering different sampling strategies.Figure 5 shows the accuracy@k curve for each geographical scope, sampling, and dataset combination.We observe that models trained with pairs from the proximity-only dataset obtain the lowest accuracy.Furthermore, as shown on Figure 5, to obtain a high accuracy, the threshold value k needs to be high compared to other models.Focusing on the results obtained with a lower sampling (i.e., 4), our model shows high accuracies except for the US (Figure 3).Furthermore, models trained with only co-occurrences achieve the highest accuracies for some countries such as France (0.91), Great Britain (0.96), Japan (0.88), and the US (0.67).However, for countries like Argentina and Nigeria, co-occurrences are not enough and, for those, the addition of pairs from proximity and inclusion relationships increases the accuracy of the model.For instance, there is a 19% difference between the CP (co-occurrences + proximity) and C (cooccurrence only) models for Nigeria.We observe that pairs from proximity relationships increase the accuracy of some models, mostly for countries with less data.Table 1 highlights the difference between the number of co-occurrence pairs between France (376,088) and Nigeria (5638).A same observation can be made on the evolution of the loss value in Figure 6 where values for co-occurrence only model are higher than those for the other models with the combination of proximity and co-occurrences.can be made on the evolution of the loss value in Figure 6 where value for co-occurrence  Figure 4 shows the results where models have been trained with pairs generated 268 with a higher sampling.Most observations that were made with a lower sampling 269 still apply.In terms of accuracy@k, co-occurrences pairs still give the best models.In  In the per pair experiment, we evaluate the model in its ability to correctly geocode 275 a toponym based on one pair (using another toponym as context).In real-world data, 276 the number of context toponyms most of the times exceeds one.Thus, in this experiment, 277 we evaluate the accuracy for geocoding a place using all the toponyms that appear in 278 its Wikipedia page.To do that, we propose to use our model with the following simple 279 heuristic.We predict the coordinates of every possible pair for a toponym t i 2 T and 280 the rest C = {(t i , t m )|t m 2 T {t i }} appearing in the same context (i.e., the content of 281 the wikipedia page).Once the coordinates are recovered, we assign the coordinates c t i 282 following this formula :  can be made on the evolution of the loss value in Figure 6 where value for co-occurrence  Figure 4 shows the results where models have been trained with pairs generated 268 with a higher sampling.Most observations that were made with a lower sampling 269 still apply.In terms of accuracy@k, co-occurrences pairs still give the best models.In  In the per pair experiment, we evaluate the model in its ability to correctly geocode 275 a toponym based on one pair (using another toponym as context).In real-world data, 276 the number of context toponyms most of the times exceeds one.Thus, in this experiment, 277 we evaluate the accuracy for geocoding a place using all the toponyms that appear in 278 its Wikipedia page.To do that, we propose to use our model with the following simple 279 heuristic.We predict the coordinates of every possible pair for a toponym t i 2 T and 280 the rest C = {(t i , t m )|t m 2 T {t i }} appearing in the same context (i.e., the content of 281 the wikipedia page).Once the coordinates are recovered, we assign the coordinates c t i 282 following this formula :  Figure 4 shows the results where models have been trained with pairs generated with a higher sampling.Most observations that were made with a lower sampling still apply.In terms of accuracy@k, co-occurrence pairs still give the best models.In addition, the increase of the number of pairs used in the training improves the model accuracy (France 0.91 → 0.94, AR 0.58 → 0.77, Nigeria 0.78 → 0.87, Japan 0.88 → 0.94, Great-Britain 0.96 → 0.98.).

Geocoding Wikipages
In the per pair experiments, we evaluate the model in its ability to correctly geocode a toponym based on one pair (using another toponym as context).In real-world data, the number of context toponyms most of the times exceeds one.Thus, in this experiment, we evaluate the accuracy for geocoding a place using all the toponyms that appear in its Wikipedia page.To do that, we propose to use our model with the following simple heuristic: we predict the coordinates of every possible pair for a toponym t i ∈ T and the rest C = {(t i , t m )|t m ∈ T − {t i }} appearing in the same context (i.e., the content of the Wikipedia page).Once the coordinates are recovered, we assign the coordinates c t i following the next formula: where coords(C) corresponds to coordinates returned by the model for each pair of C. Figures 7 and 8 show the results obtained with models trained with different sampling parameters.Figure 9 shows the accuracy@k curve for each geographical scope, sampling strategy, and dataset combination.Following the same trends as in the per pair geocoding experiments, with a sampling threshold value of 4, co-occurrence-only models obtain a high accuracy except for Nigeria and Argentina, where the addition of pairs from proximity and inclusion relationships improves the model accuracy.Unlike models trained with datasets with a lower sampling, a higher sampling causes the co-occurrence-only model to obtain the highest accuracy for every country, even Argentina and Nigeria.In a similar way, proximity-only trained models mostly obtain the lowest accuracy.As illustrated in Figure 9, these models only obtain high accuracies with a high value for k.Finally, there is no significant improvement in the highest accuracy between models trained with datasets with a different sampling.

298
In order to compare our model with other geocoding approaches, we evaluate 299 our model on geocoding datasets used in the literature.Here, we use SpatialML [32], 300 TR-CONLL [33], Lake District Corpus [34], and War Of The Rebellion [35].

301
Since our models are trained on pairs of toponyms from specific countries, Spa-302 tialML and TR-CONNL datasets were divided by toponym country membership.Fur-303 thermore, there are no places located in Argentina and United-States in SpatialML.

304
Results are shown in Table 2.For the SpatialML dataset, we obtain high accuracies 305 with places in Great-Britain but poor accuracies with toponyms from others countries.

306
Concerning TR-CONLL, we obtain accurate predictions for toponyms from Argentina, 307 Nigeria and Great Britain.For the Lake District Corpus, we also obtain good accuracies.

308
Finally, we obtain very low accuracies for the War Of The Rebellion as expected since 309 our US model obtains the lowest scores.Since, our model is trained on contemporary 310 toponyms, it can also explain why we obtain lower accuracies.Like previous exper-311 iments, results obtained with the model trained on the United-State are weaker than 312 other countries.In comparison, other methods succeed to obtain a 93% accuracy [35].

313
We investigate the reason of such low accuracies for the US in the Section 5.4.

298
In order to compare our model with other geocoding approaches, we evaluate 299 our model on geocoding datasets used in the literature.Here, we use SpatialML [32], 300 TR-CONLL [33], Lake District Corpus [34], and War Of The Rebellion [35].

301
Since our models are trained on pairs of toponyms from specific countries, Spa-302 tialML and TR-CONNL datasets were divided by toponym country membership.Fur-303 thermore, there are no places located in Argentina and United-States in SpatialML.

304
Results are shown in Table 2.For the SpatialML dataset, we obtain high accuracies 305 with places in Great-Britain but poor accuracies with toponyms from others countries.

306
Concerning TR-CONLL, we obtain accurate predictions for toponyms from Argentina, 307 Nigeria and Great For the Lake District Corpus, we also obtain good accuracies.

308
Finally, we obtain very low accuracies for the War Of The Rebellion as expected since 309 our US model obtains the lowest scores.Since, our model is trained on contemporary 310 toponyms, it can also explain why we obtain lower accuracies.Like previous exper-311 iments, results obtained with the model trained on the United-State are weaker than 312 other countries.In comparison, other methods succeed to obtain a 93% accuracy [35].

313
We investigate the reason of such low accuracies for the US in the Section 5.4.

Geocoding Results with Standard Corpora
In order to compare our model with other geocoding approaches, we evaluate also our model on geocoding datasets used in the literature.Here, we use SpatialML [34], TR-CONLL [35], the Lake District Corpus [36], and the War Of The Rebellion corpus [37].
Since our models are trained on pairs of toponyms from specific countries, the Spa-tialML and TR-CONNL datasets were divided by toponym country membership.Furthermore, there are no places located in Argentina or the United-States in SpatialML.Results are shown in Table 2.For the SpatialML dataset, we obtain high accuracies with places in Great-Britain but poor accuracies with toponyms from other countries.Concerning TR-CONLL, we obtain accurate predictions for toponyms from Argentina, Nigeria and Great Britain.For the Lake District Corpus, we also obtain an overall good accuracy.Finally, we obtain a very low accuracy for the War Of The Rebellion corpus as expected since our US model obtains the lowest scores.The fact that our models are trained on contemporary toponyms can also explain why we obtain lower accuracies.Like in the previous experiments, results obtained with the model trained on the United-State are weaker than in the other countries.In comparison, other methods succeed to obtain a 93% accuracy [37].We investigate the reason of such low accuracies for the US in Section 5.4.

Discussion
This section presents summary discussion on the results...

Scalability
Neural network training can be time-consuming.Therefore, in this preliminary work, we trained different models on specific and controlled geographical areas.We only addressed the scalability issue by sampling pairs of toponyms generated with cooccurrences and proximity relationships.Concerning proximity, we decided to draw random pairs of toponyms in a specific region.As for co-occurrences, we decided to limit the number of pairs of toponyms extracted per article through a random selection process.Two issues arise from these choices.First, the proximity relation process is oversimplified, and a better extraction should lead to a significant impact of proximity in the model performance.Second, both selection processes allow duplicates, which may reduce the number of distinct pairs of toponyms.

Selection of Model Parameters
To analyse the impact of different parameter values, we compare the accuracy obtained on the France model by changing the n-gram size, the n-gram generation processes, and the number of LSTM sub-networks.Concerning the n-gram generation process, we compare other processes that split toponyms at the word level or by using the WordPiece algorithm used by BERT [38], which tokenizes a word into specific character n-grams known for their high occurrence in text.Table 3 shows the results obtained by changing these different parameters.The results obtained with one LSTM sub-network correspond to a higher accuracy compared to a model with two LSTM sub-networks.Second, we observe that an increase of the size of the n-grams improve the model accuracy until 5-grams.Finally, the use of n-grams on the word level or generated with WordPiece lead to a worse accuracy.

Impact of Sampling
As the total number of toponym pairs for a country can be very high, we sample from all available pairs. Figure 10 shows the positive impact of a larger sampling for model training.For France and US, we only consider a sampling with k = 10 because of memory limits and the impact is thus limited.The impact is also limited for Great Britain, with only a small increase.For countries with less data, especially with less co-occurrences found on Wikipedia pages, the increase of the sampling has a high effect.

Why It Does Not Work for the US?
Our model trained with toponyms for different countries performs well except for the United States.In order to investigate this issue, we first produce prediction maps like the ones in Figure 11a,b.The aim is to reveal regions where our model performed well and where it has not.For producing these maps, we use the "per pair" experimental results and we associate each pair and its predicted coordinates with an hexagon cell using the H3 grid system.Finally, we compute, for each cell, the percentage of predictions that were correctly made (with a distance less than 161 km).Four countries covering the different types of results were selected: France, United States, Argentina and Japan.We observe in Figure 11a,d that predictions for pairs over France and Japan are mainly accurate, except for small areas (on the south west of France or the north of Japan).Places in these areas are sparse, which may explain the results.Figure 11b shows that for the US many areas are badly predicted by our model, mainly covering the West and South regions.Same as in France or Japan, places in these areas are more sparse.This is different from the North-East, where the model performed well around major US cities like New-York, Philadelphia, or Chicago.Therefore, one possible reason for the worse performance lies in place sparsity (at least in the dataset used in our tests).This is something we can also highlight for Argentina, as illustrated in Figure 11c.
13 of 16 and Japan territories are mainly accurate except for small areas (on the south west of 356 France or north of Japan).Places in these areas are sparse which may explain the results.For quantifying the ambiguity of toponyms, we compute the average number of places 365 for one toponym for each country in Table 4.If we compare France and the US, US 366 toponyms belong to more places, but the difference seems not to be very significant and 367 we will investigate further in future work.

Conclusion 369
In this article, we describe an approach for geocoding toponyms using deep learning Another lead lies in the referent ambiguity of toponyms for the selected countries.For quantifying the ambiguity of toponyms, we computed the average number of places for one toponym for each country.Results are shown in Table 4.If we compare France and the US, we can see that US toponyms belong to more places, but the difference seems not to be very significant and we will investigate further in future work.

Conclusions
In this article, we described an approach for geocoding toponyms using deep learning and character n-gram sequences.Our architecture is based on a neural network using LSTM cells to extract features for the character n-gram sequence.Our model requires two toponyms as input and returns latitude-longitude coordinates as output.The first toponym is the one to be geocoded, and the second is used as context to help the model resolve any reference ambiguity.We trained our model on six geographical areas and conduct three types of experiments for evaluation.The first evaluates the model efficiency for pairs of toponyms.The second evaluates the model efficiency for geocoding toponyms based on their Wikipedia webpages (using multiple pairs and a straight-forward heuristic).The third evaluates our model for geocoding standard datasets used in the literature.Results show high accuracy values in the first two experiments except for the US.They also shown that models trained with co-occurrences get the highest accuracies in most cases.However, when the country has a lower number of pages in Wikipedia, adding pairs generated from proximity and inclusion relationships enables us to increase the efficiency of the model.
For future work, we can consider the training and the development of a geocoding process that covers the entire world and not just one country.A second possibility for future work is to evaluate the potential of one such model with historical places, and assess the contribution of certain affixes in the geocoding.

Figure 1 .
Figure 1.Overview of our proposed deep neural network architecture.

265Figure 3 .
Figure 3. PER PAIR GEOCODING Accuracy per country and dataset combination with sampling = 4.

Figure 4 .
Figure 4. PER PAIR GEOCODING Accuracy per country and dataset combination with sampling = 50.

Figure 3 .
Figure 3. Geocoding accuracy per country and dataset combination with sampling = 4.

265Figure 3 .
Figure 3. PER PAIR GEOCODING Accuracy per country and dataset combination with sampling = 4.

Figure 4 .
Figure 4. PER PAIR GEOCODING Accuracy per country and dataset combination with sampling = 50.

Figure 9 ,Figure 7 .
Figure 9, these models only obtain high accuracies with a high value k.Finally, there 295

Figure 8 .
Figure 8. GEOCODING WIKIPEDIA PAGES Accuracy per country and dataset combination with a sampling = 50. 314

Figure 7 .
Figure 7. Accuracy for geocoding Wikipedia pages, per country and dataset combination with a sampling = 4.

Figure 9 ,Figure 7 .
Figure 9, these models only obtain high accuracies with a high value k.Finally, there 295

Figure 8 .
Figure 8. GEOCODING WIKIPEDIA PAGES Accuracy per country and dataset combination with a sampling = 50. 314

357
Figure11bshows that for the US a lot of areas are badly predicted by our model, mainly 358

11 .
for France's toponyms (b) for the United-States' toponyms (c) for the Argentina's toponyms (d) for the Japan's toponyms Figure Prediction maps of our model Another lead lies in the referent ambiguity of toponyms for the selected countries. 364 368

Figure 11 .
Figure 11.Prediction maps for the results of our model.

Table 1 .
Size of the dataset (number of pairs of toponyms contained in both the train and test sets) used for model training.

Table 2 .
Results obtained on state of the art corpora.

Table 2 .
Results obtained on state of the art corpora.