Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

A Novel Deep Learning Approach Using Contextual Embeddings for Toponym Resolution

ISPRS Int. J. Geo-Inf. 2022, 11(1), 28; https://doi.org/10.3390/ijgi11010028

by Ana Bárbara Cardoso¹, Bruno Martins¹

and Jacinto Estima^2,*

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Reviewer 3: Anonymous

Reviewer 4: Anonymous

ISPRS Int. J. Geo-Inf. 2022, 11(1), 28; https://doi.org/10.3390/ijgi11010028

Submission received: 12 November 2021 / Revised: 21 December 2021 / Accepted: 27 December 2021 / Published: 31 December 2021

Round 1

Reviewer 1 Report

This article proposes an approach for toponym resolution based on deep neural networks. The proposed approach aims at directly predicting coordinates for each named spatial feature detected in a text, without using a gazetteer. It is based on a neural network architecture using pre-trained contextual word embeddings and bidirectional LSTM units, which takes multiple inputs for each spatial entity to be resolved. This architecture is used to predict a probability distribution over geo-spatial regions in a hierarchical spatial grid. This result is then combined with the centroid coordinates of the cells of the geo-spatial grid and a second loss function to predict geographic coordinates. Two further tests are performed: combining the intermediate representation on the spatial grid produced by the neural architecture with geophysical information, and adding new data from Wikipedia to train the model.

The subject of place name resolution has already been the focus of many research works. However, none of the proposed approaches has been completely satisfactory to date. It is therefore still a research question of great interest, which is relevant for the IJGI journal.

The article is well written and easy to read. The state of the art covers the main approaches in the field, but could be completed. Indeed, the section on unsupervised approaches presents only three works whereas there are many more (see for example [3],[4],[5],[6]). Similarly, the section dedicated to deep learning approaches presents only two works. However, other approaches such as [1] or [2] have also been proposed and would be relevant to present and to compare to the approach presented in this article.

The proposed approach exploits not only place names but also their surrounding text and uses spatial grid cells to locate each place name, then refines this location and returns coordinates. In parallel, the cell prediction results are combined with geographical data describing the geophysical properties of each cell and a loss function to predict the geophysical properties of the place corresponding to each place name. This approach is definitely original and the results obtained seem promising, at least for the prediction of the coordinates which seems to give better results than the other approaches tested on the same corpora. However, for the sake of exhaustivity, it would have been interesting to include more approaches such as those presented in [1] and [2] in this comparison.

Two ways of improving this base model are proposed and tested: the addition of geophysical knowledge and the addition of training data from Wikipedia. The prediction of geophysical properties associated with place names produces less convincing results than the prediction of coordinates, but the addition of this knowledge seems to have a positive effect on the toponym resolution results - at least for the WOTR and SpatialML corpora - and the authors propose ways to improve that part: this seems to be an interesting avenue to explore. Similarly, the addition of training data extracted from Wikipedia seems to have a minimal effect: here again, avenues of improvement are suggested, but not tested. This greatly diminishes the potential impact of this article as these are the two main avenues of improvement explored, and neither of which provides any real evidence of success. It would be useful to test the avenues of improvement mentioned in order to clearly establish the usefulness of geophysical properties and additional training data.

Finally, although the implementation of the approach, the corpora and the tests carried out are described in detail, the code used for the tests is not provided with the article: this limits the possibilities of comparing this approach with other proposals on new corpora.

[1] Xu, C., Li, J., Luo, X., Pei, J., Li, C. and Ji, D., 2019, May. DLocRL: A deep learning pipeline for fine-grained location recognition and linking in tweets. In The World Wide Web Conference (pp. 3391-3397).

[2] Yan, Z., Yang, C., Hu, L., Zhao, J., Jiang, L. and Gong, J., 2021. The Integration of Linguistic and Geospatial Features Using Global Context Embedding for Automated Text Geocoding. ISPRS International Journal of Geo-Information, 10(9), p.572.

[3] Bensalem, I., & Kholladi, M. K. (2010). Toponym disambiguation by arborescent relationships. Journal of Computer Science, 6(6), 653.

[4] Blank, D. and Henrich, A., 2015, November. Geocoding place names from historic route descriptions. In Proceedings of the 9th Workshop on Geographic Information Retrieval (p. 9). ACM.

[5] Brando, C., Frontini, F., & Ganascia, J. G. (2015). Disambiguation of named entities in cultural heritage texts using linked data sets. In New Trends in Databases and Information Systems (pp. 505-514). Springer International Publishing.

[6] Moncla, L. (2015). Automatic reconstruction of itineraries from descriptive texts(Doctoral dissertation, Université de Pau et des Pays de l'Adour; Universidad de Zaragoza).

# Minor comments and typos:

Delete one “the”, line 407: “and the the number of recursive”
Line 536: “the surface pf the earth” → “the surface of the earth”
In table 3, all median distance values are the same for the SpatialML corpus: either there is a mistake in the reported values or a short explanation is needed to clarify these results.

Author Response

Many thanks for the comments and insightful suggestions.

The main response letter send to the editor summarises the changes that were made in the manuscript, in connection to the comments from the different reviewers. Specifically on what regards your suggestions:

* Following your suggestions, the description of previous related work was complemented with additional studies corresponding to unsupervised approaches, or corresponding to deep learning approaches. However, we were unable to compare these studies against the results obtained with the proposed method, given that we used a different evaluation methodology and different datasets that are not publicly available. While adding these descriptions, we tried also not to make the discussion of related work even longer, given that Reviewers 3 and 4 pointed to the fact that the discussion of previous work should, ideally, even be reduced.

* We provide now a link to the source code supporting the tests. For the case of datasets that are available online, we also included them in our Github repository, and provide the links to the original sources as footnotes in the paper. Still, not all 3 datasets are public (e.g., SpatialML is only available through LDC for its subscribers). The SpatialML dataset augmented with Wikipedia articles was not directly made available (e.g., we would have to redistribute SpatialML together with our augmentations).

* The median values obtained for the SpatialML corpus, as reported in Table 3 for the different models, are indeed all equal to 9.08 when using a precision of two decimal digits (only slight differences would be seen when considering three digits for the fractional part of the number).

Reviewer 2 Report

The focus of this research is related to the community in terms of GeoAI. Considering the limits of neural networks (NN) in NLP (e.g. heavily relying on the massive well-labeled data), this research seems interesting. Especially, the spatial context are used here, which have rarely used for non-spatial AI researchers. However, some modifications and improvements are still required to help the readership fully understand the content and highlights of the proposed method.

The introduction of LSTM in Subsection 3.1 could be modified. What is the pretrained LSTM model? How the authors fin-tune the pretrained model? (e.g. which benchmark dataset)
According to the Subsection 3.3 and Figure 2, the HEALPix and LSTM are separate parts. It seems inappropriate to name the proposed model shown in Subsection 3.3 as a neural network.
I don't think it is valid to name neural network architecture for the method shown in Figure 2. The workflow shown in the bottom row is not a multi-layer processing. The authors might want to call the proposed method as *** enhanced neural network.
In the Subsection 4.1, BERT and ELMo are the classical NN models for NLP, I guess the Wikipedia models are created based on fine-tuning BERT and ELMo with the Wikipedia corpora. Models integrating geophysical properties are created based on fine-tuning BERT and ELMo with the Wikipedia corpora and geophysical properties. The authors might consider generating a table to compare the difference of these four models.
In Table 2, 1) "our neural model" refers to "Models integrating geophysical properties"? 2) why the experimental results obtained with the base BERT models are missing?
In table 3, BERT+Wilipedia and BERT alone seems outperforming BERT+Geophysical, the authors might want to explain this. Generally speaking, we believe that spatial aspect is critical in place name recognition, however, the results shown in Table 3 seem unexpected.

Author Response

Many thanks for the comments and insightful suggestions.

* According to your suggestions, we tried to change the description for the LSTM model in Subsection 3.1, and also the descriptions from Subsection 3.3. We are now also naming the process featured in Figure 2 as "the workflow for toponym resolution," in which a neural network is the main component. It should be noticed that the "LSTM model" is not pre-trained (unless the authors are referring to the ELMo embeddings, obtained with a model using LSTMs, which are indeed pre-trained), and only the BERT/ELMo models used for generating embeddings are indeed pre-trained. The HEALPix schema is just the approach used to transform latitude/longitude into a class number (so that a classification objective can be used), used as a pre-processing step, whereas the "neural network" using LSTMs as components tries to predict HEALPix classes and, from that, the coordinates.

* According to a suggestion from Reviewer 4, the content featured in the conclusions section was broken down into two different parts, first giving a general discussion (in Subsection 4.3), and then ending the paper with a smaller discussion on conclusions and future work. The general discussion on Subsection 4.3 talks about computational efficiency (suggested by you), and it also covers the somewhat surprising results presented in Table 3, which show that the simpler "BERT" model can outperform "BERT+Geophysical" (suggested by Reviewer 2).

Reviewer 3 Report

The paper introduces a new method for toponym resolution in English texts that uses LSTM as a model and WE/BERT embeddings for data representation.

General comments:

p.2, method description: please state the task type here, i.e. binary/multiclass classification/regression.

Related work: it is well-written, but might be too long. I am unfamiliar with the journal's RW guidelines -- is it acceptable by the journal?

section 2.5: corpora statistics are missing, i.e. how many articles per corpus, max/min/avg article length, # of toponyms, avg # of toponyms per article, etc. Can be a table.

Detailed comments:

section 3.1: no need to describe what an RNN is, this is not a textbook.

section 3.2 same issue, WE can be described in a paragraph. The authors do need to state which WE they use for their model, what are they trained on, and what vector length is used. The same goes for BERT vectors.

l.391: How was the number 50 chosen? What happens if 40 or 60 is used?

l.419: 'either ELMo or BERT' - does it mean that you've used both and compared their performance? If you just used 1 of them, please say which one and avoid mentioning the other.

l.454: Given that this is categorical cross-entropy, your task needs to be declared as multiclass classification from the start.

l.482: is this dataset publically available? Please supply the link.

p.13, Table 1: perhaps this table need to be referenced earlier in the data description part.

p.14 Table 2: Your neural model needs to be well defined. Earlier, there are several options mentioned (bert, elmo etc.), but it is not clear which one this is. If you experimented with several setups, please report all of them and mark the best one. The line explaining it (l.544) comes after the table, which is confusing.

Formatting:

1) header says: submitted to Journal Not Specified, pls fix that

Author Response

Many thanks for the comments and insightful suggestions.

* You noted that Section 3.1 did not need to describe RNNs in detail. However, noting that many IJGI readers may not be familiar with the types of models used in NLP, we decided to keep the formal presentation of RNNs, instead just trying to slightly reduce and simplify it. The same goes for the explanations of word embeddings and BERT, for which we only made some slight adjustments (e.g., removing the discussion on word2vec, which is not important for understanding the proposed method).

* You pointed to a need for clarifying the type of task (binary/multi-class and classification/regression) right from the beginning. We added a description to the introduction, but it should be stated that we are actually using a combination of multi-class classification (i.e., predicting HEALPix classes) and regression (i.e., predicting coordinates, by optimizing the great circle distance as a loss) objectives, within the same model.

* The minor typos listed were fixed.

* We have further linked elementary corpus statistics to the discussion of the datasets in Section 2.5, giving there the number of documents in each collection and pointing to Table 1 shown later in the paper.

Reviewer 4 Report

This paper describes the development and validation of a new deep learning architecture for toponym resolution and localization. In general terms, the paper is well written and correctly organized, the experimental design is adequate, the methodology and the proposed system are well designed and scientifically sound, and the experimental results are very promising. Also, I think that this paper could be very interesting for other researchers in this area.

I only have a main concern regarding the operation of the proposed system. This is supposed to be an NLP problem, where we have a text and it is analyzed by a computer to extract information. However, according to what I understand, in your case you have marked the position of the toponyms in the text. That is, you don't detect which words are toponym in the text, but they are previously marked by a human. This may be common in the previous works (I am not an expert in this domain), but I consider that this is a deficiency of the proposed approach. It prevents the system from being applied to large volumes of documents, because it would require a prior analysis by a human. This should be discussed in the paper.

Some other minor comments:

In the abstract you present some alternatives that you have tested. It would be interesting to indicate the combinations that worked best. Also, it would be adequate to show some numbers of the obtained results.

The description of the previous works in section 2 is very interesting. However, it would be more adequate for a review paper. It is too extensive for a research paper like this.

In the equations (1-a)-(1-h), there are some terms that are not defined. For example, what do all the W represent? What is "oe"?

Figure 1 appears before it is mentioned in the text. The same with table 2.

Where is the footnote corresponding to 4? It is missing.

In table 2, it is very curious the great difference between the mean and median errors. The means are 10 times larger than the medians. So, I suppose that the cases of large error are also interesting in this research. To represent better this information, I suggest the authors to add cumulative error distribution graphs, so that the readers can interpret better the obtained results.

There is no information regarding the computational efficiency of the proposed system. That is, what were the training times per model, the average times per document, the times per toponym, the computer used, etc.?

The section of conclusions is very long. The conclusion should be very clear and concise, remarking the main findings of the research. I suggest you add a section of discussion (similar to what you have in the conclusions), and rewrite the conclusions in a brief set of sentences.

In general, the writing is correct. However, you have to completely review the article because there are many small errors, such as:

L36: a commonly name used -> a commonly used name / a name commonly used

L310, 324: an hyperbolic -> a hyperbolic

L538, 540, 551: Kilometers -> kilometers

L549, 615: the previous state-of-the-art -> the previous state of the art (you only have to use "state-of-the-art" when is works as an adjective, for example: "this is a state-of-the-art paper")

L565: less -> less than

Also, you have to adhere better to the format of the journal. For example: don’t add blank lines; capitalize the titles of the sections; the first page and the margins are incorrect (I think you have used an old version of the template); affiliations are not complete (where is located the Universidade Europeia? This is an interesting question for a toponym resolutor).

Author Response

Many thanks for the comments and insightful suggestions.

* You correctly noted that the proposed approach does not deal with the problem of "toponym recognition" (marking the spans of text, within a larger document, that correspond to locations), instead just dealing with toponym disambiguation (associating place names to coordinates, assuming that the spans of text corresponding to place names have already been recognized before). We have added a small comment in the introduction to further clarify this, noting that "toponym recognition" is a more standard NLP problem that can be seen as a particular case of place name recognition. There are nowadays production-ready models, for different domains and languages, that can be used to recognize different types of entities in the text (including place names), and thus combining our method with a separate "toponym recognition" approach would be relatively simple.

* You also pointed to the fact that the paper does not feature numbers regarding the computational efficiency of the proposed system. We are not presenting these numbers because the experiments were executed on different hardware (i.e., two computers with the same type of GPU) and also with other processes simultaneously using the same machines. We are nonetheless giving some information on the hardware used to execute all the tests and a general discussion on the easiness of performing these tests with relatively modest hardware.

* The formatting issues pointed out were fixed. We also revised the notation associated to the equations (e.g., in equations (1-a)-(1-h), we are now defining all the variables).

* According to your suggestion, the content featured in the conclusions section was broken down into two different parts, first giving a general discussion (in Subsection 4.3), and then ending the paper with a smaller discussion on conclusions and future work. The general discussion on Subsection 4.3 talks about computational efficiency (suggested by Reviewer 3), and it also covers the somewhat surprising results presented in Table 3, which show that the simpler "BERT" model can outperform "BERT+Geophysical" (suggested by Reviewer 2).

Article Menu

A Novel Deep Learning Approach Using Contextual Embeddings for Toponym Resolution

Further Information

Guidelines

MDPI Initiatives

Follow MDPI