Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

An Approach to a Linked Corpus Creation for a Literary Heritage Based on the Extraction of Entities from Texts

Appl. Sci. 2024, 14(2), 585; https://doi.org/10.3390/app14020585

by Kenan Kassab

and Nikolay Teslya^*

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Reviewer 3:

Adrian M.P. Braşoveanu

Appl. Sci. 2024, 14(2), 585; https://doi.org/10.3390/app14020585

Submission received: 24 November 2023 / Revised: 3 January 2024 / Accepted: 4 January 2024 / Published: 9 January 2024

(This article belongs to the Special Issue Natural Language Processing: Theory, Methods and Applications)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

This paper presents a proposal for building a linked corpus around the literary heritage of Pushkin’s works, based on the named entity recognition method with analysis of unique entities in different texts. The extraction of semantic entities within literary works is done by use of an optimized multilingual BERT model, used to build a custom NER for Russian texts. The paper provides information regarding the process of filtering and preparing the text dataset for training and evaluating the NER.

Nevertheless, the authors seem not to provide, express, and highlight the methodological devices they considered while developing this research. This paper should present a clear starting question, and its connection to the main objective. Also, this main objective should be further details in several specific objectives to help the reader follow the thought process behind the research. It is noted the lack of experimental apparatus/ place of study/ or, alternatively, the theoretical approach and related paradigm in which this research emerges.

Consider improving the explanation of research design (observational, experimental, etc. or even of a more qualitative sort), justifying the use of the techniques used to gather, analyze and present the exposed theory. It would help some statements regarding the selection of the used sources and sampling methods, concerning how the data was collected (mainly the references used, and other inputs that allowed express the design principles, and the hardware and software transformation schemes. There should be also an insight on the analysis techniques used on the data, and the indication of how the coding process and categorizations was developed regarding the hardware and software transformation schemes.

Author Response

We are thankful to the reviewer for the time and and valuable comments. We had edited the paper in the following way to consider your comments:

The reviewer’s comment

Changes in the paper

Answer to the reviewer

We addressed the objectives of our research and the task we are trying to achieve in the “Introduction” section.

(lines 28-31 and 36-42)
Such a structure can be considered as a corpus of linked texts and used in the digital humanities researches. This corpus could provide a more deep understanding between the various literature works, robust visualization of the relationships, and search opportunities with more comprehensive and deep information within the texts. All these materials are connected by mention of writer's literary works, some persons who interacted with the author, places where the author was, dates, and organizations connected with the author and the literary work (Fig.1).

The research presented in this work is aimed to create a linked corpus over the literary heritage of the great Russian poet Alexander Sergeevich Pushkin. Such corpus contains links between heritage parts based on the common entities in them. To extract these entities we propose to use one of the Natural Language Processing approaches related to the search of named entities - Named Entity Recognition (NER), and build a database which will store entities itself, links from entities to texts in corpus that in general will provide links between texts. Since the main language of the A.S. Pushkin heritage is Russian, we additionally address to some language-specific nuances but the whole approach can be adapted to any language and any writer.

Thank you for your feedback.

We mentioned the task and the problems we are trying to solve in the “Introduction” section. We went in detail about describing the problem and also about our proposed solution.

We added some modifications to the “Introduction” and “Conclusion” sections that will help with clarifying more of our task and what we achieved.

We had changed The introduction section to highlight the importance of entity extraction for building linked corps of literature heritage as well as explain why we use the Encyclopedia of A.S. Pushkin as a main source of knowledge (lines 79-82: It provides mentions and descriptions of all works by A.S. Pushkin including letters and notes, persons related to him, places where he had been, and dates of the most important events. Therefore it could be considered as the most full source of the knowledge about A.S. Pushkin heritage)

Thank you for the comment. We agree that the explanation of the research was not substantiated enough and raises some questions. We had edited the Introduction section to provide better justification of the research goal, and approach used in it.

Reviewer 2 Report

Comments and Suggestions for Authors

This work is innovative which applys Named-Entity Recognition (NER) to Russian literature, particularly focusing on A.S. Pushkin's works. The research is sound and well-motivated, and this work can contribute significantly to the field of literary digitization and language processing. This paper is well-written, but I still have some suggestions:

1. Provide a more comprehensive background on the challenges in NER for Russian literature, including the linguistic and cultural peculiarities that distinguish it from other languages.

2. Explain the technical aspects of the NER model, particularly how it's tailored for Russian texts. This would help in understanding the model's uniqueness and adaptability to other literary works.

3. It lacks a more detailed analysis of the model's performance. It would be better to include comparisons with existing SOTA models or benchmarks. Due to time limitation, additional experiments might not be feasible. Therefore, please consider it as future work.

4. Please discuss the broader implications of your work for digital humanities and literary studies, and how it might be applied to other authors or languages.

Comments on the Quality of English Language

Minor editing of English language required

Author Response

We are thankful to the reviewer for the time and valuable comments. We had edited the manuscript in the following way to consider the comments:

The reviewer’s comment

Changes to the paper

Answer to the reviewer

1. Provide a more comprehensive background on the challenges in NER for Russian literature, including the linguistic and cultural peculiarities that distinguish it from other languages.

We expanded the “Introduction” section. We added explanations for some of the rules and difficulties of building NER for the Russian language.

Lines (64-74)

Building NER for the Russian language is considered more complex than other languages because of the complexity and the unique rules the Russian language has [3], like the morphological complexity and the flexibility of the word order. In the Russian language, about all words like names, adjectives, and verbs change their forms and endings depending on the grammatical situation and context. Also, the flexibility of the word order allows us to write the same sentence with the same words in many different ways where only the word sequence varies. This makes understanding the context crucial to finding the semantic entities. Not to mention that the Russian language and Russian literature have developed and changed a lot through the years which makes building an NER to determine and recognize the entities a challenging and complex task. As a result, there exists a growing demand for NER solutions tailored to Russian linguistic nuances and to their intended domains (like the custom NER we are trying to build for the A.S. Pushkin heritage).

Thank you for the comment.

We added a paragraph explaining in more detail the unique rules of the Russian language that make it differ from other languages such as (the morphological complexity and the flexibility of the word order). Which makes building a NER system for the Russian language challenging.

2. Explain the technical aspects of the NER model, particularly how it's tailored for Russian texts. This would help in understanding the model's uniqueness and adaptability to other literary works.

We explained our choice for the utilized NER model to train for our task in the “Model Training” section.

Lines (309-319)

By using a BERT-based multilingual transformer model we have been able to fine-tune the model to our task. Since these models are multilingual, they may be adjusted to fit the many linguistic contexts found in Pushkin's writings. They can identify words and phrases from Russian and other languages that could be used in Pushkin's writings. These models can be fine-tuned to a specific dataset which is a crucial factor to maximize the performance of our NER. We also enhanced the model performance by providing high-quality annotated entities achieving that through special techniques such as the regular expression annotator for "WORK-OF-ART". By following these steps we were able to train a custom NER that can identify the entities in the Russian heritage of Alexander Pushkin. The training process explained in this study can be generalized to build a similar NER for other literary works while taking into account the uniqueness of each language's literature.

We appreciate your comment.

We optimize a BERT-based multilingual transformer model that has already been trained using pre-learned contextual embeddings inside the SpaCy framework. By using this model we were able to fine-tune the model to our Russian NER. We also explained how we enhance the model performance for our specific literature works by providing high-quality entities and we explained the techniques we used for that such as the regular expression annotator for "WORK-OF-ART".

In our work, we delve into details about the process for building and training the NER for the Russian heritage of Alexander Pushkin. Also, we explained the steps for data preparation and enhancement to increase the model performance. So this work can be generalized to build a similar NER for other literary works while taking into account the uniqueness of each language's literature.

3. It lacks a more detailed analysis of the model's performance. It would be better to include comparisons with existing SOTA models or benchmarks. Due to time limitations, additional experiments might not be feasible. Therefore, please consider it as future work.

The aims of the future work in Conclusion section are extended with performance evaluation direction (lines 389-391):

Next we are going to provide additional estimation of the created dataset and model trained with it, and comparison of the model with existing models.

Thank you for the comment and we appreciate your feedback.

We will consider it for our future work.

4. Please discuss the broader implications of your work for digital humanities and literary studies, and how it might be applied to other authors or languages.

We added a paragraph for the “Conclusion” highlighting the impact of our research on digital humanities. Shedding the light on the importance of our NER system and the database to help academics and researchers analyze the literature works and facilitate their job.

Lines(381-386)

The NER system and the database we build provide significant support for researchers and academics in digital humanities and literature works. It helps them to analyze the literature more efficiently by extracting valuable information from these works. Not to mention reducing the time for examining and studying of these works. Also, the database with the linked entities provides a more deep and comprehensive view of how the multiple literature works are connected through the common entities.

Thank you very much for your comment.

In this paper, we provided a comprehensive explanation of the process of building this NER system. Transferring this knowledge will help others build similar systems in other languages and easily adapt to their purposes. Also, this system could be utilized for other Russian literature works and fine-tuned to match similar tasks in other languages.

Reviewer 3 Report

Comments and Suggestions for Authors

The paper is well-written and well-illustrated. The paper describes a Russian NER system for cultural heritage focused on the works and entities related to A.S. Pushkin.

Overall the paper is interesting, but I think it can be improved on several fronts:

- more references to Russian NER/NEL system would help - for example the top searches on Google Scholar for Russian NER are not even included in the bibliography (e.g., Malykh et al, 2016 or Ta Le et al, 2028 and several others) - this makes me question their state of the art section, especially since even though I am an expert in NER/NEL/Slot Filling, I am not an expert in Russian or Russian NER/NEL systems.

- not immediately clear why the authors focus only on NER and not NEL, especially since NER is almost a solved problem, whereas NEL and Slot Filling are not necessarily solved in all languages

- I suggest adding DeepPavlov and Spacy to the table 1, as otherwise the table is meaningless - why mention all these tools when the selected tools are not included?

- I also want to suggest combining The Tables 2 and 3 or at least adding the names of the models to Table 3 as it is currently pretty complicated to navigate Table 3 otherwise

- several times in the paper the authors mention the unique rules of the Russian language and the complexities these rules pose, but they do not explain any of these rules. In order to make the paper more readable for the casual reader I would like to suggest that at least half a page is dedicated to these rules and complexities of the Russian language that make Russian NER a difficult proposition. This is really important as the score is great, but there's no clear presentation of how dificult the task is.

Author Response

We are thankful to the reviewer for the time and and valuable comments. We had edited the paper in the following way to consider your comments:

The reviewer’s comment	Changes to the paper	Answer to the reviewer
- more references to Russian NER/NEL system would help - for example the top searches on Google Scholar for Russian NER are not even included in the bibliography (e.g., Malykh et al, 2016 or Ta Le et al, 2028 and several others) - this makes me question their state of the art section, especially since even though I am an expert in NER/NEL/Slot Filling, I am not an expert in Russian or Russian NER/NEL systems.	We expand the “Related Work” section by including more research about the previous models utilized for building NER for the Russian language. Lines(117-122) Many previous studies related to building Russian NER were introduced. A character-aware RNN model using LSTM units for Russian NER was able to identify the entities but struggled with distinguishing between person and organization tokens due to corpus size limitations [5]. The authors of [4] found that adding Conditional Random Fields (CRF) after the Bi-LSTM enhanced the performance of the NER when tested on the Gareev’s, Person-1000, and FactRuEval 2016 datasets. and Lines (130-135) The effectiveness of multilingual NER models, which use information from other languages to improve NER in Russian, has been studied. In particular, those built on BERT seem to have the potential to improve NER in Russian. When comparing several pre-trained language models, the authors of [7] discover that Trankit-based models perform better on the NER challenge than others. Trails on BERT-based models trained on multi-lingual and Slovene-only data achieved high F-scores in Slovene and multilingual NER [8]. The authors of [9] presented a BERT model followed by a word-level CRF layer to address the problem of multilingual NER for the Slavic languages (including the Russian language). The presented model achieved the best results in the “BSNLP 2019 Shared Task” competition. The study [10] investigates the usage of a bidirectional BERT model that has been extensively trained to improve Russian Named Entity Recognition.	Thank you so much for your comment. The papers you mentioned utilized Bi-LSTM-CRF and LSTM-based models to solve the NER task. We mentioned and utilized in our research more advanced architectures like the BERT-Transformer-based model for building and training the NER which showed better results in previous research. We included the papers you mentioned in the related work and other related papers to give a more general view.
- not immediately clear why the authors focus only on NER and not NEL, especially since NER is almost a solved problem, whereas NEL and Slot Filling are not necessarily solved in all languages	We had added text into the “Related work” section to explain why we mostly consider to use NER against of NEL (lines 98-112): There is an extension of the NER task that is aimed not only at the extraction of named entities but also at the creation of links from the entities in texts to the entities in some knowledge base. It is called the Named Entity Linking (NEL) problem. The NEL could also be utilized for the creation of linked corpora. It can provide more accurate results since the goal of NEL is to find a direct link to an entity in the knowledge base from the text. In the case of entity linking with the knowledge base, we don't need separate storage and entity verification. However, there is a very important limitation of the NEL problem. It fully relies on a knowledge base used to create links. If there is no appropriate entity in the knowledge base that could be linked with an entity in text then the link will be lost even if the entity is found in several texts and could be used to link them. It could be solved by the creation of problem-specific knowledge base which will contain all valuable entities from the problem domain but it still needs to find all entities with NER. Since we are working with a specific domain of A.S. Pushkin's literary heritage, our main goal is to find as many entities as possible to create links between text, and the NER is considered the better solution here. We are also extend Conclusion section with future work on NEL problem solving for found entities (lines 391-393): The additional direction of our future work is an approach developing to providing links from the found entities to a Wikidata knowledge base as a part of NEL problem solving.	Thank you for the comment. The NER problem is not fully solved for the Russian language, especially for identifying entities in literature works. That is why we built a custom NER system for working with Pushkin's literature works. The reason we don’t consider the NEL problem in this work is that we are mostly focusing on creating links between text based on the entities and their context in text. The NEL problems utterly relies on existing knowledge base and its quality. Our other research we had presented at the conference showed that Wikidata of DBpedia cannot provide enough entities in the domain of A.S. Pushkin. The solution here is a creation of an own knowledge base, starting from the creation of entities listed through all the texts in literature heritage and this is a main goal of a presented research. Since we are already working with NEL in the same project it is considered as a future work to provide links from found entities to the knowledge base for Russian language.
- I suggest adding DeepPavlov and Spacy to the table 1, as otherwise the table is meaningless - why mention all these tools when the selected tools are not included?	Table 1 is mentioned in the subsection “4.1. Annotating Tool” Shows the comparison between multiple annotation tools highlighting the pros and cons of each tool.	We appreciate your feedback and comments. We want to note here that (Table.1) shows a comprehensive comparison between the annotation tools and we clarified our choice for using Brat as our annotating tool in the manuscript. On the other hand, Spacy and DeepPavlov are not annotation tools and they weren't used for this purpose in our research. We used Spacy (which is a library for NLP in Python) for training and testing our models. We utilized DeepPavlov (which is an open-source NER BERT-based model) for preliminary annotating and as a benchmark model.
- I also want to suggest combining The Tables 2 and 3 or at least adding the names of the models to Table 3 as it is currently pretty complicated to navigate Table 3 otherwise	We modified Table 3 by adding comments to explain the model.	Thank you for the comment. We modified Table 3 as you recommended by adding the names of the models, which will make it more comfortable to navigate through.
- several times in the paper the authors mention the unique rules of the Russian language and the complexities these rules pose, but they do not explain any of these rules. In order to make the paper more readable for the casual reader I would like to suggest that at least half a page is dedicated to these rules and complexities of the Russian language that make Russian NER a difficult proposition. This is really important as the score is great, but there's no clear presentation of how dificult the task is.	We expanded the “Introduction” section. We added an explanation for some of the rules and difficulties of building NER for the Russian language. Lines (64-74) Building NER for the Russian language is considered more complex than other languages because of the complexity and the unique rules the Russian language has [3], like the morphological complexity and the flexibility of the word order. In the Russian language, about all words like names, adjectives, and verbs change their forms and endings depending on the grammatical situation and context. Also, the flexibility of the word order allows us to write the same sentence with the same words in many different ways where only the word sequence varies. This makes understanding the context crucial to finding the semantic entities. Not to mention that the Russian language and Russian literature have developed and changed a lot through the years which makes building an NER to determine and recognize the entities a challenging and complex task. As a result, there exists a growing demand for NER solutions tailored to Russian linguistic nuances and to their intended domains (like the custom NER we are trying to build for the A.S. Pushkin heritage).	Thank you for the comment. We added a paragraph explaining in more detail the unique rules of the Russian language such as (the morphological complexity and the flexibility of the word order). This makes the task of building a custom NER for Russian literature a challenging task.

Article Menu

An Approach to a Linked Corpus Creation for a Literary Heritage Based on the Extraction of Entities from Texts

Further Information

Guidelines

MDPI Initiatives

Follow MDPI