Information

Editorial

Jump to: Research

3 pages, 149 KB

Open AccessEditorial

Editorial for the Special Issue on “Natural Language Processing and Text Mining”

by Pablo Gamallo and Marcos Garcia

Information 2019, 10(9), 279; https://doi.org/10.3390/info10090279 - 6 Sep 2019

Cited by 2 | Viewed by 3392

Abstract

Natural language processing (NLP) and Text Mining (TM) are a set of overlapping strategies working on unstructured text [...] Full article

(This article belongs to the Special Issue Natural Language Processing and Text Mining)

Research

Jump to: Editorial

17 pages, 944 KB

Open AccessFeature PaperArticle

Transfer Learning for Named Entity Recognition in Financial and Biomedical Documents

by Sumam Francis, Jordy Van Landeghem and Marie-Francine Moens

Information 2019, 10(8), 248; https://doi.org/10.3390/info10080248 - 26 Jul 2019

Cited by 48 | Viewed by 13083

Abstract

Recent deep learning approaches have shown promising results for named entity recognition (NER). A reasonable assumption for training robust deep learning models is that a sufficient amount of high-quality annotated training data is available. However, in many real-world scenarios, labeled training data is [...] Read more.

Recent deep learning approaches have shown promising results for named entity recognition (NER). A reasonable assumption for training robust deep learning models is that a sufficient amount of high-quality annotated training data is available. However, in many real-world scenarios, labeled training data is scarcely present. In this paper we consider two use cases: generic entity extraction from financial and from biomedical documents. First, we have developed a character based model for NER in financial documents and a word and character based model with attention for NER in biomedical documents. Further, we have analyzed how transfer learning addresses the problem of limited training data in a target domain. We demonstrate through experiments that NER models trained on labeled data from a source domain can be used as base models and then be fine-tuned with few labeled data for recognition of different named entity classes in a target domain. We also witness an interest in language models to improve NER as a way of coping with limited labeled data. The current most successful language model is BERT. Because of its success in state-of-the-art models we integrate representations based on BERT in our biomedical NER model along with word and character information. The results are compared with a state-of-the-art model applied on a benchmarking biomedical corpus. Full article

(This article belongs to the Special Issue Natural Language Processing and Text Mining)

► Show Figures

Figure 1

15 pages, 7236 KB

Open AccessArticle

Assisting Forensic Identification through Unsupervised Information Extraction of Free Text Autopsy Reports: The Disappearances Cases during the Brazilian Military Dictatorship

by Patricia Martin-Rodilla, Marcia L. Hattori and Cesar Gonzalez-Perez

Information 2019, 10(7), 231; https://doi.org/10.3390/info10070231 - 5 Jul 2019

Cited by 5 | Viewed by 5755

Abstract

Anthropological, archaeological, and forensic studies situate enforced disappearance as a strategy associated with the Brazilian military dictatorship (1964–1985), leaving hundreds of persons without identity or cause of death identified. Their forensic reports are the only existing clue for people identification and detection of [...] Read more.

Anthropological, archaeological, and forensic studies situate enforced disappearance as a strategy associated with the Brazilian military dictatorship (1964–1985), leaving hundreds of persons without identity or cause of death identified. Their forensic reports are the only existing clue for people identification and detection of possible crimes associated with them. The exchange of information among institutions about the identities of disappeared people was not a common practice. Thus, their analysis requires unsupervised techniques, mainly due to the fact that their contextual annotation is extremely time-consuming, difficult to obtain, and with high dependence on the annotator. The use of these techniques allows researchers to assist in the identification and analysis in four areas: Common causes of death, relevant body locations, personal belongings terminology, and correlations between actors such as doctors and police officers involved in the disappearances. This paper analyzes almost 3000 textual reports of missing persons in São Paulo city during the Brazilian dictatorship through unsupervised algorithms of information extraction in Portuguese, identifying named entities and relevant terminology associated with these four criteria. The analysis allowed us to observe terminological patterns relevant for people identification (e.g., presence of rings or similar personal belongings) and automate the study of correlations between actors. The proposed system acts as a first classificatory and indexing middleware of the reports and represents a feasible system that can assist researchers working in pattern search among autopsy reports. Full article

(This article belongs to the Special Issue Natural Language Processing and Text Mining)

► Show Figures

Figure 1

25 pages, 415 KB

Open AccessArticle

Multilingual Open Information Extraction: Challenges and Opportunities

by Daniela Barreiro Claro, Marlo Souza, Clarissa Castellã Xavier and Leandro Oliveira

Information 2019, 10(7), 228; https://doi.org/10.3390/info10070228 - 2 Jul 2019

Cited by 33 | Viewed by 9318

Abstract

The number of documents published on the Web in languages other than English grows every year. As a consequence, the need to extract useful information from different languages increases, highlighting the importance of research into Open Information Extraction (OIE) techniques. Different OIE methods [...] Read more.

The number of documents published on the Web in languages other than English grows every year. As a consequence, the need to extract useful information from different languages increases, highlighting the importance of research into Open Information Extraction (OIE) techniques. Different OIE methods have dealt with features from a unique language; however, few approaches tackle multilingual aspects. In those approaches, multilingualism is restricted to processing text in different languages, rather than exploring cross-linguistic resources, which results in low precision due to the use of general rules. Multilingual methods have been applied to numerous problems in Natural Language Processing, achieving satisfactory results and demonstrating that knowledge acquisition for a language can be transferred to other languages to improve the quality of the facts extracted. We argue that a multilingual approach can enhance OIE methods as it is ideal to evaluate and compare OIE systems, and therefore can be applied to the collected facts. In this work, we discuss how the transfer knowledge between languages can increase acquisition from multilingual approaches. We provide a roadmap of the Multilingual Open IE area concerning state of the art studies. Additionally, we evaluate the transfer of knowledge to improve the quality of the facts extracted in each language. Moreover, we discuss the importance of a parallel corpus to evaluate and compare multilingual systems. Full article

(This article belongs to the Special Issue Natural Language Processing and Text Mining)

► Show Figures

Figure 1

21 pages, 988 KB

Open AccessArticle

Large Scale Linguistic Processing of Tweets to Understand Social Interactions among Speakers of Less Resourced Languages: The Basque Case

by Joseba Fernandez de Landa, Rodrigo Agerri and Iñaki Alegria

Information 2019, 10(6), 212; https://doi.org/10.3390/info10060212 - 13 Jun 2019

Cited by 7 | Viewed by 8438

Abstract

Social networks like Twitter are increasingly important in the creation of new ways of communication. They have also become useful tools for social and linguistic research due to the massive amounts of public textual data available. This is particularly important for less resourced [...] Read more.

Social networks like Twitter are increasingly important in the creation of new ways of communication. They have also become useful tools for social and linguistic research due to the massive amounts of public textual data available. This is particularly important for less resourced languages, as it allows to apply current natural language processing techniques to large amounts of unstructured data. In this work, we study the linguistic and social aspects of young and adult people’s behaviour based on their tweets’ contents and the social relations that arise from them. With this objective in mind, we have gathered over 10 million tweets from more than 8000 users. First, we classified each user in terms of its life stage (young/adult) according to the writing style of their tweets. Second, we applied topic modelling techniques to the personal tweets to find the most popular topics according to life stages. Third, we established the relations and communities that emerge based on the retweets. We conclude that using large amounts of unstructured data provided by Twitter facilitates social research using computational techniques such as natural language processing, giving the opportunity both to segment communities based on demographic characteristics and to discover how they interact or relate to them. Full article

(This article belongs to the Special Issue Natural Language Processing and Text Mining)

► Show Figures

Figure 1

15 pages, 422 KB

Open AccessFeature PaperArticle

Event Extraction and Representation: A Case Study for the Portuguese Language

by Paulo Quaresma, Vítor Beires Nogueira, Kashyap Raiyani and Roy Bayot

Information 2019, 10(6), 205; https://doi.org/10.3390/info10060205 - 8 Jun 2019

Cited by 6 | Viewed by 7712

Abstract

Text information extraction is an important natural language processing (NLP) task, which aims to automatically identify, extract, and represent information from text. In this context, event extraction plays a relevant role, allowing actions, agents, objects, places, and time periods to be identified and [...] Read more.

Text information extraction is an important natural language processing (NLP) task, which aims to automatically identify, extract, and represent information from text. In this context, event extraction plays a relevant role, allowing actions, agents, objects, places, and time periods to be identified and represented. The extracted information can be represented by specialized ontologies, supporting knowledge-based reasoning and inference processes. In this work, we will describe, in detail, our proposal for event extraction from Portuguese documents. The proposed approach is based on a pipeline of specialized natural language processing tools; namely, a part-of-speech tagger, a named entities recognizer, a dependency parser, semantic role labeling, and a knowledge extraction module. The architecture is language-independent, but its modules are language-dependent and can be built using adequate AI (i.e., rule-based or machine learning) methodologies. The developed system was evaluated with a corpus of Portuguese texts and the obtained results are presented and analysed. The current limitations and future work are discussed in detail. Full article

(This article belongs to the Special Issue Natural Language Processing and Text Mining)

► Show Figures

Figure 1

9 pages, 738 KB

Open AccessArticle

Spelling Correction of Non-Word Errors in Uyghur–Chinese Machine Translation

by Rui Dong, Yating Yang and Tonghai Jiang

Information 2019, 10(6), 202; https://doi.org/10.3390/info10060202 - 6 Jun 2019

Cited by 4 | Viewed by 8586

Abstract

This research was conducted to solve the out-of-vocabulary problem caused by Uyghur spelling errors in Uyghur–Chinese machine translation, so as to improve the quality of Uyghur–Chinese machine translation. This paper assesses three spelling correction methods based on machine translation: 1. Using a Bilingual [...] Read more.

This research was conducted to solve the out-of-vocabulary problem caused by Uyghur spelling errors in Uyghur–Chinese machine translation, so as to improve the quality of Uyghur–Chinese machine translation. This paper assesses three spelling correction methods based on machine translation: 1. Using a Bilingual Evaluation Understudy (BLEU) score; 2. Using a Chinese language model; 3. Using a bilingual language model. The best results were achieved in both the spelling correction task and the machine translation task by using the BLEU score for spelling correction. A maximum F1 score of 0.72 was reached for spelling correction, and the translation result increased the BLEU score by 1.97 points, relative to the baseline system. However, the method of using a BLEU score for spelling correction requires the support of a bilingual parallel corpus, which is a supervised method that can be used in corpus pre-processing. Unsupervised spelling correction can be performed by using either a Chinese language model or a bilingual language model. These two methods can be easily extended to other languages, such as Arabic. Full article

(This article belongs to the Special Issue Natural Language Processing and Text Mining)

► Show Figures

Figure 1

20 pages, 627 KB

Open AccessArticle

An Improved Word Representation for Deep Learning Based NER in Indian Languages

by Ajees A P, Manju K and Sumam Mary Idicula

Information 2019, 10(6), 186; https://doi.org/10.3390/info10060186 - 30 May 2019

Cited by 9 | Viewed by 6751

Abstract

Named Entity Recognition (NER) is the process of identifying the elementary units in a text document and classifying them into predefined categories such as person, location, organization and so forth. NER plays an important role in many Natural Language Processing applications like information [...] Read more.

Named Entity Recognition (NER) is the process of identifying the elementary units in a text document and classifying them into predefined categories such as person, location, organization and so forth. NER plays an important role in many Natural Language Processing applications like information retrieval, question answering, machine translation and so forth. Resolving the ambiguities of lexical items involved in a text document is a challenging task. NER in Indian languages is always a complex task due to their morphological richness and agglutinative nature. Even though different solutions were proposed for NER, it is still an unsolved problem. Traditional approaches to Named Entity Recognition were based on the application of hand-crafted features to classical machine learning techniques such as Hidden Markov Model (HMM), Support Vector Machine (SVM), Conditional Random Field (CRF) and so forth. But the introduction of deep learning techniques to the NER problem changed the scenario, where the state of art results have been achieved using deep learning architectures. In this paper, we address the problem of effective word representation for NER in Indian languages by capturing the syntactic, semantic and morphological information. We propose a deep learning based entity extraction system for Indian languages using a novel combined word representation, including character-level, word-level and affix-level embeddings. We have used ‘ARNEKT-IECSIL 2018’ shared data for training and testing. Our results highlight the improvement that we obtained over the existing pre-trained word representations. Full article

(This article belongs to the Special Issue Natural Language Processing and Text Mining)

► Show Figures

Figure 1

17 pages, 514 KB

Open AccessArticle

Istex: A Database of Twenty Million Scientific Papers with a Mining Tool Which Uses Named Entities

by Denis Maurel, Enza Morale, Nicolas Thouvenin, Patrice Ringot and Angel Turri

Information 2019, 10(5), 178; https://doi.org/10.3390/info10050178 - 22 May 2019

Cited by 3 | Viewed by 6351

Abstract

Istex is a database of twenty million full text scientific papers bought by the French Government for the use of academic libraries. Papers are usually searched for by the title, authors, keywords or possibly the abstract. To authorize new types of queries of [...] Read more.

Istex is a database of twenty million full text scientific papers bought by the French Government for the use of academic libraries. Papers are usually searched for by the title, authors, keywords or possibly the abstract. To authorize new types of queries of Istex, we implemented a system of named entity recognition on all papers and we offer users the possibility to run searches on these entities. After the presentation of the French Istex project, we detail in this paper the named entity recognition with CasEN, a cascade of graphs, implemented on the Unitex Software. CasEN exists in French, but not in English. The first challenge was to build a new cascade in a short time. The results of its evaluation showed a good Precision measure, even if the Recall was not very good. The Precision was very important for this project to ensure it did not return unwanted papers by a query. The second challenge was the implementation of Unitex to parse around twenty millions of documents. We used a dockerized application. Finally, we explain also how to query the resulting Named entities in the Istex website. Full article

(This article belongs to the Special Issue Natural Language Processing and Text Mining)

► Show Figures

Figure 1

Journal Menu

Journal Browser

Natural Language Processing and Text Mining

Share This Special Issue

Special Issue Editors

Special Issue Information

Keywords

Benefits of Publishing in a Special Issue

Published Papers (9 papers)

Editorial

Research

Further Information

Guidelines

MDPI Initiatives

Follow MDPI