Next Article in Journal
Mutual Information, the Linear Prediction Model, and CELP Voice Codecs
Next Article in Special Issue
An Improved Word Representation for Deep Learning Based NER in Indian Languages
Previous Article in Journal
Development of a Virtual Reality-Based Game Approach for Supporting Sensory Processing Disorders Treatment
Open AccessArticle

Istex: A Database of Twenty Million Scientific Papers with a Mining Tool Which Uses Named Entities

1
Laboratoire d’Informatique Fondamentale et Appliquée de Tours (LIFAT), Université de Tours, 37000 Tours, France
2
Institut de l’Information Scientifique et Technique, 54500 Nancy, France
3
Lorraine Research Laboratory in Computer Science and Its Applications (Loria), Université de Lorraine, 54506 Nancy, France
*
Author to whom correspondence should be addressed.
Information 2019, 10(5), 178; https://doi.org/10.3390/info10050178
Received: 1 April 2019 / Revised: 10 May 2019 / Accepted: 18 May 2019 / Published: 22 May 2019
(This article belongs to the Special Issue Natural Language Processing and Text Mining)
  |  
PDF [514 KB, uploaded 22 May 2019]
  |  

Abstract

Istex is a database of twenty million full text scientific papers bought by the French Government for the use of academic libraries. Papers are usually searched for by the title, authors, keywords or possibly the abstract. To authorize new types of queries of Istex, we implemented a system of named entity recognition on all papers and we offer users the possibility to run searches on these entities. After the presentation of the French Istex project, we detail in this paper the named entity recognition with CasEN, a cascade of graphs, implemented on the Unitex Software. CasEN exists in French, but not in English. The first challenge was to build a new cascade in a short time. The results of its evaluation showed a good Precision measure, even if the Recall was not very good. The Precision was very important for this project to ensure it did not return unwanted papers by a query. The second challenge was the implementation of Unitex to parse around twenty millions of documents. We used a dockerized application. Finally, we explain also how to query the resulting Named entities in the Istex website. View Full-Text
Keywords: text mining; named entity recognition; data base of scientific papers; Istex; Unitex; CasEN; Docker text mining; named entity recognition; data base of scientific papers; Istex; Unitex; CasEN; Docker
Figures

Figure 1

This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited (CC BY 4.0).
SciFeed

Share & Cite This Article

MDPI and ACS Style

Maurel, D.; Morale, E.; Thouvenin, N.; Ringot, P.; Turri, A. Istex: A Database of Twenty Million Scientific Papers with a Mining Tool Which Uses Named Entities. Information 2019, 10, 178.

Show more citation formats Show less citations formats

Note that from the first issue of 2016, MDPI journals use article numbers instead of page numbers. See further details here.

Related Articles

Article Metrics

Article Access Statistics

1

Comments

[Return to top]
Information EISSN 2078-2489 Published by MDPI AG, Basel, Switzerland RSS E-Mail Table of Contents Alert
Back to Top