Next Article in Journal
Water-Covered Roof Versus Inverted Flat Roof on the Mediterranean Coast: A Comparative Study of Thermal and Energy Behavior
Next Article in Special Issue
Predicting Reputation in the Sharing Economy with Twitter Social Data
Previous Article in Journal
Sol–Gel Treatment of Textiles for the Entrapping of an Antioxidant/Anti-Inflammatory Molecule: Functional Coating Morphological Characterization and Drug Release Evaluation
Previous Article in Special Issue
Comprehensive Document Summarization with Refined Self-Matching Mechanism
Open AccessArticle

Named Entity Recognition for Sensitive Data Discovery in Portuguese

1
Inov Inesc Inovação—Instituto De Novas Tecnologias, 1000-029 Lisbon, Portugal
2
ISTAR-IUL, Instituto Universitário de Lisboa (ISCTE-IUL), 1649-026 Lisboa, Portugal
3
INESC-ID Lisboa, 1000-029 Lisbon, Portugal
*
Author to whom correspondence should be addressed.
Appl. Sci. 2020, 10(7), 2303; https://doi.org/10.3390/app10072303
Received: 20 February 2020 / Revised: 16 March 2020 / Accepted: 20 March 2020 / Published: 27 March 2020
(This article belongs to the Special Issue Machine Learning and Natural Language Processing)
The process of protecting sensitive data is continually growing and becoming increasingly important, especially as a result of the directives and laws imposed by the European Union. The effort to create automatic systems is continuous, but, in most cases, the processes behind them are still manual or semi-automatic. In this work, we have developed a component that can extract and classify sensitive data, from unstructured text information in European Portuguese. The objective was to create a system that allows organizations to understand their data and comply with legal and security purposes. We studied a hybrid approach to the problem of Named Entity Recognition for the Portuguese language. This approach combines several techniques such as rule-based/lexical-based models, machine learning algorithms, and neural networks. The rule-based and lexical-based approaches were used only for a set of specific classes. For the remaining classes of entities, two statistical models were tested—Conditional Random Fields and Random Forest and, finally, a Bidirectional-LSTM approach as experimented. Regarding the statistical models, we realized that Conditional Random Fields is the one that can obtain the best results, with a f1-score of 65.50%. With the Bi-LSTM approach, we have achieved a result of 83.01%. The corpora used for training and testing were HAREM Golden Collection, SIGARRA News Corpus, and DataSense NER Corpus. View Full-Text
Keywords: sensitive data; general data protection regulation; natural language processing; Portuguese language; named entity recognition sensitive data; general data protection regulation; natural language processing; Portuguese language; named entity recognition
Show Figures

Figure 1

MDPI and ACS Style

Dias, M.; Boné, J.; Ferreira, J.C.; Ribeiro, R.; Maia, R. Named Entity Recognition for Sensitive Data Discovery in Portuguese. Appl. Sci. 2020, 10, 2303. https://doi.org/10.3390/app10072303

AMA Style

Dias M, Boné J, Ferreira JC, Ribeiro R, Maia R. Named Entity Recognition for Sensitive Data Discovery in Portuguese. Applied Sciences. 2020; 10(7):2303. https://doi.org/10.3390/app10072303

Chicago/Turabian Style

Dias, Mariana; Boné, João; Ferreira, João C.; Ribeiro, Ricardo; Maia, Rui. 2020. "Named Entity Recognition for Sensitive Data Discovery in Portuguese" Appl. Sci. 10, no. 7: 2303. https://doi.org/10.3390/app10072303

Find Other Styles
Note that from the first issue of 2016, MDPI journals use article numbers instead of page numbers. See further details here.

Article Access Map by Country/Region

1
Search more from Scilit
 
Search
Back to TopTop