Language Processing and Knowledge Extraction

A special issue of Machine Learning and Knowledge Extraction (ISSN 2504-4990).

Deadline for manuscript submissions: closed (15 September 2022) | Viewed by 20120

Special Issue Editors


E-Mail Website
Guest Editor
2Ai, School of Technology, Polytechnic Institute of Cávado e Ave, 4750-810 Barcelos, Portugal
Interests: natural language processing; programming languages; compilers; computer programming education
Special Issues, Collections and Topics in MDPI journals

E-Mail Website
Guest Editor
Departamento de Informática, Escola de Engenharia, Universidade do Minho, 4710-057 Braga, Portugal
Interests: programming languages; language processing

Special Issue Information

Dear Colleagues,

For some years, Natural Language Processing has followed the trends in artificial intelligence, using algebraic and rule-based approaches. From the simple tasks of tokenization and segmentation, up to the tasks of part-of-speed tagging, or even complex tasks such as machine translations, were highly based in human work on describing formally the task.

In the last years, things have changed. The amount of data on almost every language and every field, together with the computational power evolution, has led to data-oriented approaches, using machine learning algorithms.

Curiously, at first, the goal was not to completely replace human-based rules with a system using only machine learning approaches. As an example, we can consider machine translation. About ten year ago, the main trend was Example Based Machine Translation, that used machine learning to extract portions of texts and their translations (thus, examples of translations). The remaining portion of the translation task was still highly based on translation rules.

More recently, with the boom of the jargon of Deep Learning, these tasks were completely replaced by ML algorithms. Additionally, not just complex tasks, such as machine translation, were affected. Currently, almost any task can be solved using machine learning, given there being are enough data to train a model.

In this Special Issue we are interested in the usage of machine learning approaches on natural language processing, independently of the complexity of the task being solved, and either considering it as a single ML problem or using ML to solve a specific portion. We are especially interested in applications of ML approaches on languages with limited data availability (usually referred as under-resourced languages).

Dr. Alberto Simões
Dr. Pedro Rangel Henriques
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Machine Learning and Knowledge Extraction is an international peer-reviewed open access quarterly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 1800 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • natural language processing
  • machine learning
  • low-resource languages

Benefits of Publishing in a Special Issue

  • Ease of navigation: Grouping papers by topic helps scholars navigate broad scope journals more efficiently.
  • Greater discoverability: Special Issues support the reach and impact of scientific research. Articles in Special Issues are more discoverable and cited more frequently.
  • Expansion of research network: Special Issues facilitate connections among authors, fostering scientific collaborations.
  • External promotion: Articles in Special Issues are often promoted through the journal's social media, increasing their visibility.
  • e-Book format: Special Issues with more than 10 articles can be published as dedicated e-books, ensuring wide and rapid dissemination.

Further information on MDPI's Special Issue polices can be found here.

Published Papers (5 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

16 pages, 848 KiB  
Article
IPPT4KRL: Iterative Post-Processing Transfer for Knowledge Representation Learning
by Weihang Zhang, Ovidiu Șerban, Jiahao Sun and Yike Guo
Mach. Learn. Knowl. Extr. 2023, 5(1), 43-58; https://doi.org/10.3390/make5010004 - 6 Jan 2023
Viewed by 2205
Abstract
Knowledge Graphs (KGs), a structural way to model human knowledge, have been a critical component of many artificial intelligence applications. Many KG-based tasks are built using knowledge representation learning, which embeds KG entities and relations into a low-dimensional semantic space. However, the quality [...] Read more.
Knowledge Graphs (KGs), a structural way to model human knowledge, have been a critical component of many artificial intelligence applications. Many KG-based tasks are built using knowledge representation learning, which embeds KG entities and relations into a low-dimensional semantic space. However, the quality of representation learning is often limited by the heterogeneity and sparsity of real-world KGs. Multi-KG representation learning, which utilizes KGs from different sources collaboratively, presents one promising solution. In this paper, we propose a simple, but effective iterative method that post-processes pre-trained knowledge graph embedding (IPPT4KRL) on individual KGs to maximize the knowledge transfer from another KG when a small portion of alignment information is introduced. Specifically, additional triples are iteratively included in the post-processing based on their adjacencies to the cross-KG alignments to refine the pre-trained embedding space of individual KGs. We also provide the benchmarking results of existing multi-KG representation learning methods on several generated and well-known datasets. The empirical results of the link prediction task on these datasets show that the proposed IPPT4KRL method achieved comparable and even superior results when compared against more complex methods in multi-KG representation learning. Full article
(This article belongs to the Special Issue Language Processing and Knowledge Extraction)
Show Figures

Figure 1

13 pages, 4850 KiB  
Article
Automatic Extraction of Medication Information from Cylindrically Distorted Pill Bottle Labels
by Kseniia Gromova and Vinayak Elangovan
Mach. Learn. Knowl. Extr. 2022, 4(4), 852-864; https://doi.org/10.3390/make4040043 - 27 Sep 2022
Cited by 4 | Viewed by 5286
Abstract
Patient compliance with prescribed medication regimens is critical for maintaining health and managing disease and illness. To encourage patient compliance, multiple aids, like automatic pill dispensers, pill organizers, and various reminder applications, have been developed to help people adhere to their medication regimens. [...] Read more.
Patient compliance with prescribed medication regimens is critical for maintaining health and managing disease and illness. To encourage patient compliance, multiple aids, like automatic pill dispensers, pill organizers, and various reminder applications, have been developed to help people adhere to their medication regimens. However, when utilizing these aids, the user or patient must manually enter their medication information and schedule. This process is time-consuming and often prone to error. For example, elderly patients may have difficulty reading medication information on the bottle due to decreased eyesight, leading them to enter medication information incorrectly. This study explored methods for extracting pertinent information from cylindrically distorted prescription drug labels using Machine Learning and Computer Vision techniques. This study found that Deep Convolutional Neural Networks (DCNN) performed better than other techniques in identifying label key points under different lighting conditions and various backgrounds. This method achieved a percentage of Correct Key points PCK @ 0.03 of 97%. These key points were then used to correct the cylindrical distortion. Next, the multiple dewarped label images were stitched together and processed by an Optical Character Recognition (OCR) engine. Pertinent information, such as patient name, drug name, drug strength, and directions of use, were extracted from the recognized text using Natural Language Processing (NLP) techniques. The system created in this study can be used to improve patient health and compliance by creating an accurate medication schedule. Full article
(This article belongs to the Special Issue Language Processing and Knowledge Extraction)
Show Figures

Figure 1

24 pages, 1688 KiB  
Article
NER in Archival Finding Aids: Extended
by Luís Filipe da Costa Cunha and José Carlos Ramalho
Mach. Learn. Knowl. Extr. 2022, 4(1), 42-65; https://doi.org/10.3390/make4010003 - 17 Jan 2022
Cited by 5 | Viewed by 3346
Abstract
The amount of information preserved in Portuguese archives has increased over the years. These documents represent a national heritage of high importance, as they portray the country’s history. Currently, most Portuguese archives have made their finding aids available to the public in digital [...] Read more.
The amount of information preserved in Portuguese archives has increased over the years. These documents represent a national heritage of high importance, as they portray the country’s history. Currently, most Portuguese archives have made their finding aids available to the public in digital format, however, these data do not have any annotation, so it is not always easy to analyze their content. In this work, Named Entity Recognition solutions were created that allow the identification and classification of several named entities from the archival finding aids. These named entities translate into crucial information about their context and, with high confidence results, they can be used for several purposes, for example, the creation of smart browsing tools by using entity linking and record linking techniques. In order to achieve high result scores, we annotated several corpora to train our own Machine Learning algorithms in this context domain. We also used different architectures, such as CNNs, LSTMs, and Maximum Entropy models. Finally, all the created datasets and ML models were made available to the public with a developed web platform, NER@DI. Full article
(This article belongs to the Special Issue Language Processing and Knowledge Extraction)
Show Figures

Figure 1

11 pages, 3025 KiB  
Article
The Number of Topics Optimization: Clustering Approach
by Fedor Krasnov and Anastasiia Sen
Mach. Learn. Knowl. Extr. 2019, 1(1), 416-426; https://doi.org/10.3390/make1010025 - 30 Jan 2019
Cited by 26 | Viewed by 4817
Abstract
Although topic models have been used to build clusters of documents for more than ten years, there is still a problem of choosing the optimal number of topics. The authors analyzed many fundamental studies undertaken on the subject in recent years. The main [...] Read more.
Although topic models have been used to build clusters of documents for more than ten years, there is still a problem of choosing the optimal number of topics. The authors analyzed many fundamental studies undertaken on the subject in recent years. The main problem is the lack of a stable metric of the quality of topics obtained during the construction of the topic model. The authors analyzed the internal metrics of the topic model: coherence, contrast, and purity to determine the optimal number of topics and concluded that they are not applicable to solve this problem. The authors analyzed the approach to choosing the optimal number of topics based on the quality of the clusters. For this purpose, the authors considered the behavior of the cluster validation metrics: the Davies Bouldin index, the silhouette coefficient, and the Calinski-Harabaz index. A new method for determining the optimal number of topics proposed in this paper is based on the following principles: (1) Setting up a topic model with additive regularization (ARTM) to separate noise topics; (2) Using dense vector representation (GloVe, FastText, Word2Vec); (3) Using a cosine measure for the distance in cluster metric that works better than Euclidean distance on vectors with large dimensions. The methodology developed by the authors for obtaining the optimal number of topics was tested on the collection of scientific articles from the OnePetro library, selected by specific themes. The experiment showed that the method proposed by the authors allows assessing the optimal number of topics for the topic model built on a small collection of English documents. Full article
(This article belongs to the Special Issue Language Processing and Knowledge Extraction)
Show Figures

Figure 1

13 pages, 648 KiB  
Article
Using the Outlier Detection Task to Evaluate Distributional Semantic Models
by Pablo Gamallo
Mach. Learn. Knowl. Extr. 2019, 1(1), 211-223; https://doi.org/10.3390/make1010013 - 22 Nov 2018
Cited by 2 | Viewed by 3371
Abstract
In this article, we define the outlier detection task and use it to compare neural-based word embeddings with transparent count-based distributional representations. Using the English Wikipedia as a text source to train the models, we observed that embeddings outperform count-based representations when their [...] Read more.
In this article, we define the outlier detection task and use it to compare neural-based word embeddings with transparent count-based distributional representations. Using the English Wikipedia as a text source to train the models, we observed that embeddings outperform count-based representations when their contexts are made up of bag-of-words. However, there are no sharp differences between the two models if the word contexts are defined as syntactic dependencies. In general, syntax-based models tend to perform better than those based on bag-of-words for this specific task. Similar experiments were carried out for Portuguese with similar results. The test datasets we have created for the outlier detection task in English and Portuguese are freely available. Full article
(This article belongs to the Special Issue Language Processing and Knowledge Extraction)
Show Figures

Figure 1

Back to TopTop