Special Issue "Machine Learning and Natural Language Processing"

A special issue of Applied Sciences (ISSN 2076-3417). This special issue belongs to the section "Computing and Artificial Intelligence".

Deadline for manuscript submissions: 31 July 2020.

Special Issue Editors

Prof. Maxim Mozgovoy
Website
Guest Editor
Active Knowledge Engineering Lab, The University of Aizu, Tsuruga, Ikki-machi, Aizu-Wakamatsu, 965-8580, Japan
Interests: computer-assisted language learning; AI for computer games
Special Issues and Collections in MDPI journals
Dr. Calkin Suero Montero
Website
Co-Guest Editor
School of Educational Sciences and Psychology; University of Eastern Finland, Finland
Interests: educational technologies; human-language technologies; digital fabrication; user experience; user-centred design

Special Issue Information

Dear Colleagues,

Recent years have been marked with the growing availability of natural language processing technologies for practical everyday use. The rapid development of open source instruments, such as NLTK; the wide availability of training corpora; and the general recognition of computational linguistics as an established scientific and technological field has resulted in a wide adoption of language processing in a broad range of software products.

Numerous exciting results were obtained recently with the help of machine learning technologies that make use of available datasets and corpora to support the whole range of tasks addressed in computational linguistics. These approaches were especially beneficial for regional languages, as they receive state-of-the-art language processing tools as soon as the necessary corpora are developed.

However, the existing tools and corpora still cannot cover all of the needs of researchers and developers working in the area of language processing. So, the whole community would benefit from new research perspectives on harnessing the available data, and the creation and adaptation of new linguisitic resources aimed to advance natural language processing technologies.

Thus, we believe that today’s scientific and technological landscape looks positive for research efforts based on a combination of machine learning and natural language processing. Important results can be achieved with the reliance of modern approaches and datasets, and the expected practical impact is higher than ever. This observation motivates us to propose an Issue of Applied Sciences, dedicated specifically to applications of machine learning in natural language processing tasks. We invite both original research and review articles relevant to the proposed topic.

Prof. Maxim Mozgovoy
Dr. Calkin Suero Montero 
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All papers will be peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Applied Sciences is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 1800 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • Lemmatization and part of speech tagging
  • Word analysis
  • Syntactic, semantic, and context parsing and analysis
  • Word sense disambiguation
  • Sentence breaking
  • Named entity recognition
  • Machine translation-related tasks
  • Question answering and chatbot development
  • Discourse analysis
  • Speech synthesis and recognition
  • Information retrieval
  • Ontology
  • Corpora development and evaluation
  • Natural language generation
  • Text and speech analysis.

Published Papers (6 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

Open AccessArticle
UPC: An Open Word-Sense Annotated Parallel Corpora for Machine Translation Study
Appl. Sci. 2020, 10(11), 3904; https://doi.org/10.3390/app10113904 - 04 Jun 2020
Abstract
Machine translation (MT) has recently attracted much research on various advanced techniques (i.e., statistical-based and deep learning-based) and achieved great results for popular languages. However, the research on it involving low-resource languages such as Korean often suffer from the lack of openly available [...] Read more.
Machine translation (MT) has recently attracted much research on various advanced techniques (i.e., statistical-based and deep learning-based) and achieved great results for popular languages. However, the research on it involving low-resource languages such as Korean often suffer from the lack of openly available bilingual language resources. In this research, we built the open extensive parallel corpora for training MT models, named Ulsan parallel corpora (UPC). Currently, UPC contains two parallel corpora consisting of Korean-English and Korean-Vietnamese datasets. The Korean-English dataset has over 969 thousand sentence pairs, and the Korean-Vietnamese parallel corpus consists of over 412 thousand sentence pairs. Furthermore, the high rate of homographs of Korean causes an ambiguous word issue in MT. To address this problem, we developed a powerful word-sense annotation system based on a combination of sub-word conditional probability and knowledge-based methods, named UTagger. We applied UTagger to UPC and used these corpora to train both statistical-based and deep learning-based neural MT systems. The experimental results demonstrated that using UPC, high-quality MT systems (in terms of the Bi-Lingual Evaluation Understudy (BLEU) and Translation Error Rate (TER) score) can be built. Both UPC and UTagger are available for free download and usage. Full article
(This article belongs to the Special Issue Machine Learning and Natural Language Processing)
Show Figures

Figure 1

Open AccessArticle
Integrated Model for Morphological Analysis and Named Entity Recognition Based on Label Attention Networks in Korean
Appl. Sci. 2020, 10(11), 3740; https://doi.org/10.3390/app10113740 - 28 May 2020
Abstract
In well-spaced Korean sentences, morphological analysis is the first step in natural language processing, in which a Korean sentence is segmented into a sequence of morphemes and the parts of speech of the segmented morphemes are determined. Named entity recognition is a natural [...] Read more.
In well-spaced Korean sentences, morphological analysis is the first step in natural language processing, in which a Korean sentence is segmented into a sequence of morphemes and the parts of speech of the segmented morphemes are determined. Named entity recognition is a natural language processing task carried out to obtain morpheme sequences with specific meanings, such as person, location, and organization names. Although morphological analysis and named entity recognition are closely associated with each other, they have been independently studied and have exhibited the inevitable error propagation problem. Hence, we propose an integrated model based on label attention networks that simultaneously performs morphological analysis and named entity recognition. The proposed model comprises two layers of neural network models that are closely associated with each other. The lower layer performs a morphological analysis, whereas the upper layer performs a named entity recognition. In our experiments using a public gold-labeled dataset, the proposed model outperformed previous state-of-the-art models used for morphological analysis and named entity recognition. Furthermore, the results indicated that the integrated architecture could alleviate the error propagation problem. Full article
(This article belongs to the Special Issue Machine Learning and Natural Language Processing)
Show Figures

Figure 1

Open AccessArticle
Knowledge-Grounded Chatbot Based on Dual Wasserstein Generative Adversarial Networks with Effective Attention Mechanisms
Appl. Sci. 2020, 10(9), 3335; https://doi.org/10.3390/app10093335 - 11 May 2020
Abstract
A conversation is based on internal knowledge that the participants already know or external knowledge that they have gained during the conversation. A chatbot that communicates with humans by using its internal and external knowledge is called a knowledge-grounded chatbot. Although previous studies [...] Read more.
A conversation is based on internal knowledge that the participants already know or external knowledge that they have gained during the conversation. A chatbot that communicates with humans by using its internal and external knowledge is called a knowledge-grounded chatbot. Although previous studies on knowledge-grounded chatbots have achieved reasonable performance, they may still generate unsuitable responses that are not associated with the given knowledge. To address this problem, we propose a knowledge-grounded chatbot model that effectively reflects the dialogue context and given knowledge by using well-designed attention mechanisms. The proposed model uses three kinds of attention: Query-context attention, query-knowledge attention, and context-knowledge attention. In our experiments with the Wizard-of-Wikipedia dataset, the proposed model showed better performances than the state-of-the-art model in a variety of measures. Full article
(This article belongs to the Special Issue Machine Learning and Natural Language Processing)
Show Figures

Figure 1

Open AccessArticle
Predicting Reputation in the Sharing Economy with Twitter Social Data
Appl. Sci. 2020, 10(8), 2881; https://doi.org/10.3390/app10082881 - 21 Apr 2020
Abstract
In recent years, the sharing economy has become popular, with outstanding examples such as Airbnb, Uber, or BlaBlaCar, to name a few. In the sharing economy, users provide goods and services in a peer-to-peer scheme and expose themselves to material and personal risks. [...] Read more.
In recent years, the sharing economy has become popular, with outstanding examples such as Airbnb, Uber, or BlaBlaCar, to name a few. In the sharing economy, users provide goods and services in a peer-to-peer scheme and expose themselves to material and personal risks. Thus, an essential component of its success is its capability to build trust among strangers. This goal is achieved usually by creating reputation systems where users rate each other after each transaction. Nevertheless, these systems present challenges such as the lack of information about new users or the reliability of peer ratings. However, users leave their digital footprints on many social networks. These social footprints are used for inferring personal information (e.g., personality and consumer habits) and social behaviors (e.g., flu propagation). This article proposes to advance the state of the art on reputation systems by researching how digital footprints coming from social networks can be used to predict future behaviors on sharing economy platforms. In particular, we have focused on predicting the reputation of users in the second-hand market Wallapop based solely on their users’ Twitter profiles. The main contributions of this research are twofold: (a) a reputation prediction model based on social data; and (b) an anonymized dataset of paired users in the sharing economy site Wallapop and Twitter, which has been collected using the user self-mentioning strategy. Full article
(This article belongs to the Special Issue Machine Learning and Natural Language Processing)
Show Figures

Figure 1

Open AccessArticle
Named Entity Recognition for Sensitive Data Discovery in Portuguese
Appl. Sci. 2020, 10(7), 2303; https://doi.org/10.3390/app10072303 - 27 Mar 2020
Abstract
The process of protecting sensitive data is continually growing and becoming increasingly important, especially as a result of the directives and laws imposed by the European Union. The effort to create automatic systems is continuous, but, in most cases, the processes behind them [...] Read more.
The process of protecting sensitive data is continually growing and becoming increasingly important, especially as a result of the directives and laws imposed by the European Union. The effort to create automatic systems is continuous, but, in most cases, the processes behind them are still manual or semi-automatic. In this work, we have developed a component that can extract and classify sensitive data, from unstructured text information in European Portuguese. The objective was to create a system that allows organizations to understand their data and comply with legal and security purposes. We studied a hybrid approach to the problem of Named Entity Recognition for the Portuguese language. This approach combines several techniques such as rule-based/lexical-based models, machine learning algorithms, and neural networks. The rule-based and lexical-based approaches were used only for a set of specific classes. For the remaining classes of entities, two statistical models were tested—Conditional Random Fields and Random Forest and, finally, a Bidirectional-LSTM approach as experimented. Regarding the statistical models, we realized that Conditional Random Fields is the one that can obtain the best results, with a f1-score of 65.50%. With the Bi-LSTM approach, we have achieved a result of 83.01%. The corpora used for training and testing were HAREM Golden Collection, SIGARRA News Corpus, and DataSense NER Corpus. Full article
(This article belongs to the Special Issue Machine Learning and Natural Language Processing)
Show Figures

Figure 1

Open AccessArticle
Comprehensive Document Summarization with Refined Self-Matching Mechanism
Appl. Sci. 2020, 10(5), 1864; https://doi.org/10.3390/app10051864 - 09 Mar 2020
Abstract
Under the constraint of memory capacity of the neural network and the document length, it is difficult to generate summaries with adequate salient information. In this work, the self-matching mechanism is incorporated into the extractive summarization system at the encoder side, which allows [...] Read more.
Under the constraint of memory capacity of the neural network and the document length, it is difficult to generate summaries with adequate salient information. In this work, the self-matching mechanism is incorporated into the extractive summarization system at the encoder side, which allows the encoder to optimize the encoding information at the global level and effectively improves the memory capacity of conventional LSTM. Inspired by human coarse-to-fine understanding mode, localness is modeled by Gaussian bias to improve contextualization for each sentence, and merged into the self-matching energy. The refined self-matching mechanism not only establishes global document attention but perceives association with neighboring signals. At the decoder side, the pointer network is utilized to perform a two-hop attention on context and extraction state. Evaluations on the CNN/Daily Mail dataset verify that the proposed model outperforms the strong baseline models and statistical significantly. Full article
(This article belongs to the Special Issue Machine Learning and Natural Language Processing)
Show Figures

Graphical abstract

Back to TopTop