Submit to Special Issue Submit Abstract to Special Issue Review for Applied Sciences Propose a Special Issue

Journal Menu

Journal Browser

Applications of Natural Language Processing to Data Science

Print Special Issue Flyer
Special Issue Editors
Special Issue Information
Keywords
Benefits of Publishing in a Special Issue
Published Papers

A special issue of Applied Sciences (ISSN 2076-3417). This special issue belongs to the section "Computing and Artificial Intelligence".

Deadline for manuscript submissions: 20 January 2026 | Viewed by 5799

Share This Special Issue

Special Issue Editors

Prof. Dr. Vincenza Carchiolo

E-Mail Website
Guest Editor

Dipartimento di Ingegneria Elettrica Elettronica Informatica, Università di Catania, Viale Andrea Doria, 9-95127 Catania, Italy
Interests: network science; natural language processing; data analysis; machine learning; information spread; distributed systems
Special Issues, Collections and Topics in MDPI journals

Dr. Michele Malgeri

E-Mail Website
Guest Editor

Dipartimento di Ingegneria Elettrica, Elettronica Informatica (DIEEI), Università di Catania, I95125 Catania, Italy
Interests: information security; machine learning; big data analysis; complex system; IoT; artificial intelligence; social networking
Special Issues, Collections and Topics in MDPI journals

Special Issue Information

Dear Colleagues,

This Special Issue aims to explore advanced techniques and emerging applications of Natural Language Processing (NLP) within the context of data science. In recent years, the field of NLP has made significant strides due to advancements in machine learning, deep learning, and the availability of large datasets. This evolution has opened new avenues for research and practical implementation across various domains, including business, healthcare, finance, education, and many others.

The main objectives of this Special Issue are as follows:

To explore cutting-edge techniques in NLP and data science.
To present new methodologies and tools for natural language processing.
To discuss current challenges and possible solutions in the application of NLP.
To examine case studies and real-world applications of NLP.
To promote interdisciplinary research and innovation in the field.

Topics of Interest:

This Special Issue welcomes original, high-quality contributions that address, but are not limited to, the following topics:

Advanced Language Models: Transformers and deep learning models for NLP, BERT, GPT, and their variants.
Multilingual Natural Language Processing: Challenges and solutions for handling multilingual data.
Sentiment Analysis and Opinion Mining: Techniques and applications for extracting and analyzing online opinions.
Conversational Agents: Methods for automatic text generation and the development of intelligent chatbots.
Information Extraction: Techniques for automatic extraction of structured information from unstructured texts.
Applications of NLP in healthcare.
Applications of NLP in business and finance.
Applications of NLP in green economy.
Integration of NLP and Big Data: Methods for processing large volumes of textual data, scalability, and performance.

Prof. Dr. Vincenza Carchiolo
Dr. Michele Malgeri
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Applied Sciences is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2400 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

Natural Language Processing (NLP)
data science
sentiment analysis
opinion mining
information extraction

Benefits of Publishing in a Special Issue

Ease of navigation: Grouping papers by topic helps scholars navigate broad scope journals more efficiently.
Greater discoverability: Special Issues support the reach and impact of scientific research. Articles in Special Issues are more discoverable and cited more frequently.
Expansion of research network: Special Issues facilitate connections among authors, fostering scientific collaborations.
External promotion: Articles in Special Issues are often promoted through the journal's social media, increasing their visibility.
Reprint: MDPI Books provides the opportunity to republish successful Special Issues in book format, both online and in print.

Further information on MDPI's Special Issue policies can be found here.

Published Papers (5 papers)

Download All Papers

Order results

Result details

Show export options Show export options

Select all

Export citation of selected articles as:

Research

18 pages, 516 KiB

Open AccessArticle

A Nested Named Entity Recognition Model Robust in Few-Shot Learning Environments Using Label Description Information

by Hyunsun Hwang, Youngjun Jung, Changki Lee and Wooyoung Go

Appl. Sci. 2025, 15(15), 8255; https://doi.org/10.3390/app15158255 - 24 Jul 2025

Viewed by 227

Abstract

Nested named entity recognition (NER) is a task that identifies hierarchically structured entities, where one entity can contain other entities within its span. This study introduces a nested NER model for few-shot learning environments, addressing the difficulty of building extensive datasets for general named entities. We enhance the Biaffine nested NER model by modifying its output layer to incorporate label semantic information through a novel label description embedding (LDE) approach, improving performance with limited training data. Our method replaces the traditional biaffine classifier with a label attention mechanism that leverages comprehensive natural language descriptions of entity types, encoded using BERT to capture rich semantic relationships between labels and input spans. We conducted comprehensive experiments on four benchmark datasets: GENIA (nested NER), ACE 2004 (nested NER), ACE 2005 (nested NER), and CoNLL 2003 English (flat NER). Performance was evaluated across multiple few-shot scenarios (1-shot, 5-shot, 10-shot, and 20-shot) using F1-measure as the primary metric, with five different random seeds to ensure robust evaluation. We compared our approach against strong baselines including BERT-LSTM-CRF with nested tags, the original Biaffine model, and recent few-shot NER methods (FewNER, FIT, LPNER, SpanNER). Results demonstrate significant improvements across all few-shot scenarios. On GENIA, our LDE model achieves 45.07% F1 in five-shot learning compared to 30.74% for the baseline Biaffine model (46.4% relative improvement). On ACE 2005, we obtain 44.24% vs. 32.38% F1 in five-shot scenarios (36.6% relative improvement). The model shows consistent gains in 10-shot (57.19% vs. 49.50% on ACE 2005) and 20-shot settings (64.50% vs. 58.21% on ACE 2005). Ablation studies confirm that semantic information from label descriptions is the key factor enabling robust few-shot performance. Transfer learning experiments demonstrate the model’s ability to leverage knowledge from related domains. Our findings suggest that incorporating label semantic information can substantially enhance NER models in low-resource settings, opening new possibilities for applying NER in specialized domains or languages with limited annotated data. Full article

(This article belongs to the Special Issue Applications of Natural Language Processing to Data Science)

► Show Figures

Figure 1

23 pages, 2960 KiB

Open AccessArticle

Exploring Information Interaction Preferences in an LLM-Assisted Learning Environment with a Topic Modeling Framework

by Yiming Taclis Luo, Ting Liu, Patrick Cheong-Iao Pang, Zhuo Wang and Ka Ian Chan

Appl. Sci. 2025, 15(13), 7515; https://doi.org/10.3390/app15137515 - 4 Jul 2025

Viewed by 572

Abstract

Large Language Models (LLMs) are driving a revolution in the way we access information, yet there remains a lack of exploration to capture people’s information interaction preferences in LLM environments. In this study, we designed a comprehensive analysis framework to evaluate students’ prompt texts during a professional academic writing task. The framework includes a dimensionality reduction and classification method, three topic modeling approaches, namely BERTopic, BoW-LDA, and TF-IDF-NMF, and a set of evaluation criteria. These criteria assess both the semantic quality of topic content and the structural quality of clustering. Using this framework, we analyzed 288 prompt texts to identify key topics that reflect students’ information interaction behaviors. The results showed that students with low academic performance tend to focus on structural clarity and task execution, including task inquiry, format specifications, and methodological search, indicating that their interaction mode is instruction-oriented. In contrast, students with high academic performance interact with LLM not only in basic task completion but also in knowledge integration and the pursuit of novel ideas. This is reflected in more complex topic levels and diverse, innovative keywords. It shows that they have stronger self-planning and self-regulation abilities. This study provides a new approach to studying the interaction between students and LLM in engineering education by using natural language processing to process prompts, contributing to the exploration of the performance of students with different performance levels in professional academic writing using LLM. Full article

(This article belongs to the Special Issue Applications of Natural Language Processing to Data Science)

► Show Figures

Figure 1

16 pages, 1360 KiB

Open AccessArticle

Structured Summarization of League of Legends Match Data Optimized for Large Language Model Input

by Jooyoung Kim, Wonkyung Lee and Jungwoon Park

Appl. Sci. 2025, 15(13), 7190; https://doi.org/10.3390/app15137190 - 26 Jun 2025

Viewed by 476

Abstract

Large-scale match data from esports games like League of Legends are stored in complex JSON files that often exceed the input token limitations of large language models (LLMs), restricting advanced analysis and applications such as automated commentary and strategic insight generation. This paper introduces the League of Legends Match Data Compactor (LoL-MDC), a tool designed to transform extensive match data into a concise and structured format optimized for LLM processing. By systematically summarizing structured match information—including match overviews, player and team statistics, timeline summaries, and algorithmically selected key events—the LoL-MDC significantly reduces the data size from approximately 80,000 tokens to under 2000 tokens while retaining analytical value. This method enables LLMs to generate coherent match summaries, analyze player performances, and identify key momentum shifts more effectively than processing raw JSON files. Additionally, the LoL-MDC integrates a winning probability metric to quantitatively enhance the selection of pivotal game events, ensuring relevance in esports analytics. Experimental evaluations demonstrate that the LoL-MDC improves data processing efficiency while maintaining critical insights. The proposed approach provides a structured and adaptable framework for applying LLMs to esports analytics and can be adapted to other competitive gaming environments, supporting AI-driven applications in match analysis, player performance evaluation, and strategic forecasting. Full article

(This article belongs to the Special Issue Applications of Natural Language Processing to Data Science)

► Show Figures

Figure 1

14 pages, 423 KiB

Open AccessArticle

A Small-Scale Evaluation of Large Language Models Used for Grammatical Error Correction in a German Children’s Literature Corpus: A Comparative Study

by Phuong Thao Nguyen, Bernd Nuss, Roswita Dressler and Katie Ovens

Appl. Sci. 2025, 15(5), 2476; https://doi.org/10.3390/app15052476 - 25 Feb 2025

Viewed by 1263

Abstract

Grammatical error correction (GEC) has become increasingly important for enhancing the quality of OCR-scanned texts. This small-scale study explores the application of Large Language Models (LLMs) for GEC in German children’s literature, a genre with unique linguistic challenges due to modified language, colloquial expressions, and complex layouts that often lead to OCR-induced errors. While conventional rule-based and statistical approaches have been used in the past, advancements in machine learning and artificial intelligence have introduced models capable of more contextually nuanced corrections. Despite these developments, limited research has been conducted on evaluating the effectiveness of state-of-the-art LLMs, specifically in the context of German children’s literature. To address this gap, we fine-tuned encoder-based models GBERT and GELECTRA on German children’s literature, and compared their performance to decoder-based models GPT-4o and Llama series (versions 3.2 and 3.1) in a zero-shot setting. Our results demonstrate that all pretrained models, both encoder-based (GBERT, GELECTRA) and decoder-based (GPT-4o, Llama series), failed to effectively remove OCR-generated noise in children’s literature, highlighting the necessity of a preprocessing step to handle structural inconsistencies and artifacts introduced during scanning. This study also addresses the lack of comparative evaluations between encoder-based and decoder-based models for German GEC, with most prior work focusing on English. Quantitative analysis reveals that decoder-based models significantly outperform fine-tuned encoder-based models, with GPT-4o and Llama-3.1-70B achieving the highest accuracy in both error detection and correction. Qualitative assessment further highlights distinct model behaviors: GPT-4o demonstrates the most consistent correction performance, handling grammatical nuances effectively while minimizing overcorrection. Llama-3.1-70B excels in error detection but occasionally relies on frequency-based substitutions over meaning-driven corrections. Unlike earlier decoder-based models, which often exhibited overcorrection tendencies, our findings indicate that state-of-the-art decoder-based models strike a better balance between correction accuracy and semantic preservation. By identifying the strengths and limitations of different model architectures, this study enhances the accessibility and readability of OCR-scanned German children’s literature. It also provides new insights into the role of preprocessing in digitized text correction, the comparative performance of encoder- and decoder-based models, and the evolving correction tendencies of modern LLMs. These findings contribute to language preservation, corpus linguistics, and digital archiving, offering an AI-driven solution for improving the quality of digitized children’s literature while ensuring linguistic and cultural integrity. Future research should explore multimodal approaches that integrate visual context to further enhance correction accuracy for children’s books with image-embedded text. Full article

(This article belongs to the Special Issue Applications of Natural Language Processing to Data Science)

► Show Figures

Figure 1

20 pages, 376 KiB

Open AccessArticle

Comparison of Machine Learning Models for Sentiment Analysis of Big Turkish Web-Based Data

by Cemile Gökçe Özmen and Selim Gündüz

Appl. Sci. 2025, 15(5), 2297; https://doi.org/10.3390/app15052297 - 21 Feb 2025

Viewed by 2552

Abstract

E-commerce sites have generated large amounts of unstructured data as they allow millions of users to generate product reviews. Thus, although there have been significant improvements in the characteristics of big data, such as speed and volume, developing various analysis techniques to monitor, understand, and extract useful information from this web-based data has become challenging. This study aims to analyze cosmetic products on a Turkish-based e-commerce website with sentiment analysis and to create a new domain-specific Turkish sentiment dictionary model with manual labeling. In the study, a Turkish sentiment dictionary consisting of 65,378 words was created by manually labeling 875,455 product reviews for 24 cosmetic brands sold on the Turkey-based trendyol e-commerce site, and sentiment analysis was performed using this dictionary. The dataset, divided into seven product groups, was analyzed using K-NN, SVM, DT, RF, and LR algorithms to address three classification problems. The algorithms were evaluated with comparative analysis using accuracy, precision, recall, and f-1 score metrics. SVM gave the highest performance result with over 93% accuracy, 92% precision, 93% recall, and a 91% f-1 score in all product groups. The dictionary model created for the cosmetics industry in the study helps businesses and researchers to use their resources more efficiently and save time by performing fast and low-cost analyses on large datasets of product reviews. Moreover, by analyzing customer feedback, brands can offer long-lasting and environmentally friendly products that align with customers’ feelings. Thus, businesses have the opportunity to develop or improve products. Full article

(This article belongs to the Special Issue Applications of Natural Language Processing to Data Science)

► Show Figures

Journal Menu

Journal Browser

Applications of Natural Language Processing to Data Science

Share This Special Issue

Special Issue Editors

Special Issue Information

Keywords

Benefits of Publishing in a Special Issue

Published Papers (5 papers)

Research

Further Information

Guidelines

MDPI Initiatives

Follow MDPI