applsci-logo

Journal Browser

Journal Browser

Applications of Natural Language Processing to Data Science

A special issue of Applied Sciences (ISSN 2076-3417). This special issue belongs to the section "Computing and Artificial Intelligence".

Deadline for manuscript submissions: 30 June 2025 | Viewed by 2532

Special Issue Editors


E-Mail Website
Guest Editor
Dipartimento di Ingegneria Elettrica Elettronica Informatica, Università di Catania, Viale Andrea Doria, 9-95127 Catania, Italy
Interests: network science; natural language processing; data analysis; machine learning; information spread; distributed systems
Special Issues, Collections and Topics in MDPI journals

E-Mail Website
Guest Editor
Dipartimento di Ingegneria Elettrica, Elettronica Informatica (DIEEI), Università di Catania, I95125 Catania, Italy
Interests: information security; machine learning; big data analysis; complex system; IoT; artificial intelligence; social networking
Special Issues, Collections and Topics in MDPI journals

Special Issue Information

Dear Colleagues,

This Special Issue aims to explore advanced techniques and emerging applications of Natural Language Processing (NLP) within the context of data science. In recent years, the field of NLP has made significant strides due to advancements in machine learning, deep learning, and the availability of large datasets. This evolution has opened new avenues for research and practical implementation across various domains, including business, healthcare, finance, education, and many others.

The main objectives of this Special Issue are as follows:

  • To explore cutting-edge techniques in NLP and data science.
  • To present new methodologies and tools for natural language processing.
  • To discuss current challenges and possible solutions in the application of NLP.
  • To examine case studies and real-world applications of NLP.
  • To promote interdisciplinary research and innovation in the field.

Topics of Interest:

This Special Issue welcomes original, high-quality contributions that address, but are not limited to, the following topics:

  • Advanced Language Models: Transformers and deep learning models for NLP, BERT, GPT, and their variants.
  • Multilingual Natural Language Processing: Challenges and solutions for handling multilingual data.
  • Sentiment Analysis and Opinion Mining: Techniques and applications for extracting and analyzing online opinions.
  • Conversational Agents: Methods for automatic text generation and the development of intelligent chatbots.
  • Information Extraction: Techniques for automatic extraction of structured information from unstructured texts.
  • Applications of NLP in healthcare.
  • Applications of NLP in business and finance.
  • Applications of NLP in green economy.
  • Integration of NLP and Big Data: Methods for processing large volumes of textual data, scalability, and performance.

Prof. Dr. Vincenza Carchiolo
Dr. Michele Malgeri
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Applied Sciences is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2400 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • Natural Language Processing (NLP)
  • data science
  • sentiment analysis
  • opinion mining
  • information extraction

Benefits of Publishing in a Special Issue

  • Ease of navigation: Grouping papers by topic helps scholars navigate broad scope journals more efficiently.
  • Greater discoverability: Special Issues support the reach and impact of scientific research. Articles in Special Issues are more discoverable and cited more frequently.
  • Expansion of research network: Special Issues facilitate connections among authors, fostering scientific collaborations.
  • External promotion: Articles in Special Issues are often promoted through the journal's social media, increasing their visibility.
  • Reprint: MDPI Books provides the opportunity to republish successful Special Issues in book format, both online and in print.

Further information on MDPI's Special Issue policies can be found here.

Published Papers (2 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

14 pages, 423 KiB  
Article
A Small-Scale Evaluation of Large Language Models Used for Grammatical Error Correction in a German Children’s Literature Corpus: A Comparative Study
by Phuong Thao Nguyen, Bernd Nuss, Roswita Dressler and Katie Ovens
Appl. Sci. 2025, 15(5), 2476; https://doi.org/10.3390/app15052476 - 25 Feb 2025
Viewed by 690
Abstract
Grammatical error correction (GEC) has become increasingly important for enhancing the quality of OCR-scanned texts. This small-scale study explores the application of Large Language Models (LLMs) for GEC in German children’s literature, a genre with unique linguistic challenges due to modified language, colloquial [...] Read more.
Grammatical error correction (GEC) has become increasingly important for enhancing the quality of OCR-scanned texts. This small-scale study explores the application of Large Language Models (LLMs) for GEC in German children’s literature, a genre with unique linguistic challenges due to modified language, colloquial expressions, and complex layouts that often lead to OCR-induced errors. While conventional rule-based and statistical approaches have been used in the past, advancements in machine learning and artificial intelligence have introduced models capable of more contextually nuanced corrections. Despite these developments, limited research has been conducted on evaluating the effectiveness of state-of-the-art LLMs, specifically in the context of German children’s literature. To address this gap, we fine-tuned encoder-based models GBERT and GELECTRA on German children’s literature, and compared their performance to decoder-based models GPT-4o and Llama series (versions 3.2 and 3.1) in a zero-shot setting. Our results demonstrate that all pretrained models, both encoder-based (GBERT, GELECTRA) and decoder-based (GPT-4o, Llama series), failed to effectively remove OCR-generated noise in children’s literature, highlighting the necessity of a preprocessing step to handle structural inconsistencies and artifacts introduced during scanning. This study also addresses the lack of comparative evaluations between encoder-based and decoder-based models for German GEC, with most prior work focusing on English. Quantitative analysis reveals that decoder-based models significantly outperform fine-tuned encoder-based models, with GPT-4o and Llama-3.1-70B achieving the highest accuracy in both error detection and correction. Qualitative assessment further highlights distinct model behaviors: GPT-4o demonstrates the most consistent correction performance, handling grammatical nuances effectively while minimizing overcorrection. Llama-3.1-70B excels in error detection but occasionally relies on frequency-based substitutions over meaning-driven corrections. Unlike earlier decoder-based models, which often exhibited overcorrection tendencies, our findings indicate that state-of-the-art decoder-based models strike a better balance between correction accuracy and semantic preservation. By identifying the strengths and limitations of different model architectures, this study enhances the accessibility and readability of OCR-scanned German children’s literature. It also provides new insights into the role of preprocessing in digitized text correction, the comparative performance of encoder- and decoder-based models, and the evolving correction tendencies of modern LLMs. These findings contribute to language preservation, corpus linguistics, and digital archiving, offering an AI-driven solution for improving the quality of digitized children’s literature while ensuring linguistic and cultural integrity. Future research should explore multimodal approaches that integrate visual context to further enhance correction accuracy for children’s books with image-embedded text. Full article
(This article belongs to the Special Issue Applications of Natural Language Processing to Data Science)
Show Figures

Figure 1

20 pages, 376 KiB  
Article
Comparison of Machine Learning Models for Sentiment Analysis of Big Turkish Web-Based Data
by Cemile Gökçe Özmen and Selim Gündüz
Appl. Sci. 2025, 15(5), 2297; https://doi.org/10.3390/app15052297 - 21 Feb 2025
Viewed by 1332
Abstract
E-commerce sites have generated large amounts of unstructured data as they allow millions of users to generate product reviews. Thus, although there have been significant improvements in the characteristics of big data, such as speed and volume, developing various analysis techniques to monitor, [...] Read more.
E-commerce sites have generated large amounts of unstructured data as they allow millions of users to generate product reviews. Thus, although there have been significant improvements in the characteristics of big data, such as speed and volume, developing various analysis techniques to monitor, understand, and extract useful information from this web-based data has become challenging. This study aims to analyze cosmetic products on a Turkish-based e-commerce website with sentiment analysis and to create a new domain-specific Turkish sentiment dictionary model with manual labeling. In the study, a Turkish sentiment dictionary consisting of 65,378 words was created by manually labeling 875,455 product reviews for 24 cosmetic brands sold on the Turkey-based trendyol e-commerce site, and sentiment analysis was performed using this dictionary. The dataset, divided into seven product groups, was analyzed using K-NN, SVM, DT, RF, and LR algorithms to address three classification problems. The algorithms were evaluated with comparative analysis using accuracy, precision, recall, and f-1 score metrics. SVM gave the highest performance result with over 93% accuracy, 92% precision, 93% recall, and a 91% f-1 score in all product groups. The dictionary model created for the cosmetics industry in the study helps businesses and researchers to use their resources more efficiently and save time by performing fast and low-cost analyses on large datasets of product reviews. Moreover, by analyzing customer feedback, brands can offer long-lasting and environmentally friendly products that align with customers’ feelings. Thus, businesses have the opportunity to develop or improve products. Full article
(This article belongs to the Special Issue Applications of Natural Language Processing to Data Science)
Show Figures

Figure 1

Back to TopTop