This is an early access version, the complete PDF, HTML, and XML versions will be available soon.
Open AccessArticle
Comparative Analysis of Natural Language Processing Techniques in the Classification of Press Articles
by
Kacper Piasta
Kacper Piasta
and
Rafał Kotas
Rafał Kotas *
Department of Microelectronics and Computer Science, Lodz University of Technology, ul. Wólczańska 221, 93-005 Łódź, Poland
*
Author to whom correspondence should be addressed.
Appl. Sci. 2025, 15(17), 9559; https://doi.org/10.3390/app15179559 (registering DOI)
Submission received: 30 July 2025
/
Revised: 27 August 2025
/
Accepted: 28 August 2025
/
Published: 30 August 2025
Abstract
The study undertook a comprehensive review and comparative analysis of natural language processing techniques for news article classification, with a particular focus on Java language libraries. The dataset comprised an excess of 200,000 items of news metadata sourced from The Huffington Post. The traditional algorithms based on mathematical statistics and deep machine learning were evaluated. The libraries chosen for tests were Apache OpenNLP, Stanford CoreNLP, Waikato Weka, and the Huggingface ecosystem with the Pytorch backend. The efficacy of the trained models in forecasting specific topics was evaluated, and diverse methodologies for the feature extraction and analysis of word-vector representations were explored. The study considered aspects such as hardware resource management, implementation simplicity, learning time, and the quality of the resulting model in terms of detection, and it examined a range of techniques for attribute selection, feature filtering, vector representation, and the handling of imbalanced datasets. Advanced techniques for word selection and named entity recognition were employed. The study compared different models and configurations in terms of their performance and the resources they consumed. Furthermore, it addressed the difficulties encountered when processing lengthy texts with transformer neural networks, and it presented potential solutions such as sequence truncation and segment analysis. The elevated computational cost inherent to Java-based languages may present challenges in machine learning tasks. OpenNLP model achieved 84% accuracy, Weka and CoreNLP attained 86% and 88%, respectively, and DistilBERT emerged as the top performer, with an accuracy rate of 92%. Deep learning models demonstrated superior performance, training time, and ease of implementation compared to conventional statistical algorithms.
Share and Cite
MDPI and ACS Style
Piasta, K.; Kotas, R.
Comparative Analysis of Natural Language Processing Techniques in the Classification of Press Articles. Appl. Sci. 2025, 15, 9559.
https://doi.org/10.3390/app15179559
AMA Style
Piasta K, Kotas R.
Comparative Analysis of Natural Language Processing Techniques in the Classification of Press Articles. Applied Sciences. 2025; 15(17):9559.
https://doi.org/10.3390/app15179559
Chicago/Turabian Style
Piasta, Kacper, and Rafał Kotas.
2025. "Comparative Analysis of Natural Language Processing Techniques in the Classification of Press Articles" Applied Sciences 15, no. 17: 9559.
https://doi.org/10.3390/app15179559
APA Style
Piasta, K., & Kotas, R.
(2025). Comparative Analysis of Natural Language Processing Techniques in the Classification of Press Articles. Applied Sciences, 15(17), 9559.
https://doi.org/10.3390/app15179559
Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details
here.
Article Metrics
Article Access Statistics
For more information on the journal statistics, click
here.
Multiple requests from the same IP address are counted as one view.