LSTM-Based News Article Category Classification

Rafat, Yusra; Narayana, Potu; Mohana, R. Madana; Srilatha, Kolukuluri

doi:10.3390/cmsf2025012008

Open AccessProceeding Paper

LSTM-Based News Article Category Classification^†

by

Yusra Rafat

,

Potu Narayana

^*,

R. Madana Mohana

and

Kolukuluri Srilatha

Department of Computer Science and Engineering (CSE), Stanley College of Engineering and Technology for Women, Hyderabad 500001, India

^*

Author to whom correspondence should be addressed.

^†

Presented at the First International Conference on Computational Intelligence and Soft Computing (CISCom 2025), Melaka, Malaysia, 26–27 November 2025.

Comput. Sci. Math. Forum 2025, 12(1), 8; https://doi.org/10.3390/cmsf2025012008

Published: 18 December 2025

(This article belongs to the Proceedings of First International Conference on Computational Intelligence and Soft Computing (CISCom 2025))

Download

Browse Figures

Versions Notes

Abstract

A substantial amount of data is generated day-to-day, to which news articles are a major contributor. Most of this data is not well-structured, highlighting the need for efficient ways to manage, process, and analyze said data. One useful approach involves the categorization of the data. The work “News Article Category Classification” develops a Long Short-Term Memory (LSTM) model for classifying news articles into 14 categories. LSTM networks are suitable for text classification tasks, as they efficiently capture contextual and sequential dependencies. They have a special ability to retain long-term information which makes them perfect for understanding the meaning of news articles.

Keywords:

news article category classification; long short-term memory; data categorization

1. Introduction

News and media systems and broadcasting platforms encounter the challenge of handling large numbers of articles with a variety of themes. Manual categorization of the news articles consumes a lot of time and the results might be inconsistent as well.

The main goal of the project titled “News Article Category Classification Using LSTM” is to design and implement a model that is capable of accurately categorizing news articles among the 14 predefined categories—including Arts and Culture, Business, Crime, Comedy, Education, Entertainment, Environment, Media, Politics, Religion, Science, Sports, Tech, and Women—using combined information from the news article’s title and body text.

This model exploits the power of Long Short-Term Memory (LSTM), which is an expert variant of Recurrent Neural Network (RNN) known for handling sequential data effectively. LSTM is suitable for tasks involving natural language processing (NLP) because it can learn and recall long-range dependencies in text, and it is suitable for addressing the vanishing gradient problem faced by traditional RNNs. This project establishes the use of deep learning in the arena of Computer Science and Engineering (CSE), emphasizing how LSTM contributes to solving real-world problems.

The primary objectives of this study were as follows:

To develop a deep learning model that uses LSTM networks for classifying news articles into 14 categories;
To achieve higher classification accuracy and better generalization compared to traditional models while handling a larger and more diverse set of categories.

2. Literature Review

The field of news article category classification has seen significant progress, particularly with the incorporation of multiple machine learning and deep learning techniques.

Machine Learning Models for News Article Classification by B. Naseeba, N. P. Challa, A. Doppalapudi, S. Chirag, and N. S. Nair explored the topic of news article category classification by utilizing the A G dataset and training five different Supervised Learning Models, i.e., SVM, Naïve Bayes, KNN, Decision Tree, and Random Forest Classifier [1]. They classified news articles into four categories. Their research achieved an accuracy of 89.35%. A smaller dataset was observed as a limitation and future work hinted at using deep learning approaches to address the problem statement.

Reader emotion classification of news headlines was suggested by Y. Jia, Z. Chen, and S. Yu, who proposed a sentence-level emotion classifier to classify news articles into predefined categories [2]. They used a Chinese Website Dataset and trained an SVM model to achieve accuracy of 60%. Documents were classified into reader emotion categories. Use of a traditional machine learning algorithm was observed as a limitation and encouraged for a model with more accuracy.

The paper News Article Topic Classification Using Embedding by S. B and S. Santhanalakshmi Kaur investigated the classification of news articles into categories using machine learning and deep learning classifiers [3]. Their dataset was the Antonio Guilio Dataset. Their model achieved an accuracy of 76%. However, data sparsity was posed as a limitation of the model.

Online News Article Classification Using Machine Learning Approaches by Abinaya et al. [4] aimed to design a model to classify each news headline within its pre-defined category. They used a Bengali news article dataset that contained news articles from five categories. They trained a Logistic Regression and Naïve Bayes Model to classify the news article into categories. The model achieved an accuracy of 79%.

Effective Preprocessing Activities in Text Mining using Improved Porter Stemmer Algorithm, a proposal by C. Rama Subramanian et al. [5], explored the use of the Porter Stemmer Algorithm. It used Text Mining and an advanced algorithm to overcome the problem of a Named Entity. They curated data from multiple documents. The model achieved accuracy of 73%. The project was limited as it was domain-specific.

Classification of Newspaper Article Classification by Employing Support Vector Machine in Comparison with Perceptron to Improve Accuracy Manoj V et al. [6] performed an examination of news articles to recognize text analysis. It utilized the existing Perceptron method and built SVM over it. The model was trained on the BBC News Dataset and categorized the news articles into five distinct categories.

In their article, Performance Evaluation of Semantic Based and Ontology Based Text Document Clustering Techniques, S.C Punitha proposed a hybrid scheme for a text-clustering model and text clustering with feature selection method [7]. The work was carried out on Reuters’ 21578 Text Corpora. The hybrid method mixes pattern recognition and semantic-driven methods for clustering documents. The second approach uses ontology to cluster documents. The model performed with an accuracy of 71%. The limitation observed in the research was that quality clustering was essential.

Category Classification and Topic Discovery of Japanese and English News Articles by David et al. [8] focusses on Topic Analysis of News Articles. The classification algorithm can be used on multiple languages. The results for English and Japanese language are impressive.

News Article Text Classification and Summary for Authors and Topics by Aviel et al. [9] uses Naïve Bayes Classifier to classify news article and to provide summary of the authors and the topics. The research is carried on Vox Media Dataset.

The literature reviewed in this section highlights a range of research works carried out on diverse datasets using various machine learning, deep learning and natural language processing models. Each paper examined presents a unique approach, often tailored to the specific characteristics of the dataset being used.

3. System Description

3.1. Existing System

The literature survey on news article category classification determines various methodologies and models, each presenting advancements in the field. These also pose persistent challenges which can be overcome in future works. Deep learning models contribute a lot to improving classification accuracy, as proven by various studies. The major limitations of the research were a small dataset, data sparsity, and reduced accuracy. This research intends to overcome these gaps using the following strategies:

Leveraging an advanced deep learning model (LSTM);
Utilizing a richer comprehensive dataset that spans 14 distinct news categories;
Improving classification accuracy and robustness, but also enhancing the model’s ability to handle complex and nuanced text, which is often encountered in real-world news content.

3.2. Proposed System

The system framework includes several components. The main aim is to offer a comprehensive overview of the methodologies and techniques which were used to develop a robust and accurate model which is capable of categorizing news articles effectively.

As shown in Figure 1, the components of the system are as follows:

The Data Loading Module is responsible for loading the dataset from the CSV file.
The Data Preprocessing Module handles the preparation of raw data as illustrated in Figure 2. This includes the following below factors:
- Lowercase Conversion: All text data was transformed to lowercase letters to ensure consistency in the representation of words and to facilitate subsequent processing steps.
- Special Characters Removal: Most of the time, news articles contain special symbols like currency symbols, e.g., £, €, $. These symbols complicate the procedure and are not relevant or useful. Hence, these are removed in this step.
- Punctuation Removal: Punctuation marks, including periods, commas, and quotation marks, were removed from the text data. This step was performed to eliminate irrelevant characters that do not convey meaningful information for text classification tasks.
- Label Encoding: The categorical target variable, “category,” needed to be converted into a numerical format for model consumption. This was achieved using Label Encoding, where each unique news category was assigned a unique integer identifier. Following this, these integer labels were one-hot encoded into a binary vector representation. One-hot encoding creates a binary vector for each label, where a ‘1’ in a specific position indicates the presence of that category and ‘0’ otherwise.
- Tokenization: A Tokenizer was initialized with a maximum vocabulary size of 20,000 unique words. The tokenizer builds a vocabulary based on the most frequent words in the training corpus. Words not found in this vocabulary are designated as “out-of-vocabulary” tokens. This process converts each word in the text into a corresponding integer ID from the vocabulary.
- Sequence Padding: After tokenization, the resulting sequences of integer IDs varied in length. The LSTM model has to be given fixed size input; hence, these sequences were added to make each one a fixed maximum length of 300. Shorter sequences were padded with zeros and longer sequences were truncated. This standardization is critical for batch processing in neural networks, as it allows all input sequences in a batch to have the same dimension.
Model Building and Training Module:

The literature on news article category classification demonstrates a range of methodologies and models, each contributing to the field’s advancement while also highlighting persistent challenges. Studies have shown the effectiveness of deep learning models in improving classification accuracy. Nonetheless, issues such as small datasets, data sparsity, and reduced accuracy have always been limitations. This research aims to address these gaps using the steps described below.

The first step is to divide the data into two sets. The training set has 80% of the data and the test set has the remaining 20% of the data. The next step is to build an LSTM sequential model that is trained on the training dataset then evaluated on the test dataset. Other baseline models were not implemented for comparison in this work because the primary focus of the work was to discover the effectiveness of deep learning-based sequential models (LSTMs) for news article classification. Classical baselines have already been extensively studied in the existing literature, and their comparative performance is well-documented.

The LSTM sequential model consists of the below layers:

Input Layer: The initial stage of this architecture involves preparing raw news article text for machine processing. This begins with the “Input” of the raw text, which is then transformed into “Tokens”. This conversion is critical, as it makes the input text significantly easier for machine learning models to read and process.
Embedding Layer: Following tokenization, the “Embedding Layer” plays a crucial role in transforming the discrete tokens into continuous, dense numerical vectors, explicitly labeled in Figure 3 as “Word Embeddings—128 D Vectors”. The embedding layer is responsible for mapping each token to a fixed-size vector with a dimension of 128. The embedding dimension was chosen as 128 to deliver a balance between the expressive power of the model and computational efficiency. This ensures that the semantic relationships of the words are identified while keeping the model as simple as possible. These vectors are dynamic in nature; they are learned by the model throughout the process of training.
LSTM Layer: The “LSTM Layer” is the main sequential processing component of the architecture. The LSTM layer accepts the sequence of the above-discussed 128-dimensional word embeddings as its input. Each word embedding is processed one by one, in addition to storing an internal “cell state” (portraying long-term memory) and a “hidden state” (portraying short-term memory). The parameter “Unit = 64” in the LSTM layer indicates that this layer consists of 64 memory cells or neurons as shown in Figure 3. These provide enough capacity for the model to learn the long-term dependencies. This also helps to avoid overfitting or unnecessary training time. Each of these 64 units energetically contributes to preserving and processing information over time. This encourages the LSTM to be capable of learning complex patterns and dependencies in the content of news articles.
Max Pooling 1-D Layer: After the LSTM layer, the next layer is “Max Pooling Layer 1D,” which is responsible for performing a down-sampling operation on the sequential output provided by the previous layer. The LSTM gives the order of hidden states, where each state resembles an input token. The main aim of the pooling layer is to shorten the long and variable-length sequence into compact and fixed-size representation. Max Pooling 1D functions by sliding a window (its size is defined by a parameter named pool_size) across the sequence of input and then selecting the highest value within the window. This process efficiently selects the most prominent feature from each local region of the output of LSTM.
Dense Layer with ReLU Activation: The “Dense Layer,” which is also known as a fully connected layer, receives the flattened and aggregated features from the Max Pooling 1D layer. In the dense layer, each neuron is connected to input features from the preceding layer, enabling it to learn complex combinations and interactions among these features. Crucially, the “Activation = Relu” (Rectified Linear Unit) is applied to the output of this dense layer;
Dropout Layer: A dropout rate of value 0.5 was applied as a regularization strategy, a widely accepted setting that effectively mitigates overfitting by randomly deactivating half of the neurons during training. These design decisions were made to optimize model generalization and stability across the dataset.
Output Layer with Softmax Activation: The “Output Layer” is the last section of the neural network which is accountable for creating the ultimate predictions or classifications. It takes the processed and regularized features from the preceding layers and transforms the features into a suitable format for categorizing the input news article. The Softmax activation function is applied in this layer. Softmax is specifically designed for multi-class classification problems, where an input belongs to one of several mutually exclusive categories. It converts the raw output scores (often referred to as logits) from the network into a probability distribution across all defined classes.

For model training, we have employed the Adam optimizer. This was chosen due to its flexibility in handling sparse gradients efficiently. The loss function used is categorical cross-entropy, as it is well-suited for classification problems with multiple classes. The model was trained for about 10 epochs. Each epoch has a batch size of 78. These parameters were chosen to gain a balance between convergence speed and computational efficiency.

4: Evaluation Module: This module evaluates the performance of the model using metrics such as accuracy and precision. It ensures that the models are effectively categorizing news articles.
5: Custom Input Categorization Module: Users can input customized news articles’ title and body, which the system categorizes into categories using the trained models.

Dataset Description

For this research, the dataset has been obtained from the Kaggle website. The dataset includes 6883 news articles. Each row represents three columns: Title, Body, and Category. The Title column contains the headline of the news article. The Body column encompasses the detailed content of the article. These two columns are used as the input features for the classification model. The third column, Category, is the output label representing the category of the article.

There are 14 predefined categories in the dataset that are predicted as output by the Model: Arts and Culture, Business, Crime, Comedy, Education, Entertainment, Environment, Media, Politics, Religion, Science, Sports, Tech, and Women. The dataset is fairly balanced, with almost of the categories having an approximate of 400–500 articles. The only exception is the Arts and Culture category, which includes 1002 articles. This is due to its wide coverage and trending nature.

Class balancing techniques were not applied in this project because the imbalance was not extreme and the dataset still provided sufficient representation across all categories. Moreover, the classifier achieved consistently high precision, recall, and F1-scores for both majority and minority classes, indicating that the moderate imbalance did not significantly affect the model’s performance.

4. Result Analysis

The performance of the model was evaluated using metrics such as accuracy, precision, confusion matrix, and classification report.

The model performed with an accuracy of 95%, which means that 95% of the values predicted were correct. The precision was 95.7%, while the recall was 95.63% and the F1 score was 95.69% as illustrated in Figure 4.

The confusion matrix helped to assess the classification model’s performance in machine learning by comparing predicted values against actual values for a dataset. It also helps us assess where the errors were made in the model. It gives a breakdown of all the categories as shown in Figure 5.

5. Conclusions

The LSTM-based news article classification model developed in this project offers an effective solution for automatically categorizing news articles by learning contextual patterns from both the titles and bodies of articles. By leveraging the sequential learning capability of the LSTM network, the model can be used to identify not only the presence of keywords, but also the deeper contextual relationships within a text that are essential for distinguishing between similar or overlapping categories. This eases the process of organizing the content by removing the necessity for manual categorization, thereby minimizing human effort, costs, and the risk of inconsistencies.

6. Future Work

The current LSTM-based model is limited to the English language and exclusively classifies English news articles; extending it to support multilingual news articles would make it more adaptable for worldwide use. Additionally, the system can be enhanced to categorize news articles and, at the same time, detect fake or untrue information. Other trending models such as Transformers can be implemented. These enhancements will make the system extra robust, scalable, and applicable to real-world scenarios.

Author Contributions

Y.R.: methodology and development. P.N.: validation and project administration. R.M.M.: supervision and resources. K.S.: investigation and analysis. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

No new data was created.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this document:

NLP	Natural Language Processing.
LSTM	Long Short-Term Memory.
RNN	Recurrent Neural Network.
CNN	Convolutional Neural Network.
CSE	Computer Science & Engineering.
SVM	Support Vector Machines.
NB	Naïve Bayes.
KNN	K Nearest Neighbors
TF-IDF	Term Frequency–Inverse Document Frequency
DL	Deep Learning.

References

Naseeba, B.; Challa, N.P.; Doppalapudi, A.; Chirag, S.; Nair, N.S. Machine Learning Model for News Article Classification. In Proceedings of the 5th International Conference on Smart Systems and Inventive Technology (ICSSIT), Tirunelveli, India, 23–25 January 2023; pp. 1009–1016. [Google Scholar]
Jia, Y.; Chen, Z.; Yu, S. Reader emotion classification of news headlines. In Proceedings of the 5th International Conference on Smart Systems and Inventive Technology (ICSSIT), Tirunelveli, India, 23–25 January 2023; p. 16. [Google Scholar]
B, S.; Santhanalakshmi, S. News Article Topic Classification Using Embeddings. In Proceedings of the 14th International Conference on Computing Communication and Networking Technologies, Delhi, India, 6–8 July 2023; pp. 1–7. [Google Scholar]
Abinaya, N.; Jayadharshini, P.; Priyanka, S.; Keerthika, S.; Santhiya, S. Online News Article Classification Using Machine Learning Approaches. In Proceedings of the 2nd International Conference on Disruptive Technologies (ICDT), Greater Noida, India, 15–16 March 2024; pp. 1394–1396. [Google Scholar]
Ramasubramanian, C.; Ramya, R. Effective Preprocessing Activities in Text Mining using Improved Porter Stemmer Algorithm. Int. J. Adv. Res. Comput. Commun. Eng. 2023, 2, 4536–4538. [Google Scholar]
V, M.; T, D.; Kalaiyarasi, M. Classification of Newspaper Article Classification by Employing Support Vector Machine in Comparison with Perceptron to Improve Accuracy. In Proceedings of the Eighth International Conference on Science Technology Engineering and Mathematics (ICONSTEM), Chennai, India, 6–7 April 2023; p. 16. [Google Scholar]
Punitha, S.C.; Punithavalli, M. Performance Evaluation of Semantic Based and Ontology Based Text Document Clustering Techniques. Int. J. Adv. Res. Comput. Commun. Eng. 2012, 30, 100–106. [Google Scholar] [CrossRef]
Bracewell, D.B.; Yan, J.; Ren, F.; Kuroiwa, S. Category Classification and Topic Discovery of Japanese and English News Articles. Electron. Notes Theor. Comput. Sci. 2009, 225, 51–56. [Google Scholar] [CrossRef]
Stein, A.J.; Weerasinghe, J.; Mancoridis, S.; Greenstadt, R. News Article Text classification and Summary for Authors and topics. Int. J. Adv. Res. Comput. Commun. Eng. 2012, 10, 1–12. [Google Scholar]

Figure 1. System architecture.

Figure 2. Data preprocessing.

Figure 3. Detailed architecture of LSTM model.

Figure 4. Result analysis of LSTM model.

Figure 5. Confusion matrix—LSTM.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Rafat, Y.; Narayana, P.; Mohana, R.M.; Srilatha, K. LSTM-Based News Article Category Classification. Comput. Sci. Math. Forum 2025, 12, 8. https://doi.org/10.3390/cmsf2025012008

AMA Style

Rafat Y, Narayana P, Mohana RM, Srilatha K. LSTM-Based News Article Category Classification. Computer Sciences & Mathematics Forum. 2025; 12(1):8. https://doi.org/10.3390/cmsf2025012008

Chicago/Turabian Style

Rafat, Yusra, Potu Narayana, R. Madana Mohana, and Kolukuluri Srilatha. 2025. "LSTM-Based News Article Category Classification" Computer Sciences & Mathematics Forum 12, no. 1: 8. https://doi.org/10.3390/cmsf2025012008

APA Style

Rafat, Y., Narayana, P., Mohana, R. M., & Srilatha, K. (2025). LSTM-Based News Article Category Classification. Computer Sciences & Mathematics Forum, 12(1), 8. https://doi.org/10.3390/cmsf2025012008

Article Menu

LSTM-Based News Article Category Classification^†

Abstract

1. Introduction

2. Literature Review