Exploring the Effects of Pre-Processing Techniques on Topic Modeling of an Arabic News Article Data Set
Abstract
:1. Introduction
2. Related Work
Topic Modeling Comparative Studies Using Arabic Data Sets
3. Methodology
3.1. Data Set
3.2. Pre-Processing Techniques
- Text Normalization: Standardizing text by removing diacritics, numbers, punctuation, and special characters and unifying character variations [17].
- Tokenization: Splitting text into individual words or tokens, considering Arabic-specific linguistic rules [12].
- Stop Word Removal: Eliminating common Arabic words that do not contribute significant meaning [20].
- Stemming and Lemmatization: Reducing words to their root forms using two different stemmers—ISRI and Tashaphyne.
- –
- ISRI Stemmer: An Arabic stemmer based on the ISRI algorithm developed by the Information Science Research Institute. This stemmer removes prefixes, suffixes, and infixes from words and incorporates some degree of morphological analysis to accurately determine the roots of Arabic words. This approach generally enhances the accuracy of root identification [21].
- –
- Tashaphyne Stemmer: Provided by the Tashaphyne library, this is a light Arabic stemmer that merges light stemming and root-based stemming principles. It segments Arabic words into roots and patterns while also removing common prefixes and suffixes without performing a full morphological analysis. The Tashaphyne stemmer has achieved remarkable results, outperforming other competitive stemmers (e.g., Khoja, ISRI, Motaz/Light10, FARASA, and Assem stemmers) in extracting Roots and Stems [22].
- n-Gram Construction: Creating combinations of adjacent words (n-grams) to capture more context within the text [23].
3.3. Modeling
- LDA: A probabilistic model that identifies topics based on word distributions within documents. It aims to infer these concealed topics by examining visible words [24]. Prior research has indicated the efficacy and utility of LDA methods in the field of topic modeling [25].The LDA model was implemented using the Gensim library, with a structured approach to data preparation and model training. Initially, documents were tokenized into lists of words, and a dictionary was created using the gensim.corpora.Dictionary class to map words to unique integer IDs. This dictionary was then used to generate a bag-of-words representation of the documents, which served as the model’s input. For training, the model parameters were configured as follows: the number of topics (num_topics) was set to 7 based on iterative optimization for the data set, while workers were set to 4 to enable efficient parallel computation. The corpus parameter utilized the bag-of-words representation, and id2word was assigned to the dictionary mapping of IDs to words.
- NMF: A linear algebraic model that utilizes statistical techniques to identify thematic structures within a collection of texts. It employs a decompositional strategy based on matrix factorization, categorizing it within the realm of linear algebraic methods. It stands out as an unsupervised approach for simplifying the complexity of non-negative matrices. Additionally, when it comes to mining topics from brief texts, learning models based on NMF have proven to be a potent alternative to those predicated on LDA [26].The NMF model was implemented using the Gensim library, with a focus on consistent data preparation and training parameters. The text data were tokenized and transformed into a bag-of-words format using the gensim.corpora.Dictionary class. For model training, the number of topics (num_topics) was set to 7 to ensure consistency with the LDA and BERTopic models. The corpus parameter utilized the bag-of-words representation, and id2word was assigned to the dictionary mapping of IDs to words.
- BERTopic: A method for topic modeling that employs an unsupervised clustering approach. It leverages Bidirectional Encoder Representations from Transformers (BERT) to generate the contextual embeddings of sentences. These embeddings encapsulate the semantic details of sentences, enabling the algorithm to identify topics through their contextual significance. The methodology of BERTopic encompasses a three-stage process. Initially, it transforms documents into embedding representations utilizing a pre-trained language model. Subsequently, it diminishes the dimensionality of these embeddings to facilitate more effective clustering. In the final stage, BERTopic derives topic representations from the clusters of documents through applying a unique class-based variant of TF-IDF, a technique detailed by Grootendorst in 2022 [9]. A key strength of BERTopic lies in its precision in forming clusters and its capability to propose names for topics based on these clusters. Unlike traditional topic modeling approaches, BERTopic does not require the number of topics to be pre-defined, offering a flexible and dynamic framework for topic discovery [27].The BERTopic model utilized pre-trained transformer models tailored specifically for Arabic text analysis, enabling the generation of robust semantic embeddings. The textual data were transformed into high-dimensional vector representations using a variety of pre-trained models. During training, the dimensionality of the embeddings was reduced to enhance clustering efficiency. To ensure consistency with LDA and NMF, the reduce_topics method was employed, standardizing the number of topics to 7 for direct comparison across models.
3.4. Evaluation Metrics
- Topic Coherence: Interpretability and topic coherence are critical attributes of an effective topic model. A topic is defined as a distribution of words that collectively represent a single semantic theme. To determine whether or not the words within a topic are meaningful to humans, their coherence must be evaluated—specifically, the degree to which the words are semantically related. In this study, we assessed topic coherence using NPMI, a metric that ranges from −1 to 1, allowing for direct comparison with findings from previous research [28].
- Topic Diversity: This metric measures the semantic variety of the topics produced by a model. Topic diversity can be quantified by calculating the proportion of unique words among the top 25 words across all topics. The higher the score on this metric, the more diverse the topics are, suggesting a successful topic model. Conversely, a low diversity score implies that the topics are redundant, reflecting the model’s inability to effectively separate the different themes within the text corpus. It is important to consider how the number of topics chosen for the model influences this diversity. Selecting too many topics may lead to a convergence of similar themes, resulting in a loss of distinctiveness. On the other hand, opting for too few topics may yield overly broad and less interpretable topics [15].
- Perplexity: This is a popular metric for evaluating probabilistic models like LDA, where it measures how well a probability model predicts a sample. However, its applicability is limited when it comes to methods like NMF and BERTopic because these methods are based on different underlying principles and computations [15].
4. Results
4.1. Training Experiments
- 1.
- LDA Experiments: The results of the LDA experiments are summarized in Table 1.
- Original Articles without Pre-processing: This baseline experiment resulted in a low coherence score (0.012) and moderate topic diversity (0.571). The model exhibited high perplexity (17,652.513), indicating difficulty in predicting the unseen data, as well as the longest training time (649.00 s).
- Cleaning: After cleaning the text, the coherence score improved significantly to 0.040 and topic diversity increased to 0.729. The perplexity decreased to 14,655.042 and the training time was reduced to 380.82 s, reflecting the benefits of removing noise from the data.
- Cleaning + Stemming (Tashaphyne): This configuration achieved the highest coherence score (0.051) among the LDA experiments, with a slight decrease in topic diversity (0.724). The perplexity dropped dramatically to 2836.581 and the training time decreased further to 289.69 s. This indicates that Tashaphyne stemming effectively enhanced the model’s interpretability and prediction accuracy.
- Cleaning + Stemming (ISRI): While this setup resulted in a slightly lower coherence score (0.037) compared to Tashaphyne, it achieved the lowest perplexity (1424.293) among the LDA experiments, suggesting superior predictive performance. The training time was similar to the Tashaphyne setup, at 289.51 s.Table 1. Summary of experimental results for the Latent Dirichlet Allocation (LDA) model.
Pre-Processing NPMI Coherence Score ↑ Topic Diversity Perplexity Training Time Original articles (without pre-processing) 0.012 0.571 17,652.513 649.00 Cleaning 0.040 0.729 14,655.042 380.82 Cleaning + stemming (Tashaphyne) 0.051 0.724 2836.581 289.69 Cleaning + stemming (ISRI) 0.037 0.643 1424.293 289.51 Note: Bold value indicates the best NPMI score among the compared configurations.
- 2.
- NMF Experiments: NMF was also tested under similar pre-processing configurations, with the results summarized in Table 2.
- Original Articles without Pre-processing: The NMF model initially achieved a coherence score of 0.032 and a lower topic diversity (0.557) than LDA. The training time was 2741.79 s, indicating that the model struggled with raw data.
- Cleaning: After cleaning, the coherence score improved to 0.038 and the topic diversity increased to 0.624. The training time significantly decreased to 1084.76 s, demonstrating that cleaning had a substantial positive impact.
- Cleaning + Stemming (Tashaphyne): This configuration resulted in the highest coherence score (0.071) and the best topic diversity (0.714) across all NMF experiments. The training time was further reduced to 528.50 s. The Tashaphyne stemmer clearly improved the model’s ability to produce coherent and diverse topics.
- Cleaning + Stemming (ISRI): The ISRI stemming method also improved coherence (0.053) and diversity (0.681), although not as much as Tashaphyne. The training time was the lowest among all NMF experiments, at 396.90 s, highlighting ISRI’s efficiency.Table 2. Summary of experimental results for the Non-Negative Matrix Factorization (NMF) model.
Pre-Processing NPMI Coherence Score ↑ Topic Diversity Training Time Original articles (without pre-processing) 0.032 0.557 2741.79 Cleaning 0.038 0.624 1084.76 Cleaning + stemming (Tashaphyne) 0.071 0.714 528.50 Cleaning + stemming (ISRI) 0.053 0.681 396.90 Note: Bold value indicates the best NPMI score among the compared configurations.Experiment 3, which involved cleaning and stemming using Tashaphyne, provided the highest coherence score (0.071) and topic diversity (0.714), making it the preferred choice for this technique. This indicates that Tashaphyne stemming significantly enhances the model’s ability to produce coherent and diverse topics. Although Experiment 4 (with ISRI stemming) also improved the performance, it did not match the results achieved with Tashaphyne stemming. The Tashaphyne stemmer outperformed the ISRI stemmer primarily because ISRI’s robust approach can occasionally lead to over-stemming. This over-stemming excessively reduces words, eliminating critical morphological details essential for differentiating between topics. Such aggressive reduction can create a diverse array of topics that are less meaningful or coherent, as more words are erroneously treated as identical. In contrast, Tashaphyne employs a lighter stemming technique, which likely avoids these issues. Through preserving more of the original word form, Tashaphyne ensures greater accuracy and relevance in the words retained, thereby enhancing the coherence of the topics generated.
- 3.
- BERTopic Experiments: Various embedding models were tested using different pre-processing techniques, and the results are summarized in Table 3.
- UBC-NLP/MARBERT Embedding: Basic cleaning yielded the highest coherence score (0.110) and excellent topic diversity (0.871). Interestingly, stemming with ISRI slightly improved the topic diversity (0.886) but reduced coherence (0.046). Tashaphyne stemming resulted in a negative coherence score (−0.004), despite high topic diversity (0.857), suggesting that this stemmer may overly simplify the text for this embedding.
- xlm-roberta-base Embedding: Cleaning significantly improved the coherence to 0.107 and maintained high topic diversity (0.843). Stemming with Tashaphyne and ISRI improved diversity but slightly reduced coherence, indicating that, while stemming is beneficial, it might not always enhance coherence with this embedding.
- aubmindlab/bert-base-arabertv02 Embedding: Cleaning provided the best results, with a coherence score of 0.120 and topic diversity of 0.843. Stemming generally improved diversity, but had mixed effects on coherence.
- CAMeL-Lab/bert-base-arabic-camelbert-da Embedding: Cleaning produced the highest coherence score (0.211) and the best topic diversity (0.886) across all BERTopic experiments. Stemming, while still effective, did not surpass the results achieved with basic cleaning.
- asafaya/bert-base-arabic Embedding: Both cleaning and stemming methods performed well, with cleaning achieving a coherence score of 0.120 and topic diversity of 0.886. The original articles (without pre-processing) had slightly higher coherence (0.126) but lower diversity (0.843).Table 3. Summary of experimental results for the BERTopic model.
Embedding Model Pre-Processing NPMI Coherence Score ↑ Topic Diversity Training Time 1 UBC-NLP/MARBERT Original articles (without pre-processing) 0.062 0.814 3096.04 2 Cleaning 0.110 0.871 2419.38 3 Cleaning + stemming (Tashaphyne) −0.004 0.857 2273.24 4 Cleaning + stemming (ISRI) 0.046 0.886 2187.69 5 xlm-roberta-base Original articles (without pre-processing) −0.088 0.743 2235.53 6 Cleaning 0.107 0.843 1974.42 7 Cleaning + stemming (Tashaphyne) 0.104 0.800 1730.37 8 Cleaning + stemming (ISRI) 0.083 0.829 1695.86 9 aubmindlab/bert-base-arabertv02 Original articles (without pre-processing) 0.024 0.757 1653.41 10 Cleaning 0.120 0.843 1390.71 11 Cleaning + stemming (Tashaphyne) 0.074 0.800 683.57 12 Cleaning + stemming (ISRI) 0.022 0.771 642.83 13 CAMeL-Lab/bert-base/arabic/camelbert-da Original articles (without pre-processing) −0.040 0.857 1688.56 14 Cleaning 0.211 0.886 1386.45 15 Cleaning + stemming (Tashaphyne) 0.139 0.843 1259.72 16 Cleaning + stemming (ISRI) 0.062 0.814 1225.23 17 asafaya/bert-base-arabic Original articles (without pre-processing) 0.126 0.843 3216.70 18 Cleaning 0.120 0.886 2725.64 19 Cleaning + stemming (Tashaphyne) 0.089 0.871 2486.15 20 Cleaning + stemming (ISRI) 0.081 0.871 1240.58 21 qarib/bert-base-qarib Original articles (without pre-processing) −0.026 0.786 861.80 22 Cleaning 0.144 0.900 711.71 23 Cleaning + stemming (Tashaphyne) 0.138 0.900 1238.11 24 Cleaning + stemming (ISRI) 0.131 0.871 1177.52 Note: Bold value indicates the best NPMI score among the compared configurations. - qarib/bert-base-qarib Embedding: With this embedding, cleaning provided the highest coherence (0.144) and excellent topic diversity (0.900). Stemming also performed well, maintaining high diversity and coherence close to that obtained after cleaning.
4.2. Summary of Results
4.3. Evaluation and Optimization
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Farghaly, A.; Shaalan, K. Arabic Natural Language Processing: Challenges and Solutions. ACM Trans. Asian Lang. Inf. Process. 2009, 8, 14. [Google Scholar] [CrossRef]
- Li, X.; Lei, L. A bibliometric analysis of topic modelling studies (2000–2017). J. Inf. Sci. 2021, 47, 161–175. [Google Scholar] [CrossRef]
- Liu, L.; Tang, L.; Dong, W.; Yao, S.; Zhou, W. An overview of topic modeling and its current applications in bioinformatics. SpringerPlus 2016, 5, 1–22. [Google Scholar] [CrossRef] [PubMed]
- Blei, D.M.; Ng, A.Y.; Jordan, M.I. Latent dirichlet allocation. J. Mach. Learn. Res. 2003, 3, 993–1022. [Google Scholar]
- Blei, D.M.; Lafferty, J.D. Dynamic topic models. In Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, PA, USA, 25–29 June 2006; pp. 113–120. [Google Scholar]
- Deerwester, S.; Dumais, S.T.; Furnas, G.W.; Landauer, T.K.; Harshman, R. Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 1990, 41, 391–407. [Google Scholar] [CrossRef]
- Lee, D.; Seung, H.S. Algorithms for non-negative matrix factorization. Adv. Neural Inf. Process. Syst. 2000, 13, 535–541. [Google Scholar]
- Miao, Y.; Yu, L.; Blunsom, P. Neural variational inference for text processing. In Proceedings of the International Conference on Machine Learning (PMLR), New York, NY, USA, 19–24 June 2016; pp. 1727–1736. [Google Scholar]
- Grootendorst, M. BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv 2022, arXiv:2203.05794. [Google Scholar]
- Dieng, A.B.; Ruiz, F.J.R.; Blei, D.M. Topic modeling in embedding spaces. Trans. Assoc. Comput. Linguist. 2020, 8, 439–453. [Google Scholar] [CrossRef]
- Devlin, J. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
- Bagheri, R.; Entezarian, N.; Sharifi, M.H. Topic Modeling on System Thinking Themes Using Latent Dirichlet Allocation, Non-Negative Matrix Factorization and BERTopic. J. Syst. Think. Pract. 2023, 2, 33–56. [Google Scholar]
- Antoun, W.; Baly, F.; Hajj, H. Arabert: Transformer-based model for arabic language understanding. arXiv 2020, arXiv:2003.00104. [Google Scholar]
- Ma, T.; Al-Sabri, R.; Zhang, L.; Marah, B.; Al-Nabhan, N. The impact of weighting schemes and stemming process on topic modeling of arabic long and short texts. ACM Trans. Asian Low-Resour. Lang. Inf. Process. (TALLIP) 2020, 19, 1–23. [Google Scholar] [CrossRef]
- Abdelrazek, A.; Eid, Y.; Gawish, E.; Medhat, W.; Hassan, A. Topic modeling algorithms and applications: A survey. Inf. Syst. 2023, 112, 102131. [Google Scholar] [CrossRef]
- Abuzayed, A.; Al-Khalifa, H. BERT for Arabic topic modeling: An experimental study on BERTopic technique. Procedia Comput. Sci. 2021, 189, 191–194. [Google Scholar] [CrossRef]
- Al-Khalifa, S.; Alhumaidhi, F.; Alotaibi, H.; Al-Khalifa, H.S. ChatGPT across Arabic Twitter: A Study of Topics, Sentiments, and Sarcasm. Data 2023, 8, 171. [Google Scholar] [CrossRef]
- Berrimi, M.; Oussalah, M.; Moussaoui, A.; Saidi, M. A Comparative Study of Effective Approaches for Arabic Text Classification. SSRN Electron. J. 2023. Available online: https://ssrn.com/abstract=4361591 (accessed on 1 May 2024).
- Einea, O.; Elnagar, A.; Al Debsi, R. Sanad: Single-label arabic news articles dataset for automatic text categorization. Data Brief 2019, 25, 104076. [Google Scholar] [CrossRef]
- El Kah, A.; Zeroual, I. The effects of pre-processing techniques on Arabic text classification. Int. J. 2021, 10, 41–48. [Google Scholar]
- Taghva, K.; Elkhoury, R.; Coombs, J. Arabic stemming without a root dictionary. In Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC’05)-Volume II, Las Vegas, NV, USA, 4–6 April 2005; IEEE: Piscataway, NJ, USA, 2005; Volume 1, pp. 152–157. [Google Scholar]
- Al-Khatib, R.M.; Zerrouki, T.; Abu Shquier, M.M.; Balla, A. Tashaphyne0. 4: A new arabic light stemmer based on rhyzome modeling approach. Inf. Retr. J. 2023, 26, 14. [Google Scholar] [CrossRef]
- Nithyashree, V. What Are N-Grams and How to Implement Them in Python? 2021. Available online: https://www.analyticsvidhya.com/blog/2021/09/what-are-n-grams-and-how-to-implement-them-in-python/ (accessed on 15 October 2024).
- Abdelrazek, A.; Medhat, W.; Gawish, E.; Hassan, A. Topic Modeling on Arabic Language Dataset: Comparative Study. In Proceedings of the International Conference on Model and Data Engineering, Cairo, Egypt, 21–24 November 2022; Springer: Berlin, Germany, 2022; pp. 61–71. [Google Scholar]
- Jelodar, H.; Wang, Y.; Yuan, C.; Feng, X.; Jiang, X.; Li, Y.; Zhao, L. Latent Dirichlet allocation (LDA) and topic modeling: Models, applications, a survey. Multimed. Tools Appl. 2019, 78, 15169–15211. [Google Scholar] [CrossRef]
- Chen, Y.; Zhang, H.; Liu, R.; Ye, Z.; Lin, J. Experimental explorations on short text topic mining between LDA and NMF based Schemes. Knowl. Based Syst. 2019, 163, 1–13. [Google Scholar] [CrossRef]
- Egger, R.; Yu, J. A topic modeling comparison between lda, nmf, top2vec, and bertopic to demystify twitter posts. Front. Sociol. 2022, 7, 886498. [Google Scholar] [CrossRef] [PubMed]
- Chang, J.; Gerrish, S.; Wang, C.; Boyd-Graber, J.; Blei, D. Reading tea leaves: How humans interpret topic models. In Proceedings of the Advances in Neural Information Processing Systems 22 (NIPS 2009), Vancouver, BC Canada, 7–10 December 2009; Volume 22. [Google Scholar]
- Biniz, M. DataSet for Arabic Classification. Mendeley Data 2018, 2. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Alangari, H.; Algethami, N. Exploring the Effects of Pre-Processing Techniques on Topic Modeling of an Arabic News Article Data Set. Appl. Sci. 2024, 14, 11350. https://doi.org/10.3390/app142311350
Alangari H, Algethami N. Exploring the Effects of Pre-Processing Techniques on Topic Modeling of an Arabic News Article Data Set. Applied Sciences. 2024; 14(23):11350. https://doi.org/10.3390/app142311350
Chicago/Turabian StyleAlangari, Haya, and Nahlah Algethami. 2024. "Exploring the Effects of Pre-Processing Techniques on Topic Modeling of an Arabic News Article Data Set" Applied Sciences 14, no. 23: 11350. https://doi.org/10.3390/app142311350
APA StyleAlangari, H., & Algethami, N. (2024). Exploring the Effects of Pre-Processing Techniques on Topic Modeling of an Arabic News Article Data Set. Applied Sciences, 14(23), 11350. https://doi.org/10.3390/app142311350