Comparative Analysis of TF–IDF and Hashing Vectorizer for Fake News Detection in Sindhi: A Machine Learning and Deep Learning Approach

: Social media has become a popular platform for accessing and sharing news, but it has also led to a rise in fake news, posing serious risks. The ease of dissemination and constant ﬂow of information raise concerns about the spread of incorrect information. Timely veriﬁcation of news is crucial to combat false news. However, most research on false news identiﬁcation has focused on English, neglecting South Asian languages. This study examines a dataset of Sindhi tweets, employing text feature extraction techniques such as TF–IDF and hashing vectorizer. Several machine learning algorithms, along with advanced deep learning models such as Transformer BERT, were utilized for analysis.


Introduction
Fake news is information that is false or fraudulent that may originate through traditional media channels or online platforms.The advent of social media and other digital platforms has resulted in an unprecedented surge in the dissemination of erroneous data, endangering society on an enormous scale.Online false news has the potential to mislead and misinform people, frequently with major political and societal implications.It has the potential to negatively alter people's perceptions of events, public figures, and organizations, resulting in polarization and a breakdown of trust.
Given its real-time nature along with broad reach, Twitter contributes a key part in the propagation of fake news.The platform's ease of sharing information along with the lack of control and verification processes makes it a breeding ground for the spread of incorrect information.
The Sindhi Language is the language of approximately 37 million speakers globally.The language has a rich literary legacy dating back to the 8th century that is distinguished by its diversity and uniqueness.It has had many scripts throughout the history, but presently Perso-Arabic ( ) and Devanagari ( ) are two most widely used scripts internationally.Many native folks utilize Sindhi on social media to share information and express themselves.However, Sindhi information, like that of other languages, suffers from the dissemination of misinformation, which is typically driven by personal gain, political ambitions, or fun.Nonetheless, due to the language's limited resources, detecting fake news in Sindhi poses a substantial challenge.

Related Works
S. Singh et al. [1] highlights a concern about the pernicious effect of web-based entertainment on people and society, owing to the spread of fake news.Machine learning algorithms are required because traditional manual filtering techniques have failed to identify and eradicate this problem.They used various sophisticated machine-learning methods to identify and address fake news.The SVM classifier algorithm achieved a notable accuracy rate.
I. Ahmad et al. [2] explores different textual properties which can help extricate real news from fake news and train an amalgamation of machine learning algorithms using various ensemble methods.The proposed approach is evaluated on four real-world datasets, and the experimental results indicate its superiority over individual learners.
Haseeb Ur Rehman and S. Hussain [3] found widespread fake news on Pakistani social media, particularly Twitter and Facebook.False stories on politics and international relations received attention even after being debunked, indicating the influence of cult followings and populism.
W. Y. Wang [4] presents a new dataset called LIAR for automatic fake news detection.The authors used superficial language attributes to investigate automatic fake news detection and propose a hybrid CNN.They highlight the importance of labelled benchmark datasets for combating fake news.
P. H. A. Faustini and T.F.Covões [5] proposed a model to detect fake news based on text features that can be applied to different language groups.They evaluated their approach on five datasets, including social media posts, and achieved competitive results.Support Vector Machines and Random Forests performed best in classification, and the bag-of-words approach achieved the best results overall.
J. C. S. Reis et al.
[6] used multiple classifiers, such as KNN, Naive Bayes, Random Forests, SVM, and XG-Boost, to assess the efficacy of various hand-crafted characteristics for spotting false news.The classifiers' AUC and Macro F1-scores are used to evaluate them, and RF and XGB classifiers produce the best results.According to the ROC curve for XGB, 40% of the actual news can be misclassified, while practically all fraudulent news can be classified accurately.

Proposed Methodology
This study proposes a methodology to evaluate the effectiveness of two feature extractors in combination with machine learning (ML) and deep learning (DL) models for identifying real and fake news in Sindhi on Twitter. Figure 1a illustrates the overview of our methodology.The methodology involves creating a dataset of Sindhi language tweets, cleaning and labeling them, extracting features, training and testing ML and DL models, and then computing performance metrics, such as accuracy, precision, recall, and an F1-score to assess the efficacy of the feature extraction methods.

Proposed Methodology
This study proposes a methodology to evaluate the effectiveness of two feature extractors in combination with machine learning (ML) and deep learning (DL) models for identifying real and fake news in Sindhi on Twitter. Figure 1a illustrates the overview of our methodology.The methodology involves creating a dataset of Sindhi language tweets, cleaning and labeling them, extracting features, training and testing ML and DL models, and then computing performance metrics, such as accuracy, precision, recall, and an F1-score to assess the efficacy of the feature extraction methods.

Construction of Dataset
Sindhi, being a language with limited NLP resources, ref. [7] posed a challenge.To address this, data were collected by scraping Twitter using the Tweepy API.Real news tweets were gathered from Sindhi news network accounts, while fake news tweets were crowdsourced for the development of a label dataset.
The dataset comprised 9854 tweets.Figure 1b shows the distribution of the dataset into real and fake news, while Figure 2b illustrates the tweet length across the dataset.The dataset exhibited an imbalance, with approximately one thousand tweets in the "Fake" category representing conversational tweets, which are not considered news.

Construction of Dataset
Sindhi, being a language with limited NLP resources, ref. [7] posed a challenge.To address this, data were collected by scraping Twitter using the Tweepy API.Real news tweets were gathered from Sindhi news network accounts, while fake news tweets were crowdsourced for the development of a label dataset.
The dataset comprised 9854 tweets.Figure 1b shows the distribution of the dataset into real and fake news, while Figure 2b illustrates the tweet length across the dataset.The dataset exhibited an imbalance, with approximately one thousand tweets in the "Fake" category representing conversational tweets, which are not considered news.

Data Cleaning and Pre-Processing
Data pre-processing includes data cleaning as well as the preliminary processing of the dataset comprised several steps that are enumerated below:


Elimination of superfluous white spaces.


Removal of all except textual data.


Elimination of duplicate and English language tweets  Stop-words removal.
Stop words are frequently used terms in a language that are eliminated from text data because they are ineffective for NLP activities.

Feature Extraction
Feature extraction entails selecting the most relevant features from the data to train a model.There are different methods for obtaining features.These methods decrease the dimensionality of the data and capture the most pertinent features [8].

Term Frequency-Inverse Document Frequency (TF-IDF)
TF-IDF is a feature extraction technique in NLP that measures the importance of a word in a document based on its frequency in the document and the dataset.It assigns weights to words based on their frequency in the document and inversely to their frequency in the corpus.

Hashing Vectorizer (HV)
HV is a text feature extraction technique used in NLP tasks to convert text files into a matrix of token occurrences.It uses a hash function to assign indices to words, allowing each word to be processed independently.This scalability makes it suitable for large databases.

Data Cleaning and Pre-Processing
Data pre-processing includes data cleaning as well as the preliminary processing of the dataset comprised several steps that are enumerated below:

•
Elimination of superfluous white spaces.

•
Removal of all except textual data.

•
Elimination of duplicate and English language tweets • Stop-words removal.
Stop words are frequently used terms in a language that are eliminated from text data because they are ineffective for NLP activities.

Feature Extraction
Feature extraction entails selecting the most relevant features from the data to train a model.There are different methods for obtaining features.These methods decrease the dimensionality of the data and capture the most pertinent features [8].

Term Frequency-Inverse Document Frequency (TF-IDF)
TF-IDF is a feature extraction technique in NLP that measures the importance of a word in a document based on its frequency in the document and the dataset.It assigns weights to words based on their frequency in the document and inversely to their frequency in the corpus.

Hashing Vectorizer (HV)
HV is a text feature extraction technique used in NLP tasks to convert text files into a matrix of token occurrences.It uses a hash function to assign indices to words, allowing each word to be processed independently.This scalability makes it suitable for large databases.

Machine Learning
ML algorithms are used to recognize patterns, generate predictions, and are employed in an array of applications, including text classification, where producing results for required tasks using traditional algorithms is challenging or unattainable [9].For text classification, we used SVM, Nave Bayes (nB), neural networks (NN), logistic regression (LR), CatBoost (CB), decision trees (DT), AdaBoost (AB), and random forests (RF).
SVM identifies an optimal hyperplane to separate data into distinct classes while maximizing the margin between them.Nave Bayes employs Bayes' theorem with a feature independence assumption.A neural network learns intricate patterns from data through iterative training.Logistic regression estimates the probability of an instance belonging to a particular class by applying a logistic function to a linear combination of features.Cat-Boost achieves high predictive accuracy by intelligently managing feature interactions and employing an innovative learning scheme.Random Forest (RF) achieves higher accuracy and robustness by mitigating overfitting and capturing diverse feature interactions.

Deep Learning
Deep learning models are machine learning approaches that use artificial neural networks (ANNs) to extract meaningful patterns and insights from data.To evaluate our dataset, we used CNN, RNN, LTSM, and Transformers.CNN-based models are trained to identify patterns in text, RNN-based models consider text as a sequence of words, and Transformer models capture long-range dependencies between words or tokens in a sentence [10].

Performance Evaluation
Performance evaluation metrics are utilized to assess the efficacy and performance of machine learning models or algorithms.These metrics offer quantitative measures to evaluate the model's proficiency in addressing a specific task.

Accuracy
It quantifies the overall correctness of predictions by computing the ratio of correctly predicted instances to the total number of instances.

Precision
It measures the proportion of true positive predictions among the total predicted positive instances, focusing on the precision of positive predictions.

Recall
It calculates the ratio of true positive predictions to the actual positive instances, emphasizing the model's ability to correctly identify positive instances.

F1-Score
It represents the harmonic mean of precision and recall, offering a balanced measure that combines both metrics.The F1-score is particularly useful when dealing with class imbalances within the dataset.

Experiments and Results
The proposed methodology underwent rigorous evaluation using a diverse range of machine learning and deep learning techniques.In the subsequent sections, we provide detailed elaboration on each technique along with the corresponding results attained from our evaluation efforts.
For this evaluation, we implemented all the machine learning algorithms in a similar fashion with both TF-IDF as well as HV.The results, after calculating performance metrics, can be seen in Table 1.We also generated our results of ensemble ML models of Bagging, Boosting, and Stacking.The best performing algorithm across both feature extraction techniques is NN (neural network) because of its capability to learn complex patterns in text data, its robustness to noise, and its strong generalization ability.SVM (Support Vector Machines) and Logistic Regression also demonstrated good overall performance with both feature extraction techniques, benefiting from their ability to handle linear separability, noisy data, and sparse representations.The top-performing ensemble method was Bagging, which combines multiple models trained on different subsets of the training data to reduce variance, improve stability, and enhance generalization.
Upon analyzing the results presented in Table 1, it can be concluded that both feature extraction techniques yielded favorable outcomes with most algorithms, except for KNN (K-Nearest Neighbors).TF-IDF outperformed the hashing vectorizer in terms of accuracy in 6 out of 15 instances, while the hashing vectorizer outperformed TF-IDF in 7 out of 15 instances.However, considering the overall performance metrics of Precision, Recall, and F1-Score, TF-IDF showcased superior performance compared to hashing vectorizer for our specific dataset.This is attributed to TF-IDF's ability to capture the significance of important and distinctive words in the language.The lower performance of KNN could be attributed to the choice of the value of K, which in our case was set to K = 5.
The dataset was further evaluated using deep learning models and results can be seen in Table 2.After the analysis of Table 2 we can conclude that RNN outperformed CNN with both feature extractors, because RNNs are designed to process sequential data by maintaining an internal memory that enables them to capture long-term dependencies in the text.Over-all, we can conclude that models engineered with TF-IDF have generally higher accuracy and better performance over hashing vectorizer unless we use a Transformer model like BERT (Bidirectional Encoder Representations from Transformers) that is a pretrained language model that can learn useful features from raw input text with or without feature engineering [11].To address suboptimal performance in our original dataset, since BERT is not trained in Sindhi, we used the Google Trans Library to translate the dataset into English and fine-tuned it.

Conclusions
In this research study, we extracted our data from Twitter, cleaned, pre-processed, and then analyzed it applying various machine learning algorithms, deep learning models, and the Transformer model BERT.We performed our experiments twice for each algorithm/model, first with TF-IDF and then with hashing vectorizer, to ascertain which feature engineering technique yielded the best performance for the Sindhi text dataset.Our results demonstrate that TF-IDF often performed better than hashing vectorizer for our dataset for both machine learning and deep learning models.We achieved the highest machine learning accuracy with a neural network (NN) algorithm, RNN for deep learning and with a translated dataset on BERT.This indicates the significance of selecting an appropriate feature engineering technique for text data pre-processing to obtain optimal results.These findings help facilitate the development of effective techniques for processing and interpreting Sindhi text data on social media platforms.

Limitations and Future Works
This research investigation examined Twitter news content in the Sindhi language, encompassing both real and fake news.Unfortunately, limited resources posed challenges in creating a well-balanced dataset, leading us to supplement fake news entries through crowdsourcing, potentially impacting data diversity, thus making it easier for classifiers and models to detect.Additionally, relying on Google Translate for automatic language translation during BERT fine-tuning proved suboptimal.
Our future endeavors involve gathering a more extensive corpus of Sindhi fake news from Twitter, enabling the construction of a balanced dataset for reassessment.

Figure 1 .
Figure 1.(a) Overview of the workflow for proposed methodology.(b) Distribution of dataset into real and fake tweets; blue represents fake news and orange represents real news.

Figure 1 .
Figure 1.(a) Overview of the workflow for proposed methodology.(b) Distribution of dataset into real and fake tweets; blue represents fake news and orange represents real news.

Figure 2 .
Figure 2. (a) Splitting of real and fake news tweets into train and test datasets; orange for train and tangerine for test set.(b) Distribution of length of tweets across the dataset.

Figure 2 .
Figure 2. (a) Splitting of real and fake news tweets into train and test datasets; orange for train and tangerine for test set.(b) Distribution of length of tweets across the dataset.

Table 1 .
Performance evaluation results of ML algorithms using TF-IDF and hashing vectorizer.

Table 2 .
Performance evaluation results of DL Models using feature extraction and BERT without any feature extractor.