Policy-Based Spam Detection of Tweets Dataset

Dar, Momna; Iqbal, Faiza; Latif, Rabia; Altaf, Ayesha; Jamail, Nor Shahida Mohd

doi:10.3390/electronics12122662

Open AccessArticle

Policy-Based Spam Detection of Tweets Dataset

by

Momna Dar

¹,

Faiza Iqbal

^1,*

,

Rabia Latif

²

,

Ayesha Altaf

^1,*

and

Nor Shahida Mohd Jamail

^2,*

¹

Department of Computer Science, University of Engineering and Technology, Lahore P.O. Box 54890, Pakistan

²

Artificial Intelligence and Data Analytics Laboratory, College of Computer and Information Sciences (CCIS), Prince Sultan University, Riyadh P.O. Box 66833, Saudi Arabia

^*

Authors to whom correspondence should be addressed.

Electronics 2023, 12(12), 2662; https://doi.org/10.3390/electronics12122662

Submission received: 17 May 2023 / Revised: 8 June 2023 / Accepted: 9 June 2023 / Published: 14 June 2023

(This article belongs to the Section Computer Science & Engineering)

Download

Browse Figures

Versions Notes

Abstract

:

Spam communications from spam ads and social media platforms such as Facebook, Twitter, and Instagram are increasing, making spam detection more popular. Many languages are used for spam review identification, including Chinese, Urdu, Roman Urdu, English, Turkish, etc.; however, there are fewer high-quality datasets available for Urdu. This is mainly because Urdu is less extensively used on social media networks such as Twitter, making it harder to collect huge volumes of relevant data. This paper investigates policy-based Urdu tweet spam detection. This study aims to collect over 1,100,000 real-time tweets from multiple users. The dataset is carefully filtered to comply with Twitter’s 100-tweet-per-hour limit. For data collection, the snscrape library is utilized, which is equipped with an API for accessing various attributes such as username, URL, and tweet content. Then, a machine learning pipeline consisting of TF-IDF, Count Vectorizer, and the following machine learning classifiers: multinomial naïve Bayes, support vector classifier RBF, logical regression, and BERT, are developed. Based on Twitter policy standards, feature extraction is performed, and the dataset is separated into training and testing sets for spam analysis. Experimental results show that the logistic regression classifier has achieved the highest accuracy, with an F1-score of 0.70 and an accuracy of 99.55%. The findings of the study show the effectiveness of policy-based spam detection in Urdu tweets using machine learning and BERT layer models and contribute to the development of a robust Urdu language social media spam detection method.

Keywords:

spam detection; Urdu tweets; machine learning

1. Introduction

Online social media is an essential tool for communication in the high-tech society of today. Social media sites such as Facebook, Instagram, and Twitter have made it possible to communicate with people around the world. Twitter is a well-known social media platform that enables communication and global awareness. As an interactive platform, Twitter’s main objective is to continuously engage its users through tweets. Unfortunately, this objective has been hampered by spammers who use Twitter to disseminate harmful information [1] and unsolicited communications. This regularly detracts users from general subjects and lowers the quality of the Twitter experience for actual users. On Twitter, spam has unquestionably grown to be a major problem. Users can report accounts they believe to be spam on Twitter as well. Studies already published use machine learning (ML) and deep learning (DL) techniques to find and isolate suspected spam Twitter accounts. Twitter uses natural language processing (NLP) and tweet-ranking algorithms to analyze thousands of tweets per second. Over time, it becomes better at predicting what tweets the user will find most interesting. These algorithms gather as much information about us as possible. This information includes what you prefer to watch, how you react to status updates, and even the links you click. The platforms then utilize machine learning to make extremely accurate predictions of your future desires [2,3]. Deep learning, on the other hand, employs a technology that improves machines’ capacity to detect and magnify even the tiniest patterns [4]. Machine learning can be used for investigation so that attackers can look at their target’s traffic patterns, defenses, and potential vulnerabilities.

Spammers pose a significant risk to the internet [5] and spam tweets detection has become a serious issue. Moreover, with the rapid rise of internet users, the number of tweet account spammers has also increased. They are being used for unlawful and unethical behavior, phishing, and fraud [5]. Likewise, harmful links attached via spam tweets get access to our payment details or perform other harmful acts. This paper presents effective ways of the most common algorithms to develop an ML- and DL-based model that can determine if the tweet in Urdu is spam or non-spam. We have scraped Twitter to construct an Urdu dataset. This study applies ML and DL methods to the processed datasets to determine the best algorithm for spam tweet detection based on policy and timestamp features to achieve the highest precision and accuracy.

1.1. Motivation

There is a need for additional research on the Urdu language due to the rising number of Urdu speakers in different nations. In the realm of speech processing, there is a demand for transcribing audio files to text files, such as Urdu audio files to Urdu text files. Aside from this, we have numerous additional needs for Urdu language processing in NLP, speech processing, and possibly other domains. A substantial number of Urdu datasets are available; however, most of them are of lower quality and limited in size as compared to English datasets. It is crucial that we collect an Urdu dataset and make the tools and dataset open source in order to expand the research in this area.

The main motivation of this work is, thus, to compile a high-quality Urdu tweets dataset to distinguish between spam and non-spam tweets utilizing ML and DL algorithms. Compared to English, there are fewer high-quality datasets available for spam review detection in Urdu. Urdu is a less commonly used language on social media platforms such as Twitter, which makes it harder to collect large amounts of relevant data. Table 1 compares the available English and Urdu datasets for spam review detection.

In Table 1, the dataset size corresponds to the number of tweets. Large quantity datasets represent a large number of tweets inside the dataset or corpus, whereas small quantity datasets represent a smaller number of tweets within the corpus. In this context, quality refers to the dependability of the annotations and the neatness of the dataset. High-quality datasets have been meticulously annotated and cleaned, whereas low-quality datasets may contain inaccurate or noisy annotations or be difficult to manipulate due to uneven formatting or missing data.

1.2. Challenges

A major challenge for our study was selecting the Urdu language for the research, given there has been little work done on the raw Urdu dataset, whether it be an Urdu collection of datasets or experimentation with the dataset. In addition, while collecting features from our dataset, we analyzed Twitter’s policies. One of the policies was to limit the number of tweets per day to 2400 [15], so we could have 100 tweets every hour. Any number exceeding the standard limit may be deemed spam. We were also tasked with extracting features from a limited number of features, such as a person’s name, user ID, and tweet. We meticulously evaluated the data and determined if the aggregation of user count was based on some check statement of policy. As such, the policy specifies a maximum of 100 tweets every hour.

1.3. Research Contribution

The research contributions provided by this paper are as follows:

Included a total number of 1,100,000 plus (+) unique real-time Urdu tweets of multiple users. The dataset is publically available on Kaggle [16];
Proposed a model to process real-time tweet couplets with feature extraction of Urdu couplets considering timestamp and policy features;
Evaluated the model by applying machine learning and deep learning training and testing on the Urdu dataset to achieve improved accuracy.

The rest of the paper is organized as follows. The Introduction is followed by a Literature Review, which highlights the latest related work in the domain. It also highlights the research gap in existing work. Section 3 described machine learning and deep learning algorithms. Section 4 presents the proposed algorithm, preprocessing, feature extraction details, and architecture of the model. The results and discussion are presented in Section 5. Finally, Section 7 concludes the paper.

2. Literature Review

Today, news-related spam is pervasive on social media platforms, with Twitter being the most popular. It is now harder for users to spot fake tweets. In addition, only a few studies have analyzed spam tweets using machine learning and deep learning techniques, as mentioned in the following, with the majority of studies focusing on English, Chinese, Arabic, and Turkish. It has been highlighted that a comprehensive study examining spam review detection in Urdu tweets does not exist. Therefore, it is vital to develop reliable and precise ML and DL techniques employing Urdu tweets. This section examines the available research in order to investigate the Urdu language techniques using social website datasets.

Existing studies are analyzed according to the type of dataset utilized, such as online or offline, the type of algorithm, and the results achieved by each algorithm. In order to improve accuracy, researchers are performing experiments on a variety of languages, leveraging diverse data sets, and presenting results with differing degrees of complexity. Ge et al. [17] conducted forensic analysis on a vast distribution of Urdu corpus. It is tested with Latent Dirichlet Allocation (LDA) and cosine similarity to detect textual similarity. Anwar et al. classify text using an individual and profile-based LDA classification strategy, achieving an f1-measure of 92 percent [18]. Mashooq et al. [19] explored sentiment analysis in Urdu comprehensively, covering future potential, taxonomy development, and associated issues. Hussain et al. [20] make a significant addition to the field of spam identification in Roman Urdu scripts by providing insights and approaches for recognizing misleading product evaluations in the realm of spam review detection in Roman Urdu scripts. Similarly, Hussain et al. [21,22] present a complete analysis of linguistic and spammer behavioral strategies for detecting spam reviews individually and in groups, providing useful insights and methods for efficiently addressing the problem. Duma et al. [23] introduce a unique strategy that integrates review text, overall ratings, and aspect ratings to efficiently detect false reviews, highlighting the potential of a deep hybrid model in enhancing the accuracy of fake review detection.

Combining ML and NLP techniques, such as multinomial naïve Bayes (MNB), support vector machine (SVM), expectation–maximization algorithm (EM), and stop words, lemmatization, and stemming [24], are used to identify fake reviews. Mekala et al. [25] demonstrate high precision by utilizing the approach of stylistic characteristics and term weight measurement. Saha et al. [26] achieved 96 percent accuracy using MLP on a dataset of social text from social media platforms. Similarly, Benzebouchi et al. [27] apply word2vec and then multilayer perceptron (MLP) to the English corpus, which raises accuracy to 95.83 percent. It is simple and computationally inexpensive to extract user-profiles and message content-based information using the Twitter API. Due to the scale and complexity of the Twitter user graph, collecting this information takes time and money. However, they can be utilized to identify spam. Assuming that all spam tweets contain URLs, some studies utilize URLs embedded in tweets to identify spam on Twitter [28]. Similarly, Khanday et al. is conducting a study to identify propagandistic nodes using social network analysis [29].

Based on the reviewed literature, it has been observed that several researchers have considered classifying spam tweets as language-specific evaluations, such as Chinese, Roman Urdu, English, and Turkish, therefore, it is necessary to evaluate and identify spam tweets in an Urdu dataset. Table 2 analyzed the research gap in the existing literature. Our proposed study, therefore, utilizes Urdu tweets of Twitter users and a proposed method utilizing policy and timestamp features. Moreover, we used different supervised classification model, which includes naïve Bayes, logistic regression, k-nearest neighbors, and long short-term memory, and finally compared their accuracy.

3. Machine Learning and Deep Learning Algorithms

Artificial intelligence (AI) includes both machine learning and deep learning. In brief, machine learning refers to AI that can autonomously adapt with minimal human intervention. Deep learning is a kind of machine learning that use artificial neural networks to simulate the human brain’s learning process. This section explains existing machine learning and deep learning algorithms, which are utilized in this study.

3.1. Multinomial Naïve Bayes

The foundation of naïve Bayes classifiers is Bayesian classification techniques. These depend on Bayes’ theorem, an equation that illustrates the relationship between the conditional probabilities of statistical values. In this scenario, we are attempting to determine the likelihood of a label based on a set of observed features. It is based on a simple (but naïve) application of the Bayesian formula for conditional probability after getting basic statistics from a specified training dataset. It is also popular for creating a baseline categorization performance, which other, more complex techniques must improve upon.

One most top rated algorithms for categorical classification is naïve Bayes. Moreover, our dataset is a binary classified algorithm that includes label 0 for spam and label 1 for non-spam. So, we can write the equation as follows:

P (spam | policy_feature) = \frac{P ({policy}_{feature}) \times P (spam)}{P (policy_feature)}

(1)

P (ham | policy_feature) = \frac{P (policy_feature) \times P (ham)}{P (policy_feature)}

(2)

3.2. Logistic Regression

In order to accomplish binary classification, logistic regression models a dependent variable (Y) in terms of one or more independent variables (X). In other words, it is a generalized model that predicts the chances of an event occurring. Specifically, logistic regression employs linear regression to represent the logit function. Equation (3) below represents the logistic regression cost function.

Cost (h_{θ} (x), y) = \{\begin{matrix} - \log (h_{θ} (x)) if y = 1 \\ - \log (1 - h_{θ} (x)) if y = 0 \end{matrix}

(3)

Logistic regression is another popular and highly effective algorithm for binary classification tasks. It is widely used in various domains, including spam detection and sentimental analysis.

3.3. Support Vector Machine

Support vector machine (SVM) is a supervised machine learning method applicable to classification and regression issues. However, its most frequent application is in categorization issues. Using the SVM technique, each data point is represented as a point in n-dimensional space (where n is the number of features), with the value of each feature corresponding to a certain coordinate. Consequently, we employ a support vector machine classifier in our situation. It consists of an RBF kernel with a count vectorizer, term frequency, and inverse document frequency classification process. Equation (4) represents the SVM cost function.

\min_{θ} C \sum_{i = 1}^{m} [y^{(i)} c o s t_{1} (θ^{T} x^{(i)}) + (1 - y^{(i)}) c o s t_{0} (θ^{T} x^{(i)})] + \frac{1}{2} \sum_{i = 1}^{n} θ_{j}^{2}

(4)

3.4. Long Short-Term Memory (LSTM) and Gated Recurrent Neural Network (RNN)

The incorporation of gating units permits the expansion of the family of gated recurrent neural networks (RNN) to include long short-term memory (LSTM). In particular, the system can be expanded to include three gate components. First, the forget gate can be utilized to control a direct copying or a complete state clearing. The input gate follows a similar technique to evaluate if the state should be updated in consideration of the current input signal. At each time step, the amount of data to keep from the perturbation input signal and the previous state signal is learned. In addition to learning long-term time requirements by retaining information, the system must occasionally learn to clear data from its current state.

A sequential model with layers of embedding, LSTM, dropout, and dense with ReLu and sigmoid activation functions for two successive levels of dense and dropout layers was utilized for LSTM. For LSTM Urdu sentences, we employed a tokenizer that was initialized, fitted to tweet input, and created a number sequence.

4. Methodology

This section presents the proposed methodology of the proposed approach. It describes data collection details, data pre-processing, feature extraction, architecture, and implementation details of the proposed model.

4.1. Data Collection

The dataset is extracted from Twitter using only tweets written in Urdu. The dataset is obtained via Twitter’s API. We utilized the retrieved dataset from snscrape, which provides an API for retrieving different attributes such as username, URL, tweet content, etc. Table 3 provides the dataset statistics, e.g., collection of spam and ham tweets in terms of mean, standard deviation, and so on, to understand the distribution of the dataset.

Figure 1 represents the distributed total of tweets classified on usernames as number of tweets of spam and ham for a user. “Spam” and “ham” refer to the content of the tweets, with “spam” tweets being unwanted or unsolicited messages and “ham” tweets being legitimate or desired messages. The distribution of these tweets as data points shows how many of each type of tweet a user has sent over a certain period of time. This information can be used to help identify patterns in the user’s behavior or to classify their tweets as either spam or ham for filtering purposes.

Our proposed research has fetched data using snscraper, a Python library that provides services for fetching data of multiple features such as username, tweet content, and so on. Data using snscraper had been fetched for nearly 2 h continually. This dataset is preprocessed to remove English alphabets, characters such as punctuation, exclamation mark, and other symbols. Based on Twitter’s policy [42], it only permits 2400 tweets per day, which equates to 100 tweets each hour. As a result, we choose a threshold of 200 tweets, because we get tweets using snscraper continuously for two hours, which translates to 200 tweets per user. Using this timestamp and policy functionality, we can now determine whether or not a tweet is spam.

4.2. Preprocessing

Preprocessing is done both manually and programmatically. Numerous machine learning library functions, such as the strip function, the replace function, and tokenization for removing punctuation, numbers, and irrelevant phrases or words, were employed for text cleansing processing. During tokenization, punctuation is removed and the sentence is parsed into words, which are then reassembled into entire sentences using the join function and space, as well as stop words being deleted. Each sentence is saved after purifying or pre-processing the data and assigning labels to binary labels, such as spam label as ‘0’ or non-spam label as ‘1’. The saved dataset is then processed further during the following implementation session.

4.3. Features Extraction

This study uses the vector space model (VSM), a method for expressing words as vectors. It is a typical information retrieval technique that allows judgments to be made regarding which terms are comparable. A vocabulary is composed of every word in our corpus. Our suggested research utilizes a feature engineering of Twitter policy from which a check statement is generated to determine whether a tweet is spam or ham. Twitter now states that a user can send no more than 100 tweets per hour. Therefore, it offers us the ability to determine that a user who tweets more than 100 times is likely a spammer.

As mentioned previously, this research gathered tweets continuously for two hours using snscraper. Due to the fact that we had retrieved tweets for two hours, we made the assumption that a check statement was necessary for classifying a tweet as spam or ham. As demonstrated by the following Equation (5), where TR represents the tweet rate from a user:

U s e r T w e e t R a t e (T R) = \frac{# of Tweetsx Time in Hour}{# of Users}

(5)

In Equation (5), # of tweets represents the total number of tweets posted on Twitter during a specific time period, # of users represents the total number of users who posted tweets during the same time period, and time period represents the duration of the time period, usually measured in hours.

4.4. Implementation

Our research used the sklearn libraries. Support vector machine (SVM), multinomial naïve Bayes (MNB), and logistic regression (LR) algorithms are employed to increase accuracy. Algorithm 1 presents the steps to label tweets as ham/spam. As illustrated in Figure 2, we represent spam/ham tweets against the number of tweets. It displays spam/ham distinct username’s count against the number of tweets and Figure 3 spam/ham distinct label’s count against the number of tweets. Figure 4 represents the number of tweets count against distinct usernames count. Algorithm 2 represents the procedure of categorizing Urdu language spam tweets. Certainly, our input is a specific Urdu tweet without any label, and our output is either a spam or not-spam label. Now, for each tweet in our corpus of tweets obtained from Twitter using the snscraper Python module, we applied policy and timestamp features. The dataset is imbalance as shown in Figure 5. Policy and timestamp features examine whether a user has surpassed Twitter’s 200-tweet limit within two hours of the timestamp. So, based on the retrieved characteristic, a spam or non-spam label is assigned to the tweets, as illustrated in Figure 6. Following the application of feature engineering, we selected a method or estimator from naïve Bayes, logical regression, SVM, and BERT. We divided the dataset into training and testing using a method named train_test_split from sklearn with a test size of 30%. After dividing the dataset and fitting it to the algorithm or estimator for these algorithms, e.g., SVM, we used a pipeline method of sklearn library comprising TF-IDF, count vectorizer, and ML classifiers (multinomial naïve Bayes, support vector classifier, RBF, logistic regression, and BERT) after preprocessing. Now, after training each unique estimated trained model, e.g., SVM, we predict spam or non-spam classification labels on the remaining 30% testing dataset. Finally, calculated metrics accuracy, precision, recall, and f1-score using a confusion matrix by comparing actual truth and predicted values.

Algorithm 1: Labeling Ham/Spam Tweets in the Urdu Language

0 def labeler_function (row, username_count):
1 if row[username_count] > 200:
2 label = Spam
3 else:
4 label = Ham
5 return label
6
7 dataset_label = dataset.apply (lambda row: labeler_function (row, username_count))

Algorithm 2: Spam Tweets in Urdu Language

0 Input: Tweet

T_{i}

1 Output: Spam or Not-Spam
2 for each tweet

T_{i}

in Tweets do
3 for Policy/Timestamp feature in Policy/Timestamp features
4 assign/label 0 to spam and 1 to non-spam
5 end for
6 end for
7 for each estimator in estimators (Naïve Bayes, Logistic Regression, SVM, BERT)
8 uniqueEstimtorArray = estimator.fit(labeled dataset using Policy/Timestamp features)
9 end for
10 if uniqueEstimtorArray.Predict (

T_{i}

== 1)
11 label

T_{i}

= Non-Spam
12 elseif uniqueEstimtorArray.Predict (

T_{i}

== 0)
13 label

T_{i}

= Spam
14 end if

4.5. Dataset Imbalance

As shown in Figure 5, our dataset contains more spam than genuine tweets sent by users. In cases where one class (such as spam) contains many more examples than the other (such as ham), a precision–recall (PR) curve might be a more useful evaluation of a classification model’s performance than metrics such as accuracy. As the threshold for determining whether a tweet is spam or ham is adjusted, the PR curve illustrates the trade-off between precision (the fraction of identified spam tweets that are genuinely spam) and recall (the fraction of actual spam tweets that are accurately classified as spam). A classifier with high accuracy for the majority class (say, ham) but low precision and recall for the minority class would be unsuitable for use on an uneven dataset (e.g., spam). We can better judge the classifier’s overall performance by looking at the PR curve, which shows how well it performs for both classes.

4.6. Architecture

Figure 6 depicts the architecture of the suggested technique. It demonstrates how the dataset is preprocessed and then machine learning is used on the data once it has been collected. During preprocessing, the dataset is cleaned of non-Urdu characters such as punctuation, numbers, and Unicode that are not part of the language. Replace to get rid of numbers, tokenize to get rid of punctuation marks, and use the strip function to get rid of extra lines and Unicode that are not part of the Urdu language. A full sentence is constructed from the available tokens after tokenization.

5. Experimental Results

This section presents the results and discussion of the implementation of the proposed methodology. We have collected 1,100,000 plus (+) unique real-time tweets from multiple users. While using multinomial naïve Bayes, we used a pipeline of TF-IDF and count vectorizer for the classification approach to overcome low accuracy; likewise, we used a pipeline of count vectorizer and TF-IDF for RBF kernel support vector machine. Table 4, Table 5, Table 6 and Table 7 represent the confusion matrix of naïve Bayes, SVM, logistic regression and BERT.

We employed a sequential model for BERT consisting of dropout and dense layers, using ReLu and softmax activation functions for the two layers that came after the dropout layer. We employed a BERT tokenizer that was on the Tweets data to generate a series of numbers for the Urdu sentences model. Logistic regression achieves both 99.55% accuracy and 94% precision, as seen in Table 8.

Figure 7 shows the confusion matrix of (a) logistic regression, and (b) naïve Bayes and (c) SVM and (d) BERT with true–false and positive–negative values of spam with other ham tweets. Table 7 represents the accuracy results of training and testing of the models using existing ML and DL models. Results show that logistic regression performs best in achieving higher accuracy. In binary classification, logistic regression is superior to SVM, multinomial naïve Bayes, and BERT for a number of reasons. Several factors, including the amount and complexity of the dataset, the quality of the data, and the choice of hyperparameters, may influence the relative performance of these methods. Logistic regression is a reasonably simple algorithm, has a low variance, and is, therefore, less susceptible to overfitting, and is quick to train. This makes it suitable for scenarios in which fresh data are continuously generated and the model must be continually updated.

6. Future Research Directions

There are a number of important topics for future research in the domain of Twitter spam review identification. The following items indicate prospective research and development areas:

Examination of online bullying and harassment: There is a need to examine the frequency, features, and impacts of online bullying and harassment on Twitter in greater detail. This entails studying patterns and trends, comprehending the motivations behind such activities, and investigating the impact on victims and the greater online community;
Methods for detection and prevention: Methods for detecting and preventing cyberbullying and cyber-harassment must be developed. These strategies should take Twitter’s standards for identifying and responding to harassment and abusive behavior into account. It is essential to develop automated methods that can detect and indicate incidents of harassment in real-time, allowing for prompt interventions;
Machine learning and natural language processing: To identify and classify instances of online harassment and cyberbullying on Twitter, it is necessary to construct sophisticated machine learning models and natural language processing techniques. These models should be consistent with Twitter’s guidelines regarding abusive behavior and harassment, allowing for the proper identification and classification of harmful content [43,44];
Dataset expansion and feature extraction: To improve the robustness and generalizability of the models, it is essential to expand the existing dataset utilized for spam identification. The emphasis of research should be on gathering more diverse and representative samples of spam and non-spam tweets. Additionally, studying various feature extraction techniques can result in a more accurate representation of the distinctive qualities of spam content;
Comparison with advanced baseline approaches: While the current study employed simple baseline approaches, future research could investigate more complicated and advanced strategies for spam identification. Comparing the performance of advanced classifiers, such as deep learning models or ensemble approaches, can shed light on their efficacy and potential for improvement over conventional classifiers;
Efforts should be made consistently to optimize the performance of spam identification classifiers. This involves fine-tuning the models, refining the feature selection procedure, and researching creative techniques to improve spam detection’s accuracy, precision, and recall.

By pursuing this research direction, we can obtain a greater knowledge of online harassment, enhance the efficacy of detection and prevention approaches, and improve Twitter’s overall safety and user experience.

7. Conclusions

This study has compiled a large database with more than 1,100,000 different real-time tweets from different users. These tweets have been carefully collected and processed to make sure the data is clean and to extract the important features. According to Twitter’s rules, users cannot send more than 100 tweets per hour, so any tweets that go over this limit are considered spam. Predictions are made using testing data with machine learning and BERT layer models. We used the snscrape fetch dataset, which has an API for retrieving a wide variety of attributes (username, URL, tweet content, and so on). After extracting features based on Twitter policy, we train and test our dataset to conduct spam analysis. The best F1-score (0.7) and highest accuracy (99.55%) are achieved by the logistic regression. The results of this study highlight the effectiveness of policy-based spam detection in Urdu tweets using ML and BERT layer models. The findings contribute to the development of robust spam detection techniques specifically tailored for the Urdu language on social media platforms. Future studies will focus on Word2Vec word embedding, one Hot Embedding, and FastText to increase performance and reduce bias. Convolutional neural networks (CNN) and bilingual BERT can be utilized in future studies to improve accuracy.

Author Contributions

The following are the details of authors contribution in this paper. Conceptualization, M.D.; methodology, M.D. and F.I.; software, M.D. and F.I.; literture Review, N.S.M.J.; validation, F.I., A.A. and N.S.M.J.; formal analysis, R.L., F.I. and A.A.; investigation, R.L., F.I. and A.A.; resources, R.L. and N.S.M.J.; writing—original draft preparation, M.D., F.I. and A.A.; writing—review and editing, R.L., F.I. and A.A.; supervision, F.I.; project administration, N.S.M.J.; funding acquisition, R.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the Artificial Intelligence and Data Analytics Laboratory, College of Computer and Information Sciences, Prince Sultan University, Riyadh, Saudi Arabia, and in part by the University of Engineering and Technology (UET), Lahore.

Data Availability Statement

Dataset is publically available on kaggle. Other data will be shared upon reasonable request.

Acknowledgments

The authors would like to acknowledge the support of Prince Sultan University for paying the Article Processing Charges (APC) of this publication.

Conflicts of Interest

The authors declare that they have no conflict of interest to report regarding the present study.

References

Alorini, D.; Rawat, D.B. Automatic spam detection on gulf dialectical. In Proceedings of the Conference on Computing, Networking and Communication, Honolulu, HI, USA, 18–21 February 2019; pp. 2325–2626. [Google Scholar]
Liu, S.; Wang, Y.; Zhang, J.; Chen, C.; Xiang, Y. Addressing the class imbalance problem in Twitter spam detection using ensemble learning. Comput. Secur. 2017, 69, 35–49. [Google Scholar] [CrossRef]
Wu, T.; Liu, S.; Zhang, J.; Xiang, Y. Twitter spam detection based on deep learning. In Proceedings of the Australasian Computer Science Week Multiconference, Geelong, Australia, 31 January 2017; pp. 1–8. [Google Scholar]
Alzubaidi, L.; Zhang, J.; Humaidi, A.J.; Al-Dujaili, A.; Duan, Y.; Al-Shamma, O.; Santamaría, J.; Fadhel, M.A.; Al-Amidie, M.; Farhan, L. Review of deep learning: Concepts, CNN architectures, challenges, applications, future directions. J. Big Data 2021, 8, 53. [Google Scholar] [CrossRef] [PubMed]
Larabi-marie-sainte, S.; Ghouzali, S.; Saba, T.; Aburahmah, L.; Almohaini, R. Improving spam email detection using deep recurrent neural network. Inst. Adv. Eng. Sci. 2022, 25, 1625–1633. [Google Scholar] [CrossRef]
Pang, B.; Lee, L. Opinion mining and sentiment analysis. Found. Trends Inf. Retr. 2008, 2, 1–135. [Google Scholar] [CrossRef] [Green Version]
Lahoti, P.; Morales, G.D.F.; Gionis, A. Finding topical experts in Twitter via query-dependent personalized PageRank. In Proceedings of the 2017 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 2017 (ASONAM’ 17), Association for Computing Machinery, New York, NY, USA, 31 July–3 August 2017; pp. 155–162. [Google Scholar] [CrossRef] [Green Version]
Rosenthal, M.; Kulkarni, V.; Preoţiuc-Pietro, D.V. Semeval-2015 task 10: Sentiment analysis in twitter. In Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), Denver, CO, USA, 4–5 June 2015; pp. 451–463. [Google Scholar]
Kolchyna, A.; Hopfgartner, F.; Pasi, G.; Albayrak, S. Exploring crowdsourcing for opinion spam annotation. In Proceedings of the 9th International Conference on Web and Social Media (ICWSM), Shanghai, China, 6–8 November 2015; pp. 437–440. [Google Scholar]
Maas, A.L.; Daly, R.E.; Pham, P.T.; Huang, D.; Ng, A.Y.; Potts, C. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, OR, USA, 19–24 June 2011; pp. 142–150. [Google Scholar]
Afzal, N.; Afzal, S.; Shafait, S.; Majeed, F. Leveraging machine learning to investigate public opinion of Pakistan. In Proceedings of the 26th ACM International Conference on Information and Knowledge Management (CIKM), Singapore, 6–10 November 2017; pp. 1883–1886. [Google Scholar]
Javed, M.N.; Khan, A.; Majeed, F.; Shafait, S. Urdconv: A large-scale urdu conversation corpus. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics (EACL), Online, 19–23 April 2021; pp. 677–687. [Google Scholar]
Ahmed, A.; Shafait, S. SMS spam filtering for Urdu text messages. In Proceedings of the International Conference on Com-putational Linguistics (COLING), Beijing, China, 23–27 August 2010; pp. 8–15. [Google Scholar]
Javed, M.N.; Khan, A.; Majeed, F.; Shafait, S. Towards effective spam detection in social media: The case of Urdu language. In Proceedings of the 20th International Conference on Asian Language Processing (IALP), Singapore, 5–7 December 2017; pp. 92–95. [Google Scholar]
Mehmood, A.; Farooq, M.S.; Naseem, A.; Rustam, F.; Villar, M.G.; Rodríguez, C.L.; Ashraf, I. Threatening URDU Language Detection from Tweets Using Machine Learning. Appl. Sci. 2022, 12, 10342. [Google Scholar] [CrossRef]
Dar, M.; Iqbal, F. Urdu Tweets Dataset for Spam Detection. Kaggle. Available online: https://www.kaggle.com/datasets/momnadar23/urdu-tweets-dataset-for-spam-detection (accessed on 1 May 2023).
Ge, Z.; Sun, Y.; Smith, M. Authorship attribution using a neural network language model. Proc. AAAI Conf. Artif. Intell. 2016, 30. [Google Scholar] [CrossRef]
Anwar, W.; Bajwa, I.S.; Choudhary, M.A.; Ramzan, S. An empirical study on forensic analysis of Urdu text using LDA-based authorship attribution. IEEE Access 2018, 7, 3224–3234. [Google Scholar] [CrossRef]
Mashooq, M.; Riaz, S.; Farooq, M.S. Urdu Sentiment Analysis: Future Extraction, Taxonomy, and Challenges. VFAST Trans. Softw. Eng. 2022, 10. [Google Scholar]
Hussain, N.; Mirza, H.T.; Iqbal, F.; Hussain, I. Detecting Spam Product Reviews in Roman Urdu Scripts. Oxf. Comput. J. 2020, 64, 432–450. [Google Scholar] [CrossRef]
Hussain, N.; Turab Mirza, H.; Ali, A.; Iqbal, F.; Hussain, I.; Kaleem, M. Spammer Group Detection and Diversification of Customer Reviews. PeerJ Comput. Sci. 2021, 7, e472. [Google Scholar] [CrossRef]
Hussain, N.; Turab Mirza, H.; Hussain, I.; Iqbal, F.; Memon, I. Spam Review Detection Using the Linguistic and Spammer Behavioral Methods. IEEE Access 2020, 8, 53801–53816. [Google Scholar] [CrossRef]
Duma, R.A.; Niu, Z.; Nyamawe, A.S.; Tchaye-Kondi, J.; Yusuf, A.A. A Deep Hybrid Model for fake review detection by jointly leveraging review text, overall ratings, and aspect ratings. Soft Comput. 2023, 27, 6281–6296. [Google Scholar] [CrossRef]
Vijayakumar, B.; Fuad, M.M.M. A new method to identify short-text authors using combinations of machine learning and natural language processing techniques. Procedia Comput. Sci. 2019, 159, 428–436. [Google Scholar] [CrossRef]
Mekala, S.; Tippireddy, R.R.; Bulusu, V.V. A novel document representation approach for authorship attribution. Int. J. Intell. Eng. Syst. 2018, 11, 261–270. [Google Scholar] [CrossRef]
Saha, N.; Das, P.; Saha, H.N. Authorship attribution of short texts using multi-layer perceptron. Int. J. Appl. Pattern Recognit. 2018, 5, 251–259. [Google Scholar] [CrossRef]
Benzebouchi, N.E.; Azizi, N.; Hammami, N.E.; Schwab, D.; Khelaifia, M.C.E.; Aldwairi, M. Authors’ writing styles based authorship identification system using the text representation vector. In Proceedings of the 2019 16th International Multi-Conference on Systems, Signals & Devices (SSD), Istanbul, Turkey, 21–24 March 2019; pp. 371–376. [Google Scholar]
Sun, N.; Lin, G.; Qiu, J.; Rimba, P. Near real-time twitter spam detection with machine learning techniques. Int. J. Comput. Appl. 2022, 44, 338–348. [Google Scholar] [CrossRef]
Khanday, A.M.D.; Wani, M.A.; Rabani, S.T.; Khan, Q.R. Hybrid Approach for Detecting Propagandistic Community and Core Node on Social Networks. Sustainability 2023, 15, 1249. [Google Scholar] [CrossRef]
Jain, G.; Sharma, M.; Agarwal, B. Optimizing semantic LSTM for spam detection. Int. J. Inf. Technol. 2019, 11, 239–250. [Google Scholar] [CrossRef]
Li, D.; Ahmed, K.; Zheng, Z.; Mohsan, S.; Alsharif, M.; Myriam, H.; Jamjoom, M.; Mostafa, S. Roman Urdu sentiment analysis using transfer learning. Appl. Sci. 2022, 12, 10344. [Google Scholar] [CrossRef]
Muhammad, K.B.; Burney, S.A. Innovations in Urdu Sentiment Analysis Using Machine and Deep Learning Techniques for Two-Class Classification of Symmetric Datasets. Symmetry 2023, 15, 1027. [Google Scholar] [CrossRef]
Rozaq, A.; Yunitasari, Y.; Sussolaikah, K.; Sari, E.R. Sentiment Analysis of Kampus Mengajar 2 Toward the Implementation of Merdeka Belajar Kampus Merdeka Using Naïve Bayes and Euclidean Distance Methods. Int. J. Adv. Data Inf. Syst. 2022, 3, 30–37. [Google Scholar] [CrossRef]
Hussain, N. Spam Review Detection through Behavioral and Linguistic Approaches. Computational Intelligence, Machine Learning, and Data Analytics. Ph.D. Dissertation, Department of Computer Science COMSATS University Lahore, Lahore, Pakistan, 2022. [Google Scholar]
Akhter, M.P.; Zheng, J.; Afzal, F.; Lin, H.; Riaz, S.; Mehmood, A. Supervised ensemble learning methods towards automati-cally filtering Urdu fake news within social media. PeerJ Comput. Sci. 2021, 7, e425. [Google Scholar] [CrossRef] [PubMed]
Akhter, M.P.; Jiangbin, Z.; Naqvi, I.R.; Abdelmajeed, M.; Fayyaz, M. Exploring deep learning approaches for Urdu text clas-sification in product manufacturing. Enterp. Inf. Syst. 2022, 16, 223–248. [Google Scholar] [CrossRef]
Ali, R.; Farooq, U.; Arshad, U.; Shahzad, W.; Beg, M.O. Hate speech detection on Twitter using transfer learning. Comput. Speech Lang. 2022, 74, 101365. [Google Scholar] [CrossRef]
Uzan, M.; HaCohen-Kerner, Y. Detecting Hate Speech Spreaders on Twitter using LSTM and BERT in English and Spanish. In Proceedings of the Conference and Labs of the Evaluation Forum, CLEF (Working Notes), Bucharest, Romania, 21–24 September 2021; pp. 2178–2185. [Google Scholar]
Akhter, M.P.; Jiangbin, Z.; Naqvi, I.R.; Abdelmajeed, M.; Mehmood, A.; Sadiq, M.T. Document-level text classification using single-layer multisize filters convolutional neural network. IEEE Access 2020, 8, 42689–42707. [Google Scholar] [CrossRef]
Qutab, I.; Malik, K.I.; Arooj, H. Sentiment Classification Using Multinomial Logistic Regression on Roman Urdu Text. Int. J. Innov. Sci. Technol. 2022, 4, 223–335. [Google Scholar] [CrossRef]
Rasheed, I.; Banka, H.; Khan, H.M. A hybrid feature selection approach based on LSI for classification of Urdu text. In Machine Learning Algorithms for Industrial Applications; Springer: Berlin/Heidelberg, Germany, 2021; pp. 3–18. [Google Scholar]
Twitter, Understanding Twitter Limits (Twitter Help). Available online: https://help.twitter.com/en/rules-and-policies/twitter-limits (accessed on 17 May 2023).
Daud, S.; Ullah, M.; Rehman, A.; Saba, T.; Damaševičius, R.; Sattar, A. Topic Classification of Online News Articles Using Optimized Machine Learning Models. Computers 2023, 12, 16. [Google Scholar] [CrossRef]
Ozdemir, B.; AlGhamdi, H.M. Investigating the Distractors to Explain DIF Effects Across Gender in Large-Scale Tests With Non-Linear Logistic Regression Models. Front. Educ. 2022, 6, 552. [Google Scholar] [CrossRef]

Figure 1. Distributed total of tweets classified on usernames.

Figure 2. Spam/ham distinct username count.

Figure 3. Spam/ham labels count against the number of tweets.

Figure 4. Number of tweets count against distinct usernames count.

Figure 5. Depiction of an imbalanced dataset of spam and ham tweets.

Figure 6. The architecture of the proposed methodology is to detect spam Urdu tweets.

Figure 7. Confusion matrix for (a) logistic regression; (b) naïve Bayes; (c) SVM; and (d) BERT.

Table 1. Comparison of available English and Urdu datasets for spam review.

Dataset Name	Language	Number of Tweets/Messages	Size of Dataset	Quality
Twitter Sentiment Analysis Dataset [6]	English	1.6 million	Large	High
Twitter Spam Classification Dataset [7]	English	25,000	Medium	High
SemEval-2015 Task 10 [8]	English	N/A	N/A	High
CrowdFlower Twitter Spam Corpus [9]	English	10,000	Small	Medium
Stanford Large Movie Review Dataset [10]	English	50,000	Large	High
Urdu Sentiment Corpus [11]	Urdu	5000	Small	Medium
UrdConv Corpus [12]	Urdu	22,000	Medium	Medium
COMSATS Urdu SMS Spam Corpus [13]	Urdu	10,000	Small	Medium
Social Media Urdu Corpus [14]	Urdu	6500	Small	Low

Table 2. Research gap analysis in the existing literature.

Papers	Urdu Dataset	Techniques	Feature	Policy Feature	Timestamp Feature
[30]	×	LSTM	Features on frequent or important Corpus	×	×
[31]	×	Transfer Learning	Feature mapping using CNN	×	×
[32]	✓	Naïve Bayes, SVM	Content feature	×	×
[33]	×	Naïve Bayes	TFIDF feature extraction	×	×
[34]	×	Linguistics Approaches	Linguistic and behavioral features	×	×
[35]	×	Machine Learning	TF-IDF feature extraction	×	×
[36]	✓	Deep Learning	CNN-LSTM n-gram feature	×	×
[37]	✓	BERT	Linguistic feature	×	×
[38]	×	LSTM, BERT	Word n-gram features	×	×
[39]	×	CNN	Feature mapping using CNN	×	×
[40]	✓	Logistic Regression	TFIDF feature extraction	×	×
[41]	✓	Machine Learning	Latent semantic indexing (LSI) extracted features	×	×
Proposed Research	✓	SVM, Logistic Regression, Naïve Bayes, BERT	Policy and timestamp features	✓	✓

Key: ✓ represents that feasure is available; × represents that feasure is not available.

Table 3. Data statistics.

Name	Spam	Ham
Count	7319	1,213,068
Mean	0.6%	99.4%
Standard Deviation	7.72%	7.72%
Maximum	900	198
Minimum	208	1

Table 4. Confusion matrix for naïve Bayes.

Name	Spam	Ham
Spam	91	2114
Ham	36	363,876

Table 5. Confusion matrix for RBF kernel SVM.

Name	Spam	Ham
Spam	0	2205
Ham	0	363,912

Table 6. Confusion matrix for logistic regression.

Name	Spam	Ham
Spam	590	1615
Ham	83	363,829

Table 7. Confusion matrix for BERT.

Name	Spam	Ham
Spam	0	1098
Ham	0	85,883

Table 8. Accuracy results of the trained models.

Model/Ensemble	Library	Accuracy	F1-Measure	Recall	Precision
Multinomial Naïve Bayes, CV, TF-IDF	SK-Learn	99.40%	0.54	52.0%	86.0%
RBF Kernel SVM, CV, TF-IDF	SK-Learn	99.38%	0.5	50.0%	50.0%
Logistic Regression, CV, TF-IDF	SK-Learn	99.55%	0.7	63.0%	94.0%
BERT	Transformers	99.00%	0.5	50.0%	49.0%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Dar, M.; Iqbal, F.; Latif, R.; Altaf, A.; Jamail, N.S.M. Policy-Based Spam Detection of Tweets Dataset. Electronics 2023, 12, 2662. https://doi.org/10.3390/electronics12122662

AMA Style

Dar M, Iqbal F, Latif R, Altaf A, Jamail NSM. Policy-Based Spam Detection of Tweets Dataset. Electronics. 2023; 12(12):2662. https://doi.org/10.3390/electronics12122662

Chicago/Turabian Style

Dar, Momna, Faiza Iqbal, Rabia Latif, Ayesha Altaf, and Nor Shahida Mohd Jamail. 2023. "Policy-Based Spam Detection of Tweets Dataset" Electronics 12, no. 12: 2662. https://doi.org/10.3390/electronics12122662

APA Style

Dar, M., Iqbal, F., Latif, R., Altaf, A., & Jamail, N. S. M. (2023). Policy-Based Spam Detection of Tweets Dataset. Electronics, 12(12), 2662. https://doi.org/10.3390/electronics12122662

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Policy-Based Spam Detection of Tweets Dataset

Abstract

1. Introduction

1.1. Motivation

1.2. Challenges

1.3. Research Contribution

2. Literature Review

3. Machine Learning and Deep Learning Algorithms

3.1. Multinomial Naïve Bayes

3.2. Logistic Regression

3.3. Support Vector Machine

3.4. Long Short-Term Memory (LSTM) and Gated Recurrent Neural Network (RNN)

4. Methodology

4.1. Data Collection

4.2. Preprocessing

4.3. Features Extraction

4.4. Implementation

4.5. Dataset Imbalance

4.6. Architecture

5. Experimental Results

6. Future Research Directions

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI