Detecting Suspicious Texts using Machine Learning Techniques

: Due to the substantial growth of internet users and its spontaneous access via electronic devices, the amount of electronic contents is growing enormously in recent years through instant messaging, social networking posts, blogs, online portals, and other digital platforms. Unfortunately, the misapplication of technologies has boosted with this rapid growth of online content which leads to the rise in suspicious activities. People misuse the web media to disseminate malicious activity, perform the illegal movement, abuse other people, and publicize suspicious contents on the web. The suspicious contents usually available in the form of text, audio or video, whereas text contents have been used in most of the cases to perform suspicious activities. Thus, one of the most challenging issues for NLP researchers is to develop a system that can identify suspicious text efficiently from the specific contents. In this paper, a Machine Learning (ML)-based classification model is proposed (hereafter called STD) to classify Bengali text into non-suspicious and suspicious categories based on its original contents. A set of ML classifiers with various features has been used on our developed corpus, consisting of 7000 Bengali text documents where 5600 documents used for training and 1400 documents used for testing. The performance of the proposed system is compared with the human baseline and existing ML techniques. The SGD classifier ‘tf-idf’ with the combination of unigram and bigram features are used to achieve the highest accuracy of 84.57%.


Introduction
Due to the effortless access of the Internet, www, blogs, social media, discussion forums, and online platforms via digital gadgets are producing a massive volume of digital text contents in recent years. It is observed that all the contents are not genuine or authentic; instead, some contents are faked, fabricated, forged or even suspicious. It is very unpropitious with this rapid growth of digital contents the ill-usage of the Internet has also been multiplied which governs to the boost in suspicious activities [1]. Suspicious contents are increasing day by day because of ill-usage of the Internet by few individuals to promulgate fierceness, share illegal activities, bullying other people, perform smishing, publicize incitement related contents, spread fake news, and so on. According to FBI's Internet Crime Complaint Center (IC3) report, a total of 467,361 complaints received in the year 2019 related to internet-facilitated criminal activity [2]. Moreover, several extremist users use social media or blogs to spread suspicious and violent contents which to be considered one kind of threat to national security [3].
Around 245 million people are speaking in Bengali as their native tongue, which makes it 7 th highest spoken language in the world [4]. However, research on Bengali Language Processing (BLP) is in initial stage till now, and there are no significant amount of works have been conducted yet like English, Arabic, Chinese or other European languages that makes Bengali a resource constrained language [5]. As far as we concern, none of the research conducted till to date on suspicious text detection in the Bengali language. But such systems are required to ensure the security as well as mitigate national threats in cyber-space.
Suspicious contents are those contents that hurt religious feelings, provoke people against government and law enforcement agencies, motivate people to perform act of terrorism, perform criminal acts by phishing, smishing, pharming, instigate a community without any reason, and execute extortion act [6][7][8][9]. As examples, social media has already used as a medium of communication in Boston attack and the revolution in Egypt [10]. The suspicious contents can be available in the form of video, audio, images, graphics and text. However, text plays an essential role in communication and most widely used medium of communication on cyber world. Moreover, the semantic meaning of a conversation can be retrieved by analyzing text contents which is difficult in other forms of content. In this work, we focus on analyzing text content and classifying the content into suspicious or non-suspicious.
A text could be detected as suspicious if it contained suspicious contents. It is impossible to detect suspicious texts from the enormous amount of internet text contents manually [11]. Therefore, the automatic detection of suspicious text contents should be developed. Responsible agencies are demanding some smart tool/ system that can detect suspicious text automatically. It will also be helpful to identify potential threats in cyber-world which are communicated by text contents. Automatic detection of suspicious text system can easily and promptly detect the fishy or threatening texts. Law and enforcement authority can take appropriate measures immediately, which in turn helps to reduce virtual harassment, suspicious and criminal activities mediated through online. But it is a quite challenging task to classify the Bengali text contents into suspicious or non-suspicious class due to its complex morphological structure, enormous numbers of synonym, and rich variations of verb auxiliary with the subject, person, tense, aspect, and gender. Moreover, scarcity of resources and lack of benchmark Bengali text dataset are the major barriers to build a suspicious text detection system and make it more difficult to implement compare to other languages. Therefore, the research question addressing in this paper is -"RQ: How can we effectively classify potential Bengali texts into suspicious and non-suspicious categories?" To address this research question in this work, we first develop a dataset of suspicious and non-suspicious texts considering a number of well-known Bengali data sources, such as Facebook posts, blogs, websites and newspapers. In order to process the textual data, we take into account unigram, bigram, trigram features using tf-idf and bag of words feature extraction technique. Once the feature extraction has been done, we employ the most popular machine learning classifiers (i.e., logistic regression, naive bayes, random forest, decision tree, stochastic gradient descent) to classify whether a given text is suspicious or not. We have also performed a comparative analysis of these machine learning models utilizing our collected datasets. The key contributions of our work illustrate in the following: • Develop a corpus containing 7000 text documents labelled as suspicious or non-suspicious.
• Design a classifier model to classify Bengali text documents into suspicious or non-suspicious categories on developed corpus by exploring different feature combination. • Compare the performance of the proposed classifier with various machine learning techniques as well as the existing method. • Analyze the performance of the proposed classifier on different distributions of the developed dataset. • Exhibits a performance comparison between human expert (i.e., baseline) and machine learning algorithms.
We expect that the work presented in this paper will play a pioneering role in the development of Bengali suspicious text detection systems. The rest of the paper organized as follows: Section 2 presents related work. In Section 3, a brief description of the development of suspicious Bengali corpus and its several properties have explained. Section 4 explained the proposed Bengali suspicious text document classification system and its significant constituents. Section 5 described the evaluation techniques used to assess the performance of the proposed approach. Several experimental results are presented in Section 6. Finally, in Section 7, we concluded the paper with a summary and discussed the future scopes.

Related Work
Suspicious contents detection is a well-studied research issue for the highly resourced languages like Arabic, Chinese, English, and other European languages. However, no meaningful research activities have been conducted yet to classify text with suspicious content in BLP domain. A machine learning-based system developed to detect promotion of terrorism by analyzing the contents of a text. Iskandar et al. [12] have collected data from Facebook, Twitter and numerous micro-blogging sites to train the model. By performing a critical analysis of different algorithms, they showed that Naïve Bayes is best suited for their work as it deals with probabilities [13]. Johnston et al. [14] proposed a neural network-based system which can classify propaganda related to the Sunni 1 extremist users on social media platforms. Their approach obtained 69.9% accuracy on the developed dataset. A method to identify suspicious profiles within social media presented where normalized compression distance was utilized to analyze text [15]. Jiang et al.
[16] discusses current trends and provides future direction to determine suspicious behaviour in various mediums of communications. The researchers investigated the novelty of true and false news on 126,000 stories that tweeted 4.5 million times using ML techniques [17]. An automated system explained the technique of detecting hate speech from the twitter data [18]. Logistic regression with regularization outperforms other algorithms by attaining the accuracy of 90%. An intelligent system introduced to detect suspicious messages from Arabic tweets [19]. This system yields maximum accuracy of 86.72% using SVM with a limited number of data and class. Dinakar et al.
[20] developed a corpus of YouTube comments for detecting textual cyberbullying using a multiclass and binary classifier. A novel approach presented of detecting Indonesian hate speech by using SVM, lexical, word unigram and tf-idf features [21]. A method described to detects abusive content and cyberbullying from Chinese social media. Their model achieved 95% accuracy by using LSTM and taking characteristical and behavioural features of a user [22]. Hammer H. L.
[23] discussed a way of detecting violence and threat from online discussions towards minority groups. This work considered the manually annotated sentences with bigram features of essential words.
Since Bengali is an under-resourced language, the amount of digitized text (related to suspicious, fake, or instigation) is quite less. In addition to that, no benchmark dataset is available on the suspicious text. Due to these reasons, very few research activities have carried out in this area of BLP, which are mainly related to hate, threat, fake and abusive text detection. Ishmam et al. [24] compare machine learning and deep learning-based model to detect hateful Bengali language. Their method achieved 70.10% accuracy by employing a gated recurrent neural network (GRNN) method on a dataset of 6 classes and 5K documents collected from numerous Facebook pages. The reason behind this poor accuracy is the less number of training documents in each class (approximately 900). Most importantly, they did not define the classes clearly, which is very crucial for the hateful text classification task. Recent work explained a different machine and deep learning technique to detect abusive Bengali comments [25]. The model acquired 82% accuracy by using RNN on 4700 Bengali text documents. Ehsan et al. [26] discussed another approach of detecting abusive Bengali text by combining different n-gram features and ML techniques. Their method obtained the highest accuracy for SVM with trigram features. A method to identify malicious contents from Bengali text is presented by Islam et al. [27]. This method achieved 82.44% accuracy on an unbalanced dataset of 1965 instances by applying the Naive Bayes algorithm. Hossain et al. [28] develop a dataset of 50k instances to detect fake news in Bangla. They have extensively analyzed linguistic as well as machine learning-based features. A system demonstrated the technique to identify the threats and abusive Bengali words in social media using SVM with linear kernel [29]. The model experimented with 5644 text documents and obtained the maximum accuracy of 78%.
As far as we aware, none of the remarkable research conveyed so far that focuses on detecting suspicious Bengali text. Our previous approach used logistic regression with BoW features extraction technique to detect suspicious Bengali text contents [30]. However, that work considered only 2000 text documents and achieved an accuracy of 92%. In this work, our main concern is to develop the ML-based suspicious Bengali text detection model trained on our new dataset by exploring various n-gram features and feature extraction techniques.

A Novel Suspicious Bangla Text Dataset
To till this date, no dataset is available for identifying Suspicious Bengali Texts (SBT). Therefore, we developed a Suspicious Bengali Text Dataset (SBTD), which is a novel annotated corpus to serve our purpose. The following subsection explains the definition of SBT with its inherent characteristics and details statistics of the developed SBTD.

Suspicious Text and Suspicious Text Detection
Suspicious Text Detection (STD) system classifies a text t i ϵT from a set of texts T = {t 1 , t 2 , ..., t m } into a class c i ϵC from a set of two classes C = {C s , C ns }. The task of STD is to automatically assign Deciding whether a Bengali text is suspicious or not is not so simple even for language experts because of its complicated morphological structure, rich variation in sentence formation and lack of defining related terminology. Therefore, it is very crucial to have a clear definition of SBT for making the task of SBTD smoother. In order to introduce a reasonable definition concerning the Bengali language, several definitions of violence, incitement, suspicious and hatred contents have analyzed. Most of the information collected from the different social networking websites and scientific papers summarized in Table 1. Twitter "One may not promote terrorism or violent extremism, harasses or threaten other people, incite fury toward a particular or a class of people". [31] YouTube "Contents that incite others to promote or commit violence against individuals and groups based on religion, nationality, ethnicity, sex/gender, age, race, disability, gender identity/sexual orientation". [32] Council of Europe (COE) "Expression which incite, spread, promote or justify violence toward a specific individual or class of persons for a variety of reasons". [33] Paula et al.
"Language that glorify violence and hate, incite people against groups based on religion, ethnic or national origin, physical appearance, gender identity or other". [7] Majority of the quoted definitions focus on similar attributes such as incitement of violence, promotion of hate and terrorism, threaten a person or group of people. These definitions cover the larger aspect of suspicious content from video, text, image, cartoon, illustrations, and graphics. Nevertheless, Tokenization gives a list of words of the input text such as, t i = ['Ei', 'khulna', 'titans', 'ke', 'tin', 'wickete', 'hariye', 'dilo', 'comilla', 'victoria'] • Removal of stop words: Words which has no contribution in deciding whether a text t i is (C s ) or (C ns ) is considered as unnecessary. Such words are dispelled from the document by matching with a list of stop words. Finally, after removing the stop words, the processed text as, t i ="Khulna titans ke tin wickete hariye dilo comilla victoria". (English translation: Comilla Victoria defeated Khulna Titans by three wickets) will be used for training.
With the help of above operations a set of processed texts is created. These texts are stored chronologically in a dictionary in the form of array indexing A[t 1 ]....A[t 7000 ] .with a numeric (0, 1) label. Here, 0 and 1 represents non-suspicious and suspicious class respectively.

Feature Extraction
Machine learning models could not possibly learn from the texts that we have prepared. Feature extraction performs numeric mapping on these texts to find some meaning. This work explored the bag of words (BoW) and term frequency-inverse document frequency (tf-idf) feature extraction techniques to extract features from the texts. BoW technique uses the word frequencies as features. Here each cell gives the count (c) of a feature word ( f wi ) in a text document (t i ). Unwanted words may get higher weights than the context-related words on this technique. Tf-idf technique tries to mitigate this weighting problem by calculating tf-idf value according to the equation 1.
the frequency of word f wi in text document (t i ), m means total number of text documents, |tϵm : f w ϵt| represents the number of text document t containing word f w . Tf-idf value of the feature words (( f w )) puts more emphasis on the words related to the context than other words. To find the final weighted representation of the sentences, compute euclidean norm after calculating t f − id f value of the feature words of a sentence. This normalization set high weight on the feature words with smaller variance. Equation 2 compute the norm.
Here, X norm (i) is the normalized value for the feature word f wi and X 1 , X 2 , ..., X n are the t f − id f value of the feature word f w1 , f w2 , ..., f wn respectively. Features picked out by both techniques have been applied on the classifier. Table 4 shows sample feature value for both feature extraction techniques. Features exhibited by an array of size (m * n) having m rows and n columns. The total number of text documents t 1 , t 2 , ..., t m are represented in rows while all the feature words f w1 , f w2 , .., f wn are represented in columns. In order to reduce complexity and computational cost, 3000 most frequent words considered as feature words among thousands of unique words. The model extracted linguistic n-gram features of the texts. N-gram approach uses to take account the sequence order in a sentence in order to make more sense from the sentences. Here 'n' indicates the number of consecutive words that can treat as one gram. N-gram, as well as a combination of n-gram features, will be applied in the proposed model. Table 5 shows the illustration of various n-gram features. Combination of two feature extraction techniques and n-gram features will apply to find the best-suited model for the accomplishment of suspicious Bengali text detection.

Classification
Features that we got from the previous step uses to train the machine learning model by employing different popular classification algorithms [37]. These algorithms are stochastic gradient descent (SGD), logistic regression (LR), decision tree (DT), random forest (RF) and multi-nomial naïve Bayes (MNB). Logistic regression [38] is well suited for our task as it is a binary classification problem. Equation 3-4 defines the logistic function which determines the output of logistic regression.
Cost function is, Here, m = Number of training examples. hp θ (x i ) = Hypothesis function of i th training example y i = Input label of i th training example.
The decision tree has two types of nodes external and internal. External nodes represent decision class while internal nodes hold the features essential for making classification [39]. The random forest comprises of several decision trees which operate individually. Each decision tree produces a class prediction, and the class with the maximum number of votes will be the model prediction. Naïve Bayes [40] follows Bayesian theorem where variables V 1 , V 2 , ..., V n of class C are conditionally independent of each other given C. Stochastic gradient descent takes account of only one sample point randomly while adjusting weights. SGD classifier [41] performs better than others, especially on a large dataset, although it takes more steps to converge.  With tf-idf and F2 feature, SGD beats others by getting maximum AUC of 89.3%. The value of all the classifiers raised except the decision tree. Its value decreased by 0.06%. For F3 feature with BoW LR, RF, SGD performed quite well by acquiring 87.7%, 87.5% and 86.5% AUC value respectively. AUC value decreased abruptly by (9-10)% for tf-idf and F3 feature, which make it worst feature combination. Critical analysis of results brings to the notation that, SGD classifier with the combination of unigram and bigram feature for tf-idf feature extraction technique achieved the highest value for most of the evaluation parameters compare to others. The performance of the proposed classifier (SGD) analyzed further by varying the number of training documents to get more insight. Fig. 8 shows the accuracy versus the number of training examples graph. The analysis reveals that the classification accuracy is increasing with the increased dataset and the tf-idf predominates the BoW with F2 feature.

Human Baseline vs. ML Techniques
The performance of the classifiers compared with two human experts for further investigation. Expert 1 is a university professor with BLP research domain. Expert 2 is a PhD fellow pursuing his PhD on BLP. Experts have manually classified testing text documents into one of the predefined categories. Among 1400 test text samples 621 texts are from non-suspicious (C ns ) class and 779 texts are from suspicious (C s ) class. Accuracy of each class can be computed by the ratio between the number of correctly predicted texts and the total number of texts of that class by using the confusion matrix. Suppose, a system can correctly predict 730 texts among 779 suspicious texts, then its accuracy in suspicious class will be 93.7% (730/779). As the tf-idf outperformed the BoW in the previous evaluation, so we compared the performance of the classifiers only for tf-idf feature extraction technique with human experts. Table 7 exhibits the summary of comparison. The experts outperformed the ML classifiers in both classes. Experts can more accurately classify non suspicious texts than suspicious texts. We found approximately 0.5% accuracy deviation between experts. SGD classifier with F2 feature obtained good accuracy for both classes while others were done well on suspicious class and performed very poorly on non-suspicious class. A significant difference has Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 2 August 2020 doi:10.20944/preprints202008.0033.v1