Criminal Behavior Identification Using Social Media Forensics

Ashraf, Noorulain; Mahmood, Danish; Obaidat, Muath A.; Ahmed, Ghufran; Akhunzada, Adnan

doi:10.3390/electronics11193162

Open AccessArticle

Criminal Behavior Identification Using Social Media Forensics

by

Noorulain Ashraf

^1,*

,

Danish Mahmood

¹

,

Muath A. Obaidat

²

,

Ghufran Ahmed

³

and

Adnan Akhunzada

⁴

¹

Computer Science Department, SZABIST, Islamabad 44000, Pakistan

²

Computer Science Department, City University of New York, New York, NY 10019, USA

³

Department of Computer Science, National University of Computer and Emerging Sciences, Karachi 75100, Pakistan

⁴

College of Computing and Information Technology, Department of Cybersecurity, University of Doha for Science and Technology, Doha 24449, Qatar

^*

Author to whom correspondence should be addressed.

Electronics 2022, 11(19), 3162; https://doi.org/10.3390/electronics11193162

Submission received: 16 August 2022 / Revised: 6 September 2022 / Accepted: 22 September 2022 / Published: 1 October 2022

(This article belongs to the Special Issue Digital Trustworthiness: Cybersecurity, Privacy and Resilience)

Download

Browse Figures

Versions Notes

Abstract

Human needs consist of five levels, which are: physiological needs, safety needs, love needs, esteem needs and self-actualization. All these needs lead to human behavior. If the environment of a person is positive, healthy behavior is developed. However, if the environment of the person is not healthy, it can be reflected in his/her behavior. Machines are intelligent enough to mimic human intelligence by using machine learning and artificial intelligence techniques. In the modern era, people tend to post their everyday life events on social media in the form of comments, pictures, videos, etc. Therefore, social media is a significant way of knowing certain behaviors of people such as abusive, aggressive, frustrated and offensive behaviors. Behavior detection by crawling the social media profile of a person is a crucial and important idea. The challenge of behavior detection can be sorted out by applying social media forensics on social media profiles, which involves NLP and deep learning techniques. This paper is based on the study of state of the art work on behavior detection, and based on the research, a model is proposed for behavior detection. The proposed model outperformed with an F1 score of 87% in the unigram + bigram class, and in the bigram + trigram class, it gave an F1 score of 88% when compared with models applied on state of the art work. This study is a great benefit to cybercrime and cyber-security agencies in shortlisting the profiles containing certain behaviors to prevent crimes in the future.

Keywords:

behavior; social media; twitter; machine learning; natural language processing; aggressive behavior; abusive behavior; cyber hate; antisocial; depressive behavior

1. Introduction

The study of human behavior is a crucial process. It exhibits the needs of the human mind. ‘Human needs’ consist of five levels, which are physiological needs, safety needs, love needs, esteem needs and self-actualization [1]. If these needs are fulfilled accurately, it leads to positive behavior in a person. On the other hand, if these needs are not fulfilled in a constructive way, the negative behavior is accelerated, according to Maslow’s hierarchy (Figure 1).

The behavior of a person is built as a result of the particular events or actions that happen in his/her life. Environments, disasters, family conflicts or [2] events have an immense effect on human behavior. It is an amalgam of a person’s nature, face expressions, surroundings and physical gestures, according to Zhang et al. [3] and Albu [4]. It is also defined as the state which is achieved after a series of events and actions [5]. Two types of behaviors can be categorized while studying different fields of society: routine behaviors and political behaviors [6,7]. The daily routine of a person defines routine behavior. On the other hand, the political activities of a person define political behaviors [8].

As human behavior is a vast field of study, machines are training to mimic human behaviors by make them learn to perform human-like actions [9]. The process of mimicking a human behavior is called artificial intelligence (Figure 2).

Humans are social by nature, so in this electronic era of the world, the acceptance of social media platforms has increased significantly; nearly everyone socializes here [10]. The use of social media platforms can be used to harm other people through negative comments, abusive language, hate speech and bullying online [11]. Such aggrieved behavior can harm anyone in electronic means [12]. Aggression is a behavior that represents hate. This aggression can be because of gender discrimination, color biasness, nationality, religious conflicts, etc. Aggressive behavior leads to criminal activities. An example is the case of shooting on the mosque in New Zealand, where the shooter played live footage on social media while shooting, which shows his aggression toward the Muslim community. Before the attack, he posted the pictures of the gun and later on he used same guns for attack on the mosque [13]. It is important to identify the threats before something worse happens. This critical situation can be handled by using the concept of social media forensics. The process of cyber investigation of people from their social media profiles and collecting relevant information by applying learning algorithms to prevent crimes is called social media forensics [13].

The research focuses on the identification of criminal behavior from social media by analyzing the state of the art work and proposing a better approach. This research helps with shortlisting the text containing certain behaviors, which can help with prevention of the crimes in the future. The proposed research intends to help cybercrime and cyber-security agencies with shortlisting the profiles indicating a certain behavior of individuals who could become involved in any criminal activities in the future.

The major contributions in the research are:

Optimized modeling of framework for criminal behavior detection focusing.
Optimal hybridization of NLP mechanisms during feature engineering process for robust results.
Combining multi-features in neural network architecture optimally.
The improvement of results in finding the context of the text by training the model on unigram, bigram and trigrams.

The paper is organized as follows. Section 2 explains the critical analysis of state of the art work related to behaviors on social media. In Section 3, materials and methods proposed for the detection of criminal behavior are explained. Section 4 explains experimental setup and results are presented in Section 5. Finally, Section 6 concludes this work.

2. Literature Review

The zeta byte of data is available on social media [14] and can be helpful in the detection of behaviors of people. Recent research shows that there has been an enormous amount of work on behavior detection from social media platforms (Table 1).

In [15], a cyber-troll dataset was used for the identification of aggression from tweets. The dataset had 20,001 tweets in total. Multilayer perceptron with TF-IDF was proposed, and researchers compared its performance with DNN models, i.e., CNN-LSTM and CNN-BiLSTM. Statistical results showed that the proposed model detects better aggressive behavior with 92% accuracy in less training time and with a small number of layers. However, experimental results were tested on a single feature, which was TF-IDF. Different features and combinations of features must be identified for better performance.

In [16], a first, practical, real-time framework was introduced for aggression detection. Multiple machine learning classifiers were adapted in an incremental fashion using the bag of words explained in Table 1. The proposed framework was able to achieve the same performance in comparison with base learning models, with almost 93% accuracy, 92% precision, 91% recall, and an F1 score of 90%. The proposed model only uses bag-of-words, which may cause class imbalance. If more features are explored, more accurate detection can be achieved.

A Web-based plug-in user interface [17] was developed by using the pre-trained Google AI model BERT. It visualized and detected aggressive and non-aggressive comments within twitter and Facebook text on the TRAC dataset [18]. Multiple classifiers were used that included XGBoost, LR, NB, SVM and FFNN. The proposed model was evaluated on the basis of F1 score and ROC-AUC: 0.64 and 0.62 F1 scores were attained on the TRAC English and Hindi Facebook datasets, respectively. For the English and Hindi twitter datasets, F1 scores of 0.58 and 0.50 were achieved. However, the complexity of a sentence was unsolved because it was neglected. More pre-trained models must be explored to reduce sentence complexity.

In [19], suicide notes were detected from social media blogs by using datasets: the Genuine suicide notes dataset [20], Reddit depression data [21] and Neutral Blog data. In the first part, LIWC was used for the extraction of suicide notes from the datasets. In the second part, a dilated-LSTM model was used for the detection of suicide notes. The proposed model achieved an 88.26% F1 score and 96.1% accuracy. At the end, features were visualized by emoticons such as love. In the proposed research, language patterns were identified on datasets for real-time data; accurate language patterns may not be detected. This problem can be resolved by enhancing LIWC analysis.

Depression behavior was studied in [22] by a manually created dataset from VKontakt users. In total, 6000 users were considered for this purpose. Preprocessing was completed by using MyStemAPI. Sentiment features were extracted by using the Linis-Crowd sentiment dictionary. Moreover, unigram and bigram features were extracted using TF-IDF. For classification, multiple models were tested, and results showed that random forest with PCA gave the best results with an AUC of 0.74, precision of 0.59, recall of 0.71 and F1 score of 0.65. However, the results must be improved by applying more feature fusion techniques. Moreover, neural networks can also be applied in the future in order to achieve better results.

In [23], an online hate classifier was developed for multiple social media platforms using machine learning classifiers. Data were gathered from YouTube, Reddit, twitter and Wikipedia. Binary labels were given as hateful and non-hateful comments. Multiple features were applied that included LIWC, bag-of-words, TF-IDF, Word2Vec and BERT. Multiple classification models were compared including LR. NB, SVM, XGBoost, FFNN and NB. Performance was evaluated on the basis of F1 Score and ROC-AUC. Results showed that XGBoost performed the best in the proposed model. However, a comparison of classification models was made without hyperparameter optimization; if applied, some classifiers may give different results.

Sentiment analysis of reviews of customers was conducted in [24] on four datasets by using Co-LSTM; the researchers compared the performance using a confusion matrix with SVM, Naïve Bayes, LR, CNN and RNN. The results showed that the proposed model performed well in terms of AUC in an air-line dataset with AUC = 0.084. However, the word-embedding model was trained on a limited pre-trained dataset, whereas deep learning models need a huge amount of data. Moreover, in capturing text information, Co-LSTM was unable to capture some important sequence of words. This problem occurs because of the use of connected layers without memory. If connected layers are used with memory, this problem can be resolved.

A hate speech detection model [25] among vulnerable minorities was proposed by web crawling in the Amharic language. Preprocessing was completed using the tool HornMorpho. N-gram features were extracted using Word2Vec and Tf-IDF. Deep learning classifiers, i.e., GBT, RNN-LSTM, RRF and NN-GRU, were used for classification. Performance of the model was evaluated on the basis of ACC, ROC and AUC. The results showed that RNN-GRU performed best with an ACC of 92%, ROC of 0.97% and AUC of 0.97%. However, language peculiarities may not be captured by a pre-trained model due to the use of generalized model.

Two datasets, DWMW17 [26] and FDCL18 [27], were used in [28] for hate speech detection. Features were extracted by using a combination of features and given names from M1 to M7, where M1 was the baseline feature with the TF-IDF. Experiments were conducted on both datasets separately based on these features. The results showed that M7 outperformed when applied on both datasets. For DWMW17 [26], the results were better, i.e., accuracy of 94.8%, precision of 97.1%, recall of 96.7% and F1 score of 96.9%. However, polysemy words with multiple meanings can give wrong interpretations regarding hate speech. The proposed model can be applied for multi-lingual hate speech detection in future.

Hate speech text was detected from twitter by comparing pre-trained feature extractors with CNN in [29]. Data were collected from twitter API. It had 4575 Hindi–English code-mixed words. Pre-trained features included RoBERTA, XLNet, DistilBERT and BERT. These features were used with a CNN model. Results were generated for each pre-trained embedding. XLNET performed the best for hate class with precision = 0.69, recall = 0.42 and F1 score = 0.53. However, as the comparison was among pre-trained models, some Hindi–English interpretations may be neglected by these embeddings.

Detection of hate speech in Hindi–English code-mix text of social media is discussed in [30]. Data were collected from three different resources [31,32]. Features were extracted on the word, document and character level. For the document level, Doc2Vec was used along with SVM-Linear, SVM-RBF and RF. The results showed that RF performed the best in this scenario with an accuracy of 0.64. In the second experiment, Word2Vec was used along with the same models; the results showed that SVM-RBF performed the best with an accuracy of 0.75. In the third experiment, characters were extracted on the basis of the same models by using FastText. The results showed that SVM-RBF outperformed in this scenario with an accuracy of 85.81. It is seen that character-level features provide more information than document-level and word-level features. However, for better performance, more features must be explored.

The focus of [33] was to identify hate from datasets of multiple classes. For this purpose, the SP-MTL (Shared Private Multitask Learning) model [33] was proposed. The model was based on deep neural network classifiers, i.e., CNN, LSTM, CNN + GRU and CNNa + GRU. This model was implemented on five datasets with classes hate, aggression, offensive, harassment, racist and sexist [18,34]. Features were extracted by CBOW and Word2ec. Results showed that the proposed model CNNa + GRU outperformed by macro-F1 = 84.92 and weighted-F1 = 88.31. However, there were some classes that were misclassified. To overcome this problem, domain-specific embedding should be explored. Multi-verse optimizer, group search optimizer, harmony search optimizer, krill herd algorithm and other genetic algorithms can be explored for better performance.

Hate speech was detected using deep neural network and machine learning techniques in [35] by using the Arabic Abusive Language dataset [36]. Arabic text features were extracted on the basis of n-grams, i.e., 1, 2, 3, 4, 5 and 6 g after preprocessing. Feature extraction was completed by using TF-IDF in the SVM, Naïve Bayes and logistic regression cases. For the CNN, LSTM and GRU models, mBert was used. The results showed that SVM with a binary class outperformed with F1-Macro: 85.16, and in DNN, CBB + mBert outperformed for the two-class case with F1-Macro: 87.05. However, in this research, some classes were misclassified, which can have a huge impact on results. If classified correctly, the results may differ.

In [37], the Vine social network dataset [26] was used. Features were extracted by BoW after preprocessing. For classification, RF, Ada-Boost AB, Logistic Regression (LR), linear Support Vector Classification (SVC) and Extra-Tree (ET) were applied. Evaluation was made on the basis of ERDE, F1, precision and recall. Experimental results showed that the threshold model improved the baseline of the detection model by 26%. Duel models increases the improvement of cyber-bullying detection up to 42%.

The challenge of cyber bullying detection from text embedding images and infographics is discussed in [38]. Data contained 10,000 comments and posts in the form of text and images. Comments and posts were divided into 60% text, 20% images and 20% infographics. Features were extracted by using Google lens. The data were fed to the model depending on the type of input data. The textual data were fed to the CapsNet model, and image data were fed to the ConNet model. The final prediction was made by adding a multilayer perceptron with the sigmoid activation function for decision-level fusion. The results showed that AUC-ROC achieved a score of 0.98. However, due to the complexity of language, real-time data can be of high dimensions, and it can be imbalanced.

In [39], two sub-categories of abusive language, i.e., aggressive language and offensive language, were detected from the online social media platform using word embedding with word2Vec. A simple CNN model was use for the detection of text. The Sentiment-Dataset [40] was considered as a dictionary, which had 50,000 words. CNN with Word2Vec was implemented using this dictionary on two datasets, i.e., the Aggressive-language dataset [41] and Offensive-Language Dataset [42], respectively. Experimental results showed that the model pre-trained on sentiments showed better performance with an F1 score of 64%. However, the proposed model had variations with spellings. By using more word embeddings, the problem of spelling variations can be resolved.

A sentiment analysis framework was proposed in [43] by the self-development of a military sentiment dictionary. The main focus of this study was on two things: One was to make a military dictionary that was MILSentic. The second was to compare MILSentic with existing dictionaries, i.e., NTUD and HowNet. Sigmoid, Tanh and ReLu were used with LSTM and Bi-LSTM models to check performance. Results showed that the self-developed dictionary (MILSentic + HowNet + NTUD) with (Bi-LSTM + Tanh) gave the best results with accuracy of 92.68%. This study was based on sentiment analysis of text only. More research is required for the development of refined dictionary for sentiment analysis of images, videos and cross-lingual characteristics.

Abusive comments were classified in [44] by using deep learning approaches. For this purpose, the toxic comments from the kaggle dataset were labeled as ‘Toxic’, ‘Obscene’, ‘Insult’ and ‘Severe toxic’ comments. Features were extracted on the basis of two main models; one was Glove with CNN and the other was Glove and LSTM with CNN. The results showed that the Glove + CNN model outperformed with an accuracy of 97.27.

Table 1. Evaluation Table Literature Review.

Research Challenges	Literature References
Dataset limitations	[15,18,21]
Imbalanced data annotation	[43,44]
Limited features extraction	[15,17,19,20,25]
Ambiguity in learning models	[37,38,39]
Text context misclassification	[37,44]

3. Materials and Methods

In today’s era, crimes are rapidly increasing. It is important to make a scenario in which criminals can be identified before crimes. For this purpose, a framework is proposed for the prevention of crimes by the early detection of criminal behaviors.

3.1. Proposed Framework

For crime prevention, it is important to keep an eye on people with a criminal mindset, so that quick actions can be taken accordingly by law enforcement agencies. In this research, a novel behavior detection framework is proposed for crime prevention, as shown in Figure 3. In this framework, real-time data are fetched from social media. The system detects the type of data. It then separates the data that are relevant to the behavior of a person. After that, it classifies the behavior regarding whether the data belong to class ‘A’ or class ‘B’.

If it belongs to class ‘A’, it means these data have nothing to do with the behavior of a person. On the other hand, if it belongs to class ‘B’, it means that the data are relevant to the behavior of a person. The ‘B’ class is further divided into other behavioral classes. These behavioral classes identify the behavioral traits of a person that can have a criminal mindset by observing the text posted on social media. After fetching information related to criminal behavior, those social media profiles containing a criminal nature of behavior are stored in the city database.

This database is accessible to law enforcement agencies. They can keep an eye on people with a criminal mindset by accessing the shortlisted profiles of people having criminal behavior. Moreover, they can also use this system to compare the shortlisted profiles with their own records for finding criminals.

The black box of Figure 3 is explained further in Figure 4 where a behavioral model is implemented. The fetched data from social media are saved in a database. From this database, the dataset is extracted. After that, manual annotation is performed to extract more relevant data. After extracting relevant data, a standard database is made. On this standard dataset, feature engineering is applied. After that, data splitting is completed with 75% training and 25% testing. At the end, a learning model is applied to check the performance of the claimed model in terms of finding criminal behavior. The process is visualized in Figure 4.

3.2. Proposed Methodology

System architecture is presented in this section. The architecture is shown in Figure 5. The environment for the proposed model implementation is Jupyter, and the language is python. Data are gathered from twitter. After that, a data-splitting process is initiated. Splitting the whole data into training and testing portions is essential to see how accurately the model performs on unseen data (testing data). It also prevents from overfitting. In the proposed model, data are divided into 75% training data and 25% testing data. After that, data preprocessing is completed on a dataset where it cleans the unclean data. Data cleaning is necessary to obtain accurate results from the model. In the proposed model, NLP text preprocessing techniques are implemented. Moreover, text is converted into lowercase, and tokenization is completed. After data preprocessing, feature extraction is completed, where important and relevant features are extracted without losing important information. It also helps in reducing the redundant data to increase the process-time efficiency. Multi-feature extraction techniques are implemented in order to increase the diversity of information produced from text data. In the proposed model, features are extracted on unigram, bigrams and trigrams by using bag of words, TF-IDF and GloVe feature extractors to improve the performance of the model. At first, count of word occurrence is calculated by using BoW. After that, important lexical features of the text are extracted using TF-IDF. At the end, GloVe is used to extract the semantic relatedness of words by learning meaningful vector similarities. Furthermore, feature fusion operation is performed by taking a summation of each feature extractor. This idea becomes clearer by considering the following feature extraction and feature fusion process, as shown in. Figure 5.

The proposed methodology is shown in Figure 5. The environment for the implementation of the proposed model is Jupyter, and the language is python. Data are gathered from twitter. After that, the data-splitting process is initiated. Splitting the whole data into training and testing portions is essential to see how accurately the model performs on unseen data (testing data). It also prevents from overfitting. In the proposed model, data are divided into 75% training data and 25% testing data. After that, data preprocessing is performed on a dataset where it cleans the unclean data. Data cleaning is necessary for accurate results of the model. In the proposed model, NLP text preprocessing is performed by the removal of stop words, removal of non-English text, removal of URLs, targets, hashtags, punctuation and missing values. Moreover, text is converted into lowercase, and tokenization is completed. After data preprocessing, feature extraction is performed where important and relevant features are extracted without losing important information. It also helps by reducing the redundant data to increase the process-time efficiency. Multi-feature extraction techniques are implemented in order to increase the diversity of information produced from text data. In the proposed model, features are extracted on unigrams, bigrams and trigrams by using bag of words, TF-IDF and GloVe feature extractors to improve the performance of the model. At first, count of word occurrence is calculated by using BoW. After that, important lexical features of the text are extracted using TF-IDF. At the end, GloVe is used to extract the semantic relatedness of words by learning meaningful vector similarities. Furthermore, feature fusion operation is performed by taking a summation of each feature extractor.

After features extraction, classification is performed. It recognizes and separates the relevant text into relevant class automatically with the help of the learning model. In the proposed DNN model, multilayer perceptron is used for the detection of behavior. Aggressive behavior tweets are directed to the aggressive class, and non-aggressive or neutral tweets are directed to the non-aggressive class.

In the end, the proposed model is validated on the basis of accuracy, precision, recall, loss and F1 score. It is important to validate the model to see the performance. Validation parameters show the scale of the proposed model. The scale defines how well the model has performed. The results of the proposed model are evaluated by comparing it with state of the art work.

3.3. Dataset Description

For this research, a cyber-troll dataset is used [45]. It was created for the classification of text by Data Turks. It is a human labeled dataset, as shown in Figure 6. It has 20,001 tweets in it of which 7822 are aggressive and 12,179 are non-aggressive. Aggressive tweets are labeled as 1, and non-aggressive tweets are labeled as 0.

The number of characters with respect to length is almost the same, as shown in Figure 7. The green bar shows the non-aggressive class and the blue bar shows the aggressive class. Data are divided into 25% testing and 75% training.

3.4. Classification Models and Validations

In model classification, fully connected dense layers are used. The experimentation is completed in python using TensorFlow as the back end and Keras as the front end. The framework used for implementation is Jupyter python. In the model, four hidden layers are implemented. In the first three layers, the Relu (Rectified Linear unit) activation function is used, and in the output layer, the sigmoid activation function is used. There are two neurons in the output layer just as in the classes of the dataset. After performing some experimentation, training hyper parameters are set accordingly. Drop out is equal to 0.2, and the batch size is set to 128. The learning rate is le-3. For identification of loss function, we used binary cross entropy. For optimization, Adam optimizer was used. For comparison purposes, the base model is implemented as well. After that, it is compared with the proposed model with respect to the performance.

Evaluation metrics include accuracy, recall, f measure and precision. These measures are calculated by TP (True Positive) TN (True Negative), FN (False Negative) and FP (False Positive). Those tweets in which aggressive behavior is classified correctly are TP. Those tweets that are not classified correctly are FN. On the other hand, those tweets that are not related to aggressive behavior and classified correctly are TN, and on the other hand, those tweets that are not related to aggressive behavior and are misclassified are FP. F1 score is use to balance the recall and precision and embed in a single value. In this research, the F1 score is used as a main evaluation parameter. The system used for this experimentation is core-I7 with 16 GB of RAM, Keras R, TensorFlow R and python R 3.6.7.

4. Experimental Setup

4.1. Data Preprocessing

State of the art work shows that in NLP-relevant tasks, implementation of data preprocessing showed more accurate results. The first step of data preprocessing is the extraction of useful features by the removal of unnecessary elements of textual data. Features dimensions become dense and perform better results in less processing time. In this experimentation, the followings steps are involve in the dataset.

For sentiment analysis in term of aggressive behavior, stop words are removed. As there is a lot of impact on accuracy due to meaningless and irrelevant words, removing them increases the quality of input data.

Punctuation marks as well as numbers are removed from tweets because they have no input in identifying aggressive behavior.

In order to normalize the text, all words of comments are converted into lowercase. This conversion can reduce the chance of wrong interpretation of words.

At the end, the dataset is split into small tokens by tokenization.

After preprocessing, the most common words in the dataset are shown in Figure 7.

4.2. Feature Selection and Engineering

In proposed methodology, the words of the dataset are divided in two classes to extract features, i.e., class A (unigram + bigram) and Class 2 (bigram + trigram). Unigrams, bigrams and trigrams can be explained with this example, i.e., in the tweet, “It was a good day’’, the unigram features are ‘It’, ‘was’, ‘a’, ‘good’, ‘day’; the bigram features are ‘It was’, ‘was a’, ‘a good’, ‘good day’; and the trigram features are ‘It was a’, ‘was a good’, ‘a good day’. The unigrams, bigrams, and trigrams of the dataset are shown in Figure 8, Figure 9 and Figure 10, respectively.

Moreover, in this experiment, the feature hybridization technique is used to arrange them in series in order to obtain the best results. The group of best features is implemented by studying each feature separately from state of the art work, building a hybrid extraction model, combining them by arranging them in series and identifying which features had a robust performance on results. The following are the features arrangements to check the performance of the model:

Unigram + bigram with BoW + TF-IDF+ GlOve;
Bigram + trigram with BoW + TF-IDF + GlOve;
Unigram + bigram with BoW + TF-IDF;
Bigram + trigram with BoW + TF-IDF;
Unigram + bigram with TF-IDF (base model);
Bigram + trigram with BoW + TF-IDF.

Bigram + Trigram + TF-IDF is the base feature extractor model against which the performance of all other feature models was evaluated. The process of feature extraction according to Class A and Class B is shown in Figure 11.

In another step of the feature extraction process, the best features are selected. Commonly used features in identifying aggression from twitter are selected by the p-value of the f_classif function. Here, 30,000 k is set for Ngram ranges with the SeletcKbest function. Min_doc_freq for feature selection is set by 2. The top 20 aggressive words are shown in Figure 12; on the other hand, the top 20 words of the non-aggressive class in the form of a word cloud are shown in Figure 13.

5. Results and Analysis

The performance evaluation of models on the basis of feature hybridization and classification is presented on the basis of a cyber troll dataset. At first, state of the art work [15] is implemented with the same settings as explained. They obtained an accuracy of 92% and precision, recall and F1 score of 90% (Figure 14).

However, when the same parameters were implemented with the same model, settings and dataset, the achieved results were quite different. The results were precision of 80%, recall of 90%, accuracy of 91% and F1 score of 84%. This is also shown in Table 1.

The MLP performance is tested with combinations of feature extraction models to make a competitive comparison between all approaches.

As the F1 score is considered as the main evaluation measure, it is evident from Table 2 that the proposed model MLP using (BoW + TF-IDF + GloVe) and MLP using (BoW + TF-IDF) outperformed the base model with 86% and 87% F1 scores, respectively, in the unigram + bigram case. Similarly, it has 83% and 86% F1 scores in the bigram + trigram case.

On the other hand, the F1 score of MLP using TF-IDF, which was a base model, had an 84% f1 score in the unigram case and 83% in the bigram case when implemented. Table 2 shows the accuracy, precision, recall and F1 score of MLP using the proposed features extraction model.

As shown in Table 3, BoW + TF-IDF the 1 + 2 g and 2 + 3 g cases outperformed with F1 scores of 87% and 88%, respectively. This is also shown in Figure 15. On the x-axis, all models are placed, and the y-axis shows the percentage of evaluation parameters. B1 defines the base model and M1 to M5 explain the proposed model results. In the figure, it can be seen that M2 and M3 showed the best results. The main contribution of the proposed research is an introduction of a hybrid model which gives the best results in comparison with the model implemented in state of the art work.

In Figure 16 and Figure 17, the best fit proposed model is compared with the base model to see the results in terms of F1 score, and it can be seen clearly, the proposed model M2 (bag of words, TF-IDF, MLP, 1 + 2 g) outperformed the base model B1.

The proposed model results are compared with state of the art models in the published results, and it is seen that the proposed model outperformed in terms of F1 score in that case as well. This is shown in Table 4.

From the implementation of models, it is clear that the proposed deep learning model outperformed the base model. Sparsity of data is one of the main problems that occurs while dealing with tweets because of the short length of text. To resolve this problem, a dense layer is added which extracts more accurate features. An approach is presented in this research for identifying aggressive behavior by introducing a hybrid feature extraction approach in a simple neural network architecture.

This methodology is applied by obtaining vectors in the form of one, two and three grams of words. In DNN, the number of hidden layers has a direct impact on the training time. If there are more hidden layers, the network becomes more complex; hence, it increases the training time. The proposed methodology reduces the training time by empowering the neural network system to learn important features from the proposed features hybridization technique. The proposed study contributed in the field of research by (i) modeling a framework for criminal behavior detection; (ii) building a model for aggressive behavior detection by combining multi-features in neural network architecture; and (iii) the improvement of results in finding the context of the text by training the model on unigrams, bigrams and trigrams. The results in Table 1 and Table 2 clearly portray a correlation of state of the art work with the proposed techniques and show that the proposed techniques achieved better results in terms of F1 score in finding aggression. Looking into the dataset for experimentation, it is seen that there are more tweets in the non-aggressive behavior class as compared to the aggressive behavior class. If class balancing is applied by including more tweets in the aggressive class, the results may improve. For this purpose, SMOT can be explored. The proposed research implemented methodology proposed in state of the art work [15], and the results stated were different when the same methodology with the same dataset was applied. They have stated 92% accuracy and 90% precision, recall and F1 score. However, when implementing the same model, the reproduced results were different, i.e., F1 score of 84%. The proposed model outperforms because of the multi-feature hybridization approach for extraction before feeding to the DNN model multilayer perceptron. The proposed model increased the efficiency and gave better results. The proposed model (bag of words, TF-IDF, MLP, 1 + 2 g) gave outstanding results with an 87% F1 score in the unigram + bigram case and 88% F1 score in the bigram + trigram case.

6. Conclusions

Human behavior is a very complex phenomenon. It can be seen from the way a person acts or interacts. It is made up of the individual surroundings, values and norms. If the environment of a person is productive, a positive behavior shall be nourished; on the other hand, if a person environment is negative, it has a negative impact on human behavior. It is necessary to identify the human behavior for the well-being of humanity. In this electronic era, people post their everyday life events on social media in terms of videos, images and text. These data can be used to identify the behavior of a person from his social media profile. In this research, in order to detect criminal behavior, state of the art work is studied and analyzed. Based on the analysis, a framework is proposed for the detection of criminal behavior from social media. After that, a model is suggested for aggressive behavior detection in order to identify criminal mindsets, and the proposed model showed an improvement in performance with an F1 score of 87% in the unigrams and bigrams cases; similarly, it had an 88% F1 score in the bigrams and trigrams cases as compared to the state of the art work. This research is of great help for cybercrime department in the identification of people with a criminal mindset.

Author Contributions

Conceptualization, N.A., D.M. and M.A.O.; methodology, N.A., M.A.O. and A.A.; software, N.A., D.M.; validation, M.A.O., G.A. and A.A.; formal analysis, M.A.O., D.M., A.A.; investigation, N.A., D.M.; writing—original draft, N.A., D.M., A.A. and M.A.O.; preparation, N.A., D.M.; writing—review and editing, M.A.O. and A.A.; visualization, G.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Milheim, K.L. Towards a better experience: Examining student needs in the online classroom through Maslow’s hierarchy of needs model. J. Online Learn. Teach. 2012, 8, 159. [Google Scholar]
Al-Qatawneh, S.S.; Alsalhi, N.R.; Eltahir, M.E.; Siddig, O.A. The representation of multiple intelligences in an intermediate Arabic-language textbook, and teachers’ awareness of them in Jordanian schools. Heliyon 2021, 7, e07004. [Google Scholar] [CrossRef] [PubMed]
Zhang, Z.; Luo, P.; Loy, C.-C.; Tang, X. Learning social relation traits from face images. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 3631–3639. [Google Scholar]
Albu, V. Measuring customer behavior with deep convolutional neural networks. BRAIN Broad Res. Artif. Intell. Neurosci. 2016, 7, 74–79. [Google Scholar]
Xu, Y.; Damen, D. Human routine change detection using bayesian modelling. In Proceedings of the 2018 24th International Conference on Pattern Recognition (ICPR), Beijing, China, 20–24 August 2018; pp. 1833–1838. [Google Scholar]
Constantino, S.; Schlüter, M.; Weber, E.; Wijermans, N. Cognition and behavior in context: A framework and theories to explain natural resource use decisions in social-ecological systems. Sustain. Sci. 2021, 16, 1651–1671. [Google Scholar] [CrossRef]
Zhang, Y.; Hu, W.; Rao, C.; Zhou, D.; Hu, Y.; Jin, J.; Muddassir, M. Fast photocatalytic organic dye by two metal-organic frameworks with 3D two-fold interpenetrated feature. J. Mol. Struct. 2021, 1227, 129538. [Google Scholar] [CrossRef]
Wergeland, G.J.H.; Riise, E.N.; Öst, L.-G. Cognitive behavior therapy for internalizing disorders in children and adolescents in routine clinical care: A systematic review and meta-analysis. Clin. Psychol. Rev. 2021, 83, 101918. [Google Scholar] [CrossRef]
Sinnott, J.D.; Rabin, J.S. The Psychology of Political Behavior in a Time of Change; Springer: Berlin/Heidelberg, Germany, 2021. [Google Scholar]
Nanaa, A.; Akkus, Z.; Lee, W.Y.; Pantanowitz, L.; Salama, M.E. Machine learning and augmented human intelligence use in histomorphology for haematolymphoid disorders. Pathology 2021, 53, 400–407. [Google Scholar] [CrossRef]
Smit, D. Cyberbullying in South African and American schools: A legal comparative study. S. Afr. J. Educ. 2015, 35, 1076. [Google Scholar] [CrossRef]
Grigg, D.W. Cyber-aggression: Definition and concept of cyberbullying. Aust. J. Guid. Couns. 2010, 20, 143. [Google Scholar] [CrossRef]
Doyle, G. New Zealand mosque attacker’s plan began and ended online. Reuters Retrieved 2020, 9. [Google Scholar]
Yang, S.J.; Ogata, H.; Matsui, T.; Chen, N.-S. Human-centered artificial intelligence in education: Seeing the invisible through the visible. Comput. Educ. Artif. Intell. 2021, 2, 100008. [Google Scholar] [CrossRef]
Sadiq, S.; Mehmood, A.; Ullah, S.; Ahmad, M.; Choi, G.S.; On, B.-W. Aggression detection through deep neural model on twitter. Future Gener. Comput. Syst. 2021, 114, 120–129. [Google Scholar] [CrossRef]
Herodotou, H.; Chatzakou, D.; Kourtellis, N. A Streaming Machine Learning Framework for Online Aggression Detection on Twitter. In Proceedings of the 2020 IEEE International Conference on Big Data (Big Data), Atlanta, GA, USA, 10–13 December 2020; pp. 5056–5067. [Google Scholar]
Modha, S.; Majumder, P.; Mandl, T.; Mandalia, C. Detecting and visualizing hate speech in social media: A cyber Watchdog for surveillance. Expert Syst. Appl. 2020, 161, 113725. [Google Scholar] [CrossRef]
Kumar, R.; Reganti, A.N.; Bhatia, A.; Maheshwari, T. Aggression-annotated corpus of hindi-english code-mixed data. arXiv 2018, arXiv:1803.09402. [Google Scholar]
Schoene, A.M.; Turner, A.; de Mel, G.R.; Dethlefs, N. Hierarchical Multiscale Recurrent Neural Networks for Detecting Suicide Notes. IEEE Trans. Affect. Comput. 2021. [Google Scholar] [CrossRef]
Gregory, A. The decision to die: The psychology of the suicide note. In Interviewing and Deception; Routlege: London, UK, 1999; pp. 127–156. [Google Scholar]
Pirina, I.; Çöltekin, Ç. Identifying depression on reddit: The effect of training data. In Proceedings of the 2018 EMNLP Workshop SMM4H: The 3rd Social Media Mining for Health Applications Workshop & Shared Task, Brussels, Belgium, 31 October 2018; pp. 9–12. [Google Scholar]
Stankevich, M.; Smirnov, I.; Kiselnikova, N.; Ushakova, A. Depression detection from social media profiles. In Proceedings of the International Conference on Data Analytics and Management in Data Intensive Domains, Voronezh, Russia, 13–16 October 2019; pp. 181–194. [Google Scholar]
Salminen, J.; Hopf, M.; Chowdhury, S.A.; Jung, S.-g.; Almerekhi, H.; Jansen, B.J. Developing an online hate classifier for multiple social media platforms. Hum. Cent. Comput. Inf. Sci. 2020, 10, 1–34. [Google Scholar] [CrossRef]
Behera, R.K.; Jena, M.; Rath, S.K.; Misra, S. Co-LSTM: Convolutional LSTM model for sentiment analysis in social big data. Inf. Process. Manag. 2021, 58, 102435. [Google Scholar] [CrossRef]
Mossie, Z.; Wang, J.-H. Vulnerable community identification using hate speech detection on social media. Inf. Process. Manag. 2020, 57, 102087. [Google Scholar] [CrossRef]
Davidson, T.; Warmsley, D.; Macy, M.; Weber, I. Automated hate speech detection and the problem of offensive language. In Proceedings of the International AAAI Conference on Web and Social Media, Montreal, QC, Canada, 15–18 May 2017. [Google Scholar]
Founta, A.M.; Djouvas, C.; Chatzakou, D.; Leontiadis, I.; Blackburn, J.; Stringhini, G.; Vakali, A.; Sirivianos, M.; Kourtellis, N. Large scale crowdsourcing and characterization of twitter abusive behavior. In Proceedings of the International AAAI Conference on Web and Social Media, Palo Alto, CA, USA, 25–28 June 2018. [Google Scholar]
Senarath, Y.; Purohit, H. Evaluating Semantic Feature Representations to Efficiently Detect Hate Intent on Social Media. In Proceedings of the 2020 IEEE 14th International Conference on Semantic Computing (ICSC), San Diego, CA, USA, 3–5 February 2020; pp. 199–202. [Google Scholar]
Banerjee, S.; Chakravarthi, B.R.; McCrae, J. Comparison of pretrained embeddings to identify hate speech in Indian code-mixed text. In Proceedings of the 2020 2nd International Conference on Advances in Computing, Communication Control and Networking (ICACCCN), Greater Noida, India, 18–19 December 2020; pp. 21–25. [Google Scholar]
Sreelakshmi, K.; Premjith, B.; Soman, K. Detection of Hate Speech Text in Hindi-English Code-mixed Data. Procedia Comput. Sci. 2020, 171, 737–744. [Google Scholar] [CrossRef]
Bohra, A.; Vijay, D.; Singh, V.; Akhtar, S.S.; Shrivastava, M. A dataset of Hindi-English code-mixed social media text for hate speech detection. In Proceedings of the Second Workshop on Computational Modeling of People’s Opinions, Personality, and Emotions in Social Media, New Orleans, LA, USA, 6 June 2018; pp. 36–41. [Google Scholar]
Mathur, P.; Sawhney, R.; Ayyar, M.; Shah, R. Did you offend me? Classification of offensive tweets in hinglish language. In Proceedings of the 2nd Workshop on Abusive Language Online (ALW2), Brussels, Belgium, 31 October–1 November 2018; pp. 138–148. [Google Scholar]
Kapil, P.; Ekbal, A. A deep neural network based multi-task learning approach to hate speech detection. Knowl.-Based Syst. 2020, 210, 106458. [Google Scholar] [CrossRef]
Zampieri, M.; Malmasi, S.; Nakov, P.; Rosenthal, S.; Farra, N.; Kumar, R. Semeval-2019 task 6: Identifying and categorizing offensive language in social media (offenseval). arXiv 2019, arXiv:1903.08983. [Google Scholar]
Alsafari, S.; Sadaoui, S.; Mouhoub, M. Hate and offensive speech detection on arabic social media. Online Soc. Netw. Media 2020, 19, 100096. [Google Scholar] [CrossRef]
van Bruwaene, D.; Huang, Q.; Inkpen, D. A multi-platform dataset for detecting cyberbullying in social media. Lang. Resour. Eval. 2020, 54, 851–874. [Google Scholar] [CrossRef]
López-Vizcaíno, M.F.; Nóvoa, F.J.; Carneiro, V.; Cacheda, F. Early detection of cyberbullying on social media networks. Future Gener. Comput. Syst. 2021, 118, 219–229. [Google Scholar] [CrossRef]
Kumar, A.; Sachdeva, N. Multimodal cyberbullying detection using capsule network with dynamic routing and deep convolutional neural network. Multimed. Syst. 2021, 2, 1–10. [Google Scholar] [CrossRef]
Uban, A.-S.; Dinu, L. On Transfer Learning for Detecting Abusive Language Online. In Proceedings of the International Work-Conference on Artificial Neural Networks, Gran Canaria, Spain, 12–14 June 2019; pp. 688–700. [Google Scholar]
Go, A.; Bhayani, R.; Huang, L. Twitter sentiment classification using distant supervision. CS224N Proj. Rep. Stanf. 2019, 1, 2009. [Google Scholar]
Kumar, R.; Ojha, A.K.; Malmasi, S.; Zampieri, M. Benchmarking aggression identification in social media. In Proceedings of the First Workshop on Trolling, Aggression and Cyberbullying (TRAC-2018), Santa Fe, NM, USA, 25 August 2018; pp. 1–11. [Google Scholar]
Zampieri, M.; Malmasi, S.; Nakov, P.; Rosenthal, S.; Farra, N.; Kumar, R. Predicting the type and target of offensive posts in social media. arXiv 2019, arXiv:1902.09666. [Google Scholar]
Chen, L.-C.; Lee, C.-M.; Chen, M.-Y. Exploration of social media for sentiment analysis using deep learning. Soft Comput. 2019, 24, 1–11. [Google Scholar] [CrossRef]
Anand, M.; Eswari, R. Classification of abusive comments in social media using deep learning. In Proceedings of the 2019 3rd International Conference on Computing Methodologies and Communication (ICCMC), Erode, India, 27–29 March 2019; pp. 974–977. [Google Scholar]
Sadiq, S. Cyber Troll Dataset [Data set]. Zenodo 2020, Version 1. Available online: https://zenodo.org/record/3665663#.Y0FBPepByHs (accessed on 6 August 2022). [CrossRef]

Figure 1. Maslow’s hierarchy [1].

Figure 2. Human vs. Artificial Intelligence.

Figure 3. Framework of Criminal Behavior Detection.

Figure 4. Proposed System Model for Behavior Detection.

Figure 5. Proposed Methodology for Behavior Detection.

Figure 6. Dataset Distribution.

Figure 7. Common Words.

Figure 8. Unigrams of Dataset.

Figure 9. Bigrams of Dataset.

Figure 10. Trigrams of Dataset.

Figure 11. Implementation Framework of Hybrid Features.

Figure 12. Word Cloud of Aggressive Class.

Figure 13. Word Cloud of Non-Aggressive Class.

Figure 14. Model [15] (TF-IDF, MLP, 1 + 2 Grams) Stated vs. Achieved Results.

Figure 15. Sadiq, S (2021) Model vs. Proposed Models.

Figure 16. M Sadiq, S (2021) Model vs. Best Fit Proposed Models (Unigram + Bigrams) in Terms of F1 Score.

Figure 17. Sadiq, S (2021) Model vs. Best Fit Proposed Models (Bigram + Trigrams) in Terms of F1 Score.

Table 2. Evaluation Table of State of the Art Model (Sadiq, S (2021) Model [15]) (Stated vs. Reproduced Results).

	Model	Precision	Recall	Accuracy	Loss	F1
Stated Results	Model (TF-IDF, MLP, 1 + 2 Grams)	90	90	92	08	90
Reproduced Results	Base Model (TF-IDF, MLP, 1 + 2 Grams)	80	90	91	12	84

Table 3. Evaluation Table Proposed Model vs. Sadiq, S (2021) Model.

Models Name	Model	Precision	Recall	Accuracy	Loss	F1
B1 [15]	Base Model (TF-IDF, MLP, 1 + 2 Grams)	80	90	91	12	84
M1	TF-IDF, MLP, 2 + 3 Grams	75	93	84	19	83
M2	BOW, TF-IDF, MLP, 1 + 2 Grams	84	92	91	11	87
M3	BOW, TF-IDF, MLP, 2 + 3 Grams	84	94	84	19	88
M4	BOW, TF-IDF, Glove, MLP, 1 + 2 Grams	83	91	91	11	86
M5	BOW, TF-IDF, Glove, MLP, 2 + 3 Grams	75	94	84	19	84

Table 4. F1 Score Proposed Models vs. State of the Art Work.

Models Name/State of the Art Work References	Model	Precision	Recall	Accuracy	Loss	F1
M1	TF-IDF, MLP, 2 + 3 Grams	75	93	84	19	83
M2	BOW, TF-IDF, MLP, 1 + 2 Grams	84	92	91	11	87
M3	BOW, TF-IDF, MLP, 2 + 3 Grams	84	94	84	19	88
M4	BOW, TF-IDF, Glove, MLP, 1 + 2 Grams	83	91	91	11	86
M5	BOW, TF-IDF, Glove, MLP, 2 + 3 Grams	75	94	84	19	84
[15]	Base Model (TF-IDF, MLP, 1 + 2 Grams)	80	90	91	12	84
[17]	BERT, CNN, BiLSTM, BERT, Logistic Regression, SVM	64	62	60	n/a	61

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ashraf, N.; Mahmood, D.; Obaidat, M.A.; Ahmed, G.; Akhunzada, A. Criminal Behavior Identification Using Social Media Forensics. Electronics 2022, 11, 3162. https://doi.org/10.3390/electronics11193162

AMA Style

Ashraf N, Mahmood D, Obaidat MA, Ahmed G, Akhunzada A. Criminal Behavior Identification Using Social Media Forensics. Electronics. 2022; 11(19):3162. https://doi.org/10.3390/electronics11193162

Chicago/Turabian Style

Ashraf, Noorulain, Danish Mahmood, Muath A. Obaidat, Ghufran Ahmed, and Adnan Akhunzada. 2022. "Criminal Behavior Identification Using Social Media Forensics" Electronics 11, no. 19: 3162. https://doi.org/10.3390/electronics11193162

APA Style

Ashraf, N., Mahmood, D., Obaidat, M. A., Ahmed, G., & Akhunzada, A. (2022). Criminal Behavior Identification Using Social Media Forensics. Electronics, 11(19), 3162. https://doi.org/10.3390/electronics11193162

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Criminal Behavior Identification Using Social Media Forensics

Abstract

1. Introduction

2. Literature Review

3. Materials and Methods

3.1. Proposed Framework

3.2. Proposed Methodology

3.3. Dataset Description

3.4. Classification Models and Validations

4. Experimental Setup

4.1. Data Preprocessing

4.2. Feature Selection and Engineering

5. Results and Analysis

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI