Multi-Ideology ISIS/Jihadist White Supremacist (MIWS) Dataset for Multi-Class Extremism Text Classiﬁcation

: Social media platforms are a popular choice for extremist organizations to disseminate their perceptions, beliefs, and ideologies. This information is generally based on selective reporting and is subjective in content. However, the radical presentation of this disinformation and its outreach on social media leads to an increased number of susceptible audiences. Hence, detection of extremist text on social media platforms is a signiﬁcant area of research. The unavailability of extremism text datasets is a challenge in online extremism research. The lack of emphasis on classifying extremism text into propaganda, radicalization, and recruitment classes is a challenge. The lack of data validation methods also challenges the accuracy of extremism detection. This research addresses these challenges and presents a seed dataset with a multi-ideology and multi-class extremism text dataset. This research presents the construction of a multi-ideology ISIS/Jihadist White supremacist (MIWS) dataset with recent tweets collected from Twitter. The presented dataset can be employed effectively and importantly to classify extremist text into popular types like propaganda, radicalization, and recruitment. Additionally, the seed dataset is statistically validated with a coherence score of Latent Dirichlet Allocation (LDA) and word mover’s distance using a pretrained Google News vector. The dataset shows effectiveness in its construction with good coherence scores within a topic and appro-priate distance measures between topics. This dataset is the ﬁrst publicly accessible multi-ideology, multi-class extremism text dataset to reinforce research on extremism text detection on social media platforms.


Summary
Extremist organizations exploit social media platforms to spread ideologies and influence youth with propaganda, radicalization, and recruitment. Multiple ideologies are coming from numerous organizations from different geographical locations. Organizations like ISIS [1] and Al Qaeda [2] have used Twitter and other social media platforms to spread propaganda and recruitment. White supremacists have also employed Twitter and websites like Stormfront [3] and Gab [4] to recruit youth. A few research works like [5] focus on the automated content restructuring of web forums for better semantic analysis on social media.
Current literature focuses on limited ideologies. Thus, it is necessary to develop an extremism text dataset containing multiple ideologies to detect extremism text. Existing literature on online extremism detection also focuses on limited class labels. Identifying and classifying text and users into binary labels like "extremist" or "non-extremist" provides lesser insight. Researchers have classified the extremist text on the social media into major types based on the objectives of social, political, or religious nature. Classification and analysis of extremist text on social media can help to curb disinformation. This work contributes to developing a multi-ideology extremist text seed dataset that can be used for extremism detection of larger extremism text datasets collected from popular social media platforms. The dataset will also be helpful for further extremism classification into propaganda, radicalization, and recruitment [6] as seen in Figure 1.
Twenty thousand tweets were collected from ISIS/Jihadist ideology. Twenty thousand tweets were collected from White supremacists' ideology.
The seed and resultant MIWS dataset with multiple ideologies are statistically validated and thus can be employed to generate a robust and accurate extremism text dataset.

•
Research Goal of our datasets The aim of the seed dataset is to classify multi-ideology extremism text into different classes such as propaganda, radicalization, and recruitment.
1. Seed dataset can be used to automatically annotate large extremism text datasets collected from social media platforms. 2. MIWS dataset is constructed and automatically annotated using seed dataset into propaganda, radicalization, and recruitment. 3. MIWS can also be used to train the classifier that detects extremism text and further classifies extremism text into propaganda, radicalization, and recruitment. 4. MIWS dataset can be further used to analyze the geographical location of extremism text to understand the spread of extremism.  This dataset consists of two parts:

Data Description
1. Seed dataset consisting of 400 examples collected from diverse sources and manually annotated with class labels as propaganda, radicalization, or recruitment.

2.
MIWS dataset consisting of 40,000 tweets collected from Twitter and annotated with class labels as propaganda, radicalization, or recruitment from the seed dataset. Twenty thousand tweets were collected from ISIS/Jihadist ideology. Twenty thousand tweets were collected from White supremacists' ideology.
The seed and resultant MIWS dataset with multiple ideologies are statistically validated and thus can be employed to generate a robust and accurate extremism text dataset.

Research Goal of Our Datasets
The aim of the seed dataset is to classify multi-ideology extremism text into different classes such as propaganda, radicalization, and recruitment.

1.
Seed dataset can be used to automatically annotate large extremism text datasets collected from social media platforms.

2.
MIWS dataset is constructed and automatically annotated using seed dataset into propaganda, radicalization, and recruitment. 3.
MIWS can also be used to train the classifier that detects extremism text and further classifies extremism text into propaganda, radicalization, and recruitment. 4.
MIWS dataset can be further used to analyze the geographical location of extremism text to understand the spread of extremism.

Data Description
There are 400 records in the seed dataset collected from diverse sources such as research articles, newspapers, blogs, and websites. There are 200 records for ISIS/Jihadist ideology and 200 for White supremacist ideology in the seed dataset. In the MIWS dataset, there are 20,000 tweets of ISIS/Jihadist ideology and 20,000 tweets from White Supremacist ideology. The details can be seen in Tables 1 and 2.  Figure 2 shows the process flow for construction of seed and MIWS dataset. It contains four phases: data collection, seed data validation, data labelling and merging of data from different extremist ideology. Data collection is performed in two parts: first, seed data collection (explained in Section 3.1) and then collection of tweets or MIWS data collection (explained in Section 6.1). Seed data validation (explained in Section 5) is performed individually for each ideology. Similarly, data labelling (explained in Section 6) is done on each ideology separately. The reason behind this segregation is that the corpuses of both ideologies are different and the LDA topics (explained in Section 5) are based on the probability of keywords within the document of a particular corpus. In the last phase, merging of the ISIS/Jihadist and White supremacist labelled datasets is carried out (explained in Section 6).

Seed Data Collection
For data collection of a seed dataset, we collected research articles from existing literature, examples from extremist identification websites, and blogs recognizing influential propagandists, radicals, and extremist recruiters [7].

Sources
The seed dataset is collected based on ISIS/Jihadist and White supremacist ideologies. The primary objective of the seed dataset is to collect text examples of propaganda, radi-

Seed Data Collection
For data collection of a seed dataset, we collected research articles from existing literature, examples from extremist identification websites, and blogs recognizing influential propagandists, radicals, and extremist recruiters [7].

Sources
The seed dataset is collected based on ISIS/Jihadist and White supremacist ideologies. The primary objective of the seed dataset is to collect text examples of propaganda, radicalization, and recruitment. Multiple newspaper articles, journal papers, book chapters, and websites are selected. Proper sources for text examples of propaganda, radicalization, and recruitment are selected using a snowballing technique [8]. Table 3 provides few examples from the seed dataset.

Research Articles and Reports
The seed text selected from the journal paper explicitly provides identification of extremist text as propaganda, radicalization, or recruitment [9,10,14], which are limited in numbers. Journal papers were selected from a database similar to our work in [6,15]. The search was limited to the period January 2015 to December 2020. Some older studies were also included using a snowballing technique. A total of 105 research articles and reports were surveyed, of which 18 were selected for this work.

Newspaper, Blogs, and Websites
A snowballing technique was also used to search and select newspaper articles, blogs, and websites for the seed dataset. Most seed examples are also chosen from newspaper articles, blogs [16], or counter-extremism websites [7,17]. Some websites classify users as propagandists or recruiters. The tweet or post of such users was considered as propaganda or recruitment. A total of 86 newspaper articles, blogs, and websites were surveyed, of which 32 were selected for this work.

Seed Data Features
The features of seed data include SOURCE, TYPE_OF_SOURCE, TEXT, LABEL, IDE-OLOGY, GEOGRAPHICAL_LOCATION, and AUTHOR_COUNTRY_AFFLIATION.

1.
SOURCE contains information like author name, article name, or website link of source. TEXT contains actual text, tweet, or speech that is extremist provided by the source. 4.
LABEL denotes whether the text is propaganda, radicalization, or recruitment as mentioned by the source. 5.
IDEOLOGY mentions to which extremist ideology the text belongs. 6.
GEOGRAPHICAL_LOCATION is a manually analyzed field that indicates any country mentioned in the text. 7.
AUTHOR_COUNTRY_AFFLILIATION country indicates the country to which the author belongs.

Data Pre-Processing
The following steps were carried out for data pre-processing as seen in Figure 3: • Removal of stopwords. Prepositions can affect the outcome of NLP algorithms, so they are removed. • Removal of URLs. This work does not focus on the use of URLs, so regular expressions are used to remove URLs.

•
Removal of emojis, hashtags, retweets, and digits. Emojis and digits are not considered in this research work, and symbols like hashtags, @, and retweets are out-of-scope for this research work. Thus, they are removed in pre-processing. • Lemmatization and lowercase. Lemmatization is used to ensure that meaningful words get selected for analysis. The remaining documents are converted into lowercase, so the case of terms does not affect the outcome of algorithms.

Data Pre-Processing
The following steps were carried out for data pre-processing as seen in Figure 3: • Removal of stopwords. Prepositions can affect the outcome of NLP algorithms, so they are removed.
• Removal of URLs. This work does not focus on the use of URLs, so regular expressions are used to remove URLs.
• Removal of emojis, hashtags, retweets, and digits. Emojis and digits are not considered in this research work, and symbols like hashtags, @, and retweets are out-of-scope for this research work. Thus, they are removed in pre-processing.
Lemmatization is used to ensure that meaningful words get selected for analysis. The remaining documents are converted into lowercase, so the case of terms does not affect the outcome of algorithms.

LDA with Coherence Score
In this research work, data validation implies verifying manual annotation using the topic modeling LDA technique [18]. Topic modeling is a method to identify documents in an unsupervised way. The documents are determined based on the set of keywords that are present in the corpus. Thus, the relevance of the document can be established just by looking at those sets of keywords. Latent Dirichlet allocation (LDA) is the most popular topic modeling technique. LDA works in two parts: words belonging to a document and calculating the probability of words belonging to that topic. Thus, LDA is used to determine the importance of specific words in extremism data. We further evaluate the strength of the topic with a coherence score [19]. A coherence score is used to emphasize the semantic similarity between high-scoring words in the topic. Thus, the higher the coherence scores the more the semantic similarity within the words in the topic. Word mover's distance [20] is also used to find the relationship between LDA topics of the seed and seed labels. Thus, the empirical annotation is statistically validated. Topic coherence points to the co-occurrence of words within documents in the corpus, indicating semantic relation between the words [19].

LDA with Coherence Score
In this research work, data validation implies verifying manual annotation using the topic modeling LDA technique [18]. Topic modeling is a method to identify documents in an unsupervised way. The documents are determined based on the set of keywords that are present in the corpus. Thus, the relevance of the document can be established just by looking at those sets of keywords. Latent Dirichlet allocation (LDA) is the most popular topic modeling technique. LDA works in two parts: words belonging to a document and calculating the probability of words belonging to that topic. Thus, LDA is used to determine the importance of specific words in extremism data. We further evaluate the strength of the topic with a coherence score [19]. A coherence score is used to emphasize the semantic similarity between high-scoring words in the topic. Thus, the higher the coherence scores the more the semantic similarity within the words in the topic. Word mover's distance [20] is also used to find the relationship between LDA topics of the seed and seed labels. Thus, the empirical annotation is statistically validated. Topic coherence points to the co-occurrence of words within documents in the corpus, indicating semantic relation between the words [19]. Figure 4a,b show topic coherence for the number of topics for the seed dataset. scores the more the semantic similarity within the words in the topic. Word mover' tance [20] is also used to find the relationship between LDA topics of the seed and labels. Thus, the empirical annotation is statistically validated. Topic coherence poi the co-occurrence of words within documents in the corpus, indicating semantic rel between the words [19]. Figure 4(a) and Figure 4(b) show topic coherence for the nu of topics for the seed dataset. As seen in Figure 4a,b, a coherence score was used to determine an optimal number of topics for the seed dataset. As observed, the number for topics 3, 4, and 5 shows the highest coherence of around 0.55 for ISIS and 0.68 for White supremacists. The literature [8,9,14] indicates extremism has three main types: propaganda, radicalization, and recruitment. Hence, we chose three topics (k) within extremist speech or text. The LDA optimization is also performed using GridSearchCV with 3, 4, and 5 topics. The best LDA model found using GridSearchCV contains only three topics.

Word Mover's Distance (WMD) Using Google News Pretrained Vector
Word mover's distance is used to verify the similarity between seed labels and topics created using LDA. WMD calculates similarity or dissimilarity between documents, even if there are no words in common [18]. The intuition behind WMD is that it determines the smallest semantic distance required for one document to reach another [18]. Word embeddings like Word2Vec are necessary to calculate the semantic distance between documents. The advantages of WMD are it does not use hyperparameters, the distance between documents can be broken down to the difference between words, and it works with popular word embeddings like Word2Vec.
In this study, to calculate the WMD, the topic corpus and label corpus are compared using a Google News pretrained vector. Tables 4 and 5 show the results.

Inference
The comparison of seed labels and seed topics produces acceptable results. The propaganda of ISIS/Jihadist has the lowest distance of 0.8100 to topic 0 of ISIS/Jihadist. Similarly, the radicalization sub-corpus is at the lowest distance of 0.8107 from topic 1 of ISIS/Jihadist. The recruitment sub-corpus has the lowest distance of 0.7871 from topic 2 of ISIS/Jihadist. A similar comparison is made for the White supremacist seed label and White supremacist seed topics. The propaganda sub-corpus of the WS seed is at a distance of 0.7894 from topic 1 of the WS Seed. Topic 2 of the WS seed is near to the recruitment sub-corpus at a distance of 0.9463, while Topic 0 is near radicalization at a distance of 0.9071.

Multi-Ideology ISIS/Jihadist White Supremacist (MIWS) Dataset
The MIWS dataset is constructed with tweets collected from Twitter for ISIS and White supremacist ideology. It can be used to train the classifier that detects extremism text and further classifies extremism text into propaganda, radicalization, and recruitment.

MIWS Data Collection
To collect relevant tweets and metadata, we constructed different search queries with different keywords. We used popular keywords that are associated with extremist ideologies like "munafiq", "kuffar", "white genocide", and "anti-white", as mentioned in Table 6. These keywords are referenced from [8,[21][22][23][24][25]. We also used some new keywords like "kufr army", "wesupporttaliban", "talibanourguardians", "globalists", "zog" etc., to collect recent tweets. The geographical locations were found by manually searching locations from collected tweets. If no locations were present in the tweet, it was labeled as "undefined". Table 6. Examples of keywords and combinations used for tweet collection.

Munafiq Antiwhite Murtadin
White Genocide Kufr Army White Power Kafir Globalist Zog WeStandWithTaliban WPWW The following are different metadata collected from tweets using Twitter API.

1.
TWEET_ID: It is the unique id for a tweet.

2.
CREATED_AT: Time at which tweet was created or posted. 3. USERNAME: Username of the posted tweet. 4. NAME: Name, if provided by the user. 5.
GEO_ENABLED: Boolean value for geographical data about the tweet.
Due to the Twitter data sharing policy, only Tweet_ID, Created_At, and Geo_Enabled can be shared publicly. Table 7 is a snapshot of collected tweets with geographical location and dominant topic. Table 8 shows that 20,000 tweets were collected for each ISIS/Jihadist and White supremacist ideology. Table 9 shows the count of tweets for some keywords. In extremist tweets, words like 'munafiq', 'munafiqeen', 'kuffar', and 'white lives matter' are frequently mentioned.   Table 8. Count of tweets collected for particular ideology.

No of Tweets Collected from Tweeter
ISIS/Jihadist Tweets 20,000 White Supremacist Tweets 20,000 Table 9. Count of tweets collected for particular keyword.

Construction of MIWS Using Seed Dataset
As described in Section 6.1, tweets related to specific ideologies extracted from Twitter are merged to form the MIWS dataset as seen in Figure 5.

Data Pre-Processing
Data pre-processing is carried out as mentioned in Section 4.

Data Labeling/Annotation
The following steps are performed for data annotation:

LDA on Collected Tweets
As shown in Figure 5, a comparison of labeled topics from the seed dataset and topics from Twitter collected data is performed. To extract topics, the Latent Dirichlet Allocation [26] method is used. To confirm the best possible topics, GridSearchCV is used. Hyperparameter tuning is performed to select optimal parameters for the best model. Table 10 provides the best parameters for the LDA model applied to collected data based on ideology. Data pre-processing is carried out as mentioned in Section 4.

Data Labeling/Annotation
The following steps are performed for data annotation:

LDA on Collected Tweets
As shown in Figure 5, a comparison of labeled topics from the seed dataset and topics from Twitter collected data is performed. To extract topics, the Latent Dirichlet Allocation

Comparison between Seed Labels and Topics of Collected Tweets
It is required to compare topics based on the ideology. Labeled topics from the ISIS/Jihadist seed and White supremacist seed are compared with topics from ISIS/Jihadist and White supremacists collected tweets. This was done to maintain uniformity and accuracy across ideologies.
Word mover's distance (WMD) is used to compare collected tweets and seed labels. WMD presents semantically meaningful comparisons of words from local co-occurrences in sentences. Thus, the lower the distance the more the similarity among sentences. To leverage WMD's properties, the Word2Vec vector pretrained on Google News is used [27]. As seen from Tables 11 and 12, the lower the distance between the topic and labels the more similar they are than the others. Thus, the corresponding label is given to that topic. Similarity of ISIS/Jihadist seed labels and ISIS/Jihadist tweet topics is shown in Table 11. The propaganda of the ISIS/Jihadist seed has the lowest distance of 0.8455 to topic 1 of the ISIS/Jihadist tweet topics. Similarly, the radicalization sub-corpus is at the lowest distance of 0.8575 from topic 0 of the ISIS/Jihadist tweet topics. The recruitment sub-corpus has the lowest distance of 0.8464 from topic 2 of the ISIS/Jihadist tweet topics.
A similar comparison is made for White supremacist tweets, as seen in Table 12. The propaganda seed sub-corpus is at a distance of 0.7924 from topic 2 of the WS tweets. Topic 0 of the WS tweets is near the recruitment seed sub-corpus at a distance of 0.8032, while topic 1 is near radicalization at a distance of 0.8021.

Merging of Datasets
Topic 0 of ISIS/Jihadist and topic 1 of WS are labeled as radicalization and contain 10,120 tweets. Radicalization includes more politically aligned tweets. Topic 1 of ISIS/Jihadist and Topic 2 of WS are labeled as propaganda, consisting of 19,523 tweets; thus, propaganda is the largest class of all three classes. Propaganda contains religious keywords, achievements, and glorification of ideology. Topic 2 of ISIS/Jihadist and topic 0 of WS, labeled as recruitment, contains 10,893 tweets. General hate, discussion about the degradation of old or religious ways, and incitement against a particular group are observed in recruitment. Table 13 shows a few tweets and their annotated labels with ideology, while Table 14 shows the statistical summary for seed and MIWS datasets. After annotating, the data from both ideologies are merged to form a new dataset called the merged ISIS/Jihadist and White supremacist (MIWS) dataset.

Data Validation of MIWS Dataset
Data Validation is performed on the complete MIWS dataset. To validate the dataset, WMD is used. For validation, a comparison between labels is performed. This provides three different results. The WMD for propaganda and radicalization is at 5.4632, while for propaganda and recruitment, the WMD is at 3.4831. Lastly, the WMD between recruitment and radicalization is 4.6590. These results show that there is a significant difference between MIWS propaganda, radicalization, and recruitment labels.

Discussion and Implications
The MIWS dataset proposes multi-ideology and multi-class classification, especially in extremism types like propaganda, radicalization, and recruitment. These types of text are mostly disinformation targeted at vulnerable youth. Thus, it is vital to counter extremist text and malicious disinformation on social media. However, there are only a few standard datasets related to extremism, such as the ISIS Kaggle dataset [28], Stormfront dataset [29], and Gab dataset [30]. There are also a few custom datasets that are publicly unavailable such as Jaki et al. [31], Fraiwan et al. [32], and Ferrara et al. [33]. Most of these datasets are popular in the literature but have a few limitations. They are old, obsolete, most tweets and posts in the datasets are deleted or suspended, and classification is limited to extremist-non-extremist, hate-no hate [30,33,34].
MIWS tries to address these issues. The data collected for MIWS is recent and influenced by more recent events, writing styles, and expressions on social media. Most tweets collected are available online and more information about the extremist text can be gathered [7,35]. The dataset is annotated into propaganda, radicalization, and recruitment, which provides a clear view of conversation and topics in the extremist text. A tweet of recruitment can now be addressed distinctly from propaganda or a radical opinion which are part of disinformation spread by extremists. This helps in taking measures for effective handling of disinformation control or curbing its outreach.
This dataset is one of its kind, which also caters to multiple ideologies of extremism and tries to present the diversities of the writing and presenting styles by the activists from ISIS and White supremacist groups, which are the most popular extremist ideologies [36].
Academicians, researchers, and law enforcement agencies can use the MIWS dataset to identify and analyze extremist posts and disinformation on Twitter or any other website. Identifying extremist text as propaganda, radicalization, and recruitment can further help explore the topics and events related to extremism for adequate control of disinformation. The automatic classification that can be offered with this dataset can reduce the time taken for analysis and encourage law enforcement agencies and social media networks to take rapid action against such tweets or posts.
MIWS dataset can be used in alliance with [37] to identify the high prevalence of extremism in a particular geographical area using different semantic models like SpaCy. This could extend [37] to decide about help provided to countries regarding more recent events and extremist propaganda, radicalization, and recruitment taking place in that country. MIWS dataset can be used in addition to Global Terrorism Dataset (GTD) [38] to analyze propaganda, radicalization, and recruitment leading to terrorist events and their after effects. In order for it to work with the GTD, the MIWS needs to be kept updated frequently.

Limitations
This research work presents two different datasets, so the limitations of each dataset are as below:

Conclusions
The presented work contributes to the detection of extremist text on social media, characterizes the major types of extremism text, and thus contributes to curbing the spread of disinformation. This research work contributes to the construction of extremism text, multi-ideology multi-class seeds, and the MIWS dataset. To the best of our understanding, the seed dataset collected from research articles, blogs, and counter-extremism websites is the first of its kind, which can be used further to classify any extremism text into radicalization, recruitment, and propaganda. This hypothesis is validated by constructing the MIWS dataset from recent tweets collected from Twitter and can be used to classify any extremism text into radicalization, recruitment, and propaganda. The presented seed dataset is also statistically validated using a coherence score and WMD. The MIWS dataset is validated using WMD. This makes the it statistically proven for further research on extremism text detection and analysis.

Future Work
There are still a few areas that can be improved:

•
Size of seed dataset: The size of the seed dataset can be extended with labeled extremism text from the latest research.