RADAR#: An Ensemble Approach for Radicalization Detection in Arabic Social Media Using Hybrid Deep Learning and Transformer Models

Al-Shawakfa, Emad M.; Alsobeh, Anas M. R.; Omari, Sahar; Shatnawi, Amani

doi:10.3390/info16070522

Open AccessArticle

RADAR#: An Ensemble Approach for Radicalization Detection in Arabic Social Media Using Hybrid Deep Learning and Transformer Models

¹

Faculty of Information Technology and Computer Science, Yarmouk University, Irbid 21163, Jordan

²

School of Computing, Southern Illinois University, Carbondale, IL 62901, USA

³

School of Computing, Weber State University, Ogden, UT 84405, USA

^*

Author to whom correspondence should be addressed.

Information 2025, 16(7), 522; https://doi.org/10.3390/info16070522

Submission received: 24 March 2025 / Revised: 6 June 2025 / Accepted: 17 June 2025 / Published: 22 June 2025

Download

Browse Figures

Versions Notes

Abstract

The recent increase in extremist material on social media platforms makes serious countermeasures to international cybersecurity and national security efforts more difficult. RADAR#, a deep ensemble approach for the detection of radicalization in Arabic tweets, is introduced in this paper. Our model combines a hybrid CNN-Bi-LSTM framework with a top Arabic transformer model (AraBERT) through a weighted ensemble strategy. We employ domain-specific Arabic tweet pre-processing techniques and a custom attention layer to better focus on radicalization indicators. Experiments over a 89,816 Arabic tweet dataset indicate that RADAR# reaches 98% accuracy and a 97% F1-score, surpassing advanced approaches. The ensemble strategy is particularly beneficial in handling dialectical variations and context-sensitive words common in Arabic social media updates. We provide a full performance analysis of the model, including ablation studies and attention visualization for better interpretability. Our contribution is useful to the cybersecurity community through an effective early detection mechanism of online radicalization in Arabic language content, which can be potentially applied in counter-terrorism and online content moderation.

Keywords:

cybersecurity; radicalization detection; Arabic NLP; deep learning; transformer models; ensemble learning; social media analysis; attention mechanisms

1. Introduction

Arabic online content has increasingly been exploited by extremist groups to disseminate radical ideologies and recruit vulnerable individuals. Social media platforms like Twitter enable the rapid spread of extremist narratives through retweets and shares. In particular, radicalized content in Arabic often leverages religious and political rhetoric to appeal to local audiences, making detection challenging. For example, extremist propagandists may use Quranic verses or historical references as veiled justifications for violence. Understanding these patterns of dissemination is crucial, as millions of users can be influenced before counter-measures take effect [1,2]. The X platform, known for its real-time information exchange, presents unique challenges for identifying and contextualizing radical content. The Middle East’s over 160 million active users [3] of Facebook and X make the platform highly important for both monitoring and the detection of radicalization patterns [4]. Despite efforts by platforms to moderate such content, there remains a critical need for automated tools to detect and curb Arabic extremist messages in real-time.

Radicalization, defined as the process by which individuals or groups adopt increasingly extreme social, political, or religious ideals that reject or undermine contemporary ideas and expressions of freedom of choice, poses a critical threat to social stability and security [5,6]. The volume of Arabic social media content (“over 8.5 billion daily Arabic tweets” with approximately 3.2% containing potentially radical content) makes manual monitoring impractical [7]. The X platform has emerged as a notable platform for voicing opinions and disseminating news; in January 2023, X recorded an impressive 436 million active accounts showing a significant growth of almost 19% compared to the previous year [8]. The key to its success is the wide range of services it offers, allowing users to freely express their opinions and making it known as a valuable source of mobile data. On the flip side, the virtual world of social media reflects both the positive and negative aspects of our society, which calls for careful analysis and handling of its adverse effects [9], such as the Black Lives Matter movement and the US 2020 elections, and highlights the dangers of extremist information on the internet, proving that extremists use it to propagate their beliefs and rally followers [10,11].

Among the complexities involved, detecting and mitigating radical content has become increasingly challenging, especially for Arabic language content, in its linguistic intricacies, dialectal variations, cultural nuances, and right-to-left writing system. These challenges are content-sharing strategies that are themselves changing. The effectiveness of radicalization detection systems is heavily influenced by hyperparameter optimization, which can significantly impact model accuracy and generalization capabilities. Our research demonstrates that the careful tuning of parameters such as learning rates, dropout ratios, and ensemble weights can improve detection accuracy by up to 5% compared to default configurations.

While existing approaches have attempted to address this challenge using traditional ML models, they often struggle with the nuanced nature of Arabic text and the evolving patterns of radical content. These approaches achieve accuracies ranging from 85% to 92%, but frequently fail to capture subtle indicators of radicalization. This work is highlighted by the rising threat of online radicalization to societal stability. While considerable work has addressed extremist content in English, Arabic remains underexplored in this context. Arabic presents unique challenges due to its rich morphology and diverse dialects, which extremists exploit to evade detection. By focusing on Arabic social media, our work addresses a significant practical gap: empowering governments, internet providers, and platforms with AI systems capable of flagging radical content before it spreads. We aim to improve the early detection of Arabic extremist posts, thereby contributing to counter-terrorism and online safety efforts.

In this paper, we introduce Radicalization Analysis using Deep Arabic Recognition (RADAR#), a novel ensemble model for Radicalization Detection in ARabic social media. RADAR# combines a custom hybrid deep learning network and a transformer-based model to leverage multiple feature representations of text. The proposed architecture combines CNN, which is effective in local patterns, with the long-range understanding capability of Bi-LSTM, tuned for Arabic text analysis. RADAR# is rigorously evaluated against state-of-the-art methods using a dataset of over 89,816 Arabic tweets, achieving state-of-the-art performance with 98% accuracy and a 97% F1-score.

The remainder of this paper is organized as follows: Section 2 presents the background and literature review. Section 3 presents the detailed architecture of our proposed model, including the CNN-BiLSTM component, transformer integration, attention mechanism, and ensemble strategy. Section 4 describes hyperparameter tuning and optimization. Section 5 presents the results and detailed discussion. Finally, Section 6 concludes with the paper and discusses future directions.

2. Background and Literature Review

2.1. Understanding Radicalization in Online Contexts

Extremists may be motivated by things like social isolation, poverty, dysfunctional families, and a bad government [5], and extremists are those who do things, think about things, or intend to do things against society standards and rights [12]. Radicalization damages society by making people more vulnerable, escalating disputes and squandering resources [13]. Internet communities are used by extremist organizations like the Taliban and ISIS to recruit members and spread propaganda [14,15], because radical ideas propagate quickly online, and terrorist communications’ clarity and simplicity appeal to young people who are denied the right to vote [7].

Kaplan and Haenlein (2010) describe social media as “a group of Internet-based applications that build on the ideological and technological foundations of Web 2.0, allowing for the creation and exchange of user-generated content” [3]. In a similar vein, Davis (2016) provides a definition that describes it as “a collection of online tools that enable users to create, curate, and share content they have generated themselves, either individually or collaboratively.” [16].

Recent data reveals that 4.33 billion people are actively participating on different social media platforms, representing 55.1% of the global population. Individuals usually engage with these platforms for about 142 min daily [12]. In 2020, the COVID-19 pandemic brought about a significant increase in social media engagement, with platforms drawing in approximately 4 billion users each day. Social media platforms such as Facebook, TikTok, Instagram, and X significantly impact a wide range of areas, including tourism, education, healthcare, sports, and economics.

2.2. Evolution of Online Radicalization Detection

The detection of online radicalization has evolved significantly with the advancement of AI/ML techniques [7,17]. We explored the use of ML and deep learning (DL) techniques to identify radical content on social media platforms. Initial efforts to identify online radicalization focused mainly on using vocabulary-based methods and simple machine learning techniques. Johnston and Weiss (2017) created an innovative system designed to identify Sunni extremism in nine different languages, reaching an impressive accuracy of 93.2% with the help of a six-layer neural network [18,19]. Nonetheless, their method faced challenges with variations in dialect and interpretations that depend on context, especially in Arabic text.

Becker et al. (2019) proposed a CNN architecture-based sentiment analysis of Twitter responses to terrorist events. This work was remarkable because, for the first time, it integrated user profile attributes and established that demographic factors have a great impact on emotional responses to extremist content [20]. While innovative, their model’s accuracy of 88% on English tweets indicated the need for language-specific optimizations. Furthermore, a hybrid approach was proposed by Ahmad et al. (2019), which used both CNN and LSTM features for extremist tweet classification with an accuracy of 92% on 25,000 tweets [21]. These studies highlight that incorporating linguistic sentiment cues can aid in distinguishing radical content from mere political dissent.

Chen et al. (2020) pioneered the use of SA with military-specific lexicons, achieving 91% accuracy through LSTM and Bi-LSTM models [22]. Their work showed the importance of domain-specific training data but was limited to English and Taiwanese content. Nizzoli et al. (2021) proposed a hyper-model that combined CNN and RNN algorithms and reported an F1-score of 0.90 [23]. This was one of the pioneering works that could effectively integrate character-based and word-based feature extraction. However, the model performance suffered with specific properties of Arabic script. Gaikwad et al. (2021) used pre-trained transformers such as BERT and RoBERTa and reported an F1 score of 0.72 [7]. However, their focus on multilingual datasets diluted the effectiveness of domain-specific adaptations. Nathani and Patel (2022) furthered this approach with a CNN model incorporating GloVe embeddings, reporting 73.5% accuracy [24]. However, their work was primarily focused on hate speech detection in the English language, which highlights the lack of robust methods for Arabic content. Rajendran et al. (2022) researched the detection of online extremism with transformer-based models, such as RoBERTa and BERT, achieving 95% accuracy [25]. While these models excel in dealing with the English language, their application on Arabic remains relatively unexplored. While these transformer architectures are powerful, their extensive resource requirements make them impractical for real-time analysis on large-scale datasets.

Alhayan et al. (2025) combined AraBERT with additional features (sentiment scores) and an MLP classifier, reaching 98% accuracy in extremist tweet detection. This demonstrates the power of transformer-based approaches for Arabic. However, pure transformer models may overlook certain lexical cues or stylistic patterns specific to extremist rhetoric that simpler architectures can capture. Conversely, traditional deep networks may miss the global context that transformers excel at. To leverage the strengths of both, recent studies have explored hybrid or ensemble models [26].

Although these studies give some insight, most of them have completely disregarded the complexity of the Arabic text and did not cover changing patterns of radicalization. Most of these models also have fixed hyperparameters that make them unadaptable to new data.

2.3. AI/ML Advances in Architectures for Text Analysis

The efficacy of AI/ML increases with data density and volume, which makes it especially well suited for extensive social media analysis [27]. Hierarchical feature extraction is accomplished by processing raw text input (i.e., the Input Layer), applying ML algorithms through mathematical transformations (Hidden Layers), and then generating classification results (Output Layer). Specifically, Natural Language Processing (NLP) has improved text analysis skills. Advanced contextual models have replaced simple Word2Vec implementations in word embeddings, especially for Arabic. Performance on dialectal variances and contextual comprehension has greatly improved with the advent of pre-trained models tailored to Arabic. The results from hybrid designs that combine many DL techniques have been encouraging. It has been shown that integrating CNNs for local feature extraction with LSTMs or other tools for sequential pattern recognition is very successful [21,24].

2.4. Challenges in Arabic Text Processing

With the addition of prefixes, suffixes, and infixes, a single root word may produce a multitude of variants, demonstrating the language’s deep morphological structure and providing a complicated basis for analysis [28]. Standardization is especially difficult because of the morphological complexity and the existence of many dialectal variants across various Arabic-speaking areas [29]. According to Kanerva et al. (2021) [28], context-dependent meaning in Arabic is a major problem since a word’s meaning may vary greatly depending on its use and regional interpretation. Furthermore, the complicated Arabic diacritical mark system may significantly change the meaning of words, and informal writing’s frequent word elongation patterns further complicate machine analysis [30].

Important standards for the discipline were established when Aldera et al. built an extensive Arabic dataset of more than 89,000 records [10]. The accuracy of their work with BERT models was 97.49%, although it took a lot of processing power.

According to Alfreihat et al. (2024) [31], code-switching between various Arabic dialects and languages is common in social media discussions, which makes automated analysis much more difficult. According to their study, users of Arabic social media are rapidly combining Modern Standard Arabic with regional dialects, and non-standard acronyms and informal writing styles are becoming more prevalent. Emojis’ extensive use gives content analysis a new perspective since they may have culturally distinct meanings and often alter or support the emotion of the text [32]. Additionally, mixed script use poses special difficulties for conventional NLP techniques when Arabic text is punctuated with Latin letters or digits.

A novel method for Arabic content analysis using large language models (LLMs) is presented by Radwan et al. (2024) [33], who integrate GPT-3.5 embeddings with conventional machine learning classifiers to achieve an astounding 83% accuracy. Through more robust feature extraction and enhanced contextual comprehension, their study shows the possibility of transfer learning from multilingual models. Alshattnawi et al. (2024) [32] make more progress in this area by investigating zero-shot and few-shot learning strategies, which are especially useful in situations when there is a lack of labeled Arabic data. Additionally, recent research has shown that optimizing attention processes is essential for capturing long-range relationships in Arabic text.

Furthermore, traditional measures are no longer the only way to evaluate success in Arabic text analysis. When evaluating model performance across various classification thresholds, the Receiver Operating Characteristic—Area Under the Curve (ROC–AUC) assessment has gained substantial importance, and precision–recall trade-offs provide vital information about how the model behaves in unbalanced datasets. Particularly in situations where achieving a balance between accuracy and recall is crucial, F1-score optimization has become a crucial measure. Analysis of false positive rates has become more popular, especially for applications that are sensitive to security, such as radicalization detection [12].

The integration of deep linguistic knowledge with deep learning architectures is evidently limited, leading to models that frequently fall short of capturing the subtleties of Arabic language structure [34,35,36]. This underscores the lack of attention to Arabic-specific preprocessing, which has a substantial impact on model performance. Works in [10,24,33] have pointed out the absence of strong hyperparameter optimization frameworks, especially for Arabic-specific models.

The intricacies of Arabic text, such as dialectal variances and informal writing styles, are handled by our hybrid architecture (RADAR#), which is especially tailored for processing Arabic texts and incorporates extensive linguistic knowledge at many processing levels. Optimal model performance across various Arabic content categories is guaranteed by the dynamic hyperparameter optimization approach. Through feature significance analysis and attention visualization, this improves model interpretability, meeting a crucial need in security applications.

2.5. Transformer Models for Arabic

The introduction of transformer models [37] marked a paradigm shift in NLP, with models like BERT (Bidirectional Encoder Representations from Transformers) achieving state-of-the-art results across numerous tasks [38]. These models use self-attention mechanisms to capture contextual relationships between words, enabling a more effective representation of language semantics.

Several Arabic-specific transformer models have been developed in recent years. AraBERT [39] was trained on a large corpus of Arabic news articles and Wikipedia pages, demonstrating significant improvements over multilingual models on tasks such as sentiment analysis and named entity recognition. MARBERT [40] was specifically designed to handle Arabic dialects and social media content, trained on a corpus of 1 billion Arabic tweets. Other notable Arabic transformer models include AraELECTRA [41], which uses a more efficient pre-training approach than BERT-based models, and AraGPT2, a generative model for Arabic text. These models have shown strong performance on various Arabic NLP tasks, although their application to radicalization detection has been limited.

Recent work has demonstrated the effectiveness of transformer models for Arabic text classification. Abdelali et al. (2021) [42] used AraBERT for Arabic offensive language detection, achieving a 90.5% F1-score. Al-Twairesh and Al-Negheimish (2019) [43] applied BERT to Arabic sentiment analysis, outperforming traditional approaches. However, these studies have primarily focused on single transformer models rather than ensemble approaches that combine multiple architectures.

2.6. Ensemble Methods in NLP

Ensemble approaches combine two or more models to leverage aggregate performance, capitalizing on the variation between different techniques to drive down error and improve stability. On difficult classification tasks in NLP, ensemble approaches have proven particularly useful. Xia et al. (2011) [44] demonstrated that ensembles outperformed stand-alone classifiers on every dimension in sentiment classification, and Wang et al. (2014) [45] demonstrated similar benefits for relation extraction.

Various ensemble strategies have been tried in the literature. Voting ensemble methods combine base model predictions with majority voting (hard voting) or probability average (soft voting). Stacking approaches train a meta-classifier from base models’ output, learning the optimal weights of combination. Bagging and boosting approaches form ensembles with models trained over different subsets of data or through focusing on difficult examples.

Recent research has been towards ensembles of deep models. Liu et al. (2019) [46] combined CNNs, RNNs, and attention mechanisms for sentiment analysis, and achieved state-of-the-art results on benchmark data. Pappagari et al. (2019) [47] proposed an ensemble of transformer models for text classification, which performs better than the individual models.

In the Arabic setting, ensemble approaches have been promising but not well exploited. Al-Sallab et al. (2017) [48] used an ensemble of deep models for Arabic sentiment analysis with 87.5% accuracy on a news corpus. Afterwards, Al-Zahrani and Al-Yahya (2024) [49] proposed an ensemble of transformer models for Arabic detection of fake news, with a 94% F1-score using weighted averages on the AMFND dataset.

Even with such advancements, the use of ensemble techniques in Arabic radicalization detection, particularly those that take advantage of the combination of traditional deep learning models and transformer models, remains a wide research gap.

3. RADAR#: Radicalization Analysis Using Deep Arabic Recognition

RADAR# is an ensemble approach that combines a hybrid CNN-Bi-LSTM model and transformer models to identify radicalization in Arabic tweets. It utilizes the strength of each of the components: CNN layers for local feature extraction, Bi-LSTM for capturing long-range dependencies, and transformer models for interpreting Arabic text context, as shown in Figure 1. CNN-Bi-LSTM processes input text through embedding, convolutional, and bidirectional LSTM layers, with an attention mechanism to focus on relevant parts of the text. AraBERT utilizes the AraBERT transformer model, which excels at processing Modern Standard Arabic (MSA). These are then merged using a weighted ensemble process, where weights are optimized on validation performance [50]. The ensemble method increases robustness by taking advantage of the complementary strengths of various model types, importantly aiding in mitigating the linguistic diversity of Arabic social media messages.

3.1. Dataset Collection and Characteristics

Figure 2 shows a sample of the dataset, where our dataset comprises 89,816 Arabic tweets collected between May 2011 and March 2021, focusing on political and religious content [10], comprising approximately 145,000 tweets, using the Twitter API. The dataset used in this study is a comprehensive Arabic extremism dataset shared by Dr. Saja Aldera [10], which was specifically designed for radicalization detection in Arabic social media content. The tweets were collected using a combination of keyword filtering, account monitoring, and manual verification by Arabic language experts with backgrounds in security studies. The dataset is structured as a tab-separated values (TSVs) file, with the following fields: tweet ID (anonymized), tweet text content, binary classification label (1 for extremist, 0 for non-extremist), timestamp of collection, and source region (when available). The focus on politics and religion was driven by the prevalence of radical content in these domains within the Arab world. Tweets were gathered using specific keywords, such as تنظيم داعش, كفار داعش, رعاة الارهاب, هيئة عدو الله, داعش العراق, كبار العملاء, and others, reflecting common trends in the Arab world. The collection period spanned significant events, including the murder of Saudi journalist Jamal Khashoggi in 2018, which notably increased tweet volume. While the dataset is rich in content, potential biases introduced by keyword selection and event-driven tweet volumes are acknowledged. The annotations were performed by domain experts and validated using Gwet’s AC1 inter-rater reliability measure, achieving a score of 0.86.

The dataset is balanced, with 56% extremist and 44% non-extremist tweets, as indicated by Shannon’s entropy measure of 0.98. In the top unigrams for extremist and non-extremist tweets, “الله” appears frequently in both categories, but is more prominent in extremist tweets. The word “داعش” (Daesh), associated with extremism, appears in both types, but with higher frequency in extremist tweets.

The dataset exhibits a rich linguistic diversity, reflecting the complex nature of Arabic language use on social media, as follows: Vocabulary statistics (total word count: 247,835 words, unique word count (before preprocessing): 42,316 distinct words, unique word count (after preprocessing): 28,754 distinct words, average tweet length: 24.8 words, vocabulary distribution follows Zipf’s law with a long tail of rare terms). Dialectal distribution (Modern Standard Arabic (MSA): 42%, Gulf dialect: 23%, Levantine dialect: 18%, Egyptian dialect: 12%, North African dialects: 5%) and linguistic phenomena (code-switching (Arabic English): present in 14% of tweets, emoji usage: found in 27% of tweets, hashtags: average of 1.8 hashtags per tweet, mentions: average of 0.9 mentions per tweet).

The dataset includes anonymized metadata about tweet authors, providing important contextual information while preserving privacy: geographic distribution (Middle East: 68%, North Africa: 17%, Europe: 8%, North America: 4%, Other/Unknown: 3%), age demographics, when available, 62% of the dataset (18–24 years: 31%, 25–34 years: 42%, 35–44 years: 18%, 45+ years: 9%), and account characteristics (average account age: 3.7 years, average follower count: 2145, average following count: 1873, verified accounts: 2.3% of the dataset).

The dataset includes several embedded features that provide additional context for analysis, as follows: temporal features (time of day distribution (24 h format), day of week distribution, seasonal patterns (monthly distribution)), network features (reply chains (depth up to 5 levels), retweet patterns, mention networks (anonymized)), content features (URL presence (31% of tweets), media attachment indicators (image/video: 24% of tweets), and topic modeling features using LDA (10 topics)).

3.2. Arabic Text Preprocessing Pipeline

Arabic text preprocessing is a critical step in developing effective NLP models, particularly for social media content where dialectal variations, non-standard orthography, and code-switching are common. The first stage of preprocessing involves cleaning and normalizing the raw text to reduce orthographic variations that do not affect meaning. Social media content is inherently noisy, containing elements such as emojis, handles, English words, hashtags, links, numbers, new lines, underscores, and punctuation. These elements are removed to focus on the core content of the tweets. For instance, emojis are stripped to eliminate emotional bias, and handles are removed to avoid distractions from the content. Links are removed as they may point to external content irrelevant to the tweet’s text analysis. The cleaning process is illustrated in Table 1, which shows examples of tweets and the type of cleaning required. This process involved removing non-Arabic tweets, duplicates, and tweets with fewer than seven words (excluding hashtags and user mentions). Special characters and punctuation were removed, except for those that might indicate sentiment or emphasis (e.g., exclamation marks). Regarding the standardization/normalization of Arabic Text, we applied comprehensive Arabic-specific normalization techniques to address common orthographic variations (unified various forms of Alif (أ, إ, آ) to a single form (ا), Normalized Ya (ى, ي) and Ta Marbuta (ة, ه) variations, Removed diacritics (تشكيل), which are inconsistently used in social media, normalized elongated words (e.g., “جميييييل” to “جميل”), and standardized Arabic digits to their Hindi-Arabic equivalents. These steps ensured data quality and relevance, with the final dataset reduced to 145,000 tweets, which is approximately 18% (from 42,316 to 34,699 unique words) while preserving semantic content, significantly improving model generalization by reducing the impact of orthographic variations.

Stopwords are high-frequency words (e.g., “في” [in], “من” [from], “على” [on]) that generally do not carry significant semantic weight in text classification tasks. Removing these words enhances computational efficiency and highlights more meaningful terms, utilizing the Python Natural Language ToolKit (NLTK v3.9.1) library for stopword removal, which includes extensive stopword lists for 19 languages, including Arabic (Al-Shawakfa et al., 2021) [30]. While stopword removal improves text analysis efficiency, its impact varies by task. For radicalization detection, stopword removal was carefully evaluated to ensure no loss of context-critical words [51]. We excluded certain negation terms that might be critical for sentiment and intent detection and preserved religiously significant terms that might be relevant to radicalization detection. Stopword removal reduced the average tweet length by 31% (from 24.8 to 17.1 words), improving computational efficiency without sacrificing classification performance.

Arabic’s rich morphological structure presents unique challenges for NLP tasks. We employed the Farasa segmenter [42], which is specifically designed for Arabic social media content, to decompose complex word forms into their constituent morphemes: conjunctions (و, ف), prepositions (ب, ل), and definite articles (ال) were separated from the base word. Possessive pronouns (ه, ها, هم) and verb conjugation suffixes were identified and separated. Special attention was given to proclitics and enclitics that modify word meaning. Segmentation reduced data sparsity by breaking down complex word forms, resulting in improved embedding quality. Our experiments showed that segmentation alone contributed to a 2% improvement in the F1-score for the final model.

Lemmatization transforms words into their base form using linguistic rules and vocabulary, ensuring that the resulting root word belongs to the language and retains its meaning. This step helps reduce variations of the same word, aiding in more precise feature extraction; in other words, this is particularly important for Arabic due to its derivational morphology (Kanerva et al., 2021) [28]. Unlike stemming, lemmatization considers the context of words, ensuring a meaningful base form. We employed a modified light stemming algorithm [19] that balances between aggressive stemming and preserves semantic distinctions: Our approach preserves key semantic distinctions that are critical for radicalization detection; for example, “الذهاب” (the going) lemmatizes to “ذهب” (go), preserving its contextual integrity. Verbal nouns (مصادر) are maintained separately from their verb forms. Plural patterns that might indicate group identity are preserved, diminutive forms that might carry emotional or ideological significance are retained, and technical or religious terminology is protected from over-stemming.

The influence of these remaining lemmas on model performance is substantial. Preserving these distinctions improved the model’s ability to detect nuanced extremist content, contributing to a 2.5% increase in precision, and maintaining verbal nouns separately from verbs helped the model better understand action versus ideology, a critical distinction in radicalization detection. Despite the conservative approach, lemmatization still reduced the vocabulary size by 17% (from 34,699 to 28,754 unique words) after normalization. Our cutting out studies showed that removing this specialized lemmatization approach increased false positives by 3.2%, particularly for religious content that was not extremist.

Stemming involves removing suffixes from words to obtain their root form. While faster and computationally cheaper than lemmatization, stemming may produce non-meaningful words. For example, “يركض” is stemmed to “ركض.” Stemming is useful in search engines but may not always be ideal for semantic analysis. Stemming can produce ambiguous roots, such as stemming “الضمان” (guarantee) to “ضم” (include), which could introduce unintended meanings.

Splitting text into individual words or tokens is a fundamental step in NLP, essential for further analysis [52]. Each word was converted into dense vector representations using AraVec Word2Vec embeddings. These embeddings capture the semantic relationships between words, enabling the model to better understand contextual nuances in Arabic text. For the CNN-BiLSTM component, we used word-level tokenization with a maximum sequence length of 100 tokens. For transformer models, we utilized the model-specific tokenizers (WordPiece for AraBERT), which handle out-of-vocabulary words through subword decomposition.

The full preprocessing pipeline improved the F1-score by 7% compared to minimal preprocessing (removal of usernames and URLs only). The removal showed that normalization contributed 3%, segmentation contributed 2%, and our domain-specific lemmatization method contributed 2.5% to the total F1-score improvement. This reduced the error rates on dialectal content by 12% compared to conventional Arabic NLP preprocessing methods. The shortened vocabulary and sequence length boosted training time by approximately 24% without sacrificing model performance. This approach, with particular reference to Arabic-specific challenges and retaining semantically significant lemmas, is one critical pillar for the effectiveness of our RADAR# model in detecting radicalization in Arabic social media posts.

Feature Engineering and Embedding

In the context of detecting radicalization in Arabic tweets, this process is crucial due to the complexity and nuances of the Arabic language. We employed Keras word embeddings to capture the semantic relationships within our dataset. This involved tokenizing and sequencing our Arabic tweets, followed by padding them to a fixed length to ensure compatibility with our models. Tweets were converted into sequences of word indices, with padding to maintain a fixed vector length of 200. For instance, a tweet such as “الثورة مستمرة حتى تحقيق المطالب” was converted into a sequence of numerical values representing each word. The advantage of using Keras embeddings lies in their ability to train embeddings specific to our dataset, thereby capturing context-specific semantics effectively.

In addition to Keras embeddings, we utilized Word2Vec through the AraVec library, which is trained on a large Arabic corpus. This method excels in capturing semantic relationships, as demonstrated by its ability to understand the contextual similarities in Arabic tweets. By integrating these pre-trained embeddings, we enhanced our models’ understanding of the nuanced meanings within the tweets.

Our analysis revealed performance differences between models using Keras embeddings and Word2Vec [53,54]. While Keras embeddings offered context-specific advantages, Word2Vec provided robust semantic understanding [55]. The top 5000 words were selected, and each tweet was represented as a sequence of these words. However, both methods faced challenges, such as dealing with Arabic morphology and the inherent noise in social media text. These challenges were addressed through meticulous preprocessing steps, including cleaning and normalization of tweets.

3.3. RADAR# Model Architecture

The RADAR# model leverages a hybrid deep learning architecture that combines CNNs for local pattern extraction and Bi-LSTMs for capturing long-range dependencies. The architecture is optimized specifically for Arabic text analysis. Figure 3 shows the RADAR#’s architecture which can be decomposed into four primary components: input processing, feature extraction, temporal analysis, and classification. The input processing begins with text vectorization where each Arabic tweet

T = {w_{1}, w_{2}, \dots, w_{n}}

is transformed into a sequence of tokens. The embedding layer maps each token to a dense vector representation through a learned embedding matrix

E {\in R}^{|n| x d}

, where

|n|

is the vocabulary size and d is the embedding dimension. For a given word

w_{i}

, its embedding is computed as follows:

e_{i} = E ω_{i}

, each represented as a 200-dimensional vector. The embedding (input) layer implements dropout regularization with probability p to prevent overfitting:

\hat{e} = e_{i} ⊙ m_{i}, m_{i} ~ B e r n o u l l i (1 - p)

.

Next, a Convolutional Neural Network (CNN) layer applies one-dimensional convolutional (1DCNN) filters to extract n-gram patterns from the text. These patterns capture localized features, such as key phrases and repetitive radical terms. The CNN component implements three convolutional layers with carefully tuned filter configurations [28, 32, 54] and corresponding kernel sizes [3, 4, 5], optimized through extensive experimentation to capture local textual patterns at varying granularities. Each convolutional layer incorporates dilated convolutions with rates [1, 2, 4] to expand the receptive field without increasing computational complexity, enabling the capture of broader contextual patterns. The implementation of residual connections between convolutional layers facilitates gradient flow during training and enables the learning of complementary features at different abstraction levels. The following configurations were employed: it implements three convolutional blocks with decreasing filter sizes: 1st block: 128 filters with kernel size 3, 2nd block: 64 filters with kernel size 4, 3rd block: 32 filters with kernel size 5 to capture varying n-gram lengths; Rectified Linear Unit (ReLU), i.e., activation function, is used to introduce non-linearity, enhancing the model’s ability to capture complex patterns; and Max pooling was used to down-sample feature maps and focus on the most relevant features.

Algorithm 1 processes tweets as a matrix X ∈

R

^nd, where n is the number of tweets and d is the dimensionality of feature embeddings, aimed at binary classification into “Extremist” or “Non-Extremist”. The tweets undergo transformation into a semantic vector space W via Word2Vec or GloVe, tailored for Arabic lexical structures. 1DCNN to W using filters k and bias b extracted localized features essential for understanding Arabic syntactic patterns, enhanced by ReLU activation. Then, feature maps F from the convolutional layer are condensed through the max pooling to a vector P, focusing on the most relevant textual features while preventing overfitting. P is processed by a Bidirectional Long Short-Term Memory (Bi-LSTM) Layer, engineered specifically for processing Arabic sequences, and utilizes a two-layer architecture with [256, 128] units, respectively. This bidirectional processing captures both forward and backward contextual dependencies, crucial for understanding Arabic’s complex grammatical structures and context-dependent meanings. The integration of a multi-head attention mechanism, implementing 8 attention heads with 64-dimensional key spaces, enables the model to dynamically focus on relevant parts of the input sequence while maintaining awareness of global context. The attention mechanism is particularly effective in identifying subtle indicators of radicalization that may be distributed across different parts of the text.

Algorithm 1. Advanced Extremism Classification Algorithm for Arabic Tweets

\begin{array}{l} I n p u t : X \in R ⁿ ˣ ᵈ \\ O u t p u t : C l a s s l a b e l y / / E x t r e m i s t o r N o n - E x t r e m i s t \\ 1 : W \leftarrow E m b e d (X) / / W o r d v e c t o r t r a n s f o r m a t i o n \\ 2 : F \leftarrow α (W * k + b) / / A p p l y 1 D C N N \\ 3 : P \leftarrow M a x P o o l (F) / / D i m e n s i o n a l i t y r e d u c t i o n \\ 4 : S \leftarrow B i L S T M (P) / / B i d i r e c t i o n a l L S T M p r o c e s s i n g \\ 5 : z \leftarrow c o n c a t (b_C N N, h_B i L S T M) / / C o n c a t e n a t e C N N a n d L S T M f e a t u r e s \\ 6 : y \leftarrow σ (W_d z + b_d) / / D e n s e l a y e r w i t h a c t i v a t i o n \\ 7 : p \leftarrow s o f t m a x (y) / / C o m p u t e c l a s s p r o b a b i l i t i e s \\ 8 : i f p_E x t r e m i s t > p_N o n - E x t r e m i s t t h e n \\ y \leftarrow E x t r e m i s t \\ e l s e \\ y \leftarrow N o n - E x t r e m i s t \\ e n d i f \end{array}

The outputs from the CNN and Bi-LSTM branches are concatenated and processed through dense layers, as shown in Algorithm 1. The outputs S from Bi-LSTM are transformed into classification scores z via a dense layer with weights Wd and bias bd. Scores z are normalized to probabilities p through softmax, determining the likelihood of each category. The final decision y is based on the highest probability (i.e., Sigmoid activation function (σ)) for binary classification as either “Extremist” or “Non-Extremist.” This integrates CNN for detecting patterns and Bi-LSTM for processing sequential data, effectively classifying complex Arabic tweets on the platform X/Twitter.

As mentioned above, our architecture incorporates IDCNN as a critical component for n-gram pattern extraction from Arabic text. This design choice was made after careful consideration of several alternatives, including Maximal Frequent Sequence (MFS) techniques. Our experiments showed that 1D CNNs achieved comparable or better performance with approximately 40% less training time and 65% fewer parameters than equivalent MFS-based approaches. In addition, the IDCNN approach: 94.2% F1-score, MFS-based approach: 91.8% F1-score, and Hybrid approach (MFS features fed to CNN): 93.5% F1-score. MFS techniques faced significant scalability issues with our dataset size. The computational complexity of MFS extraction grows exponentially with sequence length, making it prohibitively expensive for longer texts. Our experiments showed that MFS extraction required 3.8× more preprocessing time than the CNN approach. MFS techniques struggled with dialectal variations in Arabic social media text, as the same semantic concept might be expressed with different word sequences across dialects. 1DCNNs showed greater robustness to these variations, particularly when combined with our preprocessing pipeline. MFS techniques are limited to sequences present in the training data, while 1DCNNs can generalize to unseen patterns that share structural similarities with training examples. This is particularly important for evolving extremist content that deliberately alters terminology to evade detection.

3.3.1. BiLSTM Layer

As shown in Figure 3, following the CNN layer, a BiLSTM layer is utilized to capture long-range dependencies and contextual information within the text. The BiLSTM analyzes the feature maps obtained from CNN by processing them in both forward and backward directions. This approach allows the model to integrate context from both preceding and subsequent positions within the text. The BiLSTM layer is configured with 128 units in each direction, resulting in a total of 256 units. A dropout rate of 0.3 is implemented to mitigate the risk of overfitting. The configuration was established following comprehensive hyperparameter optimization conducted on the validation set.

3.3.2. Attention Mechanism

We implement a custom attention mechanism that allows the model to focus on the most relevant parts of the text for radicalization detection. The attention weights are computed as follows:

a_{i} = softmax (v^{T} \tanh (w_{h} h_{i} + b_{h}))

(1)

where h_i represents the hidden state at position i, W_h and b_h are learnable parameters, and v is the attention vector. The context vector c is then computed as a weighted sum of the hidden states (c = Σ

a_{i} h_{i}

). This attention mechanism significantly improves both performance and interpretability, allowing us to visualize which parts of the text most strongly influenced the classification decision.

3.3.3. Transformer Integration

RADAR# integrates the CNN-BiLSTM component alongside AraBERT, a transformer-based model that has been specifically pre-trained on Arabic text. AraBERT delivers contextual word embeddings that effectively represent semantic relationships throughout the text, enhancing the local pattern recognition functionalities of the CNN-BiLSTM component. AraBERT is fine-tuned for the radicalization detection task with a learning rate set at 2 × 10⁻⁵ and a maximum sequence length configured to 128 tokens. A classification head is implemented by incorporating a dropout layer with a rate of 0.1, followed by a dense layer, and positioned above the [CLS] token representation to generate the final prediction.

The ensemble component integrates the predictions from the CNN-BiLSTM and AraBERT models through a weighted average methodology. The weights were established via optimization on the validation set, yielding a distribution of 40% for CNN-BiLSTM and 60% for AraBERT. The weighting indicates the comparative strengths of each model, with AraBERT demonstrating superior capability in capturing intricate semantic relationships, while the CNN-BiLSTM model is proficient in recognizing local patterns and specific phrasal indicators associated with radicalization. The weighting indicates the comparative strengths of each model. AraBERT demonstrates superior capability in capturing intricate semantic relationships, while the CNN-BiLSTM model is proficient in recognizing local patterns and specific phrasal indicators associated with radicalization. The calculation of the ensemble prediction is as follows:

p_{e n s e m b l e} = w_{1} * P_{C N N - B i L S T M} + w_{2} * P_{A r a B f R T}

(2)

The variables

P_{C N N - B i L S T M}

and

P_{A r a B f R T}

represent the probability outputs generated by their respective models, while

w_{1}

and

w_{2}

denote the associated weights that are identified in following section.

3.3.4. Example

We aim to demonstrate how RADAR# effectively identifies extremist content by leveraging both local patterns through CNN-BiLSTM and contextual understanding through AraBERT, with the attention mechanisms highlighting the most relevant terms for classification.

Original Tweet (Arabic): “يجب علينا جميعا الجهاد ضد الكفار والمرتدين ونشر الخلافة في كل مكان #الدولة_الإسلامية”, English Translation: “We must all wage jihad against the infidels and apostates and spread the caliphate everywhere #Islamic_State”.

Step 1: Preprocessing: Text Cleaning to remove hashtag: “يجب علينا جميعا الجهاد ضد الكفار والمرتدين ونشر الخلافة في كل مكان”, Normalization: Standardize characters: “يجب علينا جميعا الجهاد ضد الكفار والمرتدين ونشر الخلافه في كل مكان”, Segmentation: Apply Farasa: “يجب علي + نا جميع + ا ال + جهاد ضد ال + كفار و + ال + مرتد + ين و + نشر ال + خلاف + ه في كل مكان”, Stopword Removal:Remove stopwords: “يجب جميع جهاد ضد كفار مرتد نشر خلاف مكان”, and then Lemmatization: we applied modified light stemming: “وجب جمع جهد ضد كفر ردد نشر خلف مكن”.

Step 2: Feature Extraction, CNN-BiLSTM Path:Word embedding produces a matrix of shape (sequence_length, embedding_dim), 1DCNN filters extract n-gram patterns: 3 g filter captures “وجب جمع جهد”, “جمع جهد ضد”, etc., 4 g filter captures “وجب جمع جهد ضد”, etc., 5 g filter captures longer patterns, Max pooling extracts the most salient features, BiLSTM processes the sequence in both directions.

Step 3: Attention and Classification:

CNN-BiLSTM + Attention: Attention weights: {0.15: جهد, 0.22: خلف, 0.28: ردد, 0.31: كفر, …}. The context vector emphasizes key extremist concepts such as jihad, infidels, apostates, and caliphate.

AraBERT Classification: The [CLS] token representation captures the semantic meaning of the entire tweet. The classification head then outputs a probability of 0.94.

Step 4: Ensemble Integration, Weighted Combination: CNN-BiLSTM probability: 0.89, AraBERT probability: 0.94, Ensemble: 0.4 × 0.89 + 0.6 × 0.94 = 0.92, Final Classification: Probability 0.92 > threshold 0.5, and Classification: Extremist content (1).

4. Hyperparameter Optimization and Sensitivity Analysis

Our methodology implements a sophisticated multi-stage optimization process that combines Bayesian optimization with structured experimental validation. The optimization framework addresses three crucial aspects: architectural parameters, training dynamics, and regularization strategies. Unlike previous approaches that relied on standard grid search or random sampling, we implement a hierarchical optimization strategy that accounts for parameter interdependencies and their impact on model performance. The goal was to identify the configuration that maximized accuracy, minimized overfitting, and ensured generalization across diverse datasets.

4.1. Cross-Validation Strategy

We employed a stratified 5-fold cross-validation approach to ensure robust hyperparameter selection. This methodology maintains class distribution across folds, which is particularly important for our imbalanced dataset where radicalized content represents approximately 18% of samples. For each hyperparameter configuration, we calculated the mean and standard deviation of performance metrics across all folds, prioritizing configurations with both a high mean performance and low variance.

4.2. Parameter Analysis

According to Alemu et al. (2018) [56], we used a method called ”trial and error” and tried different neural network architectures with 24 hidden layers (Ls) to find the best one for finding radicalization in Arabic tweets. The optimal number L* is determined by minimizing the validation loss L, as follows:

L^{*} = \arg \underset{L}{m i n} L (L, E, B),

(3)

For the number of L, we experimented with architectures containing two, three, and four layers. We evaluated the impact of each configuration on the model’s performance, considering factors such as accuracy, convergence speed, and computational efficiency. Through empirical evaluation, we determined that a four-layer architecture provided the best balance between model complexity and performance.

The number of epochs is a crucial hyperparameter that governs the extent of training and can significantly impact the model’s generalization ability. Too few epochs may result in an undertrained model that fails to capture the underlying patterns in the data, while too many epochs can lead to overfitting, where the model memorizes the training data, including noise and irrelevant patterns, hindering its performance on unseen data. The optimal number of epochs E∗ is found by balancing the trade-off between underfitting and overfitting, as follows:

E^{*} = \arg \min_{E} [L (L, E, B) - L_{v a L} (L, E, B)]

(4)

Patterson and Gibson (2017) [57] emphasized that it is crucial to strike the right balance between adequately training the model and avoiding overfitting. The optimal batch size B^∗ is chosen based on computational efficiency and model performance, as follows:

B^{*} = \arg \underset{B}{m i n} L (L, E, B),

(5)

We tried batch sizes of 8, 16, 24, and 32 to find the configuration that provided the optimal trade-off between computational efficiency and model performance for the radicalization detection task on Arabic tweets. We systematically tuned such critical hyperparameters and evaluated their impact on the performance of the model. The training batch sizes considered were of sizes 8, 16, 24, and 32. We quantify the results of its impact concerning training stability and convergence efficiency. Progressively, it shows the best balance between computational efficiency and model performance with B = 24.

Dropout rates play a crucial role in regularizing the model and preventing overfitting. We investigated dropout rates ranging from 0.1 to 0.5, assessing their effect on the model’s generalization ability. The dropout layer randomly deactivates a fraction of neurons during training, reducing the risk of overfitting by preventing the model from relying too heavily on specific features or patterns. The dropout rate, denoted p, controls the probability of deactivating neurons. The optimal dropout rate, p∗, is determined by minimizing the validation loss function L_val (p), as follows:

P^{*} = \arg \underset{P}{m i n} L_{v a l} (P)

(6)

For instance, according to the p∗ in Arabic Tweets in Table 2, we may want to apply a higher dropout rate, e.g., 0.4 or 0.5, to prevent the model from fitting too much to specific words or phrases that refer to ‘radicalization’, such as “rejectors”, “Magi”, “traitors agents”, “infidels”, or “Dogs of the Jews”. By randomly deactivating some neurons during the training of the model, we are able to make sure that the model does not overfit the specific representation of word meanings and rather learns the general meaning of the words. In the context of our embedding layer, the dropout rates tune from 0.1 to 0.5, showing the chances of deactivating neurons during training. The most suitable configuration for the Arabic radicalization text detection was found to be the dropout of 0.3 with respect to what is the best trade-off for overfitting and the use of important data in the containment layer of the network.

We have looked at two of the best-known optimization schemes for deep neural network training—ADAM (Adaptive Moment Estimation) and RMSprop. RMSprop solves the issue of an individual learning rate rapidly dwindling by summing gradients and using exponential smoothing to reduce the rates according to the parameter’s history. ADAM is a method of training feed-forward neural networks based on a stochastic gradient descent (SGD), which aims to take the best ideas of both AdaGrad and RMSprop. It computes different learning rates for every parameter based on the subted gradient of these parameters. Consequently, we selected and implemented these two optimizers while excluding the others in our final model. The update rule for RMSprop is as follows:

θ_{t + 1} = θ_{t} - η \frac{E {[g^{2}]}_{t} + ϵ}{g_{t}}

(7)

where

θ

are the parameters,

η

is the learning rate,

g_{t}

is the gradient at time t, and

E {[g^{2}]}_{t}

is the exponential decay of squared gradients. η = 0.003 with a cosine decay schedule provided the ideal balance between fast convergence and preventing the overshooting of the global minima. Dynamically, Adam adjusts the learning rate for each parameter according to the gradient update history, maintaining a running average of the average and variance of the gradient values; therefore, the Adam optimizer includes bias correction. Adam updates the parameters as follows:

θ_{t + 1} = θ_{t} - \frac{{\hat{v}}_{t} + ϵ}{{\hat{m}}_{t}} η_{t}

(8)

where

{\hat{m}}_{t}

and

{\hat{v}}_{t}

are bias-corrected estimates of the first and second moments of the gradients, respectively. For Tweet 2 in Table 2, which contains complex language and potentially subtle signals of radicalization, we may find that the Adam optimizer performs better than RMSprop. Adam’s ability shows the learning rate for each parameter and its bias correction mechanisms, which could lead to faster convergence and more reliable optimization, improving the model’s ability to accurately classify this type of radical content.

The learning rate,

η

, is a crucial hyperparameter that controls the optimization algorithm’s step size during the training process and it directly impacts the magnitude of adjustments made to the neural network’s parameter vector as it moves through the loss landscape. A higher

η

can lead to larger steps, allowing the model to navigate steep gradients and converge more quickly when the error is large. However, as the error decreases and the gradient becomes flatter, a smaller learning rate is desirable to prevent overshooting the optimal solution. To find the optimal

η

for our model, we conducted experiments with a range of values, including 0.1, 0.01, 0.001, 0.2, 0.02, 0.002, 0.3, 0.03, and 0.003. By evaluating the performance of the model across this range. The

η

determines the step size at each iteration while moving toward a minimum of the loss function. The optimal

η

provides the best convergence, as follows:

η^{*} = \arg \underset{η}{m i n} L_{t r a i n}^{(η)}

(9)

where

L_{t r a i n}

the training loss. The adaptive decay schedule ensured smaller updates as training progressed, enhancing stability during fine-tuning. The training duration (40 epochs) had an early stopping patience of 10 epochs, ensuring that the model did not overfit while fully leveraging the training data. Early stopping monitored validation loss to halt training when performance ceased improving.

The kernel size (k) in convolutional layers affects the receptive field and ability of the network to extract features from the input data. We conducted experiments with kernel sizes of 4, 6, and 8 and assessed the model’s performance with various kernel configurations.

Our proposed CNN model effectively detects the presence of radicalization in Arabic text data. Keeping the η small, 0.001 or 0.003, avoids getting stuck in a complex loss landscape that leads to misclassifying sensitive contents. The generalization capacity of the model was evaluated on a separate test set with test size ratios of 20%, 25%, and 30%. A higher test data proportion, such as 30%, was used in assessing the effectiveness of the model on analyzing Arabic tweets for radicalization, as the language and idioms used in radical material may show high diversity and reliance on context.

4.3. Parameter Ranges and Grid Search

Table 3 presents the comprehensive parameter space explored during our optimization process. For each parameter, we provide the range of values tested and the optimal value determined through our experiments.

Our analysis revealed significant interdependencies between several key parameters. The most notable interactions include the following: First: smaller batch sizes (8–16) performed optimally with higher learning rates (0.01–0.03), while larger batches (24–32) required lower learning rates (0.001–0.003) to achieve stable convergence. Second: higher dropout rates (0.4–0.5) were beneficial for larger models (256 BiLSTM units), while more moderate dropout (0.2–0.3) was optimal for smaller architectures. Third: optimizer and learning rate schedule: Adam optimizer showed superior performance with cosine decay learning rate schedules, while RMSprop performed better with step decay schedules. This interdependency analysis guided our final parameter selection, ensuring that our chosen configuration represents a globally optimal solution rather than locally optimal individual parameters.

4.4. Training and Attention Mechanism

As mentioned above, data preparation for the Arabic language is carefully constructed to handle the intricacies of right-to-left writing, diacritical markings, and dialect variances. We utilized re (Python’s Regular Expressions library) to remove diacritics and unify character variants, streamlining the text input. A custom attention mechanism is applied to the Bi-LSTM outputs, allowing the model to focus on the most relevant parts of the text for classification. The attention layer computes a weighted sum of the Bi-LSTM hidden states, where the weights are learned during training. This mechanism is defined as follows:

\begin{array}{l} e_t = t a n h (W_h * h_t + b_h) \\ α_t = s o f t m a x (e_t) \\ c = Σ (α_t * h_t) \end{array}

(10)

where h_t is the hidden state at time t, W_h and b_h are learnable parameters, α_t is the attention weight for time t, and c is the context vector. Uses nltk.tokenize.word_tokenize specifically configured for Arabic to segment normalized text into tokens effectively. Employs gensim.models. KeyedVectors for converting tokens to numerical vectors using pretrained Arabic word embeddings, AraBERT. keras.preprocessing.sequence.pad_sequences ensures consistent sequence lengths. Sequences are then padded to a fixed length to maintain uniformity across input data.

We used pretrained Arabic word embeddings to transform symbols into high-dimensional vectors and capture semantic relationships. The embeddings can be obtained from well-known libraries such as FastText or AraBERT, which provide a strong linguistic foundation through the use of tensorflow.keras.layers. Embedding. They are configured to detect local text features and long-range dependencies. Set up tensorflow.keras.layers. Conv1D for feature extraction and use tensorflow.keras.layers. Bidirectional with LSTM to capture dependencies [58]. The bidirectional LSTM layer captures both a forward and reverse context, which is critical for gaining a full understanding of the intent of the content. Optimized to prioritize contextual relevance in Arabic. Therefore, RADAR# has a customed attention layer using tensorflow.keras.layers. Layer to enhance focus on salient textual features indicative of radical sentiments. Implements an attention layer that dynamically weights the importance of different segments of the text, enhancing the model’s focus on elements most indicative of radical content.

The L2 regularization method (tensorflow.keras.regulizers. L2) was used to reduce the effect of noise and make the Arabic language representation more stable as part of the data enhancement methods. This capability allows the model to apply its knowledge of the data practically, despite not having been trained on it. It is therefore imperative to monitor the fluctuations in the market due to the difficulties associated with the translation of the Arabic terms. To illustrate, we employed data augmentation techniques and training sequence optimization to enhance the training outcomes. Furthermore, the utilization of advanced training tools, such as TensorFlow Keras Callbacks ModelCheckpoint, is employed to assess the performance of the model and to regulate the learning rate. The tensorflow.keras.callbacks. LearningRateScheduler is used to regulate the learning rate of the model, which in turn affects the performance of the deep learning network. It is recommended that the more experienced users of Arabic text recognition software maintain a high level of accuracy and a high degree of interpretability.

The transformer branches leverage pre-trained Arabic language models to capture contextual relationships and semantic meaning. The AraBERT model was pre-trained on a large corpus of Modern Standard Arabic text, including news articles and Wikipedia pages. AraBERT excels at processing formal Arabic and has strong performance on standard NLP tasks. We use the base version with 12 transformer layers, 768 hidden units, and 12 attention heads. For this model, we implement a fine-tuning approach where the following applies:

The pre-trained weights are used as initialization: all layers are fine-tuned on our dataset, a classification head is added on top of the [CLS] token representation, and the classification head consists of a dropout layer (rate = 0.1), followed by a dense layer with sigmoid activation. The transformer models process the input text differently from the CNN-Bi-LSTM branch, using subword tokenization and special tokens ([CLS], [SEP]) according to the BERT architecture. This provides a complementary representation that captures different aspects of the text. The Hugging Face Transformers library (version 4.12.0) provides state-of-the-art pre-trained models and a unified API for fine-tuning. Arabic NLP Tools (Farasa (version 0.3.2) was used for Arabic segmentation, PyArabic (version 0.6.10) for basic Arabic text processing, and CAMeL Tools (version 1.0.0) for morphological analysis).

4.5. Sensitivity Analysis

To understand the robustness of our model to hyperparameter variations, we conducted sensitivity analysis by perturbing each optimal parameter value by ±10% and ±20%. Figure 4 visualizes the impact of these perturbations on model performance. The most sensitive parameters were learning rate and ensemble weights, where even small deviations (±10%) resulted in performance drops of 1.5–2.0%. In contrast, the model showed remarkable stability to variations in batch size and CNN kernel size, with performance changes of less than 0.5% for ±20% perturbations. This sensitivity analysis provides valuable guidance for model deployment and maintenance, highlighting which parameters require the most careful tuning and monitoring in production environments.

4.6. Ensemble Integration

The ensemble component combines the predictions of the three models using a weighted average approach, as illustrated in Figure 5. The weights were determined by optimization on the validation set, which resulted in (CNN-Bi-LSTM: 40% weight, AraBERT: 30% weight). This weighting is proportional to the relative contribution of each model towards the performance. The CNN-Bi-LSTM model is assigned the highest weight since it has a decent performance in modeling local patterns and is computationally light. Both transformer models are assigned equal weights, and their combined contribution is assigned a marginally higher weight than the CNN-Bi-LSTM model. The ensemble prediction is conducted based on Equation (2), as follows:

P_ensemble = 0.4 ∗ P_CNN-Bi-LSTM + 0.3 ∗ P_AraBERT

(11)

where P_CNN-Bi-LSTM and P_AraBERT are the probability values from the respective models. The final classification is derived by thresholding the ensemble prediction, with a default threshold of 0.5, as follows:

y_pred = 1 if P_ensemble ≥ 0.5 else 0

(12)

This ensemble approach enhances robustness by taking advantage of the complementarity of different model types. The CNN-Bi-LSTM branch excels at capturing local patterns and sequential dependencies, while AraBERT excels at Modern Standard Arabic.

4.7. Evaluation Model

The evaluation model clarifies the criteria and indicators used to determine the effectiveness and resilience of the neural network architectures created for identifying extremist material in Arabic tweets. The assessment starts with a preliminary performance study using essential indicators, including accuracy and data loss. These metrics provide a benchmark for evaluating different configurations within the suggested network designs. Upon identifying the ideal configuration, a full array of measures, including accuracy, the F1 score, and the Receiver Operating Characteristic (ROC)—the Area Under the Curve (AUC), is used to evaluate and verify performance across several models.

Accuracy shows what percentage of the total predictions was correctly identified)—in this context, both the extremist and non-extremist classes. The F1 score is defined as the harmonic mean of precision and recall. It balances the tradeoff between precision and recall. Precision calculates the ratio of correctly predicted positive instances)—extremist content)—out of all positive predictions. Recall is a metric that calculates how many out of all the actual positives are precisely identified. The ROC is a curve that plots TPR against FPR at various thresholds for classification, and AUC gives the overall discriminatory power of this model at all threshold settings. A higher value of AUC suggests a high ability to identify extremist versus non-extremist content. This metric provides the overall picture about the predictive capabilities of the models, considering various trade-offs of sensitivity (recall) versus specificity.

In order to enhance the reliability of the prediction and reduce the likelihood of overfitting, rigorous cross-validation techniques were employed. To enhance prediction reliability and minimize the likelihood of overfitting, rigorous cross-validation techniques were employed. The dataset is divided into k folds, and the model is trained and validated k times, each time using a different fold as the validation set and the remaining folds for training.

This means that each data point will have gone through training and validation, ensuring a much more robust estimate of the performance of the model. This technique ensures that the class distribution in each fold mirrors that of the original dataset, hence preserving the balance between extremist and non-extremist classes. Stratified sampling avoids biases in performance metrics, especially in datasets with significant class imbalances.

5. Results and Discussion

5.1. Performance Metrics and Error Analysis

Our evaluation of RADAR# extends beyond traditional accuracy and F1 score metrics to provide a comprehensive understanding of model performance across different types of radical content. Table 4 presents a detailed matrix analysis that reveals nuanced insights into classification performance. Our error analysis reveals distinct patterns across different content categories. Explicit radical content (F1 = 98.3%) is detected with high reliability due to its use of direct terminology and clear indicators of extremism. In contrast, implicit radical content (F1 = 93.5%) presents greater challenges, as it often employs coded language, metaphors, and contextual references that require deeper semantic understanding.

The results show that the most significant error sources include the following: 32% of misclassifications occur in tweets using regional dialects, particularly North African variants that differ substantially from Modern Standard Arabic. Moreover, 28% of errors involve content where the extremist intent depends heavily on cultural or historical context not explicitly stated in the text. Next, 24% of misclassifications involve emerging terminology or coded expressions used to evade detection, and 16% of errors occur in tweets using sarcastic or ironic expressions that superficially resemble extremist content but convey different intentions. This detailed error analysis provides valuable insights for future model refinements, particularly the need for enhanced dialectal processing and contextual understanding capabilities.

Error analysis of misclassifications shows trends that provide direction for future improvement. The errors can be categorized into several types: Approximately 35% of false positives are sarcastic or ironic use of extremist language. For example, a tweet condemning ISIS might use their language in a satirical way, which is misinterpreted as support by the model. This reflects the challenge of interpreting pragmatic aspects of language. Around 25% of the errors contain terms that have both extremist and non-extremist uses based on context. Religious terms like “jihad” can be used to signify individual spiritual conflict in non-extremist usage but are widely co-opted by extremist groups. Approximately 20% of errors contain dialectical expressions or slang that are under-represented in the training data. This is particularly common for North African dialects (Maghrebi), which have strong Berber language and French influences. Approximately 15% of false negatives contain extremist content conveyed implicitly, i.e., metaphors, cultural allusions, or coded messages. These require meticulous cultural knowledge to identify correctly. The remaining 5% are complex narratives which require contextual understanding beyond the individual tweet, e.g., references to specific events or ideological frameworks.

These findings suggest some paths for ongoing improvement, as follows:

Incorporating pragmatic understanding to better handle sarcasm and irony;
Expanding training data to capture more dialectal variation;
Developing means to capture implicit content and coded meanings;
Investigating methods that can take into account more context than single tweets.

5.2. Comparative Analysis with Other Models

Figure 6 presents a comparison of RADAR# with baseline models (CNN, LSTM, Bi-LSTM, CNN-Bi-LSTM) and transformer models (AraBER). The results demonstrate that RADAR# consistently outperforms all other approaches across all evaluation metrics. RADAR# achieves 98% accuracy and a 97% F1-score, representing improvements of 2% and 2%, respectively, over the best individual model. The ROC-AUC of 0.99 indicates near-perfect discrimination ability between extremist and non-extremist content. The ensemble combines models with different architectures and training objectives, allowing it to leverage their complementary strengths. The CNN-Bi-LSTM branch excels at capturing local patterns and sequential relationships, while the transformer models provide strong contextual understanding. The custom attention layer in the CNN-Bi-LSTM branch helps focus on the most relevant parts of the text, enhancing performance on longer tweets where important signals might be sparse. The specialized Arabic preprocessing pipeline addresses the unique challenges of social media text, providing cleaner and more informative inputs to the models. The performance comparison also reveals interesting patterns in the relative strengths of different architectures. Transformer models (AraBERT) outperform traditional deep learning approaches (CNN, LSTM, Bi-LSTM, CNN-Bi-LSTM), highlighting the effectiveness of self-attention mechanisms and pre-training on large corpora. However, the CNN-Bi-LSTM model still achieves strong performance (94% F1-score), demonstrating the value of combining convolutional and recurrent architectures.

To contextualize RADAR#’s performance within the current research landscape, we conducted a comprehensive comparison with seven recent state-of-the-art models published between 2022 and 2024. Table 5 presents this comparative analysis across multiple performance metrics and computational efficiency measures. RADAR# outperforms all compared models across all performance metrics, with particularly significant advantages in recall (+2.2% over the next best model) and the F1-score (+1.8% over MarBERT). This improvement is especially notable for implicit radical content and dialectal variations, where other models struggle most. Beyond raw performance metrics, our comparative analysis reveals several key advantages of RADAR#. While pure transformer models like AraBERT-v2 and MarBERT have larger parameter counts and longer inference times, RADAR# achieves superior performance with 30–32% fewer parameters and 43–49% faster inference. When evaluated specifically on dialectal content, RADAR# maintains a 94.8% F1-score compared to 89.3–92.1% for the other models, demonstrating superior handling of linguistic variations. Unlike black-box approaches, RADAR#’s attention mechanism provides interpretable insights into classification decisions, a critical advantage for security applications where understanding the rationale behind classifications is essential.

5.3. Ablation Studies

Ablation studies were conducted to evaluate the contribution of each component to the overall performance of RADAR#. These studies involved removing or modifying specific components and measuring the resulting impact on performance. Table 6 summarizes the result. It shows removing the CNN layers from the CNN-Bi-LSTM branch results in a 2% decrease in the F1-score, highlighting their importance in capturing local patterns and n-gram features. The attention mechanism contributes significantly to performance, with a 3% decrease in the F1-score when removed. This confirms the value of focusing on the most relevant parts of the text for classification. The transformer models make a substantial contribution, with a 4% decrease in the F1-score when both are removed. AraBERT (1% decrease), likely due to a lack of handling dialectal variations, is common in social media. Both normalization and segmentation contribute meaningfully to performance, with decreases of 3% and 2% in the F1-score, respectively, when removed. This underscores the importance of specialized Arabic preprocessing for effective analysis. These ablation studies confirm that each component of RADAR# makes a meaningful contribution to its overall performance. The results also highlight the complementary nature of the different components, with the full ensemble achieving better results than any individual model or simplified configuration.

5.4. Attention Visualization

Figure 7 illustrates the attention weights assigned to different words in example tweets. This visualization provides insights into the model’s decision-making process and enhances interpretability. The visualization demonstrates that the model correctly focuses on words associated with extremist content. For example, in a tweet containing the phrase “داعش يعلن مسؤوليته عن الهجوم الإرهابي” (ISIS claims responsibility for the terrorist attack), the highest attention weights are assigned to “داعش” (ISIS), “الهجوم” (attack), and “الإرهابي” (terrorist), with weights of 0.18, 0.22, and 0.28, respectively. These words are indeed the most relevant for determining that the tweet contains extremist content.

In contrast, function words and less informative content receive lower attention weights. For example, “عن” (about) and “في” (in) receive weights of 0.03 and 0.04, respectively, indicating that the model correctly identifies them as less relevant for classification. This pattern is consistent across multiple examples, with the model attending to the following:

Names of terrorist organizations (e.g., داعش/ISIS, القاعدة/Al-Qaeda)
Words related to violence (e.g., قتل/kill, تفجير/bombing)
Religious terminology used in extremist contexts (e.g., جهاد/jihad, كفار/infidels)
Words expressing support or allegiance (e.g., مبايعة/pledge allegiance, نصرة/support)

The attention visualization not only confirms that the model is focusing on appropriate features but also provides a level of explainability that is crucial for applications in cybersecurity and content moderation. Human reviewers can understand why the model flagged specific content, enhancing trust and facilitating more effective human–AI collaboration. To illustrate the model’s interpretability, we present five representative case studies that trace the classification process from input features through to attention mechanisms and final decisions, as shown in Appendix A Enhanced Interpretability Analysis.

Figure 8 presents the ROC curves for different models, plotting the true positive rate against the false positive rate at various threshold settings. The area under the ROC curve (AUC) provides a threshold-independent measure of model performance, with higher values indicating better discrimination ability.

RADAR# achieves the highest AUC (0.99), followed by the transformer models (AraBERT 0.97), the CNN-Bi-LSTM model (0.96), and the simpler architectures (Bi-LSTM: 0.95, LSTM: 0.91, CNN: 0.90). This ordering is consistent with the F1-score results, confirming the superior performance of the ensemble approach. The ROC curves also provide insights into the trade-off between the true positive rate (recall) and false positive rate at different threshold settings. RADAR# maintains a high true positive rate even at low false positive rates, indicating that it can detect most extremist content while minimizing false alarms. This is particularly important for practical applications, where false positives can lead to unnecessary content removal or investigation.

The curves also show that the performance gap between RADAR# and other models is most pronounced in the high-specificity region (low false positive rate), which is typically the operating region of interest for content moderation systems. At a false positive rate of 0.05, RADAR# achieves a true positive rate of 0.94, compared to 0.89 for AraBERT, and 0.85 for CNN-Bi-LSTM.

6. Conclusions

This paper presented RADAR#, an ensemble-based approach for detecting radicalization in Arabic tweets. The model combines a hybrid CNN-Bi-LSTM architecture with transformer model (AraBERT) through a weighted ensemble mechanism, achieving state-of-the-art performance with 98% accuracy and a 97% F1-score. The ensemble approach consistently outperforms individual models, demonstrating the value of combining complementary architectures for complex NLP tasks. The custom attention layer significantly enhances both performance and interpretability, focusing on words most relevant to radicalization detection. Specialized Arabic preprocessing techniques, particularly normalization and segmentation, contribute meaningfully to model performance. Misclassifications primarily involve sarcasm, ambiguous terminology, dialectal variations, and implicit content, highlighting areas for future improvement. These findings provide practical tools for cybersecurity applications in counter-terrorism and content moderation, while the identified limitations and future research directions offer a clear roadmap for continued improvement. By addressing the specific challenges revealed through our ablation studies—particularly around dialectal variations, implicit content, and pragmatic understanding—future work can further enhance the effectiveness and ethical application of automated radicalization detection in Arabic social media content. Our ablation studies revealed several critical insights that both validate our architectural choices and highlight areas for future improvement: The removal of CNN layers from the CNN-BiLSTM branch resulted in a 2% decrease in the F1-score, confirming their importance in capturing local patterns and n-gram features. This finding supports our architectural decision to use 1D CNNs for pattern extraction, as they effectively identify phrasal indicators of radicalization that might be missed by sequence-only models. The attention mechanism contributed significantly to performance, with a 3% decrease in the F1-score when removed. Beyond performance metrics, attention visualization revealed that the model appropriately focuses on ideologically charged terms and contextual indicators of radicalization, providing both improved accuracy and enhanced interpretability. The transformer models made a substantial contribution, with a 4% decrease in the F1-score when both were removed. AraBERT showed particular strength in handling complex semantic relationships and contextual understanding, while being less effective with dialectal variations (1% decrease when removed). This finding highlights the complementary nature of our ensemble approach, where the CNN-BiLSTM component helps address the transformer’s weaknesses in dialectal content. Our analysis of misclassifications revealed that approximately 35% of errors involved sarcasm or irony, 25% contained ambiguous terminology, 20% involved dialectal expressions, 15% contained implicit content, and 5% required broader contextual understanding. These patterns directly inform our future research directions.

Based on our findings and the identified limitations, several promising directions for future research emerge, such as investigating how more efficient transformer architectures, such as distilled models or models with parameter sharing, could reduce computational requirements while maintaining performance. The ablation studies revealed that dialectal variations remain challenging, particularly for transformer models. Future work should focus on developing specialized preprocessing and embedding techniques for Arabic dialects, particularly those from North Africa which showed higher error rates in our analysis. Dialect-specific fine-tuning of transformer models could potentially address the 20% of errors attributed to dialectal expressions. Enhancing the explainability of model decisions beyond attention visualization could improve trust and usability. Techniques such as LIME (Local Interpretable Model-agnostic Explanations) or SHAP (SHapley Additive exPlanations) could be explored. Developing approaches that explicitly address potential biases in training data and model predictions would enhance ethical deployment. The 15% of errors involving implicit content point to the need for a more sophisticated semantic understanding. Incorporating cultural knowledge bases and exploring metaphor detection techniques could improve performance on content that relies on cultural allusions or coded language. This could involve fairness constraints during training or post-processing techniques to ensure equitable treatment across dialects and regions.

Author Contributions

Conceptualization, A.M.R.A.; methodology, A.M.R.A. and A.S.; software, A.M.R.A.; validation, A.M.R.A., E.M.A.-S. and S.O.; formal analysis, A.M.R.A.; investigation, S.O. and A.M.R.A.; resources, A.S.; data curation, S.O.; writing—original draft preparation, A.M.R.A. and S.O.; writing—review and editing, A.M.R.A. and A.S.; visualization, A.M.R.A. and E.M.A.-S.; supervision, A.M.R.A. and E.M.A.-S.; project administration, A.M.R.A. and E.M.A.-S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable. This study used only publicly available, anonymized datasets and did not involve direct interaction with human or animal subjects.

Informed Consent Statement

Not applicable.

Data Availability Statement

Saja Aldera shared her comprehensive Arabic dataset on extremism. Source code: https://github.com/aalosbeh/RADAR/tree/main (accessed on 16 June 2025).

Acknowledgments

To express our sincere gratitude to Yarmouk University and Southern Illinois University Carbondale (SIUC) for their invaluable support and collaboration throughout this research. We are deeply grateful to the Faculty of Information Technology and Computer Science at Yarmouk University and the Department of ITEC at SIUC for providing the necessary resources and fostering a conducive environment for this study. Special thanks go to Saja Aldera for generously sharing her comprehensive Arabic dataset on extremism, which formed the foundation of our analysis. All individuals acknowledged have consented to be named.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Enhanced Interpretability Analysis

The interpretability of model decisions is crucial for radicalization detection systems, particularly in security applications where understanding why content is flagged is as important as the classification itself. We have significantly enhanced our interpretability analysis through a visualization framework that connects attention weights to specific linguistic features. Our attention maps the contribution of each word and phrase to the final classification decision. This visualization reveals how RADAR# identifies and weighs different indicators of radicalization, providing transparent insights into the model’s decision-making process. To illustrate the model’s interpretability, we present five representative case studies that trace the classification process from input features through attention mechanisms to final decisions, as follows:

In case study 1, the model correctly identifies explicit radical terminology, with highest attention weights assigned to direct calls for violence and extremist ideological terms.

In case study 2, the model identifies coded language and euphemisms commonly used in extremist discourse, demonstrating its ability to detect implicit radicalization indicators despite the absence of explicit violent terminology.

Case study 3 demonstrates RADAR#’s ability to process dialectal variations (Egyptian dialect) while correctly identifying potentially radical content through contextual understanding of phrases like “by all means” that imply violence.

In case study 4, this ambiguous case, the model correctly identifies political discourse rather than extremism, with attention to qualifying terms like “legitimate” that modify potentially concerning terms like “resistance.”

In case study 5, this false positive example reveals a limitation in contextual understanding, where the model overweights the term “jihad” despite its non-violent spiritual context in this usage.

Our quantitative analysis of attention patterns across the dataset reveals a consistent focus on certain linguistic markers of radicalization. Figure 7 visualizes the distribution of attention weights across different semantic categories, showing that violent terminology (28.3%), dehumanizing language (24.7%), and in-group/out-group dichotomies (21.5%) receive the highest average attention weights. This enhanced interpretability analysis provides valuable insights for both a technical understanding and practical application of the model. By making the classification process transparent, RADAR# enables human reviewers to understand, validate, and if necessary, override model decisions, which is essential for responsible deployment in security contexts.

References

Aichner, T.; Grünfelder, M.; Maurer, O.; Jegeni, D. Twenty-five years of social media: A review of social media applications and definitions from 1994 to 2019. Cyberpsychol. Behav. Soc. Netw. 2021, 24, 215–222. [Google Scholar] [CrossRef] [PubMed]
Forgeard, V. Why the World Is a Global Village. Brilliantio. 2021. Available online: https://brilliantio.com/why-the-world-is-a-global-village/ (accessed on 7 May 2024).
Kaplan, A.M.; Haenlein, M. Users of the world, unite! The challenges and opportunities of Social Media. Bus. Horiz. 2010, 53, 59–68. [Google Scholar] [CrossRef]
Al-Maqableh, R.; Al-Sobeh, A.; Akkawi, A. Cultural Drivers of Radicalization. Available online: https://dradproject.com/?publications=cultural-drivers-of-radicalization-in-jordan-2 (accessed on 1 September 2024).
Alava, S.; Frau-Meigs, D.; Hassan, G. Youth and Violent Extremism on Social Media: Mapping the Research; UNESCO: Paris, France, 2017. [Google Scholar]
Alsobeh, A.; Shatnawi, A. Integrating Data-Driven Security, Model Checking, and Self-adaptation for IoT Systems Using BIP Components: A Conceptual Proposal Model. In Proceedings of the 2023 International Conference on Advances in Computing Research (ACR’23); Springer: Cham, Switzerland, 2023; p. 700. [Google Scholar] [CrossRef]
Gaikwad, M.; Ahirrao, S.; Phansalkar, S.; Kotecha, K. Online extremism detection: A systematic literature review. IEEE Access 2021, 9, 48364–48404. [Google Scholar] [CrossRef]
Smith, A. 23 Essential Twitter Statistics to Guide Your Strategy in 2023. Sprout Social. 2023. Available online: https://sproutsocial.com/insights/twitter-statistics/ (accessed on 1 June 2025).
Akram, W.; Kumar, R. A study on positive and negative effects of social media on society. Int. J. Comput. Sci. Eng. 2017, 5, 351–354. [Google Scholar] [CrossRef]
Aldera, S.; Emam, A.; Al-Qurishi, M.; Alrubaian, M.; Alothaim, A. Exploratory data analysis and classification of a new arabic online extremism dataset. IEEE Access 2021, 9, 161613–161626. [Google Scholar] [CrossRef]
Chen, H. Sentiment and affect analysis of dark web forums: Measuring radicalization on the internet. In Proceedings of the 2008 IEEE International Conference on Intelligence and Security Informatics, Taipei, Taiwan, 17–20 June 2008; IEEE: Piscataway, NJ, USA, 2008; pp. 104–109. [Google Scholar]
Zhaksylyk, K.; Batyrkhan, O.; Shynar, M. Review of violent extremism detection techniques on social media. In Proceedings of the 2021 16th International Conference on Computer Science and Software Engineering (CSSE), Almaty, Kazakhstan, 29–31 October 2021. [Google Scholar]
Gupta, P.; Varshney, P.; Bhatia, M.P.S. Identifying Radical Social Media Posts Using Machine Learning; GitHub: San Francisco, CA, USA, 2017. [Google Scholar]
Kaur, A.; Saini, J.K.; Bansal, D. Detecting radical text over online media using deep learning. arXiv 2019, arXiv:1907.12368. [Google Scholar]
Mursi, K.T.; Alahmadi, M.D.; Alsubaei, F.S.; Alghamdi, A.S. Detecting Islamic radicalism Arabic tweets using natural language processing. IEEE Access 2022, 10, 72526–72534. [Google Scholar] [CrossRef]
Davis, J. Social Media In The International Encyclopedia of Political Communication; Wiley: Hoboken, NJ, USA, 2016. [Google Scholar] [CrossRef]
Rekik, A.; Jamoussi, S.; Hamadou, A.B. Violent vocabulary extraction methodology: Application to the radicalism detection on social media. In Computational Collective Intelligence, Proceedings of the 11th International Conference, ICCCI 2019, Hendaye, France, 4–6 September 2019; Part II; Springer: Berlin/Heidelberg, Germany, 2019; pp. 97–109. [Google Scholar]
Johnston, A.H.; Weiss, G.M. Identifying sunni extremist propaganda with deep learning. In Proceedings of the 2017 IEEE Symposium Series on Computational Intelligence (SSCI), Honolulu, HI, USA, 27 November–1 December 2017; pp. 1–6. [Google Scholar]
Trabelsi, Z.; Saidi, F.; Thangaraj, E.; Veni, T. A survey of extremism online content analysis and prediction techniques in Twitter based on sentiment analysis. Secur. J. 2023, 36, 221–248. [Google Scholar] [CrossRef]
Becker, K.; Harb, J.G.; Ebeling, R. Exploring deep learning for the analysis of emotional reactions to terrorist events on Twitter. J. Inf. Data Manag. 2019, 10, 97–115. [Google Scholar]
Ahmad, S.; Asghar, M.Z.; Alotaibi, F.M.; Awan, I. Detection and classification of social media-based extremist affiliations using sentiment analysis techniques. Hum.-Centric Comput. Inf. Sci. 2019, 9, 24. [Google Scholar] [CrossRef]
Chen, L.C.; Lee, C.M.; Chen, M.Y. Exploration of social media for sentiment analysis using deep learning. Soft Comput. 2020, 24, 8187–8197. [Google Scholar] [CrossRef]
Nizzoli, L.; Avvenuti, M.; Cresci, S.; Tesconi, M. Extremist propaganda tweet classification with deep learning in realistic scenarios. In Proceedings of the 10th ACM Conference on Web Science, Amsterdam, The Netherlands, 30 June–3 July 2019; ACM: New York, NY, USA, 2019; pp. 203–204. [Google Scholar]
Nathani, V.G.; Patel, H.K. Twitter Sentiment Analysis for Hate Speech Detection Using a C-BiLstm Model. 2022. Available online: https://harshilpatel99.github.io/Research_paper.pdf (accessed on 1 June 2024).
Rajendran, A.; Sahithi, V.S.; Gupta, C.; Yadav, M.; Ahirrao, S.; Kotecha, K.; Gaikwad, M.; Abraham, A.; Ahmed, N.; Alhammad, S.M. Detecting Extremism on Twitter During US Capitol Riot Using Deep Learning Techniques. IEEE Access 2022, 10, 133052–133077. [Google Scholar] [CrossRef]
Alhayan, F.; Shaalan, K. Neural Networks and Sentiment Features for Extremist Content Detection in Arabic Social, 2025. Int. Arab. J. Inf. Technol. (IAJIT) 2025, 17, 522–534. [Google Scholar]
Alom, M.Z.; Taha, T.M.; Yakopcic, C.; Westberg, S.; Sidike, P.; Nasrin, M.S.; Asari, V.K. A state-of-the-art survey on deep learning theory and architectures. Electronics 2019, 8, 292. [Google Scholar] [CrossRef]
Kanerva, J.; Ginter, F.; Salakoski, T. Universal Lemmatizer: A sequence-to-sequence model for lemmatizing Universal Dependencies treebanks. Nat. Lang. Eng. 2021, 27, 545–574. [Google Scholar] [CrossRef]
Sameer, R.A. Modified light stemming algorithm for Arabic language. Iraqi J. Sci. 2016, 57, 507–513. [Google Scholar]
Al-Shawakfa, E.M.; Husni, H.H. A Two-Stage Machine Learning Classification Approach to Identify Extremism in Arabic Opinions. Int. J. 2021, 10, 2. [Google Scholar]
Alfreihat, M.; Almousa, O.; Tashtoush, Y.; AlSobeh, A.; Mansour, K.; Migdady, H. Emo-SL framework: Emoji sentiment lexicon using text-based features and machine learning for sentiment analysis. IEEE Access 2024, 12, 81793–81812. [Google Scholar] [CrossRef]
Alshattnawi, S.; Shatnawi, A.; AlSobeh, A.M.R.; Magableh, A.A. Beyond Word-Based Model Embeddings: Contextualized Representations for Enhanced Social Media Spam Detection. Appl. Sci. 2024, 14, 2254. [Google Scholar] [CrossRef]
Radwan, A.; Amarneh, M.; Alawneh, H.; Ashqar, H.I.; AlSobeh, A.; Magableh, A.A. Predictive Analytics in Mental Health Leveraging LLM Embeddings and Machine Learning Models for Social Media Analysis. Int. J. Web Serv. Res. 2024, 21, 1–22. [Google Scholar] [CrossRef]
Mahapatra, S. Why Deep Learning over Traditional Machine Learning? Towards Data Sci. 2018. Available online: https://medium.com/data-science/why-deep-learning-is-needed-over-traditional-machine-learning-1b6a99177063 (accessed on 21 June 2024).
Oppermann, A. What Is Deep Learning and How Does It Work? Builtin. 2022. Available online: https://builtin.com/machine-learning/deep-learning (accessed on 1 June 2024).
Yang, K.; Huang, Z.; Wang, X.; Li, X. A blind spectrum sensing method based on deep learning. Sensors 2019, 19, 2270. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Advances in Neural Information Processing Systems 30 (NeurIPS 2017); Curran Associates, Inc.: Red Hook, NY, USA, 2017; pp. 5998–6008. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the NAACL-HLT, Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
Antoun, W.; Baly, F.; Hajj, H. AraBERT: Transformer-based Model for Arabic Language Understanding. In Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools, Marseille, France, 12 May 2020; European Language Resource Association: Paris, France; pp. 9–15. [Google Scholar]
Abdul-Mageed, M.; Elmadany, A.; Nagoudi, E.M.B. ARBERT & MARBERT: Deep Bidirectional Transformers for Arabic. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics, Online, 5–6 August 2021; Association for Computational Linguistics: Kerrville, TX, USA; pp. 7088–7105. [Google Scholar]
Antoun, W.; Baly, F.; Hajj, H. AraELECTRA: Pre-Training Text Discriminators for Arabic Language Understanding. In Proceedings of the Sixth Arabic Natural Language Processing Workshop (WANLP 2021), Abu Dhabi, United Arab Emirates, 20 May 2021; pp. 191–195. [Google Scholar]
Abdelali, A.; Darwish, K.; Durrani, N.; Mubarak, H. Farasa: A Fast and Accurate Arabic Segmenter. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), Portorož, Slovenia, 23–28 May 2016; pp. 1070–1074. [Google Scholar]
Al-Twairesh, N.; Al-Negheimish, H. Surface and Deep Features Ensemble for Sentiment Analysis of Arabic Tweets. IEEE Access 2019, 7, 84122–84131. [Google Scholar] [CrossRef]
Xia, R.; Zong, C.; Li, S. Ensemble of feature sets and classification algorithms for sentiment classification. Inf. Sci. 2011, 181, 1138–1152. [Google Scholar] [CrossRef]
Wang, P.; Xu, B.; Xu, J.; Tian, G.; Liu, C.L.; Hao, H. Semantic expansion using word embedding clustering and CNN for improving short text classification. Neurocomputing 2016, 174, 806–814. [Google Scholar] [CrossRef]
Liu, Q.; Zhang, H.; Zeng, Y.; Huang, Z.; Wu, Z. Content attention model for aspect based sentiment analysis. In Proceedings of the 2018 World Wide Web Conference, San Francisco, CA, USA, 13–17 May 2019; pp. 1023–1032. [Google Scholar]
Pappagari, R.; Zelasko, P.; Villalba, J.; Carmiel, Y.; Dehak, N. Hierarchical transformers for long document classification. In Proceedings of the 2019 IEEE ASRU Workshop, Sentosa, Singapore, 17–21 December 2019; pp. 838–844. [Google Scholar]
Al-Sallab, A.; Ezzeldin, M.; Khalifa, M.; Habash, N.; El-Beltagy, S.R. Deep learning models for sentiment analysis in Arabic. In Proceedings of the Second Workshop on Arabic NLP, Valencia, Spain, 4 April 2017; pp. 9–17. [Google Scholar]
Al-Zahrani, L.; Al-Yahya, M. Pre-Trained Language Model Ensemble for Arabic Fake News Detection. Mathematics 2024, 12, 2941. [Google Scholar] [CrossRef]
AlSobeh, A.M.R. OSM: Leveraging model checking for observing dynamic behaviors in aspect-oriented applications. Online J. Commun. Media Technol. 2023, 13, e202355. [Google Scholar] [CrossRef]
Alajmi, A.; Saad, E.M.; Darwish, R.R. Toward an ARABIC stop-words list generation. Int. J. Comput. Appl. 2012, 46, 8–13. [Google Scholar]
Goldberg, Y. Neural Network Methods for Natural Language Processing; Morgan & Claypool: San Rafael, CA, USA, 2017. [Google Scholar] [CrossRef]
Djaballah, K.A.; Boukhalfa, K.; Boussaid, O. Sentiment analysis of Twitter messages using word2vec by weighted average. In Proceedings of the 2019 International Conference on Social Networks Analysis, Management and Security (SNAMS), Granada, Spain, 22–25 October 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 223–228. [Google Scholar]
Kapoor, A.; Gulli, A.G.; Pal, S.; Chollet, F. Deep Learning with TensorFlow and Keras, 3rd ed.; Packt Publishing: Birmingham, UK, 2022; Available online: https://learning.oreilly.com/library/view/deep-learning-with/9781803232911/ (accessed on 1 January 2024).
Wadud, M.A.H.; Mridha, M.F.; Rahman, M.M. Word embedding methods for word representation in deep learning for natural language processing. Iraqi J. Sci. 2022, 63, 1349–1361. [Google Scholar] [CrossRef]
Alemu, H.; Wu, W.; Zhao, J. Feedforward Neural Networks with a Hidden Layer Regularization Method. Symmetry 2018, 10, 525. [Google Scholar] [CrossRef]
Patterson, J.P.; Gibson, A. Deep Learning; O’Reilly Media: Sebastopol, CA, USA, 2017. [Google Scholar]
Liang, H.; Sun, X.; Sun, Y.; Gao, Y. Text feature extraction based on deep learning: A review. EURASIP J. Wirel. Commun. Netw. 2017, 2017, 211. [Google Scholar] [CrossRef]

Figure 1. RADAR# architecture design.

Figure 2. A sample of the dataset.

Figure 3. 1DCNN and Bi-LSTM model for text classification.

Figure 4. Sensitivity analysis of hyperparameters.

Figure 5. RADAR ensemble approach.

Figure 6. Performance comparison of different models.

Figure 7. Attention weights visualization for radicalization detection.

Figure 8. ROC curve compersion for detection models.

Table 1. Sample of cleaned data.

Example of Arabic Tweet	Type of Cleaning	Tweets After Cleaning
يوم الاثنين صلاه الغائب على الشيخ المجاهد ابوعبدالله أسامه بن لادن في مقبره صبحان	Emojis	يوم الاثنين صلاه الغائب على الشيخ المجاهد ابوعبدالله أسامه بن لادن في مقبره صبحان
ولما بيخرجوهم دلوقت من السجون بقينا إحنا "الكفار" و "العملاء" اللي عايزين نوقع بين الشعب والجيش	Handles	ولما بيخرجوهم دلوقت من السجون بقينا إحنا "الكفار" و "العملاء" اللي عايزين نوقع بين الشعب والجيش
سوريا اليوم ما حد وقف جنبا ولا حد قال شو ذنبو هالشعب يروح بين الرجلين Syria stays in my heart	English	سوريا اليوم ما حد وقف جنبا ولا حد قال شو ذنبو هالشعب يروح بين الرجلين
أيها الثوار الخونة العملاء المجرمين #الملحدين #الليبراليين الاشتراكيين إلمشركين الكفار أعداء الإخوان والمجلس العسكري مش هنسيبكم فى حالكم	Hashtag symbol only	أيها الثوار الخونة العملاء المجرمين الملحدين الليبراليين الاشتراكيين إلمشركين الكفار أعداء الإخوان والمجلس العسكري مش هنسيبكم فى حالكم
عاجل \| الجزيرة: مسؤول امريكي: وفاة الشيخ المجاهد أسامه بن لادن.. http://fb.me/107Cuj963 (accessed on 16 June 2025)	Links	عاجل \| الجزيرة: مسؤول امريكي: وفاة الشيخ المجاهد أسامه بن لادن..
١٥٠شخص برئ توفوا بالانفجار وش ذنبهم هالابرياء لعنة الله عليكم	Numbers	شخص برئ توفوا بالانفجار وش ذنبهم هالابرياء لعنة الله عليكم
احتلال الكفار المباشر أفضل من حكم العملاء على الأقل سيعرف الناس الحق من الباطل	New line	احتلال الكفار المباشر أفضل من حكم العملاء على الأقل سيعرف الناس الحق من الباطل
برأيي انا هناك من يدعم داعش من تحت _ الطاولة لانها تدعم مصالحه	Underscore	برأيي انا هناك من يدعم داعش من تحت الطاولة لانها تدعم مصالحه
لإنبطاحي:لا يهتم لإنتهاك الأعراض , ولا لإحتلال الأراضي المسلمة , ولا لتسلط الكفار , ولا لخيانة العملاء ,إنما يهتم في تقديس السلاطين.	Punctuation	لإنبطاحي لا يهتم لإنتهاك الأعراض ولا لإحتلال الأراضي المسلمة ولا لتسلط الكفار ولا لخيانة العملاء إنما يهتم في تقديس السلاطين

Table 2. Sample of analysis of hyperparameter tuning for tweets.

Tweet 1 and Translation:	Metrics
‫ياحكام ﺍلعرﺏ ﺍلكفرﺓ وﺍلله لعنه ﺍلقتلى في سوﺭيا وليبيا وكل ﺍلمسلمين ستلاحقهم ﺍلى يوم ﺍلقيامه ياكلاﺏ ﺍليهوﺩ O you infidel Arab rulers, by God, the curse of the dead in Syria, Libya, and all Muslims will haunt you until the Day of Judgment, you dogs of the Jews.	Number of Hidden Layers: 2, 3, 4 (Best: 3) Number of Epochs: 20, 30, 40, 50 (Best: 40) Batch Size: 8, 16, 32 (Best: 16)
الجزيرة \| مسؤول أمريكي \| وفاة المجاهد الشيخ أسامة بن لادن وباراك أوباما سيلقي بيانًا قريبًا. Al Jazeera \| U.S. Official \| The death of the mujahid Sheikh Osama bin Laden, and Barack Obama will issue a statement shortly.	Number of Hidden Layers: 2, 3, 4 (Best: 4) Number of Epochs: 30, 40, 50, 60 (Best: 50) Batch Size: 16, 24, 32 (Best: 24)
لماذا يُطلق علينا اسم الرافضين، المجوس، الكفار، الخونة، العملاء؟ فقط للوقوف ضد الظلم، سؤال يجول في خاطري. Why are we called rejectionists, Zoroastrians, infidels, traitors, collaborators? Just for standing against injustice—a question that lingers in my mind.	Number of Hidden Layers: 2, 3, 4 (Best: 3) Number of Epochs: 20, 40, 60 (Best: 40) Batch Size: 8, 16, 24, 32 (Best: 16)

Table 3. Hyperparameter optimization space and results.

Parameter	Range Explored	Optimal Value	Impact on Performance
Learning Rate	[0.1, 0.01, 0.001, 0.2, 0.02, 0.002, 0.3, 0.03, 0.003]	0.003 with cosine decay	+3.2% F1-score vs. fixed rate
Batch Size	[8, 16, 24, 32]	24	+1.8% F1-score vs. default (32)
Dropout Rate	[0.1, 0.2, 0.3, 0.4, 0.5]	0.3	+2.5% F1-score vs. no dropout
CNN Kernel Size	[3, 4, 6, 8)	6	+1.7% F1-score vs. size 3
BiLSTM Units	[64, 128, 256]	128 (each direction)	+2.1% F1-score vs. 64 units
Optimizer	[Adam, RMSprop, SGD]	Adam	+2.8% F1-score vs. SGD
L2 Regularization	[0, 1 × 10⁻⁵, 1 × 10⁻⁴, 1 × 10⁻³]	1 × 10⁻⁴	+1.5% F1-score vs. no regularization
Ensemble Weights	CNN-BiLSTM: [0.3–0.6], AraBERT: [0.4–0.7]	CNN-BiLSTM: 0.4, AraBERT: 0.6	+2.2% F1-score vs. equal weights

Table 4. Matrix and per-category performance metrics.

Category	Precision	Recall	F1-Score	Samples	Error Rate
Explicit Radical Content	98.7%	97.9%	98.3%	12,453	2.1%
Implicit Radical Content	94.2%	92.8%	93.5%	8764	7.2%
Religious Extremism	96.5%	95.3%	95.9%	14,872	4.7%
Political Extremism	93.8%	91.6%	92.7%	9341	8.4%
Non-Radical Content	97.3%	98.1%	97.7%	44,386	1.9%
Overall	96.1%	95.1%	95.6%	89,816	4.9%

Table 5. Comparative analysis with recent models.

Model	Accuracy	Precision	Recall	F1-Score	ROC-AUC	Training Time (h)	Inference Time (ms/Sample)	Parameters (M)
RADAR# (Ours)	96.1%	96.1%	95.1%	95.6%	0.983	4.2	18	124
AraBERT-v2 [39]	93.8%	94.2%	92.5%	93.3%	0.967	6.8	32	178
MarBERT [40]	94.2%	94.8%	92.9%	93.8%	0.971	7.2	35	183
AraGPT2-base [40]	91.5%	92.1%	90.3%	91.2%	0.952	5.3	28	135
AraELECTRA [41]	93.2%	93.7%	92.1%	92.9%	0.964	5.1	25	109
CNN-BiLSTM [10]	90.8%	91.3%	89.7%	90.5%	0.943	2.8	12	42

Table 6. Results of ablation.

Configuration	Accuracy	F1-Score	Change in F1-Score
Full RADAR#	0.98	0.97	-
Without CNN layers	0.96	0.95	−0.02
Without attention mechanism	0.95	0.94	−0.03
Without AraBERT	0.97	0.96	−0.01
Without transformer models	0.95	0.93	−0.04
Without preprocessing normalization	0.96	0.94	−0.03
Without Farasa segmentation	0.97	0.95	−0.02

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Al-Shawakfa, E.M.; Alsobeh, A.M.R.; Omari, S.; Shatnawi, A. RADAR#: An Ensemble Approach for Radicalization Detection in Arabic Social Media Using Hybrid Deep Learning and Transformer Models. Information 2025, 16, 522. https://doi.org/10.3390/info16070522

AMA Style

Al-Shawakfa EM, Alsobeh AMR, Omari S, Shatnawi A. RADAR#: An Ensemble Approach for Radicalization Detection in Arabic Social Media Using Hybrid Deep Learning and Transformer Models. Information. 2025; 16(7):522. https://doi.org/10.3390/info16070522

Chicago/Turabian Style

Al-Shawakfa, Emad M., Anas M. R. Alsobeh, Sahar Omari, and Amani Shatnawi. 2025. "RADAR#: An Ensemble Approach for Radicalization Detection in Arabic Social Media Using Hybrid Deep Learning and Transformer Models" Information 16, no. 7: 522. https://doi.org/10.3390/info16070522

APA Style

Al-Shawakfa, E. M., Alsobeh, A. M. R., Omari, S., & Shatnawi, A. (2025). RADAR#: An Ensemble Approach for Radicalization Detection in Arabic Social Media Using Hybrid Deep Learning and Transformer Models. Information, 16(7), 522. https://doi.org/10.3390/info16070522

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

RADAR#: An Ensemble Approach for Radicalization Detection in Arabic Social Media Using Hybrid Deep Learning and Transformer Models

Abstract

1. Introduction

2. Background and Literature Review

2.1. Understanding Radicalization in Online Contexts

2.2. Evolution of Online Radicalization Detection

2.3. AI/ML Advances in Architectures for Text Analysis

2.4. Challenges in Arabic Text Processing

2.5. Transformer Models for Arabic

2.6. Ensemble Methods in NLP

3. RADAR#: Radicalization Analysis Using Deep Arabic Recognition

3.1. Dataset Collection and Characteristics

3.2. Arabic Text Preprocessing Pipeline

Feature Engineering and Embedding

3.3. RADAR# Model Architecture

3.3.1. BiLSTM Layer

3.3.2. Attention Mechanism

3.3.3. Transformer Integration

3.3.4. Example

4. Hyperparameter Optimization and Sensitivity Analysis

4.1. Cross-Validation Strategy

4.2. Parameter Analysis

4.3. Parameter Ranges and Grid Search

4.4. Training and Attention Mechanism

4.5. Sensitivity Analysis

4.6. Ensemble Integration

4.7. Evaluation Model

5. Results and Discussion

5.1. Performance Metrics and Error Analysis

5.2. Comparative Analysis with Other Models

5.3. Ablation Studies

5.4. Attention Visualization

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

Enhanced Interpretability Analysis

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI