Continual Learning for Saudi-Dialect Offensive-Language Detection Under Temporal Linguistic Drift

Asiri, Afefa; Saleh, Mostafa

doi:10.3390/info17010099

Open AccessArticle

Continual Learning for Saudi-Dialect Offensive-Language Detection Under Temporal Linguistic Drift

by

Afefa Asiri

^1,2,*

and

Mostafa Saleh

¹

Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah 21589, Saudi Arabia

²

Faculty of Computing and Information Technology, University of Jeddah, Al Kamil 25341, Saudi Arabia

^*

Author to whom correspondence should be addressed.

Information 2026, 17(1), 99; https://doi.org/10.3390/info17010099

Submission received: 18 November 2025 / Revised: 15 January 2026 / Accepted: 16 January 2026 / Published: 18 January 2026

(This article belongs to the Special Issue Social Media Mining: Algorithms, Insights, and Applications)

Download

Browse Figures

Versions Notes

Abstract

Offensive-language detection systems that perform well at a given point in time often degrade as linguistic patterns evolve, particularly in dialectal Arabic social media, where new terms emerge and familiar expressions shift in meaning. This study investigates temporal linguistic drift in Saudi-dialect offensive-language detection through a systematic evaluation of continual-learning approaches. Building on the Saudi Offensive Dialect (SOD) dataset, we designed test scenarios incorporating newly introduced offensive terms, context-shifting expressions, and varying proportions of historical data to assess both adaptation and knowledge retention. Eight continual-learning configurations—Experience Replay (ER), Elastic Weight Consolidation (EWC), Low-Rank Adaptation (LoRA), and their combinations—were evaluated across five test scenarios. Results show that models without continual-learning experience a 13.4-percentage-point decline in F1-macro on evolved patterns. In our experiments, Experience Replay achieved a relatively favorable balance, maintaining 0.812 F1-macro on historical data and 0.976 on contemporary patterns (KR = −0.035; AG = +0.264), though with increased memory and training time. EWC showed moderate retention (KR = −0.052) with comparable adaptation (AG = +0.255). On the SimuReal test set—designed with realistic class imbalance and only 5% drift terms—ER achieved 0.842 and EWC achieved 0.833, compared to the original model’s 0.817, representing modest improvements under realistic conditions. LoRA-based methods showed lower adaptation in our experiments, likely reflecting the specific LoRA configuration used in this study. Further investigation with alternative settings is warranted.

Keywords:

continual learning; Arabic NLP; Saudi dialect; offensive-language detection; temporal drift; experience replay; elastic weight consolidation; LoRA

1. Introduction

Social-media platforms have become primary venues for communication in Arabic-speaking communities, with Saudi Arabia ranking among the most active countries on X (formerly Twitter), hosting approximately 15.7 million Arabic-speaking users [1]. These platforms also enable the rapid spread of offensive language, hate speech, and abusive content, motivating the development of automated detection systems. While machine-learning models have achieved high accuracy on benchmark datasets—our previous work on the Saudi Offensive Dialect (SOD) dataset reached a 91% F1-score [2]—their effectiveness diminishes over time as language evolves, particularly in informal social-media contexts, where new expressions appear and familiar ones shift in meaning.

Computational evidence confirms that temporal drift substantially reduces model performance. Florio et al. [3] showed that hate-speech detection on Italian Twitter is highly sensitive to the time gap between training and testing. Arango et al. [4] found that models exceeding 90% F1 within-dataset fell below 80% on temporally distinct data and below 50% on distributions further removed in time. Jin et al. [5] demonstrated that this bias persists across languages and model architectures, indicating that linguistic evolution—rather than implementation details—drives performance degradation.

Social-science research explains the mechanisms underlying this evolution. Online communities accelerate linguistic change through rapid diffusion and re-interpretation of expressions. Eisenstein et al. [6] traced slang propagation across metropolitan areas, showing that new words spread globally within weeks. Lucy and Bamman [7] revealed that online communities develop context-specific meanings shaped by social variables, while platform conventions and shifting norms continually redefine what counts as offensive or acceptable [8,9,10,11]. Together, these findings indicate that offensive-language evolution is both a social and computational phenomenon.

Arabic presents additional challenges due to its morphological richness and dialectal diversity. Modern Standard Arabic differs markedly from the dialects that dominate social media, each with unique lexical and pragmatic traits [12]. Our SOD dataset addressed these through dialect-specific annotation and modeling [2]; however, prior evaluation used data from a single period, leaving temporal robustness unexplored. Given the documented degradation in other languages and the speed of online linguistic change, investigating adaptive strategies is warranted.

This study investigates whether continual-learning methods can help address temporal drift in Saudi-dialect offensive-language detection. Using the SOD dataset, we constructed test scenarios capturing two forms of linguistic evolution—newly emerged offensive terms and context-shifting expressions—combined with varying proportions of historical data to assess both adaptation and retention. Eight continual-learning configurations, including Experience Replay (ER), Elastic Weight Consolidation (EWC), and Low-Rank Adaptation (LoRA), were compared across five test scenarios, with particular attention to mixed-distribution and simulated real-world scenarios that provide a closer approximation to realistic temporal conditions. In our experiments, replay-based methods showed relatively favorable retention while adapting to new patterns, though with increased memory and training time. EWC showed moderate retention with comparable adaptation. LoRA-based methods required less memory and training time but showed lower adaptation in our limited configuration; further investigation is needed. These observations provide initial insights into continual learning for Arabic-dialect offensive-language detection, an area that remains largely underexplored.

2. Related Work

2.1. Offensive-Language Detection in Arabic

Research on Arabic offensive-language detection has progressed through successive methodological stages, each addressing the language’s morphological richness and dialectal diversity. Early lexicon-based systems relied on manually compiled lists of offensive words and rule-based pattern matching. Burnap and Williams [13] demonstrated that while computationally efficient, they produced high false-positive rates and failed to capture contextual nuance or unseen expressions.

The introduction of machine-learning methods marked a significant shift. Feature-based classifiers such as Support Vector Machines and Naïve Bayes used lexical and linguistic cues—including TF-IDF, part-of-speech tags, and sentiment indicators—and achieved moderate accuracy (≈80%) on benchmark datasets. Mubarak et al. [14] showed that these approaches required extensive feature engineering and struggled with dialectal variation where morphological patterns differ substantially from Modern Standard Arabic.

Deep-learning architectures further advanced performance through automatic feature extraction. Badjatiya et al. [15] demonstrated that convolutional and recurrent neural networks learned hierarchical text representations, while Farha and Magdy [16] applied multitask learning frameworks to improve cross-dataset generalization. Transformer-based models such as AraBERT [17], MARBERT [18], and QARiB [19] achieved strong results by pre-training on large Arabic corpora, reaching F1-scores above 85% on several hate-speech benchmarks.

Dialect-specific corpora have demonstrated substantial importance for robust performance. Dahou et al. [12] reported that dialect-focused models outperform those trained on mixed-dialect or MSA-only data. The OSACT shared tasks [20,21] established standardized evaluation settings, highlighting that cross-dialect generalization remains challenging due to lexical and pragmatic differences. Our earlier Saudi Offensive Dialect (SOD) dataset confirmed these observations, with Saudi-specific fine-tuning improving F1-scores by seven percentage points over mixed-Arabic models [2].

Despite these advances, prior Arabic studies have predominantly relied on temporally static splits, assuming that offensive-language patterns remain stable over time. This approach does not account for evidence from multilingual research showing that linguistic drift causes measurable degradation in model accuracy, underscoring the need for temporally adaptive evaluation frameworks.

2.2. Continual Learning: Approaches and Applications

Continual learning addresses the challenge of adapting models to new information while avoiding catastrophic forgetting—the tendency of neural networks to abruptly lose previously learned knowledge when trained on new data [22,23]. Research has developed three principal methodological families offering distinct mechanisms for balancing stability and plasticity.

Regularization-based methods constrain parameter updates to preserve knowledge encoded in weights important for previous tasks. Elastic Weight Consolidation (EWC), introduced by Kirkpatrick et al. [24], computes the Fisher Information Matrix to identify parameters critical for previous task performance and penalizes their modification during adaptation. Kirkpatrick et al. demonstrated that EWC enables sequential learning across multiple tasks, achieving average accuracy above 90% on previously learned tasks compared to below 20% for naïve fine-tuning. Zenke et al. [25] proposed Synaptic Intelligence, which accumulates important scores online during training. These methods require no storage of previous training data but depend on appropriate regularization strength.

Replay-based methods mitigate forgetting by maintaining a subset of previous training examples and intermixing them with new data. Experience Replay (ER), adapted from reinforcement learning [26], stores representative examples in a memory buffer [27]. Rolnick et al. [27] showed that ER maintains accuracy within 5% of joint training baselines when using memory buffers comprising only 5% of original training data. Gradient Episodic Memory (GEM) and A-GEM constrain gradient updates to prevent increased loss on replayed examples [28,29]. Replay methods demonstrate strong empirical performance but require careful buffer management.

Parameter-efficient methods isolate task-specific and shared parameters to prevent interference. Low-Rank Adaptation (LoRA), proposed by Hu et al. [30], freezes pre-trained model weights and introduces trainable low-rank matrices, reducing trainable parameters by over 99% while maintaining adaptation capability on transfer-learning tasks. Adapter layers achieve similar parameter efficiency [31,32]. However, these methods were developed primarily for transfer-learning scenarios, and their effectiveness for continual learning where preserving previous task knowledge is essential remains understudied.

In English NLP, continual learning has been applied to sentiment analysis [33], toxicity detection [34], and domain adaptation [35]. Jang et al. [33] demonstrated that sequential targeting approaches improved F1-scores by 8–12 percentage points across toxicity-detection datasets, while Agarwal and Chowdary [34] achieved better cross-dataset generalization with adaptive ensemble methods. However, systematic evaluation for Arabic—particularly dialectal Arabic, where morphology and semantics evolve rapidly—is lacking.

Three important gaps emerge from the reviewed literature. First, Arabic offensive-language detection research predominantly evaluates models on temporally static benchmarks, with limited examination of performance under linguistic drift despite empirical evidence from other languages showing substantial degradation. Second, continual-learning methods validated primarily on English have not been systematically evaluated for dialectal Arabic. Third, existing work lacks a unified comparative framework contrasting regularization-, replay-, and parameter-efficient approaches for dialectal offensive-language detection. This study addresses these gaps through systematic evaluation of eight continual-learning configurations across five test scenarios, quantifying both adaptation and retention to explore strategies that may help sustain model robustness.

3. Methodology

3.1. Experimental Framework

This study builds upon the SOD_AraBERT model introduced in our previous work [2] that was developed for Saudi-dialect offensive-language detection by fine-tuning the AraBERT model on over 24,000 annotated tweets. This model serves as the starting point for the continual-learning experiments conducted in this study.

3.2. Dataset Construction

To examine the impact of language evolution on model performance, we developed new training and testing datasets that capture both contemporary and historically stable linguistic patterns.

3.2.1. Training Data

The new training dataset (NEW_DS) was collected from X (formerly Twitter) during 2024–2025. Following the discontinuation of Twitter’s Academic Research API in February 2023, data collection was conducted manually using the platform’s native search interface with Arabic language filtering and geographic targeting of Saudi and nearby regions. This creates a temporal gap of approximately three years from the original SOD dataset (2019–2022), representing a non-IID setting where covariate shift (new terms emerging) and concept drift (existing words acquiring new meanings) motivate the use of continual-learning approaches.

The dataset captures two main types of linguistic change identified through manual observation of Saudi social-media discourse:

Newly emerged offensive terms: Novel expressions without dictionary roots that have developed as offensive terms in everyday discourse and spread through social media, such as “زوط” (zoot) and its morphological variations. These terms arise from natural linguistic evolution, where users organically create and spread new offensive expressions in daily communication.
Context-shifting expressions: Neutral words that acquire offensive connotations depending on pragmatic or situational context. For example, “العفوية” (spontaneity) functions literally in “العفوية أجمل شي” (spontaneity is the best thing) but becomes sarcastic mockery in “العفوية تتصل بك” (spontaneity is calling you).

Each context-shifting term appears in both offensive and non-offensive usages to help the model learn contextual cues. The dataset also includes general offensive and non-offensive samples, and sentences containing the original, neutral meaning of the same terms to ensure balanced representation. Table 1 presents examples of both categories. Drift terms were confirmed to be absent from the original SOD corpus (2019–2022). The dataset composition includes approximately 10% of newly emerged terms and variations, 30% context-shifting expressions, and 60% general offensive and non-offensive content.

Preprocessing followed the same pipeline established in our previous work [2]: user mentions were replaced with [USER] tokens, URLs with [URL] tokens, and newlines with [NL] tokens, thereby also ensuring privacy protection by removing user identifiers. Annotation was performed by native Saudi Arabic speakers with expertise in Saudi-dialect offensive-language patterns, applying consistent labeling criteria established during the original SOD dataset development. Inter-annotator agreement exceeded 95%. The relatively modest size of NEW_DS (2000 samples) reflects the proof-of-concept orientation of this study.

3.2.2. Test Scenarios

Five test sets were designed to evaluate both adaptation to new linguistic patterns and retention of prior knowledge:

Contemporary test set: 500 examples (50% offensive and 50% non-offensive) collected independently from the training dataset (NEW_DS) during the same 2024–2025 period. Offensive examples include newly emerged and context-shifting expressions, while non-offensive examples include both general non-offensive sentences and neutral contextual uses of the same terms, ensuring a balanced distribution.
Historical test set: Five hundred held-out examples from the original SOD collection period, containing no evolved or context-shifting expressions. This dataset was never used during model development, serving as an unseen benchmark for assessing catastrophic forgetting and knowledge retention.
Mixed-distribution sets: Two datasets combining contemporary and historical samples in different ratios (20/80 and 40/60 contemporary/historical). These sets evaluate model performance under more realistic conditions, where established offensive patterns from the historical corpus remain predominant, while emerging terms constitute a smaller fraction of encountered content.
Simulated Realistic test set (SimuReal): Five hundred examples with class distribution approximating real-world conditions: 80% non-offensive; and 20% offensive, comprising 2% newly emerged terms, 3% context-shifting expressions, and 15% general offensive content. This distribution reflects that offensive content is rare in practice, and that drift terms constitute a small fraction of offensive content [36]. Table 2 provides a complete overview of all datasets used in this study, including sizes, collection periods, and composition.

Note that the historical, contemporary, and mixed test sets are balanced (50% offensive; 50% non-offensive) to ensure equal evaluation weight for both classes. The SimuReal test set uses an imbalanced distribution to approximate real-world conditions.

3.3. Experimental Configurations

We evaluated three core continual-learning techniques and their combinations, producing eight experimental configurations. To identify suitable hyperparameters for each core technique, we conducted systematic ablation studies examining LoRA rank and target modules, EWC regularization strength (λ), and ER buffer size (replay ratio). Hyperparameters were selected using a weighted criterion (0.6 × KR + 0.4 × AG) to prioritize knowledge retention, given that the historical corpus represents a larger and more established collection of Saudi offensive patterns. The ablation study results are presented in Section 4.1; the configurations described below reflect the selected settings. All experiments were repeated across five random seeds (42, 101, 123, 456, and 789), and results are reported as mean ± standard deviation.

3.3.1. Baseline Models

Original: SOD_AraBERT without continual adaptation, serving as the reference for computing Knowledge Retention and Adaptation Gain metrics.
Naïve fine-tuning: Standard fine-tuning on contemporary data without mechanisms to prevent forgetting, representing the lower bound for knowledge retention.

3.3.2. Core Techniques

Experience Replay (ER). Reinforces prior knowledge by combining samples from the original SOD training split with samples from the training dataset (NEW_DS). The replay buffer is shuffled during training. Ablation study tested replay ratios ∈ {0.1, 0.2, 0.4, 0.6, 0.8}.
Elastic Weight Consolidation (EWC). Constrains updates to parameters critical for prior knowledge using the Fisher Information Matrix computed from 1000 historical samples. Ablation study tested λ ∈ {100, 500, 1000, 5000, 10,000}.
Low-Rank Adaptation (LoRA). Freezes pre-trained weights and introduces trainable low-rank decomposition matrices within transformer attention layers. Ablation study tested ranks r ∈ {8, 16, 32} with both standard (query, key, and value) and extended (including attention output dense) target modules, with no bias adaptation, and dropout rate = 0.1.

3.3.3. Hybrid Configurations

To explore whether combining techniques from different methodological families provides complementary benefits, we evaluated four hybrid configurations: LoRA + ER, LoRA + EWC, LoRA + ER + EWC, and Full Fine-Tuning + ER + EWC. These configurations examine potential synergies between parameter-efficient adaptation, replay-based retention, and regularization-based retention.

3.3.4. Training Configuration

All experiments used consistent hyperparameters: batch size = 32, learning rate = 2 × 10⁻⁵, 5 training epochs, maximum sequence length = 128, and AdamW optimizer with weight decay = 0.01. The training dataset (NEW_DS) was split 80/20 for training/validation. Training was performed using PyTorch 2.9.0 with HuggingFace Transformers and PEFT libraries.

3.4. Evaluation Metrics

We employ three categories of metrics to evaluate model performance, continual-learning effectiveness, and training behavior. A central challenge in continual learning is balancing stability, preservation of prior knowledge, against plasticity, capacity to acquire new patterns [37].

3.4.1. Classification Metrics

Performance was evaluated using four classification metrics: Accuracy (proportion of correct predictions), Precision (reliability of positive predictions), Recall (sensitivity to positive cases), and F1-macro (class-averaged harmonic mean of precision and recall). F1-macro serves as the primary metric, as it assigns equal importance to both classes. All metrics were computed across five test scenarios.

3.4.2. Continual-Learning Metrics

To quantify the stability–plasticity trade-off, we define two complementary metrics:

Knowledge Retention (KR) measures how well a model preserves performance on historical data after adaptation:

KR = F1_historical^(after) − F1_historical^(before)

where F1_historical^(before) is the original model’s performance on historical test data, and F1_historical^(after) is performance after adaptation. Negative values indicate forgetting; values near zero indicate successful preservation.

Adaptation Gain (AG) measures improvement on contemporary patterns after adaptation:

AG = F1_contemporary^(adapted) − F1_contemporary^(original)

where F1_contemporary^(original) is the original model’s performance on contemporary data, and F1_contemporary^(adapted) is performance after adaptation. Positive values indicate successful acquisition of new patterns.

In our setting, the historical task corresponds to the SOD corpus (2019–2022), and the contemporary task to the 2024–2025 dataset. In this framework, a favorable outcome involves maximizing AG while keeping KR close to zero.

3.4.3. Training and Efficiency

To assess optimization behavior and verify model convergence, we report training error metrics: Final Training Loss (cross-entropy at the last epoch), Best Validation Loss (minimum achieved during training), and Final Validation Loss (at the last epoch). All values are reported as mean ± standard deviation across random seeds. Additionally, we report the number of trainable parameters for each configuration and wall-clock training time to evaluate the trade-off between parameter efficiency and computational cost.

4. Results

This section presents experimental findings. We first report ablation study results used to select hyperparameters for each core technique, followed by overall classification performance across the five test scenarios. We then analyze the stability–plasticity trade-off using Knowledge Retention and Adaptation Gain metrics, examine the relationship between parameter efficiency and performance, and report training efficiency and convergence metrics.

4.1. Ablation Studies

To identify suitable hyperparameters for each core continual-learning technique, we conducted systematic ablation studies. Each configuration was evaluated across three random seeds. Hyperparameters were selected using the weighted criterion described in Section 3.3.

4.1.1. LoRA Rank and Target Module Selection

Table 3 presents the LoRA ablation results across ranks r ∈ {8, 16, 32} with both standard (query, key, and value) and extended (including attention output dense) target modules. As shown in Figure 1, increasing the rank from 8 to 32 improved AG (from +0.129 to +0.200 with extended modules) at the cost of greater forgetting (KR dropped from −0.038 to −0.130). Extended target modules yielded higher AG than standard modules, though with somewhat lower retention. Based on the weighted criterion, r = 8 with extended modules was selected for main experiments.

4.1.2. EWC Regularization Strength

Table 4 presents the EWC ablation results across λ ∈ {100, 500, 1000, 5000, 10,000}.

As illustrated in Figure 2, regularization strength had a limited effect within the tested range. KR varied by approximately 1 percentage point across configurations, while AG remained stable (≈+0.25). λ = 5000 yielded marginally better retention (KR = −0.053) and was selected for main experiments.

4.1.3. Experience Replay Buffer Size

Table 5 presents the ER ablation results across replay ratios ∈ {0.1, 0.2, 0.4, 0.6, 0.8}. As shown in Figure 3, buffer size had a more noticeable effect on retention than EWC lambda. Increasing the replay ratio from 0.1 to 0.6 improved KR (from −0.060 to −0.035), while AG remained stable (≈+0.26). Beyond 0.6, returns diminished; ratio = 0.8 yielded similar results but with higher variance. Ratio = 0.6 was selected for main experiments.

4.2. Overall Performance

Table 6 presents classification performance across five test scenarios. All values represent mean ± standard deviation across five random seeds.

4.2.1. Diagnostic Boundaries: Historical and Contemporary

The historical and contemporary test sets serve as diagnostic boundaries to assess the extremes of retention and adaptation. The original model achieved 0.847 F1-macro on historical data but dropped to 0.713 on contemporary—a 13.4-percentage-point decline, confirming the presence of linguistic drift. The contemporary test set is intentionally enriched with newly emerged terms and context-shifting expressions to evaluate adaptation capability; it does not represent the typical distribution of offensive content the model would encounter over time but rather isolates the specific challenge of recognizing evolved linguistic patterns.

4.2.2. Mixed Distributions: Closer to Realistic Temporal Scenarios

The mixed 20–80 and mixed 40–60 test sets provide a closer approximation to realistic temporal scenarios, where historical patterns remain predominant (20–80) or moderately predominant (40–60), enabling evaluation under gradual temporal mixing. With these scenarios, the original model achieved 0.798 (mixed 20–80) and 0.691 (mixed 40–60) F1-macro. This degradation—even when contemporary content represents only 20–40% of samples—suggests that the model may struggle with temporal variation in practice.

In our experiments, ER and Full+ER+EWC showed relatively better performance on mixed distributions: 0.909 and 0.911 on mixed 20–80, and 0.932 and 0.933 on mixed 40–60, respectively. EWC achieved 0.894 and 0.910 on these scenarios. Naïve fine-tuning reached 0.883 and 0.908, indicating that adaptation alone can improve mixed-distribution performance, though with reduced retention as observed in the historical boundary test.

In our experimental configuration, LoRA-based methods showed comparatively lower performance on mixed scenarios (0.854–0.870 on mixed 20–80; 0.822–0.852 on mixed 40–60). Adding Experience Replay to LoRA tended to improve results, possibly suggesting that replay helps address some constraints of parameter-efficient adaptation in this setting.

4.2.3. SimuReal: Simulated Real-World Conditions

The SimuReal test set incorporates both realistic class imbalance (80% non-offensive; 20% offensive) and mixed temporal patterns to simulate conditions closer to real-world social-media data. SimuReal generally narrows performance gaps across methods and highlights minority-class detection challenges introduced by class imbalance.

In our experiments, ER and Full+ER+EWC reached 0.842 and 0.844 on SimuReal, followed by EWC (0.833) and naïve FT (0.825). The original model achieved 0.817, while LoRA-based methods ranged from 0.802 to 0.813. The narrower performance gap across methods on SimuReal compared to other scenarios suggests that class imbalance introduces challenges distinct from temporal drift.

As shown in Figure 4, performance patterns varied across test scenarios, with mixed distributions and SimuReal providing more practical indicators of model behavior under temporal variation.

4.3. Stability–Plasticity Trade-Off

Table 7 presents Knowledge Retention (KR) and Adaptation Gain (AG) scores. KR quantifies performance change on historical data relative to the original model (negative values indicate forgetting), while AG measures adaptation capability on contemporary patterns (positive values indicate successful learning).

In our experiments, ER and Full+ER+EWC achieved relatively favorable trade-offs, with KR = −0.035 and AG = +0.264 for ER, and KR = −0.034 and AG = +0.264 for Full+ER+EWC. EWC showed moderate forgetting (KR = −0.052) with comparable adaptation (AG = +0.255), while naïve fine-tuning exhibited greater knowledge loss (KR = −0.062) despite strong adaptation (AG = +0.253). Figure 5 visualizes these trade-offs across all methods.

LoRA-based methods showed different patterns in our configuration. LoRA alone achieved KR = −0.055 and AG = +0.155, while adding EWC appeared to improve retention (KR = −0.043) with similar adaptation (AG = +0.158). Combining LoRA with Experience Replay improved adaptation (AG = +0.202) but showed increased forgetting (KR = −0.069). The comparatively lower adaptation observed in LoRA-based methods may relate to the limited parameter capacity or our specific experimental choices; these observations are discussed further in Section 5.

4.4. Parameter Efficiency vs. Performance

Figure 6 compares parameter efficiency against performance on historical and contemporary test sets. Full fine-tuning methods update 135 M parameters, while LoRA-based methods (r = 8) update only 591 K parameters (0.44%). Both method families achieved similar retention on historical data (left panel), while full fine-tuning methods showed higher adaptation on contemporary data (right panel). Detailed performance values are reported in Table 6 and Table 8.

4.5. Training Efficiency

Figure 7 compares average training time across methods in our setting. Experience Replay approximately doubled training time compared to non-replay methods, reflecting the larger training set from the replay buffer. EWC added modest overhead for Fisher Information Matrix computation. LoRA-based methods showed comparable or shorter times than their full fine-tuning counterparts. Full+ER+EWC showed the longest training time, accumulating costs from both components.

4.6. Training Convergence

Table 8 presents training convergence metrics across all methods in our experiments. Most methods showed similar best and final validation losses, suggesting stable convergence without notable overfitting. Methods incorporating Experience Replay showed slightly higher final validation losses compared to their best values, which may relate to the larger combined training set.

Table 8. Training convergence metrics (mean ± std across 5 seeds).

Method	Final Train Loss	Best Val Loss	Final Val Loss
Naïve FT	0.002 ± 0.001	0.033 ± 0.002	0.033 ± 0.002
ER	0.025 ± 0.005	0.044 ± 0.006	0.062 ± 0.004
EWC	0.010 ± 0.001	0.039 ± 0.002	0.039 ± 0.002
LoRA	0.389 ± 0.011	0.435 ± 0.006	0.435 ± 0.006
LoRA+ER	0.307 ± 0.008	0.287 ± 0.002	0.287 ± 0.002
Full+ER+EWC	0.058 ± 0.006	0.063 ± 0.007	0.081 ± 0.002
LoRA+EWC	0.448 ± 0.012	0.498 ± 0.007	0.498 ± 0.007
LoRA+ER+EWC	0.325 ± 0.008	0.332 ± 0.003	0.332 ± 0.003

5. Discussion

This study investigated continual-learning approaches for addressing temporal linguistic drift in Saudi-dialect offensive-language detection. The experiments provide insights into how different methods balance knowledge retention and adaptation in this setting.

The 13.4-percentage-point decline observed when evaluating the original model on the contemporary test set confirms that temporal drift affects model performance. However, this test set was intentionally enriched with newly emerged terms and context-shifting expressions to isolate adaptation capability, representing a diagnostic scenario with a concentrated presence of drift terms.

The mixed distribution sets (20–80 and 40–60 contemporary/historical ratios) provide observations under conditions where historical patterns remain predominant (20–80) or moderately predominant (40–60), enabling evaluation under gradual temporal mixing. On these scenarios, the original model achieved 0.798 (mixed 20–80) and 0.691 (mixed 40–60) F1-macro, showing degradation even when contemporary content represents only 20–40% of samples. After applying continual learning, ER and Full+ER+EWC achieved the strongest performance on these scenarios (0.909 and 0.911 on mixed 20–80; 0.932 and 0.933 on mixed 40–60), while LoRA-based methods ranged from 0.854 to 0.870 on mixed 20–80, and from 0.822 to 0.852 on mixed 40–60. These mixed scenarios help contextualize the results from the contemporary test set.

The SimuReal test set introduces additional considerations: realistic class imbalance (80% non-offensive, 20% offensive) with drift terms constituting only 5% of samples (2% newly emerged, 3% context-shifting). This design maintains the original SOD patterns as the dominant base while incorporating a small proportion of temporal drift. On SimuReal, the performance gap between methods narrowed—ER achieved 0.842 compared to the original model’s 0.817. SimuReal generally narrows performance gaps across methods and highlights minority-class detection challenges under class imbalance.

Confusion matrices for all evaluated methods across the five test scenarios are provided in Appendix A Figure A1. For the contemporary test set, the original model exhibited a high false-negative rate on offensive samples containing evolved linguistic patterns. After applying continual learning, ER and Full+ER+EWC substantially reduced these errors. For SimuReal, false negatives remained present across methods, reflecting challenges introduced by class imbalance. False positives on SimuReal primarily involved non-offensive content containing lexically ambiguous expressions—a pattern more pronounced under imbalanced conditions where the higher proportion of non-offensive samples increases exposure to such cases.

ER reduced false negatives across these categories, while some general offensive samples remained challenging across all methods. False positives on SimuReal primarily involved non-offensive content containing words that could appear offensive in other contexts—a pattern more pronounced under class imbalance where the higher proportion of non-offensive samples increased exposure to such ambiguous cases.

Table 9 summarizes observed trade-offs across methods on SimuReal, including per-class F1-scores.

The per-class results show that performance differences are more pronounced for the offensive class (F1-OFF), which contains the drift terms. LoRA-based methods showed approximately 4–6 percentage points-lower F1-OFF compared to full fine-tuning methods, while the gap on the non-offensive class (F1-NOT) was smaller (1–2 percentage points).

Several methodological choices in this study should be acknowledged. Hyperparameters for all three core techniques—LoRA rank and target modules, EWC regularization strength (λ), and ER buffer ratio—were selected through ablation studies using a criterion that prioritized retention. Different experimental objectives could lead to different selections. The LoRA configuration in our experiments targeted attention projection layers (query, key, value, and attention output dense) with rank r = 8. This represents a limited exploration of the LoRA design space. Targeting additional modules such as feedforward layers, or using different rank values, could yield different retention–adaptation trade-offs.

Overall, these observations suggest that replay-based methods provided a more consistent balance between adaptation and retention in our experiments, though with increased training time. Parameter-efficient methods offered efficiency gains but showed reduced adaptation in this configuration.

6. Conclusions and Future Work

This study investigated continual-learning approaches for Saudi-dialect offensive-language detection under temporal linguistic drift. Using the SOD dataset and a newly collected contemporary dataset, we evaluated eight configurations—Experience Replay (ER), Elastic Weight Consolidation (EWC), Low-Rank Adaptation (LoRA), and their combinations—across five test scenarios designed to assess both adaptation and retention.

In our experiments, ER and Full+ER+EWC showed relatively favorable trade-offs between retention and adaptation, though with increased training time due to the larger combined training set. ER maintained 0.812 F1-macro on historical data while reaching 0.976 on contemporary patterns (KR = −0.035, AG = +0.264). On the SimuReal test set—designed with realistic class imbalance and only 5% drift terms—ER achieved 0.842 compared to the original model’s 0.817. LoRA-based methods showed lower adaptation in our experiments, with performance gaps of approximately 3–4 percentage points on SimuReal compared to full fine-tuning methods. These results reflect our specific configuration choices: we used a basic LoRA setup targeting only attention projection layers with rank r = 8, selected based on our weighted criterion in ablation studies. More comprehensive configurations—such as targeting feedforward layers or exploring different rank and scaling values—could yield different results.

Several limitations should be acknowledged. This study evaluates a single adaptation episode from historical to contemporary data; multi-round sequential adaptation may reveal cumulative patterns that our design cannot capture. The new training dataset is relatively small, reflecting the proof-of-concept orientation of this study. We do not claim exhaustive coverage of all drift terms; rather, the dataset is designed to probe representative drift phenomena under controlled experimental conditions.

Future research could explore multi-round sequential adaptation across multiple time periods. Alternative LoRA configurations—including feedforward layers and different rank values—warrant investigation for continual-learning settings. Automated or semi-automated approaches for detecting emerging offensive content, such as distributional shift monitoring or active learning pipelines, could support research in this area. Establishing systematic processes for ongoing data collection and annotation would further enable investigation of temporal adaptation.

This work provides initial observations on continual learning for Arabic-dialect offensive-language detection under temporal drift. The experimental framework, including test scenario design distinguishing diagnostic boundaries from realistic conditions, may inform future research in this area.

Author Contributions

Conceptualization, A.A. and M.S.; methodology, A.A.; software, A.A.; validation, A.A. and M.S.; formal analysis, A.A.; investigation, A.A.; resources A.A.; data curation, A.A.; writing—original draft preparation, A.A.; writing—review and editing, A.A. and M.S.; visualization, A.A.; supervision, M.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The datasets presented in this study are available upon request from the corresponding author due to the culturally sensitive nature of Saudi-dialect offensive-language content. The experimental code (implementations of all continual-learning methods evaluated in this study) is publicly available at https://github.com/Afefa-Asiri/Continual-Learning-for-Saudi-Dialect-Offensive-Language-Detection-under-Temporal-Linguistic-Drift, accessed on 2 January 2026.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Confusion Matrices

Figure A1 presents confusion matrices for the original model and all eight continual-learning configurations across the five test scenarios (historical, contemporary, mixed 20–80, mixed 40–60, and SimuReal), providing class-wise diagnostic results complementary to the aggregate metrics in Section 4.

Figure A1. Confusion matrix analysis: All methods × all test sets. Rows represent methods (original, naïve FT, ER, EWC, LoRA, LoRA+ER, Full+ER+EWC, LoRA+EWC, and LoRA+ER+EWC); columns represent test scenarios (historical, contemporary, mixed 20–80, mixed 40–60, and SimuReal). Accuracy is reported for each configuration.

References

Kemp, S. Digital 2025: Saudi Arabia. DataReportal. 3 March 2025. Available online: https://datareportal.com/reports/digital-2025-saudi-arabia (accessed on 12 January 2026).
Asiri, A.; Saleh, M. SOD: A Corpus for Saudi Offensive Language Detection Classification. Computers 2024, 13, 211. [Google Scholar] [CrossRef]
Florio, K.; Basile, V.; Polignano, M.; Basile, P.; Patti, V. Time of Your Hate: The Challenge of Time in Hate Speech Detection on Social Media. Appl. Sci. 2020, 10, 4180. [Google Scholar] [CrossRef]
Arango, A.; Pérez, J.; Poblete, B. Hate Speech Detection is Not as Easy as You May Think: A Closer Look at Model Validation. Inf. Syst. 2020, 101, 101584. [Google Scholar] [CrossRef]
Jin, M.; Mu, Y.; Maynard, D.; Bontcheva, K. Examining Temporal Bias in Abusive Language Detection. arXiv 2023, arXiv:2309.14146. [Google Scholar] [CrossRef]
Eisenstein, J.; O’Connor, B.; Smith, N.A.; Xing, E.P. Diffusion of Lexical Change in Social Media. PLoS ONE 2014, 9, e113114. [Google Scholar] [CrossRef]
Lucy, L.; Bamman, D. Characterizing English Variation across Social Media Communities with BERT. Trans. Assoc. Comput. Linguist. 2021, 9, 538–556. [Google Scholar] [CrossRef]
Crystal, D. Language and the Internet; Cambridge University Press: Cambridge, UK, 2012; ISBN 9781139164771. [Google Scholar] [CrossRef]
Dembe, T. The Impact of Social Media on Language Evolution. Eur. J. Linguist. 2024, 3, 1–14. [Google Scholar] [CrossRef]
Di Marco, N.; Loru, E.; Bonetti, A.; Serra, A.O.G.; Cinelli, M.; Quattrociocchi, W. The Evolution of Language in Social Media Comments. arXiv 2024, arXiv:2406.11450. [Google Scholar] [CrossRef]
Allison, K.R. Social Norms in Online Communities: Formation, Evolution and Relation to Cyber-Aggression. In Proceedings of the Extended Abstracts of the 2018 CHI Conference on Human Factors in Computing Systems (CHI EA ‘18), Montreal, QC, Canada, 21–26 April 2018; Association for Computing Machinery: New York, NY, USA, 2018; Paper No. DC01. [Google Scholar] [CrossRef]
Dahou, A.; Dahou, A.H.; Cheragui, M.A.; Abdedaiem, A.; Al-Qaness, M.A.A.; Abd Elaziz, M.; Ewees, A.A.; Zheng, Z. A Survey on Dialect Arabic Processing and Analysis: Recent Advances and Future Trends. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 2024, 24, 84. [Google Scholar] [CrossRef]
Burnap, P.; Williams, M.L. Cyber Hate Speech on Twitter: An Application of Machine Classification and Statistical Modeling for Policy and Decision Making. Policy Internet 2015, 7, 223–242. [Google Scholar] [CrossRef]
Mubarak, H.; Darwish, K.; Magdy, W. Abusive Language Detection on Arabic Social Media. In Proceedings of the First Workshop on Abusive Language Online, Vancouver, BC, Canada, 4 August 2017; pp. 52–56. [Google Scholar] [CrossRef]
Badjatiya, P.; Gupta, S.; Gupta, M.; Varma, V. Deep Learning for Hate Speech Detection in Tweets. In Proceedings of the 26th International Conference on World Wide Web Companion (WWW ‘17 Companion), Perth, Australia, 3–7 April 2017; Association for Computational Linguistics: Stroudsburg, PA, USA, 2017; pp. 759–760. [Google Scholar] [CrossRef]
Abu Farha, I.; Magdy, W. Multitask Learning for Arabic Offensive Language and Hate-Speech Detection. In Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection, Marseille, France, May 2020; European Language Resource Association: Paris, France, 2020; pp. 86–90. Available online: https://aclanthology.org/2020.osact-1.14/ (accessed on 15 July 2025).
Antoun, W.; Baly, F.; Hajj, H. AraBERT: Transformer-based Model for Arabic Language Understanding. In Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection, Marseille, France, May 2020; European Language Resource Association: Paris, France, 2020; pp. 9–15. Available online: https://aclanthology.org/2020.osact-1.2/ (accessed on 15 July 2025).
Abdul-Mageed, M.; Elmadany, A.; Nagoudi, E.M.B. ARBERT & MARBERT: Deep Bidirectional Transformers for Arabic. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online, August 2021; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 7088–7105. [Google Scholar] [CrossRef]
Ali, A.R.; Siddiqui, M.A.; Algunaibet, R.; Ali, H.R. A Large and Diverse Arabic Corpus for Language Modeling. Procedia Comput. Sci. 2023, 225, 12–21. [Google Scholar] [CrossRef]
Mubarak, H.; Darwish, K.; Magdy, W.; Elsayed, T.; Al-Khalifa, H. Overview of OSACT4 Arabic Offensive Language Detection Shared Task. In Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection, Marseille, France, May 2020; European Language Resource Association: Paris, France, 2020; pp. 48–52. Available online: https://aclanthology.org/2020.osact-1.7/ (accessed on 15 July 2025).
Husain, F. OSACT4 Shared Task on Offensive Language Detection: Intensive Preprocessing-Based Approach. In Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection, Marseille, France, May 2020; European Language Resource Association: Paris, France, 2020; pp. 53–60. Available online: https://aclanthology.org/2020.osact-1.8/ (accessed on 15 July 2025).
McCloskey, M.; Cohen, N.J. Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem. Psychol. Learn. Motiv. 1989, 24, 109–165. [Google Scholar] [CrossRef]
French, R.M. Catastrophic Forgetting in Connectionist Networks. Trends Cogn. Sci. 1999, 3, 128–135. [Google Scholar] [CrossRef] [PubMed]
Kirkpatrick, J.; Pascanu, R.; Rabinowitz, N.; Veness, J.; Desjardins, G.; Rusu, A.A.; Milan, K.; Quan, J.; Ramalho, T.; Grabska-Barwinska, A.; et al. Overcoming Catastrophic Forgetting in Neural Networks. Proc. Natl. Acad. Sci. USA 2017, 114, 3521–3526. [Google Scholar] [CrossRef]
Zenke, F.; Poole, B.; Ganguli, S. Continual Learning Through Synaptic Intelligence. In Proceedings of the 34th International Conference on Machine Learning (ICML’17), Sydney, Australia, 6–11 August 2017; PMLR: London, UK, 2017; pp. 3987–3995. Available online: https://proceedings.mlr.press/v70/zenke17a.html (accessed on 15 November 2024).
Lin, L.-J. Self-Improving Reactive Agents Based on Reinforcement Learning, Planning and Teaching. Mach. Learn. 1992, 8, 293–321. [Google Scholar] [CrossRef]
Rolnick, D.; Ahuja, A.; Schwarz, J.; Lillicrap, T.P.; Wayne, G. Experience Replay for Continual Learning. In Proceedings of the 33rd International Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, BC, Canada, 8–14 December 2019; Curran Associates, Inc.: Red Hook, NY, USA, 2019; pp. 350–360. Available online: https://proceedings.neurips.cc/paper/2019/hash/fa7cdfad1a5aaf8370ebeda47a1ff1c3-Abstract.html (accessed on 15 July 2025).
Lopez-Paz, D.; Ranzato, M. Gradient Episodic Memory for Continual Learning. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS’17), Long Beach, CA, USA, 4–9 December 2017; Curran Associates, Inc.: Red Hook, NY, USA, 2017; pp. 6470–6479. Available online: https://proceedings.neurips.cc/paper/2017/hash/f87522788a2be2d171666752f97ddebb-Abstract.html (accessed on 15 July 2025).
Chaudhry, A.; Ranzato, M.; Rohrbach, M.; Elhoseiny, M. Efficient Lifelong Learning with A-GEM. arXiv 2019, arXiv:1812.00420. [Google Scholar] [CrossRef]
Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. LoRA: Low-Rank Adaptation of Large Language Models. arXiv 2021, arXiv:2106.09685. [Google Scholar] [CrossRef]
Houlsby, N.; Giurgiu, A.; Jastrzebski, S.; Morrone, B.; de Laroussilhe, Q.; Gesmundo, A.; Attariyan, M.; Gelly, S. Parameter-Efficient Transfer Learning for NLP. In Proceedings of the 36th International Conference on Machine Learning (ICML 2019), Long Beach, CA, USA, 9–15 June 2019; PMLR: London, UK, 2019; pp. 2790–2799. Available online: https://proceedings.mlr.press/v97/houlsby19a.html (accessed on 15 July 2025).
Pfeiffer, J.; Kamath, A.; Rücklé, A.; Cho, K.; Gurevych, I. AdapterFusion: Non-Destructive Task Composition for Transfer Learning. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, Online, April 2021; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 487–503. [Google Scholar] [CrossRef]
Jang, J.; Kim, Y.; Choi, K.; Suh, S. Sequential Targeting: A Continual Learning Approach for Data Imbalance in Text Classification. Expert Syst. Appl. 2021, 179, 115068. [Google Scholar] [CrossRef]
Agarwal, S.; Chowdary, C.R. Combating Hate Speech Using an Adaptive Ensemble Learning Model with a Case Study on COVID-19. Expert Syst. Appl. 2021, 185, 115632. [Google Scholar] [CrossRef]
Thompson, B.; Khayrallah, H.; Anastasopoulos, A.; McCarthy, A.D.; Duh, K.; Marvin, R.; McNamee, P.; Gwinnup, J.; Anderson, T.; Koehn, P. Freezing Subnetworks to Analyze Domain Adaptation in Neural Machine Translation. In Proceedings of the Third Conference on Machine Translation: Research Papers, Brussels, Belgium, October 2018; Association for Computational Linguistics: Stroudsburg, PA, USA, 2018; pp. 124–132. [Google Scholar] [CrossRef]
Mubarak, H.; Hassan, S.; Chowdhury, S.A. Emojis as Anchors to Detect Arabic Offensive Language and Hate Speech. Nat. Lang. Eng. 2023, 29, 1436–1457. [Google Scholar] [CrossRef]
Parisi, G.I.; Kemker, R.; Part, J.L.; Kanan, C.; Wermter, S. Continual Lifelong Learning with Neural Networks: A Review. Neural Netw. 2019, 113, 54–71. [Google Scholar] [CrossRef] [PubMed]

Figure 1. LoRA ablation study: Knowledge Retention (KR, left) and Adaptation Gain (AG, right) across ranks and module configurations.

Figure 2. EWC lambda ablation study: Knowledge Retention (KR, left) and Adaptation Gain (AG, right) across regularization strengths.

Figure 3. ER buffer size ablation study: Knowledge Retention (KR) and Adaptation Gain (AG) across replay ratios.

Figure 4. F1-macro performance across test scenarios for all methods. Error bars represent standard deviation across five random seeds.

Figure 5. Knowledge Retention vs. Adaptation Gain trade-off for all methods. Error bars represent standard deviation across five random seeds.

Figure 6. Parameter efficiency analysis comparing trainable parameters against F1-macro performance on historical (left) and contemporary (right) test sets.

Figure 7. Average training time in seconds for each continual-learning method.

Table 1. Examples of temporal linguistic drift categories in Saudi dialect.

Category	Term	Offensive Context	Non-Offensive Context
Newly Emerged	زوط (zoot) زووط (zooot—elongated)	انت زوط والله (You are a zoot, seriously—enta zoot wallah)	—
Newly Emerged	زوط (zoot) زووط (zooot—elongated)	زووط عليك 😂 (Zoot on you—zoot ʿalayk)	—
Context-Shifting	مربع (murabbaʿ—square)	العقل مربع (The mind is square—al-ʿaql murabbaʿ)	مشروع المربع بيكون علامة فارقة SA (The Al-Murabbaʿ project will be a landmark)
	الفطاير (al-faṭāyir—pastries)	ويش يبغون منك الفطاير؟ (What do they want from you, pastries? —wesh yibghūn minnak alfatayer)	مين يجيب الفطاير؟ (Who’s bringing the pastries?—meen yjeeb alftayer)
	رفقًا بهم (rifqan bihim—be gentle)	رفقًا بهم (Be gentle with them—sarcastic)	رفقًا بهم في هذا الجو (Be kind to them in this weather—rifqan bihim fi hādhā al-jaw)
	العفوية (al-ʿafawiyah—spontaneity)	العفوية تتصل بك (Spontaneity is calling you—al-ʿafawiyah tittsil bik)	العفوية أجمل شي (Spontaneity is the best thing—al-ʿafawiyah ajmal shay)

Table 2. Dataset statistics and composition.

Dataset	Size	Period	Composition
Training (NEW_DS)	2000	2024–2025	New offensive terms + context-shifting + general offensive/non-offensive + original-context sentences
Historical Test	500	2019–2022	Held-out from original SOD samples
Contemporary Test	500	2024–2025	50% offensive, 50% non-offensive
Mixed 20–80	500	Mixed	20% contemporary, 80% historical
Mixed 40–60	500	Mixed	40% contemporary, 60% historical
SimuReal Test	500	Mixed	80% non-offensive, 20% offensive (2% new terms, 3% context-shifting, 15% general)

Table 3. LoRA ablation study (mean ± std across 3 seeds).

Configuration	Parameters	KR	AG
r = 8, standard	444 K	−0.038 ± 0.002	+0.129 ± 0.008
r = 8, extended	591 K	−0.055 ± 0.001	+0.155 ± 0.001
r = 16, standard	886 K	−0.077 ± 0.003	+0.171 ± 0.000
r = 16, extended	1.2 M	−0.102 ± 0.001	+0.199 ± 0.006
r = 32, standard	1.8 M	−0.119 ± 0.002	+0.194 ± 0.001
r = 32, extended	2.4 M	−0.130 ± 0.006	+0.200 ± 0.005

Table 4. EWC lambda ablation study (mean ± std across 3 seeds).

Lambda (λ)	KR	AG
100	−0.064 ± 0.005	+0.250 ± 0.001
500	−0.060 ± 0.002	+0.253 ± 0.002
1000	−0.056 ± 0.002	+0.253 ± 0.002
5000	−0.053 ± 0.002	+0.255 ± 0.001
10,000	−0.062 ± 0.003	+0.254 ± 0.001

Table 5. ER buffer size ablation study (mean ± std across 3 seeds).

Replay Ratio	KR	AG
0.1	−0.060 ± 0.007	+0.253 ± 0.002
0.2	−0.054 ± 0.005	+0.258 ± 0.001
0.4	−0.044 ± 0.004	+0.261 ± 0.002
0.6	−0.035 ± 0.006	+0.263 ± 0.003
0.8	−0.037 ± 0.010	+0.261 ± 0.001

Table 6. Overall classification performance across test scenarios (mean ± std across 5 seeds).

	Historical						Contemporary
Method	F1	Acc		Prec	Rec		F1		Acc	Prec		Rec
Original	0.847	0.897		0.842	0.852		0.713		0.732	0.818		0.732
Naïve FT	0.785 ± 0.007	0.833 ± 0.008		0.763 ± 0.006	0.843 ± 0.005		0.965 ± 0.002		0.965 ± 0.002	0.965 ± 0.002		0.965 ± 0.002
ER	0.812 ± 0.007	0.860 ± 0.007		0.789 ± 0.007	0.854 ± 0.007		0.976 ± 0.002		0.976 ± 0.002	0.976 ± 0.002		0.976 ± 0.002
EWC	0.795 ± 0.002	0.838 ± 0.003		0.772 ± 0.002	0.861 ± 0.004		0.968 ± 0.001		0.968 ± 0.001	0.968 ± 0.001		0.968 ± 0.001
LoRA	0.792 ± 0.001	0.838 ± 0.001		0.769 ± 0.001	0.852 ± 0.001		0.868 ± 0.001		0.869 ± 0.001	0.880 ± 0.001		0.869 ± 0.001
LoRA+ER	0.778 ± 0.003	0.822 ± 0.003		0.757 ± 0.002	0.850 ± 0.002		0.914 ± 0.001		0.914 ± 0.001	0.917 ± 0.001		0.914 ± 0.001
Full+ER+EWC	0.813 ± 0.005	0.860 ± 0.005		0.790 ± 0.005	0.856 ± 0.004		0.976 ± 0.001		0.976 ± 0.001	0.976 ± 0.001		0.976 ± 0.001
LoRA+EWC	0.805 ± 0.002	0.850 ± 0.002		0.780 ± 0.002	0.860 ± 0.001		0.870 ± 0.001		0.871 ± 0.001	0.884 ± 0.001		0.871 ± 0.001
LoRA+ER+EWC	0.784 ± 0.002	0.828 ± 0.002		0.762 ± 0.002	0.852 ± 0.003		0.905 ± 0.003		0.905 ± 0.003	0.909 ± 0.002		0.905 ± 0.003
	Mixed 20–80				Mixed 40–60				SimuReal
Method	F1	Acc	Prec	Rec	F1	Acc	Prec	Rec	F1	Acc	Prec	Rec
Original	0.798	0.802	0.829	0.802	0.691	0.708	0.768	0.708	0.817	0.888	0.834	0.803
Naïve FT	0.883 ± 0.004	0.883 ± 0.004	0.885 ± 0.004	0.883 ± 0.004	0.908 ± 0.003	0.909 ± 0.003	0.916 ± 0.003	0.909 ± 0.003	0.825 ± 0.008	0.871 ± 0.007	0.798 ± 0.008	0.880 ± 0.004
ER	0.909 ± 0.005	0.909 ± 0.005	0.910 ± 0.005	0.909 ± 0.005	0.932 ± 0.005	0.932 ± 0.005	0.936 ± 0.005	0.932 ± 0.005	0.842 ± 0.005	0.889 ± 0.004	0.818 ± 0.005	0.880 ± 0.006
EWC	0.894 ± 0.003	0.895 ± 0.003	0.899 ± 0.004	0.895 ± 0.003	0.910 ± 0.004	0.911 ± 0.004	0.919 ± 0.003	0.911 ± 0.004	0.833 ± 0.002	0.876 ± 0.002	0.804 ± 0.002	0.893 ± 0.004
LoRA	0.854 ± 0.002	0.854 ± 0.002	0.854 ± 0.002	0.854 ± 0.002	0.822 ± 0.002	0.822 ± 0.002	0.824 ± 0.002	0.822 ± 0.002	0.802 ± 0.001	0.857 ± 0.001	0.778 ± 0.001	0.847 ± 0.001
LoRA+ER	0.870 ± 0.001	0.870 ± 0.001	0.873 ± 0.001	0.870 ± 0.001	0.852 ± 0.002	0.852 ± 0.002	0.853 ± 0.002	0.852 ± 0.002	0.803 ± 0.002	0.853 ± 0.002	0.776 ± 0.002	0.860 ± 0.002
Full+ER+EWC	0.911 ± 0.003	0.911 ± 0.003	0.912 ± 0.003	0.911 ± 0.003	0.933 ± 0.004	0.933 ± 0.004	0.937 ± 0.003	0.933 ± 0.004	0.844 ± 0.004	0.890 ± 0.004	0.820 ± 0.005	0.883 ± 0.004
LoRA+EWC	0.864 ± 0.001	0.864 ± 0.001	0.864 ± 0.001	0.864 ± 0.001	0.829 ± 0.001	0.829 ± 0.001	0.833 ± 0.001	0.829 ± 0.001	0.813 ± 0.002	0.866 ± 0.002	0.789 ± 0.003	0.853 ± 0.001
LoRA+ER+EWC	0.863 ± 0.005	0.863 ± 0.005	0.865 ± 0.006	0.863 ± 0.005	0.843 ± 0.003	0.843 ± 0.003	0.843 ± 0.003	0.843 ± 0.003	0.806 ± 0.001	0.857 ± 0.001	0.780 ± 0.001	0.858 ± 0.001

In several balanced settings, accuracy may appear close to recall due to the class distribution and thresholding, though the metrics still capture different error profiles.

Table 7. Knowledge Retention and Adaptation Gain analysis (mean ± std across 5 seeds).

Method	KR	AG	Historical F1	Contemporary F1
Naïve FT	−0.062 ± 0.007	+0.253 ± 0.002	0.785 ± 0.007	0.965 ± 0.002
ER	−0.035 ± 0.007	+0.264 ± 0.002	0.812 ± 0.007	0.976 ± 0.002
EWC	−0.052 ± 0.002	+0.255 ± 0.001	0.795 ± 0.002	0.968 ± 0.001
LoRA	−0.055 ± 0.001	+0.155 ± 0.001	0.792 ± 0.001	0.868 ± 0.001
LoRA+ER	−0.069 ± 0.003	+0.202 ± 0.001	0.778 ± 0.003	0.914 ± 0.001
Full+ER+EWC	−0.034 ± 0.005	+0.264 ± 0.001	0.813 ± 0.005	0.976 ± 0.001
LoRA+EWC	−0.043 ± 0.002	+0.158 ± 0.001	0.805 ± 0.002	0.870 ± 0.001
LoRA+ER+EWC	−0.063 ± 0.002	+0.192 ± 0.003	0.784 ± 0.002	0.905 ± 0.003

Table 9. Summary of method trade-offs on SimuReal (mean across 5 seeds).

Method	F1-Macro	F1-OFF	F1-NOT	Parameters	Training Time
Full+ER+EWC	0.844	0.760	0.929	135 M (100%)	69.0 s
ER	0.842	0.757	0.928	135 M (100%)	56.4 s
EWC	0.833	0.748	0.917	135 M (100%)	28.6 s
Naïve FT	0.825	0.735	0.915	135 M (100%)	23.5 s
LoRA+EWC	0.813	0.713	0.913	591 K (0.44%)	21.2 s
LoRA+ER+EWC	0.806	0.707	0.906	591 K (0.44%)	50.0 s
LoRA+ER	0.803	0.703	0.902	591 K (0.44%)	49.4 s
LoRA	0.802	0.699	0.906	591 K (0.44%)	20.9 s

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Asiri, A.; Saleh, M. Continual Learning for Saudi-Dialect Offensive-Language Detection Under Temporal Linguistic Drift. Information 2026, 17, 99. https://doi.org/10.3390/info17010099

AMA Style

Asiri A, Saleh M. Continual Learning for Saudi-Dialect Offensive-Language Detection Under Temporal Linguistic Drift. Information. 2026; 17(1):99. https://doi.org/10.3390/info17010099

Chicago/Turabian Style

Asiri, Afefa, and Mostafa Saleh. 2026. "Continual Learning for Saudi-Dialect Offensive-Language Detection Under Temporal Linguistic Drift" Information 17, no. 1: 99. https://doi.org/10.3390/info17010099

APA Style

Asiri, A., & Saleh, M. (2026). Continual Learning for Saudi-Dialect Offensive-Language Detection Under Temporal Linguistic Drift. Information, 17(1), 99. https://doi.org/10.3390/info17010099

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Continual Learning for Saudi-Dialect Offensive-Language Detection Under Temporal Linguistic Drift

Abstract

1. Introduction

2. Related Work

2.1. Offensive-Language Detection in Arabic

2.2. Continual Learning: Approaches and Applications

3. Methodology

3.1. Experimental Framework

3.2. Dataset Construction

3.2.1. Training Data

3.2.2. Test Scenarios

3.3. Experimental Configurations

3.3.1. Baseline Models

3.3.2. Core Techniques

3.3.3. Hybrid Configurations

3.3.4. Training Configuration

3.4. Evaluation Metrics

3.4.1. Classification Metrics

3.4.2. Continual-Learning Metrics

3.4.3. Training and Efficiency

4. Results

4.1. Ablation Studies

4.1.1. LoRA Rank and Target Module Selection

4.1.2. EWC Regularization Strength

4.1.3. Experience Replay Buffer Size

4.2. Overall Performance

4.2.1. Diagnostic Boundaries: Historical and Contemporary

4.2.2. Mixed Distributions: Closer to Realistic Temporal Scenarios

4.2.3. SimuReal: Simulated Real-World Conditions

4.3. Stability–Plasticity Trade-Off

4.4. Parameter Efficiency vs. Performance

4.5. Training Efficiency

4.6. Training Convergence

5. Discussion

6. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. Confusion Matrices

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI