1. Introduction
Social media was developed to enhance communication and facilitate the seamless exchange of information, enabling users to share content, including text, photos, audio, and video. The rapid progression of technology has yielded substantial advantages, improving worldwide communication, cooperation, and the exchange of experiences. Nonetheless, the expansion of connectedness has rendered social platforms the location of unethical and detrimental behavior, such as cyberbullying [
1]. The improper use of social media to disseminate hateful and biased material has resulted in a significant increase in cyberbullying incidents. Cyberbullying is a distinct kind of online harassment which is often intentional, recurrent, and designed to inflict psychological distress and humiliation on people or groups. Cyberbullying can have a profound effect on its victims, often resulting in heightened stress, anxiety, sadness, and a significant reduction in self-esteem [
2]. The extensive consequences of online abuse underscore the pressing need to comprehend and detect cyberbullying, while simultaneously promoting safe online environments for everyone. This urgent need underscores the significance of creating resilient and privacy-centric cyberbullying detection systems that safeguard users by promptly recognizing and mitigating harmful material [
3] in real time [
4]. This proactive strategy mitigates the detrimental impacts of cyberbullying and provides a more secure digital environment for all users.
Diverse machine learning (ML) techniques have effectively detected cyberbullying using different data sources, including textual, visual, and behavioral information. Further, deep learning (DL) techniques also facilitate the detection of trends and indicators of abusive online behavior, including Recurrent Neural Networks (RNNs) [
5], Long Short-Term Memory (LSTM) networks [
6], and Transformer models [
7], which examine patterns in sequential and contextual data. These algorithms are especially proficient at detecting cyberbullying due to their ability to distinguish emerging patterns, semantic links, and sequences in digital dialogues [
8]. Furthermore, ensemble approaches like Random Forest (RF), Gradient Boosting (GB), and Stacking integrate many classifier outcomes, resulting in a more robust detection system [
4]. Text classification is crucial for identifying cyberbullying, employing ML and DL models to categorise text-based social media content, including posts, comments, and messages. Through the analysis of word frequencies, n-grams, and semantic embeddings, these models can proficiently classify the material as either cyberbullying or non-cyberbullying. Further, NLP methods are essential for deriving significant insights from text, including sentiment analysis, subject modelling, named entity recognition, part-of-speech tagging, and rule-based language systems. Thus, NLP is crucial for identifying inappropriate words and contextual indicators often linked to cyberbullying [
9].
Privacy and transparency in cyberbullying detection are crucial, particularly since conventional detection approaches need the centralization of user data, thus jeopardizing privacy by revealing sensitive information. FL mitigates privacy issues by allowing decentralised training of models without transferring user data outside their protected settings [
10]. Each organisation or user maintains authority over its local data, enabling model updates, reducing privacy issues, and facilitating compliance with data protection regulations such as GDPR [
11]. In FL, data preparation, local model training, and secure aggregation are essential to allow the aggregation of models from several organisations or users into a global model. This consolidated approach leverages the pool of data sources, improving its precision and resilience in identifying cyberbullying while maintaining user privacy. Furthermore, Explainable AI (XAI) methodologies, including Local Interpretable Model-agnostic Explanations (LIME), enhance the transparency of the model [
12]. Utilising LIME elucidates the model’s conclusions, such as the rationale for classifying certain content as cyberbullying, offering insights that enhance trust among users and stakeholders [
13]. Together, FL and XAI provide a privacy-focused and transparent framework [
14].
1.1. Motivation
The rise of online communication platforms has resulted in a notable surge in cyberbullying instances, necessitating the development of advanced detection algorithms that function effectively while protecting user privacy. FL offers a compelling alternative to conventional centralized ML approaches, which can include significant privacy concerns due to the need for centralized data storage. FL enables model training across decentralized data sources, improving data security by maintaining user data locally. To augment FL’s efficiency, we use Transfer Learning (TL), allowing the model to utilize existing knowledge from analogous tasks, enhancing training speed and accuracy despite constrained data availability. As detection models become more intricate, comprehending and elucidating model predictions becomes crucial. Utilizing XAI methodologies, particularly LIME, the proposed strategy not only detects cyberbullying but also offers clear insights into the model’s decision-making process. This integration guarantees the interpretability of the model’s activities, augmenting user and stakeholder confidence. This study aims to deliver a comprehensive, secure, and transparent solution for cyberbullying detection by integrating FL, TL, and XAI, tackling the technical, ethical, and interpretability challenges in digital environments while maintaining data privacy and security.
1.2. Research Contribution of This Article
This study explores various TL-based NLP algorithms for cyberbullying detection.
This work integrates EFL and TL with DP, enabling decentralized cyberbullying detection with secure data processing while adhering to stringent data protection regulations and safeguarding user privacy.
By employing LIME, the study enhances model transparency, making its decisions interpretable and fostering trust among stakeholders.
The proposed ensemble model achieves a remarkable accuracy of 98.19% baseline, and 96.37% after FL and DP are incorporated, validated through k-fold cross-validation with an average accuracy of 96.07%.
1.3. Organisation
The paper’s structure and key sections are as follows. We review related works in
Section 2, present the proposed system architecture for cyberbullying detection in
Section 3, discuss system performance evaluation in
Section 4, and conclude in
Section 5, also discussing future directions.
2. Related Works
This presents the results of a prior study and literature review on several cyberbullying detection approaches. Salawn et al. [
3] assessed the effectiveness of ML and NLP techniques in detecting cyberbullying. Their study classifies cyberbullying detection strategies into supervised learning, lexicon-based, rule-based, and mixed-initiative approaches. In addition, their research investigates the ethical and social consequences of automated systems designed to identify cyberbullying. P. Galan-García et al. [
15] suggested linking counterfeit Twitter (now X) accounts to the act of cyberbullying. The researchers used RF, Variable Importance Measures (VIMs), and OneR algorithms to examine tweets’ linguistic patterns and emotional content originating from troll and authentic accounts. Their methodology was evaluated using a dataset of 2000 Twitter tweets, and the OneR algorithm successfully detected and associated trolls in more than 80% of instances. K. Maity et al. [
16] presented a model called “MTBullyGNN”, designed to identify instances of code-mixed cyberbullying. MTBullyGNN employs a graph neural network (GNN) to identify and examine cases of cyberbullying. The GNN efficiently detects nodes (sentences) without labels or with inaccurate labels by gathering data from similar nodes. MTBullyGNN performs better than the most advanced algorithms on single-task BullySent and Hindi–English code-mixed datasets.
Ottosson [
17] developed a linguistic model specifically intended to identify cyberbullying on social media sites. The research attempted to narrow the disparity in platform moderation by using the GPT-3 Language Learning Model (LLM). The findings indicated that the suggested model has abilities equivalent to previous models. The study findings showed that refining an LLM successfully enhances cyberbullying identification, resulting in an accuracy rate of 90%. Alhloul and Alam [
18] conducted research in which they created a DL-based system to detect harassing tweets. The researchers suggested a CNN–attention architecture that integrated an attention layer with a convolutional pooling layer, effectively extracting cyberbullying-related terms from users’ tweets. The research conducted experiments using two different sets of pairings. At first, CNN was combined with ML models, with the convolutional layers serving as feature extractors and ML models like RF and LR being used for classification. The following methodology used combinations such as CNN-XGB and CNN-LSTM for categorisation. The results demonstrated that the CNN–attention framework achieved a remarkable accuracy of 97.10% compared to other learning models.
Mestry et al. [
19] created a CNN that utilizes fastText word embeddings to detect and categorize harmful and offensive remarks on social networking sites according to their toxicity. The model demonstrated superior accuracy in processing vernacular, jargon, and typographical mistakes and regularly encountered abbreviations in messages. John M. et al. [
20] used a supervised ML technique to detect and mitigate cyberbullying. They used many classifiers to learn and identify bullying behaviors. The assessment of the suggested approach on a dataset related to cyberbullying showed that the Neural Network surpassed other models, with an accuracy of 92.8%. At the same time, the SVM earned a slightly worse accuracy of 90.3%. Qudah et al. [
21] introduced an improved approach for identifying cyberbullying, incorporating an adaptive external dictionary (AED). The authors used ML models, including RF, XGB, and CatBoost, and developed ensemble voting models. The findings demonstrated that the suggested ensemble voting model, when integrated with AED, yielded higher accuracy in identifying instances of cyberbullying. Maram et al. [
22] proposed cyberbullying detection method using sentiment analysis and ML. Traditional ML methods are effective, but they struggle with online emotional tones. They used sentiment analysis to filter out neutral and positive content and focus on detrimental messages to improve classification accuracy. The authors used SMOTE resampling to address data imbalance and improve model performance across six cyberbullying categories. The Extra Tree model showed the highest accuracy (95.38%) even in balanced datasets.
Mathur et al. [
23] created a system to identify Twitter cyberbullying in real time. The system utilised NLP and ML techniques. The system underwent training using a dataset consisting of tweets related to cyberbullying, and the effectiveness of several ML methods was evaluated and compared. Their research showed that by meticulously choosing preprocessing procedures and optimising the RF model, a remarkable accuracy of 94.06% was attained. Bokolo and Liu [
24] used a DL method, which aims to automatically identify instances of cyberbullying on social media sites. They conducted a comparative analysis of three ML models, namely Naive Bayes (NB), SVM, and Bidirectional Long Short-Term Memory (Bi-LSTM). The analysis was performed using a Twitter dataset, and the findings demonstrated that Bi-LSTM surpassed the other models, with a remarkable accuracy of 98%. The SVM achieved a high accuracy rate of 97%, while the NB algorithm performed less well with an accuracy of 85%. Araque et al. [
25] proposed an ensemble approach for the detection of hate speech through affective computing. It presented feature extraction techniques utilising AffectiveSpace and SenticNet, improving classification efficacy by integrating affective features with conventional textual representations across various datasets. Chiril et al. [
26] proposed emotionally informed hate speech detection from a multi-target perspective utilising annotated datasets to discern specific expressions of hate speech across diverse subjects and targets, employing affective knowledge from EmoSenticNet and Hurtlex. The multi-task models surpassed single-task models in the detection of hate speech.
Muneer et al. [
27] proposed a refined version of BERT and a stacking ensemble model to detect cyberbullying on social media. The researchers used a continuous bag-of-words model in conjunction with a word2vec-like technique for extracting features to determine the weights in the embedding layer. The stacking ensemble learning approach displayed an exceptional accuracy of 97.4%. Several deep learning models, such as Conv1D LSTM, LSTM, CNN, BiLSTM_Pooling, BiLSTM, and GRU, were used. The findings highlighted the dominance of the attention-based Conv1 DLSTM classifier compared to other methods, obtaining a peak accuracy of 94.49%. Federated learning (FL) enables decentralized model training by sharing model parameters instead of raw data. FL has been applied to healthcare, finance, and mobile applications, and recently in NLP tasks including next-word prediction, sentiment analysis, and toxicity detection. Despite this progress, FL–NLP still faces challenges such as non-IID data distributions, high communication cost, and privacy risks from inference attacks. Numerous studies have extensively documented the practical uses of FL since its inception. Gboard has used FL to train a model for predicting the next word [
28,
29] and to evaluate and implement a model for improved suggestions for online searches, GIFs, and stickers [
30]. FL is used in the medical domain to safeguard patient anonymity while training an image classification algorithm capable of diagnosing COVID-19 by evaluating X-ray pictures from several hospitals [
31,
32]. Qureshi et al. [
33] proposed a hybrid feature fusion model that integrates feature-based and graph-based methodologies for credibility evaluation and attained 95.6% accuracy in credibility assessment.
Bakopoulou et al. [
34] proposed an FL technique that enables devices to collectively train a global model without uploading locally gathered training data. They demonstrated the system’s efficacy in two classification tasks: forecasting personally identifiable information (PII) exposure and identifying ad requests in individual packets. Guo et al. [
35] created an FL framework called FEAT that accurately identifies traffic in various contexts without violating user privacy. This framework tackles the issue of imbalanced client data, which substantially affect the effectiveness of FL-based methods, particularly in the classification of mobile network traffic, due to the diverse range of endpoint setups. Aouedi et al. [
36] investigated a novel federated semi-supervised learning method that utilises labelled and unlabeled data. This methodology used the NSL-KDD and authentic industrial datasets to address network traffic and diverse cyberattacks. Zeng et al. [
37] examined the gradient-matching federated domain adaptation (GM-FedDA) method for categorising brain pictures. They aimed to minimise domain differences and develop accurate local federated models for particular target areas. Basu et al. [
38] used differential privacy (DP) to categorise financial text with privacy protection in finance. Their research created a system that protects sensitive financial information while keeping accurate categorisation.
Recent works have specifically explored federated learning for cyberbullying and toxicity detection. Samee et al. [
1] fused word embeddings and emotional cues within FL, achieving robust accuracy with BERT, CNN, and LSTM on multi-platform datasets. Their work included a theoretical DP-based analysis but did not implement formal guarantees or non-IID evaluation. Khan et al. [
39] proposed a decentralized ring-topology FL to avoid reliance on a central server. While effective in decentralization, their framework assumed IID distributions and lacked formal DP. Nagy et al. [
40] developed local DP with quantization and randomized response for privacy-preserving NLP in FL, though their study was task-agnostic and not cyberbullying-specific. Sharma et al. [
41] introduced an FL pipeline for encrypted social media platforms, leveraging metadata rather than textual content, with DP and secure aggregation under explicit non-IID distributions. Shetty et al. (FedBully) [
42] demonstrated cross-device FL for binary cyberbullying detection using sentence encoders, achieving 93% AUC on IID and 91% on non-IID splits, but without DP or explainability. Alabdali et al. [
43] combined blockchain with FL for cyberbullying detection, enhancing auditability but without integrating differential privacy. Complementary surveys such as that by Khan et al. [
44] systematically reviewed 36 FL–NLP papers, highlighting open issues in convergence, robustness, and the absence of explainability and privacy guarantees.
Most prior FL-based cyberbullying works either (i) restrict themselves to binary detection tasks (e.g., FedBully [
42]), (ii) focus on decentralization or auditability without formal privacy guarantees (e.g., Khan et al. [
39], Alabdali et al. [
43]), or (iii) implement privacy or fairness without applying them to cyberbullying (e.g., Nagy et al. [
40]). Our framework advances the state of the art along three axes: Methodology—enabling multi-class cyberbullying detection with an ensemble of Transformer models; Privacy—incorporating formal
-differential privacy with Gaussian noise, clipping, and secure aggregation; and Evaluation—explicitly modeling non-IID distributions and partial participation while integrating explainability via LIME with fidelity checks. Together, these contributions position our approach as a robust and trustworthy FL solution for social media moderation. The widespread effectiveness of FL across several domains makes it an enticing methodology for identifying cyberbullying. FL allows several organisations, like social networking sites, online forums, and academic institutions, to train a precise cyberbullying detection model [
45]. This is achieved using a distributed architecture that ensures the anonymity of users. This method has been successfully implemented in other industries, such as healthcare, where confidential patient information stays inside each healthcare provider. FL with DP in cyberbullying detection enables the consolidation of knowledge and data from several sources while preserving the privacy of each person [
46]. This technique can significantly improve the effectiveness, dependability, and ethical aspects of identifying cyberbullying, hence promoting a safer and more inclusive online community. Recent breakthroughs in XAI have underscored the essential need for interpretability in intricate deep learning models, especially in areas containing sensitive or user-generated content. Study [
47] presents a thorough taxonomic analysis of LIME upgrades, tackling critical issues including fidelity, stability, and domain applicability. In the field of cyberbullying detection, study [
48] combines LIME and SHAP with user-specific LSTM models, illustrating how XAI tools uncover fundamental predictors—such as race, gender, and previous victimization history—with remarkable accuracy (98%). This research underscores how LIME enhances model openness and ethical accountability by highlighting essential decision-influencing aspects, particularly in socially sensitive NLP tasks. Recent works collectively indicate that LIME and its expansions are progressively tailored for privacy-sensitive applications.
3. Proposed System Architecture
The suggested system paradigm, shown in
Figure 1, has three separate tiers: client-side, server-side, and global model aggregation and assessment. The layers are interlinked to provide cyberbullying detection using FL and DP, ensuring data privacy and decentralisation. This section will examine each layer comprehensively and delineate the mechanisms that transpire at each step.
3.1. Client-Side Operations
In FL, the onus of data gathering and local model training is allocated among clients, such as user devices, smartphones, or edge devices. The server orchestrates the model training while the data processing and model learning happen on the client side. This decentralised method guarantees that no raw data is sent from clients to the server, safeguarding data privacy. The procedure starts when the server alerts a selected group of clients to engage in the ongoing FL training session. Every client is accountable for training a localised model on their data, therefore ensuring that the model conforms to the distinct attributes of each client’s dataset. Let
be the
i-th client in the FL system, where
, and
N signifies the total number of clients participating in the current federated learning round. The local data for client
is represented as
, where
In this form,
denotes the individual input data points, such as social media posts, chat messages, or user-generated content, while
signifies the label associated with each data point. Upon data collection, the client starts the data preparation step. This step is essential since raw data generally has noise, extraneous information, and discrepancies. Data preparation guarantees that the data presented is clean, normalised, and organised before training. The following operations are executed during this phase.
Data Cleaning: Data cleaning includes eliminating extraneous letters, rectifying spelling errors, and discarding useless information like URLs, and special symbols.
Normalisation: Text normalisation includes converting all text to lowercase, extending abbreviations (e.g., “u” to “you”), and standardising punctuation marks.
Tokenisation: During this phase, the unprocessed text is divided into smaller components known as tokens. Tokens may consist of words, subwords, or characters, contingent upon the tokenisation method used by the model.
The data
is prepared for training upon preprocessing. The preprocessed data is divided into three subsets: the training set
, the validation set
, and the testing set
. This division guarantees that the model is trained, verified, and tested on distinct data segments, minimising the likelihood of overfitting.
Upon data preparation, the client-side model is configured. The client-side model uses sophisticated TL-based NLP architectures like BERT (Bidirectional Encoder Representations from Transformers) or RoBERTa (Robustly Optimized BERT Pretraining Approach). These models are explicitly designed to apprehend the contextual significance of words, which is essential for comprehending the nuanced intricacies of cyberbullying discourse. Each client customises their model using hyperparameters
, including the learning rate
, the model’s layer count, and the batch size (see
Appendix A,
Table A1 for detailed configurations). Upon model configuration, the server alerts the client to begin the local training procedure. The model is trained using the client’s dataset
during local training. The training procedure seeks to minimise a loss function, such as cross-entropy loss, quantifying the disparity between anticipated and real labels. Gradient descent is used to modify the model parameters
at each iteration:
In this equation, denotes the revised parameters for the model following the -th iteration, signifies the learning rate that regulates the magnitude of the parameter update, and represents the gradient of the loss function concerning the model parameters . Upon completion of local model training, the client shares the modified model parameters with the server.
3.2. Server-Side Operations
The server, referred to as
S, manages the FL process. The server’s principal duty is to oversee client interactions, disseminate the global model, and consolidate client-side modifications into a revised global model. In our setup, we adopt a cross-device FL scenario with 20 clients, of which 50% are randomly selected per round. Training is conducted for 100 communication rounds, with a non-IID data partitioning (Dirichlet
= 0.5) to simulate heterogeneous client data. To further reflect practical conditions, 10% of clients are assumed to drop out per round, and a 1 MB communication budget is enforced for each client update. At the commencement of each training round
t, the server picks a subset of clients
from the whole pool of clients. The subset
comprises clients that satisfy the server’s selection requirements (e.g., processing capacity, consistent network connectivity, etc.).
Upon selection of the clients, the server disseminates the current global model with parameters
to each client inside the chosen subset. Subsequently, each client employs these characteristics as the foundation for localised training. The distribution of the model parameters is as follows:
This procedure guarantees all clients start training from an identical base model, refined via prior iterations. Each designated client thereafter trains its local model with its data and transmits the revised model parameters
back to the server. Upon receiving the updated model parameters from each client, the server aggregates the client models to create a new global model
. The server aggregates the models with an aggregation method.
Here, is the total number of data samples from all participating clients, and is the number of data samples on client . Employing aggregation guarantees that customers with larger datasets significantly influence the global model update, mitigating bias introduced by clients with smaller datasets. The server iteratively aggregates client models in each round.
3.3. Ensemble-Based Federated Classification Framework
Dataset
consists of
N samples. Here,
represents the
i-th textual input, and
denotes the corresponding label for a classification problem with
C classes. The proposed framework employs three pretrained Transformer models: DistilBERT (
), RoBERTa (
), and ELECTRA (
). Each model is independently fine-tuned on data from federated clients. Each model translates a tokenized input
x into a real-valued logitvector
, which denotes the unnormalized class scores:
The final ensemble logit vector
is calculated using the arithmetic mean (late fusion) of the outputs from individual models:
The predicted class label
is obtained by applying the softmax function and selecting the class that exhibits the highest probability.
Each Transformer model undergoes local fine-tuning on clients within an FL framework, followed by periodic global aggregation. The ensemble fusion mechanism functions as a regularisation strategy, reducing model-specific biases and enhancing the system’s generalisation capability across diverse client data. For ensemble prediction, we employ a late fusion strategy in which the logits from DistilBERT, RoBERTa, and ELECTRA are averaged to form the final prediction. This approach reduces model-specific bias and improves generalization across heterogeneous client data. While late fusion was adopted due to its simplicity and robustness in privacy-preserving settings, alternative ensemble fusion strategies could also be explored, including weighted averaging (where model contributions are proportional to validation performance), majority voting, or stacking meta-learners (where a secondary classifier is trained on the outputs of base models). Future work may investigate these alternatives to further optimize performance and robustness under differential privacy constraints.
3.4. Global Model Evaluation and Deployment
After the server aggregates the client models to formulate a new global model , it assesses the model’s efficacy using a validation set. The assessment step is essential for determining the model’s ability to effectively identify cyberbullying across diverse datasets. A variety of performance measures are used in this assessment.
Accuracy: This metric evaluates the overall accuracy of the model’s predictions. It is the ratio of accurately categorised cases to total occurrences.
Precision: Precision quantifies the proportion of occurrences identified as cyberbullying that are indeed cyberbullying. It is advantageous when the expense of false positives is substantial.
Recall: Recall assesses the model’s ability to recognise all cyberbullying episodes accurately. It is advantageous when the repercussions of false negatives are significant.
F1-Score: The F1-score represents the harmonic mean of Accuracy and Recall. It offers a fair assessment of both, especially when the dataset is skewed.
The server concludes the training, and the global model is subsequently implemented for real-time detection.
3.5. Algorithms
In an FL system, each client (denoted ) is essential for the local training of a model using its dataset . In the data acquisition phase, the client gathers local data , including input data points and their corresponding labels . This ensures that each client uses their private dataset, embodying the client’s particular characteristics. The data preparation phase ensures it is cleaned and preprocessed to make it suitable for training.
The preprocessing steps include data cleaning (removing extraneous characters, correcting spelling errors, and eliminating irrelevant information such as URLs), normalisation (converting text to lowercase, expanding abbreviations, and standardising punctuation), and tokenisation (segmenting the text into smaller units, such as words or subwords). Thereafter, it is divided into
,
, and
sets to ensure the model’s capacity for effective generalisation and to mitigate overfitting. The model setup begins by initialising the model with client-specific hyperparameters
, which include the learning rate
, with other parameters and model architecture. The goal is to reduce the loss
. After concluding local training, DP is added to the model weights, and then the updated model
is sent to the server for aggregation. We adopt the Gaussian mechanism for differential privacy, with noise scale
applied to clipped gradients (L2 norm bound = 1.0). Privacy guarantees are expressed as
-DP, with
. We use the moments accountant to track cumulative privacy loss across communication rounds. Noise injection is performed on the client side prior to transmission of model updates, as outlined in Algorithm 1.
Algorithm 1: Client-Side Local Model Training for Client |
![Systems 13 00818 i001 Systems 13 00818 i001]() |
The server side directs the central coordination of the FL training and coordination process (Algorithm 2). The server, referred to as
S, begins each training round by choosing a subset of clients
from the whole pool of clients
. This decision is taken on certain factors, including client availability and network connection. Upon choosing the clients, the server disseminates the current global models and parameters
to each client inside the subset
. Upon completion of their local training, clients transmit their revised models and parameters
to the server. The server then consolidates these revised model parameters using an aggregation technique for each model and later shares the updated model with all clients.
Algorithm 2: Server-Side Model Aggregation |
- 1:
Input: Client model parameters from a subset of clients ; - 2:
Output: Updated global model parameters ; - 3:
Select a subset of clients ; - 4:
Send current global model parameters to each client ; - 5:
Wait for each client to send back their updated parameters ; - 6:
Aggregate client models using aggregation method (e.g., FedAvg): - 7:
Update global model with ; - 8:
Schedule and repeat the training process to reflect on new inputs.
|
This guarantees that clients with more enormous datasets have a more significant impact on the revised global model. The server then updates the global model with the aggregated client local model updates . This procedure is repeated throughout training iterations until the global model attains convergence. The process is repeated at scheduled times so that adaptive learning is enabled and new inputs are processed and identified with high accuracy.
The LIME algorithm operates post-prediction on the client side, enhancing interpretability for each output
generated by the trained client-side model
(Algorithm 3). After predicting whether an input
contains cyberbullying content, LIME generates a locally interpretable explanation
by creating a perturbed dataset
around
. For each perturbation
, the model’s response
is computed and weighted by similarity
, where
, indicating the proximity to the original instance. This weighted response forms the foundation for fitting a simple, interpretable linear model
, where feature weights
provide insights into the importance of each feature
in determining
. The connection between the client-side algorithm and this LIME explanation algorithm lies in their complementary functions. While the client-side model
is optimised for accurate predictions without sharing raw data (preserving privacy), the LIME algorithm
focuses on local interpretability by fitting a simplified model within the vicinity of each prediction. This results in an explanation
, where weights
indicate the influence of features
on the prediction
. Thus, users gain valuable insights into the model’s decision making for each instance, maintaining the privacy-centric principles of FL while enhancing transparency and accountability in the detection of cyberbullying.
Algorithm 3: LIME-Based Explanation for Client-Side Model Predictions |
- 1:
Input: Trained model parameters , test data - 2:
Output: Explanations for each instance do - 3:
Compute prediction - 4:
Generate perturbed dataset around for each do - 5:
Compute - 6:
Compute weight - 7:
Fit linear model to approximate locally - 8:
Minimize weighted loss: - 9:
Extract explanation - 10:
Save - 11:
return
E
|
To extend our study beyond the IID assumption, we additionally simulate realistic federated learning conditions. Non-IID client splits are generated using a Dirichlet distribution with concentration parameters , where lower values induce stronger label skew. Furthermore, in each communication round, only a fraction of clients () are selected to participate, reflecting partial availability in real-world FL. FedSGD, FedAvg, and FedProx are all evaluated under these conditions to provide a comprehensive comparison.
6. Limitations and Future Work
While our study evaluates performance on a single benchmark dataset, the proposed framework is readily applicable to multi-platform scenarios (e.g., Twitter (now X), Reddit, and Facebook). We identify cross-platform validation as an important avenue for future work to assess robustness across heterogeneous linguistic and social contexts. Furthermore, while we report detailed per-class metrics under a stratified train–test split, future work may extend this to k-fold cross-validation to further validate class-specific stability. This work demonstrated illustrative explanations using LIME at the tokenized input level; we note two important directions for strengthening this aspect. First, a human-in-the-loop evaluation could be incorporated, for example, by comparing highlighted tokens with annotator rationales or domain expert judgments. This would help validate the faithfulness of model explanations. Second, as LIME may exhibit sensitivity when applied to Transformer models with subword tokenization, future work will explore complementary interpretability approaches such as SHAP, Integrated Gradients, or attention-based explanation methods. These extensions would provide more stable and human-aligned insights for model behavior in federated cyberbullying detection. Although our centralized models achieved >98% accuracy, this may reflect a performance ceiling effect caused by class balancing. While safeguards were applied to prevent data leakage, future work should test the framework on more diverse, naturally imbalanced datasets where performance may be lower but more realistic. While our experiments focus on moderate-scale simulations, scalability to large-scale social networks is feasible with federated learning. In practice, only a fraction of clients participate in each round (e.g., 1–10%), which reduces the per-round communication from to , where p is the participation rate. Prior FL studies have shown that with random sampling of just 5–10% of clients, convergence accuracy remains within 1–2% of full participation, even at million-client scale. Techniques such as update compression (e.g., 8-bit quantization achieving 4–8× communication savings) and sparsification (transmitting only the top 1–5% of gradients) further reduce per-round overhead. Moreover, hierarchical aggregation (edge-server → central-server) can cut aggregation latency by up to 50% in geo-distributed settings. Integrating these well-established optimizations into our framework ensures that communication costs grow sub-linearly with network size, enabling deployment across social networks with tens of millions of active users.
Automated moderation systems inevitably face trade-offs between false positives and false negatives. False positives (benign posts incorrectly flagged as bullying) may suppress free expression and lead to unnecessary censorship, while false negatives (bullying posts not detected) can allow harmful content to persist and cause real harm to affected individuals. In practice, the balance between these errors should be tailored to platform-specific values and contexts (for example, prioritizing recall in high-risk environments such as adolescent forums or precision in contexts where over-censorship poses serious concerns) We emphasize that automated tools should complement, not replace, human moderators, and we highlight the importance of transparency and user recourse mechanisms in deployment.