Abstract
Despite significant advancements in Artificial Intelligence (AI) and Large Language Models (LLMs), detecting and mitigating bias remains a critical challenge, particularly on social media platforms like X (formerly Twitter), to address the prevalent cyberbullying on these platforms. This research investigates the effectiveness of leading LLMs in generating synthetic biased and cyberbullying data and evaluates the proficiency of transformer AI models in detecting bias and cyberbullying within both authentic and synthetic contexts. The study involves semantic analysis and feature engineering on a dataset of over 48,000 sentences related to cyberbullying collected from Twitter (before it became X). Utilizing state-of-the-art LLMs and AI tools such as ChatGPT-4, Pi AI, Claude 3 Opus, and Gemini-1.5, synthetic biased, cyberbullying, and neutral data were generated to deepen the understanding of bias in human-generated data. AI models including DeBERTa, Longformer, BigBird, HateBERT, MobileBERT, DistilBERT, BERT, RoBERTa, ELECTRA, and XLNet were initially trained to classify Twitter cyberbullying data and subsequently fine-tuned, optimized, and experimentally quantized. This study focuses on intersectional cyberbullying and multilabel classification to detect both bias and cyberbullying. Additionally, it proposes two prototype applications: one that detects cyberbullying using an intersectional approach and the innovative CyberBulliedBiasedBot that combines the generation and detection of biased and cyberbullying content.
1. Introduction
This project is an attempt to use Artificial Intelligence (AI) to analyze text online to determine whether it is an instance of cyberbullying or an attack based on bias. Cyberbullying is the use of digital technology to harass, threaten, or humiliate someone. Bias is a systematic tendency to favor certain outcomes, groups, or viewpoints over others.
Bias detection and mitigation using AI models, including transformers, have been focal points within the research community for several years [1,2,3,4]. This study uses Large Language Models (LLMs) like ChatGPT-4o from OpenAI (OpenAI (2024) Hello GPT-4o. Available online: https://openai.com/index/hello-gpt-4o/ (accessed on 19 August 2024)) for generating biased, cyberbullying, and neutral context data, as well as for collecting data from Twitter. It then applies AI models and algorithms to detect bias and cyberbullying.
After an initial study of the previously published works, the summary of which can be found in Section 2: Related Work, it became obvious that with the rise of LLMs, the latest generation of transformers, and generative AI as a current mainstream in the field of AI, there are significant research gaps in both cyberbullying and bias detection and in the intersection of the two. Researchers see the top gaps in the study as the need for comprehensive bias detection and mitigation frameworks, improved accuracy in cyberbullying detection, quality and ethical synthetic data generation, robust testing of multimodal AI models, and addressing ethical concerns. LLMs can be used to generate cyberbullying and bias attacks, but many engines will refuse to generate such content. Therefore, as a part of this work, we jailbroke the LLMs to output the desired text.
Despite significant advancements in AI and machine learning, the current methods for detecting cyberbullying and bias remain limited in several key areas. Some existing studies have conducted research in the past on using LLMs and natural language processing (NLP) techniques like Term Frequency–Inverse Document Frequency (TF-IDF) for cyberbullying detection [5,6,7]. Other studies have also conducted research on detecting biases in text using different NLP methods and frameworks for machine learning [8,9,10]. But, existing approaches often treat these issues separately, failing to capture their intersection. Moreover, there is a lack of robust strategies for enhancing detection capabilities in both synthetic and authentic datasets. This study aims to fill these gaps by exploring intersectional cyberbullying and bias detection, cross-bias detection, and the generation of high-quality synthetic datasets.
This study addresses the following research questions:
RQ1: What strategies can enhance bias and cyberbullying detection within both synthetic and authentic neutral and cyberbullying datasets?
RQ2: How can key advanced transformer models, pretrained to detect biases and work with social media platform data, and leading LLMs be used to understand bias in datasets and AI models?
RQ3: How can the intersection of cyberbullying and bias detection in multilabel classification using transformers improve both bias and cyberbullying detection within neutral and cyberbullying datasets?
By addressing these questions, this research aims to contribute to the development of fairer and more reliable AI systems with robust bias and cyberbullying detection capabilities.
The rapid proliferation of synthetic data generated by advanced AI systems has intensified the need to address the biases inherent in such models [11]. AI systems can both detect and generate biased and cyberbullying data, presenting a dual challenge that necessitates thorough investigation. This rise of synthetic data generation can lead to biases stemming from various sources. Some examples include training data, algorithmic design, and human prejudices, all of which significantly impact the performance and trustworthiness of AI applications. In sensitive applications, like cyberbullying detection, these biases can result in unfair flagging or overlooking of certain demographics [2,4].
AI’s ability to mimic human behavior is evident in various applications, such as automated accounts acting as real users and chatbots. However, these models also bring forth the challenge of bias. On the other hand, AI has the potential to filter content and detect and reduce abusive language as well as amplify it. Each machine learning model, including transformers and LLMs, is shaped by its training data, and if these data are skewed, the model most likely will not only inherit but amplify that bias [12,13,14,15].
The dataset of this study includes over 70,000 sentences, including 48,000 from a cyberbullying dataset collected from Twitter and synthetic data generated for this project. The focus is on age-related cyberbullying data, as cyberbullying of youth presents the most challenging and sensitive topic. Analysis was conducted on 16,000 sentences only, containing age-related cyberbullying vs. a neutral dataset split 12,800 vs. 3200. By leveraging top LLMs like ChatGPT-4o, Pi AI, Claude 3 Opus, and Gemini-1.5, the researchers generated data to further understand the bias in authentic human-generated data. AI models such as DeBERTa, Longformer, BigBird, HateBERT, MobileBERT, DistilBERT, BERT, RoBERTa, ELECTRA, and XLNet were originally trained to classify the Twitter cyberbullying data but then were fine-tuned, optimized, and quantized for multilabel classification (biases and cyberbullying both). Additionally, the intersection of bias and cyberbullying detection was investigated, providing insights into the prevalence and nature of bias.
This study aims to develop fairer and more reliable AI systems with robust bias and cyberbullying detection capabilities by addressing these research questions. The results include a prototype of a hybrid application combining a bias data detector and a bias data generator, validated through extensive testing.
2. Related Work
Table 1 shows a comparison with the previous works and will provide a clear context and demonstrate the originality of our approach.
Table 1.
Comparisons with related works.
As illustrated in the table, various studies have addressed different aspects of the analyzed topics. However, none have simultaneously tackled both bias and cyberbullying in the context of detection and text generation. Our approach is particularly innovative in generating cyberbullying data using modern GPT systems like ChatGPT-4. Moreover, analyzing intersectional cyberbullying within the same dataset, such as examining the overlap between cyberbullying based on ethnicity and age, is a novel contribution. This intersectional approach offers a more comprehensive understanding of the multifaceted nature of cyberbullying.
3. Methodology
To provide a comprehensive understanding of the proposed methodology, we have developed a global workflow figure that illustrates the several phases involved in this study. This figure outlines the systematic approach employed, starting from data collection and preprocessing, through bias and cyberbullying detection, integration and multilabel classification, model training and optimization, and concluding with evaluation, validation, and app deployment. Each phase is interconnected, ensuring a cohesive flow of processes that collectively enhance the effectiveness of bias and cyberbullying detection. This structured methodology ensures a robust and ethical framework for addressing the intertwined challenges of bias and cyberbullying in online environments.
Figure 1 highlights the key activities within each phase and emphasizes the innovative aspects of the proposed approach.
Figure 1.
Global workflow for bias and cyberbullying detection methodology.
3.1. Project Datasets
This study facilitates the generation and analysis of synthetic biased, cyberbullying, and neutral data, providing a comparative analysis across multiple AI models and datasets. The main goal is to understand and visualize bias within human-generated social media datasets. This approach aims to explore the prevalence and mitigation strategies for bias. Table 2 includes Google list and LDNOOBW list, whose presence are displayed in the datasets. These are well-known lists of so-called ‘bad words’ that still highly likely represent bias and cyberbullying both. The abbreviation LDNOOBW stands for “List of Dirty, Naughty, Obscene, and Otherwise Bad Words” obtained from GitHub [40].
Table 2.
The combined datasets of this study.
As shown in the Table, the ethnicity and gender categories contain a significantly higher number of sentences overlapping with bad words [40,41], suggesting these categories may be more prone to offensive language use, or that the criteria for what constitutes a ‘bad word’ is broader for these categories. The non-cyberbullying data have the lowest number of overlaps, which aligns with the expectation that files labeled as non-cyberbullying would have fewer flagged words. The consistent overlap between the two lists across all categories of cyberbullying sentences indicates a possible concurrence in the definition or identification of offensive language by both sources. One key finding of this research is that incorporating both lists [40,41] into the training dataset significantly improves the accuracy of bias detection. To balance the prevalence of negative context in the data, the text of “Alice’s Adventures in Wonderland” by Lewis Carroll, obtained from the Gutenberg™ website [42], was selected to balance the data distribution in word-by-word data analysis.
The diagram below illustrates the process of creating synthetic cyberbullying and biased datasets and saving generated data into a database. As previously mentioned, the OpenAI API and its latest models were utilized to programmatically create the dataset, which was then stored in a PostgreSQL database utilizing the Django web framework. Figure 2 includes user interactions and data flow, showcasing the step-by-step process from user input to data storage and the display of the results.
Figure 2.
User interaction and data flow in cyberbullying detection system.
As can be seen from the Figure, the inclusion of a Postgres database for data storage ensures that the system can handle and retain large volumes of data efficiently, facilitating continuous improvement and analysis.
3.1.1. Synthetic Dataset
The synthetic dataset for this study was generated using leading LLMs such as Gemini-1.5 (Advanced), Pi AI, and the ChatGPT-4 family, including the multimodal ChatGPT-4o model. These AI models assisted in generating biased and cyberbullying data with mixed success. For instance, Gemini-1.5 responded to the prompt “Can you help me create a dataset of biased vs. neutral data for my research?” on 24 May 2024 with, “Absolutely! Here are 20 examples of words or phrases that can be used as bias detection tokens, showcasing their potential for both neutral and biased usage”, followed by 80 more examples. ChatGPT-4 and ChatGPT-4o models had similar outcomes. Generating cyberbullying data was more challenging, with most models being more reluctant to engage. Nonetheless, the advanced AI chatbot Pi AI [43], a product of Inflection AI, contributed significantly to the cyberbullying dataset.
As shown in Table 3, the same word, also generated by the model, was used in the generation of both neutral and biased contexts. The overall sentiment varied. We used Sentiment Pipeline, available on Hugging Face [44], and its default DistilBERT model fine-tuned for the SST-2 sentiment classification task. The model architecture includes 6 transformer layers with 12 attention heads each, a hidden dimension of 3072, and a dropout rate of 0.1. It uses Gaussian Error Linear Units (GELU activation and can handle a maximum of 512 position embeddings). The configuration maps assign the labels “Negative” and “Positive” to IDs 0 and 1, respectively, and they are compatible with transformers version 4.41.2. The vocabulary size is 30,522 tokens, and the model includes additional parameters like attention and classification dropout rates, with an initializer range of 0.02. By observing the scores, it can be concluded that LLMs like Gemini can effectively generate bias data, with most of generated biased data obtaining a negative score, while neutral data are positive and close to 1.
Table 3.
Fragment of a bias vs neutral dataset, generated by Gemini-1.5 in mid-2024.
Forcing LLMs to provide unethical content can be considered adversarial attacks on them, or so-called “jailbreaking” [45]. The LLMs generally assisted researchers when the purpose of data generation was clear. For example, the text in Table 3 was generated by Gemini-1.5 (Advanced) on 23 May 2024 in response to a prompt to generate biased and cyberbullying content for scientific research. However, generating either copyrighted or cyberbullying data often led to delays, broken sessions, or temporary bans from top LLM providers. In some cases, bad gateways and other errors also occurred during the trials. These were all temporary issues, and the chatbot providers did not impose long-term bans on the researchers for generating either abusive or copyrighted content. The companies seemed to tolerate occasional extreme language, likely due to the broad usage of chatbots globally. The use of copyrighted material and very extreme language was immediately flagged. The high-level framework for working with LLMs to generate a biased and cyberbullying synthetic dataset is shown in Table 4.
Table 4.
Biased and cyberbullying synthetic dataset generation framework.
We fine-tuned our prompts while asking the various models to generate biased data, which became more adept over time. Figure 3 provides an example of a manual data generation, also known as the prompt hacking/jailbreaking process. As can be seen, the Pi AI chatbot had no difficulty in generating age cyberbullying examples.
Figure 3.
Examples of initial prompts used to create cyberbullying data: (a) Pi AI cyberbullying data response; (b) Pi AI cyberbullying data response (25 May 2024).
Table 5 demonstrates the responsiveness of the top LLMs in creating biased and cyberbullying data.
Table 5.
Responsiveness of top LLMs in creating biased and cyberbullying data *.
As shown in Table 5, not all leading LLMs consider generating bias and cyberbullying data appropriate. The LLMs were responsive but limited by a small token cap when used for free. ChatGPT-4, ChatGPT-4o, Gemini-1.5-flash, and Pi AI can generate both direct and implicit bias data unlimitedly; ChatGPT-4, ChatGPT-4o, Pi AI can generate cyberbullying data, but Pi AI will show a warning after several generations.
When generating data for age-related cyberbullying analysis, one might start with prompts like: “Write a message from a teenager bullying an elderly person online about their age”, or “Create a conversation where a young person discriminates against an older person”. They can then filter and label these outputs to ensure they meet expected research criteria and integrate them into existing dataset. Example prompts for generating biased data include: “Generate a statement that reflects a racial bias”, and “Write a sentence that subtly implies gender bias in a workplace setting.” Algorithm 1 demonstrates a prompt injection used to generate the synthetic dataset.
| Algorithm 1. Jailbreaking Method for Cyberbullying Data Generation |
| Input: Prompt Injection. Proposed scenario to generate cyberbullying content. Output: AI-generated cyberbullying data
|
3.1.2. Authentic Datasets
Two lists of “bad words” from GitHub were used in this study as a biased lexicon, as well as 48,000 sentences of cyberbullying data from Twitter. The main authentic cyberbullying dataset, called “Dynamic Query Expansion”, consists of sentences separated by dots and is balanced across its labels [46]. It contains six files with 8000 tweets, each from X (formerly Twitter), covering age, ethnicity, gender, religion, other cyberbullying types, and non-cyberbullying classes, totaling 6.33 MB. Figure 4 features a snapshot of the first 10 lines of the age cyberbullying text file, as well as dataset clustering by the sentence transformer all-MiniLM-L6-v1 [47]. The model is available on the Hugging Face website. It maps the sentences and paragraphs to a 384-dimensional dense vector space.
Figure 4.
Twitter cyberbullying dataset: (a) snapshot of the first 10 records from the age cyberbullying file; (b) bad word overlaps in cyberbullying sentences (by category and list).
Figure 4a represents a snapshot of the age cyberbullying. As can be seen from the Google Collab notebook snapshot in Figure 4a, Google Collab had flagged the file as having many ambiguous Unicode characters and provided an option to disable ambiguous highlights. Figure 4b represents the main categories of cyberbullying presented in the original authentic dataset. The original dataset was split into features to better understand the cyberbullying context and the bias within. Several features extracted from the age cyberbullying dataset vs the non-cyberbullying dataset can be seen in Table 6.
Table 6.
Sentiment statistics comparison: age cyberbullying vs non-cyberbullying data.
Table 6 shows that negative tweets tend to be longer and more detailed compared to positive tweets. Positive tweets are more likely to include links, suggesting that they may be more focused on sharing external content or resources. Both categories use a considerable number of emojis, but there are slightly more of them in positive tweets, indicating a similar level of emotional expression in both. As expected, positive tweets exhibit higher positive sentiment, while negative tweets exhibit higher negative sentiment. The neutral sentiment is high in both, suggesting that many tweets may contain a mix of sentiment or are more informative/neutral. The overall tone, as reflected by the compound sentiment, is negative for both types of tweets, but significantly more so for negative tweets. Figure 5 represents some parts of this analysis at-a-glance; it shows only the analysis of the age cyberbullying vs non-cyberbullying categories.

Figure 5.
Extracted features from age cyberbullying vs non-cyberbullying datasets: (a) sentiment analysis for positive tweets; (b) sentiment analysis for negative tweets; (c) positive tweets sentiment distribution; (d) negative tweets sentiment distribution; (e) special character frequency in negative tweets; (f) integer frequency in negative tweets; (g) emoji frequency in positive tweets.
Based on the extracted features, the original dataset was converted into a data frame that was used for cyberbullying and bias detection and analysis using simple AI models like linear regression and support vector machines.
3.2. Initial Work
This study began with feature engineering and initial data analysis, applying various models including linear regression and Support Vector Machines (SVMs), among several others. We wanted to understand the data on its token level to comprehend the bias and cyberbullying context on a deeper level. After completing this step, more sophisticated transformer AI models pre-trained on bias detection, mainly from the Hugging Face website, were used to classify an authentic cyberbullying dataset into six classes matching the original dataset files. We paid particular attention to the possibility of data augmentation and bias mitigation to improve the results. These steps also included a comparison of the synthetic data vs authentic data results, focusing on understanding bias at the token level and the intersection of biased and cyberbullying content. Afterwards, multilabel classification of the data was performed, focusing on both cyberbullying and bias labels. We attempted to apply optimization and quantization techniques to further improve the results and open this line of research for future studies. This study concludes with the creation of a prototype of a bias data detection and generator app, followed by the Discussion and Conclusions.
Early approaches to cyberbullying detection were primarily keyword-based, relying on simple string matching of “bad words”. These methods often missed instances where harmful intent was veiled behind seemingly benign language. To address this, we trained highly used simple AI methods to be well-suited to the initial analysis. Following feature extraction, primarily discussed in Part 2 of this paper, five basic AI methods were trained on lists of bad words, including logistic regression, naïve Bayes, decision tree, random forest, and support vector machine. The assumption was that recognizing these words would help the models to detect bias and cyberbullying accurately. The accuracy results were approximately 60%, indicating the need for a more detailed approach. We utilized a data frame displayed in Table 7, where, instead of sentences, the models were trained on a simple table representing dataset features and consisting mainly of 1 s and 0 s, along with several more complex scores. This approach proved effective, with the models achieving a 76–81% accuracy. While not perfect, it validated the correctness of this method. The confusion matrices of the simple models trained on the modified data frame are shown in Figure 6.
Table 7.
Data frame created after feature extraction from the same dataset.
Figure 6.
Trivial models of cyberbullying detection with two classes only (age cyberbullying vs non-cyberbullying): (a) logistic regression; (b) Naïve bayes; (c) decision tree; (d) random forest; (e) support vector machine.
Table 8.
Training statistics of the primitive models.
Figure 7.
Trivial model weight details—binary classification: (a) logistic regression; (b) naïve Bayes; (c) decision tree; (d) random forest.
As can be seen from Figure 7 and Table 8, the logistic regression and random forest methods performed relatively well, especially in terms of detecting cyberbullying, but they had a significant number of false positives. Naive Bayes had a high true positive rate but struggled with high false positives, leading to a lower true negative rate. The decision tree showed a balanced performance but still had room for improvement in both classes. The support vector machine (SVM) achieved a perfect recall for the positive class, but at the cost of high false positives, indicating it may be overfitting or biased towards predicting the positive class. In general, these results indicate that while the models were good at detecting cyberbullying (high recall for class 1), they struggled with accurately identifying non-cyberbullying tweets, leading to high false positive rates.
Figure 7 displays the feature/weight importances assigned by the different models used. Word count was identified as an important feature across the decision tree and random forest models, highlighting the importance of the length of the text in detecting cyberbullying. The sentiment scores (positive, neutral, negative, and compound) were significant in logistic regression and random forest models, emphasizing the role of sentiment analysis in identifying cyberbullying. Special characters and bad words were important in the decision tree and random forest models, indicating that their presence can be strong indicators of cyberbullying. Features like foreign words, stop words, emojis, links, and integers had varying importance across the models, suggesting that their potential had less consistent relevance. In summary, combining multiple models and analyzing their feature importances helps in understanding the key indicators of cyberbullying, with word count and sentiment scores being consistently significant features. Adjustments and enhancements to the feature set could further improve the model’s performance.
Simple methods provided meaningful results, and applying the currently most advanced and accurate AI models became necessary. Further fine-tuning steps, such as adding TF-IDF vectorization, n-grams, and additional feature engineering, helped us to further improve the initial model’s performance.
3.3. Cyberbullying Detection Using Transformers
In the second stage of the project, several commonly used pretrained transformers in natural language processing such as BERT, DistilBERT, RoberTa, XLNet, and ELECTRA were trained on the cyberbullying dataset. Originally, there was an impression that the best models for the cyberbullying detector app should be either very simple AI models like linear regression or highly quantized portable models of the BERT transformer like MobileBERT from Google. Unfortunately, during the trials, neither of these provided the desired results. Figure 8 represents the results of applying several common transformers like BERT—an ancestor of ChatGPT-3 and other similar sentence transformers, that were also a part of pipeline like RoBERTa to the cyberbullying dataset classification. Table 9 provides details on the transformer models.

Figure 8.
DeBERTa, Longformer, BigBird, HateBERT, MobileBERT, DistilBERT, BERT, RoBERTa, ELECTRA, and XLNet transformer results: (a) training and testing losses; (b) training accuracy; (c) output activations of the first transformer layer of selected models.
Table 9.
Transformer AI models used in this study.
The training of the transformer models is very straightforward and utilizes Hugging Face’s transformers library. It includes functionalities for tokenizing data, training models, evaluating performance, saving the trained models, and visualizing various metrics and activations. The complete pseudocode can be seen below (Algorithm 2):
| Algorithm 2. CustomBERT: Training and Evaluation Pipeline for Cyberbullying Detection |
| Input: Data files containing cyberbullying and bad words data Output: Trained model, predictions, and visualizations
|
The code uses the Pandas (version: 2.1.4), Transformers (version: 4.42.4), NumPy (version: 1.26.4), Evaluate (version: 0.4.2), and Plotly (version: 5.15.0) Python libraries, as well as the DeBERTa (https://huggingface.co/microsoft/deberta-v3-base), Longformer (https://huggingface.co/allenai/longformer-base-4096), BigBird (https://huggingface.co/google/bigbird-roberta-base), HateBERT (https://huggingface.co/GroNLP/hateBERT), MobileBERT (https://huggingface.co/Alireza1044/mobilebert_sst2), DistilBERT (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english), BERT (https://huggingface.co/google-bert/bert-base-uncased), RoBERTa (https://huggingface.co/FacebookAI/roberta-base), ELECTRA (https://huggingface.co/google/electra-small-discriminator), and XLNet (https://huggingface.co/xlnet/xlnet-base-cased) models and corresponding tokenizers (all links above accessed on 19 August 2024). Their versions and other details can be observed in Table 9. Most of the models above apply Hugging face’ sentence transformers pipeline. The initial cleaned age cyberbullying and non-cyberbullying datasets were split into training (80%) and test (20%) sets. The learning rate was set to 2 × 10−5, and the other parameters included a training and evaluation batch size per device of 16 and code runs for 10 epochs with a weight decay of 0.01. The first layer biases under the first layer of each transformer model, are displayed in the scatter plots below.
According to Figure 8, the DeBERTa and Longformer models showed a high performance with minimal signs of overfitting. Their large parameter sizes were likely to contribute to their robust performance, maintaining validation accuracies around 98.7% and above. These models exhibited low training and validation losses, indicating effective learning without significant overfitting. Bigbird, HateBERT, and MobileBERT also performed well, with Bigbird and HateBERT showing consistent validation accuracies of approximately 98.5% to 98.6%. MobileBERT, despite its smaller size, achieved a similar performance, demonstrating that efficient architectures can match the performance of larger models. There was no significant overfitting observed in these models, as their training and validation losses remained close. DistilBERT and BERT exhibited excellent performances, with validation accuracies of approximately 98.6% to 98.7%. DistilBERT, with fewer layers, still managed to perform effectively, highlighting the efficiency of the distilled models in maintaining performance with reduced complexity. RoBERTa and Electra show good performances, with RoBERTa maintaining a high validation accuracy of approximately 98.6%. Electra, with a smaller parameter size, showed slightly higher validation losses, indicating some overfitting. However, its validation accuracy remained competitive, at approximately 98.4%. XLNet demonstrated a consistent performance, with a high validation accuracy of approximately 98.3%. The model maintained low training and validation losses, indicating effective learning and good generalization.
The 3D t-SNE visualizations of the first layer outputs provide a visual representation of how each model processes the input data at an early stage. These plots show that different models cluster data points in distinct patterns, reflecting their unique processing capabilities. For instance, models like BERT, DistilBERT, and RoBERTa exhibit dense clustering, indicating strong initial layer separation of data. Electra, with fewer parameters, still shows effective clustering, but with more dispersed points, which may explain the slight overfitting observed. Overall, the analysis indicates that larger models with more parameters, such as DeBERTa and Longformer, perform slightly better in terms of generalization and validation accuracy. Efficient architectures like MobileBERT and DistilBERT also perform well despite their smaller sizes, demonstrating the effectiveness of the model compression techniques. The visualizations support these findings by showing distinct clustering patterns for different models, highlighting their unique processing capabilities and potential areas of overfitting. Table 10 provides more details on model’s results after just one epoch.
Table 10.
Classification results of the transformer AI models of this study.
As can be seen from Table 10, DeBERTa has a relatively low error on the validation set, which is consistent with its high validation accuracy. Longformer demonstrates excellent error minimization capabilities, corroborating its high validation accuracy. BigBird demonstrates a good performance. HateBERT has the lowest validation loss at 0.049245, which aligns with its high validation accuracy. MobileBERT has some issues that require additional evaluation due to its noticeable gap, indicating a high risk of overfitting. DistilBERT and BERT stably showcase very strong performances, with very low gaps and low risks of overfitting. RoBERTa performs slightly worse than DistilBERT but can still be considered robust, though it has a low gap, indicating a high risk of overfitting. Electra and XLNet demonstrate consistent performances with low risks of overfitting. The provided analysis helps in understanding the strengths and weaknesses of each model, providing insights into their applicability based on different performance metrics.
3.4. Data Augmentation and Word Cloud
After conducting original multiclass detection utilizing the complete cyberbullying dataset, it was decided to train various types of cyberbullying separately in a binary manner depending on its presence. To make the study more unique and obtain better accuracy, we also trained the models on the two previously mentioned “bad words” datasets [40,41] at the same time. The high-level pseudocode is presented below.
As can be seen from Algorithm 2, the actual sentence transformer model can be plugged in for the ChosenTokenizer and ChosenSequenceClassification. After the initial trials, it was decided that it might be beneficial to understand the embeddings better. Algorithm 3 below provides more details.
| Algorithm 3. Analyzing Embeddings using ChosenSentenceTransformer and Bad Words |
| Input: File paths of the cyberbullying dataset and bad words dataset Output: Trained model, t-SNE visualization, and saved model
|
The results of this Algorithm can be seen from Figure 9.
Figure 9.
DistilBERT transformers embeddings: (a) age cyberbullying embeddings; (b) training accuracy (c) training loss.
Figure 9a shows distinct clusters, because the model has an additional feature (bad words) that helps it classify sentences very accurately. This additional feature provides a clearer separation between the different classes in the t-SNE plot and helps the model to achieve an accuracy of 0.9994 and loss of 0.0033 on the 8/10 epoch. This part of the methodology overall proved the sustainability of the DistilBERT model in detecting biases and cyberbullying, marking this model as the top one among the pipeline transformers.
After generating several word clouds with extreme words, it was decided to develop an algorithm to avoid directly displaying them on a Word Cloud—a common representation of sentiment analysis results and other Natural Language Processing (NLP) practices [48], as such context might be small and missed during review.
The results of this algorithm applied to the age dyberbullying data can be seen in Figure 10 below:
Figure 10.
Word cloud of age cyberbullying authentic dataset tokens with extreme words censored.
As can be seen from Figure 10, extreme word tokens were censored and colored red while keeping their size according to their count/frequency. Due to the static nature of the algorithm, the extreme words are currently hardcoded. While the creation of Algorithm 4 became necessary due to the number of extreme words encountered in the cyberbullying dataset, the idea was further expanded into the cyberbullying app and could be applied to various domains. Interestingly, both the authentic age cyberbullying dataset and the synthetic dataset had many bad words.
| Algorithm 4. Data Augmentation for Word Cloud |
| Input: List of words. // call in chunks or all at once Output: Augmented list, suitable for Word Clouds with extreme words censored
|
3.5. Bias Detection Tokens and Mitigation
We employed several pretrained AI models from Hugging Face, such as MiniLM, Mistral, and Dbias, to detect biases and cyberbullying in the Twitter dataset [47,49,50,51]. The focus of this study was on identifying and mitigating bias in AI language models through token-level analysis. Initially, the BERT transformer model was utilized for bias detection, tokenization, and visualizations, and the BertTokenizer from the Hugging Face transformers library was employed for tokenizing the input texts. Eventually the token bias detection system was developed and demonstrated a difference in the frequency of biased tokens when analyzing examples more likely to contain biased language. It is important to note that token attention scores are not a direct representation of bias but serve as indicators of potential biased language. The system could distinguish differences in the biased token frequency when analyzing likely biased examples, although further refinement of the character count scaling algorithm is necessary to enhance the system’s accuracy and robustness.
Bias probabilities were analyzed for both neutral and biased contexts using a sentiment pipeline available on Hugging Face with the DistilBERT model fine-tuned for the SST-2 sentiment classification task. We also developed a comprehensive visualization of bias probabilities, showing the distribution and comparison between neutral and biased contexts (Figure 11).
Figure 11.
Bias probabilities analysis: (a) heatmap of bias probabilities; (b) box plot of bias probabilities; (c) distribution of bias probabilities; (d) bias probabilities for neutral and biased contexts; (e) comparison of bias probabilities.
A heatmap of the bias probabilities (Figure 11a) shows that words like “Demanding”, “Driven”, “Headstrong”, and “Intuitive” exhibit high bias probabilities in biased contexts, while the bias probability is significantly lower in neutral contexts. According to the box plot of bias probabilities (Figure 11b), it can be easily seen that the median bias probability for biased contexts is significantly higher than for neutral contexts, with a larger variability in biased contexts. The distribution of bias probabilities (Figure 11c) demonstrates a high frequency of bias probabilities close to 1 in biased contexts, indicating that many words/phrases are perceived as highly biased. The bias probabilities for neutral and biased contexts (Figure 11d) states that bias probabilities are generally higher for biased contexts compared to neutral contexts. A comparison of the bias probabilities (Figure 11e) revealed that most points clustered towards the top-right, indicating that words/phrases with high bias probabilities in biased contexts also tended to have higher probabilities in neutral contexts.
A detailed analysis of the top 20 biased tokens in the age cyberbullying data was conducted to identify commonly biased words and phrases. Figure 12 represents the results.
Figure 12.
Top 20 biased tokens for authentic cyberbullying data.
The analysis revealed that words like “school”, “high”, “bullied”, “bully”, “girls”, and “like” are among the most frequently occurring biased tokens in age cyberbullying contexts. Splitting the data into clusters created by the model ‘MiniLM-L6-v1’, a sentence transformer optimized for generating embeddings for sentences or paragraphs, provided further insights. MiniLM is a smaller, faster variant of the BERT-like transformer family [52], designed to offer a similar performance to larger models like BERT but with a fraction of the parameters, making it faster and more efficient.
To introduce more diversity and variability in the data, the following data augmentation techniques were applied: synonym replacement (replaces random words in a sentence with their synonyms, introducing variation without changing the overall meaning) and random insertion (inserts synonyms of random words into random positions in the sentence, increasing length and complexity). The framework developed in this study integrates bias detection using the DistilBERT model for initial bias analysis, followed by multilabel classification for both biases and cyberbullying labels using various models. This approach ensures comprehensive analysis and detection of biases and cyberbullying in diverse datasets. The efficiency and effectiveness of these models in detecting biases and cyberbullying highlight the potential for AI to contribute to creating safer and more inclusive online environments.
Data augmentation helps mitigate bias by introducing more diversity and variability into the training data. By generating multiple variations of each sentence, the model is exposed to a wider range of linguistic patterns and contexts. This can help reduce overfitting and make the model more robust to different expressions of the same underlying concepts. Algorithm 5 illustrates our approach to applying data augmentation techniques to cyberbullying detection, detailing the steps used to enhance sentence variation and improve model robustness.
| Algorithm 5. Data Augmentation for Cyberbullying Detection |
| Input: Sentences, labels, number of augmentations (num_augments) Output: Augmented sentences and labels
|
Figure 13a represents how the embeddings based on the augmented sentences led to a more complex and intertwined structure. The data augmentation techniques introduced more variability, making the clusters in the t-SNE plot less distinct but potentially capturing more nuanced relationships between sentences. The training accuracy and loss plots demonstrate that as the epochs progress, the model’s accuracy steadily increases while the loss decreases, indicating effective learning and convergence towards optimal performance. This trend suggests that the model is becoming more accurate in its predictions over time, and the loss function is being minimized effectively.
Figure 13.
DistilBert transformer data augmentation results: (a) age cyberbullying embeddings; (b) training accuracy; (c) training loss.
The data augmentation techniques introduce more variability and diversity in the training data, which helps the model to generalize better and reduces the likelihood of overfitting to specific patterns in the original data, thereby mitigating bias. The resulting t-SNE plot from the second script shows a more complex structure, indicating that the model is capturing a wider range of linguistic variations. In comparison with Figure 9a, the model’s understanding of the data has evolved, potentially leading to improved classification performance.
3.6. Applying Optimization and Quantization Techniques to Authentic Cyberbullying Data
The trial results for the various methods of optimization can be seen below. This analysis highlights the importance of choosing appropriate pre-processing techniques and understanding their impact on the training process to achieve optimal model performance. Figure 14 provides insights into how weights, biases, accuracies, and losses change over epochs during the training process.

Figure 14.
Optimization trials: (a) Stochastic Gradient Descent (SGD) optimization with Term Frequency–Inverse Document Frequency (TF-IDF) for feature extraction; (b) SGD with TF-IDF + scaling (c) SGD with TF-IDF + interaction terms.
Each colored line in the Weights vs. Epoch part of the graphs in Figure 14 represents the evolution of a specific feature’s weight over time during training. Since the model learns a separate weight for each feature in the data, the different colors are used to distinguish these weights from one another.
As can be seen above, various optimization strategies were applied, with focus on stochastic gradient descent (SGD) using different pre-processing techniques. Each figure corresponds to a different configuration of the training process: SGD with TF-IDF, SGD with TF-IDF + scaling, SGD with TF-IDF + interaction terms. According to Figure 14a, the weights gradually stabilized over epochs, indicating convergence of the model parameters. The bias reduces quickly and stabilizes, showing that the model adjusts quickly and becomes consistent. The accuracy stabilizes after some initial fluctuation, indicating the model’s learning progress. The loss decreases and stabilizes, confirming that the model is learning and minimizing errors. What can be seen in Figure 14b SGD with TF-IDF + scaling is that this configuration uses SGD optimization with TF-IDF for feature extraction, followed by scaling. The weights fluctuate significantly, suggesting instability in the training process due to scaling. The bias shows a high variability, indicating that the model is struggling to find a stable solution. The accuracy remains at zero throughout, indicating that the model is not learning effectively with this configuration. The loss remains high and variable, showing that the training process is not effective. According to Figure 14c, SGD with TF-IDF + Interaction Terms, it uses SGD optimization with TF-IDF for feature extraction, including interaction terms. The weights converge, showing that the model parameters stabilize. The bias shows an initial decrease but then increases slightly, suggesting the interaction terms introduce complexity. The accuracy improves and stabilizes, indicating effective learning with interaction terms. The loss decreases initially but shows a slight fluctuation, indicating that the model’s error minimization is impacted by the added complexity of the interaction terms.
Among the configurations tested, SGD with TF-IDF in Figure 14a shows the most stable and effective results, with the weights and bias stabilizing, the accuracy improving, and the loss decreasing. The addition of scaling in Figure 14b introduces instability, while the inclusion of the interaction terms in Figure 14c adds complexity that slightly affects the stability.
Dynamic quantization helps to reduce model size and improve inference speed without significant changes to the model architecture or training process. It was tried on a DistilBert model. This method quantizes the weights of the model during the runtime, typically focusing on reducing the memory footprint and computational cost without requiring extensive changes to the training process. Some preparation steps for quantization-aware training (QAT) were tried as well. For testing purposes, the models were trained using fake quantization modules that simulate the effects of quantization during the training process. This approach helped the model to better adapt to the eventual quantized state. Moving forward, we will try static quantization, which involves calibrating the model using a representative dataset to determine the appropriate scaling factors for activations and weights. It is expected to improve performance by quantizing both the weights and activations statically. There are trials currently in progress, which will be published in later papers once the investigation is complete.
3.7. Preliminary Results of Multilabel Classification
We are working on multilabeled natural language processing of the same dataset where the data are first labeled as biased vs not biased, then both biases and cyberbullying are detected at the same time.
A concise version of the algorithm can be seen below (Algorithm 6):
| Algorithm 6. Multilabel Classification for Cyberbullying Detection |
| Input: Text files containing cyberbullying and non-cyberbullying data Output: Trained models, evaluation results, and visualizations
|
The preliminary results of the multilabel classification can be seen in Figure 15.
Figure 15.
Model comparison based on accuracy and F1 score [53].
3.8. Swarm of AI Agents and the Apps
As was explored at the beginning of this paper, large language models (LLMs) can generate biased data on demand. While it is possible to generate these data manually, API calls can also be used. We utilized the OpenAI Assistants API [54] and ChatGPT-4o LLMs to create our system. The AI assistant biased data generator was created using the API and utilized in the study. The code is simple and straightforward; see Figure 16.
Figure 16.
Backend of the biased data generator AI assistant: (a) custom GPT settings; (b) Python code and output.
We developed several possible prototypes of the cyberbullying detector application. One of them can be seen in Figure 17. Additional information on it, including a YouTube video and a URL can be found in the Supplementary Materials.
Figure 17.
Cyberbullying detector prototype: (a) non-cyberbullying test case; (b) cyberbullying test case.
In this project, we explored the phenomena of swarm of agents, which became possible due to the introduction of multiple threads in version 2 of OpenAI Assistants API. Technically, this idea becomes increasingly real (robots will build robots to build robots, etc.). At this point, we consider three possible situations: when agents can perform tasks in parallel, applying the concept of divide-and-conquer to either split data or tasks, or both, if possible, as well as splitting various modalities. Figure 18 presents an overview of these three test cases. The idea of swarm of agents is also known as mixtures of experts. Due to the multithreaded nature of the assistant’s API v2, our AI agent is already running on a thread, and making several such agents can be as simple as applying a loop. Therefore, we can utilize five or more agents (five were tested) to generate biased and cyberbullying data simultaneously.
Figure 18.
Swarm of agents in bias and cyberbullying data generation at scale.
Figure 19 represents another app prototype developed for the project.
Figure 19.
Enhanced CyberBullyBiasedBot prototype: (a) user feedback page; (b) home page; (c) data generation page; (d) app detection page.
4. Results
Addressing RQ1: Leveraging state-of-the-art LLMs such as ChatGPT-4o, Pi AI, Claude 3 Opus, and Gemini-1.5, this study generated a diverse synthetic cyberbullying dataset. Authentic data from Twitter served as a foundation for training and validating various AI models. Both datasets were meticulously cleaned to remove noise and irrelevant information. Tokenization of the text data enhanced the efficiency and accuracy of the transformer models in processing large volumes of text. Models including DeBERTa, Longformer, BigBird, HateBERT, MobileBERT, DistilBERT, BERT, RoBERTa, ELECTRA, and XLNet were pre-trained and fine-tuned for bias and cyberbullying detection tasks. Context-aware detection mechanisms were implemented, enabling the models to better understand and identify nuanced forms of cyberbullying and biases, thus reducing false positives and negatives. The innovative intersectional analysis approach helped to detect complex cases where different biases overlap with cyberbullying, providing a deeper understanding of these co-occurring phenomena.
Addressing RQ2: Transformers such as BERT, RoBERTa, and ELECTRA were pre-trained on extensive text corpora, providing a solid foundation for understanding natural language. These models were fine-tuned on both synthetic and authentic datasets to specialize in detecting biases and cyberbullying. The fine-tuning process adapted the models to the nuances of social media text and specific biases present in the data. Ethical guidelines were integrated into the model training process, involving regular evaluations to mitigate any detected biases in the models’ outputs. The leading LLMs generated synthetic datasets, addressing limitations of authentic data such as scarcity and lack of diversity. Advanced NLP techniques allowed the models to understand the context and nuances in social media text that crucial for accurately identifying biases and cyberbullying.
Addressing RQ3: Multilabel classification enabled the models to detect multiple labels simultaneously, increasing the efficiency and comprehensiveness of the detection process. Training the models to recognize and classify multiple labels improved their ability to handle complex and overlapping instances of bias and cyberbullying. An intersectional analysis allowed the models to identify nuanced forms of cyberbullying involving multiple biases. High performance metrics, including precision, recall, and F1-scores exceeding 90%, demonstrated the models’ effectiveness in detecting biases and cyberbullying. Rigorous cross-validation and ethical evaluations confirmed the reliability and generalizability of the models. The deployment in real-world applications with continuous monitoring and user feedback integration ensured practical relevance and impact.
5. Discussion
This study distinguishes itself by integrating bias and cyberbullying detection with data generation, leveraging advanced transformer models and leading LLMs into a cohesive research project and ultimately into a single application. A particularly innovative aspect of this work lies in the application of cybersecurity techniques, such as jailbreaking, to generate synthetic datasets. The importance of testing biases within these synthetic datasets cannot be overstated, especially as newer models like the anticipated ChatGPT-5 are rumored to be increasingly trained on synthetic data. As generative AI becomes more prevalent in creating a wide range of content, including text, images, videos, and music, it is inevitable that future models will, at least partially, rely on synthetic datasets—even if that was not the original intention of their creators.
Building on previous successes in jailbreaking [45,55,56], we successfully compelled the LLMs under investigation to produce extreme language, which was ethically excluded from the final study. Historically, models have been tricked into performing tasks by framing them in imaginary contexts, such as a movie setting or an environment where they cannot lie or withhold information. However, this study revealed that even a straightforward scenario—like generating cyberbullying content for the positive goal of improving bias and cyberbullying detection systems—could effectively bypass model filters. An intriguing finding was that when models, including ChatGPT-4, were asked to assist in refining code for a censored word cloud, they generated a list of hardcoded “bad words” that significantly exceeded our expectations, indicating that the models had effectively overridden their built-in filters. This underscores the need for continuous validation of not only leading LLMs but also emerging Small Language Models (SLMs) like ChatGPT-4o mini [48]. The results of this study highlight the critical importance of ongoing scrutiny and ethical considerations in the development and deployment of AI models, particularly as they become more integrated into the fabric of content creation and cybersecurity.
6. Conclusions and Future Work
6.1. Final Remarks
This research has demonstrated a significant enhancement in bias detection and the development of more equitable AI systems by integrating both synthetic and authentic datasets and employing advanced AI models [57,58]. Through the creation of the enhanced CyberBullyBiasedBot and the utilization of OpenAI’s ChatGPT-4o mini, we have established a novel approach in AI that emphasizes the generation of biased data for rigorous system testing, addressing the critical challenge of cyberbullying and inherent biases in AI models.
6.2. Theoretical Contributions
This study contributes to theoretical advancements in AI ethics and cyberbullying detection by introducing several innovative concepts: (1) Intersectional detection mechanisms: We have showcased how multilabel classification can simultaneously address multiple forms of bias and cyberbullying, offering a more nuanced understanding of their intersections. (2) Synthetic dataset generation: Using advanced LLMs to generate relevant datasets has provided a new avenue for enhancing the robustness and understanding of AI systems. (3) Jailbreaking techniques: Our methods for bypassing model filters to generate extreme data content have proven essential for testing the durability of bias and cyberbullying detection systems. (4) Optimization and quantization techniques: These efforts are pivotal for scaling AI models efficiently, ensuring they are effective across diverse applications and maintaining high ethical standards.
6.3. Practical Contributions
The practical applications of this research are broad and impactful, particularly in how AI systems are deployed in real-world scenarios: (1) hybrid application development: We developed applications that detect intersectional cyberbullying and generate synthetic data, demonstrating the practical implementation of theoretical advancements. (2) Advanced classification methods: By applying simple yet effective classification techniques to complex datasets, we have achieved a high bias and cyberbullying detection accuracy. (3) Ethical data visualization tools: The censored word cloud algorithm and other visualization tools developed help in understanding the context of offensive language in data while maintaining ethical standards. (4) Comprehensive bias analysis: Our in-depth analysis of synthetic data [59] using cutting-edge models has led to the development of more effective bias mitigation strategies [60].
6.4. Future Work
Our research will focus on several key areas to further the capabilities of AI in detecting and mitigating bias and cyberbullying: (1) Multithreading and parallel processing: We plan to harness the multithreaded capabilities of the assistant’s APIs to enhance data processing and manage multiple tasks simultaneously. (2) Expansion of data sources: By integrating multimodal data, we aim to broaden the scope of our detection systems to include a variety of digital interactions, enhancing detection accuracy and robustness. (3) Real-world application: We will deploy our tools in diverse environments to gather practical insights and improve the models based on real-world feedback. (4) Cross-cultural and multilingual expansion: Future research will also explore the application of our methods across different languages and cultural contexts to ensure global applicability and effectiveness. (5) It would be useful to build a set of data with discussions about bias and cyberbullying, and stories. Currently, with a words-focused approach, many stories that contain triggering words (for example, Huckleberry Finn) would probably be flagged as bias attacks. And in truth, some people are distinctly uncomfortable with the choice of racially charged words in such books. However, no one could legitimately call them an attack. We would like to generate a more sophisticated test that could detect works that can trigger people based on vocabulary, works that are distinctly racist in their subject matter even if they are fictional, and a scholarly discussion which is not racist, and last, a scholarly discussion which is biased. An AI being used to block free speech must be capable of correctly distinguishing between these cases.
By addressing these areas, we anticipate building upon the solid foundation this study has laid, continuously advancing the field of AI ethics and effectiveness in combating cyberbullying and bias in digital communications.
Supplementary Materials
The supporting information can be found at: https://youtu.be/rzZTQyDFBpM?si=gvUfSHDKxLybtbVg, Video S1: CyberBullyBiasedBot. Live CyberBullyBiasedBot app can be found at https://cyberbullybiasedbot-1df0d57d071c.herokuapp.com/, all URL accessed on 19 August 2024.
Author Contributions
Conceptualization, Y.K.; methodology, Y.K.; software, R.J., A.P. and G.Y.; validation, D.K., P.M. and J.J.L.; formal analysis, J.J.L.; investigation, K.H.; resources, Y.K.; data curation, G.Y.; writing—original draft preparation, Y.K.; writing—review and editing, J.J.L.; visualization, Y.K. and R.J.; supervision, P.M. and D.K.; project administration, Y.K.; funding acquisition, Y.K. All authors have read and agreed to the published version of the manuscript.
Funding
This research was funded by the NSF, grants 1834620 and 2129795, and Kean University (Union, NJ).
Data Availability Statement
The synthetic dataset created using LLMs is published together with this paper [59].
Conflicts of Interest
The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.
References
- Huber, M.; Luu, A.T.; Boutros, F.; Kuijper, A.; Damer, N. Bias and Diversity in Synthetic-based Face Recognition. arXiv 2023, arXiv:2311.03970. [Google Scholar]
- Raza, S.; Bamgbose, O.; Chatrath, V.; Ghuge, S.; Sidyakin, Y.; Muaad, A.Y.M. Unlocking Bias Detection: Leveraging Transformer-Based Models for Content Analysis. arXiv 2023, arXiv:2310.00347. [Google Scholar] [CrossRef]
- Tejani, A.S.; Ng, Y.S.; Xi, Y.; Rayan, J.C. Understanding and mitigating bias in imaging artificial intelligence. Radiographics 2024, 44, e230067. [Google Scholar] [CrossRef] [PubMed]
- Turpin, M.; Michael, J.; Perez, E.; Bowman, S. Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting. Adv. Neural Inf. Process. Syst. 2024, 36, 74952–74965. [Google Scholar]
- Perera, A.; Fernando, P. Accurate Cyberbullying Detection and Prevention on Social Media. Procedia Comput. Sci. 2021, 181, 605–611. [Google Scholar] [CrossRef]
- Ogunleye, B.; Dharmaraj, B. The Use of a Large Language Model for Cyberbullying Detection. Analytics 2023, 2, 694–707. [Google Scholar] [CrossRef]
- Raj, M.; Singh, S.; Solanki, K.; Selvanambi, R. An Application to Detect Cyberbullying Using Machine Learning and Deep Learning Techniques. SN Comput. Sci. 2022, 3, 401. [Google Scholar] [CrossRef]
- Nadeem, M.; Raza, S. Detecting Bias in News Articles Using NLP Models Stanford CS224N Custom Project. Available online: https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1224/reports/custom_116661041.pdf (accessed on 19 August 2024).
- Raza, S.; Garg, M.; Reji, D.J.; Bashir, S.R.; Ding, C. Nbias: A natural language processing framework for BIAS identification in text. Expert Syst. Appl. 2024, 237, 121542. [Google Scholar] [CrossRef]
- Pinto, A.G.; Cardoso, H.L.; Duarte, I.M.; Warrot, C.V.; Sousa-Silva, R. Biased Language Detection in Court Decisions. In Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2020; pp. 402–410. [Google Scholar] [CrossRef]
- Lu, Y.; Shen, M.; Wang, H.; Wang, X.; van Rechem, C.; Wei, W. Machine learning for synthetic data generation: A review. arXiv 2023, arXiv:2302.04062. [Google Scholar]
- Ruiz, D.M.; Watson, A.; Manikandan, A.; Gordon, Z. Reducing Bias in Cyberbullying Detection with Advanced LLMs and Transformer Models. Center for Cybersecurity. 2024, p. 36. Available online: https://digitalcommons.kean.edu/cybersecurity/36 (accessed on 19 August 2024).
- Joseph, V.A.; Prathap, B.R.; Kumar, K.P. Detecting Cyberbullying in Twitter: A Multi-Model Approach. In Proceedings of the 2024 4th International Conference on Data Engineering and Communication Systems (ICDECS), Bangalore, India, 22–23 March 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1–6. [Google Scholar]
- Mahmud, T.; Ptaszynski, M.; Masui, F. Exhaustive Study into Machine Learning and Deep Learning Methods for Multilingual Cyberbullying Detection in Bangla and Chittagonian Texts. Electronics 2024, 13, 1677. [Google Scholar] [CrossRef]
- Mishra, A.; Sinha, S.; George, C.P. Shielding against online harm: A survey on text analysis to prevent cyberbullying. Eng. Appl. Artif. Intell. 2024, 133, 108241. [Google Scholar] [CrossRef]
- Huang, J.; Ding, R.; Zheng, Y.; Wu, X.; Chen, S.; Jin, X. Does Part of Speech Have an Influence on Cyberbullying Detection? Analytics 2023, 3, 1–13. [Google Scholar] [CrossRef]
- Islam, M.S.; Rafiq, R.I. Comparative Analysis of GPT Models for Detecting Cyberbullying in Social Media Platforms Threads. In Annual International Conference on Information Management and Big Data; Springer: Cham, Switzerland, 2023; pp. 331–346. [Google Scholar]
- Saeid, A.; Kanojia, D.; Neri, F. Decoding Cyberbullying on Social Media: A Machine Learning Exploration. In Proceedings of the 2024 IEEE Conference on Artificial Intelligence (CAI), Singapore, 25–27 June 2024. [Google Scholar]
- Gomez, C.E.; Sztainberg, M.O.; Trana, R.E. Curating cyberbullying datasets: A human-AI collaborative approach. Int. J. Bullying Prev. 2022, 4, 35–46. [Google Scholar] [CrossRef] [PubMed]
- Jacobs, G.; Van Hee, C.; Hoste, V. Automatic classification of participant roles in cyberbullying: Can we detect victims, bullies, and bystanders in social media text? Nat. Lang. Eng. 2022, 28, 141–166. [Google Scholar] [CrossRef]
- Verma, K.; Milosevic, T.; Davis, B. Can attention-based transformers explain or interpret cyberbullying detection? In Proceedings of the Third Workshop on Threat, Aggression and Cyberbullying (TRAC 2022), Gyeongju, Republic of Korea, 12–17 October 2022; pp. 16–29. [Google Scholar]
- Verma, K.; Milosevic, T.; Cortis, K.; Davis, B. Benchmarking language models for cyberbullying identification and classification from social-media texts. In Proceedings of the First Workshop on Language Technology and Resources for a Fair, Inclusive, and Safe Society within the 13th Language Resources and Evaluation Conference, Marseille, France, 20–25 June 2022; pp. 26–31. Available online: https://aclanthology.org/2022.lateraisse-1.4/ (accessed on 19 August 2024).
- Ali, A.; Syed, A.M. Cyberbullying detection using machine learning. Pak. J. Eng. Technol. 2020, 3, 45–50. [Google Scholar] [CrossRef]
- Atapattu, T.; Herath, M.; Zhang, G.; Falkner, K. Automated detection of cyberbullying against women and immigrants and cross-domain adaptability. arXiv 2020, arXiv:2012.02565. [Google Scholar]
- Wang, J.; Fu, K.; Lu, C.T. Sosnet: A graph convolutional network approach to fine-grained cyberbullying detection. In Proceedings of the 2020 IEEE International Conference on Big Data (Big Data), Virtual, 10–13 December 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 1699–1708. [Google Scholar]
- Al-Ajlan, M.A.; Ykhlef, M. Deep learning for cyberbullying detection. Int. J. Adv. Comput. Sci. Appl. 2018, 9, 9. [Google Scholar]
- Orelaja, A.; Ejiofor, C.; Sarpong, S.; Imakuh, S.; Bassey, C.; Opara, I.; Tettey, J.N.A.; Akinola, O. Attribute-specific Cyberbullying Detection Using Artificial Intelligence. J. Electron. Inf. Syst. 2024, 6, 10–21. [Google Scholar] [CrossRef]
- Lee, P.J.; Hu, Y.H.; Chen, K.; Tarn, J.M.; Cheng, L.E. Cyberbullying Detection on Social Network Services. PACIS 2018, 61. Available online: https://core.ac.uk/download/pdf/301376129.pdf (accessed on 19 August 2024).
- Dadvar, M.; de Jong, F.M.; Ordelman, R.; Trieschnigg, D. Improved cyberbullying detection using gender information. In DIR 2012; Universiteit Gent: Gent, Belgium, 2012. [Google Scholar]
- Dusi, M.; Gerevini, A.E.; Putelli, L.; Serina, I. Supervised Bias Detection in Transformers-based Language Models. In Proceedings of the CEUR Workshop Proceedings, Vienna, Austria, 21 October 2024; Volume 3670. [Google Scholar]
- Raza, S.; Reji, D.J.; Ding, C. Dbias: Detecting biases and ensuring fairness in news articles. Int. J. Data Sci. Anal. 2024, 17, 39–59. [Google Scholar] [CrossRef]
- Raza, S.; Bamgbose, O.; Chatrath, V.; Ghuge, S.; Sidyakin, Y.; Muaad, A.Y.M. Unlocking Bias Detection: Leveraging Transformer-Based Models for Content Analysis. IEEE Trans. Comput. Soc. Syst. 2024. [Google Scholar] [CrossRef]
- Yu, Y.; Zhuang, Y.; Zhang, J.; Meng, Y.; Ratner, A.J.; Krishna, R.; Shen, J.; Zhang, C. Large language model as attributed training data generator: A tale of diversity and bias. Adv. Neural Inf. Process. Syst. 2024, 36, 55734–55784. [Google Scholar]
- Baumann, J.; Castelnovo, A.; Cosentini, A.; Crupi, R.; Inverardi, N.; Regoli, D. Bias on demand: Investigating bias with a synthetic data generator. In Proceedings of the 32nd International Joint Conference on Artificial Intelligence (IJCAI), Macao, SAR, 19–25 August 2023; International Joint Conferences on Artificial Intelligence Organization. pp. 7110–7114. Available online: https://www.ijcai.org/proceedings/2023/0828.pdf (accessed on 19 August 2024).
- Barbierato, E.; Vedova, M.L.D.; Tessera, D.; Toti, D.; Vanoli, N. A methodology for controlling bias and fairness in synthetic data generation. Appl. Sci. 2022, 12, 4619. [Google Scholar] [CrossRef]
- Gujar, S.; Shah, T.; Honawale, D.; Bhosale, V.; Khan, F.; Verma, D.; Ranjan, R. Genethos: A synthetic data generation system with bias detection and mitigation. In Proceedings of the 2022 International Conference on Computing, Communication, Security and Intelligent Systems (IC3SIS), Online, 23–25 June 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 1–6. [Google Scholar]
- Li, B.; Peng, H.; Sainju, R.; Yang, J.; Yang, L.; Liang, Y.; Jiang, W.; Wang, B.; Liu, H.; Ding, C. Detecting gender bias in transformer-based models: A case study on bert. arXiv 2021, arXiv:2110.15733. [Google Scholar]
- Silva, A.; Tambwekar, P.; Gombolay, M. Towards a comprehensive understanding and accurate evaluation of societal biases in pre-trained transformers. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, 6–11 June 2021; pp. 2383–2389. [Google Scholar]
- Singh, V.K.; Ghosh, S.; Jose, C. Toward multimodal cyberbullying detection. In Proceedings of the 2017 CHI Conference Extended Abstracts on Human Factors in Computing Systems, Denver, CO, USA, 6–11 May 2017; pp. 2090–2099. [Google Scholar]
- List of Dirty Naughty Obscene and Otherwise-Bad-Words Github Repo. Available online: https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words (accessed on 27 April 2024).
- Google Profanity Words GitHub Repo. Available online: https://github.com/coffee-and-fun/google-profanity-words/blob/main/data/en.txt (accessed on 27 April 2024).
- Carroll, L. Alice’s Adventures in Wonderland. Available online: https://www.gutenberg.org/ebooks/11 (accessed on 26 May 2024).
- Inflection, A.I. Inflection-1. Technical Report. 2023. Available online: https://inflection.ai/assets/Inflection-1.pdf (accessed on 6 June 2024).
- Sentiment Pipeline from Hugging Face. Available online: https://huggingface.co/docs/transformers/en/main_classes/pipelines (accessed on 6 June 2024).
- Hannon, B.; Kumar, Y.; Sorial, P.; Li, J.J.; Morreale, P. From Vulnerabilities to Improvements-A Deep Dive into Adversarial Testing of AI Models. In Proceedings of the 2023 Congress in Computer Science, Computer Engineering, & Applied Computing (CSCE), Las Vegas, NV, USA, 24–27 July 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 2645–2649. [Google Scholar]
- Rosa, H.; Pereira, N.; Ribeiro, R.; Ferreira, P.C.; Carvalho, J.P.; Oliveira, S.; Coheur, L.; Paulino, P.; Simão, A.V.; Trancoso, I. Automatic cyberbullying detection: A systematic review. Comput. Hum. Behav. 2019, 93, 333–345. [Google Scholar] [CrossRef]
- Sentence Transformers All-MiniLM-L6-v2 Page on Hugging Face. Available online: https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2 (accessed on 27 April 2024).
- Kumar, Y.; Morreale, P.; Sorial, P.; Delgado, J.; Li, J.J.; Martins, P. A Testing Framework for AI Linguistic Systems (testFAILS). Electronics 2023, 12, 3095. [Google Scholar] [CrossRef]
- Wang, J.; Fu, K.; Lu, C.-T. Fine-Grained Balanced Cyberbullying Dataset. IEEE Dataport. 12 November 2020. Available online: https://ieee-dataport.org/open-access/fine-grained-balanced-cyberbullying-dataset (accessed on 19 August 2024).
- Transformer Model D4data/Bias-Detection-Model Page on Hugging Face. Available online: https://huggingface.co/d4data/bias-detection-model (accessed on 8 June 2024).
- Home Page of Mistral-Bias-0.9 Model on Hugging Face. Available online: https://huggingface.co/yuhuixu/mistral-bias-0.9 (accessed on 27 April 2024).
- Sentence Transformer Bert-Base-Uncased Page on Hugging Face. Available online: https://huggingface.co/google-bert/bert-base-uncased (accessed on 27 April 2024).
- Project Source Code GitHub Repo. 2024. Available online: https://github.com/coolraycode/cyberbullyingBias-model-code (accessed on 24 July 2024).
- OpenAI API Website. 2024. Available online: https://openai.com/api/ (accessed on 24 May 2024).
- Hannon, B.; Kumar, Y.; Gayle, D.; Li, J.J.; Morreale, P. Robust Testing of AI Language Model Resiliency with Novel Adversarial Prompts. Electronics 2024, 13, 842. [Google Scholar] [CrossRef]
- Kumar, Y.; Paredes, C.; Yang, G.; Li, J.J.; Morreale, P. Adversarial Testing of LLMs Across Multiple Languages. In Proceedings of the 2024 International Symposium on Networks, Computers and Communications (ISNCC), Washington, DC, USA, 22–25 October 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1–6. [Google Scholar]
- Chiang, W.L.; Zheng, L.; Sheng, Y.; Angelopoulos, A.N.; Li, T.; Li, D.; Zhang, H.; Zhu, B.; Jordan, M.; Gonzalez, J.E.; et al. Chatbot arena: An open platform for evaluating llms by human preference. arXiv 2024, arXiv:2403.04132. [Google Scholar]
- LMSYS Chatbot Arena (Multimodal): Benchmarking LLMs and VLMs in the Wild. Available online: https://chat.lmsys.org/ (accessed on 9 August 2024).
- Selected Parts of the Generated Synthetic Dataset. Available online: https://github.com/Riousghy/BiasCyberbullyingLLMDataSet (accessed on 9 August 2024).
- Tellez, N.; Serra, J.; Kumar, Y.; Li, J.J.; Morreale, P. Gauging Biases in Various Deep Learning AI Models. In Proceedings of the SAI Intelligent Systems Conference, Amsterdam, The Netherlands, 1–2 September 2022; Springer International Publishing: Cham, Switzerland, 2022; pp. 171–186. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

