Next Article in Journal
The Algebraic Theory of Operator Matrix Polynomials with Applications to Aeroelasticity in Flight Dynamics and Control
Next Article in Special Issue
An Interpretable Hybrid RF–ANN Early-Warning Model for Real-World Prediction of Academic Confidence and Problem-Solving Skills
Previous Article in Journal
Optimization for Sustainability: A Comparative Analysis of Evolutionary Crossover Operators for the Traveling Salesman Problem (TSP) with a Case Study on Croatia
Previous Article in Special Issue
A Novel Hybrid Attention-Based RoBERTa-BiLSTM Model for Cyberbullying Detection
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Context-Aware Representation-Learning-Based Model for Detecting Human-Written and AI-Generated Cryptocurrency Tweets Across Large Language Models

1
School of Systems and Technology, University of Management and Technology, Lahore 54770, Pakistan
2
Department of Finance, Bucharest University of Economic Studies, 6 Piata Romana, 010374 Bucharest, Romania
3
Arfa Karim Technology Incubator, Lahore 54000, Pakistan
*
Author to whom correspondence should be addressed.
Math. Comput. Appl. 2025, 30(6), 130; https://doi.org/10.3390/mca30060130
Submission received: 15 October 2025 / Revised: 25 November 2025 / Accepted: 26 November 2025 / Published: 29 November 2025

Abstract

The extensive use of large language models (LLMs), particularly in the finance sector, raises concerns about the authenticity and reliability of generated text. Developing a robust method for distinguishing between human-written and AI-generated financial content is therefore essential. This study addressed this challenge by constructing a dataset based on financial tweets, where original financial tweet texts were regenerated using six LLMs, resulting in seven distinct classes: human-authored text, LLaMA3.2, Phi3.5, Gemma2, Qwen2.5, Mistral, and LLaVA. A context-aware representation-learning-based model, namely DeBERTa, was extensively fine-tuned for this task. Its performance was compared to that of other transformer variants (DistilBERT, BERT Base Uncased, ELECTRA, and ALBERT Base V1) as well as traditional machine learning models (logistic regression, naive Bayes, random forest, decision trees, XGBoost, AdaBoost, and voting (AdaBoost, GradientBoosting, XGBoost)) using Word2Vec embeddings. The proposed DeBERTa-based model achieved an impressive test accuracy, precision, recall, and F1-score, all reaching 94%. In contrast, competing transformer models achieved test accuracies ranging from 0.78 to 0.80, while traditional machine learning models yielded a significantly lower performance (0.39–0.80). These results highlight the effectiveness of context-aware representation learning in distinguishing between human-written and AI-generated financial text, with significant implications for text authentication, authorship verification, and financial information security.

1. Introduction

In some previous years, natural language processing (NLP) has emerged as a dominant AI field for human–computer interactions. With the advancement of this field, large language model (LLM) algorithms have been introduced recently. These models are trained on large amounts of data, enabling them to learn and provide answers with regards to human language [1]. These models learn from the Internet, including texts, books, and journals, to understand language rules and meanings. The continued improvements in these models have led to significant enhancements and will improve communication between humans and computers.
An LLM-based ChatGPT has been a major turning point in the field of NLP [2]. Due to its training with vast amounts of data, it can quickly create logical and contextual responses. It has a large impact on education and communication. Advanced language models such as GPT-4 and its latest versions have been the biggest reform in the AI world. The performance of these models continues to improve over time, particularly in generating text that resembles human language. The field of NLP has seen remarkable advancements in language understanding and generation, facilitated by deep learning models and language models.
The exposure and advancement of LLMs has transformed the view/outlook of NLP models, such as GPTs (generative pre-trained transformers), LLaMA, Mistral, Phi, and BERTs (bidirectional encoder representations from transformers), demonstrating a significant leap in NLP capabilities. With the growing use of LLM-based models, distinguishing between human- and AI-generated text is becoming increasingly challenging [3] to stop the spread of misinformation and ensure authenticity, especially in high-stakes domains such as finance and cryptocurrency, where sentiment directly impacts market trends and investor decisions.
Social media platforms such as Twitter (now X) [4] serve as critical communication channels for cryptocurrency traders, investors, and analysts. These platforms are flooded daily with millions of short, sentiment-driven posts that can influence market movements [5].
With the rise in publicly available LLMs, synthetic finance-related tweets can be generated easily to manipulate investor sentiment and to promote certain coins or products. Such automated dissemination of machine-generated financial narratives threatens the reliability of online discourse and poses potential risks to digital financial stability. Consequently, distinguishing human- and AI-generated cryptocurrency tweets has become an urgent research need to ensure the accountability of online financial communication.
Although several studies have explored the detection of AI-generated text, these studies have typically focused on general datasets of text. However, cryptocurrency-related content differs significantly in its structure, vocabulary, and semantics. The detection model trained on general text failed to adapt to this specialized linguistic space. Furthermore, many studies have just focused on real and machine-generated text identification, i.e., binary classification. There is a need for a study that can identify the model as well as which model was used to generate this text. These gaps highlight the need for a study in which domain-specific (such as cryptocurrency-related) text identification is considered and the type of model is identified.
In this study, we considered different LLMs to prepare the machine-generated cryptocurrency text dataset. After dataset preparation, different models for detection purposes were also fine-tuned to propose a robust model. This domain-specific study extended beyond generic AI text detection, and the model can easily answer which LLM model was used for AI text generation.

Motivation and Research Objectives

The motivation for this research arose from the increasing infiltration of AI-generated content into financial communication spaces and the absence of specialized detection models for this domain. While most previous studies have focused on binary classification tasks, they have largely ignored the need to differentiate between different LLMs. This oversight limits our understanding of how various models differ in their style, lexicon, and representation of financial language. Additionally, most of the datasets are generic and do not reflect the linguistic characteristics of cryptocurrency discourse, where even a short text plays a role in influencing the market.
Therefore, our research aimed to fill this gap by creating a multiclass, domain-specific dataset and detection model that not only distinguishes human-written and AI-generated text, but that can also identify which LLM family produced the text. This approach enhances the transparency and interpretability in the detection process, thereby maintaining the integrity of financial communication. The objectives of this research are listed below.
  • To construct a balanced, domain-specific dataset comprising real and AI-generated cryptocurrency tweets regenerated using multiple LLMs;
  • To propose a model to identify which LLM model was used to generate the AI-generated cryptocurrency text;
  • To contribute to enhancing AI transparency and trust in financial communication by offering a robust detection model.

2. Literature Review

The growing impact of LLMs has inspired extensive research focused on their architectures, applications, and limitations. The use of LLMs in different domains such as education, healthcare, and finance has also raised concerns regarding originality, authorship, and AI detection. This section provides an overview of related work, beginning with the evolution of LLMs and extending to AI detection techniques and the identified research gaps that motivated this study.

2.1. Evolution of Large Language Models

Meta AI introduced LLaMA (large language model Meta AI) as a series of large language models. These models were designed to efficiently balance the model size and computational cost while achieving state-of-the-art results for NLP tasks. Compared to other models, LLaMA models achieve competitive results with a limited number of parameters because they are primarily designed with a focus on resource efficiency [6]. LLaMA-2 was introduced in July 2023 and is more advanced than the base model [7]. The LLaMA model was trained with 2 trillion tokens and fine-tuned using reinforcement learning from human feedback (RLHF) [8]. The LLaMA 2 model is openly available for commercial use [7] and is based on three variations (7B, 13B, and 70B parameters). LLaMA-2 with 70B parameters is powerful and is being used for complex tasks [9]. The LLaMA 3 series models were pre-trained with a vast dataset using 15.6 trillion tokens, and they can support a contextual window size of up to 128,000 tokens [10]. LLaMA version 3.1 was a key success of the LLaMA series due to its focus on enhancing the scale, performance, and versatility. It performed better due to enhanced multilingual support covering languages such as French, Spanish, Hindi, and more. Meta released the LLaMA 3.2 open-source AI model, which was built on LLaMA 3.1. LLaMA 3.2 provides lightweight language models and medium-sized vision models. It was the first model in the LLaMA series to excel in vision capabilities. As the first LLaMA model to support vision tasks, the 90B and 11B models required a new architecture that could handle image processing. AI introduced some advanced features with the launch of 3.5. Phi-3.5 vision models, which basically understand both visual and textual inputs.
Phi-3-mini [11] was trained on 3.3 trillion tokens, but Phi-3.5-mini is a large model with 4.2 billion parameters. We assessed Phi-3.5-mini on two tasks, RULER [HSK+ 24] and RepoQA [LTD+ 24]. Phi-3.5-mini basically handled the multilingual and long text data. The Qwen2.5 LLM models [12] were released with seven open-sourced models ranging from 0.5B to 72B parameters. Due to the expertise in this domain, the Qwen2.5 model has been enhanced in mathematics and coding. It can understand structured data (tables) and generate structured output. It supports almost 29 languages.
Mistral is a model that contains 7B parameters. Mistral 7B is easily fine-tuned for chat [13]. Gemma 2 was trained on 2 billion to 27 billion parameters [14]. It has the same architecture model as the original Gemma models. It can be easily run on NVIDIA hardware using TensorRT-LLM. Users can fine-tune models with Keras and Hugging Face. Gemma 2 is available under a license that is basically commercially friendly.
The LLaVA model [15] depends on a large amount of data generated by a machine. By reducing the dataset to 90%, the smaller model still performed as well as, or better than, the original larger one. In the Lava model, when we added a query with an image, the model answered in English, even if the query was in another language. We observed that this model accepts inputs in different forms, both visual and textual.

2.2. Applications and Comparative Studies

Verma et al. [16] explained the comprehensive use of LLaMA-2, showcasing its ability to generate complete blog content. By giving the topic name and a little customization according to the context of your blog, LLaMA-2 can generate a full-length blog with a high resemblance to text written by humans.
Malisetty et al. [17] evaluated the LLaMA-2 version for Internet of Things (IoT) policy generation. They used different prompts and employed matrices such as ROUGE-LSum [18], BERT precision [19], Word2Vec [20], and GolVe [21]. Cosine similarity was used to train the model to generate texts in accordance with real IoT privacy policies.
Labruna et al. [22] conducted a comparative study on BERT and LLaMA-2 in the restaurant domain. They utilized the MultiWOZ 2.4 dataset [23] to evaluate which model generates more effective conversational dialogue. To enhance the model accuracy, they used NLP techniques, fine-tuning models, and instruction-based models.
Huo and Lian [24] conducted a benchmarking study of LLaMA [6], ChatGPT [25], and Mistral [13]. Using the Hugging Face platform, they evaluated the models across multiple functions, including their linguistic accuracy, computational efficiency, and ethical alignment. They observed that ChatGPT showed linguistic accuracy, while Mistral handled unique optimization techniques and LLaMA handled linguistic adversity.

2.3. Detection of AI-Generated Text

Arshed et al. [26] proposed a study to distinguish real and AI-generated text, especially related to finance. They initially collected tweets related to finance from Twitter and regenerated them using ChatGPT [27] and QuillBot [28] to prepare the final dataset. They applied machine learning models with a Word2Vec [20] approach and achieved an effective accuracy of 0.74 with random forest.
Orenstrakh et al. [29] emphasized the effects of large language models (LLMs) in academia. Students utilize AI to support their academic work. Therefore, they utilized all the tools created to detect plagiarism and AI-written texts, and verified the accuracy before and after using Quillbot [28], which was employed for paraphrasing.
Zhang et al. [30] presented a detailed view of controllable text generation (CTG) using transformer-based pre-trained language models (PLMs), such as BERTs, T5, and GPTs. They discussed three methods: fine-tuning the model, retraining/refactoring the PLM, and post-processing, which is used in the decoding phase, but does not change the PLM. The key techniques used are adaptor modules that incorporate sentiments and persona, prompt-based approaches, and reinforcement learning.
Kumarage et al. [31] analyzed the detection of Al-generated tweets and tweets written by humans based on a stylometric analysis. They used advanced LLM models to identify the timeline and styles of tweets by incorporating the punctuation style and linguistic adversity. These stylistic features were then embedded with the RoBERTa model [32], providing results regarding whether a human or Al wrote the tweet.
Májovsķý et al. [33] explained that LLM models, such as ChatGPT, can create completely fake articles, even in the most critical fields of medical sciences, including neurosurgery.
Vavekanand et al. [34] explained LLaMA 3.1 of the LLM family. For its creation, the database underwent data cleaning, duplication removal, and quality filtering. The model can generate content in multiple languages and can be effectively integrated into chatbots. It can also assist content creators in writing full articles with proper context.
Elkhatat et al. [35] conducted a comprehensive study on detecting various AI-detecting tools when differentiating content generated by GPT3.5, GPT4, and a human. They used tools like OpenAI’s classifier, Writer, Copyleaks, GPTZero, and CrossPlag to analyze the content. These tools showed inconsistencies in detecting human-written texts and content generated in particular by GPT-4.
Hamed [36] proposed an algorithm for the reliable identification of human-written and machine-generated texts. They assembled different machine learning (ML) techniques to distinguish texts.
Alamleh et al. [37] examined the challenges of distinguishing between AI-generated and human-written texts in academic and scientific contexts. They trained and tested a variety of ML models, such as logistic regression (LR), support vector machines (SVMs), decision trees (DTs), neural networks (NNs), and random forests (RFs), using the accuracy, the computational efficiency, and confusion matrices. Through an analysis, they found that RFs are the most suitable for this task.
Perkins et al. [38] conducted a comprehensive study on LLM models in the writing domain, specifically in the era of students utilizing AI tools in their educational assessments. This study highlighted improvements in composition and writing instructions, the collaboration between AI and humans, advancements in automated writing evaluations (AWEs) [39], and increased support for English as a foreign language (EFL) learners [33].
Abburi et al. [40] proposed a multifaceted neural approach that combines stylometric features, semantic embeddings, and an optimized vs. simpler architecture to tackle both binary and multiclass AI-text detection.
Latif et al. [41] proposed two modified deep RNN architectures (DNN-1 and DRNN-2) to distinguish between AI-generated text and human-written text. A dataset of 900 short answers was developed across the information technology (IT), cybersecurity, and cryptography fields, with 450 responses from students and 450 generated by ChatGPT. DRNN-2 achieved the best performance, with an accuracy of 88.52% on the full dataset. Table 1 is based on a summary of related studies for AI-generated text identification.

2.4. Research Gap and Contributions

From a literature point of view, and to our knowledge, some studies have been based on the LLaMA model’s machine-generated text identification. Moreover, most of the prior works have focused on general-purpose text identification or academic writing, with very few addressing domain-specific social media content like cryptocurrency-related tweets. The existing models often lack adaptability to the financial and cryptocurrency domains, where the linguistic style and sentiment differ significantly from general text.
This gap establishes the need for a domain-adaptive framework capable of distinguishing between human-authored and LLaMA-generated financial text. To address this, our study makes the following contributions:
  • Novel Dataset Preparation: We prepared a novel finance-based dataset using LLaMA3.2, Phi3.5, Gemma2, Qwen2.5, Mistral, and LLaVA.
  • Dataset Pre-Processing: This study considered different efficient pre-processing steps for dataset cleaning, making it applicable to ML and DL models and preserving the contextual meaning.
  • Fine-Tuning of Transformer Models: We extensively fine-tuned the transformer models to achieve an effective score and run the models with limited resources.
  • Transformer Model Comparison: The proposed fine-tuned DeBERTa base model was compared with other models, such as DistilBERT, BERT base, ELECTRA, and ALBERT base V1.
  • Machine Learning Models: Different machine learning models, such as logistic regression, naive Bayes, random forest, decision trees, XGBoost, AdaBoost, and voting (AdaBoost, GradientBoosting, XGBoost), with a word2vec approach, were applied to the prepared dataset to prove the proposed model’s robustness.

3. Research Methodology

In this section, we examine the various factors of our proposed study, including the preparation of novel datasets, pre-processing, model fine-tuning, and the multiple strategies and techniques employed in our research; see the abstract diagram in Figure 1.

3.1. Dataset and AI Tweet Generation

The growing use and improvements of LLM models are making it increasingly difficult to differentiate between human- and machine-generated content, especially in financial domains such as cryptocurrency, where opinions and sentiments can significantly influence market behavior. Despite significant progress in AI text detection, most existing studies have focused on academic or general-purpose datasets, with limited focus on domain-specific social media content. To address this gap, a balanced domain-specific dataset of cryptocurrency tweets was prepared in this study. The goal was to enable the training and evaluation of a model capable of detecting machine-generated financial text across different LLM architectures.
  • Real Cryptocurrency Tweets: Only 25,000 actual tweets were retrieved from the open-source “Cryptocurrency Tweets” dataset, available on Kaggle [42]. These tweets represent genuine user expression related to crypto markets, news, and community discussions.
  • AI-Generated Cryptocurrency Tweets using LLMs: We utilized local LLMs to run via Ollama [43] to regenerate the cryptocurrency text to prepare the final dataset. The selected models included LLaMA3.2:3B, Phi3.5:8b, Gemma2:9B, Qwen2.5:7B, Mistral:7B, and LLaVa:7B. Each LLM was trained on original human tweets to generate domain-consistent, semantically rich, and sentimentally diverse content. The inclusion of multiple LLMs was essential to ensure cross-model diversity and generalization. Since different LLMs produce content with distinct stylistic and contextual signatures, their outputs collectively simulate a realistic and challenging detection environment for the proposed model.
  • Dataset Composition: To reduce overfitting and overcome the issues with the majority and minority classes, we considered a balanced dataset. The final prepared dataset consisted of ~175,000 data samples with ~25,000 samples per class before pre-processing. The dataset can easily be extended to incorporate newly emerging LLMs by regenerating the original content corpus using any additional model. This ensures that the detection framework remains future-ready and adaptable.

3.2. Pre-Processing

Data pre-processing is crucial in machine learning and deep learning [44]. The financial tweet dataset retrieved from Kaggle [42] were properly pre-processed before applying LLM models to regenerate the content. Miyajiwala et al. [45] demonstrated in their study that removing common tokens, such as stop words, can degrade the performance of transformer-based models such as BERTs. Consequently, retaining stop words and avoiding lemmatization help maintain consistency with the model’s pretraining corpus and preserve essential semantic and syntactic cues. Therefore, we did not eliminate the stop words to preserve the contextual meaning in the text. Table 2 presents samples of the final dataset after pre-processing.
Figure 2 shows the word cloud of the prepared final dataset after pre-processing, and Figure 3 shows the word length.

3.3. Linguistic Analysis

To investigate the stylistic difference between real and machine-generated cryptocurrency tweets, a part-of-speech (POS) tag distribution analysis was conducted on a representative subset of 2000 randomly selected samples per class. Figure 4 presents the normalized frequency ratios of key POS categories across all seven classes. The results revealed that machine-generated tweets, particularly those produced by Phi3.5 and Gemma2, exhibited a noun-dominant structure, where nouns constituted approximately 26–30% of the total tokens, compared to 20–22% in real tweets. In contrast, the real class demonstrated higher proportions of verbs (6–8%) and pronouns (3–4%), reflecting more dynamic, conversational, and narrative linguistic patterns typical of genuine social media communication. Machine-generated texts generally show a reduced use of pronouns and a preference for adjectives and noun phrases, suggesting more formal and information-dense sentence constructions. These quantitative and structural differences highlight clear signatures between human- and AI-generated content. This finding further supports the claim that the DeBERTa base model effectively captures latent stylistic and syntactic cues, enhancing both the interpretability and the trust in its classification performance.

3.4. Proposed Model (Fine-Tuned)

Recent studies have demonstrated that utilizing pre-trained deep learning neural networks as a language model yields significantly better results [46]. Research on contextual learning has gained popularity, especially in deep learning and NLP [47,48]. The word meaning and contextual accuracy must be preserved in vector embeddings, and this concern is easily addressed by BERT models, which use a bidirectional transformer as their training framework [49]. To handle words and sub-words, special tokens ([CLS] and [SEP]) and word embeddings are considered for processing input sequences in the BERT architecture. Furthermore, a multi-layer transformer architecture with self-attention is employed to capture word relationships and produce contextualized word embeddings. There are many variations of the BERT models [32], and the size of these models typically depends on three parameters: the number of transformer layers, the number of self-attention heads, and the hidden state vector dimensionality, as shown in Table 3. The proposed model of this study is based on decoding an enhanced BERT with disentangled attention (DeBERTa), powered by Microsoft [50]. DeBERTa is introduced as an improved version (in terms of performance and efficiency) of other transformer models like BERTs and ELECTRA. Two major contributions in the DeBERTa model make it superior to previous models.
  • Disentangled Attention Mechanism: This concept separates the word content and position to capture the more complex relationship between words.
  • Enhanced Mask Decoder: This enhanced the pre-training objective to make the model more operative in downstream tasks.
An extensive fine-tuning process was applied to achieve an effective score, make the model runnable with limited available resources, and overcome overfitting. To conserve the computational power and adapt the pretraining model to our dataset, some layers of the network were frozen, while others remained unfrozen to capture new intrinsic information. In this study, we froze eight layers for models with twelve layers, i.e., the BERT base, DeBERTa base, and ELECTRA. In contrast, for the DistilBERT base, half of the network layers were frozen and half were unfrozen, i.e., three layers were frozen.

4. Results and Discussion

In this section, we explore evaluation metrics in depth, delve into the details of our experimental procedures, and present the outcomes of the proposed methodology.

4.1. Performance Evaluation Metrics

Key performance indicators were used to evaluate the performance of the machine and deep learning models. The importance of these indicators cannot be neglected when proposing a well-generalized model [51]. This study focused on four evaluation metrics to assess the proposed model’s generalizability and validity. In Equations (1)–(3), the TP, FP, TN, and FN represent true positives, false positives, true negatives, and false negatives, respectively.
Accuracy: Accuracy is a measurement of a model’s overall correctness by calculating the ratio of correctly classified instances to the total samples. However, in imbalanced datasets or when different errors have varying significance, depending on the accuracy, it may not provide a complete evaluation. It is calculated using Equation (1):
A c c u r a c y = T P + F P T P + F P + T N + F N .
Precision: Precision measures how well the model correctly identifies positive samples from actual positive samples. It is calculated using Equation (2):
P r e c i s i o n = T P T P + F P .
Recall: Recall measures the model’s ability to identify all the relevant positive instances. A high recall indicates that the model can easily capture positive cases. Recall is also more important in domains where identifying positive cases is crucial, such as the medical domain. The recall is calculated using Equation (3):
R e c a l l = T P T P + F N .
F1-Score: The harmonic mean of the precision (P) and recall (R) is known as the F1-score, which is a single metric used to evaluate the model performance. The F1-score ranges from 0 to 1, with 1 indicating optimal performance. The F1-score is important in cases where false positives and false negatives are important to consider. It is calculated using Equation (4):
F 1 = 2 × P × R P + R .

4.2. Experimental and Hardware Setup

The hardware configuration included an Ubuntu 22.04.5 LTS operating system, a 2 TB disk, 32 GB of RAM, and an Nvidia GeForce RTX 3050 GPU with 20 GB of memory, as shown in Table 4.

4.3. Hyperparameter Configurations

We employed an experimental approach to fine-tune key hyperparameters, aiming to achieve a high classification accuracy with the proposed model for classifying machine-generated financial content. These hyperparameters included the batch size, learning type, optimizer type, number of epochs, and loss function (see Table 5 for hyperparameter values).

4.4. Proposed Model Results

In this study, we proposed a fine-tuned DeBERTa base model with eight layers frozen and the remaining four layers unfrozen. The different BERT models were also fine-tuned, indicating that the proposed model outperformed other BERT variants. The proposed model, specifically the fine-tuned DeBERTa, comprised 12 transformer layers, a hidden state vector with a size of 768, and 12 self-attention heads. The model was validated using 15% of the data and tested with the 15%. Overall, the proposed model’s performance was superior, with a training accuracy of 0.99, a validation accuracy of 0.94, a test accuracy of 0.94, a weighted precision of 0.94, a weighted recall of 0.94, and a weighted F1 of 0.94 compared to other BERT variations; this can be seen in Table 6, and the classification report of the proposed fine-tuned DeBERTa model can be seen in Figure 5 for the per-class performance.
The classification error can be visualized using a confusion matrix, which includes true positives, true negatives, false positives, and false negatives. The proposed model achieved AUC scores of “Gemma2”: 1.00, “LLaMA3.2”: 0.97, “Mistral”: 0.96, “Qwen2.5”: 0.92, “real”: 0.98, “LLaVA”: 1.00, and “Phi3.5”: 0.95. The confusion matrix of the proposed model for identifying machine-generated financial content is presented in Figure 6.
Although the confusion matrix and classification reports provide a quantitative overview of the proposed model’s performance, the qualitative analysis also offers more profound insights. Table 7 presents illustrative failure cases.

4.5. Performance of the Machine Learning (ML) Models over the Prepared Dataset

In this study, several ML models were also considered with the Word2Vec approach [52]. In the Word2Vec approach, words are represented as dense vectors in a continuous vector space. This approach was selected because it can capture the word’s relationship and context, even for short texts. The experimental results of machine learning models using the Word2Vec approach as classifiers are presented in Table 8. From the machine learning and Word2Vec perspective, voting performed better than the other methods in terms of the accuracy, weighted precision, weighted recall, and weighted F1, all of which were approximately 0.80.
We further evaluated the ML models using GridSearchCV (cv = 3) to optimize the hyperparameters for each ML algorithm. The optimized models outperformed their default counterparts in most cases; see Table 9.
The proposed study was based on different BERT variations and ML models with the Word2Vec approach. The proposed fine-tuned DeBERTa model outperformed all other BERT variations and machine learning models. The voting scored more effectively than the other machine learning models; however, the voting performance was less than that of our proposed model. The proposed fine-tuned BERT model and voting performance comparison are shown in Figure 7.
A direct comparison with state-of-the-art studies was not possible due to differences in classes and datasets; however, we compared this study with some existing finance-related studies, as shown in Table 10.

4.6. Theoretical and Practical Implications

The proposed study contributes to the identification of text authenticity using language models, particularly in the finance domain, with the primary goal of distinguishing between human- and LLM-written cryptocurrency statements. BERT applications for machine-generated text identification advance the theoretical foundations of NLP and AI in finance. The preparation of a novel dataset of AI-generated text variations highlights the nature of content creation and the need to propose identification methods. This research also contributes to transparency in financial communication.
From a practical perspective, this study has significant implications for various stakeholders in the finance domain. Institutes and regulatory bodies can use the developed model to ensure that financial communications are genuine and trustworthy. This research is also helpful in identifying the true source of financial reports and analyses, i.e., authorship verification. Furthermore, this research is of significant importance, as it provides a way to assess and mitigate the risks associated with AI-generated financial content. This study focused on improving text classification while maintaining the integrity of financial information. Additionally, practical considerations such as system latency, confidence-threshold calibration, and model interpretability are essential to ensure reliable and secure integration within financial monitoring systems.

5. Conclusions

This study’s proposed DeBERTa base model demonstrates the efficiency of distinguishing between human- and AI-generated financial text. With the novel dataset prepared using LLM (large language model) variants, i.e., LLaMA3.2:3b, Phi3.5:8b, Gemma2:9b, Mistral:7b, Qwen2.5:7b, and LLaVA:7b, we evaluated the classification framework, which achieved an impressive score. Furthermore, we compared the proposed model with various BERT variants, including a BERT base, ELECTRA, and DistilBERT, as well as traditional ML models using the Word2Vec approach. The proposed model exhibited a superior, well-generalized performance, with training, validation, and test accuracies of 0.99, 0.94, and 0.94, respectively, as well as a precision, recall, and F1-score of 0.94, 0.94, and 0.94, respectively, outperforming other models. The findings of this study underscore the importance of the proposed model and text authenticity in the cryptocurrency sector. Although the proposed model demonstrates a strong performance, future work should expand the dataset and increase the number of classes to capture more information, rather than focusing solely on seven versions, and evaluate larger LLM models such as GPT-4.

Author Contributions

All authors contributed equally. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The dataset used in this study consists of seven classes overall. Subsets of samples for the real class were obtained from publicly available sources (Kaggle), as cited in the manuscript. The authors generated samples for the remaining six classes, which are not publicly available due to ongoing research. However, a subset of the dataset in its raw form, along with the sample code, is available at https://github.com/Muhammad-Asad-Arshed/Machine-Generated-Text-Identification, accessed on 15 October 2025. The full dataset can be provided by the corresponding author upon reasonable request for research purposes only.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

References

  1. Naveed, H.; Khan, A.U.; Qiu, S.; Saqib, M.; Anwar, S.; Usman, M.; Akhtar, N.; Barnes, N.; Mian, A. A Comprehensive Overview of Large Language Models. ACM Trans. Intell. Syst. Technol. 2025, 16, 1–72. [Google Scholar] [CrossRef]
  2. Wu, T.; He, S.; Liu, J.; Sun, S.; Liu, K.; Han, Q.-L.; Tang, Y. A Brief Overview of ChatGPT: The History, Status Quo and Potential Future Development. IEEE/CAA J. Autom. Sin. 2023, 10, 1122–1136. [Google Scholar] [CrossRef]
  3. Bahrini, A.; Khamoshifar, M.; Abbasimehr, H.; Riggs, R.J.; Esmaeili, M.; Majdabadkohne, R.M.; Pasehvar, M. ChatGPT: Applications, Opportunities, and Threats. In Proceedings of the 2023 Systems and Information Engineering Design Symposium, SIEDS 2023, Charlottesville, VA, USA, 27–28 April 2023; pp. 274–279. [Google Scholar] [CrossRef]
  4. Kalia, P.; Kaur, M.; Thomas, A. E-Tailers’ Twitter (X) Communication: A Textual Analysis. Int. J. Consum. Stud. 2025, 49, e70075. [Google Scholar] [CrossRef]
  5. Tao, Y.; Shao, Y. The Impact of News Media Sentiment on Financial Markets. SSRN Electron. J. 2025. [Google Scholar] [CrossRef]
  6. Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.-A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. LLaMA: Open and Efficient Foundation Language Models. arXiv 2023, arXiv:2302.13971. [Google Scholar] [CrossRef]
  7. Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S.; et al. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv 2023, arXiv:2307.09288v2. [Google Scholar] [CrossRef]
  8. Chaudhari, S.; Aggarwal, P.; Murahari, V.; Rajpurohit, T.; Kalyan, A.; Narasimhan, K.; Deshpande, A.; da Silva, B.C. RLHF Deciphered: A Critical Analysis of Reinforcement Learning from Human Feedback for LLMs. arXiv 2024, arXiv:2404.08555v2. [Google Scholar] [CrossRef]
  9. Meta Llama 2. Available online: https://www.llama.com/llama2/ (accessed on 20 September 2024).
  10. Grattafiori, A.; Dubey, A.; Jauhri, A.; Pandey, A.; Kadian, A.; Al-Dahle, A.; Letman, A.; Mathur, A.; Schelten, A.; Vaughan, A.; et al. The Llama 3 Herd of Models. arXiv 2024, arXiv:2407.21783v3. [Google Scholar] [CrossRef]
  11. Abdin, M.; Aneja, J.; Awadalla, H.; Awadallah, A.; Awan, A.A.; Bach, N.; Bahree, A.; Bakhtiari, A.; Bao, J.; Behl, H.; et al. Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone. arXiv 2024, arXiv:2404.14219v4. [Google Scholar] [CrossRef]
  12. Qwen Team. Qwen2.5 Technical Report. arXiv 2024, arXiv:2412.15115v2. [Google Scholar] [CrossRef]
  13. Jiang, A.Q.; Sablayrolles, A.; Mensch, A.; Bamford, C.; Chaplot, D.S.; de las Casas, D.; Bressand, F.; Lengyel, G.; Lample, G.; Saulnier, L.; et al. Mistral 7B. arXiv 2023, arXiv:2310.06825v1. [Google Scholar] [CrossRef]
  14. Gemma Team. Gemma 2: Improving Open Language Models at a Practical Size. arXiv 2024, arXiv:2408.00118. [Google Scholar] [CrossRef]
  15. Liu, H.; Li, C.; Wu, Q.; Lee, Y.J. Visual Instruction Tuning. Adv. Neural Inf. Process Syst. 2023, 36, 34892–34916. Available online: https://arxiv.org/pdf/2304.08485 (accessed on 30 September 2025).
  16. Verma, A.A.; Kurupudi, D.; Sathyalakshmi, S. BlogGen-A Blog Generation Application Using Llama-2. In Proceedings of the 2024 International Conference on Advances in Data Engineering and Intelligent Computing Systems, ADICS 2024, Chennai, India, 18–19 April 2024. [Google Scholar] [CrossRef]
  17. Malisetty, B.; Perez, A.J. Evaluating Quantized Llama 2 Models for IoT Privacy Policy Language Generation. Future Internet 2024, 16, 224. [Google Scholar] [CrossRef]
  18. Chin, L.-T. Summarization Branches Out and Undefined 2004 Rouge: A Package for Automatic Evaluation of Summaries. Available online: https://aclanthology.org/W04-1013.pdf (accessed on 20 September 2024).
  19. Zhang, T.; Kishore, V.; Wu, F.; Weinberger, K.Q.; Artzi, Y. Bertscore: Evaluating Text Generation with Bert. In Proceedings of the 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
  20. Liu, W.; Cao, Z.; Wang, J.; Wang, X. Short text classification based on Wikipedia and Word2vec. In Proceedings of the 2016 2nd IEEE International Conference on Computer and Communications (ICCC), Chengdu, China, 14–17 October 2016; Available online: https://ieeexplore.ieee.org/abstract/document/7924894/ (accessed on 29 March 2024).
  21. Pennington, J.; Socher, R.; Manning, C. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1532–1543. Available online: https://aclanthology.org/D14-1162.pdf (accessed on 20 September 2024).
  22. Labruna, T.; Fondazione, S.B.; Kessler, B.; Fondazione, B.M. Dynamic Task-Oriented Dialogue: A Comparative Study of Llama-2 and Bert in Slot Value Generation. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop, St. Julian’s, Malta, 21–22 March 2024; pp. 358–368. Available online: https://aclanthology.org/2024.eacl-srw.29/ (accessed on 20 September 2024).
  23. Ye, F.; Manotumruksa, J.; Yilmaz, E. MultiWOZ 2.4: A Multi-Domain Task-Oriented Dialogue Dataset with Essential Annotation Corrections to Improve State Tracking Evaluation. In Proceedings of the 23rd Annual Meeting of the Special Interest Group on Discourse and Dialogue, Edinburgh, UK, 7–9 September 2022; pp. 351–360. [Google Scholar] [CrossRef]
  24. Hou, G.; Lian, Q. Benchmarking of Commercial Large Language Models: ChatGPT, Mistral, and Llama. Res. Sq. 2024. [Google Scholar] [CrossRef]
  25. Shahriar, S.; Hayawi, K. Let’s have a chat! A Conversation with ChatGPT: Technology, Applications, and Limitations. Artif. Intell. Appl. 2023, 2, 11–20. [Google Scholar] [CrossRef]
  26. Arshed, M.A.; Gherghina, Ș.C.; Dewi, C.; Iqbal, A.; Mumtaz, S. Unveiling AI-Generated Financial Text: A Computational Approach Using Natural Language Processing and Generative Artificial Intelligence. Computation 2024, 12, 101. [Google Scholar] [CrossRef]
  27. Haleem, A.; Javaid, M.; Singh, R.P. An era of ChatGPT as a significant futuristic support tool: A study on features, abilities, and challenges. BenchCouncil Trans. Benchmarks Stand. Eval. 2022, 2, 100089. [Google Scholar] [CrossRef]
  28. Penelitian, J.; Pembelajaran, P.; Nurmayanti, N.; Stkip, S.; Banten, S. The Effectiveness of Using Quillbot in Improving Writing for Students of English Education Study Program. J. Teknol. Pendidikan J. Penelit. Dan Pengemb. Pembelajaran 2023, 8, 32–40. [Google Scholar] [CrossRef]
  29. Orenstrakh, M.S.; Karnalim, O.; Suarez, C.A.; Liut, M. Detecting LLM-Generated Text in Computing Education: A Comparative Study for ChatGPT Cases. In Proceedings of the 2024 IEEE 48th Annual Computers, Software, and Applications Conference (COMPSAC), Osaka, Japan, 2–4 July 2024. [Google Scholar] [CrossRef]
  30. Zhang, H.; Song, H.; Li, S.; Zhou, M.; Song, D. A Survey of Controllable Text Generation Using Transformer-based Pre-trained Language Models. ACM Comput. Surv. 2023, 56, 64. [Google Scholar] [CrossRef]
  31. Kumarage, T.; Garland, J.; Bhattacharjee, A.; Trapeznikov, K.; Ruston, S.; Liu, H. Stylometric Detection of AI-Generated Text in Twitter Timelines. arXiv 2023, arXiv:2303.03697. [Google Scholar] [CrossRef]
  32. Joshy, A.; Sundar, S. Analyzing the Performance of Sentiment Analysis using BERT, DistilBERT, and RoBERTa. In Proceedings of the 2022 IEEE International Power and Renewable Energy Conference, IPRECON 2022, Kollam, India, 16–18 December 2022. [Google Scholar] [CrossRef]
  33. Májovský, M.; Černý, M.; Kasal, M.; Komarc, M.; Netuka, D. Artificial Intelligence Can Generate Fraudulent but Authentic-Looking Scientific Medical Articles: Pandora’s Box Has Been Opened. J. Med. Internet Res. 2023, 25, e46924. [Google Scholar] [CrossRef]
  34. Sam, K.; Vavekanand, R. Llama 3.1: An In-Depth Analysis of the Next-Generation Large Language Model. Preprints 2024. [Google Scholar] [CrossRef]
  35. Elkhatat, A.M.; Elsaid, K.; Almeer, S. Evaluating the efficacy of AI content detection tools in differentiating between human and AI-generated text. Int. J. Educ. Integr. 2023, 19, 17. [Google Scholar] [CrossRef]
  36. Hamed, A. Improving detection of chatgpt-generated fake science using real publication text: Introducing xfakebibs a supervised-learning network algorithm. Preprints 2023. Available online: https://www.preprints.org/manuscript/202304.0350 (accessed on 20 September 2024).
  37. Alamleh, H.; AlQahtani, A.A.S.; ElSaid, A. Distinguishing Human-Written and ChatGPT-Generated Text Using Machine Learning. In Proceedings of the 2023 Systems and Information Engineering Design Symposium (SIEDS), Charlottesville, VA, USA, 27–28 April 2023; Available online: https://ieeexplore.ieee.org/document/10137767 (accessed on 5 September 2023).
  38. Perkins, M. Academic Integrity considerations of AI Large Language Models in the post-pandemic era: ChatGPT and beyond. J. Univ. Teach. Learn. Pr. 2023, 20, 7. [Google Scholar] [CrossRef]
  39. Miranty, D.; Widiati, U. An automated writing evaluation (AWE) in higher education. Pegem J. Educ. Instr. 2021, 11, 126–137. [Google Scholar] [CrossRef]
  40. Abburi, H.; Bhattacharya, S.; Bowen, E.; Pudota, N. AI-generated Text Detection: A Multifaceted Approach to Binary and Multiclass Classification. arXiv 2025, arXiv:2505.11550v1. [Google Scholar] [CrossRef]
  41. Latif, G.; Mohammad, N.; Brahim, G.B.; Alghazo, J.; Fawagreh, K. Detection of AI-written and human-written text using deep recurrent neural networks. In Proceedings of the Fourth Symposium on Pattern Recognition and Applications (SPRA 2023), Napoli, Italy, 1–3 December 2023. [Google Scholar] [CrossRef]
  42. Cryptocurrency Tweets. Available online: https://www.kaggle.com/datasets/infsceps/cryptocurrency-tweets/data (accessed on 2 October 2025).
  43. Ollama. Available online: https://ollama.com/ (accessed on 21 September 2024).
  44. Siino, M.; Tinnirello, I.; La Cascia, M. Is text preprocessing still worth the time? A comparative survey on the influence of popular preprocessing methods on Transformers and traditional classifiers. Inf. Syst. 2024, 121, 102342. [Google Scholar] [CrossRef]
  45. Miyajiwala, A.; Ladkat, A.; Jagadale, S.; Joshi, R. On Sensitivity of Deep Learning Based Text Classification Algorithms to Practical Input Perturbations. arXiv 2021, arXiv:2201.00318. [Google Scholar]
  46. Garrido-Merchan, E.C.; Gozalo-Brizuela, R.; Gonzalez-Carvajal, S. Comparing BERT against traditional machine learning text classification. J. Comput. Cogn. Eng. 2020, 2, 352–356. [Google Scholar] [CrossRef]
  47. Favre, B. Contextual Language Understanding Thoughts on Machine Learning in Natural Language Processing. 2019. Available online: https://amu.hal.science/tel-02470185 (accessed on 21 September 2024).
  48. Zhou, M.; Duan, N.; Liu, S.; Shum, H.Y. Progress in Neural NLP: Modeling, Learning, and Reasoning. Engineering 2020, 6, 275–290. [Google Scholar] [CrossRef]
  49. Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North, Minneapolis, Minnesota, 4–6 October 2019; pp. 4171–4186. [Google Scholar] [CrossRef]
  50. He, P.; Liu, X.; Gao, J.; Chen, W. DeBERTa: Decoding-enhanced BERT with Disentangled Attention. In Proceedings of the ICLR 2021-9th International Conference on Learning Representations, Vienna, Austria, 3–7 May 2021. [Google Scholar]
  51. Naidu, G.; Zuva, T.; Sibanda, E.M. A Review of Evaluation Metrics in Machine Learning Algorithms. Lect. Notes Netw. Syst. 2023, 724, 15–25. [Google Scholar] [CrossRef]
  52. Cahyani, D.E.; Patasik, I. Performance comparison of tf-idf and word2vec models for emotion text classification. Bull. Electr. Eng. Inform. 2021, 10, 2780–2788. [Google Scholar] [CrossRef]
Figure 1. Abstract diagram of the proposed study.
Figure 1. Abstract diagram of the proposed study.
Mca 30 00130 g001
Figure 2. Final prepared dataset word cloud.
Figure 2. Final prepared dataset word cloud.
Mca 30 00130 g002
Figure 3. Word lengths in the prepared dataset.
Figure 3. Word lengths in the prepared dataset.
Mca 30 00130 g003
Figure 4. Distribution comparison of top ten most frequent parts of speech across classes, analyzed using 2000 randomly selected samples per class.
Figure 4. Distribution comparison of top ten most frequent parts of speech across classes, analyzed using 2000 randomly selected samples per class.
Mca 30 00130 g004
Figure 5. Proposed model’s per-class evaluation scores.
Figure 5. Proposed model’s per-class evaluation scores.
Mca 30 00130 g005
Figure 6. Confusion matrix of proposed model.
Figure 6. Confusion matrix of proposed model.
Mca 30 00130 g006
Figure 7. Performance comparison of proposed model and voting model.
Figure 7. Performance comparison of proposed model and voting model.
Mca 30 00130 g007
Table 1. Summary of related studies on AI-generated text identification.
Table 1. Summary of related studies on AI-generated text identification.
StudyDomain/DatasetApproach/ModelAccuracy/Key Findings
Arshed et al. [26]Financial tweetsWord2Vec + random forestAccuracy of 0.74; highlight difficulty in distinguishing AI-generated tweets
Kumarage et al. [31]Twitter (general stylometric)Roberta-based classifierSuccessfully detected AI vs. human tweets using linguistic and stylistic cues
Elkhatat et al. [35]General text (GPT3.5, GPT4, human)Comparative tool analysis (GPTZero, Copyleaks, CrossPlag)Found inconsistencies in existing AI detection tools
Hamed [36]General textML ensembleProposed reliable ML algorithm for human and AI text discrimination
Alamleh et al. [37]Academic and general textSVM, RF, DT, NNFound random forest to be the most effective
Abburi et al. [40]Mixed domainNeural hybridAchieved a good accuracy for binary and multiclass AI text detection
Latif et al. [41]Academic response (IT, cybersecurity)Deep RNN architectures (DNN-1, DNN-2)DRNN-2 achieved an accuracy of 88.52% for the mixed dataset
Table 2. Prepared dataset samples.
Table 2. Prepared dataset samples.
ClassText
Real“Im participating in the Trust Trading contest to win a thanks to”
LLaMA3.2“I am participating in the Trust Trading contest with the aim of winning a thank you.”
Phi3.5“Engaging in the Trust Exchange competition aims to secure gratitude”
Gemma2“I am competing in the Trust Trading contest for a prize.”
Qwen2.5“I’m joining the Trust Trading contest to win, thank you for the opportunity”
Mistral“Competing in the Trust Trading competition with an aim to receive recognition”
LLaVA“I’m joining the Trust Trading competition to potentially receive gratitude from someone.”
Table 3. Comparison of base transformer models in terms of architecture, vocabulary size, and sequence length.
Table 3. Comparison of base transformer models in terms of architecture, vocabulary size, and sequence length.
ModelTransformer LayersHidden State SizeSelf-Attention HeadsMax Sequence Length
Proposed (DeBERTa base)1276812512
DistilBERT (base-uncased)676812512
BERT base (uncased)1276812512
ELECTRA base1276812512
ALBERT base v11276812512
Table 4. Hardware configurations for experiments.
Table 4. Hardware configurations for experiments.
Hardware InformationRelated Configuration
Operating systemUbuntu 22.04.5 LTS
System ram32 GB (gigabyte)
Disk 2 TB (terabyte)
CUDA version12.2
GPUNvidia GeForce RTX 3050
GPU space20 GB (parallel)
Table 5. Hyperparameter configurations for models.
Table 5. Hyperparameter configurations for models.
HyperparametersValue
Batch size32
Learning rate0.00002
OptimizerAdam
Epochs5
SchedulerLinear warmup scheduler
Table 6. Proposed model results and comparison with other BERT variations.
Table 6. Proposed model results and comparison with other BERT variations.
ModelTrainValidTestPrecisionRecallF1-Score
Proposed (DeBerta-base)0.990.940.940.940.940.94
DistilBERT0.950.780.780.790.790.79
BERT base0.970.780.780.790.790.79
ELECTRA0.980.790.800.810.800.80
ALBERT base V10.920.780.780.790.790.79
Table 7. Misclassified samples.
Table 7. Misclassified samples.
TextActualPredictedObservation
“maintain privacy; avoid unnecessary exposure of personal details like name and address due to concerns over crypto privacy.”Phi3.5MistralThe concise and instructive tone, as well as the balanced phrasing, was evident across both model families, resulting in stylistic overlap.
“collaboratively envision a future where responsible ai ensures user privacy, laying the foundation for a transparent and decentralized data economy centered on individual privacy protections, with web3 cryptocurrency playing a key role.”Qwen2.5LLaMA3.2The long, visionary phrasing and use of collective optimism resembled LLaMA’s generative style, and was confusing.
Table 8. Performance of machine learning models using default parameters + Word2Vec approach for latest LLM-generated content identification.
Table 8. Performance of machine learning models using default parameters + Word2Vec approach for latest LLM-generated content identification.
ModelAccuracyPrecision (Weighted)Recall (Weighted)F1-Score (Weighted)
Logistic regression0.530.530.530.52
GaussianNB0.390.410.390.38
Random forest0.780.790.780.78
Decision tree0.730.730.730.73
XGBoost0.680.680.680.68
AdaBoost0.430.430.430.42
Voting (AdaBoost, GradientBoosting, XGBoost)0.800.800.800.80
Table 9. Performance of machine learning models using optimized parameters + Word2Vec approach for latest LLM-generated content identification.
Table 9. Performance of machine learning models using optimized parameters + Word2Vec approach for latest LLM-generated content identification.
ModelParameter GridBest ParametersAccuracyPrecision (Weighted)Recall (Weighted)F1-Score (Weighted)
Logistic regression“C”: [0.1, 1, 10],
“solver”: [“liblinear”, “saga”]
{“C”: 0.1, “solver”: “saga”}0.530.530.530.53
GaussianNB“var_smoothing”: [1e-9, 1e-8, 1e-7]{“var_smoothing”: 1e-09}0.390.410.390.38
Random forest“n_estimators”: [50, 100, 200], “max_depth”: [5, 10, none]{“max_depth”: none, “n_estimators”: 200}0.790.790.790.79
Decision tree“max_depth”: [5, 10, none], “min_sample_split”: [2, 5, 10]{“max_depth”: none, “min_sample_split”: 2}0.730.730.730.73
XGBoost“n_estimators”: [50, 100, 200],
“learning_rate”: [0.01, 0.1, 0.2],
“max_depth”: [3, 5, 7]
{“learning_rate”: 0.2, “max_depth”: 7, “n_estimators”: 200}0.770.780.770.78
AdaBoost“n_estimators”: [50, 100, 200],
“learning_rate”: [0.01, 0.1, 1.0]
{“learning_rate”: 1.0, “n_estimators”: 200}0.460.450.460.45
Voting (AdaBoost, GradientBoosting, XGBoost)[“hard”, “soft”]{“voting”: “soft”}0.800.800.800.80
Table 10. Comparison with state-of-the-art finance-related studies.
Table 10. Comparison with state-of-the-art finance-related studies.
StudyClassesDomainDataset SizeModelResults
Arshed et al. [26]3 classes (real, GPT, QuillBot)FinanceDataset-1 (1500)
Dataset-2 (3000)
Random forest with Word2VecDataset-1: 0.74 (74%)
Dataset-2:
0.72 (72%)
Latif et al. [41]2 classes (human and ChatGPT)IT, cybersecurity, and cryptography900DRNN-20.8852 (88.52% on full dataset)
Present Study7 classes (real, LLaMA3.2, Phi3.5, Gemma2, Qwen2.5, Mistral, and LLaVA)Cryptocurrency~175,000Fine-tuned DeBERTa base0.94 (94%)
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Arshed, M.A.; Gherghina, Ş.C.; Khalil, I.; Muavia, H.; Saleem, A.; Saleem, H. A Context-Aware Representation-Learning-Based Model for Detecting Human-Written and AI-Generated Cryptocurrency Tweets Across Large Language Models. Math. Comput. Appl. 2025, 30, 130. https://doi.org/10.3390/mca30060130

AMA Style

Arshed MA, Gherghina ŞC, Khalil I, Muavia H, Saleem A, Saleem H. A Context-Aware Representation-Learning-Based Model for Detecting Human-Written and AI-Generated Cryptocurrency Tweets Across Large Language Models. Mathematical and Computational Applications. 2025; 30(6):130. https://doi.org/10.3390/mca30060130

Chicago/Turabian Style

Arshed, Muhammad Asad, Ştefan Cristian Gherghina, Iqra Khalil, Hasnain Muavia, Anum Saleem, and Hajran Saleem. 2025. "A Context-Aware Representation-Learning-Based Model for Detecting Human-Written and AI-Generated Cryptocurrency Tweets Across Large Language Models" Mathematical and Computational Applications 30, no. 6: 130. https://doi.org/10.3390/mca30060130

APA Style

Arshed, M. A., Gherghina, Ş. C., Khalil, I., Muavia, H., Saleem, A., & Saleem, H. (2025). A Context-Aware Representation-Learning-Based Model for Detecting Human-Written and AI-Generated Cryptocurrency Tweets Across Large Language Models. Mathematical and Computational Applications, 30(6), 130. https://doi.org/10.3390/mca30060130

Article Metrics

Back to TopTop