1. Introduction
The proliferation of digital communication has fundamentally altered the information landscape, enabling the rapid, global dissemination of content [
1]. This has amplified the societal impact of information, particularly that which is deceptive or misleading. The challenge of distinguishing truth from falsehood is not new, but its scale and complexity in the digital age are unprecedented. Indeed, the very nature of “truth” in an era dominated by competing narratives and subjective interpretations poses a fundamental philosophical problem. While this study cannot resolve this epistemological challenge, it operates on a necessary, pragmatic premise: for computational systems to function, “truth” and “deception” must be operationalized as distinct, classifiable categories based on available ground-truth data. Our work, therefore, addresses the technical feasibility of this classification while acknowledging the broader, complex reality it attempts to model. Automated deception detection has therefore emerged as a critical area of research in natural language processing (NLP) with the aim of developing systems that can identify and flag potentially untruthful content [
2].
Deception is a complex human behavior. Foundational theories, such as Zuckerman’s Four-Factor Model [
3] and DePaulo’s work on the leakage of deceptive cues [
4], suggest that the act of deceiving imposes a cognitive load that can manifest in linguistic and non-verbal signals. However, these signals can be subtle and context-dependent. A significant challenge in computational deception detection is the heterogeneity of deceptive phenomena. The linguistic properties of a spontaneous, high-stakes lie in a legal testimony may differ substantially from those in a planned, low-stakes deceptive product review or in politically motivated disinformation designed for mass consumption [
5,
6].
This study focuses on a specific subtype of deception: instructed, written deception in a low-stakes environment. To investigate this, we use a well-established multilingual dataset where participants were explicitly asked to write both truthful and deceptive statements. It is crucial to define the scope of “deception” within this context. Here, the ground truth is determined not by an external arbiter of facts, but by the self-reported intent of the writer who created the text samples. Therefore, our models are not learning to identify objective falsehood (which would be required in tasks such as fact-checking) but rather to distinguish between the linguistic patterns of two classes: “text written with truthful intent” and “text written with deceptive intent” under specific instructions. This distinction mitigates the risk of the model merely learning statistical artifacts of a topic, focusing instead on the signals correlated with the act of instructed deception itself. This controlled setting allows for a focused analysis of linguistic patterns, although its findings may not be directly generalizable to all forms of real-world misinformation.
Previous computational approaches to this problem have often relied on feature-engineering, using classifiers like Support Vector Machines (SVMs) with features drawn from lexicons such as Linguistic Inquiry and Word Count (LIWC) or n-grams [
7,
8]. While foundational, these methods can be brittle and struggle to capture the nuanced semantics of language. The advent of deep learning, particularly transformer-based models like BERT (Bidirectional Encoder Representations from Transformers) [
9], has revolutionized NLP by enabling models to learn rich, contextual representations of text. These models have shown state-of-the-art performance on numerous tasks, including the detection of fake news and propaganda [
10,
11].
More recently, the rise of instruction-tuned Large Language Models (LLMs) presents a third paradigm. These “generalist” models possess vast world knowledge and can perform tasks via in-context learning with zero or few examples, raising the question of their efficacy on specialized classification tasks compared to fine-tuning smaller transformer models. However, much of the research across these paradigms has focused on English, even though multilingual application is especially important for such tasks as deception detection as deception easily crosses linguistic borders. This gap highlights the necessity of models that are effective across multiple languages.
This study aims to address these needs by systematically evaluating and comparing the performance of these three distinct tiers of NLP models. We formulate our goals through the following research questions (RQs):
RQ1: How do traditional feature-based methods (SVM), fine-tuned specialist models (BERT), and in-context learning with generalist models (LLMs) compare in terms of performance on a multilingual instructed deception detection task?
RQ2: To what extent can fine-tuned transformer models outperform the traditional SVM baseline in both within-topic and cross-topic classification scenarios?
RQ3: What are the practical performance differences between using a monolingual versus a multilingual transformer model for deception detection in a dataset containing both English and Spanish texts?
By focusing on the specific task of instructed deception detection, in this paper we provide a rigorous and controlled comparison of key NLP techniques. We clarify the capabilities of modern architectures on the task of deceptive language detection and discuss the limitations of its scope, providing a solid base for future investigations into more complex, real-world deception detection challenges.
The remainder of this paper is organized as follows.
Section 2 reviews the evolution of research in deception detection, from its psychological foundations to the latest computational models.
Section 3 details our methodology, including a description of the dataset; the experimental design for the SVM, BERT, and LLM models; and the evaluation protocol.
Section 4 presents the empirical results of our comparative experiments.
Section 5 provides a comprehensive discussion of these findings, interpreting their significance, addressing the study’s limitations, and considering the broader theoretical and ethical implications. Finally,
Section 6 concludes the paper by summarizing our main contributions and suggesting directions for future work.
3. Materials and Methods
This section details the dataset, model architectures, and experimental protocols used in our study. We provide an in-depth description of each component to ensure methodological transparency and reproducibility.
3.1. Dataset
This study utilizes the “cross-cultural deception detection” dataset from Pérez-Rosas and Mihalcea [
12]. This dataset is a valuable resource for multilingual and cross-cultural analysis as it contains texts generated under a controlled protocol. Participants from four cultural backgrounds (American, Indian, Mexican, and Romanian) were instructed to write short, opinion-based essays on three topics: abortion, the death penalty, and best friends. For each topic, they produced both a truthful and a deceptive essay.
Our experiments focus on the English (from the US and India) and Spanish (from Mexico) parts of the dataset. The English datasets are balanced, containing 100 deceptive and 100 truthful samples for each topic (600 total samples per culture). The Spanish dataset is imbalanced, with 71 samples for abortion, 84 for the death penalty, and 188 for the best friend topic. The nature of this dataset, namely, containing explicitly solicited, low-stakes deception, defines the scope of our study. Although every form of deception has some universal similarities, the findings presented in this paper relate specifically to this form of deception. Therefore, although they cannot be directly extrapolated to spontaneous, high-stakes deception scenarios without further investigation, we do acknowledge that if deception detection is possible to achieve at a high performance level for the present task, there is also potential for the same methodology in more high-stakes deception scenarios.
3.2. Experimental Design
We conducted our experiments using Google Colab for its access to GPU resources, necessary for training transformer models. The study was designed to systematically compare the performance of three different model architectures, chosen to represent distinct stages of technological development in NLP.
3.2.1. Baseline Model: Support Vector Machine (SVM)
To establish a performance baseline and connect our work with prior research, we performed a replication of the original study’s unigram-based experiment. We used the SVM implementation from the Scikit-learn library (version 1.2.2) in Python (version 3.12.4), with a linear kernel and the default C-value of 1.0, which is a standard choice for text classification tasks. Input features were unigrams (1-g), with a frequency threshold of 10 applied to the vocabulary to filter out rare words. Consistent with the original methodology, stopwords were retained as they can contain valuable linguistic cues for deception.
3.2.2. Transformer-Based Models
We evaluated two variants of the BERT model to assess the impact of modern architectures and multilingual pre-training, representing the fine-tuned “specialist” model paradigm. The goal was to compare widely-used, publicly available discriminative models. In addition to these fine-tuned discriminative models, we include a comparative analysis of several instruction-tuned generative models (LLMs) to benchmark their zero- and few-shot capabilities on this task.
Monolingual BERT: We used bert-base-uncased, a model pre-trained on a large English corpus. This model represents a standard, high-performance baseline for English NLP tasks.
Multilingual BERT: We used bert-base-multilingual-uncased, which was pre-trained on corpora from 102 languages, including English and Spanish. This model was chosen to evaluate the effectiveness of a multilingual approach on our dataset.
For all BERT-based experiments, input texts were tokenized with a maximum sequence length of 128. The models were fine-tuned using the hyperparameters specified in
Table 2. The number of training epochs was set to three, a common practice in BERT fine-tuning for classification tasks to prevent overfitting on smaller datasets, as recommended by the original authors and established in subsequent literature [
9]. While an exhaustive hyperparameter search was beyond the scope of this benchmarking study, these parameters represent a standard, robust baseline. We monitored the validation loss during training and confirmed that it began to plateau or increase after the third epoch, indicating that further training was unnecessary and that these settings are appropriate for a fair comparison. The experiment involving the application of the monolingual English BERT to Spanish data was designed as a control experiment to establish the performance of a model encountering a language outside its pre-training corpus.
3.2.3. Large Language Models (LLMs) for Zero-, Few-Shot, and Fine-Tuned Classification
To provide a contemporary point of comparison, we evaluated three prominent instruction-tuned LLMs as “generalist” models. These experiments benchmark their in-context learning and fine-tuning capabilities on this discriminative task. The models, chosen for their strong performance and open availability, were as follows:
mistralai/Mistral-7B-v0.1, a 7-billion-parameter model known for its strong reasoning capabilities.
meta-llama/Llama-3.2-3B, a smaller, efficient variant from the LLaMA 3.2 family.
Qwen/Qwen3-4B-Instruct-2507, a 4-billion-parameter model optimized for multilingual tasks.
For each model, we tested three conditions:
Zero-shot: The model was given a system prompt defining the task and the two labels (“deceptive” and “not deceptive”) and was then asked to classify the target text.
Few-shot-5: The prompt was augmented with five randomly selected, labeled examples from the training set to provide in-context examples. The sampling was performed once for each of the 10 cross-validation folds.
Fine-Tuned: The models were fine-tuned using the Parameter-Efficient Fine-Tuning (PEFT) library with Low-Rank Adaptation (LoRA). We used a 60% training and 40% evaluation split. Standard LoRA parameters were used (rank = 8, alpha = 32, and dropout = 0.1), and the models were trained for 3 epochs.
3.2.4. Evaluation Protocol
All models were evaluated using two distinct classification schemes to test for both topic-specific and generalized performance.
Within-Topic Classification: The model’s ability to distinguish truth from deception within a single topic. Performance was measured using 10-fold stratified cross-validation.
Cross-Topic Classification: The model’s ability to generalize deception cues across different topics. The model was trained on data from two topics and tested on the held-out third topic. This was performed for all three topic combinations, and the results were averaged.
Accuracy was used as the primary evaluation metric for all experiments for better comparison with prior work.
4. Results
This section details the empirical results of our experiments, providing a comparative analysis of the SVM, fine-tuned BERT models, and LLMs.
4.1. SVM Baseline Performance and Replication
Our first experiment aimed to establish a baseline by replicating the SVM unigram experiment from Pérez-Rosas and Mihalcea (2014) [
12].
Table 3 presents a comparison of the averaged results reported in the original paper and those obtained in our replication.
Our replication produced comparable, though not identical, results. Within-topic accuracies were slightly higher in our implementation for all three datasets. However, for the English (US) dataset, our cross-topic accuracy (60.58%) was substantially lower than the originally reported figure (72.79%). This discrepancy, which we further explore in the Discussion section, shows the sensitivity of such models to implementation details. Nonetheless, the overall performance of the SVM is modest and inconsistent, providing a clear baseline for comparison. The originally reported results are represented for comparison in
Table 4, whereas the full results of our replication are in
Table 5.
4.2. Monolingual BERT Performance
The monolingual English BERT model significantly outperformed the SVM baseline on the English datasets, as detailed in
Table 6. For the English (US) data, average accuracy reached 87.17% (within-topic) and 94.68% (cross-topic). For the English (India) data, the results were 90.50% and 96.06%, respectively. As expected, the model’s performance on the Spanish (Mexico) dataset was poor, with average accuracies of 60.73% (within-topic) and 67.20% (cross-topic). This confirms the model’s inability to process a language for which it has no pre-training. Still, however, the results were comparable to the results obtained by the SVM, which suggests that transformer models, even if not trained specifically on the target languages, through pre-training still are able to poses some degree of proficiency in the target language, which can happen due to data contamination (data in other languages than English in pre-training of English models), or similarities in vocabulary between languages, which are then exploited by the sub-word (or sub-token) mechanism.
4.3. Multilingual BERT Performance
The multilingual BERT model not only maintained strong performance on the English datasets but also demonstrated a substantial improvement on the Spanish data (
Table 7). For the Spanish (Mexico) dataset, the average accuracy rose to 82.99% for within-topic and 90.14% for cross-topic classification. This represents a performance gain of over 22 percentage points compared to the monolingual BERT, confirming the effectiveness of the multilingual pre-training strategy. The model’s performance on the English datasets remained high and comparable to the monolingual model, indicating that its multilingual nature does not compromise its capability in a high-resource language like English. A summary of the averaged results across all models and conditions is presented in
Table 8 to provide a high-level overview of the results.
4.4. Large Language Model Performance
The performance of the three instruction-tuned LLMs is summarized in
Table 9. The results show a large difference between the zero-shot and few-shot prompting conditions. In the zero-shot setting, all models completely failed to perform the classification task, often outputting a single label regardless of the input or refusing to provide a valid classification, resulting in average accuracies at or near zero. This indicates that, despite their vast general knowledge, the models could not reliably map the task description to a consistent binary classification output without concrete examples.
Providing five examples in the prompt (few-shot-5) dramatically improved performance, with Mistral and LLaMA 3.2 achieving average accuracies around the 50% chance baseline. This demonstrates that the models are capable of recognizing the task pattern from in-context examples. However, even in the few-shot setting, the performance of all LLMs remained substantially lower than that of the fine-tuned BERT models and, in many cases, was comparable to or worse than the SVM baseline.
The performance of the three instruction-tuned LLMs is summarized in
Table 9. A clear pattern emerged across all three models:
Zero-shot performance was near zero. In this setting, the models consistently failed to perform the classification, often outputting a single label regardless of input or refusing to provide a valid classification.
Few-shot prompting moderately improved performance, generally bringing accuracy to around the 50% chance baseline, with Mistral and LLaMA performing best. This shows that the models can recognize the task pattern from in-context examples, but their ability remains limited.
Fine-tuning yielded the best LLM results, with Mistral-7B reaching an average within-topic accuracy of 71.13%. However, even in this condition, the LLMs did not surpass the performance of the fine-tuned BERT models.
These results suggest that for this specific discriminative task, the generalized, in-context learning abilities of LLMs are substantially less effective than task-specific fine-tuning. Even when fine-tuned, these larger models were outperformed by the smaller, more specialized BERT architecture and in many cases even by the baseline SVM-based model.
These results are particularly interesting when considering the models’ pre-training. It is highly probable that the public dataset used in our study was part of the vast corpora used to train these LLMs. Despite this likely data exposure, the models were unable to leverage that latent knowledge in a zero-shot setting to perform the specific classification task. This suggests a critical distinction between a model’s generalized knowledge and its ability to robustly follow precise, task-specific instructions without explicit in-context examples. The low performance, even with few-shot prompting, highlights the superiority of task-specific fine-tuning over in-context learning for this type of discriminative challenge, confirming that specialized models remain more effective and reliable for this problem.
5. Discussion
This section provides a critical analysis of our findings, interpreting them in the context of prior research, acknowledging the study’s limitations, and considering the broader implications of our work.
5.1. Interpretation of Results and Comparison with Prior Work
The most straightforward finding was the substantial performance gap between the SVM baseline and the transformer-based models. While the SVM’s performance was modest, our results are broadly consistent with the accuracy range reported by Pérez-Rosas and Mihalcea [
12] and other early work using similar feature-based methods [
5,
8]. The high accuracy achieved by both BERT models demonstrates that deep contextual embeddings are far more effective at capturing the subtle linguistic cues of deception than the shallow n-gram features used by SVMs.
The strong performance in the cross-topic setting is particularly noteworthy. The fact that cross-topic accuracy often exceeded within-topic accuracy suggests that the models learned generalizable linguistic patterns of deception that are not topic-specific, and that they benefited from the larger volume of training data available in the cross-topic configuration. This aligns with findings in related domains like fake news detection, where robust models learn genre and style cues rather than topic-specific artifacts [
24].
The comparison between the monolingual and multilingual BERT models on the Spanish dataset provides a critical, practical insight. The high results of the multilingual BERT, which improved average cross-topic accuracy from 67.20% to 90.14%, confirms the performance of multilingual pre-training, confirming findings from previous work on mBERT’s cross-lingual capabilities [
13].
This dramatic improvement directly answers our third research question by showing that the multilingual model’s architecture successfully leverages its pre-training on Spanish to correctly classify deception, whereas the monolingual model could only rely on superficial or shared lexical features.
Furthermore, our experiments with modern LLMs provide a valuable counterpoint and answer the first part of our primary research question. Despite their advanced capabilities and the possibility of data exposure during pre-training, their zero- and few-shot performance was clearly inferior to the fine-tuned BERT models. Even when fine-tuned using LoRA, the LLMs did not match the performance of the much smaller BERT-base models. This reinforces a key conclusion: for specialized, discriminative tasks such as this one, task-specific fine-tuning of a smaller, specialized model remains a more effective and computationally efficient approach than in-context learning or even parameter-efficient fine-tuning with larger, general-purpose generative models.
5.2. Methodological Considerations and Limitations
Our study, while methodologically consistent for its stated purpose, has limitations that define the boundaries of its conclusions. The primary limitation is the reliance on a single, albeit well-established, dataset. We acknowledge that this constrains the generalizability of our findings. However, the study’s scientific value lies not in proposing a universally applicable deception detection model but in providing a rigorous, reproducible benchmark of different technological paradigms (SVM, BERT, and LLM) on a foundational task in the field. This comparative clarity is a necessary contribution before extending analyses to more varied and complex datasets.
Our attempt to replicate the original SVM experiment yielded results that were not perfectly identical to the original study (
Table 3). Specifically, our cross-topic accuracy for the English (US) dataset was notably lower. Since we do not have access to the original code, we cannot definitively identify the cause. However, potential reasons include differences in SVM library implementations (e.g., Scikit-learn vs. LIBSVM) or subtle variations in text preprocessing and tokenization. This discrepancy highlights a persistent challenge in computational science, namely, that results can be highly sensitive to minor, often undocumented, implementation details. By presenting this discrepancy transparently, we underscore the importance of detailed methodological reporting for reproducibility.
The primary limitation of this study is the nature of the dataset. The texts analyzed are examples of instructed, low-stakes deception. The linguistic characteristics of this behavior may not be representative of real-world, high-stakes deception, such as financial fraud, or spontaneously generated misinformation. Therefore, the high accuracies achieved here should not be interpreted as a solved problem for all forms of deception. Furthermore, while this comparative study provides a clear benchmark between established architectures, we acknowledge the absence of a full ablation study on model components. Such experiments were beyond the scope of this benchmarking study but represent a vital direction for future work.
Finally, while the dataset is described as “cross-cultural”, our experiments are primarily “multilingual”. We demonstrated that a multilingual model works well on a multilingual dataset. We did not, for instance, test for knowledge transfer between the two English-speaking cultures (US and India) or analyze the specific linguistic features that differ between them. Such an analysis would be required for a truly cross-cultural study and remains a promising direction for future work.
5.3. Theoretical and Ethical Implications
The success of automated deception detection models, while promising, necessitates a deeper consideration of the underlying theoretical and ethical issues. As noted in the Introduction, the definition of “truth” is complex. Our system operationalizes it as a binary classification based on writer intent, but this is a simplification of reality. The declarative nature of our initial ethical reflections can be expanded by considering the concrete risks of misuse. For instance, a naively deployed model could learn to associate the linguistic patterns of a specific demographic or non-native speakers with “deception”, leading to significant algorithmic bias and reinforcing harmful stereotypes. The question of who defines and labels the ground-truth data, be it researchers, corporations, or governments, is a question of significant weight, directly influencing the model’s worldview and potential biases.
Furthermore, the interpretation of LLM behavior warrants a less speculative, more grounded approach. The failure of LLMs in the zero-shot setting, despite the possibility of data exposure during pre-training, is less a philosophical point about “knowledge” and more an empirical demonstration of the gap between generalized pre-training and specialized task execution. It suggests that without explicit fine-tuning or very precise prompting, these models lack the specific inductive bias needed for this discriminative task.
This leads to the crucial role of humans in an AI-augmented future. Deception detection tools should not be seen as autonomous arbiters of truth but as assistance mechanisms within a “human-in-the-loop” framework. Rather than simply replacing human judgment, these systems are best used to augment it by flagging inconsistencies, highlighting text with a high probability of being deceptive and providing evidence for a human expert to review. The development of such tools must proceed in tandem with research into explainable AI (XAI) and robust governance frameworks to ensure a responsible and beneficial integration into society.
6. Conclusions
In this study, we conducted a three-way comparative evaluation of machine learning paradigms for the task of multilingual instructed deception detection. Our systematic analysis of traditional SVMs, fine-tuned BERT models, and instruction-tuned LLMs confirms a clear performance hierarchy: fine-tuned BERT models substantially outperform both the traditional baseline and the in-context learning capabilities of modern LLMs.
Our results answer our research questions by demonstrating that (1) specialized, fine-tuned models are superior for this discriminative task; (2) transformer architectures offer significant performance gains over feature-based methods; and (3) multilingual models are essential for robust performance on multilingual datasets. The success of multilingual BERT in improving accuracy on Spanish text by over 22 percentage points especially highlights this final point.
Our findings, while grounded in the specific task of instructed deception detection, provide a clear empirical validation of transformer technology and highlight pathways for developing more reliable and linguistically flexible tools. This work serves as a necessary benchmark that clarifies the relative strengths of different modeling approaches, contributing to the broader goal of enhancing information integrity with a clear understanding of the technical and ethical challenges that remain.