SVM, BERT, or LLM? A Comparative Study on Multilingual Instructed Deception Detection

Azuma, Daichi; Meléndez, René; Ptaszynski, Michal; Masui, Fumito; Aslan, Lara; Eronen, Juuso

doi:10.3390/ai6090239

Open AccessArticle

SVM, BERT, or LLM? A Comparative Study on Multilingual Instructed Deception Detection

by

Daichi Azuma

¹

,

René Meléndez

¹

,

Michal Ptaszynski

^1,*

,

Fumito Masui

¹

,

Lara Aslan

¹ and

Juuso Eronen

²

¹

Text Information Processing Laboratory, Faculty of Engineering, Kitami Institute of Technology, Kitami 090-8507, Japan

²

Department of Administrative Studies, Prefectural University of Kumamoto, Kumamoto 862-0920, Japan

^*

Author to whom correspondence should be addressed.

AI 2025, 6(9), 239; https://doi.org/10.3390/ai6090239

Submission received: 23 July 2025 / Revised: 11 September 2025 / Accepted: 11 September 2025 / Published: 22 September 2025

(This article belongs to the Special Issue The Digital Immune System: AI-Driven Detection and Mitigation of Online Harms)

Download Versions Notes

Abstract

The automated detection of deceptive language is a crucial challenge in computational linguistics. This study provides a rigorous comparative analysis of three tiers of machine learning models for detecting instructed deception: traditional machine learning (SVM), fine-tuned discriminative models (BERT), and in-context learning with generalist Large Language Models (LLMs). Using the “cross-cultural deception detection” dataset, our findings reveal a clear performance hierarchy. While SVM performance is inconsistent, fine-tuned BERT models achieve substantially superior accuracy. Notably, a multilingual BERT model improves cross-topic accuracy on Spanish text to 90.14%, a gain of over 22 percentage points from its monolingual counterpart (67.20%). In contrast, modern LLMs perform poorly in zero-shot settings and fail to surpass the SVM baseline even with few-shot prompting, underscoring the effectiveness of task-specific fine-tuning. By transparently addressing the limitations of the solicited, low-stakes deception dataset, we establish a robust methodological baseline that clarifies the strengths of different modeling paradigms and informs future research into more complex, real-world deception phenomena.

Keywords:

deception detection; natural language processing; BERT; multilingual models; computational linguistics; machine learning; SVM; transformer models; instructed deception

1. Introduction

The proliferation of digital communication has fundamentally altered the information landscape, enabling the rapid, global dissemination of content [1]. This has amplified the societal impact of information, particularly that which is deceptive or misleading. The challenge of distinguishing truth from falsehood is not new, but its scale and complexity in the digital age are unprecedented. Indeed, the very nature of “truth” in an era dominated by competing narratives and subjective interpretations poses a fundamental philosophical problem. While this study cannot resolve this epistemological challenge, it operates on a necessary, pragmatic premise: for computational systems to function, “truth” and “deception” must be operationalized as distinct, classifiable categories based on available ground-truth data. Our work, therefore, addresses the technical feasibility of this classification while acknowledging the broader, complex reality it attempts to model. Automated deception detection has therefore emerged as a critical area of research in natural language processing (NLP) with the aim of developing systems that can identify and flag potentially untruthful content [2].

Deception is a complex human behavior. Foundational theories, such as Zuckerman’s Four-Factor Model [3] and DePaulo’s work on the leakage of deceptive cues [4], suggest that the act of deceiving imposes a cognitive load that can manifest in linguistic and non-verbal signals. However, these signals can be subtle and context-dependent. A significant challenge in computational deception detection is the heterogeneity of deceptive phenomena. The linguistic properties of a spontaneous, high-stakes lie in a legal testimony may differ substantially from those in a planned, low-stakes deceptive product review or in politically motivated disinformation designed for mass consumption [5,6].

This study focuses on a specific subtype of deception: instructed, written deception in a low-stakes environment. To investigate this, we use a well-established multilingual dataset where participants were explicitly asked to write both truthful and deceptive statements. It is crucial to define the scope of “deception” within this context. Here, the ground truth is determined not by an external arbiter of facts, but by the self-reported intent of the writer who created the text samples. Therefore, our models are not learning to identify objective falsehood (which would be required in tasks such as fact-checking) but rather to distinguish between the linguistic patterns of two classes: “text written with truthful intent” and “text written with deceptive intent” under specific instructions. This distinction mitigates the risk of the model merely learning statistical artifacts of a topic, focusing instead on the signals correlated with the act of instructed deception itself. This controlled setting allows for a focused analysis of linguistic patterns, although its findings may not be directly generalizable to all forms of real-world misinformation.

Previous computational approaches to this problem have often relied on feature-engineering, using classifiers like Support Vector Machines (SVMs) with features drawn from lexicons such as Linguistic Inquiry and Word Count (LIWC) or n-grams [7,8]. While foundational, these methods can be brittle and struggle to capture the nuanced semantics of language. The advent of deep learning, particularly transformer-based models like BERT (Bidirectional Encoder Representations from Transformers) [9], has revolutionized NLP by enabling models to learn rich, contextual representations of text. These models have shown state-of-the-art performance on numerous tasks, including the detection of fake news and propaganda [10,11].

More recently, the rise of instruction-tuned Large Language Models (LLMs) presents a third paradigm. These “generalist” models possess vast world knowledge and can perform tasks via in-context learning with zero or few examples, raising the question of their efficacy on specialized classification tasks compared to fine-tuning smaller transformer models. However, much of the research across these paradigms has focused on English, even though multilingual application is especially important for such tasks as deception detection as deception easily crosses linguistic borders. This gap highlights the necessity of models that are effective across multiple languages.

This study aims to address these needs by systematically evaluating and comparing the performance of these three distinct tiers of NLP models. We formulate our goals through the following research questions (RQs):

RQ1: How do traditional feature-based methods (SVM), fine-tuned specialist models (BERT), and in-context learning with generalist models (LLMs) compare in terms of performance on a multilingual instructed deception detection task?
RQ2: To what extent can fine-tuned transformer models outperform the traditional SVM baseline in both within-topic and cross-topic classification scenarios?
RQ3: What are the practical performance differences between using a monolingual versus a multilingual transformer model for deception detection in a dataset containing both English and Spanish texts?

By focusing on the specific task of instructed deception detection, in this paper we provide a rigorous and controlled comparison of key NLP techniques. We clarify the capabilities of modern architectures on the task of deceptive language detection and discuss the limitations of its scope, providing a solid base for future investigations into more complex, real-world deception detection challenges.

The remainder of this paper is organized as follows. Section 2 reviews the evolution of research in deception detection, from its psychological foundations to the latest computational models. Section 3 details our methodology, including a description of the dataset; the experimental design for the SVM, BERT, and LLM models; and the evaluation protocol. Section 4 presents the empirical results of our comparative experiments. Section 5 provides a comprehensive discussion of these findings, interpreting their significance, addressing the study’s limitations, and considering the broader theoretical and ethical implications. Finally, Section 6 concludes the paper by summarizing our main contributions and suggesting directions for future work.

2. Related Work

The automatic detection of deception is an interdisciplinary field, drawing from psychology, linguistics, and computer science. This section surveys the evolution of research in this domain, from foundational theories to modern computational techniques. For clarity, Table 1 summarizes key findings from a selection of the influential studies discussed.

2.1. Psychological and Linguistic Foundations

The scientific study of deception has deep roots in psychology. Ekman and Friesen’s early work on non-verbal cues and “leakage” established that deception can be betrayed by involuntary signals [14]. Interpersonal Deception Theory (IDT), proposed by Buller and Burgoon, framed deception as a dynamic, interactive process where deceivers strategically manage information, behavior, and image [15]. An important work applying linguistic analysis is the work of Newman et al., who used the LIWC tool to find that deceptive narratives tend to use fewer self-references (e.g., “I”), more negative emotion words, and simpler sentence structures, suggesting increased cognitive load and psychological distancing [7]. These foundational theories provide the conceptual basis for many of the features used in computational models.

2.2. Early Computational Approaches and Feature Engineering

Early studies into computational deception detection translated linguistic theories into machine-readable features. This typically involved training classic machine learning classifiers like Support Vector Machines (SVMs) or Naive Bayes on curated datasets. The primary research focus was on identifying effective feature sets. This included simple n-grams [8], part-of-speech (POS) tags, and more complex psycholinguistic features derived from LIWC [16]. Ott et al. conducted influential work on detecting deceptive reviews on platforms like Yelp, demonstrating that n-gram features could outperform human detectors significantly [5]. Pérez-Rosas et al. extended this work into the cross-cultural domain, showing that linguistic markers of deception can vary across cultures and topics [12]. These studies established the viability of automated detection but also highlighted the limitations of handcrafted features, which may not capture the full semantic richness of deceptive language. Other notable works include predicting truthfulness in online conversations [17], detecting deception in witness statements [18], and identifying lies in social media [19].

2.3. Deep Learning and Contextual Embeddings

The deep learning revolution in NLP, starting around 2013, marked a paradigm shift. Models based on word embeddings like Word2Vec [20] and GloVe [21] allowed algorithms to capture semantic similarities between words. Recurrent Neural Networks (RNNs), particularly Long Short-Term Memory (LSTM) networks [22], became the standard for sequence modeling tasks. Researchers began applying these models to deception detection, and these techniques often outperformed traditional feature-based methods by learning relevant features directly from the data [23,24]. For example, convolutional neural networks (CNNs) were used to capture local n-gram patterns for fake news detection [25].

2.4. Transformers and State-of-the-Art Models

The introduction of the transformer architecture, with its self-attention mechanism, by Vaswani et al. in 2017 [26] represents another paradigm shift in NLP. This led to the development of large pre-trained language models (PLMs) like BERT [9], RoBERTa [27], and XLNet [28]. These models, pre-trained on vast text corpora, can be fine-tuned to achieve state-of-the-art performance on a wide array of downstream tasks with minimal feature engineering. Their application to deception and misinformation has been widespread and highly successful. Studies have used BERT to detect fake news with high accuracy [10,29]. The same architectural principles have proven effective in detecting a range of related harmful content types, including phishing emails [30], propaganda [31], and cyberbullying or hate speech. The inclusion of the latter is justified as these phenomena often involve deceptive or manipulative language, making the underlying NLP techniques relevant to our study. More recently, the focus has shifted towards the capabilities and vulnerabilities of LLMs themselves in the context of misinformation. Recent work from 2024 and 2025 has explored using LLMs for zero-shot fact-checking [32], detecting text generated by AI, which can be used to create deceptive content at scale [33], and understanding the persuasion strategies embedded in LLM-generated propaganda [34]. This rapidly evolving landscape shows the need for robust benchmarks comparing different model paradigms, which is a primary goal of our paper.

The challenge of multilingualism has been addressed by models like multilingual BERT (mBERT) and XLM-RoBERTa (XLM-R) [35], which are pre-trained on a concatenation of corpora from many languages. These models have shown remarkable capabilities in zero-shot or few-shot cross-lingual transfer, where a model fine-tuned on a task in one language (e.g., English) can perform the same task in another language it has never seen during fine-tuning. This has been validated in various multilingual contexts, confirming the utility of such models for global-scale analysis [36,37]. This has opened new frontiers for creating scalable, multilingual solutions for content moderation and information verification [13]. This study builds directly on this line of work, providing a focused, empirical comparison of these powerful models against a traditional baseline in a controlled, multilingual setting.

2.5. Positioning of Current Work and Contributions

This paper is situated at the intersection of traditional machine learning, deep learning, and large-scale generative NLP, aiming to provide a clear, empirical benchmark for the instructed deception detection task. Our primary contribution is not the proposal of a novel architecture but rather the rigorous, three-way comparative evaluation of established models (SVM, BERT, and LLMs) on a specific multilingual task for which modern benchmarks are lacking. This addresses a critical gap by systematically measuring the performance leap across these technological paradigms.

We begin by re-examining the specific problem of instructed, cross-cultural deception detection as formulated by Pérez-Rosas and Mihalcea [12]. While their work established a crucial baseline using SVMs, our first objective is to bridge the technological gap by applying the now-dominant transformer architecture [26] to this same task, thereby updating the methodology. Our second objective is to rigorously quantify the performance gains of this approach by comparing our transformer-based results directly against a carefully replicated SVM baseline. This offers a clear measure of progress beyond earlier feature-engineering approaches [5].

Finally, and most critically, our third objective is to address the multilingual dimension of deception detection. The global nature of information dissemination [6] requires models that function across linguistic boundaries. While the capabilities of multilingual models have been explored [13,35], this study provides a focused, practical comparison of a monolingual versus a multilingual BERT on a dataset containing both English and Spanish, providing direct, empirical evidence for their relative strengths in a controlled setting. By explicitly acknowledging that our study uses solicited data [7], we frame our contribution as a robust and reproducible benchmark on a specific, foundational task, providing a solid basis upon which future research into more complex, real-world deception phenomena can be built.

3. Materials and Methods

This section details the dataset, model architectures, and experimental protocols used in our study. We provide an in-depth description of each component to ensure methodological transparency and reproducibility.

3.1. Dataset

This study utilizes the “cross-cultural deception detection” dataset from Pérez-Rosas and Mihalcea [12]. This dataset is a valuable resource for multilingual and cross-cultural analysis as it contains texts generated under a controlled protocol. Participants from four cultural backgrounds (American, Indian, Mexican, and Romanian) were instructed to write short, opinion-based essays on three topics: abortion, the death penalty, and best friends. For each topic, they produced both a truthful and a deceptive essay.

Our experiments focus on the English (from the US and India) and Spanish (from Mexico) parts of the dataset. The English datasets are balanced, containing 100 deceptive and 100 truthful samples for each topic (600 total samples per culture). The Spanish dataset is imbalanced, with 71 samples for abortion, 84 for the death penalty, and 188 for the best friend topic. The nature of this dataset, namely, containing explicitly solicited, low-stakes deception, defines the scope of our study. Although every form of deception has some universal similarities, the findings presented in this paper relate specifically to this form of deception. Therefore, although they cannot be directly extrapolated to spontaneous, high-stakes deception scenarios without further investigation, we do acknowledge that if deception detection is possible to achieve at a high performance level for the present task, there is also potential for the same methodology in more high-stakes deception scenarios.

3.2. Experimental Design

We conducted our experiments using Google Colab for its access to GPU resources, necessary for training transformer models. The study was designed to systematically compare the performance of three different model architectures, chosen to represent distinct stages of technological development in NLP.

3.2.1. Baseline Model: Support Vector Machine (SVM)

To establish a performance baseline and connect our work with prior research, we performed a replication of the original study’s unigram-based experiment. We used the SVM implementation from the Scikit-learn library (version 1.2.2) in Python (version 3.12.4), with a linear kernel and the default C-value of 1.0, which is a standard choice for text classification tasks. Input features were unigrams (1-g), with a frequency threshold of 10 applied to the vocabulary to filter out rare words. Consistent with the original methodology, stopwords were retained as they can contain valuable linguistic cues for deception.

3.2.2. Transformer-Based Models

We evaluated two variants of the BERT model to assess the impact of modern architectures and multilingual pre-training, representing the fine-tuned “specialist” model paradigm. The goal was to compare widely-used, publicly available discriminative models. In addition to these fine-tuned discriminative models, we include a comparative analysis of several instruction-tuned generative models (LLMs) to benchmark their zero- and few-shot capabilities on this task.

Monolingual BERT: We used bert-base-uncased, a model pre-trained on a large English corpus. This model represents a standard, high-performance baseline for English NLP tasks.
Multilingual BERT: We used bert-base-multilingual-uncased, which was pre-trained on corpora from 102 languages, including English and Spanish. This model was chosen to evaluate the effectiveness of a multilingual approach on our dataset.

For all BERT-based experiments, input texts were tokenized with a maximum sequence length of 128. The models were fine-tuned using the hyperparameters specified in Table 2. The number of training epochs was set to three, a common practice in BERT fine-tuning for classification tasks to prevent overfitting on smaller datasets, as recommended by the original authors and established in subsequent literature [9]. While an exhaustive hyperparameter search was beyond the scope of this benchmarking study, these parameters represent a standard, robust baseline. We monitored the validation loss during training and confirmed that it began to plateau or increase after the third epoch, indicating that further training was unnecessary and that these settings are appropriate for a fair comparison. The experiment involving the application of the monolingual English BERT to Spanish data was designed as a control experiment to establish the performance of a model encountering a language outside its pre-training corpus.

3.2.3. Large Language Models (LLMs) for Zero-, Few-Shot, and Fine-Tuned Classification

To provide a contemporary point of comparison, we evaluated three prominent instruction-tuned LLMs as “generalist” models. These experiments benchmark their in-context learning and fine-tuning capabilities on this discriminative task. The models, chosen for their strong performance and open availability, were as follows:

mistralai/Mistral-7B-v0.1, a 7-billion-parameter model known for its strong reasoning capabilities.
meta-llama/Llama-3.2-3B, a smaller, efficient variant from the LLaMA 3.2 family.
Qwen/Qwen3-4B-Instruct-2507, a 4-billion-parameter model optimized for multilingual tasks.

For each model, we tested three conditions:

Zero-shot: The model was given a system prompt defining the task and the two labels (“deceptive” and “not deceptive”) and was then asked to classify the target text.
Few-shot-5: The prompt was augmented with five randomly selected, labeled examples from the training set to provide in-context examples. The sampling was performed once for each of the 10 cross-validation folds.
Fine-Tuned: The models were fine-tuned using the Parameter-Efficient Fine-Tuning (PEFT) library with Low-Rank Adaptation (LoRA). We used a 60% training and 40% evaluation split. Standard LoRA parameters were used (rank = 8, alpha = 32, and dropout = 0.1), and the models were trained for 3 epochs.

3.2.4. Evaluation Protocol

All models were evaluated using two distinct classification schemes to test for both topic-specific and generalized performance.

Within-Topic Classification: The model’s ability to distinguish truth from deception within a single topic. Performance was measured using 10-fold stratified cross-validation.
Cross-Topic Classification: The model’s ability to generalize deception cues across different topics. The model was trained on data from two topics and tested on the held-out third topic. This was performed for all three topic combinations, and the results were averaged.

Accuracy was used as the primary evaluation metric for all experiments for better comparison with prior work.

4. Results

This section details the empirical results of our experiments, providing a comparative analysis of the SVM, fine-tuned BERT models, and LLMs.

4.1. SVM Baseline Performance and Replication

Our first experiment aimed to establish a baseline by replicating the SVM unigram experiment from Pérez-Rosas and Mihalcea (2014) [12]. Table 3 presents a comparison of the averaged results reported in the original paper and those obtained in our replication.

Our replication produced comparable, though not identical, results. Within-topic accuracies were slightly higher in our implementation for all three datasets. However, for the English (US) dataset, our cross-topic accuracy (60.58%) was substantially lower than the originally reported figure (72.79%). This discrepancy, which we further explore in the Discussion section, shows the sensitivity of such models to implementation details. Nonetheless, the overall performance of the SVM is modest and inconsistent, providing a clear baseline for comparison. The originally reported results are represented for comparison in Table 4, whereas the full results of our replication are in Table 5.

4.2. Monolingual BERT Performance

The monolingual English BERT model significantly outperformed the SVM baseline on the English datasets, as detailed in Table 6. For the English (US) data, average accuracy reached 87.17% (within-topic) and 94.68% (cross-topic). For the English (India) data, the results were 90.50% and 96.06%, respectively. As expected, the model’s performance on the Spanish (Mexico) dataset was poor, with average accuracies of 60.73% (within-topic) and 67.20% (cross-topic). This confirms the model’s inability to process a language for which it has no pre-training. Still, however, the results were comparable to the results obtained by the SVM, which suggests that transformer models, even if not trained specifically on the target languages, through pre-training still are able to poses some degree of proficiency in the target language, which can happen due to data contamination (data in other languages than English in pre-training of English models), or similarities in vocabulary between languages, which are then exploited by the sub-word (or sub-token) mechanism.

4.3. Multilingual BERT Performance

The multilingual BERT model not only maintained strong performance on the English datasets but also demonstrated a substantial improvement on the Spanish data (Table 7). For the Spanish (Mexico) dataset, the average accuracy rose to 82.99% for within-topic and 90.14% for cross-topic classification. This represents a performance gain of over 22 percentage points compared to the monolingual BERT, confirming the effectiveness of the multilingual pre-training strategy. The model’s performance on the English datasets remained high and comparable to the monolingual model, indicating that its multilingual nature does not compromise its capability in a high-resource language like English. A summary of the averaged results across all models and conditions is presented in Table 8 to provide a high-level overview of the results.

4.4. Large Language Model Performance

The performance of the three instruction-tuned LLMs is summarized in Table 9. The results show a large difference between the zero-shot and few-shot prompting conditions. In the zero-shot setting, all models completely failed to perform the classification task, often outputting a single label regardless of the input or refusing to provide a valid classification, resulting in average accuracies at or near zero. This indicates that, despite their vast general knowledge, the models could not reliably map the task description to a consistent binary classification output without concrete examples.

Providing five examples in the prompt (few-shot-5) dramatically improved performance, with Mistral and LLaMA 3.2 achieving average accuracies around the 50% chance baseline. This demonstrates that the models are capable of recognizing the task pattern from in-context examples. However, even in the few-shot setting, the performance of all LLMs remained substantially lower than that of the fine-tuned BERT models and, in many cases, was comparable to or worse than the SVM baseline.

The performance of the three instruction-tuned LLMs is summarized in Table 9. A clear pattern emerged across all three models:

Zero-shot performance was near zero. In this setting, the models consistently failed to perform the classification, often outputting a single label regardless of input or refusing to provide a valid classification.
Few-shot prompting moderately improved performance, generally bringing accuracy to around the 50% chance baseline, with Mistral and LLaMA performing best. This shows that the models can recognize the task pattern from in-context examples, but their ability remains limited.
Fine-tuning yielded the best LLM results, with Mistral-7B reaching an average within-topic accuracy of 71.13%. However, even in this condition, the LLMs did not surpass the performance of the fine-tuned BERT models.

These results suggest that for this specific discriminative task, the generalized, in-context learning abilities of LLMs are substantially less effective than task-specific fine-tuning. Even when fine-tuned, these larger models were outperformed by the smaller, more specialized BERT architecture and in many cases even by the baseline SVM-based model.

These results are particularly interesting when considering the models’ pre-training. It is highly probable that the public dataset used in our study was part of the vast corpora used to train these LLMs. Despite this likely data exposure, the models were unable to leverage that latent knowledge in a zero-shot setting to perform the specific classification task. This suggests a critical distinction between a model’s generalized knowledge and its ability to robustly follow precise, task-specific instructions without explicit in-context examples. The low performance, even with few-shot prompting, highlights the superiority of task-specific fine-tuning over in-context learning for this type of discriminative challenge, confirming that specialized models remain more effective and reliable for this problem.

5. Discussion

This section provides a critical analysis of our findings, interpreting them in the context of prior research, acknowledging the study’s limitations, and considering the broader implications of our work.

5.1. Interpretation of Results and Comparison with Prior Work

The most straightforward finding was the substantial performance gap between the SVM baseline and the transformer-based models. While the SVM’s performance was modest, our results are broadly consistent with the accuracy range reported by Pérez-Rosas and Mihalcea [12] and other early work using similar feature-based methods [5,8]. The high accuracy achieved by both BERT models demonstrates that deep contextual embeddings are far more effective at capturing the subtle linguistic cues of deception than the shallow n-gram features used by SVMs.

The strong performance in the cross-topic setting is particularly noteworthy. The fact that cross-topic accuracy often exceeded within-topic accuracy suggests that the models learned generalizable linguistic patterns of deception that are not topic-specific, and that they benefited from the larger volume of training data available in the cross-topic configuration. This aligns with findings in related domains like fake news detection, where robust models learn genre and style cues rather than topic-specific artifacts [24].

The comparison between the monolingual and multilingual BERT models on the Spanish dataset provides a critical, practical insight. The high results of the multilingual BERT, which improved average cross-topic accuracy from 67.20% to 90.14%, confirms the performance of multilingual pre-training, confirming findings from previous work on mBERT’s cross-lingual capabilities [13].

This dramatic improvement directly answers our third research question by showing that the multilingual model’s architecture successfully leverages its pre-training on Spanish to correctly classify deception, whereas the monolingual model could only rely on superficial or shared lexical features.

Furthermore, our experiments with modern LLMs provide a valuable counterpoint and answer the first part of our primary research question. Despite their advanced capabilities and the possibility of data exposure during pre-training, their zero- and few-shot performance was clearly inferior to the fine-tuned BERT models. Even when fine-tuned using LoRA, the LLMs did not match the performance of the much smaller BERT-base models. This reinforces a key conclusion: for specialized, discriminative tasks such as this one, task-specific fine-tuning of a smaller, specialized model remains a more effective and computationally efficient approach than in-context learning or even parameter-efficient fine-tuning with larger, general-purpose generative models.

5.2. Methodological Considerations and Limitations

Our study, while methodologically consistent for its stated purpose, has limitations that define the boundaries of its conclusions. The primary limitation is the reliance on a single, albeit well-established, dataset. We acknowledge that this constrains the generalizability of our findings. However, the study’s scientific value lies not in proposing a universally applicable deception detection model but in providing a rigorous, reproducible benchmark of different technological paradigms (SVM, BERT, and LLM) on a foundational task in the field. This comparative clarity is a necessary contribution before extending analyses to more varied and complex datasets.

Our attempt to replicate the original SVM experiment yielded results that were not perfectly identical to the original study (Table 3). Specifically, our cross-topic accuracy for the English (US) dataset was notably lower. Since we do not have access to the original code, we cannot definitively identify the cause. However, potential reasons include differences in SVM library implementations (e.g., Scikit-learn vs. LIBSVM) or subtle variations in text preprocessing and tokenization. This discrepancy highlights a persistent challenge in computational science, namely, that results can be highly sensitive to minor, often undocumented, implementation details. By presenting this discrepancy transparently, we underscore the importance of detailed methodological reporting for reproducibility.

The primary limitation of this study is the nature of the dataset. The texts analyzed are examples of instructed, low-stakes deception. The linguistic characteristics of this behavior may not be representative of real-world, high-stakes deception, such as financial fraud, or spontaneously generated misinformation. Therefore, the high accuracies achieved here should not be interpreted as a solved problem for all forms of deception. Furthermore, while this comparative study provides a clear benchmark between established architectures, we acknowledge the absence of a full ablation study on model components. Such experiments were beyond the scope of this benchmarking study but represent a vital direction for future work.

Finally, while the dataset is described as “cross-cultural”, our experiments are primarily “multilingual”. We demonstrated that a multilingual model works well on a multilingual dataset. We did not, for instance, test for knowledge transfer between the two English-speaking cultures (US and India) or analyze the specific linguistic features that differ between them. Such an analysis would be required for a truly cross-cultural study and remains a promising direction for future work.

5.3. Theoretical and Ethical Implications

The success of automated deception detection models, while promising, necessitates a deeper consideration of the underlying theoretical and ethical issues. As noted in the Introduction, the definition of “truth” is complex. Our system operationalizes it as a binary classification based on writer intent, but this is a simplification of reality. The declarative nature of our initial ethical reflections can be expanded by considering the concrete risks of misuse. For instance, a naively deployed model could learn to associate the linguistic patterns of a specific demographic or non-native speakers with “deception”, leading to significant algorithmic bias and reinforcing harmful stereotypes. The question of who defines and labels the ground-truth data, be it researchers, corporations, or governments, is a question of significant weight, directly influencing the model’s worldview and potential biases.

Furthermore, the interpretation of LLM behavior warrants a less speculative, more grounded approach. The failure of LLMs in the zero-shot setting, despite the possibility of data exposure during pre-training, is less a philosophical point about “knowledge” and more an empirical demonstration of the gap between generalized pre-training and specialized task execution. It suggests that without explicit fine-tuning or very precise prompting, these models lack the specific inductive bias needed for this discriminative task.

This leads to the crucial role of humans in an AI-augmented future. Deception detection tools should not be seen as autonomous arbiters of truth but as assistance mechanisms within a “human-in-the-loop” framework. Rather than simply replacing human judgment, these systems are best used to augment it by flagging inconsistencies, highlighting text with a high probability of being deceptive and providing evidence for a human expert to review. The development of such tools must proceed in tandem with research into explainable AI (XAI) and robust governance frameworks to ensure a responsible and beneficial integration into society.

6. Conclusions

In this study, we conducted a three-way comparative evaluation of machine learning paradigms for the task of multilingual instructed deception detection. Our systematic analysis of traditional SVMs, fine-tuned BERT models, and instruction-tuned LLMs confirms a clear performance hierarchy: fine-tuned BERT models substantially outperform both the traditional baseline and the in-context learning capabilities of modern LLMs.

Our results answer our research questions by demonstrating that (1) specialized, fine-tuned models are superior for this discriminative task; (2) transformer architectures offer significant performance gains over feature-based methods; and (3) multilingual models are essential for robust performance on multilingual datasets. The success of multilingual BERT in improving accuracy on Spanish text by over 22 percentage points especially highlights this final point.

Our findings, while grounded in the specific task of instructed deception detection, provide a clear empirical validation of transformer technology and highlight pathways for developing more reliable and linguistically flexible tools. This work serves as a necessary benchmark that clarifies the relative strengths of different modeling approaches, contributing to the broader goal of enhancing information integrity with a clear understanding of the technical and ethical challenges that remain.

Author Contributions

Conceptualization, D.A. and M.P.; methodology, D.A., R.M. and M.P.; software, D.A. and R.M.; validation, D.A., R.M. and M.P.; formal analysis, D.A., R.M. and M.P.; investigation, D.A., R.M. and M.P.; resources, M.P.; data curation, D.A.; writing—original draft preparation, D.A.; writing—review and editing, D.A., M.P., F.M., L.A., R.M., and J.E.; visualization, D.A.; supervision, M.P., F.M., L.A., R.M., and J.E. All authors have read and agreed to the published version of the manuscript.

Funding

This research was partially supported by a research grant from the Telecommunications Advancement Foundation (TAF) for the study ‘Applicability of an AI-based Platform for Countering Misinformation on Social Media in Internet Elections’.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset used in this study is publicly available and was originally presented by Pérez-Rosas and Mihalcea in “Cross-cultural Deception Detection” (2014) [12], presented at the 52nd Annual Meeting of the Association for Computational Linguistics.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Kaplan, A.M.; Haenlein, M. Users of the world, unite! The challenges and opportunities of Social Media. Bus. Horiz. 2010, 53, 59–68. [Google Scholar] [CrossRef]
Shu, K.; Sliva, A.; Wang, S.; Tang, J.; Liu, H. Fake News Detection on Social Media: A Data Mining Perspective. SIGKDD Explor. 2017, 19, 22–36. [Google Scholar] [CrossRef]
Zuckerman, M.; DePaulo, B.M.; Rosenthal, R. Verbal and Nonverbal Communication of Deception. In Advances in Experimental Social Psychology; Academic Press: Cambridge, MA, USA, 1981; Volume 14, pp. 1–59. [Google Scholar]
DePaulo, B.M.; Lindsay, J.J.; Malone, B.E.; Muhlenbruck, L.; Charlton, K.; Cooper, H. Cues to Deception. Psychol. Bull. 2003, 129, 74–118. [Google Scholar] [CrossRef]
Ott, M.; Choi, Y.; Cardie, C.; Hancock, J.T. Finding Deceptive Opinion Spam by Any Stretch of the Imagination. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, OR, USA, 19–24 June 2011; pp. 309–319. [Google Scholar]
Vosoughi, S.; Roy, D.; Aral, S. The spread of true and false news online. Science 2018, 359, 1146–1151. [Google Scholar] [CrossRef] [PubMed]
Newman, M.L.; Pennebaker, J.W.; Berry, D.S.; Richards, J.M. Lying Words: Predicting Deception From Linguistic Styles. Personal. Soc. Psychol. Bull. 2003, 29, 665–675. [Google Scholar] [CrossRef]
Mihalcea, R.; Strapparava, C. The Lie Detector: Explorations in the Automatic Recognition of Deceptive Language. In Proceedings of the ACL-IJCNLP 2009 Conference Short Papers, Suntec, Singapore, 4 August 2009; pp. 309–312. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
Asai, A.; Longpre, S.; Kasai, J.; Lee, C.H.; Zhang, R.; Hu, J.; Yamada, I.; Clark, J.H.; Choi, E. MIA 2022 Shared Task: Evaluating Cross-lingual Open-Retrieval Question Answering for 16 Diverse Languages. In Proceedings of the Workshop on Multilingual Information Access (MIA); Asai, A., Choi, E., Clark, J.H., Hu, J., Lee, C.H., Kasai, J., Longpre, S., Yamada, I., Zhang, R., Eds.; Association for Computational Linguistics: Seattle, WA, USA, 2022; pp. 108–120. [Google Scholar] [CrossRef]
Aslan, L.; Ptaszynski, M.; Jauhiainen, J. Are Strong Baselines Enough? False News Detection with Machine Learning. Future Internet 2024, 16, 322. [Google Scholar] [CrossRef]
Pérez-Rosas, V.; Mihalcea, R. Cross-cultural Deception Detection. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Short Papers), Baltimore, MD, USA, 23–25 June 2014; Association for Computational Linguistics: Seattle, WA, USA, 2014; pp. 440–445. [Google Scholar]
Pires, T.; Schlinger, E.; Garrette, D. How Multilingual is Multilingual BERT? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 4996–5001. [Google Scholar]
Ekman, P.; Friesen, W.V. Nonverbal Leakage and Clues to Deception. Psychiatry 1969, 32, 88–106. [Google Scholar] [CrossRef] [PubMed]
Buller, D.B.; Burgoon, J.K. Interpersonal Deception Theory. Commun. Theory 1996, 6, 203–242. [Google Scholar] [CrossRef]
Hancock, J.T.; Curry, L.E.; Goorha, S.; Woodworth, M.T. On Lying and Being Lied To: A Linguistic Analysis of Deception in Computer-Mediated Communication. Discourse Process. 2008, 45, 1–23. [Google Scholar] [CrossRef]
Fornaciari, T.; Poesio, M. Automatic deception detection in Italian court cases. Artif. Intell. Law 2013, 21, 303–340. [Google Scholar] [CrossRef]
Bachenko, J.; Fitzpatrick, E.; Schonwetter, M. Verification and Implementation of Language-Based Deception Indicators in Civil and Criminal Narratives. In Proceedings of the 22nd International Conference on Computational Linguistics (COLING 2008), Manchester, UK, 18–22 August 2008; pp. 41–48. [Google Scholar]
Volkova, S.; Shaffer, K.; Jang, J.Y.; Robinson, N. Separating Facts from Fiction: Linguistic Models to Classify Suspicious and Trusted News Posts on Twitter. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Vancouver, BC, Canada, 30 July–4 August 2017; pp. 537–543. [Google Scholar]
Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient Estimation of Word Representations in Vector Space. arXiv 2013, arXiv:1301.3781. [Google Scholar] [CrossRef]
Pennington, J.; Socher, R.; Manning, C.D. GloVe: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1532–1543. [Google Scholar]
Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
Wang, W.Y. “Liar, Liar Pants on Fire”: A New Benchmark Dataset for Fake News Detection. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Vancouver, BC, Canada, 30 July–4 August 2017; pp. 422–426. [Google Scholar]
Potthast, M.; Kiesel, J.; Reinartz, K.; Bevendorff, J.; Stein, B. A Stylometric Inquiry into Hyperpartisan and Fake News. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, 15–20 July 2018; pp. 231–240. [Google Scholar]
Riedel, B.; Augenstein, I.; Spithourakis, G.; Riedel, S. A simple but tough-to-beat baseline for the Fake News Challenge. arXiv 2017, arXiv:1707.03264. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Proceedings of the Advances in Neural Information Processing Systems 30; Curran Associates, Inc.: Red Hook, NY, USA, 2017; pp. 5998–6008. [Google Scholar]
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv 2019, arXiv:1907.11692. [Google Scholar] [CrossRef]
Yang, Z.; Dai, Z.; Yang, Y.; Carbonell, J.; Salakhutdinov, R.; Le, Q.V. XLNet: Generalized Autoregressive Pretraining for Language Understanding. In Proceedings of the Advances in Neural Information Processing Systems 32; Curran Associates, Inc.: Red Hook, NY, USA, 2019; pp. 5753–5763. [Google Scholar]
Kaliyar, R.K.; Goswami, A.; Narang, P. Fake news detection using deep learning models: A novel approach. Trans. Emerg. Telecommun. Technol. 2021, 32, e3767. [Google Scholar]
Meléndez, R.; Ptaszynski, M.; Masui, F. Comparative Investigation of Traditional Machine-Learning Models and Transformer Models for Phishing Email Detection. Electronics 2024, 13, 4877. [Google Scholar] [CrossRef]
Da San Martino, G.; Yu, S.; Barrón-Cedeño, A.; Petrov, R.; Nakov, P. Fine-Grained Analysis of Propaganda in News Articles. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; pp. 5698–5703. [Google Scholar]
Luo, Z.; Xie, Q.; Ananiadou, S. Factual consistency evaluation of summarization in the Era of large language models. Expert Syst. Appl. 2024, 254, 124456. [Google Scholar] [CrossRef]
Wang, Y.; Mansurov, J.; Ivanov, P.; Su, J.; Shelmanov, A.; Tsvigun, A.; Afzal, O.M.; Mahmoud, T.; Puccetti, G.; Arnold, T.; et al. Semeval-2024 task 8: Multidomain, multimodel and multilingual machine-generated text detection. arXiv 2024, arXiv:2404.14183. [Google Scholar]
Breum, S.M.; Egdal, D.V.; Mortensen, V.G.; Møller, A.G.; Aiello, L.M. The persuasive power of large language models. In Proceedings of the International AAAI Conference on Web and Social Media, Buffalo, NY, USA, 3–6 June 2024; Volume 18, pp. 152–163. [Google Scholar]
Conneau, A.; Khandelwal, U.; Goyal, N.; Chaudhary, V.; Wenzek, G.; Guzmán, F.; Grave, E.; Ott, M.; Zettlemoyer, L.; Stoyanov, V. Unsupervised Cross-lingual Representation Learning at Scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 8440–8451. [Google Scholar]
Eronen, J.; Ptaszynski, M.; Masui, F. Enhancing cross-lingual learning: Optimal transfer language selection with linguistic similarity. Sci. Talks 2023, 6, 100226. [Google Scholar] [CrossRef]
Eronen, J.; Ptaszynski, M.; Masui, F. Zero-shot cross-lingual transfer language selection using linguistic similarity. Inf. Process. Manag. 2023, 60, 103250. [Google Scholar] [CrossRef]

Table 1. Summary of main findings from selected prior studies.

Study	Methodology	Key Finding
Newman et al. (2003) [7]	Linguistic analysis (LIWC)	Deceptive language uses fewer self-references, more negative emotion words, and less complex cognitive terms.
Ott et al. (2011) [5]	SVM with n-gram features	Automated methods can detect deceptive online reviews with high accuracy (approx. 90%), significantly outperforming human judges.
Pérez-Rosas & Mihalcea (2014) [12]	SVM with n-gram features on cross-cultural data	Linguistic cues of deception vary across cultures and topics, indicating that a one-size-fits-all model is insufficient.
Vosoughi et al. (2018) [6]	Large-scale data analysis on Twitter	False news diffuses significantly farther, faster, deeper, and more broadly than the truth in all categories of information.
Devlin et al. (2019) [9]	Transformer-based deep learning model (BERT)	Deep bidirectional pre-training allows a model to learn rich contextual representations, achieving state-of-the-art results on a wide range of NLP tasks.
Pires et al. (2019) [13]	Analysis of multilingual BERT (mBERT)	mBERT learns cross-lingual representations that allow for surprising zero-shot cross-lingual transfer, even without shared vocabulary.

Table 2. Hyperparameters for BERT model fine-tuning.

Parameter	Value
Number of Training Epochs	3
Per-Device Training Batch Size	8
Per-Device Evaluation Batch Size	16
Warmup Steps	500
Weight Decay	0.01

Table 3. Comparison of SVM accuracy (%): original vs. replication.

Dataset	Within-Topic (Avg)		Cross-Topic (Avg)
Dataset	Original (2014)	Our Replication	Original (2014)	Our Replication
English (US)	65.45%	66.48%	72.79%	60.58%
English (India)	54.65%	58.17%	53.74%	52.89%
Spanish (Mexico)	57.99%	63.32%	57.21%	58.15%

Table 4. Original results reported in the original work (2014) using SVM (accuracy %).

Language (Country)	Topic	Within-Topic	Cross-Topic
English (US)	Abortion	63.75%	80.36%
	Best Friend	74.50%	60.78%
	Death Penalty	58.10%	77.23%
	Average	65.45%	72.79%
English (India)	Abortion	46.00%	50.00%
	Best Friend	60.45%	57.23%
	Death Penalty	57.50%	54.74%
	Average	54.65%	53.74%
Spanish (Mexico)	Abortion	52.46%	57.69%
	Best Friend	66.66%	50.53%
	Death Penalty	54.87%	63.41%
	Average	57.99%	57.21%

Table 5. Full results of our SVM replication attempt (accuracy %).

Language (Country)	Topic	Within-Topic	Cross-Topic
English (US)	Abortion	66.00%	55.50%
	Best Friend	76.95%	62.75%
	Death Penalty	56.50%	63.50%
	Average	66.48%	60.58%
English (India)	Abortion	55.00%	51.50%
	Best Friend	56.00%	50.67%
	Death Penalty	63.50%	56.50%
	Average	58.17%	52.89%
Spanish (Mexico)	Abortion	64.11%	62.82%
	Best Friend	68.36%	54.30%
	Death Penalty	57.50%	57.32%
	Average	63.32%	58.15%

Table 6. Monolingual BERT (bert-base-uncased) accuracy (%).

Dataset	Topic	Within-Topic	Cross-Topic
English (US)	Abortion	89.00%	92.50%
	Best Friend	86.02%	98.04%
	Death Penalty	86.02%	93.50%
	Average	87.17%	94.68%
English (India)	Abortion	85.50%	95.00%
	Best Friend	95.00%	99.67%
	Death Penalty	91.00%	93.50%
	Average	90.50%	96.06%
Spanish (Mexico)	Abortion	53.75%	66.67%
	Best Friend	63.57%	71.51%
	Death Penalty	64.86%	63.41%
	Average	60.73%	67.20%

Table 7. Multilingual BERT (bert-base-multilingual-uncased) accuracy (%).

Dataset	Topic	Within-Topic	Cross-Topic
English (US)	Abortion	85.00%	91.00%
	Best Friend	87.31%	95.59%
	Death Penalty	87.00%	90.50%
	Average	86.44%	92.36%
English (India)	Abortion	85.50%	98.00%
	Best Friend	88.33%	99.67%
	Death Penalty	85.00%	82.00%
	Average	86.28%	93.22%
Spanish (Mexico)	Abortion	78.39%	82.05%
	Best Friend	87.25%	95.70%
	Death Penalty	83.33%	92.68%
	Average	82.99%	90.14%

Table 8. Comparison of all results averaged for each of the three parts of the dataset (accuracy (%)). Highlighted: best, second-best.

	Original	Our	BERT
	(2014)	Replication	Monolingual	Multilingual
Dataset	Within-Topic
English (US)	65.45%	66.48%	87.17%	86.44%
English (India)	54.65%	58.17%	90.50%	86.28%
Spanish (Mexico)	57.99%	63.32%	60.73%	82.99%
	Cross-Topic
English (US)	72.79%	60.58%	94.68%	92.36%
English (India)	53.74%	52.89%	96.06%	93.22%
Spanish (Mexico)	57.21%	58.15%	67.20%	90.14%

Table 9. Average accuracy (%) of Large Language Models (LLMs) on the English (US) dataset. Overall averaged results were highlighted in bold font.

Model	Mode	Task Type	Abortion	Best Friend	Death Penalty	Average
Mistral	Zero-shot	Within-topic	0.00%	0.00%	0.00%	0.00%
	Zero-shot	Cross-topic	0.00%	0.00%	0.00%	0.00%
	Few-shot-5%	Within-topic	52.56%	50.00%	47.57%	50.04%
	Few-shot-5%	Cross-topic	42.68%	50.00%	49.46%	47.38%
	Fine-tune	Within-topic	72.30%	70.08%	71.02%	71.13%
	Fine-tune	Cross-topic	57.40%	54.60%	55.01%	55.67%
LLaMA 3.2-3B	Zero-shot	Within-topic	3.85%	1.08%	0.00%	1.64%
	Zero-shot	Cross-topic	2.44%	0.00%	1.61%	1.35%
	Few-shot-5%	Within-topic	46.15%	50.54%	51.16%	49.28%
	Few-shot-5%	Cross-topic	43.90%	47.44%	50.00%	47.11%
	Fine-tune	Within-topic	67.50.	62.0%	69.60%	66.36%
	Fine-tune	Cross-topic	58.90%	56.20%	56.80%	57.30%
Qwen 3-4B	Zero-shot	Within-topic	0.00%	0.00%	0.00%	0.00%
	Zero-shot	Cross-topic	0.00%	0.00%	0.00%	0.00%
	Few-shot-5%	Within-topic	26.15%	32.18%	6.05%	21.46%
	Few-shot-5%	Cross-topic	1.22%	50.00%	45.16%	32.13%
	Fine-tune	Within-topic	47.10%	46.80%	48.20%	47.37%
	Fine-tune	Cross-topic	45.25%	46.00%	47.90%	46.38%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Azuma, D.; Meléndez, R.; Ptaszynski, M.; Masui, F.; Aslan, L.; Eronen, J. SVM, BERT, or LLM? A Comparative Study on Multilingual Instructed Deception Detection. AI 2025, 6, 239. https://doi.org/10.3390/ai6090239

AMA Style

Azuma D, Meléndez R, Ptaszynski M, Masui F, Aslan L, Eronen J. SVM, BERT, or LLM? A Comparative Study on Multilingual Instructed Deception Detection. AI. 2025; 6(9):239. https://doi.org/10.3390/ai6090239

Chicago/Turabian Style

Azuma, Daichi, René Meléndez, Michal Ptaszynski, Fumito Masui, Lara Aslan, and Juuso Eronen. 2025. "SVM, BERT, or LLM? A Comparative Study on Multilingual Instructed Deception Detection" AI 6, no. 9: 239. https://doi.org/10.3390/ai6090239

APA Style

Azuma, D., Meléndez, R., Ptaszynski, M., Masui, F., Aslan, L., & Eronen, J. (2025). SVM, BERT, or LLM? A Comparative Study on Multilingual Instructed Deception Detection. AI, 6(9), 239. https://doi.org/10.3390/ai6090239

Article Menu

SVM, BERT, or LLM? A Comparative Study on Multilingual Instructed Deception Detection

Abstract

1. Introduction

2. Related Work

2.1. Psychological and Linguistic Foundations

2.2. Early Computational Approaches and Feature Engineering

2.3. Deep Learning and Contextual Embeddings

2.4. Transformers and State-of-the-Art Models

2.5. Positioning of Current Work and Contributions

3. Materials and Methods

3.1. Dataset

3.2. Experimental Design

3.2.1. Baseline Model: Support Vector Machine (SVM)

3.2.2. Transformer-Based Models

3.2.3. Large Language Models (LLMs) for Zero-, Few-Shot, and Fine-Tuned Classification

3.2.4. Evaluation Protocol

4. Results

4.1. SVM Baseline Performance and Replication

4.2. Monolingual BERT Performance

4.3. Multilingual BERT Performance

4.4. Large Language Model Performance

5. Discussion

5.1. Interpretation of Results and Comparison with Prior Work

5.2. Methodological Considerations and Limitations

5.3. Theoretical and Ethical Implications

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI