Multimodal Behavioral Sensors for Lie Detection: Integrating Visual, Auditory, and Generative Reasoning Cues
Abstract
Highlights
- A multimodal deception detection framework combining visual, audio, and language-based reasoning achieved high accuracy on a DOLOS dataset.
- The ViViT-based visual model reached 74.4% accuracy, while HuBERT audio classification showed strong performance on prosodic cues.
- Multimodal fusion enhances robustness and interpretability in behavioral biometrics for deception analysis.
- Language-guided models like GPT-5 prompt-level fusion provide explainable AI outputs, facilitating trust and real-world applicability.
Abstract
1. Introduction
2. State-of-the-Art
2.1. Traditional Approaches to Lie Detection
2.2. Micro-Expression Analysis in Deception Detection
2.3. Multimodal Deception Detection Systems
2.4. CoT-Based Generative Analysis of Deception
2.5. Emotion-Driven Cues in Deception Detection
3. System Overview
3.1. Visual Processing with ViViT
3.2. Multimodal Prompt-Level Fusion in GPT-5
- Preprocessing of modalities:
- Video frames: Sixteen equally spaced frames are extracted from each video using OpenCV, preserving spatial, facial, and bodily cues while ensuring coverage across the entire utterance.
- Speech transcription: Audio is extracted from the video and transcribed using the OpenAI Whisper-1 model, providing a verbatim text representation of the spoken content.
- Emotion metadata: Paralinguistic emotional features are obtained from the audio signal using the SpeechBrain emotion-recognition-wav2vec2-IEMOCAP model. The predicted categorical label (e.g., angry, neutral, and happy) and its confidence score are recorded.
- Prompt construction: Each modality is converted into a prompt component:
- The 16 frames are attached to the user message as base64-encoded images.
- The transcript text is inserted verbatim.
- The detected emotion and its confidence score are provided in natural language.
Depending on the ablation configuration, one or more modalities are omitted from the prompt to measure their individual impact. - Reasoning and classification: The system message instructs GPT-5 to act as a deception detection researcher, evaluating behavioral, verbal, and paralinguistic consistency. The model is required to output only a JSON object with three fields: label (“lie” or “truth”), confidence (0.0–1.0), and reasoning (a short explanation of observed cues). Decoding parameters are fixed (temperature = 0.2, top-p = 1.0) to ensure deterministic output.
3.3. DOLOS Dataset
4. Preliminary Experiments
4.1. Visual Stream: Video Transformers for Deception Classification
4.2. Facial Feature Streams: OpenFace and ResNet Backbones
4.3. Classical Models on Multimodal Features
4.4. Ablation Study
- Unimodal: Video frames only (V) and transcript only (T).
- Bimodal: Video + Transcript (V + T), Video + Emotion (V + E), and Transcript + Emotion (T + E).
- Trimodal: Video + Transcript + Emotion (V + T + E), i.e., the full proposed configuration.
4.5. Zero-Shot Inference and Testing Procedure
Algorithm 1 Prompt -level multimodal deception detection (zero-shot GPT-5). |
|
4.6. Computational Environment
4.7. Fusion Experiments and Modality Ablations
Algorithm 2 Multimodal deception detection experiment pipeline. |
|
5. Future Work
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Khare, S.K.; Blanes-Vidal, V.; Nadimi, E.S.; Acharya, U.R. Emotion recognition and artificial intelligence: A systematic review (2014–2023) and research recommendations. Inf. Fusion 2024, 102, 102019. [Google Scholar] [CrossRef]
- Ekman, P. Telling Lies: Clues to Deceit in the Marketplace, Politics, and Marriage; W. W. Norton & Company: New York, NY, USA, 2009. [Google Scholar]
- Oh, Y.H.; See, J.; Le Ngo, A.C.; Phan, R.C.W.; Baskaran, V. A Survey of Automatic Facial Micro-Expression Analysis: Databases, Methods, and Challenges. Front. Psychol. 2018, 9, 1128. [Google Scholar] [CrossRef] [PubMed]
- Rahman, H.A.; Shah, Q.; Qasim, F. Artificial Intelligence based Approach to Analyze the Lie Detection using Skin and Facial Expression: Artificial Intelligence using Skin and Facial Expression. Pak. J. Emerg. Sci. Technol. 2023, 4, 4. [Google Scholar]
- Yan, W.J.; Wu, Q.; Liu, Y.J.; Wang, S.J.; Fu, X. CASME Database: A Dataset of Spontaneous Micro-Expressions Collected from Neutralized Faces. In Proceedings of the 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), Shanghai, China, 22–26 April 2013; pp. 1–7. [Google Scholar]
- Chen, H.; Shi, H.; Liu, X.; Li, X.; Zhao, G. Smg: A micro-gesture dataset towards spontaneous body gestures for emotional stress state analysis. Int. J. Comput. Vis. 2023, 131, 1346–1366. [Google Scholar] [CrossRef]
- Wang, Z.; Zhang, K.; Luo, W.; Sankaranarayana, R. HTNet for Micro-Expression Recognition. arXiv 2023, arXiv:2307.14637. [Google Scholar] [CrossRef]
- Tsai, Y.-H.H.; Bai, S.; Liang, P.P.; Kolter, J.Z.; Morency, L.-P.; Salakhutdinov, R. Multimodal Transformer for Unaligned Multimodal Language Sequences. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL), Florence, Italy, 28 July–2 August 2019; pp. 6558–6569. Available online: https://aclanthology.org/P19-1656/ (accessed on 3 September 2025).
- Lian, Z.; Chen, H.; Chen, L.; Sun, H.; Sun, L.; Ren, Y.; Cheng, Z.; Liu, B.; Liu, R.; Peng, X.; et al. AffectGPT: A New Dataset, Model, and Benchmark for Emotion Understanding with Multimodal Large Language Models. arXiv 2024, arXiv:2501.16566. [Google Scholar]
- Kalchbrenner, N.; Grefenstette, E.; Blunsom, P. A convolutional neural network for modeling sentences. arXiv 2014, arXiv:1404.2188. [Google Scholar] [CrossRef]
- Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.S.; Dean, J. Distributed representations of words and phrases and their compositionality. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Stateline, NV, USA, 5–10 December 2013; Volume 26, pp. 3111–3119. [Google Scholar]
- Khalil, M.A.; Ramirez, M.; Can, J.; George, K. Implementation of Machine Learning in BCI Based Lie Detection. In Proceedings of the 2022 IEEE World AI IoT Congress (AIIoT), Seattle, WA, USA, 6–9 June 2022; pp. 213–217. [Google Scholar] [CrossRef]
- Guo, X.; Selvaraj, N.M.; Yu, Z.; Kong, A.; Shen, B.; Kot, A. Audio-Visual Deception Detection: DOLOS Dataset and Parameter-Efficient Crossmodal Learning. In Proceedings of the International Conference on Computer Vision (ICCV), Paris, France, 2–3 October 2023. [Google Scholar]
- Ye, D.; Bai, S.; Han, Q.; Lu, H.; Wang, X.E.; Wang, S.; Lu, Y.; Bi, B.; Shou, L.; Zhou, M. GPT-5 prompt-level fusion (replacing mPLUG-Owl for our final experiments): Modularization Empowers Large Language Models with Multimodality. arXiv 2023, arXiv:2304.10592. Available online: https://arxiv.org/abs/2304.10592 (accessed on 3 September 2025).
- Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. LoRA: Low-Rank Adaptation of Large Language Models. arXiv 2021, arXiv:2106.09685. Available online: https://arxiv.org/abs/2106.09685 (accessed on 3 September 2025).
- Baltrušaitis, T.; Robinson, P.; Morency, L.-P. OpenFace: An Open Source Facial Behavior Analysis Toolkit. In Proceedings of the 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Placid, NY, USA, 7–10 March 2016. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. arXiv 2015, arXiv:1512.03385. Available online: https://arxiv.org/abs/1512.03385 (accessed on 3 September 2025). [CrossRef]
- Arnab, A.; Dehghani, M.; Heigold, G.; Sun, C.; Lučić, M.; Schmid, C. ViViT: A Video Vision Transformer. arXiv 2021, arXiv:2103.15691. Available online: https://arxiv.org/abs/2103.15691 (accessed on 3 September 2025).
- Hsu, W.-N.; Bolte, B.; Tsai, Y.-H.H.; Lakhotia, K.; Salakhutdinov, R.; Mohamed, A. HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units. arXiv 2021, arXiv:2106.07447. Available online: https://arxiv.org/abs/2106.07447 (accessed on 3 September 2025). [CrossRef]
- Tong, Z.; Song, Y.; Wang, J.; Wang, L. VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training. arXiv 2022, arXiv:2203.12602. Available online: https://arxiv.org/abs/2203.12602 (accessed on 3 September 2025).
Model | Accuracy (Mean) | Training Details |
---|---|---|
ViViT-B-1 6 × 2-kinetics + LoRA (32 frames), dense sampling | 0.740 | 30 epochs, Adam (0.00005), no overfitting |
ViViT-B-1 6 × 2-kinetics + LoRA (32 frames), uniform sampling | 0.744 | 30 epochs, Adam (0.00005), no overfitting |
Timesformer-B-kinetics + LoRA (8 frames), uniform sampling | 0.743 | 30 epochs, Adam (0.00005), early stopping used |
ResNet (frozen) + LSTM | ≈0.74 | 30 epochs, Adam StepLR (0.0005), overfitting noted |
Model | Accuracy | F1-Score | Comments |
---|---|---|---|
Linear SVM | 0.54 | 0.52 | - |
RBF SVM | 0.55 | 0.54 | - |
Bernoulli Naive Bayes | 0.56 | 0.55 | Slightly better than other models |
Random Forest | 0.55 | 0.55 | - |
MLP | 0.58 | N/A | Best among classical models |
Configuration | Precision | Recall | F1-Score | Accuracy | MCC |
---|---|---|---|---|---|
Unimodal | |||||
Video only (V) | 0.639 | 0.883 | 0.741 | 0.692 | 0.415 |
Transcript only (T) | 0.602 | 0.933 | 0.732 | 0.658 | 0.379 |
Bimodal | |||||
Video + Transcript (V + T) | 0.671 | 0.917 | 0.775 | 0.733 | 0.502 |
Video + Emotion (V + E) | 0.558 | 0.967 | 0.707 | 0.600 | 0.294 |
Transcript + Emotion (T + E) | 0.598 | 0.917 | 0.724 | 0.650 | 0.355 |
Trimodal | |||||
Video + Transcript + Emotion (V + T + E) | 0.683 | 0.933 | 0.789 | 0.750 | 0.537 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Grabowski, D.; Łuczaj, K.; Saeed, K. Multimodal Behavioral Sensors for Lie Detection: Integrating Visual, Auditory, and Generative Reasoning Cues. Sensors 2025, 25, 6086. https://doi.org/10.3390/s25196086
Grabowski D, Łuczaj K, Saeed K. Multimodal Behavioral Sensors for Lie Detection: Integrating Visual, Auditory, and Generative Reasoning Cues. Sensors. 2025; 25(19):6086. https://doi.org/10.3390/s25196086
Chicago/Turabian StyleGrabowski, Daniel, Kamila Łuczaj, and Khalid Saeed. 2025. "Multimodal Behavioral Sensors for Lie Detection: Integrating Visual, Auditory, and Generative Reasoning Cues" Sensors 25, no. 19: 6086. https://doi.org/10.3390/s25196086
APA StyleGrabowski, D., Łuczaj, K., & Saeed, K. (2025). Multimodal Behavioral Sensors for Lie Detection: Integrating Visual, Auditory, and Generative Reasoning Cues. Sensors, 25(19), 6086. https://doi.org/10.3390/s25196086