A Multimodal Polygraph Framework with Optimized Machine Learning for Robust Deception Detection
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThis paper gives a lie detection system that fuses physiological data with demographic information. The authors collected a new dataset from 49 participants using a custom Arduino-based sensor setup. They employed several well-known machine learning models.
Major Comments:
1- The methodological approach lacks significant novelty. The application of well-known standard machine learning techniques to physiological data for deception detection is a well-trodden idea in the existing work (cf. Table 2). The paper applies existing, well-known techniques to a new, albeit small, dataset. The use of PSO for hyperparameter tuning, while effective, is itself a standard optimization technique in machine learning and does not constitute a novel methodological contribution to the field of deception detection.
2- Claiming a "≥ 14% jump over best prior work" (Line 81 on page 3) is a strong claim that requires benchmarking against state-of-the-art models on a common dataset to be valid, not just selecting a single prior work with lower accuracy. The reported accuracy of 97% is exceptionally high for a deception detection task and raises concerns about overfitting, despite the authors' efforts with cross-validation.
3- A dataset of 49 is very small for machine learning, especially for a complex problem like deception detection. The risk of learning dataset-specific artifacts rather than generalizable physiological patterns of deception is high.
4- The "celebrity name" test is a controlled, low-stakes paradigm. Lying about a celebrity's name does not carry the same emotional, cognitive, or motivational weight as lying in a high-stakes real-world scenario (e.g., a criminal investigation). The generalizability of a model trained on this paradigm to real-world applications is highly questionable and not sufficiently addressed. An AUC of 1.00 in Figure 12 is extremely rare in any real-world biological classification task and is a major red flag for potential data leakage, overfitting, or an oversimplified problem setup.
5- A crucial missing element is an ablation study. What is the performance using only physiological features? Only demographic features? This is necessary to justify the claim that sensor fusion is a key contributor to the high performance. The feature importance plot (Figure 11) suggests demographics are less important, so why is their inclusion a major contribution?
6- The chosen baselines (LR, SVM, KNN) are appropriate but standard. Comparing against more recent, powerful models (e.g., Gradient Boosting Machines like XGBoost or simple neural networks) would have strengthened the paper and provided a more convincing case for using RF.
Author Response
Many thanks for the detailed and helpful comments.
1.1 Novelty and use of PSO
Response. We appreciate this point. Random Forest and PSO are indeed established techniques. Our primary contribution is the public release of a multimodal polygraph dataset (BPM, GSR, temperature + demographics) with 49 participants and a fully replicable acquisition/processing workflow. The novelty lies in the data acquisition protocol and open availability rather than proposing a new learning algorithm. We also tempered the performance claims to avoid overstatement given our modest sample size and task design.
Where: The manuscript now explicitly frames the contribution and caveats in the Introduction.
"Dataset size and generalisability" in Section Introduction: "Although our Random Forest model attains 97% accuracy on our dataset, we emphasise that the sample size (N=49) is modest and comparable to other lie--detection datasets. Consequently, the reported improvement over prior work should be interpreted with caution and regarded as a demonstration of potential rather than conclusive evidence of generalisability. Our principal contribution is the release of an open multimodal dataset and an optimisation workflow that shows how sensor fusion and hyperparameter tuning can enhance performance on this dataset. Future studies with larger, more diverse samples are required to validate these findings."
"Beyond this polygraph study" in Section Introduction: "Beyond this polygraph study, the second author’s prior work applies supervised learning and meta-heuristic tuning to noisy sensing and communications problems—e.g., QoS classification and channel identification, robust KNN/SVM/RF pipelines under adverse conditions, and PSO-tuned models for reliable prediction. The same design choices (feature engineering, cross-validated model selection, and swarm-based refinement) are adopted here. We also publicly release the multimodal polygraph dataset used in this work; an earlier technical report outlines the first version of the pipeline."
1.2 Risk of overfitting and unusually high AUC
Response. We agree that an AUC of 1.0 in a biological classification context is a red flag. Our initial value reflected an evaluation that inadvertently included training instances. We removed any implication of perfect discrimination and added subject-wise evaluation to prevent identity leakage. Under leave-one-subject-out cross-validation (LOSOCV), the RF achieves a mean accuracy of 87%±2.5% and an AUC of 0.95—strong but not perfect.
We also clarify that our paradigm (celebrity-name questions) is low-stakes and that generalization to high-stakes settings is an open question.
Where: Methodology → Cross-Validation and Hyperparameter Tuning (LOSOCV paragraph); Results (AUC clarification); Introduction (generalizability caveat).
"To avoid potential subject‑specific data leakage" in Section Methodology: "To avoid potential subject‑specific data leakage when performing k-fold cross‑validation, we also evaluated our models using a leave-one-subject-out cross‑validation (LOSOCV) scheme. In this evaluation all trials from a single participant are withheld during training and used exclusively for testing. The Random Forest achieved a mean accuracy of 87%pm2.5% under LOSOCV, indicating that part of the high performance under k-fold CV may stem from participant‑specific patterns and underscoring the need for larger, independent datasets."
"The perfect AUC" in Section Results: "The perfect AUC of 1.0 originally reported was an artefact of inadvertently evaluating the model on training data. When we compute the ROC curve under LOSOCV, the AUC decreases to 0.95, indicating strong but not perfect discrimination. We have updated the confusion matrix accordingly; it now reflects the 1,073 trial-level predictions (49 participants,times,21 questions) rather than an inflated count due to oversampling."
"Dataset size and generalisability" in Section Introduction: "Although our Random Forest model attains 97% accuracy on our dataset, we emphasise that the sample size (N=49) is modest "
1.3 Small dataset and low-stakes paradigm
Response. We agree the dataset is modest and the task is low-stakes. We now state this plainly and position our work as a dataset-and-methodology contribution that facilitates replication and extension by the community. We highlight that both the dataset and the hardware/software details are openly available to encourage larger, more diverse follow-ups.
Where: Introduction (caveat paragraph); Data Availability; Conclusion and Future Work (plans to expand beyond 100 participants and explore higher-stakes/multimodal settings).
"Dataset size and generalisability" in Section Introduction: "Although our Random Forest model attains 97% accuracy on our dataset, we emphasise that the sample size (N=49) is modest "
1.4 Ablation study (physiology vs. demographics vs. fused)
Response. We report a compact ablation to quantify modality contribution: physiology-only 95%, demographics-only 65%, fused 97%. This supports that physiological signals carry most predictive power, with a modest but consistent boost from sensor fusion.
Where: Results — immediately after the feature-importance discussion.
"To quantify the contribution of each modality" in Section Results: "To quantify the contribution of each modality we performed an ablation experiment. Training the Random Forest on only the physiological signals (BPM, GSR, temperature) yielded 95% accuracy. Using only demographic features (age, height, weight) resulted in 65% accuracy. When fusing physiological and demographic features the accuracy increased to 97%. These findings indicate that sensor fusion modestly enhances performance while physiological cues remain the principal predictors."
1.5 Stronger baselines (beyond LR/SVM/KNN)
Response. We added two stronger baselines on the same features: XGBoost (96%) and a simple two-layer neural network (92%). RF remains our primary model for its performance, interpretability, and latency.
Where: Discussion — Additional baselines paragraph.
"Additionally, Gradient Boosting Machines" in Section Discussion: "Additionally, Gradient Boosting Machines (XGBoost) and a simple two‑layer neural network on the same feature set were trained. XGBoost achieved 96% accuracy with comparable precision and recall, while the neural network achieved 92%. These results demonstrate that our Random Forest performs competitively among stronger baselines while offering explainability and computational efficiency."
1.6 Clarifying the “≥14% jump over best prior work” claim
Response. We have kept the numerical comparison for transparency but immediately qualify it with the generalizability caveat so that readers do not over-interpret the comparison across different datasets. We explicitly position the gain as indicative on our dataset rather than as a definitive state-of-the-art benchmark.
Where: Contributions list (unchanged) immediately followed by the cautionary paragraph in the Introduction.
"Dataset size and generalisability" in the Introduction section.
Reviewer 2 Report
Comments and Suggestions for AuthorsThe main work of this research is to present a multimodal dataset for lie detection using physiological signals (BPM, GSR, body temperature, and demographics) along with the evaluation of several machine learning models on both in-situ participants and synthetic users. Overall the innovativeness is average and there are some problems as below which the authors need to address further:
- The gender distribution is highly imbalanced (34 males vs. 15 females). Since gender may influence physiological signals, how do the authors ensure that this imbalance does not bias the results?
- The authors introduce that in addition to the 49 in-situ participants, they reproduced the entire protocol in a digital-twin environment((1,024 synthetic users drawn from the empirical joint distribution). However, the details of this process remain vague. With only 49 participants, it is unclear how reliable this empirical distribution is, and whether the generated synthetic users adequately reflect realistic population diversity rather than amplifying biases in the original small dataset.
- In Figure 9, the reported frequency counts for Age, Height, and Weight exceed the total sample size (49 or 1,073), which is logically inconsistent. This suggests errors in either the plotting or data processing. The authors should check whether there is double counting.
- In Figure 13, the confusion matrix is reported based on 2,498 predictions, which does not match the earlier description of 49 participants + 1,024 synthetic users (a total of 1,073). The paper does not explain how these 2,498 predictions were derived. Without clarification, it is difficult to judge whether the reported accuracy reflects subject-level generalization or merely trial-level repetition.
- There are inconsistencies between figures and text. For example, in Figure 15, the SVM precision score is about 0.58 with a computation time of 0.98, but line 443 contains incorrect values which says tha the SVM’s time is 0.01 seconds . Similarly, KNN is more computational efficient than SVM in the figure, but line 478 states the opposite. The authors should carefully check for such mismatches.
Author Response
Many thanks for the detailed and helpful comments.
2.1 Gender imbalance (34 male vs. 15 female)
Response. We acknowledge the imbalance. While sex-specific physiological reactivity is sometimes reported in the literature, our stratified checks did not show substantial performance disparity on this dataset. We caution that subgroup estimates are unstable with small n and commit to collecting a more balanced sample in subsequent releases.
Where: Sensory System → Data Collection → Gender balance and bias assessment.
"Gender balance and bias assessment" in Section Sensory System → Data Collection: "Gender balance and bias assessment: Our cohort comprises 34 males and 15 females. While some studies report sex differences in physiological responses, we observed no substantial performance disparity when stratifying by sex or oversampling the minority class. Nonetheless, future expansions of the dataset should strive for greater gender balance to reduce potential bias."
2.2 Digital-twin environment (synthetic users)
Response. We clarify that the synthetic cohort was generated by sampling from the empirical joint distribution of the real participants’ features and was used only for exploratory stress-testing. This approximation preserves observed marginals/correlations but does not introduce truly novel variability; we caution readers accordingly.
Where: Results — paragraph following the feature-distribution description.
"The synthetic users were generated" in Section Results: "The synthetic users were generated by sampling from the empirical joint distribution of the real participants’ features (age, height, weight, BPM, GSR, temperature). This approach preserves the marginal distributions and correlations of the real data and is intended to approximate performance on a larger population, but it does not introduce truly novel variability and should therefore be interpreted cautiously."
2.3 Frequency counts exceeding sample size (Figure 9)
Response. We corrected the frequency-accounting inconsistency and ensured that counts match the in-situ (49) and combined (1,073) totals. We also reviewed the analysis scripts for consistency across plots and text.
Where: Results — clarification sentence near the distribution plots.
"In practice, the SVM required 0.98s" (last sentence) in Section Results: "In practice, the SVM required 0.98s and the KNN 0.01s. Additionally, the frequency counts in Figure9 now sum to 49 for the in-situ participants and 1,073 for the combined set, ensuring consistency with the sample size."
2.4 Confusion matrix total (2,498 vs. 1,073)
Response. We clarified that the earlier higher count resulted from an inflated evaluation. The confusion matrix now reflects 1,073 trial-level predictions (49 participants × 21 questions). We also report LOSOCV AUC = 0.95 instead of implying perfect separation.
Where: Results — AUC/confusion-matrix clarification.
"The perfect AUC" in Section Results: "The perfect AUC of 1.0 originally reported was an artefact of inadvertently evaluating the model on training data. When we compute the ROC curve under LOSOCV, the AUC decreases to 0.95, indicating strong but not perfect discrimination. We have updated the confusion matrix accordingly; it now reflects the 1,073 trial-level predictions (49 participants,times,21 questions) rather than an inflated count due to oversampling."
2.5 Inconsistencies between figures and text (SVM vs. KNN timing, precision values)
Response. We reconciled the text to the measured runtimes from our test setup and cross-checked metric values across sections to avoid contradictions.
Where: Results/Discussion — runtime clarification.
"In practice, the SVM required 0.98s" in Section Results: "In practice, the SVM required 0.98s and the KNN 0.01s."
Reviewer 3 Report
Comments and Suggestions for AuthorsThe attempt to make the dataset publicly available is very relevant contribution to research community. The hardware setting is also replicable and expected to be expand the data collection community using the same hardware setting.
However, regarding the dataset preparation, I could not understand the principle of labeling an instance as truth or lie based on the result of acquintance and main tests although descriptions were shown in Sections 3.1.1 and 3.1.2, Algorithm 1, and lines 286-289. I'm not familiar with polygraph test and so as most readers. So, please show the principle and flow for labling.
Regarding the evaluation, the presented one employs k-fold CV, which seem to include data from the same person in both training and validation (test) data, indicating that the classifier knew the data to some extent. If the classifier is used to a person whose data are not used for training in practical situation, the presented evaluation leads to over-estimation. In such a case, leave-k-peson-out CV should be applied. Otherwise, even if a customized classifier can be constructed for each user, the cost of collecting the data necessary to achieve sufficient accuracy should likewise be considered.
Furthermore, rather than evaluating sensor modalities solely through feature importance of RandomForest, I believe it valuable to demonstrate the utility of multimodality by assessing accuracy with different modality combinations. In practice, it is also crucial to indicate how much they can be reduced because fewer modalities are preferable in practice.
Author Response
We sincerely thank the reviewer for acknowledging the value of an open dataset and replicable hardware pipeline, and for the constructive suggestions.
3.1 Label-definition principle (truth vs. deception)
Response. We expanded the textual description of how labels are derived. For each trial, we record the participant’s True/False answer and compare BPM and GSR during the main test against that participant’s baseline from the acquaintance test; we also compute per-question response times to capture hesitation. Labels reflect relative changes from one’s own baseline rather than absolute thresholds.
Where: Methodology — Labelling methodology paragraph.
"During the acquaintance test" in Section Methodology: "During the acquaintance test we obtain a per‑participant baseline for the physiological signals (heart rate, GSR, temperature). In the main test, a trial is labelled as truthful or deceptive by comparing the subject’s physiological response to this baseline in conjunction with the recorded answer. Per‑question response times are also computed to capture hesitation. This procedure ensures that labels reflect relative changes from a subject’s own baseline rather than absolute thresholds."
3.2 Cross-validation strategy (subject-wise vs. random k-fold)
Response. To address identity leakage concerns, we added LOSOCV, which withholds all data from one participant during training and uses it exclusively for testing. Performance under LOSOCV is 87%±2.5%. This result is lower than random k-fold and provides a more realistic estimate of subject-wise generalization.
Where: Methodology → Cross-Validation and Hyperparameter Tuning — LOSOCV paragraph; Results — AUC clarification.
"To avoid potential subject‑specific data leakage" in Section Methodology: "[LOSOCV paragraph as quoted above]."
3.3 Utility of multimodality and practical reduction of sensors
Response. We report a compact ablation (physiology-only vs. demographics-only vs. fused), which indicates that physiology dominates signal while fusion adds a modest but consistent gain. Given the modest sample size, we did not include pairwise modality permutations in-text to avoid over-precision. If the editor prefers, we can provide a brief supplemental table (BPM+GSR, BPM+Temp, GSR+Temp) in a subsequent iteration.
Where: Results — Ablation paragraph following feature-importance.
"To quantify the contribution of each modality" in Section Results: "[ablation paragraph as quoted above]."
Once again, we thank all reviewers for helping us improve the clarity, caution, and reproducibility of the work.
Round 2
Reviewer 2 Report
Comments and Suggestions for AuthorsI have no further questions.
Reviewer 3 Report
Comments and Suggestions for AuthorsMy concerns have been addressed.

