A Comparative Study of Large Language Models for Industrial Cyber-Physical Security
Abstract
1. Introduction
- (1)
- Tabular foundation models applied to OT/ICS security benchmarks. To the best of our knowledge, no prior study has evaluated TabPFN or TabICL on industrial control system intrusion detection. Where previous evaluations of these models concentrated on classical UCI-style tabular benchmarks [25,26] or IT-network IDS [27,28], we benchmark both on three OT/ICS testbeds, water-treatment SCADA (SWaT), multi-process hardware-in-the-loop (HAI), and Industrial IoT network flows (WUSTL-IIoT-2021), under a full-holdout multi-seed protocol with paired statistical testing.
- (2)
- Head-to-head comparison of open-source LLMs and tabular foundation models. Prior LLM-based IDS studies on industrial and IoT settings [22,23] compared against Random Forest or XGBoost anchors but not against modern tabular foundation models; prior tabular-FM evaluations did not include LLM baselines. We compare four open-source LLMs (Qwen3-235B-A22B, Llama-3.3-70B, Hermes-4-70B, Hermes-4-405B) against TabPFN, TabICL, and two classical baselines (Random Forest and XGBoost) on the same protocol, with cross-seed Mann–Whitney tests and paired McNemar tests to ensure the conclusions are statistically defensible.
- (3)
- Cross-domain operational characterisation. We provide a per-attack-class evaluation on the five-class WUSTL taxonomy, a cost-per-correct-prediction analysis of the LLM family, and a false-alarm-rate/detection-rate Pareto characterisation suitable for guiding deployment choices in industrial cyber-physical security.
- (4)
- Max-context sensitivity analysis. We complement the K-shot headline comparison with a max-context sensitivity analysis (Section 5.8) that quantifies the cost of the K-shot constraint on each tabular method family and confirms that the foundation-model-versus-classical comparison is operative in the data-constrained regime that motivates foundation-model approaches.
- (5)
- Deployment-oriented robustness analyses. We add a deployable confidence-gated tabular-to-LLM cascade that exceeds either standalone detector at ∼6% LLM escalation (Section 6.4), a leave-one-attack-type-out study of generalisation to unseen attack families (Section 5.11), a feature-budget ablation (Section 5.9), and a per-method computational-complexity profile.
2. Related Work
2.1. Classical and Deep Learning Intrusion Detection
2.2. Large Language Models in Cybersecurity
2.3. Tabular Foundation Models
2.4. Foundation Models for OT/ICS Intrusion Detection
3. Datasets
3.1. SWaT: Secure Water Treatment
3.2. HAI: HIL-Based Augmented Industrial Dataset
3.3. WUSTL-IIoT-2021
3.4. Data Preparation and Preprocessing
- Leakage column removal.
- Label resolution.
- Feature selection.
- Class subsampling and holdout construction.
- Random vs. chronological splitting.
- Multi-class data availability.
- Summary.
4. Methodology
4.1. Methods Evaluated
4.2. LLM Inference Protocol
- Role-instructed system message.
| Listing 1. Representative SWaT binary prompt ( shown for brevity; the protocol uses ). The listing is schematic: the deployed prompt serialises the 12 mutual-information-selected channels (Section 3.4) as a JSON object per row, e.g., {"AIT201":177.77, …} → Normal, rather than the human-readable form shown here. |
| [system] You are a senior process-control engineer monitoring a six-stage water-treatment plant. The~plant is instrumented with 51 channels: - LIT-xxx: level transmitters (water level in tanks, units: mm). - FIT-xxx: flow transmitters (water flow rate, units: m^3/h). - AIT-xxx: analyzer transmitters (chemistry, e.g.,~pH, conductivity). - MV-xxx, P-xxx, UV-xxx: motorized valves, pumps, UV lamps (state: 0/1/2). Classify each row as one of {Normal, Attack}. Respond with one word~only. [user] Example 1 (label: Normal): LIT-101=521.3, FIT-101=2.42, MV-101=2, P-101=1, .., AIT-503=7.89 → Normal Example 2 (label: Attack): LIT-101=812.7, FIT-101=2.41, MV-101=2, P-101=1, .., AIT-503=7.91 → Attack Query: LIT-101=655.2, FIT-101=2.40, MV-101=2, P-101=1, .., AIT-503=7.88 → |
- In-context examples.
- Query and output parsing.
| Listing 2. Representative WUSTL-IIoT-2021 five-class prompt ( per class shown for brevity). As in Listing 1, the listing is schematic: the deployed prompt serialises the 12 selected flow features (Section 3.4) as a JSON object per row rather than the human-readable form shown here. |
| [system] You are a network and ICS cybersecurity analyst monitoring an Industrial IoT testbed running Modbus/TCP traffic. Each row is a network flow summarised by 41 features: - TotPkts, TotBytes: total packet/byte counts in the flow. - SrcBytes, DstBytes: bytes from source/destination. - Rate, Load: flow rate and load. - sTtl, dTtl: source/destination TCP TTLs. - Sport, Dport, Proto: ports and transport protocol. - Other categorical and flag fields. Classify each flow as one of: Normal, DoS, Reconnaissance, CommandInjection, Backdoor. Respond with exactly one~label. [user] Example 1 (label: Normal): TotPkts=14, TotBytes=842, Rate=12.4, Sport=502, Dport=4711, .. → Normal Example 2 (label: DoS): TotPkts=8731, TotBytes=512040, Rate=4128.1, Sport=502, Dport=4712, .. → DoS Example 3 (label: Reconnaissance): TotPkts=22, TotBytes=1408, Rate=4.1, Sport=1024, Dport=502, .. → Reconnaissance Example 4 (label: CommandInjection): TotPkts=6, TotBytes=312, Rate=2.0, Sport=49152, Dport=502, .. → CommandInjection Example 5 (label: Backdoor): TotPkts=3, TotBytes=180, Rate=0.8, Sport=44321, Dport=502, .. → Backdoor Query: TotPkts=15, TotBytes=920, Rate=11.8, Sport=502, Dport=4711, .. → |
4.3. Tabular Anchor Configuration
- Random Forest.
- TabPFN.
- TabICL.
- Choice of classical baselines.
- Reproducibility summary.
4.4. Evaluation Protocol
- Holdout construction.
- Tabular anchors: full holdout.
- LLM: stratified-natural subsample.
- Multi-seed repetition.
- Two-level comparison protocol.
4.5. Metrics
4.6. Statistical Testing
- Paired McNemar tests.
- Cross-seed Mann–Whitney tests.
- Per-class significance.
- Effect size reporting.
5. Results
5.1. Headline Binary Comparison Across Datasets
5.2. Paired Significance: LLM Versus Each Anchor
5.3. Cross-LLM Significance Against the Baseline
5.4. Multi-Class Evaluation on WUSTL-IIoT-2021
5.5. Operational Characterisation
- Scope and time validity of the cost analysis.
5.6. Robustness Analyses
5.7. Summary of Findings
- No single method is universally best. Qwen3-235B-A22B wins SWaT with statistical confidence; TabICL wins HAI; TabPFN wins WUSTL at saturation.
- LLMs lose to tabular foundation models on HAI by a substantial margin. The accuracy deficit against TabICL on HAI is the largest single-dataset effect in the study and is significant at .
- The WUSTL multi-class deficit is class-specific. LLMs are competitive on rare attacks (Backdoor, CommInj) and substantially weaker on traffic-rich attacks (DoS, Reconn), losing 13 to 28 percentage points of per-class F1 to Random Forest.
- Cross-LLM variability matters for cost-per-correct. Llama-3.3-70B is on the cost/accuracy Pareto frontier on SWaT; Hermes-4-405B is dominated everywhere.
5.8. Max-Context Sensitivity Analysis
- Protocol.
- Findings.
- Rare-class drill-down.
- Implication for the paper’s central claim.
5.9. Sensitivity to the Number of Selected Features
5.10. Chronological-Split Sensitivity Analysis
5.11. Generalisation to Unseen Attack Families
6. Discussion
6.1. Interpreting the Dataset-Dependence Pattern
6.2. Why Tabular Foundation Models Dominate HAI
6.3. Why LLMs Trail on Traffic-Rich Attacks
6.4. From Diagnosis to Deployment: Confidence-Gated Hybrid Detection
6.5. Deployment Recommendations
- Default to a tabular foundation model.
- Add an LLM-based detector where semantic feature interpretation helps.
- Do not rely on an LLM alone for traffic-rich attacks.
- Escalate only low-confidence decisions.
- Read FAR and DR together, not just accuracy.
6.6. Limitations
- Public-mirror data artefacts.
- LLM calibration.
- Multi-testing correction.
- Seed budget.
- LLM model selection.
- Random splitting and the chronological-split robustness check.
- No real-world operational validation.
- K-shot constraint as a deliberate design choice.
6.7. Future Work
- Learned attack-type routing. The confidence-gated cascade of Section 6.4 escalates low-confidence tabular decisions to the LLM using only a scalar confidence threshold. A natural extension is a learned routing head, a lightweight pre-classifier over the same features that predicts which detector family will handle each sample best, which could exploit the per-attack-type complementarity (Section 5.4) more directly than a single global threshold.
- Soft-probability recovery for LLM detectors. Open-source LLMs return only a predicted label, leaving them out of the AUROC and AUPRC comparisons of Section 5.5. Recent calibration techniques that read token-level log-probabilities of the predicted-label token would recover a soft-probability proxy and enable threshold-tunable deployment.
- Cross-dataset transfer. Training on SWaT and deploying on HAI (or vice versa) would address the long-standing OT/ICS-IDS generalisation question on a modern foundation-model substrate, and would clarify whether the dataset-dependence pattern reported here is itself transferable.
- Natural-distribution evaluation on the original iTrust and NSRI releases. Re-running the protocol on the original SWaT and HAI distributions (rather than their Kaggle mirrors) would close the dataset-artefact loop flagged in Section 6.6 and would also expose the multi-class behaviour of the tabular foundation models on those two testbeds.
- Cross-campaign generalisation. The chronological-split analysis (Section 5.10) probes within-run detection, and the leave-one-attack-type-out analysis (Section 5.11) gives a first within-WUSTL measure of generalisation to attack families unseen at training time. Neither is a genuine cross-campaign evaluation, in which the training and test partitions correspond to distinct recording campaigns with potentially non-overlapping attack repertoires; that would require either the original iTrust/NSRI distributions with their campaign-level metadata or purpose-built multi-campaign OT/ICS corpora, and is a natural follow-on study.
- Feature-naming ablation on SWaT. The LLM advantage on SWaT (Section 5.1) is consistent with the hypothesis that semantically named sensor channels (LIT, FIT, P prefixes) let the LLM draw on process-engineering priors from pre-training, but the present protocol does not directly test this. A controlled comparison between the original SWaT prompt and an anonymised variant in which feature names are replaced with neutral identifiers (F1, F2, …, F51) under a generic binary-classifier system message would isolate the contribution of feature semantics to the LLM’s SWaT advantage and clarify whether the effect operates primarily through detection rate or through false-alarm rate.
7. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
| AUPRC | Area Under the Precision–Recall Curve |
| AUROC | Area Under the Receiver Operating Characteristic Curve |
| DoS | Denial-of-Service |
| DR | Detection Rate |
| FAR | False Alarm Rate |
| HAI | HIL-based Augmented Industrial Dataset |
| HIL | Hardware-in-the-Loop |
| HMI | Human–Machine Interface |
| ICL | In-Context Learning |
| ICS | Industrial Control System |
| IDS | Intrusion Detection System |
| IIoT | Industrial Internet of Things |
| LLM | Large Language Model |
| MCC | Matthews Correlation Coefficient |
| OT | Operational Technology |
| PLC | Programmable Logic Controller |
| SCADA | Supervisory Control and Data Acquisition |
| SOC | Security Operations Centre |
| SWaT | Secure Water Treatment |
| TabICL | Tabular In-Context Learning |
| TabPFN | Tabular Prior-Fitted Network |
| WUSTL | Washington University in St. Louis IIoT-2021 Dataset |
| XGBoost | eXtreme Gradient Boosting |
References
- Meneghello, F.; Calore, M.; Zucchetto, D.; Polese, M.; Zanella, A. IoT: Internet of threats? A survey of practical security vulnerabilities in real IoT devices. IEEE Internet Things J. 2019, 6, 8182–8201. [Google Scholar] [CrossRef]
- Liao, H.J.; Lin, C.H.R.; Lin, Y.C.; Tung, K.Y. Intrusion detection system: A comprehensive review. J. Netw. Comput. Appl. 2013, 36, 16–24. [Google Scholar] [CrossRef]
- Buczak, A.L.; Guven, E. A survey of data mining and machine learning methods for cyber security intrusion detection. IEEE Commun. Surv. Tutor. 2015, 18, 1153–1176. [Google Scholar] [CrossRef]
- Khraisat, A.; Gondal, I.; Vamplew, P.; Kamruzzaman, J. Survey of intrusion detection systems: Techniques, datasets and challenges. Cybersecurity 2019, 2, 20. [Google Scholar] [CrossRef]
- Liu, H.; Lang, B. Machine learning and deep learning methods for intrusion detection systems: A survey. Appl. Sci. 2019, 9, 4396. [Google Scholar] [CrossRef]
- Lansky, J.; Ali, S.; Mohammadi, M.; Majeed, M.K.; Karim, S.H.T.; Rashidi, S.; Hosseinzadeh, M.; Rahmani, A.M. Deep learning-based intrusion detection systems: A systematic review. IEEE Access 2021, 9, 101574–101599. [Google Scholar] [CrossRef]
- Tavallaee, M.; Bagheri, E.; Lu, W.; Ghorbani, A.A. A detailed analysis of the KDD CUP 99 data set. In Proceedings of the 2009 IEEE Symposium on Computational Intelligence for Security and Defense Applications; IEEE: New York, NY, USA, 2009; pp. 1–6. [Google Scholar]
- Moustafa, N.; Slay, J. UNSW-NB15: A comprehensive data set for network intrusion detection systems (UNSW-NB15 network data set). In Proceedings of the 2015 Military Communications and Information Systems Conference (MilCIS); IEEE: New York, NY, USA, 2015; pp. 1–6. [Google Scholar]
- Sharafaldin, I.; Lashkari, A.H.; Ghorbani, A.A. Toward generating a new intrusion detection dataset and intrusion traffic characterization. ICISSp 2018, 1, 108–116. [Google Scholar] [CrossRef]
- Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
- Fernández, A.; Garcia, S.; Herrera, F.; Chawla, N.V. SMOTE for learning from imbalanced data: Progress and challenges, marking the 15-year anniversary. J. Artif. Intell. Res. 2018, 61, 863–905. [Google Scholar] [CrossRef]
- Mohammad, R.; Saeed, F.; Almazroi, A.A.; Alsubaei, F.S.; Almazroi, A.A. Enhancing Intrusion Detection Systems Using a Deep Learning and Data Augmentation Approach. Systems 2024, 12, 79. [Google Scholar] [CrossRef]
- Koneru, S.S.; Cho, J. Bridging the Gap: A Comparative Analysis of ICS and IT Datasets for IDS Evaluation. In Proceedings of the 2024 2nd International Conference on Foundation and Large Language Models (FLLM); IEEE: New York, NY, USA, 2024; pp. 300–304. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems; NeurIPS; Curran Associates: Red Hook, NY, USA, 2017; Volume 30, pp. 5998–6008. [Google Scholar]
- Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models are Few-Shot Learners. In Proceedings of the Advances in Neural Information Processing Systems; NeurIPS; Curran Associates: Red Hook, NY, USA, 2020; Volume 33, pp. 1877–1901. [Google Scholar]
- Hasanov, I.; Virtanen, S.; Hakkala, A.; Isoaho, J. Application of Large Language Models in Cybersecurity: A Systematic Literature Review. IEEE Access 2024, 12, 176751–176778. [Google Scholar] [CrossRef]
- Zhang, J.; Bu, H.; Wen, H.; Liu, Y.; Fei, H.; Xi, R.; Li, L.; Yang, Y.; Zhu, H.; Meng, D. When llms meet cybersecurity: A systematic literature review. Cybersecurity 2025, 8, 1–41. [Google Scholar] [CrossRef]
- Balogh, S.; Mlyncek, M.; Vranák, O.; Zajac, P. Using Generative AI Models to Support Cybersecurity Analysts. Electronics 2024, 13, 4718. [Google Scholar] [CrossRef]
- DeCusatis, C.; Tomo, R.; Singh, A.; Khoury, E.; Masone, A. Cybersecurity Applications of Near-Term Large Language Models. Electronics 2025, 14, 2704. [Google Scholar] [CrossRef]
- Coppolino, L.; Iannaccone, A.; Nardone, R.; Petruolo, A. Asset Discovery in Critical Infrastructures: An LLM-Based Approach. Electronics 2025, 14, 3267. [Google Scholar] [CrossRef]
- Keltek, M.; Hu, R.; Sani, M.F.; Li, Z. LSAST: Enhancing Cybersecurity Through LLM-Supported Static Application Security Testing. In Proceedings of the IFIP International Conference on ICT Systems Security and Privacy Protection; Springer: Berlin/Heidelberg, Germany, 2025; pp. 166–179. [Google Scholar]
- Muhammad, M.; Shaaban, A.M.; German, R.; Al Sardy, L. HyLLM-IDS: A Conceptual Hybrid LLM-Assisted Intrusion Detection Framework for Cyber-Physical Systems. In Proceedings of the International Conference on Computer Safety, Reliability, and Security; Springer: Berlin/Heidelberg, Germany, 2025; pp. 129–142. [Google Scholar]
- Li, Y.; Xiang, Z.; Bastian, N.D.; Song, D.; Li, B. IDS-Agent: An LLM Agent for Explainable Intrusion Detection in IoT Networks. In Proceedings of the NeurIPS 2024 Workshop on Open-World Agents; NeurIPS; Curran Associates: Red Hook, NY, USA, 2024. [Google Scholar]
- Hollmann, N.; Müller, S.; Eggensperger, K.; Hutter, F. TabPFN: A transformer that solves small tabular classification problems in a second. In Proceedings of the International Conference on Learning Representations 2023, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
- Hollmann, N.; Müller, S.; Purucker, L.; Krishnakumar, A.; Körfer, M.; Hoo, S.B.; Schirrmeister, R.T.; Hutter, F. Accurate predictions on small data with a tabular foundation model. Nature 2025, 637, 319–326. [Google Scholar] [CrossRef] [PubMed]
- Qu, J.; Holzmüller, D.; Varoquaux, G.; Morvan, M.L. TabICL: A Tabular Foundation Model for In-Context Learning on Large Data. arXiv 2025, arXiv:2502.05564. [Google Scholar]
- García, P.; de Curtò, J.; de Zarzà, I. Foundation Models for Tabular Intrusion Detection: Evaluating TabPFN and LLM Few-Shot Classification on IoT Network Security. In Proceedings of the 2025 3rd International Conference on Foundation and Large Language Models (FLLM); IEEE: New York, NY, USA, 2025. [Google Scholar]
- García, P.; de Curtò, J.; de Zarzà, I.; Cano, J.C.; Calafate, C.T. Foundation Models for Cybersecurity: A Comprehensive Multi-Modal Evaluation of TabPFN and TabICL for Tabular Intrusion Detection. Electronics 2025, 14, 3792. [Google Scholar] [CrossRef]
- Hossain, M.A.; Islam, M.S. Enhancing DDoS attack detection with hybrid feature selection and ensemble-based classifier: A promising solution for robust cybersecurity. Meas. Sens. 2024, 32, 101037. [Google Scholar] [CrossRef]
- Lai, T.; Farid, F.; Bello, A.; Sabrina, F. Ensemble learning based anomaly detection for IoT cybersecurity via Bayesian hyperparameters sensitivity analysis. Cybersecurity 2024, 7, 44. [Google Scholar] [CrossRef]
- Yan, J.; Wang, Q.; Cheng, Y.; Su, Z.; Zhang, F.; Zhong, M.; Liu, L.; Jin, B.; Zhang, W. Optimized single-image super-resolution reconstruction: A multimodal approach based on reversible guidance and cyclical knowledge distillation. Eng. Appl. Artif. Intell. 2024, 133, 108496. [Google Scholar] [CrossRef]
- Wang, X.; Jiang, H.; Dong, Y.; Mu, M. Spatial-channel collaborative multi-scale graph interaction deep transfer learning for unsupervised rotating machinery fault diagnosis. Eng. Appl. Artif. Intell. 2026, 176, 114691. [Google Scholar] [CrossRef]
- Ismail, S.; Dandan, S.; Qushou, A. Intrusion detection in IoT and IIoT: Comparing lightweight machine learning techniques using TON_IoT, WUSTL-IIOT-2021, and EdgeIIoTset datasets. IEEE Access 2025, 13, 73468–73485. [Google Scholar] [CrossRef]
- Zhou, C.; Li, Q.; Li, C.; Yu, J.; Liu, Y.; Wang, G.; Zhang, K.; Ji, C.; Yan, Q.; He, L.; et al. A comprehensive survey on pretrained foundation models: A history from bert to chatgpt. Int. J. Mach. Learn. Cybern. 2025, 16, 9851–9915. [Google Scholar]
- Yamin, M.M.; Hashmi, E.; Ullah, M.; Katt, B. Applications of llms for generating cyber security exercise scenarios. IEEE Access 2024, 12, 143806–143822. [Google Scholar] [CrossRef]
- Arik, S.Ö.; Pfister, T. Tabnet: Attentive interpretable tabular learning. In Proceedings of the AAAI Conference on Artificial Intelligence; AAAI Press: Palo Alto, CA, USA, 2021; Volume 35, pp. 6679–6687. [Google Scholar]
- Hegselmann, S.; Buendia, A.; Lang, H.; Agrawal, M.; Jiang, X.; Sontag, D. Tabllm: Few-shot classification of tabular data with large language models. In Proceedings of the International Conference on Artificial Intelligence and Statistics, PMLR, Valencia, Spain, 25–27 April 2023; pp. 5549–5581. [Google Scholar]
- Han, S.; Yoon, J.; Arik, S.O.; Pfister, T. Large language models can automatically engineer features for few-shot tabular learning. arXiv 2024, arXiv:2404.09491. [Google Scholar]
- Mathur, A.P.; Tippenhauer, N.O. SWaT: A water treatment testbed for research and training on ICS security. In Proceedings of the 2016 International Workshop on Cyber-Physical Systems for Smart Water Networks (CySWater); IEEE: New York, NY, USA, 2016; pp. 31–36. [Google Scholar]
- Goh, J.; Adepu, S.; Junejo, K.N.; Mathur, A. A dataset to support research in the design of secure water treatment systems. In Proceedings of the International Conference on Critical Information Infrastructures Security; Springer: Berlin/Heidelberg, Germany, 2016; pp. 88–99. [Google Scholar]
- Shin, H.K.; Lee, W.; Yun, J.H.; Kim, H. {HAI} 1.0:{HIL-based} augmented {ICS} security dataset. In Proceedings of the 13Th USENIX Workshop on Cyber Security Experimentation and Test (CSET 20); USENIX Association: Berkeley, CA, USA, 2020. [Google Scholar]
- Zolanvari, M. WUSTL-IIoT-2021: Industrial IoT Cybersecurity Dataset; IEEE-DataPort: New York, NY, USA, 2021. [Google Scholar] [CrossRef]
- Chen, T.; Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM Sigkdd International Conference on Knowledge Discovery and Data Mining; ACM: New York, NY, USA, 2016; pp. 785–794. [Google Scholar]














| Term | Definition as Used in this Paper |
|---|---|
| Foundation model | Large network pre-trained on broad data, deployed without task-specific gradient updates |
| Open-source LLM | Qwen3-235B-A22B, Llama-3.3-70B, Hermes-4-70B/405B, as zero-gradient in-context classifiers |
| Tabular foundation model | TabPFN and TabICL, transformers pre-trained for in-context inference over tabular data |
| Classical anchor | Random Forest and XGBoost reference baselines |
| OT/ICS-IDS | Intrusion detection on operational-technology/industrial-control-system telemetry |
| K-shot regime | Headline protocol with K in-context examples per class for every method |
| Max-context regime | Sensitivity protocol where tabular anchors use their native training-budget maximum |
| Dataset | Domain | Features (Native) | Features (Used) | Attack Prevalence | ||
|---|---|---|---|---|---|---|
| SWaT | Water-treatment SCADA | 51 sensor/actuator | 12 (MI-selected) | 99,980 † | 6000 | 50% (natural: ∼12%) |
| HAI | Multi-process HIL | 79 sensor/actuator | 12 (MI-selected) | 62,010 | 6000 | natural (∼3%) |
| WUSTL | IIoT Modbus/TCP flows | 41 flow features | 12 (MI-selected) | 99,980 † | 6000 | natural (∼7%) |
| Item | Value |
|---|---|
| Python/environment | 3.12 on Google Colab; NVIDIA A100 for orchestration, LLM inference offloaded to Nebius AI Studio API, TabICL on CPU |
| scikit-learn | 1.6.1 |
| xgboost | 3.2.0 |
| tabpfn-client | cloud API (tabpfn-client) |
| tabicl | checkpoint tabicl-classifier-v1.1-20250506 |
| openai (Nebius client) | 2.37.0 |
| numpy/pandas/scipy | 2.0.2/2.2.2/1.16.3 |
| Nebius AI Studio model IDs | |
| Qwen3-235B-A22B | Qwen/Qwen3-235B-A22B-Instruct-2507 |
| Llama-3.3-70B | meta-llama/Llama-3.3-70B-Instruct |
| Hermes-4-70B | NousResearch/Hermes-4-Llama-3.1-70B |
| Hermes-4-405B | NousResearch/Hermes-4-Llama-3.1-405B |
| LLM decoding | temperature 0, top_p 1, max_tokens 8, ; first-valid-label parse; unparseable → Attack (binary)/most common attack class (multi-class) |
| 10 per class (primary); in the E2 sweep | |
| Random Forest | n_estimators=200, max_features="sqrt", class_weight="balanced", otherwise scikit-learn defaults |
| XGBoost | n_estimators=200, max_depth=6, learning_rate=0.1, eval_metric="logloss"/"mlogloss", balanced class weighting |
| TabPFN | TabPFNClassifier cloud API; -per-class ICL support set in the headline E7 protocol, up to stratified support rows (v2 cap) in the max-context sweep |
| TabICL | v1.1 checkpoint, , norm_methods=["none","power"], Latin-square feature shuffling, shifted class shuffling, outlier threshold , softmax temperature , logit averaging |
| Code repository | https://github.com/drdecurto/fm-security (accessed on 2 June 2026) |
| Dataset | Model | Acc. | Macro F1 | MCC | FAR | DR |
|---|---|---|---|---|---|---|
| HAI | RandomForest | 0.707 ± 0.014 | 0.657 ± 0.011 | 0.413 ± 0.014 | 0.326 ± 0.019 | 0.842 ± 0.012 |
| HAI | XGBoost | 0.659 ± 0.000 | 0.607 ± 0.000 | 0.314 ± 0.000 | 0.364 ± 0.000 | 0.759 ± 0.000 |
| HAI | TabPFN | 0.683 ± 0.006 | 0.641 ± 0.006 | 0.406 ± 0.010 | 0.363 ± 0.006 | 0.875 ± 0.006 |
| HAI | TabICL | 0.733 ± 0.019 | 0.678 ± 0.014 | 0.432 ± 0.013 | 0.288 ± 0.029 | 0.822 ± 0.028 |
| HAI | Qwen3-235B-A22B | 0.690 ± 0.003 | 0.633 ± 0.004 | 0.350 ± 0.012 | 0.328 ± 0.003 | 0.764 ± 0.016 |
| SWaT | RandomForest | 0.827 ± 0.001 | 0.822 ± 0.001 | 0.695 ± 0.002 | 0.003 ± 0.002 | 0.657 ± 0.003 |
| SWaT | XGBoost | 0.824 ± 0.000 | 0.819 ± 0.000 | 0.684 ± 0.000 | 0.015 ± 0.000 | 0.662 ± 0.000 |
| SWaT | TabPFN | 0.805 ± 0.000 | 0.803 ± 0.000 | 0.627 ± 0.001 | 0.080 ± 0.002 | 0.691 ± 0.001 |
| SWaT | TabICL | 0.818 ± 0.001 | 0.813 ± 0.001 | 0.671 ± 0.003 | 0.022 ± 0.003 | 0.659 ± 0.003 |
| SWaT | Qwen3-235B-A22B | 0.836 ± 0.003 | 0.833 ± 0.004 | 0.700 ± 0.005 | 0.026 ± 0.003 | 0.699 ± 0.008 |
| WUSTL | RandomForest | 0.985 ± 0.001 | 0.985 ± 0.001 | 0.970 ± 0.002 | 0.023 ± 0.002 | 0.993 ± 0.001 |
| WUSTL | XGBoost | 0.894 ± 0.000 | 0.893 ± 0.000 | 0.806 ± 0.000 | 0.001 ± 0.000 | 0.790 ± 0.000 |
| WUSTL | TabPFN | 0.988 ± 0.000 | 0.988 ± 0.000 | 0.976 ± 0.000 | 0.022 ± 0.000 | 0.998 ± 0.000 |
| WUSTL | TabICL | 0.985 ± 0.000 | 0.985 ± 0.000 | 0.971 ± 0.001 | 0.028 ± 0.000 | 0.999 ± 0.001 |
| WUSTL | Qwen3-235B-A22B | 0.986 ± 0.000 | 0.986 ± 0.000 | 0.972 ± 0.001 | 0.026 ± 0.000 | 0.998 ± 0.000 |
| Dataset | LLM | Anchor | Acc | b (LLM Only) | c (Anc Only) | p |
|---|---|---|---|---|---|---|
| HAI | Qwen3-235B-A22B | RandomForest | −0.036 | 253 | 472 | 5.66 * |
| HAI | Qwen3-235B-A22B | TabICL | −0.072 | 262 | 696 | 1.80 * |
| HAI | Qwen3-235B-A22B | TabPFN | −0.001 | 428 | 432 | 0.919 |
| SWaT | Qwen3-235B-A22B | RandomForest | +0.008 | 137 | 87 | 1.06 * |
| SWaT | Qwen3-235B-A22B | TabICL | +0.017 | 228 | 123 | 2.84 * |
| SWaT | Qwen3-235B-A22B | TabPFN | +0.032 | 364 | 171 | 1.03 * |
| WUSTL | Qwen3-235B-A22B | RandomForest | +0.003 | 18 | 2 | 4.02 * |
| WUSTL | Qwen3-235B-A22B | TabICL | +0.001 | 15 | 8 | 0.210 |
| WUSTL | Qwen3-235B-A22B | TabPFN | +0.001 | 20 | 14 | 0.391 |
| Model | Kind | Accuracy | Macro F1 | MCC |
|---|---|---|---|---|
| TabPFN | Anchor | 0.992 ± 0.000 | 0.924 ± 0.003 | 0.985 ± 0.000 |
| RandomForest | Anchor | 0.988 ± 0.001 | 0.883 ± 0.003 | 0.979 ± 0.001 |
| TabICL | Anchor | 0.986 ± 0.001 | 0.862 ± 0.004 | 0.976 ± 0.002 |
| Qwen3-235B-A22B | LLM | 0.981 ± 0.004 | 0.805 ± 0.011 | 0.967 ± 0.007 |
| Model | Backdoor | Comminj | Dos | Normal | Reconn | Macro |
|---|---|---|---|---|---|---|
| Hermes-4-405B | 0.766 [0.718, 0.799] | 0.871 [0.834, 0.899] | 0.747 [0.700, 0.791] | 0.841 [0.777, 0.900] | 0.763 [0.653, 0.874] | 0.798 |
| Hermes-4-70B | 0.791 [0.764, 0.825] | 0.860 [0.828, 0.897] | 0.694 [0.633, 0.769] | 0.874 [0.829, 0.925] | 0.704 [0.625, 0.818] | 0.785 |
| Llama-3.3-70B | 0.823 [0.786, 0.866] | 0.882 [0.853, 0.911] | 0.819 [0.742, 0.886] | 0.870 [0.813, 0.920] | 0.748 [0.606, 0.889] | 0.828 |
| Qwen3-235B-A22B | 0.824 [0.778, 0.868] | 0.872 [0.835, 0.896] | 0.819 [0.752, 0.898] | 0.922 [0.880, 0.965] | 0.774 [0.673, 0.889] | 0.842 |
| XGBoost | 0.650 | 0.280 | 0.924 | 0.917 | 0.993 | 0.753 |
| RandomForest | 0.812 [0.761, 0.860] | 0.891 [0.868, 0.908] | 0.974 [0.955, 0.992] | 0.938 [0.903, 0.967] | 0.908 [0.859, 0.957] | 0.904 |
| Model | Input (USD/ tok) | Output (USD/ tok) |
|---|---|---|
| Llama-3.3-70B | ||
| Qwen3-235B-A22B | ||
| Hermes-4-70B | ||
| Hermes-4-405B |
| Method | Family | Params (Tot./Act.) | Size | Fit (s) | Infer. (s/1k) | FLOPs/Query |
|---|---|---|---|---|---|---|
| RandomForest | tabular | non-param. | MB | negligible | ||
| XGBoost | tabular | non-param. | MB | < | negligible | |
| TabPFN | tabular | non-param. | MB | negligible | ||
| TabICL | tabular | non-param. | MB | negligible | ||
| Llama-3.3-70B | LLM | 70B / 70B | served | 0 † | 52 | ∼ TFLOP |
| Qwen3-235B-A22B | LLM | 235B / 22B | served | 0 † | 40 | ∼ TFLOP |
| Hermes-4-70B | LLM | 70B / 70B | served | 0 † | 32 | ∼ TFLOP |
| Hermes-4-405B | LLM | 405B / 405B | served | 0 † | 49 | ∼ TFLOP |
| Method | Training Budget |
|---|---|
| RandomForest | Full training pool (∼80,000 SWaT, ∼50,000 HAI, ∼80,000 WUSTL) |
| XGBoost | Full training pool (same as RandomForest) |
| TabPFN | 10,000 stratified support rows (v2 native limit) |
| TabICL | 50,000 stratified support rows (native large-context support) |
| Dataset | Method | (K-Shot) | MCC (K-Shot) | (Max-Ctx) | MCC (Max-Ctx) | |
|---|---|---|---|---|---|---|
| HAI | RandomForest | 20 | 0.446 ± 0.024 | 49,624 | 0.967 ± 0.000 | +0.521 |
| HAI | XGBoost | 20 | 0.313 ± 0.000 | 49,624 | 0.944 ± 0.000 | +0.632 |
| HAI | TabPFN | 20 | 0.595 ± 0.000 | 10,000 | 0.958 ± 0.000 | +0.363 |
| HAI | TabICL | 20 | 0.556 ± 0.006 | 49,624 | 0.959 ± 0.003 | +0.403 |
| SWaT | RandomForest | 20 | 0.649 ± 0.002 | 80,000 | 0.998 ± 0.000 | +0.350 |
| SWaT | XGBoost | 20 | 0.582 ± 0.000 | 80,000 | 0.997 ± 0.000 | +0.415 |
| SWaT | TabPFN | 20 | 0.650 ± 0.000 | 10,000 | 0.994 ± 0.000 | +0.344 |
| SWaT | TabICL | 20 | 0.661 ± 0.000 | 50,000 | 0.997 ± 0.000 | +0.336 |
| WUSTL (binary) | RandomForest | 20 | 0.970 ± 0.001 | 80,000 | 1.000 ± 0.000 | +0.029 |
| WUSTL (binary) | XGBoost | 20 | 0.828 ± 0.000 | 80,000 | 0.999 ± 0.000 | +0.171 |
| WUSTL (binary) | TabPFN | 20 | 0.980 ± 0.000 | 10,000 | 0.999 ± 0.000 | +0.019 |
| WUSTL (binary) | TabICL | 20 | 0.971 ± 0.002 | 50,000 | 1.000 ± 0.000 | +0.029 |
| WUSTL (mc) | RandomForest | 50 | 0.909 ± 0.005 | 86,969 | 1.000 ± 0.000 | +0.090 |
| WUSTL (mc) | XGBoost | 50 | 0.813 ± 0.000 | 86,969 | 0.999 ± 0.000 | +0.187 |
| WUSTL (mc) | TabPFN | 50 | 0.898 ± 0.000 | 10,000 | 0.999 ± 0.000 | +0.101 |
| WUSTL (mc) | TabICL | 50 | 0.955 ± 0.005 | 50,000 | 1.000 ± 0.000 | +0.045 |
| Regime | Method | Backdoor | CommInj | DoS | Normal | Reconn |
|---|---|---|---|---|---|---|
| K-shot | RandomForest | 0.049 | 0.708 | 0.949 | 0.985 | 0.981 |
| K-shot | XGBoost | 0.273 | 0.120 | 0.883 | 0.932 | 0.810 |
| K-shot | TabPFN | 0.091 | 0.233 | 0.937 | 0.989 | 0.900 |
| K-shot | TabICL | 0.121 | 0.622 | 0.983 | 0.985 | 0.981 |
| Max-context | RandomForest | 0.968 | 0.991 | 1.000 | 1.000 | 1.000 |
| Max-context | XGBoost | 0.977 | 1.000 | 1.000 | 1.000 | 1.000 |
| Max-context | TabPFN | 0.914 | 0.981 | 1.000 | 0.999 | 1.000 |
| Max-context | TabICL | 0.962 | 1.000 | 1.000 | 1.000 | 1.000 |
| Backdoor | CommInj | ||||
|---|---|---|---|---|---|
| Regime | Method | Precision | Recall | Precision | Recall |
| K-shot | RandomForest | 0.025 | 0.643 | 0.560 | 0.962 |
| K-shot | XGBoost | 0.164 | 0.810 | 0.064 | 0.923 |
| K-shot | TabPFN | 0.048 | 0.810 | 0.133 | 0.942 |
| K-shot | TabICL | 0.067 | 0.643 | 0.470 | 0.923 |
| Max-context | RandomForest | 0.976 | 0.960 | 0.981 | 1.000 |
| Max-context | XGBoost | 0.955 | 1.000 | 1.000 | 1.000 |
| Max-context | TabPFN | 0.949 | 0.881 | 0.981 | 0.981 |
| Max-context | TabICL | 0.933 | 0.992 | 1.000 | 1.000 |
| Dataset | RandomForest | XGBoost | TabPFN | TabICL | Qwen3-235B-A22B |
|---|---|---|---|---|---|
| SWaT | 0.696 | 0.651 | 0.682 | 0.682 | 0.724 |
| HAI | 0.531 | 0.459 | 0.532 | 0.609 | 0.373 |
| WUSTL | 0.971 | 0.845 | 0.977 | 0.973 | 0.973 |
| Dataset | Model | Acc. | Macro F1 | MCC | FAR | DR |
|---|---|---|---|---|---|---|
| SWaT | TabPFN | 0.989 ± 0.003 | 0.989 ± 0.003 | 0.978 ± 0.007 | 0.012 ± 0.006 | 0.990 ± 0.002 |
| SWaT | Qwen3-235B-A22B | 0.986 ± 0.001 | 0.986 ± 0.001 | 0.972 ± 0.003 | 0.015 ± 0.003 | 0.987 ± 0.002 |
| SWaT | RandomForest | 0.944 ± 0.016 | 0.944 ± 0.016 | 0.893 ± 0.028 | 0.106 ± 0.033 | 0.995 ± 0.002 |
| SWaT | TabICL | 0.914 ± 0.022 | 0.913 ± 0.023 | 0.841 ± 0.039 | 0.000 ± 0.000 | 0.828 ± 0.044 |
| HAI | RandomForest | 1.000 ± 0.000 | 1.000 ± 0.000 | 1.000 ± 0.000 | 0.000 ± 0.000 | 1.000 ± 0.000 |
| HAI | TabPFN | 1.000 ± 0.000 | 1.000 ± 0.000 | 1.000 ± 0.000 | 0.000 ± 0.000 | 1.000 ± 0.000 |
| HAI | Qwen3-235B-A22B | 1.000 ± 0.000 | 1.000 ± 0.000 | 1.000 ± 0.000 | 0.000 ± 0.000 | 1.000 ± 0.000 |
| HAI | TabICL | 0.997 ± 0.003 | 0.995 ± 0.005 | 0.989 ± 0.009 | 0.004 ± 0.004 | 1.000 ± 0.000 |
| Held-Out Family | RandomForest | XGBoost | TabPFN | TabICL | Qwen3-235B-A22B |
|---|---|---|---|---|---|
| Backdoor | 0.668 | 0.064 | 0.664 | 0.640 | 0.893 |
| CommInj | 0.183 | 0.145 | 0.970 | 0.964 | 0.716 |
| DoS | 0.587 | 0.228 | 0.403 | 0.754 | 0.595 |
| Reconn | 0.380 | 0.275 | 0.528 | 0.809 | 0.978 |
| Mean | 0.454 | 0.178 | 0.641 | 0.792 | 0.795 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
de Curtò, J.; de Zarzà, I.; Cano, J.C.; Calafate, C.T. A Comparative Study of Large Language Models for Industrial Cyber-Physical Security. Electronics 2026, 15, 2779. https://doi.org/10.3390/electronics15132779
de Curtò J, de Zarzà I, Cano JC, Calafate CT. A Comparative Study of Large Language Models for Industrial Cyber-Physical Security. Electronics. 2026; 15(13):2779. https://doi.org/10.3390/electronics15132779
Chicago/Turabian Stylede Curtò, J., I. de Zarzà, Juan Carlos Cano, and Carlos T. Calafate. 2026. "A Comparative Study of Large Language Models for Industrial Cyber-Physical Security" Electronics 15, no. 13: 2779. https://doi.org/10.3390/electronics15132779
APA Stylede Curtò, J., de Zarzà, I., Cano, J. C., & Calafate, C. T. (2026). A Comparative Study of Large Language Models for Industrial Cyber-Physical Security. Electronics, 15(13), 2779. https://doi.org/10.3390/electronics15132779
