EviCal: Evidence-Grounded Consistency Calibration for Content-Level Multimodal Labeling
Abstract
1. Introduction
- We propose EviCal, an evidence-grounded consistency calibration framework for content-level multimodal labeling in audit-oriented document analysis, enabling structured decision-making under predefined label constraints.
- We introduce a novel consistency calibration mechanism that integrates explicit evidence grounding with global causal and logical verification, improving decision stability, interpretability, and reliability without relying on end-to-end generative inference.
- We evaluate EviCal on two real-world power-system audit datasets, where EviCal achieves up to 93.97% accuracy and 81.22 F1, and attains a human score of up to 4.58/5, outperforming strong multimodal baselines.
2. Related Work
2.1. Document Understanding and Extraction
2.2. AI for Power-System Document Understanding and Labeling
2.3. Causal and Logical Reasoning in LLMs
2.4. Hallucination Mitigation and Trustworthy Auditing
3. Methodology
3.1. Task Definition
3.2. Multimodal Document Understanding Module
3.2.1. Atomic Representation and Cross-Modal Encoding
3.2.2. Label-Aware Semantic Focusing
3.2.3. Evidence Assignment and Intermediate Decision State Construction
3.3. Causal and Logical Verification Module
3.3.1. Structured Audit State Projection
3.3.2. LLM-Based Global Consistency Feedback
3.3.3. Consistency-Calibrated Decision Refinement
3.4. Confidence Quantification Module
3.4.1. Evidence Support Strength
3.4.2. Prediction Stability
3.4.3. Consistency Feedback Reliability
3.5. Training Objective
4. Experiments
4.1. Dataset Construction
4.1.1. Data Acquisition and Multimodal Preprocessing
4.1.2. Taxonomy and Human-AI Collaborative Annotation
4.2. Baselines
4.3. Implementation Details
4.4. Evaluation Metrics
4.5. Main Results
4.6. Ablation Study
4.7. Impact of Backbone LLMs
4.8. Risk-Coverage Analysis
4.9. Effect of Semantic Granularity
4.10. Hyperparameter Sensitivity Study
4.10.1. Effect of Calibration Strength
4.10.2. Effect of Attention Head Count M
4.11. Error Analysis
5. Discussion
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Alvarez-Alvarado, M.S.; Donaldson, D.L.; Recalde, A.A.; Noriega, H.H.; Khan, Z.A.; Velasquez, W.; Rodríguez-Gallegos, C.D. Power System Reliability and Maintenance Evolution: A Critical Review and Future Perspectives. IEEE Access 2022, 10, 51922–51950. [Google Scholar] [CrossRef]
- Huang, Y.; Lv, T.; Cui, L.; Lu, Y.; Wei, F. LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking. In Proceedings of the 30th ACM International Conference on Multimedia MM ’22; Association for Computing Machinery: New York, NY, USA, 2022; pp. 4083–4091. [Google Scholar] [CrossRef]
- Appalaraju, S.; Jasani, B.; Kota, B.U.; Xie, Y.; Manmatha, R. DocFormer: End-to-End Transformer for Document Understanding. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 973–983. [Google Scholar] [CrossRef]
- Li, C.; Bi, B.; Yan, M.; Wang, W.; Huang, S.; Huang, F.; Si, L. StructuralLM: Structural Pre-training for Form Understanding. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics, Online, 1–6 August 2021. [Google Scholar]
- Wang, D.; Raman, N.; Sibue, M.; Ma, Z.; Babkin, P.; Kaur, S.; Pei, Y.; Nourbakhsh, A.; Liu, X. Docllm: A layout-aware generative language model for multimodal document understanding. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Bangkok, Thailand, 11–16 August 2024; pp. 8529–8548. [Google Scholar]
- Tang, Z.; Yang, Z.; Wang, G.; Fang, Y.; Liu, Y.; Zhu, C.; Zeng, M.; Zhang, C.; Bansal, M. Unifying vision, text, and layout for universal document processing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 19254–19264. [Google Scholar]
- Liu, Y.; Yang, B.; Liu, Q.; Li, Z.; Ma, Z.; Zhang, S.; Bai, X. Textmonkey: An ocr-free large multimodal model for understanding document. arXiv 2024, arXiv:2403.04473. [Google Scholar] [CrossRef]
- Kim, G.; Hong, T.; Yim, M.; Nam, J.; Park, J.; Yim, J.; Hwang, W.; Yun, S.; Han, D.; Park, S. OCR-free Document Understanding Transformer. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022. [Google Scholar]
- Ye, J.; Hu, A.; Xu, H.; Ye, Q.; Yan, M.; Dan, Y.; Zhao, C.; Xu, G.; Li, C.; Tian, J.; et al. mplug-docowl: Modularized multimodal large language model for document understanding. arXiv 2023, arXiv:2307.02499. [Google Scholar]
- Zhang, Y.; Zhang, R.; Gu, J.; Zhou, Y.; Lipka, N.; Yang, D.; Sun, T. Llavar: Enhanced visual instruction tuning for text-rich image understanding. arXiv 2023, arXiv:2306.17107. [Google Scholar]
- Luo, J.; Yao, S.; Zhao, C.; Xu, J.; Feng, J. A federated named entity recognition model with explicit relation for power grid. Comput. Mater. Contin. 2023, 75, 4207. [Google Scholar] [CrossRef]
- Tang, W.; Zhang, Y.; Mao, X.; Shan, M.; Lv, K.; Sun, X.; Ding, Z. Enhanced Named Entity Recognition and Event Extraction for Power Grid Outage Scheduling Using a Universal Information Extraction Framework. Energies 2025, 18, 3617. [Google Scholar] [CrossRef]
- Meng, L.; Wang, Y.; Huang, Y.; Ma, D.; Zhu, X.; Zhang, S. A Named Entity Recognition Model for Chinese Electricity Violation Descriptions Based on Word-Character Fusion and Multi-Head Attention Mechanisms. Energies 2025, 18, 401. [Google Scholar] [CrossRef]
- Chen, G.; Xie, W.; Liu, Y.; Yuan, X.; Zhao, L. Systematically modeling and extracting bibliographic metadata of power grid standard documents with LLMs. Inf. Res. Int. Electron. J. 2025, 30, 654–665. [Google Scholar] [CrossRef]
- Yao, J.; She, Y.; Shi, K. Domain Knowledge Graph Construction and Troubleshooting Method for Intelligent Diagnosis of Power Transformer Defects. In Proceedings of the 2025 IEEE 7th Advanced Information Management, Communicates, Electronic and Automation Control Conference (IMCEC), Chongqing, China, 5–7 December 2025; Volume 7, pp. 1205–1213. [Google Scholar] [CrossRef]
- Xiong, J.; Yang, P.; Chen, B.; Chen, Z. Defect Identification Method of Power Grid Secondary Equipment Based on Coordination of Knowledge Graph and Bayesian Network Fusion. Energy Eng. 2026, 123, 1. [Google Scholar] [CrossRef]
- Liu, P.; Tian, B.; Liu, X.; Gu, S.; Yan, L.; Bullock, L.; Ma, C.; Liu, Y.; Zhang, W. Construction of Power Fault Knowledge Graph Based on Deep Learning. Appl. Sci. 2022, 12, 6993. [Google Scholar] [CrossRef]
- Liu, W.; Gu, Y.; Zeng, Z.; Qi, D.; Li, D.; Luo, Y.; Li, Q.; Wei, S. Automated Equipment Defect Knowledge Graph Construction for Power Grid Regulation. Electronics 2024, 13, 4430. [Google Scholar] [CrossRef]
- Zhang, L.; Kuang, J.; Teng, Y.; Xiang, S.; Li, L.; Zhou, Y. A Lightweight Infrared and Visible Light Multimodal Fusion Method for Object Detection in Power Inspection. Processes 2025, 13, 2720. [Google Scholar] [CrossRef]
- Guo, Y.; He, Y.; Zhang, K.; Zhang, T.; Lin, Y.; Wang, Z.; Chen, H. A decoupled scene-equipment fusion method for power substation equipment detection. Knowl.-Based Syst. 2025, 331, 114792. [Google Scholar] [CrossRef]
- Zhang, K.; Zheng, Z.; Wang, J.; Yang, J.; Xiao, Y. Multimodal knowledge-guided method for power transmission line fault detection using a vision-language model. Int. J. Electr. Power Energy Syst. 2026, 176, 111696. [Google Scholar] [CrossRef]
- Zhong, Y.; Luo, P.; Yan, Y.; Jia, T.; Qi, D. PowerGPT: A multimodal foundation model for power inspection. Appl. Soft Comput. 2025, 186, 113939. [Google Scholar] [CrossRef]
- Wang, J.; Li, M.; Luo, H.; Zhu, J.; Yang, A.; Rong, M.; Wang, X. Power-LLaVA: Large Language and Vision Assistant for Power Transmission Line Inspection. arXiv 2024, arXiv:2407.19178. [Google Scholar]
- Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Xia, F.; Chi, E.; Le, Q.V.; Zhou, D. Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural Inf. Process. Syst. 2022, 35, 24824–24837. [Google Scholar]
- Yao, S.; Yu, D.; Zhao, J.; Shafran, I.; Griffiths, T.; Cao, Y.; Narasimhan, K. Tree of thoughts: Deliberate problem solving with large language models. Adv. Neural Inf. Process. Syst. 2023, 36, 11809–11822. [Google Scholar]
- Wang, X.; Wei, J.; Schuurmans, D.; Le, Q.; Chi, E.; Narang, S.; Chowdhery, A.; Zhou, D. Self-Consistency Improves Chain of Thought Reasoning in Language Models. In Proceedings of the International Conference on Learning Representations (ICLR), Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
- Chi, H.; Li, H.; Yang, W.; Liu, F.; Lan, L.; Ren, X.; Liu, T.; Han, B. Unveiling causal reasoning in large language models: Reality or mirage? Adv. Neural Inf. Process. Syst. 2024, 37, 96640–96670. [Google Scholar]
- Kiciman, E.; Ness, R.; Sharma, A.; Tan, C. Causal reasoning and large language models: Opening a new frontier for causality. Trans. Mach. Learn. Res. 2023. Available online: https://par.nsf.gov/biblio/10574854 (accessed on 19 March 2026).
- Jin, Z.; Chen, Y.; Leeb, F.; Gresele, L.; Kamal, O.; Lyu, Z.; Blin, K.; Gonzalez Adauto, F.; Kleiman-Weiner, M.; Sachan, M.; et al. Cladder: Assessing causal reasoning in language models. Adv. Neural Inf. Process. Syst. 2023, 36, 31038–31065. [Google Scholar]
- Pan, S.; Luo, L.; Wang, Y.; Chen, C.; Wang, J.; Wu, X. Unifying large language models and knowledge graphs: A roadmap. IEEE Trans. Knowl. Data Eng. 2024, 36, 3580–3599. [Google Scholar] [CrossRef]
- Yang, K.; Deng, J.; Chen, D. Generating Natural Language Proofs with Verifier-Guided Search. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022; Association for Computational Linguistics (ACL): Stroudsburg, PA, USA, 2022; pp. 89–105. [Google Scholar]
- Liu, X.; Xu, P.; Wu, J.; Yuan, J.; Yang, Y.; Zhou, Y.; Liu, F.; Guan, T.; Wang, H.; Yu, T.; et al. Large language models and causal inference in collaboration: A comprehensive survey. In Findings of the Association for Computational Linguistics: NAACL 2025; Association for Computational Linguistics: Stroudsburg, PA, USA, 2025; pp. 7668–7684. [Google Scholar]
- Gao, Y.; Xiong, Y.; Gao, X.; Jia, K.; Pan, J.; Bi, Y.; Dai, Y.; Sun, J.; Wang, H.; Wang, H. Retrieval-augmented generation for large language models: A survey. arXiv 2023, arXiv:2312.10997. [Google Scholar]
- Nakano, R.; Hilton, J.; Balaji, S.; Wu, J.; Ouyang, L.; Kim, C.; Hesse, C.; Jain, S.; Kosaraju, V.; Saunders, W.; et al. WebGPT: Browser-assisted question-answering with human feedback. arXiv 2021, arXiv:2112.09332. [Google Scholar]
- Gao, T.; Yen, H.; Yu, J.; Chen, D. Enabling Large Language Models to Generate Text with Citations. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing; Association for Computational Linguistics: Stroudsburg, PA, USA, 2023. [Google Scholar]
- Min, S.; Krishna, K.; Lyu, X.; Lewis, M.; Yih, W.t.; Koh, P.; Iyyer, M.; Zettlemoyer, L.; Hajishirzi, H. Factscore: Fine-grained atomic evaluation of factual precision in long form text generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–10 December 2023; pp. 12076–12100. [Google Scholar]
- Manakul, P.; Liusie, A.; Gales, M. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–10 December 2023; pp. 9004–9017. [Google Scholar]
- Kuhn, L.; Gal, Y.; Farquhar, S. Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Large Language Models. In Proceedings of the International Conference on Learning Representations (ICLR), Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
- Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A.; et al. Training language models to follow instructions with human feedback. Adv. Neural Inf. Process. Syst. 2022, 35, 27730–27744. [Google Scholar]
- Wang, T.; Yu, P.; Tan, X.E.; O’Brien, S.; Pasunuru, R.; Dwivedi-Yu, J.; Golovneva, O.; Zettlemoyer, L.; Fazel-Zarandi, M.; Celikyilmaz, A. Shepherd: A critic for language model generation. arXiv 2023, arXiv:2308.04592. [Google Scholar] [CrossRef]
- Gao, L.; Dai, Z.; Pasupat, P.; Chen, A.; Chaganty, A.T.; Fan, Y.; Zhao, V.; Lao, N.; Lee, H.; Juan, D.C.; et al. Rarr: Researching and revising what language models say, using language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Toronto, ON, Canada, 9–14 July 2023; pp. 16477–16508. [Google Scholar]
- Huang, L.; Yu, W.; Ma, W.; Zhong, W.; Feng, Z.; Wang, H.; Chen, Q.; Peng, W.; Feng, X.; Qin, B.; et al. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Trans. Inf. Syst. 2025, 43, 1–55. [Google Scholar] [CrossRef]
- Poznanski, J.; Borchardt, J.; Dunkelberger, J.; Huff, R.; Lin, D.; Rangapur, A.; Wilhelm, C.; Lo, K.; Soldaini, L. olmOCR: Unlocking Trillions of Tokens in PDFs with Vision Language Models. arXiv 2025, arXiv:2502.18443. Available online: http://arxiv.org/abs/2502.18443 (accessed on 19 March 2026). [CrossRef]
- Team, Q. Qwen2.5-VL Technical Report. arXiv 2025, arXiv:2502.13923v1. [Google Scholar] [CrossRef]
- DeepSeek-AI. DeepSeek-V3 Technical Report. arXiv 2024, arXiv:2412.19437. [Google Scholar]
- OpenAI. Hello GPT-4o. 2024. Available online: https://openai.com/index/hello-gpt-4o (accessed on 19 March 2026).
- Li, Z.; Li, Y.; Kang, L.; Karatzas, D.; Ma, W. AVIR: Adaptive Visual In-Document Retrieval for Efficient Multi-Page Document Question Answering. In Proceedings of the 7th ACM International Conference on Multimedia in Asia; Association for Computing Machinery: New York, NY, USA, 2025; pp. 1–7. [Google Scholar]
- Mo, Y.; Shao, Z.; Ye, K.; Mao, X.; Zhang, B.; Xing, H.; Ye, P.; Huang, G.; Chen, K.; Huan, Z.; et al. Doc-CoB: Enhancing Multi-Modal Document Understanding with Visual Chain-of-Boxes Reasoning. arXiv 2025, arXiv:2505.18603. [Google Scholar]
- Faysse, M.; Sibille, H.; Wu, T.; Omrani, B.; Viaud, G.; Hudelot, C.; Colombo, P. ColPali: Efficient Document Retrieval with Vision Language Models. In Proceedings of the International Conference on Learning Representations; Yue, Y., Garg, A., Peng, N., Sha, F., Yu, R., Eds.; ICLR: Appleton, WI, USA, 2025; Volume 2025, pp. 61424–61449. [Google Scholar]




| Method Category | Representative Focus | Main Limitation | Gap Addressed by EviCal |
|---|---|---|---|
| Document understanding methods [2,4,5,7,8] | Layout-aware document parsing and multimodal representation. | Task-agnostic outputs with limited audit-level consistency control. | Atomic evidence grounding linked to predefined audit labels. |
| Power-domain labeling methods [11,12,13,15,22,23] | Domain-specific entity extraction, defect analysis, and inspection understanding. | Mostly text/image-level or task-specific processing. | Content-level multimodal labeling with structured audit decisions. |
| LLM reasoning methods [24,25,26,28,29] | Intermediate reasoning, verification, and structured constraints. | Often prompt-based or post hoc, with weak audit-specific control. | Symbolic audit-state calibration under predefined causal/logical rules. |
| Trustworthy auditing methods [33,34,35,36,37,38] | Retrieval, citation, uncertainty estimation, and factual verification. | Coarse evidence units and limited engineering-audit granularity. | Confidence-aware auditing grounded in sentences, table rows, and figure captions. |
| Dataset | Raw Documents | Processed Units | Key Categories | Kappa |
|---|---|---|---|---|
| General-Doc | 109 | 18,192 | Software Asset List, Compliance | 0.68 |
| Domain-Doc | 96 | 24,328 | Emergency Plan, Fault Handling | 0.62 |
| Model | General-Doc | Domain-Doc | ||||
|---|---|---|---|---|---|---|
| Acc | F1 | Acc | F1 | |||
| Qwen2.5-VL | 89.34 | 72.47 | 3.42 | 82.18 | 69.41 | 2.85 |
| DeepSeek-V3 | 88.24 | 74.32 | 3.55 | 80.56 | 70.12 | 2.94 |
| ChatGPT-4o | 90.55 | 77.12 | 3.82 | 83.12 | 72.45 | 3.15 |
| AVIR | 88.76 | 75.28 | 3.95 | 84.15 | 74.32 | 3.48 |
| Doc-CoB | 90.92 | 77.61 | 4.12 | 85.67 | 76.15 | 3.76 |
| ColPali | 85.10 | 71.04 | 3.31 | 79.45 | 68.10 | 2.92 |
| EviCal (Ours) | 93.97 | 81.22 | 4.58 | 88.71 | 79.52 | 4.42 |
| Variant | General-Doc | Domain-Doc | ||
|---|---|---|---|---|
| F1 | F1 | |||
| (A) Without Causal Verification | 79.15 | 3.82 | 74.95 | 3.24 |
| (B) Without Semantic Focusing | 72.33 | 4.31 | 69.23 | 4.15 |
| (C) Without Atomic Representation | 71.04 | 4.12 | 67.85 | 3.98 |
| EviCal (Ours) | 81.22 | 4.58 | 79.52 | 4.42 |
| Dataset | ChatGPT-4o | DeepSeek-V3 | Llama-3-8B | Qwen2.5-7B (Ours) |
|---|---|---|---|---|
| General-Doc | 4.45 | 4.62 | 3.92 | 4.58 |
| Domain-Doc | 4.28 | 4.45 | 3.64 | 4.42 |
| Multimodal Document | Ground Truth | Prediction Details | |
|---|---|---|---|
| [Text]
对220 kV 某变电站10 kV 开关柜进行红外测温普查。记录显示:A相触头及母排接头处均未发现明显异常温升现象。 (Infrared thermometry survey conducted on a 10 kV switchgear at a 220 kV substation. Records show no abnormal temperature rise at Phase A contacts and busbar joints.) [Image] 红外热像图显示A相设备呈现亮白色高温耀斑。(The infrared thermogram shows Phase A equipment exhibiting a bright white high-temperature flare.) | ![]() | Oppose (Logical conflict between the main text and image caption) | Target Label: O&M Management Process Model Decision: Support Extracted Evidence: Text segment Confidence: 0.91 |
| [Text] 本批次《PMS 3.0 系统授权清单》中,包含特殊运维协议(SLA等级为T1)的组件共有4 项,具体详见下表标记。 (In this batch of the PMS 3.0 System Authorization List, there are 4 components with special O&M agreements (SLA level T1), as marked in the table below.) [Table] 系统核心组件授权与维保清单:(System Core Component Authorization and Maintenance List:) | ![]() | Support (Original table exactly matches the 4 items) | Target Label: Software Asset List Model Decision: Details Missing Extracted Evidence: Misaligned table header fragments Confidence: 0.48 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Zhang, X.; Han, B.; Yuan, Y.; Zhu, G.; Song, H.; Qiu, W.; Ni, L. EviCal: Evidence-Grounded Consistency Calibration for Content-Level Multimodal Labeling. Informatics 2026, 13, 86. https://doi.org/10.3390/informatics13060086
Zhang X, Han B, Yuan Y, Zhu G, Song H, Qiu W, Ni L. EviCal: Evidence-Grounded Consistency Calibration for Content-Level Multimodal Labeling. Informatics. 2026; 13(6):86. https://doi.org/10.3390/informatics13060086
Chicago/Turabian StyleZhang, Xiaofeng, Baoli Han, Yufeng Yuan, Guangyao Zhu, Huibo Song, Weixing Qiu, and Li Ni. 2026. "EviCal: Evidence-Grounded Consistency Calibration for Content-Level Multimodal Labeling" Informatics 13, no. 6: 86. https://doi.org/10.3390/informatics13060086
APA StyleZhang, X., Han, B., Yuan, Y., Zhu, G., Song, H., Qiu, W., & Ni, L. (2026). EviCal: Evidence-Grounded Consistency Calibration for Content-Level Multimodal Labeling. Informatics, 13(6), 86. https://doi.org/10.3390/informatics13060086


