Estimating the Utility of Using Structured and Unstructured Data for Extracting Incidents of External Hospitalizations from Patient Documents
Abstract
1. Introduction
2. Materials and Methods
2.1. Information Extraction Process
2.2. Defining External Hospitalizations
2.3. Mapping External Hospitalizations
2.4. Evaluating Performance of Information Extraction
2.5. Automating the Estimation of the Utility of Unstructured Data
2.5.1. Manual Method
2.5.2. Assumptions for Automating the Estimation of P(S|U)
2.5.3. Syntactic Patterns
2.5.4. Using Syntactic Patterns for Automating the Estimation of P(S|U)
3. Results
4. Discussion
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Vashi, A.A.; Urech, T.; Wu, S.; Tran, L.D. Community Emergency Care Use by Veterans in an Era of Expanding Choice. JAMA Netw Open 2024, 7, e241626. [Google Scholar] [CrossRef] [PubMed]
- Xu, Z.; Jain, S.; Kankanhalli, M. Hallucination is Inevitable: An Innate Limitation of Large Language Models. arXiv 2024, arXiv:2401.11817. [Google Scholar] [CrossRef]
- Tonmoy, S.M.; Zaman, S.M.; Jain, V.; Rani, A.; Rawte, V.; Chadha, A.; Das, A. A Comprehensive Survey of Hallucination Mitigation Techniques in Large Language Models. arXiv 2024, arXiv:2401.01313. [Google Scholar] [CrossRef]
- Meng, X.; Yan, X.; Zhang, K.; Liu, D.; Cui, X.; Yang, Y.; Zhang, M.; Cao, C.; Wang, J.; Wang, X.; et al. The application of large language models in medicine: A scoping review. iScience 2024, 27, 109713. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
- Locke, S.; Bashall, A.; Al-Adely, S.; Moore, J.; Wilson, A.; Kitchen, G. Natural Language Processing in Medicine: A Review. Trends Anaesth. Crit. Care 2021, 38, 4–9. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.M.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is All you Need. In Proceedings of the Advances in Neural Information Processing Systems 30 (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
- Abdullah, M.H.A.; Aziz, N.; Abdulkadir, S.J.; Alhussian, H.S.A.; Talpur, N. Systematic Literature Review of Information Extraction From Textual Data: Recent Methods, Applications, Trends, and Challenges. IEEE Access 2023, 11, 10535–10562. [Google Scholar] [CrossRef]
- Bednarczyk, L.; Reichenpfader, D.; Gaudet-Blavignac, C.; Ette, A.; Zaghir, J.; Zheng, Y.; Bensahla, A.; Bjelogrlic, M.; Lovis, C. Evidence for Clinical Text Summarization Using Large Language Models: Scoping Review. J. Med. Internet Res. 2025, 27, e68998. [Google Scholar] [CrossRef] [PubMed]
- Gaber, F.; Shaik, M.; Allega, F.; Bilecz, A.J.; Busch, F.; Goon, K.; Franke, V.; Akalin, A. Evaluating large language model workflows in clinical decision support for triage and referral and diagnosis. NPJ Digit. Med. 2025, 8, 263. [Google Scholar] [CrossRef] [PubMed]
- Busch, F.; Hoffmann, L.; Dos Santos, D.P.; Makowski, M.R.; Saba, L.; Prucker, P.; Hadamitzky, M.; Navab, N.; Kather, J.N.; Truhn, D.; et al. Large language models for structured reporting in radiology: Past, present, and future. Eur. Radiol. 2025, 35, 2589–2602. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
- Kandpal, N.; Raffel, C. Position: The Most Expensive Part of an LLM should be its Training Data. arXiv 2025, arXiv:2504.12427. [Google Scholar] [CrossRef]
- Chiticariu, L.; Li, Y.; Reiss, F.R. Rule-Based Information Extraction is Dead! Long Live Rule-Based Information Extraction Systems! In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, WA, USA, 18–21 October 2013; Association for Computational Linguistics: Seattle, WA, USA, 2013; pp. 827–832. [Google Scholar]
- Meng, F.; Morioka, C. Automating the generation of lexical patterns for processing free text in clinical documents. J. Am. Med. Inform. Assoc. 2015, 22, 980–986. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
- Meng, F.; Morioka, C.A.; Elbers, D.C. Generating Information Extraction Patterns from Overlapping and Variable Length Annotations using Sequence Alignment. arXiv 2019, arXiv:1908.03594. [Google Scholar] [CrossRef]
- SQL Server Technical Documentation. Available online: https://learn.microsoft.com/en-us/sql/sql-server/?view=sql-server-ver17 (accessed on 1 March 2025).
- Price, L.E.; Shea, K.; Gephart, S. The Veterans Affairs’s Corporate Data Warehouse: Uses and Implications for Nursing Research and Practice. Nurs. Adm. Q. 2015, 39, 311–318. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]




| Number of invoices for external hospitalizations (1 January–30 June 2024) | 541,009 |
| Number of community care notes who did not also have an invoice (1 January–30 June 2024) | 162,265 |
| Total external hospitalizations with structured data indicators (1 January–30 June 2024) | 703,274 |
| Time frame | 1 January – 31 December 2024 |
| Documents that had at least one pattern match | 19,673 |
| Number of unique patients | 13,154 |
| Random sample size (documents) | 1000 |
| Random sample size (patients) | 941 |
| Number in the random sample that also has structured data (documents) | 574 |
| Number in the random sample that also has structured data (patients) | 556 |
| Estimated 1-P(S|U) (documents) | 45.3% |
| Estimated 1-P(S|U) (patients) | 44.4% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Davenport, M.; Hall, R.; Kappala, S.; Michelson, T.; Mitchell, R.; Winski, D.; Hau, C.; Leatherman, S.; Meng, F. Estimating the Utility of Using Structured and Unstructured Data for Extracting Incidents of External Hospitalizations from Patient Documents. Information 2025, 16, 978. https://doi.org/10.3390/info16110978
Davenport M, Hall R, Kappala S, Michelson T, Mitchell R, Winski D, Hau C, Leatherman S, Meng F. Estimating the Utility of Using Structured and Unstructured Data for Extracting Incidents of External Hospitalizations from Patient Documents. Information. 2025; 16(11):978. https://doi.org/10.3390/info16110978
Chicago/Turabian StyleDavenport, Michael, Robert Hall, Saraswathi Kappala, Trevor Michelson, Robert Mitchell, David Winski, Cynthia Hau, Sarah Leatherman, and Frank Meng. 2025. "Estimating the Utility of Using Structured and Unstructured Data for Extracting Incidents of External Hospitalizations from Patient Documents" Information 16, no. 11: 978. https://doi.org/10.3390/info16110978
APA StyleDavenport, M., Hall, R., Kappala, S., Michelson, T., Mitchell, R., Winski, D., Hau, C., Leatherman, S., & Meng, F. (2025). Estimating the Utility of Using Structured and Unstructured Data for Extracting Incidents of External Hospitalizations from Patient Documents. Information, 16(11), 978. https://doi.org/10.3390/info16110978

