Next Article in Journal
Differentiating Between Human-Written and AI-Generated Texts Using Automatically Extracted Linguistic Features
Previous Article in Journal
Performance Evaluation Metrics for Empathetic LLMs
error_outline You can access the new MDPI.com website here. Explore and share your feedback with us.
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Estimating the Utility of Using Structured and Unstructured Data for Extracting Incidents of External Hospitalizations from Patient Documents

1
Cooperative Studies Program Coordinating Center, VA Boston Healthcare System, Boston, MA 02111, USA
2
School of Public Health, Boston University, Boston, MA 02118, USA
3
Chobanian & Avedisian School of Medicine, Boston University, Boston, MA 02118, USA
*
Author to whom correspondence should be addressed.
Information 2025, 16(11), 978; https://doi.org/10.3390/info16110978
Submission received: 8 July 2025 / Revised: 20 October 2025 / Accepted: 4 November 2025 / Published: 12 November 2025

Abstract

Patients within the US Department of Veterans Affairs (VA) healthcare system have the option of receiving care at facilities external to the VA network. This work presents a method for identifying external hospitalizations among the VA’s patient population by utilizing data stored in patient records. The process of extracting this information is complicated by the fact that indicators of external hospitalizations come from two sources: well-defined structured data and free-form unstructured text. Though natural language processing (NLP) leveraging Large Language Models (LLMs) has advanced capabilities to automate information extraction from free text, deploying these systems remains complex and costly. Using structured data is low-cost, but its utility must be determined in order to optimally allocate resources. We describe a method for estimating the utility of using structured and unstructured data and show that if specific conditions are met, the level of effort to perform this estimate can be greatly reduced. For external hospitalizations in the VA, our analysis showed that 44.4% of cases identified using unstructured data could not be found using structured data alone.

1. Introduction

The U.S. Department of Veterans Affairs (VA) has instituted policies facilitating patients to receive care from external (non-VA) healthcare systems. Veterans have eagerly utilized these opportunities, as evidenced by close to 20 million emergency department visits at external facilities from Fiscal Year 2016 to 2022 [1]. Though positive for patients, this introduces information gaps within VA healthcare records that may cause issues for business analytics, quality assurance/management, and scientific needs. Determining whether patients obtained external care using medical records is complex because this information is mainly recorded in three different sources: (1) billing invoices sent from the external facility to the VA; (2) community care notes generated by VA when the patient receives external care; and (3) incidental mentions of external care by VA providers when documenting patient encounters. The former two can be directly analyzed by standard querying methods against the underlying database (referred to as structured data). The third source is challenging because accurately extracting information from free text (called unstructured data) is generally laborious and technically difficult. Natural Language Processing (NLP) techniques have been developed over the past several decades to overcome these challenges, but have not seen widespread adoption in clinical settings besides a handful of specialty domains (e.g., mammography reports). The recent advent of state-of-the-art Large Language Models (LLMs) that exhibit highly sophisticated capabilities has made the use of NLP more tractable for various tasks. However, these systems generally cannot be used straight out of the box, are expensive to operate, and are known to generate errors commonly called “hallucinations” [2,3]. Non-trivial efforts are needed to properly integrate and validate these systems before deployment within healthcare to minimize negative impacts to patients and providers. Thus, characterizing the effectiveness of low-cost methods based on structured data will help determine whether administrators should invest in higher-cost solutions such as NLP. This requires a deep understanding of data generation and utilization workflows within the healthcare system. This paper describes a method for reducing the manual effort for estimating the utility of structured versus unstructured data using simple syntactic patterns.
Research and development of NLP for healthcare has been continually advancing for the past few decades, with varying levels of success and adoption [4,5]. The invention of the transformer-based LLM [6] began the current revolution of NLP systems that exhibit highly sophisticated capabilities, and their potential for impacting healthcare is being actively explored. The subfield of Information Extraction (IE) is specifically focused on using NLP to transform information embedded within free text into well-defined formats that can more easily be used for computational analyses [7]. LLMs have been applied to various healthcare tasks, including text summarization [8], clinical decision support [9], and report generation [10]. Even though LLMs have seen widespread adoption in healthcare, they remain costly to operate and require significant investment in computational and human resources [11]. The cost of deploying an LLM extends beyond the fees charged by AI companies for access to the LLM itself. Thus, even if an LLM is open source (“free”) and is brought in-house, the costs incurred to operate the LLM within a large institution include the compute and storage resources needed to run the LLM, validation of the LLM’s performance (particularly important in healthcare), and continuous maintenance of the LLM such as fine-tuning and installation of upgrades that are needed for the system to run over long periods of time. The work presented in this paper leverages traditional pattern-based IE techniques [12] as low-cost, high-precision/low-recall extractors to estimate the utility of higher cost IE methods (e.g., LLMs). We show that under specific constraints, this can be performed at significantly lower costs than using large-scale gold-standard datasets that are completely manually annotated by domain experts.

2. Materials and Methods

Using structured data for extracting information is always preferable over unstructured data as the results are generally higher quality while expending much fewer resources. It follows that utilizing unstructured data should only be necessary if the utility of structured data has been fully exhausted and the required minimal performance metrics have not been satisfied. In our case, subject matter experts indicated that billing invoice data and community care notes (both structured data) are reliable signals for external hospitalizations. Given this fact, the question to be answered is what percentage of external hospitalizations identified by unstructured data (U) is also covered by structured data (S), or in other words, the conditional probability of structured given unstructured, P(S|U). If this value is high, this means that there is less utility for unstructured data because there is much overlap with the capabilities of structured data. However, if the value is low, then processing unstructured data could provide added value. To illustrate this concept using the diagram shown in Figure 1, the dark gray shaded area represents the indicator of the utility of unstructured data and is 1-P(S|U). Using set notation, we can calculate P(S|U) = P(S ∩ U)/P(U).

2.1. Information Extraction Process

Common IE processes typically involve multiple steps, and actual implementation will strongly depend on various factors. First, detailed and precise definitions of the data elements to be captured are clearly expressed. These definitions will depend not just on clinical knowledge but also on the types of questions that motivate the extraction in the first place. This requires a deep understanding of the data captured in the patient record, including the format, frequency, completeness, and quality among various characteristics. The questions being studied will strongly impact the definition, even for the same type of data element. For instance, drug relapse may be generally defined as patients returning to substance abuse after a period of abstinence, but studies may differ in defining the minimum time between two incidents. Next, based on established definitions, each data element is then precisely mapped onto underlying data sources. These could range from simple one-to-one mappings onto specific database tables and columns, to more complex analyses that combine multiple methods. For instance, extracting tumor size may entail using NLP to extract measurements from radiology reports and/or applying image processing directly to radiological images. Finally, minimum quality standards for the extracted data are established based on study needs. For instance, large population-level studies may emphasize recall over precision, while smaller retrospective studies may require high levels for both (typically reported as F1-score).

2.2. Defining External Hospitalizations

For this work, external hospitalizations are defined as VA patient hospital stays at any facility that is outside the VA’s healthcare network. This encompasses any facility not directly managed by the VA’s Healthcare Administration. Our definition also includes visits to external emergency departments, even if the patient did not end up being formally admitted.

2.3. Mapping External Hospitalizations

As previously mentioned, structured data for identifying external hospitalizations consists of invoice data and community care notes. For community care notes, it is their existence that matters and not the understanding of their content (i.e., no NLP). For this work, we assumed that structured data have very high precision due to the inherent business forces within the healthcare system. In other words, we assumed it is very rare, though not entirely impossible, for an invoice or community care note to be generated without an actual instance of external hospitalization. External hospitalizations are also incidentally documented within clinic notes by providers as they see fit to include this information. Since clinic notes throughout the VA do not follow nationally standardized templates, there is high linguistic variation in the phrasing used by providers when writing these notes. As a result, clinic notes are expressed in general-purpose English and are not constrained by any templated standard.

2.4. Evaluating Performance of Information Extraction

Standard evaluation for IE typically utilizes precision/recall metrics. Precision (Positive Predictive Value) indicates how likely extraction results are correct (true positives vs. false positives), and recall (Sensitivity) reflects how much of the true answer set was identified (true positives vs. false negatives). In practice, determining the precision is more straightforward than recall because manual review is limited by NLP-identified positive cases. Determining recall involves the identification of false negatives, which typically requires large random samples from the general patient population, as we must assume false negatives can occur anywhere. In addition, the prevalence of documents that contain information of interest can be low, leading to large amounts of manual effort to retrieve a small subset of cases. As an example, the prevalence of external hospitalization mentions within documents in the general VA patient population was estimated to be 8%. This prevalence was estimated based on manual annotation of 12,779 randomly selected documents from a cohort of patients with known structured data indicators of external hospitalizations. In total, 1030 unique documents contained at least one annotation, resulting in an estimate of 1030/12,779~8% for the prevalence rate of external hospitalization mentions within unique documents. Because the annotated cohort is enriched, we believe that the prevalence rate for the general patient population is much lower, and 8% represents a reasonable upper-bound estimate.

2.5. Automating the Estimation of the Utility of Unstructured Data

2.5.1. Manual Method

The manual method for estimating P(S|U) requires an expert-annotated document set (unstructured document set) for mentions of external hospitalizations, where the documents are randomly sampled from the general patient population. For each true positive (unstructured data), the patient record is reviewed for invoice data or community notes (structured data) that correspond to that incident of external hospitalization. P(S|U) is estimated as the number of patients from the unstructured document set that also had structured data over the total number of patients in the unstructured document set. Given a prevalence of 8%, finding 1000 true positives requires reviewing 12,500 documents on average. As mentioned in the previous section, our team performed this exact exercise, and a significant amount of human effort was expended over several weeks. If out of the 1000 patients having unstructured data, 800 also had structured data, the estimate for P(S|U) would be 80%.

2.5.2. Assumptions for Automating the Estimation of P(S|U)

Our proposed method is predicated on the following assumptions: (1) availability of large amounts of source data (patient records); (2) high precision structured data (e.g., invoice data and community care notes); (3) high precision syntactic patterns for extracting the data of interest; and (4) statistical independence (no causal relationships) between the structured data and linguistic wordings and phrasings that indicate external hospitalizations. For (3), we will describe more details on patterns in the next section. For (4), the idea is that the structured data does not influence the way documents are linguistically composed by providers. Specifically, we require that the existence of invoice data or community care notes does not impact how a provider would document the same incident of external care within a clinic note. This assumption is reasonable because there is no mandated process within the VA healthcare system that providers review invoice data or community care notes before documenting external care. Additionally, providers documenting external care are, for the vast majority, reporting incidental information that was told to them by patients. Thus, the linguistic phrasings used for documenting this information are mainly driven by the provider’s own personal style, which results from factors including educational and/or cultural backgrounds.

2.5.3. Syntactic Patterns

Syntactic patterns consist of phrases that are used to match occurrences within documents, similar to standard word searches. Patterns can be made to be very general with high recall by using single words or short phrases, or very precise with high precision by including longer phrases or even entire sentences. Patterns are created either top-down, using subject matter expertise, or bottom-up, by leveraging actual text from documents. In practice, both methods are typically utilized because experts can articulate many examples, but this is rarely complete. Complementing expert knowledge with examples from actual text results in patterns that exhibit higher coverage.
To support the bottom-up approach, two domain-knowledgeable annotators labeled 56 annotations that were used to generate our set of 13 patterns by manual inspection. A previously developed annotation tool [13] was utilized for the annotation process (Figure 2), where the form to be filled out is in the left panel, available documents are listed in the middle panel, and the text of the selected document is shown in the right panel.
There are existing methods that could be applied for automating the pattern generation process [14] if a larger set of more complex patterns were required. The precision of the patterns was reviewed by a third human reviewer, who was not part of the annotation process that generated the patterns. The reviewer validated the patterns on a random sample of 1000 matches and found precision to be 91% after reviewing matches and confirming they described an instance of community hospitalization. Patterns were matched against patient documents using Microsoft SQL Server’s full-text search querying constructs [15]. Example patterns used in our study are shown in Figure 3. False positives were identified and typically were found in after-visit summaries for the patient that included instructions on what to do if they were admitted to a community hospital.
Some examples of sentences that matched our patterns are shown below in Figure 4:

2.5.4. Using Syntactic Patterns for Automating the Estimation of P(S|U)

This set of high-precision/low-recall patterns was used to automate the estimate of P(S|U). Basically, the patterns are applied to identify a random sample of patients with unstructured data for external hospitalizations. The patterns were matched against general patient documents, and all documents containing a match were retrieved. Because the patterns are high precision, we assumed with minimal validation that all retrieved documents are true positives. Next, for all unique patients who had documents retrieved by the patterns, we determined the percentage that also had structured data. This percentage was taken as the estimate of P(S|U). As mentioned previously, we hold the assumption that there is statistical independence between the presence of invoice data/community notes and the linguistic constructs used to indicate external hospitalizations in documents. Thus, patterns do not need to exhibit high linguistic variation because there is no causal link with the existence of invoice data/community notes. Thus, the sample retrieved by patterns can be considered a random sample of external hospitalizations with respect to structured data. Because we have a large-scale patient database, we were able to aggregate large sample sizes of documents with mentions of external hospitalizations using a relatively small set of low-recall patterns.

3. Results

External facilities were widely used by the VA patient population. A query of data from 1 January 2024 to 30 June 2024 resulted in 541,009 invoices that were billed to the VA by external facilities. In addition, 162,265 community care notes were generated during the same time range for patients who did not have an invoice, totaling 703,274 unique patients with structured data that indicated external hospitalizations. We matched our patterns against all patient documents written by 5755 different providers stationed at 125 different locations from 1 January to 31 December 2024 and retrieved 19,673 documents that had at least one match, consisting of 13,154 unique patients. Of these, we randomly sampled 1000 patients and found that 556 had either invoice data and/or community care notes. Thus, our estimate for P(S|U) based on this sample is 55.6% for patients, indicating a non-trivial percentage of patients with external hospitalizations (44.4%), or 1-P(S|U), are identified only with unstructured data. For studies that require higher levels of recall, the deployment of NLP to automate the extraction of this information may be necessary. The total number of invoices in the first half of 2024 is shown in Table 1, and the results of our estimate for 1-P(S|U) are shown in Table 2.

4. Discussion

Estimating the utility of using unstructured versus structured data is an important task within the information extraction framework because processing unstructured data remains costly, complex, and challenging. However, the estimation process could itself be costly because large manually annotated random samples are required to achieve standard confidence intervals and levels for the estimate. If the prevalence of the data of interest is relatively low, the cost of performing the estimate could exceed the resources available for a study. This work shows that under certain conditions, the amount of human labor needed to estimate the utility of unstructured data could be greatly reduced by leveraging high-precision structured indicators that already exist. In our case, we used invoice data and community care notes to indicate external hospitalizations with high precision that resulted from standard workflows within the healthcare system. To avoid the work needed to generate a large manually annotated true positive gold standard, we utilized a small set of high-precision/low-recall syntactic patterns to automatically identify a sample of patients with external hospitalizations. To generate our pattern set, we leveraged a small set of 56 annotations, greatly reducing the required human labor cost. The value added for our estimation method is measured by the annotation cost for generating patterns compared with the cost for generating a large, annotated gold standard. We argue that the effort needed to annotate for generating patterns is significantly less because of two factors. First, since the pattern set is not required to exhibit high linguistic variability, it can be generated by manual inspection of a small set of examples (e.g., 56 annotations). Second, annotating for patterns can be based on an enriched set of documents that have known structured data indicators of external hospitalizations. Since all the documents have associated structured data, the likelihood of having unstructured indicators, though not guaranteed, is much greater than a random sample from the general patient population.
The limitation of this work is that the stated assumptions must be met before this method can be practically applied. If there is no structured data indicator, then there would be no need to utilize this method, as there is no useful comparison to make with unstructured data. As researchers in the VA, we have the luxury of working with a very large-scale centralized electronic patient repository that contains data from over 10 million patients [16]. We were able to use a small set of syntactic patterns as the size of the repository made identifying a sufficiently large true positive set relatively trivial, without the need for sophisticated machine learning techniques. Finally, having a deep understanding of how data is generated within the VA was crucial for determining that our independence assumption between structured data and how documents are linguistically composed by providers was valid.

5. Conclusions

The use of cutting-edge technologies for analyzing unstructured data, such as LLMs for free text, opens up new avenues for capturing data elements of interest, but they remain high-cost to deploy. Even though these systems have made great strides in the past few years, it is still prudent to determine the utility of using structured data before pouring resources into unstructured methods. However, estimating the utility of structured data could also be non-trivial and itself require a significant amount of manual effort. We described a method for reducing the effort for estimating the utility of structured versus unstructured data, given that some assumptions are satisfied. Our method includes utilizing a small set of high-precision/low-recall syntactic patterns to automatically identify true positives against a large-scale patient data repository in order to aggregate a sufficiently large random sample. The pattern set can typically be generated using manual inspection of a small set of examples annotated from patient documents. Efficiently determining estimates of the utility of structured data that are readily available (low cost) before turning to high-cost solutions enables the healthcare system to judiciously allocate computational and analytical resources where they are most needed.

Author Contributions

Conceptualization, S.L. and F.M.; methodology, F.M.; software, M.D.; validation, M.D., D.W., R.M., and F.M.; formal analysis, M.D. and F.M.; investigation, M.D. and F.M.; data curation, M.D., R.H., S.K., T.M., R.M., and D.W.; writing—original draft preparation, F.M.; writing—review and editing, M.D., R.M., D.W., and C.H.; supervision, F.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

The study was conducted in accordance with the Declaration of Helsinki and approved by the Institutional Review Board of VA Boston Healthcare System (Protocol 1671839-10 approved on 25 May 2022).

Informed Consent Statement

Patient consent was waived due to the following reasons. This is a database-only study that processes data for the purpose of developing analytical methods for automating the identification and extraction of patient data from the medical record. Additionally, the cohort was selected from the large patient population, making it impractical to obtain HIPAA authorizations from each subject.

Data Availability Statement

The data used for this study contains Protected Health Information (PHI) of patients, and sharing outside the healthcare system is strongly restricted per VA data security and regulatory policies.

Acknowledgments

We acknowledge support from the VA Boston Cooperative Studies Program Coordinating Center and the VA Office of Research and Development.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Vashi, A.A.; Urech, T.; Wu, S.; Tran, L.D. Community Emergency Care Use by Veterans in an Era of Expanding Choice. JAMA Netw Open 2024, 7, e241626. [Google Scholar] [CrossRef] [PubMed]
  2. Xu, Z.; Jain, S.; Kankanhalli, M. Hallucination is Inevitable: An Innate Limitation of Large Language Models. arXiv 2024, arXiv:2401.11817. [Google Scholar] [CrossRef]
  3. Tonmoy, S.M.; Zaman, S.M.; Jain, V.; Rani, A.; Rawte, V.; Chadha, A.; Das, A. A Comprehensive Survey of Hallucination Mitigation Techniques in Large Language Models. arXiv 2024, arXiv:2401.01313. [Google Scholar] [CrossRef]
  4. Meng, X.; Yan, X.; Zhang, K.; Liu, D.; Cui, X.; Yang, Y.; Zhang, M.; Cao, C.; Wang, J.; Wang, X.; et al. The application of large language models in medicine: A scoping review. iScience 2024, 27, 109713. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
  5. Locke, S.; Bashall, A.; Al-Adely, S.; Moore, J.; Wilson, A.; Kitchen, G. Natural Language Processing in Medicine: A Review. Trends Anaesth. Crit. Care 2021, 38, 4–9. [Google Scholar] [CrossRef]
  6. Vaswani, A.; Shazeer, N.M.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is All you Need. In Proceedings of the Advances in Neural Information Processing Systems 30 (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
  7. Abdullah, M.H.A.; Aziz, N.; Abdulkadir, S.J.; Alhussian, H.S.A.; Talpur, N. Systematic Literature Review of Information Extraction From Textual Data: Recent Methods, Applications, Trends, and Challenges. IEEE Access 2023, 11, 10535–10562. [Google Scholar] [CrossRef]
  8. Bednarczyk, L.; Reichenpfader, D.; Gaudet-Blavignac, C.; Ette, A.; Zaghir, J.; Zheng, Y.; Bensahla, A.; Bjelogrlic, M.; Lovis, C. Evidence for Clinical Text Summarization Using Large Language Models: Scoping Review. J. Med. Internet Res. 2025, 27, e68998. [Google Scholar] [CrossRef] [PubMed]
  9. Gaber, F.; Shaik, M.; Allega, F.; Bilecz, A.J.; Busch, F.; Goon, K.; Franke, V.; Akalin, A. Evaluating large language model workflows in clinical decision support for triage and referral and diagnosis. NPJ Digit. Med. 2025, 8, 263. [Google Scholar] [CrossRef] [PubMed]
  10. Busch, F.; Hoffmann, L.; Dos Santos, D.P.; Makowski, M.R.; Saba, L.; Prucker, P.; Hadamitzky, M.; Navab, N.; Kather, J.N.; Truhn, D.; et al. Large language models for structured reporting in radiology: Past, present, and future. Eur. Radiol. 2025, 35, 2589–2602. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
  11. Kandpal, N.; Raffel, C. Position: The Most Expensive Part of an LLM should be its Training Data. arXiv 2025, arXiv:2504.12427. [Google Scholar] [CrossRef]
  12. Chiticariu, L.; Li, Y.; Reiss, F.R. Rule-Based Information Extraction is Dead! Long Live Rule-Based Information Extraction Systems! In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, WA, USA, 18–21 October 2013; Association for Computational Linguistics: Seattle, WA, USA, 2013; pp. 827–832. [Google Scholar]
  13. Meng, F.; Morioka, C. Automating the generation of lexical patterns for processing free text in clinical documents. J. Am. Med. Inform. Assoc. 2015, 22, 980–986. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
  14. Meng, F.; Morioka, C.A.; Elbers, D.C. Generating Information Extraction Patterns from Overlapping and Variable Length Annotations using Sequence Alignment. arXiv 2019, arXiv:1908.03594. [Google Scholar] [CrossRef]
  15. SQL Server Technical Documentation. Available online: https://learn.microsoft.com/en-us/sql/sql-server/?view=sql-server-ver17 (accessed on 1 March 2025).
  16. Price, L.E.; Shea, K.; Gephart, S. The Veterans Affairs’s Corporate Data Warehouse: Uses and Implications for Nursing Research and Practice. Nurs. Adm. Q. 2015, 39, 311–318. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
Figure 1. Utility of unstructured data as 1-P(S|U). The set of patients with structured data indicating external hospitalizations is represented by S; those with unstructured indicators are represented by U. It is possible for patients to have both structured and unstructured indicators, and this is shown by the light gray overlapping area. We are interested in estimating 1-P(S|U), as those patients have unstructured data but no structured data indicators.
Figure 1. Utility of unstructured data as 1-P(S|U). The set of patients with structured data indicating external hospitalizations is represented by S; those with unstructured indicators are represented by U. It is possible for patients to have both structured and unstructured indicators, and this is shown by the light gray overlapping area. We are interested in estimating 1-P(S|U), as those patients have unstructured data but no structured data indicators.
Information 16 00978 g001
Figure 2. Screenshot of the annotation tool used to label patient documents for external hospitalization indicators. A form representing the information to be extracted is shown in the left panel, the document text is shown in the right panel, and a list of documents for the current patient is shown in the middle column. Annotators highlight text segments from the right panel and associate labels with the text from the form elements in the left panel. In this example, the sentence starting with “Mr. Smith.” would be highlighted and associated with the “Outside Care” element.
Figure 2. Screenshot of the annotation tool used to label patient documents for external hospitalization indicators. A form representing the information to be extracted is shown in the left panel, the document text is shown in the right panel, and a list of documents for the current patient is shown in the middle column. Annotators highlight text segments from the right panel and associate labels with the text from the form elements in the left panel. In this example, the sentence starting with “Mr. Smith.” would be highlighted and associated with the “Outside Care” element.
Information 16 00978 g002
Figure 3. Example syntactic patterns. These patterns were used with SQL Server Full-Text Search to return documents with approximate matches (e.g., substitution of prepositions).
Figure 3. Example syntactic patterns. These patterns were used with SQL Server Full-Text Search to return documents with approximate matches (e.g., substitution of prepositions).
Information 16 00978 g003
Figure 4. Examples of text matched by patterns. The matched text is shown in bold and italics. The matches are approximate and based on SQL Server’s Full-Text Search matching algorithm. For instance, in the first match (“recent admission to community hospital…”), the words “recent” and “admission” are approximately matched to the pattern “recently admitted…” based on the corresponding words having the same root.
Figure 4. Examples of text matched by patterns. The matched text is shown in bold and italics. The matches are approximate and based on SQL Server’s Full-Text Search matching algorithm. For instance, in the first match (“recent admission to community hospital…”), the words “recent” and “admission” are approximately matched to the pattern “recently admitted…” based on the corresponding words having the same root.
Information 16 00978 g004
Table 1. Number of Invoices and Community Care Notes in the First Half of 2024.
Table 1. Number of Invoices and Community Care Notes in the First Half of 2024.
Number of invoices for external hospitalizations (1 January–30 June 2024)541,009
Number of community care notes who did not also have an invoice (1 January–30 June 2024)162,265
Total external hospitalizations with structured data indicators (1 January–30 June 2024)703,274
Table 2. Estimating 1-P(S|U) with data from calendar year 2024. The number of patients is generally less than the documents because patients can have multiple documents that indicate external hospitalizations within the given time frame. This is because many documents have text that is carried over from previous documents.
Table 2. Estimating 1-P(S|U) with data from calendar year 2024. The number of patients is generally less than the documents because patients can have multiple documents that indicate external hospitalizations within the given time frame. This is because many documents have text that is carried over from previous documents.
Time frame1 January – 31 December 2024
Documents that had at least one pattern match19,673
Number of unique patients13,154
Random sample size (documents)1000
Random sample size (patients)941
Number in the random sample that also has structured data (documents)574
Number in the random sample that also has structured data (patients)556
Estimated 1-P(S|U) (documents)45.3%
Estimated 1-P(S|U) (patients)44.4%
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Davenport, M.; Hall, R.; Kappala, S.; Michelson, T.; Mitchell, R.; Winski, D.; Hau, C.; Leatherman, S.; Meng, F. Estimating the Utility of Using Structured and Unstructured Data for Extracting Incidents of External Hospitalizations from Patient Documents. Information 2025, 16, 978. https://doi.org/10.3390/info16110978

AMA Style

Davenport M, Hall R, Kappala S, Michelson T, Mitchell R, Winski D, Hau C, Leatherman S, Meng F. Estimating the Utility of Using Structured and Unstructured Data for Extracting Incidents of External Hospitalizations from Patient Documents. Information. 2025; 16(11):978. https://doi.org/10.3390/info16110978

Chicago/Turabian Style

Davenport, Michael, Robert Hall, Saraswathi Kappala, Trevor Michelson, Robert Mitchell, David Winski, Cynthia Hau, Sarah Leatherman, and Frank Meng. 2025. "Estimating the Utility of Using Structured and Unstructured Data for Extracting Incidents of External Hospitalizations from Patient Documents" Information 16, no. 11: 978. https://doi.org/10.3390/info16110978

APA Style

Davenport, M., Hall, R., Kappala, S., Michelson, T., Mitchell, R., Winski, D., Hau, C., Leatherman, S., & Meng, F. (2025). Estimating the Utility of Using Structured and Unstructured Data for Extracting Incidents of External Hospitalizations from Patient Documents. Information, 16(11), 978. https://doi.org/10.3390/info16110978

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop