Development and Evaluation of a Natural Language Processing System for Curating a Trans-Thoracic Echocardiogram (TTE) Database

Background: Although electronic health records (EHR) provide useful insights into disease patterns and patient treatment optimisation, their reliance on unstructured data presents a difficulty. Echocardiography reports, which provide extensive pathology information for cardiovascular patients, are particularly challenging to extract and analyse, because of their narrative structure. Although natural language processing (NLP) has been utilised successfully in a variety of medical fields, it is not commonly used in echocardiography analysis. Objectives: To develop an NLP-based approach for extracting and categorising data from echocardiography reports by accurately converting continuous (e.g., LVOT VTI, AV VTI and TR Vmax) and discrete (e.g., regurgitation severity) outcomes in a semi-structured narrative format into a structured and categorised format, allowing for future research or clinical use. Methods: 135,062 Trans-Thoracic Echocardiogram (TTE) reports were derived from 146967 baseline echocardiogram reports and split into three cohorts: Training and Validation (n = 1075), Test Dataset (n = 98) and Application Dataset (n = 133,889). The NLP system was developed and was iteratively refined using medical expert knowledge. The system was used to curate a moderate-fidelity database from extractions of 133,889 reports. A hold-out validation set of 98 reports was blindly annotated and extracted by two clinicians for comparison with the NLP extraction. Agreement, discrimination, accuracy and calibration of outcome measure extractions were evaluated. Results: Continuous outcomes including LVOT VTI, AV VTI and TR Vmax exhibited perfect inter-rater reliability using intra-class correlation scores (ICC = 1.00, p < 0.05) alongside high R2 values, demonstrating an ideal alignment between the NLP system and clinicians. A good level (ICC = 0.75–0.9, p < 0.05) of inter-rater reliability was observed for outcomes such as LVOT Diam, Lateral MAPSE, Peak E Velocity, Lateral E’ Velocity, PV Vmax, Sinuses of Valsalva and Ascending Aorta diameters. Furthermore, the accuracy rate for discrete outcome measures was 91.38% in the confusion matrix analysis, indicating effective performance. Conclusions: The NLP-based technique yielded good results when it came to extracting and categorising data from echocardiography reports. The system demonstrated a high degree of agreement and concordance with clinician extractions. This study contributes to the effective use of semi-structured data by providing a useful tool for converting semi-structured text to a structured echo report that can be used for data management. Additional validation and implementation in healthcare settings can improve data availability and support research and clinical decision-making.


Introduction
Electronic health records (EHR) have become increasingly important as they generate more and more data leading to "Big Data", holding key insights into disease patterns and opportunities to optimise patient treatment.However, the reliance on semi-structured data is a major barrier in using EHR.These types of data do not have a fixed structure but have some type of organisation, generally through the use of tags, labels or metadata.While semi-structured data enables flexibility in data representation, they are limited in terms of the lack of standardisation, irregular organisation, ambiguity and interpretation issues.Extracting useful information and integrating semi-structured data can be difficult, necessitating the use of advanced techniques such as natural language processing and data mapping.When interacting with patients, healthcare workers frequently use freeform notes to record vital details [1].
The echo report based on echo imaging is the fundamental record of evidence for the diagnosis of a cardiovascular patient.The structure (e.g., valves, cardiac chambers and blood vessels dimensions) and functionalities (e.g., ejection fraction, global longitudinal strain) of the heart are examined and described in the echo report by the echocardiographer with a mix of semi-structured numerical data and free text descriptions.However, the extraction and generation of research data from the original document are challenging, mainly due to the narrative nature of the echo report, different reporting styles of echocardiographers, differences in echo devices (reporting platforms and vendor software) and hospital specific protocol differences.As such, the extraction of structured data from semi-structured echo reports, especially for statistical analyses, tends to be excessively time consuming and requires tremendous effort and cost, owing to its presentation as a narrative document.
While NLP models and tools were used to process data from psychiatry, X-ray radiography [2] and pathology [3] to distinguish healthy from diseased patients [4,5], they are not frequently used in echocardiography analysis.
Whilst the retrieval of categorised data from clinical data repositories is relatively simple, not all data are available in this format.Large organisations, such as the NHS, typically store large amounts of information in the form of free text, within which specific categories of data (e.g., aortic or mitral valve haemodynamics) cannot be easily retrieved for analysis.To enable the data contained within free text to be used for statistical analyses, it is necessary to both extract and convert that information into a structured format.However, there are significant challenges to the extraction process, which can be summarised as: (a) variability of language, which is often ambiguous and complex; (b) lack of standardisation; and (c) incomplete information and privacy concerns.This paper presents an approach that employs a natural language processing (NLP) system to handle variations in echocardiographic data to construct a comprehensive data repository by accurately categorising, extracting and organising semi-structured echocardiographic data.
The primary aim of this study is to improve data management and analysis for more effective use of echocardiographic data in research and clinical applications.To achieve this aim, we have defined two key objectives.Our primary objective is to create a structured moderate-fidelity echocardiogram database using automated data extraction and curation technologies.We envision this database being easily integrated with high-fidelity datasets such as EHRs [6], supporting the aim of effective echocardiographic data usage in research by promoting multimodal learning methodologies [7].In parallel, our second objective is to meticulously identify factors that consistently display superior extraction performance.This aspect of research is critical in terms of future clinical applications, such as patient monitoring and the development of predictive risk scores, both of which have the potential to alter patient care and medical research.
We validated the proposed method by comparison of the NLP extraction with clinician's annotations on a sample of the most common type of echocardiography examination, Trans-Thoracic Echocardiogram (TTE).Finally, the approach is scaled to extract all available echocardiogram free text reports within the local hospital.
The main contributions of the study are summarised as follows: • Improved Data Extraction Using NLP.This study introduces and illustrates the usage of a GATE-based NLP system to extract comprehensive information from echocardiography reports.This approach substantially improves the efficiency while maintaining accuracy to extract meaningful insights from these reports by automating the data extraction process.

•
Data-Driven Clinical Decision-Making.By utilising Big Data resources, this work contributes to the realisation of learning health systems.It provides the foundation for using data-driven [8] insights to help healthcare practitioners make precise diagnoses, identify at-risk patients, develop personalised treatment regimens and enhance medical research in complex patient populations.It enables healthcare workers to immediately identify critical parameters and red flags in preoperative and postoperative settings, such as changes in aortic diameters or valve haemodynamics.This feature can also help to expedite surgical prioritisation, patient monitoring and waiting list administration.

•
Opportunities for Research and Audits.The NLP system's capabilities go beyond treating specific patients.It paves the way to advanced audit and research projects in cardiac care.With the use of this technology, researchers can monitor patterns and examine large datasets, advancing our knowledge of heart conditions and therapeutic outcomes.

•
Interpretable Integration of Data.The study addresses the challenge of integrating both continuous and discrete data, such as qualitative ratings and quantitative measurements, in echocardiogram reports.The structured database generated by the NLP system offers a clear and interpretable representation of this combined information, facilitating its practical use.

•
Potential for Broader Healthcare Applications.While the study focuses on echocardiography reports, the concepts and methods explored here can potentially extend to other investigative modalities, including CT scans.The NLP system's ability to integrate with risk modelling approaches further expands its potential impact on healthcare research and practice.
In the next section we report on some related work.

Related Work
In [9], NLP algorithms were created and verified to identify aortic stenosis (AS) cases from echocardiography reports and compare their precision to diagnosis codes.The promise of NLP for enhancing case identification in population health was demonstrated by the NLP algorithms' greater accuracy in detecting AS cases than diagnostic codes.An approach for obtaining numerical test results and associated descriptions from free-text echocardiogram reports was proposed in another study [10].The system efficiently handled typos, synonyms and abbreviations by using corpus-independent algorithms and fuzzy matching to detect and pair expressions with measurement results.It was useful for processing large numbers of echocardiographic findings and showed potential for assisting medical research or clinical trial verification.In a different study, a method for turning heterogeneous echocardiographic medical notes into structured data based on NLP was provided [11].The researchers developed a unique NLP-based extraction and processing programme, EchoInfer, to automate the extraction and organisation of the 80 frequently evaluated echocardiogram data items.By converting unstructured, semi-structured and structured data from echocardiograms into a format compatible with traditional analytical techniques, EchoInfer's effectiveness and consistency were established.Another study evaluated the generalisability of Left Ventricular Ejection Fraction (LVEF) extraction modules, using a data set known as the TUCP EF year 3 corpus (including echocardiography and radiology data reports) [12].The features for detecting LVEF information were examined, and NLP techniques based on a machine learning (ML) sequential tagger were deployed.The work made contributions by assessing how well previous LVEF extraction modules performed on the new corpus, and by examining how the amount of training data influenced the accuracy of the new NLP modules.Another study showed that NLP may be used to categorise stress echocardiography (SE) data and extracted variables that were frequently utilised in stress testing score models [13].By synthesising data elements from the reports, the NLP system produced a valid summary and demonstrated high accuracy in criteria validity when compared to the reference standard.Construct validity was also evaluated [13], and the results demonstrated that, in line with prior findings, NLP-derived SE results effectively distinguished patients at short-term cardiac risk.
Transformer based models, such as the Bidirectional Encoder Representations from Transformers (BERT), have also been applied for extract knowledge from unstructured notes in the healthcare domain.Studies typically begin with standardisation of text by using the NLTK toolkit for case conversion, removing new line characters, punctuation, single character-words and performing spell checks [14].A recent study found that a Dutch version of this model, referred to BERTje, outperformed traditional approaches (TF-IDF and doc2vec with ML), other health domain specific Dutch transformers RobBERT and MedRobERTa.n in predicting the reasons for undertaking MRI scan from three classes: (1) Diagnosis; (2) Progression; or (3) Monitoring [15].eXplainable Artificial Intelligence (XAI) techniques such as Shapley Additive Explanation (SHAP) and Local Interpretable Model-Agnostic Explanation (LIME) were used to generate variable importance for words that result in class predictions and were ranked by the radiologist to understand the average impact of (perhaps) valid words on predictions.Another study compared French language BERT models against a Bidirectional and Auto-Regressive Transformer (BART) based model for prediction emergency department triage status (hospitalised or discharged).Interestingly, PCA dimensionality reduction of transformer embedding was applied followed by k-means clustering of different cluster sizes ranging from 2-10 and cluster membership robustness assessed through the Silhouette score and the Fowlkes-Mallows (FM) score, suggesting models with larger number of parameters having higher performance [14].
Previous research has mostly concentrated on extracting predefined outcomes from echocardiographic records.There have only been two published studies that integrated corpus-specific knowledge for complete extraction of measurement outcomes [10,11].Few studies have applied automated conversion of units to a standardised dictionary-specified format.In addition, few studies have clinically applied the data extraction approach within the National Health Service (NHS) settings utilising efficient EHR database extraction and data loading processes for maximising the availability of structured echo report results.

Materials and Methods
The register-based cohort study using de-identified patient data is part of a research approved by the Bristol Heart Institute, and the need for patients' consent was waived.Reporting of results follows the TRIPOD statement.

Dataset
The study was performed using the University Hospitals Bristol and Weston NHS Foundation Trust (UHBW) echocardiogram dataset, which comprises UHBW echocardiogram reports prospectively stored by UHBW following patient examinations.Cardiac resynchronisation optimisation echo, 3D TTE, TTE with contrast agents; Stress Echocardiogram (SE) with and without dobutamine and contrast agents; and Trans-Oesophageal Echocardiogram (TOE) cases were excluded, resulting in a total of 135,062 TTE reports for analysis.This study was performed using data from all TTE echocardiogram examinations across the UHBW from January 2009 to November 2020.A total of 200 routinely reported echocardiography outcomes measurements were extracted by the system at baseline.All names and demographics were removed from the echocardiography reports and corresponding metadata.
Although obtaining true labels for examination results through manual annotation can be a laborious process, two cardiologists undertook the task of blindly labelling of 98 reports from this TTE cohort to extract examination results for 43 of the most clinically relevant (35 continuous and 8 discrete) outcomes as the gold standard for the test dataset (Table 1).Differences in extraction were resolved by the lead cardiologists.Variables extracted by the cardiologists was on the same unit scale as that within the original echocardiography report.The dataset was split into three cohorts: Training and Validation (n = 1075), Test Dataset (n = 98) and Application Dataset (n = 133,889).The primary evaluation criteria were discrimination, calibration, and overall accuracy of the NLP system in the extraction of TTE echocardiogram variables into the structured format.

Data Exploration
A list of common occurring words, but which are not specifically echocardiogram variable related, in the echocardiographic examination reports was curated (see Supplementary Materials Figure S1).This list was entered into the high dimensionality Cluto clustering toolkit as exclusion criteria algorithm, along with the training/validation dataset to identify important variables within the current context [16].Hierarchical clustering was utilised to cluster variables that showed a high degree of similarity across documents.
The clustering was analysed using (i) automatic clustering (Figure S2), allowing the algorithm to determine the appropriate detailed level of the clustering structure; and (ii) using a clustering size of 3 to show higher level similarities across variables (Figure S3).Due to the high dimensionality of the heatmap, Ghostscript interpreter was used to visualise the heatmap.This exploratory analysis, along with echocardiography expertise was used to guide the categorisation of system extraction components including the JAPE rules (as discussed later).
An analysis of variations across echocardiogram variables was then conducted for all 200 outcome variables (Table 2).This process was supported by the use of Regular Expressions within macro based excel spreadsheets to rapidly establish presence and location of each variable.The underlined textual elements show features that are taken into account during the text annotation and extraction process.

Data Annotation Approach
In order to capture crucial information such as heart dimensions, ejection fraction, valve properties and other relevant measurements, we first defined specific data categories in preparation for the annotation process.These categories were designed to ensure the comprehensive extraction and standardisation of both discrete and continuous variables.This annotation process adhered to the UK echosonographers guidelines, as well as categories established in collaboration with medical experts [17].
NLP-based annotations were performed using the JAPE engine, which automatically annotated data from echocardiography reports.The annotation pattern description is situated on the left-hand side (LHS) of the JAPE, whereas the annotation manipulation statements are found on the right-hand side (RHS) [18].The annotations identified on the LHS were then processed on the RHS using labels from the LHS to refer to matched text segments.RHS processing tasks included unit conversions, the conversion of qualitative data to binary or ordinal formats, the addition of annotations (units and measurements values (varValue)) for the matched patterns and grouping annotations to higher parent-ontological levels (e.g., Sinotubular Junction is grouped under parent level VesselsMeasurements).The annotations and corresponding measurement varValues are then parsed by the JAVA functions for conversion into structured CSV (comma-separated values) format before insertion into the designated SQL database tables, according to the parent-ontological levels.
Two cardiologists separately performed manual annotations in excel (CSV) spreadsheets to enable the validation of the correctness and reliability of the NLP generated annotations.For both NLP and clinical based CSV annotations, we collapsed the outcome columns in from wide to long formats to enable metric-based performance evaluations.

Model Development
The model was developed with the Java and GATE NLP framework using the Eclipse development environment.The JAPE Transducer was included to enable mapping of the input text to output text, in order to enable other components to be integrated as part of the processing pipeline.We used the GATE tokeniser, which separates the input text into discrete tokens that represent individual meaningful language components, such as words or punctuation.The sentence splitter was then used to separate the text into individual sentences based on sentence boundaries identified by the output of the tokeniser (Figure 1).words or punctuation.The sentence splitter was then used to separate the text into individual sentences based on sentence boundaries identified by the output of the tokeniser (Figure 1).We then used the Java Annotation Patterns Engine (JAPE, as explained further later on) to establish unique annotation rules over the text.We were able to define and match particular token patterns (combinations) using JAPE rules, applying new annotations based on the patterns discovered.
In addition, we used gazetteers [19], referred to herein as dictionaries, which are standardised groupings of terms or phrases associated with a specific topic.Gazetteers contributed to improving the tokenisation process by ensuring proper treatment of echocardiography terms, levels of severity and exclusion topics that could not have been adequately captured in the general language corpora or tokeniser.The JAPE rules and dictionaries were developed for 200 echocardiographic examination outcomes.
Furthermore, our technique is cross data repository compatible, enabling direct extraction to any specified data repository within the same distributed NHS network.JAPE is a rule-based language that is used in GATE to extract information from text.JAPE rules provide patterns and actions for detecting linguistic patterns in the text and create or modify annotations [19].These rules are carried out in a sequential manner, specified by their priority to resolve overlapping annotations.GATE JAPE rules were further We then used the Java Annotation Patterns Engine (JAPE, as explained further later on) to establish unique annotation rules over the text.We were able to define and match particular token patterns (combinations) using JAPE rules, applying new annotations based on the patterns discovered.
In addition, we used gazetteers [19], referred to herein as dictionaries, which are standardised groupings of terms or phrases associated with a specific topic.Gazetteers contributed to improving the tokenisation process by ensuring proper treatment of echocardiography terms, levels of severity and exclusion topics that could not have been adequately captured in the general language corpora or tokeniser.The JAPE rules and dictionaries were developed for 200 echocardiographic examination outcomes.
Furthermore, our technique is cross data repository compatible, enabling direct extraction to any specified data repository within the same distributed NHS network.
JAPE is a rule-based language that is used in GATE to extract information from text.JAPE rules provide patterns and actions for detecting linguistic patterns in the text and create or modify annotations [19].These rules are carried out in a sequential manner, specified by their priority to resolve overlapping annotations.GATE JAPE rules were further used to extract the corresponding numerical (integer or float) or qualitative categorisation values of each variable.Some examples are provided in Table S1.
Apart from the ability to pattern match data variations, the JAPE rules were designed to extract numerical clinical measurement values (Table S1).During this process, our technique automatically converts any measurement values into pre-defined scales and units, standardising to a common unit scale where appropriate.For example, an outcome reported in centimetres (cm) in one examination but in meters (m) in another would have been standardised to cm by multiplication of 100 in the second report, if cm was the most commonly reported unit.
A graphical user interface (GUI) option can be turned on in the system to enable real-time tracking of the extraction process and to provide explainability of the extraction process (Figure 2).Due to increased memory consumption and processing time, this function was turned on during the training and validation phase but was switched off during the test and application phases to enable annotations at high throughput.Red highlighted sections of the report show the relationship annotations generated through the JAPE rules for associating outcomes with their measurements.
The JAVA interface then enables the extraction of these annotations for matching and inserting into the corresponding table columns in the Microsoft SQL server database.The NLP model was iteratively refined using the training and validation dataset using echocardiography expert knowledge to drive human in the loop reinforcement-based improvements.Microsoft SQL server database was used for storing and loading raw and extracted reports through an interface with the main Java application.
A graphical user interface (GUI) option can be turned on in the system to enable realtime tracking of the extraction process and to provide explainability of the extraction process (Figure 2).Due to increased memory consumption and processing time, this function was turned on during the training and validation phase but was switched off during the test and application phases to enable annotations at high throughput.Red highlighted sections of the report show the relationship annotations generated through the JAPE rules for associating outcomes with their measurements.The JAVA interface then enables the extraction of these annotations for matching and inserting into the corresponding table columns in the Microsoft SQL server database.The NLP model was iteratively refined using the training and validation dataset using echocardiography expert knowledge to drive human in the loop reinforcement-based improvements.Microsoft SQL server database was used for storing and loading raw and extracted reports through an interface with the main Java application.

Statistical Analysis and Validation
The results for the aggregated mean and standard deviation of each continuous variable for both the doctor and NLP extractions were reported in the exploratory and calibration analysis.Additionally, coefficients of determination (R2) and intraclass correlation coefficients (ICC) [13,[20][21][22] with p-value for continuous variables were calculated.ICC is a measure of interrater reliability and was categorised as poor: <0.5; moderate: 0.5-0.75;good: 0.75-0.9and excellent: >0.9 [23].
The magnitude calibration of the variable values in the dataset was visualised using bubble plots.The x and y positions in these plots were determined by the total (combined) magnitude of the values across all test dataset reports for each outcome, and the size of the circles indicated how frequently the variable values occurred.
For the performance evaluation of discrete variables, we adopted the confusion matrix, precision-recall and the F1 score.The confusion matrix was determined to evaluate the system's overall accuracy, while precision and recall values were applied to evaluate how accurately the system identified positive cases (presence of measurements), while avoiding false positives (measurement extracted where not actually present) and false negatives (measurement missed in the extraction process), respectively.Rare outcomes

Statistical Analysis and Validation
The results for the aggregated mean and standard deviation of each continuous variable for both the doctor and NLP extractions were reported in the exploratory and calibration analysis.Additionally, coefficients of determination (R2) and intraclass correlation coefficients (ICC) [13,[20][21][22] with p-value for continuous variables were calculated.ICC is a measure of interrater reliability and was categorised as poor: <0.5; moderate: 0.5-0.75;good: 0.75-0.9and excellent: >0.9 [23].
The magnitude calibration of the variable values in the dataset was visualised using bubble plots.The x and y positions in these plots were determined by the total (combined) magnitude of the values across all test dataset reports for each outcome, and the size of the circles indicated how frequently the variable values occurred.
For the performance evaluation of discrete variables, we adopted the confusion matrix, precision-recall and the F1 score.The confusion matrix was determined to evaluate the system's overall accuracy, while precision and recall values were applied to evaluate how accurately the system identified positive cases (presence of measurements), while avoiding false positives (measurement extracted where not actually present) and false negatives (measurement missed in the extraction process), respectively.Rare outcomes were aggregated to prevent undefined (NaN) values that would otherwise prevent the calculation of precision and recall values.F1 score was utilised for evaluation of the overall performance of positive case extraction.

Practical System Application
Once development was complete, the system was installed onto the central hospital server and utilised to process the automated extraction for 133,962 separate TTE reports (Application Dataset).

Demographics
A total 146,967 echocardiographic examinations were conducted on 78,536 patients within UHBW during the study period at baseline.Cardiac resynchronisation optimisation echo, 3D TTE, TTE with contrast agents; SE with and without dobutamine and contrast agents; and Trans-Oesophageal Echocardiogram (TOE) cases were excluded, resulting in 135,062 reports.The pre-processing of data has been described previously in the methods section.A patient flow consort diagram is shown in Supplementary Materials Figure S4.Baseline differences in echocardiogram variables between clinician and algorithm extractions are shown in Table 3.Using coefficients of determination (R 2 ) and intraclass correlation coefficients (ICC) for continuous outcomes, the effectiveness of the system for outcome measures extraction was assessed (Table 4).Variables including LVOT VTI, LVOT Vmax, AV VTI and TR Vmax showed R 2 values of 1, demonstrating a perfect fit between the predictions of the system and the clinicians' extractions.Of these outcomes, LVOT VTI, AV VTI and TR Vmax exhibited perfect ICC scores (1.00, p < 0.05) alongside high R 2 values, demonstrating ideal agreement between the system and clinicians.
The ICC results demonstrated a good level of inter-rater reliability for numerous outcomes.ICC values of 0.75-0.9were observed for variables such as LVOT Diam, Lateral MAPSE, Peak E Velocity, Lateral E' Velocity, PV Vmax, Sinuses of Valsalva and Ascending Aorta diameters.Furthermore, with excellent ICC scores (0.97 b , 0.99 b , 0.92 b and 0.90 b , respectively) and high R 2 values, the outcomes LA Area, LA Volume, RA Area and Peak A Velocity demonstrated excellent inter-rater reliability across the system and clinicians' extractions.
For outcomes with very low R and ICC values e.g., Aortic Valve max Pressure Gradient (AV max PG), we analysed the reports in detail and discovered that the reason for low observed performance was due to the following factors: (i) the system extracted the correct values, but the clinicians missed such values (Figure 3, green annotations), possibly due to incorrect recording by the echocardiographer.In this case, the "38 9" should likely be recorded as "38.9", instead when considering the correct scale required; (ii) the echocardiographer used the wrong terminology (e.g., incorrectly using peak transvalvular velocity to describe AV max PG, Figure 3, purple annotations); (iii) due to low number of cases, there were only 7 AV max PG positive cases recorded by the clinician out of 98 documents in the test dataset.Hence, non-system induced errors (e.g., from (i) and (ii)) can inflate the perceived underperformance of the system.
In the bubble plot, larger bubbles represent frequently occurring outcomes across all reports in the test dataset, while smaller bubbles represent less frequent outcomes (Figure 4).The bubble plot demonstrates that the LA area, located close to the plot's centre, is represented by the largest circle.This shows that the LA area is the variable that is most frequently extracted.Other variables with substantial circle sizes include LA volume, Lateral E' velocity and RA area, indicating that they are also frequently extracted by the NLP system.correct values, but the clinicians missed such values (Figure 3, green annotations), possibly due to incorrect recording by the echocardiographer.In this case, the "38 9" should likely be recorded as "38.9", instead when considering the correct scale required; (ii) the echocardiographer used the wrong terminology (e.g., incorrectly using peak transvalvular velocity to describe AV max PG, Figure 3, purple annotations); (iii) due to low number of cases, there were only 7 AV max PG positive cases recorded by the clinician out of 98 documents in the test dataset.Hence, non-system induced errors (e.g., from (i) and (ii)) can inflate the perceived underperformance of the system.
In the bubble plot, larger bubbles represent frequently occurring outcomes across all reports in the test dataset, while smaller bubbles represent less frequent outcomes (Figure 4).The bubble plot demonstrates that the LA area, located close to the plot's centre, is represented by the largest circle.This shows that the LA area is the variable that is most frequently extracted.Other variables with substantial circle sizes include LA volume, Lateral E' velocity and RA area, indicating that they are also frequently extracted by the NLP system.
The centres of circles of the bubbles in the bubble plot are closely aligned to the diagonal line from the origin, indicating that the system and clinician extractions are well calibrated overall.However, certain variables, including EF, AR PHT and TR max PG, were below the diagonal line, suggesting under-extraction by the system compared to the clinician.Using precision, recall and F1 score metrics, we assessed the effectiveness of the system for extracting discrete outcome measures (Table 5).The centres of circles of the bubbles in the bubble plot are closely aligned to the diagonal line from the origin, indicating that the system and clinician extractions are well calibrated overall.However, certain variables, including EF, AR PHT and TR max PG, were below the diagonal line, suggesting under-extraction by the system compared to the clinician.
Using precision, recall and F1 score metrics, we assessed the effectiveness of the system for extracting discrete outcome measures (Table 5).

Aortic Regurgitation (AR) Level
The system showed a precision of 0.86 and a recall of 0.89 for the identification of AR levels.Only three occurrences (false negatives) were missed while accurately identifying 24 true positive situations.The system generated four instances of false positives.The final F1 score was 0.87, demonstrating a performance that was balanced across recall and precision.

LV Systolic Function
The NLP algorithm achieved a precision of 0.80 and a recall of 0.41 for the classification of Left Ventricular (LV) Systolic Function.While missing 17 occurrences (false negatives), it accurately recognised 12 true positive cases.Three false positive cases were produced by the algorithm.In this situation, there was a trade-off between precision and recall, as indicated by the F1 score of 0.55.

MV Regurgitation Level
The NLP algorithm performed exceptionally in the detection of mitral valve (MV) regurgitation levels.With a precision and recall of 0.98 and 0.97, respectively, it was able to correctly identify 59 true positive cases while producing only two false negatives.One false positive case was obtained by the algorithm.The final F1 score was 0.98, which represents a strong overall performance.

AV+MV+PV+TV Stenosis
The aortic valve (AV), mitral valve (MV), pulmonic valve (PV) and tricuspid valve (TV) stenotic levels that were extracted were aggregated, since occurrences were rare and were classified with moderate precision and recall by the system.The precision and recall were 0.38 and 0.56, respectively.Five true positive cases were accurately recognised by the algorithm, whereas four false negative cases were missed.It produced eight instances of false positives.The F1 score for this result was 0.45, indicating a performance that was reasonably balanced between recall and precision.

TR Level
The system displayed high precision and moderate recall for classifying extractions of tricuspid regurgitation (TR) levels.It achieved a precision of 0.97 and a recall of 0.64.A total of 32 false negatives were missed by the system, while 57 true positive cases were successfully identified.It obtained two false positive cases.The F1 score for this result was 0.77, indicating a favourable balance of precision overall.
The confusion matrix exhibits a predominant pattern where the extractions align diagonally from the top left to the bottom right (Figure 5), indicating a high degree of correct classification.Out of the 882 instances evaluated, the NLP system achieved an accuracy rate of 91.38%.The remaining 8.62% of errors primarily consisted of false negatives, accounting for 7.6% of the total.Conversely, false positives constituted only 1.02% of the errors.

Discussion
This study presents the use of a GATE-based system for extracting outcomes from TTE reports, with a special emphasis on capturing both discrete and continuous variables.Most of the rule-based research in the field of NLP in healthcare have focused on general clinical text, methodology or specific medical domains, with limited exploration in the context of echocardiography reports or the use of GUI based interface to interpret extraction [18,[20][21][22][24][25][26][27][28].To the best of our knowledge, this is one of the first studies to use a GATE-based NLP system for TTE extraction, adding to the understanding of natural language processing (NLP) in this area.
Our study set out to achieve two separate aims that aligned to increase the management and utility of echocardiographic data.First, we attempted to create a moderate-fidelity echocardiography report database by utilising automated data extraction and curation.The intention to integrate this with high-fidelity datasets, exemplified in [6], highlights the potential of multimodality learning strategies [7].
Our analysis also included identifying outcomes with consistently high extraction performance.The selection of robust variables is paramount since it provides the foundation for potential clinical applications and the development of predicted risk scores.These

Discussion
This study presents the use of a GATE-based system for extracting outcomes from TTE reports, with a special emphasis on capturing both discrete and continuous variables.Most of the rule-based research in the field of NLP in healthcare have focused on general clinical text, methodology or specific medical domains, with limited exploration in the context of echocardiography reports or the use of GUI based interface to interpret extraction [18,[20][21][22][24][25][26][27][28].To the best of our knowledge, this is one of the first studies to use a GATE-based NLP system for TTE extraction, adding to the understanding of natural language processing (NLP) in this area.
Our study set out to achieve two separate aims that aligned to increase the management and utility of echocardiographic data.First, we attempted to create a moderate-fidelity echocardiography report database by utilising automated data extraction and curation.
The intention to integrate this with high-fidelity datasets, exemplified in [6], highlights the potential of multimodality learning strategies [7].
Our analysis also included identifying outcomes with consistently high extraction performance.The selection of robust variables is paramount since it provides the foundation for potential clinical applications and the development of predicted risk scores.These two initiatives are aligned, aiming to improve data utilisation, ultimately enabling more efficient use of echocardiographic data in both research and clinical contexts.

Technical Perspective
While studies have generally evaluated outcomes irrespective of the data type, this study presents an evaluation pipeline for considering discrete and continuous outcomes using their own respective sets of evaluation metrics.A recent study demonstrates that, for both discrete and sequential datasets, discrete-feature approaches outperformed sequential time-series (continuous variable) methods [29].This finding contrasted with conventional assumptions and emphasised the importance of integrating discrete data, which is consistent with our exploration of diverse variable types in echocardiography reports.Another study highlighted the advantages of translating datasets into appropriate formats, by comparing continuous (visual analogue scales) against discrete (verbal descriptor scales) rating scales [30].It was found that both continuous and discrete rating scales gave similar performances in terms of inter-rater reliability.However, raters were found to prefer the continuous scale.This is consistent with our work to efficiently classify and handle data from echocardiogram reports using either continuous or qualitative rating scales, when and where appropriate.
In relation to our NLP-system study, applying insights from the broader deep learning domain's debate on merging discrete and continuous processing has important implications, as indicated in [31].The inherent role played by discrete symbols in human communication coincides nicely with the language of medical reports and allows for nuanced comprehension of discrete outcome rating scales.The idea that discrete symbols need context-expansion by incorporating continuous scale outcomes resonates with the explanative nature of the latter type of data.This data framework of integrating discrete and continuous data parallels the structured data generated from the NLP system in this study.
There are numerous methods to mix continuous and discrete processing in ways that support gradient-based learning in addition to attention mechanisms.It is common practise to use policy gradients [32], a technique for backpropagation through discrete decisions [33].A differentiable method for choosing distinct categories in a sampling setting is provided by reparameterisation techniques such as the Gumbel-Softmax distribution [34].Modelling a categorical probability distribution over the discrete elements, determined by the continuous input, is also a popular technique for converting continuous representations into discrete outputs (Bengio et al., 2003) [29,35].Discrete elements (such as word tokens or actions) can be transformed into continuous representations by locating the appropriate token (feature) vector from its matrix embeddings using the corresponding token index.
It has also been suggested that composable neural module networks (NMN) might be advantageous for combining continuous and discrete information [36,37].These modules can be combined into intricate networks according to the computations required to respond to natural language queries.They are specialised for specific subtasks.By concentrating on training component modules and learning how to combine them, this method minimises the need to relearn for every type of issue.The JAPE rules as part of the JAVA interfacing NLP system in this study have similarities in principle to the methodology for composing NMNs modules [38], and may provide the basis for integrating neural based approaches to TTE extraction systems in an interpretable manner.For studies have generally focusing on image based problems, the attention module generally prefers the use of attention extracted from Convolution Neural Networks [39,40].Recurrent and Transformer Neural Networks may be more suitable to model text and non-text based sequential interactions [41], with the latter able to excel in modelling long term dependencies, through parallel processing and self-attention to capture relationship across input all input features simultaneously.

Cardiac Surgery Perspective
In addition to its applications within the context of echocardiography reports, the NLP system may have potential for broader clinical implications and research opportunities.Preoperatively, the algorithm offers the possibility of identifying critical parameters from echo data that could act as "red flags", prompting surgeons to prioritise surgery for patients with specific aortic diameters or valve haemodynamics.This functionality may assist surgeons in managing surgery waiting lists more efficiently.
Following surgical interventions, the system's functionalities may extend to the postoperative phase, assisting clinicians to promptly identify issues related to structural valve degeneration and adjust the frequency of monitoring accordingly.Furthermore, the algorithm may play a potential role in addressing challenges like patient-prosthesis mismatch, enabling clinicians to make informed decisions for clinical interventions and facilitating audits for quality control purposes.
Increasingly, congenital heart disease (CHD) is diagnosed in the antenatal setting, with foetal echocardiography commonplace.Extension of the system's functionalities to antenatal ultrasound imaging modalities could aid antenatal risk modelling and prognostication, facilitating improved surgical planning, perinatal and neonatal care.
The system's applicability includes not only echocardiography report analysis but also other types of investigative modalities.For instance, the system's capabilities might be increased to examine computer tomography (CT) results and aid in aortic surveillance initiatives.Additionally, the system holds potential to integrate risk modelling approaches with automated text extraction, while identifying risks that have the potential for advancing research projects in cardiac care.

Cardiologist Perspective
The ability to extract echo data for reports using NLP would have significant benefits to managing a range of patients, including those with valvular heart disease, heart failure and ischaemic heart disease.As workflows in clinical practice become more automated, it is crucial to be able to prioritise and manage referrals and outpatient waiting lists.Being able to identify patients who have had a significant worsening of their valvular heart disease or cardiac function in a way that does not increase the administrative burden of reviewing every echo report would be extremely appealing to cardiologists.This tool also has great potential for research and auditing.For instance, with the ability to track the rates of change of aortic stenosis over time or monitoring aortic root dimensions.In addition, the combination of this database with other biochemistry, chest imaging and electrocardiogram datasets holds potential for the improved diagnosis of heart failure in patients.
This tool can integrate both the quantitative and qualitative data from an echo report, which is crucial for the optimal management of these patients.

Limitations and Future Work
Future work shall investigate the comparison of ICD-10 diagnostic codes against NLP for detecting cardiac diseases in the evaluation of this patient cohort [9].Due to the cost in terms of time and effort required for clinicians to manually annotate and extract the validation labels, a limited number of evaluation samples were available.Thus, future work should seek to obtain a larger sample of labelled validation samples from the entire hospital extracted set.It would also be interesting to refine the categorisation process to capture more nuanced information from echocardiography reports.This could involve identifying and extracting specific details related to different cardiac conditions e.g., heart failure.Future work should also further improve the NLP system compatibility with existing electronic health record systems to seamlessly integrate with healthcare IT infrastructure for wider adoption, possibly with a web-based view of the extracted data to enable echocardiographers or clinicians to perform quality inspections and manually modify the extracted data.Since manual annotations by clinicians using Excel spreadsheets were laborious, future work may consider the possibility of rapid annotation approaches whereby the NLP pre-annotates the reports in the GUI environment to allow the clinicians to only make minor corrections before curating a high fidelity dataset.This approach may also help to rapidly generate large amounts of gold standard annotations for the evaluation of newly developed transformer-based tools, or for the purpose of mechanistic clinical studies on the high fidelity corrected dataset.The application of Cluto-based clustering using graph representations is possible but was not investigated.This, and similar approaches, would be interesting to analyse in future studies [16,42].While the approach has also been developed for extracting Magnetic Resonance Imaging (MRI) unstructured data, future work should identify clinicians with expertise for curating the validation labels in this type of device.Certain variables, such as LVOT Vmax, showed high R 2 correlation but low ICC scores, suggesting that the scalar value of the NLP system extraction is correct but potentially on a different unit scale to that extracted by the clinician.Future work should aim to develop a plugin to the existing system to convert the clinician extraction to the NLP system unit scale.The mean values for some continuous outcomes extracted by the system were lower than those by the cardiologists, suggesting that any differences are more likely due to false negative than false positive findings.Although any improvements in extraction process would still be limited by the accuracy of the initial report, standardising the reporting of echocardiography across different vendors, national and international societies would be beneficial.Future work should also aim to further develop the approach using hybrid (machine learning plus rule-based) approaches to enhance existing variables with relatively lower extraction performance.For example, the combination with NMN-based approaches through reinforcement learning to select modules may encapsulate greater variations in outcome terminologies across different hospital locations.As another possible direction for future work, more advanced NLP models (e.g., BERT) could be utilised for the text analysis task.BERT models offer the advantage of contextual understanding through embedding representations, allowing them to capture intricate relationships within text and to enhance the accuracy and depth of the analysis.It would also be interesting to incorporate fuzzy [10], foundation learning and reinforcement based approaches on a larger local validation cohort to further improve the modelling performance.Future work should also consider analysing data extracted from echocardiography reports over time to track disease progression and treatment outcomes.One should also aim to validate the NLP system's performance in diverse patient populations, to ensure that it can effectively handle variations in language and medical terminology.

Conclusions
In summary, this study offers a novel application of the GATE-based system in the extraction of echocardiogram report data to curate a moderate fidelity TTE database that could be used for multi-fidelity data fusion.The study contributes to the understanding of NLP in the context of echocardiography by addressing both approaches for discrete and continuous outcomes.These findings emphasise the valuable contribution of NLP in automating data extraction and improving clinical decision-making processes.By bridging the gap between structured and semi-structured healthcare data, our approach holds promise for advancing research, risk prediction and patient care in the realm of echocardiography and beyond.

Figure 1 .
Figure 1.Design of the NLP Echo extraction system.

Figure 1 .
Figure 1.Design of the NLP Echo extraction system.

Figure 2 .
Figure 2. Graphical user interface of system to explainability of extraction for the continuous variable, EF. varValue1 shows the lower range value if this exists; varValue2 shows the upper range value; varValue shows the average of lower and upper range values if these exist.

Figure 2 .
Figure 2. Graphical user interface of system to explainability of extraction for the continuous variable, EF. varValue1 shows the lower range value if this exists; varValue2 shows the upper range value; varValue shows the average of lower and upper range values if these exist.

Figure 3 .
Figure 3.In depth analysis of clinician and system extractions for AV max PG; green and pink region indicates region of features involved in the annotation process.

Figure 4 .
Figure 4. Bubble plot analysis of magnitude and calibration for continuous variable sets.X-axis and y-axis show the total magnitude of extracted outcome measures for clinician and NLP, respectively.The size of circles represents the frequency of each variable extracted by the NLP system.

Figure 4 .
Figure 4. Bubble plot analysis of magnitude and calibration for continuous variable sets.X-axis and y-axis show the total magnitude of extracted outcome measures for clinician and NLP, respectively.The size of circles represents the frequency of each variable extracted by the NLP system.

Bioengineering 2023 , 21 Figure 5 .
Figure 5. Confusion matrix analysis of classification performance for discrete variable sets.system (Predicted) is shown on the y-axis while the clinicians (Actual) is shown on the x-axis.Accuracies are shown in green and errors in red.% represents percentage of total.

Figure 5 .
Figure 5. Confusion matrix analysis of classification performance for discrete variable sets.system (Predicted) is shown on the y-axis while the clinicians (Actual) is shown on the x-axis.Accuracies are shown in green and errors in red.% represents percentage of total.

Table 1 .
Abbreviations and definitions for 43 of the most clinically relevant outcomes labelled by the two clinicians.Heart category and data type are also shown.

Table 2 .
Example of outcome measure variations for continuous and qualitative (discrete) measurements.

Table 5 .
Precision recall results for discrete variables sets in the test dataset.

Table 5 .
Precision recall results for discrete variables sets in the test dataset.