A Weakly-Supervised Named Entity Recognition Machine Learning Approach for Emergency Medical Services Clinical Audit

Clinical performance audits are routinely performed in Emergency Medical Services (EMS) to ensure adherence to treatment protocols, to identify individual areas of weakness for remediation, and to discover systemic deficiencies to guide the development of the training syllabus. At present, these audits are performed by manual chart review, which is time-consuming and laborious. In this paper, we report a weakly-supervised machine learning approach to train a named entity recognition model that can be used for automatic EMS clinical audits. The dataset used in this study contained 58,898 unlabeled ambulance incidents encountered by the Singapore Civil Defence Force from 1st April 2019 to 30th June 2019. With only 5% labeled data, we successfully trained three different models to perform the NER task, achieving F1 scores of around 0.981 under entity type matching evaluation and around 0.976 under strict evaluation. The BiLSTM-CRF model was 1~2 orders of magnitude lighter and faster than our BERT-based models. Our proposed proof-of-concept approach may improve the efficiency of clinical audits and can also help with EMS database research. Further external validation of this approach is needed.


Introduction
Clinical performance audits are thought to be an important part of quality review and continuous quality improvement in healthcare systems and services [1,2]. In emergency medical services (EMS), one of the clinical audits that is conducted involves examining whether paramedics have performed the assessment and treatment steps following the standard operating procedures [3][4][5]. This is usually performed by auditing the free text reports written by the paramedics for the attended cases.
EMS clinical audits need to be routinely performed to ensure adherence to treatment protocols, to identify lapses, and to discover systemic deficiencies to guide paramedic training. However, identification of these items for audit from the free text case reports requires a significant amount of time, resources, and effort [6].
The Singapore Civil Defence Force (SCDF) is the national emergency medical services (EMS) provider in Singapore, handling more than 190,000 medical calls to the national "995" emergency hotline annually [7]. Paramedics in the SCDF are trained to respond to medical emergencies by providing rapid on-scene triage, treatment, and conveyance of casualties to the hospitals for further management. At present, all EMS-attended cases are recorded by the paramedics on hardcopy ambulance case records, then transcribed into electronic forms (with the assistance of a digital pen) within 48 h and uploaded onto an internal server for audit and data analysis. Clinical audit of our EMS involves a manual, laborious audit of randomly selected cases and subsequent follow-up actions by a considerably small team of dedicated auditors. Due to the high call volume and inherent complexity of the paramedic protocols, only a limited percentage (around 10%) of total cases were audited every year [8].
Named entity recognition (NER) is a natural language processing (NLP) technique that recognizes and labels certain words mentioning specific entities in the sentences. An example of NER is to recognize "Apple" as a brand name in the sentence "I am a big fan of Apple products" but not "Apple is my favorite fruit". It has been successfully used for information extraction in medical texts [9,10], but its specific application in the context of paramedic text reports is unexplored. The language used in the paramedic text reports are different from that in traditional clinical documents, so existing non-trainable NER models cannot be directly applied in this case. Challenges that are common for clinical NLP still apply, which include widespread and inconsistent use of acronyms, misspellings, flexible formatting, atypical grammar and use of jargon [11]. Lastly, it is impractical to label a large amount of corpus to generate a dataset for fully supervised training.
In this proof-of-concept study, we aimed to develop an NER model on paramedic text reports for clinical audit. We adopted a weakly-supervised approach by creating and fine-tuning a synonym list of keywords and phrases for the entities and using them as the pseudo labels. The main contribution of the paper is to (1) propose the use of natural language processing to conduct EMS clinical audits instead of human chart reviewing, (2) use a weakly-supervised method to label a mass amount of unlabeled data for downstream training and (3) ascertain the effectiveness of the method in EMS clinical audit data. Since the languages used in different EMS systems are dramatically different, we hoped that this study could serve as a good example for other EMS systems to develop their own clinical NER audit model. The remainder of this paper is structured as follows. In the Methods section, we introduce the steps of (1) data preparation and preprocessing, (2) weakly-supervised labelling, (3) model training and (4) model evaluation. A web demo of the model is also developed. In the Results section, we describe the dataset and report the performance, size and inference speed of the models. In the Discussion section, we discuss our approach, with its advantages and limitations, and analyze the errors.

Data Preparation and Preprocessing
The data used in this proof-of-concept study contained 58,898 ambulance incidents recorded by the Singapore Civil Defence Force between 1st April 2019 and 30th June 2019. We included all incidents from one of the three clinical scenarios that are commonly encountered in EMS practice as motivating examples: (1) acute coronary syndrome, (2) stroke, and (3) the bleeding patient. After excluding 14,679 incidents that did not result in a patient encounter and 8 cases with missing text reports, the remaining 44,211 incidents were used as the final dataset. Ethics approval for this study was granted by the National Healthcare Group (NHG DSRB 2020/00893) with a waived informed consent.
Text reports were converted into lower case as many text reports were entered in all upper case due to the nature of the data entry system. All symbols were removed except for the "%" symbol, while all numbers were retained. Sentences were split into individual tokens using white space tokenization. All preprocessing steps were performed automatically by a simple Python script with native libraries and regular expressions.

Weakly-Supervised Labelling
As the entire dataset was unlabeled, we used a weakly-supervised learning approach to model training. For the NER labels, we chose 17 different clinical entities spanning 3 different categories (clinical procedure, clinical finding, and medication) based on their clinical relevance in EMS practice. The notation used in this study is the IOB2 notation, which assigns a "B-" token for the first word of each entity, including terms which comprise of only 1 word; an "I-" token for each subsequent word within the entity; and an "O" token for all other tokens not belonging to any entity [12]. We did not annotate negation because we consider the paramedics to be compliant if they documented the entity. Documentation indicated that they did not forget about the standard operating procedures and may have chosen to overwrite according to the actual scenario.
In the first step, we used a rule-based technique to create dummy labels for the entire dataset. We first created a list of synonyms for each entity based on EMS practice experience. These synonyms might be single token or multiple tokens, and each entity can have multiple synonyms. Subsequently, we used the synonyms to match and pseudo-label the entities in the sentences. Fuzzy string matching was used to increase recall of the bootstrapping process by including terms with minor spelling mistakes, defined as having a maximum of 1 missing character compared to the correct spelling. However, as this might result in high rates of false positives if used on short phrases, it was only performed on strings that were 5 characters or longer. The fuzzy string-matching algorithm used was provided by the "fuzzysearch" package [13]. Single or multi-token entities that spanned less than 5 characters were matched using exact string matching. If explicitly specified, exact string matching was used.
After pseudo-labelling of the training, development and test sets, the labels in the development and test sets were verified by a clinician and any mistakes made by the bootstrap process were corrected. We performed error analysis only on the development set to fine-tune the synonym list and specify which terms require exact string matching to improve the labelling process. The dataset was split into 95% training (n = 42,000), 2.5% development (n = 1105) and 2.5% test (n = 1106) sets. Only 5% (n = 2211) of the data were human labelled.

Model Training
To perform the NER task, we experimented with a deep learning-based Bidirectional Long Short-Term Memory + Conditional Random Fields (BiLSTM-CRF) model as well as two Bidirectional Encoder Representations from Transformer (BERT) models with different pretrained weights. Both model architectures could automatically learn the useful information from the training data without manual feature engineering.

Bidirectional Long Short-Term Memory + Conditional Random Fields
Conditional Random Fields (CRF) are a class of probabilistic models designed to segment and label sequence data [14], and have been used with success on named entity recognition tasks due to their ability to use customized observation features from both past and future elements in sequences [15]. Bidirectional Long Short-Term Memory [16] combined with an output CRF layer [17] is a recurrent neural network (RNN) model that has achieved state-of-the-art performance over many named entity recognition tasks. Instead of manually crafting features for the traditional CRF model, a BiLSTM model automatically learns the useful features and feeds them into the CRF model. We built a BiLSTM-CRF model using PyTorch library in Python [18]. No pre-trained word embedding was used; instead, a word embedding layer was initialized with 0 s and trained together with the entire model. The batch size was set as 512. We used Adam optimizer [19] with a default learning rate of 0.001. Early stopping was implemented and would trigger if the validation loss did not decrease in 5 consecutive epochs to prevent overfitting. The maximum training epochs was set as 300. We experimented with different dimensions of the word embedding layer as well as the hidden layer, and used 100 and 64, respectively, in the final model, which yielded the best performance on the validation set. An illustration of the model can be seen in Figure 1.

BERT-Based Token Classifier
BERT-based models are bidirectional transformer models with contextualized word embedding pre-trained on large corpora and have revolutionized deep learning in NLP tasks ever since their introduction [20]. Token classification can be achieved by adding a linear classification layer after the output from the BERT-based model. To build the model, we used the PyTorch implementation from the Transformers Python library by HuggingFace [21]. The first pretrained model we used was the BERT-based-uncased model, which was trained on two large corpora: BooksCorpus [22] and English Wikipedia. Since our corpus is related to the clinical domain, the second pretrained model we used was Clinical-BERT [23], which was trained on clinical notes from the Medical Information Mart for Intensive Care III database [24]. Prior to the training, all sentences were tokenized by the pre-trained tokenizer and zero-padded to a constant 300 token sequence length. In the training phase, the pre-trained model was fine-tuned on our training data for 30 epochs with early stopping after 5 epochs if there was no improvement in token level accuracy on the development set. We used the AdamW optimizer [25] under the learning rate of 3 × 10 −5 , Adams epsilon of 1 × 10 −8 and weight decay rate of 0.01 over all parameters except for the bias terms as well as the gamma and beta terms in the layer-normalization layers. A learning rate scheduler was used to linearly reduce the learning rate throughout the epochs. When the model combined the sub-token tag predictions, we let the model pick the most frequent class except O to be the final prediction of the word. An illustration of the model can be seen in Figure 2.

Model Evaluation 2.4.1. Token Class Level
We evaluated the performance of our NER models using the weighted precision, recall and F1-score on all tokens except the uninformative "O" token. Specifically, the weighted metric calculates the metric for each token class and finds their average weighted by the number of true tokens in that class.

Entity Level
We reported the MUC-5 evaluation metrics under both strict evaluation mode and entity type matching mode defined in the 2013 International Workshop on Semantic Evaluation (SemEval'13) to compare their performance at the entity level [26,27]. An entity prediction is defined by both the entity type predicted and the word span (starting word and ending word). MUC-5 categorizes each prediction into 1 of the 5 following types: • Correct (COR): the system's output is the same as the gold-standard annotation. Based on these types, the below measures can be calculated: Under both strict evaluation mode and entity type matching mode, there will be no PAR. The only difference between the two modes is that if a predicted entity has the correct type but the word span only overlaps with the gold-standard annotation, it will be INC under strict evaluation, but COR under entity type matching. It is worth noting that POS depends on the model-specific prediction and can be larger than the total number of entities in the data, because 1 gold-standard entity can be compared to more than 1 prediction overlapped with it and, thus, be counted more than once.

Results
The whole dataset consists of 44,211 paramedic reports, with 3,069,578 words and 39,067 unique words. The training, development and test sets contained 41,984, 1105 and 1106 reports, respectively. Table 1 shows some examples of the original reports, the reports after preprocessing and their ground truth NER labels. Statistics about the clinical entities, their relative frequencies, total tokens of the entities and average number of tokens per entity are presented in Table 2. loc o e noted 3 cm laceration active bleeding noted dislocated rt shoulder pt claimed numbness but is due to fall 2 12 ago did not see dr pt unable to give furthur hx as he does not wish to talk much Based on the prevalence of the entities, Electrocardiogram (ECG), Bleeding and Stroke Assessment are the three most observed entities. Looking at the tokens, Normal Saline has a significantly longer average number of tokens per entity, because it is often represented by phrases such as "i v n s" or "iv ns 0 9%" after the punctuation is removed. Table 3 shows the performance of our NER models over the entities on the test set. On the test set, our models show indistinguishably excellent performance of F1 scores: around 0.981 under entity type matching evaluation and 0.976 under strict evaluation. Despite the indistinguishable performance, the model complexity and inference speed differ by 1~2 orders of magnitude between the BiLSTM-CRF model and the BERT-based models, as demonstrated in Table 4. Hence, we decided to choose BiLSTM-CRF as our final model. We reported the performance of the BiLSTM-CRF model over the token classes on the development set and test set in the supplementary file. Table 4. Comparison of model complexity and inference speed. Inference wall time is reported using the mean and standard deviation of wall clock time over 100 iterations of predicting a sample sentence, without the overhead of loading libraries and the model itself.

Model Parameters (in Millions)
Model

Discussion
In this proof-of-concept study, we developed an NER model on paramedic text reports for the purpose of clinical audit. Although not quantified in this study, it is apparent that this system would enable us to vastly increase not only the number of cases audited, but also the complexity of the audit, as dozens of individual actions could be evaluated. Moreover, this could be achieved in a much shorter period of time than if a team of human auditors were to perform manual chart reviews.
Another possible use case for this model would be to identify cases for database research. With the digitalization of ambulance data, there is increasing opportunity for large-scale data analysis and research. The NER model was able to accurately identify a limited set of commonly used clinical entities. This information can be used to retrieve cases containing a certain clinical entity. This would reduce the need for a manual chart review, which is impractical as the number of cases vastly increases the potential sample size for any clinical study. As EMS is an important part of the chain of survival for patients requiring emergency care, there is a need for robust identification of case types for downstream research tasks.
The final BiLSTM-CRF model achieved good performance with an F1 score of 0.981 under entity type matching evaluation and 0.976 under strict evaluation. Although the overall performance was satisfactory, we observed two major sources of errors. The first source is a partial capture of the full span of multi-word entities including ECG (e.g., "ecg 4 leads"), GTN (e.g., "s l gtn") and Normal Saline (e.g., "i v n s 0 9%"). Since our training set is labelled only via a weakly-supervised approach, while the validation set and test set are labelled by humans, slight discrepancies are expected. Nevertheless, we believe these mistakes are of lesser significance and would not affect the audit result since the entities are still labelled. The second source is due to the misspelled words. A misspelled word can either be non-existent ("salbutumol" vs. "salbutamol") or have a different meaning ("facial drop" vs. "facial droop"). Our BiLSTM-CRF model will mark the first type of misspelled words as "unknown" where a special word vector will be assigned. As for the second type of misspelled words, they are seen both in their normal context and the misspelled context. As a result, these misspelled words are more difficult to learn for the model and contribute to the errors. With that being said, we also observed that some of these misspelled words were correctly predicted, likely thanks to the CRF module.
We expected the BERT models to mitigate the issue of misspelling by predicting the entities from the subwords produced by WordPiece tokenization [32]. Moreover, we expected Clinical-BERT would perform better with the pretraining on a clinical corpus. Upon examination of the results, we found that pretrained tokenizers from BERT base and Clinical-BERT did produce different subwords. Based on these subwords, the BERT models predict more entities than the BiLSTM-CRF model. However, both true positive and false positive results increased, and we observed higher recall, lower precision and F1 scores similar to the BiLSTM-CRF model. Reasons why the BERT models may have not worked better than BiLSTM-CRF in our task include the following: (1) the WordPiece tokenization is not designed to correct spelling errors, but rather to segment the meaningful units from the words; (2) in our paramedic report, the words are highly abbreviated, making the tokenizer less helpful; and (3) clinical notes, which Clinical-BERT was pre-trained on, are indeed different from our paramedic report to a certain extent.
To our knowledge, this is the first study investigating the use of NLP on a large number of paramedic-written free text reports, and the results are promising. We believe that this work can inspire more NLP applications for novel clinical text. Nonetheless, we also recognize some limitations of this work. Firstly, the study was conducted within a single EMS system and further studies are needed to evaluate its external validity. Further planned intervention is also necessary to evaluate the usefulness of this system. Secondly, we did not manage to correct the spelling errors in the text reports, which is an area for future work. Thirdly, we did not experiment with lighter BERT-based models, such as DistilBERT, which are smaller and faster than normal BERT models [33]. Lastly, entity classes such as "Burns Cooling" and "Valsalva Maneuver" were absent in the test data set and could not be evaluated.
Future studies can prospectively evaluate the actual deployment of this software in an EMS system in terms of both quantitative and qualitative aspects for both audits and paramedics. Machine learning models used for named entity recognition need to be recalibrated over time to reflect changes in the documentation practices of practitioners over time and changes in personnel over time. The evaluation of such changes over time could be the focus of future studies. Finally, the evaluation of the named entity recognition model on data from other EMS systems will help to determine if the performance we observed is generalizable.

Conclusions
In this proof-of-concept study, we demonstrated the process of developing a reliable NER model that could reliably identify clinical entities from unlabeled paramedic free text reports. This model can be used in an EMS clinical audit system that automates the audit process. It allows us to increase the proportion of cases that undergo auditing to complete coverage, even as we experience increasing demand for EMS services in Singapore, while reducing mental fatigue for human auditors. Funding: This research and the APC were funded by MOE Academic Research Fund (AcRF) Tier 1 grant (WBS R-608-000-301-114).

Institutional Review Board Statement:
The study was conducted according to the guidelines of the Declaration of Helsinki, and approved by National Healthcare Group Domain Specific Review Board (NHG DSRB 2020/00893).
Informed Consent Statement: Patient consent was waived as data used were de-identified.

Data Availability Statement:
The data supporting the findings of this study are available from the corresponding author upon reasonable request, subject to approval by SCDF.