Templated Text Synthesis for Expert-Guided Multi-Label Extraction from Radiology Reports

: Training medical image analysis models traditionally requires large amounts of expertly annotated imaging data which is time-consuming and expensive to obtain. One solution is to automatically extract scan-level labels from radiology reports. Previously, we showed that, by extending BERT with a per-label attention mechanism, we can train a single model to perform automatic extraction of many labels in parallel. However, if we rely on pure data-driven learning, the model sometimes fails to learn critical features or learns the correct answer via simplistic heuristics (e.g., that “likely” indicates positivity ), and thus fails to generalise to rarer cases which have not been learned or where the heuristics break down (e.g., “likely represents prominent VR space or lacunar infarct” which indicates uncertainty over two differential diagnoses). In this work, we propose template creation for data synthesis, which enables us to inject expert knowledge about unseen entities from medical ontologies, and to teach the model rules on how to label difﬁcult cases, by producing relevant training examples. Using this technique alongside domain-speciﬁc pre-training for our underlying BERT architecture i.e., PubMedBERT, we improve F1 micro from 0.903 to 0.939 and F1 macro from 0.512 to 0.737 on an independent test set for 33 labels in head CT reports for stroke patients. Our methodology offers a practical way to combine domain knowledge with machine learning for text classiﬁcation tasks.


Introduction
Training medical imaging models requires large amounts of expertly annotated data, which is time-consuming and expensive to obtain.Fortunately, medical images are often accompanied by free-text reports written by radiologists describing their main radiographic findings (what the radiologist sees in the image e.g., hyperdensity) and clinical impressions (what the radiologist diagnoses based on the findings e.g., haemorrhage).Recent approaches to creating large imaging datasets have involved mining these reports to automatically obtain scan-level labels [1,2].Scan-level labels can then be used to train anomaly detection algorithms, as demonstrated in the CheXpert challenge for automated chest X-Ray interpretation [1] and the Radiological Society of North America (RSNA) haemorrhage detection challenge [2].For the task of extracting labels from head computed tomography (CT) scan reports (see Figures 1 and 2), we have previously shown that we can train a single model to perform automatic extraction of many labels in parallel [3], by extending BERT [4] with a per-label attention mechanism [5].However, extracting labels from text can be challenging because the language in radiology reports is diverse, domain-specific, and often difficult to interpret.Therefore, the task of reading the radiology report and assigning labels is not trivial and requires a certain degree of medical knowledge on the part of a human annotator [6].When we rely on pure data-driven learning, we find that the model sometimes fails to learn critical features or learns the correct answer via simple heuristics (e.g., that presence of the word "likely" indicates positivity) rather than valid reasoning, and thus fails to generalise to rarer cases which have not been learned or where the heuristics break down (e.g., "likely represents prominent VR space or lacunar infarct" which indicates uncertainty over two differential diagnoses).McCoy et al. [7] suggested the use of templates to counteract a similar problem in sentiment analysis, for film and product reviews, to prevent syntactic heuristics being learned.We also previously performed simple data synthesis using simple templates, to provide minimal training examples of each class.In this work, we further develop the idea of template creation to do extensive data synthesis.Our contributions are centred around incorporating medical domain knowledge into a deep learning model for text classification.We propose to use templates to inject expert knowledge of rare classes and class relationships, and to teach the model about labelling rules.Using template data synthesis alongside domain-specific pre-training for our underlying BERT architecture (PubMedBERT [9]), we are able to robustly extract a set of 33 labels related to neurological abnormalities from head CT reports for stroke patients.
Our methodology offers a practical way to combine rules with machine learning for text classification.In summary:

•
Building on our work in [3], we propose to use templates to strategically augment the training dataset with rare cases obtained from a medical knowledge graph and with difficult cases obtained from rules created by human experts during the course of manual annotation, enabling expert-guided learning via text data synthesis.

•
We analyse the impact of the vocabulary arising from domain-specific pre-training of BERT, and show why this improves accuracy.

•
We perform extensive validation of our methods, including a prospective validation on data which was unseen at the point of annotating the training dataset, and show that our methods enable improved generalisation and a convenient mechanism for adaptation.

Radiology Report Labelling
Automatic extraction of labels from radiology reports has traditionally been accomplished using expert medical knowledge to engineer a feature extraction and classification pipeline [10]; this was the approach taken by Irvin et al. to label the CheXpert dataset of Chest X-rays [1] and by Grivas et al. in the EdIE-R method for labelling head CT reports [11].These pipelines separate the individual tasks such as determining whether a label in mentioned or not (named entity recognition) and determining a label as being present (negation detection).An alternative is to design an end-to-end machine learning model that will learn to extract the final labels directly from the text.Simple approaches have been demonstrated using word embeddings or bag of words feature representations followed by logistic regression [12] or decision trees [13].More complex approaches using a variety of neural networks have been shown to be effective for document classification by many authors [14,15], especially with the addition of attention mechanisms [3,5,[16][17][18][19].State-of-the-art solutions use existing pre-trained models, such as Bidirectional Encoder Representations from Transformers (BERT) [4], that have learnt underlying language patterns, and fine-tune them on small domain-specific datasets.

Pre-Training for Text Deep Learning Models
Different variants of BERT such as BioBERT [20] (as used by Wood et al. [17]) or PubMedBERT [9] use the same model architecture and pre-training procedures as the original BERT, but use different pre-training datasets, allowing the models to learn the context of domain-specific vocabulary.

Text Data Synthesis
Various approaches have been proposed for text data augmentation, targeting improved performance on some diverse natural language processing (NLP) applications.Synthetic data can be generated using very simple rule-based transformations including noise injection (inserting random words), random word deletion or number swapping [21,22].Another approach to creating synthetic text data are to randomly split training documents or sentences into multiple training fragments.This has been shown to improve performance on text classification tasks [23].Paraphrasing is a more sophisticated approach which is usually achieved by back-translation using neural machine translation models; this was used on the CheXpert dataset by Smit et al. [18].Back-translation has been used in other tasks and settings too [24][25][26].These approaches do indiscriminate augmentation based on the whole training corpus.By contrast, McCoy et al. [7] suggested the use of templates to target less common cases, which are underrepresented in the training data and do not obey the simple statistical heuristics that models tend to learn; in particular, they focused on creating a balanced dataset in which syntactic heuristics could not solve the majority of cases.

Materials and Methods
In this section, we first describe our dataset and annotation scheme, followed by a description of the method of data synthesis via templates which is the focus of this paper, followed finally by a description of the model architectures that we employ for our experiments.

NHS GGC Dataset
Our target dataset contains 28,687 radiology reports supplied by the West of Scotland Safe Haven within NHS Greater Glasgow and Clyde (GGC).We have acquired the ethical approval to use this data: iCAIRD project number 104690, University of St Andrews CS14871.A synthetic example report with a similar format to the NHS GGC reports can be seen in Figure 2.
Our dataset is split into five subsets: Table 1 shows the number of patients, reports, and sentences for each subset.We use the same training and validation datasets as previously used in [3].We further validate on an independent test set consisting of 317 reports, a prospective test set of 200 reports, and an unlabelled test set of 27,940 reports.We made sure to allocate sentences from reports relating to the same patient to the same data subset to avoid data leakage.The annotation process was performed in two phases; Phase 1 on an initial anonymised subset of the data, and Phase 2 on the full pseudonymised dataset that we accessed onsite at the Safe Haven via Canon Medical's AI training platform.A list of 33 radiographic findings and clinical impressions found in stroke radiology reports was collated by a clinical researcher (5 years clinical experience and 2 years experience leading on text and image annotation) and reviewed by a neurology consultant; this is the set of labels that we aim to classify.Figure 3 shows a complete list of these labels.During the annotation process, each sentence was initially labelled by one of the two medical students (third and fourth year students with previous annotation experience).After annotating the sentences, difficult cases were discussed with the clinical researcher and a second pass was made to make labels consistent.We include inter-annotator comparisons between annotators and the final reviewed annotations for a subset of our data (1040 sentences) in Table 2.We see that the agreement between annotator 2 and the final reviewed version is higher than that of annotator 1. Annotator 1 and annotator 2 were slightly offset in annotation time, and the annotation protocol was updated before annotator 2 finished the first annotation iteration, enabling annotator 2 to incorporate these updates into their annotations and resulting in higher comparison scores.
Table 2. Comparisons between the two medical student annotators ("Annotator 1" and "Annotator 2") and the final reviewed data ("Reviewed").We report Cohen's kappa, F1 micro and F1 macro for 1044 sentences from 138 reports that were annotated by both annotators.

Comparison
Cohen Each sentence is labelled for each finding or impression as one of 4 certainty classes: positive, uncertain, negative, not mentioned.These are the same certainty classes as used by Smit et al. [18].In the training dataset, the most common labels such as Haemorrhage/Haematoma, Infarct/Ischaemia and Hypodensity have between 150-350 mentions (100-200 positive, 0-50 uncertain, 0-150 negative) while the rarest labels such as abscess or cyst only occur once.Full details of the label distribution breakdown can be found in Appendix B. We denote our set of labels as L, where F is the set of findings and I is the set of impressions; and our set of certainty classes as C, such that the number of labels is defined as n L = |L| = n F + n I = |F| + |I| and the number of certainty classes is defined as n C = |C|.For the NHS GGC dataset, n F = 15, n I = 18, n L = 33 and n C = 4.

Templates for Text Data Synthesis
In this section, we describe some generic templates based on the labelling scheme, followed by two methods of integrating domain knowledge: "knowledge injection" and "protocol-based templates".

Generic Templates
Our generic templates are shown in Figure 4. Data synthesis involves replacing the ENTITY slot with each of the 33 label names in turn.The set of 3 simple templates allows the model to see every combination of certainty classes and labels (Figure 4).This enables learning of combinations that are not present in the original training data.This works well for labels where there is little variation in the terminology i.e., the label name is effectively always the way that the label is described, such as "lesion".We also formulate a further 6 permuted templates, in which we change the word ordering and in particular the position of the label within the sentence, to inject diversity into the data.
There may be [ENTITY] in the brain.
[ENTITY] may be evident in the brain.
There is no [ENTITY] in the brain.
[ENTITY] is not evident in the brain.
There is [ENTITY] in the brain.
[ENTITY] is evident in the brain.

Combining Templates
We use the meta-template shown in Figure 5 to generate more complex sentences containing entities with different uncertainty modifiers. [TEMPLATE1] [TEMPLATE2] and Figure 5. Meta-template which specifies that two templates can be concatenated with the word "and".
An example sentence generated by the above template is "There is hyperdensity in the brain and there is no infarct" which would be labelled as positive hyperdensity and negative infarct.If the random selection results in the same label but with different certainty classes, we use the following precedence rule to label the sentence with a single certainty class for that label: positive > negative > uncertain > not mentioned.

Knowledge Injection into Templates
Some of our labels have many different subtypes which are unlikely to be exhaustively represented in the data, and the label name is only one of many ways of mentioning label entities.In particular, the labels tumour and infection are rare in our dataset of stroke patients, with infection not present in our training data at all, but they have many diversely named subtypes.
We can obtain synonyms of labels from existing medical ontologies and insert these into the templates.In this paper, we use the Unified Medical Language System (UMLS) [8] which is a compendium of biomedical science vocabularies.The UMLS is made up of almost 4 million biomedical concepts, each with its own Concept Unique Identifier (CUI).Each concept in the UMLS knowledge graph has an associated thesaurus of synonyms known as surface forms.Furthermore, UMLS provides relationships between concepts, including hierarchical links (inverse_isa relationships) from general down to more specific concepts.For any given CUI, we can follow the inverse_isa links to identify its child subgraph.In order to obtain synonyms for tumour, we took the intersection of the subgraphs for brain disease (CUI: C0006111) and tumour (CUI: C0027651).In order to obtain synonyms for infection, we took the subgraph of CNS infection (CUI: C0007684)-see Figure 6.This process yielded synonyms such as "intracranial glioma" and "brain meningioma" for the label Tumour, and "cerebritis" and "encephalomyelitis" for the label Infection.In total, we retrieve 38 synonyms for tumour (S tumour ) and 304 for infection (S infection ).We inject these synonyms into templates by randomly substituting the label names with UMLS synonyms during training.This substitution technique ensures that labels with many synonyms do not overpower and outnumber labels with less or no synonyms.

Protocol-Derived Templates
Creating a manual annotation protocol is difficult [6] and the protocol constantly evolves as new data are encountered and labelled.It is therefore useful to be able to encode certain phrases/rules from the protocol in a template so that they can be learned by the model.This is particularly useful for the certainty class modifiers, for instance "suggestive" compared to "suspicious".The templates shown in Figure 7 have been derived from the protocol developed during Phase 1 annotation, and were chosen following analysis of the Phase 1 test set failure cases to identify which rules were not learned.We insert only the subset of labels that fit each template e.g., for the first two templates, we sample suitable entity pairs of finding and impressions according to the finding to impression links shown in Figure 3.

Synthetic Dataset Summary
Summary statistics for the synthetic datasets are shown in Table 3.It may be seen that, for some templates, we can generate a larger number of synthetic sentences than for others, due to the number of combinations of labels and label synonyms for each template.The total number of synonyms is S = S I + S F , where S F is the number of impression synonyms and S I is the number of finding synonyms.For this paper, S F = n F = 15 and S I = n I + S infection + S tumour = 18 + 38 + 304.In Table 3, we use these numbers to define upper bounds for the number of sentences we can generate.The number of unique synthetic sentences is larger than the number of original sentences.When we naively used all of this data, we observed that this has a negative effect on training, so we implement a sampling ratio between real and synthetic sentences to ensure that only 30% of samples in each training batch are from the synthetic dataset.This is applied across all of our synthetic approaches, including baselines.For practical reasons, we pre-select 400 random label combinations for each of the combined and protocol-based approaches although synonyms are randomly inserted at every training iteration.We chose the sampling ratio of 30% empirically based on our Phase 1 validation dataset.
We benchmark against two baseline data synthesis approaches: random deletion and random insertion.In the random deletion approach, we create a synthetic sentence for each original sentence in the training dataset by deleting a single randomly selected word each time.The random insertion approach similarly creates one synthetic sentence for each original sentence in the training dataset; however, here we insert a randomly selected stop word.We use the NLTK library's list of English stop words [28] -stop words are the most frequent words used in a language such as "a", "for", "in" or "the" [29].

Models
In this section, we describe the models which we employ (implemented in Python).For all methods, data are pre-processed by converting to lower case and padding with zeros to reach a length of n tok = 50 if the input is shorter.All models finish with n L softmax classifier outputs, each with n C classes, and are trained using a weighted categorical cross entropy loss and Adam optimiser [30].We weight across the labels but not across certainty classes, as class weighting did not yield improvement.Given a parameter β = 0.9 which controls the size of the label re-weighting, the number of sentences n and the number of not mentioned occurrences of a label o l , we calculate the weights for each label using the training data as follows: Models are trained for up to 200 epochs with an early stopping patience of 25 epochs on F1 micro; for full details on execution times, see Appendix A. We note that early stopping typically occurs after 60-70 epochs, so models generally converge after 35-45 epochs.Hyperparameter search was performed through manual tuning on the validation set, based on the micro-averaged F1 metric.All models are trained with a constant learning rate of 0.00001 and a batch size of 32.

BERT Pre-Training Variants
All BERT variants use the same model architecture as the standard pre-trained BERT model, "bert-base-uncased" weights are available for download online (https://github.com/google-research/bert, accessed on 1 November 2020)-we use the huggingface [31] implementation.We take the output representation for the CLS token of size 768 × 1 at position 0 and follow with the n L softmax outputs.For BioBERT, we use a Bio-/ClinicalBERT model [20] pre-trained on both PubMed abstracts and the MIMIC-III dataset.We use the same training parameters as for BERT (above).The PubMedBERT model uses a different vocabulary to other BERT variants which is extracted from PubMed texts, and, therefore, it is more suited to medical tasks [9].We use the pre-trained huggingface model (https:// huggingface.co/microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext,accessed on 1 November 2020).

ALARM-Based Models
When training neural networks, we find that accuracy can be reduced where there are many classes.Here, we describe the per-label attention mechanism [32] as seen in Figure 8, an adaptation of the multi-label attention mechanism in the CAML model [5].We can apply this to the output of any given neural network subarchitecture-here, we use it in combination with BERT variants.We define the output of the subnetwork as r ∈ R n tok ×h , where n tok is the number of tokens and h is the hidden representation size.The parameters we learn are the weights W 0 ∈ R h×h and bias b 0 ∈ R h .For each label l, we learn an independent v l ∈ R h to calculate an attention vector α l ∈ R n tok : The attended output s l ∈ R h is then passed through n L parallel classification layers reducing dimensionality from h to n C to produce s l for each label.During computation, the parallel representations can be concatenated into α, s and s respectively as shown in Figure 8.

Classification PubMedBERT
Per-label Attention Our ALARM + per-label-attention model, inspired by the ALARM [17] model, uses the entire learnt representation of size 768 × n tok instead of using a single output vector of size 768 × 1.We employ n L per-label attention mechanisms instead of a single shared attention mechanism before passing through three fully connected layers per label, and follow with the n L softmax outputs.Similar to the simple BERT model, we can substitute BioBERT or PubMedBERT for the underlying BERT model in this architecture.

Results
In this section, we firstly investigate the impact of the pre-training dataset and vocabulary on the task accuracy.Secondly, we investigate the effect of our data synthesis templates by validating on our Phase 1 data before going on to show how this model can be used on the prospective Phase 2 dataset.Finally, we compare our model to EdIE-R, a state-of-the-art rules-based approach for label extraction from Head CT radiology reports.
In terms of metrics, we report both micro-and macro-averaged F1 score: the micro score is calculated across all labels and gives an idea of the overall performance whilst the macro score is averaged across labels with equal weighting for each label (we do not weight equally across certainty classes).We note that, although we use micro F1 as our early stopping criterion, we do not observe an obvious difference in the scores if macro F1 is used for early stopping.We exclude the not mentioned class from our metrics, similar to the approach used by Smit et al. [18].All results are reported as the mean and standard deviation of 10 runs with different random seeds.

What Impact Does the Pre-Training Dataset Have on Task Accuracy?
In this section, we investigate the impact of the BERT pre-training dataset on our model's accuracy for label extraction.Table 4 shows the results for all models on our Phase 1 independent test set.The results show that, regardless of the model architecture variant, the PubMedBERT pre-trained weights produce a positive effect on the results for both micro and macro F1.The main differences between the BERT variants are the pre-training datasets and the vocabulary that the models use, so we investigate this in more detail.
The BERT, BioBERT and PubMedBERT vocabularies contain 30,522; 28,996; and 30,522 words, respectively.BioBERT should have the same vocabulary as the original BERT model as it is initialised with that model's weights; however, we find that the pretrained implementations we are using have slightly differing vocabularies (a few words have been removed from the BioBERT vocabulary).We have 1827 unique words across the training, validation, and independent test datasets.Of those words, we find that 710 words are not in the vocabulary of the BERT model and similarly 784 words are not in the BioBERT vocabulary; we note that all 710 words that are unknown to BERT are also unknown to BioBERT.In comparison, only 496 words are not in the PubMedBERT vocabulary-461 of those words overlap with the BERT and BioBERT out of vocabulary (OOV) words.Table 5 shows the breakdown for our training, validation, and independent test datasets.Table 6 highlights some different tokenisation outputs for five of our 33 labels.We see that words such as haemorrhage or hydrocephalus are known to the PubMedBERT model but are tokenised into five separate word pieces by the original BERT tokeniser.Thus, the model can learn to attend to one token rather than requiring to learn a sequence of five tokens.In this section, we report results on our independent test set that we introduced in Section 3.2.The results are shown in Table 7 for our best model, ALARM (PubMedBERT) + per-label attention.In addition to the F1 scores for all labels, we highlight performance on the Tumour label.Table 7 shows that the baseline results for random deletion and insertion do not yield any improvements on our dataset.The sentences in our dataset are quite short, with most words carrying meaning, so deleting words actually harms our model.Comparing results for our subsets of synthetic data created by different types of templates, we can see that the injection of UMLS synonyms for the Tumour label makes a significant difference, giving a significant boost to the F1 macro score.
When training our model only on the template synthetic data (see ablations in Table 7), we see that the numbers are significantly lower than when combined with the original real data.This shows that the synthetic data do not contain the variety of language that was present in the real data, especially around expressing uncertainty.Furthermore, when adding UMLS synonyms, we targeted labels that were rare and poorly detected.If we wished to rely more heavily on data synthesis, we would also need to provide synonyms for the common labels.

What Impact Does Data Synthesis Have on Task Accuracy for Prospective Data?
During analysis of model performance on the independent test set, we notice recurring patterns in which our model trained with the generic template data repeatedly misclassifies positive and uncertain mentions.This is often due to very specific labelling rules that have been added to the protocol, e.g., "suggestive of" is always labelled as positive compared to "suspicious of" which is always labelled as uncertain.After evaluation of our independent test set, we extracted rules for the most common mistakes into templates as shown previously in Figure 7.We use the prospective test data to evaluate this new set of templates because the protocol was influenced by all three of our training, validation and independent test datasets (see results in Table 8).
To highlight the fast-changing human annotator protocol and the adaptability of our template system, we note that, when labelling the prospective test set, our annotators encountered various mentions of infections which did not fit into the set of labels at the time.The medical experts decided to add the Infection label to our labelling system (see Figure 3).Even though we have no training examples for this label, using our template system, we could easily generate additional training data for this label, including the injection of synonyms from UMLS.Table 8 shows that the protocol-based templates provide small improvements for both micro and macro F1 performance.To further evaluate the addition of the protocol-based templates, we manually extract sentences from the unlabelled dataset which contain the phrases "suggestive", "suspicious" and "rather than".This search results in 510 sentences of which we randomly select a subset of 100 of these sentences for evaluation.The generic and protocol-based model predictions for these 100 sentences are compared.The protocolbased template model outperforms the previous model with an F1 micro score of 0.952 compared to 0.902.Figure 9 shows an example sentence for which the generic template model made the incorrect predictions and the protocol templates helped the protocol-based model make the correct predictions.[CLS] this is thought most likely to reflect " lux ury perfusion " in a relatively new stroke rather than evidence of mass lesion . [SEP] More likely [IMPRESSION1] rather than [IMPRESSION2].
[CLS] this is thought most likely to reflect " lux ury perfusion " in a relatively new stroke rather than evidence of mass lesion .[SEP] Figure 9.An example sentence from the unlabelled dataset with predictions made by our generic (top) and protocol-based (bottom) template models.The attention for the infarct and lesion labels is overlaid on the sentences.The protocol-based template shown in the middle row of the figure enables the model to correctly classify these sentences.To simplify the visualisation, the higher-weight attention (either for infarct or lesion) has been overlaid on each word.

Comparison of the Proposed Method with a Rules-Based System
EdIE-R [11] is a rule-based system which has also been designed to label radiology reports for head CT scans from stroke patients [33].However, the labels that the rules were created for are slightly different so, in order to compare this model with ours, we have mapped the EdIE-R labels to a subset of our labels as follows: Ischaemic stroke to Infarct/Ischaemia; Haemorrhagic stroke and Haemorrhagic transformation to Haemorrhage/Haematoma; Cerebral small vessel disease, Tumour and Atrophy to our identical labels.Since EdIE-R dichotomises labels into negative and positive mentions and does not explicitly model uncertain mentions, we have ignored uncertain mentions in our metric calculations.Therefore, the results in Table 9 are not directly comparable to those in other sections.We use the EdIE-R implementation provided by the authors (https://github.com/Edinburgh-LTG/edieviz,accessed on 1 November 2020).
Table 9.We compare EdIE-R to our best model across five labels that overlap between the two annotation systems.We report the performance of EdIE-R against mean standard deviation of 10 runs with different random seeds for our approach.Results are for our independent test set.Bold indicates the best model for each metric.CSVD = cerebral small vessel disease.Figure 10 shows confusion matrices of the certainty classes for our best model and the EdIE-R model, respectively; these figures contain results for the subset of labels that the models have in common.From these matrices and Table 9, it is clear that our model has a higher overall accuracy than the rules-based approach when evaluated on the independent test set.The "overall accuracy" metric shown in the figures is the simple accuracy metric over the three certainty classes positive, negative and not mentioned.Due to the differences in label definitions, the EdIE-R approach over-predicts positive mentions of the tumour label.On inspection, we observe that the EdIE-R system labels any mentions of "mass" as a tumour, while, in our system, a mass is only labelled as a tumour if there is a specific mention of "tumour" or subtype of tumour (e.g., "meningioma"); otherwise, we label as a (non-specific) Lesion.It is therefore likely that Tumour label is defined differently between our annotation protocol and the protocol used by Alex et al. [33].

Discussion
Our results show that we can successfully augment our training data with synthetic data generated from templates.These synthetic data guide our model to learn provided rules by example, creating a model which can benefit from both rules-based and deep learning approaches.As we have shown in Section 4.3, our approach is adaptable to new labels as the templates can be used to generate new training examples.However, our results also open some questions which we discuss in this section.

Difference in Accuracy between Phase 1 and Phase 2 Test Data
Phase 1 performance is close to the inter-annotator agreement F1 performance shown in Table 2, especially for F1 micro.We can expect that humans are better at picking out rare labels, so the gap between human and model F1 macro performance is slightly larger, although we have successfully narrowed the gap with the addition of templates.We observe that there is a drop in performance between the Phase 1 test data and the prospective Phase 2 test data of approximately 0.06 in both F1 micro and macro.This drop is consistent across all metrics and classes.On review of the data, we observe that there was not only a difference in the label distributions (e.g., with the new labels of Infection and Pneumocephalus appearing, see Appendix B) but also the number of sentences without labels is higher in the Phase 2 dataset; 44% of sentences in the Phase 2 dataset were not assigned any labels by the annotators, resulting in many more potentially confounding sentences than in the Phase 1 dataset in which 23% were not assigned any labels.
We posit two reasons for this increase in label-free examples.The Phase 2 dataset did not exclusively contain scans for suspected stroke events but also contained studies for other reasons e.g., sinus abnormalities.This arose because we had access to head CT radiology reports within an 18-month period either side of the stroke event.We further changed the method by which we extracted sentences between the two phases.In Phase 1, sentences were manually extracted from the body of the reports by human annotators, whereas, in Phase 2, we implemented an automatic pipeline to extract and segment a report into sentences.As a result, any human bias in sentence selection (effectively curation) was not reproduced in the automatic pipeline, and we observed that many more irrelevant sentences were extracted for annotation, for instance describing results from other types of scans (e.g., CTA) or other non-imaging patient details.
In summary, the Phase 2 dataset gave a good insight into performance "in the wild," and we are satisfied that the performance drop was not excessive.

Limitations of the Comparison with EdIE-R
In this paper, we have shown that our approach performs more accurately than a pure rules-based approach such as EdIE-R.However, we did not have access to the dataset and labels on which EdIE-R was trained and validated, making the comparison an unequal one, especially since there are likely to be differences in the labelling rules.
The fact that the EdIE-R approach [11] is rules-based means it cannot simply be retrained on our dataset.Instead, it would be necessary to rewrite some of the rules in the system.Whilst this makes a fair comparison difficult, it also highlights the benefit of an approach such as ours which can be adapted to a change in the labelling system; the model can either be retrained on a different labelled dataset or suitable new data can be synthesised using templates.In the case of the definition of Tumour, in order to validate on a dataset labelled with the EdIE-R protocol, we could include "mass" (and any other synonyms of "mass" that UMLS provides) as a synonym for Tumour.

Synthetic Data Distribution
We have been careful to retain a valid data distribution when generating synthetic data.For the protocol-based templates, we selected only valid pairs of findings and impressions according to the scheme shown in Figure 3 and used only impression labels for the templates relating to clinical impressions.We performed ablation by creating synthetic data by selecting from all labels at random (results not shown in the paper) and did not notice a significant difference in accuracy, which suggests that the model is not heavily leveraging inter-label relationships.

Utilising Templates in an Online Learning Setting
A possible application of template data synthesis could be in an online learning setting.In this scenario, a human could actively spot misclassifications and edit/add new templates or synonyms to the database, which triggers the model to be retrained using the new templates.This would allow medical experts to continually update the model and fine-tune it for the datasets they are using.We could go further in the automation process and train a machine learning algorithm to propose templates for misclassifications to the human.The expert would then simply have to accept the template for the model to be retrained.

Conclusions
We have proposed the use of templates to inject expert knowledge of rare classes and to teach the model about labelling rules via synthetic data generated from templates.We have shown that, using this mechanism alongside domain-specific pre-training, we are able to robustly extract multiple labels from head CT reports for stroke patients.Our mechanism both gives better generalisation to the existing system and provides the ability to adapt to new classes or examples which are observed in the test population without requiring extensive further data annotation efforts.

Figure 1 .Figure 2 .
Figure 1.Our original set of radiology reports is annotated by three medical annotators (clinical researcher and medical students) at sentence-level.The training dataset is augmented with synthetic data generated from templates.We inject knowledge from the UMLS [8] meta-thesaurus and medical experts into the templates to teach the model about rare synonyms and annotation protocol rules.Our model then predicts labels for the given sentences.

Figure 3 .
Figure 3. Label schema: 13 radiographic findings, 16 clinical impressions and 4 crossover labels which are indicated with a single asterisk.Finding→impression links are shown schematically.* These labels fit both the finding and impression categories.** Haematoma can indicate other pathology e.g., trauma.*** Where labels refer to chronic (rather than acute) phenomena, they indicate brain frailty [27].

Figure 4 .
Figure 4.These are the "generic" templates which aim to provide an example for every entity class (simple templates) with entities at different positions in the sentence (permuted templates).

[Figure 7 .
Figure 7. Protocol-derived templates which generate examples of protocol-specific rules in action.

Figure 8 .
Figure 8.Our model architecture: PubMedBERT [9] maps from input x to a hidden representation r (this subarchitecture indicated in yellow can be replaced by another BERT variant); per-label attention maps to s which contains a separate representation for each label; the attention vector α can be visualised by overlaying this on the original text, again per-label; finally, the representation is passed through three classification layers to produce a per-label per-class prediction s .

Figure 10 .
Figure 10.Confusion matrices showing the performance of our model (left) and the EdIE-R (right) on the independent test data across the certainty class subset of positive, negative and not mentioned.

Table 1 .
Summary statistics for the NHS GGC datasets used in this work.The validation set is used for hyperparameter and best model selection.

's Kappa F1 Micro F1 Macro
Schematic representation of a part of the subgraph of the UMLS meta-thesaurus used to extract synonyms for infection.Blue ovals represent individual concepts (CUIs), while the attached yellow rectangles contain the different surface forms of that concept.Each arrow represents an inverse_isa relationship.

Table 3 .
Summary statistics for the synthetic datasets used in this work.For the templates, the number of generated sentences depends on the total number of synonyms S, the number of finding synonyms S F and the number of impression synonyms S I .For the compound templates (combined, protocol-based), the number of generated sentences further depends on the template[label] combinations that are samples; here, we indicate the upper bound (UB):

Table 4 .
Micro-and macro-averaged F1 results as mean standard deviation of 10 runs with different random seeds.Bold indicates the best model for each metric.

Table 5 .
Comparison of BERT, BioBERT and PubMedBERT unknown vocabulary in our dataset.

Table 6 .
Comparison of tokeniser output for original BERT and PubMedBERT.We show the number of tokens in brackets, followed by the tokens separated by the conventional ## symbol.

Table 7 .
Micro-and macro-averaged F1 results on our independent test set as mean standard deviation of 10 runs with different random seeds.Bold indicates the best model for each metric.

Table 8 .
Micro-and macro-averaged F1 results on our prospective test set as mean standard deviation of 10 runs with different random seeds.Bold indicates the best model for each metric.