PreRadE: Pretraining Tasks on Radiology Images and Reports Evaluation Framework

: Recently, self-supervised pretraining of transformers has gained considerable attention in analyzing electronic medical records. However, systematic evaluation of different pretraining tasks in radiology applications using both images and radiology reports is still lacking. We propose Pre-RadE, a simple proof of concept framework that enables novel evaluation of pretraining tasks in a controlled environment. We investigated three most-commonly used pretraining tasks (MLM — Masked Language Modelling, MFR — Masked Feature Regression, and ITM — Image to Text Matching) and their combinations against downstream radiology classification on MIMIC-CXR, a medical chest X-ray imaging and radiology text report dataset. Our experiments in the multimodal setting show that (1) pretraining with MLM yields the greatest benefit to classification performance, largely due to the task-relevant information learned from the radiology reports. (2) Pretraining with only a single task can introduce variation in classification performance across different fine-tuning episodes, suggesting that composite task objectives incorporating both image and text modalities are better suited to generating reliably performant models.


Introduction
The development of self-supervised pretraining and transformer architectures have produced predictive models that surpass their supervised counterparts on many image and language benchmarks [1,2].With the availability of large collections of paired radiology X-ray images and text reports, recent attention has been directed towards developing novel implementations of these methods suited to multimodal radiology data [3][4][5].There has been rapid growth in this field and early results show promise.However, there exists no known domain specific study into pretraining (pretext) tasks in isolation from the model architecture and training protocol, leading to uncertainty about task suitability.
Evaluation of pretraining methods typically involve evaluating the resulting model on downstream tasks and datasets relevant to the domain and/or application [1,6].Within radiology, common evaluation tasks include multi-label pathoanatomical classification [5], medical visual question answering [4], and medical report generation [3,5].While domain agnostic evaluation frameworks exist [6,7] and offer guidance on establishing methods to isolate components in a controlled manner, their focus is on the performance across domains at the expense of specificity.
The primary aim of this study is to examine the effect of varying the pretext task used in pretraining on downstream radiology classification tasks.We implemented several common baseline tasks in a controlled evaluation framework using a consistent model architecture and training protocol, and empirically measured their impact on downstream classification performance.Thus, this study:

•
Introduced an empirical evaluation framework for model-agnostic evaluation of vision and language pretext tasks on downstream classification tasks.(The PyTorch Lightning implementation of our framework, along with data processing and analysis notebooks, is available at https://github.com/mwcoleman/prerade/.).

•
Considered the impacts of pretext tasks particularly with multimodal radiology data on multi-label pathoanatomical classification for the first time.

•
Performed controlled studies on Masked Language Modelling (MLM), Masked Feature Regression (MFR), Image to Text Matching (ITM), and their combinations.

Materials and Methods
This section first provides a brief overview of the evaluation framework, followed by an introductory background to self-supervised pretraining and a description of the three pretraining tasks (MLM, MFR and ITM) that were selected for evaluation in this study.

Framework Overview
This study evaluated combinations of common pretraining tasks in a controlled setting by using consistent model architecture, datasets and training protocols.Figure 1 outlines the framework and process used for this study.Experimental details covering the data processing steps, model components, training protocols and evaluation protocols are provided in Sections 2.5-2.8.Beginning with a single stream type multimodal transformer (Refer Section 2.6), eight instances were pretrained using combinations of three self-supervised learning pretext tasks (Refer Section 2.2) on a large multimodal radiology dataset (Refer to Section 2.5).These instances were fine-tuned on a smaller labelled dataset from the same data source, before being evaluated on a held-out test set (from the same source).The fine-tuned instances were evaluated on a dataset from a different source that was not seen during training.The fine-tuning evaluating process was repeated 20 times varying the random seed to obtain uncertainty estimates of performance.

Figure 1.
The evaluation framework overview.Note: Eight pretraining tasks combinations were used to pretrain eight model instances.These were then fine-tuned, along with two baseline models, and evaluated 20 times on pathoanatomical classification.

Self-Supervised Pretraining
Self-supervised image and text pretraining aims to learn an encoder function f0 that maps paired inputs (w, v), where w is a text sequence w = {wl,.., wm} of length m, and v is the accompanying image features v = {vl,.., vn} of length n, to a representative fixed dimension vector useful for downstream tasks (Refer to Table 1 for notation used).There exist a variety of pretext tasks that have been developed in multimodal selfsupervised pretraining implemented with transformer architecture (commonly referred to as Bidirectional Encoder Representations from Transformers (BERT)).In this study we investigated Masked Language Modeling (MLM), Masked Feature Regression (MFR), and Image to Text Matching (ITM), tasks common to training modern and performant implementations such as UNiversal Image-TExt Representation (UNITER) [8], VisualBERT [9], Learning Cross-Modality Encoder Representations from Transformers (LXMERT) [10], Vision-and-Language BERT (VilBERT) [11] and VL BERT [12].These state-of-the-art implementations all use individual modality (text/image) embedding layers combined with a transformer (self-attention) encoding base.However, they differ in the architecture of the attention mechanism (Figure 2).Some propose a single stream joint encoder with selfattention [8,9,12], while others introduce an inductive bias towards inter-modal attention by using two streams and cross-modal attention [10,11].
Figure 2. VL BERT architectures: single stream with concatenated input embeddings and self-attention (left) vs. dual stream with explicit cross-modal attention (right).Note: 'K', 'Q', 'V' are the attention key, query, and value terms, 'W' is a text input and 'V' is a visual input.

Masked Language Modeling (MLM)
Masked Language Modeling is a unimodal (text) pretraining task [13].The process randomly mask outs input tokens, replacing them with a generic token called [MASK].The encoder learns representations through maximizing the likelihood of predicting a masked token(s) true label conditioned on the remaining token embeddings.The supervisory signal (loss) is implemented with cross entropy.In the joint image and text setting the training loss is for one training pair follows as:

Masked Feature Regression (MFR)
Masked Feature Regression randomly selects one or more image features and masks them by replacement with zero-padded vectors of equivalent length [8].The encoder learns representations through reconstructing the original feature vectors.The loss for one training pair is implemented as the sum, over masked vector indices, of L2 distances between each of the reconstructed and original vectors:

Image to Text Matching (ITM)
Image to Text Matching falls within the contrastive learning category of pretext tasks and is a multimodal variant of next sentence prediction [10].During preprocessing, a selection of samples is randomly chosen and either their image or text input is swapped with the corresponding type from another data point, resulting in the encoder being presented with a misaligned input pair.The encoder's task is to determine if the two modalities are paired or not.The representations are learned through maximizing (minimizing) the similarity of paired (unpaired) inputs, with the loss for one training pair implemented via binary cross entropy, i.e., (for misaligned images): where   ∈ {0:unmatched, 1:matched}

Pretraining on Multimodal Radiology Data
Following the release of large publicly available multimodal datasets [14] VL BERT architectures and training protocols for radiology tasks have been adopted.Notably there were two studies that incorporated variants of the pretext tasks above to pretrain on radiology images and accompanying reports [4,5].Each study reported state of the art (SOTA) results in their evaluation in terms of accuracy, BiLingual Evaluation Understudy (BLEU) score, Area under the ROC (Receiver Operating Characteristic) Curve (AUC), and ranked retrieval precision.Yet neither of these studies investigated the effect of varying the pretext task, instead focused on novel problem domains (medical visual question answering) [4] or training data regimes [5].Furthermore, neither study conducted repeated evaluations to evaluate variance of results.
Within the radiology data domain, the closest work to this current study was a comparative study evaluating four performant pretrained VL BERTs (LXMERT [10], Visual-BERT [9], UNITER [8] and PixelBERT [15]) on multimodal and text-only pathoanatomical classification tasks using the MIMIC-CXR and OpenI publicly available chest x-ray datasets as training and evaluation datasets, respectively.These studies found that these models, pretrained on general domain corpora only, transferred well to radiology data and tasks, and reported SOTA performance in terms of per-label and averaged AUC.Our study differed in two key ways from these studies: (1) we focused on a model-agnostic evaluation of the choice of pretext task, conducting in-domain pretraining using the same architecture and training protocol, and (2) we performed controlled fine-tuning experiments to evaluate consistency of results.

Evaluating Pretraining Performance
The evaluation of pretraining methods typically involve evaluating the resulting model on downstream tasks and datasets relevant to the domain and/or application [1,6].While domain agnostic evaluation frameworks exist [6,7] and offer guidance on establishing methods to isolate components in a controlled manner, their focus has been on the performance across domains at the expense of specificity.
Within radiology, common evaluation tasks include multilabel pathoanatomical classification, medical visual question answering, and medical report/image generation.Of these, pathoanatomical classification requires the least amount of task specific modelling components, which is relevant to our work concerned with model-agnostic evaluations.Even so, there are limited evaluation studies specific to multimodal radiology data (Refer to Section 2.3), and no known study to evaluate pretext tasks in isolation.
The wider machine learning literature contains established practices for conducting comparative studies in controlled environments.Our philosophy of approach is closely related to a recent study [16] that theoretically and experimentally investigates modelling choices for VL BERTs.While their study includes a large theoretical component (in contrast with ours) and is focused on architecture choices in the general domain, aspects of our experimental approach are influenced by their framework.Controlled experimentation with analysis of uncertainty is a necessary condition to adoption within medical domain and yet is often neglected in the core machine learning literature [17].

Data
This study used the MIMIC-CXR dataset, which is a large publicly available chest xray dataset containing 377,110 radiograph images corresponding to 227,827 imaging studies with free text radiology reports sourced from the Beth Israel Deaconess Medical Center in North America [14].Labels for 13 pathoanatomical findings are derived by the authors using natural language processing (NLP) methods [18,19], and have a reported recall of 0.91 and F1-score of 0.83.Iterative stratification [20,21] was performed to obtain a pretraining (n = 112,124), fine-tuning (n = 6228) and test (n = 6228) split.
This study also used the OpenI dataset which is a publicly available chest x-ray dataset containing 8121 radiograph images corresponding to 3996 imaging studies with free text radiology reports, sourced from different institutions across North America [22].While no explicit labels are provided, pathoanatomical findings from 14 categories are provided in the form of radiologist provided Medical Subject Headings (MeSH) terms.Due to the limited amount of data in the OpenI dataset, it is not viable for use in selfsupervised pretraining.However, we followed existing work [23] in using the dataset as an external evaluation dataset (i.e., no OpenI data were used for training), deriving binary labels for the seven categories that appeared in the MIMIC-CXR dataset.

Data Preprocessing
The preprocessing of the study dataset was undertaken to split the data into pretraining, fine-tuning, and testing:

•
The samples were chosen by selecting image studies that contain Antero-Posterior (AP) views (the only view type to be present in all studies) and a 'findings' section in the radiology text report.A paired sample contained both a text report and an image (i.e., AP view).This resulted in a set of 124,580 paired samples.

•
The sample labels were obtained from the provided label set [14].We followed previous works [23] in processing the labels to obtain a binary label per finding category.Missing or uncertain findings were replaced with a negative (i.e., no finding).

•
The image features were extracted using the pretrained feature extractor described in Section 2.6.For studies containing multiple AP views, the first view was selected for feature extraction.After feature extraction, each image was represented by a set of feature embeddings Vp  R36 × 4, and position embeddings V f  R36 × 1024.Dimensions were chosen for compatibility with the existing pretrained extractor model parameters.

•
The text component was processed to remove all text prior to the 'findings' section in the radiology report.The resulting string was truncated to 125 characters (due to computational constraints as discussed in Section 4.4).The 'summary' section of the text report was moved to the beginning of the text, to ensure the summary text was not truncated.

•
A stratified split of 90:10 was conducted on the 124,580 paired samples using iterative stratification (equal distribution, order = 4) to maintain multi-label proportionality and obtain dataset splits that are representative of the parent distribution [20,21].The training set (90%, 112,124 samples) was used for pretraining the model.

•
The remaining 10% was withheld for evaluation, with a 50:50 stratified split into a fine-tuning and a test set, each containing 6228 samples.The sample sizes were chosen based on the reported availability of fine-tuning data in related radiology use cases [24] and typical test set sizes in published datasets [14,22].
Table 2 shows the prevalence of positive findings for each label in the datasets used in our experiments.The labeled data were processed to create binary labels for each finding category, with missing or uncertain labels modified to 'no finding' [23].The details and links to data processing code and notebooks are provided in Appendix A. Across all the datasets, Pleural Effusion had the highest proportion of positive findings in the MIMIC-CXR dataset (Pretrain: 27.36%; Finetune: 29.80%; Test: 30.19%) compared to Cardiomegaly in OpenI Test dataset (8.55%).The proportions are given as a percentage of the total number of samples in the dataset (bottom row).MIM = MIMIC-CXR dataset.

Model Components
The pretraining and evaluation framework in this study was agnostic to model components including the encoder, embedding layers, and projection heads (Figure 3).For our experiments, we modeled the text encoder as BERT [13], the image feature extractor as a Convolutional Neural Network (CNN) [25], and used VisualBERT with pretrained weights [9] for the joint modality encoder.The projection head was task specific and was discarded after pretraining.Following existing works [8][9][10], we performed no further training of the CNN image extractor model and computed the image features offline prior to pretraining.Further implementation details are provided in Table A1.

Figure 3.
An overview of the architecture used in the framework.Note: Model components shown within dotted boundary.  was modelled as a fixed parameter image feature extractor,   as a learnable word embedding encoder, ℎ  and ℎ  as learnable linear transform layers,   as a learnable joint modality encoder, and   as a learnable (task specific) projection head.An example MFR pretext process is shown in green to demonstrate interaction between model and pretraining framework.
We followed established pretraining practices in pretraining VQA BERT models in a multi-task setting [8].For example, the following describes one forward pass of the pretraining process for the 'MLM, MFR, ITM' task combination: 1.A single pretext task was sampled uniformly from the pretraining scenario (i.e., either MLM, MFR, or ITM was chosen randomly).2. A batch of  inputs was sampled uniformly from the pretraining dataset (Refer to Section 2.5).Each input pair represented a text string in the radiologist report text  as well as the extracted feature   and position vectors   corresponding to the paired image.

The input batch was processed according to the pretext task:
-For the MLM or MFR task the relevant input elements were masked in each sample up to the masking budget and defined as: -Alternatively, for the ITM task, a selection of samples from the batch  ̂∈  ̂,   and   were replaced with another from the batch.Here,  ̂ was randomly selected from the batch and | ̂| was calculated as: 4. The input batch was then passed through the embedding layers and projected to fixed dimensional representations: This resulted in dimensions of  ̂∈ ℝ ℎ  ×| ̂| and  ̂∈ ℝ ℎ  ×| ̂| where ℎ  was the input dimension of the joint encoder.
5. We obtained a joint representation  with the joint encoder   (⋅) and projection function   (⋅) acting on the concatenated embeddings:  =   (  ( ̂| ̂)) 6. Losses were calculated based on the specific pretext objective (Refer to Section 2.2).

Pretraining Scenarios
We considered all non-empty subsets of {MLM,MFR,ITM} for comparison, yielding seven pretraining scenarios, and all were pretrained using both medical radiology text and X-ray image data.We further considered a model pretrained with MLM on the radiology text-only inputs.We defined a fixed set of hyper-parameters and random seeds across all training scenarios (details in Table A1) to ensure that the only difference was the choice of tasks.We followed established works [8][9][10]13] for task parameter selection, which included a masking rate of 0.15 for both MLM and MFR, and a sampling rate of 0.50 for the ITM task.For each scenario, we trained the model with the respective training objectives for 200,000 steps equating to approximately 22 h of training time per model on a RTX 3090 GPU.
We evaluated two baseline model instances: (1) random weight initialization (denoted 'No Pretrain'), and (2) weights loaded from the public VisualBERT repository.No domain specific pretraining was conducted.

Evaluation Tasks
The seven pretrained and two baseline models were fine-tuned for six epochs on the MIMIC-CXR fine-tuning data set (Finetune) using multi-label binary cross-entropy (BCE) loss to update network parameters.Models were evaluated against multimodal multilabel Classification on both MIMIC-CXR and OpenI test sets.We assessed each scenario's fine-tuning sensitivity to random seed initialization by fine-tuning each pretrained model 20 times.The mean per-label AUC, average AUC and sample weighted average AUC (weighted average (wAvg), AUC), along with standard deviation were reported.

Results
Table 3 report the mean per-label and average AUC for each scenario on MIMIC-CXR and OpenI test datasets.On the internal evaluation, 'MLM, MFR' reported the strongest average AUC (0.989) and best or equal best per-label AUC in 8/13 findings categories (Table 3).Other training scenarios that incorporated MLM ('MLM, MFR, ITM', 'MLM, ITM', 'MLM (text)', and 'MLM') achieved comparable performance, with the average AUC greater than 0.982 and collectively contained the highest AUCs per label category.When pretraining and fine-tuning distributions differed (external evaluation, Table 3), little benefit was observed in terms of mean AUC.All resulting pretrained models obtained high AUC scores, with a minimum average AUC/weighted average AUC of 0.921/0.926('MFR').The 'No Pretrain' experiment reported the highest weighted AUC and per-label AUC for 3/7 categories.Models lacking MLM pretraining reported increased variability in AUC across label categories, with those categories having a low number of training samples reporting lower AUC than those with large relative numbers.This result was not evident in the OpenI results, where label proportionality was more balanced.

Classification Performance
Our results on the MIMIC-CXR test set demonstrate a benefit over both (a) no pretraining ('No Pretrain'), and (b) general domain pretraining ('VBert-COCO').However, given the strong performance of 'MLM (text)', there appears to be little additional benefit from incorporating the visual inputs.We suggest the strong text performance is due in part to the demonstrated prior bias of these model types to text, and in part to the taskrelevant information summarized by the radiologist text report.
Our results on the OpenI test suggest there may be less benefit provided by pretraining when the test data are from a different source.However, all models (including baseline) reported high performance (Avg.AUC ≥ 0.921), demonstrating a strong multimodal classification capability inherent to the VL BERT architecture.We suggest further investigation is required, using more demanding tasks, in order to assess generalization ability comprehensively.

Sensitivity to Label Imbalance
Our study found that models lacking MLM pretraining report increased variability in AUC across label categories.BCE loss provides a supervisory signal proportionate to the number of training samples, and our experiments suggest that incorporating suitable pretext tasks (MLM in this situation) is beneficial to training models that are robust to this supervision imbalance, at least when evaluated on multi-label classification problems.

Sensitivity to Random Initialisation
Our model was pretrained once and fine-tuned for 20 times with different random seeds.A number of pretraining scenarios report a considerable fine-tuning sensitivity to random seed initialization when compared to the baseline models.On MIMIC-CXR dataset all single-task pretraining scenarios trained on multimodal input data (i.e., 'MLM', 'MFR', 'ITM') reported higher variability of average AUC than all composite pretraining scenarios incorporating MLM.This was also the case on the OpenI test set, although additionally 'MLM (text)' reports large variability ( = 0.0235).These results suggest composite pretraining tasks produce more consistent results, as well as highlighting the benefit of investigating fine-tuning sensitivity as part of the pretext task evaluation process.

Strengths and Limitations
This study has a number of strengths.Our study focuses on controlled experimentation of the pretext task component, removing variation due to model architecture and training/data protocols.Furthermore, our study includes repeated evaluations to report the variance of performance, which, while common in health research, is less so in the field of machine learning.Another strength of this study is that it focuses on reproducibility of results, publishing all data processing protocols and framework implementation (code).Diverging from the trend of machine learning research towards cluster computing, the implementation was developed to run on a single consumer grade computer (this may appeal to smaller research groups and health researchers equipped with modest computing budgets).
A limitation to this study was that the computational resources placed constraints on our exploration of areas such as pretext task scenarios, hyperparameter tuning, variability analysis, and choice of embedding layers.Only a single model architecture and implementation (VisualBERT joint encoder, Mask-RCNN visual encoder, BERT word embeddings) was tested.However, similar results are expected with other choices.
Even though our results found little evidence to support pretraining when the test distribution differed to the training distribution, drawing a reliable conclusion would require considerably more investigation into distribution shift.This analysis was outside the scope of our study and the topic of future research.The framework of this study provides a simple proof of concept for the analysis of pretext tasks used in pretraining models for pathoanatomical classification with multimodal radiology data.There exists much opportunity for expanding upon this work to gain a deeper insight into the benefits and limitations of self-supervised learning applied within radiology.

Conclusions
This study introduced and implemented a model-agnostic framework for evaluating self-supervised pretext tasks with multimodal radiology data on pathoanatomical classification.We conducted controlled studies for a selection of widely used pretext tasks with VL BERT type transformer architecture in order to gain an understanding of their performance and limitations.Pretraining with a composite objective of pretext tasks, was found to improve classification performance, reduce sensitivity to class imbalance in the multilabel setting, and reduce fine-tuning variance, and predominantly when the training and testing data are sourced from the same distribution.However, when the training and testing distributions differed, pretraining provided little benefit to model classification performance.Our results provide important evidence-based determination of the relative performance and reliability of multimodal self-supervised learning pretext tasks used to train pathoanatomical classification models for radiology applications.

Table 1 .
Glossary of notation.

Table 2 .
The proportion of samples having positive findings in each label category for each dataset used in experiments.

Table 3 .
The per-label average and sample weighted average AUC for multimodal classification on MIMIC-CXR and OpenI test sets.Note: The results reported as mean and std.dev.Over 20 fine-tuning episodes, varying the random seed.Orange (teal) represents highest (lowest) mean AUC across scenarios, with ties broken by std.dev.