GIT-CXR: End-to-End Transformer for Chest X-Ray Report Generation

Sîrbu, Iustin; Sîrbu, Iulia-Renata; Bogojeska, Jasmina; Rebedea, Traian

doi:10.3390/info16070524

Open AccessArticle

GIT-CXR: End-to-End Transformer for Chest X-Ray Report Generation

¹

Faculty of Automatic Control and Computer Science, National University of Science and Technology Politehnica of Bucharest, 060042 Bucharest, Romania

²

School of Engineering, Zurich University of Applied Sciences, 8401 Winterthur, Switzerland

³

NVIDIA, Santa Clara, CA 95051, USA

^*

Authors to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Information 2025, 16(7), 524; https://doi.org/10.3390/info16070524

Submission received: 8 May 2025 / Revised: 9 June 2025 / Accepted: 19 June 2025 / Published: 23 June 2025

(This article belongs to the Section Information Applications)

Download

Browse Figures

Versions Notes

Abstract

Medical imaging is crucial for diagnosing, monitoring, and treating medical conditions. The medical reports of radiology images are the primary medium through which medical professionals can attest to their findings, but their writing is time-consuming and requires specialized clinical expertise. Therefore, the automated generation of radiography reports has the potential to improve and standardize patient care and significantly reduce the workload of clinicians. Through our work, we have designed and evaluated an end-to-end transformer-based method to generate accurate and factually complete radiology reports for X-ray images. Additionally, we are the first to introduce curriculum learning for end-to-end transformers in medical imaging and demonstrate its impact in obtaining improved performance. The experiments were conducted using the MIMIC-CXR-JPG database, the largest available chest X-ray dataset. The results obtained are comparable with the current state of the art on the natural language generation (NLG) metrics BLEU and ROUGE-L, while setting new state-of-the-art results on F1 examples-averaged F1-macro and F1-micro metrics for clinical accuracy and on the METEOR metric widely used for NLG.

Keywords:

radiology report generation; curriculum learning; image captioning; chest X-ray; transformer; machine learning

1. Introduction

Interpreting radiographic images with complex and detailed features is a challenging task that demands significant time [1,2] and specialized clinical expertise [3]. The insights provided by clinicians in these reports are crucial for future patient assessments, and errors can result in misdiagnoses and improper treatments. Consequently, the increasing volume of radiographic images, particularly in public hospitals and densely populated areas, coupled with a limited number of experts, leads to substantial delays that negatively impact diagnosis and treatment outcomes.

Radiology report generation can be framed as an image-captioning task. While recent advances in image captioning, such as Convolutional Neural Network (CNN) encoders paired with Recurrent Neural Network (RNN) decoders [1,2] and transformer-based models with complex modules [4], have shown promise, generating accurate medical reports from images remains an unresolved challenge, with current performance levels insufficient for practical use. The main challenges in medical image captioning include highly similar complex images with subtle differences, the use of domain-specific language, and brief diagnostic insights embedded within lengthy repetitive descriptions.

This work focuses on developing and evaluating an end-to-end transformer approach, designed specifically for the automated generation of radiology reports from radiography images. We adapt the Generative Image-to-text Transformer (GIT) [5] by incorporating widely used techniques such as adding a classification head, using patients’ history, and multi-view images. Most distinctively, we integrate curriculum learning (CL) into our training and prove its efficacy. Together with these enhancements, we obtain an end-to-end transformer approach that outperforms existing methods, while experimentally confirming that all these ingredients are essential for our method’s success. Importantly, the techniques used by us do not introduce additional training or inference complexity to the network, in contrast to other recent methods [6,7].

The current research encounters great problems with the generation of long medical reports [8]. We demonstrate the essential role that curriculum learning plays in substantially improving this aspect and argue that this specific issue was not properly addressed in previous work. We believe that it deserves greater attention from the medical AI research community.

We conduct our experiments on the largest publicly available chest X-ray dataset—the MIMIC-CXR-JPG dataset introduced by Johnson et al. [9]. Our proposed solution achieves a new state of the art on the natural language generation (NLG) metric METEOR [10], as well as on the clinical accuracy metrics F1-macro and F1-micro. These results demonstrate both the clinical accuracy and factual completeness of our generated reports. Furthermore, our approach performs on par with the state of the art considering the natural language generation metrics BLEU [11] and ROUGE-L [12].

Our main contributions can be summarized as follows:

We propose an end-to-end transformer approach for the generation of medical reports from chest X-ray images, demonstrating the validity of simpler architectures.
To the best of our knowledge, we are the first to show the effectiveness of curriculum learning for the task of automated radiology report generation using transformers.
We show the capacities of our setups by obtaining state-of-the-art results, over the largest benchmark of chest radiography, MIMIC-CXR-JPG, for both clinical accuracy metrics as well as natural language generation metrics, attesting to both the factual completeness as well as the accuracy of our generated reports.

2. Related Work

2.1. Transformers in Image Captioning

Attention models have gained large-scale popularity for the task of image captioning [13,14,15], due to their outstanding performance. The GIT Transformer [5] is a Generative Image-to-text Transformer that has obtained state-of-the-art results on various computer vision tasks. It has been pre-trained on 0.8B image–text pairs from various sources, but as far as we know, it has not been evaluated on medical imaging tasks. The architecture of GIT consists of two transformer modules: an image encoder—based on the vision transformer model of Yuan et al. [16]—that extracts features from the input image, and a text decoder (also a transformer module) that uses these visual features to generate the corresponding caption.

2.2. Radiology Report Generation

Radiology report generation from radiographic images (Figure 1) falls under the broader task of image captioning [17,18]. The majority of methods used to address the challenges of medical image captioning rely on deep learning models employing an image-encoder and text-decoder architecture. The encoder is typically a Convolutional Neural Network [2,19,20,21], used to extract image features and create their latent representations. The decoder—usually a Recurrent Neural Network, such as Long Short-Term Memory (LSTM) [22]—is then used to convert the extracted features into the generated reports [1].

Transformers [23] offer the advantage of effectively capturing long-range visual and textual dependencies and have been successfully applied to this task. However, models leveraging transformers often incorporate additional CNN encoders [24,25], or complex modules (e.g., added memory augmentation module [4], Faster R-CNN object detector [6], interpreter–generator–classifier modules [26], and relational memory [27]), which increase inference complexity and require more supervision during training.

Therefore, to address the problems introduced by this added complexity, we investigate the performance of a simpler approach: an end-to-end transformer model. In the context of medical image captioning, this approach has not been studied in depth until recently by Wang et al. [28] and Nicolson et al. [29]. However, these works do not address the issue of overly short generated reports.

2.3. Curriculum Learning

First introduced by Bengio et al. [30], the vanilla form of curriculum learning refers to gradually increasing the difficulty of the data samples that pass through the model during training. This is similar to the form in which humans initially acquire knowledge.

The effectiveness of curriculum learning for text generation has previously been studied by Subramanian et al. [31]. They show the importance of constraining their adversarial model to generate increasingly longer sequences. Sequence length has also been used as a difficulty metric for curriculum learning in other natural language processing (NLP) tasks. Spitkovsky et al. [32] show the importance of an easy-to-hard training strategy for unsupervised grammar induction, while Chang et al. [33] apply curriculum learning for data-to-text generation and show that it improves both the generation quality and convergence speed.

In the field of medical imaging, curriculum learning has been applied mainly to computer vision tasks, typically employing handcrafted curriculum, or an order based on human annotators [34,35,36,37]. More recently, Alsharid et al. [38] employed a dual-curriculum approach for the task of fetal ultrasound image captioning, using the Wasserstein distance for image data and the TF-IDF metric [39] for text data. Liu et al. [40] applied curriculum learning for the generation of medical reports, employing an iterative two-step approach: first, estimate the difficulty of the training samples and evaluate the competence of the model; second, select the appropriate training samples following the easy-to-hard strategy.

However, curriculum learning has not yet been explored in the context of medical report generation using transformers. Inspired by curriculum methods in NLP and by the observation that longer reports are harder to generate for our baseline models, we introduce a simple but effective curriculum learning approach based on report length, offering a lightweight alternative to the more complex multimodal curricula typically used in medical imaging.

3. Method

In order to adapt the GIT transformer for the complex task of automated report generation for radiography images, we employ a variety of task-specific methods. While some of them are widely used in the field of medical image captioning (e.g., we combine the approach of using multi-view images with the approach of using the patient’s medical history and the approach of training the model in a multi-task setting), we are the first to experiment with a curriculum learning method, based on the report length. All of these methods are further elaborated in this section.

3.1. GIT-CXR (SV)

The single-view approach is depicted in Figure 2, excluding the multi-label classifier. This serves as our baseline method, where the GIT model is fine-tuned for image captioning on the MIMIC-CXR dataset. A single-view chest X-ray image is passed through the image encoder to obtain an embedding that is fed into the text decoder to generate the corresponding radiology report. In this setting, we keep only the AP and PA views, duplicating the associated report for each image. We focus on these views because they account for 253,714 images (67.2% of the dataset) and provide the most comprehensive view of the patient’s condition, in contrast to lateral images.

3.2. GIT-CXR (MV)

For the multi-view approach, we use samples of two images at once (because they amount to more than 90.5% of the entire dataset), correlated with their corresponding report, meaning combinations of AP, PA, LAT and LL images. We enforce that at least one of the images is AP/PA and we duplicate the image of a report if only one is available. The condition of choosing the two X-ray images is that the resulting multi-view image needs to contain at least one AP or one PA. The way we adapt the GIT model for the multi-view input is shown in Figure 3. First, each single-view image is passed through the image encoder, obtaining an image embedding for each view. Then, a different temporal embedding is added to each view to differentiate the different views of the multi-view image. Finally, the resulting embeddings are concatenated, creating a final image embedding for the entire multi-view image. The next steps are similar to the single-view approach described before.

3.3. Context

For both methods described above—GIT-CXR (SV) and GIT-CXR (MV)—we also incorporate contextual information and define the methods GIT-CXR (SV+C) and GIT-CXR (MV+C), respectively. This context represents additional information about the patient, such as medical history or reason for examination, and it is given together with the target report to be tokenized and fed to the text decoder. The context is obtained by concatenating the ‘indication’ and ‘history’ fields of the reports.

The reports contain only the ‘history’ section in 25.0% of the cases, only the ‘indication’ in 72.7% of the cases, neither in 2.2% of the cases, and both in 0.0%. An example of such context is as follows: ‘_ year old male with a history of metastatic melanoma, presenting with confusion and somnolence. evaluate for acute cardiopulmonary process’.

3.4. GIT-CXR-CLS

We also introduce an auxiliary loss for multi-label classification, illustrated in Figure 2, and define the methods GIT-CXR-CLS (SV+C) and GIT-CXR-CLS (MV+C) for single-view and multi-view setups, respectively. Similarly to Nguyen et al. [26], the image tokens are fed into a multi-label classifier consisting of one classification head for each of the 14 possible diagnostic categories as defined by the CheXbert labeler [41]. The classification loss is computed as the mean of the weighted cross-entropy losses for each head as defined in Equation (1). Here, D denotes the number of pathologies (in our case 14), x is the input image, and

f_{v} (x)

represents the image encoding obtained by passing x through the vision encoder

f_{v}

. The prediction of the i-th classification head is denoted as

h_{i} (f_{v} (x))

, while

y_{i}

represents the target label for the i-th diagnostic. The cross-entropy loss

C E

is individually weighted for each pathology to account for class imbalance:

L_{M L C} = \frac{1}{D} \sum_{i = 1}^{i \leq D} C E (y_{i}, h_{i} (f_{v} (x)))

(1)

As we train the model in a multi-task setting, the total loss is the sum between the generation loss of GIT-CXR and the additional classification loss:

L = L_{G I T} + L_{M L C}

.

3.5. Curriculum Learning

One of the main challenges in medical report generation is training the model to produce reports that are sufficiently long and include all the relevant clinical information. This difficulty arises from two perspectives. From an NLG view point, it is more difficult to generate longer texts than shorter ones due to the need to maintain coherence over a longer span and handle long-range dependencies effectively. From a clinical accuracy standpoint, generating a report that describes multiple findings is more difficult, as it requires the model to identify and correctly describe more pathologies. As shown in Figure 4, there is a correlation between report length in the MIMIC-CXR-JPG training set and the number of pathologies identified. This correlation is moderate, with Pearson and Spearman correlation coefficients of 0.31 (p = 0.000) and 0.37 (p = 0.000), respectively.

However, as discussed in Section 5, the generated reports tend to be shorter than their corresponding targets. This leads to a steep decrease in performance when long reports are expected, which drastically affects the overall performance of the model.

To address this issue, we propose a curriculum learning approach based on the length of the target report: the model is initially trained using shorter (and easier to learn) samples, with the average target length increasing progressively from one epoch to the next. More precisely, we divide the dataset into b bins based on the length of the report, ensuring that each bin contains an equal number of reports. During each training epoch, we assign a weight

\frac{1}{1 + | i - i_{e} |}

to the samples corresponding to each bin

1 \leq i \leq b

, where

i_{e}

is the bin corresponding to the current epoch. Then, we sample without replacement a fraction f of the dataset, using these weights. This ensures that during each epoch, the most samples used will come from the proximity of the bin

i_{e}

, while still allowing for some amount of samples of the opposite difficulty level. For example, in the early stages of the training, the model will still be able to see words that might appear only in long reports, while in the final stages of the training, the model will still see some easy examples that require the generation of short reports, preventing overfitting to lengthy outputs.

Due to the correlation between the length of the report and the corresponding number of pathologies, we consider the report length to be a reliable difficulty estimate, able to address the aforementioned challange from both the NLG and clinical accuracy viewpoints. While our bins are created based on the report length, Figure 5 shows that they also correspond to an increasing number of pathologies.

We incorporate our proposed curriculum learning approach into the previous approaches and experiment with it in various settings. To this end, we define GIT-CXR-CLS (SV+C+CL) and GIT-CXR-CLS (MV+C+CL) that introduce CL into the multi-task architecture in single-view and multi-view setups, respectively. Similarly, GIT-CXR (SV+C+CL) and GIT-CXR (MV+C+CL) introduce CL into the GIT-CXR architecture in single-view and multi-view configurations, respectively.

4. Experiments

4.1. Dataset

The MIMIC Chest X-ray Database v2.0.0 [42] is the largest publicly available dataset containing chest X-ray images with their corresponding free-form text clinical reports. The dataset contains a total of 377,110 DICOM format radiography images that correspond to 227,835 studies of 64,588 patients. This dataset is the basis for the MIMIC-CXR-JPG dataset, or MIMIC Chest X-ray JPG Database v2.0.0 [9], which is entirely derived from MIMIC-CXR. Additionally processed, it provides JPG conversion of the original DICOM images and 14 pathologies (labels) for the reports using the CheXpert labeler [43]. All 14 pathologies have four possible classes each (‘Positive’, ‘Negative, ‘Uncertain’ and ‘Missing’). The MIMIC-CXR-JPG [9] dataset offers the standard reference splits that we also adopt in our study. Some studies have been manually reviewed by experts. The test set contains all studies of patients who had at least one report labeled in the manual review. The validation set contains a random set of 500 patients and all associated studies. Finally, all the remaining studies are made available in the training set, resulting in a 222,758–1808–3269 distribution of train-validation-test studies. More details on the data splits are provided by Johnson et al. [9]. Since different works have used the dataset in various settings (e.g., different sections of the reports, different views of the X-ray images), next we provide some details on the structure of the data and the settings employed in this work.

4.1.1. Studies

One patient can have one or more studies, and one study can have one or more chest X-ray images and exactly one report. One report can contain one or more sections—free-form text details about the patient’s condition—such as ‘comparison’, ‘clinical history’, ‘indication’, ‘reasons for examination’, ‘impression’ and ‘findings’. An example of a study can be seen in Figure 1, where Figure 1a is the free-form text report, and the corresponding Figure 1b–d are the chest radiographic images taken from different views of the patient.

4.1.2. Generation of Reports

Previous works use different sections of the reports. For example, Endo et al. [44] use both the ‘findings’ and the ‘impression’, while Nguyen et al. [26], Lovelace and Mortazavi [25] and Miura et al. [24] use only the ‘findings’ field. Considering the distribution of the reports with regard to the sections they contain, of the 227,835 studies, 189,561 (83.2%) reports contain an ‘impression’ section, and 155,716 (68.4%) reports contain a ‘findings’ section. This adds up to 95.4% of the entire dataset. Therefore, we decide to use both the ‘findings’ and the ‘impression’, discarding the studies that do not contain reports with at least one of them. Additionally, they incorporate the most relevant information from all sections. We also pre-process the text and eliminate most symbols and upper-case letters.

4.1.3. Images

The MIMIC-CXR-JPG dataset consists of chest X-ray images taken from different views of the patient: the front (Figure 1b, AP), the back (Figure 1c, PA), the lateral (Figure 1d, LAT) or, more specifically, the left-lateral (LL) part of the patient. In our experiments, we compare a single-view approach (also investigated by Endo et al. [44] and Wang et al. [28]) and a multi-view approach (also seen in Nguyen et al. [26] and Miura et al. [24]) of handling the situation involving multiple scans per study.

4.1.4. Labels

For the labeling of our generated reports, we choose to use the CheXbert labeler by Smit et al. [41]. This is a radiology report labeling method based on a biomedically pre-trained BERT [45] that has near radiologist performance in labeling medical conditions and is 5.5% more accurate than CheXpert. Related works that also use the CheXbert labeler are those of Miura et al. [24] and Endo et al. [44], as opposed to works that use the CheXpert labeler [25,26,27].

4.2. Evaluation Metrics

NLG metrics are the default evaluation approach used for image-captioning tasks in order to evaluate the ability of the model to generate coherent text. While BLEU [11] is precision based and ROUGE [12] is recall based, METEOR [10] provides a way of combining both precision and recall, and is also better correlated with human judgment. However, since Boag et al. [46] found that it is possible to have models with high NLG scores that do not produce a correct diagnosis, clinical accuracy metrics have been introduced to measure the ability of a model to produce reports that could be used to identify the right pathologies. Therefore, for the most complete comparison to previous works and in order to determine both the factual completeness as well as the clinical accuracy of our approach, the metrics we decide to use are both NLG metrics—BLEU-1 to BLEU-4 (BL₁ to BL₄), ROUGE-L (RG_L), and METEOR (M)—as well as the clinical accuracy metrics F1-macro and F1-micro on all 14 labels, on just the five most frequent labels (to compare with Miura et al. [24]) and finally the F1 examples, averaged.

The clinical accuracy metrics involve using a pre-trained labeler to identify pathologies from the target report, applying the same labeler to the generated report, and then comparing the resulting classifications using F1 scores. Because the F1 score needs to be computed in a multi-label (the 14 labels of CheXbert) and multi-class (Positive, Negative, Uncertain and Missing) manner, we proceed similarly to Lovelace and Mortazavi [25]: we compute the F1 score of the Positive class for each pathology individually and then we compute their macro-average (F1_MA) for comparison with Endo et al. [44] and Lovelace and Mortazavi [25], and micro-average (F1_MI) for comparison with Lovelace and Mortazavi [25]. Miura et al. [24] only report the F1-micro on the five most frequent labels of the CheXbert labeler (F1_MI5). More recently, Tanida et al. [6] and Nicolson et al. [29] used the example-based average (F1_EX) for the reported results. In order to ensure comparison with all methods, we report all the aforementioned F1 averages: F1_MA, F1_MI, F1_MI5 and F1_EX.

4.3. Experimental Setup

We train our models on a Linux system with 10 CPUs, 160 GB RAM, and a Nvidia H100 GPU with 80 GB VRAM. All experiments are trained for up to

N_{e} = 30

epochs, with early stopping using patience of 5 epochs. The best checkpoint for each experiment is selected based on performance on the validation set, using a weighted average of the NLG metrics: 0.25 for METEOR and ROUGE-L, and 0.125 for each of the BLEU scores. We use the AdamW optimizer with a learning rate of

5 \times 10^{- 5}

and a batch size of 32 for all our experiments. The images are resized to

224 \times 224

. The maximum length of the target report together with the context is set to 192 tokens, with a maximum of 45 tokens allocated to the context. The value of 192 tokens is chosen to ensure the coverage of at least 94% of the reports without additional truncation. For GIT-CXR-CLS, the weight of the classification loss is 0.1. In the multi-view setting, the number of views is fixed at 2. For the curriculum learning approach, we use the number of bins

b = 10

and sample

f = 25 %

of the dataset at each epoch. To maintain parity with the number of training samples used in the non-curriculum setting, we train the model for

\frac{1}{f} N_{e} = 120

epochs and perform validation every

\frac{1}{f} = 4

epochs.

5. Results and Discussion

We compare our two best-performing methods—GIT-CXR (MV+C+CL) and GIT-CXR (SV+C+CL)—with ten current state-of-the-art approaches for the task of medical report generation from chest X-ray images, which also use the MIMIC-CXR-JPG benchmark. The results can be seen in Table 1.

Our method, GIT-CXR (MV+C+CL), achieves state-of-the-art performance on the NLG metric METEOR, surpassing AGA [26] by 14.7 percentage points (pp). Moreover, we outperform all the other methods that report clinical accuracy scores—F1 examples-averaged (F1_EX), F1-macro (F1_MA) and F1-micro (F1_MI)—across the 14 labels. We are only 0.2 pp behind M2TR in terms of F1_MI on the five most frequent labels, noting that their results are not based on the official dataset splits. Among the end-to-end transformer approaches, our method obtains state-of-the-art performance on all presented metrics, outperforming ARR TR [28] by at least 5.0 pp in terms of BLEU scores and 2.5 pp in terms of ROUGE-L.

5.1. Ablation Study

Through our ablation study in Table 2, we demonstrate that all four techniques that we introduce (adding context, using multi-view images, incorporating a multi-label classifier, and applying curriculum learning) positively impact performance, across both NLG and clinical accuracy metrics. Compared to the baseline model, GIT-CXR (SV), adding context in the form of patient history and reason for examination leads to consistent improvements across all NLG metrics, suggesting that incorporating even limited textual context helps the model generate more informative and coherent reports. Incorporating a classification head introduces the most substantial gains in clinical accuracy, as it forces the model to focus on the presence or absence of specific findings, improving the image encoder and ultimately the factual correctness of generated reports. The use of multi-view images—by including both frontal and lateral X-rays—further enhances model performance by enriching the visual representation and capturing the complementary information not available in single-view setups. This can be seen in Table 2 for all multi-view methods, compared to their single-view counterparts, for both NLG and clinical accuracy metrics. Finally, applying curriculum learning based on report length yields the most substantial overall improvement, in both single-view and multi-view settings. Notably, the introduction of CL helps more than the combined use of the multi-view images and classification head, as GIT-CXR (SV+C+CL) outperforms GIT-CXR-CLS (MV+C) across the board. Next, we provide a more detailed discussion of the impact of curriculum learning and how it interacts with the other techniques.

Curriculum Learning Impact

An important finding from our ablation study is that while curriculum learning improves the performance of our models overall, the architectures that combine curriculum learning with the multi-label classifier—GIT-CXR-CLS (SV+C+CL) and GIT-CXR-CLS (MV+C+CL)—perform worse than their counterparts without the classifier.

This unexpected result breaks the trend observed in our other experiments, where the addition of a classification head consistently led to performance gains. Although we initially expected GIT-CXR-CLS (MV+C+CL) to produce the best performance, since it incorporated all proposed improvements, our experiments show that the classification head is not compatible with the curriculum learning strategy we employed. We attribute this to the fact that our curriculum methodology radically changes the pathology distribution seen by each classification head across epochs. For medical reports, the problem is that short reports correspond to healthy patients who did not need detailed explanations in their findings. In particular, shorter reports, used in the early stages of training, often correspond to healthy patients with minimal findings, whereas longer reports are generally associated with many underlying conditions (Figure 4 and Figure 5). Since many pathologies already have a low prevalence in training data (with half of them appearing in less than 5% of the patients), our length-based curriculum introduces an even more skewed distribution early in training. The prevalence of each pathology across bins is shown in Figure 6. Notably, in the first bin,

57 %

of the examples have no finding, and this number constantly decreases to about

8 %

for the last bin. Similarly, analyzing each pathology individually, the prevalence is constantly increasing from one bin to another. This constant distribution shift, together with the particularly high imbalance in the early stages of training, hinders the learning of the multi-label classifier. This finding highlights a unique challenge in medical text generation that distinguishes it from general-domain tasks.

We also test our hypothesis that curriculum learning leads to better performance in generating long reports and make a direct comparison between GIT-CXR (MV+C), trained without curriculum learning, and GIT-CXR (MV+C+CL), the corresponding model trained with curriculum learning. Figure 7 shows the generalization performance on the validation set for the two models over epochs. While the METEOR score is used for illustration, similar trends are observed across all NLG metrics. Despite a faster initial improvement for the model without CL—which is expected, as the CL model is initially trained mostly on short reports—the non-CL model quickly stagnates. In contrast, the CL model continues to improve as longer reports are gradually incorporated into the training process. In Figure 8, we show how the evaluation metrics vary with the length of the target report on the test set. First, with respect to METEOR (Figure 8a) and ROUGE-L (Figure 8b), we can see how the two models perform similarly for reports up to 75 tokens. For longer sequences, although both models exhibit a linear drop in performance, the model trained with curriculum learning shows a less steep decline. Second, with respect to F1-micro (Figure 8c), both models perform similarly poorly for very short sequences, but this is mostly due to their rarity and the unreliability of metrics in that regime (Figure 9). However, from 50 tokens onward, the model trained without curriculum learning is increasingly affected by the target length, whereas the model that uses curriculum learning maintains consistently high performance even for very long sequences.

Next, we analyze the impact of curriculum learning for reports corresponding to different pathologies. To this end, Figure 10 shows the Meteor improvement of GIT-CXR (MV+C+CL) over GIT-CXR (MV+C) for each pathology, while Figure 11 shows how the Meteor score of GIT-CXR (MV+C+CL) varies with report length for each pathology, as well as the performance improvement achieved through CL across different report lengths. First, we observe that there is little variation in Meteor scores across pathologies. This is expected for NLG metrics, which are not intended to reflect clinical accuracy. Second, we notice a drop in performance from early bins (0–3) to late bins (7–9), consistent with the trend shown in Figure 8a. However, this trend appears consistently across all pathologies. Then, analyzing the improvement produced by CL, we find it is consistently around 3 pp, with a maximum of 4 pp for Consolidation and Pneumonia, and a minimum of 1 pp for No Finding. Finally, despite some exceptions (e.g., Consolidation, No Finding) the improvement tends to be higher for mid and late bins, demonstrating the benefits of curriculum learning in generating longer coherent reports. Moreover, the early bins also show a positive impact, indicating that our CL approach does not harm performance on shorter reports from an NLG perspective. Although we present only the Meteor score, similar improvements are made for all NLG metrics.

Similarly, Figure 12 shows how the F1 score for each pathology varies with report length, as well as the performance improvement achieved through curriculum learning across different report lengths. First, we notice that the performance of GIT-CXR (MV+C+CL) is consistent across the report lengths for most of the pathologies. Some exceptions are represented by Fracture or No Finding reports, which are both underrepresented in the test set for certain lengths (Table 3 shows their small support in the test set, while Figure 6 shows the skewed prevalence of reports containing Fracture and No Finding toward late and early bins, respectively). Second, we notice that curriculum learning leads to improvements for all report lengths, with up to 25 pp improvement for Pneumonia reports from early bins. Finally, CL proves to be particularly beneficial for long reports from late bins, as 13 of 14 pathologies see an improvement in the F1 score over the same model without curriculum learning.

Although Figure 12 shows a decrease in performance for certain (pathology, bin) pairs, they correspond again to underrepresented settings, where the metric values is unreliable. For example, the decrease of 9 pp for Lung Lesion reports from early bins corresponds to a low support in the test set (Table 3) and a low prevalence in early bins (Figure 6). However, Figure 13 shows that 12 of 14 pathologies benefit from curriculum learning. While the occasional decrease in performance is small (3 pp for Enlarged Cardiomediastinum and 1 pp for Pneumothorax), the performance gains are usually substantial (e.g., 20 pp for Pneumonia, 16 pp for Lung Opacity, 12 pp for Pleural Effusion, etc.) as shown by the gains of the micro and macro averages of 10 pp and 7 pp, respectively.

To conclude, we prove that the reports generated by our best model, trained with curriculum learning, have high clinical accuracy and can be leveraged by the CheXbert labeler to identify pathologies more reliably than using the reports generated by previous works. Moreover, our results show that our newly introduced curriculum learning technique has a greater positive impact than the widely used method of adding a classification head (e.g., GIT-CXR (MV+C+CL) vs. GIT-CXR-CLS (MV+C)). This highlights curriculum learning as a meaningful advancement in X-ray report generation. A promising direction for future work is exploring ways to combine curriculum learning with classification heads, enabling models to take advantage of both techniques effectively.

5.2. Labels Analysis

As there are 13 pathologies that may appear in the reports (14 if we include the ’No Findings’ label), it is obvious that a model may have very different performance results in identifying each one of them. For example, they are differently represented in the dataset, or some of them might be easier to identify from a radiographic image. Therefore, we conduct an analysis of how well our best model performs for each individual label, based on the results shown in Table 3.

Table 3. Clinical accuracy metrics per pathology for our best model GIT-CXR (MV+C+CL).

Category	F1	P	R	Support
Enlarged Cardiomediastinum	$0.088$	$0.203$	$0.057$	230
Cardiomegaly	$0.642$	$0.627$	$0.658$	1168
Lung Opacity	$0.478$	$0.532$	$0.434$	1131
Lung Lesion	$0.148$	$0.333$	$0.096$	178
Edema	$0.461$	$0.492$	$0.433$	695
Consolidation	$0.162$	$0.262$	$0.118$	187
Pneumonia	$0.259$	$0.282$	$0.239$	213
Atelectasis	$0.432$	$0.463$	$0.404$	890
Pneumothorax	$0.310$	$0.310$	$0.310$	71
Pleural Effusion	$0.669$	$0.676$	$0.661$	1116
Pleural Other	$0.113$	$0.296$	$0.070$	114
Fracture	$0.033$	$0.143$	$0.019$	161
Support Devices	$0.766$	$0.776$	$0.757$	1327
No Finding	$0.317$	$0.244$	$0.451$	193
MACRO_AVG	$0.349$	$0.403$	$0.336$
MICRO_AVG	$0.537$	$0.573$	$0.506$

First, we notice that the precision score tends to be higher, with a minimum value of

0.143

for the Fracture label, whereas the recall reaches values smaller than

0.1

for several labels, including Enlarged Cardiomediastinum, Lung Lesion, Pleural Other, and Fracture. This shows a tendency of the model to fail to identify a pathology rather than falsely mention it in the report. This is further supported by the fact that the precision is considerably greater than the recall for multiple categories, such as Enlarged Cardiomediastinum (3.5 times larger), Lung Lesion (3.5 times larger), Consolidation (2.2 times larger), Pleural Other (4.2 times larger), or Fracture (7.5 times larger).

Second, we notice that there is a correlation between the support (i.e., number of instances) of a label and the big difference between the corresponding precision and recall. More precisely, for all labels with support smaller than 300 samples, with the exception of Pneumothorax and No Finding (which is the absence of pathologies), the precision is considerably higher than the recall. On the other hand, for the better-represented categories (more than 300 samples), precision and recall are usually close to each other, and recall even exceeds precision in the case of Cardiomegaly.

We therefore conclude that poor model performance on certain pathologies is primarily due to their underrepresentation in the dataset, rather than the inherent difficulty of those tasks. The threshold of 300 samples corresponds to approximately 10% of the dataset, as this analysis is performed on the test set comprising 3082 samples. The strong correlation between performance and label frequency also explains the discrepancy between the macro average (which weights all labels equally) and the considerably higher micro average (which weights all samples equally). This further demonstrates that computing the metrics based solely on the five most represented labels [24] yields artificially better scores.

5.3. Reports Analysis

In Table 4, we present three examples of generated reports alongside their corresponding context and ground truth report, using the GIT-CXR (MV+C+CL) model.

The first example is a complete report that contains both the ‘impression’ and the ‘findings’ sections. Although the ROUGE-L (0.349) and METEOR (0.330) scores are relatively good, the BLEU score is significantly lower than the overall performance of our model (BLEU-1 is 0.165 vs. 0.396). This illustrates how the length discrepancy between the generated and target reports negatively affects the BLEU score, due to the brevity penalty applied when the generated output is too short.

The second example lacks the ‘impression’ section in the target report, but despite this, our model is able to generate with decent accuracy the ‘findings’ section and leaves the ‘impression’ section empty. As a result, the generated report achieves very high scores on all the NLG metrics (BLEU-1—0.571, ROUGE-L—0.492, METEOR—0.701).

The final example features a minimal context ‘picc.’ and a lengthy target report that includes both the ‘impression’ and ‘findings’ sections. The results in this case are very poor on all NLG metrics. That is because the short and uninformative context contributes little, and the generated report is much shorter than the target, omitting the ‘impression’ section entirely. Furthermore, the METEOR score is also very low (0.162), as the generated content fails to capture the meaning of the target report.

6. Conclusions

In this work, we developed and evaluated an end-to-end transformer approach utilizing the GIT transformer, combined with additional components, to address the task of X-ray medical report generation. Curriculum learning proved particularly beneficial, improving performance on more challenging cases that required the generation of longer reports. In our experiments, we integrated various relevant techniques and achieved results that exceed current state-of-the-art methods on application-relevant metrics, such as the NLG metric METEOR and the clinical accuracy metric F1. Furthermore, we obtained on-par results on other NLG metrics, including BLEU and ROUGE-L. We conducted an extensive ablation study and a thorough analysis of the label-wise performance and generated reports to better understand the strengths and limitations of our approach. Our work paves the way for further research and improvements in this direction.

7. Limitations

Our work has a few limitations that should be acknowledged. First, we used only a single dataset to test our proposed method. However, we made sure to use the largest publicly available dataset, MIMIC-CXR, which is about 50 times larger than IU-Xray [47], the other widely used dataset for radiology report generation. We focused on maximizing comparability to past and future work by employing the most complete set of evaluation metrics and by using the original train–validation–test split, unlike many other papers. Second, while we obtained state-of-the-art performance on a variety of metrics by introducing our curriculum learning method, there is still room for research on the compatibility between this technique and the setup using an additional classification head. This would enable future approaches to benefit from both techniques, as both individually led to an impressive performance boost. Third, while our work focused on simplicity, both in terms of the architecture used and on the introduction of the NLP-inspired curriculum learning method, based strictly on the report length, we acknowledge the fact that a different curriculum learning method designed specifically for medical imaging could improve the results even further. Finally, even though our results are state of the art in many aspects, they are yet to be on par with radiologist performance to allow our approach to be used in real medical settings.

Author Contributions

Conceptualization, J.B. and T.R.; methodology, I.S. and I.-R.S.; software, I.S. and I.-R.S.; validation, I.S. and I.-R.S.; formal analysis, I.S. and I.-R.S.; investigation, I.S. and I.-R.S.; resources, I.S. and I.-R.S.; data curation, I.S. and I.-R.S.; writing—original draft preparation, I.S. and I.-R.S.; writing—review and editing, J.B. and T.R.; visualization, I.S. and I.-R.S.; supervision, J.B. and T.R. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the project “Romanian Hub for Artificial Intelligence-HRIA”, Smart Growth, Digitization and Financial Instruments Program, 2021-2027, MySMIS no. 334906.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original data presented in the study are openly available as follows: MIMIC-CXR-JPG dataset is available at https://physionet.org/content/mimic-cxr-jpg/2.1.0/ (accessed on 23 February 2024) and the code is available on GitHub at https://github.com/iustinsirbu13/GIT-CXR (accessed on 4 April 2025).

Conflicts of Interest

Author Traian Rebedea was employed by the company NVIDIA. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AP	Anterior–Posterior (regarding the viewpoint of a chest X-ray image)
CL	Curriculum Learning
CNN	Convolutional Neural Network
GIT	Generative Image-to-text Transformer
LAT	Lateral (regarding the viewpoint of a chest X-ray image)
LL	Left Lateral (regarding the viewpoint of a chest X-ray image)
LSTM	Long Short-Term Memory
NLG	Natural Language Generation
NLP	Natural Language Processing
PA	Posterior–Anterior (regarding the viewpoint of a chest X-ray image)
RNN	Recurrent Neural Network

References

Jing, B.; Xie, P.; Xing, E. On the automatic generation of medical imaging reports. arXiv 2017, arXiv:1711.08195. [Google Scholar]
Li, Y.; Liang, X.; Hu, Z.; Xing, E.P. Hybrid retrieval-generation reinforced agent for medical image report generation. Adv. Neural Inf. Process. Syst. 2018, 31, 1537–1547. [Google Scholar]
Delrue, L.; Gosselin, R.; Ilsen, B.; Van Landeghem, A.; de Mey, J.; Duyck, P. Difficulties in the interpretation of chest radiography. In Comparative Interpretation of CT and Standard Radiography of the Chest; Springer: Berlin/Heidelberg, Germany, 2011; pp. 27–49. [Google Scholar]
Cao, Y.; Cui, L.; Zhang, L.; Yu, F.; Li, Z.; Xu, Y. MMTN: Multi-modal memory transformer network for image-report consistent medical report generation. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 277–285. [Google Scholar]
Wang, J.; Yang, Z.; Hu, X.; Li, L.; Lin, K.; Gan, Z.; Liu, Z.; Liu, C.; Wang, L. Git: A generative image-to-text transformer for vision and language. arXiv 2022, arXiv:2205.14100. [Google Scholar]
Tanida, T.; Müller, P.; Kaissis, G.; Rueckert, D. Interactive and explainable region-guided radiology report generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7433–7442. [Google Scholar]
Bu, S.; Li, T.; Yang, Y.; Dai, Z. Instance-level Expert Knowledge and Aggregate Discriminative Attention for Radiology Report Generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 14194–14204. [Google Scholar]
Zhao, G.; Zhao, Z.; Gong, W.; Li, F. Radiology report generation with medical knowledge and multilevel image-report alignment: A new method and its verification. Artif. Intell. Med. 2023, 146, 102714. [Google Scholar] [CrossRef]
Johnson, A.E.; Pollard, T.J.; Greenbaum, N.R.; Lungren, M.P.; Deng, C.y.; Peng, Y.; Lu, Z.; Mark, R.G.; Berkowitz, S.J.; Horng, S. MIMIC-CXR-JPG, a large publicly available database of labeled chest radiographs. arXiv 2019, arXiv:1901.07042. [Google Scholar]
Banerjee, S.; Lavie, A. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Ann Arbor, MI, USA, 29 June 2005; pp. 65–72. [Google Scholar]
Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, 6–12 July 2002; pp. 311–318. [Google Scholar]
Lin, C.Y. Rouge: A package for automatic evaluation of summaries. In Text Summarization Branches Out; Association for Computational Linguistics: Barcelona, Spain, 2004; pp. 74–81. [Google Scholar]
Cornia, M.; Stefanini, M.; Baraldi, L.; Cucchiara, R. Meshed-memory transformer for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10578–10587. [Google Scholar]
Nguyen, V.Q.; Suganuma, M.; Okatani, T. Grit: Faster and better image captioning transformer using dual visual features. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 167–184. [Google Scholar]
Zhang, X.; Sun, X.; Luo, Y.; Ji, J.; Zhou, Y.; Wu, Y.; Huang, F.; Ji, R. Rstnet: Captioning with adaptive attention on visual and non-visual words. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 15465–15474. [Google Scholar]
Yuan, L.; Chen, D.; Chen, Y.L.; Codella, N.; Dai, X.; Gao, J.; Hu, H.; Huang, X.; Li, B.; Li, C.; et al. Florence: A new foundation model for computer vision. arXiv 2021, arXiv:2111.11432. [Google Scholar]
Vinyals, O.; Toshev, A.; Bengio, S.; Erhan, D. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3156–3164. [Google Scholar]
Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.; Salakhudinov, R.; Zemel, R.; Bengio, Y. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the International Conference on Machine Learning, Lille, France, 7–9 July 2015; pp. 2048–2057. [Google Scholar]
Li, C.Y.; Liang, X.; Hu, Z.; Xing, E.P. Knowledge-driven encode, retrieve, paraphrase for medical image report generation. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 6666–6673. [Google Scholar]
Srinivasan, P.; Thapar, D.; Bhavsar, A.; Nigam, A. Hierarchical X-ray report generation via pathology tags and multi head attention. In Proceedings of the Asian Conference on Computer Vision, Kyoto, Japan, 30 November–4 December 2020. [Google Scholar]
Yin, C.; Qian, B.; Wei, J.; Li, X.; Zhang, X.; Li, Y.; Zheng, Q. Automatic generation of medical imaging diagnostic report with hierarchical recurrent neural network. In Proceedings of the 2019 IEEE International Conference on Data Mining (ICDM), Beijing, China, 8–11 November 2019; pp. 728–737. [Google Scholar]
Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
Miura, Y.; Zhang, Y.; Tsai, E.B.; Langlotz, C.P.; Jurafsky, D. Improving factual completeness and consistency of image-to-text radiology report generation. arXiv 2020, arXiv:2010.10042. [Google Scholar]
Lovelace, J.; Mortazavi, B. Learning to generate clinically coherent chest X-ray reports. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2020, Online, 16–20 November 2020; pp. 1235–1243. [Google Scholar]
Nguyen, H.T.; Nie, D.; Badamdorj, T.; Liu, Y.; Zhu, Y.; Truong, J.; Cheng, L. Automated generation of accurate & fluent medical X-ray reports. arXiv 2021, arXiv:2108.12126. [Google Scholar]
Chen, Z.; Song, Y.; Chang, T.H.; Wan, X. Generating Radiology Reports via Memory-driven Transformer. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; pp. 1439–1449. [Google Scholar]
Wang, Z.; Han, H.; Wang, L.; Li, X.; Zhou, L. Automated radiographic report generation purely on transformer: A multicriteria supervised approach. IEEE Trans. Med Imaging 2022, 41, 2803–2813. [Google Scholar] [CrossRef] [PubMed]
Nicolson, A.; Dowling, J.; Koopman, B. Improving chest X-ray report generation by leveraging warm starting. Artif. Intell. Med. 2023, 144, 102633. [Google Scholar] [CrossRef] [PubMed]
Bengio, Y.; Louradour, J.; Collobert, R.; Weston, J. Curriculum learning. In Proceedings of the 26th Annual International Conference on Machine Learning, Montreal, QC, Canada, 14–18 June 2009; pp. 41–48. [Google Scholar]
Subramanian, S.; Rajeswar, S.; Dutil, F.; Pal, C.; Courville, A. Adversarial generation of natural language. In Proceedings of the 2nd Workshop on Representation Learning for NLP, Vancouver, BC, Canada, 3 August 2017; pp. 241–251. [Google Scholar]
Spitkovsky, V.I.; Alshawi, H.; Jurafsky, D. Baby Steps: How “Less is More” in unsupervised dependency parsing. In NIPS: Grammar Induction, Representation of Language and Language Learning; Neural Information Processing Systems Foundation: San Diego, CA, USA, 2009; pp. 1–10. [Google Scholar]
Chang, E.; Yeh, H.S.; Demberg, V. Does the order of training samples matter? improving neural data-to-text generation with curriculum learning. arXiv 2021, arXiv:2102.03554. [Google Scholar]
Lotter, W.; Sorensen, G.; Cox, D. A multi-scale CNN and curriculum learning strategy for mammogram classification. In Proceedings of the Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support: Third International Workshop, DLMIA 2017, and 7th International Workshop, ML-CDS 2017, Held in Conjunction with MICCAI 2017, Québec City, QC, Canada, 14 September 2017; Proceedings 3. Springer: Berlin/Heidelberg, Germany, 2017; pp. 169–177. [Google Scholar]
Jiménez-Sánchez, A.; Mateus, D.; Kirchhoff, S.; Kirchhoff, C.; Biberthaler, P.; Navab, N.; Ballester, M.A.G.; Piella, G. Curriculum learning for improved femur fracture classification: Scheduling data with prior knowledge and uncertainty. Med. Image Anal. 2022, 75, 102273. [Google Scholar] [CrossRef]
Oksuz, I.; Ruijsink, B.; Puyol-Antón, E.; Clough, J.R.; Cruz, G.; Bustin, A.; Prieto, C.; Botnar, R.; Rueckert, D.; Schnabel, J.A.; et al. Automatic CNN-based detection of cardiac MR motion artefacts using k-space data augmentation and curriculum learning. Med. Image Anal. 2019, 55, 136–147. [Google Scholar] [CrossRef]
Wei, J.; Suriawinata, A.; Ren, B.; Liu, X.; Lisovsky, M.; Vaickus, L.; Brown, C.; Baker, M.; Nasir-Moin, M.; Tomita, N.; et al. Learn like a pathologist: Curriculum learning by annotator agreement for histopathology image classification. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Virtual, 5–9 January 2021; pp. 2473–2483. [Google Scholar]
Alsharid, M.; El-Bouri, R.; Sharma, H.; Drukker, L.; Papageorghiou, A.T.; Noble, J.A. A curriculum learning based approach to captioning ultrasound images. In Proceedings of the Medical Ultrasound, and Preterm, Perinatal and Paediatric Image Analysis: First International Workshop, ASMUS 2020, and 5th International Workshop, PIPPI 2020, Held in Conjunction with MICCAI 2020, Lima, Peru, 4–8 October 2020; Proceedings 1. Springer: Berlin/Heidelberg, Germany, 2020; pp. 75–84. [Google Scholar]
Sparck Jones, K. A statistical interpretation of term specificity and its application in retrieval. J. Doc. 1972, 28, 11–21. [Google Scholar] [CrossRef]
Liu, F.; Ge, S.; Zou, Y.; Wu, X. Competence-based multimodal curriculum learning for medical report generation. arXiv 2022, arXiv:2206.14579. [Google Scholar]
Smit, A.; Jain, S.; Rajpurkar, P.; Pareek, A.; Ng, A.Y.; Lungren, M.P. CheXbert: Combining automatic labelers and expert annotations for accurate radiology report labeling using BERT. arXiv 2020, arXiv:2004.09167. [Google Scholar]
Johnson, A.E.; Pollard, T.J.; Berkowitz, S.J.; Greenbaum, N.R.; Lungren, M.P.; Deng, C.y.; Mark, R.G.; Horng, S. MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports. Sci. Data 2019, 6, 317. [Google Scholar] [CrossRef]
Irvin, J.; Rajpurkar, P.; Ko, M.; Yu, Y.; Ciurea-Ilcus, S.; Chute, C.; Marklund, H.; Haghgoo, B.; Ball, R.; Shpanskaya, K.; et al. Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Voume 33, pp. 590–597. [Google Scholar]
Endo, M.; Krishnan, R.; Krishna, V.; Ng, A.Y.; Rajpurkar, P. Retrieval-based chest x-ray report generation using a pre-trained contrastive language-image model. In Proceedings of the Machine Learning for Health, Virtual, 6–7 August 2021; pp. 209–219. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Boag, W.; Hsu, T.M.H.; McDermott, M.; Berner, G.; Alesentzer, E.; Szolovits, P. Baselines for chest X-ray report generation. In Proceedings of the Machine Learning for Health Workshop, Virtual, 11 December 2020; pp. 126–140. [Google Scholar]
Demner-Fushman, D.; Kohli, M.D.; Rosenman, M.B.; Shooshan, S.E.; Rodriguez, L.; Antani, S.; Thoma, G.R.; McDonald, C.J. Preparing a collection of radiology examinations for distribution and retrieval. J. Am. Med. Inform. Assoc. 2016, 23, 304–310. [Google Scholar] [CrossRef]

Figure 1. Example of a study of one patient. Each study contains (a) a medical report, and (b–d) a variable number of X-ray images from various views: anterior–posterior (AP), posterior–anterior (AP), lateral (LAT) or left-lateral (LL).

Figure 2. GIT-CXR (SV) and GIT-CXR-CLS (SV) architectures, without and with the red path, respectively.

Figure 3. GIT-CXR (MV) architecture with 2 images.

Figure 4. The evolution of mean report length with the number of pathologies present in the report.

Figure 5. The mean number of diseases for each of the bins used by our curriculum learning approach.

Figure 6. The prevalence of diseases across the bins created by report length.

Figure 7. The evolution of METEOR on the validation dataset, across epochs, for the GIT-CXR (MV+C+CL) and GIT-CXR (MV+C) models, respectively. Similar curves are obtained for the other NLG metrics. The epoch value is normalized, to account for the factor f used by the CL approach.

Figure 8. The metrics score evolution with report length.

Figure 9. Distributions for the length of the generated reports (blue) and the length of the target reports (yellow), which have been truncated to 192 tokens (orange vertical line).

Figure 10. Meteor score comparison per pathology for GIT-CXR (MV+C) and GIT-CXR (MV+C+CL).

Figure 11. (Left) Meteor score per pathology and bin for GIT-CXR (MV+C+CL). (Right) The Meteor improvement over GIT-CXR (MV+C). The bins are grouped together in order to prevent metrics computation on (pathology, bin) pairs with few representatives in the test set.

Figure 12. (Left) F1 score per pathology and bin for GIT-CXR (MV+C+CL). (Right) The F1 improvement over GIT-CXR (MV+C). The bins are grouped together in order to prevent metrics computation on (pathology, bin) pairs with few representatives in the test set.

Figure 13. F1 score comparison per pathology for GIT-CXR (MV+C) and GIT-CXR (MV+C+CL).

Table 1. Results on the full MIMIC-CXR-JPG dataset [9].

Model	BL₁	BL₂	BL₃	BL₄	RG_L	M	F1_MA	F1_MI	F1_MI5	F1_EX
GIT-CXR (MV+C+CL)	$0.403$	0.286	0.215	0.168	$0.312$	0.369	0.348	0.534	0.565	0.458
GIT-CXR (SV+C+CL)	$0.393$	$0.278$	$0.208$	$0.162$	$0.305$	0.359	0.327	0.505	$0.538$	$0.428$
ARR TR [28]	$0.351$	$0.223$	$0.157$	$0.118$	$0.287$	−	−	−	−	−
RGRG [6]	$0.373$	$0.249$	$0.175$	$0.126$	$0.264$	$0.168$	−	−	$0.547$	0.447
EKAGen [7]	0.419	$0.258$	$0.170$	$0.119$	$0.287$	$0.157$	−	$0.499$	−	−
CvT-212Distil [29]	$0.392$	$0.245$	$0.169$	$0.124$	$0.285$	$0.153$	−	−	−	$0.384$
R2GEN [27]	$0.353$	$0.218$	$0.145$	$0.103$	$0.277$	$0.142$	$0.276$	−	−	−
AGA (MV+T+I) [26] †	0.495	0.360	0.278	0.224	0.390	$0.222$	−	−	−	−
LOVE [25] †	$0.415$	$0.272$	$0.193$	$0.146$	0.318	$0.159$	$0.228$	$0.411$	−	−
MMTN [4] †	$0.379$	$0.238$	$0.159$	$0.116$	$0.283$	$0.161$	−	−	−	−
CXR-RePaiR [44] †	−	$0.069$	−	−	−	−	$0.274$	−	−	−
$M 2$ TR [24] †	−	−	−	$0.133$	−	−	−	−	0.567	−

Note: The models with ‘†’ do not use the original splits. The best results for each task are highlighted using bold font and the second best with underline. Apart from us, the ARR TR approach is the only end-to-end transformer architecture. All of the architectures below it use additional CNN/LSTM [22] modules. If not specified otherwise, we report all the results from the original papers. All of our results are the average of three training runs.

Table 2. Results of the ablation study on the full MIMIC-CXR-JPG dataset [9] using the original splits.

Model	BL₁	BL₂	BL₃	BL₄	RG_L	M	F1_MA	F1_MI	F1_MI5	F1_EX
GIT-CXR-CLS (MV+C+CL)	$0.389$	$0.274$	$0.205$	$0.159$	$0.302$	$0.357$	$0.318$	$0.495$	$0.530$	$0.420$
GIT-CXR-CLS (SV+C+CL)	$0.386$	$0.273$	$0.204$	$0.159$	$0.301$	$0.355$	$0.312$	$0.486$	$0.513$	$0.411$
GIT-CXR (MV+C+CL)	0.403	0.286	0.215	0.168	0.312	0.369	0.348	0.534	0.565	0.458
GIT-CXR (SV+C+CL)	0.393	0.278	0.208	0.162	$0.305$	0.359	0.327	0.505	0.538	0.428
GIT-CXR-CLS (MV+C)	$0.354$	$0.254$	$0.193$	$0.152$	$0.310$	$0.351$	$0.308$	$0.486$	$0.526$	$0.410$
GIT-CXR-CLS (SV+C)	$0.352$	$0.252$	$0.189$	$0.149$	$0.307$	$0.348$	$0.313$	$0.487$	$0.516$	$0.412$
GIT-CXR (MV+C)	$0.343$	$0.248$	$0.188$	$0.149$	0.311	$0.347$	$0.298$	$0.462$	$0.496$	$0.386$
GIT-CXR (SV+C)	$0.324$	$0.230$	$0.172$	$0.136$	$0.294$	$0.331$	$0.257$	$0.407$	$0.428$	$0.334$
GIT-CXR (MV)	$0.316$	$0.199$	$0.130$	$0.090$	$0.240$	$0.291$	$0.294$	$0.495$	$0.536$	$0.415$
GIT-CXR (SV)	$0.299$	$0.187$	$0.122$	$0.084$	$0.235$	$0.282$	$0.262$	$0.452$	$0.500$	$0.376$

Note: The best results for each task are highlighted using bold font and the second best with underline.

Table 4. Examples of reports generated using the GIT-CXR (MV+C+CL) model, alongside their corresponding context, ground truth report and NLG metrics.

	Report	BL1-4	RG-L	M
Context	_ year old male with history of metastatic melanoma, now with recurrent seizures and lethargy, comes here to evaluate for pneumonia.
Target	Impression: No acute cardiopulmonary process. Findings: Frontal and lateral radiographs of the chest redemonstrate a round calcified pulmonary nodule in the posterior right lung base, unchanged from multiple priors and consistent with prior granulomatous disease. A known enlarged right hilar lymph node seen on ct of _ likely accounts for the increased opacity at the right hilum. A known right mediastinal lymph node conglomerate accounts for the fullness at the right paratracheal region. No pleural effusion, pneumothorax or focal consolidation is present. The patient is status post median sternotomy and cabg with wires intact. The cardiac silhouette is normal in size. The mediastinal and hilar contours are unchanged from the preceding radiograph.
Generated	Impression: No acute cardiopulmonary process. Findings: Frontal and lateral radiographs of the chest show no focal consolidation, pleural effusion or pneumothorax. The cardiomediastinal and hilar contours are normal. The patient is status post median sternotomy with intact wires.	0.165 0.135 0.117 0.106	$0.349$	$0.330$
Context	post procedure, to assess for pneumothorax.
Target	Impression: Findings: In comparison with the study of _, there is no evidence of pneumothorax. continued low lung volumes with substantial mass in the right paratracheal region.
Generated	Impression:Findings: In comparison with the study of _, there is little overall change. Again there are low lung volumes with mild atelectatic changes at the bases. no evidence of acute focal pneumonia or vascular congestion.	0.571 0.501 0.455 0.412	$0.492$	$0.701$
Context	picc.
Target	Impression: _ Left picc tip appears to terminate in the distal left brachiocephalic vein. _ Mild pulmonary vascular congestion. _ Interval improvement in aeration of the lung bases with residual streaky opacity likely reflective of atelectasis. Interval resolution of the left pleural effusion. Findings: Left picc tip is seen terminating in the region of the distal left brachiocephalic vein. Tracheostomy tube is in unchanged standard position. the heart is moderately enlarged. marked calcification of the aortic knob is again present. Mild pulmonary vascular congestion is similar. Bibasilar streaky airspace opacities are minimally improved. Previously noted left pleural effusion appears to have resolved. No pneumothorax is identified. Percutaneous gastrostomy tube is seen in the left upper quadrant.
Generated	Impression:Findings: There has been interval removal of a right sided picc. The right sided picc line has been removed. Tracheostomy tube and central venous catheter are again seen. There is a moderate cardiomegaly. There is hazy opacification of the right lung base which is stable. There is no overt pulmonary edema.	0.148 0.068 0.029 0.000	$0.170$	$0.162$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Sîrbu, I.; Sîrbu, I.-R.; Bogojeska, J.; Rebedea, T. GIT-CXR: End-to-End Transformer for Chest X-Ray Report Generation. Information 2025, 16, 524. https://doi.org/10.3390/info16070524

AMA Style

Sîrbu I, Sîrbu I-R, Bogojeska J, Rebedea T. GIT-CXR: End-to-End Transformer for Chest X-Ray Report Generation. Information. 2025; 16(7):524. https://doi.org/10.3390/info16070524

Chicago/Turabian Style

Sîrbu, Iustin, Iulia-Renata Sîrbu, Jasmina Bogojeska, and Traian Rebedea. 2025. "GIT-CXR: End-to-End Transformer for Chest X-Ray Report Generation" Information 16, no. 7: 524. https://doi.org/10.3390/info16070524

APA Style

Sîrbu, I., Sîrbu, I.-R., Bogojeska, J., & Rebedea, T. (2025). GIT-CXR: End-to-End Transformer for Chest X-Ray Report Generation. Information, 16(7), 524. https://doi.org/10.3390/info16070524

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

GIT-CXR: End-to-End Transformer for Chest X-Ray Report Generation

Abstract

1. Introduction

2. Related Work

2.1. Transformers in Image Captioning

2.2. Radiology Report Generation

2.3. Curriculum Learning

3. Method

3.1. GIT-CXR (SV)

3.2. GIT-CXR (MV)

3.3. Context

3.4. GIT-CXR-CLS

3.5. Curriculum Learning

4. Experiments

4.1. Dataset

4.1.1. Studies

4.1.2. Generation of Reports

4.1.3. Images

4.1.4. Labels

4.2. Evaluation Metrics

4.3. Experimental Setup

5. Results and Discussion

5.1. Ablation Study

Curriculum Learning Impact

5.2. Labels Analysis

5.3. Reports Analysis

6. Conclusions

7. Limitations

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI