Previous Article in Journal
View-Aware Pose Analysis: A Robust Pipeline for Multi-Person Joint Injury Prediction from Single Camera
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Review

The Current Landscape of Automatic Radiology Report Generation with Deep Learning: A Scoping Review

by
Patricio Meléndez Rojas
1,2,*,
Jaime Jamett Rojas
1,
María Fernanda Villalobos Dellafiori
2,
Pablo R. Moya
3,4 and
Alejandro Veloz Baeza
5
1
Ph.D. Program in Health Sciences and Engineering, Universidad de Valparaíso, Valparaíso 2540064, Chile
2
Faculty of Dentistry, Universidad Andres Bello, Viña del Mar 2520000, Chile
3
Centro Interdisciplinario de Neurociencias de Valparaíso (CINV), Universidad de Valparaíso, Valparaíso 2340000, Chile
4
Institute of Physiology, Faculty of Sciences, Universidad de Valparaíso, Valparaíso 2340000, Chile
5
Faculty of Engineering, Universidad de Valparaiso, Valparaíso 2340000, Chile
*
Author to whom correspondence should be addressed.
AI 2026, 7(1), 8; https://doi.org/10.3390/ai7010008 (registering DOI)
Submission received: 1 November 2025 / Revised: 6 December 2025 / Accepted: 24 December 2025 / Published: 29 December 2025
(This article belongs to the Section Medical & Healthcare AI)

Abstract

Automatic radiology report generation (ARRG) has emerged as a promising application of deep learning (DL) with the potential to alleviate reporting workload and improve diagnostic consistency. However, despite rapid methodological advances, the field remains technically fragmented and not yet mature for routine clinical adoption. This scoping review maps the current ARRG research landscape by examining DL architectures, multimodal integration strategies, and evaluation practices from 2015 to April 2025. Following the PRISMA-ScR (Preferred Reporting Items for Systematic Reviews and Meta-Analyses extension for Scoping Reviews) guidelines, a comprehensive literature search identified 89 eligible studies, revealing a marked predominance of chest radiography datasets (87.6%), primarily driven by their public availability and the accelerated development of automated tools during the COVID-19 pandemic. Most models employed hybrid architectures (73%), particularly CNN–Transformer pairings, reflecting a shift toward systems that combine local feature extraction with global contextual reasoning. Although these approaches have achieved measurable gains in textual and semantic coherence, several challenges persist, including limited anatomical diversity, weak alignment with radiological rationale, and evaluation metrics that insufficiently reflect diagnostic adequacy or clinical impact. Overall, the findings indicate a rapidly evolving but clinically immature field, underscoring the need for validation frameworks that more closely reflect radiological practice and support future deployment in real-world settings.

Graphical Abstract

1. Introduction

The interpretation of medical images and the generation of radiological reports are core components of diagnostic assessment, treatment planning, and ongoing patient monitoring [1,2]. Producing a coherent and clinically meaningful report requires not only the accurate recognition of imaging findings but also their integration into a structured diagnostic narrative. This process demands years of specialized training and contributes to workload pressures in radiology departments [1,2,3]. These constraints, together with the risk of inter-observer variability, have motivated growing interest in computational systems capable of supporting or partially automating the reporting process [1,2,3].
Deep learning (DL) has become the predominant paradigm in automatic radiology report generation (ARRG) [1], commonly implemented through encoder–decoder architectures in which convolutional neural networks (CNNs) extract visual representations and text-based decoders generate the final report [3]. Early works predominantly relied on recurrent neural networks (RNNs), including long short-term memory (LSTM) models [1]. In contrast, recent advances have introduced attention-based mechanisms, such as Transformers [4], and multimodal contrastive frameworks, such as contrastive language-image pretraining (CLIP) [5]. Further developments include domain knowledge-guided strategies [1,6], attention-based architectures [7], reinforcement learning [2], large language models (LLMs) [8,9,10], and hybrid approaches that integrate multiple mechanisms to improve clinical accuracy [1].
However, despite these methodological advances, the current body of evidence remains fragmented, with substantial gaps in clinically aligned validation, semantic faithfulness, and the integration of structured medical knowledge into report generation pipelines [1,2,5]. In this context, most published ARRG models demonstrate promising linguistic performance but remain in a stage of limited clinical readiness, showing insufficient validation for deployment in routine radiological workflows.
The advancement of ARRG relies on specialized datasets comprising medical images paired with textual reports [1,5], with public benchmarks such as MIMIC-CXR and IU-Xray frequently supporting this line of research [1,7]. Yet, evaluating model performance remains challenging due to the heterogeneity of metrics: some focus on textual similarity (e.g., BLEU, ROUGE, BERTScore), while others assess the correctness of clinical findings (e.g., AUC, F1 score), and often lack standardized clinical validation [1,5,6,7]. Although prior reviews have addressed selected aspects of ARRG or specific architectures such as Transformers and multimodal methods [3,4,5,10,11], the rapid proliferation of DL-based systems has produced a fragmented landscape that complicates the global understanding of methodological progress and its alignment with clinical readiness.
Given the heterogeneity and rapid evolution of the literature, this study was conducted as a scoping review to systematically map the available evidence, summarize methodological developments, and identify existing knowledge gaps in ARRG. This review aims to provide a comprehensive overview of DL-based ARRG, outlining its principal characteristics, methodological approaches, data sources, evaluation strategies, and key findings, and evaluating their alignment with the requirements for future clinical implementation.

2. Materials and Methods

This scoping review was conducted following the PRISMA-ScR guidelines. A comprehensive search was conducted in the following electronic databases: IEEE Xplore, ACM Digital Library, PubMed/MEDLINE, Scopus, and Web of Science (WoS) Core Collection. This review included original English-language research articles from peer-reviewed journals published between 2015 and April 2025. Eligible studies were required to address the automatic generation of radiology reports using DL architectures, including but not limited to CNNs, RNNs, Transformers, graph neural networks (GNNs), or hybrid models. Studies also had to employ multimodal input data, typically pairing imaging data with a text-generation component, with or without additional structured information, as a core element of the report-generation pipeline. Exclusion criteria comprised studies outside the radiology domain, non-original publications (such as reviews or editorials), and works lacking sufficient methodological or outcome detail for complete assessment. Grey literature was not considered in this review.
The detailed search strategies applied to each database are presented in Table 1. The initial retrieval of records was performed automatically by executing the specific search strings listed in Table 1 within each database’s search engine. All identified records were imported into Zotero® (Version 7.0.30 64-bit; Corporation for Digital Scholarship, Vienna, VA, USA) for bibliographic management. Following duplicate removal, study selection was conducted using Rayyan QCRI® (web-based application; Qatar Computing Research Institute, Doha, Qatar), in which titles and abstracts were independently screened by the first author (MP) and the second author (JJ) according to predefined eligibility criteria. Full-text versions of potentially relevant publications were retrieved and independently evaluated by MP and JJ for final inclusion. Any disagreements arising at either screening stage were resolved by the third reviewer (VM).
Data were charted using a pre-defined extraction form, as recommended in scoping review methodology. Data extraction focused on key study characteristics, including study objectives, radiological domain, datasets used, input modalities, DL architectures, and methodological details, report generation pipeline characteristics, and evaluation metrics. Bibliometric information (authors, publication year, country, and publication source), reported limitations, and suggested future research directions were also recorded. The extraction form was jointly developed by MP and JJ, who independently completed and iteratively refined the dataset until consensus was reached. Studies were subsequently grouped according to the DL methodology employed, and their main characteristics and findings were synthesized narratively.
As scoping reviews aim to map the available evidence rather than evaluate study quality, no formal risk-of-bias assessment was undertaken. Instead, the completeness and transparency of reporting were qualitatively appraised using the TRIPOD-LLM guideline [12]. Given the exploratory nature of this review, no quantitative assessment of publication bias was performed; however, potential reporting-related sources of bias were considered during the synthesis. The review protocol was registered in PROSPERO (CRD420251044453) to ensure methodological transparency, acknowledging that formal registration is optional for scoping reviews.

3. Results

A PRISMA flow chart of the screening and selection process is presented in Figure 1, and Table 1 summarizes the number of articles retrieved from each database. A total of 89 studies met the inclusion criteria. Research activity in this field has intensified markedly in recent years: one eligible study was published in 2020, followed by five in 2021, nineteen in 2022, fourteen in 2023, and twenty-nine in 2024, with a slight decline to twenty-one in the partial 2025 dataset (Figure 2). Assessment using the TRIPOD-LLM checklist showed that while most studies reported performance metrics in detail, essential elements such as participant description, sample size justification, and external validation were frequently absent, and none of the reviewed articles achieved full compliance.
Regarding the clinical scope, chest radiography overwhelmingly dominated, with 78 of 89 studies (87.6%) using chest X-ray (CXR) datasets, primarily from publicly available repositories. Other anatomical regions, such as brain CT/MRI, abdominal CT, spinal imaging, and oral/maxillofacial radiology, were represented only sporadically, with four or fewer publications each (Figure 2). This distribution shows that although ARRG has advanced rapidly, its development remains restricted mainly to thoracic imaging.
Geographic distribution further reveals that ARRG research is concentrated almost exclusively in a small number of countries. As shown in Figure 3, China accounts for the majority of publications (n = 56), followed by India (n = 9), Pakistan (n = 4), and Brazil (n = 3), while all remaining contributing nations report no more than two studies each. This pattern reflects structural disparities in AI research capacity. It emphasizes that most methodological innovation in ARRG is being driven by institutions with access to large-scale datasets and high computational resources, predominantly located in high-resource settings.
From a methodological perspective, hybrid DL architectures were the most frequently represented approach. Studies most commonly combined convolutional encoders with Transformer-based language modeling, followed by pure Transformer-based strategies, more complex multi-hybrid approaches, and finally CNN–RNN pipelines, which continue to appear but at a reduced rate. Together, these trends indicate a clear shift toward architectures that integrate localized feature extraction with broader contextual reasoning. This architectural distribution reflects not only technical preferences but also the gradual movement toward models designed to encode both localized saliency and global diagnostic context. To contextualize the frequency with which each family of models appears in the literature, Figure 4 summarizes the relative prevalence of the main deep learning approaches in the included studies. The complete list of articles is presented in Table 2.
To facilitate comparison across methodological strategies, the studies were categorized into four architectural groups according to their prevalence in the included literature: (i) CNN–Transformer hybrid architectures, (ii) purely Transformer-based methods, (iii) multi-hybrid combinations integrating multiple DL paradigms, and (iv) traditional CNN–RNN encoder–decoder pipelines. In addition to architectural prevalence, performance across the included studies was assessed using BLEU-1 as the most frequently reported textual similarity metric. As illustrated in Figure 5, BLEU-1 values show substantial variability across models. These categories structure the subsequent analysis of performance characteristics and evaluation approaches (Figure 4 and Figure 5).

3.1. CNN + Transformers

ARRG methods have increasingly adopted encoder–decoder architectures, leveraging CNNs for visual encoding and Transformer networks for language decoding [14,15,16]. This represents a shift from earlier approaches utilizing RNNs like LSTMs [14,16,17], with Transformers offering superior capacity to model long-range dependencies and process information in parallel, resulting in richer contextual representations [15,18]. Widely cited examples of this framework include the Memory-driven Transformer (R2Gen) [14,18,19,20,21,22,23,24,25,26] and its variant, the Cross-modal memory network (R2GenCMN) [16,18,19,24,25,26,27,28], which enhance information flow and cross-modal alignment. Other approaches, such as RATCHET, incorporate a standard Transformer decoder guided by CNN-derived features [14,29], while additional enhancements include relation memory units and cross-modal memory matrices [30].
A persistent challenge for these models is the inherent data imbalance in medical datasets, where normal findings vastly outnumber abnormal ones [14,20,21,22,30]. To mitigate this, contrastive learning strategies have been introduced, as in the Contrastive Triplet Network (CTN), which improves the representation of rare abnormalities by contrasting visual and semantic embeddings [14]. Integration of medical prior knowledge is another common enhancement, through knowledge graphs [14,19,21], disease tags [19,23,31], or anatomical priors [23], helping to guide generation toward clinically meaningful structures [32]. Additional refinements include organ-aware decoders [31], multi-scale feature fusion [17,33], and adaptive mechanisms that dynamically modulate the contribution of visual and semantic inputs [22]. Ablation studies consistently confirm the value of these specialized components in improving performance [16,33,34,35].
Evaluations on public datasets, notably IU X-ray [14,15,23,25,28,33,36] and MIMIC-CXR [14,15,23,25,28,33,36], demonstrate that CNN–Transformer based methods achieve state-of-the-art (SOTA) results across conventional natural language generation metrics including BLEU, METEOR, ROUGE-L, and CIDEr [14,16,21,28,30,31,33,36]. Reported improvements include better abnormality description, higher sentence coherence, and enhanced medical correctness [14,15,23,24,34,37], with successful applications extending to cranial trauma and polydactyly reporting [18,26]. Nevertheless, these models still struggle to represent uncertainty, severity, and extremely rare pathologies [34], indicating that although their textual fidelity has advanced considerably, their clinical interpretability and deployment readiness remain limited.

3.2. Transformers

Transformer-based architectures have become a potent solution for ARRG, offering significant advancements over traditional CNN–RNN and LSTM approaches [38,39,40,41,42,43]. Their primary advantage lies in the ability to model long-range dependencies, which is critical for radiological reporting, where multiple anatomical and pathological findings must be described coherently within a single narrative [38,40,43,44]. These systems are typically structured as encoder–decoder framework [38,39,40,45,46,47] or as pure Transformer-based designs [44]. Many implementations leverage pretrained models such as Vision Transformers (ViT) or Swin Transformer for image encoding, paired with language models like GPT-2 or BERT for generation, improving performance in scenarios constrained by limited medical data [38,42,43,45,46,48,49,50,51,52].
Attention mechanisms play a central role in integrating multimodal features [39,40,42,43,53,54]. Cross-attention modules allow fine-grained alignment between visual and textual embeddings, improving structural coherence and semantic grounding [38,39,45,46,51]. Additional optimization strategies include graph-based fusion [38], multi-feature enhancement modules [54], and specialized mechanisms to prioritize rare or diagnostically relevant content [48]. Memory augmentation [38,39,40,42,51,55,56], knowledge integration through factual priors or graphs [39,51,54,55,56,57], and object-level feature extraction [58] further enhance semantic accuracy. Term-weighting and vocabulary-masking strategies reduce reliance on frequent, standard descriptors, thereby improving recall of abnormal findings [44].
Transformers are typically evaluated using standard natural language generation metrics such as BLEU, ROUGE, and METEOR [38,42,44,45,46,51,52,54,56,59,60], as well as semantic similarity measures including BERTScore and CheXbert [46,49,51,58,60]. Some studies incorporate clinical validation via tools such as RadGraph or expert assessment [41,46,47,60], reflecting a growing emphasis on clinically meaningful correctness. However, despite these advances, most models still rely on surrogate textual similarity metrics and have not yet demonstrated consistent alignment with radiological reasoning in real-world settings, indicating that their clinical maturity remains preliminary.

3.3. Multihybrid

There is a clear trend toward multi-hybrid architectures that combine different DL paradigms and integrate external medical knowledge to enhance semantic alignment and clinical relevance [61,62]. Most approaches continue to employ an encoder–decoder structure in which CNNs such as ResNet [61,62,63,64,65,66] or VGG19 [61,66,67]) function as visual encoders, while LSTM-based [61,62,63,65,66,68] or hierarchical RNN decoders [67,69] generate the textual output, with Transformers increasingly incorporated to improve long-term dependency modeling [61,62,64,65,66,67,68,69,70,71,72].
A central design objective of these hybrid configurations is to improve cross-modal alignment between image regions and textual concepts. To this end, memory networks are widely used to store or retrieve image–text correspondences [61,62,64,66,68,73,74], align medical terminology with localized visual features [64,71,74], or stabilize representations during generation [61,62,64,66,71,72,74,75]. Contrastive learning is also frequently adopted to reinforce the semantic distinction between positive and negative image–text pairs, improving the detection of clinically relevant abnormalities [63,66,74]. External knowledge sources, including knowledge graphs [66,73,74], curated medical vocabularies [71,74], and retrieval-augmented mechanisms that draw on similar prior reports [62,64,71,72,74], are increasingly used to embed domain expertise into the generation process.
Some models also emulate elements of radiological reasoning by incorporating multi-expert or multi-stage workflows [65] or by implementing strategies that first localize salient regions and then generate narratively coherent text [71]. Others address data imbalance and background noise by refining lesion-level representations through denoising or saliency-aware mechanisms [66,71,76], or by prioritizing diagnostically abnormal content through report-level reordering [69]. The introduction of contextual embeddings derived from large language models such as BERT further improves lexical richness and contextual fidelity [67,69,72].
Performance in this category is typically benchmarked using the IU-Xray [61,62,63,64,66,67,68,69,70,71,72,73,74,75] and MIMIC-CXR [61,62,63,64,65,66,68,69,70,71,72,73,74,75] datasets. Reported gains are consistent across BLEU [61,62,63,64,65,66,67,68,69,70,71,73,74,75,76], METEOR [61,62,63,64,65,66,68,70,72,73,74,76,77], ROUGE-L [61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76], and CIDEr [64,67,69,70,71,73,74,75,76], with qualitative analyses [68,69,72] and ablation studies [66,70,72] confirming the contribution of hybrid components to improved fluency, abnormality description, and semantic correctness [65,68,70]. However, despite these gains, clinical deployment remains limited, as most evaluation frameworks still prioritize textual similarity rather than diagnostic alignment or interpretive reasoning.

3.4. CNN + RNN Architectures

ARRG frequently employs encoder–decoder architectures [77,78,79,80,81,82,83], commonly consisting of a CNN encoder to extract visual features from medical images [84], and a RNN decoder, often a LSTM or Gated Recurrent Unit (GRU), to generate the report text sequentially [73,77,78,80,81,82,83,85,86,87]. This framework seeks to translate visual representations into descriptive narratives [77,78,83,86]. Performance is commonly evaluated using standard NLG metrics such as BLEU [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,87,88,89,90], ROUGE (especially ROUGE-L) [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,87,89,90], CIDEr [79,82,84,87,88,89], and METEOR [79,81,83,87,89] with some studies supplementing text-based evaluation with diagnostic accuracy pipelines or automated clinical assessment tools such as CheXpert [79,81,90].
Several CNN + RNN-based methods have reported competitive or even superior performance across multiple benchmarks [78,80,81,87,88,90]. For instance, a CNN-LSTM model incorporating attention achieved a BLEU-4 of 0.155 on MIMIC-CXR [78], while the G-CNX network, combining ConvNeXtBase with a GRU decoder, obtained BLEU-1 scores of 0.6544 on IU-Xray and 0.5593 on ROCOv2 [81]. Similarly, the HReMRG-MR method, based on LSTMs with reinforcement learning, demonstrated improvements over several baselines on both IU-Xray and MIMIC-CXR [87]. Additional architectures targeting specialized reporting tasks, such as proximal femur fracture assessment in Dutch, have also reported strong performance [88].
However, despite these promising results, CNN + RNN models are limited by their sequential decoding nature, which restricts long-range contextual reasoning and reduces their ability to handle complex radiological narratives. As a result, while these architectures remain relevant in benchmarking and resource-constrained settings, their suitability for clinically aligned reporting is inherently limited relative to Transformer-based or hybrid approaches.

3.5. Others

Three studies employed architectures that did not fit into the previously defined categories due to distinctive design choices targeting specialized aspects of ARRG. The first approach replaces free-text generation with structured output by learning question-specific representations using tailored CNNs and MobileNets, which are subsequently classified with SVMs rather than decoded through a sequence generator [91]. This method demonstrated superior performance on the ImageCLEF2015 Liver CT annotation task, suggesting that task-adapted feature extraction can outperform shared representations in structured reporting problems. A second framework integrates a ViT encoder with a hierarchical LSTM decoder, augmented by a MIX-MLP module for multi-label classification and a POS-SCAN co-attention mechanism to fuse semantic and visual priors [92]. This hybrid configuration leverages label-aware alignment to improve anomaly identification and text coherence, with ablation studies confirming the contribution of its integrated components. The third approach incorporates a knowledge graph derived from disease label co-occurrence statistics, combining DenseNet-based visual features with a Transformer text encoder and a GNN reasoning layer before final report generation [93], thereby enabling structured medical knowledge to inform the decoding process.
Although these methods differ from CNN–Transformer or hybrid pipelines, they collectively illustrate a movement toward architectures that incorporate structured reasoning or external supervision to compensate for dataset and interpretability limitations. However, because their adoption remains technically experimental and narrowly scoped, their translational maturity is still preliminary relative to the more broadly validated families of models.
Table 2. Summary of key characteristics, methodologies, and reported metrics for the included studies.
Table 2. Summary of key characteristics, methodologies, and reported metrics for the included studies.
YearRefs.Radiological DomainDatasets (DS) UsedDeep Learning MethodArchitectureBest BLEU-1Best Outcome DS
2020Zeng X. et al. [86]Ultrasound (US) (gallbladder, kidney, liver), Chest X-ray US image dataset (6563 images). IU-Xray (7470 images, 3955 reports)CNN, RNNSFNet (Semantic fusion network). ResNet-50. Faster RCNN. Diagnostic report generation module using LSTM0.65US DATASET
2021Liu G. et al. [94]Chest CTCOVID-19 CT dataset (368 reports, 1104 CT images). CX-CHR dataset (45,598 images, 28,299 reports), 12 million external medical textbooksTransformer, CNNMedical-VLBERT (with DenseNet-121 as backbone)0.70CX CHR CT
2021Alfarghaly O. et al. [49]Chest X-rayIU-Xray Transformer2 CDGPT2 (Conditioned Distil Generative Pre-trained Transformer)0.387IU-Xray
2021Moon J.H. et al. [95]Chest X-rayMIMIC-CXR (377,110 images, 227,835 reports). IU-Xray CNN, TransformerBERT-base* 0.126 (BLEU-4)MIMIC-CXR
2021Loveymi S. et al. [91]Liver CTLiver CT annotation dataset from ImageCLEF 2015 (50 patients)CNNMobileNet-V20.65ZGT HOSPITAL RX
2021Hou D. et al. [79]Chest X-rayMIMIC-CXR. IU-XrayCNN, RNN, GANEncoder with two branches (CNN based on ResNet-152 and MLC). Hierarchical LSTM decoder with multi-level attention and a reward module with two discriminators.* 0.148 (BLEU-4)MIMIC-CXR
2022Paalvast O. et al. [90]Proximal Femur Fractures (X-ray)Primary dataset: 4915 cases with 11,606 images and reports. Language model dataset: 28,329 radiological reportsCNN, RNNDenseNet-169 for classification. Encoder–Decoder for report generation. GloVe for language modeling0.65MAIN DATASET
2022Zhang D. et al. [93]Chest X-rayMIMIC-CXR. IU-XrayGNN, TransformerCustom framework using Transformer for the generation module0.505IU-Xray
2022Kaur. N. et al. [69]Chest X-rayIU-Xray CNN, RNN, TransformerCNN VGG19 network (feature extraction). BERT (language generation). DistilBERT (perform sentiment)0.772IU-Xray
2022Aswiga R. et al. [96]Liver CT and kidney, DBT (Digital Breast Tomosynthesis)ImageNet (25,000 images). CT abdomen and mammography images (750 images). CT and DBT medical images (150 images)RNN, CNNMLTL-LSTM model (Multi-level transfer learning framework with a long short-term-memory model)0.769CT-DBT
2022Xu Z. et al. [87]Chest X-rayMIMIC-CXR. IU-XrayCNN, RNNHReMRG-MR (Hybrid reinforced report generation method with m-linear attention and repetition penalty mechanism)0.4806MIMIC-CXR
2022Nicolson A. et al. [15]Chest X-rayMIMIC-CXR. IU-XrayCNN, Transformer CvT2DistilGPT20.4732IU-Xray
2022Yan S. et al. [64]Chest X-rayMIMIC-CXR. IU-XrayCNN, RNN, TransformerDenseNet (encoder). LSTM or Transformer (decoder). ATAG (Attributed abnormality graph) embeddings. GATE (gating mechanism)** 0.323 (BLEU AVG)IU-Xray
2022Gajbhiye G.O. et al. [8]Chest X-rayIU-Xray CNN, RNNAMLMA (Adaptive multilevel multi-attention)0.471IU-Xray
2022Shang C. et al. [22]Chest X-rayMIMIC-CXR. IU-XrayCNN, TransformerMATNet (Multimodal adaptive transformer)0.518IU-Xray
2022Li H. et al. [77]Chest X-rayIU-Xray CNN, RNN, Attention MechanismRCLN model (combining CNN, LSTM, and multihead attention mechanism). Pre-trained ResNet-152 (image encoder)0.4341IU-Xray
2022Najdenkoska I. et al. [72]Chest X-rayMIMIC-CXR. IU-XrayCNN, RNN, TransformerVTI (Variational topic inference) with LSTM-based and Transformer-based decoders0.503IU-Xray
2022Yang Y. et al.
[14]
Chest X-rayMIMIC-CXR. IU-XrayCNN, TransformerCTN built on Transformer architecture0.491IU-Xray
2022Kaur N. et al. [80]Chest X-rayIU-Xray CNN, RNN, Reinforcement LearningCADxReport (VGG19, HLSTM with co-attention mechanism and reinforcement learning)0.577IU-Xray
2022Wang J. et al. [27]Chest X-rayMIMIC-CXR. IU-XrayCNN, TransformerCAMANet (Class activation map guided attention network)0.504IU-Xray
2022Kaur N. et al. [84]Chest X-rayIU-Xray CNN, RNN, Attention MechanismCheXPrune (encoder–decoder architecture with VGG19 and hierarchical LSTM)0.5428IU-Xray
2022Yan B. et al. [23]Chest X-rayMIMIC-CXR. IU-XrayCNN, Transformer, VAEPrior Guided Transformer. ResNet101 (visual feature extractor). Vanilla Transformer (baseline)0.482IU-Xray
2022Wang Z. et al. [44]Chest X-rayMIMIC-CXR. IU-XrayTransformerPure Transformer-based Framework (custom architecture)0.496IU-Xray
2022Sirshar M. et al. [78]Chest X-rayMIMIC-CXR. IU-XrayCNN, RNNVGG16 (visual geometry group CNN). LSTM with attention0.580IU-Xray
2022Vendrow & Schonfeld [41]Chest X-rayMIMIC-CXR. CheXpertTransformerMeshed-memory augmented transformer architecture with visual extractor using ImageNet and CheXpert pre-trained weights0.348MIMIC-CXR
2023Wang R. et al. [54]Chest X-rayMIMIC-CXR. IU-XrayTransformerMFOT (Multi-feature optimization transformer)0.517IU-Xray
2023Mohsan M.M. et al. [42]Chest X-rayIU-Xray TransformerTrMRG (Transformer Medical Report Generator) using ViT as encoder, MiniLM as decoder0.5551IU-Xray
2023Xue Y.
et al. [97]
Chest X-rayMIMIC-CXR. IU-XrayTransformer, CNN, RNNASGMD (Auxiliary signal guidance and memory-driven) network. ResNet-101 and ResNet-152 as visual feature extractors0.489IU-Xray
2023Zhang S. et al. [36]Chest X-rayMIMIC-CXR. IU-XrayCNN, TransformerVisual Prior-based Cross-modal Alignment Network0.489IU-Xray
2023Zhang J. et al. [55]Chest X-ray, CT COVID-19MIMIC-CXR. IU-Xray. COV-CTR (728 images)TransformerICT (Information calibrated transformer)0.768COV-CTR
2023Pan R. et al. [16]Chest X-rayMIMIC-CXR. IU-XrayCNN, Transformer, Self-Supervised LearningS3-Net (Self-supervised dual-stream network)0.499IU-Xray
2023Yang Y. et al. [82]Chest X-rayMIMIC-CXR. IU-XrayCNN, RNNTriNet (custom architecture)0.478IU-Xray
2023Gu Y. et al. [88]Chest X-rayIU-Xray. Chexpert (224,316 images)CNN, RNNResNet50. CVAM + MVSL (Cross-view attention module and Medical visual-semantic LSTMs)0.460IU-Xray
2023Shetty S. et al. [85]Chest X-rayIU-Xray CNN, RNNEncoder–Decoder framework with UM-VES and UM-TES subnetworks and LSTM decoder0.5881IU-Xray
2023Zhao G. et al. [57]Chest X-rayMIMIC-CXR. IU-XrayTransformerResNet101 (visual extractor). 3-layer Transformer structure (encoder–decoder framework). BLIP architecture0.513IU-Xray
2023Hou X. et al. [56]Chest X-rayIU-Xray Transformer, Contrastive LearningMKCL (Medical knowledge with contrastive learning). ResNet-101. Transformer0.490IU-Xray
2023Xu D. et al. [62]Chest X-ray, DermoscopyIU-Xray, NCRC-DS (81 entities, 536 triples)CNN, RNN, TransformerDenseNet-121. ResNet-101. Memory-driven Transformer0.494IU-Xray
2023Zeng X. et al. [98]US (gallbladder, fetal heart), Chest X-rayUS dataset (6563 images and reports). Fetal Heart (FH) dataset (3300 images and reports). MIMIC-CXR. IU-XrayCNN, RNNAERMNet (Attention-Enhanced Relational Memory Network)0.890US DATASET
2023Guo K. et al. [59]Chest X-rayNIH Chest X-ray (112,120 images). MIMIC-CXR. IU-XrayTransformerViT. GNN. Vector Retrieval Library. Multi-label contrastive learning. Multi-task learning0.478IU-Xray
2024Pan Y. et al. [17]Chest X-rayMIMIC-CXR. IU-XrayCNN, TransformerSwin-Transformer0.499IU-Xray
2024Raminedi S. et al. [51]Chest X-rayIU-Xray TransformerViGPT2 model0.571IU-Xray
2024Liu Z. et al. [28]Chest X-rayMIMIC-CXR. IU-XrayCNN, TransformerFMVP (Flexible multi-view paradigm)0.499IU-Xray
2024Veras Magalhaes G. et al. [43]Chest X-rayProposed Dataset (21,970 images). IU-Xray. NIH Chest X-ray TransformerXRaySwinGen (Swin Transformer as image encoder, GPT-2 as textual decoder)0.731PT BR
2024Sharma D. et al. [33]Chest X-rayIU-Xray CNN, TransformerFDT-Dr 2 T (custom framework)0.531IU-Xray
2024Tang Y. et al. [99]Chest X-rayIU-Xray. XRG-COVID-19 (8676 scans, 8676 reports)CNN, TransformerDSA-Transformer with ResNet-101 as the backbone0.552XRG-COVID-19
2024Li H. et al. [32]Chest X-rayMIMIC-CXR. IU-XrayCNN, TransformerDenseNet-121. Transformer encoder. GPT-40.491IU-Xray
2024Gao N. et al. [58]Oral Panoramic X-rayOral panoramic X-ray image-report dataset (562 sets of images and reports). MIMIC-CXRTransformerMLAT (Multi-Level objective Alignment Transformer)0.5011PAN XRAY
2024Yang B. et al. [66]Chest X-rayMIMIC-CXR. IU-XrayCNN, RNN, TransformerMRARGN (Multifocal Region-Assisted Report Generation Network)0.502IU-Xray
2024Liu X. et al. [21]Chest X-rayMIMIC-CXR. IU-XrayCNN, TransformerMemory-driven Transformer (based on standard Transformer architecture with relational memory added to the decoder)0.508IU-Xray
2024Alotaibi F.S. et al. [67]Chest X-rayIU-Xray CNN, RNN, TransformerVGG19 (CNN) pre-trained over the ImageNet dataset. GloVe, fastText, ElMo, and BERT (extract textual features from the ground truth reports). Hierarchical LSTM (generate reports)0.612IU-Xray
2024Sun S. et al. [92]Chest X-rayMIMIC-CXR. IU-XrayRNN, Transformer, MLPTransformer (encoder). MIX-MLP multi-label classification network. CAM (Co-attention mechanism) based on POS-SCAN. Hierarchical LSTM (decoder)0.521IU-Xray
2024Zeiser F.A., et al. [45]Chest X-rayMIMIC-CXR TransformerCheXReport (Swin-B fully transformer)0.354MIMIC-CXR
2024Zhang K. et al. [25]Chest X-rayMIMIC-CXR. IU-XrayCNN, TransformerRAMT (Relation-Aware Mean Teacher). GHFE (Graph-guided hybrid feature encoding) module. DenseNet121 (visual feature extractor). Standard Transformer (decoder)0.482IU-Xray
2024Li S. et al. [31]Chest X-rayMIMIC-CXR. IU-XrayCNN, TransformerResNet-101. Transformer (multilayer encoder and decoder)0.514IU-Xray
2024Zheng Z. et al. [40]Chest X-rayMIMIC-CXR. IU-XrayTransformerTeam Role Interaction Network (TRINet)0.445MIMIC-CXR
2024Vieira P.D.A. et al. [26]Polydactyly X-rayCustom dataset (16,710 images and reports)CNN, TransformerInception-V3 CNN. Transformer Architecture0.516PD XRAY BR
2024Leonardi G. et al. [46]Chest X-rayMIMIC-CXR TransformerViT. GPT-2 (with custom positional encoding and beam search)* 0.095 (BLEU-4)MIMIC-CXR
2024Zhang J. et al. [89]Chest X-ray, Chest CT (COVID-19)MIMIC-CXR. IU-Xray. COV-CTRCNN, RNNHDGAN (Hybrid Discriminator Generative Adversarial Network)0.765COV-CTR
2024Alqahtani F. F. et al. [100]Chest X-rayIU-Xray. Custom dataset (1250 images and reports)CNN–TransformerCNX-B2 (CNN encoder, BioBERT transformer)0.479IU-Xray
2024Shahzadi I. et al. [37]Chest X-rayNIH ChestX-ray. IU-Xray CNN, TransformerCSAMDT (Conditional self-attention memory-driven transformer)0.504IU-Xray
2024Xu L. et al. [53]Ultrasound (gallbladder, kidney, liver), Chest X-ray MIMIC-CXR. IU-Xray. LGK US (6563 images and reports). TransformerCGFTrans (Cross-modal global feature fusion transformer)0.684US DATASET
2024Yi X. et al. [52]Chest X-rayMIMIC-CXR. IU-XrayTransformerTSGET (Two-stage global enhanced transformer)0.500IU-Xray
2024Zhang W. et al. [35]Chest X-rayMIMIC-CXR. IU-XrayCNN, TransformerVCIN (Visual-textual cross-modal interaction network). ACIE (Abundant clinical information embedding). Bert-based Decoder-only Generator0.508IU-Xray
2024Yi X. et al. [75]Chest X-rayMIMIC-CXR. IU-XrayCNN, RNN, TransformerMemory-driven Transformer0.539IU-Xray
2024Zhong Z. et al. [30]Chest X-rayMIMIC-CXR. Chest ImaGenome (237,853 images). Brown-COVID (1021 images). Penn-COVID (2879 images)CNN, TransformerMRANet (Multi-modality regional alignment network)0.504BROWN-COVID
2024Liu A. et al. [34]Chest X-rayMIMIC-CXR. IU-XrayCNN, TransformerResNet-101. Multilayer Transformer (encoder and decoder)0.472IU-Xray
2024Ran R. et al. [65]Chest X-rayMIMIC-CXR. IU-XrayCNN, RNN, TransformerMeFD-Net (proposed multi-expert diagnostic module). ResNet101 (visual encoder). Transformer (text generation module)0.505IU-Xray
2024Zhang K. et al. [19]Chest X-rayMIMIC-CXR. Chest ImaGenome (242,072 scene graphs)CNN, TransformerFaster R-CNN (object detection). GPT-2 Medium (report generation)0.391MIMIC-CXR
2025Zhu D. et al. [76]Chest X-rayMIMIC-CXR. IU-XrayCNN, RNN, TransformerDenoising multi-level cross-attention. Contrastive learning model (with ViTs-B/16 as visual extractor, BERT as text encoder)0.507IU-Xray
2025Huang L. et al. [74]Chest X-rayMIMIC-CXR. IU-XrayCNN, RNN, TransformerKCAP (Knowledge-guided cross-modal alignment and progressive fusion)0.517IU-Xray
2025Mei X. et al. [70]Chest X-rayMIMIC-CXR. IU-XrayCNN, RNN, Transformer, ViTATL-CA (Adaptive topic learning and fine-grained crossmodal alignment)0.487IU-Xray
2025Liu Y. et al. [63]Chest X-rayMIMIC-CXR. IU-XrayCNN, RNN, TransformerADCNet (Anomaly-driven cross-modal contrastive network). ResNet-101 and Transformer encoder–decoder architecture0.493IU-Xray
2025Singh P. et al. [50]Chest X-rayIU-Xray TransformerChestX-Transcribe (combines Swin Transformer and DistilGPT)0.675IU-Xray
2025Bouslimi R. et al. [18]Brain CT and MRI scansRSNA-IHDC dataset (674,258 brain CT images, 19,530 patients)CNN, TransformerAC-BiFPN (Augmented convolutional bi-directional feature pyramid network). Transformer model0.382RSNA IHDC CT
2025Dong Z. et al. [61]Chest X-rayMIMIC-CXR. IU-XrayCNN, RNN, TransformerDCTMN (Dual-channel transmodal memory network)0.506IU-Xray
2025Zhang J. et al. [73]Chest X-rayMIMIC-CXR. IU-XrayCNN, Transformer, Graph Reasoning Network (GRN), Cross-modal Gated Fusion Network (CGFN)ResNet101 (visual feature extraction). GRN. CGFN (Cross-modal gated fusion network). Transformer (decoder)0.514IU-Xray
2025Batool H. et al. [47]Spine CTVerSe20 (300 MDCT spine images)TransformerViT-Base. BioBERT BASE. MiniLM0.7291IU-Xray
2025Zhao J. et al. [48]Chest X-rayIU-Xray TransformerResNet-101 with CBAM (convolutional block attention module). Cross-attention mechanism0.456IU-Xray
2025Fang J. et al. [39]Chest X-rayMIMIC-CXR. IU-XrayTransformerMMG (Multi-modal granularity feature fusion)0.497IU-Xray
2025Ho, X. et al. [24]Chest X-rayMIMIC-CXR. IU-XrayCNN, TransformerRCAN (Recalibrated cross-modal alignment network)0.521IU-Xray
2025Ucan M. et al. [81]Chest X-rayIU-Xray CNN, RNNG-CNX (hybrid encoder–decoder architecture). ConvNeXtBase (encoder side). GRU-based RNN (decoder side)0.6544IU-Xray
2025Yang B. et al. [66]Chest X-rayMIMIC-CXR. IU-XrayCNN, RNN, TransformerDPN (Dynamics priori networks) with components including DGN (Dynamic graph networks), Contrastive learning, and PrKN (Prior knowledge networks). ResNet-152 (image feature extraction). SciBert (report embedding)0.409IU-Xray
2025Yu T. et al. [29]Chest X-ray, Bladder PathologyMIMIC-CXR. IU-Xray. 4253 bladder Pathology images. Transformer, CNNAHP (Adapter-enhanced hierarchical cross-modal pre-training)0.502IU-Xray
2025Liu F. et al. [101]Chest X-rayCOVIDx-CXR-2 (29,986 images). COVID-CXR (more than 900 images). BIMCV-COVID-19 (more than 10,000 images). COV-CTR. MIMIC-CXR. NIH ChestX-rayCNN, TransformerResNet-50 (image encoder). BERT (text encoder). Transformer-based model (with variants using LLAMA-2-7B and Transformer-BASE, decoder)* 0.63 (BLEU-4)COVID-19 DATASETS
2025Liu X. et al. [20]Chest X-rayMIMIC-CXR. IU-XrayCNN, TransformerCECL (Clustering enhanced contrastive learning)0.485IU-Xray
2025Sun S. et al. [68]Chest X-rayMIMIC-CXR. IU-XrayDiffusion Models, RNN, CNN, TransformerDiffusion Model-based architecture. ResNet34. Transformer structure using cross-attention0.422IU-Xray
2025Yang Y. et al. [60]Chest X-rayMIMIC-CXR. IU-XrayTransformerSTREAM (Spatio-temporal and retrieval-augmented modeling). SwinTransformer (Swin-Base) (encoder). TinyLlama-1.1B (decoder). 0.506IU-Xray
2025Tang Y. et al. [71]Chest X-rayMIMIC-CXR. ROCO (over 81,000 images)CNN, RNN, TransformerCAT (Cross-modal augmented transformer)0.491IU-Xray
2025Varol Arısoy M. et al. [38]Chest X-rayIU-Xray. COV-CTRTransformerMedVAG (Medical vision attention generation)0.808COV-CTR
* Indicates the BLEU-4 value, as the study did not report the BLEU-1 metric. ** Indicates an average of the BLEU metrics, as the study did not report the BLEU-1 metric. Citations for external public datasets included in this table are as follows: IU-Xray [102]; MIMIC-CXR [103]; ImageCLEF 2015 Liver CT [104]; ImageNet [105]; COV-CTR [106]; NIH Chest X-ray [107]; Chest ImaGenome [108]; RSNA-IHDC [109]; VerSe 20 [110]; Bladder Pathology [111]; COVIDx-CXR-2 [112]; BIMCV COVID-19 [113]; and ROCO [114]. Datasets not listed here were either custom-built or not publicly available.

4. Discussion

This scoping review offers a comprehensive mapping of methodological advances in ARRG, highlighting the predominant DL architectures, evaluation practices, and existing gaps in clinical alignment. The evolution of ARRG models demonstrates a clear transition from sequential language frameworks to more expressive, attention-based architectures capable of capturing long-range semantic dependencies and generating contextually coherent diagnostic narratives. Early CNN–RNN pipelines demonstrated the feasibility of translating image-derived representations into coherent textual descriptions. Still, their reliance on stepwise decoding limited their ability to capture global contextual dependencies within radiological narratives [115,116]. Subsequent extensions attempted to mitigate these constraints by leveraging multi-task learning and co-attention mechanisms, which improved alignment between visual features and semantic structure but remained fundamentally constrained by the sequential nature of RNN-based decoding [117]. The introduction of Transformer-based models marked a methodological inflection point by enabling parallelized processing, improved feature integration, and richer contextual reasoning [115]. Hybrid architectures further enhanced performance by combining CNNs for localized feature extraction with Transformers for global semantic modeling. At the same time, more recent developments incorporate memory augmentation, medical priors, or retrieval-based alignment mechanisms to compensate for limited contextual cues in public datasets.
More recent research has explored advanced strategies to improve report quality further and strengthen alignment with radiological reasoning. Memory-augmented architectures and models that incorporate structured medical knowledge have shown promising performance, particularly on extensive public benchmarks such as MIMIC-CXR [116]. These systems typically enrich the representation space by integrating pre-built knowledge graphs or retrieval-based mechanisms that draw on similar reports or pathology patterns [49,118,119], moving closer to the way radiologists ground their interpretation in prior clinical context. However, although these mechanisms improve semantic alignment, they do not yet guarantee diagnostic accountability or case-level reasoning, limiting their contribution to actual clinical readiness.
Other innovations include Region-guided Report Generation (RGRG), which enhances explainability by anchoring narrative content to localized anatomical regions [120]; RECAP, which introduces temporal reasoning to capture disease progression and longitudinal consistency [121]; and UAR (“unify, align, and refine”), a framework designed to align visual and textual features across multiple semantic levels [122]. Progressive-generation strategies have also been proposed to refine report outputs iteratively, leading to more stable, coherent, and clinically focused narratives [123]. These advances indicate that recent improvements in ARRG extend beyond raw language-modeling capacity to incorporate mechanisms that emulate elements of human diagnostic reasoning increasingly.
Collectively, this architectural trajectory demonstrates substantial technical maturation, although increased model complexity has not yet translated into consistent improvements in clinically grounded interpretability or translational maturity.
However, evaluating ARRG systems remains one of the most critical challenges in the field. Widely used NLG metrics such as BLEU [124] and ROUGE [125] primarily measure surface-level similarity based on n-gram overlap [125,126,127]. While applicable in some contexts, these metrics often fail to capture deeper semantic equivalence, paraphrastic variation, or clinically relevant word ordering. In fact, studies have shown that BLEU scores correlate poorly with human judgment in image captioning tasks [126,127], and their alignment with expert radiologist evaluations of CXR reports is particularly weak [128]. Furthermore, we observed inconsistent standardization and detailed reporting of computational efficiency metrics (e.g., parameter count, inference latency, and memory footprint) across the included studies. This variability, often exacerbated by heterogeneous experimental settings and hardware specifications, severely hinders the direct comparison of the computational overhead among different model architectures. Addressing this deficit is imperative for future clinical deployment, particularly in resource-constrained settings, where minimal computational demands are paramount.
Another limitation lies in the availability of reference reports. It is estimated that around 50 human-written reports per image are needed to achieve a reliable consensus, yet most public datasets provide only five [127]. Overcoming these limitations will require not only more sophisticated model architectures but also the development of large, high-quality datasets, such as MIMIC-CXR, combined with clinically aligned evaluation metrics, such as RadGraph F1 and RadCliQ [128], which better reflect the true diagnostic quality and clinical usefulness of generated reports.
Medical databases often demonstrate a tenuous connection to authentic clinical scenarios. The process of capturing real-world medical data is fraught with challenges, leading to datasets that are limited and biased towards common cases while marginalizing critical abnormalities. Such limitations restrict linguistic diversity and impede the development of varied descriptions, particularly for rare and nuanced cases, which are crucial for precise clinical diagnosis. Moreover, these biases not only constrain linguistic variability but also undermine the generalizability of models across different institutions and clinical contexts, ultimately reducing their applicability in diverse real-world settings. As a result, models dependent on these datasets may experience deficiencies in accuracy and reliability when deployed in clinical practice.
Compared with previous reviews, this work provides a broader, more up-to-date overview of automated radiology report generation. While Kaur et al. [2] focused exclusively on CXR, limiting generalization to other modalities, our review highlights the need to extend ARRG research beyond thoracic imaging. Similarly, the review by Mahmoud et al. [129] emphasized early CNN-RNN approaches but does not cover recent advances such as Transformer-based models and knowledge-enhanced frameworks. Furthermore, although Liao et al. [130] provided a systematic analysis of datasets and evaluation methods, their discussion lacks a strong connection to clinical challenges and real-world applicability. In contrast, this review not only synthesizes current technical trends but also situates them within the broader clinical workflow, emphasizing integration into diagnostic practice, highlighting the limitations of existing evaluation metrics, and proposing future research directions to improve the accuracy and practical utility of ARRG systems.
Several methodological constraints should be acknowledged. First, the absence of a formal risk-of-bias assessment—consistent with the aims of a scoping review—may limit the interpretability and comparability of findings across studies. Second, restricting the search to peer-reviewed English-language literature introduces potential language and publication biases. Third, despite the use of comprehensive database-specific strategies, the heterogeneity of terminology in automated radiology report generation may have led to the omission of studies using non-standard descriptors. Lastly, the evidence base is heavily shaped by the availability of large public chest X-ray datasets, which reinforces modality-specific biases and restricts the generalizability of observed methodological trends to other imaging domains.
This review provides a comprehensive and up-to-date overview of the current state of ARRG using DL, with a particular focus on architectural trends, evaluation practices, and clinical applications. Additionally, by capturing not only technical details but also clinical context, this work helps bridge the gap between algorithmic development and real-world diagnostic needs. As a scoping review, this work did not aim to appraise the quality of the evidence or conduct a statistical synthesis, but rather to map the conceptual and methodological evolution of the field. Nevertheless, some limitations should be acknowledged. The review predominantly reflects research efforts focused on CXR, primarily driven by the accessibility of public datasets such as MIMIC-CXR and the urgency created by the COVID-19 pandemic. Consequently, other anatomical regions and imaging modalities remain underexplored, highlighting the need for broader dataset development and more diverse applications. Furthermore, while the review describes prevailing evaluation metrics, it also reveals ongoing limitations in these measures’ ability to capture actual clinical relevance. Looking forward, future research should prioritize clinically aligned evaluation frameworks, expand model development beyond thoracic imaging, and explore integrating large language models with domain-specific knowledge to improve report quality and diagnostic accuracy. Addressing these gaps is essential to realizing the full potential of automated report generation in supporting radiologists and enhancing healthcare delivery.

5. Conclusions

This scoping review maps and synthesizes the current state of ARRG using DL, highlighting both its methodological evolution and its emerging clinical relevance. The key findings can be summarized as follows:
  • The field remains heavily concentrated on chest radiography, with more than 87% of studies based on CXR datasets. This reflects the availability of public data and the acceleration of thoracic imaging research during the COVID-19 pandemic, but also exposes a lack of anatomical diversity that limits generalizability to other diagnostic domains encountered in routine radiological practice.
  • Hybrid architectures, particularly CNN–Transformer combinations, represent the dominant methodological trend (73% of included studies). By leveraging CNNs for localized visual encoding and Transformer modules for contextual reasoning, these models generate reports with greater coherence and better representation of abnormality, reducing variability and supporting more consistent documentation.
  • The increased use of memory modules, medical knowledge graphs, and cross-modal alignment mechanisms demonstrates a clear shift toward clinically informed modeling. These strategies improve factual grounding by embedding structured domain knowledge into the generation process and aligning outputs more closely with expert reasoning.
  • However, current evaluation frameworks remain poorly aligned with clinical decision-making. Metrics such as BLEU and ROUGE capture surface-level similarity but do not reflect diagnostic adequacy or patient management utility, underscoring the need for evaluation standards that measure whether generated reports truly support radiological interpretation and workflow reliability.
  • Overall, ARRG has achieved meaningful technical progress, yet its translation into real clinical environments remains constrained by limited anatomical coverage, shallow evaluation standards, and insufficient external validation. For these systems to evolve from experimental prototypes into trustworthy decision support tools, future research must prioritize clinically grounded benchmarking, greater dataset diversity, and integration pathways that reflect the realities of radiological practice. As these gaps are progressively addressed, ARRG has the potential to become a scalable, clinically accountable complement to radiological reporting, provided that future developments successfully bridge the remaining clinical-readiness gap. By adopting a scoping review approach, this study offers a descriptive synthesis that can guide future systematic or meta-analytic investigations addressing specific diagnostic or methodological questions.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/ai7010008/s1. Table S1: Preferred Reporting Items for Systematic reviews and Meta-Analyses extension for Scoping Reviews (PRISMA-ScR) Checklist.

Author Contributions

Conceptualization, P.M.R., J.J.R., M.F.V.D., P.R.M. and A.V.B.; methodology, P.M.R., J.J.R., M.F.V.D., P.R.M. and A.V.B.; validation, P.M.R., J.J.R. and M.F.V.D.; formal analysis P.M.R., J.J.R., M.F.V.D., P.R.M. and A.V.B.; investigation, P.M.R., J.J.R. and M.F.V.D.; writing—original draft preparation, P.M.R.; writing—review and editing, P.M.R., J.J.R., M.F.V.D., P.R.M. and A.V.B.; project administration, P.M.R.; funding acquisition, P.M.R. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Agencia Nacional de Investigación y Desarrollo (ANID), Chile, through the “Doctorado en Chile Scholarship Program, Academic Year 2025 (Grant No. 1340/2025)”.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Conflicts of Interest

The authors declare no financial or non-financial conflicts of interest that could be perceived as influencing the work reported in this manuscript.

Abbreviations

The following abbreviations are used in this manuscript:
ARRGAutomatic Radiology Report Generation
DLDeep learning
PRISMAPreferred reporting items for systematic reviews and meta-analyses
COVID-19 Coronavirus disease 2019
CNNsConvolutional neural networks
RNNsRecurrent neural networks
LSTMLong short-term memory
CLIPContrastive language-image pretraining
LLMsLarge language models
MIMICMedical information mart for intensive care database
MIMIC-CXR MIMIC-Chest X-ray
IU-XrayIndiana University Chest X-ray collection
BLEUBilingual evaluation understudy
ROUGERecall-oriented understudy for gisting evaluation
BERTBidirectional encoder representations from transformers
AUCArea under the curve
IEEEInstitute of electrical and electronics engineers
ACMAssociation for Computing Machinery
WoSWeb of Science
GNNsGraph neural networks
GATGraph attention network
TRIPODTransparent reporting of a multivariable prediction model for individual prognosis or diagnosis
CXRChest X-ray
CTComputed tomography
MRIMagnetic resonance imaging
CTNContrastive triplet network
SOTAState of the art
METEORMetric for evaluation of translation with explicit ordering
CIDErConsensus-based image description evaluation
ViTVision transformers
GPTGenerative pre-trained transformers
NLGNatural language generation
GRUGated recurrent unit
SVMsSupport vector machines
KGKnowledge graph
DSDataset
USUltrasound
DBTDigital breast tomosynthesis
SFNetSemantic fusion network
CDGPTConditioned distil generative pre-trained transformer
MLTLMulti-level transfer learning
HReMRGHybrid reinforced medical report generation method.
ATAGAttributed abnormality graph
AMLMAAdaptive multilevel multi-attention
VTIVariational topic inference
CAMANetClass activation map guided attention network
MFOTMulti-feature optimization transformer
TrMRGTransformer medical report generator
ASGMDAuxiliary signal guidance and memory-driven
ICTInformation-calibrated transformer
CVAMCross-view attention module
MVSLMedical visual-semantic LSTMs
MKCLMedical knowledge with contrastive learning
AERMNetAttention-Enhanced Relational Memory Network
FMVPFlexible multi-view paradigm
RAMTRelation-aware mean teacher
GHFEGraph-guided hybrid feature encoding
CSAMDTConditional self-attention memory-driven transformer
CGFTransCross-modal global feature fusion transformer
TSGETTwo-stage global enhanced transformer
VCINVisual-textual cross-modal interaction network
ACIE Abundant clinical information embedding
MRANetMulti-modality regional alignment network
KCAPKnowledge-guided cross-modal alignment and progressive fusion
ATL-CA Adaptive topic learning and fine-grained crossmodal alignment
ADCNetAnomaly-driven cross-modal contrastive network
AC-BiFPN Augmented convolutional bi-directional feature pyramid network
DCTMNDual-channel transmodal memory network
GRNGraph reasoning network
CGFNCross-modal gated fusion network
CBAMConvolutional block attention module
MMGMulti-modal granularity feature fusion
RCANRecalibrated cross-modal alignment network
DPNDynamics priori networks
DGNDynamic graph networks
PrKNPrior knowledge networks
AHP Adapter-enhanced hierarchical cross-modal pre-training
CECLClustering enhanced contrastive learning
STREAMSpatio-temporal and retrieval-augmented modeling
CAT Cross-modal augmented transformer
MedVAGMedical vision attention generation
RGRGRegion-guided report generation
UARUnify, align, and refine

References

  1. Ramirez-Alonso, G.; Prieto-Ordaz, O.; López-Santillan, R.; Montes-Y-Gómez, M. Medical report generation through radiology images: An overview. IEEE Lat. Am. Trans. 2022, 20, 986–999. [Google Scholar] [CrossRef]
  2. Kaur, N.; Mittal, A.; Singh, G. Methods for automatic generation of radiological reports of chest radiographs: A comprehensive survey. Multimed. Tools Appl. 2022, 81, 13409–13439. [Google Scholar] [CrossRef]
  3. Pang, T.; Li, P.; Zhao, L. A survey on automatic generation of medical imaging reports based on deep learning. Biomed. Eng. Online 2023, 22, 48. [Google Scholar] [CrossRef]
  4. Azad, R.; Kazerouni, A.; Heidari, M.; Khodapanah Aghdam, E.; Molaei, A.; Jia, Y.; Jose, A.; Roy, R.; Merhof, D. Advances in medical image analysis with vision transformers: A comprehensive review. Med. Image Anal. 2024, 91, 103000. [Google Scholar] [CrossRef]
  5. Sun, Z.; Lin, M.; Zhu, Q.; Xie, Q.; Wang, F.; Lu, Z.; Peng, Y. A scoping review on multimodal deep learning in biomedical images and texts. J. Biomed. Inform. 2023, 146, 104482. [Google Scholar] [CrossRef]
  6. Sloan, P.; Clatworthy, P.L.; Simpson, E.; Mirmehdi, M. Automated radiology report generation: A review of recent advances. IEEE Rev. Biomed. Eng. 2025, 18, 368–387. [Google Scholar] [CrossRef] [PubMed]
  7. Guo, L.; Tahir, A.M.; Zhang, D.; Wang, Z.J.; Ward, R.K. Automatic medical report generation: Methods and applications. APSIPA Trans. Signal Inf. Process. 2024, 13, e24. [Google Scholar] [CrossRef]
  8. Shen, Y.; Xu, Y.; Ma, J.; Rui, W.; Zhao, C.; Heacock, L.; Huang, C. Multi-modal large language models in radiology: Principles, applications, and potential. Abdom. Radiol. 2024, 50, 2745–2757. [Google Scholar] [CrossRef]
  9. Nakaura, T.; Ito, R.; Ueda, D.; Nozaki, T.; Fushimi, Y.; Matsui, Y.; Yanagawa, M.; Yamada, A.; Tsuboyama, T.; Fujima, N.; et al. The impact of large language models on radiology: A guide for radiologists on the latest innovations in AI. Jpn. J. Radiol. 2024, 42, 685–696. [Google Scholar] [CrossRef] [PubMed]
  10. Nerella, S.; Bandyopadhyay, S.; Zhang, J.; Contreras, M.; Siegel, S.; Bumin, A.; Silva, B.; Sena, J.; Shickel, B.; Bihorac, A.; et al. Transformers and large language models in healthcare: A review. Artif. Intell. Med. 2024, 154, 102900. [Google Scholar] [CrossRef]
  11. Ouis, M.Y.; Akhloufi, M.A. Deep learning for report generation on chest X-ray images. Comput. Med. Imaging Graph. 2024, 111, 102320. [Google Scholar] [CrossRef]
  12. Gallifant, J.; Afshar, M.; Ameen, S.; Aphinyanaphongs, Y.; Chen, S.; Cacciamani, G.; Demner-Fushman, D.; Dligach, D.; Daneshjou, R.; Fernandes, C.; et al. The TRIPOD-LLM reporting guideline for studies using large language models. Nat. Med. 2025, 31, 60–69. [Google Scholar] [CrossRef] [PubMed]
  13. Tricco, A.C.; Lillie, E.; Zarin, W.; O’Brien, K.K.; Colquhoun, H.; Levac, D.; Moher, D.; Peters, M.D.J.; Horsley, T.; Weeks, L.; et al. PRISMA Extension for Scoping Reviews (PRISMA-ScR): Checklist and Explanation. Ann. Intern. Med. 2018, 169, 467–473. [Google Scholar] [CrossRef] [PubMed]
  14. Yang, Y.; Yu, J.; Jiang, H.; Han, W.; Zhang, J.; Jiang, W. A contrastive triplet network for automatic chest X-ray reporting. Neurocomputing 2022, 502, 71–83. [Google Scholar] [CrossRef]
  15. Nicolson, A.; Dowling, J.; Koopman, B. Improving chest X-ray report generation by leveraging warm starting. Artif. Intell. Med. 2023, 144, 102633. [Google Scholar] [CrossRef]
  16. Pan, R.; Ran, R.; Hu, W.; Zhang, W.; Qin, Q.; Cui, S. S3-Net: A self-supervised dual-stream network for radiology report generation. IEEE J. Biomed. Health Inform. 2024, 28, 1448–1459. [Google Scholar] [CrossRef]
  17. Pan, Y.; Liu, L.J.; Yang, X.B.; Peng, W.; Huang, Q.S. Chest radiology report generation based on cross-modal multi-scale feature fusion. J. Radiat. Res. Appl. Sci. 2024, 17, 100823. [Google Scholar] [CrossRef]
  18. Bouslimi, R.; Trabelsi, H.; Karaa, W.B.A.; Hedhli, H. AI-driven radiology report generation for traumatic brain injuries. J. Imaging Inform. Med. 2025, 38, 2630–2645. [Google Scholar] [CrossRef]
  19. Zhang, K.; Yang, Y.; Yu, J.; Fan, J.; Jiang, H.; Huang, Q.; Han, W. Attribute prototype-guided iterative scene graph for explainable radiology report generation. IEEE Trans. Med. Imaging 2024, 43, 4470–4482. [Google Scholar] [CrossRef]
  20. Liu, X.; Xin, J.; Shen, Q.; Li, C.; Huang, Z.; Wang, Z. End-to-end clustering enhanced contrastive learning for radiology reports generation. IEEE Trans. Emerg. Top. Comput. Intell. 2025, 9, 1780–1794. [Google Scholar] [CrossRef]
  21. Liu, X.; Xin, J.; Dai, B.; Shen, Q.; Huang, Z.; Wang, Z. Label correlated contrastive learning for medical report generation. Comput. Methods Programs Biomed. 2025, 258, 108482. [Google Scholar] [CrossRef]
  22. Shang, C.; Cui, S.; Li, T.; Wang, X.; Li, Y.; Jiang, J. MATNet: Exploiting multi-modal features for radiology report generation. IEEE Signal Process. Lett. 2022, 29, 2692–2696. [Google Scholar] [CrossRef]
  23. Yan, B.; Pei, M.; Zhao, M.; Shan, C.; Tian, Z. Prior guided transformer for accurate radiology reports generation. IEEE J. Biomed. Health Inform. 2022, 26, 5631–5640. [Google Scholar] [CrossRef]
  24. Hou, X.; Li, X.; Liu, Z.; Sang, S.; Lu, M.; Zhang, Y. Recalibrated cross-modal alignment network for radiology report generation with weakly supervised contrastive learning. Expert Syst. Appl. 2025, 269, 126394. [Google Scholar] [CrossRef]
  25. Zhang, K.; Jiang, H.; Zhang, J.; Fan, J.; Yu, J.; Han, W. Semi-supervised medical report generation via graph-guided hybrid feature consistency. IEEE Trans. Multimed. 2024, 26, 904–915. [Google Scholar] [CrossRef]
  26. Vieira, P.D.A.; Mathew, M.J.; Santos Neto, P.D.A.D.; Silva, R.R.V.E. The automated generation of medical reports from polydactyly X-ray images using CNNs and transformers. Appl. Sci. 2024, 14, 6566. [Google Scholar] [CrossRef]
  27. Wang, J.; Bhalerao, A.; Yin, T.; See, S.; He, Y. CAMANet: Class activation map guided attention network for radiology report generation. IEEE J. Biomed. Health Inform. 2024, 28, 2199–2210. [Google Scholar] [CrossRef] [PubMed]
  28. Liu, Z.; Zhu, Z.; Zheng, S.; Zhao, Y.; He, K.; Zhao, Y. From observation to concept: A flexible multi-view paradigm for medical report generation. IEEE Trans. Multimed. 2024, 26, 5987–5995. [Google Scholar] [CrossRef]
  29. Yu, T.; Lu, W.; Yang, Y.; Han, W.; Huang, Q.; Yu, J.; Zhang, K. Adapter-enhanced hierarchical cross-modal pre-training for lightweight medical report generation. IEEE J. Biomed. Health Inform. 2025, 29, 5303–5316. [Google Scholar] [CrossRef]
  30. Zhong, Z.; Li, J.; Sollee, J.; Collins, S.; Bai, H.; Zhang, P.; Healey, T.; Atalay, M.; Gao, X.; Jiao, Z. Multi-modality regional alignment network for COVID X-ray survival prediction and report generation. IEEE J. Biomed. Health Inform. 2024, 29, 3293–3303. [Google Scholar] [CrossRef] [PubMed]
  31. Li, S.; Qiao, P.; Wang, L.; Ning, M.; Yuan, L.; Zheng, Y.; Chen, J. An organ-aware diagnosis framework for radiology report generation. IEEE Trans. Med. Imaging 2024, 43, 4253–4265. [Google Scholar] [CrossRef]
  32. Li, H.; Wang, H.; Sun, X.; He, H.; Feng, J. Context-enhanced framework for medical image report generation using multimodal contexts. Knowl.-Based Syst. 2025, 310, 112913. [Google Scholar] [CrossRef]
  33. Sharma, D.; Dhiman, C.; Kumar, D. FDT–Dr2T: A unified dense radiology report generation transformer framework for X-ray images. Mach. Vis. Appl. 2024, 35, 68. [Google Scholar] [CrossRef]
  34. Liu, A.; Guo, Y.; Yong, J.H.; Xu, F. Multi-grained radiology report generation with sentence-level image-language contrastive learning. IEEE Trans. Med. Imaging 2024, 43, 2657–2669. [Google Scholar] [CrossRef]
  35. Zhang, W.; Cai, B.; Hu, J.; Qin, Q.; Xie, K. Visual-textual cross-modal interaction network for radiology report generation. IEEE Signal Process. Lett. 2024, 31, 984–988. [Google Scholar] [CrossRef]
  36. Zhang, S.; Zhou, C.; Chen, L.; Li, Z.; Gao, Y.; Chen, Y. Visual prior-based cross-modal alignment network for radiology report generation. Comput. Biol. Med. 2023, 166, 107522. [Google Scholar] [CrossRef]
  37. Shahzadi, I.; Madni, T.M.; Janjua, U.I.; Batool, G.; Naz, B.; Ali, M.Q. CSAMDT: Conditional self-attention memory-driven transformers for radiology report generation from chest X-ray. J. Imaging Inform. Med. 2024, 37, 2825–2837. [Google Scholar] [CrossRef] [PubMed]
  38. Varol Arısoy, M.; Arısoy, A.; Uysal, I. A vision-attention-driven language framework for medical report generation. Sci. Rep. 2025, 15, 10704. [Google Scholar] [CrossRef]
  39. Fang, J.; Xing, S.; Li, K.; Guo, Z.; Li, G.; Yu, C. Automated generation of chest X-ray imaging diagnostic reports by multimodal and multi-granularity features fusion. Biomed. Signal Process. Control 2025, 105, 107562. [Google Scholar] [CrossRef]
  40. Zheng, Z.; Zhang, Y.; Liang, E.; Weng, Z.; Chai, J.; Li, J. TRINet: Team role interaction network for automatic radiology report generation. Comput. Biol. Med. 2024, 183, 109275. [Google Scholar] [CrossRef]
  41. Vendrow, E.; Schonfeld, E. Understanding transfer learning for chest radiograph clinical report generation with modified transformer architectures. Heliyon 2023, 9, e17968. [Google Scholar] [CrossRef] [PubMed]
  42. Mohsan, M.M.; Akram, M.U.; Rasool, G.; Alghamdi, N.S.; Baqai, M.A.A.; Abbas, M. Vision transformer and language model-based radiology report generation. IEEE Access 2023, 11, 1814–1824. [Google Scholar] [CrossRef]
  43. Veras Magalhaes, G.; De S. Santos, R.L.; Vogado, L.H.S.; Cardoso De Paiva, A.; De Alcantara Dos Santos Neto, P. XRaySwinGen: Automatic medical reporting for X-ray exams with multimodal model. Heliyon 2024, 10, e27516. [Google Scholar] [CrossRef]
  44. Wang, Z.; Han, H.; Wang, L.; Li, X.; Zhou, L. Automated radiographic report generation purely on transformer: A multicriteria supervised approach. IEEE Trans. Med. Imaging 2022, 41, 2803–2813. [Google Scholar] [CrossRef]
  45. Zeiser, F.A.; Da Costa, C.A.; De Oliveira Ramos, G.; Maier, A.; Da Rosa Righi, R. CheXReport: A transformer-based architecture to generate chest X-ray reports suggestions. Expert Syst. Appl. 2024, 255, 124644. [Google Scholar] [CrossRef]
  46. Leonardi, G.; Portinale, L.; Santomauro, A. Enhancing radiology report generation through pre-trained language models. Prog. Artif. Intell. 2024, 12. [Google Scholar] [CrossRef]
  47. Batool, H.; Mukhtar, A.; Khawaja, S.G.; Alghamdi, N.S.; Khan, A.M.; Qayyum, A.; Adil, R.; Khan, Z.; Akbar, M.U.; Eklund, A. Knowledge distillation and transformer-based framework for automatic spine CT report generation. IEEE Access 2025, 13, 42949–42964. [Google Scholar] [CrossRef]
  48. Zhao, J.; Yao, W.; Sun, L.; Shi, L.; Kuang, Z.; Wu, C.; Han, Q. Automated chest X-ray diagnosis report generation with cross-attention mechanism. Appl. Sci. 2025, 15, 343. [Google Scholar] [CrossRef]
  49. Alfarghaly, O.; Khaled, R.; Elkorany, A.; Helal, M.; Fahmy, A. Automated radiology report generation using conditioned transformers. Inform. Med. Unlocked 2021, 24, 100557. [Google Scholar] [CrossRef]
  50. Singh, P.; Singh, S. ChestX-Transcribe: A multimodal transformer for automated radiology report generation from chest X-rays. Front. Digit. Health 2025, 7, 1535168. [Google Scholar] [CrossRef] [PubMed]
  51. Raminedi, S.; Shridevi, S.; Won, D. Multi-modal transformer architecture for medical image analysis and automated report generation. Sci. Rep. 2024, 14, 19281. [Google Scholar] [CrossRef]
  52. Yi, X.; Fu, Y.; Liu, R.; Hu, Y.; Zhang, H.; Hua, R. TSGET: Two-stage global enhanced transformer for automatic radiology report generation. IEEE J. Biomed. Health Inform. 2024, 28, 2152–2162. [Google Scholar] [CrossRef]
  53. Xu, L.; Tang, Q.; Zheng, B.; Lv, J.; Li, W.; Zeng, X. CGFTrans: Cross-modal global feature fusion transformer for medical report generation. IEEE J. Biomed. Health Inform. 2024, 28, 5600–5612. [Google Scholar] [CrossRef]
  54. Wang, R.; Hua, R. Generating radiology reports via multi-feature optimization transformer. KSII Trans. Internet Inf. Syst. 2023, 17, 2768–2787. [Google Scholar] [CrossRef]
  55. Zhang, J.; Shen, X.; Wan, S.; Goudos, S.K.; Wu, J.; Cheng, M.; Zhang, W. A novel deep learning model for medical report generation by inter-intra information calibration. IEEE J. Biomed. Health Inform. 2023, 27, 5110–5121. [Google Scholar] [CrossRef] [PubMed]
  56. Hou, X.; Liu, Z.; Li, X.; Li, X.; Sang, S.; Zhang, Y. MKCL: Medical knowledge with contrastive learning model for radiology report generation. J. Biomed. Inform. 2023, 146, 104496. [Google Scholar] [CrossRef] [PubMed]
  57. Zhao, G.; Zhao, Z.; Gong, W.; Li, F. Radiology report generation with medical knowledge and multilevel image-report alignment: A new method and its verification. Artif. Intell. Med. 2023, 146, 102714. [Google Scholar] [CrossRef]
  58. Gao, N.; Yao, R.; Liang, R.; Chen, P.; Liu, T.; Dang, Y. Multi-level objective alignment transformer for fine-grained oral panoramic X-ray report generation. IEEE Trans. Multimed. 2024, 26, 7462–7474. [Google Scholar] [CrossRef]
  59. Guo, K.; Zheng, S.; Huang, R.; Gao, R. Multi-task learning for lung disease classification and report generation via prior graph structure and contrastive learning. IEEE Access 2023, 11, 110888–110898. [Google Scholar] [CrossRef]
  60. Yang, Y.; You, X.; Zhang, K.; Fu, Z.; Wang, X.; Ding, J.; Sun, J.; Yu, Z.; Huang, Q.; Han, W.; et al. Spatio-temporal and retrieval-augmented modelling for chest X-ray report generation. IEEE Trans. Med. Imaging 2025, 44, 2892–2905. [Google Scholar] [CrossRef]
  61. Dong, Z.; Lian, J.; Zhang, X.; Zhang, B.; Liu, J.; Zhang, J.; Zhang, H. A chest imaging diagnosis report generation method based on dual-channel transmodal memory network. Biomed. Signal Process. Control 2025, 100, 107021. [Google Scholar] [CrossRef]
  62. Xu, D.; Zhu, H.; Huang, Y.; Jin, Z.; Ding, W.; Li, H.; Ran, M. Vision-knowledge fusion model for multi-domain medical report generation. Inf. Fusion 2023, 97, 101817. [Google Scholar] [CrossRef]
  63. Liu, Y.; Zhang, J.; Liu, K.; Tan, L. ADCNet: Anomaly-driven cross-modal contrastive network for medical report generation. Electronics 2025, 14, 532. [Google Scholar] [CrossRef]
  64. Yan, S.; Cheung, W.K.; Chiu, K.; Tong, T.M.; Cheung, K.C.; See, S. Attributed abnormality graph embedding for clinically accurate X-ray report generation. IEEE Trans. Med. Imaging 2023, 42, 2211–2222. [Google Scholar] [CrossRef] [PubMed]
  65. Ran, R.; Pan, R.; Yang, W.; Deng, Y.; Zhang, W.; Hu, W.; Qing, Q. MeFD-Net: Multi-expert fusion diagnostic network for generating radiology image reports. Appl. Intell. 2024, 54, 11484–11495. [Google Scholar] [CrossRef]
  66. Yang, B.; Lei, H.; Huang, H.; Han, X.; Cai, Y. DPN: Dynamics priori networks for radiology report generation. Tsinghua Sci. Technol. 2025, 30, 600–609. [Google Scholar] [CrossRef]
  67. Alotaibi, F.S.; Kaur, N. Radiological report generation from chest X-ray images using pre-trained word embeddings. Wirel. Pers. Commun. 2023, 133, 2525–2540. [Google Scholar] [CrossRef]
  68. Sun, S.; Su, Z.; Meizhou, J.; Feng, Y.; Hu, Q.; Luo, J.; Hu, K.; Yang, Z. Optimizing medical image report generation through a discrete diffusion framework. J. Supercomput. 2025, 81, 637. [Google Scholar] [CrossRef]
  69. Kaur, N.; Mittal, A. RadioBERT: A deep learning-based system for medical report generation from chest X-ray images using contextual embeddings. J. Biomed. Inform. 2022, 135, 104220. [Google Scholar] [CrossRef]
  70. Mei, X.; Yang, L.; Gao, D.; Cai, X.; Han, J.; Liu, T. Adaptive medical topic learning for enhanced fine-grained cross-modal alignment in medical report generation. IEEE Trans. Multimed. 2025, 27, 5050–5061. [Google Scholar] [CrossRef]
  71. Tang, Y.; Yuan, Y.; Tao, F.; Tang, M. Cross-modal augmented transformer for automated medical report generation. IEEE J. Transl. Eng. Health Med. 2025, 13, 33–48. [Google Scholar] [CrossRef] [PubMed]
  72. Najdenkoska, I.; Zhen, X.; Worring, M.; Shao, L. Uncertainty-aware report generation for chest X-rays by variational topic inference. Med. Image Anal. 2022, 82, 102603. [Google Scholar] [CrossRef]
  73. Zhang, J.; Cheng, M.; Li, X.; Shen, X.; Wan, Y.; Zhu, J. Generating medical report via joint probability graph reasoning. Tsinghua Sci. Technol. 2025, 30, 1685–1699. [Google Scholar] [CrossRef]
  74. Huang, L.; Cao, Y.; Jia, P.; Li, C.; Tang, J.; Li, C. Knowledge-guided cross-modal alignment and progressive fusion for chest X-ray report generation. IEEE Trans. Multimed. 2025, 27, 557–567. [Google Scholar] [CrossRef]
  75. Yi, X.; Fu, Y.; Yu, J.; Liu, R.; Zhang, H.; Hua, R. LHR-RFL: Linear hybrid-reward-based reinforced focal learning for automatic radiology report generation. IEEE Trans. Med. Imaging 2025, 44, 1494–1504. [Google Scholar] [CrossRef]
  76. Zhu, D.; Liu, L.; Yang, X.; Liu, L.; Peng, W. Denoising multi-level cross-attention and contrastive learning for chest radiology report generation. J. Imaging Inform. Med. 2025, 38, 2646–2663. [Google Scholar] [CrossRef]
  77. Li, H.; Liu, X.; Jia, D.; Chen, Y.; Hou, P.; Li, H. Research on chest radiography recognition model based on deep learning. Math. Biosci. Eng. 2022, 19, 11768–11781. [Google Scholar] [CrossRef]
  78. Sirshar, M.; Paracha, M.F.K.; Akram, M.U.; Alghamdi, N.S.; Zaidi, S.Z.Y.; Fatima, T. Attention-based automated radiology report generation using CNN and LSTM. PLoS ONE 2022, 17, e0262209. [Google Scholar] [CrossRef] [PubMed]
  79. Hou, D.; Zhao, Z.; Liu, Y.; Chang, F.; Hu, S. Automatic report generation for chest X-ray images via adversarial reinforcement learning. IEEE Access 2021, 9, 21236–21250. [Google Scholar] [CrossRef]
  80. Kaur, N.; Mittal, A. CADxReport: Chest X-ray report generation using co-attention mechanism and reinforcement learning. Comput. Biol. Med. 2022, 145, 105498. [Google Scholar] [CrossRef]
  81. Ucan, M.; Kaya, B.; Kaya, M. Generating medical reports with a novel deep learning architecture. Int. J. Imaging Syst. Technol. 2025, 35, e70062. [Google Scholar] [CrossRef]
  82. Yang, Y.; Yu, J.; Zhang, J.; Han, W.; Jiang, H.; Huang, Q. Joint embedding of deep visual and semantic features for medical image report generation. IEEE Trans. Multimed. 2023, 25, 167–178. [Google Scholar] [CrossRef]
  83. Gajbhiye, G.O.; Nandedkar, A.V.; Faye, I. Translating medical image to radiological report: Adaptive multilevel multi-attention approach. Comput. Methods Programs Biomed. 2022, 221, 106853. [Google Scholar] [CrossRef]
  84. Kaur, N.; Mittal, A. CheXPrune: Sparse chest X-ray report generation model using multi-attention and one-shot global pruning. J. Ambient Intell. Humaniz. Comput. 2023, 14, 7485–7497. [Google Scholar] [CrossRef] [PubMed]
  85. Shetty, S.; Ananthanarayana, V.S.; Mahale, A. Cross-modal deep learning-based clinical recommendation system for radiology report generation from chest X-rays. Int. J. Eng. 2023, 36, 1569–1577. [Google Scholar] [CrossRef]
  86. Zeng, X.; Wen, L.; Xu, Y.; Ji, C. Generating diagnostic report for medical image by high-middle-level visual information incorporation on double deep learning models. Comput. Methods Programs Biomed. 2020, 197, 105700. [Google Scholar] [CrossRef]
  87. Xu, Z.; Xu, W.; Wang, R.; Chen, J.; Qi, C.; Lukasiewicz, T. Hybrid reinforced medical report generation with M-linear attention and repetition penalty. IEEE Trans. Neural Netw. Learn. Syst. 2025, 36, 2206–2222. [Google Scholar] [CrossRef]
  88. Gu, Y.; Li, R.; Wang, X.; Zhou, Z. Automatic medical report generation based on cross-view attention and visual-semantic long short-term memories. Bioengineering 2023, 10, 966. [Google Scholar] [CrossRef]
  89. Zhang, J.; Cheng, M.; Cheng, Q.; Shen, X.; Wan, Y.; Zhu, J.; Liu, M. Hierarchical medical image report adversarial generation with hybrid discriminator. Artif. Intell. Med. 2024, 151, 102846. [Google Scholar] [CrossRef]
  90. Paalvast, O.; Nauta, M.; Koelle, M.; Geerdink, J.; Vijlbrief, O.; Hegeman, J.H.; Seifert, C. Radiology report generation for proximal femur fractures using deep classification and language generation models. Artif. Intell. Med. 2022, 128, 102281. [Google Scholar] [CrossRef]
  91. Loveymi, S.; Dezfoulian, M.H.; Mansoorizadeh, M. Automatic generation of structured radiology reports for volumetric computed tomography images using question-specific deep feature extraction and learning. J. Med. Signals Sens. 2021, 11, 194–207. [Google Scholar] [CrossRef] [PubMed]
  92. Sun, S.; Mei, Z.; Li, X.; Tang, T.; Li, Z.; Wu, Y. A label information fused medical image report generation framework. Artif. Intell. Med. 2024, 150, 102823. [Google Scholar] [CrossRef]
  93. Zhang, D.; Ren, A.; Liang, J.; Liu, Q.; Wang, H.; Ma, Y. Improving medical X-ray report generation by using knowledge graph. Appl. Sci. 2022, 12, 11111. [Google Scholar] [CrossRef]
  94. Liu, G.; Liao, Y.; Wang, F.; Zhang, B.; Zhang, L.; Liang, X.; Wan, X.; Li, S.; Li, Z.; Zhang, S.; et al. Medical-VLBERT: Medical visual language BERT for COVID-19 CT report generation with alternate learning. IEEE Trans. Neural Netw. Learn. Syst. 2021, 32, 3786–3797. [Google Scholar] [CrossRef]
  95. Moon, J.H.; Lee, H.; Shin, W.; Kim, Y.H.; Choi, E. Multi-Modal understanding and generation for medical images and text via Vision-Language Pre-Training. IEEE J. Biomed. Health Inform. 2022, 26, 6070–6080. [Google Scholar] [CrossRef]
  96. Aswiga, R.V.; Shanthi, A.P. A multilevel transfer learning technique and LSTM framework for generating medical captions for limited CT and DBT images. J. Digit. Imaging 2022, 35, 564–580. [Google Scholar] [CrossRef]
  97. Xue, Y.; Tan, Y.; Tan, L.; Qin, J.; Xiang, X. Generating radiology reports via auxiliary signal guidance and a memory-driven network. Expert Syst. Appl. 2024, 237, 121260. [Google Scholar] [CrossRef]
  98. Zeng, X.; Liao, T.; Xu, L.; Wang, Z. AERMNet: Attention-enhanced relational memory network for medical image report generation. Comput. Methods Programs Biomed. 2024, 244, 107979. [Google Scholar] [CrossRef] [PubMed]
  99. Tang, Y.; Wang, D.; Zhang, L.; Yuan, Y. An efficient but effective writer: Diffusion-based semi-autoregressive transformer for automated radiology report generation. Biomed. Signal Process. Control 2024, 88, 10565. [Google Scholar] [CrossRef]
  100. Alqahtani, F.F.; Mohsan, M.M.; Alshamrani, K.; Zeb, J.; Alhamami, S.; Alqarni, D. CNX-B2: A novel CNN-Transformer approach for chest X-Ray medical report generation. IEEE Access 2024, 12, 26626–26635. [Google Scholar] [CrossRef]
  101. Liu, F.; Wu, X.; Huang, J.; Yang, B.; Branson, K.; Schwab, P.; Clifton, L.; Zhang, P.; Luo, J.; Zheng, Y.; et al. Aligning, autoencoding, and prompting large language models for novel disease reporting. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 3332–3343. [Google Scholar] [CrossRef]
  102. Demner-Fushman, D.; Kohli, M.D.; Rosenman, M.B.; Shooshan, S.E.; Rodriguez, L.; Antani, S.; Thoma, G.R.; McDonald, C.J. Preparing a collection of radiology examinations for distribution and retrieval. J. Am. Med. Inform. Assoc. 2015, 23, 304–310. [Google Scholar] [CrossRef] [PubMed]
  103. Johnson, A.E.W.; Pollard, T.J.; Berkowitz, S.J.; Greenbaum, N.R.; Lungren, M.P.; Deng, C.-Y.; Mark, R.G.; Horng, S. MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports. Sci. Data 2019, 6, 317. [Google Scholar] [CrossRef]
  104. Marvasti, N.; Roldan, M.; Uskudarli, S.; Aldana Montes, J.; Acar, B. Overview of the ImageCLEF 2015 liver CT annotation task. In Proceedings of the ImageCLEF 2015 Evaluation Labs and Workshop, Toulouse, France, 8–11 September 2015. [Google Scholar]
  105. ImageNet: A Large-Scale Hierarchical Image Database for Visual Object Recognition Research. Available online: http://www.image-net.org/ (accessed on 10 April 2025).
  106. Li, M.; Liu, R.; Wang, F.; Chang, X.; Liang, X. Auxiliary signal-guided knowledge encoder-decoder for medical report generation. World Wide Web 2023, 26, 253–270. [Google Scholar] [CrossRef]
  107. Wang, X.; Peng, Y.; Lu, L.; Lu, Z.; Bagheri, M.; Summers, R.M. ChestX-Ray8: Hospital-scale chest X-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 3462–3471. [Google Scholar] [CrossRef]
  108. Wu, J.; Agu, N.; Lourentzou, I.; Sharma, A.; Paguio, J.A.; Yao, J.S.; Dee, E.C.; Mitchell, W.; Kashyap, S.; Giovannini, A.; et al. Chest imagenome dataset for clinical reasoning. In Proceedings of the Annual Conference on Neural Information Processing Systems, Online, 6–14 December 2021. [Google Scholar]
  109. Radiological Society of North America. RSNA Intracranial Hemorrhage Detection [Dataset]. Kaggle 2019. Available online: https://www.kaggle.com/c/rsna-intracranial-hemorrhage-detection (accessed on 10 April 2025).
  110. Sekuboyina, A.; Husseini, M.E.; Bayat, A.; Löffler, M.; Liebl, H.; Li, H.; Tetteh, G. VerSe: A Vertebrae labelling and segmentation benchmark for multi-detector CT images. Med. Image Anal. 2021, 73, 102166. [Google Scholar] [CrossRef]
  111. Zhang, Z.; Chen, P.; McGough, M.; Xing, F.; Wang, C.; Bui, M.; Xie, Y.; Sapkota, M.; Cui, L.; Dhillon, J.; et al. Pathologist-level interpretable whole-slide cancer diagnosis with deep learning. Nat. Mach. Intell. 2019, 1, 236–245. [Google Scholar] [CrossRef]
  112. Pavlova, M.; Terhljan, N.; Chung, A.G.; Zhao, A.; Surana, S.; Aboutalebi, H.; Gunraj, H.; Sabri, A.; Alaref, A.; Wong, A. COVID-net CXR-2: An enhanced deep convolutional neural network design for detection of COVID-19 cases from chest X-ray images. Front. Med. 2022, 9, 861680. [Google Scholar] [CrossRef]
  113. de la Iglesia-Vayá, M.; Saborit, J.M.; Montell, J.A.; Pertusa, A.; Bustos, A.; Cazorla, M.; Galant, J.; Barber, X.; Orozco-Beltrán, D.; García-García, F.; et al. BIMCV COVID-19+: A large annotated dataset of RX and CT images from COVID-19 patients. arXiv 2020, arXiv:2006.01174. [Google Scholar]
  114. Pelka, O.; Koitka, S.; Rückert, J.; Nensa, F.; Friedrich, C. Radiology Objects in COntext (ROCO): A multimodal image dataset. In Proceedings of the 7th Joint International Workshop, CVII-STENT 2018 and Third International Workshop, LABELS 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, 16 September 2018; Proceedings. pp. 180–189. [Google Scholar] [CrossRef]
  115. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar] [CrossRef]
  116. Chen, Z.; Song, Y.; Chang, T.H.; Wan, X. Generating radiology reports via memory-driven transformer. arXiv 2022, arXiv:2010.16056. Available online: https://github.com/zhjohnchan/R2Gen (accessed on 10 April 2025).
  117. Jing, B.; Xie, P.; Xing, E.P. On the automatic generation of medical imaging reports. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Brussels, Belgium, 31 October–4 November 2018. [Google Scholar]
  118. Liu, F.; Wu, X.; Ge, S.; Fan, W.; Zou, Y. Exploring and distilling posterior and prior knowledge for radiology report generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021. [Google Scholar] [CrossRef]
  119. Yang, S.; Wu, X.; Ge, S.; Zhou, S.K.; Xiao, L. Knowledge matters: Chest radiology report generation with general and specific knowledge. Med. Image Anal. 2022, 80, 102510. [Google Scholar] [CrossRef] [PubMed]
  120. Tanida, T.; Müller, P.; Kaissis, G.; Rueckert, D. Interactive and explainable region-guided radiology report generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; Available online: https://github.com/ttanida/rgrg (accessed on 10 April 2025).
  121. Hou, W.; Cheng, Y.; Xu, K.; Li, W.; Liu, J. Recap: Towards precise radiology report generation via dynamic disease progression reasoning. arXiv 2023, arXiv:2310.13864. Available online: https://github.com/wjhou/Recap (accessed on 10 April 2025).
  122. Li, Y.; Yang, B.; Cheng, X.; Zhu, Z.; Li, H.; Zou, Y. Unify, align, and refine: Multi-level semantic alignment for radiology report generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 4–6 October 2023; pp. 2851–2861. [Google Scholar] [CrossRef]
  123. Nooralahzadeh, F.; Perez-Gonzalez, N.; Frauenfelder, T.; Fujimoto, K.; Krauthammer, M. Progressive transformer-based generation of radiology reports. arXiv 2021, arXiv:2102.09777. Available online: https://github.com/uzh-dqbm-cmi/ARGON (accessed on 10 April 2025).
  124. Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), Philadelphia, PA, USA, 6–12 July 2002; pp. 311–318. [Google Scholar]
  125. Lin, C.Y. ROUGE: A package for automatic evaluation of summaries. In Proceedings of the Workshop on Text Summarization Branches Out, Barcelona, Spain, 25–26 July 2004; pp. 74–81. Available online: https://aclanthology.org/W04-1013/ (accessed on 10 April 2025).
  126. Zhang, T.; Kishore, V.; Wu, F.; Weinberger, K.Q.; Artzi, Y. BERTScore: Evaluating text generation with BERT. In Proceedings of the International Conference on Learning Representations (ICLR), Addis Ababa, Ethiopia, 26–30 April 2020; Available online: https://github.com/Tiiiger/bert_score (accessed on 10 April 2025).
  127. Vedantam, R.; Zitnick, C.L.; Parikh, D. CIDEr: Consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 4566–4575. [Google Scholar] [CrossRef]
  128. Yu, F.; Endo, M.; Krishnan, R.; Pan, I.; Tsai, A.; Pontes-Reis, E.; Fonseca, E.K.U.N.; Lee, H.M.H.; Abad, Z.S.H.; Ng, A.Y.; et al. Evaluating progress in automatic chest X-ray radiology report generation. Patterns 2023, 4, 100802. [Google Scholar] [CrossRef]
  129. Mahmoud, M.; Monshi, A.; Poon, J.; Chung, V. Deep learning in generating radiology reports: A survey. Artif. Intell. Med. 2020, 106, 101878. [Google Scholar] [CrossRef] [PubMed]
  130. Liao, Y.; Liu, H.; Spasic, I. Deep learning approaches to automatic radiology report generation: A systematic review. Inform. Med. Unlocked 2023, 39, 101273. [Google Scholar] [CrossRef]
Figure 1. PRISMA diagram [13] for the scoping review demonstrating the search results, included and excluded studies. The completed PRISMA-ScR checklist is provided in the Supplementary Materials (Table S1).
Figure 1. PRISMA diagram [13] for the scoping review demonstrating the search results, included and excluded studies. The completed PRISMA-ScR checklist is provided in the Supplementary Materials (Table S1).
Ai 07 00008 g001
Figure 2. Number of publications by field of clinical focus and year.
Figure 2. Number of publications by field of clinical focus and year.
Ai 07 00008 g002
Figure 3. Geographic distribution of the included studies by country of origin. China contributes the largest share of publications (n = 56), followed by India (n = 9), Pakistan (n = 4), and Brazil (n = 3).
Figure 3. Geographic distribution of the included studies by country of origin. China contributes the largest share of publications (n = 56), followed by India (n = 9), Pakistan (n = 4), and Brazil (n = 3).
Ai 07 00008 g003
Figure 4. Number and percent of publications by the deep learning method.
Figure 4. Number and percent of publications by the deep learning method.
Ai 07 00008 g004
Figure 5. Distribution of BLEU-1 scores from the included studies. Each point on the vertical axis represents the BLEU-1 value obtained by a study; studies that did not report this metric are omitted. Notably, the highest scores were achieved on datasets other than the common IU-Xray and MIMIC-CXR benchmarks.
Figure 5. Distribution of BLEU-1 scores from the included studies. Each point on the vertical axis represents the BLEU-1 value obtained by a study; studies that did not report this metric are omitted. Notably, the highest scores were achieved on datasets other than the common IU-Xray and MIMIC-CXR benchmarks.
Ai 07 00008 g005
Table 1. Search strategies adapted by the database.
Table 1. Search strategies adapted by the database.
DatabaseFieldSearch ExpressionResults
PubMed[tiab](“Convolutional Neural Network *” OR CNN OR “Recurrent Neural Network *” OR RNN OR LSTM OR GRU OR Transformer OR Transformers OR “Attention Mechanism” OR “Encoder Decoder” OR “Sequence to Sequence” OR “Graph Neural Network *” OR GNN OR GCN OR GAT OR “Deep Learning” OR “Neural Network” OR “Neural Networks”) AND (Radiology OR Radiolog * OR “Medical Imag *” OR “Diagnostic Imag *” OR X-ray OR CT OR MRI OR PET) AND (“Report Generation” OR “Text Generation” OR “Narrative Generation” OR “Automatic Report *” OR “Clinical Report *” OR “Medical Report *”)158
Scopus-Same expression as PubMed, without specific field restriction259
Web of ScienceTS=Same Boolean expression adapted to the TS= field for topic-based search217
IEEE Xplore-Same Boolean expression adjusted to the syntax requirements of the respective database79
ACM DL-Same Boolean expression adjusted to the syntax requirements of the respective database301
In PubMed, the [tiab] field was used to restrict the search to title and abstract. In WoS, the TS= field was used for topic-based search. Search expressions were syntactically adapted to each database.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Meléndez Rojas, P.; Jamett Rojas, J.; Villalobos Dellafiori, M.F.; Moya, P.R.; Veloz Baeza, A. The Current Landscape of Automatic Radiology Report Generation with Deep Learning: A Scoping Review. AI 2026, 7, 8. https://doi.org/10.3390/ai7010008

AMA Style

Meléndez Rojas P, Jamett Rojas J, Villalobos Dellafiori MF, Moya PR, Veloz Baeza A. The Current Landscape of Automatic Radiology Report Generation with Deep Learning: A Scoping Review. AI. 2026; 7(1):8. https://doi.org/10.3390/ai7010008

Chicago/Turabian Style

Meléndez Rojas, Patricio, Jaime Jamett Rojas, María Fernanda Villalobos Dellafiori, Pablo R. Moya, and Alejandro Veloz Baeza. 2026. "The Current Landscape of Automatic Radiology Report Generation with Deep Learning: A Scoping Review" AI 7, no. 1: 8. https://doi.org/10.3390/ai7010008

APA Style

Meléndez Rojas, P., Jamett Rojas, J., Villalobos Dellafiori, M. F., Moya, P. R., & Veloz Baeza, A. (2026). The Current Landscape of Automatic Radiology Report Generation with Deep Learning: A Scoping Review. AI, 7(1), 8. https://doi.org/10.3390/ai7010008

Article Metrics

Back to TopTop