Recent Advances in Large Language Models for Healthcare

: Recent advances in the field of large language models (LLMs) underline their high potential for applications in a variety of sectors. Their use in healthcare, in particular, holds out promising prospects for improving medical practices. As we highlight in this paper, LLMs have demonstrated remarkable capabilities in language understanding and generation that could indeed be put to good use in the medical field. We also present the main architectures of these models, such as GPT, Bloom, or LLaMA, composed of billions of parameters. We then examine recent trends in the medical datasets used to train these models. We classify them according to different criteria, such as size, source, or subject (patient records, scientific articles, etc.). We mention that LLMs could help improve patient care, accelerate medical research, and optimize the efficiency of healthcare systems such as assisted diagnosis. We also highlight several technical and ethical issues that need to be resolved before LLMs can be used extensively in the medical field. Consequently, we propose a discussion of the capabilities offered by new generations of linguistic models and their limitations when deployed in a domain such as healthcare.


Introduction
Recent advances in the field of artificial intelligence (AI) have enabled the development of increasingly powerful linguistic models capable of generating text fluently and coherently.Among these models, "large language models" (LLMs) stand out for their imposing size and their ability to learn enormous amounts of textual data.Models like GPT-3.5 [1] and GPT-4 [2] developed by OpenAI [3], or Bard, created by Google [4], have billions of parameters and have demonstrated impressive comprehension skills and language generation.These fascinating advancements in natural language processing (NLP) have promising implications in many fields, including healthcare [5][6][7].Indeed, they offer new perspectives for improving patient care [8,9], accelerating medical research [10,11], supporting decision-making [12,13], accelerating diagnosis [14,15], and making health systems more efficient [16,17].
These models could also provide valuable assistance to healthcare professionals by helping them interpret complex patient records and develop personalized treatment plans [18] as well as manage the increasing amount of medical literature.
In the clinical domain, these models could help make more accurate diagnoses by analyzing patient medical records [19].They could also serve as virtual assistants to provide personalized health information or even simulate therapeutic conversations [20][21][22].The automatic generation of medical record summaries or examination reports is another promising application [23][24][25].
For biomedical research, the use of extensive linguistic models paves the way for rapid information extraction from huge databases of scientific publications [26,27].They can also generate new research hypotheses by making new connections in the literature.These applications would significantly accelerate the discovery process in biomedicine [28,29].
Additionally, these advanced language models could improve the administrative efficiency of health systems [30].They would be able to extract key information from massive medical databases, automate the production of certain documents, or even help in decision-making for the optimal allocation of resources [31][32][33].
LLMs such as GPT-3.5,GPT-4, Bard, LLaMA, and Bloom have shown impressive results in various activities related to clinical language comprehension.Nevertheless, it is essential to evaluate them in detail in the medical context, which is characterized by its distinct nuances and complexities compared to common texts.
It is therefore essential to continue studies in order to verify the capacity of these innovative linguistic models in real clinical situations, whether for decision-making, diagnostic assistance, or the personalization of treatments.Rigorous evaluation is the key to taking full advantage of this technology and transforming medicine through better mastery and use of clinical language.
Although promising, the use of this AI in health also raises ethical and technological challenges that must be addressed.However, their potential for accelerating medical progress and improving the quality of care seems immense.The coming years will tell us to what extent these revolutionary models are capable of transforming medicine and health research.
The main contributions of this paper are the following: • We analyze major large language model (LLM) architectures such as ChatGPT, Bloom, and LLaMA, which are composed of billions of parameters and have demonstrated impressive capabilities in natural language understanding and generation.

•
We present recent trends in the medical datasets used to train such models.We classify these datasets according to different criteria, such as their size, source (e.g., patient files, scientific articles), and subject matter.

•
We highlight the potential of LLMs to improve patient care through applications like assisted diagnosis, accelerate medical research by analyzing literature at scale, and optimize the efficiency of health systems through automation.

•
We discuss key challenges for practically applying LLMs in medicine, particularly important ethical issues around privacy, confidentiality, and the risk of algorithmic biases negatively impacting patient outcomes or exacerbating health inequities.Addressing these challenges will be critical to ensuring that LLMs can safely and equitably benefit public health.
As depicted in Figure 1, this study is deployed according to a well-defined architecture that aims to enlighten the reader on the different facets of LLMs and their relevance in medical diagnoses.We will begin, in Section 2, our exploration with a brief history of LLMs.Section 3 serves as an introduction to the transformer architecture, laying the foundation for understanding the following sections.Building on this foundation, in Section 4, we will delve deeper into the specific architecture of LLMs.Section 5 illustrates the practical applications of LLMs, with an emphasis on their use in medical diagnosis.Section 6 presents a comprehensive review of medical datasets, segmenting them into three key categories.Section 7 focuses on the major innovation, which is the advent of foundation models in the AI landscape, highlighting their relevance in the clinical domain.Before concluding this article in the final Section 9, Section 8 offers critical reflections, assessing both the benefits of LLMs and not neglecting the challenges inherent in them.

Brief History of Large Language Models
The first language models were n-gram models [34], which estimate the probability of a word based on previous n − 1 words.They began to be used in the 1980s and are still used today.However, they do not capture the semantics of the language well.
During the 2000s, researchers introduced topic models like latent Dirichlet allocation (LDA) [35].These models have the ability to detect themes within large collections of text and are particularly useful for analyzing large amounts of health-related textual data.
Language models based on neural networks, such as word2vec [36] and GloVe [37], began to emerge in the 2010s.They learn vector representations of words that capture semantic and syntactic relationships.They have been used for tasks such as extracting information from clinical texts.
The introduction of the transformer architecture in 2017 brought a revolution in the field of NLP, paving the way for the emergence of large-scale pre-trained models such as BERT [4], ELMo [38], RoBERTa [39], and GPT-3 [3].These models are trained on massive amounts of general domain text and can then be fine-tuned for specific healthcare applications.They have been used for tasks such as named entity recognition, sentiment analysis, and question answering (QA) in clinical text.Indeed, domain-specific versions of BERT such as BioBERT [40] and ClinicalBERT [41] have been developed to address clinical language comprehension tasks.
More recently, LLMs have continued to evolve, demonstrating cutting-edge performance in all fields, including healthcare.They are being applied in new ways in healthcare, such as to facilitate clinical documentation, identify adverse drug reactions, and predict health outcomes from patient notes.However, the specialized clinical vocabularies, acronyms, and abbreviations present in the text remain a challenge.
Over time, LLMs have steadily increased in size and performance, such as GPT-3.5 and GPT-4, as well as Bard.This opened the way to new use cases: more comprehensive virtual assistants [42], integration with patient files [43], diagnostic/therapeutic recommendations [44], etc.
While LLMs have tremendous potential, their development raises major ethical challenges: guaranteeing patient safety, combating bias, verifying and explaining recommenda-tions, respecting privacy, and ensuring complementarity with caregivers.Researchers are working actively on these issues so that AI benefits the healthcare system responsibly.

Transformers and Their Architecture
Transformers, a category of deep learning (DL) models, are predominantly employed for tasks related to NLP.These models were first introduced by Google in 2017 through a paper titled "Attention is All You Need" [53].The fundamental element of transformers is the attention mechanism (see Section 3.1).This mechanism enables the model to comprehend the contextual associations between words (or other elements) within a sentence.
Transformers use multi-head self-attention to analyze the relationship between words in a sentence.Self-attention means that words attend to their relationship with other words in the same sequence, without regard to their relative or absolute position.
The basic architecture of transformers consists of an encoder and a decoder.The encoder helps process the input sequence, while the decoder generates the output sequence.Common transformer architectures include BERT [4], GPT [54], T5 [55], etc. BERT (Bidirectional Encoder Representations from Transformers) introduced bidirectional training, which looks at the context from both left and right.GPT (Generative Pre-trained Transformer) models like GPT-2 [56], GPT-3 [3], and GPT-4 [2] are able to generate new text.T5 (Text-To-Text Transfer Transformer) can perform a wide range of text-based tasks like summarization, QA, and translation [57].
All transformer models follow the same basic structure-embedding, encoding, and decoding.However, they differ in pre-training objectives, model size, number of encoder/decoder stacks, attention types, etc.Later models like T5, GPT-3, and GPT-4 have billions of parameters to handle more complex tasks through transfer learning from massive text corpora.

The Attention Mechanism
The attention mechanism aims to palliate the loss of information transmitted to the decoder, as it is only the hidden state created during the last phase by the encoder that is provided as input to the decoder.
The original work of Larochelle and Hinton [58] introduced this approach to the field of computer vision.By analyzing several regions of an image separately, i.e., by considering different extracts, it is possible for a learning algorithm to gradually accumulate knowledge about the shapes and objects present.By analyzing each segment in turn, the model can build up a global understanding of the image as a whole, which will ultimately enable it to assign a relevant category to it.Authors initially proposed this method, which involves examining the various parts of an image and, after assimilating the details of each, arriving at a precise classification.
According to Equation (1), the attention mechanism, recommended by [59,60], has played a crucial role in improving the performance of machine translation systems.This approach offers the model the ability to focus on essential segments of the input sequence.
The main idea behind attention is to evaluate the relationship between parts of two different sequences.In NLP, especially in a sequence-to-sequence framework, the attention mechanism aims to signal to the model which word in sequence "B" should be privileged in relation to a specific word in sequence "A".
A model with attention differs from a classic seq2seq model in two major respects: Firstly, instead of only transmitting the ultimate hidden state from the encoder to the decoder, it transmits all the hidden states to the decoder, thus enriching the information transmitted.Secondly, before producing its output, the decoder integrates an additional step.It evaluates all the hidden states received from the encoder, assigning them a score via multiplication by their softmax value, to better target the crucial elements of the input.
For each word in the input sequence, a contextualized attention layer (self-attention) generates a vector representative of its importance (attention vector).Although explanations are presented here as vectors for simplicity's sake, calculations actually involve matrix representations (several vectors juxtaposed), with each word associated with a row of a matrix.Thus, attention vectors actually describe the relative influence of each row of a matrix representing the sequence as a whole, allowing contextual interactions to be captured at all levels.
To calculate the attention vector for each encoded input, we need to consider three types of vectors:

•
The query vector, q, represents the encoded sequence up to that point.

•
The key vector, noted k, corresponds to a projection of the encoded entry under consideration.

•
The value vector, v, contains the information relative to this same encoded entry.
Thus, to obtain the attention vector associated with a given encoder input, the mechanism takes into account these three vectors: q, k, and v, respectively, the request vector, key vector, and value vector.
These vectors q, k, and v are indexed by the position of the word in the processed sequence.For example, for the first word, we will have the vectors q 1 , k 1 , and v 1 .They are generated by projecting the x vector representation of each word using three matrices:

•
The Q matrix produces the q query vectors.

•
Matrix K produces key vectors k.

•
Matrix V generates value vectors v.
These three matrices, Q, K, and V are learned during the training process of the transformer model [53] so as to optimize the calculation of contextual attention.The vectors q, k, and v for each word are thus derived by multiplying the vector representation x with these matrices.
d k is the hidden dimensionality for keys.V T is the transpose of a V matrix.

Self-Attention
Self-attention aims to model the contextual relationships existing within a single sequence thanks to the attention mechanism [53].Within a self-attention layer, we seek to determine the interdependence between different elements (words), in order to obtain a vector representation enriched by the global context.Unlike traditional attention between input and output sequences, here, we calculate attention scores between the elements of the single sequence under consideration.This internal self-attention is an essential building block in the transformer architecture, enabling words to be enriched by their linguistic environment within the same sentence.Self-attention is also called intra-attention [61][62][63].

Multi-Head Attention
The multi-head attention architecture is a major advantage of the transformer model.By dividing calculations between several "heads" carrying out their processing in parallel, it considerably speeds up processing, since each head handles part of the data simultaneously.Parallelization also gives the system greater modeling capacity.
Each head can then model contextual relationships from its own angle or scale, enriching the final representation with complementary perspectives.Multi-headed attention thus captures dependencies in a finer, more complex way between the various elements processed.Furthermore, integrating a variety of contexts enables each word to be better represented.
This architecture significantly increases the flexibility and expressive capacity of the transformer model without critically increasing the number of parameters.It also makes the system more robust, as each head can compensate for any local errors of the others.All in all, multi-headed attention makes the most of the possibilities offered by parallel computing to enhance contextual modeling.

Attention Hyperparameters
The dimensions of the data manipulated by the transformer model are configured using three hyperparameters:

•
The embedding size, which corresponds to the dimension of the vectors used to represent the input elements (words and tokens).Present throughout the model, this dimension also defines its capacity, called "model size".

•
The size of the queries (equal to that of the keys and values), i.e., the dimension of the vectors produced by the three linear layers generating the matrices of queries, keys, and values required for attentional calculations.

•
The number of attentional heads, which determines the number of attentional processing blocks operating in parallel.
These three hyperparameters determine the vector formats used at each stage of the transformer, from input to output.They directly condition its expressive capabilities and the volume of data it can process in a sophisticated way.

Positional Encoding
As the transformer model has neither recurrence nor convolution, it has no intrinsic mechanisms for exploiting the order of elements within input sequences.This property is essential for many tasks, such as translation or text comprehension.To remedy this, positional embeddings are added to the initial token plunges.
In concrete terms, each position in the sequence is assigned a vector encoding its relative or absolute place.These positional embeddings, which have the same dimensions as the usual embeddings, can then simply be added to them.In this way, the model has additional information on the position of each token within the sequence.
These positional embeddings can be learned during training as free parameters, or fixed using mathematical functions such as sine/cosine.Placed at the input of the encoder and decoder layers, they enable the latter to exploit the missing sequential dimension, significantly improving performance on many tasks.Their use has proved particularly beneficial in initial transformer models whose design rejects recurrence and convolution [64].

Transformers
The development of the digital world would not have been possible without advancements in automatic NLP.However, for a long time, NLP techniques remained limited due to insufficient progress in AI.
Recurrent neural networks (RNN) [61,63,65], and convolutional neural networks (CNN) [53], widely used in the past for NLP tasks, had the disadvantage of processing text sequentially.This approach proved inefficient with massively parallel computing units such as GPUs, which became essential with the advent of DL on large datasets.
It was against this backdrop that the transformer model emerged in 2017 by Vaswani et al. [53] of Google Brain and Google Research.Based on attention as a basic primitive, it enabled sequences to be processed in a truly parallel way for the first time, thanks to multi-headed calculations.
In terms of performance, transformer proved superior to RNN and CNN on natural language understanding and generation tasks.It was able to learn more efficiently and in parallel, better capturing long-range dependencies.Its excellent evaluation results have made the transformer an essential component of modern NLP.
It has revolutionized the field by enabling language processing on a very large scale, paving the way for major advancements such as human-quality machine translation.
Encoders and decoders are the two essential components of the original transformer model (see Figure 2).Each part is composed of a stack of N = 6 layers that are all identical.The output of one layer is the input of the next layer until the final representation (prediction) is reached [53].Transformer models, as represented in Figure 2, are distinguished by the presence of two fundamental elements: encoders and decoders.These two components form the basis of the architecture.Each of these elements is designed around a set of six identical layers, where the output of each layer feeds the input of the next, enabling a progressive transformation of information until the final prediction is formulated.
The encoder, which forms the left-hand side of this architecture, is structured into two main blocks, both of which are neural networks.
A self-attention layer works to preserve the dependencies between words in a sequence.It analyzes each word in relation to the others, in order to identify its context.A feedforward neural network, which performs complex transformations on the data received from the self-attention layer.Its main task is to transform an input sequence into a series of continuous representations, which then serve as inputs for the decoder.
The decoder, located on the right-hand side of the architecture, is structured in a similar way as the encoder but also incorporates an "Encoder-Decoder Attention" layer.The latter plays a crucial role: it facilitates the attention mechanism between the input sequence (once encoded) and the output sequence (during its decoding phase).The aim is to ensure that each word in the output sequence takes account of all the words in the input sequence.
The decoding process is based on the combination of the encoder output and the output generated by the decoder during the previous time step, enabling the output sequence to be built up progressively.
One of the main strengths of transformer architectures such as LLMs lies in their ability to extract essential intermediate features, such as syntactic structures and semantic integration of words.This ability enables LLMs to represent human language knowledge accurately and richly, making them powerful in a variety of tasks such as QA, sentiment classification, and machine translation.Moreover, LLMs are versatile, as they can be adapted to new challenges by reusing the same pre-trained model, which often puts them head and shoulders above previous methods.Their ability to extract and exploit relevant linguistic information makes LLMs powerful and versatile tools in NLP.
In the following section, we explain the architecture of LLMs in detail, highlighting the key mechanisms that contribute to their interesting results in text understanding and generation.

LLMs' Architecture
LLMs have made significant advancements in recent years.First, the transformer models introduced a novel attention-based neural architecture for NLP.
These advanced models, with their billions of parameters, have become feasible thanks to recent advancements in computational capabilities and model architecture [66], as shown in Figure 3.For example, GPT-3, which relies on extensive data, has around 175 billion parameters [67].Meanwhile, the open-source LLaMA model series ranges between 7 and 70 billion parameters [68,69].Table 1 summarizes the main LLMs, giving their general architecture as well as an indication of the number of parameters in some of the basic versions used in research.This enables a quick comparison of the size and specific features of each model.The initial phase of LLM training, called pre-training, uses a self-supervised method using large amounts of unlabeled data, which include sources such as general web content, Wikipedia, Github repositories, social media, and BooksCorpus [68,70].The main goal of training is to predict the next word in a series, which requires significant resources [71].This involves converting the text into tokens before entering it into the template [72].The outcome of this process is a base model designed primarily for basic language generation but unable to perform complex tasks.
After initial pre-training, the model undergoes a fine-tuning phase to specialize it for certain tasks [72].Here, the model can be trained on narrower domain-specific datasets, like medical records for healthcare applications.
This fine-tuning can be improved through various techniques.A constitutional AI approach embeds predefined rules or principles directly into the model architecture [73].Reward-based training involves human evaluators assessing the quality of multiple model outputs to provide feedback [70,74].Reinforcement learning based on human feedback uses a comparison-based system to optimize responses through iterative human feedback [74].
This fine-tuning step requires less computational power but more human resources compared to pre-training.It customizes the model to perform a specific task, like a chatbot, with controlled, targeted results.The fine-tuned model resulting from this phase is then deployed for flexible applications in its domain of specialization.
The goal is to leverage general pre-training and then fine-tune the model's capabilities for the nuances of its intended use case through various focused training approaches during fine-tuning.The result is a model that is both adapted and versatile for its specialist field of application.
LLMs have an impressive ability to adapt to unfamiliar tasks and demonstrate remarkable reasoning skills [75,76].However, to fully exploit their potential in specialized fields such as medicine, more specific training strategies are essential.These strategies could encompass direct prompting methods such as few-shot learning [3], in which a limited set of task examples during testing guides the model's results, and zero-shot learning [77], in which the model operates without any prior specific examples.In addition, refined techniques such as chain-of-thought prompting [78], which prompts the model to sequentially decompose its reasoning, and self-consistency checks [79], which ask the model to confirm the consistency of its responses, are also crucial.
Reinforcement learning (RL) is also a complementary technique in the fine-tuning process of LLMs, aimed at improving and better aligning pre-trained models.To enhance the performance of these LLMs, optimization methods inspired by RL or directly derived from RL are implemented.Notable examples include RL from human feedback (RLHF), direct preference optimization (DPO) and proximal policy optimization (PPO).
RLHF [2,69] integrates RF techniques with human feedback to refine the language model.DPO, introduced by Rafailov et al. [80], focuses on direct preference optimization for model-generated responses.PPO, initially conceived by Schulman et al. [81] and later adapted by Tunstall et al. [82], employs proximal policy optimization for LLM fine-tuning.
These RL-based approaches have proven effective, particularly when combined with instruction tuning, in improving the relevance and quality of responses generated by LLMs.However, it is important to note that, like instruction tuning, these methods are primarily aimed at improving the quality of responses and their behavioral appropriateness, without necessarily increasing the breadth of knowledge that the model can demonstrate.
Instruction prompt tuning, developed by Lester et al. [83], is a promising technique for efficiently updating model parameters, enhancing its performance on many downstream medical tasks.This approach has advantages over prompt methods in a few examples.In general, these methods enrich the initial model fine-tuning processes, enhancing their suitability for medical problems.A recent example is the Flan-PaLM model [84].Figure 4 compares classical fine-tuning and prompt tuning for adapting pre-trained language models to specific tasks.In classical fine-tuning, each task requires the creation of a dedicated model, leading to multiple versions of the model with different configurations for each task.This approach is complex and requires significant resources.Prompt tuning, on the other hand, simplifies the adaptation process.Instead of creating separate models, we simply provide the original pre-trained model with a "prompt" specific to each task.This "prompt" acts as an instruction that allows the model to handle different tasks without requiring major modifications.The major advantage of prompt tuning lies in its efficiency.For example, with a large T5 model [55], traditional fine-tuning for each task can require around 11 billion parameters, which is equivalent to creating new models for each task.In contrast, prompt tuning only requires about 20,480 parameters per task "prompt", representing a significant reduction of five orders of magnitude.It is therefore possible to use a single adaptable model with minimal modifications to accomplish various tasks.
Consequently, prompt tuning allows for the adaptation of a single pre-trained model to a multitude of tasks, significantly reducing the computational resources required and improving scalability compared to traditional fine-tuning.This breakthrough paves the way for more widespread use of language models in various domains.
Understanding training methods and their ongoing evolution provides essential insights for assessing the current capabilities of models and identifying their potential future applications.In specialized fields such as healthcare, where precision and expertise are essential, an in-depth analysis of model specialization techniques is crucial to assessing their full potential.Among these techniques, retrieval augmented generation (RAG) was introduced by Lewis et al. [85].This method aims to enhance the capabilities of LLMs, especially in knowledge-intensive tasks, by integrating external sources of information.Originally, the application of RAG required task-specific training.However, subsequent research [86] revealed that a pre-trained model incorporating the RAG method could increase its performance without the need for additional training.
The fundamental principle of RAG is based on the use of an auxiliary knowledge base and an input query.The RAG architecture is then used to identify relevant documents in this database that are close to the query.These retrieved documents are then merged with the initial query, providing the model with an enriched, in-depth context on the queried subject.
The success of LLMs is attributed to their ability to generalize from the massive amounts of data that they are exposed to during training.This enables them to perform a wide array of language-related tasks with impressive accuracy, including text completion, language translation, sentiment analysis, and more.LLMs have also demonstrated proficiency in understanding context and generating contextually appropriate responses in conversational settings.Among the instances of LLMs, notable examples include PaLM [87], GPT-3 [3], LLaMA [68], PaLM2 [88] utilized in the BARD chatbot, Bloom [89], and GPT-4 [2].These models are trained on billions of tokens sourced from datasets such as Common Crawl, WebText2, Wikipedia, Stack Exchange, PUBMED, ArXiv, Github, and many other diverse repositories.
Additionally, researchers are exploring, using LLMs, the creation of systems capable of understanding and applying complex instructions from text.In this context, models such as Alpaca [90], StableLM [91], and Dolly [92] stand out.Alpaca, based on Meta's LLaMA 7B model, showed that a lighter model can compete with heavier models like OpenAI's text-DaVinci-003.In addition, Alpaca is more economical and easy to reproduce thanks to its open structure.
This indicates that it is entirely possible for less demanding models to compete in performance while still being transparent and accessible, thanks to their free nature.Sta-bleLM and Dolly are also working in this direction, seeking to better respond to requests formulated in natural language.
LLMs now have trillions of learning data tokens and the number of their parameters has grown rapidly [68].Rather than improvements in architectural design, the amount of data, parameters, and computational resources appear to be a driving factor in the capabilities of LLMs [93].
While LLMs have opened up new possibilities for understanding and generating natural language, they also pose challenges and ethical considerations.The risk of bias in learning data, ethical concerns relating to content generation, and issues of data privacy are some areas that researchers and practitioners continue to address.

ChatGPT
ChatGPT, developed by OpenAI and based on the GPT architecture [54], was officially released in June 2020.This model, which has undergone several updates, has been trained using a vast corpus of textual sources such as books, web content, and articles, giving it the ability to capture the complexities of human language.
ChatGPT is able to produce texts that reflect human articulation in response to specific prompts.This prowess makes it invaluable in a variety of NLP applications, from chatbots to translation to text summarization.Furthermore, its adaptability means that it can be fine-tuned for particular tasks using more concentrated datasets.
To explore its origins, the first GPT-1 [54] used a 40 GB text dataset for training.In contrast, its successors-GPT-2 [56], GPT-3 [3], and GPT-4 [2]-benefited from much larger datasets.This progressive increase in training data propelled ChatGPT's effectiveness, with GPT-4's results becoming practically indistinguishable from human-written text.Table 2 illustrates the evolution of the different versions of GPT, specifically in the context of the healthcare sector.Improved consistency in text generation, but still with limits in terms of precision and specific medical context.

GPT-3 2020 175 Bn
Search summaries, assistance in interpreting medical data, generation of medical QA.
Enhanced ability to understand and generate complex medical texts, useful for literature analysis and health consulting applications.

GPT-4 2023 Unknown
In-depth analysis of clinical cases, integration into assisted diagnosis systems, personalized health advice.
Significant advancement in language understanding and accuracy, and the ability to integrate textual and visual data for more comprehensive analyses.
Each GPT version has marked a significant advancement over the previous one, with improvements in terms of model complexity, language understanding, and polyvalence of application domains.GPT-4's capability to manage multimodal inputs (text and images) is a notable evolution over previous versions [2].
Researchers have conducted investigations to assess the usefulness of ChatGPT and InstructGPT [74] for healthcare, as well as their suitability for specific medical tasks.For example, Luo et al. [94] developed BioGPT, a model based on the GPT-2 framework pretrained on 15 million PUBMED abstracts.BioGPT outperformed other state-of-the-art models in a range of tasks, including QA, relationship extraction, and document classification.Likewise, BioMedLM 2.7B (formerly called PUBMEDGPT [95]), which is pre-trained on both PUBMED abstracts and full text, illustrates ongoing progress in this area.In addition, researchers have used GPT-4 to create multimodal medical LLMs, and early results are promising in this area.
Current research in the biomedical sector focuses mainly on the use of ChatGPT for data assistance.These approaches train models of small size using ChatGPT's distilled or translated knowledge.A notable example is Chatdoctor [32], which marks the first initiative to adapt language and machine learning (LLM) models to the biomedical domain.They fine-tuned the LLaMa model through dialogue simulations generated via ChatGPT.
Another example is DoctorGLM [96], which uses ChatGLM-6B as a base model and fine-tunes it using the Chinese translation of the ChatDoctor dataset, which was obtained using ChatGPT.In addition, Chen et al. [97] developed an improved Chinese and medical linguistic model in their LLM collection.
These works collectively demonstrate the potential of LLMs to be successfully applied in the biomedical domain.They highlight the adaptability of LLMs to meet the specific needs of medical assistance and underline the importance of using synthesized and translated data to form specialized models in this field.

PaLM
The Pathways Language Model (PaLM), introduced in 2019 [87], is a powerful architecture built upon the transformer decoder, a densely connected language model.PaLM stands out from other models as it prioritizes text generation over input processing, resulting in exceptional efficiency for tasks like generating text.Google developed the Pathways approach specifically for training large-scale machine learning models, and PaLM benefits from this methodology.Pathways enables flexible and scalable model training, a critical factor in handling the immense volumes of data necessary to train such models.
PaLM has been trained on a massive dataset consisting of 780 billion tokens.This dataset includes a variety of textual sources, such as web pages, Wikipedia articles, source code, conversations on social networks, press articles, and books.By incorporating these different sources, PaLM is exposed to various languages, knowledge, and text styles.This enables it to acquire a thorough understanding of language, ranging from structured and specialized knowledge to informal expressions and literary narratives.
With 540 billion parameters, PaLM has a significant ability to model complex relationships, which is advantageous for analyzing and synthesizing detailed medical information.It can potentially be used to generate and summarize medical information, help interpret health data, and answer general medical questions [98].One of its technical innovations, the "Chain-of-Thought Prompting" method [78], enables it to process complex queries in several stages, which could prove beneficial for solving complex medical problems or interpreting health data in several phases.
A standout feature of PaLM is its competence in addressing out-of-vocabulary (OOV) words, which are terms not seen during its training phase.Rather than stumbling over these unfamiliar words, PaLM can generate fitting contextual replacements, thereby elevating its overall performance in text generation.
Researchers first adapted the PaLM model to medical QA, leading to the creation of [84].This model established state-of-the-art benchmarks for QA.Based on these results, the Med-PaLM model was introduced using instruction tuning, proving its effectiveness in areas such as clinical expertise, scientific consensus, and medical reasoning processes [99].
This approach has been extended to develop a multimodal medical LLM.These PaLMcentric models highlight the benefits of adapting basic models to the specific needs of medical scenarios.Flan-PaLM is a specially designed version of PaLM for processing instructions [84].The latest version, called Med-PaLM2, was unveiled at Google Health's annual event [98].This version was developed based on PaLM2, which is the fundamental language model used by Google's chatbot, Bard.
Recently, Google developed a multimodal model called Med-PaLM M [100], which follows on from its PaLM-E vision-language model [101], and has the ability to synthesize and communicate information from medical images such as chest X-rays, dermatology images, pathology slides, and other biomedical data to aid the diagnostic process by understanding and discussing these visuals through a natural language dialogue with clinicians.

LLaMA
The LLaMA (Large Language Model Meta AI) collection, was unveiled in 2023 by Touvron et al. [68].It brings together a whole series of foundational models of imposing size, varying from 7 to 70 billion parameters.These giant models could be trained on a massive scale via unsupervised learning techniques such as word masking and nextsentence prediction.The training was done on colossal public datasets, representing trillions of tokens in total.
These resources included Wikipedia, Common Crawl, and OpenWebText.By using only freely accessible data, the creators of LLaMA have shown that it is possible to achieve a state-of-the-art level of performance without resorting to proprietary datasets.This collection thus demonstrates the remarkable progress made in the field, now allowing the deployment of giant linguistic models trained on a very large scale on public resources.
The LLaMA-1 model [68] was developed with efficient causal attention [102] by avoiding storing and calculating masked attention weights and key/query scores.A further optimization was achieved by reducing the number of activations recalculated during backtracking, as described in [103].As part of the work on LLaMA-2 [69], the focus was on improving a specific model called LLaMA-2-Chat, making it safer and more efficient for dialogue generation.The LLaMA-2 pre-trained model has been improved by incorporating 40% more training data.In addition, the context length was extended and a feature called Grouped Query Attention (GQA) was added [104].The model was trained on a massive dataset consisting of 2000 billion tokens.
Gema et al. [105] have introduced Clinical LLaMA-LoRA, a Parameter-Efficient Fine-Tuning (PEFT) adaptation layer based on the LLaMA model.This specific adaptation for the clinical domain is trained using clinical notes from the MIMIC-IV database.By combining these data, a specialized adapter is created to improve the model's performance in the clinical domain.In addition, the authors propose a two-stage PEFT framework that integrates both Clinical LLaMA-LoRA and Downstream LLaMA-LoRA.The latter PEFT adapter is specifically designed for downstream tasks, enabling better adaptation of the model to specific tasks in the clinical domain.
PMC-LLaMA [106] is an open-source language model specially designed for the analysis of biomedical articles.It is fine-tuned from LLaMA on biomedical academic articles, giving it increased specialization and understanding of biomedical language.
Visual Med-Alpaca [107] is a multimodal biomedical model based on the LLaMa-7B architecture.It is specially designed to handle various biomedical tasks by integrating both linguistic and visual data.Its training process involves a collaboration between GPT-3.5-Turbo, a powerful language model, and human experts.

Bloom
Bloom is a massive multilingual LLM with 176 billion parameters, trained on the NVIDIA AI platform using vast text data [89].
While referred to as an LLM, Bloom's core functionality is in text generation through continuation; it is prompted with an initial context and asked to complete and extend the text.The terms generation, continuation, and completion are thus used somewhat interchangeably to describe Bloom's text production abilities.
A key attribute of Bloom is its multilingualism; it can generate text in 46 different human languages as well as 13 programming languages.By leveraging immense computational resources, Bloom was trained on an industrial scale with massive language datasets, resulting in a highly capable generative model with a breadth of linguistic competencies unprecedented for an openly available LLM.
As described in [89], BLOOM's architecture and pre-training are based on the transformer approach, using a causal-only decoder model.BLOOM's modeling details include the use of ALiBi positional embeddings, which directly adjust attention scores according to the distance between keys and queries.This facilitates training and improves performance.In addition, an embedding normalization layer is added after the integration layer to stabilize training, using float16 precision.
In addition to these architectural components, data pre-processing is crucial.This includes steps such as de-duplication and privacy suppression, particularly for high-risk sources.BLOOM also employs prompted fine-tuning with guest datasets, which enhances its zero-shot generalization capabilities and improves performance.

StableLM
StableLM, an open-source LLM, has been launched by Stability AI [91].It is available in versions with 3 billion and 7 billion parameters, with larger versions on the horizon (with 15 billion to 65 billion parameters).This follows the introduction in 2022 of Stable Diffusion, a model for generating images from a sentence of text.StableLM, designed to generate text and code, shows that smaller models can achieve high performance with the right training.The model relies on a large dataset based on the Pile dataset, but three times as large.This large dataset enables the model to excel in conversational and coding tasks, even though its number of parameters is considerably lower than that of models such as GPT-3.
StableLM comes in two main variants: StableLM-3B-4E1T, which focuses on studying the impact of repeated tokens, and StableLM-Alpha v2, which improves on the initial architecture of the Alpha models and uses larger, higher-quality datasets.In addition, experimental fine-tuning has been carried out to develop StableLM-Tuned-Alpha, combining various datasets to enhance its capabilities as a conversational agent.

Applications of Transformer-Based LLMs in Healthcare
When a patient meets a clinician for the first time, it is essential to establish an accurate diagnosis and create a relationship of trust.Unfortunately, time constraints often limit the ability to thoroughly review medical histories and provide tailored care to each patient.LLMs can improve this process by extracting relevant information from electronic health records [108], allowing a concise overview of the patient's medical history to be presented.This helps optimize consultation productivity.
By highlighting past medical conditions, treatments, medications, and results of previous visits, LLMs eliminate the need to review numerous records, allowing for more targeted interactions to address specific patient issues.By focusing on key concerns, LLMs ensure that diagnosis is personalized to each patient.Additionally, these models are constantly improved as health data grow, ensuring their accuracy.By ensuring data transparency and implementing continuous monitoring, the potential of LLMs can be further improved, resulting in faster diagnoses and greater patient satisfaction.
Transformer-based language models, such as LLaMA, ChatGPT [109], and GPT-4 [2] have shown great potential in a variety of fields, including medical practice.These models are particularly efficient at understanding and producing texts that resemble those written by humans, opening up possibilities for supporting healthcare professionals in the diagnostic process, as shown in Figure 5.The rest of this section illustrates how these models can be applied for this purpose.

Text Analysis and Summarization
Electronic medical records (EMRs) have transformed the way healthcare professionals access and use patient data [110], profoundly altering their approach to decision-making.A major advantage of EMRs lies in their ability to summarize clinical observations [18], giving doctors a clear view of potential health hazards and facilitating informed medical choices.This data summarization not only minimizes diagnostic errors but also optimizes therapeutic outcomes thanks to up-to-date, targeted patient information [111].
Nevertheless, manual summarization of clinical reports is laborious and error-prone [112].The abundance and density of data mean that even seasoned professionals can miss crucial details.It is therefore imperative to develop automated synthesis techniques to increase the efficiency and reliability of medical care.
Relying on these automated summarization techniques, medical staff could rapidly extract the essential elements from the clinical notes provided, thereby reducing the risk of errors and enhancing the quality of the care provided.
The rise of LLMs in NLP has opened new possibilities for streamlining automated clinical report summarization.These models, recognized for their capability to align with input directions, benefit from the integration of text prompts [113,114].
Indeed, LLMs, like BERT, BART, LLAMA, Bloom, GPT-3.5, and GPT-4, have demonstrated a remarkable ability to analyze, understand, and summarize large quantities of unstructured text such as clinical notes, patient histories, scientific articles, etc. [115,116].This makes them helpful in extracting meaningful information from textual medical data [117].
Large amounts of medical literature can also be filtered and analyzed by LLMs, which can then dynamically summarize data from many sources for doctors [126,127].Succinctly synthesizing ideas saves crucial time.
LLMs can analyze clinical text and identify clinical concepts, entities, and their interrelationships.This enables them to generate differential diagnoses [128,129], predict specific diagnoses [130,131], and provide answers to physicians' queries [98,132].Moreover, LLMs' multilingual capabilities can facilitate global healthcare [133].
In capturing patient details and symptoms, LLMs offer a promising solution for the extensive documentation that physicians and clinical experts face daily [134].These models can produce detailed clinical summaries and diagnostic reports, effectively alleviating the time-intensive burden of these professionals [135].A concrete example of their potential is an LLM specially designed to condense the results of radiology reports, highlighting its applicability in similar medical fields [136].
To optimize time management during consultations, LLMs can be exploited to generate concise summaries of each patient's medical history.These summaries include information on comorbidities, previous consultations or admissions, medication lists, as well as progress and response to previous treatments [19,137].These synthesized summaries draw the doctor's attention to relevant information about the patient, which is crucial for an accurate diagnosis of the disease.By concisely providing this information, LLMs promote more efficient and comprehensive consultations.This approach allows doctors to devote more time to patient interaction, which in turn increases patient satisfaction.
Joseph et al. [138] recently introduced FACTPICO, an evidence-based reference for plain-language summaries of medical texts describing randomized controlled trials (RCTs).RCTs are essential in evidence-based medicine and play a direct role in patients' treatment decisions.FACTPICO consists of 345 plain-language abstracts generated by three LLMs (GPT-4, Llama-2, and Alpaca).It includes fine-grained evaluation and natural language justifications provided by experts.
As a result, using LLMs to generate concise summaries optimizes time management during consultations, enables physicians to focus on patient interaction, and improves patient care [139].In this way, the summary process helps clinicians to efficiently review patient histories, research studies, and guidelines [140].

Diagnostic Assistance
Using LLMs to aid medical professionals in making diagnoses has shown potential [141].LLMs, trained on a massive medical corpus, have acquired important clinical knowledge.This latent expertise gives them several assets that can support the diagnostic process.
LLMs can synthesize information from a variety of sources, such as medical literature, clinical guidelines, and case histories, to guide decision-making.Using a patient's data, these models can automatically generate differential diagnosis hypotheses, ranked according to their estimated probability, for consideration by the clinician [142][143][144].
Medical doctors (MDs) can quickly acquire pertinent decision assistance by using natural language queries to interact with LLMs through conversational interfaces.In order to identify patterns and create clinical syndromes, related indications and symptoms may be semantically grouped [145,146].
LLMs can also assist in clinical decision-making by suggesting suitable medications [147], recommending relevant imaging services based on symptoms [148], or pinpointing disease causes from diverse clinical documents.When combined with tools like medical imaging, these models can offer a holistic view, helping doctors diagnose and define diseases [149].Moreover, by evaluating cases of individuals with comparable symptoms, LLMs can forecast potential disease outcomes, empowering both doctors and patients to make well-informed treatment choices.
By extracting the most important elements from narrative records, LLMs can simplify the review process for healthcare professionals by creating structured record segments [150,151].In addition, the synthesis of information from exchanges or expert research helps to improve data assimilation.
Supplementary examinations often play a crucial role in medical diagnosis, helping to reinforce clinical hypotheses and confirm diagnoses.In this context, LLMs can act as clinical decision support tools, helping physicians choose the most relevant radiological examinations for given clinical cases [8].This approach has the potential to minimize pressure on limited resources, particularly in public hospitals or resource-constrained areas where imaging equipment and technical support are limited.Furthermore, LLMs can help avoid unnecessary examinations, such as those involving radiation or contrast agents, for patients who do not need them.
The increasing use of DL models to create computer-aided diagnosis (CAD) systems has automated the interpretation of medical images, facilitating the detection and classification of pathologies.However, their integration into clinical practice is sometimes hampered by a lack of transparency in the decisions made by these models.To remedy this, the integration of LLMs into CAD systems can enable clinicians to ask questions about specific images or patient cases, thereby clarifying the CAD system's decision-making process.This human-machine interaction can make models more interpretable and encourage their use in routine diagnostic procedures.In addition, this approach may reveal new insights or biomarkers in disease imaging.
A recent study [31] presents a new paradigm called generalist medical AI (GMAI), in which models are trained on large, diversified medical datasets.This approach allows models to perform a large range of tasks, such as diagnosis, prognosis, and treatment planning.The paper evaluates the performance of GMAI models on different medical tasks and finds that they perform better than conventional specialized medical AI models in several areas, including diagnosis, prognosis, and treatment planning.
In the future, the integration of different data sources, such as images and laboratory tests, could provide more comprehensive diagnostic information [152].Throughout this process, however, it is essential to maintain the interpretability of models and to assign responsibility for results to human practitioners.The integration of multimodal data will pave the way for in-depth diagnostic analysis [153].
Despite the significant progress made by LLMs, their use to assist medical diagnosis has several important limitations.First, these models may lack the clinical precision necessary for specific diagnoses, as they may fail to capture all essential details or misinterpret complex medical information.Additionally, they generally struggle to fully understand the clinical context of a patient, including crucial elements such as medical history or current symptoms [154].
The data used to train these models can also contain biases or inaccuracies, which can translate into erroneous recommendations.It is important to emphasize that LLMs can in no way replace human clinical expertise, especially regarding physical examination and interpretation of medical test results.
Data privacy and security issues are also a major concern, particularly when dealing with sensitive patient information.Furthermore, LLMs are not always effective in interpreting complex medical test results, such as radiological images or laboratory tests.Their use in the medical field therefore requires rigorous clinical validation to ensure their safety and efficacy before deployment in a healthcare setting.
Finally, given the rapidly evolving nature of the medical field, these models need to be constantly updated to incorporate new discoveries and practices, which necessitates continuous maintenance.

Answering Medical Queries
QA is a task that automatically provides an answer to a given question.Indeed, LLMs have greatly progressed the field of QA in NLP, enabling machines to comprehend and respond to natural language queries with precision.These extensive language models find applications in virtual assistants and chatbots, delivering precise and pertinent responses to user questions.Such systems leverage LLMs to grasp the context and meaning of inquiries and formulate suitable replies.For instance, Google Assistant, underpinned by LLMs, adeptly addresses an array of user queries spanning general knowledge, weather updates, directions, and more [57].
Advancements in DL exemplified by "transformers" now open up new possibilities [53].Under their enhanced capacities, LLMs [4,55] have the potential to gain a finergrained understanding of complex relationships within enormous corpora.The recent emergence of transformers and LLMs has breathed new life into research exploring AI's potential to tackle the great challenge [155,156] of medical QA.
For instance, GPT-3.5 achieved an accuracy of 60.2% on the MedQA (USMLE) dataset, demonstrating its ability to provide accurate responses to medical queries [98,166].Flan-PaLM, another powerful LLM, achieved an even higher accuracy of 67.6% on the same dataset.The performance of GPT-4-base [2] was even more impressive, achieving an accuracy of 86.1% on medical QA tasks [167,168].
These advancements in LLMs have been observed within a relatively short time, showcasing their rapid progress and potential for accurate and reliable medical QA.It is important to note that these accuracy figures are specific to the mentioned datasets and models, and the performance may vary depending on the specific task and dataset involved.
The critical question remains whether they can generate responses meeting the demands of the clinical context-reliability, interpretability, and adherence to ethical standards, prerequisites for actual utility at the point of care [169][170][171][172].
One of the major challenges encountered in medical QA systems is the problem of hallucinations leading to incorrect answers [173].An approach commonly used to solve this problem is retrieval augmentation.This approach consists of combining LLMs with a search system such as New Bing for general domains, or Almanac for clinical domains [174].It involves first retrieving relevant documents as support, and then using LLMs to answer a specific question based on these retrieved documents.The underlying idea is that LLMs are able to effectively summarize content, which can reduce hallucinations.It is important to stress, however, that these systems are not error-free [175] and require thorough and systematic evaluation to ensure their quality [176].Further studies are needed to rigorously evaluate the performance of these systems and identify possible sources of error.
A promising approach to solving the hallucination problem is to augment LLMs with additional tools.For example, recent work has explored the use of LLM augmentation by combining data from specific sources [177][178][179][180].A concrete example is the GeneTuring dataset, which contains search questions for information on specific single nucleotide polymorphisms (SNPs).However, it is important to note that autoregressive LLMs do not possess specific knowledge about these SNPs, and most commercial search engines are unable to provide relevant results for such queries.Consequently, the approach of increasing retrieval may prove ineffective in such cases.
In such situations, relevant information is often only accessible via specialized databases such as NCBI dbSNP.For this reason, the approach of enriching LLMs using web database utility APIs, such as those provided by NCBI, offers the potential for solving the problem of hallucinations related to specific entities in biomedical databases [178].
While computational progress hints at promising avenues, rigorous validation in realworld healthcare environments remains essential before these systems can meaningfully impact patient outcomes [176].Overall, advancements in NLP promise to improve the ability of systems to answer medical questions.However, this is only possible if rigorous evaluations are carried out to ensure the safety, transparency, and responsibility of such solutions.Indeed, as healthcare is a crucial area where the smallest failure could have serious consequences, it is essential to ensure through thorough testing that digital medical assistance systems respect the highest standards in terms of ethics, protection of sensitive data, and patient well-being.

Image Captioning
Developing precise and dependable automated report-generation systems for medical images presents various obstacles.These include the analysis of limited medical image datasets using machine learning techniques and the generation of informative captions for images that involve multiple organs.
The generation of image captions has been the subject of considerable work in computer vision.Early models focused on characterizing the objects, attributes, and relations present in the image via visual primitive extraction methods [181][182][183][184].
It has been shown that encoding by object region instead of global image representation increases performance [200].These advancements in image encoders and decoders, together with the development of attention networks, have led to significant improvements in the quality of automatically generated captions.
Captioning medical images is an especially difficult task due to the complexity and variability of images such as fetal ultrasound images.The latter are usually noisy, of low resolution, and dependent on factors such as fetal position, gestational age, or imaging plan.Furthermore, the availability of extensive annotated datasets is essential for tackling this task effectively.In response to these challenges, researchers have put forward DLbased approaches that seamlessly merge visual and textual data, enabling the creation of informative captions for both images and videos.
Alsharid et al. [201] presented a model for image captioning specifically designed for fetal ultrasound images.Alsharid et al. [202,203] have taken their studies further by introducing a programmatic learning approach to train such image captioning models.They have also proposed a captioning framework where an image is first classified, and then one of the multiple-image captioning models is employed to generate the corresponding caption.Each image captioning model is associated with an anatomical structure.
In a separate investigation [204], researchers addressed the difficulties encountered in generating captions for medical images, particularly those involving multiple organs.In response, they propose a solution that combines DL and transfer learning techniques, specifically employing a Multilevel Transfer Learning Technique (MLTL) and LSTM framework.The authors introduce a foundational MLTL framework consisting of three models designed to detect and classify datasets with severe limitations.This is achieved by leveraging knowledge gained from readily available datasets.The first model utilizes non-medical images to acquire generalized features, which are subsequently transferred to the second model, responsible for the intermediate and auxiliary domains related to the target domain.The study concludes by discussing the potential applications of this approach in the medical diagnosis field and its potential to enhance patient care.
A new model for automatic clinical image caption generation has been developed [205].It combines radiological scan analysis with structured patient information to generate comprehensive and detailed radiology reports.The model utilizes two language models, Show-Attend-Tell [206] and GPT-3, to generate captions that contain important information about identified pathologies, their locations, and 2D heatmaps highlighting the pathologies on the scans.This approach improves the interpretation and communication of radiological results, facilitating clinical decision-making.
The application of caption generation is not limited solely to images; it extends to videos as well.As part of another research project [207], the authors introduced an innovative method for generating captions for fetal ultrasound videos, which can be valuable in aiding healthcare professionals in their diagnosis and treatment decision-making processes.This approach leverages a three-way multimodal deep neural network that merges gaze-tracking technology with NLP.The primary objective is to develop a comprehensive understanding of ultrasound images, enabling the generation of detailed captions encompassing nouns, verbs, and adjectives.The potential applications of these DL models include integration into systems designed to assist in the interpretation of ultrasound scan video frames and clips.The article also explores previous studies in the field of image and video captioning and presents quantitative assessment findings.
Li et al. [208] developed a novel end-to-end video understanding system called VideoChat, which focuses on chat interaction.The system incorporates video foundation models and LLMs via a learnable neural interface.Its main feature is its ability to perform spatiotemporal reasoning, locate events in the video, and infer causal relationships between them.To fine-tune the system, the researchers developed a specific dataset focused on video instructions.This comprehensive dataset includes thousands of videos with detailed descriptions and associated conversations.It places particular emphasis on spatiotemporal reasoning and captures the causal relationships between events in the videos.
Although important advancements have been made in the field of generating captions for medical images and videos using LLMs, certain challenges still need to be tackled.Firstly, the limited size of the datasets used in some studies represents one of the main difficulties, as it may compromise the generalizability of the proposed approaches to more expansive datasets.In addition, the evaluation measures used in some research studies do not always provide a comprehensive understanding of the quality of the generated captions.In the future, it would be interesting to explore more sophisticated measures that would enable us to evaluate model performance more effectively.Improving the representativity of the data used and refining the evaluation criteria would be promising avenues for progress in this field.
Another challenge lies in the need for large quantities of annotated data for model training, which can prove complicated in the field of medical imaging where annotation requires expertise.In the future, research could explore new ways of reducing dependence on massive quantities of pre-annotated data, for example by developing methods that enable better generalization of the proposed approaches.The integration of additional contextual information such as clinical metadata or patient medical history into the caption generation process also represents a promising avenue for providing richer context and improving both the relevance and accuracy of the captions produced.This type of multimodal approach would go some way to overcoming the lack of massive annotated data.
Future research in medical image and video captioning can focus on two key areas: advancing transformer-based word embedding models and spatiotemporal visual and textual feature extractors and utilizing larger and more diverse datasets.These advancements aim to improve the accuracy, precision, and usefulness of generated captions in medical applications.By developing more sophisticated models and incorporating richer datasets, the goal is to overcome limitations such as dataset size and lack of diversity, ultimately enhancing the performance of automatic report generation systems in the medical field.

Analysis of Massive Medical Datasets
In this section, as shown in Figure 6, we will present a range of datasets by classifying them into three categories: general datasets, de-identification datasets, and medical QA datasets.

General Datasets
In this section, we present a non-exhaustive overview of several commonly used medical datasets, with a synthetic summary of their main characteristics, as described in Table 3.
This table succinctly summarizes key information on various datasets representative of clinical and biomedical fields, such as their theme, approximate size, and type of content, as well as a brief description of some.
Although this list is not intended to be exhaustive, it does provide an overview of the resources available and frequently exploited in current research in automatic medical language processing.MIMIC-III, known as "Medical Information Mart for Intensive Care III", is a publicly accessible database focused on critical care [209].It is an extensive and comprehensive health database that includes de-identified health-related data from more than 40,000 patients who were admitted to Beth Israel Deaconess Medical Center (BIDMC) in Boston between the years 2001 and 2012.The database encompasses a diverse array of information, including clinical notes, physiological waveforms, laboratory test results, medication details, procedures, diagnoses, and demographics.This extensive dataset holds immense value for medical research and healthcare analytics, as well as the creation and validation of machine learning models and clinical decision support systems.It provides a valuable resource for advancing medical knowledge and enhancing patient care.MIMIC-III is widely utilized by researchers and healthcare professionals for a range of studies, including predictive modeling [210,211], risk stratification [212], treatment outcomes analysis [213], and other medical research investigations [214,215].The database offers valuable insights into patient care [216,217], facilitates the development of advanced healthcare technologies [218,219], and contributes to the enhancement of clinical practices and patient outcomes [220,221].It is essential to emphasize the importance of ethical considerations and strict adherence to data usage policies when accessing and utilizing the MIMIC-III database to ensure proper handling and protection of patient information.

2.
MIMIC-CXR, referred to as "Medical Information Mart for Intensive Care-Chest X-ray", is an expansion of the MIMIC-III database that specifically concentrates on chest X-ray images and related clinical data [222].This publicly available dataset comprises de-identified chest X-ray images alongside their corresponding radiology reports for a substantial number of patients.A total of 377,110 images are available in the MIMIC-CXR dataset.These images are associated with 227,835 radiographic stud-ies conducted at the Beth Israel Deaconess Medical Center in Boston.The contributors to the dataset utilized the ChexPert tool [223] to classify the free-text notes associated with each image into 14 different labels.This process categorized the textual information accompanying the images, providing additional context and information for research and analysis.MIMIC-CXR is a valuable dataset that plays a crucial role in training and evaluating machine learning models and algorithms focused on chest X-ray image analysis, radiology report processing, NLP, and various medical imaging tasks.Researchers and healthcare professionals rely on MIMIC-CXR to develop and validate AI-driven systems designed for automated diagnosis [224,225], disease detection [226,227], and image-captioning applications specifically tailored to chest X-ray images [205,228].

3.
MEDLINE is an extensive and highly regarded bibliographic database that encompasses life sciences and biomedical literature.It is curated and maintained by the National Library of Medicine (NLM) and is a component of the larger PUBMED system.MEDLINE contains more than 29 million references sourced from numerous academic journals, covering a wide range of disciplines including medicine, nursing, dentistry, veterinary medicine, healthcare systems, and preclinical sciences.MEDLINE is a widely utilized resource by researchers, healthcare professionals, and scientists seeking access to an extensive collection of scholarly articles and abstracts.It serves as a vital source of information for academic research, aiding in clinical decision-making [229,230], and keeping individuals informed about the latest developments in the medical and life sciences [231][232][233].The database plays a crucial role in supporting evidence-based practice, enabling professionals to stay updated with the most recent advancements and discoveries in their respective fields.MEDLINE's comprehensive coverage and wealth of information make it an essential tool for professionals and researchers across the medical and life sciences domains.In MEDLINE, the data commonly consist of bibliographic information such as article titles, author names, abstracts, publication sources, publication dates, and other pertinent details.This dataset is an essential and foundational resource for conducting research in the fields of medicine and life sciences, supporting diverse applications like NLP [234], information retrieval [235,236], data analysis [237], and more [238,239].The wealth of information contained within MEDLINE enables researchers to explore and extract valuable insights, contributing to advancements in medical knowledge and facilitating a wide range of research endeavors within the healthcare and life sciences domains [240-242].

4.
ABBREV dataset, proposed by Stevenson et al. in 2009 [243], is a collection of acronyms and their corresponding long forms extracted from MEDLINE abstracts.Originally introduced by Liu et al. in 2001, this dataset has undergone automated reconstruction.The reconstruction process involves identifying the long forms of acronyms in MED-LINE and replacing them with their respective acronyms.The dataset is divided into three subsets, with each subset containing 100, 200, and 300 instances, respectively.

5.
The PUBMED dataset comprises over 36 million citations and abstracts of biomedical literature; while it does not provide full-text journal articles, it typically includes links to the complete texts when they are available from external sources.Maintained by the National Center for Biotechnology Information (NCBI), PUBMED is an openly accessible resource for the public.Serving as a comprehensive search engine, it enables users to explore a vast collection of articles from diverse biomedical and life science journals.PUBMED encompasses a wide range of subjects, spanning medicine, nursing, dentistry, veterinary medicine, biology, biochemistry, and various other fields within the biomedical domain.Within the PUBMED dataset, you can find a wealth of information, such as article titles, author names, abstracts, publication sources, publication dates, keywords, and MeSH terms (Medical Subject Headings).This extensive dataset is extensively used by researchers, healthcare professionals, and individuals within the academic and medical communities.PUBMED serves as a go-to platform for accessing the latest research and information in the vast field of biomedicine.It provides a comprehensive search engine that enables users to explore a diverse range of topics, including medicine, nursing, dentistry, veterinary medicine, biology, biochemistry, and more.By utilizing PUBMED, researchers and professionals can stay updated with the latest scientific literature, conduct literature reviews, and make evidence-based decisions in their respective fields.

6.
The NUBES dataset [244], short for "Negation and Uncertainty annotations in Biomedical texts in Spanish", is a collection of sentences extracted from de-identified health records.These sentences are annotated to identify and mark instances of negation and uncertainty phenomena.To the best of our knowledge, the NUBES corpus is presently the most extensive publicly accessible dataset for studying negation in the Spanish language.It is notable for being the first corpus to include annotations for speculation cues, scopes, and events alongside negation.7.
The NCBI disease corpus comprises a massive compilation of biomedical abstracts curated to facilitate NLP of literature concerning human pathologies [245].Leveraging Medical Subject Headings assigned to articles within PUBMED, the corpus was constructed by annotating over 1.5 million abstracts that discuss one or more diseases according to controlled vocabulary terms.Spanning publications from 1953 to the present day, the breadth of the included abstracts encapsulates a wide spectrum of health conditions.Each is tagged with the relevant disease(s) addressed, permitting focused retrieval.Beyond these annotations, access is also provided to the original PUBMED records and associated metadata, such as publication dates.By manually linking this substantial body of literature excerpts to standardized disease descriptors, the corpus establishes an invaluable resource for information extraction, classification, and QA applications centered around exploring and interpreting biomedical writings covering an immense range of human ailments.The colossal scale and selection of peer-reviewed reports ensure the corpus offers deep pools of content for natural language models addressing real-world biomedical queries or aiding disease investigations through text-based analysis.Overall, it presents a comprehensive, expertly curated foundation for driving advancements in medical text mining.

8.
The CASI (Clinical Abbreviation Sense Inventory) dataset aims to facilitate the disambiguation of medical terminology through the provision of contextual guidance for commonly abbreviated terms.Specifically, it features 440 of the most prevalent abbreviations and acronyms uncovered among the 352,267 dictated clinical notes.For each entry, potential intended definitions, or "senses", are enumerated based on an analysis of the medical contexts within which the shorthand appeared across the vast note compilation.By aggregating examples of appropriate abbreviation implementation in genuine patient encounters, the inventory helps correlate ambiguous clinical jargon with probable denotations.This aids in navigating the challenges inherent to interpreting abbreviated nomenclature pervasive throughout medical documentation, which is often polysemous.The dataset provides utility for natural language systems seeking to map abbreviations to intended clinical senses when processing narratives originating from authentic healthcare records.9.
The MEDNLI dataset was conceived with the goal of advancing natural language inference (NLI) within clinical settings.As in NLI generally, the objective involves predicting the evidential relationship between a hypothesis and premise as true, false, or undetermined.To facilitate model progress, MEDNLI comprises a sizable corpus of 14,049 manually annotated sentence pairs.The data originate from MIMIC-III, necessitating initial access to extract the pairs, while retrieval is dependent on securing permission to access the source's protected patients, MEDNLI balances this ethically with the availability of an extensive set.This supports valuable work on a key clinical NLP challenge: determining the inferential relationship between ideas expressed.10.The MEDICAT dataset comprises a vast repository of medical imaging data, matched materials, and granular annotations [246].It contains over 217,000 images extracted from 131,410 open-access articles in PUBMED Central.Each image is accompanied by a caption and 7507 feature compound structures with delineated subfigures and subcaptions.The reference text comes from the S2ORC database.The collection also includes online citations for around 25,000 images in the ROCO subset.MEDI-CAT introduces a new method for annotating the constituent elements of complex images.It correlates over 2000 composite visualizations with their constituent subfigures and descriptive subcapitals.This refined partitioning at multiple intralimatic scales within individual elements far surpasses previous collections based on unitary image-caption couplings.MedICaT's granular annotations lay the foundations for modeling intra-element semantic congruence, an essential capability for tackling the sophisticated challenges of visual language in biomedical imaging.With its vast scale and careful delineation of image substructures, MEDICAT provides a cuttingedge dataset to advance research at the intersection of vision, language, and clinical media understanding.11.The OPEN-I, the Indiana University Chest X-ray dataset, presents a sizable corpus of 7470 chest radiograph images paired with associated radiology reports [247].Each clinical dictation encompasses discrete segments for impression, findings, tags, comparison, and indication.Notably, rather than using full reports as captions, this dataset strategically concatenates the impression and findings portions for each image, while impressions convey overarching diagnoses, findings provide specifics on visualized abnormalities and lesions.By combining these sections, the designated captions offer concise and comprehensive clinical overviews.With a sample size exceeding 7000 image-report duos, this collection equips researchers with ample training and evaluation materials.Distinct from simpler descriptors, the report snippets incorporated as captions encompass richer diagnostic particulars.This notation format furnishes target descriptions well-suited to nurturing technologies aimed at automating radiographic report generation directly from chest X-ray visuals.12.The MEDICAL ABSTRACTS dataset presents a robust corpus of 14,438 clinical case abstracts describing five categories of patient conditions, an integral resource for the supervised classification of biomedical language [248].Unlike many datasets, each sample benefits from careful human annotation, providing researchers with a fully labeled collection that avoids ambiguity.The abstracts also emanate from real medical scenarios, imbuing the models with authentic representations.The collection divides examples into standardized training and test scores, enabling a rigorous and consistent evaluation of classification performance.At this scale, DL techniques can derive profound meaning from the distribution of terminology across different specialties.Automatic organization of vast quantities of documents according to disease status should lead to improved indexing, keyword assignment, and file routing.This fundamental resource enables investigators to cultivate methods that intelligently classify and route written summaries, streamlining discovery and care.

De-Identification Dataset
I2B2 (Informatics for Integrating Biology and the Bedside) is a non-profit entity with the primary goal of organizing NLP competitions and assembling datasets centered around clinical language.These datasets are tailored for specific tasks like de-identification, extracting relationships, and selecting cohorts for clinical trials.Despite the management transition to Harvard's National Clinical Language Challenges (N2C2), academic research predominantly continues to cite these datasets under the I2B2 name, acknowledging their founding influence and enduring significance in the domain, as shown in Table 4. Hereafter, we will detail all datasets associated with I2B2: The I2B2 2006 dataset serves two main objectives-deidentification of protected health information from medical records [249], and extraction of smoking-related data [250].
Regarding privacy, it focuses on removing personally identifiable information (PII) from clinical documents.This is important for enabling the ethical and lawful use of such records for research purposes by preserving patient anonymity.At the same time, the dataset facilitates the extraction of key smoking-related details from these records.This aspect is crucial for understanding impactful health behavior and risk factors.Researchers leverage this dual-purpose dataset to develop automated systems capable of both tasks through NLP.This significantly advances the application of NLP within healthcare to tackle real-world issues surrounding privacy and extracting meaningful insights from unstructured notes.The dataset is freely available to the research community.Overall, it serves an important role in propelling research at the intersection of privacy, smoking cessation, and NLP technologies for EMR.

2.
The I2B2 2008 Obesity dataset is a specialized collection aimed at facilitating the recognition and extraction of obesity-related information from medical records [251].The dataset challenge centered on identifying various aspects of obesity within clinical notes, such as body mass index, weight, diet, physical activity levels, and related conditions.Researchers leverage this curated corpus to build models and algorithms capable of accurately detecting and extracting obesity-related data from unstructured notes.Key elements identified include BMI, weight measurements, dietary patterns, exercise habits, and associated medical conditions.By developing technologies to systematically organize this clinical information, researchers can gain deeper insights into obesity trends, risk factors, and outcomes.This contribution ultimately advances healthcare research and strategies for obesity treatment and prevention [252][253][254].
The specificity of the I2B2 2008 Obesity dataset in regards to this important public health issue has made it a valuable resource for the medical informatics community to progress solutions.

3.
The I2B2 2009 Medication dataset is a specialized corpus focused on extracting granular medication-related information from clinical notes [255].The dataset challenge centered on precisely identifying and categorizing various facets of medications documented within notes, such as names, dosages, frequencies, administration routes, and intended uses.The corpus is meticulously annotated to demarcate and classify mentions of medications and associated attributes in the unstructured text [256].
Researchers leverage this resource to develop sophisticated NLP models tuned to accurately recognize and extract intricate medication details.Key elements identified and disambiguated include drug names, dosage amounts, dosing schedules, administration methods, and therapeutic purposes.By automatically organizing these clinical aspects at scale, the resulting NLP systems enable vital insights to support health-care decision-making and research [257].The specificity and detailed annotation of the dataset in mining this significant clinical domain has made it instrumental for advancing automatic extraction of medication information [258,259].

4.
The I2B2 2010 Relations dataset was specially curated to facilitate extracting and comprehending associations between medical entities in clinical texts.The dataset challenge centered on accurately identifying and categorizing relationship types expressed in clinical narratives, such as drug-drug interactions or dosage frequency links [260].
Annotations clearly define and classify the annotated relations, supporting the development of models precisely capable of recognizing and categorizing diverse medical connections.Advancements ultimately aim to bolster comprehension of treatment considerations and implications to elevate standards of care based on a complete view of patient context and interconnectivity [261,262].

5.
The I2B2 2011 coreference dataset has been specially designed to facilitate the resolution of coreferences in clinical notes [263].The challenge of I2B2 2011 was to identify and resolve coreferences expressed in narratives, where different terms refer to the same real-world entity.Coreference annotations meticulously indicate these relationships in order to train models to discern and accurately link references to a common entity.Automatic identification of coreferential links improves understanding of complex clinical relationships.Further development of these technologies should enable the medical community to better understand patient histories, presentations, and treatments through the complete resolution of entities in clinical discussions.

6.
The I2B2 2012 Temporal Relations dataset was curated with a specific focus on aiding in the identification and understanding of temporally related elements present in clinical notes [264].The dataset challenge centered on precisely identifying and categorizing the temporal relationships conveyed in narratives, with the goal of understanding when events or actions occurred relative to one another.The notes are meticulously annotated to delineate and classify these time-related connections, supporting the development of models that can accurately discern and organize temporal aspects.Key targets involve discerning whether certain phenomena preceded, followed, or co-occurred based on the documentation.The precise annotations of the dataset and the focus on this significant dimension continue to enable valuable progress on a fundamental but complex clinical modeling task [265].

7.
The I2B2 2014 Deidentification & Heart Disease dataset: This specialized collection addresses two primary objectives: privacy protection through de-identification and the extraction of heart disease information [266].For de-identification, the task involves removing personally identifiable information (PII) from medical records while preserving anonymity.The datasets are annotated to highlight sensitive PII to remove.Researchers utilize this resource to develop and assess de-identification systems, ensuring ethical use of data for research by maintaining privacy [267] regarding heart disease extraction, annotations tag mentions, diagnostic details, treatments, and related clinical aspects within notes.Leveraging these guidelines supports building models accurately deriving insights into cardiovascular presentations and management [268].

8.
The I2B2 2018 dataset is divided into two pivotal tracks: Track 1, focusing on clinical trial cohort selection [269], and Track 2, centered around adverse drug events and medication [270].Track 1 aims to accurately identify eligible patients for clinical trials using EMR data.The annotations mark records matching various eligibility criteria.
Researchers use this resource to develop and evaluate ML models that can efficiently pinpoint suitable trial candidates, expediting the enrollment process.Regarding Track 2, the goal here is to detect and classify mentions of adverse drug reactions and medication information in notes.The annotations highlight such instances in clinical narratives.Researchers apply this dataset to advance NLP techniques specifically for precisely extracting and categorizing adverse events and medication details.This contributes to pharmacovigilance and patient safety.

Medical QA Datasets
In this section, we present an overview of different medical datasets dedicated to the QA task.Table 5 summarizes some of the key features of several of these datasets which, without claiming to be exhaustive, cover the most frequently studied clinical domains.MEDMCQA dataset contains over 194,000 authentic medical entrance exam multiplechoice questions designed to rigorously evaluate language understanding and reasoning abilities [161].Sourced from the All India Institute of Medical Sciences (AIIMS) and National Eligibility cum Entrance Test (NEET) PG exams in India, the questions cover 2400 diverse healthcare topics across 21 medical subjects, with an average complexity mirrored in clinical practice.Each sample presents a question stem alongside the correct answer and plausible distractors, requiring models to comprehend language at a depth beyond simple retrieval.Correct selection demonstrates an understanding of semantics, concepts, and logical reasoning across topics.By spanning more than 10 specific reasoning skills, MEDMCQA comprehensively tests a model's language and thought processes in a manner analogous to how human exam takers must demonstrate their clinical knowledge and judgment in order to gain entrance to postgraduate medical programs in India.The large scale, intricacy, and realism of these authentic medical testing scenarios establish MEDMCQA as a leading benchmark for developing assistive technologies with human-level reading comprehension, critical thinking, and decision-making skills.The dataset poses a major challenge that pushes the boundaries of language model abilities.
As the certification exam required to practice medicine in the United States, the USMLE assesses candidates on an extraordinarily broad and in-depth range of clinical skills.As such, it sets the bar for the level of medical acumen expected of MDs within the U.S. healthcare system.Using authentic samples from this rigorous assessment, MedQA USMILE offers insight into the comprehensive knowledge requirements and analytical abilities assessed.It reflects the extremely high standards of medical training and licensure in the USA.Although it is a preliminary size at present, the use of actual USMLE content gives the dataset validity as a test for medical QA systems looking for capabilities equivalent to those of junior doctors.These capabilities are essential for applications designed to help medical students in the USA.Thanks to its authentic test format and its link to the standards applicable to U.S. doctors, MedQA USMILE provides an important first benchmark, focused solely on the requirements of the U.S. system, and paves the way for more powerful aids for clinical training in the country.

4.
The MQP (Medical Question Pairs) dataset was compiled manually by MDs to provide examples of medical question pairs, whether similar or not [271].From a random sample of 1524 authentic patient questions, MDs performed two labeling tasks.First, for each original question, they composed a reworded version retaining an equivalent underlying intention in order to generate a "similar pair".At the same time, using overlapping terminology, they devised a "dissimilar pair" on a related but ultimately inapparent topic.This double-matching process produced, for each initial question, both a semantically coherent reworking and a variant that was superficially related but whose answer was not congruent.Only similar pairs are then used in MQP.By asking doctors to rewrite queries in different styles while retaining consistent meaning, the selected similar question instances give the models the ability to discern medical semantic substance beyond superficial correspondence.Their inclusion enables more rigorous evaluation and enhancement of a system's comprehension capabilities at a deeper level.

5.
The CLINIQPARA dataset comprises a compendium of patient queries crafted with paraphrasing to progress medical QA from Electronic Health Records [272].It presents 10,578 uniquely reworded questions sorted into 946 semantically distinct clusters anchored to shared clinical intent.Initially harvested from EMRs, the questions were aggregated to cultivate systems that generalize beyond surface forms to grasp the crux of related inquiries couched in diverse linguistic guises.By merging paraphrased variations tied to an identical underlying semantic substance, the inventory arm models have the prowess to discern core significance notwithstanding syntax.Equipped via CLINIQPARA, AI can learn how a single medical consultation may morph in phrasing while retaining essential import, empowering robust interrogation of noisily documented EMR contents.Its expansive scale and clustering by meaning establish it as paramount for advancing healthcare AI's acuity in apprehending intent beneath variances in how patients may pose an identical essential query.

6.
The VQA-RAD dataset comprises 3515 manually crafted question-response pairs pertaining to 315 distinctive radiological images [273].On average, approximately 11 inquiries are posed regarding each individual examination.The questions encapsulate a wide spectrum of clinical constructs and findings that a diagnosing radiologist may want to interrogate or validate within an imaging study.Example themes incorporate anatomical components, anomalies, measurements, diagnoses, and more.By aggregating thousands of question-answer annotations across hundreds of studies, VQA-RAD amasses a sizeable compendium to nurture visual QA (VQA) models.Its scale and granularity lend the resource considerable value to cultivating AI capable of helping radiologists unravel the semantics within medical visuals through an interactive query interface.7.
PATHVQA is a seminal resource, representing the first VQA dataset curated specifically for pathology [274].It contains 32,799 manually formulated questions broadly covering clinical and morphological issues pertinent to the specialty.These questions correlate to 4998 unique digitized tissue slides, averaging approximately 6-7 inquiries per image.PATHVQA aims to mirror the analytical and diagnostic cognition of pathologists when interpreting whole slide images.8.
PUBMEDQA is an essential resource for advancing evidence-based QA in the biomedical field [275].Its objective is to evaluate the ability of models to extract answers from PUBMED abstracts for clinical questions expressed in a yes/no/medium format.The dataset includes 1000 carefully selected questions, associated with expert annotations, which define the correct answer inferred from the literature for questions such as "Do preoperative statins reduce atrial fibrillation after coronary artery bypass grafting?".In addition, 61,200 real-life but unlabeled questions enable semi-supervised techniques to develop learning.Over 211,000 artificially generated question-answer summaries are incorporated to deepen the learning pool.9.
The VQA-MED-2018 dataset was the first of its kind created specifically for visual QA (VQA) using medical images [276].It was introduced as part of the ImageCLEF 2018 challenge to allow the testing of VQA models in this new healthcare domain.An automatic rule-driven system first extracted captions from images, simplifying sentences and pinpointing response phrases to seed question formulation and candidate ranking.However, purely algorithmic derivation risks semantic inconsistencies or clinical irrelevance.Two annotators with medical experience thoroughly cross-checked each query and response twice.One round ensured semantic logic, while another judged clinical pertinence to associated visuals.This two-step validation by experienced clinicians guaranteed queries and answers not just made sense logically but genuinely applied to practice.As the pioneering dataset for medical VQA, it thus permits exploring how AI can interpret images and language in healthcare contexts.Thanks to its meticulously verified question-response sets grounded in actual medical images and language, VQA-MED-2018 laid the groundwork for advancing and benchmarking systems to help practitioners through image-based insights.10.The VQA-MED-2019 dataset represented the second iteration of questions and answers related to medical images as part of the ImageCLEF 2019 challenge [277].Its design aimed to further the investigation started by its predecessor.Drawing inspiration from the question patterns observed in radiology contexts within VQA-RAD, it concentrated on the top four classes of inquiries typically encountered: imaging modalities, anatomical orientations, affected organs, and abnormalities.Some question types, like modality or plane, allowed for categorized responses and could thus be modeled as classification tasks.However, specifying abnormalities required generative models since possibilities were not fixed.This evaluation of different AI problem types paralleled the diverse thinking involved in analyzing scans.Questions also emulated natural consultations.By targeting the most common radiology question categories and building on lessons from previous work, VQA-MED-2019 enhanced the foundation for assessing and advancing systems meant to expedite diagnostic comprehension through combined vision and language processing.11.The VQA-MED-2020 dataset represented the third iteration of this influential initiative, presented as part of ImageCLEF 2020 to advance the answer to medical visual questions [278].The core corpus consisted of diagnostically relevant images, whose diagnosis was derived directly from visual elements.The questions focused on abnormalities, with 330 frequent conditions selected to ensure minimum prevalence in systematically organized search patterns.This guided combination of visual and linguistic elements established the characteristic VQA assessment framework.In addition, VQA-MED-2020 enriched the landscape by introducing visual question generation, eliciting queries endogenously from radiographic content.Over 1001 associated images provided a basis for 2400 carefully constructed questions, first algorithmically designed and then manually refined.These interdependent tasks aimed to foster more nuanced understanding through contextual synergies between vision and language.
By developing both explanatory and generative faculties, VQA-MED-2020 helped characterize key imaging subtleties while cultivating technologies better prepared to participate in clinical workflows.Its balanced approach to both established and emerging problem spaces iteratively honed assessment of progress towards diagnostically adroit interrogative and descriptive proficiencies-skills paramount to intuitively aiding radiographic decision-making.12.The RADVISDIAL dataset has led the way in the emerging field of visual dialog in medical imaging by introducing the first radiology dialog collection for iterative QA modeling [279].Drawing on annotated cases from the MIMIC-CXR database accompanied by comprehensive reports, the dataset offered both simulated and authentic consultations.A silver set algorithmically generated multi-round interactions from plain text, while a gold standard captured 100 image-based discussions between expert radiologists under rigorous guidelines.Beyond simple question-answer pairs, this sequential format mimicked the richness of clinical dialogue, posing a more ecologically valid test of the systems' contextual capabilities.The comparisons also revealed the importance of historical context for accuracy-an insight with implications for the development of conversational AI assistants.By establishing visual dialogue as a framework for evaluating the progress of models over a succession of queries, RADVISDIAL laid the foundation for the development of technologies that facilitate nuanced diagnostic reasoning through combined visual and linguistic questions and answers.Its dual-level design also provided a benchmark of synthetic and real-world performance as the field continues to advance.RADVISDIAL therefore opened the way to promising work on sequential and contextual modeling for visual dialogue tasks, such as doctor-patient or radiologist discussions.

Foundation Models for Healthcare
Foundation models (FMs) marked a revolution in the world of AI.They are machine learning structures formed from large, often unlabeled datasets [66].They offer a diverse range of applications, extending from text generation to video editing [280] and including such complex fields as protein folding [281] and robotics [282].
One of the leading models to emerge in this area is ChatGPT from OpenAI.Initially conceived as a linguistic model for predicting the next word in a sequence, it has surprised the scientific community with its multifunctional skills.In particular, its applications in the medical field are remarkable, ranging from assisting doctors with licensing examinations to simplifying radiology reports [132,283].
Rapid progress in this field has attracted the attention of healthcare professionals.However, it is difficult to assess the potential benefits and drawbacks of these advancements in a clinical context.To address this, a recent study of 80 distinct clinical FMs developed from EMR data was undertaken with the aim of understanding and addressing these challenges.

Concept of FM
Foundation models are based on three key concepts.The first is pre-training, which involves training the model on large datasets.For example, a corpus consists of anonymized patient records, scientific articles, and clinical trial reports.The transformation model is trained on this corpus to acquire a general understanding of the syntax and semantics of a medical language.Its level of understanding is assessed on language comprehension tasks, such as named entity recognition.
Once pre-trained, the model can be fine-tuned to domain-or task-specific data.This is known as the fine-tuning stage.Let us take the example of a model pre-trained on general medical data.It could be fine-tuned based on the analysis of symptoms reported by patients to suggest possible diagnoses.This fine-tuning would be carried out using a dataset corresponding to the symptoms and diagnoses of anonymous patients.
The concept of learning transfer is also essential.It enables the knowledge already acquired by the model to be used to adapt it to related tasks.For example, the patient symptom analysis model could be retrained on data linking symptoms, diagnoses, and effective treatments.The aim would be to automatically associate the right treatments with the proposed diagnoses.

Clinical FMs
EMR data can be utilized to construct foundation models, which primarily fall into two categories: Clinical Language Models (CLaMs) and Models for EMR (FEMRs).

Clinical Language Models (CLaMs)
CLaMs are a subcategory of large-scale linguistic models specialized in clinical and biomedical texts [284].They are mainly trained to process and generate this type of medical content, such as extracting drug names from notes or summarizing clinical dialogues [99].
By training them on large quantities of internet text, general-purpose LLMs such as ChatGPT, Bloom, and GPT-4 may offer some utility in clinical settings.However, when it comes to specialized tasks, CLaMs generally demonstrate superior performance.

Foundation Models for EMR (FEMRs)
Another important category of clinical FMs is EMR record integration models, called FMs for Electronic Medical Records (FEMRs) [284].Unlike the language models previously described, FEMRs can process the structured set of events and clinical data contained in a patient's medical record.
When trained on EMR data, FEMRs generate a vector representation densely encapsulating information relating to the care pathway of each patient.This representation in the form of a high-dimensional vector, often termed "patient embedding", has the property of compactly encapsulating all the information relating to the patient and can then serve as input to various predictive models dedicated to tasks such as anticipating the risk of hospital readmission.
Remarkably, the performances of these secondary predictive models are often superior to those of traditional learning models that do not benefit from the overview provided by the patient embedding.By holistically integrating the multiple dimensions of the medical record, these FEMRs therefore seem to capture the overall health status of patients in a richer way than traditional approaches.Their ability to synthesize the complete clinical history into a dense representation opens promising perspectives for the development of new predictive clinical applications.

Discussion
LLMs, such as GPT-4 or LLaMA, have demonstrated significant potential in advancing NLP, and their application in medical diagnosis promises to transform this field.
One of the main strengths of LLMs is their ability to quickly analyze clinical data.A doctor could, for example, provide a model like GPT-4 with a complex set of symptoms and, in return, receive suggestions for possible differential diagnoses.This is even more important when considering that these models can access large, up-to-date medical databases, providing valuable information, especially for rare diseases or complex cases.
The potential of LLMs does not stop at large medical institutions.In remote or underserved areas where access to a specialist may be difficult, an LLM-based tool could serve as a first line of diagnostic advice.Of course, this could never replace real medical advice, but it could effectively guide initial care.
Furthermore, the information burden is a constant challenge in medicine.MDs are inundated with data, and LLMs could play a critical role in filtering and prioritizing this information.At the same time, medical students could benefit from these models as interactive learning tools, allowing them to ask questions and receive detailed answers about various medical conditions.
What is also interesting is the ability of LLMs to process natural language.Patients can simply describe their symptoms, and models like BERT could translate those verbal descriptions into precise medical terms.Going further, by combining LLMs with computer vision technologies, we could envision these models in helping interpret medical images.
However, it is essential to remember that although LLMs offer impressive benefits, they should not be used as the final word in diagnosis.They can serve as valuable assistants, but human clinical judgment is important.Additionally, these models would require extensive validation before widespread adoption to ensure their accuracy in a clinical setting.
Although they offer advantages, the use of LLMs in medical diagnosis also presents serious challenges.First, reliability is a major concern.Although an LLM like GPT-4 can provide insights based on a vast amount of data, it is not immune to errors.For example, a model could misinterpret symptoms described by a patient and suggest an incorrect diagnosis, with potentially serious consequences for the patient's health.Additionally, LLMs' ability to generalize from the data they were trained on can sometimes lead them to make mistakes in situations they did not explicitly "see" during their training.
Then, there is the challenge of overlinking.Although LLMs can be useful as support tools, there is a risk that healthcare professionals will begin to rely too heavily on them to the detriment of their clinical judgment.For example, if an LLM suggests a particular diagnosis based on the information provided, an MD might be tempted to follow that suggestion without questioning it, even if their clinical expertise suggests otherwise.
The issue of data privacy is also crucial.LLMs require huge amounts of data to train.If these data come from actual medical records, this could pose privacy concerns, even if the data are anonymized.It is essential to ensure that sensitive patient information remains protected.
Another challenge is that of transparency and interpretability.Decisions made by LLMs can often appear to be a "black box", meaning that doctors and patients may struggle to understand how a particular diagnosis was suggested.Without a clear explanation, it can be difficult to trust the model's recommendations.
Finally, medical ethics are another area of concern.If an incorrect diagnosis is made based on an LLM's suggestions, who is responsible?The doctor, the hospital, or the creator of the model?These questions require careful consideration before the widespread adoption of LLMs in medical diagnosis.
However, although LLMs have the potential to transform medical diagnosis, it is crucial to approach these challenges with caution and diligence to ensure patient safety and well-being.

Techniques for Mitigating Ethical Risks
The ethical analysis of LLM models in healthcare is crucial, given their potential impact on various aspects of healthcare.To attenuate the ethical risks associated with these models, several techniques can be employed.
In this field, differentially private learning is an important technique for protecting the privacy of the data used in LLM training [285,286].It ensures that adding or deleting a single data point in the training dataset does not significantly affect the model's output.This reduces the risk of the model revealing sensitive information about the patients from whom the data were collected.A key challenge of differentially private learning is to find a balance between privacy and model accuracy.As the level of confidentiality (i.e., the quantity of noise added) increases, the accuracy of the model tends to decrease.
Improving the interpretability of LLMs in healthcare is fundamental to enabling professionals to understand the decision-making process [287,288].In a sector where decisions have far-reaching consequences, such as healthcare, it is essential to understand the basis of a model's recommendations or predictions.This contributes to the confidence and adoption of these technologies by healthcare professionals.Interpretability techniques aim to explain why and how a model has reached a specific conclusion [289].This can be done by identifying the characteristics or data that were decisive in the model's decision.
For example, in the case of a diagnosis, the model might indicate which symptoms or test results were most influential in its conclusion.Visual tools such as heat maps make explanations accessible.Improved interpretability contributes to the transparency and reliability of recommendations, encouraging their informed integration without substituting for human judgment [290].It complements expertise for enriched decision-making.Ultimately, this approach ensures that they not only perform well, but are also understandable and trustworthy by professionals, enabling safer use of AI in healthcare.
LLMs in the healthcare field can involuntarily learn from and propagate biases in training data.To overcome this risk, it is crucial to make regular audits to detect and rectify these biases [291].These audits should include performance analyses on a variety of datasets, and focus on identifying any results that may be unfair or discriminatory.
To ensure that these audits are exhaustive and unbiased, they should be carried out by teams made up of members with diverse profiles and from a variety of disciplines.These teams should include not only medical experts, who have an in-depth understanding of the healthcare field but also researchers specializing in ethics, who can provide a critical perspective on the moral implications of model results.This diversity ensures that different points of view and expertise are taken into account when evaluating models.Such an interdisciplinary approach is essential to ensure that LLMs in healthcare are not only technically competent but also fair and unbiased in their applications.This helps to build confidence in the use of AI in medicine and to ensure that the benefits of these technologies accrue to all individuals in an equitable manner.
Fair learning techniques play a vital role in ensuring that LLM models are fair, unbiased, and representative in healthcare.To achieve this objective, several key strategies can be put in place [292].First, it is important to perform a preliminary assessment of the data to detect any potential bias and assess the representativity of demographic groups, medical conditions and socio-economic contexts.Next, datasets should be diversified by integrating information from diverse sources and populations, ensuring that data are collected from underrepresented or marginalized groups to ensure balanced representation.Data rebalancing techniques, such as oversampling minorities or undersampling majority groups, can also be used to balance datasets [293,294].Regularization models can be integrated into the learning process to minimize bias, for example by penalizing predictions that reinforce stereotypes or inequalities [295,296].Multi-task learning, which involves training models on multiple tasks simultaneously, can improve their ability to generalize and reduce single-task-specific biases [297][298][299].It is also essential to make rigorous tests on various samples in order to evaluate performance and detect biases in the model's predictions.Finally, it is important to train users on the strengths and limitations of LLMs, emphasizing the importance of their clinical expertise in interpreting the results.Interdisciplinary collaboration, involving ethics experts, health professionals, lawyers, and representatives of diverse communities, can also contribute to the balanced design and evaluation of models in health care.

Challenges and Future Perspectives
The first major challenge lies in the rigorous management of potential biases contained in the vast medical databases used for training AI algorithms.These data may present biases linked to the gender, age, or ethnic origin of patients; therefore, it is essential to collect information that is diverse and representative of the entire population.Thorough controls must also be put in place to detect any abuse that could unfairly impact certain groups.
The challenge of transparency and interpretability in diagnostic support systems is also crucial.Unlike traditional medical diagnoses requiring a detailed understanding of the reasons leading to conclusions, many current AI algorithms still operate as "black boxes", without the ability to clearly explain their reasoning.However, traceability and complete explainability of the analyses produced are essential in medicine.Significant progress remains to be made in explanatory AI.
The security and safety of AI systems and the highly sensitive medical data they analyze and store are also critical issues.The risks of leaks or malicious attacks aimed at compromising the confidentiality or integrity of this highly private information must be anticipated by deploying solid technical protection measures.
Robust validation of algorithm performance on large and varied populations, as well as their ability to maintain a high level of reliability over time, constitute other important challenges.It will also be necessary to adequately train health professionals in the relevant and safe use of these decision-making tools.
While medical AI shows strong potential, its responsible development requires sustained efforts on these different levels to ensure the trust of patients and practitioners in these technologies.

Conclusions
In this paper, we have provided an analysis of the potential impacts of major LLMs on the healthcare sector.We started by introducing LLM architectures like ChatGPT, Bloom, and LLaMA, which are composed of billions of parameters and have demonstrated impressive capabilities in language understanding and generation.These models have the potential to process and generate natural language, which can be used to improve medical diagnosis and patient care as well as accelerate medical research.
We then reviewed recent trends in medical datasets used to train such models, classifying them according to different criteria such as their size, their source, or their subject (patient files, scientific articles, etc.).Using these datasets can help train large language models to better understand the nuances and complexities of clinical language, which can be used to improve decision-making, diagnostic support, or the personalization of treatments.
However, we also highlighted the challenges related to the practical use of large linguistic models in the medical field.One of the main concerns is the ethical implications of using these models, as training them requires large amounts of sensitive medical data.There are concerns about data privacy and security, as well as the potential for these models to perpetuate bias and inaccuracies in medical data, which can have serious consequences for patient care.
Despite these challenges, significant progress is being made in the area of LLMs.With the deployment of LLMs trained on a very large scale with public resources, these models could revolutionize the way we approach patient care and medical research.However, it is important to carefully consider the ethical implications of using these models and work to address these concerns to ensure they are used responsibly and effectively.

Figure 1 .
Figure 1.Structural diagram presenting the topics covered in our paper.

Figure 4 .
Figure 4. Differences between classic fine-tuning and prompt tuning.

Table 2 .
Evolution of GPT versions in the healthcare sector.

Table 3 .
Summary of general medical datasets.

Table 4 .
Summary of de-identification datasets.

Table 5 .
Summary of medical QA datasets.MEDQUAD dataset consolidates authoritative medical knowledge from across the NIH into a substantial question-answer resource.It contains 47,457 pairs compiled from 12 respected NIH websites focused on health topics (e.g., cancer.gov,niddk.nih.gov,GARD, and MEDLINEPlus Health Topics).These sources include the National Cancer Institute, the National Institute of Diabetes and Digestive and Kidney Diseases, and the Genetic and Rare Diseases Information Center databases.