GPT-Driven Radiology Report Generation with Fine-Tuned Llama 3
Abstract
:1. Introduction
2. Materials and Methods
2.1. Dataset
Dataset Preparation
2.2. Method
2.2.1. Transformer Architecture
Section Multi-Head Self-Attention Mechanism
Section Position-Wise Feed-Forward Network and Positional Encoding
Section Key Advantages of the Transformer
2.2.2. Parameter Quantization
2.2.3. LoRA
2.2.4. Llama 3
2.2.5. Evaluation
Section Traditional Metrics
Section Human Evaluation
3. Results
4. Discussion
4.1. Comparative Analysis
4.2. Limitations
4.2.1. Dataset Limitations
- Bias in Data. The dataset used for fine-tuning the AI model was derived from reports written by a limited number of radiologists from a single institution. This potentially introduced biases, as the language, style, and diagnostic approaches may not represent the broader radiology community. Such biases could affect the model’s generalizability to reports written by radiologists from different institutions or those with varying levels of experience.
- Sample Size. Although the dataset included a substantial number of reports, it may still be insufficient to capture the full diversity of radiological findings and reporting styles. This is particularly relevant for rare pathologies or atypical cases, which may not be well represented in the training data. As a result, the model might underperform in these scenarios.
4.2.2. Evaluation Limitations
- Subjective Evaluation. The Turing-like quiz and rating forms relied on subjective evaluations made by radiologists. While efforts were made to ensure there was an unbiased and comprehensive assessment, individual preferences and interpretations could influence the results. Different radiologists may have varying thresholds for what they consider a “perfect” conclusion, introducing variability in ratings and comparisons.
- Limited Scope of Evaluation. The evaluation focused solely on the quality of the AI-generated conclusions. Other critical sections of radiology reports, such as findings, impressions, and recommendations, were not assessed in this study. Therefore, the model’s capability to generate complete and clinically useful radiology reports remains partially unexplored.
4.2.3. Model Limitations
- Lack of Clinical Judgment. Despite its high performance, the AI model lacks the clinical judgment and context awareness that human radiologists possess. Radiologists integrate patient history, prior imaging studies, and physical examination findings into their diagnostic processes. The AI model, trained solely on text reports, cannot access or interpret this broader clinical context, which may lead to less-informed conclusions.
- Handling of Ambiguous Cases. Radiologists often encounter ambiguous or borderline cases requiring nuanced interpretation and decision-making. The AI model analyzed may struggle with such cases, as it relies on patterns learned from training data. This could result in either overly cautious or overly confident conclusions, neither of which are ideal in clinical settings.
- Ethical and Legal Considerations. The use of AI in medical diagnosis raises ethical and legal concerns, particularly regarding accountability and patient consent. If an AI-generated conclusion results in a misdiagnosis or adverse outcome, the question of legal responsibility remains unclear. Additionally, patients may be apprehensive about AI involvement in their care, underscoring the need for transparency and informed consent.
- Resource Constraints. While quantization and low-rank adaptation techniques notably reduced the model’s memory footprint, training and deploying large language models still require substantial computational resources. This could limit the accessibility of such models in resource-constrained settings or smaller medical facilities.
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Voinea, Ş.-V.; Gheonea, I.A.; Selişteanu, D.; Teică, R.V.; Florescu, L.M.; Ciofiac, C.M.; Nica, R.E. Detection and Classification of Knee Ligament Pathology based on Convolutional Neural Networks. In Proceedings of the 2023 9th International Conference on Control, Decision and Information Technologies CoDIT, Rome, Italy, 3–6 July 2023; pp. 543–548. [Google Scholar] [CrossRef]
- Voinea, Ș.-V.; Gheonea, I.A.; Teică, R.V.; Florescu, L.M.; Roman, M.; Selișteanu, D. Refined Detection and Classification of Knee Ligament Injury Based on ResNet Convolutional Neural Networks. Life 2024, 14, 478. [Google Scholar] [CrossRef] [PubMed]
- Florescu, D.N.; Ivan, E.T.; Ciocâlteu, A.M.; Gheonea, I.A.; Tudoraşcu, D.R.; Ciurea, T.; Gheonea, D.I. Narrow Band Imaging Endoscopy for Detection of Precancerous Lesions of Upper Gastrointestinal Tract. Rom. J. Morphol. Embryol.-Rev. Roum. De Morphol. Et Embryol. 2016, 57, 931–936. [Google Scholar]
- Gheonea, I.A.; Streba, C.T.; Cristea, C.G.; Stepan, A.E.; Ciurea, M.E.; Sas, T.; Bondari, S. MRI and Pathology Aspects of Hypervascular Nodules in Cirrhotic Liver: From Dysplasia to Hepatocarcinoma. Rom. J. Morphol. Embryol. Rev. Roum. De Morphol. Et Embryol. 2015, 56, 925–935. [Google Scholar]
- Ungureanu, B.S.; Pirici, D.; Margaritescu, C.; Gheonea, I.A.; Trincu, F.N.; Fifere, A.; Saftoiu, A. Endoscopic Ultrasound Guided Injection of Iron Oxide Magnetic Nanoparticles for Liver and Pancreas: A Feasibility Study in Pigs. Med. Ultrason. 2016, 18, 157–162. [Google Scholar] [CrossRef]
- Jia, L.; Zheng, Q.; Tian, J.-H.; He, D.-L.; Zhao, J.-X.; Zhao, L.; Huang, G. Artificial Intelligence with Magnetic Resonance Imaging for Prediction of Pathological Complete Response to Neoadjuvant Chemoradiotherapy in Rectal Cancer: A Systematic Review and Meta-Analysis. Front. Oncol. 2022, 12, 1026216. [Google Scholar] [CrossRef]
- Srivastav, S.; Chandrakar, R.; Gupta, S.; Babhulkar, V.; Agrawal, S.; Jaiswal, A.; Prasad, R.; Wanjari, M. ChatGPT in Radiology: The Advantages and Limitations of Artificial Intelligence for Medical Imaging Diagnosis. Cureus 2023, 15, e41435. [Google Scholar] [CrossRef]
- Cheng, J. Applications of Large Language Models in Pathology. Bioengineering 2024, 11, 342. [Google Scholar] [CrossRef]
- Codari, M.; Schiaffino, S.; Sardanelli, F.; Trimboli, R.M. Artificial Intelligence for Breast MRI in 2008-2018: A Systematic Mapping Review. AJR. Am. J. Roentgenol. 2019, 212, 280–292. [Google Scholar] [CrossRef]
- Sorin, V.; Barash, Y.; Konen, E.; Klang, E. Creating Artificial Images for Radiology Applications Using Generative Adversarial Networks (GANs)—A Systematic Review. Acad. Radiol. 2020, 27, 1175–1185. [Google Scholar] [CrossRef]
- Arndt, C.; Güttler, F.; Heinrich, A.; Bürckenmeyer, F.; Diamantis, I.; Teichgräber, U. Deep Learning CT Image Reconstruction in Clinical Practice. RöFo-Fortschritte Auf Dem Geb. Der Röntgenstrahlen Der Bildgeb. Verfahr. 2020, 193, 252–261. [Google Scholar] [CrossRef]
- Ayana, G.; Dese, K.; Choe, S. Transfer Learning in Breast Cancer Diagnoses via Ultrasound Imaging. Cancers 2021, 13, 738. [Google Scholar] [CrossRef] [PubMed]
- Dixit, S.; Gupta, C.L.P. Compressed Deep Learning and Transfer Learning Model for Detecting Brain Tumour. In Proceedings of the 2023 3rd International Conference on Innovative Sustainable Computational Technologies CISCT, Dehradun, India, 8–9 September 2023; pp. 1–6. [Google Scholar] [CrossRef]
- Choudhary, A.; Tong, L.; Zhu, Y.; Wang, M.D. Advancing Medical Imaging Informatics by Deep Learning-Based Domain Adaptation. Yearb. Med. Inform. 2020, 29, 129–138. [Google Scholar] [CrossRef] [PubMed]
- Stabile, A.; Giganti, F.; Kasivisvanathan, V.; Giannarini, G.; Moore, C.; Padhani, A.; Panebianco, V.; Rosenkrantz, A.; Salomon, G.; Turkbey, B.; et al. Factors Influencing Variability in the Performance of Multiparametric Magnetic Resonance Imaging in Detecting Clinically Significant Prostate Cancer: A Systematic Literature Review. Eur. Urol. Oncol. 2020, 3, 145–167. [Google Scholar] [CrossRef] [PubMed]
- Mali, S.A.; Ibrahim, A.; Woodruff, H.; Andrearczyk, V.; Müller, H.; Primakov, S.; Salahuddin, Z.; Chatterjee, A.; Lambin, P. Making Radiomics More Reproducible across Scanner and Imaging Protocol Variations: A Review of Harmonization Methods. J. Pers. Med. 2021, 11, 842. [Google Scholar] [CrossRef]
- Saha, A.; Harowicz, M.R.; Mazurowski, M. Breast Cancer MRI Radiomics: An Overview of Algorithmic Features and Impact of Inter-Reader Variability in Annotating Tumors. Med. Phys. 2018, 45, 3076–3085. [Google Scholar] [CrossRef]
- Ashburner, J.; Klöppel, S. Multivariate Models of Inter-Subject Anatomical Variability. Neuroimage 2011, 56, 422–439. [Google Scholar] [CrossRef]
- Saeb, S.; Lonini, L.; Jayaraman, A.; Mohr, D.C.; Kording, K.P. The Need to Approximate the Use-Case in Clinical Machine Learning. GigaScience 2017, 6, 1–9. [Google Scholar] [CrossRef]
- Vabalas, A.; Gowen, E.; Poliakoff, E.; Casson, A. Machine Learning Algorithm Validation with a Limited Sample Size. PLoS ONE 2019, 14, e0224365. [Google Scholar] [CrossRef]
- Roberts, D.R.; Bahn, V.; Ciuti, S.; Boyce, M.S.; Elith, J.; Guillera-Arroita, G.; Hauenstein, S.; Lahoz-Monfort, J.J.; Schröder, B.; Thuiller, W.; et al. Cross-Validation Strategies for Data with Temporal, Spatial, Hierarchical, or Phylogenetic Structure. Ecography 2017, 40, 913–929. [Google Scholar] [CrossRef]
- Bejani, M.M.; Ghatee, M. A Systematic Review on Overfitting Control in Shallow and Deep Neural Networks. Artif. Intell. Rev. 2021, 54, 6391–6438. [Google Scholar] [CrossRef]
- Siontis, G.C.M.; Sweda, R.; Noseworthy, P.A.; Friedman, P.A.; Siontis, K.C.; Patel, C. Development and Validation Pathways of Artificial Intelligence Tools Evaluated in Randomised Clinical Trials. BMJ Health Care Inform. 2021, 28, e100466. [Google Scholar] [CrossRef] [PubMed]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is All you Need. In Advances in Neural Information Processing Systems; Guyon, I., Von Luxburg, U., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: New York, NY, USA, 2017; Volume 30, Available online: https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf (accessed on 1 June 2024).
- Rahali, A.; Akhloufi, M.A. End-to-End Transformer-Based Models in Textual-Based NLP. AI 2023, 4, 54–110. [Google Scholar] [CrossRef]
- Correia, A.S.; Colombini, E. Attention, Please! A Survey of Neural Attention Models in Deep Learning. Artif. Intell. Rev. 2021, 55, 6037–6124. [Google Scholar] [CrossRef]
- Jia, J.; Chen, X.; Yang, A.; He, Q.; Dai, P.; Liu, M. Link of Transformers in CV and NLP: A Brief Survey. In Proceedings of the 2022 5th International Conference on Pattern Recognition and Artificial Intelligence PRAI, Chengdu, China, 19–21 August 2022; pp. 735–743. [Google Scholar] [CrossRef]
- Chitty-Venkata, K.T.; Emani, M.; Vishwanath, V.; Somani, A. Neural Architecture Search for Transformers: A Survey. IEEE Access 2022, 10, 108374–108412. [Google Scholar] [CrossRef]
- Maurício, J.; Domingues, I.; Bernardino, J. Comparing Vision Transformers and Convolutional Neural Networks for Image Classification: A Literature Review. Appl. Sci. 2023, 13, 5521. [Google Scholar] [CrossRef]
- Jiang, H.; Li, Q.; Li, Y. Post Training Quantization after Neural Network. In Proceedings of the 2022 14th International Conference on Computer Research and Development ICCRD, Shenzhen, China, 7–9 January 2022; pp. 1–6. [Google Scholar] [CrossRef]
- Sayed, R.; Azmi, H.; Shawkey, H.; Khalil, A.H.; Refky, M. A Systematic Literature Review on Binary Neural Networks. IEEE Access 2023, 11, 27546–27578. [Google Scholar] [CrossRef]
- Mishra, R.; Gupta, H.P.; Dutta, T. A Survey on Deep Neural Network Compression: Challenges, Overview, and Solutions. arXiv 2020. [Google Scholar] [CrossRef]
- Alqahtani, A.; Xie, X.; Jones, M.W. Literature Review of Deep Network Compression. Informatics 2021, 8, 77. [Google Scholar] [CrossRef]
- Hu, Z.; Nie, F.; Tian, L.; Wang, R.; Li, X. Low Rank Regularization: A Review. Neural Netw. 2020, 136, 218–232. [Google Scholar] [CrossRef]
- Li, M.; Ding, D.; Heldring, A.; Hu, J.; Chen, R.; Vecchi, G. Low-Rank Matrix Factorization Method for Multiscale Simulations: A Review. IEEE Open J. Antennas Propag. 2021, 2, 286–301. [Google Scholar] [CrossRef]
- De Handschutter, P.; Gillis, N.; Siebert, X. A Survey on Deep Matrix Factorizations. Comput. Sci. Rev. 2021, 42, 100423. [Google Scholar] [CrossRef]
- Liu, Y.-H.; Han, T.; Ma, S.; Zhang, J.; Yang, Y.; Tian, J.; He, H.; Li, A.; He, M.; Liu, Z.; et al. Summary of ChatGPT/GPT-4 Research and Perspective Towards the Future of Large Language Models. arXiv 2023. [Google Scholar] [CrossRef]
- Abdar, M.; Pourpanah, F.; Hussain, S.; Rezazadegan, D.; Liu, L.; Ghavamzadeh, M.; Fieguth, P.; Cao, X.; Khosravi, A.; Acharya, U.; et al. A Review of Uncertainty Quantification in Deep Learning: Techniques, Applications and Challenges. Inf. Fusion 2020, 76, 243–297. [Google Scholar] [CrossRef]
- Villmann, T.; Bohnsack, A.; Kaden, M. Learning Vector Quantization: A Survey. J. Artif. Intell. Soft Comput. Res. 2017, 7, 65–81. [Google Scholar] [CrossRef]
- Reiter, E.; Belz, A. An Investigation into the Validity of Some Metrics for Automatically Evaluating Natural Language Generation Systems. Comput. Linguist. 2009, 35, 529–558. [Google Scholar] [CrossRef]
- Koroteev, M.V. BERT: A Review of Applications in Natural Language Processing and Understanding. arXiv 2021. [Google Scholar] [CrossRef]
- Lee, S.; Lee, J.; Moon, H.; Park, C.; Seo, J.; Eo, S.; Koo, S.; Lim, H.-J. A Survey on Evaluation Metrics for Machine Translation. Mathematics 2023, 11, 1006. [Google Scholar] [CrossRef]
- Fomicheva, M.; Specia, L. Taking MT Evaluation Metrics to Extremes: Beyond Correlation with Human Judgments. Comput. Linguist. 2019, 45, 515–558. [Google Scholar] [CrossRef]
- Dobre, I. A Comparison Between BLEU and METEOR Metrics Used for Assessing Students within an Informatics Discipline Course. Procedia-Soc. Behav. Sci. 2015, 180, 305–312. [Google Scholar] [CrossRef]
- Kim, D.W.; Jang, H.; Kim, K.; Shin, Y.; Park, S. Design Characteristics of Studies Reporting the Performance of Artificial Intelligence Algorithms for Diagnostic Analysis of Medical Images: Results from Recently Published Papers. Korean J. Radiol. 2019, 20, 405–410. [Google Scholar] [CrossRef]
- Shen, J.; Zhang, C.J.P.; Jiang, B.; Chen, J.; Song, J.; Liu, Z.; He, Z.; Wong, S.Y.; Fang, P.-H.; Ming, W. Artificial Intelligence Versus Clinicians in Disease Diagnosis: Systematic Review. JMIR Med. Inform. 2019, 7, e10010. [Google Scholar] [CrossRef] [PubMed]
- Yin, J.; Ngiam, K.; Teo, H. Role of Artificial Intelligence Applications in Real-Life Clinical Practice: Systematic Review. J. Med. Internet Res. 2021, 23, e25759. [Google Scholar] [CrossRef] [PubMed]
- Huang, Y.; Yen, M.-F. A New Perspective of Performance Comparison among Machine Learning Algorithms for Financial Distress Prediction. CompSciRN Other Mach. Learn. (Top.) 2019, 83, 105663. [Google Scholar] [CrossRef]
- Probst, P.; Wright, M.N.; Boulesteix, A.-L. Hyperparameters and Tuning Strategies for Random Forest. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2019, 9, e1301. [Google Scholar] [CrossRef]
- Bottou, L.; Curtis, F.E.; Nocedal, J. Optimization Methods for Large-Scale Machine Learning. SIAM Rev. 2018, 60, 223–311. [Google Scholar] [CrossRef]
- Zhang, S.; Dong, L.; Li, X.; Zhang, S.; Sun, X.; Wang, S.; Li, J.; Hu, R.; Zhang, T.; Wu, F.; et al. Instruction Tuning for Large Language Models: A Survey. arXiv 2023. [Google Scholar] [CrossRef]
- Akinci D’Antonoli, T.; Stanzione, A.; Bluethgen, C.; Vernuccio, F.; Ugga, L.; Klontzas, M.E.; Cuocolo, R.; Cannella, R.; Koçak, B. Large Language Models in Radiology: Fundamentals, Applications, Ethical Considerations, Risks, and Future Directions. Diagn. Interv. Radiol. 2024, 30, 80–90. [Google Scholar] [CrossRef]
- Nakaura, T.; Ito, R.; Ueda, D.; Nozaki, T.; Fushimi, Y.; Matsui, Y.; Yanagawa, M.; Yamada, A.; Tsuboyama, T.; Fujima, N.; et al. The Impact of Large Language Models on Radiology: A Guide for Radiologists on the Latest Innovations in AI. Jpn. J. Radiol. 2024, 42, 685–696. [Google Scholar] [CrossRef]
- Giannaris, P.S.; Al-Taie, Z.; Kovalenko, M.; Thanintorn, N.; Shin, D. Artificial Intelligence-Driven Structurization of Diagnostic Information in Free-Text Pathology Reports. J. Pathol. Inform. 2020, 11, 4. [Google Scholar] [CrossRef]
- Tschandl, P.; Weaver, W.; Pollastri, G. BERT-Based Models for Biomedical Text Summarization. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing EMNLP, Virtual, 16–20 November 2020. [Google Scholar]
- Pons, E.; Braun, L.M.M.; Hunink, M.G.M.; Kors, J.A. Natural Language Processing in Radiology: A Systematic Review. Radiology 2016, 279, 329–343. [Google Scholar] [CrossRef]
- Haque, A.; Neubert, J. Application of Deep Learning in Medical Imaging and Radiology: A Review. J. Med. Imaging Radiat. Sci. 2019, 50, 489–499. [Google Scholar] [CrossRef]
- Alfarghaly, O.; Khaled, R.; Elkorany, A.; Helal, M.; Fahmy, A. Automated Radiology Report Generation Using Conditioned Transformers. Inform. Med. Unlocked 2021, 24, 100557. [Google Scholar] [CrossRef]
Number of Reports | |
---|---|
Thorax | 6909 |
Abdomen | 8416 |
Pelvis | 7695 |
Skull | 3989 |
Spine | 373 |
Neck | 686 |
Breasts | 325 |
Pituitary gland | 258 |
Prostate | 396 |
Knee | 1329 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Voinea, Ș.-V.; Mămuleanu, M.; Teică, R.V.; Florescu, L.M.; Selișteanu, D.; Gheonea, I.A. GPT-Driven Radiology Report Generation with Fine-Tuned Llama 3. Bioengineering 2024, 11, 1043. https://doi.org/10.3390/bioengineering11101043
Voinea Ș-V, Mămuleanu M, Teică RV, Florescu LM, Selișteanu D, Gheonea IA. GPT-Driven Radiology Report Generation with Fine-Tuned Llama 3. Bioengineering. 2024; 11(10):1043. https://doi.org/10.3390/bioengineering11101043
Chicago/Turabian StyleVoinea, Ștefan-Vlad, Mădălin Mămuleanu, Rossy Vlăduț Teică, Lucian Mihai Florescu, Dan Selișteanu, and Ioana Andreea Gheonea. 2024. "GPT-Driven Radiology Report Generation with Fine-Tuned Llama 3" Bioengineering 11, no. 10: 1043. https://doi.org/10.3390/bioengineering11101043
APA StyleVoinea, Ș. -V., Mămuleanu, M., Teică, R. V., Florescu, L. M., Selișteanu, D., & Gheonea, I. A. (2024). GPT-Driven Radiology Report Generation with Fine-Tuned Llama 3. Bioengineering, 11(10), 1043. https://doi.org/10.3390/bioengineering11101043