Automatic Speech Recognition in Healthcare in the Post-LLM Era: A Scoping Review
Abstract
1. Introduction
- RQ1.
- (Applications) In which healthcare application contexts and settings have LLM-based ASR systems been applied and evaluated?Purpose: To map the landscape of healthcare applications and identify underexplored application contexts and setting areas.
- RQ2.
- (Technical Architectures) What ASR and language models (including LLM and PLM), training datasets, and model adaptation techniques have been utilized in target healthcare applications?Purpose: To characterize the technical diversity and identify prevailing methodological approaches.
- RQ3.
- (Evaluation Methods) What evaluation environment and methods, including performance metrics, are employed to assess LLM-based ASR systems in healthcare contexts?Purpose: To assess current evaluation standards and identify gaps in quality assessment practices.
- RQ4.
- (Reported Insights and Challenges) What are the benefits, implementation challenges, and ethical considerations that have been reported in the included studies?Purpose: To synthesize reported findings that inform deployment strategies and guide future research directions for LLM-based ASR systems intended for healthcare applications.
2. Background and Related Reviews
2.1. Automatic Speech Recognition: Foundations and Evolution
2.2. ASR Applications in Healthcare
2.3. Large Language Models and the Convergence with ASR
2.4. Related Reviews and Research Gaps
3. Review Methodology
3.1. Eligibility Criteria
- Published between 1 January 2022 and 31 December 2025.
- Original research articles (empirical, experimental, or solution-based studies).
- Full text available in English and open-access.
- Focus on ASR in healthcare or health-related contexts (e.g., clinical documentation, diagnosis, therapy, patient communication, medical education, accessibility, administration).
- Investigation of integrated ASR-LLM pipelines for downstream clinical tasks.
3.2. Information Sources and Search Strategy
3.3. Selection of Sources of Evidence
3.4. Data Charting Process
- Bibliographic Metadata: Author(s), country, publication year, source, and study type.
- Study Context: Research motivation, objectives, application context, clinical setting, target population, and supported languages.
- Technical Architecture: LLMs employed, ASR system specifications, dataset characteristics, and adaptation techniques.
- Evaluation and Validation: Methodologies, performance metrics, benchmark datasets, and human-in-the-loop requirements.
- Clinical Outcomes: Time savings, documentation quality, and user satisfaction.
- Ethics and Implementation: Privacy, bias considerations, and regulatory compliance.
- Reproducibility: Availability of code, models, and datasets.
| Extraction Domain | Item to Extract | Corresponding RQ(s) |
|---|---|---|
| Bibliographic Metadata | Author(s), first author’s country, publication year, source, paper type | N/A |
| Study Context | Application context (e.g., diagnosis, admin) | RQ1 |
| Setting (e.g., hospital, telehealth) | RQ1 | |
| Target population (e.g., clinicians, patients) | RQ1 | |
| Language(s) supported | RQ1 | |
| Paper motivation (intended health problem) | RQ1 | |
| Technology and Methods | ASR model(s) used (e.g., Whisper, Google STT) | RQ2 |
| LLM(s) used (e.g., GPT-4, LLaMA) | RQ2 | |
| Dataset(s) used (nature, size, availability, modality) | RQ2 | |
| Adaptation technique (e.g., fine-tuning, prompting) | RQ2 | |
| System features (input/output; open-source/proprietary) | RQ2 | |
| Evaluation and Validation | Validation methods and metrics (e.g., WER, accuracy) | RQ3 |
| External evaluation methods (e.g., user studies, clinical simulation) | RQ3 | |
| User involvement in testing | RQ3 | |
| Human-in-the-loop required | RQ3 | |
| Outcomes Beyond Accuracy | Clinical utility (time saved, satisfaction, workload) | RQ3, RQ4 |
| Ethics and Implementation | Privacy and data governance measures | RQ4 |
| Equity considerations (accents, low-resource languages) | RQ4 | |
| Adoption factors (integration, cost, barriers) | RQ4 | |
| Replication | Availability of replication package (code, data, models) | RQ4 |
4. Results
4.1. Study Characteristics
4.2. RQ1: Clinical Applications and Healthcare Domains

4.3. RQ2: Technical Architectures

4.4. RQ3: Evaluation Methods

4.5. RQ4: Reported Insights and Challenges

5. Discussion
5.1. Insights and Practical Implications
- LLMs serve as the primary component in most pipelines, not ASR. In 13 of the 19 included studies, the LLM held the primary role while ASR served as a secondary transcription input. This distribution suggests that the research community increasingly views these systems as clinical intelligence pipelines in which speech recognition is an input mechanism rather than the core contribution. Practitioners designing such systems should allocate evaluation and adaptation effort accordingly, with particular attention to the LLM component’s capacity for clinical reasoning, summarization, and structured output generation.
- Off-the-shelf deployment is common but may not be sufficient for all contexts. Fifteen of the 19 studies used ASR models without any domain-specific adaptation, and 13 used LLMs in their base configuration with prompting alone. While this low-barrier approach enables rapid prototyping, the four studies that did fine-tune their ASR components did so because general-purpose models proved inadequate for underserved languages [51], atypical speech patterns [59], noisy clinical environments [54], or specialized medical terminology [47]. Organizations considering deployment should evaluate off-the-shelf model performance on their specific clinical populations and acoustic conditions before relying on frozen configurations.
- LLM applications extend well beyond clinical documentation. Although administrative documentation was the most common application context (n = 7), LLMs also performed ASR error correction [46,59], clinical screening and classification [55,56], therapeutic dialogue generation [42,58], synthetic training data generation [42], regulatory compliance checking [51], and semantic speech reconstruction [50]. This breadth suggests that the integration of LLMs with ASR has potential across a wider range of healthcare workflows than documentation alone, though each application introduces distinct evaluation and safety requirements. This pattern extends to subscription-based literature; for instance, VoxRad demonstrated template-guided ASR-LLM radiology reporting with HIPAA-compliant local hosting [62], and stroke rehabilitation models achieved BLEU scores of 30.2 in multimodal audio-video speech tasks [63].
- Early evidence suggests multimodal LLMs may offer a pathway to unify the ASR-LLM pipeline. GPT-4o was used as both the ASR and LLM component in two studies [46,49], bypassing the traditional two-stage architecture. This approach simplifies deployment and reduces integration complexity, but it also concentrates dependency on a single proprietary model, raising concerns about version stability, auditability, and regulatory compliance that merit further investigation.
- On-device processing is emerging as a privacy-preserving deployment strategy. Four studies deployed models locally to avoid transmitting patient audio to external servers [47,53,55,56]. This approach is relevant in jurisdictions with strict data sovereignty requirements and may help address clinician concerns about patient data being processed by third-party cloud services. However, on-device deployment imposes constraints on model size and computational resources, creating a trade-off between privacy and capability that warrants further study. Recent work corroborates the feasibility of local deployment; VoxRad achieved high accuracy in radiology reporting while maintaining HIPAA compliance through locally hosted ASR and LLM processing [62].
- Reproducibility remains limited. Only five of the 19 studies provided a replication package with publicly available code or models. The reliance on closed-source commercial APIs (nine of 19 studies) introduces a critical methodological vulnerability for clinical translation: the complete loss of version control. Because proprietary models are routinely and silently updated by their vendors, a clinical pipeline rigorously validated on a specific API version today may exhibit entirely altered behaviors tomorrow—such as shifting hallucination rates, modified clinical reasoning, or different output formatting. This lack of transparency fundamentally breaks traditional clinical change-management protocols, as researchers and health systems cannot lock the model’s state, audit the underlying parameter updates, or guarantee longitudinal verifiability. Addressing this gap requires a concerted shift toward locally hostable, open-source clinical ASR-LLM pipelines and the development of shared healthcare-specific benchmark datasets to ensure safe and reproducible clinical deployment.
- Equity concerns are acknowledged but rarely evaluated systematically. Fifteen of the 19 studies mentioned at least one equity consideration—such as multilingual support, accent sensitivity, or accessibility for speech-impaired populations—yet none of the included studies made systematic bias evaluation or fairness testing a primary contribution. Moving from acknowledgment to rigorous evaluation will require dedicated benchmarks that capture the diversity of clinical speech, including regional accents, code-switching patterns, and atypical speech.
- Human oversight is widely practiced but inconsistently reported. Fifteen of the 19 studies incorporated some form of human-in-the-loop mechanism, ranging from clinician review of generated notes to manual correction of ASR output to patient satisfaction surveys. However, the nature, rigor, and scope of human oversight varied considerably across studies. Standardized reporting of human oversight mechanisms—including who reviewed the output, what criteria were applied, and whether human corrections were fed back into the system—would improve comparability and support the development of clinical governance frameworks for these systems.
- Evaluation practice focuses on technical accuracy rather than clinical validity. The current evaluation landscape—in which aggregate WER is the dominant ASR metric and human ratings are reported without inter-rater reliability—is insufficient for clinical deployment decisions. We propose two minimum requirements for future studies. First, at least one domain-weighted error metric should accompany aggregate WER: clinically critical tokens such as drug names, dosages, diagnoses, and negations carry disproportionate risk and should be tracked separately, as demonstrated by the DWER and N-DWER variants introduced in our sample [46] and the broader proposal of Medical Concept WER [29]. Second, any study that uses human raters to assess LLM output quality must report inter-rater reliability (ICC, weighted Cohen’s , or Krippendorff’s ); without it, human evaluation scores cannot be meaningfully compared across studies or used as a basis for clinical adoption decisions. These two requirements are low-cost, well-established in adjacent fields, and their absence across all 19 included studies represents the single most actionable reporting gap identified by this review.
5.2. Clinical Safety and Regulatory Considerations
5.3. Toward a Standardized Evaluation Framework for Clinical ASR–LLM Systems
5.4. Limitations
- The inclusion criteria required both ASR and LLM components, excluding studies that advanced either technology in isolation for healthcare applications. This design choice was deliberate, as the review specifically targets the intersection of these technologies to understand how they function together in clinical pipelines. Future reviews could examine standalone ASR or standalone LLM healthcare applications to complement the findings presented here.
- The publication window (2022–2025) resulted in a temporally skewed sample, with 17 of 19 studies from 2025, reflecting the recency of the field. This concentration is expected, given that the widespread integration of LLMs with ASR in healthcare began only after the release of models such as ChatGPT and Whisper in late 2022. As more studies emerge, future reviews should revisit these findings to assess whether the patterns identified here—such as the dominance of frozen configurations and the scarcity of empirical evaluations—persist or evolve.
- As a scoping review, this study prioritized breadth of landscape mapping over formal quality assessment of individual studies. While this approach aligns with PRISMA-ScR methodology [19,41], it limits our ability to differentiate robust from weaker evidence or to weight findings by study quality. Future systematic reviews addressing specific clinical questions would benefit from formal quality appraisal using validated tools such as MI-CLAIM [64] for clinical AI studies or PROBAST [65] for prediction model assessment, enabling evidence grading and clinical recommendation development.
- The rapid pace of ASR and LLM development means that newer models, techniques, and clinical deployments may have emerged since the search was conducted. This is an inherent challenge in reviewing any fast-moving technology area. To partially address this, the search window was extended through December 2025 to capture the most recent available work. Periodic updates to this review would help track the trajectory of the field.
- Restricting inclusion to open-access, peer-reviewed publications excludes two categories of evidence: subscription-journal studies and gray literature (preprints, technical reports). The subscription-access constraint may under-represent high-quality clinical deployment studies and commercial system evaluations often published in related journals. The gray literature exclusion means recent preprints and technical reports are not captured, potentially missing developments too recent for traditional peer review. These were deliberate methodological choices to ensure reproducibility and peer-review fidelity [19,41]. Gray literature, while valuable for timeliness, lacks the quality assurance mechanisms of peer review that are essential for systematic evidence synthesis intended to inform clinical deployment [64]. Our findings thus reflect the peer-reviewed, open-access literature rather than the complete development landscape. Representative subscription-based studies corroborate key findings on documentation applications [62], multimodal integration [63], and clinical performance benchmarks—a VR-based triage system achieved 95.1% task success with 4.61/5 naturalness ratings [66].
6. Conclusions
- Standardized evaluation frameworks for clinical LLM outputs. The field urgently needs consensus metrics including hallucination and omission rates with explicit denominators, harm-grading rubrics for cascading errors, and mandatory reporting of inter-rater reliability (ICC, Cohen’s , or Krippendorff’s ) for human evaluation panels. Without these standards, cross-study comparison remains impossible and clinical adoption lacks an evidence base.
- Shared multilingual clinical speech benchmarks. Standardized datasets with subgroup-stratified test sets and clinically weighted error metrics (e.g., MC-WER, DWER) would expose the performance disparities that aggregate WER conceals. These benchmarks should span multiple languages, accents, and clinical contexts to enable systematic fairness evaluation.
- Systematic bias and fairness testing. The field must move from widespread equity acknowledgment (15 of 19 studies) to empirical evaluation (currently absent). Future work should mandate pre-registered subgroup analyses with statistical comparison across demographic groups, unseen-population hold-outs, and explicit disparity thresholds.
- Open-source pipelines and reproducible research infrastructure. Progress on all fronts depends on commitments to open-source models, public datasets, replication packages, and federated evaluation frameworks that overcome the reproducibility barriers imposed by closed-source APIs (9 of 19 studies) and private clinical-speech corpora.
Author Contributions
Funding
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Appendix A. Evaluation Metric Descriptions
| Metric | Description |
|---|---|
| WER | Word Error Rate. Proportion of words incorrectly inserted, deleted, or substituted relative to a reference transcript. |
| CER | Character Error Rate. Character-level analog of WER; particularly relevant for character-based languages such as Mandarin. |
| DWER | Dental Word Error Rate. Domain-specific variant that weights errors on clinical terminology more heavily. |
| N-DWER | Normalized DWER. Length-normalized version of DWER for cross-transcript comparison. |
| uWER | Unweighted WER. Treats all word types equally without frequency weighting. |
| MER | Match Error Rate. Proportion of matched word pairs containing errors, accounting for alignment. |
| WIL | Word Information Lost. Proportion of word-level information lost between reference and hypothesis. |
| WIP | Word Information Preserved. Complement of WIL; proportion of information correctly retained. |
| SRR | Sentence Recognition Rate. Proportion of sentences transcribed entirely without error. |
| RAR | Recognition Accuracy Rate. Proportion of correctly recognized speech units. |
| Metric | Description |
|---|---|
| BERTScore | Semantic similarity between generated and reference texts using contextual BERT embeddings, capturing meaning beyond surface-level overlap. |
| ROUGE | Recall-Oriented Understudy for Gisting Evaluation. N-gram overlap between generated and reference texts (ROUGE-1, ROUGE-2, ROUGE-L). |
| BLEU | Bilingual Evaluation Understudy. Precision of n-gram overlap, originally developed for machine translation. |
| BARTScore | Log-likelihood of generated text given the reference using a pre-trained BART model. |
| Metric | Description |
|---|---|
| Accuracy | Proportion of correctly classified instances out of all instances. |
| Precision | Proportion of true positives among all positive predictions. |
| Recall | Proportion of true positive instances correctly identified out of all actual positives. |
| F1-score | Harmonic mean of precision and recall. |
| AUC | Area Under the ROC Curve. Measures discrimination ability across all classification thresholds. |
| Method | Description |
|---|---|
| Manual quality scoring | Domain experts rate outputs on predefined dimensions (e.g., accuracy, relevance, completeness). |
| Thematic analysis | Systematic identification of themes in user feedback through qualitative coding. |
| Baseline vs. augmented | Performance comparison with and without a specific component (e.g., LLM-augmented data). |
| Ablation study | Systematic removal of model components to quantify individual contributions. |
| Chi-square test | Statistical comparison of observed versus expected frequencies in categorical outcomes. |
| Error categorization | Reviewers identify, classify, and rate the severity of errors in outputs. |
| Metric | Description |
|---|---|
| Latency | Time between speech input and system output; critical for real-time applications. |
| Hallucination Rate | Proportion of generated content unsupported by or contradicting the source input. |
| Platform usage | Adoption metrics tracking active users and frequency of system use in deployment. |
| RAG Triad score | Composite score for RAG quality across context relevance, groundedness, and answer relevance. |
| Conceptual Precision | Proportion of semantically meaningful concepts correctly preserved from input to output. |
Appendix B. Database Search Queries
Appendix B.1. PubMed
Appendix B.2. Scopus
Appendix B.3. IEEE Xplore
Appendix B.4. Web of Science
Appendix B.5. Validation Set
References
- Zhang, J.; Wu, J.; Qiu, Y.; Song, A.; Li, W.; Li, X.; Liu, Y. Intelligent speech technologies for transcription, disease diagnosis, and medical equipment interactive control in smart hospitals: A review. Comput. Biol. Med. 2023, 153, 106517. [Google Scholar] [CrossRef]
- Ng, J.J.W.; Wang, E.; Zhou, X.; Zhou, K.X.; Goh, C.X.L.; Sim, G.Z.N.; Tan, H.K.; Goh, S.S.N.; Ng, Q.X. Evaluating the performance of artificial intelligence-based speech recognition for clinical documentation: A systematic review. BMC Med. Inform. Decis. Mak. 2025, 25, 236. [Google Scholar] [CrossRef]
- Mess, S.A.; Mackey, A.J.; Yarowsky, D.E. Artificial Intelligence Scribe and Large Language Model Technology in Healthcare Documentation: Advantages, Limitations, and Recommendations. Plast. Reconstr. Surg.-Glob. Open 2025, 13, e6450. [Google Scholar] [CrossRef]
- Rabiner, L. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 1989, 77, 257–286. [Google Scholar] [CrossRef] [PubMed]
- Hinton, G.; Deng, L.; Yu, D.; Dahl, G.E.; Mohamed, A.R.; Jaitly, N.; Senior, A.; Vanhoucke, V.; Nguyen, P.; Sainath, T.N.; et al. Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups. IEEE Signal Process. Mag. 2012, 29, 82–97. [Google Scholar] [CrossRef]
- Graves, A.; Fernández, S.; Gomez, F.; Schmidhuber, J. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd International Conference on Machine Learning; ICML ’06; Association for Computing Machinery: New York, NY, USA, 2006; pp. 369–376. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.u.; Polosukhin, I. Attention is All you Need. In Proceedings of the Advances in Neural Information Processing Systems; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30. [Google Scholar]
- Baevski, A.; Zhou, Y.; Mohamed, A.; Auli, M. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. In Proceedings of the Advances in Neural Information Processing Systems; Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2020; Volume 33, pp. 12449–12460. [Google Scholar]
- Radford, A.; Kim, J.W.; Xu, T.; Brockman, G.; Mcleavey, C.; Sutskever, I. Robust Speech Recognition via Large-Scale Weak Supervision. In Proceedings of the 40th International Conference on Machine Learning; Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., Scarlett, J., Eds.; PMLR: New York, NY, USA, 2023; Volume 202, pp. 28492–28518. [Google Scholar]
- Hopkins, B.S.; Dallas, J.; Yu, J.; Briggs, R.G.; Chung, L.K.; Cote, D.J.; Gomez, D.; Shah, I.; Carmichael, J.D.; Liu, J.C.; et al. The use of generative artificial intelligence–based dictation in a neurosurgical practice: A pilot study. Neurosurg. Focus 2025, 59, E8. [Google Scholar] [CrossRef]
- Seyedi, S.; Griner, E.; Corbin, L.; Jiang, Z.; Roberts, K.; Iacobelli, L.; Milloy, A.; Boazak, M.; Bahrami Rad, A.; Abbasi, A.; et al. Using HIPAA (Health Insurance Portability and Accountability Act)–Compliant Transcription Services for Virtual Psychiatric Interviews: Pilot Comparison Study. JMIR Ment. Health 2023, 10, e48517. [Google Scholar] [CrossRef]
- Pondel-Sycz, K.; Bilski, P.; Bobiński, P.; Morzyński, L.; Lewandowski, M.; Kozłowski, E.; Szczepański, G.; Jasiński, M.; Makarewicz, G.; Pietrzak, A.P.; et al. A comparative study of deep End-to-End Automatic Speech Recognition models for doctor-patient conversations in Polish in a real-life acoustic environment. Int. J. Electron. Telecommun. 2025, 71, 1–8. [Google Scholar] [CrossRef]
- Jelassi, M.; Jemai, O.; Demongeot, J. Revolutionizing Radiological Analysis: The Future of French Language Automatic Speech Recognition in Healthcare. Diagnostics 2024, 14, 895. [Google Scholar] [CrossRef] [PubMed]
- OpenAI. GPT-4 Technical Report. arXiv 2023, arXiv:2303.08774. [Google Scholar] [CrossRef]
- Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. LLaMA: Open and Efficient Foundation Language Models. arXiv 2023, arXiv:2302.13971. [Google Scholar] [CrossRef]
- Wang, D.; Zhang, S. Large language models in medical and healthcare fields: Applications, advances, and challenges. Artif. Intell. Rev. 2024, 57, 299. [Google Scholar] [CrossRef]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT); Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; Volume 1, pp. 4171–4186. [Google Scholar] [CrossRef]
- Lee, J.; Yoon, W.; Kim, S.; Kim, D.; Kim, S.; So, C.H.; Kang, J. BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 2020, 36, 1234–1240. [Google Scholar] [CrossRef] [PubMed]
- Tricco, A.C.; Lillie, E.; Zarin, W.; O’Brien, K.K.; Colquhoun, H.; Levac, D.; Moher, D.; Peters, M.D.; Horsley, T.; Weeks, L.; et al. PRISMA extension for scoping reviews (PRISMA-ScR): Checklist and explanation. Ann. Intern. Med. 2018, 169, 467–473. [Google Scholar] [CrossRef] [PubMed]
- Munn, Z.; Peters, M.D.; Stern, C.; Tufanaru, C.; McArthur, A.; Aromataris, E. Systematic review or scoping review? Guidance for authors when choosing between a systematic or scoping review approach. BMC Med. Res. Methodol. 2018, 18, 143. [Google Scholar] [CrossRef] [PubMed]
- Chan, W.; Jaitly, N.; Le, Q.; Vinyals, O. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); IEEE: Piscataway, NJ, USA, 2016; pp. 4960–4964. [Google Scholar] [CrossRef]
- Gulati, A.; Qin, J.; Chiu, C.C.; Parmar, N.; Zhang, Y.; Yu, J.; Han, W.; Wang, S.; Zhang, Z.; Wu, Y.; et al. Conformer: Convolution-augmented Transformer for Speech Recognition. In Proceedings of the Interspeech 2020, Shanghai, China, 25–29 October 2020; pp. 5036–5040. [Google Scholar]
- Koenecke, A.; Choi, A.S.G.; Mei, K.X.; Schellmann, H.; Sloane, M. Careless Whisper: Speech-to-Text Hallucination Harms. In Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency; Association for Computing Machinery: New York, NY, USA, 2024. [Google Scholar]
- Carl, M.; Icht, M. Automated Assessment of Word- and Sentence-Level Speech Intelligibility in Developmental Motor Speech Disorders: A Cross-Linguistic Investigation. Diagnostics 2025, 15, 1892. [Google Scholar] [CrossRef]
- Zhong, Z.; Wang, Q.; Singh, S.; Mendes, C.C.; Hasegawa-Johnson, M.; Abdulla, W.; Reza Shahamiri, S. Convolution-Augmented Transformers for Enhanced Speaker-Independent Dysarthric Speech Recognition. IEEE Trans. Neural Syst. Rehabil. Eng. 2025, 33, 3815–3826. [Google Scholar] [CrossRef]
- Fatehifar, M.; Munro, K.J.; Stone, M.A.; Wong, D.; Cootes, T.; Schlittenlacher, J. Digits-In-Noise Hearing Test Using Text-to-Speech and Automatic Speech Recognition: Proof-of-Concept Study. Trends Hear. 2025, 29, 23312165251367625. [Google Scholar] [CrossRef]
- Zolnoori, M.; Vergez, S.; Xu, Z.; Esmaeili, E.; Zolnour, A.; Anne Briggs, K.; Scroggins, J.K.; Hosseini Ebrahimabad, S.F.; Noble, J.M.; Topaz, M.; et al. Decoding disparities: Evaluating automatic speech recognition system performance in transcribing Black and White patient verbal communication with nurses in home healthcare. JAMIA Open 2024, 7, ooae130. [Google Scholar] [CrossRef]
- DiChristofano, A.; Shuster, H.; Chandra, S.; Patwari, N. Global Performance Disparities Between English-Language Accents in Automatic Speech Recognition. arXiv 2023, arXiv:2208.01157. [Google Scholar] [CrossRef]
- Adedeji, A.; Sanni, M.; Ayodele, E.; Joshi, S.; Olatunji, T. The Multicultural Medical Assistant: Can LLMs Improve Medical ASR Errors Across Borders? arXiv 2025, arXiv:2501.15310. [Google Scholar] [CrossRef]
- Nazi, Z.A.; Peng, W. Large Language Models in Healthcare and Medical Domain: A Review. Informatics 2024, 11, 57. [Google Scholar] [CrossRef]
- Thirunavukarasu, A.J.; Ting, D.S.J.; Elangovan, K.; Gutierrez, L.; Tan, T.F.; Ting, D.S.W. Large language models in medicine. Nat. Med. 2023, 29, 1930–1940. [Google Scholar] [CrossRef]
- Shool, S.; Adimi, S.; Saboori Amleshi, R.; Bitaraf, E.; Golpira, R.; Tara, M. A systematic review of large language model (LLM) evaluations in clinical medicine. BMC Med. Inform. Decis. Mak. 2025, 25, 117. [Google Scholar] [CrossRef]
- Alkalbani, A.M.; Alrawahi, A.S.; Salah, A.; Haghighi, V.; Zhang, Y.; Alkindi, S.; Sheng, Q.Z. A Systematic Review of Large Language Models in Medical Specialties: Applications, Challenges and Future Directions. Information 2025, 16, 489. [Google Scholar] [CrossRef]
- Bedi, S.; Liu, Y.; Orr-Ewing, L.; Dash, D.; Koyejo, S.; Callahan, A.; Fries, J.A.; Wornow, M.; Swaminathan, A.; Lehmann, L.S.; et al. Testing and Evaluation of Health Care Applications of Large Language Models: A Systematic Review. JAMA 2025, 333, 319–328. [Google Scholar] [PubMed]
- van Buchem, M.M.; Boosman, H.; Bauer, M.P.; Kant, I.M.J.; Cammel, S.A.; Steyerberg, E.W. The digital scribe in clinical practice: A scoping review and research agenda. NPJ Digit. Med. 2021, 4, 57. [Google Scholar] [CrossRef] [PubMed]
- Jordan, E.; Terrisse, R.; Lucarini, V.; Alrahabi, M.; Krebs, M.O.; Desclés, J.; Lemey, C. Speech Emotion Recognition in Mental Health: Systematic Review of Voice-Based Applications. JMIR Ment. Health 2025, 12, e74260. [Google Scholar] [CrossRef] [PubMed]
- Sasseville, M.; Yousefi, F.; Ouellet, S.; Naye, F.; Stefan, T.; Carnovale, V.; Bergeron, F.; Ling, L.; Gheorghiu, B.; Hagens, S.; et al. The Impact of AI Scribes on Streamlining Clinical Documentation: A Systematic Review. Healthcare 2025, 13, 1447. [Google Scholar] [CrossRef]
- Muthusamy, D.; Muthusamy Chinnan, S.P.; Muthuswamy Prakashpathy, G. From Waves to Words: The Impact of Large Language Models on Automatic Speech Recognition Systems: An Overview. 2025. Available online: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5113383 (accessed on 1 April 2026).
- Yang, Z.; Shimizu, S.; Yu, Y.; Chu, C. When large language models meet speech: A survey on integration approaches. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2025; Association for Computational Linguistics: Stroudsburg, PA, USA, 2025; pp. 20298–20315. [Google Scholar]
- Cui, W.; Yu, D.; Jiao, X.; Meng, Z.; Zhang, G.; Wang, Q.; Guo, S.Y.; King, I. Recent advances in speech language models: A survey. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Association for Computational Linguistics: Stroudsburg, PA, USA, 2025; pp. 13943–13970. [Google Scholar]
- Arksey, H.; O’Malley, L. Scoping studies: Towards a methodological framework. Int. J. Soc. Res. Methodol. 2005, 8, 19–32. [Google Scholar] [CrossRef]
- Gabor-Siatkowska, K.; Sowański, M.; Rzatkiewicz, R.; Stefaniak, I.; Kozłowski, M.; Janicki, A. AI to Train AI: Using ChatGPT to Improve the Accuracy of a Therapeutic Dialogue System. Electronics 2023, 12, 4694. [Google Scholar] [CrossRef]
- Bang, J.U.; Han, S.H.; Kang, B.O. Alzheimer’s disease recognition from spontaneous speech using large language models. ETRI J. 2024, 46, 96–105. [Google Scholar] [CrossRef]
- Kheddar, H.; Hemis, M.; Himeur, Y. Automatic speech recognition using advanced deep learning approaches: A survey. Inf. Fusion 2024, 109, 102422. [Google Scholar] [CrossRef]
- Lossio-Ventura, J.A.; Frank, S.; Ringlein, G.; Bonson, K.; Olszko, A.; Knobel, A.; Pine, D.S.; Freeman, J.B.; Benito, K.; Jangraw, D.C.; et al. Automated classification of exposure and encourage events in speech data from pediatric OCD treatment. JAMIA Open 2025, 8, ooaf127. [Google Scholar] [CrossRef]
- O’Kane, R.; Stonehouse-Smith, D.; Ota, L.; Patel, R.; Johnson, N.; Slipper, C.; Seehra, J.; Papageorgiou, S.; Cobourne, M. Transcription Accuracy of Automatic Speech Recognition for Orthodontic Clinical Records. J. Dent. Res. 2025. [Google Scholar] [CrossRef]
- Xu, Y.; Jia, H.; Wang, M.; Feng, J.; Xu, X.; Wang, H.; Chen, J.; Zheng, Z.; Yang, X.; Shen, Y.; et al. Enhancing clinical documentation with voice processing and large language models: A study on the LAOS system. NPJ Digit. Med. 2025, 8, 798. [Google Scholar] [CrossRef] [PubMed]
- Balyan, R.; Rivera, A.Y.; Verma, T. Incorporating Language Technologies and LLMs to Support Breast Cancer Education in Hispanic Populations: A Web-Based, Interactive Platform. Appl. Sci. 2025, 15, 11231. [Google Scholar] [CrossRef]
- Busch, F.; Prucker, P.; Komenda, A.; Ziegelmayer, S.; Makowski, M.R.; Bressem, K.K.; Adams, L.C. Multilingual feasibility of GPT-4o for automated Voice-to-Text CT and MRI report transcription. Eur. J. Radiol. 2025, 182, 111827. [Google Scholar] [CrossRef]
- Rafi, R.A.; Ahmed, S.; Venkateshperumal, D.; Khokhar, A.; Arifuzzaman, M.; Azad, A.; Alyami, S.A. Multilingual Low-Latency Emergency VoIP System Using LLM for Speech Reconstruction and Blockchain for Secure Data Archiving. IEEE Access 2025, 13, 150101–150122. [Google Scholar] [CrossRef]
- Emssaad, I.; Ben-Bouazza, F.E.; Tafala, I.; Mezali, M.C.E.; Jioudi, B. Leveraging multilingual RAG for breast cancer RCPs: AI-driven speech transcription and compliance in Darija-French clinical discussions. Comput. Methods Programs Biomed. Update 2025, 8, 100221. [Google Scholar] [CrossRef]
- de Paula, P.A.B.; Severino, J.V.B.; Berger, M.N.; Veiga, M.H.; Ribeiro, K.D.P.; Loures, F.S.; Todeschini, S.A.; Roeder, E.A.; Marques, G.L. Improving documentation quality and patient interaction with AI: A tool for transforming medical records—An experience report. J. Med. Artif. Intell. 2025, 8, 19. [Google Scholar] [CrossRef]
- Zavala-Díaz, J.; Olivares-Rojas, J.C.; Gutierrez-Gnecchi, J.A.; Tellez-Anguiano, A.C.; Reyes-Archundia, E.; Ramos-Díaz, J.G. Towards a Clinical Interface for Speaker Identification and Speech-To-Text Transcription for Recording Medical Consultations in Spanish. Int. J. Comb. Optim. Probl. Inform. 2025, 16, 364–374. [Google Scholar] [CrossRef]
- Chen, C.; Hu, Y.; Cai, W.; Pan, H.; Shen, M.; Zhai, Y.; Wu, S.; Zhou, Q.; Guo, Y. Deep learning-based in-ambulance speech recognition and generation of prehospital emergency diagnostic summaries using LLMs. Int. J. Med. Inform. 2025, 203, 106029. [Google Scholar] [CrossRef] [PubMed]
- Park, C.; Kim, C. A Novel Chain-of-Thought Reasoning Approach for Alzheimer’s Disease Detection Using Large Language and Vision-Language Models. IEEE Trans. Neural Syst. Rehabil. Eng. 2025, 33, 4386–4395. [Google Scholar] [CrossRef] [PubMed]
- Umer, L.; Iqbal, J.; Ayaz, Y.; Imam, H.; Ahmad, A.; Asgher, U. StressSpeak: A Speech-Driven Framework for Real-Time Personalized Stress Detection and Adaptive Psychological Support. Diagnostics 2025, 15, 2871. [Google Scholar] [CrossRef]
- Yu Wu, J.; Chen, Y.J.; Lin, Y.F.; Ching, C.T.S.; Wang, H.M.D.; Chen, Y.C.; Juan, Y.C.; Liao, L.D. Bioengineering an AI-augmented platform for remote mental health interventions. Results Eng. 2025, 27, 105931. [Google Scholar] [CrossRef]
- Zisquit, M.; Shoa, A.; Oliva, R.; Perry, S.; Spanlang, B.; Klomek, A.B.; Slater, M.; Friedman, D. AI-Enhanced Virtual Reality Self-Talk for Psychological Counseling: Formative Qualitative Study. JMIR Form. Res. 2025, 9, e67782. [Google Scholar] [CrossRef]
- He, Y.; Seng, K.P.; Lim, C.S.; Ang, L.M. Robust Dysarthric Speech Recognition with GAN Enhancement and LLM Correction. Adv. Intell. Syst. 2026, 8, e202500873. [Google Scholar] [CrossRef]
- Ding, H.; Xia, W.; Zhou, Y.; Wei, L.; Feng, Y.; Wang, Z.; Song, X.; Li, R.; Mao, Q.; Chen, B.; et al. Evaluation and practical application of prompt-driven ChatGPTs for EMR generation. NPJ Digit. Med. 2025, 8, 77. [Google Scholar] [CrossRef]
- Gasque, R.A.; Zaietta, N.; Mollard, L.; Beltrame, M.C.; Virreira, M.E.L.; Quiñonez, E.G.; Mattera, F.J. HPB SmartNotes: The impact of artificial intelligence on surgeon workload in the outpatient office. EngMedicine 2025, 2, 100101. [Google Scholar] [CrossRef]
- Ankush, A. VoxRad: Building an open-source locally-hosted radiology reporting system. Clin. Imaging 2025, 119, 110414. [Google Scholar] [CrossRef]
- Guo, Y.; Huang, A.; Peng, B.; Li, Y.; Gu, W. MBBo-RPSLD: Training a Multimodal BlenderBot for Rehabilitation in Post-Stroke Language Disorder. IEEE J. Biomed. Health Inform. 2026, 30, 2805–2815. [Google Scholar] [CrossRef]
- Norgeot, B.; Quer, G.; Beaulieu-Jones, B.K.; Torkamani, A.; Dias, R.; Gianfrancesco, M.; Arnaout, R.; Kohane, I.S.; Saria, S.; Topol, E.; et al. Minimum information about clinical artificial intelligence modeling: The MI-CLAIM checklist. Nat. Med. 2020, 26, 1320–1324. [Google Scholar] [CrossRef] [PubMed]
- Wolff, R.F.; Moons, K.G.; Riley, R.D.; Whiting, P.F.; Westwood, M.; Collins, G.S.; Reitsma, J.B.; Kleijnen, J.; Mallett, S. PROBAST: A tool to assess the risk of bias and applicability of prediction model studies. Ann. Intern. Med. 2019, 170, 51–58. [Google Scholar] [CrossRef] [PubMed]
- Yang, H.; Tian, Q.; Gu, X. Toward Industry 5.0: Evaluating Multimodal Virtual Human Interaction for Smart Healthcare in Simulated VR Environments. Internet Technol. Lett. 2026, 9, e70190. [Google Scholar] [CrossRef]












| Review | Year | PRISMA | Healthcare Domain | ASR Focus | LLM Focus | ASR-LLM Integration | Evaluation Analysis | Ethics and Challenges |
|---|---|---|---|---|---|---|---|---|
| Van Buchem et al. [35] | 2021 | ✕ | Digital scribes | ✓ | ✕ | ✕ | ✓ | ✕ |
| Zhang et al. [1] | 2023 | ✕ | Smart hospitals | ✓ | ✕ | ✕ | ✕ | ✕ |
| Jordan et al. [36] | 2025 | ✓ | Mental health (SER *) | ✓ | ✕ | ✕ | ✓ | ✕ |
| Ng et al. [2] | 2025 | ✓ | Clinical documentation | ✓ | ✕ | ✕ | ✓ | ✕ |
| Sasseville et al. [37] | 2025 | ✓ | AI scribes | ✓ | ✕ | ✕ | ✓ | ✕ |
| This Review | 2025 | ✓ | Comprehensive (all domains) | ✓ | ✓ | ✓ | ✓ | ✓ |
| Reference | Their Contribution | What This Review Adds |
|---|---|---|
| Muthusamy et al. (2025) [38] | 60-year historical overview; ASR evolution from HMM to LLMs | Application taxonomy (RQ1); systematic evaluation critique (RQ3); reproducibility quantification (5/19 packages) |
| Yang et al. (2025) [39] | Technical fusion architectures; modality alignment methods | Architectural adaptation patterns in practice (RQ2); deployment challenge synthesis (RQ4); clinical validity framework |
| Cui et al. (2025) [40] | Speech LM advances; training paradigms; benchmark datasets | Real-world application mapping; evaluation fragmentation analysis; translational gap identification |
| Application Context | Description |
|---|---|
| Administrative | Clinical documentation tasks such as generating medical notes, structuring EMRs, transcribing radiology reports, and summarizing consultations. |
| Diagnosis | Systems that analyze speech to support clinical decision-making, including screening for cognitive decline, detecting stress, and automated coding of therapy sessions. |
| Therapy | Applications that support therapeutic processes, including dialogue systems for counseling, teletherapy augmentation, VR-based counseling, and bilingual patient education. |
| Doctor–patient communication | Tools that facilitate or improve spoken interaction between clinicians and patients, including speech reconstruction for degraded signals, EMR generation from consultations, and speech recognition for atypical speech. |
| Application Setting | Description |
|---|---|
| Clinical | Outpatient clinics, primary care facilities, and specialty practices where patients receive scheduled consultations or therapy sessions. |
| Hospital | Inpatient or departmental settings within hospitals, including radiology, ophthalmology, and orthodontics units. |
| Telehealth | Remote care delivered through digital platforms, including video-based therapy, VR-based counseling, and remote screening tools. |
| Homecare | Patient-initiated use in home environments, such as smartphone-based monitoring, self-administered assessments, and web-based health education. |
| Emergency | Prehospital and acute settings, including ambulance-based documentation and emergency communication systems. |
| Task | Description | Studies |
|---|---|---|
| Clinical documentation | Generating clinical notes, EMRs, medical records, or radiology reports from speech input. | [47,49,52,57,60,61] |
| Classification/screening | Classifying speech content for diagnostic purposes, including AD detection and stress detection. | [45,55,56] |
| Emergency/diagnostic summary | Generating structured summaries from prehospital or clinical speech for urgent decision support. | [51,54] |
| ASR error correction | Post-processing ASR transcriptions to correct domain-specific recognition errors. | [46,59] |
| Therapeutic dialogue | Generating counselor-style responses in therapeutic or mental health interactions. | [42,58] |
| Synthetic data generation | Producing artificial training data to augment limited clinical datasets. | [42] |
| Compliance validation | Checking clinical outputs against guideline-based rules for regulatory adherence. | [51] |
| Speech reconstruction | Semantically reconstructing degraded or incomplete speech signals. | [50] |
| Patient education (QA) | Answering patient questions and generating educational content. | [48] |
| Machine translation | Translating clinical speech across languages. | [53] |
| Fluency/opinion evaluation | Assessing speech fluency and generating clinical opinions from transcripts. | [43] |
| Task | Description | Studies |
|---|---|---|
| Text classification | Intent recognition, emotion detection, sentiment analysis, and stress classification. | [42,45,56] |
| Feature extraction | Extracting acoustic features (Wav2Vec 2.0) or textual embeddings (BERT) from speech or transcripts. | [43] |
| Semantic retrieval | Matching patient queries to relevant educational content using embedding similarity. | [48] |
| Data augmentation | Generating augmented dysarthric speech samples via adversarial training (CycleGAN). | [59] |
| Facial expression analysis | Classifying patient facial expressions during teletherapy sessions (Swin Transformer). | [57] |
| Metric Family | Comp. | Metric | Freq. | Studies |
|---|---|---|---|---|
| Word Error | ASR | WER | 8 | [42,45,46,47,51,55,59,61] |
| ASR | CER | 5 | [46,51,54,57,61] | |
| ASR | DWER; N-DWER; uWER | 1 ea. | [46] | |
| ASR | MER; WIL; WIP | 1 ea. | [61] | |
| ASR | SRR; RAR | 1 ea. | [42,57] | |
| Text Similarity | LLM | BERTScore | 4 | [46,47,49,60] |
| LLM | ROUGE (variants) | 5 | [47,49,50,53,60] | |
| LLM | BLEU | 2 | [47,50] | |
| LLM | BARTScore | 1 | [46] | |
| Classification | LLM | Accuracy | 4 | [45,54,55,56] |
| LLM | F1; precision; recall | 3 ea. | [53,55,56] | |
| LLM | AUC | 1 | [45] | |
| Human Evaluation | LLM | Manual quality scoring | 3 | [54,57,60] |
| LLM | Qualitative thematic analysis | 1 | [58] | |
| LLM | Baseline vs. augmented | 1 | [42] | |
| LLM | Ablation study | 1 | [43] | |
| LLM | Chi-square test | 1 | [61] | |
| LLM | Manual error categorization | 1 | [49] | |
| System-Level | ASR | Latency | 1 | [50] |
| LLM | Hallucination rate | 1 | [46] | |
| LLM | Platform usage; activation rate | 1 ea. | [52] | |
| LLM | RAG Triad; conceptual prec. | 1 ea. | [50,51] |
| Method | Description | Freq. | Studies |
|---|---|---|---|
| Clinician evaluation | Domain experts assess outputs for accuracy, relevance, and clinical utility. | 8 | [47,49,51,52,53,54,57,60] |
| Satisfaction questionnaire | Structured survey capturing user experience from clinicians, patients, or both. | 6 | [42,47,52,56,57,58] |
| Workflow simulation | System tested in a scenario resembling real clinical use. | 4 | [49,53,54,57] |
| Real-world deployment | System deployed in routine clinical practice at scale. | 1 | [52] |
| Compliance testing | Guardrail testing for toxicity, data leakage, and guideline adherence. | 1 | [51] |
| Not applied | 7 | [43,45,46,48,50,55,59] |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Alabbad, M.; Alhoshan, W. Automatic Speech Recognition in Healthcare in the Post-LLM Era: A Scoping Review. Healthcare 2026, 14, 1333. https://doi.org/10.3390/healthcare14101333
Alabbad M, Alhoshan W. Automatic Speech Recognition in Healthcare in the Post-LLM Era: A Scoping Review. Healthcare. 2026; 14(10):1333. https://doi.org/10.3390/healthcare14101333
Chicago/Turabian StyleAlabbad, Maram, and Waad Alhoshan. 2026. "Automatic Speech Recognition in Healthcare in the Post-LLM Era: A Scoping Review" Healthcare 14, no. 10: 1333. https://doi.org/10.3390/healthcare14101333
APA StyleAlabbad, M., & Alhoshan, W. (2026). Automatic Speech Recognition in Healthcare in the Post-LLM Era: A Scoping Review. Healthcare, 14(10), 1333. https://doi.org/10.3390/healthcare14101333

