Architecting the Orthopedical Clinical AI Pipeline: A Review of Integrating Foundation Models and FHIR for Agentic Clinical Assistants and Digital Twins

Boltaboyeva, Assiya; Baigarayeva, Zhanel; Imanbek, Baglan; Amangeldy, Bibars; Tasmurzayev, Nurdaulet; Ozhikenov, Kassymbek; Alimbayeva, Zhadyra; Alimbayev, Chingiz; Karymsakova, Nurgul

doi:10.3390/a19020099

Open AccessReview

Architecting the Orthopedical Clinical AI Pipeline: A Review of Integrating Foundation Models and FHIR for Agentic Clinical Assistants and Digital Twins

by

Assiya Boltaboyeva

^1,2,3,

Zhanel Baigarayeva

^1,2,3

,

Baglan Imanbek

^2,*,

Bibars Amangeldy

²

,

Nurdaulet Tasmurzayev

²

,

Kassymbek Ozhikenov

^2,*,

Zhadyra Alimbayeva

¹

,

Chingiz Alimbayev

¹ and

Nurgul Karymsakova

⁴

¹

Institute of Automation and Information Technology, Satbayev University, Almaty 050013, Kazakhstan

²

Faculty of Information Technologies and Artificial Intelligence, Al Farabi Kazakh National University, Almaty 050040, Kazakhstan

³

LLP “Kazakhstan R&D Solutions”, Almaty 050056, Kazakhstan

⁴

Department of Automation and Control, ALT University, Almaty 050012, Kazakhstan

^*

Authors to whom correspondence should be addressed.

Algorithms 2026, 19(2), 99; https://doi.org/10.3390/a19020099

Submission received: 29 December 2025 / Revised: 22 January 2026 / Accepted: 23 January 2026 / Published: 27 January 2026

(This article belongs to the Special Issue Artificial Intelligence Algorithms for Healthcare: 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

The exponential growth of multimodal orthopedic data, ranging from longitudinal Electronic Health Records to high-resolution musculoskeletal imaging, has rendered manual analysis insufficient. This has established Large Language Models (LLMs) as algorithmically necessary for managing healthcare complexity. However, their deployment in high-stakes surgical environments presents a fundamental algorithmic paradox: while generic foundation models possess vast reasoning capabilities, they often lack the precise, protocol-driven domain knowledge required for safe orthopedic decision support. This review provides a structured synthesis of the emerging algorithmic frameworks required to build modern clinical AI assistants. We deconstruct current methodologies into their core components: large-language-model adaptation, multimodal data fusion, and standardized data interoperability pipelines. Rather than proposing a single proprietary architecture, we analyze how recent literature connects specific algorithmic choices such as the trade-offs between full fine-tuning and Low-Rank Adaptation to their computational costs and factual reliability. Furthermore, we examine the theoretical architectures required for ‘agentic’ capabilities, where AI systems integrate outputs from deep convolutional neural networks and biosensors. The review concludes by outlining the unresolved challenges in algorithmic bias, security, and interoperability that must be addressed to transition these technologies from research prototypes to scalable clinical solutions.

Keywords:

large language models; AI assistant; clinical decision support systems; RAG; finetuning; multimodal AI; explainable AI; NLP; medical AI

1. Introduction

The exponential growth of multimodal orthopedic data, ranging from longitudinal Electronic Health Records (EHRs) to high-resolution musculoskeletal (MSK) imaging, has rendered manual analysis insufficient. As a result, LLMs have become algorithmically necessary for managing healthcare complexity. By processing large-scale, heterogeneous, and unstructured clinical narratives with minimal preprocessing, LLMs enable the rapid synthesis of operative notes, biomechanical research, and patient communications. This process converts vast volumes of data into actionable clinical knowledge [1,2,3,4]. However, their deployment in high-stakes surgical environments presents a fundamental algorithmic paradox: while generic foundation models possess vast reasoning capabilities, they often lack the precise, protocol-driven domain knowledge required for safe orthopedic decision support [5]. Consequently, the core engineering challenge shifts from model development to specialized domain adaptation. Bridging the semantic gap between generalist pre-training and specialist application without inducing catastrophic forgetting is a prerequisite for deployment. This step enables the development of intelligent software agents designed to augment a surgeon’s cognitive capabilities, manage complex musculoskeletal (MSK) information, and support critical perioperative decisions.

To be truly effective, intelligent assistants must align with the broader shift toward personalized orthopedics. This requires a multimodal approach that extends beyond text. Orthopedics is inherently visual and mechanical; therefore, robust assistants must integrate outputs from Deep Convolutional Neural Networks (CNNs) used for automated image analysis. CNNs have demonstrated expert-level accuracy in detecting fractures, assessing soft-tissue injuries, and grading osteoarthritis [6,7]. By combining these visual insights with patient-specific data, such as bone density profiles and biomechanical loading factors, AI assistants can facilitate individualized treatment planning. These plans range from precision implant sizing in arthroplasty to optimized alignment in osteotomy [8,9,10].

A critical barrier to building these multimodal assistants is the effective ingestion of structured clinical data. Standard interoperability formats, such as HL7 FHIR, provide essential schema definitions but often introduce syntactic clutter that hinders semantic processing. Current state-of-the-art frameworks address this by converting structured resources into natural language narratives or semantic vectors before retrieval. This ensures that Retrieval-Augmented Generation (RAG) mechanisms operate on rich clinical concepts such as specific fracture classifications or prosthetic component specifications rather than rigid code schemas. Furthermore, to overcome the “context-loading latency” inherent in querying distributed FHIR servers, modern architectures are increasingly moving toward asynchronous vector indexing strategies. This allows for the scalable “Data Readiness” required for real-time clinical inference [11,12].

Despite these technical advancements, the “black-box” nature of deep learning models remains a primary obstacle to clinical adoption. In orthopedic surgery, where decision pathways must be defensible, opacity limits trust. Explainability techniques, such as Gradient-weighted Class Activation Mapping (Grad-CAM) for imaging and SHapley Additive exPlanations (SHAP) for tabular risk models, have become pivotal. These methods elucidate model predictions, ensuring that AI-driven recommendations are interpretable and align with established medical logic [13,14].

The primary contribution of this review is to provide a structured synthesis of the multi-stage algorithmic framework required to build modern clinical AI assistants. Rather than surveying general clinical applications, we deconstruct these systems into their core algorithmic components: (1) large language model adaptation, (2) multimodal data fusion, and (3) standardized data interoperability pipelines. We present an integrated, end-to-end pipeline view that connects specific algorithmic choices to their computational trade-offs. For LLM adaptation, we conduct a comparative analysis of full fine-tuning, Low-Rank Adaptation (LoRA), and RAG, examining how parameter-efficient approaches impact computational cost and factual reliability. In multimodal data fusion, we review architectures that combine radiological images, clinical text, and biosignals to generate unified diagnostic outputs. Finally, we outline unresolved challenges in algorithmic bias, security, and interoperability that must be addressed for scalable deployment, and we offer a blueprint for researchers building the next generation of orthopedic intelligence.

2. Methodology

To ensure a rigorous and transparent synthesis of the emerging algorithmic frameworks for clinical orthopedical AI assistants, we conducted a structured narrative review of literature published between November 2016 and September 2025. A search was performed across major databases, including PubMed, Scopus, IEEE Xplore, and Web of Science, utilizing combinations of keywords such as ‘Large Language Models,’ ‘Orthopedics,’ ‘FHIR,’ ‘Generative AI,’ and ‘Clinical Decision Support.’

We prioritized peer-reviewed articles that bridge theoretical AI architectures with practical surgical and diagnostic workflows, specifically focusing on studies that integrate foundation models with standardized interoperability protocols. Purely theoretical benchmarks without clinical context or non-English publications were excluded. This selection process resulted in the final inclusion of 135 studies (Introduction: 14, Foundational technologies: 64, Application layer: 48, Discussion: 9). The detailed review workflow, including identification, screening, and eligibility assessment, is illustrated in Figure 1.

While this study utilizes the PRISMA framework to ensure transparency in the literature search and selection process, it is designed as a structured narrative review rather than a fully quantitative systematic review. Following the screening phase, we performed a targeted data extraction to identify key technical and clinical dimensions across the included studies. We synthesized information regarding architectural components such as the use of specific foundation models and integration frameworks alongside adaptation strategies, including RAG, LoRA, and fine-tuning. Furthermore, we examined evaluation settings involving clinical scenarios, data sources, and interoperability standards like FHIR, as well as performance metrics related to accuracy, efficiency gains, and safety limitations. This structured extraction approach allowed for a qualitative synthesis of how current AI technologies address clinical workflow requirements in orthopedics.

3. Foundational Technologies for AI Clinical Assistants

Figure 2 presents the conceptual framework that synthesizes how domain-specific adaptation mechanisms integrate with multimodal clinical data processing and controlled inference constraints. This diagram represents an original architectural synthesis derived from integrating multiple recent methodologies into a unified pipeline that addresses a critical deployment challenge. The framework illustrates the core algorithmic innovation of this review, namely the distinction between pretraining generalization and domain-specific specialization. The left pathway demonstrates the challenge of deploying unadapted foundation models in clinical settings, where general biomedical reasoning capabilities may be insufficient for specialized orthopedic decision-making. The right pathway shows how parameter-efficient adaptation techniques such as LoRA and controlled inference constraints enable domain-safe outputs [5,15,16,17].

The novel contribution lies in the explicit architectural decoupling of the semantic reasoning layer, which is handled by the adapted LLM, from the deterministic safety verification layer, which is handled by GCD enforcing FHIR interoperability standards. This distinction is critical because while standard fine-tuning approaches optimize for task accuracy, they do not guarantee that outputs conform to rigid clinical schema requirements. Figure 2 demonstrates that clinical AI systems must separate these concerns architecturally. Specifically, they must reason creatively about clinical problems while simultaneously enforcing deterministic constraints to ensure outputs are parsable, compliant, and clinically safe. This two-layer approach prevents a common failure mode in which stochastic LLM outputs hallucinate clinically invalid JSON fields or violate interoperability standards, even when the underlying medical reasoning is sound [18].

3.1. LLMs Adaptation in Orthopedics

Large language models (LLMs) are steadily transforming clinical documentation, decision support, medical education, and patient engagement [19]. While full-parameter fine-tuning theoretically offers maximum expressivity, it is often ill-suited for clinical domains due to the high risk of catastrophic forgetting, in which the model overwrites its general biomedical reasoning capabilities to overfit on small, institution-specific datasets.

Parameter efficiency is crucial for orthopedic-specific models, particularly when addressing computational and privacy constraints. The MedAdapter framework exemplifies this paradigm, fine-tuning only a 110 M-parameter BERT-sized adapter rather than full model weights. This approach achieves 99.35% of supervised fine-tuning performance while using 14.75% of GPU memory, enabling deployment in resource-constrained healthcare environments [20]. Similarly, instruction fine-tuning using orthopedic clinical notes demonstrates particular promise. Vaid et al. fine-tuned LLaMA-7B on musculoskeletal pain characteristics extracted from unstructured clinical notes. This approach resulted in privacy-preserving models capable of parsing patient histories without transmitting protected health information to third-party APIs. This methodology addresses critical regulatory requirements while maintaining clinical utility [21].

In the context of Multimodal LLMs, such as those analyzing orthopedic imagery, architectural decisions regarding adapter placement are critical. Research [22] demonstrates that fine-tuning the vision encoder can lead to ‘feature collapse,’ effectively destroying the robust representations learned during pre-training. Domain-specific fine-tuning has proved its value in multimodal systems that automatically draft diagnostic reports and perform quantitative measurements with expert-level precision.

Retrieval-Augmented Generation (RAG) has emerged as a leading strategy for boosting factual correctness and serves as the dominant adaptation strategy for orthopedic applications. The pipeline in Figure 3 begins with curating a knowledge base of articles, clinical guidelines, and electronic health records. For orthopedics, this typically involves integrating 13–28 authoritative sources, including AAOS Comprehensive Review textbooks [23]. The OrthoWizard evaluation demonstrated the transformative impact of retrieval-augmented generation (RAG). GPT-4 with RAG achieved an accuracy of 73.80% on 1023 orthopedic examination questions, which was statistically equivalent to human orthopedic surgeons (73.97%) and significantly higher than GPT-4 alone (64.91%) [23]. This source-grounding reduces hallucinations while providing traceable citations, critical features for clinical adoption.

Figure 3 depicts the retrieval-augmented generation (RAG) pipeline for orthopedic clinical question answering. It synthesizes well-documented retrieval strategies from recent clinical AI implementations while emphasizing specific architectural innovations that address the clinical precision bottleneck. While RAG itself is an established technique in the literature [24], the clinical instantiation presented here makes explicit three critical technical refinements. (1) The system integrates a hybrid search strategy that combines dense vector retrieval with Okapi BM25 sparse retrieval to address the lexical gap inherent in clinical text. Dense embeddings capture semantic intent but may fail to match exact terminology, particularly for rigid medical identifiers such as ICD-10 codes and medication dosages. (2) The incorporation of Hierarchical Navigable Small World (HNSW) [25,26] indexing to scale retrieval to million-scale vector databases while maintaining logarithmic time complexity, which is crucial for real-time clinical decision support. (3) The end-to-end tracking of factual grounding from curated authoritative knowledge sources through multimodal encoding to final model inference.

To implement this high-precision retrieval, the architecture integrates a semantic embedding substrate, a retrieval-augmented generation pathway, and a parameter-efficient multimodal language model to deliver fact-checked answers grounded in heterogeneous evidence. A dedicated embedding service maps both user queries and a corpus of text–image pairs into a shared vector space stored in a high-performance database.

The weighting parameter

α \in

[0, 1] serves as a critical hyperparameter that balances semantic depth with lexical precision [27,28]. In the context of clinical applications, this hybrid approach is essential for mitigating the ‘vocabulary mismatch’ problem, where patients and clinicians use disparate terminologies for the same condition [29]. For instance, in orthopedic surgery, a value of α > 0.5 (typically 0.7) is preferred when seeking conceptual matches, such as linking ‘degenerative disc disease’ with ‘vertebral wear’—a task where dense embeddings excel by capturing nuanced semantic relationships [30,31]. Conversely, a lower is utilized for retrieving specific clinical entities, such as precise implant serial numbers or specific surgical instruments, where exact keyword matching (sparse retrieval via BM25) is non-negotiable for clinical safety [28,32]. Recent meta-analyses in biomedicine demonstrate that such hybrid indexing significantly outperforms single-method retrieval, providing a 35% improvement in accuracy for clinical decision support systems [29].

Beyond these technical architectures, prompt engineering enables zero-shot adaptation without model parameter updates. Effective orthopedic prompts incorporate structured clinical templates, imaging findings, and differential diagnosis frameworks. In-context learning allows clinicians to embed few-shot examples within prompts, adapting generalist models to subspecialty contexts (e.g., spine surgery, sports medicine) without retraining [2]. Advanced prompting techniques include chain-of-thought reasoning for complex cases and multimodal prompts combining radiology reports with imaging descriptions. However, prompt-based approaches show performance ceilings, with GPT-4 achieving only 64.91% accuracy on orthopedic examinations without RAG augmentation [23].

Ultimately, the success of these adaptations is measured by clinical performance benchmarks. Standardized orthopedic examinations serve as primary benchmarks for LLM competency. Performance varies significantly by model architecture and adaptation method, as presented in Table 1. Standalone models such as GPT 3.5 and Bard demonstrate accuracy levels ranging from 49.8 percent to 58 percent, placing their performance at or below the level of early-stage residents from PGY 1 to PGY 3 [23,33]. While the transition to GPT 4 shows substantial improvement, reaching 64.91 percent accuracy and corresponding to the level of a PGY 5 senior resident, this performance still falls short of the human surgeon benchmark [34]. The most significant finding presented in these results is the impact of Retrieval Augmented Generation. When GPT 4 is augmented with RAG, accuracy increases to 73.80 percent [23], achieving performance comparable to that of human orthopedic surgeons, whose reference accuracy typically ranges from approximately 74.2 percent to 75.3 percent [33]. These findings demonstrate that domain-specific grounding is algorithmically necessary to bridge the gap between general medical knowledge and specialized surgical expertise.

GPT-4 demonstrates superior performance on higher-order and image-associated questions compared to predecessors. However, orthopedic residents consistently outperform unaugmented LLMs, highlighting the gap between general medical knowledge and specialized orthopedic expertise. Diagnostic accuracy across studies ranges from 55 to 93%, with substantial heterogeneity in evaluation methodologies [33].

The transition of LLMs from general-purpose assistants to orthopedic specialists requires selecting adaptation frameworks that balance computational cost with clinical reliability. A key challenge is bridging the gap between broad pre-training and domain-specific reasoning without causing catastrophic forgetting. Parameter-efficient fine-tuning methods such as LoRA and RAG provide stronger domain grounding, enabling models to meet or exceed human surgeon benchmarks. Table 2 summarizes these strategies and their trade-offs reported in recent orthopedic AI research.

In terms of practical workflow integration, LLMs demonstrate immediate utility in automating repetitive documentation tasks. Studies show GPT-3 and ChatGPT can generate clinical letters and management plans for common orthopedic scenarios, reducing administrative burden. Applications include preoperative planning, where LLMs assist spinal surgeons by analyzing imaging reports and suggesting surgical approaches. Additional applications include postoperative documentation through the automated generation of operative notes and discharge summaries. However, accuracy limitations persist: 35% of initial lumbar spine report translations contained major omissions, and 6% had major inaccuracies. Enhanced prompting reduced omission rates to 7%, but inaccuracy rates remained stable. Multimodal LLMs represent the next frontier, with GPT-4Vision showing potential to integrate imaging data directly into diagnostic reasoning. Early applications include bone metastasis detection from scintigraphy, achieving an AUROC > 0.8 when developed through LLM-assisted programming. Finally, in orthopedic education, LLMs serve as interactive tools, providing explanations and engaging in Socratic dialog with trainees. Performance on board-style questions suggests utility for knowledge reinforcement, though current models cannot replace structured residency curricula.

While the reviewed literature unanimously positions LLMs as powerful reasoning engines, a significant contradiction persists regarding their deployment architectures. Several studies, such as Wiest et al. [35], highlight the trade-offs of large, cloud-based general-purpose models (e.g., GPT-4) regarding data privacy, whereas others like Labrak et al. [36] demonstrate that smaller, domain-specifically fine-tuned models (e.g., BioMistral) offer superior privacy preservation and reduced computational requirements. A major limitation identified across most architectural studies is the ‘hallucination trade-off’: techniques like RAG significantly reduce factual errors, achieving near 0% hallucination rates in some frameworks [37], but introduce higher computational latency [37,38], rendering them less suitable for real-time intraoperative assistance where sub-second responses are critical. Furthermore, few studies adequately address the ‘catastrophic forgetting’ phenomenon when fine-tuning models on specific medical datasets [39], highlighting a clear gap in continuous learning frameworks for surgical subspecialties.

3.2. Natural Language Understanding for Orthopedic Narratives and Operative Notes

Approximately 80% of healthcare data exists in an unstructured free-text format within electronic health records (EHRs), representing a largely untapped resource for clinical research and quality improvement. In the specialized field of orthopedics, this vast data repository includes operative notes, radiology reports, clinical narratives, and patient communications that document complex musculoskeletal procedures, complications, and outcomes. Natural Language Processing (NLP) has emerged as the pivotal computational layer capable of unlocking this rich narrative content. A systematic mapping of clinical NLP projects reveals a significant methodological shift from brittle rule-based extraction to robust transformer architectures. This transition mirrors the broader trajectory of biomedical informatics, where early pipelines coupled handcrafted lexicons with statistical classifiers before progressing to foundation models adaptable with minimal task-specific data [40].

Unstructured clinical narratives constitute the richest but historically least accessible stratum of the EHR. Comparative studies indicate that fewer than two-thirds of predefined quality indicators are recoverable from structured fields alone, underscoring the critical analytical value locked in free-text notes [41]. Transforming these narratives into computable signals begins with rigorous preprocessing to normalize the highly variable clinical text. Essential preprocessing steps such as tokenization, lemmatization, part-of-speech tagging, and spelling correction are applied to mitigate lexical noise. Domain-specific abbreviation disambiguation, which is critical for orthopedic terminology, along with stop-word removal, further restores implicit semantics prior to downstream modeling [42].

This data transformation process is supported by a multi-layer architecture, illustrated in Figure 4, which is designed to support low-latency processing in high-throughput clinical environments. As depicted, the pipeline ingests clinical notes, HL7-formatted message streams, and PDF documents via streaming services such as Kafka for immediate de-identification and section segmentation. Following preprocessing, an ontology mapping module aligns text spans with standardized medical terminology from sources such as UMLS, SNOMED, or LOINC to ensure semantic consistency. The resulting structured outputs are stored in a multi-model persistence layer. However, maintaining consistency across these disparate systems, Graph Databases (Neo4j) for relationships, Vector Stores (FAISS) for semantic search, and Document Stores for logs, presents a significant data engineering challenge. The choice of Hierarchical Navigable Small World (HNSW) indexing is theoretically grounded in its ability to maintain O(logN) search complexity even as the orthopedic dataset scales to millions of clinical records [25,43]. This efficiency is achieved through a multi-layered graph structure, analogous to a skip-list, where the top layers contain long-range edges for coarse-grained navigation, and the bottom layers provide fine-grained local connectivity [25,44]. Formally, for a dataset of size N, the search process involves a greedy traversal where the number of distance evaluations is minimized by the small-world property, ensuring sub-linear retrieval times (e.g., <50 ms for 106 vectors), which is essential for real-time surgical assistant feedback [45,46]. Simple “plug-and-play” integration is insufficient; robust pipelines require Event-Driven Architectures (e.g., Change Data Capture) to ensure that updates in the EHR are instantaneously reflected across all indices. Without this rigorous synchronization, the risk of serving stale or conflicting clinical data increases. The pipeline illustrated in Figure 4 represents a logical view of these components, emphasizing the necessity of an orchestration layer to manage the complex data lifecycle from ingestion to inference.

Figure 4 presents the architectural synthesis of the complete NLU data lifecycle required to support clinical AI pipelines in orthopedic settings at scale. The figure integrates well-established NLP preprocessing steps, including tokenization, lemmatization, part-of-speech tagging, and domain-specific abbreviation disambiguation [47,48] that are crucial for orthopedic terminology. The conceptual contribution of Figure 4 lies in explicitly modeling the orchestration layer and change data capture mechanisms required to maintain consistency and prevent stale or conflicting clinical data across interconnected systems. The illustrated pipeline reflects the recognition that low-latency clinical inference at scale depends on more than accurate NLP models. It also requires robust and consistent data flows that prevent clinicians from accessing outdated or conflicting information during critical decision-making processes.

In the domain of orthopedics, the application of such architectures has witnessed exponential growth, with 90% of relevant studies published between 2019 and 2021. Clinical and operative notes currently constitute the largest application domain, accounting for 50% of orthopedic NLP studies. These applications have demonstrated remarkable precision in extracting granular surgical data [49]. In total knee arthroplasty (TKA), rule-based algorithms have successfully extracted key data elements, such as implant constraint type and patellar resurfacing status, from 20,000 operative notes with accuracy exceeding 98%. In addition, implant model identification algorithms have achieved an F1-score of 99.9% [40,50]. Similarly, for Prosthetic Joint Infection (PJI) detection, algorithms processing consultation notes and microbiology results have achieved a sensitivity of 0.887 and a specificity of 0.991, significantly outperforming administrative coding data [51].

Beyond operative documentation, approximately 36% of orthopedic NLP research focuses on extracting information from radiology reports [49]. In this sub-domain, models have demonstrated the ability to identify periprosthetic femur fractures with 100% sensitivity and determine Vancouver classifications with 94.8% specificity [40]. Deep learning approaches have largely supplanted bespoke feature engineering, with transformer-based models proving superior as general-purpose encoders. BioClinicalBERT has been utilized to classify treatment outcomes for proximal humerus fractures with 87% accuracy by analyzing the final 512 tokens of clinical notes, which typically contain the most relevant discharge status information [52]. Scaling to the billion-parameter regime, large models like GatorTron have delivered absolute gains of up to 9.6% on medical question-answering benchmarks [53].

As large language models (LLMs) mature and enter high-stakes clinical settings, orchestration frameworks such as MAI-Dx emerge to coordinate model reasoning and mitigate diagnostic uncertainty through collaborative agents. In this architecture, transformer-powered agents are assigned distinct clinical roles, including hypothesis formulation and checklist validation, acting as a virtual doctor panel engaged in “chain-of-thought” debate. Modern LLMs, such as GPT-4, have demonstrated quality comparable to that of physicians when assessing treatment recommendations based on knee and shoulder MRI reports, though with limitations in evaluating treatment urgency and patient context. Furthermore, the automation of CPT coding using ChatGPT-4 has demonstrated effectiveness in spine operative notes, promising a reduction in healthcare costs and coding errors [54]. The practical utility of such an integrated pipeline is further exemplified in Total Knee Arthroplasty (TKA) planning. In this clinical scenario, the system ingests unstructured data from a patient’s longitudinal EHR, including radiology reports describing ‘severe joint space narrowing’ and operative notes from prior arthroscopic interventions. By utilizing the HNSW-indexed vector store, the NLU module retrieves relevant surgical protocols and implant specifications in sub-50 ms. An agentic assistant then synthesizes this data into a ‘Digital Twin’ of the patient’s knee, highlighting potential risks such as prior hardware interference or bone loss. By integrating foundation models with FHIR-standardized data, the system provides the surgeon with a pre-operative checklist and a tailored surgical plan, reducing manual chart review time and ensuring that specific patient comorbidities are accounted for in the final prosthetic selection.

Despite these achievements, deployment faces challenges regarding data quality, overfitting, and interpretability. The lack of understandability and transparency in AI models leads to inadequate accountability, although attention mechanisms offer the dominant approach to explainability in healthcare. Privacy issues are addressed using tools such as the certified de-identification pipeline Philter V1.0, which removes 17 types of personal data with a re-identification risk of less than 0.025% [55]. Moreover, federated learning enables the training of models on disparate data without transferring it, keeping model weights local and ensuring collaborative learning across institutions. However, despite a tenfold increase in publications, less than 6% of studies reach routine deployment within the National Health Service, indicating a persistent gap between research achievements and clinical practice [56].

3.3. Three-Dimensional Geometric Reasoning and Volumetric Intelligence

The orthopedic field is currently undergoing a fundamental paradigm shift from 2D approximation, characterized by X-ray-based templating, to 3D volumetric intelligence a capability driven by the convergence of geometric deep learning, statistical shape models (SSMs), and automated reasoning systems [57]. This transition is not merely about enhanced visualization; rather, it represents the emergence of “reasoning” systems capable of autonomously solving complex spatial puzzles. These systems can determine how to reduce a comminuted fracture, predict optimal implant fit from partial data, or reconstruct full 3D bone density models from low-dose 2D imaging [58].

This architecture in Figure 5 illustrates the integration of diverse input modalities, including MSK imaging (weight-bearing X-rays, 3D CT reconstructions), clinical text (EHRs, notes), and kinematic time-series data (gait analysis, IMU sensors) processed through modality-specific encoders (Vision Transformer, Transformer encoder). These inputs are synthesized in a Multimodal Fusion Layer and processed by a pretrained biomedical LLM to generate actionable outputs such as diagnosis, report generation, treatment recommendations, risk prediction, and longitudinal forecasting. This architectural arrangement will allow a pretrained biomedical LLM to generate actionable outputs ranging from automated 3D reconstruction to longitudinal recovery forecasting, effectively solving spatial surgical problems before operative intervention.

At the core of this transformation are three technologies that function as the “brain” behind the 3D model. Unlike traditional 3D imaging, which is passive, geometric reasoning is active and capable of understanding anatomy. Statistical Shape Models (SSMs) serve as the foundation of this reasoning. By training on thousands of scans, these models learn the “modes of variation” in human anatomy, such as the curvature of a femur, the wear patterns of a glenoid, or the torsion of a tibia [57]. This reasoning capability allows algorithms to “hallucinate” or infer accurate 3D shapes from incomplete data. For instance, when presented with a partial view of a damaged knee, the AI can infer what the healthy anatomy should look like based on population statistics, thereby enabling precise reconstruction for patient-specific implants (PSI) [59]. This technology is currently used to design off-the-shelf implants that fit the vast majority of the population or to generate specific guides for complex cases.

Complementing SSMs is the application of Geometric Deep Learning and Graph Neural Networks (GNNs). GNNs can reason about relationships between anatomical landmarks; in orthognathic or trauma surgery, they can simulate how moving one bone fragment affects the soft tissue and alignment of connected structures, effectively “solving” the geometry of a fracture [60]. This is particularly applicable in the automated segmentation of complex structures, such as pelvic fractures or the temporal bone, where standard pixel-based methods often fail due to noise or metal artifacts.

The third core component is Volumetric Segmentation and Intelligence, which refers to the direct analysis of voxel data using architectures like the 3D U-Net [61]. Instead of merely delineating the boundary of a bone, these systems differentiate between cortical and trabecular bone quality, detect tumors with sub-millimeter precision, and calculate volumes automatically. New platforms are now analyzing the quality of the volume rather than just the shape, for example, analyzing 3D bone density to recommend screw trajectories that maximize hold strength in osteoporotic patients [62,63,64].

The integration of these technologies has established distinct classes of clinical tools, which is in Table 3. It outlines the transition from 2D approximations to 3D volumetric intelligence through specific clinical domains. In the field of trauma and fracture management, AI systems are now capable of solving the spatial puzzles of complex fractures by matching bone fragments to healthy templates with high precision [60]. For joint replacement, commercialized algorithms can generate full 3D models from standard 2D X-rays, maintaining sub-millimeter accuracy and effectively eliminating the need for more radiation-intensive CT scans [65]. Moreover, in spine surgery, these systems have become the standard of care for advanced navigation, analyzing vertebral geometry to propose optimal pedicle screw paths that maximize bone purchase while protecting delicate neural structures [66]. These applications demonstrate that geometric reasoning has matured from a research concept into a clinical necessity that allows surgeons to solve complex surgical problems before the patient even enters the operating room.

Looking forward, the next frontier for volumetric intelligence is Functional Reasoning. The field is moving from static questions of fit to dynamic questions of movement. Future “4D” systems, combining volumetric data with time, will use geometric reasoning to predict kinematics, simulating how a specific implant alignment will affect gait or wear years post-operation [67]. Furthermore, intraoperative replanning will allow AI to update surgical plans in real time based on live video or digitized landmarks, compensating for bone deflection or soft-tissue tension changes [68]. Ultimately, 3D Geometric Reasoning has matured from a research topic to a clinical necessity. It has bridged the “data gap” in orthopedics, allowing surgeons to obtain CT-quality insights from X-ray-level inputs. For the orthopedic professional, Volumetric Intelligence is no longer just about visualization; it is about algorithmically solving the surgical problem before the patient enters the operating room [69].

3.4. Data Engineering for Clinical AI Pipelines

Integrating Large Language Models with HL7 FHIR exposes two distinct but complementary interoperability challenges at the input and output stages of the clinical AI pipeline. On the input side, direct ingestion of nested FHIR JSON introduces high-frequency syntactic noise that degrades attention and retrieval performance. This limitation is addressed through JSON-to-narrative linearization techniques, which flatten structured resources into semantically coherent natural-language representations prior to embedding [16,67]. Frameworks such as CLEAR demonstrate that this approach improves retrieval density and relevance compared with raw JSON indexing, particularly for guideline-driven clinical queries [70]. However, narrative linearization does not resolve the reciprocal output-side challenge: the inherently probabilistic nature of LLM decoding, which risks producing syntactically invalid or schema-noncompliant FHIR resources. Grammar-Constrained Decoding (GCD) addresses this by restricting generation to tokens permitted by grammars or finite-state machines derived from FHIR schemas, ensuring syntactic validity of generated resources. Critically, while narrative linearization enhances semantic access to clinical data and GCD enforces structural correctness, neither method alone guarantees clinical correctness. GCD functions as an interoperability safeguard rather than a fact-checking mechanism. As a result, it must be integrated with upstream grounding mechanisms, such as retrieval-augmented generation, and downstream verification layers to prevent clinically incorrect yet syntactically valid outputs from entering production EHR systems.

Beyond structural validity, the algorithmic pipeline must address the “Context-Latency Trade-off” inherent in real-time clinical systems. While FHIR provides a robust graph of patient data, feeding raw, verbose JSON resources directly into an LLM’s context window is computationally inefficient and increases token consumption. To optimize this, the framework incorporates a semantic flattening layer that transforms complex FHIR bundles into dense, token-efficient summaries before ingestion. This ensures that the limited context window of the model is prioritized for high-value clinical reasoning rather than repetitive schema metadata.

Furthermore, the inference latency of querying a live FHIR server can become a bottleneck for RAG-based decision support. To mitigate this, our framework suggests an asynchronous indexing strategy, where clinical data is pre-processed and synchronized into a vector database. This decouples the high-latency retrieval of structured records from the low-latency requirements of the inference engine, allowing the model to access historical patient context in milliseconds. By treating the interoperability layer as a high-speed context substrate, the system achieves the responsiveness required for point-of-care clinical applications.

3.5. Explainability Frameworks and Clinical Trust

Explainability and interpretability have emerged as foundational pillars for establishing clinical trust in AI-driven CDSSs across healthcare domains, particularly in high-risk fields such as orthopedics. According to [71], interpretability is essential for the application of clinical decision support systems (CDSSs) in healthcare settings. Interpretability is defined as a transparent model structure with clear input–output relationships and explainable AI algorithms, and it critically influences both clinician and patient acceptance of AI-powered recommendations [71]. The absence of explainability in black box AI models presents a significant barrier to adoption, as healthcare professionals require transparent and justifiable reasoning before incorporating AI-generated recommendations into clinical decision-making processes. Ref. [72] further emphasize that explainability allows clinicians, surgeons, and patients to understand the contributing factors underlying AI-powered predictive models, thereby fostering trust and improving comprehension of reasoning processes that directly influence clinical outcomes in orthopedics [72]. The integration of explainability frameworks also aligns with regulatory requirements such as the General Data Protection Regulation, which mandates that individuals have the right to meaningful explanations of automated decision-making logic [71]. This requirement reinforces the necessity of developing transparent AI systems that can clearly articulate their decision pathways to both healthcare providers and patients seeking to understand orthopedic diagnoses, treatment recommendations, and prognostic assessments.

The practical implementation of explainability in orthopedic artificial intelligence systems requires the use of diverse methodological approaches that range from inherently interpretable ante hoc methods to post hoc explanation techniques applied to complex black box models. Ref. [73] describes several explainable AI approaches, including saliency maps that visually highlight regions in medical images which strongly influence diagnostic decisions. These methods are particularly valuable in orthopedic imaging analysis, where identifying specific features in X-ray, CT, and MRI scans is critical for fracture detection and classification.

Shapley values and explainable boosting machines have gained prominence as model-agnostic techniques that quantify the contribution of individual input features to final predictions. These methods enable orthopedic surgeons to assess the reasoning underlying artificial intelligence outputs and to identify potential areas for improvement in preoperative surgical planning [73,74]. Recent evidence from [75] demonstrates that combining clinical explanations with machine learning outputs significantly improves clinician acceptance, trust, satisfaction, and system usability. Specifically, integrating SHAP-based visualizations with natural language explanations outperforms both results-only outputs and traditional SHAP visualizations alone.

In the context of orthopedic surgical guidance, explainable artificial AI systems can provide interpretable justifications for recommendations related to implant selection, alignment parameters, and surgical strategy. This supports informed surgical decision making and reduces the risk of complications such as implant loosening and alignment errors that can negatively affect long-term patient outcomes [76].

The advancement of agentic AI in healthcare introduces new challenges for clinical trust that require transparency and accountability. Ref. [77] emphasize that medical AI agents must provide transparent reasoning for their decisions, particularly when offering diagnostic suggestions or treatment strategies. Clinician understanding of artificial intelligence rationale is essential for informed decision-making and for fostering confidence in autonomous systems. Ref. [74] observes that healthcare professionals often struggle to trust complex machine learning models due to limited evaluation scope and reliance on specialized technical expertise [74].

The successful deployment of explainable AI in orthopedic agentic clinical assistants requires overcoming technical, regulatory, and organizational challenges while demonstrating clear improvements in diagnostic performance and clinician adoption. Ref. [78] illustrates this potential through DeepKneeExplainer, which combines high diagnostic accuracy with interpretable visual explanations for knee osteoarthritis assessment. Contemporary applications in orthopedic surgery demonstrate that AI-based preoperative 3D planning systems achieve superior accuracy in predicting prosthesis size and axial alignment compared to traditional 2D template planning. However, these systems require explainability features to justify why specific implant sizes and positioning angles are recommended for individual patients based on their unique anatomical characteristics [76]. Ref. [72] identifies several implementation barriers, including the trade-off between model interpretability and predictive accuracy. They also highlight challenges in explaining advanced deep neural networks with numerous hyperparameters and complex interactions. In addition, they emphasize the need for user-centric interface designs that present explanations in formats clinicians can readily understand and trust without adding complexity to already demanding clinical workflows [72]. Addressing these implementation challenges requires collaborative interdisciplinary efforts involving AI practitioners, orthopedic specialists, regulatory entities, and patient advocates. These stakeholders must work together to establish clear standards, guidelines, and best practices for the deployment of explainable AI in orthopedic settings. In parallel, small-scale pilot projects, continuous user feedback collection, and comprehensive clinician training are essential to ensure successful integration into clinical practice.

4. The Application Layer: From Data to Clinical Insights

4.1. From CDSS to Agentic Surgical Assistants: Bridging Generative AI and Robotic Actuation

Large Language Models (LLMs) are increasingly employed in Clinical Decision Support Systems (CDSS) to assist with diagnosis, triage, preoperative templating, and treatment recommendations [24]. They can process clinical literature, interpret radiographic data, and extract guidelines to support evidence-based decision-making [79]. While these approaches show promise in areas such as preoperative patient selection and hospital resource planning, their reliability remains limited by training data biases and the risk of hallucinations. Optimal performance, therefore, depends on effective prompt engineering, particularly through instructive and question–answer formats [79]. However, the transition from passive support to “agentic” surgical assistance in orthopedics requires higher precision. In an attempt to augment traditional workflows, recent studies have tested LLMs within robotic-assisted surgical platforms, such as those used for Total Knee Arthroplasty (TKA) [80]. Unlike generic decision trees, these “agentic” models aim to integrate intraoperative kinematic data with preoperative CT-based plans to suggest optimal implant positioning [81]. Yet, when deployed without domain-specific fine-tuning, generic foundation models often fail to align with the stringent mechanical alignment protocols required in joint replacement surgery [9]. This highlights how a single error in biomechanical reasoning, such as miscalculating the varus/valgus angle, can significantly compromise implant longevity and patient safety. Evaluations of LLMs on complex musculoskeletal tasks reveal that models often struggle with spatial reasoning in 3D reconstruction tasks compared to expert surgeons, underscoring their unsuitability for autonomous robotic actuation without strict “human-in-the-loop” oversight [9].

Despite limitations, LLMs show promise in automating orthopedic-specific documentation, such as operative notes and discharge summaries for arthroplasty patients [82]. They may also reduce alert fatigue by filtering non-actionable CDSS warnings. Transparent, interpretable outputs remain essential for safe clinical use. In the educational domain, ChatGPT and specialized derivatives have been applied to orthopedic residency training to simulate oral board examinations [83]. In a controlled study involving orthopedic trainees, those who utilized AI-driven “Socratic” tutoring for case simulations showed improved retention of differential diagnoses for musculoskeletal pathologies [84]. The model provided personalized evaluations on surgical decision-making steps, such as “implant selection” and “soft tissue balancing,” which residents found highly relevant to real-world practice [84]. These findings support the use of LLMs in developing clinical reasoning when combined with human oversight.

Generative AI models like GPT-4 show potential in real-time diagnosis and treatment planning, recognizing fracture patterns from reports and summarizing complex magnetic resonance imaging (MRI) findings [85]. While nearing passing scores on medical licensing exams, limitations such as hallucinations and oversimplified reasoning limit their safety; thus, they should support, not replace, clinical judgment. Furthermore, LLMs are being tested as interfaces for surgical robotics, translating natural language commands into executable robotic trajectories for bone resection [86]. While the model can accurately identify procedural steps, its ability to dynamically adjust to intraoperative complications (e.g., unexpected bone loss) remains inferior to that of experienced surgeons.

To improve accuracy and specificity, some LLMs now use Retrieval-Augmented Generation (RAG) to access authoritative orthopedic knowledge bases, such as AAOS guidelines [87]. This enhances diagnostic and treatment support, particularly for junior clinicians. Advanced frameworks now test LLMs on interpreting longitudinal musculoskeletal data to predict implant failure risks. Presented with hypothetical patient cases involving revision surgery, the model achieved high accuracy in suggesting graft choices and fixation methods when grounded in retrieval-based clinical evidence [88]. The results demonstrate that LLMs can effectively retrieve clinical information when supported by structured inputs, though enhancing model interpretability and ensuring compatibility with robotic surgical workflows remain critical challenges.

4.2. Predictive Analytics in Healthcare Using LLMs

Large Language Models (LLMs) have progressed from experimental chatbots to credible candidates for peri-operative decision support in orthopedic surgery. A recent review indicates that researchers are increasingly adapting text-native foundation models for musculoskeletal applications, such as implant failure risk stratification and postoperative recovery guidance. While early findings are promising, concerns regarding transparency and bias in training data, specifically the underrepresentation of diverse skeletal anatomies, remain prevalent [89].

Against this backdrop, several research groups have demonstrated that conditioning LLMs on pre-operative orthopedic narratives can yield highly discriminative forecasts of surgical outcomes. Recent studies utilizing LLaMA-based architectures have focused on predicting Discharge Disposition (e.g., home vs. skilled nursing facility) following Total Joint Arthroplasty (TJA) [90]. By analyzing unstructured social history and physical therapy notes, these models have achieved predictive performance comparable to or superior to standard ASA-based regression models. This performance highlights the models’ ability to capture subtle social determinants of health that influence discharge planning [90]. Furthermore, fine-tuned “health system-scale” models applied to electronic health records have demonstrated the ability to predict 30-day readmission rates and length of stay (LOS) with an AUC of 0.79, significantly outperforming traditional structured data indices [91].

Complementing single-institution findings, multi-center studies have explored the utility of LLMs in enhancing predictive models. It is critical to note that while LLMs excel at processing unstructured text, they are not inherently designed for precise continuous-variable regression. Consequently, the most robust architectures utilize LLMs as feature extractors rather than direct predictors [92]. In predicting Length of Stay (LOS), an LLM parses complex narrative notes to extract variables (e.g., “patient lives alone,” “history of slow recovery”) which are then fed into gradient-boosted decision trees or specialized regression heads. This neuro-symbolic approach leverages the LLM’s semantic understanding while relying on traditional statistical methods for numerical precision, mitigating the “arithmetic hallucinations” often observed in pure transformer models [92]. This underscores the necessity of integrating real-time intraoperative parameters for continuous-time forecasting.

Beyond aggregate endpoints, LLMs are now being trained to anticipate specific orthopedic complications. Researchers have fine-tuned clinical LLMs to identify Periprosthetic Joint Infection (PJI) by synthesizing longitudinal wound care notes and inflammatory marker trends [90]. These models reported significant gains in accuracy compared to rule-based keyword search methods, particularly in identifying complex infection cases often missed by structured coding [90]. Table 4 provides a comparative view of these LLM-based prediction tasks against established clinical standards.

Retrieval-Augmented Generation (RAG) provides another avenue for improving output fidelity. Recent evaluations show that while standard LLMs may hallucinate treatment plans, systems aligned with American Academy of Orthopedic Surgeons (AAOS) guidelines achieve concordance rates exceeding 90% for conditions such as hip fractures. This alignment mitigates the risk of inappropriate medication or prophylaxis recommendations [93].

LLM-based analytics are also extending into post-acute musculoskeletal rehabilitation. In a cohort of postoperative patients, NLP algorithms were applied to extract granular physical therapy metrics, such as “Range of Motion” (ROM) degrees and exercise location, from unstructured therapist progress notes [94]. While traditional rule-based methods still excel at simple numeric extraction, LLMs demonstrated superior recall in interpreting complex narrative descriptions of movement quality (e.g., “Backward Plane” motion) [94]. By converting these unstructured narratives into longitudinal structured features, clinicians can now track recovery trajectories with greater precision. Additionally, sentiment analysis of patient-reported outcomes (PROMs) via LLMs has been shown to correlate strongly with functional recovery scores, providing a psychological dimension to recovery prediction.

Methodologically, these studies converge on the importance of hybrid fusion combining tabular implant registry data with dense vector embeddings of clinical notes [91]. Despite advances, challenges in calibration remain; recent audits of LLM-generated risk scores show a tendency towards overconfidence, necessitating the use of uncertainty quantification methods.

Equally important is the governance layer within orthopedic registries. A roadmap for deploying AI in joint replacement registries calls for privacy-preserving fine-tuning strategies to handle sensitive implant performance data. Key safeguards proposed for orthopedic AI are summarized in Table 5 [89]. The governance framework presented in the table establishes the safeguards required for integrating LLM-driven analytics into routine orthopedic practice. To ensure that AI-generated treatment plans adhere to established AAOS or NASS clinical guidelines, the framework proposes reinforcement learning from human feedback as a mechanism for iterative alignment with surgeon expertise [80,85]. To address privacy concerns inherent in multi-center hospital networks and implant registries, federated learning is highlighted as an approach that enables collaborative model training while preserving patient data privacy [91]. The framework also emphasizes the importance of calibration checks [89,90] and uncertainty quantification methods such as Monte Carlo [90] dropout to reduce the risk of overestimating surgical success and to identify ambiguous cases that require human review. Finally, stratified bias audits are mandated to ensure that AI systems do not perpetuate historical disparities in arthroplasty access or pain management associated with patient characteristics such as race or body mass index [89].

The safeguards enumerated in Table 2 underscore that predictive accuracy alone is insufficient; LLM-driven analytics in orthopedics must be accompanied by mechanisms that ensure equity and safety, particularly when guiding decisions on irreversible surgical interventions [89].

4.3. Personalized Medicine and Patient-Facing Applications

LLMs have demonstrated substantial promise in personalizing orthopedic rehabilitation by integrating heterogeneous patient data into a cohesive “Digital Twin” framework. By ingesting structured EHR entries (e.g., surgical protocols, implant details) and unstructured narrative notes, these models generate tailored treatment suggestions that align closely with expert consensus. Recent evaluations indicate that GPT-4o can replicate specialist-designed rehabilitation regimens for knee osteoarthritis with high clinical alignment, achieving agreement rates of roughly 74% with physiotherapist consensus on core exercise parameters [95]. However, the true potential of these systems lies in their integration with longitudinal biomechanical data. Emerging multimodal frameworks propose fusing text-based clinical notes with time-series data from wearable sensors to dynamically adjust rehabilitation intensity, moving beyond static advice to actively manage patient recovery trajectories.

In patient support roles, LLMs act as “Semantic Translators,” bridging the gap between complex surgical terminology and patient health literacy. ChatGPT has achieved high accuracy in responding to Frequently Asked Questions (FAQs) regarding Total Hip Arthroplasty (THA), with studies showing it can answer patient queries with accuracy comparable to or exceeding standard online resources [96]. However, raw outputs from foundation models often exceed recommended reading levels; evaluations of AI-generated advice regarding platelet-rich plasma (PRP) therapy reveal that responses often score above a 16th-grade reading level, rendering them inaccessible to the average patient [97]. To address this, prompt engineering and domain-specific fine-tuning have been shown to significantly reduce the Flesch–Kincaid Grade Level (FKGL) of orthopedic instructions while preserving medical accuracy.

RAG is critical for operationalizing these agents within the hospital ecosystem. Generative AI embedded within patient portals (e.g., OpenNotes) can contextualize patient messages and draft empathetic, clinically relevant responses in real time. Recent trials at major medical centers demonstrate that these drafts are often indistinguishable from provider-authored content [98]. By accessing up-to-date hospital ontologies, these systems reduce hallucinations and ensure that guidance on specific conditions, such as Kienböck’s disease, meets specialist quality thresholds, although readability remains a challenge [99]. Furthermore, utilizing LLMs to rewrite surgical consent forms for spine surgery has been shown to improve readability metrics significantly, potentially enhancing patient comprehension of procedural risks without diluting medico-legal warnings [100].

A comparative overview of these applications within the orthopedic workflow is presented in Table 6. It provides a balanced synthesis of the clinical utility and the persistent obstacles facing LLM applications in the orthopedic workflow. In rehabilitation planning, models can achieve a 74% agreement rate with expert physiotherapists, yet they often lack the granular specificity required for detailed exercise parameters like sets and repetitions [95]. While LLMs excel as semantic translators by simplifying surgical consent forms to a 6th-grade level, their raw outputs for other orthopedic queries often exceed college-level reading requirements, posing a barrier to patient literacy [96,100]. In addition, while patient portal agents can draft empathetic responses to messages, the architecture necessitates a provider-in-the-loop to validate responses for clinical safety [98]. Ultimately, while the models provide accurate summaries for rare conditions like Kienböck’s disease, readability remains a challenge for complex diseases [99], emphasizing that these tools should support rather than replace clinical judgment.

Despite these advances, significant risks remain. Reviews of AI applications in pain medicine have highlighted the potential for algorithmic bias, noting that without careful calibration, these models may perpetuate historical disparities in pain management recommendations for minority groups [101]. Furthermore, while explicit bias in opioid recommendations is not always detected in modern models like GPT-4, the “black box” nature of these systems necessitates rigorous fairness-aware testing before deployment in diverse populations [101]. Performance also degrades on tasks requiring multi-step reasoning, highlighting the urgent need for oversight.

Looking forward, the integration of LLMs into comprehensive orthopedic education platforms promises to strengthen shared decision-making. Studies assessing ChatGPT’s utility for ankle fracture education have found that it provides evidence-based, accurate answers to common patient questions, offering a scalable tool to close pre-operative informational gaps [102]. As these technologies mature, the focus must shift from static chatbots to multimodal “Recovery Agents” that actively monitor, educate, and alert both the patient and the surgeon, ensuring a continuous loop of personalized care.

4.4. The Agentic Interface for Surgical Robotics

The convergence of generative artificial intelligence and surgical robotics represents a fundamental architectural shift in orthopedics: moving from “passive execution” systems to “agentic collaborators.” Traditional robotic platforms, such as those widely used for Total Knee Arthroplasty (TKA) and Total Hip Arthroplasty (THA), rely heavily on rigid preoperative planning based on Computed Tomography (CT) segmentation [103]. While these systems excel at precise bone resection within pre-defined haptic boundaries, they lack semantic understanding of the intraoperative environment. They execute coordinates, not clinical intent. The next generation of “Agentic” robotic interfaces aims to bridge this gap by embedding a multimodal cognitive core capable of processing voice, vision, and kinematics directly into the robotic control loop [103,104].

One of the most immediate applications of Agentic AI is the reduction in cognitive load through natural language processing (NLP). In the current workflow, adjusting a surgical plan requires the surgeon to break the sterile field or instruct a technician to manipulate a touchscreen, creating latency and potential communication errors. Emerging “Text-to-Trajectory” frameworks allow surgeons to control robotic arms using semantic voice commands, such as “Increase the tibial varus cut by 2 degrees” or “Switch to fine-dissection mode” [104]. Large Language Models (LLMs) function here as “translation layers,” converting these unstructured verbal instructions into executable Python 3.10-based robotic code (e.g., ROS 2 commands) [104]. Recent evaluations in surgical environments demonstrate that LLM-driven voice control can achieve command recognition accuracy of 97%, significantly reducing the time required for intraoperative plan adjustments while allowing hands-free device manipulation [105]. This capability is particularly vital in complex revision surgeries, where the surgeon’s hands are occupied with instrument management and soft tissue retraction.

Beyond voice control, the “Agentic” interface enhances the robot’s visual perception. Current navigation systems rely on optical trackers drilled into the bone, which can loosen or suffer from line-of-sight occlusion. Newer computer vision systems integrate endoscopic video feeds with real-time depth sensing to perform semantic segmentation of anatomical structures on the fly [106]. Unlike static CT-based navigation, these agents can identify soft tissues that were not visible in the preoperative scan. For instance, during robotic spine surgery or knee replacement, computer vision algorithms can segment critical risk structures such as the popliteal artery or exiting nerve roots and automatically generate dynamic “No-Fly Zones” [106]. If the AI detects that the robotic burr is approaching a segmented nerve, it triggers an active haptic stop, physically preventing the surgeon from advancing. This creates a “dynamic safety envelope” that updates in real-time as the soft tissue deforms, addressing a major limitation of rigid registration systems. However, semantic segmentation accuracy varies substantially by anatomical structure size, while large solid organs achieve Dice scores above 0.88, smaller structures such as nerves achieve only 0.49, highlighting ongoing challenges in detecting critical fine anatomy [107].

The highest level of agentic capability lies in biomechanical reasoning. In TKA, achieving the perfect balance between flexion and extension gaps is often more art than science. Agentic frameworks are now being trained on thousands of successful operative logs to function as “Intraoperative Consultants” [108]. By analyzing real-time tensioner data from the robotic arm, the AI can predict the functional consequence of a proposed cut before it is made. For example, novel AI algorithms can systematically evaluate thousands of possible implant positioning solutions within 0.1 s, ranking them according to surgeon-defined targets for soft tissue tension [109]. In clinical validation, AI-guided robotic TKA achieved target extension and flexion gaps within ±1.5 mm in 92% of cases, compared to only 52% with manual planning [109,110]. The system analyzes medial and lateral compartment tensions, suggesting precise adjustments: “Medial tension is 20 Newtons higher than lateral. Suggest recutting the distal femur by 1.5 mm to balance. Proceed?” This moves the robot from a passive tool that cuts where told to an intelligent partner that optimizes implant longevity based on AAOS-aligned biomechanical principles.

Despite these advancements, the transition from research labs to the operating theater faces the “Sim-to-Real” gap. Models trained on clean, synthetic bone geometries or virtual reality simulators often struggle to generalize to the chaotic visual environment of actual surgery, where blood, smoke, and osteophytes obscure landmarks [111]. While meta-analyses demonstrate that technical skills acquired through robotic VR simulator training can transfer to the operating room with positive correlation (r = 0.67, 95% CI 0.22–0.88), significant challenges remain in replicating the unpredictable tissue deformations and visual occlusions of live surgery [112]. Furthermore, the stochastic nature of LLMs introduces the risk of “robotic hallucinations”, where the model might misinterpret a command or suggest an unsafe trajectory. To mitigate the stochastic nature of LLMs and the latency inherent in retrieval-based systems, proposed architectures strictly separate the “cognitive” planning layer from the “reactive” control layer. While RAG-based LLMs are effective for pre-operative querying of guidelines or non-time-critical intraoperative decision support (e.g., verifying implant compatibility), they introduce latencies (often >20 s) that are incompatible with real-time haptic feedback loops. Therefore, the “agentic” interface does not directly drive the robotic actuator. Instead, the LLM functions as a high-level orchestrator that translates semantic intent into parameters for a deterministic “Safety Layer” (Geometric Constraint Solver). This local, low-latency controller operates at the frequency required for haptic stability (>1 kHz), verifying kinematic feasibility and enforcing “No-Fly Zones” before any motion is executed. This hierarchical approach ensures that the creativity of AI is utilized for planning, while the safety of execution is guaranteed by physics-based constraints.

4.5. Biosignals in Rehabilitation and Challenges in Clinical Integration

Modern orthopedics is undergoing a fundamental architectural transformation, shifting focus from episodic, clinic-based interventions to continuous, home-based monitoring. While the precision of robotic-assisted surgery optimizes the technical execution of procedures such as total knee arthroplasty (TKA) or anterior cruciate ligament (ACL) reconstruction, it does not alone determine clinical success. The ultimate clinical outcome remains heavily dependent on patient compliance and physiological recovery during the rehabilitation phase. Historically, this postoperative period has functioned as a “black box,” where surgeons lack objective visibility into a patient’s daily activity and biomechanical quality. To bridge this gap, the concept of the “Orthopedic Digital Twin” is emerging as a critical framework. In this architecture, the AI Clinical Assistant evolves from a text-based interface into a multimodal “Rehabilitation Agent,” capable of fusing patient-reported outcomes with high-frequency objective biosignals to create a dynamic, virtual replica of the patient’s musculoskeletal status [113].

The backbone of this remote monitoring ecosystem is the interpretation of kinematic data through ubiquitous wearable technology. Traditional gait analysis, while accurate, relies on expensive optical laboratories that are inaccessible for daily tracking. To democratize this capability, emerging AI frameworks utilize consumer-grade Inertial Measurement Units (IMUs) embedded in smartwatches or specialized knee braces. These sensors generate noisy, high-dimensional data streams that are difficult to interpret using classical signal processing. However, recent applications of deep learning, particularly machine learning models applied to thigh- and shank-mounted IMUs, have shown effective signal denoising capabilities. As a result, 3D knee joint angles can be reconstructed with clinical precision, with mean errors of only 2–4 degrees relative to laboratory standards [114,115]. Furthermore, hybrid approaches combining IMU data with electromyography (EMG) via Transformer-based models can now estimate lower-limb joint moments with significantly reduced RMSE, enabling the monitoring of muscle-joint mechanics during walking [116]. For patients recovering from TKA or ACL reconstruction, these agentic systems actively analyze the quality of movement, differentiating patients from healthy controls by capturing altered stance times and stride asymmetries that correlate with functional impairment and re-injury risk [117].

Complementing physical sensors, advancements in markerless computer vision are revolutionizing the assessment of range of motion (ROM). Novel pose-estimation algorithms, such as tailored versions of OpenPose, enable patients to perform rehabilitation exercises in front of a standard smartphone camera while the AI extracts skeletal keypoints in real-time. Validation studies indicate that these markerless systems can measure knee flexion and extension with a margin of error comparable to gold-standard goniometry, demonstrating high intra- and inter-rater reliability [118,119]. The “agentic” capability of these systems is particularly evident in their provision of real-time biofeedback; identifying abnormal movement patterns like dynamic knee valgus during exercises allows for therapist-like guidance in unsupervised home environments [120].

Furthermore, functional recovery involves not only structural kinematics but also neuromuscular integrity. A persistent challenge following knee surgery is arthrogenic muscle inhibition, where the quadriceps muscle fails to activate correctly. Wearable surface Electromyography (sEMG) sensors integrated into smart textiles are being paired with machine learning to characterize these altered activation patterns [120]. Recent work combining biomechanical modeling with EMG has estimated the mechanical impact of the vastus medialis and vastus lateralis on patellar loading, supporting targeted interventions to optimize activation ratios and prevent patellofemoral pain [121]. 4 AI models trained on these features can detect changes related to muscle fatigue or injury risk, adapting exercise intensity to prevent chronic compensations [120].

The ultimate realization of the Orthopedic Digital Twin lies in the fusion of these disparate data streams. A standalone sensor might detect a limp, but it lacks the context provided by subjective experience. Therefore, advanced multimodal AI models are designed to integrate time-series biosignals with patient-reported outcomes (PROMs) such as EQ-5D-5L. By processing both biomechanical metrics and subjective symptoms, these models have achieved prediction accuracy for patient satisfaction with an R² > 0.90, far exceeding models using single modalities alone [122]. This convergence transforms rehabilitation from a reactive process into a proactive, data-driven science, where the AI Agent serves as a continuous sentinel, ensuring that the precision achieved in the operating room is maintained throughout the critical months of recovery [113].

Despite the theoretical robustness of FHIR-based pipelines described in current research [123], a gap remains between syntactic standardization and semantic interoperability. The literature highlights that while multimodal fusion (text + image) enhances diagnostic accuracy, current tokenization methods often fail to capture the full spatial nuance of high-resolution MRI or CT scans [5,124]. Moreover, a recurring limitation in existing agentic workflows is the lack of standardized validation metrics; most studies rely on retrospective accuracy rather than prospective clinical utility [125,126]. This discrepancy suggests that while the ‘plumbing’ of data pipelines is maturing, the ‘reasoning’ layer often struggles to interpret ambiguous clinical data in the same way a human surgeon would.

5. Discussion and Future Outlook

While the architectural pillars of LLM adaptation and multimodal fusion provide the structural foundation for modern orthopedic AI, a cross-study synthesis reveals a critical limitation in the current transformer architecture: the performance gap between semantic classification and continuous regression. The literature demonstrates that LLMs function as expert-level “pattern matchers” but unreliable “calculators.” For instance, architectures like Gemma-2 achieve 98% accuracy in binary classification tasks, such as distinguishing Parkinson’s medication states [127], and identifying postoperative complications in breast reconstruction [128]. However, this performance collapses when the architecture is forced into continuous-variable forecasting. The failure of GPT-4 Turbo to accurately predict hospital length-of-stay (MAE 4.5 days), despite high mortality prediction scores [129], highlights a fundamental architectural bottleneck. Transformers optimized for next-token prediction struggle with the mathematical regression required for longitudinal orthopedic outcomes (range of motion recovery curves) compared to traditional gradient boosting methods. Therefore, future system designs will likely adopt neuro-symbolic hybrids rather than relying on LLMs as end-to-end predictors for numeric data.

The review of interoperability pipelines identifies a severe trade-off between diagnostic depth and real-time responsiveness, particularly in Retrieval-Augmented Generation (RAG). While RAG is essential for grounding LLMs in evidence and reducing hallucinations, the retrieval mechanism introduces latency costs that are prohibitive for real-time applications such as robotic surgery or intraoperative navigation. Empirical data shows that retrieval processes account for over 40% of end-to-end latency, doubling Time-To-First-Token (TTFT) to nearly a second (965 ms) and pushing total response times for complex queries beyond 25 s [130,131]. While acceptable for asynchronous back-office coding, this latency renders standard RAG architectures incompatible with the sub-second loops required in the operating room. Consequently, deployment strategies must shift toward “selective RAG” or asynchronous pre-fetching [118,132], sacrificing some measure of dynamic retrieval for the speed required in acute clinical settings.

The reliance on synthetic data introduces a complex trade-off between privacy and utility. As detailed in the literature, high-utility synthetic data often retains statistical “fingerprints” vulnerable to membership inference attacks, while rigorous differential privacy often degrades clinical precision to unacceptable levels. Consequently, the field is moving away from purely synthetic solutions toward Federated Learning and Trusted Research Environments (TREs). In a federated approach, the model travels to the data rather than the data traveling to the model, allowing institutions to collaborate on training without exposing raw patient records. While synthetic data remains valuable for initial system stress-testing and pipeline validation, scalable clinical deployment will likely depend on these decentralized training architectures to satisfy regulatory requirements like GDPR and HIPAA without compromising diagnostic accuracy.

Ultimately, the transition from experimental models to clinical utility depends on solving the “last mile” problem of autonomy. The maturity model of medical AI is shifting from passive information retrieval (chatbots) to agentic systems capable of chain-of-thought reasoning and tool use [5]. The current literature points toward a future where AI does not merely summarize notes but autonomously executes workflows, querying FHIR servers for bone density trends, cross-referencing surgical guidelines, and drafting orders [133]. However, this shift from “oracle” to “agent” exacerbates the legal and ethical “moral crumple zones” identified in recent frameworks [134]. As systems become more autonomous, the architectural requirement shifts from pure accuracy to explainability and auditability, ensuring that clinicians remain the “human-in-the-loop” to validate the AI’s reasoning before execution [135]. Success in orthopedic deployment will therefore depend not just on model weights, but on wrapping these models in secure, low-latency, and accountable control layers.

As orthopedic AI matures, the field is poised to transition from static volumetric analysis to “4D” Functional Reasoning. While current geometric deep learning excels at anatomical reconstruction, the next frontier lies in predicting kinematics over time, simulating how a specific implant alignment will affect gait mechanics or polyethylene wear years post-operation. This evolution will likely drive the adoption of neuro-symbolic hybrids, which combine the semantic reasoning of LLMs with the mathematical precision of physics-based solvers, effectively addressing the regression deficits inherent in pure transformer architectures.

Simultaneously, the model of care will shift from “Oracle” to “Agent,” where AI systems do not merely summarize notes but autonomously execute workflows querying FHIR servers for bone density trends, cross-referencing surgical guidelines, and drafting orders [133]. This autonomy introduces significant legal and ethical “moral crumple zones” [134]. As systems become more autonomous, the architectural requirement shifts from pure accuracy to explainability and auditability, ensuring that clinicians remain the “human-in-the-loop” to validate the AI’s reasoning before execution [5,135]. Success in orthopedic deployment will therefore depend not just on model weights, but on wrapping these models in secure, low-latency, and accountable control layers.

6. Conclusions

The maturation of orthopedic AI from experimental pilots to essential clinical utilities relies on a comprehensive “System of Systems” approach that harmonizes algorithmic rigor with semantic interoperability. This review highlights that the successful deployment of intelligent assistants is not merely a function of model size, but of specialized domain adaptation through techniques like Retrieval-Augmented Generation and parameter-efficient fine-tuning, which effectively balance computational efficiency with clinical accuracy. By bridging the “data gap” through Volumetric Intelligence and the fusion of multimodal biosignals, these systems enable a transition from 2D approximation to precise 3D reasoning and continuous patient monitoring. As these technologies evolve, the focus must shift toward ensuring robust governance, data privacy, and explainability to foster trust among clinicians and patients alike. Ultimately, these agentic systems will not replace the orthopedic surgeon but will serve as powerful cognitive augmentations, “solving” surgical problems algorithmically to ensure that precision in the operating room is matched by optimized, data-driven recovery trajectories.

Author Contributions

Conceptualization, A.B., Z.B., B.I., B.A., N.T. and K.O.; methodology, A.B., Z.B., B.A., B.I., N.T., K.O., Z.A. and C.A.; software, A.B., Z.B., B.I., B.A. and N.T.; validation, B.I., K.O., Z.A., C.A. and N.K.; formal analysis, B.A., N.T., B.I., K.O. and N.K.; investigation, A.B., Z.B., B.A., Z.A., C.A., N.K. and K.O.; resources, A.B., N.T., B.I., K.O., N.K., C.A. and Z.A.; data curation, Z.B., B.A., A.B. and N.T.; writing—original draft preparation, Z.B., B.A., A.B., N.T. and N.K.; writing—review and editing, B.I., K.O., C.A. and Z.A.; visualization, Z.B., B.A., A.B., N.T. and Z.A.; supervision, B.I., K.O., Z.A., C.A. and N.K.; project administration, B.I. and K.O.; funding acquisition, K.O. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Ministry of Science and Higher Education of the Republic of Kazakhstan (Grant No. BR24992820).

Data Availability Statement

Not applicable.

Acknowledgments

We would like to express our sincere gratitude to the Research Team from Apex Laboratory for their essential contributions and support, which greatly facilitated the progress of this research.

Conflicts of Interest

Authors Assiya Boltaboyeva and Zhanel Baigarayeva were employed by the company LLP “Kazakhstan R&D Solutions”. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

LLMs	Large Language Models
EHRs	Electronic Health Records
MSK	Musculoskeletal
RAG	Retrieval-Augmented Generation
Grad-CAM	Gradient-weighted Class Activation Mapping
SHAP	SHapley Additive exPlanations
CNNs	Convolutional Neural Networks
FHIR	Fast Healthcare Interoperability Resources
HNSW	Hierarchical Navigable Small World
FAISS	Facebook AI Similarity Search
BERT	Bidirectional Encoder Representations from Transformers
LoRA	Low-Rank Adaptation
AAOS	American Academy of Orthopedic Surgeons
R2	Coefficient of Determination
IMUs	Inertial Measurement Units
sEMG	Surface Electromyography
ROM	Range of Motion
TKA	Total Knee Arthroplasty
THA	Total Hip Arthroplasty
PJI	Periprosthetic Joint Infection
PROMs	Patient-Reported Outcome Measures
CT	Computed Tomography
MRI	Magnetic Resonance Imaging
AUC	Area Under the Curve
BMI	Body Mass Index
TRIP	Trustworthy, Reliable, Interoperable Pipelines
GPT-4	Generative Pre-trained Transformer 4
ROS	Robot Operating System
MAI-Dx	Medical Artificial Intelligence Diagnostic
CPT	Current Procedural Terminology
API	Application Programming Interface
FME	Finite Model Engineering
GCD	Grammar-Constrained Decoding
CFG	Context-Free Grammar
FSM	Finite State Machine
HCI	Human–Computer Interaction
AI	Artificial Intelligence
GNNs	Graph Neural Networks
VR	Virtual Reality
IRL	Inverse Reinforcement Learning
GPU	Graphics Processing Unit
MAE	Mean Absolute Error
ICD-10	International Classification of Diseases, 10th edition
LOS	Length of Stay
TKR	Total Knee Replacement

References

Koh, E.; Sunil, R.S.; Lam, H.Y.I.; Mutwil, M. Confronting the data deluge: How artificial intelligence can be used in the study of plant stress. Comput. Struct. Biotechnol. J. 2024, 23, 3454–3466. [Google Scholar] [CrossRef]
Maity, S.; Saikia, M.J. Large Language Models in Healthcare and Medical Applications: A Review. Bioengineering 2025, 12, 631. [Google Scholar] [CrossRef]
Haltaufderheide, J.; Ranisch, R. The ethics of ChatGPT in medicine and healthcare: A systematic review on Large Language Models (LLMs). NPJ Digit. Med. 2024, 7, 183. [Google Scholar] [CrossRef]
Li, X.; Chen, S.; Meng, M.; Wang, Z.; Jiang, H.; Hao, Y. Research progress and implications of the application of large language model in shared decision-making in China’s healthcare field. Front. Public Health 2025, 13, 1605212. [Google Scholar] [CrossRef] [PubMed]
Moor, M.; Banerjee, O.; Shakeri Hossein Abad, Z.S.; Krumholz, H.M.; Leskovec, J.; Topol, E.J.; Rajpurkar, P. Foundation Models for Generalist Medical Artificial Intelligence. Nature 2023, 616, 259–265. [Google Scholar] [CrossRef] [PubMed]
Guermazi, A.; Tannoury, C.; Kompel, A.J.; Murakami, A.M.; Ducarouge, A.; Gillibert, A.; Li, X.; Tournier, A.; Lahoud, Y.; Jarraya, M.; et al. Improving Radiographic Fracture Recognition Performance and Efficiency Using Artificial Intelligence. Radiology 2022, 302, 627–636. [Google Scholar] [CrossRef]
Mohamed, A.; Elasad, A.; Fuad, U.; Pengas, I.; Elsayed, A.; Bhamidipati, P.; Salib, P. Artificial Intelligence in Trauma and Orthopaedic Surgery: A Comprehensive Review From Diagnosis to Rehabilitation. Cureus 2025, 17, e92280. [Google Scholar] [CrossRef] [PubMed]
Zheng, Z.; Ryu, B.Y.; Kim, S.E.; Song, D.S.; Kim, S.H.; Park, J.W.; Ro, D.H. Deep learning for automated hip fracture detection and classification: Achieving superior accuracy. Bone Jt. J. 2025, 107-B, 213–220. [Google Scholar] [CrossRef]
Han, F.; Huang, X.; Wang, X.; Chen, Y.F.; Lu, C.; Li, S.; Lu, L.; Zhang, D.W. Artificial Intelligence in Orthopedic Surgery: Current Applications, Challenges, and Future Directions. MedComm 2025, 6, e70260. [Google Scholar] [CrossRef]
Mekki, Y.M.; Luijten, G.; Hagert, E.; Belkhair, S.; Varghese, C.; Qadir, J.; Solaiman, B.; Bilal, M.; Dhanda, J.; Egger, J.; et al. Digital twins for the era of personalized surgery. NPJ Digit. Med. 2025, 8, 283. [Google Scholar] [CrossRef]
Ayaz, M.; Pasha, M.F.; Alahmadi, T.J.; Abdullah, N.N.B.; Alkahtani, H.K. Transforming Healthcare Analytics with FHIR: A Framework for Standardizing and Analyzing Clinical Data. Healthcare 2023, 11, 1729. [Google Scholar] [CrossRef]
Semenov, I.; Osenev, R.; Gerasimov, S.; Kopanitsa, G.; Denisov, D.; Andreychuk, Y. Experience in Developing an FHIR Medical Data Management Platform to Provide Clinical Decision Support. Int. J. Environ. Res. Public Health 2020, 17, 73. [Google Scholar] [CrossRef] [PubMed]
Amann, J.; Blasimme, A.; Vayena, E.; Frey, D.; Madai, V.I. Explainability for Artificial Intelligence in Healthcare: A Multidisciplinary Perspective. BMC Med. Inform. Decis. Mak. 2020, 20, 310. [Google Scholar] [CrossRef] [PubMed]
Daram, S. Explainable AI in Healthcare: Enhancing Trust, Transparency, and Ethical Compliance in Medical AI Systems. Int. J. AI BigData Comput. Manag. Stud. 2025, 6, 11–20. [Google Scholar] [CrossRef]
Yuan, K.; Yoon, C.H.; Gu, Q.; Munby, H.; Walker, S.A.; Zhu, T.; Eyre, W.D. Transformers and large language models are efficient feature extractors for electronic health record studies. Commun. Med. 2025, 5, 83. [Google Scholar] [CrossRef]
Xie, Q.; Chen, Q.; Chen, A.; Peng, C.; Hu, Y.; Lin, F.; Peng, X.; Huang, J.; Zhang, J.; Keloth, V.; et al. Medical foundation large language models for comprehensive text analysis and beyond. NPJ Digit. Med. 2025, 8, 141. [Google Scholar] [CrossRef]
Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. LoRA: Low-Rank Adaptation of Large Language Models. arXiv 2021, arXiv:2106.09685. [Google Scholar] [CrossRef]
Artsi, Y.; Sorin, V.; Glicksberg, B.S.; Korfiatis, P.; Freeman, R.; Nadkarni, G.N.; Klang, E. Challenges of Implementing LLMs in Clinical Practice: Perspectives. J. Clin. Med. 2025, 14, 6169. [Google Scholar] [CrossRef] [PubMed]
Vrdoljak, J.; Boban, Z.; Vilovic, M.; Kumrić, M.; Bozic, J. A Review of Large Language Models in Medical Education, Clinical Decision Support, and Healthcare Administration. Healthcare 2025, 13, 603. [Google Scholar] [CrossRef]
Shi, W.; Xu, R.; Zhuang, Y.; Yu, Y.; Sun, H.; Wu, H.; Yang, C.; Wang, M.D. MedAdapter: Efficient Test-Time Adaptation of Large Language Models Towards Medical Reasoning. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), Miami, FL, USA, 12–16 November 2024; pp. 22294–22314. [Google Scholar] [CrossRef]
Vaid, A.; Landi, I.; Nadkarni, G.; Nabeel, I. Using fine-tuned large language models to parse clinical notes in musculoskeletal pain disorders. Lancet Digit. Health 2023, 5, e855–e858. [Google Scholar] [CrossRef]
Liu, H.; Li, C.; Wu, Q.; Lee, Y.J. Visual instruction tuning. In Advances in Neural Information Processing Systems (NeurIPS). arXiv 2024. [Google Scholar] [CrossRef]
Eskenazi, J.; Krishnan, V.; Konarzewski, M.; Constantinescu, D.; Lobaton, G.; Dodds, S. Evaluating retrieval augmented generation and ChatGPT’s accuracy on orthopaedic examination assessment questions. Ann. Jt. 2025, 10, 12. [Google Scholar] [CrossRef]
Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.-T.; Rocktäschel, T.; et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. arXiv 2020, arXiv:2005.11401. [Google Scholar] [CrossRef]
Malkov, Y.A.; Yashunin, D.A. Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 824–836. [Google Scholar] [CrossRef]
Lee, H.; Park, T.; Na, Y.; Kim, W.-H. P-HNSW: Crash-Consistent HNSW for Vector Databases on Persistent Memory. Appl. Sci. 2025, 15, 10554. [Google Scholar] [CrossRef]
Zakka, C.; Shad, R.; Chaurasia, A.; Dalal, A.R.; Kim, J.L.; Moor, M.; Fong, R.; Phillips, C.; Alexander, K.; Ashley, E.; et al. Almanac—Retrieval-Augmented Language Models for Clinical Medicine. NEJM AI 2024, 1, 2. [Google Scholar] [CrossRef]
de la Torre, J. Scalable unit harmonization in medical informatics via Bayesian-optimized retrieval and transformer-based re-ranking. Int. J. Med. Inform. 2026, 206, 106180. [Google Scholar] [CrossRef] [PubMed]
Liu, S.; McCoy, A.B.; Wright, A. Improving large language model applications in biomedicine with retrieval-augmented generation: A systematic review, meta-analysis, and clinical development guidelines. J. Am. Med. Inform. Assoc. 2025, 32, 605–615. [Google Scholar] [CrossRef]
Sivarajkumar, S.; Mohammad, H.A.; Oniani, D.; Roberts, K.; Hersh, W.; Liu, H.; He, D.; Visweswaran, S.; Wang, Y. Clinical Information Retrieval: A Literature Review. J. Healthc. Inform. Res. 2024, 8, 313–352. [Google Scholar] [CrossRef]
Luan, Y.; Eisenstein, J.; Toutanova, K.; Collins, M. Sparse, Dense, and Attentional Representations for Text Retrieval. Trans. Assoc. Comput. Linguist. 2021, 9, 329–345. [Google Scholar] [CrossRef]
Matsuo, R.; Ho, T.B. Semantic term weighting for clinical texts. Expert Syst. Appl. 2018, 114, 543–551. [Google Scholar] [CrossRef]
Mo, K.; Lin, R.; Dunn, E.; Girgis, G.; Fang, W.; Walsh, J.; Banyai-Flores, N.; Watson, T.; Lee, D. Systematic Review on Large Language Models in Orthopaedic Surgery. J. Clin. Med. 2025, 14, 5876. [Google Scholar] [CrossRef]
Xu, A.Y.; Singh, M.; Balmaceno-Criss, M.; Oh, A.; Leigh, D.; Daher, M.; Alsoof, D.; McDonald, C.L.; Diebo, B.G.; Daniels, A.H. Comparative performance of artificial intelligence-based large language models on the orthopedic in-training examination. J. Orthop. Surg. 2025, 33, 1. [Google Scholar] [CrossRef]
Wiest, I.C.; Ferber, D.; Zhu, J.; van Treeck, M.; Meyer, S.K.; Juglan, R.; Carrero, Z.I.; Paech, D.; Kleesiek, J.; Ebert, M.P.; et al. Privacy-preserving large language models for structured medical information retrieval. NPJ Digit. Med. 2024, 7, 257. [Google Scholar] [CrossRef]
Labrak, Y.; Bazoge, A.; Morin, E.; Gourraud, P.-A.; Rouvier, M.; Dufour, R. BioMistral: A Collection of Open-Source Pretrained Large Language Models for Medical Domains. In Findings of the Association for Computational Linguistics: ACL 2024; Association for Computational Linguistics: Bangkok, Thailand, 2024. [Google Scholar] [CrossRef]
Tung, J.Y.M.; Le, Q.; Yao, J.; Huang, Y.; Lim, D.Y.Z.; Sng, G.G.R.; Lau, R.S.E.; Tan, Y.G.; Chen, K.; Tay, K.J.; et al. Performance of Retrieval-Augmented Generation Large Language Models in Guideline-Concordant Prostate-Specific Antigen Testing: Comparative Study With Junior Clinicians. J. Med. Internet Res. 2025, 27, e78393. [Google Scholar] [CrossRef]
Xu, S.; Yan, Z.; Dai, C.; Wu, F. MEGA-RAG: A retrieval-augmented generation framework with multi-evidence guided answer refinement for mitigating hallucinations of LLMs in public health. Front. Public Health 2025, 13, 1635381. [Google Scholar] [CrossRef] [PubMed]
Li, H.; Ding, L.; Fang, M.; Tao, D. Revisiting catastrophic forgetting in large language model tuning. In Findings of the Association for Computational Linguistics: EMNLP 2024; Association for Computational Linguistics: Miami, FL, USA, 2024. [Google Scholar] [CrossRef]
Wyatt, J.M.; Booth, G.J.; Goldman, A.H. Natural Language Processing and Its Use in Orthopaedic Research. Curr. Rev. Musculoskelet. Med. 2021, 14, 392–396. [Google Scholar] [CrossRef] [PubMed]
Gu, Y.; Tinn, R.; Cheng, H.; Lucas, M.; Usuyama, N.; Liu, X.; Naumann, T.; Gao, J.; Poon, H. Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing. ACM Trans. Comput. Healthc. 2021, 3, 15. [Google Scholar] [CrossRef]
Scholte, M.; van Dulmen, S.A.; Neeleman-Van der Steen, C.W.; van der Wees, P.J.; Nijhuis-van der Sanden, M.W.; Braspenning, J. Data extraction from electronic health records (EHRs) for quality measurement of the physical therapy process: Comparison between EHR data and survey data. BMC Med. Inform. Decis. Mak. 2016, 16, 141. [Google Scholar] [CrossRef]
Karanikolas, N.N.; Manga, E.; Samaridi, N.; Stergiopoulos, V.; Tousidou, E.; Vassilakopoulos, M. Strengths and Weaknesses of LLM-Based and Rule-Based NLP Technologies and Their Potential Synergies. Electronics 2025, 14, 3064. [Google Scholar] [CrossRef]
Wang, M.; Xu, X.; Yue, Q.; Wang, Y. A comprehensive survey and experimental comparison of graph-based approximate nearest neighbor search. Proc. VLDB Endow. 2021, 14, 1964–1978. [Google Scholar] [CrossRef]
Zhao, J.; Both, J.P.; Rodriguez-R, L.M.; Konstantinidis, K.T. GSearch: Ultra-fast and scalable genome search by combining K-mer hashing with hierarchical navigable small world graphs. Nucleic Acids Res. 2024, 52, e74. [Google Scholar] [CrossRef] [PubMed]
Hall, B.W.; Kaushik, M.J. Retrieval augmented docking using hierarchical navigable small-world graphs. J. Chem. Inf. Model. 2024, 64, 7357–7367. [Google Scholar] [CrossRef]
Gupta, D.; Loane, R.; Gayen, S.; Demner-Fushman, D. Medical image retrieval via nearest neighbor search on pre-trained image features. Knowl.-Based Syst. 2023, 278, 110907. [Google Scholar] [CrossRef]
Piasta, K.; Kotas, R. Comparative Analysis of Natural Language Processing Techniques in the Classification of Press Articles. Appl. Sci. 2025, 15, 9559. [Google Scholar] [CrossRef]
Farrow, L.; Raja, A.; Zhong, M.; Anderson, L. A systematic review of natural language processing applications in Trauma & Orthopaedics. Bone Jt. Open 2025, 6, 264–274. [Google Scholar] [CrossRef] [PubMed]
Sagheb, T.E.; Ramazanian, T.; Tafti, A.P.; Fu, S.; Kremers, W.K.; Berry, D.J.; Lewallen, D.G.; Sohn, S.; Maradit Kremers, H. Use of Natural Language Processing Algorithms to Identify Common Data Elements in Operative Notes for Knee Arthroplasty. J. Arthroplast. 2021, 36, 922–926. [Google Scholar] [CrossRef]
Fu, S.; Wyles, C.C.; Osmon, D.R.; Carvour, M.L.; Sagheb, E.; Ramazanian, T.; Kremers, W.K.; Lewallen, D.G.; Berry, D.J.; Sohn, S.; et al. Automated Detection of Periprosthetic Joint Infections and Data Elements Using Natural Language Processing. J. Arthroplast. 2021, 36, 688–692. [Google Scholar] [CrossRef]
Floyd, S.B.; Almeldien, A.G.; Smith, D.H.; Judkins, B.; Krohn, C.E.; Reynolds, Z.C.; Jeray, K.; Obeid, J.S. Using Artificial Intelligence to Develop a Measure of Orthopaedic Treatment Success from Clinical Notes. Front. Digit. Health 2025, 7, 1523953. [Google Scholar] [CrossRef]
Yang, X.; Chen, A.; PourNejatian, N.; Shin, H.C.; Smith, K.E.; Parisien, C.; Compas, C.; Martin, C.; Costa, A.B.; Flores, M.G.; et al. A Large Language Model for Electronic Health Records. NPJ Digit. Med. 2022, 5, 194. [Google Scholar] [CrossRef] [PubMed]
Zaidat, B.; Lahoti, Y.S.; Yu, A.; Mohamed, K.S.; Cho, S.K.; Kim, J.S. Artificially Intelligent Billing in Spine Surgery: An Analysis of a Large Language Model. Glob. Spine J. 2025, 15, 1113–1120. [Google Scholar] [CrossRef]
Radhakrishnan, L.; Schenk, G.; Muenzen, K.; Oskotsky, B.; Ashouri Choshali, H.; Plunkett, T.; Israni, S.; Butte, A.J. A Certified Deidentification System for All Clinical Text Documents for Information Extraction at Scale. JAMIA Open 2023, 6, ooad045. [Google Scholar] [CrossRef]
Peng, L.; Luo, G.; Zhou, S.; Chen, J.; Xu, Z.; Sun, J.; Zhang, R. An In-depth Evaluation of Federated Learning on Biomedical Natural Language Processing for Information Extraction. NPJ Digit. Med. 2024, 7, 127. [Google Scholar] [CrossRef] [PubMed]
Oeding, J.F.; Champagne, A.A.; Hurley, E.T.; Samuelsson, K. Harnessing Deep Learning and Statistical Shape Modelling for Three-Dimensional Evaluation of Joint Bony Morphology. J. Exp. Orthop. 2024, 11, e70070. [Google Scholar] [CrossRef] [PubMed]
Xiao, D.; Lian, C.; Deng, H.; Kuang, T.; Liu, Q.; Ma, L.; Kim, D.; Lang, Y.; Chen, X.; Gateno, J.; et al. Estimating Reference Bony Shape Models for Orthognathic Surgical Planning Using 3D Point-Cloud Deep Learning. IEEE J. Biomed. Health Inform. 2021, 25, 2958–2966. [Google Scholar] [CrossRef]
Betti, V.; Aldieri, A.; Cristofolini, L. A Statistical Shape Analysis for the Assessment of the Main Geometrical Features of the Distal Femoral Medullary Canal. Front. Bioeng. Biotechnol. 2024, 12, 1250095. [Google Scholar] [CrossRef] [PubMed]
Han, R.; Uneri, A.; Vijayan, R.C.; Wu, P.; Vagdargi, P.; Sheth, N.; Vogt, S.; Kleinszig, G.; Osgood, G.M.; Siewerdsen, J.H. Fracture Reduction Planning and Guidance in Orthopaedic Trauma Surgery via Multi-body Image Registration. Med. Image Anal. 2021, 68, 101917. [Google Scholar] [CrossRef]
Abou Elkhier, M.T.; Saber, M.E.; Sweedan, A.O.; Elashwah, A. Evaluation of Accuracy of Three-Dimensional Printing and Three-Dimensional Miniplates in Treatment of Anterior Mandibular Fractures: A Prospective Clinical Study. BMC Oral Health 2025, 25, 649. [Google Scholar] [CrossRef]
Alzubaidi, L.; AL-Dulaimi, K.; Salhi, A.; Alammar, Z.; Fadhel, M.A.; Albahri, A.S.; Alamoodi, A.H.; Albahri, O.S.; Hasan, A.F.; Bai, J.; et al. Comprehensive Review of Deep Learning in Orthopaedics: Applications, Challenges, Trustworthiness, and Fusion. Artif. Intell. Med. 2024, 155, 102935. [Google Scholar] [CrossRef]
Jester, N.; Singh, M.; Lorr, S.; Tommasini, S.; Wiznia, H.; Buono, D. The development of an artificial intelligence auto-segmentation tool for 3D volumetric analysis of vestibular schwannomas. Sci. Rep. 2025, 15, 5918. [Google Scholar] [CrossRef]
Roth, T.A.; Jokeit, M.; Sutter, R.; Vlachopoulos, L.; Fucentese, S.F.; Carrillo, F.; Snedeker, J.G.; Esfandiari, H.; Fürnstahl, P. Deep-learning based 3D reconstruction of lower limb bones from biplanar radiographs for preoperative osteotomy planning. Int. J. Comput. Assist. Radiol. Surg. 2024, 19, 1843–1853. [Google Scholar] [CrossRef] [PubMed]
Zimmer Biomet. ZBEdge. Available online: https://www.zimmerbiomet.com/en/products-and-solutions/zb-edge.html (accessed on 25 December 2025).
Brainlab. Cranial Planning. Available online: https://www.brainlab.com/surgery-products/overview-neurosurgery-products/cranial-planning/ (accessed on 25 December 2025).
Elkohail, A.; Soffar, A.; Khalifa, A.M.; Omar, I.; Mosaad, M.; Abdulaziz, M.; Elsaket, A.; Panhwer, H.S.; Abdelglil, M.; Teama, M.; et al. AI-Enhanced Surgical Decision-Making in Orthopedics: From Preoperative Planning to Intraoperative Guidance and Real-Time Adaptation. Cureus 2025, 17, e92762. [Google Scholar] [CrossRef] [PubMed]
Berhouet, J.; Samargandi, R. Emerging Innovations in Preoperative Planning and Motion Analysis in Orthopedic Surgery. Diagnostics 2024, 14, 1321. [Google Scholar] [CrossRef]
Orthofeed. 5 Game-Changing Orthopedic Startups Revolutionizing Healthcare in 2025. Available online: https://orthofeed.com/2025/01/25/5-game-changing-orthopedic-startups-revolutionizing-healthcare-in-2025/ (accessed on 25 December 2025).
Lopez, I.; Swaminathan, A.; Vedula, K.; Narayanan, S.; Haredasht, F.N.; Ma, S.P.; Liang, A.S.; Tate, S.; Maddali, M.; Gallo, R.J.; et al. Clinical Entity-Augmented Retrieval (CLEAR) for Clinical Information Extraction. NPJ Digit. Med. 2025, 8, 45. [Google Scholar] [CrossRef]
Xu, Q.; Xie, W.; Liao, B.; Hu, C.; Qin, L.; Yang, Z.; Xiong, H.; Lyu, Y.; Zhou, Y.; Luo, A. Interpretability of Clinical Decision Support Systems Based on Artificial Intelligence from Technological and Medical Perspective: A Systematic Review. J. Healthc. Eng. 2023, 2023, 9919269. [Google Scholar] [CrossRef]
Amirian, S.; Carlson, L.; Gong, M.; Lohse, I.; Weiss, K.; Plate, J.; Tafti, A.P. Explainable AI in Orthopedics: Challenges, Opportunities, and Prospects. arXiv 2023, arXiv:2308.04696. [Google Scholar] [CrossRef]
Oettl, F.C.; Oeding, J.F.; Samuelsson, K. Explainable artificial intelligence in orthopedic surgery. J. Exp. Orthop. 2024, 11, e12103. [Google Scholar] [CrossRef]
Stiglic, G.; Kocbek, P.; Fijacko, N.; Zitnik, M.; Verbert, K.; Cilar, L. Interpretability of machine learning-based prediction models in healthcare. WIREs Data Min. Knowl. Discov. 2020, 10, e1379. [Google Scholar] [CrossRef]
Hur, S.; Lee, Y.; Park, J.; Jeon, Y.J.; Cho, J.H.; Cho, D.; Lim, D.; Hwang, W.; Cha, W.C.; Yoo, J. Comparison of SHAP and clinician friendly explanations reveals effects on clinical decision behaviour. NPJ Digit. Med. 2025, 8, 578. [Google Scholar] [CrossRef]
Lan, Q.; Li, S.; Zhang, J.; Guo, H.; Yan, L.; Tang, F. Reliable prediction of implant size and axial alignment in AI-based 3D preoperative planning for total knee arthroplasty. Sci. Rep. 2024, 14, 16971. [Google Scholar] [CrossRef]
Liu, F.; Niu, Y.; Zhang, Q.; Wang, K.; Dong, Z.; Wong, I.N.; Cheng, L.; Li, T.; Duan, L.; Li, K.; et al. A foundational architecture for AI agents in healthcare. Cell Rep. Med. 2025, 6, 102374. [Google Scholar] [CrossRef]
Karim, M.R.; Jiao, J.; Dohmen, T.; Cochez, M.; Beyan, O.; Rebholz-Schuhmann, D.; Decker, S. DeepKneeExplainer: Explainable knee osteoarthritis diagnosis from radiographs and magnetic resonance imaging. IEEE Access 2021, 9, 39757–39780. [Google Scholar] [CrossRef]
Zhang, C.; Liu, S.; Zhou, X.; Zhou, S.; Tian, Y.; Wang, S.; Xu, N.; Li, W. Examining the Role of Large Language Models in Orthopedics: Systematic Review. J. Med. Internet Res. 2024, 26, e59607. [Google Scholar] [CrossRef]
Kim, K.B.; Kim, G.B.; Kim, J.H.; Lee, S.M. Artificial intelligence in total knee arthroplasty: Clinical applications and implications. Knee Surg. Relat. Res. 2025, 37, 44. [Google Scholar] [CrossRef]
Lambrechts, A.; Wirix-Speetjens, R.; Maes, F.; Van der Perre, G. Artificial intelligence based patient-specific preoperative planning for total knee arthroplasty. Front. Robot. AI 2022, 9, 840282. [Google Scholar] [CrossRef]
Sánchez-Rosenberg, G.; Magnéli, M.; Barle, N.; Kontakis, M.G.; Müller, A.M.; Wittauer, M.; Gordon, M.; Brodén, C. ChatGPT-4 generates orthopedic discharge documents faster than humans maintaining comparable quality: A pilot study of 6 cases. Acta Orthop. 2024, 95, 152–156. [Google Scholar] [CrossRef] [PubMed]
Mendiratta, D.; Herzog, I.; Singh, R.; Para, A.; Joshi, T.; Vosbikian, M.; Kaushal, N. Utility of ChatGPT as a preparation tool for the Orthopaedic In-Training Examination. J. Exp. Orthop. 2025, 12, e70135. [Google Scholar] [CrossRef] [PubMed]
Li, T.P.; Slocum, S.; Sahoo, A.; Ochuba, A.; Kolakowski, L.; Henn, R.F., III; Johnson, A.A.; LaPorte, D.M. Socratic Artificial Intelligence Learning (SAIL): The Role of a Virtual Voice Assistant in Learning Orthopedic Knowledge. J. Surg. Educ. 2024, 81, 1655–1666. [Google Scholar] [CrossRef]
Sing, D.C.; Shah, K.S.; Pompliano, M.; Yi, P.H.; Velluto, C.; Bagheri, A.; Eastlack, R.K.; Stephan, S.R.; Mundis, G.M., Jr. Enhancing Magnetic Resonance Imaging (MRI) Report Comprehension in Spinal Trauma: Readability Analysis of AI-Generated Explanations for Thoracolumbar Fractures. JMIR AI 2025, 4, e69654. [Google Scholar] [CrossRef] [PubMed]
Zargarzadeh, S.; Hung, C.M. Multi-modal large language models for robot-assisted surgery: Integrating task reasoning and motion planning. arXiv 2024, arXiv:2408.07806. [Google Scholar] [CrossRef]
Baur, D.; Ansorg, J.; Heyde, C.E.; Voelker, A. Development and Evaluation of a Retrieval-Augmented Generation Chatbot for Orthopedic and Trauma Surgery Patient Education: Mixed-Methods Study. JMIR AI 2025, 4, e75262. [Google Scholar] [CrossRef]
Rosen, J.; Russell, J.; Kartik, P.; Vella-Baldacchino, M. Artificial intelligence algorithms in orthopaedics: A narrative review of methods and clinical applications. J. Exp. Orthop. 2025, 12, e70549. [Google Scholar] [CrossRef]
Mickley, J.P.; Kaji, E.S.; Khosravi, B.; Mulford, K.L.; Taunton, M.J.; Wyles, C.C. Overview of Artificial Intelligence Research Within Hip and Knee Arthroplasty. Arthroplast. Today 2024, 27, 101396. [Google Scholar] [CrossRef]
Van Engen, M.G.; Carender, C.N.; Glass, N.A.; Noiseux, N.O. Outcomes After Successful Debridement, Antibiotic, and Implant Retention Therapy for Periprosthetic Joint Infection in Total Knee Arthroplasty. J. Arthroplast. 2024, 39, 483–489. [Google Scholar] [CrossRef] [PubMed]
Jiang, L.Y.; Liu, X.C.; Nejatian, N.P.; Nasir-Moin, M.; Wang, D.; Abidin, A.; Eaton, K.; Riina, H.A.; Laufer, I.; Punjabi, P.; et al. Health system-scale language models are all-purpose prediction engines. Nature 2023, 618, 85–93. [Google Scholar] [CrossRef]
Porche, K.; Dru, A.; Moor, R.; Kubilis, P.; Vaziri, S.; Hoh, D.J. Preoperative Radiographic Prediction Tool for Early Postoperative Segmental and Lumbar Lordosis Alignment After Transforaminal Lumbar Interbody Fusion. Cureus 2021, 13, e18175. [Google Scholar] [CrossRef]
Mohamed, K.S.; Yu, A.; Schroen, C.A.; Duey, A.; Hong, J.; Yu, R.; Etigunta, S.; Kator, J.; Rhee, H.S.; Hausman, M.R. Comparing AAOS appropriate use criteria with ChatGPT-4o recommendations on treating distal radius fractures. Hand Surg. Rehabil. 2025, 44, 102122. [Google Scholar] [CrossRef] [PubMed]
Sivarajkumar, S.; Gao, F.; Denny, P.; Aldhahwani, B.; Visweswaran, S.; Bove, A.; Wang, Y. Mining Clinical Notes for Physical Rehabilitation Exercise Information: Development and Validation of Natural Language Processing Algorithms. JMIR Med. Inform. 2024, 12, e52289. [Google Scholar] [CrossRef] [PubMed]
Gürses, Ö.A.; Özüdoğru, A.; Tuncay, F.; Kararti, C. The Role of Artificial Intelligence Large Language Models in Personalized Rehabilitation Programs for Knee Osteoarthritis: An Observational Study. J. Med. Syst. 2025, 49, 73. [Google Scholar] [CrossRef]
Wright, B.M.; Bodnar, M.S.; Moore, A.D.; Maseda, M.C.; Kucharik, M.P.; Diaz, C.C.; Schmidt, C.M.; Mir, H.R. Is ChatGPT a trusted source of information for total hip and knee arthroplasty patients? Bone Jt. Open 2024, 5, 139–146. [Google Scholar] [CrossRef]
Fahy, S.; Niemann, M.; Böhm, P.; Winkler, T.; Oehme, S. Assessment of the Quality and Readability of Information Provided by ChatGPT in Relation to the Use of Platelet-Rich Plasma Therapy for Osteoarthritis. J. Pers. Med. 2024, 14, 495. [Google Scholar] [CrossRef]
Kaur, A.; Budko, A.; Liu, K.; Eaton, E.; Steitz, B.; Johnson, K. Automating Responses to Patient Portal Messages Using Generative AI. Appl. Clin. Inform. 2025, 16, 718–731. [Google Scholar] [CrossRef]
Asfuroğlu, Z.M.; Yağar, H.; Gümüşoğlu, E. High accuracy but limited readability of large language model-generated responses to frequently asked questions about Kienböck’s disease. BMC Musculoskelet. Disord. 2024, 25, 879. [Google Scholar] [CrossRef] [PubMed]
Gill, B.; Bonamer, J.; Kuechly, H.; Gupta, R.; Emmert, S.; Kurkowski, S.; Hasselfeld, K.; Grawe, B. ChatGPT is a promising tool to increase readability of orthopedic research consents. J. Orthop. Trauma Rehabil. 2024, 31, 148–152. [Google Scholar] [CrossRef]
Jumreornvong, O.; Perez, A.M.; Malave, B.; Mozawalla, F.; Kia, A.; Nwaneshiudu, C.A. Biases in Artificial Intelligence Application in Pain Medicine. J. Pain Res. 2025, 18, 1123–1135. [Google Scholar] [CrossRef]
Graefe, S.B.; Jeansonne, N.A.; Meister, A.; Juliano, P.; MacDonald, A.; Aynardi, M.; Perry, K. Assessing ChatGPT Responses to Common Patient Questions Regarding Ankle Fractures. Foot Ankle Orthop. 2024, 9, 2473011424S00149. [Google Scholar] [CrossRef]
Fan, X.; Wang, Y.; Zhang, S.; Xing, Y.; Li, J.; Ma, X.; Ma, J. Orthopedic Surgical Robotic Systems in Knee Arthroplasty: A Comprehensive Review. Front. Bioeng. Biotechnol. 2025, 13, 1523631. [Google Scholar] [CrossRef]
Kim, J.W.; Chen, J.T.; Hansen, P.; Shi, L.X.; Goldenberg, A.; Schmidgall, S.; Scheikl, P.M.; Deguet, A.; White, B.M.; Tsai, D.R.; et al. SRT-H: A Hierarchical Framework for Autonomous Surgery via Language-Conditioned Imitation Learning. Sci. Robot. 2025, 10, eadt5254. [Google Scholar] [CrossRef]
Killeen, B.D.; Chaudhary, S.; Osgood, G.; Unerath, M. Take a Shot! Natural Language Control of Intelligent Robotic X-ray Systems in Surgery. Int. J. Comput. Assist. Radiol. Surg. 2024, 19, 1165–1173. [Google Scholar] [CrossRef]
Souipas, S.; Nguyen, A.; Laws, S.G.; Davies, B.L.; Rodriguez y Baena, F. Real-Time Active Constraint Generation and Enforcement for Surgical Tools Using 3D Detection and Localisation Network. Front. Robot. AI 2024, 11, 1365632. [Google Scholar] [CrossRef]
Zhou, R.; Wang, D.; Zhang, H.; Zhu, Y.; Zhang, L.; Chen, T.; Liao, W.; Ye, Z. Vision Techniques for Anatomical Structures in Laparoscopic Surgery: A Comprehensive Review. Front. Surg. 2025, 12, 1557153. [Google Scholar] [CrossRef]
Loftus, T.J.; Tighe, P.J.; Filiberto, A.C.; Efron, P.A.; Brakenridge, S.C.; Mohr, A.M.; Rashidi, P.; Upchurch, G.R., Jr.; Bihorac, A. Artificial Intelligence and Surgical Decision-Making. JAMA Surg. 2020, 155, 148–158. [Google Scholar] [CrossRef]
Ng, M.S.P.; Loke, R.W.K.; Tan, M.K.L.; Ng, H.Y.; Liau, Z.Q.G. Novel Artificial Intelligence Algorithm for Soft Tissue Balancing and Bone Cuts in Robotic Total Knee Arthroplasty Improves Accuracy and Surgical Duration. Arthroplasty 2025, 7, 39. [Google Scholar] [CrossRef] [PubMed]
Hampp, E.L.; Chughtai, M.; Scholl, L.Y.; Sodhi, N.; Bhowmik-Stoker, M.; Jacofsky, D.J.; Mont, M.A. Robotic-Arm Assisted Total Knee Arthroplasty Demonstrated Greater Accuracy and Precision to Plan Compared with Manual Techniques. J. Knee Surg. 2019, 32, 239–250. [Google Scholar] [CrossRef] [PubMed]
Schmidgall, S.; Opfermann, J.D.; Kim, J.W.; Krieger, A. Will Your Next Surgeon Be a Robot? Autonomy and AI in Robotic Surgery. Sci. Robot. 2025, 10, eadt0187. [Google Scholar] [CrossRef]
Schmidt, M.W.; Köppinger, K.F.; Fan, C.; Kowalewski, K.-F.; Schmidt, L.P.; Vey, J.; Proctor, T.; Probst, P.; Bintintan, V.V.; Müller-Stich, B.-P.; et al. Virtual Reality Simulation in Robot-Assisted Surgery: Meta-Analysis of Skill Transfer and Predictability of Performance. BJS Open 2021, 5, zraa066. [Google Scholar] [CrossRef] [PubMed]
Hoyer, G.; Gao, K.T.; Gassert, F.G.; Luitjens, J.; Jiang, F.; Majumdar, S.; Pedoia, V. Foundations of a Knee Joint Digital Twin from qMRI Biomarkers for Osteoarthritis and Knee Replacement. NPJ Digit. Med. 2025, 8, 118. [Google Scholar] [CrossRef]
Ackland, D.C.; Fang, Z.; Senanayake, D. A Machine Learning Approach to Real-Time Calculation of Lower Limb Joint Angles from Inertial Measurement Units. Med. Eng. Phys. 2025, 104, 138. [Google Scholar] [CrossRef]
Wagner, R.E.; Plácido da Silva, H.; Gramann, K. Accuracy of Measuring Knee Flexion after Total Knee Arthroplasty through Wearable Sensors. Sensors 2021, 21, 4485. [Google Scholar] [CrossRef]
Gurchiek, R.D.; Donahue, N.; Fiorentino, N.M.; McGinnis, S.R. Wearables-Only Analysis of Muscle and Joint Mechanics: An EMG-Driven Approach. IEEE Trans. Biomed. Eng. 2022, 69, 580–589. [Google Scholar] [CrossRef]
Kokkotis, C.; Chalatsis, G.; Moustakidis, S.; Siouras, A.; Mitrousias, V.; Tsaopoulos, D.; Patikas, D.; Aggelousis, N.; Hantes, M.; Giakas, G.; et al. Identifying Gait-Related Functional Outcomes in Post-Knee Surgery Patients Using Wearable Sensors and Machine Learning. Int. J. Environ. Res. Public Health 2023, 20, 448. [Google Scholar] [CrossRef]
Laurent, A. Performance of Retrieval-Augmented Generation (RAG) on Pharmaceutical Documents. IntuitionLabs, 18 December 2025. Available online: https://intuitionlabs.ai/articles/rag-performance-pharmaceutical-documents (accessed on 25 December 2025).
Gong, M.F.; Finger, L.E.; Letter, C.; Amirian, S.; Parmanto, B.; O’Malley, M.; Klatt, B.A.; Tafti, A.P.; Plate, J.F. Development and Validation of a Mobile Phone Application for Measuring Knee Range of Motion. J. Knee Surg. 2025, 38, 22–27. [Google Scholar] [CrossRef]
Wu, X.; Zhang, H.; Cui, H.; Pei, W.; Zhao, Y.; Wang, S.; Cao, Z.; Li, W. Surface Electromyography and Gait Features in Patients after Anterior Cruciate Ligament Reconstruction. Orthop. Surg. 2025, 17, 62–70. [Google Scholar] [CrossRef] [PubMed]
Crouzier, M.; Hug, F.; Sheehan, F.T.; Collins, N.J.; Crossley, K.; Tucker, K. Neuromechanical Properties of the Vastus Medialis and Vastus Lateralis in Patellofemoral Pain. Orthop. J. Sports Med. 2023, 11, 23259671231155894. [Google Scholar] [CrossRef] [PubMed]
Yeung, T.; Yang, S.; Yeung, S.; Zaidi, F.; Weaver, S.; Bolam, S.; Lovatt, M.; Besier, T.; Munro, J.; Hanlon, M.; et al. IMU-Augmented Patient-Related Outcome Measure for Knee Arthroplasty Patients. J. Med. Biol. Eng. 2025, 45, 557–565. [Google Scholar] [CrossRef]
Haendel, M.A.; Chute, C.G.; Bennett, T.D.; Eichmann, D.A.; Guinney, J.; Kibbe, W.A.; Payne, P.R.O.; Pfaff, E.R.; Robinson, P.N.; Saltz, J.H.; et al. The National COVID Cohort Collaborative (N3C): Rationale, design, infrastructure, and deployment. J. Am. Med. Inform. Assoc. 2021, 28, 427–443. [Google Scholar] [CrossRef] [PubMed]
Acosta, J.N.; Falcone, G.J.; Rajpurkar, P.; Topol, E.J. Multimodal biomedical AI. Nat. Med. 2022, 28, 1773–1784. [Google Scholar] [CrossRef]
Plana, D.; Shung, D.L.; Grimshaw, A.A.; Saraf, A.; Sung, J.J.Y.; Kann, B.H. Randomized Clinical Trials of Machine Learning Interventions in Health Care: A Systematic Review. JAMA Netw. Open 2022, 5, e2233946. [Google Scholar] [CrossRef]
Singhal, K.; Azizi, S.; Tu, T.; Mahdavi, S.S.; Wei, J.; Chung, H.W.; Scales, N.; Tanwani, A.; Cole-Lewis, H.; Pfohl, S.; et al. Large language models encode clinical knowledge. Nature 2023, 620, 172–180. [Google Scholar] [CrossRef]
Castelli, M.; Sousa, M.; Vojtech, I.; Single, M.; Amstutz, D.; Maradan-Gachet, M.E.; Magalhães, A.D.; Debove, I.; Rusz, J.; Martinez-Martin, P.; et al. Detecting Neuropsychiatric Fluctuations in Parkinson’s Disease Using Patients’ Own Words: The Potential of Large Language Models. NPJ Park. Dis. 2025, 11, 79. [Google Scholar] [CrossRef]
Zheng, C.; Li, Q.; Lu, G.; Mai, Y.; Hu, Y. Large Language Models in Breast Cancer Reconstruction: A Framework for Patient-Specific Recovery and Predictive Insights. SLAS Technol. 2025, 32, 100285. [Google Scholar] [CrossRef]
Chung, P.; Fong, C.T.; Walters, A.M.; Aghaeepour, N.; Yetisgen, M.; O’Reilly-Shah, V.N. Large Language Model Capabilities in Perioperative Risk Prediction and Prognostication. JAMA Surg. 2024, 159, 928–937. [Google Scholar] [CrossRef]
Prabha, S.; Gomez-Cabello, C.A.; Haider, S.A.; Genovese, A.; Trabilsy, M.; Wood, N.G.; Bagaria, S.; Tao, C.; Forte, A.J. Enhancing Clinical Decision Support with Adaptive Iterative Self-Query Retrieval for Retrieval-Augmented Large Language Models. Bioengineering 2025, 12, 895. [Google Scholar] [CrossRef]
Shen, M.; Umar, M.; Maeng, K.; Suh, G.E.; Gupta, U. Towards Understanding Systems Trade-offs in Retrieval-Augmented Generation Model Inference. arXiv 2024, arXiv:2412.11854. [Google Scholar] [CrossRef]
Niset, A.; Melot, I.; Pireau, M.; Englebert, A.; Scius, N.; Flament, J.; El Hadwe, S.; Al Barajraji, M.; Thonon, H.; Barrit, S. Grounded Large Language Models for Diagnostic Prediction in Real-World Emergency Department Settings. JAMIA Open 2025, 8, ooaf119. [Google Scholar] [CrossRef] [PubMed]
Gou, F.; Liu, J.; Xiao, C.; Wu, J. Research on Artificial-Intelligence-Assisted Medicine: A Survey on Medical AI. Diagnostics 2024, 14, 1472. [Google Scholar] [CrossRef] [PubMed]
Oladokun, P. Posthuman Ethics in Digital Health: Reimagining Autonomy, Consent, and Responsibility in AI-Augmented Care. Int. J. Eng. Technol. Res. Manag. 2025, 9, 4. [Google Scholar]
De Micco, F.; Grassi, S.; Tomassini, L.; Palma, G.D.; Ricchezze, G.; Scendoni, R. Robotics and AI in Healthcare from the Perspective of European Regulation: Who Is Responsible for Medical Malpractice? Front. Med. 2024, 11, 1428504. [Google Scholar] [CrossRef]

Figure 1. PRISMA flow diagram illustrating the literature search strategy, screening process, and final selection criteria utilized for this review on algorithmic architectures for orthopedic clinical AI assistants.

Figure 2. Multimodal AI system with domain adaptation and controlled inference.

Figure 3. Multimodal RAG-based medical question answering.

Figure 4. Biomedical natural language understanding (NLU) Pipeline Architecture.

Figure 5. Multimodal biomedical AI architecture for clinical decision support.

Table 1. Performance evaluation and benchmarks of orthopedic LLMs.

Model	Accuracy	Performance Level	References
GPT-3.5	52.98%	Below PGY-1	[23]
Bard	49.8–58%	PGY-1 to PGY-3 equivalent	[33]
GPT-4	64.91%	PGY-5 level	[34]
GPT-4 + RAG	73.80%	Human surgeon equivalent	[23]
Orthopedic Residents	74.2–75.3%	Reference standard	[33]

Table 2. Comparative table on adaptation methods.

Method	Illustrative Study	Accuracy/Performance	Computational Cost	Clinical Reliability	References
Full Fine-tuning	MedAdapter (BERT-sized)	99.35% of SFT performance	High GPU memo	Risk of overfitting	[20]
LoRA	LLaMA-7B on MSK pain	Preserves privacy, efficient	Low (1% params)	Good for niche tasks	[21]
RAG	GPT-4 + AAOS guidelines	73.80% (surgeon-level)	Medium (latency)	High, traceable	[23]
Prompt Engineering	GPT-4 zero-shot	64.91% (no RAG)	Very low	Limited, variable	[23]

Table 3. Application of geometric reasoning in medical practice.

Domain	Application of Geometric Reasoning	Current Status	References
Trauma and Fracture	AI “solves” the puzzle of comminuted fractures by matching fragments to a healthy template (often mirrored from the contralateral side).	Rapidly maturing; used in pelvic and complex acetabular fracture planning	[60]
Joint Replacement	Algorithms (e.g., XPlan.ai, X-Atlas) generate full 3D models from standard 2D X-rays, eliminating the need for CT scans while maintaining sub-millimeter accuracy.	Commercialized (Zimmer Biomet, RSIP Vision)	[65]
Spine Surgery	AI analyzes vertebral geometry to propose optimal pedicle screw paths, avoiding nerves and maximizing purchase.	Standard of Care in advanced navigation (Brainlab Elements, Stryker Mako)	[66]
Oncology	Precise calculation of tumor volume over time to track growth rates invisible to the naked eye.	High impact in neuro-oncology and orthopedic oncology	[63]

Table 4. Comparative evaluation of LLM-based predictive analytics in orthopedics.

Prediction Task	Model Used	Performance (AUC/Accuracy)	Compared to the Traditional Model	References
Discharge Disposition	LLaMA-based	Comparable to ASA models	Captures social determinants	[90]
30-day Readmission	Health-system LLM	AUC 0.79	Outperforms structured data	[91]
PJI Detection	fine-tunedClinical LLM	High sensitivity/specificity	Better than rule-based coding	[90]

Table 5. Governance and deployment safeguards proposed for Orthopedic Clinical LLMs.

Safeguard	Rationale	Proposed Mechanism	Reference
RLHF alignment	Ensure adherence to AAOS/NASS clinical guidelines	Iterative feedback from orthopedic surgeons on generated plans	[89,93]
Privacy-preserving fine-tuning	Protect patient data in multi-center Implant Registries	Federated learning across hospital networks	[91]
Calibration checks	Prevent overestimation of surgical success rates	Reliability diagrams for revision surgery risk scores	[89,90]
Uncertainty quantification	Flag ambiguous cases (e.g., borderline infection)	Monte-Carlo dropout for tumor recurrence prediction	[90]
Bias audits	Address disparities in arthroplasty access and outcomes	Stratified performance analysis by race and BMI	[89]

Table 6. Capabilities and limitations of LLMs in healthcare applications.

Aspect	LLM Capabilities	Current Limitations	Reference
Rehabilitation Planning	Generates knee OA rehab plans with ~74% expert agreement	Lacks specificity in detailed exercise parameters (sets/reps)	[95]
Surgical Health Literacy	Answers THA FAQs with high accuracy and consistency	Reading level often exceeds 13th grade (College level)	[96]
Interactive Consent	Simplifies spine surgery consent to a near 6th-grade level	Cannot fully replace surgeon-led legal discussions	[100]
Patient Portal Agents	Drafts empathetic responses to patient portal messages	Requires provider-in-the-loop validation for safety	[98]
Condition-Specific Info	Provides accurate summaries for niche conditions (e.g., Kienböck’s)	Readability remains poor for complex rare diseases	[99]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Boltaboyeva, A.; Baigarayeva, Z.; Imanbek, B.; Amangeldy, B.; Tasmurzayev, N.; Ozhikenov, K.; Alimbayeva, Z.; Alimbayev, C.; Karymsakova, N. Architecting the Orthopedical Clinical AI Pipeline: A Review of Integrating Foundation Models and FHIR for Agentic Clinical Assistants and Digital Twins. Algorithms 2026, 19, 99. https://doi.org/10.3390/a19020099

AMA Style

Boltaboyeva A, Baigarayeva Z, Imanbek B, Amangeldy B, Tasmurzayev N, Ozhikenov K, Alimbayeva Z, Alimbayev C, Karymsakova N. Architecting the Orthopedical Clinical AI Pipeline: A Review of Integrating Foundation Models and FHIR for Agentic Clinical Assistants and Digital Twins. Algorithms. 2026; 19(2):99. https://doi.org/10.3390/a19020099

Chicago/Turabian Style

Boltaboyeva, Assiya, Zhanel Baigarayeva, Baglan Imanbek, Bibars Amangeldy, Nurdaulet Tasmurzayev, Kassymbek Ozhikenov, Zhadyra Alimbayeva, Chingiz Alimbayev, and Nurgul Karymsakova. 2026. "Architecting the Orthopedical Clinical AI Pipeline: A Review of Integrating Foundation Models and FHIR for Agentic Clinical Assistants and Digital Twins" Algorithms 19, no. 2: 99. https://doi.org/10.3390/a19020099

APA Style

Boltaboyeva, A., Baigarayeva, Z., Imanbek, B., Amangeldy, B., Tasmurzayev, N., Ozhikenov, K., Alimbayeva, Z., Alimbayev, C., & Karymsakova, N. (2026). Architecting the Orthopedical Clinical AI Pipeline: A Review of Integrating Foundation Models and FHIR for Agentic Clinical Assistants and Digital Twins. Algorithms, 19(2), 99. https://doi.org/10.3390/a19020099

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Architecting the Orthopedical Clinical AI Pipeline: A Review of Integrating Foundation Models and FHIR for Agentic Clinical Assistants and Digital Twins

Abstract

1. Introduction

2. Methodology

3. Foundational Technologies for AI Clinical Assistants

3.1. LLMs Adaptation in Orthopedics

3.2. Natural Language Understanding for Orthopedic Narratives and Operative Notes

3.3. Three-Dimensional Geometric Reasoning and Volumetric Intelligence

3.4. Data Engineering for Clinical AI Pipelines

3.5. Explainability Frameworks and Clinical Trust

4. The Application Layer: From Data to Clinical Insights

4.1. From CDSS to Agentic Surgical Assistants: Bridging Generative AI and Robotic Actuation

4.2. Predictive Analytics in Healthcare Using LLMs

4.3. Personalized Medicine and Patient-Facing Applications

4.4. The Agentic Interface for Surgical Robotics

4.5. Biosignals in Rehabilitation and Challenges in Clinical Integration

5. Discussion and Future Outlook

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI