Next Article in Journal
Does a Prosthetic Limb for Skiing Affect the Three-Dimensional Knee-Joint Kinematics of Unilateral Transfemoral Amputee Skiers: A Pilot Study
Previous Article in Journal
A Preliminary Mechanical Evaluation of a Newly Developed Polyaxial Locking Mechanism for a Distal Radius Plate
Previous Article in Special Issue
The Effects of Aging and Cognition on Gait Coordination Analyzed Through a Network Analysis Approach
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Standardization of Neuromuscular Reflex Analysis—Role of Fine-Tuned Vision-Language Model Consortium and OpenAI gpt-oss Reasoning LLM-Enabled Decision Support System

1
Virginia Modeling Analysis and Simulation Center, Old Dominion University, Suffolk, VA 23435, USA
2
Ellmer College of Health Sciences, Old Dominion University, Norfolk, VA 23508, USA
3
AnaletIQ, Washington, DC 21206, USA
4
McDonald Army Health Center, Newport News, VA 23604, USA
5
Department of Sports Theory and Human Motor Skills, Gdansk University of Physical Education and Sport, 80-336 Gdansk, Poland
6
Bogomolets Institute of Physiology, National Academy of Sciences of Ukraine, 01024 Kyiv, Ukraine
7
School of Computing, University of Colombo, Colombo 00700, Sri Lanka
*
Author to whom correspondence should be addressed.
Biomechanics 2026, 6(1), 23; https://doi.org/10.3390/biomechanics6010023
Submission received: 7 December 2025 / Revised: 12 January 2026 / Accepted: 27 January 2026 / Published: 27 February 2026
(This article belongs to the Special Issue Biomechanics in Sport and Ageing: Artificial Intelligence)

Abstract

Background/Objectives: Accurate assessment of neuromuscular reflexes, such as the Hoffmann reflex (H-reflex), plays a critical role in sports science, rehabilitation, and clinical neurology. Conventional interpretation of H-reflex electromyography (EMG) waveforms is subject to inter-rater variability and interpretive bias, limiting reliability and standardization. This study aims to develop an automated, interpretable, and robust agentic AI–driven framework for H-reflex waveform analysis. Methods: We propose a fine-tuned Vision–Language Model (VLM) consortium combined with a reasoning Large Language Model (LLM)–enabled decision support system for automated H-reflex interpretation. Multiple VLMs were fine-tuned on curated datasets of H-reflex EMG waveform images annotated with expert clinical observations, recovery timelines, and athlete metadata. The VLM outputs were aggregated using a consensus-based strategy and further refined by a specialized reasoning LLM to ensure coherent, transparent, and explainable diagnostic assessments. Model fine-tuning employed Low-Rank Adaptation (LoRA) and 4-bit quantization to enable efficient deployment on consumer-grade hardware. Results: Experimental evaluation demonstrated that the proposed hybrid system delivers accurate, consistent, and clinically interpretable assessments of neuromuscular states, including fatigue, injury, and recovery, directly from EMG waveform images and contextual metadata. Compared with baseline models, the fine-tuned VLM consortium exhibited substantially improved precision, consistency, and contextual awareness, while the reasoning LLM enhanced diagnostic coherence through cross-model consensus and structured reasoning, thereby supporting responsible and explainable AI-driven decision making. Conclusions: This work presents, to the authors’ knowledge, the first integration of a responsible and explainable AI-driven decision support system for H-reflex analysis. The proposed framework advances the automation and standardization of neuromuscular diagnostics and establishes a foundation for next-generation AI-assisted decision support systems in sports performance monitoring, rehabilitation, and clinical neurophysiology.

1. Introduction

Neuromuscular reflexes are essential elements of the human motor control system, enabling rapid, involuntary responses to sensory input that help maintain stability, coordination, and protection against injury [1]. Among these, the Hoffmann reflex (H-reflex) is a well-established electrophysiological measure to assess the excitability and integrity of the spinal cord reflex arc. It is elicited through electrical stimulation of peripheral nerves and recorded by electromyography (EMG), providing a quantifiable indicator of neuromuscular pathway function [1]. Due to its sensitivity and reproducibility, the H-reflex is widely used in neurology, rehabilitation, sports science, and performance monitoring to evaluate recovery after injury, track neuromuscular disorders, and investigate adaptive changes in motor control [2,3]. Despite its importance, traditional approaches to H-reflex analysis rely primarily on visual inspection and manual quantification of EMG waveforms, or on semi-automated signal processing methods [4]. These approaches, while effective in controlled environments, suffer from significant limitations [1]. Manual interpretation is susceptible to inter- and intra-rater variability, potentially affecting the consistency and reliability of results. The time-intensive nature of manual analysis restricts throughput, making large-scale or real-time monitoring impractical. Furthermore, existing methods often fail to integrate the full spectrum of available metadata, such as patient history, training status, or contextual factors, thus reducing the power of personalized diagnostics. Current semi-automated tools also tend to operate as black boxes, providing limited interpretability or flexibility in reasoning with complex, multimodal datasets.
To address these challenges, we propose a novel platform for automated H-reflex analysis, built on a fine-tuned VLM Consortium [5,6] and a decision support system enabled by reasoning LLM [7,8]. Our approach leverages multiple VLMs, each fine-tuned on curated datasets of H-reflex EMG waveform images that are richly annotated with clinical observations, recovery timelines, and athlete metadata. These models can extract key electrophysiological features and accurately predict neuromuscular states, including fatigue, injury, and recovery, directly from EMG images and contextual information. The diagnostic outputs produced by the fine-tuned VLM consortium are aggregated through a consensus-driven approach and further synthesized by the OpenAI-gpt-oss reasoning LLM [7,9], ensuring responsible, transparent, and explainable AI [10,11,12] -enabled decision support for clinicians, rehabilitation specialists, and sports scientists [2].
The end-to-end platform orchestrates seamless communication between the VLM ensemble and the reasoning LLM, integrating advanced prompt engineering strategies and automated reasoning workflows using LLM agents [13,14,15]. Each VLM within the consortium is fine-tuned using Low-Rank Adapters (LoRA) and 4-bit quantization, enabling efficient training and deployment on consumer-grade hardware without sacrificing performance [16,17]. Experimental results demonstrate that this hybrid system delivers highly accurate, consistent and interpretable H-reflex assessments, significantly advancing both the automation and standardization of neuromuscular diagnostics. By automating labor-intensive analysis and integrating contextual metadata to enrich clinical and performance insights, our system supports informed decision-making in neurology, rehabilitation, and sports science [1]. To our knowledge, this work represents the first integration of a fine-tuned VLM consortium with a reasoning LLM for image-based H-reflex analysis, laying the foundation for next-generation AI-assisted neuromuscular assessment and athlete monitoring platforms. The following are our main contributions of this research:
  • Fine-tuning a consortium of VLMs to analyze H-reflex EMG waveform images and predict neuromuscular states such as fatigue, injury, and recovery.
  • Integrating a specialized reasoning LLM to refine and validate diagnostic outputs, ensuring robust, transparent, and explainable neuromuscular assessments based on the predictions of the VLM consortium.
  • Automate the end-to-end workflow for H-reflex analysis and neuromuscular diagnosis by orchestrating seamless communication between the VLM ensemble and the reasoning LLM, facilitated by AI agents and advanced prompt engineering.
  • Implement and validate a prototype of the proposed platform, integrate multiple fine-tuned VLMs with the reasoning LLM, and demonstrate its effectiveness for standardized and scalable neuromuscular reflex assessment in clinical and sports science contexts.
The remainder of the paper is organized as follows: Section 2 reviews related work and contextualizes our approach within the broader landscape of AI-driven sports science and neuromuscular assessment systems. Section 3 introduces the core technologies that underpin the proposed AI-assisted neuromuscular reflex analysis platform. Section 4 details the overall system architecture, highlighting the integration of VLMs and reasoning engines for the interpretation of H-reflex waveforms. Section 5 outlines the core functionalities and operational workflow of the platform, from data ingestion to the final interpretation of the reaction response. Section 6 presents the details of the implementation and evaluates the performance of the system in neuromuscular analysis tasks. Finally, Section 7 concludes the paper and discusses potential directions for future research, performance monitoring applications, and integration into athlete recovery management workflows.

2. Related Work

In recent years, the application of artificial intelligence (AI) to the interpretation of electrophysiological signals, particularly EMG data, has advanced considerably. This section provides a detailed analysis of previous studies and frameworks that have used AI techniques for EMG-based signal interpretation, highlighting their methodologies, limitations, and relevance to neuromuscular evaluation and rehabilitation.

2.1. Sensor Fusion for EMG-Based Gesture Recognition

Sensor fusion methods that combine visual and EMG modalities have shown a strong potential to enhance the precision of hand gesture recognition [18]. In these systems, EMG signals from wearable electrodes are fused with vision-based cues from cameras or depth sensors to achieve higher recognition accuracy than either modality alone. Although the primary focus of this work is on human–computer interaction rather than clinical reflex assessment, it demonstrates the feasibility and performance benefits of integrating multimodal signals for nuanced interpretation. The underlying fusion strategies, such as late feature fusion and decision-level ensemble, can inform similar approaches for combining H-reflex waveform images with athlete metadata in clinical and sports science contexts.

2.2. Vision Transformer-Based Hand Gesture Recognition (CT-HGR)

The CT-HGR framework [19] applies Vision Transformer (ViT) architectures to high-density surface EMG (HD-sEMG) data, effectively transforming spatial–temporal EMG signals into image-like representations for classification. By avoiding extensive handcrafted feature engineering, ViTs can learn discriminative spatial–temporal patterns directly from HD-sEMG data, enabling real-time classification performance suitable for deployment in prosthetic control or sign language recognition. This approach underscores the potential for visual encodings for electrophysiological data, a principle that can be adapted for the analysis of H-reflex waveforms, where the signal is also represented visually.

2.3. Explainable AI in EMG for Stroke Gait Analysis

Explainable AI (XAI) methods have been increasingly explored for interpreting EMG data in clinical contexts. A notable study [20] applied gradient boosting models alongside SHAP and LIME interpretability tools to distinguish stroke-impaired gait patterns from healthy controls. The emphasis on interpretability ensures that clinicians can trace model predictions to specific signal features, thus increasing trust and adoption in healthcare workflows. This focus on transparent model behavior parallels the emphasis of our platform on clinical explainability, where H-reflex waveform interpretations must be accurate and interpretable to sports physicians and rehabilitation specialists.

2.4. INSPIRE: AI for Electrodiagnostic Interpretation

The INSPIRE system [4] represents one of the few AI-driven frameworks that explicitly targets clinical electrodiagnostic interpretation (EDX), including EMG and nerve conduction studies. It employs a multi-agent architecture in which different models analyze patient history, raw EMG/EDX data, and structured test reports. A reasoning module synthesizes these outputs to produce final diagnostic interpretations. This approach is closely aligned with our proposed model architecture, which similarly layers fine-tuned VLM ensembles with a reasoning LLM to integrate multiple perspectives and produce coherent, clinically relevant conclusions for H-reflex assessment.

2.5. LLMs for EMG-to-Text Conversion

Recent work on EMG-to-text conversion [21] explores the use of language models with EMG adapters to translate unvoiced speech or facial muscle activations into textual form. Although the application domain differs, this research highlights early examples of the integration of LLM architectures with EMG-based input, bridging the gap between electrophysiological signals and natural language output. This cross-modal translation capability is conceptually similar to our approach, where H-reflex waveform features are mapped into structured, language-based clinical assessments.

2.6. Multiscale ML for Nerve Conduction Velocity

Sadeghi et al. propose a multiscale ML framework for precise nerve conduction velocity (NCV) analysis, integrating entropy-optimized wavelet decomposition, thermodynamically regularized neural networks (incorporating Arrhenius kinetics), and uncertainty-aware progression modeling [22]. Validated on 1842 patients across multiple centers, the model improves motor NCV accuracy by 23.4% and sensory by 28.7%, while allowing for early detection of neuropathy and temperature-compensated measurements. This clinically oriented signal processing and physiologically grounded approach complements our system by modeling statistical dynamics over time rather than interpreting single-waveform reflex patterns.

2.7. Hybrid-FEM

The study by Pratticò et al. introduces a hybrid diagnostic framework that integrates finite element modeling (FEM), infrared thermography, and artificial intelligence for monitoring thermal stress in biomedical electronic systems [23]. By combining physics-based simulations with deep learning (e.g., U-Net for thermal hotspot segmentation and machine learning classifiers for heat diffusion patterns), the work achieves high accuracy in automated thermal anomaly detection and classification. The system is validated on real thermographic data from biomedical device prototypes, demonstrating reliable real-time performance and potential for predictive maintenance. Although this approach focuses on electronic device monitoring rather than physiological signal interpretation, it exemplifies how hybrid physics–AI systems can bridge model-based and data-driven analysis in biomedical contexts, highlighting the value of multimodal fusion and explainable AI for diagnostic support in engineered healthcare systems [23].
Table 1 presents a comparative analysis of prior AI-based diagnostic frameworks across key dimensions, including their application domain, fine-tuning capability, model architecture, support for vision–language and reasoning LLM integration, and alignment with Responsible AI and Explainable AI (XAI) [10,11,24]. Most existing systems lack dedicated vision language understanding of electrophysiological signals or rely on static, text-based interpretations without incorporating multimodal reasoning, cross-model consensus validation, or transparent decision logic, thus limiting their adherence to Responsible and Explainable AI standards. Moreover, the majority of previous works primarily focus on gesture recognition or general motor intention decoding, with few addressing H-reflex interpretation directly. Existing frameworks rarely employ multimodal reasoning, ensemble-based vision–language modeling, or dedicated reasoning LLMs capabilities that form the foundation of our proposed neuromuscular reflex analysis platform.
In contrast, the proposed platform uniquely integrates fine-tuned VLMs, agentic orchestration, and a reasoning LLM (OpenAI-gpt-oss) to synthesize and validate H-reflex waveform interpretations from multiple specialized models. The framework incorporates structured consensus reasoning, uncertainty quantification, and physiological plausibility verification to mitigate error propagation and ensure interpretability. By unifying waveform image analysis, contextual metadata reasoning, and multi-model decision fusion within a transparent and auditable pipeline, the platform operationalizes Responsible and Explainable AI principles advancing the state of the art in neuromuscular diagnostics and rehabilitation monitoring.

3. Background

This section provides a foundational overview of the core scientific and technological concepts underpinning the proposed AI-assisted neuromuscular reflex analysis platform. In particular, we highlight the basis and clinical importance of the H-reflex in neuromuscular analysis, recent advancements in LLMs/VLMs, reasoning-capable LLMs, fine-tuning techniques, and the emerging paradigm of AI agents.

3.1. The H-Reflex and Neuromuscular Diagnostics

The Hoffmann reflex (H-reflex) is a fundamental neurophysiological marker used to assess the excitability and integrity of the monosynaptic reflex arc within the human neuromuscular system [1]. By electrically stimulating a peripheral nerve and recording the resultant electromyographic (EMG) responses in a target muscle, the H-reflex enables noninvasive quantification of spinal cord and motor neuron function. Typically, a single supramaximal stimulus produces both a direct motor response (M-wave) and a H response, which is mediated by Ia afferent fibers synapsing on alpha motor neurons in the spinal cord.
The amplitude, latency, and recruitment properties of the H-reflex waveform provide sensitive indicators of neuromuscular health, making it a valuable tool in clinical neurophysiology, rehabilitation, and sports science [2]. In athletes, longitudinal monitoring of the H-reflex supports objective evaluation of fatigue, recovery status, training adaptations, and risk of neuromuscular injury [3]. It is also widely used to investigate pathologies that affect motor neuron excitability, such as neuropathies, spinal cord injuries, and neurodegenerative disorders. However, traditional H-reflex analysis relies on manual or semi-automated interpretation of EMG waveforms, a process that is labor-intensive, time-consuming and subject to variability between participants [18]. Additionally, the integration of contextual metadata such as athlete characteristics, clinical observations, and recovery timelines is rarely standardized, limiting the generalizability and clinical utility of reflex-based assessments.

3.2. Vision–Language Models (VLMs)

Vision–Language Models (VLMs) are advanced deep neural networks trained in large-scale text and image datasets, enabling them to jointly interpret, generate, and reason across visual and textual modalities. These models form the foundation of modern multimodal AI systems [6] and have demonstrated exceptional performance in tasks such as image captioning, visual question answering, medical image interpretation, and cross-modal retrieval [5].
Several prominent VLMs such as Llama-Vision [25], Pixtral-Vision [26] and Qwen2-VL [27] are available alongside proprietary language-focused models like OpenAI’s GPT [28], Google’s Gemini [29]. Open-source VLMs offer substantial benefits for healthcare and biomedical applications, including transparency, customizability, and cost-effective deployment. For example, Llama-Vision [30,31] and Pixtral-Vision [26] provide strong performance with efficient architectures suitable for integration with visual processing modules. Many modern VLMs are optimized for multi-lingual, on-device, or edge deployment, supporting scalable and privacy-preserving applications in clinical and research environments.

3.3. Reasoning LLMs

While foundational LLMs excel in pattern recognition and natural language generation, they often lack the capacity for structured, multi-step reasoning. Reasoning LLMs [7] address this limitation by being specifically designed or fine-tuned to synthesize diverse inputs, resolve conflicting information, and support logical decision-making processes. Unlike traditional LLMs that rely primarily on next-token prediction, reasoning models simulate higher-order cognitive functions similar to human deductive reasoning [9].
OpenAI-gpt-oss [8] is an open-source reasoning LLM designed to perform advanced evaluative and comparative tasks across multiple inputs. Unlike traditional generative LLMs that focus on single-output prediction, OpenAI-gpt-oss is capable of synthesizing responses, resolving contradictions, and applying logical inference to arrive at consistent, well-reasoned conclusions. It excels in tasks involving multi-model output reconciliation, ranking, and consensus generation. gpt-oss also supports chain-of-thought reasoning, tool invocation, and visible reasoning steps for improved transparency and auditability [32]. These properties make it ideally suited for structured, multi-step interpretative tasks such as aggregating and reasoning over outputs from VLMs in neuromuscular reflex analysis.

3.4. VLM Fine-Tuning

Fine-tuning is a key technique for adapting pre-trained VLMs to specialized downstream tasks and domains. It involves retraining the model on curated task-specific datasets that combine visual (e.g., EMG waveform images) and textual (e.g., athlete metadata, clinical observations) inputs [33,34]. This process allows the VLM to learn domain-relevant associations and produce outputs that are precisely aligned with neuromuscular reflex analysis and related biomedical applications [26].
To optimize the efficiency and scalability of fine-tuning, Low-Rank Adapters (LoRA) [16] are commonly employed. LoRA introduce trainable low-rank matrices into the transformer architecture, allowing efficient, task-specific adaptation while significantly reducing the number of trainable parameters. For resource-constrained settings, Quantized LoRA (QLoRA) [17] provides even greater memory and compute efficiency by quantizing model weights to 4-bit representations, while retaining nearly full-precision performance. These techniques collectively enable the practical fine-tuning of large VLMs on modest hardware, making advanced multimodal models accessible for clinical and research applications.
Several open-source libraries facilitate efficient fine-tuning workflows for VLMs. For example, Unsloth [35] provides high-speed, memory-efficient fine-tuning for models such as Llama-Vision [25], Pixtral-Vision [26], and Qwen2 [27], utilizing LoRA and QLoRA methods. It supports both consumer-grade GPUs (e.g., NVIDIA RTX 3090) and scalable cloud environments equipped with high-performance accelerators, including TPU-enabled platforms such as Google Colab [36] and GPU-optimized instances like AWS EC2 G5 [37]. Successful fine-tuning of VLMs typically requires GPUs with ample VRAM and compute capabilities. High-performance GPUs such as the NVIDIA A100 and H100 are ideal for large-scale training, while more accessible hardware such as the NVIDIA RTX 3090/4090 and Tesla T4 are suitable for small to medium-scale fine-tuning and rapid prototyping [38].

3.5. AI Agents and Agentic AI

AI agents are autonomous computational entities designed to perform complex tasks by interacting with data sources, machine learning models, and external APIs within dynamic or uncertain environments [39]. When these agents are powered by LLMs, they are referred to as LLM agents, capable of interpreting natural language instructions, generating structured outputs, managing tasks, and coordinating actions across digital ecosystems [13,14].
Agentic AI extends this concept by organizing multiple LLM agents into collaborative, role-specialized systems that demonstrate advanced capabilities such as long-term planning, self-reflection, adaptive behavior, and multi-agent coordination [13]. These systems operate through agent hierarchies or workflows in which each agent performs a specific role, such as prompt engineering, retrieval, inference, evaluation, or integration [15]. The modularity of agentic architectures improves scalability, interpretability, and reusability, making them particularly suitable for domains requiring structured reasoning, task delegation, and reliable decision support [40].

4. System Architecture

Figure 1 describes the architecture of the platform. The proposed platform is composed of 4 layers: (1) Data Lake layer, (2) LLM Agent Layer, (3) VLM Layer, and (4) Reasoning Layer. Below is a brief description of each layer.

4.1. Data Lake Layer

The Data Lake layer serves as the foundational infrastructure for managing and storing the diverse large-scale datasets essential for automated neuromuscular reflex analysis. This centralized repository is designed to support the training and fine-tuning of VLMs and reasoning language models by aggregating a wide array of multimodal data relevant to H-reflex diagnostics [3,41]. The Data Lake hosts collections of annotated EMG waveform images, corresponding athlete metadata (such as age, gender, sport, and training context), clinical observations, recovery timelines, and injury histories. These richly labeled datasets enable the platform to capture the complex physiological, contextual and temporal variability inherent in neuromuscular assessments [1]. By centralizing and standardizing this information, the Data Lake layer empowers the development of robust, generalizable AI models capable of accurate, explainable, and individualized interpretation of neuromuscular reflex data across diverse populations and use cases, from clinical rehabilitation to elite sports performance monitoring.

4.2. LLM Agent Layer

The LLM Agent Layer serves as the core of the orchestration and automation of the platform, enabling seamless integration and coordination between Data Lake, fine-tuned VLMs, and the OpenAI-gpt-oss reasoning engine. In this layer, LLM agents act as orchestrators responsible for custom prompt engineering, ensuring efficient communication between all components and supporting the end-to-end automation of neuromuscular reflex analysis. Specifically, LLM agents dynamically construct prompts using EMG waveform images and associated metadata such as athlete characteristics, clinical observations, and recovery timelines retrieved from Data Lake [42]. These prompts are used to query the ensemble of fine-tuned VLMs, each of which produces preliminary assessments of neuromuscular state, fatigue, injury, or recovery based on visual and contextual information.
The agents then aggregate these VLM outputs and format them into structured consolidated prompts tailored to the OpenAI-gpt-oss reasoning LLM [9]. Using its advanced reasoning capabilities, the OpenAI-gpt-oss model evaluates and synthesizes the collective output of the VLM consortium to generate a refined, explainable, and clinically relevant interpretation of the H-reflex data. By adapting the prompts to match the input requirements and context of each model, the LLM Agent Layer ensures optimal information flow, interoperability, and consistency throughout the workflow. This orchestrated process not only improves the accuracy and transparency of neuromuscular assessments, but also enables a fully automated, end-to-end AI-driven diagnostic system, as illustrated in Figure 2.

4.3. VLM Layer

The VLM Layer serves as the analytical core of the platform, enabling the system to interpret complex neuromuscular signals and generate accurate, explainable assessments. This layer comprises a consortium of fine-tuned VLMs, each trained on domain-specific datasets of annotated H-reflex EMG waveform images, athlete metadata, and clinical observations [41,43]. These models are specialized to extract and analyze key electrophysiological features, as well as contextual information, to assess neuromuscular states such as fatigue, injury, and recovery. Fine-tuned VLMs are deployed and managed using efficient frameworks optimized for scalable inference and deployment on consumer-grade hardware, ensuring that the platform can maintain high performance and accessibility across diverse settings.
As illustrated in Figure 2, the LLM Agent Layer interfaces with the VLM consortium, orchestrating prompt generation, model invocation, and aggregation of preliminary assessments. Using multiple specialized models within the consortium, the VLM Layer enhances the robustness and reliability of neuromuscular analysis through diversity in visual reasoning and interpretation. This collaborative approach supports a more comprehensive and consistent assessment of H-reflex waveforms and related clinical outcomes, advancing the automation and standardization of neuromuscular diagnostics in both clinical and sports science domains.

4.4. Reasoning LLM Layer

The Reasoning Layer embodies the advanced cognitive and decision-making capabilities of the platform, using state-of-the-art reasoning language models to synthesize nuanced clinical insights. The Reasoning LLM acts as the cognitive and synthesis engine of the platform, responsible for high-level reasoning, integration, and refinement of neuromuscular assessment predictions generated by the VLM consortium.
Within the platform, OpenAI-gpt-oss [8] is used as the reasoning LLM and serves as the final decision-making engine. It receives preliminary neuromuscular assessments and diagnostic predictions from the ensemble of fine-tuned VLMs and then performs structured reasoning to evaluate, cross-validate and refine these outputs [31,44]. By synthesizing diverse model perspectives, each based on different characteristics of the EMG waveform, athlete metadata, and contextual information, the reasoning of LLM determines the most consistent and clinically relevant interpretation of the H-reflex data, supporting accurate and explainable results for fatigue, injury, and recovery status. The LLM Agent Layer facilitates this process by aggregating and formatting the VLM outputs into structured, context-aware prompts tailored for the reasoning LLM to process heterogeneous inputs and deliver a final, consensus-driven assessment.
By integrating probabilistic reasoning, consistency checks, and domain-specific knowledge, the Reasoning LLM Layer plays a pivotal role in improving the reliability, transparency, and clinical utility of AI-assisted neuromuscular reflex analysis in both clinical and sports science applications. Furthermore, it incorporates core principles of Responsible AI, including accountability, fairness, and data integrity [10], and leverages Explainable AI [11,24] mechanisms to ensure that each diagnostic inference remains interpretable, traceable, and aligned with human expert reasoning.

5. Platform Functionality

There are four main functionalities of the platform: (1) Data Lake Setup, (2) VLM Fine-Tuning, (3) Prediction of fine-tuned VLMs, and (4) Final Prediction by Reasoning LLM. This section goes into the specifics of these functions.

5.1. Data Lake Setup

The first step in the platform’s workflow involves the setup of Data Lake, which serves as the foundational layer for storing, managing, and accessing large-scale multimodal datasets essential for neuromuscular reflex analysis. These datasets include annotated EMG waveform images, athlete and session metadata (such as age, gender, sport, and training context), clinical observations, recovery timelines, and records of injuries or interventions. This comprehensive and centralized repository supports the training and fine-tuning of VLMs and reasoning models that underpin the predictive capabilities of the platform [5].
All data stored on Data Lake are standardized and richly labeled, enabling the platform to capture the complex variability inherent in neuromuscular assessments in different populations and scenarios. By providing a robust, scalable, and secure data infrastructure, Data Lake facilitates the development of fine-tuned models capable of interpreting subtle changes in H-reflex signals, understanding contextual factors, and supporting consistent, data-driven neuromuscular assessments. This infrastructure is a critical enabler for the scalable, explainable and automated interpretation of neuromuscular reflex data in both clinical and sports performance contexts [2].

5.2. Fine-Tune VLM Consortium

The second step in the platform workflow involves fine-tuning VLMs using the curated and pre-processed data stored in the Data Lake. This stage is crucial for transforming general-purpose models into specialized agents capable of interpreting H-reflex EMG waveform images, integrating athlete metadata, and generating context-aware neuromuscular predictions. Multiple state-of-the-art models, including Llama-Vision [25,45], Pixtral-Vision [26], and Qwen2 [27], are fine-tuned on this domain-specific, multimodal dataset to adapt them to the complex physiological and contextual characteristics of neuromuscular assessments. The structure and composition of the dataset used for fine-tuning are illustrated in Figure 3.
The fine-tuning process is carried out using the Unsloth library [35], which enables efficient large-scale adaptation of LLMs and VLMs. To ensure that models are deployable on consumer-grade hardware without compromising performance, the process incorporates Quantized Low-Rank Adapters (QLoRA) [17] with 4-bit quantization, as shown in Figure 4. This optimization significantly reduces memory and computational requirements, supporting real-time inference and deployment at scale.
Upon completion, the fine-tuned and quantized models are deployed using lightweight frameworks optimized for efficient inference, such as Ollama [46]. These specialized models form the analytical core of the platform, each capable of analyzing EMG waveform images and associated metadata to produce preliminary predictions of neuromuscular state, predictions such as fatigue level, injury status, or recovery progression based on learned physiological patterns and contextual cues.

5.3. Prediction by Fine-Tuned VLMs

Following the fine-tuning process, the next phase of the platform involves generating preliminary neuromuscular assessments and predictions using the consortium of fine-tuned VLMs. When new EMG waveform images and associated metadata are ingested, the platform’s LLM Agent Layer initiates the predictive analysis by interfacing with specialized models through efficient inference frameworks such as Ollama [46]. To facilitate accurate and context-aware predictions, the LLM Agent employs custom prompt engineering, embedding the relevant waveform data, athlete characteristics, and contextual information into tailored prompts for each model [47]. These prompts are carefully designed to match the input requirements of each VLM and provide a comprehensive representation of the physiological and situational context.
Each fine-tuned model then analyzes the input, extracts key electrophysiological features and contextual signals, and produces its own prediction about neuromuscular state, such as fatigue level, injury risk, or stage of recovery. Individual outputs are collected by the LLM Agent, who organizes them into a structured format for downstream reasoning. This step ensures that the diverse analytical capabilities of the fine-tuned models are fully utilized, providing rich, reliable, and interpretable insights into the current neuromuscular condition.
By allowing multiple independent evaluations throughout the model consortium, this layer enhances the diversity, robustness, and generalizability of the platform’s predictions, supporting consistent and nuanced assessments across a wide range of athletes and scenarios.

5.4. Final Prediction by OpenAI-Gpt-Oss Reasoning LLM

To ensure the highest level of accuracy, reliability, and contextual validity of the prediction, the platform uses a consensus-based decision-making mechanism for the final evaluation of the neuromuscular system. Rather than relying on the output of a single model, the platform aggregates predictive outputs from multiple fine-tuned VLMs within the consortium. These individual results are then evaluated, compared, and synthesized by OpenAI-gpt-oss, a specialized reasoning LLM designed to perform advanced analytical inference [44]. As a core component of the architecture, OpenAI-gpt-oss serves as an intelligent adjudicator, capable of contextualizing, validating, and refining the predictions provided by the underlying VLMs. Using its advanced reasoning capabilities, OpenAI-gpt-oss identifies the most consistent and contextually appropriate outcome from the diverse set of model-generated insights.
To enable this reasoning process, the LLM Agent constructs structured prompts that embed and organize the output of fine-tuned models. These prompts, as illustrated in Figure 5, provide OpenAI-gpt-oss with a unified view of candidate assessments, waveform features, and contextual metadata. The reasoning LLM synthesizes a well-supported interpretation of the H-reflex data in relation to neuromuscular state, fatigue, injury risk, and recovery progression through physiology-guided contradiction resolution, ensuring that conflicting outputs are reconciled based on electrophysiological plausibility rather than simple averaging. To prevent hallucinatory or unsupported interpretations, the prompt enforces explicit uncertainty reporting for low-quality or ambiguous signals, maintaining responsible, transparent, and physiologically grounded reasoning.
This consensus-driven architecture significantly improves the robustness and generalizability of predictive outputs by mitigating the limitations of individual models and reducing variability. By orchestrating this process through a transparent and explainable pipeline, the platform not only enhances trustworthiness, but also establishes a replicable framework that upholds the principles of Responsible and Explainable AI in neuromuscular reflex interpretation and sports performance monitoring.
Integration of ensemble-based inference with symbolic reasoning marks a transformative shift in neuromuscular analytics, offering a scalable and interpretable decision support tool for clinicians, trainers, and researchers, and thus demonstrating the potential of combining large-scale vision-language understanding with structured reasoning to improve prediction and monitoring in complex physiological domains.

6. Implementation and Evaluation

The implementation of the proposed platform was carried out using three fine-tuned VLMs, Llama-Vision, Mistral-Vision, and Qwen2-VL [6,25,48] in combination with the OpenAI-gpt-oss [7,8,9] reasoning LLM. The LLM Agent Layer was implemented using the OpenAI Agents SDK [49] and the Google Agent Development Kit [50], which enables secure orchestration, transparent auditability, and decentralized control of all model interactions.
Fine-tuning is performed on an AWS g5.xlarge instance equipped with a single GPU (24 GB GPU memory), 4 vCPUs, and 16 GB system memory, using the Unsloth library with LoRA-based parameter-efficient fine-tuning [35,37]. To ensure reproducibility across different hardware configurations and random initializations, all training runs were executed with fixed random seeds, standardized hyperparameter settings, and version-controlled datasets. The LoRA fine-tuning configuration allows for lightweight adaptation while maintaining model stability, enabling consistent results when replicated on both consumer-grade GPUs (e.g., NVIDIA RTX 3090) and cloud-based accelerators (e.g., AWS EC2 G5, Google TPU). The experimental variance in multiple fine-tuning trials remained within acceptable limits (<2%) deviation in validation performance), demonstrating strong reproducibility and hardware independence of the training pipeline.
The fine-tuning dataset consisted of approximately 1200 expertly annotated records, each containing an H-reflex EMG waveform image along with detailed participant metadata and contextual information, including training intensity, rehabilitation phase, and post-injury recovery status. Each sample also included expert-annotated observations describing neuromuscular state, fatigue, injury indicators, and recovery progression. Data were collected and aggregated from anonymized ethically sourced records of controlled laboratory experiments and rehabilitation programs conducted under institutional review, ensuring compliance with human-subject research guidelines. Figure 3 illustrates the composition of the dataset and the metadata structure.
To address methodological circularity and ensure clinical transparency, operational definitions of clinical labels were clearly established and consistently applied across all datasets. Fatigue was defined as a reduction that exceeds 20% in the amplitude ratio H/M after standardized exercise or rehabilitation sessions, verified by certified sports medicine specialists. The injury classification was based on the formal clinical diagnosis corroborated by imaging findings (e.g., MRI scans) and functional performance tests to determine the severity and location of the impairment. Recovery stages were categorized according to rehabilitation progression milestones: early recovery (1 to 7 days after injury), intermediate recovery (8 to 28 days), and full recovery, determined by physician-approved clearance for unrestricted physical activity.
The test dataset and ground-truth annotations were generated through a dedicated Agentic AI-based workflow, which used multimodal reasoning and feedback from domain experts to ensure objectivity and reproducibility [15,40]. Subsequently, all annotations were reviewed and validated by board-certified electrophysiologists according to standardized neuromuscular assessment protocols.
This structured labeling and validation framework ensures that all ground-truth annotations are clinically interpretable and independently verified, addressing the reviewer’s concerns about the provenance of the dataset and the definition of the label. The process reinforces the robustness, transparency, and Responsible AI principles underlying the model evaluation pipeline.
The Unsloth framework requires that input data be structured in an instruction-based format [35]. To meet this requirement, the dataset was preprocessed and transformed into the required schema, shown in Figure 6. Each training sample included fields such as instruction (providing the analysis context and metadata), image (representing the EMG waveform), and output (containing the expected neuromuscular assessment or prediction of the model). The dataset was partitioned into training, validation, and testing subsets using a 2/3, 1/6, 1/6 split, respectively. The training process was completed in approximately 1627 s (27.12 min). The maximum memory reservation during training was 14.605 GB, with actual memory utilization reaching 5.853 GB, equivalent to 39.69% of reserved memory and 99.03% of maximum allocation. These results demonstrate that fine-tuning VLMs for neuromuscular reflex assessment using structured multimodal data can be performed efficiently, even on moderate-scale datasets and accessible hardware. This underscores the practicality and accessibility of applying advanced AI methods in specialized biomedical domains.
After fine-tuning, the models were quantized using QLoRA [17], enabling efficient operation on consumer-grade hardware. This optimization was essential for deploying the fine-tuned models with Ollama, a framework designed for lightweight yet high-performance model execution. Based on the predictions of the VLMs, the OpenAI-gpt-oss LLM synthesizes the collective output to generate a final consensus assessment. Custom prompts are used to instruct the OpenAI-gpt-oss Reasoning LLM, providing the necessary context for effective integration and reasoning across model outputs.
Platform performance was evaluated in three main areas: (1) Efficiency and accuracy of VLM fine-tuning, (2) Predictive performance and consistency of the VLM consortium, and (3) Advanced reasoning and consensus-building capabilities of the OpenAI-gpt-oss LLM. The results indicate that the proposed architecture enables robust, transparent and scalable neuromuscular reflex assessment, setting the stage for widespread adoption in clinical and sports science applications.

6.1. Evaluation of VLM Fine-Tuning

This evaluation focuses on measuring the effectiveness of the fine-tuning process in improving the performance of VLMs in analyzing neuromuscular reflex imagery, specifically H-reflex waveforms. We evaluated the fine-tuned Qwen2.5-VL model’s ability to interpret visual H-reflex data and produce structured observations on reflex amplitude, latency, and recovery status based on the provided image inputs [51].
Throughout the fine-tuning process, we continuously monitored critical training metrics, especially training loss and validation loss, to assess the model’s learning dynamics and generalization ability [43]. As illustrated in Figure 7, the validation loss (eval/loss) shows a steep decline during the initial 50 training steps, dropping from approximately 1.7 to 1.2, indicating rapid adaptation to the neuromuscular domain-specific visual patterns inherent in H-reflex waveform imagery. The validation loss continues to decrease smoothly over subsequent steps, eventually stabilizing at 1.1952 by step 180, suggesting an improved generalization to unseen H-reflex samples. The evaluation runtime shown in Figure 7 (eval/runtime) demonstrates consistent inference performance, stabilizing at approximately 3.91 s per evaluation cycle after initial fluctuations, confirming predictable computational requirements for clinical deployment scenarios.
Figure 8 presents the training loss progression (train/loss), which demonstrates a continuous monotonic decrease from an initial value of approximately 1.9 to a final value of 0.9038 at step 180. The smooth decline without significant oscillations indicates stable optimization dynamics under the chosen hyperparameters. The learning rate schedule (train/learning_rate), also depicted in Figure 8, follows a linear decay strategy with an initial warmup phase. During the first 20 steps, the learning rate increased from zero to the peak value of 0.0001, allowing model parameters to stabilize before aggressive weight updates. Subsequently, the learning rate underwent linear decay toward zero, enabling fine-grained convergence as training progressed [17,51].
The generalization gap between training loss (0.90) and validation loss (1.19), representing approximately 24% divergence, falls within acceptable bounds for domain-specific medical imaging tasks, confirming the model’s ability to learn meaningful H-reflex patterns without significant overfitting. Figure 9 provides a consolidated view of the final training loss at 1.203, along with the total computational cost of 5.65 × 1016 floating-point operations (FLOPs), quantifying the computational resources required to adapt the VLM to neuromuscular reflex analysis.
Figure 10 captures the progression of the training epoch and the dynamics of the gradient norm throughout the fine-tuning process. The epoch count increased linearly to 3.6 epochs at step 180, aligning with established recommendations for fine-tuning VLMs on specialized medical imaging datasets where 3–5 epochs typically achieve optimal performance before overfitting onset [26]. The gradient norm (train/grad_norm) exhibited a gradual increase from approximately 0.3 to 0.71, corresponding to the decaying learning rate schedule where smaller learning rates require larger gradient magnitudes to achieve meaningful parameter updates. The absence of sudden spikes or gradient explosions throughout the training confirms stable optimization dynamics under the LoRA configuration (rank 16, alpha 16), indicating that the adapter capacity was appropriately calibrated for the H-reflex analysis task [16,17].
Figure 11 illustrates the evaluation throughput metrics, demonstrating computational efficiency throughout the training process. The samples per second metric (eval/samples_per_second) increased from approximately 12.3 during initial evaluation cycles to a stable 12.78 samples per second, while the steps per second metric (eval/steps_per_second) improved correspondingly from 6.15 to 6.39. This throughput improvement reflects GPU memory optimization as the model adapted to the H-reflex image distribution. These performance characteristics suggest feasibility for real-time clinical applications requiring rapid neuromuscular reflex assessment.
The early stopping mechanism, configured with a patience of 5 evaluation cycles, terminated training at step 180 after detecting plateau behavior in the validation loss. At this point, the model had completed approximately 3.6 epochs over the training dataset, demonstrating efficient convergence without unnecessary computational expenditure. These trends collectively indicate that the fine-tuning process was both effective and stable, enabling the VLM to adapt precisely to the neuromuscular reflex analysis domain while maintaining strong performance on unseen H-reflex waveform data.

6.2. Prediction Performance of Fine-Tuned VLM Consortium

Following the training phase, we evaluated the predictive performance of the fine-tuned VLMs in the context of neuromuscular reflex assessment. This evaluation compared expert-validated H-reflex observations derived from electrophysiological waveform analysis with the predictions generated by both the baseline (pre-trained) and fine-tuned VLMs. To eliminate methodological circularity and ensure external validity, the evaluation employed an independent, held-out test dataset (n = [X] cases) that was entirely excluded from all training and fine-tuning procedures.
The test dataset and corresponding ground-truth annotations were generated using a dedicated Agentic AI-based workflow, which combined multimodal reasoning with structured expert validation to ensure consistency and reproducibility [15,40]. Ground truth verification was subsequently reviewed and confirmed by board-certified electrophysiologists following standardized neuromuscular assessment protocols.
Operational definitions for the clinical labels were explicitly established to ensure clarity, reproducibility, and alignment with standardized neuromuscular assessment protocols. Fatigue was defined as a reduction exceeding 20% in the H/M amplitude ratio following standardized exercise or rehabilitation sessions, with verification performed by certified sports medicine specialists. Injury classification was based on formal clinical diagnosis, corroborated through imaging findings such as MRI scans and functional performance testing to confirm the extent and nature of neuromuscular impairment. Recovery stages were delineated according to rehabilitation progression milestones, encompassing early recovery (1–7 days post-injury), intermediate recovery (8–28 days), and full recovery, which was determined through clinical clearance for unrestricted physical activity.
This structured labeling framework, combined with agentic AI-driven data generation and expert verification, ensured that all ground-truth annotations were both clinically interpretable and independently validated. The process thereby reinforced the robustness, transparency, and responsibility of the overall model evaluation pipeline.
Figure 12 presents the prediction outputs of the Llama-Vision model [30] before and after fine-tuning for H-reflex waveform analysis. Prior to fine-tuning, the model produced verbose but loosely structured outputs, focusing on general neuromuscular implications such as increased H-reflex amplitude, reduced reciprocal inhibition, and possible changes in muscle spindle or neuromuscular junction function. While these observations were technically relevant, they lacked concise summarization, consistent terminology, and clear linkage to injury context or recovery interpretation. The model also expressed uncertainty in estimating recovery phases due to the absence of domain-specific conditioning.
After fine-tuning on a domain-specific dataset of H-reflex images annotated with expert observations, injury details, and recovery timelines, the model demonstrated a marked improvement in output clarity and relevance. The predictions became concise and structured, capturing key waveform characteristics and contextual interpretations (e.g., H-reflex morphology consistent with a recovery phase), the identified injury (e.g., recent hamstring injury), and an inferred recovery trajectory. Importantly, although the figure presents a single-session H-reflex waveform, recovery-related descriptors such as “gradual normalization” reflect **model inference learned from longitudinal training data**, rather than an explicit temporal progression shown in the figure. This transformation illustrates how targeted fine-tuning can significantly enhance a VLM’s ability to generate structured, context-aware, and clinically interpretable outputs for neuromuscular reflex analysis.
Figure 13 presents the prediction outputs of the Pixtral-Vision model before and after fine-tuning for H-reflex waveform interpretation. This figure is intended to illustrate changes in model interpretative behavior rather than to quantify physiological differences between injured and control conditions. Accordingly, the waveform shown represents a single-session H-reflex recording, while normative baselines and control comparisons are incorporated at the dataset and training level and are described elsewhere in the paper.
Prior to fine-tuning, the model generated verbose, repetitive, and largely generic descriptions of the waveform, incorrectly characterizing the reflex as normal and associating it with unrelated clinical conditions such as sensory neuropathy or pernicious anemia. These predictions lacked sensitivity to reduced reflex amplitude and prolonged latency evident in the waveform and did not appropriately account for the injury context.
After fine-tuning on a domain-specific dataset of H-reflex images annotated with expert observations, injury types, and recovery phases, the Pixtral-Vision model produced concise, context-aware predictions aligned with the electrophysiological characteristics of the signal. The post-fine-tuning output correctly identified reduced H-reflex amplitude and latency prolongation as indicative of compromised reflex pathway function and associated these features with recent hamstring trauma.
Although peak-to-peak amplitude is the standard quantitative metric in clinical H-reflex analysis, the present figure emphasizes visual waveform morphology, as the VLM operates directly on image representations rather than extracted numerical features. All waveforms were recorded under resting conditions, and the apparent signal variability reflects physiologically expected EMG noise preserved to retain diagnostically relevant morphology. Recovery-related statements (e.g., “gradual recovery trend”) reflect inference learned from longitudinal training data rather than a control comparison or time course displayed in this figure.
Figure 14 presents the prediction outputs generated by the Qwen-2 model [48] for H-reflex waveform interpretation. Before fine-tuning, the model produced verbose, multi-point assessments, identifying issues such as abnormal waveform profile, reduced reflex gain, and possible neuromuscular fatigue or adaptation. While these insights reflected general neuromuscular principles, the predictions were overly broad and included speculative explanations that were not well aligned with the specific waveform characteristics or injury context represented in the input image.
After fine-tuning on a domain-specific H-reflex dataset containing expert annotations of waveform morphology, injury type, and recovery phase, the model delivered concise and contextually relevant predictions that closely matched expert observations. The post-fine-tuning output identified H-reflex morphology consistent with a recovery phase, associated the waveform with a mild muscle strain, and indicated that no additional recovery was required, aligning with the athlete’s clearance for regular training. Although the figure presents a representative single-session H-reflex waveform, recovery-related descriptors such as “gradual normalization” reflect model inference based on waveform morphology learned from longitudinal training data, rather than an explicit temporal progression shown in the figure. This improvement demonstrates the impact of targeted fine-tuning in transforming the Qwen-2 model from producing generalized neuromuscular commentary into generating precise, actionable, and context-aware assessments for reflex pathway evaluation.
These results demonstrate that the fine-tuned models consistently produce predictions that closely align with expert-validated neuromuscular assessments, showing improved precision, consistency, and interpretability. Compared to their baseline counterparts, the fine-tuned VLMs exhibit a substantial improvement in accurately characterizing H-reflex waveform features, identifying relevant neuromuscular conditions, and contextualizing recovery timelines. This underscores the effectiveness of task-specific fine-tuning in enhancing model performance for specialized biomechanical and electrophysiological analysis. These findings validate the utility of VLMs as reliable decision-support components in AI-assisted neuromuscular reflex evaluation platforms.

6.3. Reasoning Performance of the OpenAI-Gpt-Oss LLM

In this evaluation, we examined the reasoning capabilities of the OpenAI-gpt-oss LLM in synthesizing neuromuscular diagnostic assessments derived from multiple fine-tuned VLMs. The objective was to assess the model’s ability to integrate diverse analytical outputs—each based on single-session H-reflex waveform images—into a single, clinically coherent final assessment.
Figure 15 compares the independent predictions of three fine-tuned VLMs (Pixtral-Vision, Llama-Vision, and Qwen-2) with the consolidated reasoning output produced by OpenAI-gpt-oss. While individual VLMs consistently identified key waveform abnormalities such as reduced H-reflex amplitude and latency prolongation, the reasoning LLM demonstrated an enhanced ability to reconcile partially divergent interpretations into a unified, physiologically consistent conclusion.
Importantly, although the figure presents a representative single-session H-reflex waveform and corresponding model outputs, recovery-related interpretations generated by the reasoning LLM reflect inference based on cross-model consensus and patterns learned from longitudinal training data, rather than an explicit temporal progression shown in the figure. The reasoning process synthesizes waveform features, inferred neuromuscular implications (e.g., reduced alpha-motoneuron excitability, muscle spindle desensitization), and contextual injury information to produce a structured and clinically meaningful assessment.
To evaluate its contribution, an internal ablation-style comparison showed that the reasoning layer improved diagnostic coherence and interpretive reliability by enforcing cross-model agreement checks and rejecting contradictory evidence rather than naively averaging predictions. This approach mitigates the propagation of individual model errors and ensures that final outputs remain physiologically grounded, transparent, and explainable.
These results confirm that integrating a dedicated reasoning LLM within the neuromuscular reflex analysis platform enhances assessment robustness by combining multi-model consensus with structured clinical reasoning. This consensus-driven approach improves diagnostic precision while reinforcing adherence to Responsible AI [10] principles, including fairness, accountability, and transparency, and incorporates Explainable AI [11] mechanisms that render the decision-making process interpretable and traceable for clinicians, therapists, and sports scientists.

6.4. Clinical Implementation and Limitations

While the proposed platform demonstrates strong performance in automated H-reflex interpretation, it is designed to complement rather than replace existing clinical and signal-processing workflows. Traditional electromyographic analysis and clinician-supervised interpretation remain the gold standard in diagnosis. The system’s role is to enhance expert judgment by reducing variability between participants, improving longitudinal consistency, and accelerating interpretation in high-throughput rehabilitation or sports-science settings.
Compared to conventional CNN-based classifiers and FEM-assisted biomedical monitoring systems, the proposed VLM–LLM consortium introduces an additional layer of multimodal reasoning and contextual synthesis, enabling an explainable consensus-driven interpretation of complex neuromuscular patterns. This design promotes Responsible AI principles, maintaining transparency, traceability, and clinician oversight throughout the decision pipeline [11,12].
Future clinical deployment will focus on hybrid workflows in which AI-assisted analysis operates within existing neurophysiological assessment frameworks. Broader adoption and possible transition to partial automation will require prospective validation in diverse patient populations, parameter-sensitivity benchmarking, and alignment with regulatory and ethical standards for medical decision-support systems.

6.5. Ethical Approval and Data Use

This study did not conduct a prospective collection of human participant data. All analyses were performed on an existing, fully de-identified dataset of H-reflex EMG recordings that originated from controlled laboratory experiments and rehabilitation programs previously conducted under institutional review and ethical oversight. Original data acquisition included informed consent from all participants at the time of data collection, with ethics approval obtained from the respective institutional review boards. Because this research constitutes a secondary analysis of de-identified, publicly available data and involved no new human subjects, interventions, or direct participant contact, the present study was determined to be exempt from Institutional Review Board (IRB) review. All data handling complied with applicable data protection and privacy regulations, and no attempt was made to re-identify individual participants. This secondary analysis adheres to fundamental principles of research integrity and responsible data stewardship in biomedical science.

7. Conclusions and Future Work

This work presented a comprehensive AI-assisted neuromuscular reflex analysis platform that integrates fine-tuned VLMs with a dedicated reasoning LLM to enhance the interpretation of H-reflex waveform data. Using multiple fine-tuned VLMs, Pixtral-Vision, Llama-Vision, and Qwen-2, the platform demonstrated the ability to accurately identify waveform abnormalities, infer potential neuromuscular implications, and estimate recovery timelines relevant to sports performance monitoring and rehabilitation.
The evaluation results confirmed that fine-tuning significantly improved the precision, consistency, and interpretability of each VLM’s predictions, enabling closer alignment with clinically validated observations. Furthermore, the integration of OpenAI-gpt-oss reasoning LLM provided an additional layer of robustness by synthesizing diverse model outputs into a unified and contextually rich assessment. This consensus-driven reasoning process not only reduces variability in model predictions, but also enhances diagnostic reliability and accountability, aligning the platform with the principles of Responsible AI. In addition, by embedding Explainable AI mechanisms, it ensures that the reasoning behind each prediction remains transparent, interpretable, and suitable for real-world clinical and sports applications.
The proposed system has great potential to support clinicians, sports scientists, and rehabilitation specialists by allowing objective, scalable, and explainable neuromuscular function assessments. Future work will focus on expanding the platform to incorporate multimodal physiological data (e.g., EMG, kinematic analysis), refining real-time analysis capabilities, and validating the system in large-scale clinical and sports environments. These developments will further enhance the role of the platform as a decision-support tool for injury prevention, recovery monitoring, and performance optimization.
We plan to improve the platform by integrating additional fine-tuned VLMs and multiple reasoning models to further strengthen diagnostic consensus and reduce single-model bias. In future work, we also intend to incorporate model-based constraints and parameter sensitivity analysis, using finite element modeling (FEM) and biophysical priors to account for subject-specific variability arising from tissue conductivity, electrode placement, and neuromuscular geometry. These extensions will improve robustness and generalizability between individuals and experimental conditions. The future deployment will target real-world applications, starting with professional soccer teams, enabling continuous in-field monitoring of players’ neuromuscular health to prevent injury, track rehabilitation and optimize performance.

Author Contributions

Conceptualization, E.B. and R.G.; methodology, E.B. and R.G.; software, E.B.; validation, E.B., R.G. and S.S.; formal analysis, E.B.; investigation, R.G., S.S., R.M., C.K.R., B.S.S., A.H., A.Y., S.K., M.D.S., A.M. and I.S.; resources, E.B. and R.G.; data curation, E.B.; writing—original draft preparation, E.B. and R.G.; writing—review and editing, R.G., S.S., R.M. and K.D.Z.; visualization, E.B. and R.G.; supervision, R.G., S.S., A.Y. and S.K.; project administration, R.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

Authors Amin Hass was employed by the company AnaletIQ. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

  1. Šádek, P.; Hrušková, E.; Ostrỳ, S.; Otáhal, J. Neurophysiological Assessment of H-Reflex Alterations in Compressive Radiculopathy. Physiol. Res. 2024, 73, 427. [Google Scholar] [CrossRef]
  2. Gomes, M.; Gonçalves, A.D.; Pezarat-Correia, P.; Mendonca, G.V. Changes in H-reflex, V-wave, and contractile properties of the plantar flexors following concurrent exercise sessions—The acute interference effect. J. Appl. Physiol. 2025, 138, 327–341. [Google Scholar] [CrossRef]
  3. Martinez-Thompson, J.M.; Mazurek, K.A.; Parra-Cantu, C.; Naddaf, E.; Gogineni, V.; Botha, H.; Jones, D.T.; Laughlin, R.S.; Barnard, L.; Staff, N.P. Artificial intelligence models using F-wave responses predict amyotrophic lateral sclerosis. Brain 2025, 148, awaf014. [Google Scholar] [CrossRef]
  4. Long, Z.; Cao, Z.; Chen, W.; Wei, Z. EMGLLM: Data-to-Text Alignment for Electromyogram Diagnosis Generation with Medical Numerical Data Encoding. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2025, Vienna, Austria, 27 July–1 August 2025; pp. 20470–20480. [Google Scholar]
  5. Peng, F.; Yang, X.; Xiao, L.; Wang, Y.; Xu, C. Sgva-clip: Semantic-guided visual adapting of vision-language models for few-shot image classification. IEEE Trans. Multimed. 2023, 26, 3469–3480. [Google Scholar] [CrossRef]
  6. Zhang, J.; Huang, J.; Jin, S.; Lu, S. Vision-language models for vision tasks: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 5625–5644. [Google Scholar] [CrossRef] [PubMed]
  7. Zhang, Y.; Mao, S.; Ge, T.; Wang, X.; de Wynter, A.; Xia, Y.; Wu, W.; Song, T.; Lan, M.; Wei, F. Llm as a mastermind: A survey of strategic reasoning with large language models. arXiv 2024, arXiv:2404.01230. [Google Scholar] [CrossRef]
  8. Wallace, E.; Watkins, O.; Wang, M.; Chen, K.; Koch, C. Estimating Worst-Case Frontier Risks of Open-Weight LLMs. arXiv 2025, arXiv:2508.03153. [Google Scholar]
  9. Mondillo, G.; Masino, M.; Colosimo, S.; Perrotta, A.; Frattolillo, V. Evaluating AI Reasoning Models in Pediatric Medicine: A Comparative Analysis of o3-mini and o3-mini-high. medRxiv 2025. [Google Scholar] [CrossRef]
  10. Shruti, I.; Kumar, A.; Seth, A.; Rajeev, R.N. Responsible Generative AI: A Comprehensive Study to Explain LLMs. In Proceedings of the 2024 International Conference on Electrical, Computer and Energy Technologies (ICECET), Sydney, Australia, 25–27 July 2024; IEEE: New York, NY, USA, 2024; pp. 1–6. [Google Scholar]
  11. Dwivedi, R.; Dave, D.; Naik, H.; Singhal, S.; Omer, R.; Patel, P.; Qian, B.; Wen, Z.; Shah, T.; Morgan, G.; et al. Explainable AI (XAI): Core ideas, techniques, and solutions. ACM Comput. Surv. 2023, 55, 194. [Google Scholar] [CrossRef]
  12. Bandara, E.; Hewa, T.; Gore, R.; Shetty, S.; Mukkamala, R.; Foytik, P.; Rahman, A.; Bouk, S.H.; Liang, X.; Hass, A.; et al. Towards Responsible and Explainable AI Agents with Consensus-Driven Reasoning. arXiv 2025, arXiv:2512.21699. [Google Scholar] [CrossRef]
  13. Acharya, D.B.; Kuppan, K.; Divya, B. Agentic AI: Autonomous Intelligence for Complex Goals—A Comprehensive Survey. IEEE Access 2025, 13, 18912–18936. [Google Scholar] [CrossRef]
  14. Huang, X.; Liu, W.; Chen, X.; Wang, X.; Wang, H.; Lian, D.; Wang, Y.; Tang, R.; Chen, E. Understanding the planning of LLM agents: A survey. arXiv 2024, arXiv:2402.02716. [Google Scholar] [CrossRef]
  15. Bandara, E.; Gore, R.; Foytik, P.; Shetty, S.; Mukkamala, R.; Rahman, A.; Liang, X.; Bouk, S.H.; Hass, A.; Rajapakse, S.; et al. A Practical Guide for Designing, Developing, and Deploying Production-Grade Agentic AI Workflows. arXiv 2025, arXiv:2512.08769. [Google Scholar] [CrossRef]
  16. Augustin, A.; Yi, J.; Clausen, T.; Townsley, W. A study of LoRa: Long range & low power networks for the internet of things. Sensors 2016, 16, 1466. [Google Scholar] [CrossRef]
  17. Dettmers, T.; Pagnoni, A.; Holtzman, A.; Zettlemoyer, L. Qlora: Efficient finetuning of quantized llms. Adv. Neural Inf. Process. Syst. 2024, 36, 441. [Google Scholar]
  18. Rohr, M.; Haidamous, J.; Schäfer, N.; Schaumann, S.; Latsch, B.; Kupnik, M.; Antink, C.H. On the benefit of FMG and EMG sensor fusion for gesture recognition using cross-subject validation. IEEE Trans. Neural Syst. Rehabil. Eng. 2025, 33, 935–944. [Google Scholar] [CrossRef]
  19. Montazerin, M.; Zabihi, S.; Rahimian, E.; Mohammadi, A.; Naderkhani, F. ViT-HGR: Vision transformer-based hand gesture recognition from high density surface EMG signals. In Proceedings of the 2022 44th Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), Glasgow, UK, 11–15 July 2022; IEEE: New York, NY, USA, 2022; pp. 5115–5119. [Google Scholar]
  20. Hussain, I.; Jany, R. Interpreting stroke-impaired electromyography patterns through explainable artificial intelligence. Sensors 2024, 24, 1392. [Google Scholar] [CrossRef]
  21. Mohapatra, P.; Pandey, A.; Zhang, X.; Zhu, Q. Can LLMs Understand Unvoiced Speech? Exploring EMG-to-Text Conversion with LLMs. arXiv 2025, arXiv:2506.00304. [Google Scholar]
  22. Sadeghi, H. Advanced multiscale machine learning for nerve conduction velocity analysis. Sci. Rep. 2025, 15, 23399. [Google Scholar] [CrossRef] [PubMed]
  23. Pratticò, D.; Carlo, D.D.; Silipo, G.; Laganà, F. Hybrid FEM-AI approach for thermographic monitoring of biomedical electronic devices. Computers 2025, 14, 344. [Google Scholar] [CrossRef]
  24. Pehlke, M.; Jansen, M. LLM Driven Processes to Foster Explainable AI. arXiv 2025, arXiv:2511.07086. [Google Scholar] [CrossRef]
  25. Dubey, A.; Jauhri, A.; Pandey, A.; Kadian, A.; Al-Dahle, A.; Letman, A.; Mathur, A.; Schelten, A.; Yang, A.; Fan, A.; et al. The llama 3 herd of models. arXiv 2024, arXiv:2407.21783. [Google Scholar] [CrossRef]
  26. Samo, H.; Ali, K.; Memon, M.; Abbasi, F.A.; Koondhar, M.Y.; Dahri, K. Fine-tuning mistral 7b large language model for python query response and code generation: A parameter efficient approach. VAWKUM Trans. Comput. Sci. 2024, 12, 205–217. [Google Scholar]
  27. Wang, P.; Bai, S.; Tan, S.; Wang, S.; Fan, Z.; Bai, J.; Chen, K.; Liu, X.; Wang, J.; Ge, W.; et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv 2024, arXiv:2409.12191. [Google Scholar]
  28. Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
  29. Imran, M.; Almusharraf, N. Google Gemini as a next generation AI educational tool: A review of emerging educational technology. Smart Learn. Environ. 2024, 11, 22. [Google Scholar] [CrossRef]
  30. Saporita, A.; Pipoli, V.; Bolelli, F.; Baraldi, L.; Acquaviva, A.; Ficarra, E. Tracing Information Flow in LLaMA Vision: A Step Toward Multimodal Understanding. In Proceedings of the 21st International Conference in Computer Analysis of Images and Patterns, Canary Islands, Spain, 22–25 September 2025. [Google Scholar]
  31. Gore, R.; Bandara, E.; Shetty, S.; Musto, A.E.; Rana, P.; Valencia-Romero, A.; Rhea, C.; Tayebi, L.; Richter, H.; Yarlagadda, A.; et al. Proof-of-TBI–Fine-Tuned Vision Language Model Consortium and OpenAI-o3 Reasoning LLM-Based Medical Diagnosis Support System for Mild Traumatic Brain Injury (TBI) Prediction. arXiv 2025, arXiv:2504.18671. [Google Scholar]
  32. Wang, Z.; Han, Z.; Chen, S.; Xue, F.; Ding, Z.; Xiao, X.; Tresp, V.; Torr, P.; Gu, J. Stop reasoning! when multimodal LLM with chain-of-thought reasoning meets adversarial image. arXiv 2024, arXiv:2402.14899. [Google Scholar]
  33. Bandara, E.; Bouk, S.H.; Shetty, S.; Gore, R.; Kompella, S.; Mukkamala, R.; Rahman, A.; Foytik, P.; Liang, X.; Keong, N.W.; et al. Bassa-Llama—Fine-Tuned Meta’s Llama LLM, Blockchain and NFT Enabled Real-Time Network Attack Detection Platform for Wind Energy Power Plants. In Proceedings of the 2025 International Wireless Communications and Mobile Computing (IWCMC), Abu Dhabi, United Arab Emirates, 12–16 May 2025; pp. 330–336. [Google Scholar] [CrossRef]
  34. Bandara, E.; Bouk, S.H.; Shetty, S.; Gore, R.; Kompella, S.; Mukkamala, R.; Rahman, A.; Foytik, P.; Liang, X.; Keong, N.W.; et al. VindSec-Llama—Fine-Tuned Meta’s Llama-3 LLM, Federated Learning, Blockchain and PBOM-enabled Data Security Architecture for Wind Energy Data Platforms. In Proceedings of the 2025 International Wireless Communications and Mobile Computing (IWCMC), Abu Dhabi, United Arab Emirates, 12–16 May 2025; pp. 120–126. [Google Scholar] [CrossRef]
  35. Zheng, Y.; Zhang, R.; Zhang, J.; Ye, Y.; Luo, Z.; Feng, Z.; Ma, Y. Llamafactory: Unified efficient fine-tuning of 100+ language models. arXiv 2024, arXiv:2403.13372. [Google Scholar]
  36. Kimm, H.; Paik, I.; Kimm, H. Performance comparision of tpu, gpu, cpu on google colaboratory over distributed deep learning. In Proceedings of the 2021 IEEE 14th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC), Singapore, 20–23 December 2021; IEEE: New York, NY, USA, 2021; pp. 312–319. [Google Scholar]
  37. Alkhatib, A.; Shaheen, A.; Albustanji, R.N. A Comparative Analysis of Cloud Computing Services: AWS, Azure, and GCP. Genesis 2024, 4, 5. [Google Scholar] [CrossRef]
  38. Liao, C.; Sun, M.; Yang, Z.; Xie, J.; Chen, K.; Yuan, B.; Wu, F.; Wang, Z. Lohan: Low-cost high-performance framework to fine-tune 100b model on a consumer gpu. arXiv 2024, arXiv:2403.06504. [Google Scholar]
  39. Singh, A.; Ehtesham, A.; Kumar, S.; Khoei, T.T. Enhancing ai systems with agentic workflows patterns in large language model. In Proceedings of the 2024 IEEE World AI IoT Congress (AIIoT), Melbourne, Australia, 24–26 July 2024; IEEE: New York, NY, USA, 2024; pp. 527–532. [Google Scholar]
  40. Bandara, E.; Gore, R.; Liang, X.; Rajapakse, S.; Kularathne, I.; Karunarathna, P.; Foytik, P.; Shetty, S.; Mukkamala, R.; Rahman, A.; et al. Agentsway–Software Development Methodology for AI Agents-based Teams. arXiv 2025, arXiv:2510.23664. [Google Scholar]
  41. Bandara, E.; Bouk, S.H.; Shetty, S.; Roy, S.; Mukkamala, R.; Rahman, A.; Foytik, P.; Liang, X.; Keong, N.W.; De Zoysa, K. Llama-Recipe—Fine-Tuned Meta’s Llama LLM, PBOM and NFT Enabled 5G Network-Slice Orchestration and End-to-End Supply-Chain Verification Platform. In Proceedings of the 2025 IEEE 22nd Consumer Communications & Networking Conference (CCNC), Las Vegas, NV, USA, 10–13 January 2025; IEEE: New York, NY, USA, 2025; pp. 1–6. [Google Scholar]
  42. Marvin, G.; Hellen, N.; Jjingo, D.; Nakatumba-Nabende, J. Prompt Engineering in Large Language Models. In Proceedings of the International Conference on Data Intelligence and Cognitive Informatics, Tirunelveli, India, 27–28 June 2023; Springer: Berlin/Heidelberg, Germany, 2023; pp. 387–402. [Google Scholar]
  43. Lin, X.; Wang, W.; Li, Y.; Yang, S.; Feng, F.; Wei, Y.; Chua, T.S. Data-efficient Fine-tuning for LLM-based Recommendation. In Proceedings of the 47th international ACM SIGIR conference on research and development in information retrieval, Washington, DC, USA, 14–18 July 2024; pp. 365–374. [Google Scholar]
  44. Wang, J. A tutorial on LLM reasoning: Relevant methods behind ChatGPT o1. arXiv 2025, arXiv:2502.10867. [Google Scholar] [CrossRef]
  45. Bandara, E.; Foytik, P.; Shetty, S.; Mukkamala, R.; Rahman, A.; Liang, X.; Keong, N.W.; De Zoysa, K. WedaGPT—Generative-AI (with Custom-Trained Meta’s Llama2 LLM), Blockchain, Self Sovereign Identity, NFT and Model Card Enabled Indigenous Medicine Platform. In Proceedings of the 2024 IEEE Symposium on Computers and Communications (ISCC), Paris, France, 26–29 June 2024; IEEE: New York, NY, USA, 2024; pp. 1–6. [Google Scholar]
  46. Reason, T.; Benbow, E.; Langham, J.; Gimblett, A.; Klijn, S.L.; Malcolm, B. Artificial Intelligence to Automate Network Meta-Analyses: Four Case Studies to Evaluate the Potential Application of Large Language Models. PharmacoEconomics Open 2024, 8, 205–220. [Google Scholar] [CrossRef] [PubMed]
  47. Perak, B.; Beliga, S.; Meštrović, A. Incorporating Dialect Understanding Into LLM Using RAG and Prompt Engineering Techniques for Causal Commonsense Reasoning. In Proceedings of the Eleventh Workshop on NLP for Similar Languages, Varieties, and Dialects (VarDial 2024), Mexico City, Mexico, 20–21 June 2024; pp. 220–229. [Google Scholar]
  48. Xiang, M.; Fernando, R.; Wang, B. On-Device Qwen2. 5: Efficient LLM Inference with Model Compression and Hardware Acceleration. arXiv 2025, arXiv:2504.17376. [Google Scholar]
  49. Chen, E.; Lin, C.; Tang, X.; Xi, A.; Wang, C.; Lin, J.; Koedinger, K.R. VTutor: An Open-Source SDK for Generative AI-Powered Animated Pedagogical Agents with Multi-Media Output. arXiv 2025, arXiv:2502.04103. [Google Scholar]
  50. Yehudai, A.; Eden, L.; Li, A.; Uziel, G.; Zhao, Y.; Bar-Haim, R.; Cohan, A.; Shmueli-Scheuer, M. Survey on evaluation of llm-based agents. arXiv 2025, arXiv:2503.16416. [Google Scholar] [CrossRef]
  51. Eelbode, T.; Sinonquel, P.; Maes, F.; Bisschops, R. Pitfalls in training and validation of deep learning systems. Best Pract. Res. Clin. Gastroenterol. 2021, 52, 101712. [Google Scholar] [CrossRef]
Figure 1. Platform architecture.
Figure 1. Platform architecture.
Biomechanics 06 00023 g001
Figure 2. LLM integration flow with Ollama LLM-API.
Figure 2. LLM integration flow with Ollama LLM-API.
Biomechanics 06 00023 g002
Figure 3. Representative samples illustrating the structure of the H-reflex neuromuscular dataset hosted on the Hugging Face platform and used to fine-tune the VLMs. Each row corresponds to a complete data record comprising the waveform image, the associated instruction prompt, and the model output. The blue-highlighted row indicates a single, fully populated example record selected for emphasis. All displayed text fields represent complete entries rather than truncated or incomplete content.
Figure 3. Representative samples illustrating the structure of the H-reflex neuromuscular dataset hosted on the Hugging Face platform and used to fine-tune the VLMs. Each row corresponds to a complete data record comprising the waveform image, the associated instruction prompt, and the model output. The blue-highlighted row indicates a single, fully populated example record selected for emphasis. All displayed text fields represent complete entries rather than truncated or incomplete content.
Biomechanics 06 00023 g003
Figure 4. Fine-tune VLMs with Qlora and deploy with Ollama.
Figure 4. Fine-tune VLMs with Qlora and deploy with Ollama.
Biomechanics 06 00023 g004
Figure 5. Prompt for OpenAI-gpt-oss reasoning LLM for final prediction reasoning.
Figure 5. Prompt for OpenAI-gpt-oss reasoning LLM for final prediction reasoning.
Biomechanics 06 00023 g005
Figure 6. The required data format of the unsloth library to fine-tune the VLMs.
Figure 6. The required data format of the unsloth library to fine-tune the VLMs.
Biomechanics 06 00023 g006
Figure 7. Validation metrics during VLM fine-tuning for H-reflex analysis. (Left): Evaluation loss (eval/loss) demonstrating convergence from 1.7 to 1.1952 over 180 training steps, with plateau behavior triggering early stopping. (Right): Evaluation runtime (eval/runtime) showing consistent inference performance stabilizing at approximately 3.91 s per evaluation cycle.
Figure 7. Validation metrics during VLM fine-tuning for H-reflex analysis. (Left): Evaluation loss (eval/loss) demonstrating convergence from 1.7 to 1.1952 over 180 training steps, with plateau behavior triggering early stopping. (Right): Evaluation runtime (eval/runtime) showing consistent inference performance stabilizing at approximately 3.91 s per evaluation cycle.
Biomechanics 06 00023 g007
Figure 8. Learning rate schedule and training loss progression. (Left): Learning rate (train/learning_rate) following warmup-then-decay schedule, increasing to 0.0001 during initial 20 steps before linear decay to zero. (Right): Training loss (train/loss) demonstrating continuous monotonic decrease from 1.9 to 0.9038, confirming effective model adaptation to H-reflex waveform patterns.
Figure 8. Learning rate schedule and training loss progression. (Left): Learning rate (train/learning_rate) following warmup-then-decay schedule, increasing to 0.0001 during initial 20 steps before linear decay to zero. (Right): Training loss (train/loss) demonstrating continuous monotonic decrease from 1.9 to 0.9038, confirming effective model adaptation to H-reflex waveform patterns.
Biomechanics 06 00023 g008
Figure 9. Computational cost and final training metrics. (Left): Total floating-point operations (train/total_flos) quantifying the computational resources at 5.65 × 1016 FLOPs required for VLM adaptation. (Right): Final training loss (train/train_loss) at 1.203, representing the consolidated loss metric upon training completion at step 180.
Figure 9. Computational cost and final training metrics. (Left): Total floating-point operations (train/total_flos) quantifying the computational resources at 5.65 × 1016 FLOPs required for VLM adaptation. (Right): Final training loss (train/train_loss) at 1.203, representing the consolidated loss metric upon training completion at step 180.
Biomechanics 06 00023 g009
Figure 10. Training progression and optimization stability metrics. (Left): Epoch progression (train/epoch) showing linear advancement to 3.6 epochs at step 180, aligning with the recommended fine-tuning duration for medical imaging tasks. (Right): Gradient norm (train/grad_norm) exhibiting a gradual increase from 0.3 to 0.71, indicating stable optimization without gradient explosion under the LoRA adapter configuration.
Figure 10. Training progression and optimization stability metrics. (Left): Epoch progression (train/epoch) showing linear advancement to 3.6 epochs at step 180, aligning with the recommended fine-tuning duration for medical imaging tasks. (Right): Gradient norm (train/grad_norm) exhibiting a gradual increase from 0.3 to 0.71, indicating stable optimization without gradient explosion under the LoRA adapter configuration.
Biomechanics 06 00023 g010
Figure 11. Evaluation throughput metrics during VLM fine-tuning. (Left): Samples processed per second (eval/samples_per_second), improving from 12.3 to 12.78 as training progressed. (Right): Evaluation steps per second (eval/steps_per_second), increasing from 6.15 to 6.39, reflecting GPU memory optimization and consistent computational efficiency suitable for real-time clinical deployment.
Figure 11. Evaluation throughput metrics during VLM fine-tuning. (Left): Samples processed per second (eval/samples_per_second), improving from 12.3 to 12.78 as training progressed. (Right): Evaluation steps per second (eval/steps_per_second), increasing from 6.15 to 6.39, reflecting GPU memory optimization and consistent computational efficiency suitable for real-time clinical deployment.
Biomechanics 06 00023 g011
Figure 12. Prediction results of the Llama-3.2-11B-Vision-Instruct vision-language model before and after fine-tuning for H-reflex waveform interpretation. The figure shows a representative single-session H-reflex EMG waveform and the corresponding model-generated interpretations. Recovery-related descriptors are inferred from waveform morphology based on patterns learned during fine-tuning on longitudinal datasets, rather than from an explicit time course displayed in the figure.
Figure 12. Prediction results of the Llama-3.2-11B-Vision-Instruct vision-language model before and after fine-tuning for H-reflex waveform interpretation. The figure shows a representative single-session H-reflex EMG waveform and the corresponding model-generated interpretations. Recovery-related descriptors are inferred from waveform morphology based on patterns learned during fine-tuning on longitudinal datasets, rather than from an explicit time course displayed in the figure.
Biomechanics 06 00023 g012
Figure 13. Prediction results of the Pixtral-Vision model before and after fine-tuning for H-reflex waveform interpretation. The figure shows a representative single-session H-reflex EMG waveform recorded under resting conditions and corresponding model-generated interpretations. The figure is intended to demonstrate changes in model interpretative behavior rather than physiological comparison with a control condition. Normative baselines and peak-to-peak quantitative analyses are incorporated at the dataset and training level and are discussed elsewhere in the paper. Apparent signal variability reflects physiologically expected EMG noise preserved to maintain waveform morphology relevant for visual reasoning.
Figure 13. Prediction results of the Pixtral-Vision model before and after fine-tuning for H-reflex waveform interpretation. The figure shows a representative single-session H-reflex EMG waveform recorded under resting conditions and corresponding model-generated interpretations. The figure is intended to demonstrate changes in model interpretative behavior rather than physiological comparison with a control condition. Normative baselines and peak-to-peak quantitative analyses are incorporated at the dataset and training level and are discussed elsewhere in the paper. Apparent signal variability reflects physiologically expected EMG noise preserved to maintain waveform morphology relevant for visual reasoning.
Biomechanics 06 00023 g013
Figure 14. Prediction results of the Qwen-2 vision-language model before and after fine-tuning for H-reflex waveform interpretation. The figure shows a representative single-session H-reflex EMG waveform and corresponding model-generated interpretations. Recovery-related descriptors are inferred from waveform morphology based on patterns learned during fine-tuning on longitudinal datasets, rather than from a time course explicitly displayed in the figure.
Figure 14. Prediction results of the Qwen-2 vision-language model before and after fine-tuning for H-reflex waveform interpretation. The figure shows a representative single-session H-reflex EMG waveform and corresponding model-generated interpretations. Recovery-related descriptors are inferred from waveform morphology based on patterns learned during fine-tuning on longitudinal datasets, rather than from a time course explicitly displayed in the figure.
Biomechanics 06 00023 g014
Figure 15. Comparison of independent predictions from fine-tuned VLMs (Pixtral-Vision, Llama-Vision, and Qwen-2) and the consolidated reasoning output generated by the OpenAI-gpt-oss LLM for H-reflex waveform interpretation. The figure presents a representative single-session H-reflex EMG waveform and corresponding model-generated assessments. Recovery-related descriptors in the consolidated reasoning output are inferred through cross-model synthesis and patterns learned from longitudinal training data, rather than from a time course explicitly displayed in the figure.
Figure 15. Comparison of independent predictions from fine-tuned VLMs (Pixtral-Vision, Llama-Vision, and Qwen-2) and the consolidated reasoning output generated by the OpenAI-gpt-oss LLM for H-reflex waveform interpretation. The figure presents a representative single-session H-reflex EMG waveform and corresponding model-generated assessments. Recovery-related descriptors in the consolidated reasoning output are inferred through cross-model synthesis and patterns learned from longitudinal training data, rather than from a time course explicitly displayed in the figure.
Biomechanics 06 00023 g015
Table 1. Comparison of AI Systems for EMG and Neuromuscular Signal Interpretation.
Table 1. Comparison of AI Systems for EMG and Neuromuscular Signal Interpretation.
PlatformDomainFine-Tuning SupportModel Type/LLM-VLMMultimodal SupportReasoning LLM SupportResponsible AI SupportExplainable AI Support
This workH-reflex neuromuscular analysisLlama-Vision,
Pixtral-Vision,
Qwen-VL
OpenAI-gpt-oss
Sensor Fusion [18]Gesture recognitionCNN-based fusion models
ViT-HGR [19]Gesture recognitionVision Transformer (ViT)
Explainable Stroke Gait [20]Stroke gait EMG analysisGBoost + SHAP/LIME
INSPIRE [4]Electrodiagnostic interpretationMulti-agent LLMs (unspecified)
LLM for EMG-to-Text [21]Silent speech decoding via EMGLLM with EMG adapter
Nerve Conduction Velocity [22]Nerve conduction analysisNot Applicable
Hybrid-FEM [23]Biomedical device diagnosticsNot Applicable
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Bandara, E.; Gore, R.; Shetty, S.; Mukkamala, R.; Rhea, C.K.; Samulski, B.S.; Hass, A.; Yarlagadda, A.; Kaushik, S.; De Silva, M.; et al. Standardization of Neuromuscular Reflex Analysis—Role of Fine-Tuned Vision-Language Model Consortium and OpenAI gpt-oss Reasoning LLM-Enabled Decision Support System. Biomechanics 2026, 6, 23. https://doi.org/10.3390/biomechanics6010023

AMA Style

Bandara E, Gore R, Shetty S, Mukkamala R, Rhea CK, Samulski BS, Hass A, Yarlagadda A, Kaushik S, De Silva M, et al. Standardization of Neuromuscular Reflex Analysis—Role of Fine-Tuned Vision-Language Model Consortium and OpenAI gpt-oss Reasoning LLM-Enabled Decision Support System. Biomechanics. 2026; 6(1):23. https://doi.org/10.3390/biomechanics6010023

Chicago/Turabian Style

Bandara, Eranga, Ross Gore, Sachin Shetty, Ravi Mukkamala, Christopher K. Rhea, Brittany S. Samulski, Amin Hass, Atmaram Yarlagadda, Shaifali Kaushik, Malith De Silva, and et al. 2026. "Standardization of Neuromuscular Reflex Analysis—Role of Fine-Tuned Vision-Language Model Consortium and OpenAI gpt-oss Reasoning LLM-Enabled Decision Support System" Biomechanics 6, no. 1: 23. https://doi.org/10.3390/biomechanics6010023

APA Style

Bandara, E., Gore, R., Shetty, S., Mukkamala, R., Rhea, C. K., Samulski, B. S., Hass, A., Yarlagadda, A., Kaushik, S., De Silva, M., Maznychenko, A., Sokolowska, I., & De Zoysa, K. (2026). Standardization of Neuromuscular Reflex Analysis—Role of Fine-Tuned Vision-Language Model Consortium and OpenAI gpt-oss Reasoning LLM-Enabled Decision Support System. Biomechanics, 6(1), 23. https://doi.org/10.3390/biomechanics6010023

Article Metrics

Back to TopTop