1. Introduction
Parkinson’s disease (PD) is a neurodegenerative disorder characterized by both motor and non-motor disturbances arising from dysfunction of basal ganglia–thalamo–cortical circuits. Deficits in movement initiation, scaling, and automatization give rise to the cardinal motor signs (bradykinesia, rigidity, tremor, postural instability) and to prominent axial manifestations such as reduced trunk–head rotation, impaired segmental coordination, and postural deformities [
1,
2]. Speech is frequently affected: hypokinetic dysarthria occurs in approximately 80–90% of patients and is characterized by hypophonia (reduced loudness), monopitch and monoloudness, flattened prosody, short rushes of speech, and imprecise articulation [
2,
3]. Mechanistically, these abnormalities reflect a combination of bradykinesia, peripheral rigidity, impaired amplitude scaling, and disrupted sensory control—including underestimation of self-produced loudness [
1]. Endoscopic and imaging studies have documented glottal insufficiency and bowing or atrophy of the vocal folds, while aerodynamic analyses show reduced subglottal pressure and altered loudness strategies (Lombard effect), pointing to combined respiratory–laryngeal dysfunction [
4,
5].
Motor scaling deficits in PD extend across both orofacial and axial effectors. Axial symptoms include pronounced rigidity, reduced range of trunk–head rotation,
en bloc coupling of head and trunk during turning, and an increased risk of falls due to decreased trunk mobility. Postural deformities such as camptocormia, antecollis, and Pisa syndrome further disrupt head orientation. Although head tremor is not a cardinal PD sign, subtle oscillatory or jerky head movements may appear and require differentiation from essential or dystonic tremor. Taken together, simultaneous measurement of speech acoustics and head kinematics has the potential to capture complementary pathophysiological markers: (i) a respiratory–laryngeal component (hypoadduction, reduced subglottal pressure, dysprosody) and (ii) an axial component (restricted range of motion, reduced velocity, decoupling of head–trunk movement, increased stiffness) [
1,
4]. Clinically and metrologically relevant metrics span head rotation/tilt ranges, angular velocity and jerk, and tremor-related spectral power (3–7 Hz), analyzed alongside acoustic features such as intensity, F0 variability, laryngeal tremor indices, and temporal discontinuities.
These clinical observations motivate the development of digital biomarkers that quantify PD-related changes in speech and movement in a scalable, objective manner. Voice is a particularly mature source of digital markers: classical signal processing and modern representation-learning approaches have demonstrated robust performance for PD detection and monitoring based on sustained phonation, reading, diadochokinetic tasks, and spontaneous speech [
6,
7]. At the same time, inertial measurement units (IMUs) placed on the trunk, limbs, or head capture kinematic signatures of gait, posture, tremor, and balance that are central to PD pathophysiology but often underrepresented in purely audio-based pipelines [
8,
9]. From a multimodal perspective, fused analysis of voice and movement is attractive: voice encodes fine-grained motor-speech control, whereas IMU-based kinematics provide context about axial and whole-body motor function. A concise overview of representative smartphone-, smartwatch-, and MR/AR-based digital-biomarker studies in PD is given in
Table 1,which highlights both the rapid progress in sensor-based assessment and the relative lack of tightly synchronized voice–head-IMU pipelines in mixed reality. Mixed and augmented reality (MR/AR) head-mounted displays (HMDs) provide a natural platform for such multimodal assessment. Devices like Microsoft HoloLens 2 (HL2) combine standardized, programmable task elicitation with precise on-device sensing (microphones, cameras, IMUs), enabling reproducible measurements under controlled stimuli while preserving ecological validity [
8,
9,
10]. Validation studies have shown that HL2 can yield accurate and reliable kinematic readouts for gait, balance, and upper-limb tasks, supporting its use in clinical assessments and longitudinal follow-up [
8,
9,
11]. The headset-mounted microphone array offers standardized mouth–microphone geometry, while the rigidly attached IMU follows head motion without body–sensor displacement artifacts. This combination makes MR/AR HMDs a promising testbed for synchronized audio–kinematic biomarkers in PD.
Over the past two years our group has systematically explored MR-based assessments in PD. We have shown that (i) speech collected in an MR setting supports PD screening and task-specific analysis of acoustic and linguistic features [
6], (ii) MR implementations of gait tests (e.g., Timed Up-and-Go) yield discriminative spatiotemporal characteristics between PD and healthy controls [
18], and (iii) MR can be used to deliver multicomponent motor–cognitive and eye-tracking tasks on HoloLens 2 [
20,
21]. In these earlier studies, however, speech and motion channels were analyzed separately rather than fused within a synchronized MR pipeline, and the potential of concurrent voice–head kinematics for PD assessment remained unexplored.
In this work, we introduce DiagNeuro, an AR system for simultaneous acquisition of voice and head-IMU signals during interactive, standardized tasks with time-locked synchronization and a unified processing stack. We develop and evaluate a set of deep-learning models that operate on short, synchronized audio–IMU episodes to discriminate PD from healthy controls. Our contributions are threefold: (i) an MR acquisition architecture that ensures sample-level alignment between audio and head kinematics under guided protocols; (ii) a multimodal learning pipeline with early, intermediate, and gated fusion strategies complemented by probability calibration; and (iii) an analysis of how such MR-based fusion could translate to mobile deployments (e.g., smartphone-grade microphones and IMUs) while preserving task standardization. We specifically test the hypotheses that (H1) multimodal models fusing voice and head-IMU features outperform voice-only and IMU-only baselines for PD vs. HC discrimination; (H2) the incremental benefit of fusion is task-dependent, with larger gains in tasks that more strongly engage movement; and (H3) synchronized MR acquisition is feasible in a clinical setting and yields multimodal markers that align with established PD pathophysiology.
From a modeling perspective, we cast this study as a supervised binary classification problem. The input is a short, synchronized voice–IMU episode recorded during one of five standardized mixed-reality speech tasks (T02–T06), and the target label is the diagnostic group
. We analyze both a pooled configuration, where episodes from all tasks are combined to train a single classifier, and task-wise models that are fit and evaluated separately for each task. This formulation links the clinical questions to the technical pipeline:
Section 2 details how episodes and clinical variables are stored in the database, and
Section 3 describes the preprocessing and multimodal fusion architectures used to learn from these episode-level records.
3. Methodology
3.1. Overall Workflow
Figure 5 summarizes the modeling workflow used in this study. Briefly, we (1) acquire synchronized voice and head-IMU data in mixed reality using the DiagNeuro headset protocol; (2) preprocess and temporally align the signals (log mel-spectrograms for audio, normalized six-axis IMU at 100 Hz) into fixed-length crops; (3) instantiate a set of candidate architectures (
audio_only,
imu_only,
early_concat,
mid_xattn,
gated_early); (4) train and evaluate these models under stratified, subject-aware cross-validation with probability calibration; and (5) compare pooled and task-wise performance, focusing on whether fusion yields a consistent gain over audio-only baselines, particularly in movement-engaging tasks.
3.2. Signal Preprocessing and Temporal Alignment
Audio waveforms are loaded at 16 kHz, converted to mono, and peak-normalized to unit amplitude. We extract log mel-spectrograms with , , and hop length samples (≈20 ms). For numerical stability, we add a floor prior to .
For accelerometer and gyroscope streams, we identify the time column (e.g., time, timestamp, ticks) via heuristic matching and select three numeric axes corresponding to {x,y,z}. Missing samples are forward/back-filled, time is re-based to start at zero, and each stream is linearly interpolated and resampled to 100 Hz. The resulting 6-channel IMU vector (ACCx,y,z, GYRx,y,z) is per-channel median-centered and standardized by its standard deviation (with 10−6 floor).
We operate on synchronized audio–IMU windows of 15 s. If a session exceeds 15 s, we sample a random crop during training; at inference we apply test-time augmentation (TTA) with independent random crops. If a session is shorter than 15 s, we use the full duration. Audio and IMU are time-aligned and truncated to the common overlap length T (minimum of the two modalities). With the chosen parameters, a 15 s crop yields mel frames and 1500 IMU steps.
To avoid implicit imputation artifacts in multimodal learning, sessions with incomplete modality availability are excluded from multimodal training/evaluation (see
Table 2). Unimodal baselines (
audio_only,
imu_only) are trained and evaluated on the corresponding single-modality subsets. In practice, after synchronization the model consumes paired sequences truncated to the shared overlap; hence the effective sequence length
T is determined by the minimum available duration across modalities within each crop.
3.3. Model Architecture and Multimodal Fusion
All variants share the same backbone structure: (i) modality-specific encoders producing per-timestep embeddings, (ii) an optional fusion block, (iii) a temporal model, and (iv) a classification head producing a binary logit for . Importantly, the temporal model and head are matched across variants so that performance differences can be attributed to the fusion mechanism rather than increased classifier capacity.
Let
denote the log-mel sequence and
the IMU sequence (after temporal alignment and truncation to the shared overlap). Both are projected to a common embedding dimension
,
For audio, we apply two 1D temporal convolutions (kernel size 5, ReLU) to map
. Concretely, the Conv1D stack uses channel mapping
with padding chosen to preserve the temporal length
T. For IMU, we apply a per-timestep MLP (Linear–ReLU–Linear) mapping
. We implement this as
applied independently at each timestep. A sinusoidal positional encoding
is added to each modality embedding.
Because audio and IMU are sampled at different nominal rates (mel hop ms frames per 15 s, IMU resampled at 100 Hz samples per 15 s), we align modalities on the mel timestamp grid: after resampling IMU to 100 Hz, we linearly interpolate the 6-axis IMU stream onto the mel-frame timestamps, yielding with the same T as . Thus, for a synchronized 15 s crop the end-to-end tensor flow is raw crop with and IMU → interpolation onto mel timestamps ; encoders ; fusion ; BiGRU ; mean pooling ; MLP head → scalar logit.
We evaluate the following fusion strategies: (i)
audio_only:
; (ii)
imu_only:
; (iii)
early_concat:
; (iv)
gated_early: an early-concatenation representation modulated by a learnable per-timestep gate,
the gate MLP is
with a sigmoid output. (v)
mid_xattn: a mid-level cross-attention block where audio queries IMU (keys/values) to obtain an IMU-conditioned audio representation,
and we set
as the fused sequence passed to the shared temporal backbone. All fusion modules are trained end-to-end jointly with the encoders and temporal backbone.
Given the fused per-timestep sequence
(dimension
depending on the variant), we model temporal dependencies using a bidirectional GRU:
followed by mean pooling
and a two-layer MLP head with dropout producing a single logit. The same BiGRU and head configuration is used for all unimodal and multimodal variants. The BiGRU uses 1 layer with hidden size
= 128 per direction (hence
). The MLP head is
(=256)
with ReLU activations and dropout
.
3.4. Training, Validation, and Calibration Protocol
All models are trained end-to-end from random initialization. We do not employ external pretrained backbones, nor any fine-tuning strategy in this study; this isolates the effect of the fusion mechanisms and temporal modeling under a controlled-capacity regime.
We use an external 5-fold StratifiedGroupKFold split by participant identity to prevent subject leakage across folds. Within each training fold, we perform GroupShuffleSplit (train/validation = 70/30, grouped by participant) to obtain internal validation sets used only for early stopping and post hoc calibration (not for architectural re-tuning). Test metrics for a fold are obtained by ensembling the three independent repetitions (probabilities averaged after calibration). Thus, hyperparameters are held constant across outer folds; the inner splits are used strictly for early stopping, calibration, and threshold selection rather than per-fold hyperparameter search.
We optimize class-weighted cross-entropy with AdamW. Early stopping is based on a smoothed validation AUC computed as the mean of the last three epochs, with patience
. We apply gradient clipping for numerical stability. All fixed training hyperparameters are learning rate
, weight decay
, batch size 32, maximum epochs 50, and gradient clipping
; these are reported explicitly in
Table 4.
At test time, we apply TTA with random 15 s crops per sample; logits are averaged across crops prior to calibration.
To obtain well-calibrated probabilities and a stable operating point, we apply temperature scaling on the internal validation set and select an operating threshold
on the calibrated probabilities. We select
by maximizing validation F1 on the calibrated probabilities within the guard-rail interval. To prevent degenerate calibration on small validation sets, we enforce guard-rails:
and for
we use the conservative fallback
,
. This regime is frequently triggered in task-wise evaluation (typical
), and any fallback or clipping is logged as
red_flags for auditability.
Calibration statistics are reported for transparency. Across pooled (ALL) experiments, the average calibration temperature and operating threshold were as follows: early_concat: , ; gated_early: , ; mid_xattn: , ; audio_only: , . For imu_only, calibration frequently saturates at the guard-rails (e.g., T clipped to ), consistent with near-chance validation AUC in the pooled setting and preventing unstable post hoc probability scaling.
3.5. Metrics
Primary metric is ROC AUC computed on calibrated probabilities. We additionally report F1, precision, and recall at the selected threshold. We also report Accuracy at the selected operating point, to complement the threshold-dependent metrics and facilitate comparison with prior studies. For interpretability, we compute per-task metrics by grouping test episodes by task_id. Per-task summaries are reported for AUC (threshold-free) as well as for F1 and Accuracy (threshold-dependent). Fold-level scores are summarized by arithmetic means across outer folds. Unless stated otherwise, the operating threshold is selected on inner validation (per outer fold) and then applied to the corresponding held-out test fold.
3.6. Implementation Details
The pipeline is implemented in PyTorch 2.8.0. Audio I/O prioritizes torchaudio and falls back to soundfile or scipy as needed. Hidden size is , and for the cross-attention variant we use multi-head attention with 4 heads (). Dropout is applied in a component-specific manner (e.g., in the classifier head; in gating/attention modules where applicable). All random draws (crop positions, splits, weight initialization) are seeded for reproducibility. All experiments ensure subject-wise separation between training, validation, and testing partitions via group-aware splitting.
4. Results of Fusion and Interpretation
We first assess whether fusion improves performance on average across heterogeneous tasks, and we quantify the AUC–F
1 trade-off after calibration.
Table 5 reports pooled means over five outer folds, including accuracy, precision, and recall.
gated_early attains the highest mean AUC (0.875) and improves F
1 versus
mid_xattn (0.771 vs. 0.763), while
audio_only still yields the highest mean F
1 overall (0.785). In terms of accuracy,
gated_early is also best on average (0.806), closely followed by
audio_only (0.802), indicating that fusion yields a modest but consistent gain at the chosen operating point. The
imu_only baseline remains near-chance AUC (0.497) with recall skew (0.983), underscoring limited standalone discriminative power of IMU and its role as a modulator when fused. Notably,
mid_xattn exhibits higher recall (0.812) at the expense of precision (0.729), suggesting a more liberal decision boundary relative to
gated_early (precision 0.829, recall 0.730) under the selected thresholds.
Figure 6 visualizes pooled ROC curves for the main model variants, confirming the ranking observed in
Table 5 and illustrating the operating regimes where fusion provides the largest gain (e.g., moderate false-positive rates where
gated_early dominates).
We next analyze task-dependent effects to reveal when IMU contributes the most. As shown in
Table 6,
gated_early is top on
T03 (AUC
) and
T06 (AUC
),
early_concat ranks best on
T02 (AUC
), and
audio_only ranks best on
T04 (AUC
) and
T05 (AUC
). Two qualitative patterns follow (i) tasks with strong acoustic signatures (T04, T05) are saturated by audio, leaving limited headroom for IMU; (ii) tasks requiring fine-grained movement context (T03, T06) benefit from injecting IMU at the representation level with an explicit gate, while T02 favors simple concatenation.
Figure 7 provides per-task ROC curves (one panel per task), showing that task heterogeneity is substantial: T04 and T05 exhibit strong separability for
audio_only, whereas T03 (and, to a lesser extent, T06) benefits from IMU-conditioned fusion, consistent with the per-task AUC maxima in
Table 6. The per-task ROC curves (
Figure 7) further highlight this heterogeneity.
To complement threshold-free AUC with operating-point behavior,
Table 7 and
Table 8 report per-task F1 and accuracy, respectively. These results reveal that the method with the best AUC per task does not always maximize F1/accuracy on the same task. For example, on T06
gated_early has the highest AUC (
) but
early_concat achieves a substantially higher F1 (
), indicating that the globally selected threshold (optimized on inner validation) may be better aligned with
early_concat for that task’s score distribution. Similarly, on T02
early_concat is best in AUC (
) and F1 (
), whereas
audio_only attains the best accuracy (
), consistent with a threshold-dependent trade-off between precision/recall and overall correctness.
Table 9 summarizes the best method per task (selected by per-task AUC) and reports the corresponding F1/Accuracy/Precision/Recall at the operating point. The shift from
early_concat to
gated_early on T03 and T06 indicates that learning when and how much to trust IMU, rather than always concatenating, can improve separability while maintaining stable temporal modeling. More broadly, we observe a consistent hierarchy: (a) IMU-only remains insufficient across tasks (low AUC and unstable threshold-dependent behavior), (b) audio-only provides a strong baseline and dominates tasks with predominantly acoustic cues (T04, T05), and (c) fusion offers the largest benefits in tasks where head-motion context disambiguates speech production patterns (T03 and, in ranking terms, T06). This supports the interpretation of IMU as a context signal that regularizes or re-weights acoustic evidence, rather than an independent discriminator.
Table 9 highlights that the “best” method depends on the evaluation target: optimizing ranking performance (AUC) favors
gated_early on T03/T06, whereas optimizing a discrete operating point (F1/Accuracy) may select
audio_only or
early_concat for specific tasks. This motivates reporting both threshold-free and threshold-dependent metrics, and it suggests that task- or cohort-specific threshold calibration could further improve deployment-level performance without changing the backbone architecture.
IMU in isolation is insufficient but useful in fusion. The near-chance performance of the standalone IMU model is expected, given that participants were seated with instructions to maintain a neutral posture, contrasting with high-performance kinematic benchmarks typically obtained from gait or active motor tasks [
8,
9]
gated_early yields the highest pooled AUC and becomes top 1 on T03 and T06, where nuanced motion context complements acoustics.
audio_only remains a strong and sometimes superior choice on acoustically dominated tasks (T04, T05), consistent with its top pooled F
1. The AUC–F
1 trade-off between fusion and audio-only likely reflects operating-point selection after calibration; under alternative thresholds (e.g., optimizing
per task), fusion may narrow the F
1 gap while retaining ranking gains. These results support IMU as a conditional enhancer of audio, with explicit gating improving when and how IMU influences the representation.
6. Discussion
This study demonstrates the feasibility and benefits of fusing voice and head motion signals for Parkinson’s disease (PD) assessment in a mixed-reality (MR) environment. We developed a HoloLens 2-based system (DiagNeuro) that records synchronized speech and head inertial data during structured tasks, and we evaluated multiple sensor-fusion strategies.
The principal finding is that combining vocal features with head-mounted IMU data modestly improved PD vs. healthy classification performance over speech alone, while voice features remained the dominant discriminative modality. Specifically, an early-fusion model with a learned gating mechanism achieved the highest overall AUC (≈0.875) (vs.
for
audio_only) and also the best pooled Accuracy (≈0.806;
Table 5). Rather than uniformly improving sensitivity, the fusion models exhibit a precision–recall trade-off:
mid_xattn yields higher recall (≈0.812) but lower precision (≈0.729), whereas
gated_early improves precision (≈0.829) while maintaining competitive recall (≈0.730), confirming that multimodal integration can capture complementary PD markers [
41].
This behavior is consistent with the pooled ROC curves (
Figure 6), which illustrate the operating regions where fusion provides the clearest separation.
At the same time, a strong voice-only baseline (AUC
) reflected the well-known richness of acoustic biomarkers in PD [
41]. In our protocol, the head-IMU channel on its own was only weakly discriminative (near-chance AUC), which we attribute primarily to the task and cohort characteristics rather than to hardware limitations. While IMU sensors typically achieve high diagnostic accuracy in gait or active motor tasks [
9,
17], all recordings were collected during seated speech with a neutral head posture in patients who were predominantly mild–moderate, so the available kinematic signal reflects subtle micromovements rather than overt axial abnormalities. Under these conditions, short IMU segments are expected to be low-contrast between PD and HC. Nevertheless, the same IMU stream yields a consistent, task-dependent gain when fused with audio, especially for tasks where axial–motor differences are more likely to manifest. This suggests that kinesthetic cues are indeed informative but act mainly as a conditional modulator of the much richer acoustic representation. These results support our hypothesis that subtle head-movement patterns (e.g., reduced range or tremor-like activity) contribute additional cues to PD detection when measured concurrently with speech, even though voice signals carry the bulk of the information. We therefore view the weak IMU-only performance as specific to this speech-centric battery and anticipate stronger standalone effects in future work that includes explicit gait/balance or head-movement tasks and a broader range of disease stages.
We also found that fusion efficacy is task-dependent: for highly voice-centric tasks (e.g., reading or rapid syllable repetition), adding IMU yielded no gain or slight degradation, whereas in tasks engaging posture or sustained phonation, the audio–IMU fusion gave measurable improvements. This is consistent with
Table 9:
audio_only is best on T04/T05,
gated_early on T03/T06 (by AUC), and
early_concat on T02. This nuanced outcome suggests that an adaptive fusion approach (as implemented by our gating model) is preferable, enabling the model to rely on motion cues only when they are informative.
In summary, DiagNeuro validates the concept of synchronized multimodal MR assessment for PD, showing that dual-channel analysis is not only technically feasible but can modestly enhance detection of PD-related abnormalities compared to single-modality analysis.
Our successful deployment of a 15–20 min MR task battery in over 160 sessions (
Table 2) with minimal missing data demonstrates the practicality of this approach in a patient population. These advantages come alongside evidence that AR headsets can produce clinically valid measurements for motor functions: recent studies have shown HoloLens-derived metrics for gait, balance, and functional tests to concur with gold-standard instruments [
6]. Our work extends this validity to the concurrent capture of voice and head movement, suggesting that AR devices can be multi-purpose evaluation tools in neurology.
A key question is how our MR-based findings translate to more ubiquitous platforms like smartphones or wearables. On one hand, the HoloLens 2 provided high-fidelity data in a controlled setting—a best-case scenario for digital biomarker capture. Moving to commodity mobile devices will introduce challenges such as variable hardware, environmental noise, and lack of holographic guidance.
A potential translational strategy would be to preserve the standardized task paradigm developed in MR but deliver it through a smartphone app. For example, the app could display written or audio instructions for a sustained phonation, a reading passage, or a brief spoken monologue, similar to our AR tasks. The phone’s microphone can record the speech, while its motion sensors could capture gross movements (if the phone is held in hand or kept in a shirt pocket to approximate upper-body motion). Although this would not exactly replicate head kinematics, it could pick up related signals (hand tremor, body sway or subtle movements as the person speaks). Recent work shows that such multimodal smartphone assessments are feasible and can detect early PD, lending hope that a “lighter” version of DiagNeuro could run on common devices.
In essence, MR-based research guides the design of mobile tools by indicating what to measure and how to standardize the measurement. Our analysis already touched on this by exploring early, intermediate, and late fusion; a mobile app could employ the same fusion strategies in software, even if the sensors differ.
In summary, while direct deployment of DiagNeuro on smartphones is not plug-and-play, the insights gained are transferable. We emphasize that maintaining task consistency is crucial—whether via MR or mobile—because uncontrolled free behavior might not yield the specific biomarkers (e.g., calibrated speech tasks, intentional head movements or steadiness) that a structured exam elicits. Thus, the ideal path to translation is a hybrid—use the phone’s convenience but enforce a protocol akin to an AR session. This could preserve much of the benefit, enabling broader use of multimodal PD assessment outside specialized labs [
6].
Despite its promising results, our study has several limitations. First, the data were collected in a supervised MR setting, which may limit generalizability. Participants were aware of being evaluated via an AR headset and were guided through specific tasks. This controlled context is useful for consistency, but real-world conditions differ. In daily life, speech and movement are spontaneous and may be influenced by distractions or varying emotional states. Future work should evaluate our multimodal markers in more naturalistic environments—for example, by having patients use the system at home over longer periods, or by comparing MR-guided results with passive monitoring (like analyzing free speech during phone calls).
Secondly, our cohort consisted of moderate-stage PD patients and age-matched healthy controls, but we did not explicitly stratify by disease severity or phenotype. It remains unclear how early in the disease these voice and head-motion changes become detectable. Prior research indicates that acoustic changes can precede overt motor signs in PD [
6], so a worthwhile next step is to test early-stage or even prodromal individuals with our pipeline. The head movement differences in early PD might be very subtle (since axial rigidity and postural impairment typically worsen in later stages), potentially yielding a smaller fusion benefit. Longitudinal studies are also needed: can our synchronized metrics track disease progression or responses to therapy? Repeated MR assessments (for example, every few months) could show whether the multimodal score correlates with clinical changes over time.
Additionally, the contribution of the head IMU signals, while statistically significant in some cases, was relatively modest. Our best fusion model improved AUC by only about 1–2 percentage points over audio-only in pooled analysis. This improvement might not be clinically meaningful on its own. A detailed error analysis would help clarify when and why the fusion helps It is possible that only a subset of PD patients (those with prominent axial symptoms) benefit from the motion features, whereas others do not. In future research, personalized or subgroup models could be explored, where the system adapts to patient-specific symptom profiles (e.g., placing more weight on IMU features for patients with higher axial rigidity or tremor).
Moreover, incorporating other modalities could amplify the gains: our platform could readily be extended with, say, a hand tremor task using the controller or an eye-tracking task, as we have prototyped separately. Fusing more than two modalities (e.g., voice, head motion, and eye movement) is an exciting direction made feasible by MR headsets. However, this also raises the challenge of feature overload and the need for efficient fusion algorithms to avoid noise from less informative channels.
The next point worth discussing is that our task battery focused on speech and did not include overt gait or balance tasks. This was by design (to target hypokinetic dysarthria and head micro-movements), but it means that some cardinal PD features were not probed. A comprehensive AR exam could integrate our voice + head tasks with brief motor tasks (like an on-spot stepping or sway test) to capture a wider spectrum of PD signs. Doing so might increase overall diagnostic accuracy, as suggested by multi-domain smartphone studies.
Relatedly, we did not combine the results across tasks for a subject-level diagnosis in this work—each task episode was classified independently. In practice, a clinician would consider the aggregate of a patient’s performance across multiple tasks. We expect that simple voting or averaging of the per-task outputs would improve stability and reduce false alarms. Future implementations of DiagNeuro can incorporate subject-level decision logic, possibly with learned weighting for each task’s contribution. Lastly, the current system’s hardware (HoloLens 2) may not be easily accessible or tolerable for all patients. The device is relatively heavy and expensive, and wearing it might be uncomfortable for some elderly users or those with neck issues. While our participants generally managed the 15-minute sessions well, a few reported mild fatigue. As AR technology evolves, lighter glasses or even AR contact lenses could alleviate this limitation. In the meantime, careful protocol design (e.g., offering breaks, ensuring proper fit) is important when using such headsets clinically. We also acknowledge that some training is needed for users to get acquainted with MR interaction (like using gaze or voice commands to navigate tasks), though in our experience the learning curve was short. Going forward, usability studies should be conducted to optimize the patient experience—particularly if assessments are to be carried out at home without technical support. To provide a final contextualization of our results,
Table 12 summarizes the performance of the DiagNeuro system alongside key unimodal and multimodal benchmarks referenced in this study. This overview confirms our conclusion: although our seated protocol provides lower standalone kinematic discriminability compared to dynamic gait assessments, multimodal fusion effectively fills this gap.
7. Conclusions
This work sets the stage for truly multimodal digital biomarkers in Parkinson’s disease by leveraging an immersive MR platform. We demonstrated that synchronized voice and head-motion analysis is feasible and can modestly enhance detection of PD-related deficits, reinforcing the notion that no single sensor tells the whole story. The AR paradigm not only improves data quality through standardization but also opens new avenues for patient engagement and at-home monitoring.
As technology progresses, the gap between specialized MR systems and everyday mobile devices will continue to narrow. Our approach can be seen as a testbed for next-generation digital exams—one that can be iteratively distilled into more portable formats.
We envision a future where a patient might perform a brief multimodal task routine (speaking, moving, looking at targets) guided either by AR glasses or a smartphone app, and receive an immediate, objective report on their motor and speech health. Realizing this vision will require interdisciplinary efforts, validation in larger and more diverse cohorts, and close collaboration with clinicians to ensure that the digital scores map onto meaningful clinical outcomes.
In conclusion, this study validates DiagNeuro, a novel HoloLens 2-based system for the synchronized, multimodal assessment of Parkinson’s disease (PD) using voice and head-motion signals in a mixed-reality (MR) environment. The key finding demonstrates that fusing these two modalities modestly improved PD classification accuracy (reaching a peak AUC of ≈0.875) over the already strong voice-only baseline (AUC ). This improvement confirms that head kinematics, even when subtle and acquired during seated speech tasks, provide complementary, task-dependent information that, when integrated via an adaptive fusion mechanism, enhances PD detection. This work successfully establishes the technical feasibility and standardization benefits of the MR platform for capturing dual-channel digital biomarkers. Future research should prioritize the longitudinal validation of these multimodal markers in early-stage and prodromal patients, explore the inclusion of additional modalities or overt motor tasks to amplify the diagnostic gains, and guide the translation of these standardized assessment paradigms to more scalable, ubiquitous mobile platforms for widespread clinical use.