Abstract
Parkinson’s disease (PD) alters both speech and movement, yet most automated assessments still treat these signals separately. We examined whether combining voice with head motion improves discrimination between patients and healthy controls (HC). Synchronous measurements of acoustic and inertial signals were collected using a HoloLens 2 headset. Data were obtained from 165 participants (72 PD/93 HC), following a standardized mixed-reality (MR) protocol. We benchmarked single-modality models against fusion strategies under 5-fold stratified cross-validation. Voice alone was robust (pooled AUC ≈ 0.865), while the inertial channel alone was near chance (AUC ≈ 0.497). Fusion provided a modest but repeatable improvement: gated early-fusion achieved the highest AUC (≈0.875), cross-attention fusion was comparable (≈0.873). Gains were task-dependent. While speech-dominated tasks were already well captured by audio, tasks that embed movement benefited from complementary inertial data. Proposed MR capture proved feasible within a single session and showed that motion acts as a conditional improvement factor rather than a sole predictor. The results outline a practical path to multimodal screening and monitoring for PD, preserving the reliability of acoustic biomarkers while integrating kinematic features when they matter.