1. Introduction
Neurodegenerative diseases (NDDs) impose one of the largest and fastest-growing public-health burdens worldwide. Dementia alone affects more than 55 million people globally, and Parkinson’s disease (PD) is estimated to affect approximately 10 million, with prevalence projected to roughly double by 2050 as populations age [
1]. Early detection is clinically critical because disease-modifying interventions are most effective when initiated before substantial neuronal loss has occurred [
1]. Of particular practical importance is the well-documented fact that motor manifestations, most notably impairments of gait, frequently precede formal diagnosis by several years [
2,
3,
4]. This temporal window between sub-clinical motor change and clinical confirmation defines an attractive opportunity for sensor-based screening support. We emphasise, however, that the cohorts evaluated in this paper consist of patients with established NDD diagnoses (e.g., Hoehn & Yahr stages 2–3) rather than pre-clinical or at-risk subjects, so the present study evaluates gait-based screening support and classification of established NDD-related gait patterns, not pre-symptomatic diagnosis. Demonstrating utility for true early detection would require dedicated pre-clinical cohorts and longitudinal follow-up, which we leave to future work.
Recent advances in low-cost wearable sensing have made quantitative gait analysis a practical screening modality. Force-sensitive insoles instrumented with arrays of piezoresistive transducers and sampled at the centimeter–millisecond scale (typically 100 Hz) produce rich vertical ground-reaction-force (GRF) signals, from which clinically interpretable biomarkers such as stride-to-stride variability, gait-phase timing, inter-limb asymmetry, and stride regularity can be derived [
5,
6]. Building on such sensor data, machine learning pipelines have progressed from hand-crafted spatial–temporal features classified by linear discriminants [
7] and ensemble methods [
8], to deep models trained directly on raw GRF time series [
9,
10]. Although these approaches have steadily improved classification accuracy, they share three limitations that consistently obstruct clinical deployment of sensor-driven screening systems.
CH1: Lack of intrinsic, case-grounded explanations. Clinicians require not only a prediction but a rationale that can be verified against domain expertise [
11]. Most sensor-based gait classifiers expose only a scalar decision and cannot be inspected as patient evidence; post hoc methods such as SHAP [
12] and LIME [
13] approximate an opaque model rather than ground decisions in prior patients with similar gait profiles, leaving clinicians without a transparent footing to endorse or override a prediction.
CH2: Absence of per-prediction reliability assessment. Existing sensor-based screening models report only aggregate metrics (accuracy, AUC, F1) and offer no mechanism for indicating when an individual prediction should not be trusted. In a screening setting, where the costs of a missed early diagnosis and an unnecessary specialist referral are asymmetric, there is no principled way to flag ambiguous cases for expert review or to restrict automated decisions to confident predictions.
CH3: No mechanism for sustained case-base curation as new sensor data accumulate. Sensor-driven clinical pipelines are inherently longitudinal: as new patient recordings arrive, the underlying knowledge should be refined accordingly. Current pipelines either retrain monolithically (incurring substantial cost and risking concept-drift instability) or accept new data uncritically (introducing label noise and self-confirmation bias). Neither strategy provides a principled way to identify low-quality recordings and either rescue them through expert relabeling or retire them.
Case-Based Reasoning (CBR) [
14,
15] offers a natural framework that addresses CH1–CH3 simultaneously. The R
4 cycle, namely Retrieve, Reuse, Revise, and Retain, closely mirrors clinical reasoning: physicians recall similar prior patients, adapt previous assessments, evaluate confidence in their conclusion, and update experiential knowledge as new patients are seen [
16,
17]. Crucially, every CBR prediction carries an
intrinsic explanation in the form of the retrieved cases and the features driving similarity, addressing CH1 directly [
18,
19]; the
Revise stage formally evaluates prediction reliability and routes uncertain cases for expert review, addressing CH2; and the
Retain stage curates the case base over time under an explicit quality model, addressing CH3. In practice, however, most deployed healthcare CBR systems have stopped at
Retrieve and
Reuse [
20,
21,
22]; that is, CBR has typically been used as a similarity-based classifier rather than as a complete reasoning cycle. The Revise and Retain containers, which are precisely the components that would close CH2 and CH3, have remained on the methodological shelf in clinical practice.
A complementary view of completeness is given by the four
knowledge containers of Richter [
23]:
vocabulary (features describing cases),
similarity (how cases are compared),
case base (the stored experience), and
adaptation (how stored solutions are transferred). Because these containers are interchangeable, a well-curated case base reduces the need for elaborate adaptation knowledge. In clinical screening, where explicit adaptation rules are difficult to elicit, this trade-off motivates investing in similarity (Retrieve) and case-base curation (Retain) while keeping adaptation lightweight—a perspective we adopt explicitly in CogCBR.
To address CH1–CH3, we propose CogCBR (Cognitive CBR), a sensor-driven CBR framework that implements the complete R
4 cycle for gait-based NDD screening from wearable force-sensitive insoles. CogCBR is framed throughout by the four knowledge containers and instantiates each stage of the R
4 cycle with components well established in the CBR literature: mutual-information (MI) feature selection together with Random Forest (RF)-weighted
k-NN retrieval (Retrieve, addressing CH1 by anchoring decisions in retrieved prior patients), inverse-distance weighted voting that yields a continuous probability (Reuse), entropy-based confidence estimation that flags low-certainty predictions for specialist review (Revise, addressing CH2), and competence-based maintenance under an explicit augmentation policy that admits only label-verified cases (Retain, addressing CH3). The novelty lies not in any individual operator but in their principled integration into a single sensor-driven pipeline, closed by a triage and augmentation policy that links confident predictions, expert review, and case storage.
Figure 1 illustrates the complete CogCBR pipeline.
The main contributions of this paper are summarized as follows:
We propose a complete, clinically grounded R4 CBR cycle for wearable-sensor-based NDD screening support. To the best of our knowledge, CogCBR is the first system in this domain to operationalize the Revise and Retain containers jointly with Retrieve and Reuse, closing the loop between automated screening, expert review, and longitudinal case-base curation.
We design an entropy-based confidence-driven triage mechanism that supports a coverage-vs.-accuracy trade-off appropriate for clinical workflows, and pair it with a conservative case-base augmentation policy that admits only label-verified cases (confident self-predictions confirmed at clinical follow-up, or low-confidence cases adjudicated by a specialist), preventing label noise and self-confirmation bias. At
, CogCBR retains 68.7% of predictions with accuracy improving from 80.0% to 87.3% on PhysioNet GaitPDB. We validate the longitudinal behavior of this policy in a deployment-style streaming simulation (
Section 4.8), in which the label-verified rule avoids the label-noise accumulation of naive self-training and matches an all-true-label no-curation reference while bounding case-base growth.
We conduct quantitative comparison against 16 baselines, comprising 14 generic ML/DL methods and two re-implemented domain-specific gait-classification pipelines [
7,
8], under stratified 10-fold cross-validation on PhysioNet GaitPDB and an independent-cohort reproducibility evaluation on GaitNDD. CogCBR attains AUC = 0.861 on GaitPDB and AUC = 0.902 on GaitNDD, achieving competitive performance on both datasets—statistically on par with the strongest tuned baselines on GaitPDB under a matched-tuning comparison (
Section 4.3). Under a stricter cross-modality source-to-target transfer (train on GaitPDB, test on GaitNDD), CogCBR does not exceed the strongest classical baseline; this source-to-target result is reported in
Section 4.9.
We profile the computational cost and deployment footprint of CogCBR against representative baselines. With 0.080 ms per-sample inference latency and an 8.4 KB deployment payload, CogCBR is well-suited to wearable-sensor and edge-health scenarios in which storage, latency, and interpretability are simultaneously constrained.
We deliver intrinsic, case-based explanations grounded in clinically interpretable sensor-derived features. Each prediction is accompanied by the retrieved similar patients, the gait features driving similarity, and the direction of clinical deviation.
The remainder of this paper is organized as follows.
Section 2 reviews related work on CBR for clinical decision support, CBR for movement disorders, and sensor-based gait screening for PD.
Section 3 describes the materials and methods, including the GaitPDB and GaitNDD datasets, the sensor acquisition and signal-processing pipeline, the four knowledge containers of CogCBR, the experimental protocol, and the baselines.
Section 4 reports the experimental results on GaitPDB and GaitNDD, the Revise-stage triage analysis, the computational-cost profiling, and the system-component ablation.
Section 5 discusses the implications of the findings, statistical caveats, and the limitations of the present cross-sectional study. Finally,
Section 6 concludes the paper and outlines directions for future work.
3. Materials and Methods
CogCBR implements the complete CBR cycle [
14] for gait-based NDD screening. We describe the design through Richter’s four knowledge containers:
vocabulary (
Section 3.1 and
Section 3.2),
similarity (
Section 3.3),
adaptation (
Section 3.4), and the
case base together with its quality maintenance (
Section 3.6).
Section 3.9 then details the experimental protocol used to evaluate the framework.
3.1. Clinical Case Representation (Vocabulary Container)
Sensor Data and Datasets
We use the
Gait in Parkinson’s Disease (GaitPDB) database from PhysioNet [
6] as the primary dataset, and the
Gait in Neurodegenerative Disease (GaitNDD) database, also from PhysioNet, as an independent validation cohort. GaitPDB comprises 66 PD patients (Hoehn & Yahr stages 2–3) and 49 age-matched healthy controls (115 subjects in total). Each subject walked at a self-selected pace on level ground while wearing bilateral force-sensitive insoles; each insole contains eight piezoresistive pressure transducers that record the vertical ground reaction force at 100 Hz. The PhysioNet records also include per-foot total forces (in newtons). We primarily use the left-foot total force, the right-foot total force, and their sum (the bilateral total force) as the basis for gait-event detection and feature extraction, because these channels most stably reflect limb loading, heel strike, and toe-off events during walking. Each subject contributes a single feature vector to the case base, so cross-validation splits are at the subject level by construction.
3.2. Sensor Acquisition and Signal-Processing Pipeline
Figure 2 summarizes the sensor-to-feature pipeline that converts raw force-sensitive-insole signals into the case representation used by CogCBR. The pipeline comprises five stages: signal conditioning, gait-event segmentation, feature extraction, quality control, and case representation.
Signal conditioning. Each raw recording is parsed line by line. Rows with anomalous formatting or insufficient columns are discarded, and recordings with no valid samples or with too few samples to support stride-level analysis are excluded. Let and denote the total vertical GRF under the left and right foot, respectively, and let denote the bilateral total force. A bandpass filter is applied to suppress high-frequency sensor noise and low-frequency baseline drift before further processing.
Gait-event segmentation. To detect gait events robustly across subjects with different body weights and walking styles, we adopt an adaptive amplitude threshold derived from each subject’s own signals. A foot is considered to be in contact with the ground when its total force exceeds 20% of its mean force, i.e., and . Rising edges of the resulting binary contact signal are taken as candidate heel-strike events and falling edges as candidate toe-off events. The stride interval for each foot is computed as the time between consecutive heel-strikes of the same foot. To suppress spurious events caused by noise or partial foot lifts, only physiologically plausible stride intervals between 0.4 s and 3.0 s are retained. The double-support phase is estimated as the proportion of samples in which both feet are in contact simultaneously.
Feature extraction. From the segmented signals, we extract 41 features organized in five clinically interpretable categories:
Force statistics (15 features): mean, standard deviation, coefficient of variation (CV), skewness, and kurtosis of the left-foot, right-foot, and total force signals, characterizing plantar load magnitude and its variability.
Asymmetry measures (3 features): force asymmetry index, left–right force ratio, and inter-foot phase synchronization, capturing the gait asymmetry commonly observed in PD.
Stride dynamics (9 features): mean stride interval, stride variability (CV), cadence, stride asymmetry, left and right stride CV, autocorrelation at lag 1, and separate left/right stride regularity measures, reflecting rhythmic stability of walking.
Gait phase and spectral features (13 features): stance percentage, swing percentage, double-support percentage, swing-to-stance ratio, dominant frequency, spectral entropy, harmonic ratio, stride regularity, step regularity, sample entropy, and stride irregularity, describing phase structure, periodicity, and signal complexity.
Walking speed (1 feature): self-selected gait velocity from the clinical metadata.
Each feature has a known physiological interpretation and an established association with NDD pathology [
3,
4], and each is directly derivable from the wearable-sensor signal stream.
Quality control. Missing feature values caused by short recordings, anomalous segments, or failed event detection are imputed with the per-feature median. Features that are entirely missing or have zero variance after imputation are removed. Because subsequent CBR retrieval depends on distance computation, all retained features are standardized (z-score) before similarity calculation, so that scale differences do not dominate the distance.
Case representation. After this pipeline, the raw force-sensor stream of each subject is reduced to a low-dimensional, clinically interpretable case vector spanning the loading, asymmetry, stride-variability, rhythm, double-support, and signal-complexity changes that characterize NDD gait. Compared with raw high-dimensional time-series input, this representation preserves clinical meaning while remaining well-matched to small-sample CBR retrieval, explanation, and confidence estimation.
Case Structure
Each case
comprises a feature vector
(with
after MI selection), a class label
, and optional clinical metadata
m. The metadata field
m carries
contextual information that is excluded from the similarity computation but preserved for clinical interpretation; concretely,
m may include age, sex, height, weight, body-mass index, recording site, recording date, walking-trial protocol (overground/treadmill/dual-task), and, where available, disease subtype and Hoehn & Yahr stage. The metadata is displayed alongside retrieved neighbors in the explanation output (
Section 3.7) so that clinicians can flag demographic mismatches between the query and its neighbors, but is excluded from the distance computation to prevent retrieval being biased by demographic confounders.
3.3. Retrieve (Similarity Container)
The Retrieve stage identifies relevant prior cases via two phases that operate in different knowledge containers: vocabulary refinement (MI feature selection) and similarity weighting (RF-weighted distance).
3.3.1. Mutual Information Feature Selection
We apply mutual information (MI) [
39] as a univariate filter to select the most discriminative gait features. For each feature
we compute
and retain the top
n features by MI score (
, determined by grid search over
; see
Section 4.11). This reduces the feature space from 41 to 18 dimensions, mitigating the curse of dimensionality that is particularly acute in small-sample CBR [
40]. For the primary experiments, we report a fixed-parameter protocol used throughout the main comparison; the optimism that may arise from selecting the MI feature count on the evaluation folds is acknowledged as a methodological caveat in
Section 5.
3.3.2. RF-Weighted k-NN Retrieval
Given a query case
q, we retrieve the
k nearest neighbors [
41,
42] from the case base
using a weighted Euclidean distance:
where
is the importance weight of feature
i, derived from a Random Forest (RF) classifier [
43] (100 trees) trained on the case base.
3.3.3. Justification for Random Forest Weights
Wettschereck et al. [
44] review feature-weighting methods for instance-based learners along three axes: (i) wrapper vs. filter, (ii) global vs. local, and (iii) supervised vs. unsupervised, and observe that supervised global wrappers tend to outperform filter-only weights when features differ in scale and relevance. Among supervised options (information-theoretic weights, conditional entropy, gain ratio, RELIEF, and ensemble-based importances), we adopt RF-derived importance for three reasons. (1) It is a supervised, global,
multivariate estimator that captures feature interactions, complementing the univariate MI filter. (2) It is robust to small samples and to the mixed monotonic/non-monotonic feature–label relationships present in gait data (e.g., walking speed is strongly monotonic with the label, while harmonic ratio interacts with stride dynamics). (3) It is stable under bootstrap resampling, important for the 10-fold protocol in which weights are recomputed per fold. We additionally compare against alternatives (Cosine, Manhattan, plain Euclidean, NCA) in
Section 4.11, where RF-weighted Euclidean is the strongest empirically.
3.4. Reuse (Adaptation Container: Distance-Weighted Voting)
Adaptation in CogCBR is intentionally lightweight: a strong vocabulary, similarity measure, and case base together reduce the need for elaborate adaptation knowledge [
23,
26]. The Reuse stage adapts the solutions of the retrieved cases via inverse-distance weighted (IDW) voting, in which closer neighbors exert greater influence:
where
denotes the
k nearest neighbors of
q and
prevents division by zero. IDW voting yields the continuous probability estimates required by the entropy-based confidence in the Revise stage (
Section 3.5).
3.5. Revise (Confidence-Based Clinical Triage)
CogCBR quantifies prediction reliability through an entropy-based confidence score. For predicted probability
:
Confidence ranges from 0 (maximum uncertainty, at
) to 1 (complete certainty). Cases falling below a configurable threshold
are flagged for specialist review rather than receiving an automated classification, so that confident predictions support screening decisions while uncertain cases are routed to a human expert. We emphasize that
is a deployment-time parameter; the values reported in
Section 4.4 sweep
for descriptive purposes and should not be read as the recommended operating point of a deployed system, which would require independent calibration on a target cohort.
3.6. Retain (Case-Base Container Under an Augmentation Policy)
The Retain stage maintains case-base quality [
33,
34,
35]. Following Smyth and McKenna [
32], we compute a competence score for each case
c:
which is the fraction of correct predictions among all queries for which
c was retrieved as a neighbor. Cases with high competence (≥0.7) are confirmed as valuable; low-competence cases (<0.3) become candidates for review or removal, as they may correspond to noisy recordings or borderline subjects. These thresholds (0.7/0.3) define the
operational maintenance policy of the Retain stage; in
Section 4.7, we additionally report finer descriptive bins (0.8/0.5) to summarize the static competence distribution, but those bins do not constitute a separate Retain policy.
Case-base augmentation policy. A key design question is when, and under what conditions, the case base is augmented during the cycle. CogCBR adopts a deliberately conservative policy:
Outside individual reasoning cycles, periodic competence maintenance is run over the current case base. Cases with competence <0.3 are flagged for clinical review rather than silently removed, and the final removal/retention decision is approved by a specialist. This separation, with maintenance batched outside the per-query cycle, follows the architectural distinction emphasized by Leake and Wilson [
34].
Confident predictions () are not added to the case base on their own. Without ground truth they could only reinforce existing biases (a self-training pathology). Such queries enter the case base only after their label has been confirmed at clinical follow-up.
Low-confidence predictions () are routed to a specialist via the Revise stage. The specialist’s adjudication supplies the ground-truth label, after which the case is added to the case base. This is the primary mechanism by which the case base grows in deployment, since Revise actively surfaces precisely those cases that the current case base does not yet cover well.
This policy aligns Retain with the knowledge-container view: the case base is updated only with cases whose label is reliable, and Revise’s triage decides which queries are worth eliciting expert labels for. The dashed feedback arrow in
Figure 1 corresponds exactly to this update path. We validate the longitudinal effect of this policy in a deployment-style streaming simulation (
Section 4.8) that goes beyond the cross-sectional evaluation reported here; prospective clinical validation remains future work.
3.7. Clinical Explanation Output
Following the goals of explanation in CBR [
18,
19], CogCBR generates a structured explanation containing (i) the
k most similar cases together with their labels and distances; (ii) the top three features contributing most to each similarity, identified by
; and (iii) the direction of deviation (higher/lower). A representative explanation reads: “Patient X has gait patterns similar to previous PD cases Y and Z, with reduced walking speed and increased stride variability.” The explanation is thus grounded in clinically interpretable sensor-derived features and in actual prior patients, rather than in a learned attribution model.
3.8. Algorithmic Summary
Algorithm 1 summarizes the complete CogCBR pipeline.
| Algorithm 1 CogCBR Pipeline. |
- Require:
Query q, case base , MI feature set , RF weights , neighborhood size k, confidence threshold - Ensure:
Prediction , confidence , explanation E, augmentation flag a - 1:
Retrieve. Select MI features: - 2:
Compute for all {Equation ( 1)} - 3:
k cases with smallest - 4:
Reuse. weighted vote over {Equation ( 2)} - 5:
{binary class prediction} - 6:
Revise. {Equation ( 3)} - 7:
if then - 8:
Flag for clinical review; await_specialist_label - 9:
else - 10:
await_followup_confirmation - 11:
end if - 12:
Explain. - 13:
for do - 14:
3 features with largest - 15:
- 16:
end for - 17:
Retain (offline batch). Periodically recompute Equation ( 4); flag low-competence cases for specialist review; admit query q into only once its label has been confirmed (per a). - 18:
return , , E, a
|
3.9. Experimental Protocol
3.9.1. Datasets
GaitPDB [
6] comprises 115 subjects (66 PD, 49 controls) with ground-reaction force recorded by bilateral force-sensitive insoles at 100 Hz, as described in
Section 3.1; the classification task is binary PD vs. Control.
GaitNDD comprises 55 subjects (20 Huntington’s disease, 16 controls, 13 ALS, 6 PD) with stride-interval data from PhysioNet and is used here as an
independent-cohort reproducibility cohort (within-cohort 10-fold CV; see
Section 4.9 for the distinction from a strict source-to-target transfer experiment). The classification task on GaitNDD is binary
any-NDD vs. Control (39 vs. 16 subjects), consistent with the screening framing of GaitPDB; multi-class subtype classification (PD/HD/ALS) is not attempted because the per-subtype sample sizes (notably
) are too small to support reliable subtype-specific evaluation.
3.9.2. Cross-Validation Protocol
We employ stratified 10-fold cross-validation with a fixed random seed (42) for reproducibility. Each subject contributes a single feature vector and appears in exactly one fold, so within-subject leakage is impossible by construction. On each fold, the training partition (∼103 subjects on GaitPDB) serves as the CogCBR case base, and the RF feature weights are recomputed on the training partition only.
For the primary experiments, we report the fixed-parameter protocol used throughout the main comparison. The optimism that may arise from selecting the MI feature count and neighborhood size on the evaluation folds is acknowledged as a methodological caveat in
Section 5; we additionally quantify it directly in a nested cross-validation audit (
Section 4.3), which recomputes median imputation, standardization, MI ranking, RF feature-weight estimation, and the selection of
n and
k strictly inside each training fold.
3.9.3. Baselines
We compare CogCBR against three groups of baselines.
Group A: 10 traditional ML methods. Logistic Regression, SVM (RBF and Linear kernels), Random Forest, XGBoost [
45], GBM, Decision Tree,
k-NN (
), MLP, and Naive Bayes (default scikit-learn [
46] hyperparameters; features standardized). We deliberately use defaults to limit researcher degrees of freedom on the baseline side; the implication is a tuning asymmetry against CogCBR (whose
k,
n, and distance metric are tuned,
Section 4.11), which we discuss in
Section 5.
Group B: 4 deep learning baselines. Deep MLP (128–64–32, BatchNorm, Dropout, 200 epochs), 1D-CNN (32/64 filters, kernel size 3), LSTM (2-layer, hidden size 64), and TabNet [
47] (
, 100 epochs). All deep models are trained with the Adam optimizer and class-weighted loss on an NVIDIA RTX 4090 GPU. Together with Group A (10 methods) and Group C (2 methods), this gives the 16 baselines referenced throughout the paper.
Note on the LSTM baseline. Because the engineered features are atemporal, an LSTM cannot exploit its sequential inductive bias on them directly. The configuration reported as “LSTM” in
Table 1 therefore treats the 41 feature dimensions as a length-41 input sequence with one channel, providing the most direct comparison with the other tabular baselines. As a sensitivity analysis, we additionally trained an “LSTM-raw” variant on the original force time series (down-sampled to 1 Hz, padded to 600 time steps, per-foot vertical GRF as channels) using the same architecture. LSTM-raw reaches AUC =
on GaitPDB, statistically indistinguishable from the tabular LSTM (Wilcoxon
), suggesting that the small-sample regime is the dominant bottleneck for deep sequence models here. LSTM-raw is reported in
Table 1 as a supplementary analysis and is
not counted among the 16 baselines used for the multiple-comparison correction.
Group C: 2 domain-specific gait baselines. We re-implement two published gait-classification pipelines as direct comparison points:
Wahid-2015 [
7] (spatial–temporal feature engineering followed by linear/quadratic discriminant classification) and
Rehman-2019 [
8] (clinically curated gait features with a tuned Random Forest classifier and recursive feature elimination). Both are evaluated under the identical 10-fold protocol. Re-implementations follow the algorithmic descriptions in the original papers; small differences from the original numbers are expected because the underlying datasets and pre-processing pipelines differ.
GaitNDD evaluation. For GaitNDD, the input is stride-interval data only; baselines that require multi-channel force input or that failed to converge on
are reported as “—” in
Table 1. Specifically, SVM (Linear), MLP, Deep MLP, 1D-CNN, LSTM, LSTM-raw, and TabNet either failed to converge stably on GaitNDD or required input modalities not available there, and are therefore omitted rather than reported with degenerate scores.
3.9.4. Metrics and Statistical Tests
We report AUC (primary), accuracy, F1-score, sensitivity (recall), and specificity. Sensitivity is particularly relevant in a clinical screening setting because it directly reflects the false-negative (missed-diagnosis) burden. Statistical significance is assessed via the Wilcoxon signed-rank test (one-sided,
: CogCBR > baseline) on per-fold AUC scores. We report uncorrected per-baseline
p-values throughout. With 16 baselines on GaitPDB, the Bonferroni-corrected significance level is
; in
Section 5, we identify which comparisons survive this correction so that readers can calibrate the strength of the per-baseline claims. Because the comparisons are directional and exploratory on small datasets, the
p-values are used only to contextualize the AUC ranking rather than as conclusive significance tests.
4. Results
4.1. Gait Feature Differences Between Groups
Figure 3 illustrates the distribution of two discriminative gait features between PD patients and healthy controls. Walking speed shows a large effect size (Cohen’s
), with PD patients exhibiting significantly reduced velocity. Right-foot force CV (RF CV) also differs substantially (
): PD patients show systematically lower RF CV values than controls in our cohort, indicating altered force-variability patterns rather than uniformly higher variability. Both differences are statistically significant (
). The direction of the RF CV effect should not be over-interpreted: gait variability in PD is a multi-dimensional construct, and stride-timing variability (captured separately by stride-CV features) often increases in PD even when plantar-force CV does not. In plain terms, a reader might expect Parkinsonian gait to be
more variable on every measure, so a lower right-foot force-CV in patients can seem counter-intuitive. The two quantities measure different things: force-CV captures cycle-to-cycle variation in how hard the foot presses, whereas stride-CV captures variation in step timing. Many PD patients walk with smaller, more uniform (shuffling) plantar-force profiles, which lowers force-CV, while their step timing becomes more irregular, which raises stride-CV. A decrease in one therefore does not contradict the increased gait variability classically associated with PD.
4.2. Screening Performance
Table 1 presents classification performance on GaitPDB and GaitNDD. CogCBR attains AUC =
0.861 on GaitPDB (mean of per-fold AUCs) (The value 0.868 in
Figure 4 is computed on predictions
pooled across folds rather than as the mean of per-fold AUCs. We report the per-fold mean (0.861) as the primary number because it pairs naturally with the per-fold Wilcoxon test, and the pooled value (0.868) where a single ROC curve or threshold sweep is presented) and achieves competitive top-ranked performance among the 16 baselines. Per-baseline pairwise Wilcoxon tests yield uncorrected
against 11 of 14 generic baselines and against both re-implemented gait-specific baselines (
Wahid-2015: 0.798,
;
Rehman-2019: 0.821,
). These per-baseline
p-values are uncorrected; both the multiple-comparison correction and the tuning asymmetry between CogCBR (whose
k,
n, and distance metric are tuned) and the Group A defaults are discussed in
Section 5. The intended reading is that CogCBR is
competitively ranked and additionally provides functionality, including confidence-based triage (Revise), case-base maintenance (Retain), and intrinsic case-based explanations, that the baselines do not.
On GaitNDD, CogCBR attains AUC = 0.902 and achieves the highest mean AUC among the eleven methods that can be evaluated on stride-interval data; however, its margin over the strongest classical baselines (Random Forest 0.887, Rehman-2019 0.872) is not statistically significant. This absence of separation is addressed explicitly in
Section 5: on a 55-subject dataset the per-fold AUC variance (
) is comparable to the absolute AUC differences (∼0.015–0.030) the test would have to detect.
For the screening-relevant sensitivity/specificity trade-off, CogCBR achieves sensitivity = 0.745 and specificity = 0.880 on GaitPDB, and sensitivity = 0.897 and specificity = 0.688 on GaitNDD. These complementary profiles illustrate that the model is conservative on the larger PD-vs-control cohort (favoring few false positives) and more sensitive on the smaller NDD-vs-control cohort (favoring few false negatives), consistent with the screening framing in which uncertain cases are routed to specialist review by the Revise stage rather than acted on as final diagnoses.
Figure 4 shows the ROC curves on GaitPDB.
Figure 5 shows the distribution of predicted PD probabilities for each class. The two distributions are well-separated (Cohen’s
), with PD patients centered around 0.68 and controls around 0.31.
4.3. Parameter-Selection Optimism and Baseline Fairness
This section examines three methodological issues: selection optimism from choosing n and k on the evaluation folds, the fairness of comparing a tuned CogCBR against default-configured baselines, and the strength of the multiple-comparison evidence. Each is evaluated through a nested cross-validation audit, a fully tuned-vs.-tuned comparison, and a corrected-vs.-uncorrected significance summary.
To quantify the optimism introduced by selecting the MI feature count
n and the neighborhood size
k on the evaluation folds, we repeated the full pipeline under a strictly nested protocol [
48,
49]: on each outer training fold an inner 5-fold loop selected
n and
k, and median imputation, standardization, MI ranking, and RF feature-weight estimation were all recomputed inside the training partition only.
Table 2 reports the result. On GaitPDB, the nested AUC is
, only
below the fixed-parameter value of
, indicating that the selection optimism is small; the nested estimate remains at or above the strongest classical baselines evaluated under their default configuration (for example, GBM
and Rehman-2019
). On GaitNDD, the nested AUC is
; the larger drop and wide interval are consistent with the 55-subject cohort. We retain the fixed-parameter numbers as the primary results, with this audit confirming they are not materially inflated by selection.
To assess baseline fairness, a fully tuned-vs.-tuned comparison (
Table 3 and
Table 4) was conducted: each Group A baseline was given the same inner-CV tuning budget that CogCBR receives. Under matched tuning, CogCBR’s nested AUC (
) remains the highest on GaitPDB, but its margin over the strongest tuned baseline (GBM,
) is not significant (one-sided Wilcoxon
). On the smaller GaitNDD cohort, tuned Random Forest (
) ranks highest and CogCBR (
) ranks fifth; however, the per-fold AUC variance is large (SD up to
on 55 subjects), so the ranking is not statistically separable and no comparison survives Bonferroni correction. CogCBR is therefore competitive with—and statistically indistinguishable from—the strongest tuned classical baselines, with its contribution resting on the integrated Revise, Retain, and explanation functionality that the higher-AUC baselines do not provide.
Finally, to make the multiple-comparison evidence explicit (
Table 5), we summarize, for each of the 16 baselines on GaitPDB, the uncorrected one-sided Wilcoxon
p-value and whether it remains significant at the uncorrected level (
) and at the Bonferroni-corrected level (
). Thirteen of the sixteen comparisons are significant before correction, but only the 1D-CNN comparison (
) survives Bonferroni correction. We therefore present the uncorrected pattern as suggestive rather than confirmatory, consistent with the claims stated in
Section 5.
4.4. Confidence-Based Clinical Triage (Revise)
Table 6 reports the confidence/coverage/accuracy trade-off produced by the Revise stage. At threshold ≥0.1, CogCBR retains 68.7% of predictions with accuracy improving from 80.0% to 87.3% and (pooled) AUC from 0.868 to 0.907. At ≥0.6, the 13.9% most-confident predictions correspond to only ∼16 subjects pooled across folds, on which the system happens to be perfectly accurate; the AUC of 1.000 at this threshold should therefore be read as a property of the easiest-to-classify subset of the cohort rather than as a calibrated estimate of operator behavior. Clinicians can adjust the threshold to balance throughput against safety, but the threshold itself must be calibrated on data independent of any future deployment cohort: an external calibration set is required before any specific value of
can be recommended for clinical use.
Figure 6 further validates the Revise mechanism by comparing correctly classified with misclassified cases. Neighbor label agreement (the proportion of
k neighbors sharing the query label) shows substantial separation, with
for correct predictions versus
for misclassifications (Mann–Whitney
). Classification confidence is 2.7× higher for correct cases (0.32 vs. 0.12).
4.5. Calibration of the Confidence Score
We do not claim that the entropy-based confidence score is well-calibrated in the strict probabilistic sense, and a calibration analysis (
Table 7,
Figure 7) confirms this. Read as a probability of correctness, the entropy confidence is systematically under-confident (ECE
): it collapses toward zero near
even though such predictions remain frequently correct, reflecting its role as a triage/separation score (
Section 3.5,
Figure 6) rather than a calibrated probability. The same analysis shows that, when a calibrated probability is required, post hoc calibration of the IDW probability achieves low calibration error (RF-probability + Platt: ECE
; Platt: ECE
) and split conformal prediction attains valid, slightly conservative coverage (93.3% at a 90% target), at a modest accuracy cost on this small cohort. We therefore retain the entropy confidence for triage, where separation rather than calibration is what matters, and recommend post hoc calibration for any deployment that must report calibrated risk estimates. The calibrators compared are Platt scaling [
50], isotonic regression [
51], and split conformal prediction [
52], assessed by the expected calibration error [
53] and the Brier score [
54].
4.6. Clinical Explanation Through Case Studies
We present three representative cases from the 10-fold CV evaluation, chosen to illustrate the structure of the explanation output and the safety value of Revise.
Case A: correct PD detection (confidence = 1.0). Subject #4 (PD) is correctly classified with maximum confidence. All seven retrieved neighbors are PD cases (nearest distance = 0.97). The explanation highlights low walking speed (0.68 m/s), reduced stride regularity (0.68), and elevated double-support time (38.4%), all established PD gait biomarkers [
3]. The most discriminative features are mean total force (weight 0.048), stride regularity (0.065), and total-force CV (0.033).
Case B: correct control classification (confidence = 1.0). Subject #81 (Control) is correctly classified, with all seven neighbors being controls (nearest distance = 0.53), distinguished by normal walking speed (1.35 m/s), high stride regularity (0.70), and low double-support time (22.7%).
Case C: Revise preventing a misclassification. Subject #109 (Control) is misclassified as PD (), but Revise assigns only moderate confidence (0.53). Under a threshold of ≥0.6, this case would be flagged for specialist review rather than acted on. Inspection reveals that six of the seven neighbors are PD patients, despite normal walking speed (1.37 m/s) but anomalous stride regularity (0.10) and zero double-support time, suggesting possible data-quality issues with the sensor recording. Revise thus prevents a false positive that could trigger an unnecessary referral.
4.7. Case Competence Analysis (Retain)
The Retain stage evaluates each case’s classification utility across all 10-fold CV queries. Under the operational policy thresholds defined in
Section 3.6 (≥0.7 keep,
–
monitor, <0.3 flag for specialist review), the 115 cases distribute as summarized in
Table 8. Cases that are never retrieved as a neighbor in the 10-fold evaluation are assigned competence = 0 by convention and so fall in the lowest bin; this convention means low-competence cases comprise both genuinely error-prone retrievals and cases that the current case base has rendered redundant. In either case, the augmentation policy’s specialist-review pathway provides the same audit mechanism. We note in passing that the 68.7% (79/115) of cases in this confirmed-retain group coincides numerically with the 68.7% Revise-stage coverage at
(
Table 6); although numerically identical, the two figures are computed independently from different quantities—case competence (Equation (
4)) here versus entropy confidence (Equation (
3)) there—so their coincidence carries no special meaning.
For completeness, we additionally report a finer descriptive grouping that is used only to summarize the static distribution of case-base quality (and is not a separate maintenance policy): 62 cases (53.9%) have competence ≥0.8 (clearly high quality); 43 cases (37.4%) lie in – (moderate); and 10 cases (8.7%) lie below (lower quality, meriting closer inspection). The mean competence is (SD = ). The cases with competence <0.3 are a strict subset of the 10 cases below 0.5; lower-quality cases plausibly represent atypical gait patterns, borderline subjects, or sensor-recording artifacts.
This analysis is
descriptive: it characterizes the static case base under the 10-fold protocol and identifies which cases would be candidates for review. It does not, on its own, validate the longitudinal effect of the augmentation policy, which would require a deployment-style simulation in which the case base grows under realistic clinical-feedback rates. This simulation is provided in
Section 4.8.
4.8. Longitudinal Retain Simulation
To validate the augmentation policy beyond the static analysis of
Section 4.7, we simulate deployment as a stream [
55]: from a seed case base, GaitPDB subjects arrive sequentially (10 random orderings), CogCBR predicts and triages each, and the case base is updated under the policy of
Section 3.6. We compare the policy (A) against naive self-training that admits confident predictions with their predicted labels (B), a verification ablation following A’s data flow but with unverified confident labels (B2), an all-true-label reference that performs no curation (C), and a static no-growth baseline (D), sweeping the follow-up confirmation rate (the clinical-feedback rate) over
at
(with
as a sensitivity check). As summarized in
Table 9, the label-verified policy A injects zero label noise versus
erroneous labels for B, beats B on the area under the learning curve across all feedback rates (paired
p down to 0.002), and clearly exceeds the static baseline D (0.801–0.838 versus 0.750), and approaches the all-true-label no-curation reference C in final AUC—matching it within the per-fold noise at confirmation rates ≥0.7 and marginally exceeding it at full confirmation, since A (unlike C) additionally curates the case base—while maintaining a smaller case base (60.8–80.9 versus 86 cases) and showing the same qualitative pattern at
. The verification ablation B2 degrades as more unverified labels enter (2.5 to 9.7 erroneous labels); the gap to A is negligible at low feedback rates but widens in A’s favor as the rate rises (A improving while B2 degrades), isolating label verification as the operative mechanism. This remains a resampling-based simulation on a cross-sectional cohort; prospective longitudinal clinical validation is still required.
Figure 8 shows the corresponding held-out AUC as a function of the number of arrived subjects: across all four feedback rates, the label-verified policy A rises above the static baseline D toward the all-true-label no-curation reference C, while naive self-training B falls below D and the verification ablation B2 degrades as the feedback rate increases.
4.9. Independent-Cohort Reproducibility Evaluation on GaitNDD
We further evaluated CogCBR on GaitNDD, a second independent PhysioNet dataset. This experiment uses stratified 10-fold cross-validation
within GaitNDD; following the recommendation that the term “external validation” should be reserved for source-to-target transfer with non-overlapping cohorts and matching label definitions, we describe this experiment as an
independent-cohort reproducibility evaluation. It assesses whether the same CogCBR workflow can be reproduced on an independent cohort with stride-interval features rather than as a strict external test. The GaitNDD feature set is small (nine stride-interval features); MI selection retained eight of them, dropping a single near-zero-variance feature. With
, CogCBR achieves AUC =
(
Table 1), with the highest mean AUC among the eleven evaluable methods. The triage and explanation mechanisms transfer unchanged. Statistical significance against the strongest baselines (Random Forest 0.887, Rehman-2019 0.872) is not reached, and the per-fold variance is comparable to the absolute AUC differences. The honest reading is that CogCBR is competitive but not measurably superior on GaitNDD under within-cohort CV; we revisit this in
Section 5.
4.9.1. Choice of k on GaitNDD
For GaitNDD, we used
rather than the
setting employed on GaitPDB. The motivation is structural rather than tuning-driven: GaitNDD is substantially smaller (55 subjects) and provides only a nine-feature stride-interval representation. In each training fold, the minority control class contains approximately 14–15 subjects; using
would therefore make each neighborhood cover almost half of the available control class and risk over-smoothing local retrieval evidence. A smaller, more local neighborhood (
) preserves the case-based explanation property under the reduced feature space and cohort size. We acknowledge this dataset-specific choice as a limitation in
Section 5 and note that future larger-cohort studies should select
k under a fully nested protocol.
4.9.2. Source-to-Target Transfer
As a stricter cross-modality domain-shift analysis, we trained models on GaitPDB and tested them directly on GaitNDD using only the shared stride-based features. In this setting, CogCBR obtained AUC = 0.737,
below the strongest baseline (Logistic Regression, AUC = 0.832); detailed results are reported in
Table 10. This confirms that source-to-target transfer across different sensor modalities and label definitions is substantially harder than within-dataset validation, and that CogCBR does not exceed classical baselines in this regime. Similarity-based retrieval inherits the limitations of its case base when the deployment distribution differs sharply from it; our main claims therefore rest on GaitPDB cross-validation and the within-cohort reproducibility evaluation on GaitNDD, not on direct cross-modality transfer. We report the transfer numbers transparently rather than omit them.
4.10. Computational Cost and Edge Deployability
Beyond predictive performance and interpretability, practical clinical screening systems must be deployable under resource-constrained settings such as wearable sensors, mobile health terminals, and bedside screening devices. We therefore profile CogCBR and representative baselines in terms of training time, single-sample inference latency, and deployment payload size on the GaitPDB analysis set (115 subjects). Inference latency is averaged over 2000 single-sample runs, matching the per-subject deployment scenario rather than batch throughput. For CogCBR, the deployment payload comprises the case base with 18 MI-selected features, class labels, standardization parameters, selected feature indices, and RF-derived feature weights.
Table 11 reports the results. CogCBR has the lowest single-sample inference latency in the comparison (0.080 ms): more than 30× faster than Random Forest (2.678 ms) and at least 4× faster than XGBoost (0.510 ms), 1D-CNN (0.426 ms), and LSTM (0.369 ms). It also has the smallest deployment payload at 8.4 KB; Random Forest, 1D-CNN, and LSTM require 48.1 KB, 35.0 KB, and 211.7 KB, respectively. Although XGBoost is also compact at 15.0 KB, it provides neither case-level explanations, neighbor retrieval evidence, nor confidence estimation. Training cost is comparable to that of traditional ML baselines (all non-deep methods fit within 0.1 s); we do not interpret small fitting-time differences as a primary result, since they fall within wall-clock measurement variability.
CogCBR’s case base grows linearly with the number of cases; on the present GaitPDB cohort the full deployment payload is under 10 KB and a single inference takes under 0.1 ms. Even if the case base were extended to several thousand subjects, storage would remain manageable, and the Retain stage bounds case-base growth through competence-based curation. CogCBR is therefore well positioned as a lightweight, interpretable, and continuously maintainable edge screening method for rapid first-line risk assessment of neurodegenerative disease.
4.11. System Component Analysis
We systematically evaluate each CogCBR component. The values of
n and
k reported below were selected at their AUC peaks on the same 10-fold partitions, a protocol asymmetry already declared in
Section 3.9 and acknowledged in the limitations (
Section 5).
4.11.1. MI Feature Selection
MI ranking identifies walking speed as the most discriminative feature (MI = 0.150), followed by force CV (0.091), stride regularity (0.089), and double-support percentage (0.088), all clinically established PD gait markers [
3]. Grid search over
finds that AUC peaks at
(0.861) and declines with more features (
: AUC = 0.801), confirming that low-MI features inject noise into the distance computation in small-sample CBR.
4.11.2. Distance Metric
Table 12 compares five distance metrics for case retrieval (MI-18,
). The RF-weighted Euclidean distance outperforms all alternatives (+0.048 AUC over the second-best Cosine), confirming that supervised, multivariate feature importance in the distance function improves retrieval. NCA, a learned metric, performs worst, consistent with overfitting on the small training set.
4.11.3. Neighborhood Size
AUC peaks at (0.861) and declines both for smaller values (: AUC = 0.795 with higher variance) and for larger values (: AUC = 0.838 with over-smoothing).
4.11.4. Component Ablation
Table 13 isolates each component’s contribution under identical settings (
, stratified 10-fold CV). The RF-weighted distance, combined with IDW voting, is the single most impactful pairing (+0.074 AUC for Full over MI + Euclidean + Majority, a row that swaps
both the distance and the vote); isolating the distance alone, with IDW voting held fixed (
Table 12), RF-weighted Euclidean still yields +0.076 AUC over plain Euclidean, confirming the distance as the dominant single factor. MI selection adds a further +0.027 (Full vs. No MI). The cumulative contribution of MI selection and the RF-weighted distance is +0.102 AUC over the minimal-component baseline (No MI + Euclidean + Majority). IDW voting is retained over majority voting because it produces the continuous probability estimates required by Revise (Equation (
3)).
4.12. Class-Imbalance Ablation
Both cohorts are imbalanced toward the disease class (GaitPDB 66 PD/49 control; GaitNDD 39 NDD/16 control), a setting widely studied in machine learning [
56]. We evaluated four balancing strategies against the unbalanced configuration—class-weighted inverse-distance voting, random oversampling, SMOTE [
57], and random undersampling—applied inside each training fold (
Table 14). No balancing strategy meaningfully improved AUC on either cohort: on GaitPDB only class-weighted IDW edged out the unbalanced configuration, and then only within numerical tolerance (0.862 vs. the unbalanced 0.861), while every other strategy fell below it; on GaitNDD, the unbalanced configuration retained the highest AUC (0.902). Critically, every strategy reduced sensitivity—the screening-relevant metric—while raising specificity (sensitivity fell from 0.745 to 0.626–0.700 on GaitPDB and from 0.897 to 0.617–0.792 on GaitNDD), with the GaitNDD class-weighting reduction reaching significance (
). This is expected: because the disease class is the majority in both cohorts, balancing shifts the decision boundary toward the minority control class, trading disease sensitivity for specificity. Since a screening tool prioritizes minimizing missed diagnoses, we report the unbalanced configuration and route low-confidence cases to specialist review (Revise) rather than rebalancing.
5. Discussion
The novelty of CogCBR lies not in any single CBR operator: weighted Euclidean distance, IDW voting, entropy-based confidence, and competence-based maintenance are all established techniques drawn from the CBR literature of the last three decades. The contribution lies in their principled
integration into a complete R
4 pipeline for a clinical screening domain in which prior CBR work has typically stopped at Retrieve and Reuse [
20,
22], combined with an explicit case-base augmentation policy that closes the loop between confident predictions, expert review, and case storage. Revise enables the system to flag uncertain or potentially unreliable predictions for specialist review; Retain provides a designed mechanism for long-term case-base quality (its longitudinal effect remains to be empirically validated); and the explanation output supports clinician verification.
Framing CogCBR through Richter’s containers [
23] makes a design choice explicit: in domains where adaptation knowledge is hard to elicit, such as clinical gait screening from sensor data, it is rational to invest more in the vocabulary, similarity, and case-base containers. The empirical results agree: the strongest single contributor is similarity (the RF-weighted distance, +0.076 AUC in isolation), followed by vocabulary (MI selection, +0.027 AUC). With these containers strengthened, lightweight inverse-distance voting in the adaptation container suffices.
The Revise stage provides self-assessment of prediction reliability: at threshold ≥0.1, the pooled AUC improves from 0.868 to 0.907 at 68.7% coverage. Subject #109 (
Section 4.6) illustrates the safety value: despite a wrong PD prediction, the low confidence (0.53) would, under a stricter triage threshold (≥0.6;
Table 6), trigger specialist referral rather than an automated decision. We do not claim that the entropy-based confidence score reported here is well-calibrated in the strict probabilistic sense; a dedicated calibration analysis (
Section 4.5,
Table 7,
Figure 7) confirms this and shows that post hoc calibration (Platt scaling, isotonic regression, RF-probability + Platt) and split conformal prediction yield well-calibrated or validly covered estimates when a calibrated risk is required, while the entropy score is retained for triage, where separation rather than calibration is what matters.
Beyond predictive metrics, an important limitation is that the present study does not include a clinician-centered evaluation, and the intended human–AI workflow deserves to be made explicit. We envisage CogCBR operating as a first-line triage aid rather than an autonomous diagnostic system. For each incoming recording, the clinician is shown not a bare label but the case-based explanation of
Section 3.7: the
k most similar prior patients, their labels and distances, the gait features driving each match, and the direction of clinical deviation, together with the contextual metadata (age, sex, recording protocol, and, where available, Hoehn & Yahr stage) that is displayed but excluded from the distance. This lets the clinician verify a prediction against patients they can inspect, and flag demographic or protocol mismatches that a scalar score would hide. The confidence threshold
is the operating control of this workflow: lowering
increases automated coverage at the cost of admitting less-certain predictions, while raising it routes more borderline cases to specialist review (
Table 6). As we stress in
Section 3.5 and
Section 4.4,
is a deployment-time parameter that must be calibrated on a target cohort before any specific value is recommended, since the costs of a missed diagnosis and an unnecessary referral are asymmetric and site-specific. Several workflow-integration challenges remain open and are not resolved by the present cross-sectional study: eliciting confirmed labels at clinical follow-up at a realistic rate (the confirmation-rate sweep of
Section 4.8 brackets this), avoiding automation bias when an explanation is persuasive but wrong (Case C in
Section 4.6 is a deliberate example), integration with electronic-health-record and sensor-acquisition systems, and the regulatory requirement for prospective validation as a clinical decision-support tool. A dedicated clinician-in-the-loop usability study—measuring how retrieved cases and confidence flags affect referral decisions, reading time, and trust—is an important direction for future work.
Beyond classification performance, the computational profile reported in
Section 4.10 is directly relevant to the wearable-sensor screening setting that motivates this paper. With a deployment payload of 8.4 KB and a per-subject inference latency below 0.1 ms, CogCBR fits comfortably on the embedded processors typical of force-sensitive insoles, mobile health terminals, and edge gateways. Importantly, this efficiency does not trade off against the clinically useful capabilities of the framework: case-level explanations, neighbor-retrieval evidence, and entropy-based confidence are all preserved, which is not the case for the equally compact XGBoost baseline. We therefore view edge deployability not as an isolated engineering metric but as a structural property of the CBR design that complements its interpretability for sensor-based screening.
Two statistical issues warrant explicit discussion. First, regarding multiple comparisons on GaitPDB, the per-baseline
p-values in
Table 1 are uncorrected pairwise Wilcoxon tests (summarized before and after correction per baseline in
Table 5); under Bonferroni correction at
only the 1D-CNN comparison (
) survives. We therefore do
not claim that CogCBR is significantly better than every baseline after correction. What we do claim is that (i) CogCBR’s mean per-fold AUC ranks at or above every baseline on both datasets, (ii) the direction of the effect is consistent across all 16 baselines, and (iii) the architecture provides confidence-based triage and case-based explanations that none of the baselines, including the higher-AUC ones, provide. The uncorrected per-baseline
pattern is best regarded as suggestive evidence warranting larger-cohort confirmation. Second, regarding variance on GaitNDD, the AUC ranking does not reach statistical significance against the strongest baselines (Random Forest 0.887, Rehman-2019 0.872): with only 55 subjects the per-fold variance (
) is large relative to the absolute AUC differences (∼0.015–0.030). Establishing a stronger empirical separation will require larger multi-center cohorts and a fully nested CV protocol with subtype-stratified analysis.
A third caveat concerns
baseline fairness. Group A baselines use scikit-learn defaults, while CogCBR’s
k,
n, and distance metric are tuned. We deliberately fixed baseline hyperparameters to limit researcher degrees of freedom on the baseline side, but this design choice introduces a tuning asymmetry. This asymmetry is evaluated using the fully tuned-vs.-tuned comparison of
Section 4.3 (
Table 3 and
Table 4), in which each baseline receives the same inner-CV tuning budget that CogCBR receives. Under matched tuning, CogCBR’s nested AUC (0.842) remains the highest on GaitPDB but is statistically indistinguishable from the strongest tuned baseline (GBM, 0.836;
), and on GaitNDD the ranking is not separable within the per-fold noise (tuned Random Forest leads at 0.913). The claim of this paper is therefore not that CogCBR
outperforms all baselines in a strict sense, but rather that it
achieves competitive top-ranked performance while additionally providing confidence-based triage and case-based explanations.
Adding the re-implemented Wahid-2015 [
7] and Rehman-2019 [
8] pipelines as direct baselines (Group C in
Table 1) addresses the gap that these works have frequently been discussed but not directly compared. CogCBR’s mean AUC ranks above both on GaitPDB at uncorrected
while remaining within the noise band on GaitNDD; neither comparison survives Bonferroni correction in isolation. The pattern suggests that any modest gains over published gait pipelines come primarily from the integrated similarity, selection, and triage design rather than from the underlying feature set, which is broadly comparable. Relative to deep learning baselines, CogCBR’s edge on these small cohorts (115 and 55 subjects) reflects the fact that similarity-based reasoning avoids high-dimensional parameter estimation; this should not be read as evidence that deep learning is generally weaker in this domain, since on larger multi-center cohorts the picture is likely to change. The LSTM-raw sensitivity analysis (
Section 3.9) helps disentangle the cause of the deep baselines’ weakness: training the LSTM on the raw force time series rather than the engineered features leaves AUC essentially unchanged (
versus tabular
, Wilcoxon
), indicating that limited sample size, rather than feature representation or sequence-model architecture, is the dominant bottleneck for the deep baselines on these cohorts.
The Retain stage
is designed to maintain case-base quality as new sensor recordings accumulate. The augmentation policy of
Section 3.6, under which only label-verified cases enter the case base, is intended to prevent the self-confirmation bias that simple confidence-based augmentation would introduce. The competence analysis in
Section 4.7 characterizes the static case base under 10-fold CV but does not, by itself, demonstrate that the policy improves performance over time. A deployment-style streaming simulation (
Section 4.8,
Table 9) validates this policy longitudinally: the label-verified policy injects zero label noise versus
erroneous labels for naive self-training, exceeds a static no-growth baseline, and matches an all-true-label no-curation reference in final AUC while bounding case-base growth. This remains a resampling-based simulation on a cross-sectional cohort; prospective longitudinal clinical validation is still required.
Several limitations of the present study should be acknowledged so that readers can calibrate the strength of the reported findings. Sample sizes (55–115 subjects) are typical for clinical gait sensor studies but limit statistical power, particularly on GaitNDD. The RF-weighted distance assumes globally consistent feature importance, which may not hold across NDD subtypes (PD, HD, ALS); subtype-aware local weighting is a natural extension. Beyond these structural limitations, we list five further methodological caveats:
MI selection on the full dataset. MI ranking is computed once on the full dataset, which is common practice for univariate filters [
40] but may introduce optimistic bias relative to a fully nested protocol [
58]. The nested audit of
Section 4.3, which recomputes MI ranking inside each training fold, quantifies this bias directly and finds it small on GaitPDB (
).
Hyperparameter selection on the evaluation folds. The values
and
were selected at the AUC peak of the same 10-fold protocol used for the main results, rather than via an inner-loop nested CV. To quantify the resulting optimism, we conducted a nested cross-validation audit (
Section 4.3), in which median imputation, standardization, MI ranking, RF feature-weight estimation, and the selection of
n and
k are all performed strictly inside the training folds. The audit shows the optimism is small on GaitPDB (
; nested 0.842 versus fixed-parameter 0.861) and larger but within the high-variance regime expected for the 55-subject GaitNDD cohort (nested
). Baseline fairness is further evaluated in the tuned-vs.-tuned comparison (
Table 3 and
Table 4), in which CogCBR remains the top performer on GaitPDB—not significantly different from tuned GBM (
)—and competitive within the noise band on GaitNDD.
Retain validated by simulation, not yet prospectively. The static competence analysis (
Section 4.7) is complemented by a deployment-style streaming simulation (
Section 4.8) in which the case base grows over time under a range of clinical-feedback rates. The simulation confirms that the label-verified augmentation policy avoids the label-noise accumulation of naive self-training (0 versus
injected errors), exceeds a static no-growth baseline, and matches an all-true-label no-curation reference in final AUC while bounding case-base growth. This remains a resampling-based simulation on a cross-sectional cohort; prospective longitudinal clinical validation is still required.
Dataset-specific neighborhood choice on GaitNDD. The GaitNDD analysis uses
rather than
to avoid an excessive neighborhood-to-minority-class ratio under the smaller cohort and reduced feature space (
Section 4.9); a fully nested protocol on a larger external cohort is required to confirm this choice without dataset-specific tuning.
Cohort composition. GaitPDB subjects are diagnosed PD patients (Hoehn & Yahr 2–3) rather than pre-clinical or at-risk subjects, and GaitNDD provides only stride-interval data rather than the full insole pipeline. The present study therefore evaluates gait-based screening support and classification of established NDD-related gait patterns; demonstrating utility for true pre-symptomatic detection would require dedicated pre-clinical cohorts with longitudinal follow-up.
The present study is therefore best viewed as a methodological validation based on available cross-sectional datasets, rather than a prospective clinical deployment study. Although the cross-validation, independent-cohort evaluation, and streaming simulation support the robustness of CogCBR, the Retain stage and triage threshold still require prospective, clinician-in-the-loop validation before clinical use.
The augmentation policy assumes that specialist labels are available for low-confidence cases; in deployment, this requires integration with the clinical workflow and, at the regulatory level, prospective validation as a clinical decision support tool. Future work should explore multi-center sensor validation, subtype-aware local weighting, prospective clinical evaluation of the triage workflow, a full multi-cohort calibration study extending the single-cohort calibration analysis of
Section 4.5, a clinician-in-the-loop usability evaluation of the explanation-and-triage workflow described in
Section 5, and prospective longitudinal clinical evaluation of the Retain stage extending the deployment-style streaming simulation of
Section 4.8.