5.1. Main Results on 56 VICON Trials
Table 1 summarizes the main comparison under the fixed mean-selected protocol used in the main table.
Table 1 shows a split performance pattern across the compared update rules. FIBA-like adaptive covariance achieves the lowest mean and median error, reaching 0.168 m and 0.088 m, respectively. Posterior-contact soft-ZUPT instead achieves the lowest upper-tail metrics, with p90 = 0.400 m, p95 = 0.588 m, and CVaR@90 = 0.593 m. Robust soft-ZUPT remains highly competitive, especially on average-case metrics, while the contact-only variant is clearly weaker in the upper tail.
This pattern is consistent with the update mechanisms of the compared methods. FIBA-like adaptive covariance continuously modulates the pseudo-observation strength from the detector statistic, which appears to benefit the more regular trials that dominate mean and median performance. Posterior-contact soft-ZUPT, by contrast, combines the detector-derived prior with an innovation-conditioned posterior correction, so that suspicious candidate updates can be weakened more aggressively when the current residual is inconsistent with stable contact. The resulting behavior is less advantageous on every nominal trial, but it is more effective at suppressing rare overconfident updates, which is reflected in the stronger p90, p95, and CVaR@90 values. The contact-only ablation suggests that the detector-derived prior alone is not sufficient: without the innovation-conditioned posterior correction, tail behavior deteriorates markedly.
5.2. Trial-Level Pairwise Comparison
Figure 2a,b show Bland–Altman plots comparing posterior-contact soft-ZUPT with its two strongest comparators at the trial level. For each trial, the x-axis gives the average 2D ARMSE of the two compared methods, and the y-axis gives the paired difference, defined as posterior contact minus the comparator. Therefore, negative values indicate trials on which posterior contact has lower error, whereas positive values indicate trials on which the comparator has lower error. The horizontal solid line marks the mean difference, and the dashed lines mark the 95% limits of agreement. Against FIBA-like adaptive covariance, posterior contact yields lower error on 19 of the 56 trials, while FIBA-like is lower on 37. Against robust soft-ZUPT, posterior contact is lower on 14 trials, while robust is lower on 41, with one near-tie.
As illustrated in
Figure 2, most low-error trials cluster near zero difference, indicating small pairwise differences on easier trials. In contrast, the largest negative differences appear at larger average-error values, meaning that posterior contact produces its largest improvements on the more difficult trials. For example, posterior contact improves over FIBA-like by 0.420 m on trial
2018-02-22-10-10-29, whereas its largest loss to FIBA-like is 0.152 m on
2017-11-27-11-22-22. In both pairwise comparisons, the points outside the lower 95% limit of agreement occur in the negative direction, while no points exceed the upper limit. Overall, this experiment shows that posterior-contact soft-ZUPT mainly improves performance by mitigating a small number of difficult high-error cases.
5.3. Tail-Risk Analysis
Inspection of
Table 1 shows that posterior-contact soft-ZUPT is strongest on the upper-tail summaries, achieving the lowest p95 and CVaR@90 among the compared methods. This pattern suggests that its main advantage may lie in suppressing rare but damaging failures rather than in uniformly improving average-case performance. To examine this point more directly,
Figure 3 isolates the upper-tail metrics from the main table and augments them with the maximum trial error.
Figure 3 confirms that posterior contact achieves the lowest values on all three upper-tail summaries, reaching p95 = 0.588 m, CVaR@90 = 0.593 m, and max = 0.791 m. In comparison, FIBA-like reaches p95 = 0.635 m, CVaR@90 = 0.663 m, and max = 0.871 m, while robust soft-ZUPT reaches p95 = 0.843 m, CVaR@90 = 0.942 m, and max = 1.782 m. The contact-only ablation is notably weaker in the upper tail, with CVaR@90 = 1.262 m and max = 3.232 m.
This upper-tail advantage is consistent with the posterior-contact update mechanism. Because the method combines detector-derived contact confidence with an innovation-conditioned posterior correction, it can weaken suspicious candidate updates more aggressively when the residual is incompatible with stable contact. The resulting behavior is most visible not on the typical trials that dominate mean and median performance, but on the small number of difficult trials that determine p95, CVaR@90, and maximum error. These results therefore indicate that the principal benefit of posterior contact lies in difficult-trial mitigation and failure suppression.
5.4. Interpretability and Failure Analysis
Figure 4,
Figure 5,
Figure 6,
Figure 7,
Figure 8,
Figure 9,
Figure 10 and
Figure 11 provide time-aligned diagnostic views of these two trials. The detector-statistic panels show when the inertial signal is compatible with a zero-velocity/contact hypothesis; the shaded orange intervals mark the posterior-contact update intervals selected by the algorithm. These shaded intervals are not manual foot-strike or stance-phase annotations, because the public VICON benchmark provides position ground truth but not trial-level gait-event labels. The probability panels show how detector evidence is converted from raw contact confidence to a smoothed prior and then to an innovation-conditioned posterior. The covariance-scale panels show how strongly each method trusts the zero-velocity update: values near one indicate a strong update, whereas larger values weaken the update. The posterior-LLR panels show whether the innovation supports or rejects stable contact, with negative values indicating evidence against applying a confident contact update. In an ideal behavior, confident stance-like intervals would retain high contact probability and low covariance scale, while suspicious transition intervals such as heel-strike, toe-off, or unstable contact would reduce the posterior probability and increase the covariance scale.
Figure 4,
Figure 5,
Figure 6 and
Figure 7 show that posterior contact helps when the detector alone is uncertain and the innovation indicates that a confident zero-velocity update would be unsafe. In
Figure 4, the SHOE statistic repeatedly moves through ambiguous intervals near the posterior threshold, indicating that the zero-velocity hypothesis is not cleanly separated from non-contact. In
Figure 5, these ambiguous intervals are followed by sharp drops from the contact prior to the posterior-contact probability. The posterior LLR in
Figure 7 is strongly negative over the same intervals, showing that the innovation is actively rejecting stable-contact updates rather than merely inheriting detector skepticism. This posterior correction is then converted into covariance modulation in
Figure 6: the posterior effective
r-scale rises sharply and remains elevated over the suspicious segments, while the FIBA scale follows a smoother continuous response. Together, these patterns indicate that posterior contact recognizes risky candidate contact intervals and deliberately weakens those updates. This difference carries through to the end-to-end trial error. In this case, suppressing those locally dangerous updates reduces the 2D trajectory error from 0.871 m for FIBA-like adaptive covariance to 0.451 m for posterior-contact soft-ZUPT.
Figure 8,
Figure 9,
Figure 10 and
Figure 11 show a FIBA-favorable regime in which the detector statistic evolves in a more regular and repetitive pattern, without the same sequence of strongly suspicious local intervals. The posterior probability in
Figure 9 still reacts to the detector and innovation, but its thresholded behavior is comparatively coarse. In
Figure 10, FIBA-like adaptation tracks the oscillatory instability pattern more smoothly, whereas posterior contact alternates more abruptly between strong and weak correction. As a result, the smoother continuous modulation of FIBA-like yields the lower 2D trajectory error, with 0.425 m for FIBA-like and 0.577 m for posterior contact. Posterior contact remains competitive with the hard and robust baselines.
Taken together, these two cases show that the difference between posterior contact and FIBA-like is not simply one of average strength, but of update mechanism. Posterior contact is most advantageous when detector ambiguity is followed by innovation evidence that strongly contradicts stable contact, because the posterior correction can aggressively suppress misleading updates. FIBA-like is strongest when the motion pattern is sufficiently regular so that a continuous instability-to-covariance mapping already captures the local reliability structure without the need for thresholded posterior intervention. Additional favorable and unfavorable cases in the appendix follow the same qualitative pattern.
5.5. Sensitivity and Regime Analysis
We next examine how the selected posterior-contact and FIBA-like configurations behave under one-factor-at-a-time parameter perturbations.
Figure 12 summarizes the resulting mean, p95, and CVaR@90 curves, with the posterior-contact sweeps shown in
Figure 12a,b, and the FIBA-like sweeps shown in
Figure 12c,d. In this paper, a sensitivity sweep means that one parameter is varied across several candidate values while the other selected parameters are kept fixed. Thus, each panel shows how the error changes when a single design choice is made more or less conservative. The x-axis gives the parameter being varied, and the y-axis reports the resulting error metric; lower curves indicate better performance, while flatter curves indicate lower sensitivity to that parameter. Based on this reading, posterior contact shows a comparatively broad operating basin around
inactive_scale = 100 and
min_prob in the range 0.2–0.3. Shrinking
inactive_scale to 10 raises the mean from 0.176 m to 0.231 m and the p95 from 0.588 m to 0.650 m, while increasing it to 300 keeps p95 nearly unchanged at 0.587 m but worsens CVaR@90 from 0.593 m to 0.660 m and increases the maximum error from 0.791 m to 0.962 m. Likewise, setting
min_prob too aggressively at 0.5 raises CVaR@90 to 0.930 m and the maximum error to 1.804 m.
The FIBA-like sweeps show a different pattern. The fold-selected configuration remains the strongest average-case choice, with the lowest mean at gamma = 1.0 and the nominal sigma_ref values (combined mean 0.168 m). However, more conservative settings improve the upper-tail metrics: gamma = 0.75 reduces p95 from 0.635 m to 0.473 m and CVaR@90 from 0.663 m to 0.558 m, while halving sigma_ref reduces p95 to 0.505 m and CVaR@90 to 0.581 m. This indicates that FIBA-like has more headroom for tail-oriented retuning, whereas the selected posterior-contact configuration already lands at a comparatively reliability-oriented point without changing the selection objective.
We also examined trial subsets to localize where the observed gains arise. Here, a trial subset means a smaller group of trials selected according to a shared property, so that performance can be inspected in a more targeted regime rather than only over all 56 trials. The most informative subset is the hard-tail regime, defined as the top quartile of hard-ZUPT errors with
m (
). This subset represents the trials where the classical hard-ZUPT baseline has the largest errors, and therefore highlights the difficult cases where failure suppression matters most.
Figure 13 reports subset-level mean, p95, and maximum error for the main methods in
Figure 13a–c, respectively. In these panels, lower bars indicate better performance, and the three metrics respectively summarize average error, high-percentile error, and worst-case error within the subset.
On the hard-tail subset, posterior contact is the strongest method on all three shown metrics, reaching mean = 0.387 m, p95 = 0.736 m, and max = 0.791 m, slightly ahead of FIBA-like (0.403 m, 0.802 m, 0.871 m) and well ahead of robust soft-ZUPT (0.504 m, 1.395 m, 1.782 m). In practical terms, this means that posterior contact reduces not only the average error in the difficult subset, but also the high-percentile and worst-case errors. This result further supports the view that posterior contact is most valuable on the difficult trials where failure suppression matters most.
The remaining subsets show how the other methods behave when each update rule is most naturally matched to the trial. On trials where posterior contact attains the lowest error, FIBA-like remains comparatively close, whereas robust degrades more substantially, especially on p95 and maximum error. On trials where robust is best, posterior contact remains the closest competitor and is clearly stronger than FIBA-like on the upper-tail metrics. On trials where FIBA-like is best, posterior contact again remains competitive in mean and maximum error, but FIBA-like opens a clearer advantage on p95. Taken together, these subsets suggest that posterior contact is neither broadly dominant nor brittle: it is strongest on the difficult trials emphasized by the hard-tail subset, and otherwise tends to remain near the leading method.