1. Introduction
Real-time monitoring of human joint torque provides critical quantitative feedback for intelligent rehabilitation assessment, biomechanical analysis, and exoskeleton control systems [
1,
2]. However, these systems are typically deployed on embedded platforms with limited computational capacity, stringent power constraints, and demanding real-time requirements. Consequently, high-fidelity deep learning models face substantial deployment bottlenecks [
3]. For example, mainstream time–frequency approaches employ the continuous wavelet transform (CWT) to construct two-dimensional representations that enhance feature extraction accuracy. However, the associated preprocessing is computationally intensive, increasing single-step inference latency to over 150 ms [
4], thereby limiting real-time interaction on wearable devices. Under stringent hardware constraints, edge deployment often removes such preprocessing and reverts the input modality from a two-dimensional time–frequency spectrum to a one-dimensional time-series signal.
This transformation is not merely a linear dimensionality reduction but represents structural degradation of the observation space [
5,
6]. Under such degradation, local energy distributions and spectral evolution patterns that were explicitly encoded in the time–frequency plane become implicitly coupled due to reduced observability [
7,
8]. This results in a pronounced observability gap between teacher and student models. The reduced shared representational basis across heterogeneous architectures undermines the feature-manifold isomorphism assumptions underlying conventional knowledge transfer methods [
9,
10]. Consequently, in the absence of well-aligned intermediate representations, establishing effective mapping relationships between architectures with unequal observability becomes a central challenge for cross-model knowledge distillation.
Owing to the observability gap and the reduced shared representational basis, existing knowledge distillation methods exhibit limited applicability in heterogeneous settings. Mainstream feature-alignment approaches assume the existence of a mappable isomorphic manifold between teacher and student feature spaces [
11,
12]. Under structural degradation of the input modality, however, this assumption is often violated, and enforced feature projections may produce unstable or non-physical mappings. Response-level distillation is further constrained by the low-rank nature of regression outputs, which limits its ability to convey the high-dimensional inference structure of the teacher model. Such approaches rely primarily on end-to-end numerical fitting and lack structural guidance, making it challenging to correct deviations from physical consistency caused by incomplete observations [
13,
14]. Traditional physics-based constraints are typically implemented as posterior regularization terms that primarily restrict the solution space boundaries [
15]. When key time–frequency information is systematically absent at the input stage, such boundary constraints cannot substitute for the missing observational structure.
Therefore, under heterogeneous settings with incomplete observability, two challenges persist: constructing a representation-independent knowledge transfer pathway and incorporating deterministic exogenous mechanisms to supplement purely statistical inference. In this context, the central difficulty extends beyond specific algorithmic designs to identifying the appropriate level of information that can serve as a stable knowledge carrier. When the input modality degrades from two-dimensional time–frequency representations to one-dimensional time-series signals, the student model loses explicit spectral information and the shared basis required for reliable feature alignment. Consequently, representation-based alignment strategies become insufficient.
Instead of enforcing unstable feature-level mappings, the proposed approach leverages (1) parameter-level structural geometry to inherit the inference topology of the teacher model and (2) physics-based dynamical priors to compensate for dynamical information lost due to dimensionality reduction. By combining structural inheritance with physics-guided compensation, a Physically Guided Dual-Consistency Knowledge Distillation (PDC-KD) framework is proposed to enable reliable cross-model knowledge transfer under constrained edge-computing resources and simplified input modalities.
The main contributions are summarized as follows:
A transfer strategy based on parameter-manifold alignment is proposed to replace conventional feature-level alignment. To address intermediate feature mismatch across heterogeneous architectures, a shared anchor space is constructed to enable student models to inherit the teacher’s inference topology at the parameter level, thereby reducing reliance on feature-space isomorphism.
A physics-guided exogenous information compensation mechanism is established. Unlike conventional boundary regularization strategies, implicit biomechanical priors are incorporated as an independent exogenous information source. Robust physical operators are used to compensate for the loss of dynamical consistency resulting from input dimensionality reduction.
The effectiveness of the proposed framework for lightweight edge deployment is validated. Experiments on a standard IMU-based dynamic regression task demonstrate that the framework reduces model parameters by approximately 98% while maintaining predictive performance comparable to a high-fidelity teacher model, confirming its practical applicability in resource-constrained engineering scenarios.
The remainder of this paper is organized as follows.
Section 2 reviews related work on wearable dynamics estimation and heterogeneous knowledge distillation, highlighting current limitations.
Section 3 details the PDC-KD framework, including the shared anchor-space parameter-manifold alignment strategy and the physics-guided dynamical compensation mechanism.
Section 4 presents the experimental setup and evaluates prediction accuracy, computational efficiency, and physical consistency on a standard lower-limb movement dataset, followed by ablation and baseline comparisons.
Section 5 discusses the effective boundary of structured knowledge transfer and analyzes the influence of physical priors, clarifying the limitations of the proposed method.
Section 6 concludes the paper and outlines directions for future research.
3. Methodology
Under edge-computing constraints, limited computational resources necessitate a simplification of the input modality from two-dimensional time–frequency spectra to one-dimensional time-series signals. This modality simplification introduces substantial architectural heterogeneity between teacher and student architectures. To address this challenge, a Physics-Guided Dual-Consistency Knowledge Distillation (PDC-KD) framework is proposed. The framework establishes a hierarchical optimization strategy that integrates parameter-manifold alignment with physics-guided information compensation (
Figure 2). This design enables knowledge transfer and dynamical information reconstruction in lightweight deployment scenarios characterized by incomplete observability.
3.1. Problem Formulation and the PDC-KD Framework
The effectiveness of the PDC-KD framework is grounded in three theoretical premises that define its applicability:
- 1.
Task-consistency assumption: The teacher and student models share an identical optimization objective and physical constraints within the same dynamic regression task.
- 2.
Parameter-proxy assumption: When heterogeneous architectures hinder feature-space alignment, the structural configuration of weights in projection and adaptation layers is assumed to encode inference logic rather than merely support representation transfer.
- 3.
Computational-mediation assumption: The shared anchor space serves as a mathematical reference for quantifying geometric relationships among heterogeneous parameter matrices, independent of intermediate feature representations or latent semantic spaces.
The proposed framework applies to regression tasks characterized by substantial structural differences in input modalities and well-defined dynamical priors. Its primary objective is to mitigate performance degradation in resource-constrained edge-computing environments, rather than to serve as a general-purpose alternative to isomorphic compression or conventional distillation.
The PDC-KD framework consists of two complementary pathways. The first pathway addresses the challenge of feature-level alignment under heterogeneous modality degradation by introducing a parameter-level structural alignment mechanism. Instead of transferring the entire encoder weights or network topology, it anchors the teacher’s projection layer () and the student’s dimensional adaptation layer (), establishing a geometric correspondence within a shared anchor space. Fisher information weighting is combined with a low-rank subspace projection strategy to minimize structural discrepancies between their parameter manifolds. This mechanism enables the student model to approximate the teacher’s inference-boundary structure during output mapping, thereby reducing reliance on intermediate feature-space isomorphism.
The second pathway introduces physics-guided information compensation. Because input dimensionality reduction results in the loss of high-frequency information, physical priors are incorporated as additional sources of information. The framework constructs implicit inertial parameters and robust differential operators and imposes regularization constraints derived from the Newton–Euler equations. Rather than enforcing exact physical correctness of the outputs, this mechanism introduces data-independent inductive biases that suppress non-physical prediction oscillations induced by modality degradation and enhance robustness under practical engineering constraints.
3.2. Heterogeneous Architectures for Edge Deployment
A heterogeneous teacher–student distillation framework is adopted. The teacher model employs CWT to generate time–frequency representations and integrates convolutional networks with attention mechanisms to extract multi-scale dynamical features [
4]. This architecture captures both local time–frequency structures and global correlations, thereby ensuring accurate torque estimation. However, its computational complexity and parameter scale limit deployment on resource-constrained edge devices.
The student model directly processes one-dimensional IMU time-series signals using a lightweight recurrent neural network to capture temporal dependencies, significantly reducing computational overhead. This simplification results in the systematic loss of frequency-domain structural information, thereby reducing observability and limiting the model’s representational capacity.
The structural disparity between the two networks renders their parameter-space geometries directly incomparable. To establish a computable structural correspondence, a linear dimensional adapter is appended to the student network, ensuring output dimensional consistency with the teacher’s projection layer. This design guarantees parameter dimensional compatibility, enables geometric comparability between heterogeneous weight matrices, and provides the basis for subsequent alignment within the shared anchor-space parameter manifolds. The core objective is to develop an effective and physically consistent knowledge transfer mechanism under heterogeneous conditions, rather than relying solely on conventional network-architecture optimization.
3.3. Path I: Parameter-Manifold Alignment via Shared Anchor Space
When the input modality degrades from a two-dimensional time–frequency spectrum to a one-dimensional time-series signal, the feature-generation mechanisms of the teacher and student models become substantially misaligned, rendering conventional feature-alignment distillation strategies ineffective. Accordingly, under the parameter-proxy assumption defined in the theoretical premises, a shared anchor space is constructed, and a parameter-level knowledge transfer pathway is introduced to replace conventional feature-alignment schemes.
3.3.1. Construction of a Shared Parameter Anchor Space
Let
and
denote the projection weight matrices of the teacher and student models within the shared anchor space, respectively. Owing to the inherent difference in their original parameter-space dimensions (
), direct numerical approximation between them is not feasible. To address this dimensional inconsistency, the Gram matrix is introduced as a computational tool to characterize the second-order correlation structure of parameters in the anchor space [
35], as formulated in Equation (
1).
where
. The Gram matrix transformation projects parameters of different original dimensions into a unified metric space without relying on input-level semantic representations. In this metric space, parameter alignment is formulated as a structural-consistency constraint on the corresponding Gram matrices.
This structural-consistency strategy alleviates the limitation imposed by asymmetric input observability, establishes a structural similarity metric across heterogeneous architectures, and formulates the corresponding alignment loss as shown in Equation (
2).
where
denotes the Frobenius norm.
In Equation (
2), the relationship between heterogeneous parameters is reformulated as a geometric correlation structure. By minimizing structural discrepancies between the Gram matrices, the student network is encouraged to approximate the teacher’s topological distribution on the parameter manifold. The parameter-manifold alignment mechanism alleviates the computational challenge posed by dimensional incompatibility between heterogeneous parameters and provides a mathematical basis for subsequent refinement via structural guidance (see
Figure 3).
3.3.2. Task-Sensitive Alignment via Fisher Geometry
Although the Gram matrix captures second-order correlations among parameters, it does not distinguish the relative contributions of individual feature dimensions to the final prediction. To compensate for the student model’s limited parameter capacity for handling high-dimensional features, Fisher information from the teacher network is introduced as a local sensitivity-weighting mechanism. Fisher information quantifies the second-order sensitivity of the loss function to parameter perturbations within the anchor space, and its approximate formulation is given in Equation (
3).
Here, denotes the output variable corresponding to the kth dimension of the teacher network in the anchor space.
Fisher information characterizes the statistical sensitivity of model parameters to the task loss, independent of input-level feature saliency. The Fisher-based weighting mechanism eliminates the need for both input-sample comparability across heterogeneous models and semantic alignment of intermediate-layer features. While the anchor space provides a coordinate representation of parameters, Fisher information assigns a task-relevant local metric to this space, enabling differentiation of the relative importance of parameter directions.
Based on this sensitivity metric, the Fisher-weighted discrepancy between Gram matrices in the anchor space is computed to formulate a local sensitivity alignment loss, as defined in Equation (
4).
The local sensitivity alignment loss elevates the distillation objective from purely geometric matching to task-aware structural alignment. The Fisher-weighting mechanism introduces a task-relevant local metric into the parameter manifold, approximating the directional distribution of prediction sensitivity within the parameter space. This mechanism guides size-constrained student models to prioritize alignment along structurally informative directions with higher information density.
3.3.3. Principal Subspace Regularization
Fisher information constraints capture only local parameter sensitivity and do not reflect the global correlation structure within the anchor space. To complement this local perspective, a low-rank subspace alignment strategy is introduced to establish a structural regularization mechanism [
36,
37]. Specifically, singular value decomposition (SVD) is applied to the teacher’s Gram matrix
to extract principal component subspaces and construct the corresponding projection operator
, as defined in Equation (
5):
Here, contains the first k principal basis vectors derived from the teacher’s Gram matrix, forming the principal subspace of the anchor space.
Under this projection operator, the student Gram matrix
is projected onto the teacher’s principal subspace. The subspace alignment loss is defined as the structural distance after projection, as formulated in Equation (
6), and its geometric interpretation is illustrated in
Figure 4:
This strategy effectively constrains the optimization trajectory of the student model parameters, prioritizing convergence along the principal directions defined in the teacher’s anchor space. By restricting the degrees of freedom within the low-rank subspace, this regularization term suppresses minor noise components during parameter optimization, promotes approximation of the teacher’s global correlation structure, and enhances the training stability of the lightweight student model.
3.4. Path II: Physics-Guided Compensation for Dynamical Consistency
As the input modality degrades from a two-dimensional time–frequency spectrum to a one-dimensional time-series signal, the student model experiences a systematic loss of frequency-domain structural information. This leads to observability degradation and induces high-frequency oscillations in the predictions that deviate from physical principles.
In contrast to
Section 3.3, which emphasizes parameter-manifold distillation for inference-logic transfer, this section introduces an independent exogenous physical-compensation pathway. This physics-guided compensation mechanism extends beyond conventional solution-space regularization by incorporating human-body dynamical equations as an independent information source. It actively compensates for missing observational dimensions, restores dynamical-consistency features, and calibrates predictions at the signal level.
3.4.1. Equivalent Inertia Modeling and Robust Operators
In practical wearable applications, calibrating the transformation between the sensor coordinate system and the human anatomical coordinate system remains a significant challenge. An equivalent-parameter modeling strategy is adopted to enhance engineering applicability. Specifically, the unknown coordinate transformation
is incorporated into the equivalent inertial tensor
, which is treated as a learnable compensation variable and optimized automatically during network training. The resulting physical coupling relationship is formulated in Equation (
7), as illustrated in
Figure 5:
Here, denotes the equivalent inertial tensor expressed in the sensor coordinate system; represents the rotation matrix of the sensor relative to the limb; and corresponds to the standard inertial tensor of the human body segment.
Based on the equivalent modeling framework, the Newton–Euler equations are used to evaluate the student model’s predictions. Owing to soft-tissue deformation and device-induced micro-motion artifacts, the dynamic equations serve as a physics-based approximation of rigid-body motion. They constrain the trend of joint torque variations rather than providing an exact analytical solution. The resulting dynamic-consistency relationship is formulated in Equation (
8):
Here, (N · m/kg) denotes the joint torque; and represent the angular velocity and angular acceleration, respectively; and denotes the gravity compensation term.
3.4.2. Physics-Consistent Residual Regularization
To ensure stable gradient propagation in the physics-compensation pathway under noisy IMU measurements, both numerical stability and algebraic validity must be considered. As illustrated in
Figure 6, a robust dynamical operator is introduced that integrates two core mechanisms.
First, a Savitzky–Golay smoothing differentiator is applied as a signal-preprocessing step to suppress high-frequency noise in the angular-velocity measurements. Second, a Cholesky-based reparameterization of the inertial parameter matrix is employed to enforce the symmetric positive-definite (SPD) property of the equivalent inertial tensor, thereby constraining the learnable parameters to remain within the physically feasible domain.
Based on the robust dynamical operator, a physics-compensation loss
is formulated according to the Newton–Euler equations, as defined in Equation (
9):
The physics-compensation loss serves not only as a regularization term but also as a core structural guidance signal. By minimizing the physical residual, the loss encourages the student model to move beyond purely data-driven statistical regularities and actively recover dynamical trend information lost due to reduced perceptual dimensionality.
Implementation details of the robust operator and its stability analysis are provided in
Appendix A, while the derivation of the physics-compensation loss is presented in
Appendix B.
3.5. Joint Optimization Strategy
The parameter-manifold distillation and physics-compensation mechanisms operate in a coordinated manner to form an end-to-end optimization framework. A joint optimization strategy is adopted, in which a composite loss function simultaneously enforces data-fitting accuracy, distillation consistency, and dynamical consistency.
The total objective function
integrates the data-fitting term
, the output-response distillation term
, the parameter-structure alignment terms
and
, and the physics-compensation term
, as defined in Equation (
10):
The definitions and roles of each loss component are described as follows.
The basic supervision term
ensures that the model retains fundamental fitting capability with respect to the ground-truth labels. To mitigate the influence of batch size (
B) and time-series length (
T) on gradient magnitude, a normalized formulation is adopted, as defined in Equation (
11):
The output-response distillation term
encourages the student model to inherit the regression behavior of the teacher at the output layer by minimizing the discrepancy between the teacher’s prediction
and the student’s prediction
, as defined in Equation (
12):
The structural terms
and
implement the parameter-manifold alignment mechanism and facilitate transfer of the teacher’s inference structure, whereas
corresponds to the physics-compensation mechanism described in
Section 3.4.
During joint training, the relative gradient contributions of different mechanisms are balanced by weighting coefficients
,
,
, and
, which are treated as hyperparameters and tuned independently of the model architecture (selection criteria are provided in
Appendix C).
To ensure numerical stability in the multi-objective optimization process, gradient clipping and learning rate warm-up strategies are adopted. These auxiliary strategies suppress early-stage gradient oscillations and promote stable convergence of both physics constraints and parameter-alignment mechanisms. The detailed training procedure and implementation specifics are provided in
Appendix C (Algorithm A1).
5. Discussion
The experimental results indicate that degrading the input modality from an explicit two-dimensional time–frequency representation to a one-dimensional time-series signal does not lead to a complete collapse in task performance. Although geometric alignment between teacher and student feature distributions is weakened, the student model retains the capacity to perform the core dynamical regression task under engineering constraints. This observation suggests that, under structural reduction of input dimensionality, the suitability of feature alignment as a primary knowledge carrier must be carefully reconsidered [
10]. Previous approaches, such as FitNets and Relational KD, are primarily predicated on the assumption that the observation spaces of teacher and student models are isomorphic or only weakly heterogeneous. However, under the two-dimensional-to-one-dimensional degradation setting examined in this study, the shared mapping basis in feature space is substantially weakened, thereby reducing observability. Nevertheless, the results demonstrate that even in the absence of fine-grained feature geometry from the teacher model, task-relevant information can still be transferred across heterogeneous architectures.
Experimental observations and ablation analyses indicate that parametric geometric constraints do not require the student model to reconstruct the teacher’s intermediate feature representations. As illustrated in
Figure 9, after introducing parameter-geometric constraints, the anchor-space projections of the student model remain relatively discrete and do not converge toward the teacher manifold.
Table 3 further demonstrates that M2 improves performance for multiple joints, particularly the knee and ankle, compared with M1. These findings suggest that, under heterogeneous distillation settings, parameter-geometric constraints primarily influence the optimization trajectory and the model’s discriminative structure [
42].
This effect may be related to the intrinsic biomechanical coordination of human movement. Although IMU signals are high-dimensional and contain substantial noise, lower-limb motion is inherently constrained by skeletal structure and muscle coordination [
43]. The principal task-related directions in the teacher’s parameter space are therefore more likely to encode genuine dynamical patterns rather than high-dimensional noise-induced redundancy. Consequently, even without reconstructing the teacher’s two-dimensional time–frequency manifold at the feature level, the student model can inherit essential structural information by enforcing parametric-geometric consistency (e.g., Gram-structure or principal-subspace alignment).
This structural-surrogate mechanism provides a plausible explanation for the non-uniform performance observed across joints. As shown in
Table 1 and
Table 3, distillation gains are relatively stable for the hip joint. In contrast, improvements for the knee joint—characterized by more substantial impact and more complex transient dynamics—are more limited. In scenarios involving pronounced transient dynamics, the effect of purely structural constraints may diminish. This observation suggests that structural consistency alone may be insufficient to compensate fully for observability degradation, thereby motivating the incorporation of additional physics-guided constraints [
44].
Under conditions of incomplete observability, the primary contribution of the physics-guided mechanism is to enhance predictive stability. Specifically, it suppresses high-frequency anomalous oscillations induced by input dimensionality reduction, constrains predictions within physically plausible ranges, and provides trend-level constraints derived from dynamic equations when fine-grained spectral information is unavailable due to input dimensionality reduction. From an optimization perspective, the physical-consistency loss influences the direction of parameter updates. Compared with training based solely on label supervision, this mechanism imposes additional penalties on update directions that violate dynamic constraints, thereby restricting the model’s feasible solution space [
45]. In this study, physical constraints are embedded directly into the training loss function rather than implemented as post hoc correction modules. Experimental results show that reductions in physical-consistency error (PCE) and Peak Error are more pronounced than reductions in average error (RMSE), consistent with improved control over the tails of the error distribution. However, this mechanism cannot fully compensate for information loss induced by representation degradation, and its effectiveness diminishes under conditions dominated by high-frequency shocks or strong nonlinear dynamics.
It is also important to clarify the dynamical applicability boundaries of the physics-guided compensation mechanism. The Newton–Euler formulation adopted here assumes rigid-body kinematics, which provides a reasonable approximation for periodic lower-limb locomotion scenarios. However, for strongly nonlinear or aperiodic motions, such as rapid turning, jumping, or sudden posture changes, the rigid-body assumption may no longer hold. In such cases, soft-tissue deformation and multi-segment coupling effects may cause the physics-compensation term to introduce systematic bias rather than beneficial correction. The effectiveness of the proposed mechanism is therefore expected to diminish as the target motion deviates from the periodic locomotion conditions for which the framework was designed. Future work could explore incorporating musculoskeletal or flexible-body models to broaden applicability beyond the current rigid-body constraint.
Although this study validates the effectiveness of the proposed method under specific experimental conditions, several limitations must be clarified to define its scope of applicability. First, there is an inherent trade-off between training cost and online efficiency. Although the PDC-KD framework substantially reduces online inference latency (1.02 ms), it introduces additional computational overhead during training. Specifically, anchor-space Fisher information estimation, singular value decomposition (SVD), and Cholesky-based reparameterization of physical parameters increase the complexity of offline training. This reflects a deliberate offline-for-online computational trade-off; however, its suitability for scenarios with frequent retraining or edge-side adaptive learning requires further investigation. To quantify the offline training overhead introduced by the PDC-KD framework, wall-clock timing was recorded for the stair-ascent task. Fisher information precomputation required approximately
s per joint, and SVD decomposition of the anchor-space Gram matrix required less than
s. Each student training run (200 epochs) took approximately 166 s per joint, with a peak GPU memory consumption of 891 MB. The detailed breakdown is provided in
Appendix D (
Table A3). These results confirm that the additional offline overhead is incurred only once prior to deployment and does not affect online inference latency (
ms, approximately 980 FPS).
Second, the engineering equivalence of the learned physical parameters warrants careful consideration. Obtained through implicit optimization during training, the equivalent inertia tensor represents an engineering surrogate rather than an exact biomechanical parameter. It constrains the numerical trend and feasible region of predictions but does not provide a precise physiological interpretation. Its physical consistency and stability under abnormal movements or extreme operating conditions remain to be systematically validated. Furthermore, the current evaluation is limited to open-loop, offline testing within standard gait cycles, and real-time stability in closed-loop control systems has not yet been verified. Future work will explore more efficient adaptive mechanisms to accommodate evolving teacher models and non-stationary data distributions [
25].
To assess the robustness of the framework under sensor measurement uncertainty, a post hoc sensitivity analysis was performed by injecting additive Gaussian noise (
relative to the normalized input scale) into the student model’s IMU inputs at inference time. As reported in
Table A4, RMSE degradation remained below
for
across all three joints, indicating adequate robustness under realistic sensor noise levels. The physical-consistency error exhibited negligible variation (<1%) across all noise conditions, suggesting that the physics-guided compensation mechanism imposes structural constraints that are largely invariant to input-level perturbations. Performance degrades substantially at
, consistent with the expectation that the framework targets realistic rather than adversarial noise conditions.
Third, several sources of uncertainty in both input data and model parameters warrant acknowledgment. On the input side, IMU signals are subject to soft-tissue artifacts, sensor-to-segment misalignment, and measurement drift over prolonged use, which introduce uncertainty into the observed kinematics. On the model side, the equivalent inertia tensor functions as an implicitly learned engineering surrogate whose convergence behavior may vary across subjects with substantially different body segment parameters. Furthermore, the Fisher information matrix is precomputed on the training distribution and does not adapt to inter-subject variability or domain shift. These factors suggest that the reported performance metrics reflect aggregate behavior under the evaluated experimental conditions and may not generalize uniformly to all individual deployment scenarios.
A further practical consideration concerns the gap between the offline evaluation protocol and real-world wearable deployment. Experimental evaluation in this work relies on pre-segmented, time-normalized gait cycles, under the implicit assumption that gait-phase segmentation has already been performed upstream. In continuous wearable deployment, the model must process an uninterrupted IMU data stream in which gait-cycle boundaries are not known a priori. Integrating the proposed framework with a real-time segmentation front-end—such as a heel-strike detector or a sliding-window gait-phase classifier—would likely be necessary for practical deployment. However, this integration has not been systematically evaluated, and assessment on continuous, non-segmented sequences under realistic ambulatory conditions constitutes an important direction for future work. Accordingly, the present results should be interpreted within the scope of cycle-level offline evaluation.
Finally, the generalizability of the Fisher information weighting strategy warrants further consideration. In the current framework, the Fisher information matrix
is precomputed from the teacher model on the training distribution and remains fixed during student training. When the target data distribution shifts substantially—due to changes in subject demographics, activity type, or sensor configuration—the precomputed weights may no longer accurately reflect task-relevant parameter sensitivity, potentially limiting distillation effectiveness. A possible mitigation could involve periodically re-estimating
on a representative subset of the target distribution; alternatively, an incremental updating scheme could be adopted. These extensions are left for future work. As noted in
Section 4.2, the t-SNE visualization serves as qualitative evidence rather than causal proof.
Despite these limitations, the findings provide practical insights for designing resource-constrained edge-learning systems. In scenarios involving simplified input modalities and constrained computational budgets, incorporating parameter-space structural constraints and physics-consistency mechanisms can enhance prediction stability while preserving model compactness. More broadly, integrating structural consistency with physical rationality facilitates a balanced trade-off between system efficiency and stability in similar tasks.