1. Introduction
Human–robot interaction (HRI) is an interdisciplinary domain concerned with the analysis, design, and assessment of robotic systems that operate alongside or in collaboration with humans [
1]. Communication forms the foundation of this interaction, and the manner in which it occurs depends strongly on whether the human and the robot share the same physical environment [
1]. In high-noise settings like construction zones, manufacturing floors and airport ramps, standard voice-based or teleoperation-based communication frequently fails due to excessive background noise, limited maneuverability, and safety considerations. Consequently, dependence on these conventional communication methods may undermine both efficiency and user well-being, highlighting the necessity for alternative, context-sensitive communication strategies. In situations where verbal communication is hindered by noise, mobility limitations, or safety issues, hand gesture-based interaction provides a subtle and natural means for human–robot collaboration [
2]. This nonverbal communication generally corresponds with human social behavior and can substantially reduce cognitive burden, making it more effective than verbal communication in challenging environments [
2]. A classic example of this in practice is the system of air-marshalling signals which are standardized hand gestures that have long enabled pilots and ground crews to coordinate aircraft movement reliably and unambiguously. Recent studies on vision-driven systems in robotics have shown the effective ramp hand signals recognition through deep learning and computer vision methods, hence enhancing the autonomy and operational safety of unmanned vehicles [
3,
4,
5]. These developments collectively emphasize gesture-based human–robot interaction, as an essential framework for promoting natural, accessible, and reliable communication between humans and robots, especially in contexts where conventional verbal communication is unfeasible [
2,
4,
5,
6,
7].
Gesture-based communication is a promising approach in HRI to improve ergonomics, decrease downtime, raise safety, and boost resilience, where even children, older people, hearing- and speech-impaired individuals, or non-technical individuals with limited knowledge of robot operation can interact with robots [
2,
4,
5,
8,
9,
10,
11]. However, one of the primary challenges in deploying robotic systems in real-world settings lies in the phenomenon of domain shift. Domain shift occurs when a model trained on data from one domain (the source) is deployed in another domain (the target) where it must perform the same task, yet the data distributions differ [
12,
13,
14,
15]. The difference in the data distribution in gesture-based HRI system deployment can be due to illumination, background clutter, camera viewpoints, or variations in human gesture execution styles across facilities, operators, and over time. These distributional differences severely impair model performance and render straightforward deployment impractical [
16,
17].
Domain incremental learning (DIL), which trains models to generalize across new domains encountered sequentially while keeping the task unchanged, is a method for addressing domain shift challenges [
18,
19]. This approach helps HRI systems to adapt to new domain knowledge, which is aligned with the generalization process for a robot in a real-world scenario. One of the fundamental challenges in DIL is catastrophic forgetting. According to various studies, when a neural network is trained on new tasks, its performance on previously learned tasks decreases drastically. This is because the parameters critical for earlier knowledge are overwritten by the new learning [
20,
21,
22,
23,
24,
25,
26]. This problem is particularly relevant to domain incremental learning, where a system must continuously learn for new users who might perform gestures in different environments and at different speeds without losing proficiency in previously learned gesture patterns. Through effective retraining protocols and limited sample retention, research on continual gesture learning for HRI has shown great promise for incrementally learning new gesture classes while reducing forgetting [
27,
28,
29,
30,
31]. However, a notable gap in the current literature is that most efforts have focused on class-incremental learning, whereas the challenge of adapting to new domains—such as different users and environments—within a DIL framework remains underexplored.
To address these challenges, our work focuses on designing gesture-based HRI systems that can work reliably in a wide range of real-world situations. Specifically, we target the open problem of adapting to new users and environments under a domain incremental learning framework, while minimizing catastrophic forgetting. Subject-as-domain domain incremental learning (DIL) frames each subject or human operator as a distinct domain within a shared task and label space. Under this formulation, gesture patterns vary across users due to individual morphology, motion style, and environmental context, producing sequential domain shifts that challenge model generalization [
32]. This formulation falls under the domain incremental learning paradigm, in which the input distribution changes across domains while the label space remains fixed [
33]. Our study pursues two complementary objectives: (1) adaptation to continuously learn new subject domains with high accuracy and robustness under shifting distributions; and (2) retention to preserve prior-domain performance by minimizing catastrophic forgetting. Together, these objectives enable human-centered incremental learning, where the system remains dependable and inclusive across diverse users. We address the challenge of memory efficiency by investigating techniques that selectively keep and replay the most informative samples from previous domains. In addition, we emphasize the importance of robust evaluation protocols that not only assess recognition accuracy but also capture retention, forgetting, and resource consumption across long sequences of domain shifts. Together, these directions aim to bridge the gap between current class-incremental approaches and the practical requirements of domain-incremental gesture recognition in human–robot interaction. The main contributions of this work can be summarized as follows:
We present a multi-domain air-marshalling hand signal recognition framework under a DIL paradigm, facilitating robust adaptation to new users and environments without sacrificing prior knowledge.
We introduce a memory-efficient latent exemplar replay strategy where the latent embedding of gesture video generated by a frozen encoder is used for DIL training to secure the privacy of the user.
We develop a clustering-based exemplar selection mechanism to identify and store the most representative samples from previously learned domains, thereby enhancing generalization in subsequent learning phases.
We investigate extensive evaluation metrics that go beyond accuracy, explicitly quantifying knowledge retention, forgetting, and resource utilization across extended sequences of domain shifts.
The remainder of this paper is organized as follows.
Section 2 represents the related work in the domain of gesture-based domain incremental learning for HRI.
Section 3 defines the problem statement about the subject-as-domain DIL setting and the twin goals of adaptation and retention.
Section 4 introduces the proposed method ReDIaL and its components, such as latent embeddings (
Section 4.1), the cross-modality model (
Section 4.2), exemplar memory (
Section 4.3.2), and balanced replay (
Section 4.3.3).
Section 5 covers the experimental setup for the proposed method ReDIL.
Section 6 reports results across 21 domains with proper evaluation metrics and analysis.
Section 7 discusses implications for human–robot collaboration, and
Section 8 concludes with future directions.
3. Problem Statement: Subject-As-Domain Incremental Gesture Learning
Our domain-incremental gesture recognition framework is based on clearly defined input spaces, a shared label space, and domain-indexed datasets. Two complementary modalities (1) RGB frames (
) and (2) Posture data (
), are used to represent each observation
.
and
have the dimension of
, where
T is the number of frames,
C is the number of channels,
H is the height of the frame, and
W is the width of the frame. The multimodal observation can therefore be expressed as:
The detailed description of each modality is described in
Section 4.1. The set of gesture classes to be recognized is represented by the shared label space
, which is common across all domains. Each observation
is associated with a distinct label
. Formally, the dataset
corresponding to domain
is given by the following:
As the robot is incrementally exposed to new environments, domains are indexed over time:
At time step , an initial dataset from the source domain , which may correspond to an internally collected dataset in a controlled environment, is used to train the robot.
For subsequent time steps , the robot encounters datasets corresponding to new domains . In our experimental setup, these domains represent distinct participants, each introducing variability in body morphology, environmental conditions, and gesture execution styles, while preserving the same label space .
3.1. Domain Shift Formulation
Let
be the multimodal observation and
its gesture label (also referred to as class). Throughout this paper, the terms label and class are used interchangeably. The observation distribution in domain
can be written as follows:
where
controls execution speed, affecting the temporal length T of ;
controls subject appearance and morphology, affecting pixel statistics;
controls background/scene context, affecting visual content.
A domain shift occurs when any of these latent factors have domain-dependent distributions. Lets assume, for any new user in new domain
, the distribution
contains execution speed
, subject appearance and morphology
, background/scene context
:
even though the gesture labels remain identical across domains (
) within the shared label space
, the underlying data distributions differ. Hence, domain shifts in our problem are primarily driven by the following:
each of which modifies a different generative component of
.
This formulation highlights that although the gesture labels remain identical across domains, subject-specific and environmental factors introduce significant distributional shifts in the input space. Addressing these discrepancies requires effective domain adaptation techniques that enable the model to generalize properly while retaining previously acquired knowledge.
3.2. Domain Incremental Learning Objective
Let
be a sequence of domains that share the same gesture label space
. At time step
t, the model
is trained on data from domain
and achieves an accuracy:
which measures how well
recognizes gestures within the distribution of
and
is the probability.
When the robot is deployed in a new environment
at time
, the data distribution changes as discussed in
Section 3.1. A domain incremental learning process adapts the domain
by fine-tuning the previous model
to produce an updated model
. The primary objective of this adaptation is twofold:
Here, ‘≥’ denotes a no-degradation criterion: after adapting to the new domain
, the updated model
should achieve accuracy on each previously learned domain
that is at least the accuracy of the prior model
on
; i.e., past-domain performance is maintained or improved. If
, the model has suffered from catastrophic forgetting, a well-known problem in continual learning where adaptation to new data leads to a severe drop in accuracy on past domains. So, the forgetting in a DIL setting, can be defined as
If the
, it indicates that there is catastrophic forgetting available during adaptation of domain
. Thus, two fundamental and complementary goals are at the center of the creation of the updated model
:
Goal 1. Adaptation: Achieve high by effectively learning the new domain distribution. This ensures the model remains reliable and accurate when deployed in novel environments.
Goal 2. Retention: Maintain or for all previously seen domains. This prevents catastrophic forgetting and guarantees that knowledge from earlier domains remains useful over time.
The simultaneous achievement of these goals is essential to maintaining strong, lifelong learning of domains in HRI. Without retention, adaptation runs the risk of causing catastrophic forgetting, in which previous environment knowledge is erased, decreasing the robot’s adaptability and dependability. On the other hand, focusing on retention without enough adaptation could result in poor generalization in unfamiliar settings, which would compromise usability in the real world. The basic problem of lifelong learning is thus embodied by gesture recognition across successive domains: allowing for ongoing adaptation to changing environments while preserving previously learned competencies.
4. Methodology
This section details our Replay-based Domain Incremental Learning (ReDIaL) pipeline for gesture recognition in HRI. As illustrated in
Figure 1, the pipeline has four stages: (i)
latent embedding generation with a frozen encoder
that maps RGB and posture streams into a stable, privacy-preserving feature space; (ii) a
cross-modality recognition model that aggregates modality-specific representations into a unified embedding; (iii)
clustering-based exemplar selection that stores a compact, diverse subset of past domains within a fixed memory budget; and (iv)
balanced replay training that interleaves current-domain samples with stored exemplars to control the stability–plasticity trade-off. Freezing
stabilizes the feature manifold across subjects and reduces storage, while clustering improves mode coverage so that limited memory remains informative. The following subsections expand each component and present the training algorithm used for sequential adaptation.
4.1. Latent Embedding Generation
As mentioned in
Section 3, the
contains two modalities which are
and
. The
and
can be expressed as follows:
RGB frames (): The RGB video input stream that contains visual information such as background, clothing, texture, color, appearance, gesture, and environmental context. This modality represents rich pixel-level information and preserves contextual information that is necessary for differentiating gestures in diverse settings. Let denote the RGB video modality defined in the RGB video space , such that and , where T represents the number of frames, C denotes the number of channels, H indicates the height, and W specifies the width of the video .
Posture data (): This modality provides a structured representation of human motion derived from tracked body joints over time, which can be obtained either from the RGB stream using a keypoint detection algorithm [
46,
47] or from dedicated hardware sensors. Unlike raw RGB frames, it emphasizes geometric and kinematic information such as posture, dynamics, and gesture trajectories, which are less sensitive to environmental variations [
48]. For implementation, each frame of
is represented as a skeleton-based image generated from the detected body joints, yielding
.
Since both modalities in a
have the same shape, we can write
Here, the first dimension indexes the RGB and posture modalities. Directly storing
for all observations is memory-expensive in a DIL setting, as each modality represents a full video sequence. To address this, we map
into a compact latent representation using a shared encoder
. The latent embeddings are obtained as follows:
where
n is the embedding dimension per modality,
and
denote the embedding for
and
, respectively. The final embedding can be expressed by concatenating the latent embeddings for each modality as follows:
This is the compact latent representation of observation
that is utilized for storage and domain incremental learning. This approach significantly reduces memory requirements while retaining modality-specific information in a joint feature space. Thus,
represents the latent embedding–label pair corresponding to an observation–label pair
.
4.2. Model Overview
A cross-modality recognition model
is used as the core classifier within our gesture-based HRI framework. The overview of the
is given in
Figure 2. Given the latent embeddings
and
from the RGB and posture modalities, respectively, two parallel transformer encoders
and
are applied to obtain modality-specific contextual representations:
These outputs are passed through a Query-based Global Attention Pooling (QGAP) mechanism, parameterized by learnable queries
and
, to produce compact embeddings:
Finally, a unified representation
corresponding to the gesture observation
is obtained by concatenating the two modality embeddings:
which is then fed to the classification head
to predict the gesture label
. This design enables
to jointly reason over complementary information from both modalities while keeping a compact representation suitable for domain incremental learning.
4.3. Domain Incremental Learning Strategy
For a well-trained model that has converged on domain , it must learn from the incoming data of the new domain while preserving all previously acquired knowledge. To maintain this balance, we incorporate three key components: (i) a target-domain set , which stores samples from ; (ii) an exemplar memory , which retains representative instances from ; and (iii) a replay mechanism that jointly presents data from and during each optimization step. A thorough discussion of these components is provided in the subsequent sections of this paper.
4.3.1. Target Domain Set Collection
In the target domain
, the data distribution may change due to lighting, viewpoint, background, or user motion style and speed. To support adaptation, we build a target domain set
where
is the same frozen encoder used for previous domains. Using a fixed encoder ensures that the embeddings from
lie in the
same latent space as the earlier domains.
This consistency serves two purposes: (i) Adaptation: supplies in-domain examples for fine-tuning; (ii) Shift assessment: comparing the feature statistics or empirical distributions of with those of the all previously seen domains quantifies the domain shift and guides the adaptation procedure.
4.3.2. Clustering-Based Exemplar Memory Selection
Because retaining all past data is impractical, we maintain a fixed-size
exemplar memory maintained by a budget
. For domain
with dataset
, the number of exemplars to keep is
where
denotes the floor operation. We apply clustering to the set
by using clustering method
, producing clusters
and centroids
. For each cluster, we select the medoid (the embedded sample closest to its centroid):
The exemplar buffer is then
This clustering-based selection yields broad
mode coverage of the domain’s embedding distribution, producing a more informative and balanced memory than random or majority-class sampling, and thereby mitigating catastrophic forgetting during subsequent adaptation.
4.3.3. Balanced Multi-Domain Replay Training
Training proceeds with mini-batches that contain an equal number of samples from the new domain and the exemplar memory. Let
denote the per-domain batch size,
is the batch size,
is the batch selected from
and
is the batch selected from
. Here, Uniform represents a function that randomly selects number of samples (
) from a given set (
or
). At each step, we draw
and form a balanced mini-batch by concatenation
Model parameters
(of
) are updated by minimizing the cross-entropy over
, with learning rate
:
Training continues for a fixed budget of steps or until a validation criterion is met. This replay schedule preserves exposure to past domains while adapting to the target domain, balancing stability and plasticity. We summarize this method as Algorithm 1.
Algorithm 1 Balanced Replay for Domain Incremental Adaptation |
- Require:
Model , exemplar memory , new-domain set , batch size , learning rate , - optimizer
, max steps S - Ensure:
Updated model - 1:
- 2:
for each training step do ▹ or until validation convergence - 3:
- 4:
- 5:
▹ balanced batch of size - 6:
Compute loss - 7:
Update parameters - 8:
end for - 9:
return Updated model
|
6. Results and Analysis
6.1. Result Analysis: Evidence of Domain Shift and the Advantage of ReDIaL
Table 2 summarizes performance on the source test set
, the pooled target test set
, and their union
. We focus here on two aspects: (i) demonstrating the severity and asymmetry of the domain shift and (ii) showing how
ReDIaL achieves strong adaptation to new domains while maintaining high performance on the source, approaching the non-incremental upper bound.
The source-only model attains on the source yet only on the pooled target, a drop of in accuracy. Conversely, the target-only (pooled) model achieves on the pooled target but just on the source, a drop of in performance. These large, cross-directional gaps ( and ) quantify a severe and asymmetric domain shift between the datasets: models trained in one domain generalize poorly to the other, even though the label space, , is identical. The union-set accuracies reinforce this domain shift: yields only overall, while reaches but remains weak on the source domain—again highlighting the underlying distribution difference.
The baseline joint-training model , which assumes simultaneous access to source and all target data, achieves on the source domain, on the target domain, and on the overall domain test set. This setting is unrealistic for real deployments, since in practice, domains are not accessible prior to deployment. However, it is useful as an upper bound on performance achievable without incremental constraints. Fine-tuning without rehearsal on pooled target data () yields perfect target performance () but reduces source accuracy to , indicating substantial drift away from the source distribution. Sequential fine-tuning with no rehearsal () improves target accuracy to but further drops the performance on the source to , leading to only on the union. These results show that naive adaptation strategies can succeed on the new domain while underperforming on the original one, which is a clear indication of a strong domain shift. Standard Experience Replay (ER) substantially narrows the cross-domain gap: (source), (target), and overall.
ReDIaL improves upon ER across the time steps t, achieving on the source (+ over ER), on the target (+), and overall (+). Notably, ReDIaL approaches the non-incremental joint-training model on every scenario (source: vs. , target: vs. 100, union: vs. ), but under incremental constraints. This suggests that ReDIaL’s design such as frozen encoder for a stable latent space, clustering-based exemplar selection for broad mode coverage, and balanced rehearsal effectively counters the domain shift while retaining prior knowledge.
The large cross-domain performance gaps of and provide direct, quantitative evidence of a significant and asymmetric domain shift. While naive fine-tuning strategies can overfit to the target distribution, they struggle to preserve performance on the source. By contrast, ReDIaL delivers high accuracy on both domains and on their union, nearly matching joint-training performance without violating the incremental learning protocol.
6.2. Overall Incremental Performance
The evaluation in
Figure 3 represents accuracy over time for ReDIaL,
, and ER, because all three models are trained
sequentially on subject-as-domain streams; at each step
t, we report accuracy on the cumulative test set of all domains observed up to
t.
The incremental fine-tuning model without rehearsal exhibits the lowest and most variable performance across the sequence. Accuracy drops early from about at to roughly at , recovers partially around mid-sequence (e.g., near ), and achieves near at . This pattern is consistent with strong sensitivity to subject-specific shifts and limited retention of earlier domains when no replay mechanism is available.
ER attains higher accuracy than throughout and begins close to ReDIaL (approximately at versus for ReDIaL). The accuracy generally improves over time but shows notable drops around some intermediate domains (e.g., near ) and achieves at about by . These variations suggest that a uniform replay buffer helps maintain performance but can be sensitive to domain idiosyncrasies when the stored exemplars are not sufficiently representative of earlier distributions.
ReDIaL starts strong at (), experiences a significant decrease at (), and thereafter increases, reaching around , by , and at . Relative to ER, ReDIaL maintains a higher accuracy level with smaller fluctuations across the sequence. This behavior aligns with the method’s design: a frozen encoder that stabilizes the latent space across subjects, clustering-based exemplar selection that improves coverage of prior domains, and balanced replay that jointly exposes the model to past and current data.
Across sequential domains, exhibits substantial variability and the lowest final accuracy, ER mitigates this effect but remains sensitive to domain shifts, and ReDIaL achieves the highest accuracy throughout most of the sequence and at the final step. The temporal trends provide qualitative evidence that rehearsal is necessary for cross-domain stability and that the proposed latent-exemplar strategy is particularly effective in this domain-incremental setting.
6.3. Catastrophic Forgetting Across Domains
Figure 4 represents forgetting
at each time step, defined as the change in accuracy on the previous domain immediately after learning the next domain, as given in Equation (
17). We compare the proposed replay strategy (ReDIaL) against incremental fine-tuning without rehearsal (
) and a standard experience replay baseline (ER), and we highlight time steps that are particularly susceptible to forgetting.
exhibits the most inconsistent trend of forgetting across the sequence. In early steps, forgetting rises sharply (e.g., around ), then flips to a strong negative value at , and again shows a positive rise near . In the subsequent time steps t, another large positive spike appears around (annotated at approximately in the figure), followed by a deep negative drop close to (around ). This alternating pattern indicates that, in the absence of rehearsal, adaptation to a new subject can both overwrite previous knowledge and occasionally induce accidental improvements when two consecutive domains happen to be similar. Overall, the variability underscores the vulnerability of naive fine-tuning to subject-driven domain shifts.
ER substantially reduces forgetting relative to , but the curve remains fluctuated across incremental time steps t. In particular, large positive rise are visible around and (both near ), while a marked negative value appears at (approximately ). These events suggest that although replay stabilizes learning, the exemplar set may not always stores sufficient diversity from earlier domains, leaving the model sensitive to subjects that differ strongly in appearance, background, or execution speed.
ReDIaL keeps forgetting close to zero throughout most of the sequence. The curve shows small early negative values (e.g., about at ) that reflect mild backward transfer, followed by fluctuations of very low magnitude and convergence to zero by . The near-flat profile indicates that the combination of a frozen encoder, clustering-based exemplar selection, and balanced replay effectively constrains parameter drift when adapting to a new subject.
Across methods, several time steps stand out as particularly prone to forgetting: early transitions (e.g., –) and mid-to-late transitions (e.g., , , ). These coincide with subject changes that likely introduce stronger shifts in appearance, background, or execution speed. Among the three models, is the most affected, ER mitigates but does not eliminate these spikes, and ReDIaL exhibits the smallest amplitude with a stable end-of-sequence value. So, the results indicate that rehearsal is essential in this domain incremental setting, and that ReDIaL’s latent-exemplar replay provides the most reliable retention of previously learned domains.
6.4. Comparison with the Other Methods
Table 3 reports the time-averaged cumulative accuracy
, the harmonic mean
across domains, and the average forgetting
. ReDIaL achieves the highest
(
) and
(
), slightly outperforming ER (
and
) and showing a substantial margin over both fine-tuning baselines. The very small gap between
and
for ReDIaL and ER (both
) indicates consistently strong performance across domains, whereas
exhibits a noticeably lower harmonic mean, suggesting degraded generalization to some domains.
ReDIaL and ER also show near-zero or slightly negative average forgetting, reflecting not only effective retention but occasional backward transfer. In contrast,
suffers from severe forgetting (
), and while
averages close to zero, it displays high per-step volatility as seen in the forgetting curves. Together, these results demonstrate that ReDIaL provides the most favorable balance of accuracy and retention, outperforming both naive fine-tuning and standard experience replay in this domain-incremental setting.
Table 3 isolates the contribution of each key module in the proposed ReDIaL framework by comparing models with progressively added components—latent embedding, clustering, and balanced replay. The results demonstrate that models without replay (e.g.,
and
) suffer from severe forgetting (
), confirming the necessity of a replay mechanism. Incorporating replay in the ER baseline markedly reduces forgetting and improves average performance, isolating the benefit of memory-based rehearsal. Finally, combining clustering for representative exemplar selection and balanced replay for stable adaptation in ReDIaL gives the highest
and
, which means that these modules work together to improve cross-domain generalization and knowledge retention. Overall, this comparison study clearly shows how each part affects the whole, which meets the ablation criteria.
7. Discussion
Our findings provide clear evidence of asymmetric domain shift in HRI gesture recognition. As shown in
Table 2, a
source-only model transfers poorly to the pooled target set, while a
target-only model fails on the source—despite an identical label space. This pattern is consistent with the subject- and environment-driven factors identified in
Section 3.1 (appearance, background, execution speed) and motivates a domain incremental solution that adapts to new users without erasing prior knowledge.
ReDIaL addresses this need through three complementary components: a
frozen encoder that stabilizes the latent space across subjects,
clustering-based exemplar selection that improves coverage of earlier domains, and
balanced replay that mixes past and current data at every update (Algorithm 1). Combined with our cross-modality design (
Section 4.1), these choices yield high and steady accuracy over time (
Figure 3), approaching the joint-training upper bound while respecting incremental constraints, and outperforming ER on source, target, and their union (
Table 2). The forgetting analysis (
Figure 4) represents this trend: naive incremental fine-tuning (
) exhibits large variations, ER reduces but does not eliminate variations, and ReDIaL stays close to zero throughout, an observation reinforced by the time-averaged and harmonic metrics with near-zero average forgetting (
Table 3).
Operating in the latent space is also practical: it preserves task-relevant features while sharply reducing storage compared with raw videos, enabling a replay buffer that scales to long domain sequences, and it improves privacy by avoiding the retention of identifiable frames (
Section 4.1). Limitations include reliance on a fixed pre-trained encoder (COSMOS) and on keypoint quality for the posture stream; both can constrain downstream adaptation. While latent exemplar replay offers storage efficiency and privacy benefits by avoiding raw visual data, potential vulnerabilities such as latent inversion or membership inference attacks remain open concerns, warranting future exploration of privacy-preserving latent encoding or differentially private replay mechanisms.
Across all runs, performance remained highly stable, with near-saturated accuracy and minimal variance across domains, suggesting that formal statistical testing (e.g., confidence intervals or paired t-tests) would provide limited additional insight. The evaluation already includes standard continual learning (CL) metrics—average accuracy, harmonic mean, and forgetting—which together capture the intuition of BWT and ensure metric completeness. We acknowledge that sequential order may influence adaptation dynamics and plan to examine randomized domain orderings in future work. Finally, the observed performance plateau in later stages likely reflects a small-sample ceiling effect, where high per-domain accuracy limits measurable incremental gains.
While ReDIaL mitigates forgetting in a subject-as-domain setting, several focused steps remain for future work. (i) We plan to extend evaluation to larger gesture vocabularies and datasets, as well as to heterogeneous sensors (e.g., depth, IMU) and more diverse environments. This will help stress-test cross-domain generalization. (ii) For memory selection, we aim to move beyond clustering by comparing uncertainty-based and gradient-based strategies for building the replay buffer. In parallel, we will evaluate latent-space privacy and explore privacy-preserving replay techniques. (iii) On-robot studies are also critical. These will measure latency, energy use, memory footprint, and task success in noisy, real-time deployments, including trials with non-expert operators. (iv) Finally, we plan to extend to longer-horizon continual regimes that mix domain and class increments, include recurring users, different sequential order of domains and require robust calibration and out-of-distribution handling.
8. Conclusions
We introduced
ReDIaL, a memory-efficient domain incremental learning framework for gesture-based HRI. By combining a frozen encoder for latent stability, clustering-based exemplar selection for representative memory, and balanced rehearsal for sequential adaptation, ReDIaL effectively mitigates catastrophic forgetting while enabling robust cross-user adaptation. Across 21 domains (one source and twenty subject-specific targets), ReDIaL achieves state-of-the-art incremental performance: it attains
97.34% accuracy on the combined old and new domains, improving over pooled fine-tuning by
5.47 percentage points, over incremental fine-tuning by
16.42 points, and over ER by
3.14 points (
Table 2). Incremental performance analyses show consistently high accuracy over time (
Figure 3) and near-zero forgetting across domain transitions (
Figure 4); aggregate metrics confirm strong cross-domain robustness (
Table 3).
These results demonstrate that latent exemplar replay not only improves accuracy and retention but also supports privacy and scalability, making it well-suited for robotic deployments operating under efficient compute and memory constraints. ReDIaL’s combination of accuracy, retention, privacy, and scalability aligns well with the requirements of human–robot collaboration, where systems must remain reliable across continually changing users, sensors, and environments. ReDIaL can front-end contact-rich, human-guided tasks by providing robust cross-user gesture commands under domain shift, while a downstream impedance-learning controller ensures contact stability in unknown environments. This perception–control pairing is consistent with evidence that impedance learning reduces tracking error and operator effort in carving/grinding tasks with real environments [
51]. Looking ahead, future work will explore adaptive encoder updates, more informative exemplar selection strategies, and on-robot evaluations to further enhance robustness, privacy, and deployment readiness.