Replay-Based Domain Incremental Learning for Cross-User Gesture Recognition in Robot Task Allocation

Podder, Kanchon Kanti; Dutta, Pritom; Zhang, Jian

doi:10.3390/electronics14193946

Open AccessArticle

Replay-Based Domain Incremental Learning for Cross-User Gesture Recognition in Robot Task Allocation

by

Kanchon Kanti Podder

¹

,

Pritom Dutta

² and

Jian Zhang

^2,*

¹

Department of Electrical and Computer Engineering, Kennesaw State University, Marietta, GA 30060, USA

²

Department of Information Technology, Kennesaw State University, Marietta, GA 30060, USA

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(19), 3946; https://doi.org/10.3390/electronics14193946

Submission received: 8 September 2025 / Revised: 29 September 2025 / Accepted: 2 October 2025 / Published: 6 October 2025

(This article belongs to the Special Issue Coordination and Communication of Multi-Robot Systems)

Download

Browse Figures

Review Reports Versions Notes

Abstract

Reliable gesture interfaces are essential for coordinating distributed robot teams in the field. However, models trained in a single domain often perform poorly when confronted with new users, different sensors, or unfamiliar environments. To address this challenge, we propose a memory-efficient replay-based domain incremental learning (DIL) framework, ReDIaL, that adapts to sequential domain shifts while minimizing catastrophic forgetting. Our approach employs a frozen encoder to create a stable latent space and a clustering-based exemplar replay strategy to retain compact, representative samples from prior domains under strict memory constraints. We evaluate the framework on a multi-domain air-marshalling gesture recognition task, where an in-house dataset serves as the initial training domain and the NATOPS dataset provides 20 cross-user domains for sequential adaptation. During each adaptation step, training data from the current NATOPS subject is interleaved with stored exemplars to retain prior knowledge while accommodating new knowledge variability. Across 21 sequential domains, our approach attains

97.34 %

accuracy on the domain incremental setting, exceeding pooled fine-tuning (

91.87 %

), incremental fine-tuning (

80.92 %

), and Experience Replay (

94.20 %

) by

+ 5.47

,

+ 16.42

, and

+ 3.14

percentage points, respectively. Performance also approaches the joint-training upper bound (

98.18 %

), which represents the ideal case where data from all domains are available simultaneously. These results demonstrate that memory-efficient latent exemplar replay provides both strong adaptation and robust retention, enabling practical and trustworthy gesture-based human–robot interaction in dynamic real-world deployments.

Keywords:

gesture recognition; domain-incremental learning; human-robot interaction (HRI); cross-user adaptation; memory-efficient replay; multi-robot task allocation

1. Introduction

Human–robot interaction (HRI) is an interdisciplinary domain concerned with the analysis, design, and assessment of robotic systems that operate alongside or in collaboration with humans [1]. Communication forms the foundation of this interaction, and the manner in which it occurs depends strongly on whether the human and the robot share the same physical environment [1]. In high-noise settings like construction zones, manufacturing floors and airport ramps, standard voice-based or teleoperation-based communication frequently fails due to excessive background noise, limited maneuverability, and safety considerations. Consequently, dependence on these conventional communication methods may undermine both efficiency and user well-being, highlighting the necessity for alternative, context-sensitive communication strategies. In situations where verbal communication is hindered by noise, mobility limitations, or safety issues, hand gesture-based interaction provides a subtle and natural means for human–robot collaboration [2]. This nonverbal communication generally corresponds with human social behavior and can substantially reduce cognitive burden, making it more effective than verbal communication in challenging environments [2]. A classic example of this in practice is the system of air-marshalling signals which are standardized hand gestures that have long enabled pilots and ground crews to coordinate aircraft movement reliably and unambiguously. Recent studies on vision-driven systems in robotics have shown the effective ramp hand signals recognition through deep learning and computer vision methods, hence enhancing the autonomy and operational safety of unmanned vehicles [3,4,5]. These developments collectively emphasize gesture-based human–robot interaction, as an essential framework for promoting natural, accessible, and reliable communication between humans and robots, especially in contexts where conventional verbal communication is unfeasible [2,4,5,6,7].

Gesture-based communication is a promising approach in HRI to improve ergonomics, decrease downtime, raise safety, and boost resilience, where even children, older people, hearing- and speech-impaired individuals, or non-technical individuals with limited knowledge of robot operation can interact with robots [2,4,5,8,9,10,11]. However, one of the primary challenges in deploying robotic systems in real-world settings lies in the phenomenon of domain shift. Domain shift occurs when a model trained on data from one domain (the source) is deployed in another domain (the target) where it must perform the same task, yet the data distributions differ [12,13,14,15]. The difference in the data distribution in gesture-based HRI system deployment can be due to illumination, background clutter, camera viewpoints, or variations in human gesture execution styles across facilities, operators, and over time. These distributional differences severely impair model performance and render straightforward deployment impractical [16,17].

Domain incremental learning (DIL), which trains models to generalize across new domains encountered sequentially while keeping the task unchanged, is a method for addressing domain shift challenges [18,19]. This approach helps HRI systems to adapt to new domain knowledge, which is aligned with the generalization process for a robot in a real-world scenario. One of the fundamental challenges in DIL is catastrophic forgetting. According to various studies, when a neural network is trained on new tasks, its performance on previously learned tasks decreases drastically. This is because the parameters critical for earlier knowledge are overwritten by the new learning [20,21,22,23,24,25,26]. This problem is particularly relevant to domain incremental learning, where a system must continuously learn for new users who might perform gestures in different environments and at different speeds without losing proficiency in previously learned gesture patterns. Through effective retraining protocols and limited sample retention, research on continual gesture learning for HRI has shown great promise for incrementally learning new gesture classes while reducing forgetting [27,28,29,30,31]. However, a notable gap in the current literature is that most efforts have focused on class-incremental learning, whereas the challenge of adapting to new domains—such as different users and environments—within a DIL framework remains underexplored.

To address these challenges, our work focuses on designing gesture-based HRI systems that can work reliably in a wide range of real-world situations. Specifically, we target the open problem of adapting to new users and environments under a domain incremental learning framework, while minimizing catastrophic forgetting. Subject-as-domain domain incremental learning (DIL) frames each subject or human operator as a distinct domain within a shared task and label space. Under this formulation, gesture patterns vary across users due to individual morphology, motion style, and environmental context, producing sequential domain shifts that challenge model generalization [32]. This formulation falls under the domain incremental learning paradigm, in which the input distribution changes across domains while the label space remains fixed [33]. Our study pursues two complementary objectives: (1) adaptation to continuously learn new subject domains with high accuracy and robustness under shifting distributions; and (2) retention to preserve prior-domain performance by minimizing catastrophic forgetting. Together, these objectives enable human-centered incremental learning, where the system remains dependable and inclusive across diverse users. We address the challenge of memory efficiency by investigating techniques that selectively keep and replay the most informative samples from previous domains. In addition, we emphasize the importance of robust evaluation protocols that not only assess recognition accuracy but also capture retention, forgetting, and resource consumption across long sequences of domain shifts. Together, these directions aim to bridge the gap between current class-incremental approaches and the practical requirements of domain-incremental gesture recognition in human–robot interaction. The main contributions of this work can be summarized as follows:

We present a multi-domain air-marshalling hand signal recognition framework under a DIL paradigm, facilitating robust adaptation to new users and environments without sacrificing prior knowledge.
We introduce a memory-efficient latent exemplar replay strategy where the latent embedding of gesture video generated by a frozen encoder is used for DIL training to secure the privacy of the user.
We develop a clustering-based exemplar selection mechanism to identify and store the most representative samples from previously learned domains, thereby enhancing generalization in subsequent learning phases.
We investigate extensive evaluation metrics that go beyond accuracy, explicitly quantifying knowledge retention, forgetting, and resource utilization across extended sequences of domain shifts.

The remainder of this paper is organized as follows. Section 2 represents the related work in the domain of gesture-based domain incremental learning for HRI. Section 3 defines the problem statement about the subject-as-domain DIL setting and the twin goals of adaptation and retention. Section 4 introduces the proposed method ReDIaL and its components, such as latent embeddings (Section 4.1), the cross-modality model (Section 4.2), exemplar memory (Section 4.3.2), and balanced replay (Section 4.3.3). Section 5 covers the experimental setup for the proposed method ReDIL. Section 6 reports results across 21 domains with proper evaluation metrics and analysis. Section 7 discusses implications for human–robot collaboration, and Section 8 concludes with future directions.

2. Related Work

2.1. Gesture Recognition for HRI and Multi-Robot Control

Gesture-based interfaces that are easy to use and do not require technical knowledge can improve human–robot interaction [34]. This multidisciplinary field, which has its roots in early voice-gesture research from the 1990s, currently includes applications in industrial robots, assistive robots, exoskeletons, prostheses, and cognitive science, ergonomics, IoT, big data, and virtual reality [35]. Utilizing Microsoft Kinect v2 for multi-robot interaction, Canal et al. (2015) [34] created a real-time gesture recognition system that achieved high identification rates in user tests. The system implemented both static and dynamic gesture recognition utilizing weighted dynamic time warping and skeletal characteristics [34]. In their taxonomy of gesture-based human–swarm interaction, Alonso-Mora et al. (2015) divided techniques into two categories: shape-constrained interaction, which controls robot formations, and free-form interaction, which allows robot selection directly [36]. In order to guarantee single-operator control, Cicirelli et al. (2015) developed an HRI interface that combined quaternion joint angle information from neural network classifiers and numerous kinect cameras with person re-identification [37]. In a more recent study [38], Nguyen et al. (2023) created a wireless system that recognizes four distinct gestures using deep neural networks and XGBoost in conjunction with Vicon motion capture technology. This system achieved an accuracy of roughly 60% for robot control applications. However, most prior HRI gesture systems assume fixed users and sensing setups and do not handle user/environment-induced domain shifts or incremental adaptation; we address this gap with a multi-domain, domain incremental air-marshalling hand-signal recognizer that adapts to new users and contexts without sacrificing previously learned competence.

2.2. Continual, Lifelong, and Domain Incremental Learning

The challenge of gradually learning new information without forgetting previously taught material is addressed by continual learning. Deep neural networks are fundamentally limited by catastrophic forgetting [39]. Task incremental, domain incremental, and class incremental learning are the three basic situations for continual learning identified by Van de Ven et al. (2022). Each of these scenarios has unique difficulties and calls for a different approach [33]. Beyond these simple situations, Xie et al. (2022) investigated a more intricate situation in which the distributions of classes and domains vary at the same time. They suggest a domain-aware approach that uses bi-level balanced memory and von Mises–Fisher mixture models to address intra-class domain imbalance [40]. To preserve model compactness while preventing forgetting, Hung et al. (2019) offer a complementary strategy that combines progressive network expansion, crucial weight selection, and model compression. Together, these studies show that specific architectures and training techniques that strike a compromise between the stability-plasticity conundrum that arises when learning from non-stationary data streams are necessary for effective ongoing learning [41]. While continual-learning methods study catastrophic forgetting in general, few works instantiate a true domain-incremental HRI setting with a fixed label space and user-driven distribution shift or report retention/forgetting over long domain sequences; our approach formalizes this setting and provides evaluation that jointly measures accuracy, retention, forgetting, and resource usage.

2.3. Memory-Efficient Rehearsal Strategies

For continuous learning in resource-constrained situations, such as robotic systems, memory-efficient rehearsal techniques are essential. ExStream was first presented by Hayes et al. (2018), who showed that while complete rehearsal can prevent catastrophic forgetting, their technique achieves similar results with far lower memory and computation needs [42]. By storing intermediate layer activations rather than raw input data, Pellegrini et al. (2019) introduced Latent Replay, which significantly lowers storage and processing requirements while preserving representation stability through regulated learning rates. This method made it possible to apply continuous learning on smartphones in almost real-time [43]. The gradient-matching coresets created by Balles et al. (2022) for exemplar selection do not require pre-training because they choose rehearsal samples by comparing the gradients created by the coreset to those of the original dataset [44]. Similar to this, Yoon et al. (2021) developed Online Coreset Selection (OCS), which works well on standard, unbalanced, and noisy datasets by iteratively choosing representative and instructive samples that optimize current task while retaining a high affinity for previous tasks [45]. For continuous learning in resource-constrained situations, such as robotic systems, memory-efficient rehearsal techniques are essential. ExStream was first presented by Hayes et al. (2018), who showed that while complete rehearsal can prevent catastrophic forgetting, their technique achieves similar results with far lower memory and computation needs [42]. By storing intermediate layer activations rather than raw input data, Pellegrini et al. (2019) introduced Latent Replay, which significantly lowers storage and processing requirements while preserving representation stability through regulated learning rates. This method made it possible to apply continuous learning on smartphones in almost real-time [43]. The gradient-matching coresets created by Balles et al. (2022) for exemplar selection do not require pre-training because they choose rehearsal samples by comparing the gradients created by the coreset to those of the original dataset [44]. Similar to this, Yoon et al. (2021) developed Online Coreset Selection (OCS), which works well on standard, unbalanced, and noisy datasets by iteratively choosing representative and instructive samples that optimize current task adaptation while retaining a high affinity for previous tasks [45]. Existing rehearsal strategies often store raw inputs (raising storage/privacy concerns) or task-specific coresets; in contrast, we propose a privacy-preserving latent exemplar replay that stores compact video embeddings from a frozen encoder and uses clustering-based selection to maintain representative memories under tight budgets while supporting robust adaptation.

3. Problem Statement: Subject-As-Domain Incremental Gesture Learning

Our domain-incremental gesture recognition framework is based on clearly defined input spaces, a shared label space, and domain-indexed datasets. Two complementary modalities (1) RGB frames (

R_{t}

) and (2) Posture data (

L_{t}

), are used to represent each observation

x_{t} \in X

.

R_{t}

and

L_{t}

have the dimension of

R^{(T \times C \times H \times W)}

, where T is the number of frames, C is the number of channels, H is the height of the frame, and W is the width of the frame. The multimodal observation can therefore be expressed as:

x_{t} = {R_{t}, L_{t}}

(1)

The detailed description of each modality is described in Section 4.1. The set of gesture classes to be recognized is represented by the shared label space

Y

, which is common across all domains. Each observation

x_{t}

is associated with a distinct label

y_{t} \in Y

. Formally, the dataset

D_{t}

corresponding to domain

d_{t}

is given by the following:

D_{t} = {(x_{t}, y_{t})}

(2)

As the robot is incrementally exposed to new environments, domains are indexed over time:

At time step $t = 0$ , an initial dataset $D_{0}$ from the source domain $d_{0}$ , which may correspond to an internally collected dataset in a controlled environment, is used to train the robot.
For subsequent time steps $t = {1, \dots, t}$ , the robot encounters datasets $D_{t}$ corresponding to new domains $d_{t}$ . In our experimental setup, these domains represent distinct participants, each introducing variability in body morphology, environmental conditions, and gesture execution styles, while preserving the same label space $Y$ .

3.1. Domain Shift Formulation

Let

x_{t} = {R_{t}, L_{t}}

be the multimodal observation and

y_{t} \in Y

its gesture label (also referred to as class). Throughout this paper, the terms label and class are used interchangeably. The observation distribution in domain

d_{t}

can be written as follows:

p_{t} (x_{t} ∣ y_{t}) = p (x_{t} | v_{t}, S_{t}, B_{t}, y_{t})

(3)

where

$v_{t}$ controls execution speed, affecting the temporal length T of $x_{t}$ ;
$S_{t}$ controls subject appearance and morphology, affecting pixel statistics;
$B_{t}$ controls background/scene context, affecting visual content.

A domain shift occurs when any of these latent factors have domain-dependent distributions. Lets assume, for any new user in new domain

d_{t + 1}

, the distribution

p_{t + 1}

contains execution speed

v_{t + 1}

, subject appearance and morphology

S_{t + 1}

, background/scene context

B_{t + 1}

:

p_{t} (v_{t}, S_{t}, B_{t} ∣ y_{t}) \neq p_{t + 1} (v_{t + 1}, S_{t + 1}, B_{t + 1} ∣ y_{t + 1}) ⟹ p_{t} (x_{t} ∣ y_{t}) \neq p_{t + 1} (x_{t + 1} ∣ y_{t + 1})

(4)

even though the gesture labels remain identical across domains (

y_{t} = y_{t + 1}

) within the shared label space

Y

, the underlying data distributions differ. Hence, domain shifts in our problem are primarily driven by the following:

\underset{speed / time-warp \Rightarrow T}{\underset{︸}{p_{t} (v_{t})}}, \underset{appearance / style}{\underset{︸}{p_{t} (S_{t})}}, \underset{background / context}{\underset{︸}{p_{t} (B_{t})}}

each of which modifies a different generative component of

x_{t}

.

This formulation highlights that although the gesture labels remain identical across domains, subject-specific and environmental factors introduce significant distributional shifts in the input space. Addressing these discrepancies requires effective domain adaptation techniques that enable the model to generalize properly while retaining previously acquired knowledge.

3.2. Domain Incremental Learning Objective

Let

{d_{t}}_{t \geq 0}

be a sequence of domains that share the same gesture label space

Y

. At time step t, the model

M_{t}

is trained on data from domain

d_{t}

and achieves an accuracy:

A (M_{t}, d_{t}) = Pr (M_{t} (x_{t}) = y_{t} | (x_{t}, y_{t}) \sim d_{t})

(5)

which measures how well

M_{t}

recognizes gestures within the distribution of

d_{t}

and

Pr (\cdot)

is the probability.

When the robot is deployed in a new environment

d_{t + 1}

at time

t + 1

, the data distribution changes as discussed in Section 3.1. A domain incremental learning process adapts the domain

d_{t + 1}

by fine-tuning the previous model

M_{t}

to produce an updated model

M_{t + 1}

. The primary objective of this adaptation is twofold:

maximize A (M_{t + 1}, d_{t + 1}) while ensuring A (M_{t + 1}, d_{t}) \geq A (M_{t}, d_{t})

(6)

Here, ‘≥’ denotes a no-degradation criterion: after adapting to the new domain

d_{t + 1}

, the updated model

M_{t + 1}

should achieve accuracy on each previously learned domain

d_{t}

that is at least the accuracy of the prior model

M_{t}

on

d_{t}

; i.e., past-domain performance is maintained or improved. If

A (M_{t + 1}, d_{t}) ≪ A (M_{t}, d_{t})

, the model has suffered from catastrophic forgetting, a well-known problem in continual learning where adaptation to new data leads to a severe drop in accuracy on past domains. So, the forgetting in a DIL setting, can be defined as

F_{t \to t + 1} = A (M_{t}, d_{t}) - A (M_{t + 1}, d_{t})

(7)

If the

F_{t \to t + 1} > 0

, it indicates that there is catastrophic forgetting available during adaptation of domain

d_{t + 1}

. Thus, two fundamental and complementary goals are at the center of the creation of the updated model

M_{t}

:

Goal 1. Adaptation: Achieve high

A (M_{t + 1}, d_{t + 1})

by effectively learning the new domain distribution. This ensures the model remains reliable and accurate when deployed in novel environments.

Goal 2. Retention: Maintain

A (M_{t + 1}, d_{t}) \geq A (M_{t}, d_{t})

or

F_{t \to t + 1} \leq 0

for all previously seen domains. This prevents catastrophic forgetting and guarantees that knowledge from earlier domains remains useful over time.

The simultaneous achievement of these goals is essential to maintaining strong, lifelong learning of domains in HRI. Without retention, adaptation runs the risk of causing catastrophic forgetting, in which previous environment knowledge is erased, decreasing the robot’s adaptability and dependability. On the other hand, focusing on retention without enough adaptation could result in poor generalization in unfamiliar settings, which would compromise usability in the real world. The basic problem of lifelong learning is thus embodied by gesture recognition across successive domains: allowing for ongoing adaptation to changing environments while preserving previously learned competencies.

4. Methodology

This section details our Replay-based Domain Incremental Learning (ReDIaL) pipeline for gesture recognition in HRI. As illustrated in Figure 1, the pipeline has four stages: (i) latent embedding generation with a frozen encoder

E

that maps RGB and posture streams into a stable, privacy-preserving feature space; (ii) a cross-modality recognition model that aggregates modality-specific representations into a unified embedding; (iii) clustering-based exemplar selection that stores a compact, diverse subset of past domains within a fixed memory budget; and (iv) balanced replay training that interleaves current-domain samples with stored exemplars to control the stability–plasticity trade-off. Freezing

E

stabilizes the feature manifold across subjects and reduces storage, while clustering improves mode coverage so that limited memory remains informative. The following subsections expand each component and present the training algorithm used for sequential adaptation.

4.1. Latent Embedding Generation

As mentioned in Section 3, the

x_{t}

contains two modalities which are

R_{t}

and

L_{t}

. The

R_{t}

and

L_{t}

can be expressed as follows:

RGB frames ( $R_{t}$ ): The RGB video input stream that contains visual information such as background, clothing, texture, color, appearance, gesture, and environmental context. This modality represents rich pixel-level information and preserves contextual information that is necessary for differentiating gestures in diverse settings. Let $R_{t}$ denote the RGB video modality defined in the RGB video space $R$ , such that $R_{t} \in R$ and $R_{t} \in R^{T \times C \times H \times W}$ , where T represents the number of frames, C denotes the number of channels, H indicates the height, and W specifies the width of the video $R_{t}$ .
Posture data ( $L_{t}$ ): This modality provides a structured representation of human motion derived from tracked body joints over time, which can be obtained either from the RGB stream using a keypoint detection algorithm [46,47] or from dedicated hardware sensors. Unlike raw RGB frames, it emphasizes geometric and kinematic information such as posture, dynamics, and gesture trajectories, which are less sensitive to environmental variations [48]. For implementation, each frame of $L_{t}$ is represented as a skeleton-based image generated from the detected body joints, yielding $L_{t} \in R^{T \times C \times H \times W}$ .

Since both modalities in a

x_{t}

have the same shape, we can write

x_{t} \in R^{2 \times T \times C \times H \times W}

(8)

Here, the first dimension indexes the RGB and posture modalities. Directly storing

x_{t}

for all observations is memory-expensive in a DIL setting, as each modality represents a full video sequence. To address this, we map

x_{t}

into a compact latent representation using a shared encoder

E

. The latent embeddings are obtained as follows:

f_{t}^{R} = E (R_{t}) \in R^{1 \times n}, f_{t}^{L} = E (L_{t}) \in R^{1 \times n}

(9)

where n is the embedding dimension per modality,

f_{t}^{R}

and

f_{t}^{L}

denote the embedding for

R_{t}

and

L_{t}

, respectively. The final embedding can be expressed by concatenating the latent embeddings for each modality as follows:

f_{t} = [f_{t}^{R} ∥ f_{t}^{L}] \in R^{1 \times 2 n}

(10)

This is the compact latent representation of observation

x_{t}

that is utilized for storage and domain incremental learning. This approach significantly reduces memory requirements while retaining modality-specific information in a joint feature space. Thus,

{f_{t}, y_{t}}

represents the latent embedding–label pair corresponding to an observation–label pair

{x_{t}, y_{t}}

.

4.2. Model Overview

A cross-modality recognition model

M_{t}

is used as the core classifier within our gesture-based HRI framework. The overview of the

M_{t}

is given in Figure 2. Given the latent embeddings

f_{t}^{R}

and

f_{t}^{L}

from the RGB and posture modalities, respectively, two parallel transformer encoders

T^{R}

and

T^{L}

are applied to obtain modality-specific contextual representations:

h_{t}^{R} = T^{R} (f_{t}^{R}), h_{t}^{L} = T^{L} (f_{t}^{L})

These outputs are passed through a Query-based Global Attention Pooling (QGAP) mechanism, parameterized by learnable queries

Q^{R}

and

Q^{L}

, to produce compact embeddings:

e_{t}^{R} = QGAP (h_{t}^{R}, Q^{R}), e_{t}^{L} = QGAP (h_{t}^{L}, Q^{L})

Finally, a unified representation

e_{t}

corresponding to the gesture observation

x_{t}

is obtained by concatenating the two modality embeddings:

e_{t} = [e_{t}^{R} ∥ e_{t}^{L}]

which is then fed to the classification head

G

to predict the gesture label

{\hat{y}}_{t}

. This design enables

M_{t}

to jointly reason over complementary information from both modalities while keeping a compact representation suitable for domain incremental learning.

4.3. Domain Incremental Learning Strategy

For a well-trained model

M_{t}

that has converged on domain

d_{t}

, it must learn from the incoming data of the new domain

d_{t + 1}

while preserving all previously acquired knowledge. To maintain this balance, we incorporate three key components: (i) a target-domain set

N

, which stores samples from

d_{t + 1}

; (ii) an exemplar memory

E

, which retains representative instances from

d_{t}

; and (iii) a replay mechanism that jointly presents data from

N

and

E

during each optimization step. A thorough discussion of these components is provided in the subsequent sections of this paper.

4.3.1. Target Domain Set Collection

In the target domain

d_{t + 1}

, the data distribution may change due to lighting, viewpoint, background, or user motion style and speed. To support adaptation, we build a target domain set

N = {\{(f_{t + 1}^{(j)}, y_{t + 1}^{(j)})\}}_{j = 1}^{n_{t + 1}}, f_{t + 1}^{(j)} = E (x_{t + 1}^{(j)})

where

E

is the same frozen encoder used for previous domains. Using a fixed encoder ensures that the embeddings from

N

lie in the same latent space as the earlier domains.

This consistency serves two purposes: (i) Adaptation:

N

supplies in-domain examples for fine-tuning; (ii) Shift assessment: comparing the feature statistics or empirical distributions of

N

with those of the all previously seen domains

{d_{i} ∣ i \leq t}

quantifies the domain shift and guides the adaptation procedure.

4.3.2. Clustering-Based Exemplar Memory Selection

Because retaining all past data is impractical, we maintain a fixed-size exemplar memory maintained by a budget

β \in (0, 1]

. For domain

d_{t}

with dataset

D_{t} = {(f_{t}^{(i)}, y_{t}^{(i)})}_{i = 1}^{n_{t}}

, the number of exemplars to keep is

α = ⌊β n_{t}⌋

(11)

where

⌊ \cdot ⌋

denotes the floor operation. We apply clustering to the set

{f_{t}^{(i)}}_{i = 1}^{n_{t}}

by using clustering method

C

, producing clusters

{C_{k}}_{k = 1}^{α}

and centroids

{λ_{k}}_{k = 1}^{α}

. For each cluster, we select the medoid (the embedded sample closest to its centroid):

i_{k} = arg min_{i \in C_{k}} {∥f_{t}^{(i)} - λ_{k}∥}^{2}, k = 1, \dots, α

(12)

The exemplar buffer is then

E = {\{(f_{t}^{(i_{k})}, y_{t}^{(i_{k})})\}}_{k = 1}^{α}

(13)

This clustering-based selection yields broad mode coverage of the domain’s embedding distribution, producing a more informative and balanced memory than random or majority-class sampling, and thereby mitigating catastrophic forgetting during subsequent adaptation.

4.3.3. Balanced Multi-Domain Replay Training

Training proceeds with mini-batches that contain an equal number of samples from the new domain and the exemplar memory. Let

μ

denote the per-domain batch size,

B

is the batch size,

B_{E}

is the batch selected from

E

and

B_{N}

is the batch selected from

N

. Here, Uniform represents a function that randomly selects number of samples (

μ

) from a given set (

E

or

N

). At each step, we draw

B_{E} \sim Uniform (E, μ), B_{N} \sim Uniform (N, μ)

and form a balanced mini-batch by concatenation

B = concat (B_{E}, B_{N}), where | B | = 2 μ

(14)

Model parameters

θ

(of

M_{t + 1}

) are updated by minimizing the cross-entropy over

B

, with learning rate

η

:

L (θ; B) = \frac{1}{| B |} \sum_{(f, y) \in B} CE (M_{t + 1} (f), y), θ \leftarrow θ - η \nabla_{θ} L (θ; B)

(15)

Training continues for a fixed budget of steps or until a validation criterion is met. This replay schedule preserves exposure to past domains while adapting to the target domain, balancing stability and plasticity. We summarize this method as Algorithm 1.

Algorithm 1 Balanced Replay for Domain Incremental Adaptation

Require:: Model $M_{t}$ , exemplar memory $E$ , new-domain set $N$ , batch size $μ$ , learning rate $η$ ,
optimizer: $W$ , max steps S
Ensure:: Updated model $M_{t + 1}$
1:: $θ \leftarrow θ_{t}$
2:: for each training step $s = 1, 2, \dots, S$ do ▹ or until validation convergence
3:: $B_{E} \leftarrow SampleUniform (E, μ)$
4:: $B_{N} \leftarrow SampleUniform (N, μ)$
5:: $B \leftarrow concat (B_{E}, B_{N})$ ▹ balanced batch of size $2 μ$
6:: Compute loss $L = \frac{1}{| B |} \sum_{(f, y) \in B} CE (M_{t + 1} (f), y)$
7:: Update parameters $θ \leftarrow W (θ, \nabla_{θ} L, η)$
8:: end for
9:: return Updated model $M_{t + 1} \leftarrow M_{t}$

5. Experimental Setup

5.1. Datasets

We evaluate on two datasets comprising the same six air-marshalling hand–signal classes. Dataset 1 is an in-house collection from 8 participants. We treat it as the source domain

d_{0}

and use COSMOS [49] as the frozen encoder

E

to obtain latent embeddings

f_{t} = E (x_{t})

. For training only, we apply four augmentations to the videos—identity (None), perspective, rotation, and padding, while keeping validation and test unaugmented. The split for

d_{0}

is 2076 training samples, 121 validation samples, and 210 test samples.

Dataset 2 is the NATOPS dataset [50], from which we use the same six classes for 20 participants. We model each participant as a separate domain

d_{t}

for

t \in {1, \dots, 20}

. Each subject provides 20 samples per class (120 per subject). Here, each class corresponds to a gesture label, and in the remainder of this section and subsequent analyses, the terms class and label are used interchangeably. Per class, we allocate 14 for training, 2 for validation, and 4 for testing; with the four augmentations (including identity) applied to the training set, this yields

14 \times 4 = 56

training items per class and

56 \times 6 = 336

training items per domain. Validation and test remain unaugmented with

2 \times 6 = 12

and

4 \times 6 = 24

samples per domain, respectively.

For the exemplar memory

E

, we employ a per-class clustering budget

β = 0.2

. From

d_{0}

this results in

⌊ 0.2 \times 2076 ⌋ = 414

exemplars. For each NATOPS domain, applying the budget per class selects

⌊ 0.2 \times 56 ⌋ = 11

embeddings per class, i.e.,

11 \times 6 = 66

exemplars per domain. Thus, after processing all 20 NATOPS domains, the memory would contain

414 + 20 \times 66 = 1734

exemplars in total.

5.2. Comparison Techniques

To contextualize the performance of our DIL method (ReDIaL), we compare against several standard baselines that probe domain shift, adaptation, and forgetting. Let

D^{(1)} = {d_{0}^{(1)}}

denote Dataset 1 (source) and

D^{(2)} = {d_{1}^{(2)}, \dots, d_{20}^{(2)}}

denote the 20 participant-specific domains from Dataset 2 (target). We write

T (d_{t})

for the test set of domain

d_{t}

, and use the pooled target test set

T^{(2)} = ⋃_{t = 1}^{20} T (d_{t}^{(2)})

. The following baseline models are considered for comparison:

Source-only $M_{0}$ : trained on $d_{0}^{(1)}$ and evaluated on $T (d_{0}^{(1)})$ and $T^{(2)}$ . Purpose: quantifies cross-dataset/domain shift when no target data are used.
Target-only (pooled) ${\bar{M}}_{20}$ : trained on all target domains pooled in $D^{(2)}$ , $⋃_{t = 1}^{20} d_{t}^{(2)}$ , and evaluated on $T^{(2)}$ and $T (d_{0}^{(1)})$ . Purpose: serves as an oracle upper bound for target performance and gauges reverse shift toward the source domain.
Joint-training Model $M_{J}$ : trained on the combined dataset of $D^{(1)}$ and $D^{(2)}$ , and tested on $T^{(2)}$ and $T (d_{0}^{(1)})$ . Purpose: serves as a baseline model if source and target domain data is available at once. This model provides an upper bound measure assuming that all domain data is available from the start, which is unrealistic in practice because new domain data can be continuously introduced to the robot when it is deployed in various environments over time.
Fine-tune (pooled) $M^{FT}$ : initialized from $M_{0}$ (trained on $d_{0}^{(1)}$ ), then fine-tuned on the pooled target data $⋃_{t = 1}^{20} d_{t}^{(2)}$ without any access to $d_{0}^{(1)}$ during fine-tuning. Purpose: measures adaptation to the target with potential catastrophic forgetting of the source.
Incremental fine-tune (no rehearsal) $M_{t}^{IFT}$ : starting from $M_{0}$ , sequentially fine-tuned on $d_{1}^{(2)}, \dots, d_{20}^{(2)}$ with no samples from previously seen domains. Purpose: provides a strong lower bound for retention, highlighting forgetting in a purely sequential setting.
Experience Replay [20] $E R$ : used ER [20] method to compare our proposed method ReDIaL with state-of-the-art method. Purpose: evaluate the performance of ReDIaL with the ER, to find the effectiveness of ReDIaL in HRI system.

These baselines are included to (i) quantify domain shift (source-only vs. target-only), (ii) establish upper/lower bounds on target accuracy and retention, and (iii) isolate the benefits of ReDIaL’s rehearsal-based, latent-exemplar DIL training relative to naïve fine-tuning strategies.

5.3. Implementation Details

As a posture data, we used MediaPipe [46] for keypoint detection. For each modality-specific encoder

T^{R}

and

T^{L}

in

M_{t}

, adaptive average pooling was employed to obtain a 512-dimensional spatial embedding. Each encoder was implemented as a two-layer Transformer with four attention heads per layer and a dropout rate of 0.20.

For all time steps

t \in {0, \dots, 20}

, we use the same training configuration. The identical hyperparameters are applied to ReDIaL and to all baselines (

{\bar{M}}_{20}

,

M_{J}

M^{FT}

,

M_{t}^{IFT}

). A reduce-on-plateau scheduler lowers the learning rate when the validation loss does not improve, and early stopping selects the best checkpoint. The complete set of hyperparameters employed across all models and domains is presented in Table 1.

5.4. Evaluation Metrics

To evaluate the performance of the recognition model, we adopt accuracy

A

as the primary evaluation metric, following prior works [4,5]. For, simplicity and understanding,

A (M_{t}, d_{t})

will be denoted as

A_{t}

where

t \in {0, 1, \dots, 20}

. To further characterize the stability–plasticity behavior of the model, we compute two aggregate metrics:

\bar{A} = \frac{1}{t} \sum_{t = 0}^{t} A_{t}, \bar{H} = \frac{t + 1}{\sum_{i = 0}^{t} \frac{1}{A_{i}}}

(16)

Here,

\bar{A}

represents the average accuracy, while

\bar{H}

denotes the harmonic mean accuracy, providing a balanced measure of performance across domains. In addition, we quantify catastrophic forgetting as we discussed in Section 3 as

F_{t} = A (M_{t}, d_{t}) - A (M_{t + 1}, d_{t})

(17)

6. Results and Analysis

6.1. Result Analysis: Evidence of Domain Shift and the Advantage of ReDIaL

Table 2 summarizes performance on the source test set

T (d_{0}^{(1)})

, the pooled target test set

T^{(2)}

, and their union

T (d_{0}^{(1)} \cup T^{(2)})

. We focus here on two aspects: (i) demonstrating the severity and asymmetry of the domain shift and (ii) showing how ReDIaL achieves strong adaptation to new domains while maintaining high performance on the source, approaching the non-incremental upper bound.

The source-only model

M_{0}

attains

86.78 %

on the source yet only

24.69 %

on the pooled target, a drop of

62.09 %

in accuracy. Conversely, the target-only (pooled) model

{\bar{M}}_{20}

achieves

100 %

on the pooled target but just

22.31 %

on the source, a drop of

77.69 %

in performance. These large, cross-directional gaps (

86.78 \to 24.69

and

100 \to 22.31

) quantify a severe and asymmetric domain shift between the datasets: models trained in one domain generalize poorly to the other, even though the label space,

Y

, is identical. The union-set accuracies reinforce this domain shift:

M_{0}

yields only

37.15 %

overall, while

{\bar{M}}_{20}

reaches

84.41 %

but remains weak on the source domain—again highlighting the underlying distribution difference.

The baseline joint-training model

M_{J}

, which assumes simultaneous access to source and all target data, achieves

90.91 %

on the source domain,

100 %

on the target domain, and

98.18 %

on the overall domain test set. This setting is unrealistic for real deployments, since in practice, domains are not accessible prior to deployment. However, it is useful as an upper bound on performance achievable without incremental constraints. Fine-tuning without rehearsal on pooled target data (

M_{FT}

) yields perfect target performance (

100 %

) but reduces source accuracy to

59.50 %

, indicating substantial drift away from the source distribution. Sequential fine-tuning with no rehearsal (

M_{t}^{IFT}

) improves target accuracy to

87.76 %

but further drops the performance on the source to

53.72 %

, leading to only

80.92 %

on the union. These results show that naive adaptation strategies can succeed on the new domain while underperforming on the original one, which is a clear indication of a strong domain shift. Standard Experience Replay (ER) substantially narrows the cross-domain gap:

84.30 %

(source),

96.68 %

(target), and

94.20 %

overall.

ReDIaL improves upon ER across the time steps t, achieving

88.43 %

on the source (+

4.13 %

over ER),

99.58 %

on the target (+

2.90 %

), and

97.34 %

overall (+

3.14 %

). Notably, ReDIaL approaches the non-incremental joint-training model

M_{J}

on every scenario (source:

88.43

vs.

90.91

, target:

99.58

vs. 100, union:

97.34

vs.

98.18

), but under incremental constraints. This suggests that ReDIaL’s design such as frozen encoder for a stable latent space, clustering-based exemplar selection for broad mode coverage, and balanced rehearsal effectively counters the domain shift while retaining prior knowledge.

The large cross-domain performance gaps of

M_{0}

and

{\bar{M}}_{20}

provide direct, quantitative evidence of a significant and asymmetric domain shift. While naive fine-tuning strategies can overfit to the target distribution, they struggle to preserve performance on the source. By contrast, ReDIaL delivers high accuracy on both domains and on their union, nearly matching joint-training performance without violating the incremental learning protocol.

6.2. Overall Incremental Performance

The evaluation in Figure 3 represents accuracy over time for ReDIaL,

M_{t}^{I F T}

, and ER, because all three models are trained sequentially on subject-as-domain streams; at each step t, we report accuracy on the cumulative test set of all domains observed up to t.

The incremental fine-tuning model without rehearsal

M_{t}^{IFT}

exhibits the lowest and most variable performance across the sequence. Accuracy drops early from about

85.5 %

at

t_{1}

to roughly

74.6 %

at

t_{2}

, recovers partially around mid-sequence (e.g., near

t_{13}

), and achieves near

80.9 %

at

t_{20}

. This pattern is consistent with strong sensitivity to subject-specific shifts and limited retention of earlier domains when no replay mechanism is available.

ER attains higher accuracy than

M_{t}^{IFT}

throughout and begins close to ReDIaL (approximately

89 %

at

t_{1}

versus

91.7 %

for ReDIaL). The accuracy generally improves over time but shows notable drops around some intermediate domains (e.g., near

t_{12}

) and achieves at about

94.2 %

by

t_{20}

. These variations suggest that a uniform replay buffer helps maintain performance but can be sensitive to domain idiosyncrasies when the stored exemplars are not sufficiently representative of earlier distributions.

ReDIaL starts strong at

t_{1}

(

91.72 %

), experiences a significant decrease at

t_{2}

(

86.98 %

), and thereafter increases, reaching

96.1 %

around

t_{12}

,

97.36 %

by

t_{17}

, and

97.34 %

at

t_{20}

. Relative to ER, ReDIaL maintains a higher accuracy level with smaller fluctuations across the sequence. This behavior aligns with the method’s design: a frozen encoder that stabilizes the latent space across subjects, clustering-based exemplar selection that improves coverage of prior domains, and balanced replay that jointly exposes the model to past and current data.

Across sequential domains,

M_{t}^{IFT}

exhibits substantial variability and the lowest final accuracy, ER mitigates this effect but remains sensitive to domain shifts, and ReDIaL achieves the highest accuracy throughout most of the sequence and at the final step. The temporal trends provide qualitative evidence that rehearsal is necessary for cross-domain stability and that the proposed latent-exemplar strategy is particularly effective in this domain-incremental setting.

6.3. Catastrophic Forgetting Across Domains

Figure 4 represents forgetting

F_{t}

at each time step, defined as the change in accuracy on the previous domain immediately after learning the next domain, as given in Equation (17). We compare the proposed replay strategy (ReDIaL) against incremental fine-tuning without rehearsal (

M_{t}^{IFT}

) and a standard experience replay baseline (ER), and we highlight time steps that are particularly susceptible to forgetting.

M_{t}^{IFT}

exhibits the most inconsistent trend of forgetting across the sequence. In early steps, forgetting rises sharply (e.g., around

t_{2}

), then flips to a strong negative value at

t_{3}

, and again shows a positive rise near

t_{4}

. In the subsequent time steps t, another large positive spike appears around

t_{15}

(annotated at approximately

+ 8.45 %

in the figure), followed by a deep negative drop close to

t_{17}

(around

- 7.5 %

). This alternating pattern indicates that, in the absence of rehearsal, adaptation to a new subject can both overwrite previous knowledge and occasionally induce accidental improvements when two consecutive domains happen to be similar. Overall, the variability underscores the vulnerability of naive fine-tuning to subject-driven domain shifts.

ER substantially reduces forgetting relative to

M_{t}^{IFT}

, but the curve remains fluctuated across incremental time steps t. In particular, large positive rise are visible around

t_{12}

and

t_{19}

(both near

+ 7.75 %

), while a marked negative value appears at

t_{20}

(approximately

- 5.52 %

). These events suggest that although replay stabilizes learning, the exemplar set may not always stores sufficient diversity from earlier domains, leaving the model sensitive to subjects that differ strongly in appearance, background, or execution speed.

ReDIaL keeps forgetting close to zero throughout most of the sequence. The curve shows small early negative values (e.g., about

- 4.12 %

at

t_{1}

) that reflect mild backward transfer, followed by fluctuations of very low magnitude and convergence to zero by

t_{20}

. The near-flat profile indicates that the combination of a frozen encoder, clustering-based exemplar selection, and balanced replay effectively constrains parameter drift when adapting to a new subject.

Across methods, several time steps stand out as particularly prone to forgetting: early transitions (e.g.,

t_{2}

–

t_{4}

) and mid-to-late transitions (e.g.,

t_{12}

,

t_{15}

,

t_{19}

). These coincide with subject changes that likely introduce stronger shifts in appearance, background, or execution speed. Among the three models,

M_{t}^{IFT}

is the most affected, ER mitigates but does not eliminate these spikes, and ReDIaL exhibits the smallest amplitude with a stable end-of-sequence value. So, the results indicate that rehearsal is essential in this domain incremental setting, and that ReDIaL’s latent-exemplar replay provides the most reliable retention of previously learned domains.

6.4. Comparison with the Other Methods

Table 3 reports the time-averaged cumulative accuracy

\bar{A}

, the harmonic mean

\bar{H}

across domains, and the average forgetting

F_{t}

. ReDIaL achieves the highest

\bar{A}

(

94.89 %

) and

\bar{H}

(

94.82 %

), slightly outperforming ER (

93.11 %

and

93.01 %

) and showing a substantial margin over both fine-tuning baselines. The very small gap between

\bar{A}

and

\bar{H}

for ReDIaL and ER (both

< 0.2 %

) indicates consistently strong performance across domains, whereas

M_{FT}

exhibits a noticeably lower harmonic mean, suggesting degraded generalization to some domains.

ReDIaL and ER also show near-zero or slightly negative average forgetting, reflecting not only effective retention but occasional backward transfer. In contrast,

M_{FT}

suffers from severe forgetting (

+ 27.28 %

), and while

M_{t}^{IFT}

averages close to zero, it displays high per-step volatility as seen in the forgetting curves. Together, these results demonstrate that ReDIaL provides the most favorable balance of accuracy and retention, outperforming both naive fine-tuning and standard experience replay in this domain-incremental setting. Table 3 isolates the contribution of each key module in the proposed ReDIaL framework by comparing models with progressively added components—latent embedding, clustering, and balanced replay. The results demonstrate that models without replay (e.g.,

M_{F T}

and

M_{t}^{I F T}

) suffer from severe forgetting (

F_{t} > 0

), confirming the necessity of a replay mechanism. Incorporating replay in the ER baseline markedly reduces forgetting and improves average performance, isolating the benefit of memory-based rehearsal. Finally, combining clustering for representative exemplar selection and balanced replay for stable adaptation in ReDIaL gives the highest

\bar{A}

and

\bar{H}

, which means that these modules work together to improve cross-domain generalization and knowledge retention. Overall, this comparison study clearly shows how each part affects the whole, which meets the ablation criteria.

7. Discussion

Our findings provide clear evidence of asymmetric domain shift in HRI gesture recognition. As shown in Table 2, a source-only model transfers poorly to the pooled target set, while a target-only model fails on the source—despite an identical label space. This pattern is consistent with the subject- and environment-driven factors identified in Section 3.1 (appearance, background, execution speed) and motivates a domain incremental solution that adapts to new users without erasing prior knowledge.

ReDIaL addresses this need through three complementary components: a frozen encoder that stabilizes the latent space across subjects, clustering-based exemplar selection that improves coverage of earlier domains, and balanced replay that mixes past and current data at every update (Algorithm 1). Combined with our cross-modality design (Section 4.1), these choices yield high and steady accuracy over time (Figure 3), approaching the joint-training upper bound while respecting incremental constraints, and outperforming ER on source, target, and their union (Table 2). The forgetting analysis (Figure 4) represents this trend: naive incremental fine-tuning (

M_{t}^{IFT}

) exhibits large variations, ER reduces but does not eliminate variations, and ReDIaL stays close to zero throughout, an observation reinforced by the time-averaged and harmonic metrics with near-zero average forgetting (Table 3).

Operating in the latent space is also practical: it preserves task-relevant features while sharply reducing storage compared with raw videos, enabling a replay buffer that scales to long domain sequences, and it improves privacy by avoiding the retention of identifiable frames (Section 4.1). Limitations include reliance on a fixed pre-trained encoder (COSMOS) and on keypoint quality for the posture stream; both can constrain downstream adaptation. While latent exemplar replay offers storage efficiency and privacy benefits by avoiding raw visual data, potential vulnerabilities such as latent inversion or membership inference attacks remain open concerns, warranting future exploration of privacy-preserving latent encoding or differentially private replay mechanisms.

Across all runs, performance remained highly stable, with near-saturated accuracy and minimal variance across domains, suggesting that formal statistical testing (e.g., confidence intervals or paired t-tests) would provide limited additional insight. The evaluation already includes standard continual learning (CL) metrics—average accuracy, harmonic mean, and forgetting—which together capture the intuition of BWT and ensure metric completeness. We acknowledge that sequential order may influence adaptation dynamics and plan to examine randomized domain orderings in future work. Finally, the observed performance plateau in later stages likely reflects a small-sample ceiling effect, where high per-domain accuracy limits measurable incremental gains.

While ReDIaL mitigates forgetting in a subject-as-domain setting, several focused steps remain for future work. (i) We plan to extend evaluation to larger gesture vocabularies and datasets, as well as to heterogeneous sensors (e.g., depth, IMU) and more diverse environments. This will help stress-test cross-domain generalization. (ii) For memory selection, we aim to move beyond clustering by comparing uncertainty-based and gradient-based strategies for building the replay buffer. In parallel, we will evaluate latent-space privacy and explore privacy-preserving replay techniques. (iii) On-robot studies are also critical. These will measure latency, energy use, memory footprint, and task success in noisy, real-time deployments, including trials with non-expert operators. (iv) Finally, we plan to extend to longer-horizon continual regimes that mix domain and class increments, include recurring users, different sequential order of domains and require robust calibration and out-of-distribution handling.

8. Conclusions

We introduced ReDIaL, a memory-efficient domain incremental learning framework for gesture-based HRI. By combining a frozen encoder for latent stability, clustering-based exemplar selection for representative memory, and balanced rehearsal for sequential adaptation, ReDIaL effectively mitigates catastrophic forgetting while enabling robust cross-user adaptation. Across 21 domains (one source and twenty subject-specific targets), ReDIaL achieves state-of-the-art incremental performance: it attains 97.34% accuracy on the combined old and new domains, improving over pooled fine-tuning by 5.47 percentage points, over incremental fine-tuning by 16.42 points, and over ER by 3.14 points (Table 2). Incremental performance analyses show consistently high accuracy over time (Figure 3) and near-zero forgetting across domain transitions (Figure 4); aggregate metrics confirm strong cross-domain robustness (Table 3).

These results demonstrate that latent exemplar replay not only improves accuracy and retention but also supports privacy and scalability, making it well-suited for robotic deployments operating under efficient compute and memory constraints. ReDIaL’s combination of accuracy, retention, privacy, and scalability aligns well with the requirements of human–robot collaboration, where systems must remain reliable across continually changing users, sensors, and environments. ReDIaL can front-end contact-rich, human-guided tasks by providing robust cross-user gesture commands under domain shift, while a downstream impedance-learning controller ensures contact stability in unknown environments. This perception–control pairing is consistent with evidence that impedance learning reduces tracking error and operator effort in carving/grinding tasks with real environments [51]. Looking ahead, future work will explore adaptive encoder updates, more informative exemplar selection strategies, and on-robot evaluations to further enhance robustness, privacy, and deployment readiness.

Author Contributions

Conceptualization, K.K.P. and J.Z.; methodology, K.K.P. and P.D.; validation, K.K.P. and P.D.; formal analysis, K.K.P.; investigation, K.K.P. and P.D.; data curation, K.K.P.; writing—original draft preparation, K.K.P.; writing—review and editing, K.K.P. and P.D.; visualization, K.K.P. and P.D.; supervision, J.Z.; funding acquisition, J.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the NSF under Grants CCSS-2245607.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are restricted because the system and dataset are still under active development and refinement.

Conflicts of Interest

There is no conflict of interest.

References

Goodrich, M.A.; Schultz, A.C. Human–robot interaction: A survey. Found. Trends Hum.-Comput. Interact. 2008, 1, 203–275. [Google Scholar] [CrossRef]
Wang, X.; Shen, H.; Yu, H.; Guo, J.; Wei, X. Hand and arm gesture-based human-robot interaction: A review. In Proceedings of the 6th International Conference on Algorithms, Computing and Systems, Larissa, Greece, 16–18 September 2022; pp. 1–7. [Google Scholar]
de Frutos Carro, M.Á.; LópezHernández, F.C.; Granados, J.J.R. Real-time visual recognition of ramp hand signals for UAS ground operations. J. Intell. Robot. Syst. 2023, 107, 44. [Google Scholar] [CrossRef]
Podder, K.K.; Zhang, J.; Wu, Y. IHSR: A Framework Enables Robots to Learn Novel Hand Signals from a Few Samples. In Proceedings of the 2024 IEEE International Conference on Advanced Intelligent Mechatronics (AIM), Boston, MA, USA, 15–19 July 2024; pp. 959–965. [Google Scholar]
Podder, K.K.; Zhang, J.; Mao, S. Trustworthy Hand Signal Communication Between Smart IoT Agents and Humans. In Proceedings of the 2024 IEEE 10th World Forum on Internet of Things (WF-IoT), Ottawa, ON, Canada, 10–13 November 2024; pp. 595–600. [Google Scholar]
Beeri, E.B.; Nissinman, E.; Sintov, A. Robust Dynamic Gesture Recognition at Ultra-Long Distances. arXiv 2024, arXiv:2411.18413. [Google Scholar] [CrossRef]
Xie, J.; Xu, Z.; Zeng, J.; Gao, Y.; Hashimoto, K. Human–Robot Interaction Using Dynamic Hand Gesture for Teleoperation of Quadruped Robots with a Robotic Arm. Electronics 2025, 14, 860. [Google Scholar] [CrossRef]
Podder, K.K.; Chowdhury, M.; Mahbub, Z.B.; Kadir, M. Bangla sign language alphabet recognition using transfer learning based convolutional neural network. Bangladesh J. Sci. Res. 2020, 31, 20–26. [Google Scholar]
Podder, K.K.; Chowdhury, M.E.; Tahir, A.M.; Mahbub, Z.B.; Khandakar, A.; Hossain, M.S.; Kadir, M.A. Bangla sign language (bdsl) alphabets and numerals classification using a deep learning model. Sensors 2022, 22, 574. [Google Scholar] [CrossRef]
Podder, K.K.; Ezeddin, M.; Chowdhury, M.E.; Sumon, M.S.I.; Tahir, A.M.; Ayari, M.A.; Dutta, P.; Khandakar, A.; Mahbub, Z.B.; Kadir, M.A. Signer-independent arabic sign language recognition system using deep learning model. Sensors 2023, 23, 7156. [Google Scholar] [CrossRef]
Podder, K.K.; Zhang, J.; Wang, L. Universal Sign Language Recognition System Using Gesture Description Generation and Large Language Model. In Proceedings of the International Conference on Wireless Artificial Intelligent Computing Systems and Applications, Qindao, China, 21–23 June 2024; Springer: Berlin/Heidelberg, Germany, 2024; pp. 279–289. [Google Scholar]
Xu, J.; Xiao, L.; López, A.M. Self-supervised domain adaptation for computer vision tasks. IEEE Access 2019, 7, 156694–156706. [Google Scholar] [CrossRef]
Luo, Y.; Zheng, L.; Guan, T.; Yu, J.; Yang, Y. Taking a closer look at domain shift: Category-level adversaries for semantics consistent domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 2507–2516. [Google Scholar]
Venkateswara, H.; Chakraborty, S.; Panchanathan, S. Deep-learning systems for domain adaptation in computer vision: Learning transferable feature representations. IEEE Signal Process. Mag. 2017, 34, 117–129. [Google Scholar] [CrossRef]
Tanveer, M.H.; Fatima, Z.; Zardari, S.; Guerra-Zubiaga, D. An in-depth analysis of domain adaptation in computer and robotic vision. Appl. Sci. 2023, 13, 12823. [Google Scholar] [CrossRef]
Kouw, W.M.; Loog, M. An introduction to domain adaptation and transfer learning. arXiv 2018, arXiv:1812.11806. [Google Scholar]
Baptista, J.; Santos, V.; Silva, F.; Pinho, D. Domain adaptation with contrastive simultaneous multi-loss training for hand gesture recognition. Sensors 2023, 23, 3332. [Google Scholar] [CrossRef]
Shi, H.; Wang, H. A unified approach to domain incremental learning with memory: Theory and algorithm. Adv. Neural Inf. Process. Syst. 2023, 36, 15027–15059. [Google Scholar]
Lamers, C.; Vidal, R.; Belbachir, N.; van Stein, N.; Bäeck, T.; Giampouras, P. Clustering-based domain-incremental learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 3384–3392. [Google Scholar]
Rolnick, D.; Ahuja, A.; Schwarz, J.; Lillicrap, T.; Wayne, G. Experience replay for continual learning. Adv. Neural Inf. Process. Syst. 2019, 32, 350–360. [Google Scholar]
Houyon, J.; Cioppa, A.; Ghunaim, Y.; Alfarra, M.; Halin, A.; Henry, M.; Ghanem, B.; Van Droogenbroeck, M. Online distillation with continual learning for cyclic domain shifts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 2437–2446. [Google Scholar]
Li, S.; Su, T.; Zhang, X.; Wang, Z. Continual Learning With Knowledge Distillation: A Survey. IEEE Trans. Neural Netw. Learn. Syst. 2024, 36, 9798–9818. [Google Scholar] [CrossRef]
Kutalev, A.; Lapina, A. Stabilizing Elastic Weight Consolidation method in practical ML tasks and using weight importances for neural network pruning. arXiv 2021, arXiv:2109.10021. [Google Scholar] [CrossRef]
Aich, A. Elastic weight consolidation (EWC): Nuts and bolts. arXiv 2021, arXiv:2105.04093. [Google Scholar] [CrossRef]
Kirkpatrick, J.; Pascanu, R.; Rabinowitz, N.; Veness, J.; Desjardins, G.; Rusu, A.A.; Milan, K.; Quan, J.; Ramalho, T.; Grabska-Barwinska, A.; et al. Overcoming catastrophic forgetting in neural networks. Proc. Natl. Acad. Sci. USA 2017, 114, 3521–3526. [Google Scholar] [CrossRef]
Lee, S.W.; Kim, J.H.; Jun, J.; Ha, J.W.; Zhang, B.T. Overcoming catastrophic forgetting by incremental moment matching. Adv. Neural Inf. Process. Syst. 2017, 30, 4655–4665. [Google Scholar]
Cucurull, X.; Garrell, A. Continual learning of hand gestures for human-robot interaction. arXiv 2023, arXiv:2304.06319. [Google Scholar] [CrossRef]
Ding, Q.; Liu, D.; Ai, J.; Yin, P.; Wang, F.; Han, S. Comparisons on incremental gesture recognition with different one-class sEMG envelopes. Biomed. Signal Process. Control 2025, 103, 107421. [Google Scholar] [CrossRef]
Calinon, S.; Billard, A. Incremental learning of gestures by imitation in a humanoid robot. In Proceedings of the ACM/IEEE International Conference on Human-Robot Interaction, Arlington, VA, USA, 9–11 March 2007; pp. 255–262. [Google Scholar]
Aich, S.; Ruiz-Santaquiteria, J.; Lu, Z.; Garg, P.; Joseph, K.; Garcia, A.F.; Balasubramanian, V.N.; Kin, K.; Wan, C.; Camgoz, N.C.; et al. Data-free class-incremental hand gesture recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 20958–20967. [Google Scholar]
Wang, Z.; She, Q.; Chalasani, T.; Smolic, A. Catnet: Class incremental 3d convnets for lifelong egocentric gesture recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020; pp. 230–231. [Google Scholar]
Xu, X.; Zou, Q.; Lin, X. Alleviating human-level shift: A robust domain adaptation method for multi-person pose estimation. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 2326–2335. [Google Scholar]
Van de Ven, G.M.; Tuytelaars, T.; Tolias, A.S. Three types of incremental learning. Nat. Mach. Intell. 2022, 4, 1185–1197. [Google Scholar] [CrossRef]
Canal, G.; Angulo, C.; Escalera, S. Gesture based human multi-robot interaction. In Proceedings of the 2015 International Joint Conference on Neural Networks (IJCNN), Killarney, Ireland, 12–17 July 2015; pp. 1–8. [Google Scholar] [CrossRef]
Su, H.; Qi, W.; Chen, J.; Yang, C.; Sandoval, J.; Laribi, M.A. Recent advancements in multimodal human–robot interaction. Front. Neurorobot. 2023, 17, 1084000. [Google Scholar] [CrossRef]
Alonso-Mora, J.; Lohaus, S.H.; Leemann, P.; Siegwart, R.Y.; Beardsley, P.A. Gesture based human-Multi-robot swarm interaction and its application to an interactive display. In Proceedings of the 2015 IEEE International Conference on Robotics and Automation (ICRA), Seattle, WA, USA, 26–30 May 2015; pp. 5948–5953. [Google Scholar]
Cicirelli, G.; Attolico, C.; Guaragnella, C.; D’Orazio, T. A kinect-based gesture recognition approach for a natural human robot interface. Int. J. Adv. Robot. Syst. 2015, 12, 22. [Google Scholar] [CrossRef]
Nguyen, K.H.; Pham, A.D.; Minh, T.B.; Phan, T.T.T.; Do, X.P. Gesture Recognition Model with Multi-Tracking Capture System for Human-Robot Interaction. In Proceedings of the 2023 International Conference on System Science and Engineering (ICSSE), Ho Chi Minh, Vietnam, 27–28 July 2023; pp. 6–11. [Google Scholar]
Parisi, G.I.; Kemker, R.; Part, J.L.; Kanan, C.; Wermter, S. Continual Lifelong Learning with Neural Networks: A Review. Neural Netw. Off. J. Int. Neural Netw. Soc. 2018, 113, 54–71. [Google Scholar] [CrossRef]
Xie, J.; Yan, S.; He, X. General Incremental Learning with Domain-aware Categorical Representations. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans Ernest N. Morial Convention Center New Orleans, New Orleans, LA, USA, 21–24 June 2022; pp. 14331–14340. [Google Scholar]
Hung, S.C.Y.; Tu, C.H.; Wu, C.E.; Chen, C.H.; Chan, Y.M.; Chen, C.S. Compacting, Picking and Growing for Unforgetting Continual Learning. arXiv 2019, arXiv:1910.06562. [Google Scholar] [CrossRef]
Hayes, T.L.; Cahill, N.D.; Kanan, C. Memory Efficient Experience Replay for Streaming Learning. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2018; pp. 9769–9776. [Google Scholar]
Pellegrini, L.; Graffieti, G.; Lomonaco, V.; Maltoni, D. Latent Replay for Real-Time Continual Learning. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 24 October 2020–24 January 2021; pp. 10203–10209. [Google Scholar]
Balles, L.; Zappella, G.; Archambeau, C. Gradient-Matching Coresets for Rehearsal-Based Continual Learning. arXiv 2022, arXiv:2203.14544. [Google Scholar]
Yoon, J.; Madaan, D.; Yang, E.; Hwang, S.J. Online Coreset Selection for Rehearsal-based Continual Learning. arXiv 2021, arXiv:2106.01085. [Google Scholar]
Lugaresi, C.; Tang, J.; Nash, H.; McClanahan, C.; Uboweja, E.; Hays, M.; Zhang, F.; Chang, C.L.; Yong, M.G.; Lee, J.; et al. Mediapipe: A framework for building perception pipelines. arXiv 2019, arXiv:1906.08172. [Google Scholar] [CrossRef]
Martınez, G.H. Openpose: Whole-Body Pose Estimation. Ph.D. Dissertation, Robotics Institute Carnegie Mellon University, Pittsburgh, PA, USA, 2019. [Google Scholar]
Wang, Q.; Zhang, K.; Asghar, M.A. Skeleton-based ST-GCN for human action recognition with extended skeleton graph and partitioning strategy. IEEE Access 2022, 10, 41403–41410. [Google Scholar] [CrossRef]
Kim, S.; Xiao, R.; Georgescu, M.I.; Alaniz, S.; Akata, Z. COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training. arXiv 2024, arXiv:2412.01814. [Google Scholar]
Song, Y.; Demirdjian, D.; Davis, R. Tracking body and hands for gesture recognition: Natops aircraft handling signals database. In Proceedings of the 2011 IEEE International Conference on Automatic Face & Gesture Recognition (FG), Santa Barbara, CA, USA, 21–25 March 2011; pp. 500–506. [Google Scholar]
Xing, X.; Burdet, E.; Si, W.; Yang, C.; Li, Y. Impedance learning for human-guided robots in contact with unknown environments. IEEE Trans. Robot. 2023, 39, 3705–3721. [Google Scholar] [CrossRef]

Figure 1. Overview of the proposed ReDIaL framework. At each step, inputs are encoded into a stable latent space by a frozen encoder, representative exemplars are selected via clustering, and balanced replay combines current and stored samples to adapt to new domains while preserving prior knowledge.

π

is a frozen action decoder for robot action generation, and this

π

is out of scope for this proposed ReDIaL method.

Figure 1. Overview of the proposed ReDIaL framework. At each step, inputs are encoded into a stable latent space by a frozen encoder, representative exemplars are selected via clustering, and balanced replay combines current and stored samples to adapt to new domains while preserving prior knowledge.

π

is a frozen action decoder for robot action generation, and this

π

is out of scope for this proposed ReDIaL method.

Figure 2. Cross-modality model

M_{t}

: RGB and posture embeddings are encoded by modality-specific transformers and pooled via learnable queries, then concatenated and classified to predict the gesture label

{\hat{y}}_{t}

. Here, ‘*’ represents query-based self-attention.

Figure 2. Cross-modality model

M_{t}

: RGB and posture embeddings are encoded by modality-specific transformers and pooled via learnable queries, then concatenated and classified to predict the gesture label

{\hat{y}}_{t}

. Here, ‘*’ represents query-based self-attention.

Figure 3. Accuracy comparison ReDIaL, ER, and

M_{t}^{I F T}

.

Figure 3. Accuracy comparison ReDIaL, ER, and

M_{t}^{I F T}

.

Figure 4. Forgetting comparison across time steps for ReDIaL, ER, and

M_{t}^{IFT}

. Positive values denote forgetting; negative values denote backward transfer.

Figure 4. Forgetting comparison across time steps for ReDIaL, ER, and

M_{t}^{IFT}

. Positive values denote forgetting; negative values denote backward transfer.

Table 1. Training hyperparameters used for all models and domains. The learning-rate (LR) scheduler reduces the LR by the stated factor if the validation loss does not improve for the specified patience; early stopping monitors the same criterion.

Hyperparameter	Value
Batch size	6
Optimizer	Adam
Initial learning rate	$1 \times 10^{- 4}$
LR scheduler patience (epochs)	15
LR reduction factor	0.1
Early stopping patience (epochs)	35
Max epochs	100

Table 2. Accuracy (%) of the models evaluated on different test domains.

Model	$T (d_{0}^{(1)})$	$T^{(2)}$	$T (d_{0}^{(1)} \cup T^{(2)})$
$M_{0}$	86.78	24.69	37.15
${\bar{M}}_{20}$	22.31	100	84.41
$M_{j}$	90.91	100	98.18
$M_{F T}$	59.50	100	91.87
$M_{t}^{I F T}$	53.72	87.76	80.92
ER	84.30	96.68	94.20
ReDIaL	88.43	99.58	97.34

Table 3. Comparison of different method based on

\bar{A}

,

\bar{H}

, and

F_{t}

. A checkmark (✓) indicates the component is included in the model, while a cross (✕) indicates it is not.

Table 3. Comparison of different method based on

\bar{A}

,

\bar{H}

, and

F_{t}

. A checkmark (✓) indicates the component is included in the model, while a cross (✕) indicates it is not.

Model	Latent Embedding	Clustering	Balanced Replay	$\bar{A} (%)$	$\bar{H} (%)$	$F_{t} (%)$
$M_{F T}$	✓	✕	✕	79.75	74.61	27.28
$M_{t}^{I F T}$	✓	✕	✕	78.93	78.71	0.25
ER	✓	✕	✓	93.11	93.01	−0.28
ReDIaL	✓	✓	✓	94.89	94.82	−0.21

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Podder, K.K.; Dutta, P.; Zhang, J. Replay-Based Domain Incremental Learning for Cross-User Gesture Recognition in Robot Task Allocation. Electronics 2025, 14, 3946. https://doi.org/10.3390/electronics14193946

AMA Style

Podder KK, Dutta P, Zhang J. Replay-Based Domain Incremental Learning for Cross-User Gesture Recognition in Robot Task Allocation. Electronics. 2025; 14(19):3946. https://doi.org/10.3390/electronics14193946

Chicago/Turabian Style

Podder, Kanchon Kanti, Pritom Dutta, and Jian Zhang. 2025. "Replay-Based Domain Incremental Learning for Cross-User Gesture Recognition in Robot Task Allocation" Electronics 14, no. 19: 3946. https://doi.org/10.3390/electronics14193946

APA Style

Podder, K. K., Dutta, P., & Zhang, J. (2025). Replay-Based Domain Incremental Learning for Cross-User Gesture Recognition in Robot Task Allocation. Electronics, 14(19), 3946. https://doi.org/10.3390/electronics14193946

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Replay-Based Domain Incremental Learning for Cross-User Gesture Recognition in Robot Task Allocation

Abstract

1. Introduction

2. Related Work

2.1. Gesture Recognition for HRI and Multi-Robot Control

2.2. Continual, Lifelong, and Domain Incremental Learning

2.3. Memory-Efficient Rehearsal Strategies

3. Problem Statement: Subject-As-Domain Incremental Gesture Learning

3.1. Domain Shift Formulation

3.2. Domain Incremental Learning Objective

4. Methodology

4.1. Latent Embedding Generation

4.2. Model Overview

4.3. Domain Incremental Learning Strategy

4.3.1. Target Domain Set Collection

4.3.2. Clustering-Based Exemplar Memory Selection

4.3.3. Balanced Multi-Domain Replay Training

5. Experimental Setup

5.1. Datasets

5.2. Comparison Techniques

5.3. Implementation Details

5.4. Evaluation Metrics

6. Results and Analysis

6.1. Result Analysis: Evidence of Domain Shift and the Advantage of ReDIaL

6.2. Overall Incremental Performance

6.3. Catastrophic Forgetting Across Domains

6.4. Comparison with the Other Methods

7. Discussion

8. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI