1. Introduction
The use of robots, both mobile and stationary, in fields such as industrial automation, assistive technology, telepresence, entertainment, and scientific research has significantly increased in recent years [
1]. With the rapid progress of Artificial Intelligence (AI) and computer vision, interactions between humans and robots have become increasingly common [
2]. Among emerging methods, Intelligent Human–Robot Interaction (HRI) techniques are gaining attention as alternatives to conventional control systems [
3]. Traditional approaches often rely on physical input devices such as joysticks or keyboards [
4]. However, unlike these hardware-based interfaces, facial gesture-based HRI allows users to operate robots through intuitive and contactless expressions, making interaction more natural and accessible even for inexperienced users [
5].
HRI research aims to create innovative designs and intuitive interfaces, typically grouped into four categories: wearable sensors, speech recognition, gesture-based systems, and user-friendly remote controls [
6].
Table 1 summarizes the common advantages and disadvantages of typical HRI systems. In recent years, facial gesture-based recognition (FGR) systems have attracted growing interest due to the expressive and intuitive nature of facial movements as a means of communication [
7,
8]. Unlike hand gestures, facial gestures such as head tilts, eye blinks, or mouth movements can be performed with minimal effort, require no additional hardware, and remain effective even when the user’s hands are occupied, making them particularly suitable for contactless robot control [
9].
Facial gesture recognition systems utilize various data sources, generally divided into two categories: sensor-based facial gesture recognition (S-FGR) [
10] and vision-based facial gesture recognition (V-FGR) [
11]. These categories differ primarily in their data acquisition techniques, data types, and training methodologies [
12]. Sensor-based FGR typically employs specialized wearable devices or electrodes that capture muscle activity and subtle facial movements through electromyography (EMG) or inertial measurement units (IMUs) [
13,
14]. Such sensor data tends to be robust against external variations like lighting and background noise, and often requires less computational effort since the signals are directly obtained without complex image processing [
15]. In contrast, vision-based FGR analyzes 2D images or video sequences captured by cameras, which makes it more accessible and cost-effective, as no additional hardware beyond a camera is needed [
6,
16]. Consequently, the majority of recent research has concentrated on V-FGR due to its simplicity and scalability [
17].
Vision-based methods generally follow two approaches: extracting hand-crafted features [
15] or employing deep learning to automatically learn features [
10,
18]. Hand-crafted methods typically utilize predefined facial landmarks, geometric features, or texture descriptors to recognize gestures [
19]. Although these traditional approaches can be computationally efficient, they often lack adaptability to varying facial expressions and environmental conditions [
20,
21]. On the other hand, deep learning methods, particularly convolutional neural networks (CNNs), automatically learn hierarchical features from raw data, leading to superior accuracy and robustness [
10,
22]. However, these methods generally impose higher computational requirements and demand large annotated datasets for effective training [
8,
12,
13].
Facial gestures can also be classified as either static or dynamic [
12]. Static gestures involve holding a specific facial expression or pose for a certain duration [
23], such as raising eyebrows or blinking, whereas dynamic gestures consist of a sequence of movements over time, like nodding or mouth movements [
19]. While deep learning techniques have demonstrated excellent performance in recognizing static gestures due to their consistent visual patterns, dynamic gesture recognition remains more challenging [
24]. The temporal nature of dynamic gestures adds complexity, resulting in an increased computational load and often lower accuracy compared to static gestures [
25]. To address the temporal dimension of dynamic facial gestures, tracking algorithms combined with deep learning have been employed [
9,
26]. For instance, some works integrate pose estimation frameworks with tracking methods such as Kalman filters or DeepSORT [
27] to extract and maintain consistent facial landmarks over video frames, facilitating temporal gesture classification. While these methods improve recognition robustness, they further increase computational complexity, which can hinder real-time performance [
28].
Despite the advances in facial gesture recognition, many state-of-the-art methods [
17,
26,
29,
30] still struggle with balancing accuracy and efficiency. Real-time human–robot interaction systems require lightweight, fast, and reliable models that can operate under resource constraints without sacrificing safety or user experience [
31]. To this end, several CNN and 3D CNN architectures [
7,
30,
32] have been proposed, offering improved feature extraction and temporal modeling capabilities. However, these approaches typically entail high computational costs, motivating the search for efficient alternatives that maintain competitive performance [
28,
33].
To address these gaps, we propose GBA, a lightweight and efficient facial gesture recognition network that leverages the GhostNet feature extractor [
34] and incorporates a BiLSTM with an attention mechanism [
35]. This system accurately recognizes both static and dynamic facial gestures while maintaining low computational overhead. Furthermore, we developed a 3D robot simulation environment in Unity, enabling smooth and intuitive robot control through socket communication. This setup offers a faster and more user-friendly interface compared to previous approaches.
Overall, our novel approach leverages a streamlined set of facial gestures combined with a 3D simulation environment for remote robot operation. By integrating socket communication, the system functions as a virtual simulation platform, laying the groundwork for more advanced and contactless control methods. Compared to traditional joystick controllers and sensor-based interaction systems [
3,
12,
21,
31], our proposed solution offers improved speed, reliability, and user-friendliness. We summarize our main contributions in this work as follows:
GhostNet-based spatial features achieve 99.13% accuracy (↑4.3%) at 30 FPS with only 1.1 GFLOPs, proving suitable for real-time use on embedded GPUs.
Incorporated LSTM networks capture sequential dependencies in gesture patterns, improving robustness to temporal variations.
Integrated spatial and temporal attention modules to focus on the most salient facial regions and critical time frames, boosting classification accuracy.
Designed a Unity3D-based virtual environment where recognized facial gestures are converted into robot control commands via socket communication for interactive operation.
Developed a fully trainable architecture that unifies spatial feature extraction, temporal modeling, and attention in a single efficient framework for human–robot interaction.
Furthermore, the related work on facial gesture recognition is reviewed in
Section 2, followed by a detailed description of our proposed method in
Section 3. The experimental setup, results, and analysis are presented in
Section 4 with further discussion in
Section 5. Finally,
Section 6 concludes the paper and outlines future research directions.
Table 1.
Human–robot interaction systems can be classified into several types, as outlined in [
3,
12,
21,
31].
Table 1.
Human–robot interaction systems can be classified into several types, as outlined in [
3,
12,
21,
31].
HRI Method | Pros | Cons |
---|
Remote Control Devices | Familiar interface Low latency Precise control | Requires physical devices Limited mobility |
Wearable Tracking Sensors | Accurate motion tracking Enables continuous monitoring | Intrusive Costly hardware User discomfort |
Voice-Based Interaction | Hands-free interaction Intuitive for commands | Sensitive to noise Language dependent |
Facial Gesture Interface | Contactless Intuitive and expressive No additional hardware needed | Sensitive to lighting Limited gesture vocabulary Computationally intensive |
4. Experimental Evaluation
Experiments for the proposed system were conducted to evaluate the performance of vision-based dynamic facial gesture recognition, computational efficiency, and real-time applicability for interactive control tasks. The evaluation was carried out using the publicly available FaceGest [
59] dataset, which provides 13 distinct facial gesture classes recorded under diverse conditions. Our focus was on assessing classification accuracy, robustness across different gesture categories, and inference speed in comparison with existing baseline methods.
4.1. System Setup and Configuration
The proposed system was evaluated using a cross-platform client–server configuration. The Unity-based 3D simulation environment acted as the server and was deployed on a Windows 11 Home workstation (HP Pavilion Gaming Desktop TG01-1xxx, HP Inc., Palo Alto, CA, USA) equipped with an Intel® Core™ i5-10400F CPU at 2.90 GHz, 32 GB of RAM, and a Realtek Gaming GbE network interface. This server maintained a TCP socket connection to receive gesture commands from the remote client, acknowledged each command, and updated the virtual environment in real time, enabling responsive simulation control. The client consisted of the FGR module (GBA), which was trained and executed on an Ubuntu 22.04.5 LTS workstation with a 3.50 GHz Intel Core i9-10920X CPU, dual NVIDIA GeForce RTX 3090 GPUs, and 32 GB of RAM. The model was trained for 350 epochs with early stopping enabled. Categorical cross-entropy was used as the loss function, and the Adam optimizer was employed for optimal weight updates. The tanh activation function was used in the BiLSTM layers, while softmax activation was applied at the final output layer for gesture classification.
Fifteen subjects participated in testing the proposed system. The FGR (GBA) module was installed on one machine, and the Unity 3D simulator on another, connected through IP addresses as shown in
Figure 4. The communication mechanism is illustrated in
Figure 3. To evaluate real-world applicability, the simulation was replaced with a physical ground robot and interfaced via an ATmega32U4 microcontroller, enabling the direct execution of commands predicted by the FGR system.
Experimental averaged test results for each class were recorded and are presented in
Table 4 and
Table 5. Comparisons with existing methods are reported in
Table 6 and
Table 7. Each subject performed the gestures listed in
Table 4. Unlike systems based on wearable gloves or aided sensors, the vision-based gesture recognition allowed subjects to perform gestures naturally without maintaining strict distances or requiring extensive training.
4.2. Evaluation of Vision-Based Face Gesture Classification
The proposed GBA pipeline integrates GhostNet for lightweight spatial feature extraction, followed by bidirectional LSTM layers with temporal attention for dynamic gesture sequence modeling. To quantitatively assess performance, we computed classification metrics on the test split of the FaceGest dataset, including precision, recall, F1-score, and overall accuracy (
Table 5).
The confusion matrix (
Figure 5) demonstrates the ability of our model to distinguish between eye-based, mouth-based, head-based, and combined gestures, with minimal inter-class confusion. Classes with visually subtle differences, such as
blink versus
double blink, were also accurately recognized due to the temporal attention mechanism highlighting key frames.
To further evaluate the training behavior of the proposed model, we analyzed the variation in accuracy and loss over epochs. As shown in
Figure 6, the training and validation accuracy (
Figure 6a) steadily increases, while the training and validation loss (
Figure 6b) decreases consistently, indicating smooth convergence without overfitting. These curves confirm the stability of the model training process and support the effectiveness of the proposed architecture for dynamic facial gesture recognition.
In comparison with baseline dynamic gesture recognition methods from the literature, our approach consistently achieved higher accuracy and significantly reduced the computational cost, making it suitable for deployment on resource-constrained systems. The evaluation confirmed that the integration of attention modules into GhostNet and LSTM layers improved the model’s robustness in varying lighting conditions and across different subjects.
4.3. Evaluation Based on Lap Time
One of the standard performance indicators in gesture-based control systems is the lap time [
60,
61], which measures the total time taken from the initiation of a gesture to the completion of the corresponding action. The primary objective of the proposed system is to provide a natural, safe, and intuitive interface for controlling robots and UAVs, particularly for users with no prior control experience. Therefore, lap time was selected as a key evaluation metric.
A total of 15 participants took part in the experiments, performing the complete set of facial gestures defined in the FaceGest dataset. The system was evaluated in a virtual simulation environment implemented in Unity3D 6.0. For each participant, the time required to execute the control commands using all gestures was measured and averaged to obtain the lap time (shown in
Table 7 and
Table 8).
The results demonstrated that the proposed GBA architecture, enhanced with spatial and temporal attention mechanisms, enabled faster and more responsive control compared to existing vision-based gesture recognition systems. The lap times achieved by our system were significantly lower, confirming its suitability for non-expert users. Additionally, the consistently low detection time ensured that commands were processed and executed with minimal delay, resulting in a smoother and more intuitive interaction experience.
Overall, these findings validate that the proposed GBA approach not only improves recognition accuracy but also enhances real-time responsiveness, making it highly applicable to hands-free HCI scenarios such as robotics, UAV navigation, and immersive metaverse applications.
5. Discussion
This study presents a vision-based facial gesture recognition framework for the real-time remote control of robotic systems in both virtual environments. The compact gesture set, combined with the system’s low-latency response, enables intuitive operation with minimal cognitive load for users, even those without prior experience. This was achieved by designing a concise set of 10 easily distinguishable gestures, balancing functional coverage with operational simplicity. Such a design ensures reliable control in challenging or time-sensitive scenarios, avoiding the pitfalls of overly complex gesture vocabularies that may overwhelm operators, or overly simplistic ones that limit task execution [
68,
69].
Compared to existing vision-based interaction systems, the proposed method is characterized by its simplicity, adaptability, and extensibility. While the core command set is intentionally minimal, it can be expanded depending on application-specific needs, which makes the system flexible for deployment across different domains. Furthermore, the integration of socket-based communication enables seamless remote operation within metaverse-style environments, extending the scope of traditional gesture-based control systems. The recognition engine is built on a hybrid GhostNet-BiLSTM backbone, augmented with spatial and temporal attention mechanisms. This architecture combines the efficiency of lightweight convolutional feature extraction with the temporal modeling capacity of recurrent networks, enabling the robust detection of dynamic facial gestures with minimal computational overhead. The model’s responsiveness and temporal pattern recognition capabilities make it readily transferable to other vision-based HCI applications, from robotic teleoperation to immersive VR/AR interaction.
Limitations
Despite its promising results, the system has certain limitations. The current implementation uses a fixed set of 10 base gestures to ensure ease of use and reduce cognitive burden. While this constraint improves user experience, it may limit control granularity in highly complex scenarios, such as multiple robot operations, where additional derived commands would be necessary. Expanding the gesture set, however, must be approached carefully to maintain usability.
Another limitation lies in the scope of evaluation. The reported results are primarily based on the FaceGest dataset and a real-world pilot study with 15 participants. While the high accuracy values and live testing results are encouraging, they may not fully capture the model’s generalization capability across different datasets or unseen subjects. A cross-dataset evaluation was not conducted in this work due to dataset availability and compatibility constraints, and we explicitly recognize this as a direction for future research.
In addition, while several baselines were included for comparison, not all were re-trained under identical experimental conditions. Some results were adopted directly from prior publications, as shown in
Table 6. This distinction may affect the fairness of comparisons, although care was taken to ensure that all baseline results are reported accurately and from reputable sources.
Finally, the current implementation assumes stable computational resources and controlled environmental conditions. Deployment in highly dynamic or resource-constrained settings may require further optimization and robustness checks to ensure consistent performance.
6. Conclusions
In this paper, we proposed a lightweight and efficient vision-based facial gesture recognition system for the intuitive remote control of robots in a virtual environment. Leveraging a hybrid GhostNet-BiLSTM architecture with spatial and temporal attention, our method achieves high accuracy and low latency on the FaceGest dataset, demonstrating robust performance across varied lighting and user conditions. The integration with a Unity3D-based simulation environment via socket communication enables seamless, real-time command transmission, offering a novel approach to contactless human–robot interaction.
The compact gesture vocabulary designed in this work strikes a balance between functionality and user cognitive load, making it suitable for both expert and novice users. Experimental evaluations, including gesture classification accuracy and lap time metrics, validate the system’s superiority over existing vision-based control methods in terms of responsiveness and usability.
Future work will focus on extending the gesture set to support more complex control scenarios, including multi-robot coordination. Additionally, we plan to explore adaptive learning techniques to personalize gesture recognition and improve robustness under diverse real-world conditions. Beyond these directions, we also aim to conduct cross-dataset and unseen subject evaluations to better assess the generalization capabilities of the framework. This will help identify potential dataset biases and ensure reliability across different user groups and environments. Another important direction is optimizing the framework for deployment in resource-constrained platforms, such as mobile or embedded systems, where real-time performance and energy efficiency are critical.
Overall, the proposed system paves the way for accessible, immersive, and reliable vision-based interfaces in assistive robotics, teleoperation, and metaverse applications.