1. Introduction
Recently, human motion recognition has gained increasing attention due to its wide applications in healthcare, human–computer interaction, smart environments, and assistive technologies. Among various types of body movement, upper-limb motions—particularly those involving both arms—are essential for interpreting complex gestures and supporting rehabilitation processes. Traditional motion recognition methods often rely on wearable sensors, which can be intrusive and inconvenient for daily use. With the advancement of computer vision and deep learning, vision-based approaches have become a promising alternative for capturing and analyzing human motion in a non-contact and real-time manner [
1,
2,
3,
4].
This study aims to develop a vision-based dual-arm motion recognition system that accurately analyzes upper-limb movements from video input. By leveraging MediaPipe Hands [
5] for skeletal critical point detection and a stacked long short-term memory (LSTM) architecture for temporal modeling [
6,
7,
8,
9], the system recognizes complex, coordinated, or symmetrical dual-arm gestures. The combination of spatial vector encoding and temporal dynamics enables the system to capture subtle variations in motion patterns. The system provides a robust solution for real-time motion analysis, with significant potential in physical rehabilitation, gesture-based control, and interactive technologies.
2. Materials and Methods
2.1. RK OpenVINO AI Box
The RK OpenVINO AI Box (Advantech Co., Ltd., Taipei, Taiwan) serves as the core device for executing advanced computing and image inference tasks in this system. It is equipped with an ARM-based processor and supports the Intel OpenVINO toolkit (Intel Corporation, Santa Clara, CA, USA), enabling efficient execution of various deep learning models such as pose estimation, joint tracking, and object detection. With the integration of the RK OpenVINO AI Box, the system gains edge computing capabilities, allowing it to process large volumes of image data in real time. This significantly reduces reliance on cloud services while improving recognition efficiency and response speed, thereby enabling more intelligent and interactive features.
2.2. Intel D455F Depth Camera
The Intel D455F [
10] depth camera (Intel Corporation, Santa Clara, CA, USA) is a product in the Intel RealSense series. It utilizes active stereo vision technology combined with an infrared projector to capture depth information from the scene. With its high resolution and wide field of view, the camera can reliably sense depth even under varying lighting conditions, thereby enhancing the accuracy of pose recognition and environment mapping. Its advanced depth-sensing capabilities make it well-suited for tasks such as human pose estimation, object tracking, and 3D environment reconstruction. The working principle of the camera is based on the disparity between images captured by its dual lenses. When both lenses simultaneously capture the same scene, the system compares the positions of corresponding feature points in the two images to compute the disparity. By combining this disparity with the baseline distance between the two lenses (i.e., the physical distance between them) and the intrinsic parameters of the camera, the system uses triangulation to calculate the actual distance from each pixel to the camera. This process generates a depth map, which accurately reflects the position and depth of objects in the scene.
2.3. Human Action Recognition System
The developed system integrates an Intel RealSense depth camera (Intel Corporation, Santa Clara, CA, USA) with the MediaPipe framework to extract skeletal representations of the human body, and employs an LSTM neural network for temporal action recognition. The system continuously captures image frames and extracts 3D joint coordinates of the human body in real time. A total of 90 consecutive frames is aggregated to construct a fixed-length temporal sequence dataset. To enhance the discriminative power of the model, the system computes inter-frame joint displacement features, representing the temporal dynamics of skeletal motion. These features are then used as input to a pre-trained LSTM model, which performs action classification and prediction based on the observed motion patterns.
2.3.1. Skeleton Point Extraction and Hand Joint Recognition
In the preprocessing stage of the action recognition pipeline, we utilized the MediaPipe (Google LLC, Mountain View, CA, USA) Hands model to extract 3D hand skeletal landmarks, combining machine learning and computer vision to achieve high accuracy and real-time performance. The model adopts a two-stage architecture: hand region detection from RGB images, followed by pose regression using a convolutional neural network to estimate 21 critical points per hand, each comprising spatial coordinates (xy,z) and a visibility score. Using bimanual gesture recognition, the system extracts landmarks for both hands per frame, yielding 42 skeletal points in total. This full skeletal representation enhances the ability to capture inter-hand coordination and spatial relationships, which are essential for recognizing complex gestures.
Video input is processed frame-by-frame using OpenCV (Intel Corporation, Santa Clara, CA, USA). Each frame is converted to RGB and analyzed to obtain the 42 landmarks. Their four-dimensional attributes are recorded, and the frame-to-frame displacement of each point is computed to capture temporal motion features. This results in a 168-dimensional feature vector per frame. A sequence of 90 consecutive frames forms a complete action sample, which serves as input to the LSTM model for training and inference. This structured preprocessing effectively encodes both spatial posture and temporal dynamics, improving the model’s performance in recognizing diverse bimanual actions.
2.3.2. Action Recognition Model: Training and Inference Pipeline
We developed an action recognition model based on LSTM networks, a variant of recurrent neural networks (RNN) well-suited for modeling temporal dependencies in sequential data. Owing to its capability to retain long-term contextual information, LSTM is particularly effective for capturing temporal coherence in continuous human motion.
The input to the model is derived from MediaPipe-extracted 3D hand landmarks, with each frame represented by a 168-dimensional feature vector comprising 42 critical points with position and visibility attributes. Temporal dynamics are encoded by computing inter-frame displacements. Each action sequence consists of 90 consecutive frames, resulting in an input shape of (90,168). Action labels are one-hot encoded, and class imbalance is addressed via weighted loss adjustment. The dataset is divided into 70% for training and 30% for testing.
The model employs a three-layer stacked LSTM architecture. The first layer with 128 units captures low-level temporal features, followed by batch normalization and dropout (a rate of 0.5). The second and third layers use 64 and 32 units, respectively, progressively distilling high-level representations. The final LSTM layer outputs a summary vector, which is passed through a dense layer with softmax activation to classify among eight predefined action categories.
Training is optimized using the Adam optimizer with categorical cross-entropy loss. Techniques such as dynamic learning rate scheduling and early stopping are applied to enhance generalization and prevent overfitting.
Figure 1 and
Figure 2 illustrate sample frames captured during the 90-frame joint detection process and the inference result, respectively.
2.3.3. Error Analysis of Skeleton Keypoint Extraction and Joint Recognition
We evaluated the accuracy of the skeleton detection model in extracting critical points, with a particular focus on three critical joints: the shoulder, elbow, and wrist. These critical points serve as the foundation for subsequent error analysis.
In the experiment, image acquisition techniques were used to obtain the coordinate data of the shoulder, elbow, and wrist. The shoulder coordinate was designated as the origin, serving as the reference point for measuring the relative positions of the elbow and wrist. The 3D coordinates derived from the skeleton detection model and depth camera through back-projection were then compared with the actual measured values to calculate the positional errors. This comparison enabled an assessment of the accuracy and precision of the critical point extraction process.
In the model, the reliability of the system in localizing key body joints is verified, thereby providing a foundational basis for subsequent motion analysis and imitation control.
Figure 3 illustrate the error analysis results, showing the subject performing an arm-swinging motion.
The error analysis results are presented through visualized graphics.
Figure 4a summarizes the statistical error data of the elbow relative to the shoulder, including absolute distance errors along the three axes (X_err, Y_err, Z_err) as well as the combined Euclidean distance. All error metrics are computed using absolute values and include the minimum, 25th percentile, median, 75th percentile, maximum, standard deviation, and mean.
Figure 4b presents the statistical error data of the wrist relative to the shoulder.
3. Results
To assess the robustness and generalization of the model, data were collected across various locations, camera angles, and operators. Seven predefined actions, including clapping, arm circles, waving, throwing, handshaking, picking up, and following, were each performed 50 times, totaling 350 trials. Each 4-second video was processed to extract skeletal keypoints, which were converted into time series data and classified using an LSTM model based on motion patterns. The experimental results are summarized in
Table 1, which presents the model’s classification performance for seven hand actions, including precision, recall, and F1-score.
The model performed consistently and accurately across different actions, demonstrating the effectiveness of the LSTM-based system in classifying continuous skeletal movements. To further analyze the distribution of misclassifications among action categories, the confusion matrix is presented in
Table 2.
4. Discussion
The experimental results showed that the developed real-time dual-arm action recognition system achieves a high level of accuracy and robustness across multiple bimanual gesture categories. Notably, the system maintained a consistent F1-score above 84% across all tested actions, with peak performance observed in waving (F1 = 94.1%) and following (F1 = 92.8%). These results highlight the effectiveness of the spatiotemporal feature encoding approach based on directional vectors between upper-limb joints, combined with the temporal modeling capabilities of the stacked LSTM network.
Error analysis results of key joint extraction validated the reliability of the skeletal detection framework. By anchoring shoulder coordinates as a reference and quantifying the 3D error distribution of elbow and wrist positions, the system demonstrates acceptable precision for motion tracking tasks. The small variance in Euclidean errors supports the claim that the system is suitable for applications requiring fine-grained gesture analysis, such as rehabilitation assessment or interactive control.
Despite the encouraging performance, several limitations remain. First, the model exhibits confusion between semantically similar actions, such as handshaking and throwing, potentially due to overlapping motion trajectories. This suggests the need for integrating higher-level contextual features or augmenting training data diversity. Second, the system relies on a fixed sequence length (90 frames), which may restrict its flexibility in recognizing gestures with varying durations. Future work could explore adaptive temporal windowing or transformer-based architectures to overcome this limitation.
Moreover, while the Intel D455F camera and RK OpenVINO AI Box enable real-time processing with edge-computing capabilities, environmental factors such as lighting variations or occlusions may still impact depth accuracy and joint detection stability. Addressing these challenges through sensor fusion or domain adaptation techniques could further improve robustness in unconstrained settings.
5. Conclusions
We developed an integrated dual-arm motion recognition system capable of real-time gesture classification using skeletal joint vectors and deep temporal modeling. By leveraging MediaPipe for robust pose estimation and employing a three-layer LSTM network trained on structured 168-dimensional temporal features, the system successfully classifies complex upper-limb actions with high accuracy. Inter-frame displacement vectors and full-hand landmark extraction enable the detailed representation of spatial coordination, which proves essential for bimanual gesture recognition. The experimental evaluation—comprising 350 video trials across seven action classes—confirms the model’s generalization ability under varying scenarios, demonstrating practical viability in fields such as physical rehabilitation, gesture-based interface control, and interactive robotics.
It is required to incorporate attention mechanisms for improved temporal focus, expanding the action set, and deploying the system in uncontrolled real-world environments. These enhancements aim to broaden the applicability and reliability of the developed system in real-time human–computer interaction applications.
Author Contributions
Conceptualization, Y.-G.L.; methodology, Y.-H.T. and C.-W.H.; software, Y.-H.T. and C.-W.H.; writing—review and editing, Y.-H.T.; writing—original draft preparation, Y.-H.T. and C.-W.H. All authors have read and agreed to the published version of the manuscript.
Funding
This research was supported by a grant from the National Science and Technology Council (NSTC), project number 112-2221-E-003-006-MY2.
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
The original contributions presented in this study are included in the article.
Conflicts of Interest
The authors declare no conflict of interest.
References
- Bai, L.; Zhao, T.; Xiu, X. Exploration of computer vision and image processing technology based on OpenCV. In Proceedings of the 2022 International Seminar on Computer Science and Engineering Technology, Indianapolis, IN, USA, 8–9 January 2022; pp. 145–147. [Google Scholar]
- Cui, H.; Dahnoun, N. Real-time short-range human posture estimation using mmWave radars and neural networks. IEEE Sens. J. 2021, 22, 535–543. [Google Scholar] [CrossRef]
- Palani, P.; Panigrahi, S.; Jammi, S.A.; Thondiyath, A. Real-time Joint Angle Estimation using Mediapipe Framework and Inertial Sensors. In Proceedings of the 2022 IEEE 22nd International Conference on Bioinformatics and Bioengineering, Taichung, Taiwan, 7–9 November 2022; pp. 128–133. [Google Scholar]
- Hsu, C.-W. Design of an Omnidirectional Interactive Robot Based on Multi-Action Recognition and Arm Imitation Learning. Master’s Thesis, National Taiwan Normal University, Taipei City, Taiwan, 2025. [Google Scholar]
- Lugaresi, C.; Tang, J.; Nash, H.; McClanahan, C.; Uboweja, E.; Hays, M.; Zhang, F.; Chang, C.-L.; Yong, M.G.; Lee, J.; et al. MediaPipe: A framework for building perception pipelines. arXiv 2019, arXiv:1906.08172. [Google Scholar] [CrossRef]
- Tai, T.-M.; Jhang, Y.-J.; Liao, Z.-W.; Teng, K.-C.; Hwang, W.-J. Sensor-Based Continuous Hand Gesture Recognition by Long Short Term Memory. IEEE Sens. Lett. 2018, 2, 6000704. [Google Scholar] [CrossRef]
- Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
- Sutskever, I.; Vinyals, O.; Le, Q.V. Sequence to sequence learning with neural networks. In Proceedings of the Advances in Neural Information Processing Systems 27 (NIPS 2014), Montreal, QC, Canada, 8–13 December 2014; pp. 3104–3112. [Google Scholar]
- Zhu, G.; Zhang, L.; Shen, P.; Song, J. Multimodal gesture recognition using 3-D convolution and convolutional LSTM. IEEE Access 2017, 5, 4517–4524. [Google Scholar] [CrossRef]
- Rustler, L.; Volprecht, V.; Hoffmann, M. Empirical Comparison of Four Stereoscopic Depth Sensing Cameras for Robotics Applications. IEEE Access 2025, 13, 67564–67577. [Google Scholar] [CrossRef]
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |