Development of a System for Recognising and Classifying Motor Activity to Control an Upper-Limb Exoskeleton

Obukhov, Artem; Krasnyansky, Mikhail; Merkuryev, Yaroslav; Rybachok, Maxim

doi:10.3390/asi8040114

Open AccessArticle

Development of a System for Recognising and Classifying Motor Activity to Control an Upper-Limb Exoskeleton

by

Artem Obukhov

^*

,

Mikhail Krasnyansky

,

Yaroslav Merkuryev

and

Maxim Rybachok

Laboratory of VR Simulators, Tambov State Technical University, Tambov 392000, Russia

^*

Author to whom correspondence should be addressed.

Appl. Syst. Innov. 2025, 8(4), 114; https://doi.org/10.3390/asi8040114

Submission received: 14 June 2025 / Revised: 8 August 2025 / Accepted: 14 August 2025 / Published: 19 August 2025

Download

Browse Figures

Versions Notes

Abstract

This paper addresses the problem of recognising and classifying hand movements to control an upper-limb exoskeleton. To solve this problem, a multisensory system based on the fusion of data from electromyography (EMG) sensors, inertial measurement units (IMUs), and virtual reality (VR) trackers is proposed, which provides highly accurate detection of users’ movements. Signal preprocessing (noise filtering, segmentation, normalisation) and feature extraction were performed to generate input data for regression and classification models. Various machine learning algorithms are used to recognise motor activity, ranging from classical algorithms (logistic regression, k-nearest neighbors, decision trees) and ensemble methods (random forest, AdaBoost, eXtreme Gradient Boosting, stacking, voting) to deep neural networks, including convolutional neural networks (CNNs), gated recurrent units (GRUs), and transformers. The algorithm for integrating machine learning models into the exoskeleton control system is considered. In experiments aimed at abandoning proprietary tracking systems (VR trackers), absolute position regression was performed using data from IMU sensors with 14 regression algorithms: The random forest ensemble provided the best accuracy (mean absolute error = 0.0022 metres). The task of classifying activity categories out of nine types is considered below. Ablation analysis showed that IMU and VR trackers produce a sufficient informative minimum, while adding EMG also introduces noise, which degrades the performance of simpler models but is successfully compensated for by deep networks. In the classification task using all signals, the maximum result (99.2%) was obtained on Transformer; the fully connected neural network generated slightly worse results (98.4%). When using only IMU data, fully connected neural network, Transformer, and CNN–GRU networks provide 100% accuracy. Experimental results confirm the effectiveness of the proposed architectures for motor activity classification, as well as the use of a multi-sensor approach that allows one to compensate for the limitations of individual types of sensors. The obtained results make it possible to continue research in this direction towards the creation of control systems for upper exoskeletons, including those used in rehabilitation and virtual simulation systems.

Keywords:

machine learning; neural networks; motor activity; upper exoskeletons; virtual simulators; multisensory data

1. Introduction

In the field of modern immersive technologies and virtual reality (VR) systems, a key area of focus is the development of software and hardware that increases user immersion in the virtual environment [1,2,3]. One such tool is the development of advanced exoskeleton systems that allow the simulation of support and load on the user’s limbs and body [4,5,6]. However, the process of controlling exoskeletons places high demands on the positioning accuracy of robotic segments, which is particularly important in cases where the trajectory of the exoskeleton is not strictly defined but adapts to the user’s activity [7]. This can be achieved by recognising and classifying the user’s motor activity using various sensors.

Among the main types of motor activity tracking systems, electromyographic (EMG) sensors, inertial measurement units (IMUs), and computer vision (CV) systems are worth mentioning because integrating and collecting data from these systems enables movement positioning and classification [8]. Nevertheless, existing research shows that each of the tracking systems has its own strengths, weaknesses, and limitations [9]. Even commercial tracking systems, such as sensors used in VR systems, have their drawbacks (sensitivity to noise and occlusions). In addition, it should be noted that, in recent years, progress in this area has slowed down, and the main focus is shifting to computer vision systems.

This study is carried out within a project aimed at developing a hardware–software complex based on a controllable upper-limb exoskeleton for rehabilitation and professional training. The complex involves interaction among several stakeholders: patients or trainees who perform a set of planned actions and movements under the required safety constraints; a rehabilitation specialist/instructor who designs the exercises and monitors their correct execution; and technical personnel responsible for installing and configuring the tracking systems and ensuring the operability of both hardware and software. The hardware component of the complex includes the selected set of tracking systems (EMG sensors, IMU modules, VR trackers, and a CV module) and the exoskeleton’s actuation mechanism; the software component comprises tools for synchronous signal acquisition and annotation, a preprocessing block (filtering, normalisation, data alignment), and a control subsystem with emergency criteria and shutdown modes. To increase the robustness of the control system, the standard data flow “sensors → processed data → commands to the exoskeleton” must be augmented with an additional block based on machine learning models, which enhances system stability under conditions of user action uncertainty.

Considering the above, in order to implement an integrated approach to managing the exoskeleton of the upper limbs and tracking human motor activity when working with it, it is necessary to combine and process large amounts of data from various tracking systems. This will make it possible to implement biofeedback from the user, monitor their activity, control the safety of interaction with the exoskeleton, and form control commands for the exoskeleton. The collected data can also be the basis for the digital shadow of the user’s hands in virtual space to provide a better reconstruction of the three-dimensional model of fingers and hands in general.

This research has the following objectives:

Development of theoretical foundations and practical tools aimed at combining data from different tracking systems to form a comprehensive dataset of the user’s hand motor activity;
Development and comparison of machine learning models to optimally solve the problem of predicting the main parameters of motor activity: hand position, rotation angle, motion vector, and motion type classification;
Development of an algorithm to integrate the trained optimal machine learning model to form a digital shadow of the hand motor activity process for further use in virtual simulators and exoskeleton systems.

The contribution of this work to the field is as follows:

Development and formalisation of the theoretical foundations for processing multi-sensor data on a user’s motor activity within a virtual training environment, including procedures for synchronising data from heterogeneous sources, thereby standardising data collection and processing and enabling reproducible comparison of different sensor-source combinations;
Investigation of how individual components of the multi-sensor dataset, used to train various machine learning models, affect the final accuracy of regression and classification tasks, which makes it possible to justify replacing some data sources with others that are simpler or more user-friendly within exoskeleton systems;
Formalisation of an algorithm for integrating trained models into the control loop of an upper-limb exoskeleton embedded in virtual training and rehabilitation systems, which constitutes an important theoretical step toward the subsequent implementation of control systems operating on multi-sensor data.

The principal hypothesis of this study is the possibility of replacing high-precision tracking systems (e.g., VR trackers) with a combination of simpler and more noise-resilient systems by leveraging machine learning technologies to solve regression problems (improving the accuracy of simpler tracking systems) and classification problems (directly determining the user’s activity type). Confirmation of this hypothesis will broaden the range of tracking systems that can be employed in exoskeleton applications and will enable the construction of control systems that tightly integrate models for monitoring and evaluating the user’s current activity. Unlike studies focused exclusively on achieving maximum classification accuracy, the present work formalises the entire process of using multi-sensor data to control an upper-limb exoskeleton: a universal protocol for collecting and synchronising heterogeneous signals (EMG, IMU, VR, CV) with alignment to a unified time window has been developed; an algorithm for integrating machine learning models into the exoskeleton control system has been proposed with explicit consideration of safety requirements; and a minimally sufficient sensor configuration for reliable movement classification has been determined.

The article is structured as follows: The introduction includes justification of relevance and a statement of research objectives; the Section 2 reviews existing approaches to information gathering in the process of interaction with exoskeletons, and analyses the methods of motion recognition using EMG, sensors, and computer vision; Section 3 includes a description of data processing methods and the development of machine learning models and an algorithm for their integration into the exoskeleton control system; further, the results obtained during experimental studies are presented, and a comparative analysis of the accuracy of machine learning models in solving motor activity classification tasks is presented. The article concludes with a discussion of the results and conclusions.

2. Related Works

Current research on immersive virtual reality (VR) systems for vocational training or rehabilitation aims to provide a high degree of immersion for the user with the necessary safety requirements for operation [10]. A key requirement for such systems is the tracking of movements and monitoring of the user’s physiological parameters, which allows the system’s functioning to be adapted to the user’s current state. The existing experience in the field of modern methods for tracking movements and monitoring physiological indicators used in immersive systems, based on exoskeletons, is considered. The results of summarising the main works on this topic are presented in Table 1.

Below, the application aspects of various tracking systems are discussed.

One of the key aspects of VR technology is ensuring the accurate tracking of the user’s position and movements in virtual space. For many years, the main method of tracking activity in VR systems has involved the use of stationary sensors, such as base stations, cameras, and infrared sensors (HTC Vive Tracker, HTC Corporation, Taoyuan, Taiwan), located in the space around the VR headset [19]. The system records the position of sensors attached to the user’s body, then processes the data using triangulation algorithms to build an accurate three-dimensional model. This approach allows for high positioning accuracy, but its use requires a specially equipped space and limits user mobility. When a user interacts with an exoskeleton, its components may additionally cover the sensors and create electromagnetic interference, which hinders the operation of such tracking systems in real-world applications [9]. Nevertheless, at the experimental research stage, the high accuracy of these sensors makes them useful for collecting reference data.

Specialised trackers for VR for absolute positioning require additional equipment in the form of base stations. On the other hand, IMUs, which include various devices based on accelerometers and gyroscopes, including commercial ones (Kat Loco S, KAT VR, Hangzhou, China), allow for relative positioning and rotation in space [20]. Specifically, Kat Loco S is used to analyse leg movement, which limits its application in this article. However, inertial unit systems can also be easily implemented using widely available sensors, such as the MPU-9250 [21]. Their integration into ESP32 allows the creation of lightweight, wireless wearable devices without restrictions on the number of sensors or on their location, and with the ability to independently implement data processing software logic [22]. The problem with this tracking system is the accumulation of integration errors when calculating the speed of the accelerometer and, even more so, the movement of the sensor. Nevertheless, IMUs can accurately determine angles of rotation and direction of movement, which can be used in the analysis of motor activity.

Information directly related to the state of the human muscular system is also of great interest when analysing motor activity. This information can be obtained by processing data from EMG and Hall effect sensors. EMG is a widely used method for measuring muscle activity by detecting electrical signals generated by muscle contractions [23]. EMG sensors can be attached to the skin surface or implanted into muscle tissue. Surface EMG sensors are non-invasive and easy to use, making them a popular choice for monitoring muscle activity in a variety of applications. An alternative approach to tracking upper-limb motor activity is to use sensors that record changes in muscles. In particular, this is the focus of the Open Muscle project, implemented on the basis of a bracelet with 12 integrated Hall sensors [24]. The Hall sensors used in the device respond to changes in the magnetic field, allowing even minor movements and tensions in the forearm muscles to be recorded, providing detailed data on muscle activity in real time. The first type of sensors will be used in this study, as they allow for more accurate and flexible positioning of electrodes, capturing information directly from key areas of human muscle activity. When using EMG, it is important to note that this type of sensor is very sensitive and susceptible to noise, including during significant motor activity.

With the development of computer vision technologies, considerable attention is being paid to human pose estimation (HPE) methods, which allow the positions of key points on the human skeleton to be determined with high accuracy [25]. Modern solutions such as MediaPipe Pose, YOLO, Movenet, OpenPose, and others use deep neural networks to build skeletal models, which makes it possible to create three-dimensional digital copies of the body for interaction with virtual space; however, the effectiveness of these methods depends significantly on image quality, camera resolution, and lighting conditions, which limits their application in scenarios with intense dynamics or low light. In addition, it should be noted that active movement or turns can make some body segments inaccessible to the camera, leading to distortions in the reconstruction of the body model. Finally, computer vision has problems determining the distance to objects (z-coordinate), which is partially solved by stereo cameras and cameras with depth sensors, but the accuracy and cost of these solutions are far from optimal.

The integration of various methods for tracking and monitoring physiological parameters, using muscle activity as an example, will provide a comprehensive set of information about the movements of users of an immersive system based on an exoskeleton. On the other hand, given the drawbacks of the systems discussed, there is no single universal tool that would allow for an unambiguous solution to the problem of tracking the exoskeleton user’s movements. Therefore, a comprehensive approach is needed, allowing for the combination of data from different sources to form a more accurate and correct result. Based on previous experience in this area [26], it can be concluded that it is possible to combine data from different tracking systems, but machine learning technologies must be used to solve this problem effectively and quickly. This will make it possible to avoid the development of complex mathematical models for combining and transforming data and, by generalising and approximating large amounts of information, obtain an approximate but sufficiently accurate solution.

The effectiveness of using machine learning technologies to solve this problem has been confirmed by numerous studies, some of which are presented in review [27], which examined various architectures (convolutional, recurrent, generative) and datasets based on various combinations of IMU.

The use of machine learning for human gait recognition based on IMU sensors, for subsequent use in controlled exoskeletons, is discussed in [28]. Multilayer dense and LSTM neural networks classified three movement patterns with sufficient accuracy (over 80%), which can be used to form a control command for the exoskeleton.

Five different movements (walking, climbing, descending) are classified based on data from EMG and IMU in [29]. SVM and LSTM models are used to solve the problem. The first model showed the best results (over 95%). It was noted that both models show a slight delay in data processing (19 and 37 ms, respectively), which allows them to be integrated into real-time systems.

A more complex solution to the classification problem is considered in study [30], where seven types of different activities are recognised based on data from the IMU. The source data was encoded into a spectrogram to find the periodic nature of human movement. The F1 measure for the architecture based on CNN–(LSTM/GRU) reached a level of 0.89 to 0.95 for different groups of participants.

Vásconez et. al. [31] examines the task of classifying hand movements to determine statistical and dynamic gesture patterns, using IMU and EMG data (based on the Myo wristband) as source data. The Q-learning approach was used to solve the classification problem. The final classification accuracy results ranged from 93% to 99% (depending on the type of gestures and data source), with gesture recognition accuracy ranging from 56–70% (for IMU) to 88–90% (for EMG). This study confirms both the high potential of machine learning for hand motion classification tasks and the applicability of EMG–IMU data for solving such tasks.

This analysis demonstrates the ability to solve the task of classifying human motor activity by using machine learning technologies. Obtaining data from EMG and IMU has proven to be effective, while adding additional sources (computer vision and trackers) to expand the initial data set or validate it during model training has also proven to be effective. Solving the problem of classification and the recognition of motor activity is highly relevant in the creation of exoskeleton systems, as this information can be used in the implementation of safety systems, hardware control, or user condition assessment.

3. Materials and Methods

The main stages of the research are outlined in Figure 1, including the necessary theoretical preparation procedures (formalisation of data handling processes, development of machine learning model architectures, selection of metrics to evaluate the tasks being solved) and the necessary experiments aimed at comparing machine learning models and achieving the set goal of creating a system for recognising and classifying motor activity for subsequent control of the upper exoskeleton.

In the first stage of the study, the main objects involved in the study are formalised using set theory. These objects include components of a multisensory system that monitors a person’s condition: EMG, IMU, computer vision system, and virtual reality (VR) trackers. Together, they form the input and output data for subsequent machine learning model training.

Before forming the dataset, preliminary data preparation must be performed, which includes the following procedures:

Synchronisation of data streams from various sources into a single set, taking into account time stamps;
Division of a large amount of information into time segments (ranges) through automatic or manual tagging. In the last case, manual tagging tools need to be implemented, and a video stream can be the main source of information for the analyst doing the tagging. By analysing the actions happening in the video in separate fragments, the analyst indicates the time range of the current fragment, linking it to the corresponding data from other sources;
Preprocessing of data using appropriate filters. This stage is most relevant for data from IMU, where additional transformations are performed to remove high-frequency noise and baseline drift;
Extraction of additional attributes from the raw data to obtain additional metrics. This could be the power spectral density (PSD) of the signal for EMG or the calculated speed and position of the sensor for IMU.

After completing this stage, a final set of input and output data is formed. In this study, it is assumed that the input data consists of data from IMU (acceleration and rotation angles along three axes), EMG (data from several electrodes attached to the user’s hand), and computer vision (the position of the fingers in the frame, recognised by a camera). The position data from VR trackers will be used as reference data for the position in space. The output data include the types or categories of user movements.

This approach to data separation will enable the following tasks to be performed:

Prediction of the refined positions of human arm segments based on the initial input data using reference data from VR trackers as output data. This will reduce IMU positioning errors through the use of machine learning models. This approach to improving accuracy has been successfully tested in previous studies [32];
Classification of the user’s hand movements according to several categories based on input data analysis. Within the scope of this task, it is also of great interest to study the influence of the number of sources on movement classification accuracy.

These tasks must be solved within the scope of experimental research, but their successful implementation requires appropriate preparatory measures, as outlined below:

Development of several alternative machine learning model architectures, which will allow for the objective comparison of different approaches and their effectiveness in the context of the tasks being solved;
Preparation of an experimental setup, within which it is necessary to simulate as closely as possible the conditions of being in an immersive system based on an upper exoskeleton, and to organise the collection of data from the necessary sources;
Creation of software tools for synchronising and tagging data from selected data sources.

After the necessary preparation, data are collected and prepared in accordance with the procedures described above, a dataset is formed, and models are trained and compared. The models with the best performance will be integrated into the upper-exoskeleton control system to solve the following tasks:

Use of processed and refined data on motor activity as a component of the control action to regulate the position of the upper exoskeleton;
Monitoring and classification of the motor activity of the exoskeleton user in order to log the operations they perform;
Tracking of the position of the hands to check the safety of the user’s current state and to avoid emergencies.

The machine learning models developed as part of this study will become the basis for an intelligent monitoring system that automates the activities of an external observer and improves the efficiency and safety of the upper-exoskeleton control system.

3.1. Research Methodology

We specify the presented study scheme as a standalone methodology with a step-by-step description of the proposed approach.

Step 1. Formalisation of objects and tasks.

The sets of sensors S, users P, as well as the functions for data acquisition, transmission, and processing are defined. Next, the input and output arrays are formed, corresponding to the two subtasks ML1 (coordinate regression) and ML2 (action classification). This step is described in detail in Section 3.2.

Step 2. Data acquisition and transmission.

Data from EMG, IMU, CV, and VR trackers are recorded in parallel at their respective intrinsic sampling rates. All packets retain timestamps and are then forwarded to the processing module with consideration of delays.

Step 3. Preprocessing and synchronisation.

The necessary filters and transformations are applied: for IMU, a combination of Kalman/Madgwick filters to suppress drift and noise; for EMG, band-pass filters; for CV/VR, artefact removal. Primary stream synchronization and initial pose calibration are performed.

Step 4. Balancing sampling rates and window length.

All recordings are aligned to a common sampling rate and to a fixed time window that should cover the majority of exercises. Within each measurement, the following sets are constructed: the unified data set U1, the IMU position data U2, the globally aligned data U3, the IMU displacement data U4, and the set U5, normalised relative to the first row. The data alignment procedure is presented further in Section 3.2.

Step 5. Feature formation.

Data are prepared for model training, which may include both extraction of additional features and reshaping into flat vectors for classical ML models.

Step 6. Model selection and training.

For ML1, two regression scenarios are used: (U2 → Y1) for absolute coordinates and (U4 → Y3) for coordinate increments; for ML2, (U3 → Y2) or (U5 → Y2), depending on absolute or normalised features. The model sets include linear methods, tree/ensemble methods, and deep neural networks (model descriptions are provided in Section 3.3). Hyperparameters are selected either analytically or via GridSearchCV (included in library scikit-learn, version 1.0.2), and the data are split in the commonly accepted 80/20 ratio.

Step 7. Quality assessment.

MAE or MSE are used to evaluate regression quality; for classification, Accuracy, F1, Precision/Recall (see Section 3.4 for details). The single-prediction time is measured to assess model throughput. In addition, an ablation study (by groups of data sources) is conducted to determine each source’s contribution to classification accuracy.

Step 8. Integration into the exoskeleton control system.

The trained models are embedded into a cyclic algorithm presented in detail in Section 3.5 (Figure 2): data → preprocessing → ML1 (if required) → ML2 → decision making → command to the exoskeleton → safety control → emergency shutdown or normal operation.

The presented sequence of actions makes it possible both to test the research hypothesis and to address the remaining tasks of training and comparing models for improving IMU-based positioning accuracy and classifying the user’s motor activity.

Based on the methodology outlined above, we propose the structural architecture of a VR interaction system built around an actuated upper-limb exoskeleton (Figure 2). At the centre of the scheme is the user, the primary entity with which all other system components interact either directly or indirectly.

The system employs an omnidirectional sliding platform as the locomotion mechanism, enabling the user’s movements to be transferred into the virtual environment without constraining freedom of motion in any direction.

To ensure user safety when operating the omnidirectional platform, a height-adjustable safety harness is required. Its purpose is to prevent falls and injuries and to facilitate safe ingress/egress to and from the platform’s active surface.

Integration of the upper-limb exoskeleton is necessary both for monitoring the posture of the user’s arms and torso and for generating the requisite forces and resistances during interaction with virtual objects. The exoskeleton enables simulation of haptic sensations and physical interaction, substantially increasing immersion. In the longer term, beyond producing controlled loads, such an exoskeleton may also be used for upper-limb rehabilitation or for assistive support during task execution.

The system further incorporates a sensor suite that includes position, motion, and orientation sensing. It may be implemented using VR trackers, gyroscopes, accelerometers, computer-vision systems, or a combination thereof to obtain a comprehensive assessment of the user’s movement on the platform. Electromyography equipment is included as an auxiliary sensor to monitor muscular activity. These sensors may be mounted on the exoskeleton or independently, depending on the integration method.

An external component of the immersive system is the VR subsystem, comprising both hardware (head-mounted display, controllers, other devices) and software (the virtual scene and its objects). The VR subsystem provides feedback in response to events (e.g., collisions, object manipulations, reaching scenario checkpoints) that are sent back to the main system for processing.

All listed components are connected through a control system that fuses information from the sensors and devices with feedback from the virtual scene. The system software analyses and processes the incoming data, enabling coordinated control of virtual scene parameters and the exoskeleton state, including the application of targeted loads to specific segments when required. This control layer is designed as a modular, distributed architecture, with certain modules managing hardware and software control and others handling data collection and analytics. Within the present study, we focus specifically on the integration of multi-sensor data for monitoring and the control of the upper-limb exoskeleton.

3.2. Formalisation of Data Handling Processes

The initial data comes from a multi-sensor information gathering system that combines a set of sensors:

S = \{s_{E M G}, s_{I M U}, s_{C V}, s_{V R}\},

(1)

where

s_{E M G}

—surface EMG sensors for recording electrical muscle activity,

s_{I M U}

—inertial sensors (accelerometers and gyroscopes) for measuring acceleration and angular velocity,

s_{C V}

—a computer vision method for recording the position of fingers and palms,

s_{V R}

—VR trackers for high-precision tracking of hand position.

Let

P

be the set of users, where

p \in P

is an individual user. Data from all sensors will be denoted as

B

. A correspondence is defined between sets

S

and

B

, such that there is a corresponding

s \in S

to a certain

b \in B

.

Then set a certain function

ϕ : P \times S \to B

, which describes the process of collecting data from the user using sensors. For each

p \in P

and set of sensors

S

, data collection will look as follows:

B_{p} = \{b_{p, i} = ϕ (p, s_{i})\}, \forall s_{i} \in S

.

The function of collecting data from sensors can be formalised as follows. Since data are collected from each sensor at time

t

, the data packet

b_{p, i} (t)

from sensor

s_{i} \in S

can be represented as a function of time, depending on the patient’s condition and the characteristics of the sensor:

b_{p, i} (t) = H_{i} [h_{p} (t)] + n_{i} (t),

(2)

where

h_{p} (t)

—the true physiological signal or characteristic of patient

p

at time

t

;

H_{i} [\cdot]

—sensor measurement operator

s_{i}

;

n_{i} (t)

—noise, distortion, or error arising during the formation of data from the sensor.

The

H_{i} [\cdot]

sensor measurement operator describes how the

s_{i}

sensor converts the physiological signal

h_{p} (t)

into the biological signal

b_{p, i} (t)

. This operator must take into account the physical and technical characteristics of the sensor, including its sensitivity, frequency response, dynamic range, and other characteristics. To obtain a correct signal, it is necessary to use verified high-quality equipment which will minimise the influence of noise

n_{i} (t)

and obtain a signal as close as possible to the user’s actual condition. Depending on the quality or type of equipment used, noise

n_{i} (t)

may be present in the input signal. This can lead to drift error accumulation for IMUs, inaccurate voltage values for EMG sensors, incorrect body segment positions due to overlap, and challenging external conditions for computer vision or trackers.

After formation, the signal

b_{p, i} (t)

is transmitted by the sensor

s_{i}

to the processing module to obtain initial («raw») data:

r_{p, i} = φ (b_{p, i}, s_{i})

. The data transfer function

φ

performs the following mapping

φ : B \times S \to R

, where

R

is the set of «raw» data in the processing module. The transfer process may also include delays (latency)

τ_{i}

, which is expressed as follows:

r_{p, i} (t) = b_{p, i} (t - τ_{i}) .

(3)

Further, the obtained data

R

must be processed. Denote by

γ : R \times A \to E

is the function that converts the original data

R

into processed data

E

using a set of algorithms

A .

Thus, for specific data, the following is obtained:

e_{p, i} = γ (r_{p, i}, a)

. The form of the function

γ

and the algorithm

a \in A

used depends on the characteristics of the initial signal and the data source (sensor). Several important operations are performed during processing:

Data synchronisation is used to eliminate $τ_{i} (\forall s_{i} \in S)$ for all sensors by synchronising the initial time stamps in samples from different sensors within each recording session. Before a session begins, the position of the sensors is calibrated and a single reference time is set. This further reduces the $n_{i} (t) (\forall s_{i} \in S)$ noise.
Use of filters to reduce $n_{i} (t) (\forall s_{i} \in S)$ noise. For IMU, a combination of the Kalman filter and the Madgwick’s orientation filter is used, which has proven highly effective. For other sources, filters or conversion procedures may be applied as needed.

Thus, as a result of the transformations performed, a set of processed data

E

is obtained, among which subsets corresponding to the selected sensors are specified:

E = \{E_{E M G}, E_{I M U}, E_{C V}, E_{V R}\}

.

For each data source, a constant recording frequency is specified, which can be denoted by the set

F P S = \{ν_{E M G}, ν_{I M U}, ν_{C V}, ν_{V R}\}

, where each element takes integer values in Hz. The values of this set can vary significantly:

\{\begin{cases} ν_{E M G} = 250, \\ ν_{I M U} = 80, \\ ν_{C V} = 30, \\ ν_{V R} = 65 . \end{cases}

(4)

In this case, several transformations must be performed to prepare the source data for model training. First, the source data must be saved after being labelled by action category. Suppose that a total of

J

labelled measurements were formed in the initial dataset. For each

j

measurement, there is a different action duration

T_{i}, j = 1 .. J

, and a different number of records from each of the

i

sources (

N_{i} \approx ν_{i} T_{j}

). The following is the result:

E_{E M G} \to {\{X_{E M G, j} \in ℝ^{N_{E M G} \times M_{E M G} \times C}\}}_{j = 1 .. J}

—raw data from EMG sensors in the amount of

M_{E M G}

, distributed in matrices of size

N_{E M G} \times M_{E M G} \times C

, where

N_{E M G}

is the number of measurements corresponding to a specific time interval

T_{j}

, and

C

is the number of channels (EMG electrodes, equal to six in this study).

E_{I M U} \to {\{X_{I M U, j} \in ℝ^{N_{I M U} \times M_{I M U} \times 9}\}}_{j = 1 .. J}

—data from the IMU in quantity

M_{I M U}

, distributed in arrays of size

N_{I M U} \times M_{I M U} \times 6

, where

N_{I M U}

is the number of measurements corresponding to a specific time interval

T_{j}

, and 6 is the dimension corresponding to the vector of positions, accelerations, and rotation angles along three axes:

〈x, y, z, a x, a y, a z, p i t c h, r o l l, y a w〉

. The position of the IMU in space is calculated based on the kinematics of the arm, taking into account the lengths of the arm segments and the angles between them. The shoulder point is used as the initial reference point.

E_{V R} \to {\{X_{V R, j} \in ℝ^{N_{V R} \times M_{V R} \times 6}\}}_{j = 1 .. J}

—data from VR trackers in quantity

M_{V R}

, distributed in arrays of size

N_{V R} \times M_{V R} \times 6

, where

N_{V R}

is the number corresponding to a specific time interval

T_{j}

, and 6 is the dimension corresponding to the vector of positions and rotation angles along three axes:

〈x, y, z, p i t c h, r o l l, y a w〉

.

E_{C V} \to {\{X_{C V, j} \in ℝ^{N_{C V} \times M_{C V} \times 2}\}}_{j = 1 .. J}

—data from the computer vision system, which tracks

M_{C V}

points on the palm (e.g., for MediaPipe Hand,

M_{C V} = 21

), distributed in arrays of size

N_{C V} \times M_{C V} \times 3

, where

N_{C V}

is the number of measurements corresponding to a specific time interval

T_{j}

, and 2 is the dimension corresponding to the relative coordinates of the hand point along two axes:

〈x, y〉

. It should be noted that the z-axis coordinate is the depth of the point, with the depth at the wrist being the starting point of the coordinates. This value is relative and cannot be scaled to real metric coordinates. It should also be noted that hand recognition models (e.g., MediaPipe Hand) are capable of recognising multiple hands at the same time, in which case the data format will take the form

E_{C V} \to {\{X_{C V, j} \in ℝ^{N_{C V} \times L \times M_{C V} \times 3}\}}_{j = 1 .. J}

, where

L

is the number of hands being tracked.

The number of records in each measurement and each source varies significantly and depends directly on the frequency of data recording and the length of the marked interval. All this can complicate the preparation and training of machine learning models. There are two possible solutions to this problem:

Selection of the length of the scanning window as an input dimension, which varies for each data source and is significantly smaller than the duration of each measurement, allowing the action to be classified based on a data fragment. The disadvantage of this approach is that the time sequence corresponding to the movement will not be fully analysed; only a small part of it will, which may negatively affect accuracy. Additionally, the window dimension for each source will differ, as it is directly proportional to the recording frequency.
Alignment of the input data sizes to a single length, taking into account the highest frequency among all sources by approximation. This approach will result in all data being represented as matrices with the same frequency $ν_{a l l}$ (based on the values, this frequency will correspond to the highest and equal to $ν_{E M G}$ ). The downside is that the amount of data will increase significantly (for example, up to 8.3 times for computer vision). On the other hand, there are the following advantages: all data are synchronised, which allows for the comparison of measurements from different sources that are equal in terms of indices (this can be used in solving problem 1 when predicting refined values); all data from the current measurement can be combined into a single matrix with dimensions $N_{a l l, j} \times M_{a l l}$ , where the number of records $N_{a l l, j}$ corresponds to the largest number of records from the source in the current measurement ( $N_{a l l, j} \underset{s_{i} \in S}{= \max} (N_{i})$ ), and the number of columns corresponds to the total number of features from all sources $S$ ( $M_{a l l} = (M_{E M G} \times C + M_{I M U} \times 9 + L \times M_{C V} \times 2 + M_{V R} \times 6)$ ).

Performing alignment within the measurement is not sufficient, however, as each measurement has its own

N_{a l l, j}

value, which can vary significantly. This will complicate the use of time series of different lengths when training the model, so it is proposed to additionally perform a general alignment of all data using a procedure similar to the one presented above.

The time constant

T_{a l l}

, which is the same for the evaluation of all actions, is introduced. Let it be the rule that the value of

T_{a l l}

exceeds 95% of the time of all measured exercises. Then let us align all measurements to this constant:

T_{j} \to T_{a l l}

, which will lead to a change in the length of the measurement from

N_{a l l, j}

to

N_{a l l} = ν_{a l l} T_{a l l}

, which will be the same for all measurements. The relevant data will be approximated towards the extension or compression of the time series length.

Therefore, the input data will include the original data arrays from each source

E = \{E_{E M G}, E_{I M U}, E_{C V}, E_{V R}\}

, as well as the converted arrays:

E \to U 1 = {\{X_{U 1, j} \in ℝ^{N_{a l l, j} \times M_{a l l}}\}}_{j = 1 .. J}

—the

U 1

combined data array from all sources after the first balancing conversion (within each measurement);

U 1 \to U 2 = {\{X_{U 2, j} \in ℝ^{M_{I M U} \times 6}\}}_{j = 1 .. N_{a l l, j} \times J}

—the

U 2

combined data array, obtained by extracting data from the

U 1

array by the IMU and additional calculation of positions based on the kinematic model of the arm;

U 1 \to U 3 = {\{X_{U 3, j} \in ℝ^{N_{a l l} \times M_{a l l}}\}}_{j = 1 .. J}

—the

U 3

combined data array from all sources after the second balancing transformation (within the entire dataset), which can subsequently be decomposed into separate sources in the process of model ablation research;

U 2 \to U 4 = {\{X_{U 4, j} \in ℝ^{M_{I M U} \times 6}\}}_{j = 1 .. N_{a l l, j} \times J}

—the

U 4

data array, obtained by calculating the difference in IMU position based on the

U 2

array, which allows estimating the amount of displacement per unit of time;

U 3 \to U 5 = {\{X_{U 5, j} \in ℝ^{N_{a l l} \times M_{a l l}}\}}_{j = 1 .. J}

—the

U 5

combined data array, obtained from

U 3

by normalising the values relative to the first measurement in each record, allows for the evaluation of relative values in relation to the beginning of the record, rather than the absolute values of all data.

The following elements can be used as output data:

U 1 \to Y 1 = {\{Y_{V R, j} \in ℝ^{N_{a l l, j} \times M_{V R} \times 3}\}}_{j = 1 .. J}

—position data from

N_{a l l, j}

measurements taken by

M_{V R}

VR trackers, used as a reference positioning system, with each tracker measurement corresponding to a vector

〈x, y, z〉

;

Y 2 = {\{Y_{K j} \in ℕ\}}_{j = 1 .. J}

—user action category identifiers specified for each measurement;

Y 1 \to Y 3 = {\{Y_{V R 2, j} \in ℝ^{N_{a l l, j} \times M_{V R} \times 3}\}}_{j = 1 .. J}

—

Y 3

data array obtained by calculating the difference in the position of VR trackers based on

Y 1

, which allows for the estimation of the amount of tracker movement in space per unit of time.

For the first task, two alternative arrays,

U 4

and

Y 3

, are created, which are different from

U 2

and

Y 1

in that they store not the absolute position of the tracker, but the difference between its neighbouring positions (coordinate increment). In terms of dimensionality, these data sets are identical, but they will allow for evaluating whether it is more effective to predict the absolute position or more accurate to predict the shift at the current step. For the second task, an additional data set

U 5

has also been formed, which stores values from all sources, normalised relative to the first measurement, which reduces the influence of the initial position of the hand, as well as individual EMG readings, and can have a positive effect on the accuracy of classification.

Then, the previously defined tasks are reduced to training machine learning models that carry out the relevant comparisons:

\begin{matrix} M L 1 : U 2 \to Y 1 or U 4 \to Y 3, \\ M L 2 : U 3 \to Y 2 or U 5 \to Y 2, \end{matrix}

(5)

In this case, the input dimension of the

M L 1

model for improving IMU positioning accuracy will be

M_{I M U} \times 6

, and the output dimension will be

M_{V R} \times 6

. For the second task of action classification, the input dimension is assumed to be

N_{a l l} \times M_{a l l}

, and the output dimension is 1 (or a binary vector with a length equal to the number of categories).

After the input and output data have been formed, they must be divided into training and test samples (for example, in a ratio of 80:20) and used to train the selected machine learning models.

If the model does not support multidimensional input data, which is applicable to most machine learning algorithms (decision trees, random forest, linear models, etc.), the procedure for creating a continuous flat array is performed (for example, using the numpy.ravel function in Python 3.7 or later, numpy version is 1.24.2).

3.3. Development of Machine Learning Models

Based on a review of existing research on the application of various machine learning models for motor activity classification [27,28,29,30,31], as well as the experience of the team in applying various models to regression and classification tasks, two sets of models were selected. The first set will be used to refine the positions and angles of the IMU based on data from a reference tracking system in the form of VR trackers. The second set is used to solve the problem of motor activity classification. The selection of model hyperparameters was performed either analytically, based on previous experience and adjustment of the number and types of layers and the number of neurons (for neural networks), or using the GridSearchCV method to find the optimal tree depth (for those models where they are used). For linear and other models, the default parameters were used unless otherwise specified.

The following list of models has been obtained to solve the problem of improving the accuracy of IMU positioning [33,34]:

1.1. Linear Regression.

The simplest linear model that directly maps the input vector of features into a scalar output. Used with default parameters.

1.2. ElasticNet Regression.

The linear model with a combination of L1 and L2 regularisation, allowing for a balance between feature selection and reduced overfitting. The selected parameter values are alpha = 1.0 and l1_ratio = 0.5.

1.3. RANSACRegressor.

The regression mode is based on the RANSAC (random sample consensus) algorithm. The model iteratively selects a random subset of data for training, discarding outliers, and finding a model that maximises the number of «internal» observations. It is capable of ignoring outliers, which is relevant when working with various sensors. Parameters are selected by default.

1.4. TheilSenRegressor.

The linear regression mode based on the Theil–Sen estimator, which calculates the median of all paired slopes. The model is highly resistant to outliers, which is especially important when working with noisy data. Parameters are selected by default.

1.5. Decision Tree Regressor.

The decision tree that divides the feature space into regions with similar response values. The following parameters are used: max_depth = 10, min_samples_split = 2. The depth limit prevents overfitting, letting the model capture important non-linear relationships.

1.6. Random Forest Regressor.

The ensemble of decision trees (random forest) where the prediction is made by voting, which improves generalisability. The number of trees (n_estimators) is 100, the default tree depth.

1.7. AdaBoost Regressor.

A boosting method that combines the predictions of weak models (e.g., decision trees) by focusing on the errors of previous iterations. Thus, the method focuses on training simple trees on misclassified observations by changing the distribution of the training data set to increase the weights on difficult-to-classify observations. Parameters: n_estimators = 50, learning_rate = 1.0.

1.8. Gradient Boosting Regressor.

The boosting method in which weak models are trained sequentially to correct the errors of previous ones. The parameters are set as n_estimators = 100, learning_rate = 0.1, max_depth = 3. The low depth of the trees and moderate learning rate allow for avoiding overfitting and effectively modelling complex dependencies.

1.9. XGBRegressor.

An optimised version of gradient boosting that includes additional features to improve performance and speed, as well as L1 (Lasso) and L2 (Ridge) regularisation methods that help prevent overfitting. The default parameters are n_estimators = 50 and max_depth = 5.

1.10. K-Nearest Neighbors Regressor.

The model that predicts values based on the average values of the k-nearest neighbours in the feature space. The number of neighbours is set to five.

1.11. Dense Neural Network.

The multilayer perceptron consists of four fully connected layers, combining batch normalization and dropout for stable learning. Layer architecture: input layer

M_{I M U} \times 6

, layer with 256 neurons (ReLU); layer with 128 neurons (ReLU); layer with 64 neurons (ReLU); output layer with

M_{V R} \times 6

neurons (linear activation). The optimiser is Adam, and the learning rate is 0.001. Here and further on, ReduceLROnPlateau is used for neural networks with tracking of val_loss, parameters: factor = 0.5, patience = 5, min_lr = 1 × 10⁻⁷.

1.12. Transformer.

The Transformer architecture uses a self-attention mechanism to model global dependencies in sequences, which allows input data to be processed in parallel and long-term dependencies in the sequence to be modelled, which is particularly important when dealing with long time series [35]. Layer architecture: input layer

M_{I M U} \times 6

; 2 Transformer Encoder blocks, which consist of two main parts: an attention mechanism with normalisation and a position-independent feed-forward network block with additional normalisation, with an output of

M_{V R} \times 6

neurons with linear activation. The first part includes a MultiHeadAttention layer with num_heads = 2 and a key size (head_size = 16), followed by Dropout and LayerNormalization, after which a residual connection is made to the block input. The feed-forward block includes the first Conv1D layer with a kernel of 1, ReLU activation, and 64 filters, then the data are passed through another Conv1D layer with a kernel size of 1, which converts it back to the original dimension (the number of filters corresponds to the block input), after which normalisation and residual connection to the block input are performed. The optimiser is Adam, and the learning rate is 0.0001.

1.13. CNN–Transformer.

The architecture combines the advantages of CNN (local filters) and Transformer (global attention), complementing them with trainable positional embeddings. The hidden layers include the following blocks: Conv1D (64 filters, 3 kernels), BatchNormalization, concatenation of output features with positional embedding (Embedding(6, 64)), allowing the model to flexibly encode order. This is followed by two Encoder blocks (num_heads = 2, key_dim = 16, Dropout 0.25) with FFN 64 and LayerNormalization; residual connections are preserved. The architecture is completed by GlobalAveragePooling, Dense (128 neurons), Dropout (0.30), and an output layer with linear activation. The optimiser is Adam, and the learning rate is 0.0001.

1.14. CNN–GRU.

The hybrid of convolutional and recurrent architectures with the addition of a simple attention mechanism. The architecture includes the following internal layers: Conv1D (64 filters, 3 kernels), BatchNormalization, GRU 64 (return_sequences = True), tangential attention layer Dense (1, tanh activation function), softmax, context vector (weighted sum of GRU states). This is followed by a fully connected layer of 128 neurons, Dropout (0.30), and an output layer with linear activation.

In order to solve the second task, it is necessary to classify the types of upper-limb movements, which requires the creation of a corresponding set of models, presented below [36,37].

2.1. Logistic Regression.

The linear model that uses a logistic function to calculate the probability of an object belonging to a given class. The default parameters are used.

2.2. Nearest Neighbors Classification.

The classification method based on finding the nearest neighbours in the feature space. The number of neighbours is set to five.

2.3. Decision Tree Classifier.

The decision tree used for classification. The value 10 was selected as the tree depth parameter.

2.4. Random Forest Classifier.

The ensemble of decision trees with limited tree depth (3) and a number of trees equal to 10 is used. By combining the predictions of several weak models, this method reduces variance and increases robustness to noise.

2.5. AdaBoost Classifier.

The boosting method that iteratively trains 50 weak classifiers using the SAMME algorithm. Each new classifier focuses on the errors of the previous ones, and the final decision is made through weighted voting.

2.6. Gaussian Naive Bayes.

The naive Bayes classifier assumes feature independence and uses normal distribution to estimate probabilities. Default parameters are used.

2.7. XGBClassifier.

Gradient boosting used for classification. The main parameters are selected as n_estimators = 50 and max_depth = 5.

2.8. Stacking Classifier.

The ensemble model combines the predictions of several base classifiers (logistic regression, nearest neighbors, two variants of decision tree with depths of 5 and 10, which were presented earlier). The final meta-classifier (logistic regression) combines their predictions to obtain the final decision.

2.9. Voting Classifier.

The ensemble model combines the forecasts of specified base models using «soft voting». The final probability for each class is calculated as a weighted average of the probabilities predicted by the base models. A list of similar models to those presented in 2.8 is used.

2.10. Dense Neural Network.

The multilayer classification network receives a fixed-dimensional vector (

N_{a l l} \times M_{a l l}

for the full data volume or less when studying ablation) as input, which is sequentially processed by Dense blocks (256, 128, 64 neurons) with ReLU activation and L2 regularisation (λ = 1 × 10⁻⁴). Each hidden layer is accompanied by normalisation (BatchNormalization) and Dropout (0.25). A Flatten layer is used to bring the input to a flat form, and the output layer of N neurons is implemented with softmax activation for multiclass classification.

2.11. Transformer.

The model whose hidden layers are similar to the architecture 1.12, but the input layer consists of

N_{a l l} \times M_{a l l}

neurons, and the output layer consists of N neurons with the softmax activation function. In addition, the classification head has the form Dense (200), Dropout (0.3), and Dense (100). The final layer with softmax activation forms the output.

2.12. CNN–Transformer.

The model with architecture similar to 1.13, but the input layer consists of

N_{a l l} \times M_{a l l}

neurons, and the output layer consists of N neurons with a softmax activation function. The classification head has the form Dense (200), Dropout (0.3), and Dense (100).

2.13. CNN–GRU.

The model similar in architecture to model 1.14. The differences are as follows: the running layer consists of

N_{a l l} \times M_{a l l}

, and a GRU with 128 neurons is used. The attention mechanism is implemented through a Dense (1, tanh) layer, followed by a softmax transition along the time axis and convolution as a weighted sum of hidden states. The classification part includes Dense (128 neurons, ReLU), Dropout (0.3), and a final softmax layer of N neurons.

A graphical representation of the machine learning models (in a generic form for tasks ML1 and ML2) is shown in Figure 3.

Both relatively simple machine learning methods and the latest developments in convolutional, recurrent networks, and transformers will be compared. The presented models will be compared using the metrics listed in Section 3.4 in order to find the optimal model for solving each of the tasks. The models listed above for solving the two tasks were selected on methodological grounds: classical, interpretable algorithms (logistic regression, k-nearest neighbors, decision trees, ensemble methods) can serve as a baseline and as a means to assess the influence of feature engineering, whereas deep neural architectures (dense neural networks, CNNs, GRUs, transformers, hybrids thereof) are employed to test the hypothesis that they can automatically extract informative latent subspaces and compensate for the absence of certain sensors when dealing with complex data. In addition, this selection makes it possible to objectively compare how model complexity affects both performance quality and inference speed.

3.4. Evaluating the Quality of Trained Models

To evaluate the quality of trained machine learning models, two sets of metrics must be selected. The first set is aimed at evaluating the quality of the solution to the regression problem of predicting position and rotation angles based on input data from the IMU. Data from VR trackers serve as reference data. Therefore, the main metrics for evaluating the quality of the solution to this problem are the mean square error (MSE) and the mean absolute error (MAE) [38]:

\begin{array}{l} M S E = \frac{1}{J} \sum_{j = 1}^{J} \sum_{i = 1}^{M_{I M U} \times 6} {(X_{U 2, j, i} - Y_{V R, j, i})}^{2}, \\ M A E = \frac{1}{J} \sum_{j = 1}^{J} \sum_{i = 1}^{M_{I M U} \times 6} |X_{U 2, j, i} - Y_{V R, j, i}| . \end{array}

(6)

The task of motion recognition is formally represented as multiclass classification on a finite set of motions. The set of possible motion classes is defined as

C

,

{\{Y_{K j} \in C\}}_{j = 1 .. J}

.

The following estimates can be used as measures for the quality of the multiclass classification solution [39,40]:

Cross-entropy loss used as a loss function in neural network training:

H (y, \hat{y}) = - \sum_{i = 1}^{N} \sum_{j = 1}^{C} y_{i, j} \log ({\hat{y}}_{i, j}),

(7)

where

N

is the number of examples in the test sample,

C

—the number of classes,

y_{i, j}

—the true label of class

j

for example

i

,

{\hat{y}}_{i, j}

—the predicted probability of example

i

belonging to class

j

.

Classification accuracy (proportion of correctly predicted classes):

A c c u r a c y = \frac{\sum_{i = 1}^{N} 1 ({\hat{y}}_{i} = y_{i})}{N},

(8)

where

1 ({\hat{y}}_{i} = y_{i})

is the indicator function (equal to 1 if the prediction matches the true class, otherwise 0),

N

—the total number of test examples.

Average Precision:

\begin{array}{l} P r e c i s i o n = \frac{1}{C} \sum_{k = 1}^{C} P r e c i s i o n_{k}, \\ P r e c i s i o n_{k} = \frac{T P_{k}}{T P_{k} + F P_{k}}, \end{array}

(9)

where

T P_{k}

(True Positives)—the number of objects correctly classified into class

k

,

F P_{k}

(False Positives)—the number of objects incorrectly classified into class

k

.

Average Recall:

\begin{array}{l} R e c a l l = \frac{1}{C} \sum_{k = 1}^{C} {R e c a l l}_{k}, \\ {R e c a l l}_{k} = \frac{T P_{k}}{T P_{k} + F N_{k}}, \end{array}

(10)

where

F N_{k}

(False Negatives)—the number of class

k

objects erroneously assigned to other classes.

F1-score (harmonic average Precision and Recall):

\begin{array}{l} F 1 = \frac{1}{C} \sum_{k = 1}^{C} F 1_{k}, \\ F 1_{k} = \frac{2 \cdot P r e c i s i o n_{k} \cdot R e c a l l_{k}}{P r e c i s i o n_{k} + R e c a l l_{k}}, \end{array}

(11)

The test samples will consist of elements from set

Y 2

that did not participate in the model training process.

When evaluating machine learning models for classification tasks, the importance of features from set

U 3

is also of great interest for identifying the most informative signal parameters. This can be implemented in the ablation analysis process: models are compared on data sets with sequentially removed groups of sensors (EMG, IMU, CV, VR). Such ablation analysis is a way to figure out how much each type of signal affects the final classification accuracy.

3.5. Algorithm for Integrating Trained Models for Upper Exoskeleton Control

After successfully training the machine learning models, they must be integrated into the upper-exoskeleton control system. It should be noted that this integration involves embedding models as a decision making support component that allows two tasks to be solved: identify the trajectory of the user’s hand movements for the subsequent calculation of the exoskeleton segment movements (e.g., in the case of supportive rehabilitation or human movement tracking modes); classify movements to recognise correct or incorrect activity.

The process of integrating models into the upper-exoskeleton control system is formalised in the form of an algorithm shown in Figure 4. It is worth examining some of its stages in more detail.

The presented algorithm is cyclical with a discrete time step between periods of processing incoming information about user movements. This means that, during the operation of the exoskeleton control system, this algorithm is executed in parallel, performing the cyclical and continuous processing of current information about the user. The process of processing, combining, filtering, and other necessary operations, including the calculation of additional features, fully complies with the general methodology for preparing the dataset.

If the sources include inertial tracking systems, it may be necessary to convert the data using the selected ML1 model. This will allow for obtaining the absolute position (or displacement) of the sensors in space with some error, which is also used to form a common set of initial data.

The next stage uses the ML2 model, which classifies user movements according to specified categories. It is assumed that this classification can identify planned actions by the system designated as normal. Any other actions that do not fit into these categories should cause the control system to enter an emergency stop state. Another condition for stopping may be the detection of a movement trajectory that exceeds permissible deviations that are safe for the user.

The upper exoskeleton’s cycle continues either in normal mode or is interrupted based on the analysis of the results from machine learning models. Its integration into the control system increases adaptability to possible changes in the user’s conditions, since the models used can be trained on a large number of different behaviours and actions, among which it will be possible to simulate emergency situations (for example, in a virtual environment without danger to health).

We formalise the presented algorithm as pseudocode, merging it with the data processing procedures described earlier in Section 3.2 (Algorithm 1).

Algorithm 1. UpperExoControlCycle

INPUT:
S = {s_EMG, s_IMU, s_CV, s_VR} // set of sensors
AllowedClasses = {C1, …, Ck} // permissible action categories
Δt // control-loop period
allT = 5 s // window length
allν = ν_EMG // unified sampling rate after resampling
ML1_abs, ML1_inc // regressors: absolute coordinates and increments
ML2_cls_abs, ML2_cls_rel // classifiers: using absolute or normalised data
Thresholds_traj // tolerances for trajectory/velocity/angles
STATE:
Buffer_raw ← ∅ // circular buffer of packets r ∈ R
Mode ← NORMAL
LOOP every Δt while SystemActive:
NewPackets ← AcquireFromSensors(S) // 1. Packet acquisition and transfer (φ, ϕ)
Buffer_raw ← Buffer_raw ∪ Transfer(NewPackets)
E ← Preprocess(Buffer_raw) // 2. Preprocessing (γ): synchronization, filtering, alignment
// 3. Window/rate balancing → forming U-arrays
U1 ← Build_U1(E, allT, allν) // unified per-measurement data
U2 ← Extract_IMU_and_Kinematics(U1) // IMU + computed poses
U3 ← GlobalAlign_AllSources(U1) // second alignment across the entire dataset
U4 ← Diff_IMU_Positions(U2) // IMU displacements
U5 ← Normalise_Relative(U3) // normalising the first row of the record
if UseAbsoluteRegression: // 4. Choosing the regression scheme
Pose_hat ← ML1_abs.predict(U2) // predict Y1
else:
ΔPose_hat ← ML1_inc.predict(U4) // predict Y3
Pose_hat ← IntegratePose(ΔPose_hat)
if UseRelativeClassification: // 5. Feature construction for classification
X_cls ← U5
else:
X_cls ← U3
Prob ← ML2_cls.predict_proba(X_cls)
Action_hat ← argmax(Prob)
// 7. Safety check
if Action_hat ∉ AllowedClasses OR TrajectoryDeviation(Pose_hat, Thresholds_traj):
EmergencyStop()
Mode ← EMERGENCY
Log(Action_hat, Pose_hat, "STOP")
continue LOOP
Cmd ← ComputeCommands(Pose_hat, Action_hat) // 8. Control command computation
SendToActuators(Cmd)
Log(Action_hat, Pose_hat, "OK")
if LatencyExceeded() OR MissingCriticalSensors():// 9. Latency monitoring and resource degradation
ApplyFallback(FallbackModels)
EmergencyStop()
Mode ← EMERGENCY
END LOOP

END

The developed integration algorithm includes the following: periodic sampling of a fixed-length data window; preprocessing and normalisation of signals; processing the window through a classifier and, where necessary, a regressor; decision making based on anomaly/signal-loss checks; and issuing control actions to the exoskeleton. If the permissible latency is exceeded or a signal is lost, the system automatically halts its operation.

For the successful implementation of this algorithm, it is necessary to achieve the main goal of this study, which is to solve the problem of motor activity classification and to improve positioning accuracy. The trained models will form the basis of the decision support module in the control system within the framework of the presented algorithm.

The algorithm itself has the following scientific and practical value: it will be used as a link between the user’s real actions and their digital representation within virtual trainers and exoskeleton systems. Thus, its subsequent implementation will enable the formation of a digital shadow of the motor activity of the hands, which will find its application in the virtual reconstruction of hands in specified systems, as well as digital representation for control systems in rehabilitation and professional simulators.

Accordingly, we implemented a reproducible methodology for processing multi-sensor data to realise an upper-limb exoskeleton control system, including time-stamp synchronisation of EMG/IMU/CV/VR streams, filtering and signal conditioning, and alignment to a fixed analysis window at a unified sampling rate of 250 Hz. For training the regression and classification models, the data are transformed into the standardised arrays U1–U5. The machine learning models comprise interpretable algorithms (logistic regression, k-nearest neighbors, decision trees, ensemble methods) and deep networks; training used an 80/20 split with appropriate accuracy metrics and loss functions. The trained models are integrated into a cyclic control algorithm that begins with data acquisition and preprocessing, applies ML1 correction where required, performs state classification with ML2, and concludes with safety checks and actuator commands to the exoskeleton.

4. Results

4.1. Experimental Setup and Data Preparation

As part of the experiment, data on hand motor activity were collected using a multisensory system consisting of EMG sensors, IMUs, virtual reality (VR) trackers, and a computer vision system. The experiment consisted of the following stages:

(1): Preparation of software for synchronous data recording. The Python programming language and the set of libraries for working with the used equipment were applied. Each source logged the measurement time, which later allowed for synchronising measurements from different sources with different frequencies.
(2): Equipment preparation and sensor calibration. A mandatory procedure for attaching EMG sensors, trackers, and IMUs to each participant at predetermined positions followed by calibration to minimise error accumulation. The sensor placement diagram and an illustrative excerpt from the experimental study are presented in Figure 5. The following tracking hardware was used for data acquisition: an IMU module based on the MPU-9250 with an ESP32 controller (TDK InvenSense, San Jose, CA, USA), HTC Vive trackers (HTC Corporation, Taoyuan, Taiwan), a Logitech HD Pro C920 camera (Logitech International S.A., Lausanne, Switzerland), and a six-channel EMG sensor (Wuxi Sizhirui Technology Co., Wuxi, China) built on an STM32 microcontroller (STMicroelectronics, Shanghai, China).
(3): Data collection. Participants performed a set of specified movements, presented in Table 2, including elbow flexion/extension, circular hand movements, and movements along various axes. Each movement was repeated several times to obtain a sufficient amount of data.
(4): Data marking. To mark the data, software developed by the team was used, as shown in Figure 6. The developed software allows the user to create sessions for recording individual exercises, visually track recorded movements on video, and record the start and end times of movements for subsequent data extraction. During marking, data are extracted synchronously from all sources, taking into account the start and end times selected by the employee responsible for marking. Each fragment is saved as a separate file in csv format.

(5): Signal processing. Necessary noise filtering is performed using band-pass filters, after which a single, synchronised sample is formed from sources arriving at different frequencies and in different formats. This stage involves combining all data on a single time scale, which is then stretched to a fixed time of 5 s (the most common movement length), allowing for a uniform dimension for all 1200-line records, which are then fed into the models. Thus, the presented software not only provides visual data annotation but also implements an algorithm for temporal alignment of streams with different sampling rates. All annotated signals are brought to a fixed window length, which makes it possible to form a feature matrix of a unified size for any combination of sensors.
(6): Model training. Based on the selected features and synchronised data, the model architectures presented above were trained to solve the regression problem (IMU position to VR tracker position) and classification. This stage includes the study of classification model ablation in order to identify the influence of individual data sources on the accuracy of the solution. Model training was performed on the following hardware: AMD Ryzen 9 7950X, NVIDIA RTX 4070 Ti Super (16 GB), 128 GB RAM, and an SSD drive.

4.2. Structure of the Collected Data

As a result of the experiment involving 15 participants (all healthy and having given their informed consent to participate in the experiment), each of them performed 4–5 repetitions of the exercises presented in Table 2.

Since the pre-selection process, where low-quality or unsuccessful attempts were removed, 629 records were collected. The distribution of these records is shown in Figure 7. For the first experiment aimed at improving positioning accuracy, 393,242 data rows were selected. Each row contains 12 input values (three positions and three accelerations from two IMUs—positions are listed first, followed by accelerations) and 6 output values (three coordinates from each of two trackers).

For the second experiment, all 629 records were used, each with a dimensionality of 1200 rows × 78 values. The 78 values include 18 position-related values from two IMUs (three accelerations, three positions, three angles for each IMU), 12 values from two VR trackers (three angles and three positions each), 6 EMG values, and 42 CV values (coordinates of 21 points).

Fragments of the input and output data are shown in Figure 8. For the first task, three records are illustrated; for the second task, a fragment of three elements from the 1200-value input set is presented.

The record dimensionalities correspond to those described earlier in Section 3.2. Each record is associated with an annotated category.

The training of the machine learning models was carried out on 80% of the obtained data, with the remaining 20% used for testing. For the regression task, neural network models were trained for 50 epochs using ReduceLROnPlateau on the loss function with min_lr = 1 × 10⁻⁷ and an initial learning rate of 0.001; the batch size was set to 200. For the classification task, the same ReduceLROnPlateau strategy was applied, the number of epochs was set to 200, and the batch size was 10 (limited by GPU memory). The parameters of the machine learning models are presented in Section 3.3.

4.3. Comparing Models When Solving Regression Problems

The results of the models’ evaluations during tracker coordinate regression are presented in Table 3. The following metrics are used: mean absolute error (MAE), mean square error (MSE), and model prediction time (Time, in seconds, for 1 record). The top three models are highlighted in bold, and the result of the best model is additionally underlined. The following results were obtained. Random forest has the best performance: MAE = 0.0022, MSE = 4.4 × 10⁻⁵. K-nearest neighbors and CNN–GRU–Transformer also perform well, ranking among the top three. Gradient ensembles (XGBoost, gradient boosting), Dense NN, CNN–Transformer, and decision trees are less than an order of magnitude behind, remaining significantly better than basic linear regressors. Linear methods, including linear regression and ElasticNet, are limited by the assumption of linearity and are unable to identify complex nonlinear relationships between inertial and optical measurements. This is expressed in a significant increase in MAE compared to the above models. Thus, when predicting absolute coordinates, the presence of a powerful nonlinear approximation mechanism is crucial. The prediction time for most classical machine learning methods (1.1–1.3, 1.5, 1.9) was insignificant, but neural network methods take significantly longer (1.11–1.14), which reduces the possibility of using models in real time for a single record. Batch processing of multiple measurements in a single model call would be more efficient.

In the case of regression of increments (Table 4), the models have a significantly lower absolute error, and most models have similar indicators: RANSAC, ElasticNet, Theil–Sen, decision tree, and CNN–Transformer have comparable results (MAE 0.000413–0.000419; MSE 7 × 10⁻⁶). Deep neural network architectures (Dense NN, Transformer, CNN–GRU) achieve comparable metrics without any noticeable advantage over simple linear methods. AdaBoost loses its noise resistance (MAE = 0.0039), confirming its sensitivity to errors in base predictions under conditions of small increments. Random forest and k-nearest neighbors show a slight deterioration (MAE ≈ 5–6 × 10⁻⁴), indicating overfitting.

The following conclusions can be made regarding the solution to this problem. For absolute coordinates, models with high approximation power (tree ensembles, convolutional recurrent networks) are required, while linear regressors produce errors that are an order of magnitude higher. For coordinate increments, most algorithms are comparable in terms of accuracy. Therefore, it is advisable to use random forest for calibrating the absolute position of the IMU. If real-time operation is not planned, models such as CNN–GRU are also applicable. For predicting small increments, lighter linear robust models (ElasticNet, RANSAC) with high accuracy and performance are preferable.

4.4. Ablation Model Research

In the next stage of experimental research, a detailed analysis was carried out of the influence of initial data on the accuracy of classification problem solution. During ablation analysis, the accuracy of models 2.1–2.13 presented in Section 3.3 was evaluated, both when only one source’s data were fed to input and when data were combined (in pairs, triples, and other combinations up to the complete set of all data). Sources were added based on preliminary assessment of classification accuracy for each source. Results were added in ascending order: EMG first, then IMU, CV, and VR. The results are presented in Table 5 (for absolute data values) and Table 6 (for data normalised relative to the initial position). The highest accuracy within the selected sources is highlighted in bold.

Based on the results of the study, the following conclusions can be made. First, the informativeness of each source will be analysed:

EMG is a relatively inconvenient source to install, as it depends on attachment points on the human body. Without deep neural networks, the accuracy is within ≤0.50, with a maximum accuracy of 0.897 (Dense NN).

IMU provides correct and accurate data. IMU is a part of many tracking systems and can be easily implemented separately. Collectively and individually, it provides the highest accuracy of 1.000 (on Dense NN and CNN–GRU models).

CV is a sufficiently convenient contactless data collection source; however, it requires the tracking object to be within sight. It also allows for a high accuracy of 0.992 (according to the CNN–Transformer model).

VR is an informative source that displays absolute values, but requires specialised equipment and free space. It provides an accuracy of 1.000 (on Dense NN and CNN–GRU models).

The results of the ablation study show that sources such as VR trackers and IMUs alone guarantee high classification accuracy without the addition of extra data. EMG is the most complex data source in this comparison, not guaranteeing a high level of classification accuracy.

In most cases, combining high-quality sources does not reduce the overall accuracy, except for the addition of EMG, which leads to an increase in noise levels. Combining all data reduces classification accuracy, especially with classical machine learning algorithms. Therefore, when there is an excess of different types of features, simple models require strict regularisation or preliminary selection. A complete data set is only justified for multilayer neural networks, and it is important to ensure that the source information is as complete as possible in order to handle abnormal situations. Within the scope of this study, there is no such need. It can be concluded that EMG and CV sources should be considered as auxiliary channels that are useful to include when training multilayer neural networks or when there is insufficient data, since IMU or VR is completely sufficient for solving the classification problem.

Furthermore, it is essential to evaluate the impact of normalisation on the starting position. Most classical machine learning methods lose accuracy, while neural networks are more stable. Models trained only on IMU-angle and CV suffer the least from normalisation, while those trained on EMG suffer the most. Thus, the use of normalised data is acceptable, but it will necessarily require the use of complex neural network architectures. It is simpler, more accurate, and less costly in terms of data processing with absolute coordinates. For this reason, the normalisation approach will not be used in the future.

4.5. Comparison of Classification Algorithms

After the ablation study, a more detailed and comprehensive comparison of the models was carried out using the full set of metrics for the classification task. The models were compared by Accuracy, Precision, Recall, F1-score, as well as by the time of a single prediction (the timing was obtained after one test run). Table 7 reports the classification results when using the entire dataset, while Table 8 presents the results for IMU-only data (positions and angles), identified in the ablation study as the minimally sufficient data volume. The top three results are highlighted in bold, and the most accurate solution is underlined. Since five-fold cross-validation was employed, the tables report the mean value and the mean deviation. The neural-network training process is shown in Figure 9 and Figure 10 for the full dataset and for the IMU-only data, respectively.

When all data sources are available, the best performance is achieved by deep models: Transformer (accuracy up to 99.2%), the fully connected neural network (up to 98.4%), followed by CNN–GRU and the comparably performing AdaBoost Classifier. The poorest results are obtained by the classical models: logistic regression and Gaussian Naive Bayes. From this, it can be concluded that multisensory information is beneficial for architectures capable of extracting complex nonlinear dependencies, but detrimental to simpler models.

Switching to IMU-only data (position and angles) reduces the accuracy of several models; however, the dense neural network, CNN–GRU, and Transformer still achieve 100 % accuracy, while CNN–Transformer attains an accuracy of 98.4%. This shows that two IMU sensors with position and angle information are enough to classify movement as accurately as the full set. The other models show worse results, with some models (random forest and decision tree) dropping in accuracy to 0.75. Gaussian Naive Bayes degraded the most (accuracy dropped from 0.81 to 0.53), demonstrating the inability of this model to analyse inertial sensor features. Removing data from VR trackers, CV, and EMG reduced the accuracy of almost all classical algorithms, but deep networks maintained or even improved the result: for CNN–GRU and CNN–Transformer, this reduced slight overfitting. This means that extra sensors are very important for methods that cannot pick out informative subspaces on their own; powerful neural network architectures can make up for their absence.

When evaluating computation time, it can be concluded that tree-based models and Gaussian Naive Bayes models have the highest performance. Among multilayer models with the best accuracy, CNN–Transformer shows the highest performance, but all models are unsuitable for use directly in real time. Given the classification based on information from the last 5 s, this aspect may not be so critical.

4.6. Comparison with Previous Studies

In addition, the results obtained were compared with existing research in this field to evaluate factors such as the architecture used and the data source. The comparison is presented in Table 9.

The comparison shows that hybrid models combining LSTM, GRU, Transformer, and CNN architectures demonstrate high accuracy exceeding 98% due to the effective extraction of spatio-temporal features. This is consistent with the results of our experiments. In addition, researchers pay close attention to the analysis of data combinations (IMU + EMG), which allows a more flexible and universal approach to classification tasks by collecting diverse data.

As a result, the optimal models were identified for the two tasks: ML1 (pose regression)—random forest with an error of 0.0022 m; ML2 (user-activity classification)—Transformer with an accuracy up to 99.2% on the full dataset. The ablation study showed that using IMU or VR alone constitutes a sufficient informative minimum for the classification task; with IMU-only input, the CNN–Transformer/CNN–GRU hybrids and the fully connected network achieved up to 100% accuracy.

5. Discussion

5.1. Analysis of the Obtained Results

It is essential to analyse the obtained results and their further applicability in the discussed subject area. The developed models demonstrated high efficiency in solving sensor position regression and motor activity classification tasks, which can be used in the future to control the segments of the upper exoskeleton and related tasks for monitoring human movements. With the complete set of data (EMG + IMU + CV + VR), 99.2% accuracy was achieved on Transformer and 98.4% on the fully connected neural network. If only IMU data are used, 100% accuracy is achievable on Transformer, CNN–GRU, and dense network. When regressing the absolute coordinates of key points, the random forest ensemble achieved an average absolute error of MAE = 0.0022 m (≈2 mm), providing extremely high limb positioning accuracy, which can be used to create simplified monitoring systems that are part of the upper exoskeleton. In the context of rehabilitation or load simulating devices, such accuracy is important. A controller based on recognised EMG/IMU signals moves corresponding segments of the exoskeleton with minimal discrepancy between intention and actual kinematics. The fast response of the models (random forest can be called at a frequency of 72 Hz) allows them to be integrated into real-time contours without noticeable inertia, which is necessary for natural interaction between humans and robotic mechanisms.

In addition to directly controlling movement, the multisensory system provides a large amount of data for assessing the user’s condition and improving the safety of interaction between the user and the device. Analysis of EMG signal patterns in combination with movement data can be used to assess the user’s level of muscle activity and fatigue; for example, changes in EMG amplitude and trajectory characteristics may indicate impending fatigue or decreased coordination. In existing studies, the integration of such sensors is already being used to monitor patient condition, assess muscle fatigue based on movement characteristics, and to detect unnatural movements [50,51]. Within the framework of the concept under consideration, the exoskeleton control system can also send signals indicating that the user is fatigued or performing a movement incorrectly, allowing the exoskeleton’s operating mode to be automatically adjusted (e.g., increasing the degree of support or pausing the exercise). From a safety perspective, combining multiple data sources makes it possible to detect abnormal situations more reliably: a sharp spike in EMG in the absence of IMU movement or an unexpected shift in trackers can be interpreted as a potentially dangerous situation requiring the immediate shutdown of the device. Thus, the multi-sensor model also acts as a system for monitoring the user’s condition and for preventing injuries, which is extremely important when there is close physical interaction between a person and a robotic device.

It is worth noting the promising potential of collecting, processing, and analysing data on motor activity to create a digital shadow of the hand for subsequent use in a virtual environment. The high accuracy of motion reconstruction allows the user’s actions to be transferred to virtual reality (VR) with almost no distortion. This opens up the possibility of using the model to control an avatar or prosthesis in VR/AR. Movements of the real arm measured by EMG, IMU, and other sensors are displayed on a virtual avatar, creating a mirroring effect. This approach is already being used in related prosthetics work: interactive digital twins of prosthetic arms are integrated into VR trainers to teach patients how to control bionic limbs [52]. In the context of the discussion in this article, data collected and combined into a digital shadow can be used to control an upper exoskeleton.

The multisensory approach demonstrated clear advantages over the use of a single data source. Combining data from EMG, inertial modules, and visual tracking allows one to compensate for the limitations of each individual channel, significantly improving recognition reliability. However, it should be noted that individual sensors may reduce the overall accuracy of motion recognition. At the same time, it should be remembered that such a reduction may increase the versatility of the system as a whole, due to a significant increase in the number of motion categories. This uncertainty must be taken into account in each specific study and project, as it is impossible to assess in advance how effective a particular model will be in a specific task and in real conditions with the presence of uncertain factors in the real world. The experiments conducted as part of this article showed that the multisensory model generalises information better and is more resistant to interference than some single signals. For example, using EMG alone is not sufficient, while combining it with other sensors significantly improves accuracy. Thus, the multi-sensor approach not only increases the accuracy of the system but also makes it more resistant to noise and artefacts. If interference appears in EMG, information from IMU and the camera will allow the model to ignore it; conversely, temporary loss of the tracker can be compensated for by data from wearable sensors. This is a key factor for reliable operation of any control system based on data analysis.

Nevertheless, experiments confirm that, for a range of scenarios, single-sensor systems can be used to ensure simplicity and economic efficiency; for example, the use of inertial sensors alone may be sufficient to assess activity. As shown by the analysis of other studies (Section 4.6), the combination of EMG + IMU already covers most basic actions. Obviously, single-channel solutions are not as universal and sensitive as multi-sensor ones, but they can work well in specific apps that only need a limited set of commands. The obtained data on the relative contribution of each channel (through ablation tests presented in Section 4.4) can help in choosing such a configuration: if the model based on a single IMU already shows high accuracy for target movements, then additional sensors may not be implemented, which is important for the practical economic efficiency of the system.

From the perspective of machine learning model architectures, the conducted comparative analysis (Section 4.3 and Section 3.5) confirms the current tendency towards the use of multilayer neural networks. Deep neural networks, including CNN, GRU, and Transformer architectures, capable of accounting for temporal structure and nonlinear relationships in data, provide the highest accuracy compared to classical algorithms. This is consistent with the findings of other studies: for motion recognition tasks, complex models based on deep neural networks often outperform even advanced ensemble algorithms in terms of quality [41,44,48,49]. However, the difference in accuracy is not always radical. With a small number of well-chosen features, AdaBoost or XGBoost algorithms can produce results close to those of neural networks, especially if the movements for classification are relatively simple. In the conducted experiments, many classical methods (XGBClassifier, AdaBoost Classifier) achieved an accuracy level of 90% or higher.

In terms of computational resources (time to obtain a solution), models can also be divided into two categories: high-speed models (linear models, decision trees, random forest) and more cumbersome models (all neural network architectures, Stacking, and Voting Classifier ensembles). However, it is important to understand that the process of running neural networks can be optimised by running them in a separate thread, and, within the classification task, its time is not so critical. In the regression task, high performance is more important, but, in the course of the experiment, the model with the best accuracy shows sufficient performance.

The integration of the developed models into upper-exoskeleton control systems will be considered. From an engineering perspective, one of the main challenges is to ensure the synchronous and stable operation of all sensors and system components. Conducted experiments using the developed software for synchronising signals collected at different frequencies and dimensions demonstrated the success of this solution. Possible delays between data were eliminated by using the signal recording time at each source. The considered integration algorithm takes into account the further strategy of applying the developed software and trained models. The algorithm did not consider in detail the hardware component, which must take into account the requirements for stable operation in the conditions of exoskeleton functioning and must be sufficiently energy efficient. However, when designing the exoskeleton, its limitations and requirements can be taken into account, such as using movement mechanisms that do not generate extra electromagnetic interference (e.g., pneumatic cylinders) and integrating the necessary mounts on the exoskeleton or its safety part to attach sensors or cameras to capture user movements. Thus, the process of integrating the developed models into the upper-exoskeleton control system will be a complex task, including operations such as expanding the functionality of the software and upgrading the hardware.

It is also necessary to discuss separately the practical aspects and challenges of deploying the proposed approach and models in real-world upper-limb exoskeleton applications, particularly with respect to real-time operation and robustness of the control system. The results presented in Section 4.3 and Section 3.5 (see the Time column in Table 3 and Table 4) show that a number of models deliver high throughput (<0.03 s per prediction, which corresponds to the standard 30 FPS of many tracking systems) while maintaining acceptable accuracy. For example, for absolute-coordinate regression, random forest (MAE = 0.0022 m) produces a result in 0.0139 s, and KNN in 0.009 s. For increment regression, most models are likewise highly efficient in terms of performance: RANSAC (9.21 × 10⁻⁵ s), and Theil–Sen (0.0034 s). Thus, solving the regression task in a near-real-time mode is feasible, and, given the data acquisition overhead, the operation of these models will not bear a significant negative impact.

Neural network models, on the other hand, exhibit both higher prediction errors and considerably slower inference times. When analysing the classification task, it must be taken into account that these models begin operating only after the analysis window is fully populated (5 s in this study). Thereafter, the classification time for the best models (Table 7 and Table 8) reaches 0.26–1.5 s. This aspect, of course, limits their use for frame-by-frame processing; for them, batch processing or additional optimisation is advisable. Conversely, models such as AdaBoost Classifier solve the task in 0.04 s with an accuracy ranging from 92.8% (IMU only) to 97.6% (all signals), which can also be considered acceptable when high-speed operation is required.

From the standpoint of operational stability and robustness of the control system, it should be emphasised that the integration of trained models and data-processing approaches is aimed precisely at improving these parameters: monitoring for atypical situations, detecting deviations beyond permissible trajectories and poses with immediate emergency shutdown, and the performed data processing (noise filtering and stream normalisation). Nevertheless, system robustness can be further increased, since testing on the current sample of 15 people may be insufficient; real-world deployment of the control system will require adapting the algorithms to users’ individual initial characteristics, which may become a subject of future research. Thus, the model integration procedure will enhance the system’s stability and robustness without significantly degrading throughput, provided appropriate models are selected.

The operability of the integration algorithm at the current stage of the study has been confirmed experimentally: all 15 participants performed the set of movements; the system correctly classified nine types of actions (up to 100% Accuracy) and regressed coordinates with an error of ≈2 mm. Consequently, integrating these models into the cyclic algorithm (Figure 2) will make it possible to predict the user’s current state from the predefined set of movements and to generate control commands for safety supervision.

Finally, the key value of this work does not lie in the specific partial effects (e.g., comparing machine learning models trained on different data sources), but in the developed methodological foundation, which includes several key components:

-: A formalised protocol for collecting and synchronising data for upper-limb exoskeletons, enabling unified analysis window lengths, sampling rates, and feature–matrix structures, thereby facilitating the comparison of models and sensor sets in current and future studies;
-: A universal algorithm for integrating machine learning models into the exoskeleton control loop, accounting for user safety requirements and enabling improved positioning accuracy when using IMUs through a regression model;
-: An investigation of the minimally sufficient set of sensors for classifying the user’s motor activity, which will simplify and reduce the cost of the upper-exoskeleton control system without sacrificing tracking quality and monitoring accuracy.

The principal value of the work lies not in obtaining the individual numerical results of machine learning models, but in a reproducible methodology for data collection and synchronisation, and in an algorithm for integrating machine learning models into the exoskeleton control system—an aspect that has previously been described only fragmentarily in the literature.

5.2. Limitations of the Study and Directions for Further Work

Finally, it is necessary to consider the limitations of the study. There are many areas that require more in-depth research, testing and further work:

Expansion of the set of exercises to classifications. The set of movements presented in the classification task should be significantly expanded. In addition, it is necessary to introduce a class of other actions that do not belong to any of the analysed ones. It would also be productive to take into account the probability of an action, since actions are not always unambiguous in the process of movement and must be identified by degree of probability rather than by strict category.
Expansion of the number of tracking sources. Increasing the amount of source information about motor activity will have a positive effect on the accuracy of classification and the effectiveness of the monitoring system. For example, integrating an external camera with a depth sensor that covers the entire height of a person will allow key points of the entire body model to be recognised in three dimensions, while the use of EEG or ECG sensors will provide additional information about the physiological state of a person during activities, and strain or pressure sensors will provide an additional assessment of hand movements in space.
Identification of features in the analysed signals. In addition to directly expanding the number of information sources, it is necessary to deepen the analysis and the processing of raw data in order to obtain additional knowledge from the information already available. For EMG, it is advisable to calculate frequency time features, such as the average and median frequencies of the spectrum, power spectrum density, signal entropy, fractal dimension, and so on. Similarly, for IMU data, it is possible to calculate normal and angular velocities, and, for computer vision systems, the velocity of individual body segments, adding recognition of additional body segments and their positions. An expanded set of features can increase the sensitivity of the system and its versatility in situations where processing the raw data with a model is not effective.
Accounting for signal variability and model adaptability to human parameters. This limitation is particularly justified for such a category of tracking systems as EMG (and therefore ECG and EEG sensors), since the amplitude, shape, and spectrum of EMG signals vary significantly due to differences in anatomy, fitness level, skin temperature, and electrode placement. Without taking this variability into account, the model tends to overfit and lose accuracy after a long session or when the user changes. It is necessary to implement continuous calibration mechanisms: adaptive scaling of signals by a sliding window, normalisation taking into account the current level, and retraining strategies that allow the network weights to be reconfigured for a new person. This study considered normalisation to the initial values of each measurement, but it did not show the desired effectiveness in practice. Adapting models to a specific user is an extremely promising direction for future research.
Accounting for external factors. In real-world conditions, sensors are affected by various interferences, including electromagnetic noise from the network and equipment (affects EMG), metal structures and collisions with them (may affect IMU and VR sensors), variable lighting, and non-uniform backgrounds (affects computer vision systems). To increase robustness, more multi-stage filtering and the processing of raw data are required. For computer vision systems, exposure, contrast, and viewing angle control are also necessary. In this regard, ensemble decision making methods trained under various environmental conditions may be further applied, which can increase resistance to unpredictable interference.
Expansion of the participant sample. The study analysed data from healthy users, and the sample size is rather limited. Although the collected data made it possible to confirm the proposed hypothesis and to train the models effectively, expanding the sample—including participants from different age groups and patients with musculoskeletal disorders—is highly promising. This will allow us to verify the transferability of the models to movement analysis in cases of limited mobility, as well as ensuring the applicability of the hardware–software complex based on a controlled exoskeleton and the trained models to rehabilitation tasks.

Summarising the results obtained and the limitations identified, the following areas for further research can be outlined. Firstly, the development of methods for adapting the model to new users and conditions; for example, the use of adaptive machine learning algorithms capable of adjusting the model weights to the individual characteristics of a specific person. Secondly, the integration of additional signal processing methods, such as wavelet transformation and advanced filtering, to improve resistance to noise and artefacts. This is especially relevant for EMG and other medical signals, where artefacts can mask the useful signal. Thirdly, expanding the range of sensors: adding new types such as pressure or strain sensors could provide additional information about the interaction of the limb with the external environment. The addition of ECG and EEG sensors to improve the accuracy of movement classification is promising [53]. Another direction is to optimise the monitoring system by abandoning unpromising/ineffective tracking systems and focusing on the most informative and universal sources. This will reduce the complexity and cost of the system but increase the overall value of the information collected.

The main direction of further research is to apply the trained models and the developed algorithm on their integration in a real system of controlled upper exoskeletons. Then the obtained results can be used to accurately recognise the user’s movement intentions, allowing for gentle assistance or resistance to the user’s movements, speeding up motor recovery or increasing the efficacy of professional training. Another application area is gesture-based human–machine interfaces: in virtual and augmented reality, such a system can track the user’s hand movements with high accuracy, enabling intuitive interactions (e.g., controlling an avatar or VR objects). Combining multisensory data and deep learning techniques shows high potential for creating intelligent systems capable of interpreting human motor intentions with high accuracy, which may be in demand in any monitoring and security systems that analyse human actions.

6. Conclusions

This article considers the development of a system for recognising and classifying upper-limb motor activity for subsequent use in controlling and monitoring upper exoskeletons. The use of a combined approach that combines data from several different sources (electromyographic sensors, inertial measurement units, computer vision, and virtual reality (VR) trackers) made it possible to achieve a high accuracy of motion classification. The conducted ablation study made it possible to evaluate the contribution of each source to the accuracy of the task, which allows for conclusions on the applicability to and influence on the classification quality of individual sources. In addition, as part of the conducted study, models that convert data from the IMU about the relative position in space into more accurate values received from reference devices (VR trackers) with a fairly low error rate (MAE = 0.0022 metres) were obtained. Furthermore, an algorithm for integrating the trained models into the upper-exoskeleton control system has been developed to monitor the correctness of the actions performed and to track unsafe states for the user.

The obtained results allowed us to conclude that it is possible to classify the user’s motor activity with high accuracy from the perspective of using only IMU sensors as a data source (ease of fabrication, size, cost, high classification accuracy when using only data from IMU). The comparison of different machine learning models and architectures favoured the following: random forest for solving the regression of IMU data to the absolute metric position; for the classification task, the best performance is obtained on the Transformer, CNN–GRU and Dense NN models, which show close to 100% accurate results. The findings of the study agree with existing research on the choice of architectures and data sources for motor activity analysis.

The main outcome of this work is a methodological foundation: a formalised protocol for multi-sensor data collection and synchronisation across heterogeneous sources, and the selection and integration of machine learning models into the upper-limb exoskeleton control loop to enhance reliability and monitor off-nominal conditions. A minimally sufficient sensor configuration (IMU or VR) for robust user activity classification was identified. The proposed approach is compatible with real-time constraints for pose regression and allows for classification with some latency when using deep networks or with minimal delay when using ensemble methods, thereby enabling reliable monitoring of the user’s state during exoskeleton operation.

The recommendations for further research include the development of adaptive machine learning algorithms that can take into account individual characteristics of users, as well as the integration of additional types of sensors to improve the reliability of the system. In addition, the identified limitations of the current study require further elaboration and elimination. The proposed system demonstrates a high potential for application in various fields, including professional virtual simulators based on upper exoskeletons, rehabilitation complexes, motor activity monitoring, and tracking systems.

Author Contributions

Conceptualization, A.O.; methodology, A.O. and M.K.; software, Y.M.; validation, M.R.; formal analysis, A.O. and Y.M.; investigation, A.O. and M.K.; resources, M.R. and Y.M.; data curation, Y.M. and M.R.; writing—original draft, A.O.; writing—review and editing, A.O. and M.R.; visualization, Y.M. and M.R.; supervision, A.O.; project administration, M.K.; funding acquisition, M.K. All authors have read and agreed to the published version of the manuscript.

Funding

The article was carried out with the financial support of the Ministry of Science and Higher Education of the Russian Federation within the framework of the project “Development of an immersive virtual reality interaction system for professional training based on an omnidirectional platform” (124102100628-3).

Institutional Review Board Statement

This study was conducted in accordance with the Declaration of Helsinki, and approved by the Scientific and Technical Board of Tambov State Technical University (protocol 3 of 15.12.2024, project 124102100628-3). All participants of the study agreed to the informed consent document. All data were anonymous to protect the participants.

Data Availability Statement

The datasets are available on request from the corresponding author only, as the data are sensitive and participants may be potentially identifiable.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Damaševičius:, R.; Sidekerskienė, T. Virtual Worlds for Learning in Metaverse: A Narrative Review. Sustainability 2024, 16, 2032. [Google Scholar] [CrossRef]
Kim, Y.M.; Rhiu, I.; Yun, M.H. A systematic review of a virtual reality system from the perspective of user experience. Int. J. Hum.–Comput. Interact. 2020, 36, 893–910. [Google Scholar] [CrossRef]
Dritsas, E.; Trigka, M.; Troussas, C.; Mylonas, P. Multimodal Interaction, Interfaces, and Communication: A Survey. Multimodal Technol. Interact. 2025, 9, 6. [Google Scholar] [CrossRef]
Vélez-Guerrero, M.A.; Callejas-Cuervo, M.; Mazzoleni, S. Artificial Intelligence-Based Wearable Robotic Exoskeletons for Upper Limb Rehabilitation: A Review. Sensors 2021, 21, 2146. [Google Scholar] [CrossRef] [PubMed]
Tiboni, M.; Borboni, A.; Vérité, F.; Bregoli, C.; Amici, C. Sensors and Actuation Technologies in Exoskeletons: A Review. Sensors 2022, 22, 884. [Google Scholar] [CrossRef]
Kyrarini, M.; Lygerakis, F.; Rajavenkatanarayanan, A.; Sevastopoulos, C.; Nambiappan, H.R.; Chaitanya, K.K.; Babu, A.R.; Mathew, J.; Makedon, F. A Survey of Robots in Healthcare. Technologies 2021, 9, 8. [Google Scholar] [CrossRef]
Vélez-Guerrero, M.A.; Callejas-Cuervo, M.; Mazzoleni, S. Design, Development, and Testing of an Intelligent Wearable Robotic Exoskeleton Prototype for Upper Limb Rehabilitation. Sensors 2021, 21, 5411. [Google Scholar] [CrossRef]
Islam, M.M.; Nooruddin, S.; Karray, F.; Muhammad, G. Human activity recognition using tools of convolutional neural net-works: A state of the art review, data sets, challenges, and future prospects. Comput. Biol. Med. 2022, 149, 106060. [Google Scholar] [CrossRef]
Obukhov, A.; Volkov, A.; Pchelintsev, A.; Nazarova, A.; Teselkin, D.; Surkova, E.; Fedorchuk, I. Examination of the Accuracy of Movement Tracking Systems for Monitoring Exercise for Musculoskeletal Rehabilitation. Sensors 2023, 23, 8058. [Google Scholar] [CrossRef]
Udeozor, C.; Toyoda, R.; Russo Abegão, F.; Glassey, J. Perceptions of the Use of Virtual Reality Games for Chemical Engineering Education and Professional Training. High. Educ. Pedagog. 2021, 6, 175–194. [Google Scholar] [CrossRef]
Wanying, L.; Shuwei, L.; Qi, S. Robot-Assisted Upper Limb Rehabilitation Training Pose Capture Based on Optical Motion Capture. Int. J. Adv. Manuf. Technol. 2024, 121, 1–12. [Google Scholar] [CrossRef]
Bhujel, S.; Hasan, S.K. A Comparative Study of End-Effector and Exoskeleton Type Rehabilitation Robots in Human Upper Extremity Rehabilitation. Hum.–Intell. Syst. Integr. 2023, 5, 11–42. [Google Scholar] [CrossRef]
Mani Bharathi, V.; Manimegalai, P.; George, T.; Pamela, D. A Systematic Review of Techniques and Clinical Evidence to Adopt Virtual Reality in Post-Stroke Upper Limb Rehabilitation. Virtual Real. 2024, 28, 172. [Google Scholar] [CrossRef]
Ödemiş, E.; Baysal, C.V.; İnci, M. Patient Performance Assessment Methods for Upper Extremity Rehabilitation in Assist-as-Needed Therapy Strategies: A Comprehensive Review. Med. Biol. Eng. Comput. 2025, 63, 1895–1914. [Google Scholar] [CrossRef]
Cha, K.; Wang, J.; Li, Y.; Shen, L.; Chen, Z.; Long, J. A Novel Upper-Limb Tracking System in a Virtual Environment for Stroke Rehabilitation. J. NeuroEng. Rehabil. 2021, 18, 166. [Google Scholar] [CrossRef]
Li, R.T.; Kling, S.R.; Salata, M.J.; Cupp, S.A.; Sheehan, J.; Voos, J.E. Wearable Performance Devices in Sports Medicine. Sports Health 2016, 8, 74–78. [Google Scholar] [CrossRef]
Li, Y.; Zheng, L.; Wang, X. Flexible and Wearable Healthcare Sensors for Visual Reality Health-Monitoring. Virtual Real. Intell. Hardw. 2019, 1, 411–427. [Google Scholar] [CrossRef]
Tsilomitrou, O.; Gkountas, K.; Evangeliou, N.; Dermatas, E. Wireless Motion Capture System for Upper Limb Rehabilitation. Appl. Syst. Innov. 2021, 4, 14. [Google Scholar] [CrossRef]
Yang, Y.; Weng, D.; Li, D.; Xun, H. An improved method of pose estimation for lighthouse base station extension. Sensors 2017, 17, 2411. [Google Scholar] [CrossRef]
Ergun, B.G.; Şahiner, R. Embodiment in Virtual Reality and Augmented Reality Games: An Investigation on User Interface Haptic Controllers. J. Soft Comput. Artif. Intell. 2023, 4, 80–92. [Google Scholar] [CrossRef]
Franček, P.; Jambrošić, K.; Horvat, M.; Planinec, V. The Performance of Inertial Measurement Unit Sensors on Various Hardware Platforms for Binaural Head-Tracking Applications. Sensors 2023, 23, 872. [Google Scholar] [CrossRef]
Ghorbani, F.; Ahmadi, A.; Kia, M.; Rahman, Q.; Delrobaei, M. A Decision-Aware Ambient Assisted Living System with IoT Embedded Device for In-Home Monitoring of Older Adults. Sensors 2023, 23, 2673. [Google Scholar] [CrossRef]
Eliseichev, E.A.; Mikhailov, V.V.; Borovitskiy, I.V.; Zhilin, R.M.; Senatorova, E.O. A Review of Devices for Detection of Muscle Activity by Surface Electromyography. Biomed. Eng. 2022, 56, 69–74. [Google Scholar] [CrossRef]
Nguyen, V.T.; Lu, T.-F.; Grimshaw, P.; Robertson, W. A Novel Approach for Human Intention Recognition Based on Hall Effect Sensors and Permanent Magnets. Prog. Electromagn. Res. M 2020, 92, 55–65. [Google Scholar] [CrossRef]
Chung, J.-L.; Ong, L.-Y.; Leow, M.-C. Comparative Analysis of Skeleton-Based Human Pose Estimation. Future Internet 2022, 14, 380. [Google Scholar] [CrossRef]
Obukhov, A.; Dedov, D.; Volkov, A.; Teselkin, D. Modeling of Nonlinear Dynamic Processes of Human Movement in Virtual Reality Based on Digital Shadows. Computation 2023, 11, 85. [Google Scholar] [CrossRef]
Zhang, S.; Li, Y.; Zhang, S.; Shahabi, F.; Xia, S.; Deng, Y.; Alshurafa, N. Deep Learning in Human Activity Recognition with Wearable Sensors: A Review on Advances. Sensors 2022, 22, 1476. [Google Scholar] [CrossRef]
Lin, J.-J.; Hsu, C.-K.; Hsu, W.-L.; Tsao, T.-C.; Wang, F.-C.; Yen, J.-Y. Machine Learning for Human Motion Intention Detection. Sensors 2023, 23, 7203. [Google Scholar] [CrossRef]
Gonzales-Huisa, O.A.; Oshiro, G.; Abarca, V.E.; Chavez-Echajaya, J.G.; Elias, D.A. EMG and IMU Data Fusion for Locomotion Mode Classification in Transtibial Amputees. Prosthesis 2023, 5, 1232–1256. [Google Scholar] [CrossRef]
Marcos Mazon, D.; Groefsema, M.; Schomaker, L.R.B.; Carloni, R. IMU-Based Classification of Locomotion Modes, Transitions, and Gait Phases with Convolutional Recurrent Neural Networks. Sensors 2022, 22, 8871. [Google Scholar] [CrossRef]
Vásconez, J.P.; Barona López, L.I.; Valdivieso Caraguay, Á.L.; Benalcázar, M.E. Hand Gesture Recognition Using EMG-IMU Signals and Deep Q-Networks. Sensors 2022, 22, 9613. [Google Scholar] [CrossRef]
Obukhov, A.; Dedov, D.; Volkov, A.; Rybachok, M. Technology for Improving the Accuracy of Predicting the Position and Speed of Human Movement Based on Machine Learning Models. Technologies 2025, 13, 101. [Google Scholar] [CrossRef]
Mishra, R.; Mishra, A.K.; Choudhary, B.S. High-Speed Motion Analysis-Based Machine Learning Models for Prediction and Simulation of Flyrock in Surface Mines. Appl. Sci. 2023, 13, 9906. [Google Scholar] [CrossRef]
Kwon, B.; Son, H. Accurate Path Loss Prediction Using a Neural Network Ensemble Method. Sensors 2024, 24, 304. [Google Scholar] [CrossRef] [PubMed]
Lee, J.; Suh, J. A Study on Systematic Improvement of Transformer Models for Object Pose Estimation. Sensors 2025, 25, 1227. [Google Scholar] [CrossRef]
Sulla-Torres, J.; Calla Gamboa, A.; Avendaño Llanque, C.; Angulo Osorio, J.; Zúñiga Carnero, M. Classification of Motor Competence in Schoolchildren Using Wearable Technology and Machine Learning with Hyperparameter Optimization. Appl. Sci. 2024, 14, 707. [Google Scholar] [CrossRef]
Stančić, I.; Musić, J.; Grujić, T.; Vasić, M.K.; Bonković, M. Comparison and Evaluation of Machine Learning-Based Classification of Hand Gestures Captured by Inertial Sensors. Computation 2022, 10, 159. [Google Scholar] [CrossRef]
Samkari, E.; Arif, M.; Alghamdi, M.; Al Ghamdi, M.A. Human Pose Estimation Using Deep Learning: A Systematic Literature Review. Mach. Learn. Knowl. Extr. 2023, 5, 1612–1659. [Google Scholar] [CrossRef]
Ogundokun, R.O.; Maskeliūnas, R.; Misra, S.; Damasevicius, R. Hybrid InceptionV3-SVM-Based Approach for Human Posture Detection in Health Monitoring Systems. Algorithms 2022, 15, 410. [Google Scholar] [CrossRef]
Farhadpour, S.; Warner, T.A.; Maxwell, A.E. Selecting and Interpreting Multiclass Loss and Accuracy Assessment Metrics for Classifications with Class Imbalance: Guidance and Best Practices. Remote Sens. 2024, 16, 533. [Google Scholar] [CrossRef]
Jiang, Y.; Song, L.; Zhang, J.; Song, Y.; Yan, M. Multi-Category Gesture Recognition Modeling Based on sEMG and IMU Signals. Sensors 2022, 22, 5855. [Google Scholar] [CrossRef]
Bangaru, S.S.; Wang, C.; Aghazadeh, F. Data Quality and Reliability Assessment of Wearable EMG and IMU Sensor for Construction Activity Recognition. Sensors 2020, 20, 5264. [Google Scholar] [CrossRef] [PubMed]
Valdivieso Caraguay, Á.L.; Vásconez, J.P.; Barona López, L.I.; Benalcázar, M.E. Recognition of Hand Gestures Based on EMG Signals with Deep and Double-Deep Q-Networks. Sensors 2023, 23, 3905. [Google Scholar] [CrossRef] [PubMed]
Bai, A.; Song, H.; Wu, Y.; Dong, S.; Feng, G.; Jin, H. Sliding-Window CNN + Channel-Time Attention Transformer Network Trained with Inertial Measurement Units and Surface Electromyography Data for the Prediction of Muscle Activation and Motion Dynamics Leveraging IMU-Only Wearables for Home-Based Shoulder Rehabilitation. Sensors 2025, 25, 1275. [Google Scholar] [CrossRef]
Jeon, H.; Choi, H.; Noh, D.; Kim, T.; Lee, D. Wearable Inertial Sensor-Based Hand-Guiding Gestures Recognition Method Robust to Significant Changes in the Body-Alignment of Subject. Mathematics 2022, 10, 4753. [Google Scholar] [CrossRef]
Toro-Ossaba, A.; Jaramillo-Tigreros, J.; Tejada, J.C.; Peña, A.; López-González, A.; Castanho, R.A. LSTM Recurrent Neural Network for Hand Gesture Recognition Using EMG Signals. Appl. Sci. 2022, 12, 9700. [Google Scholar] [CrossRef]
Hassan, N.; Miah, A.S.M.; Shin, J. A Deep Bidirectional LSTM Model Enhanced by Transfer-Learning-Based Feature Extraction for Dynamic Human Activity Recognition. Appl. Sci. 2024, 14, 603. [Google Scholar] [CrossRef]
Lin, W.-C.; Tu, Y.-C.; Lin, H.-Y.; Tseng, M.-H. A Comparison of Deep Learning Techniques for Pose Recognition in Up-and-Go Pole Walking Exercises Using Skeleton Images and Feature Data. Electronics 2025, 14, 1075. [Google Scholar] [CrossRef]
Kamali Mohammadzadeh, A.; Alinezhad, E.; Masoud, S. Neural-Network-Driven Intention Recognition for Enhanced Human–Robot Interaction: A Virtual-Reality-Driven Approach. Machines 2025, 13, 414. [Google Scholar] [CrossRef]
Zhao, Z.; Wang, J.; Wang, S.; Wang, R.; Lu, Y.; Yuan, Y.; Chen, J.; Dai, Y.; Liu, Y.; Wang, X.; et al. Multimodal Sensing in Stroke Motor Rehabilitation. Adv. Sens. Res. 2023, 2, 2200055. [Google Scholar] [CrossRef]
Lauer-Schmaltz, M.W.; Cash, P.; Hansen, J.P.; Das, N. Human Digital Twins in Rehabilitation: A Case Study on Exoskeleton and Serious-Game-Based Stroke Rehabilitation Using the ETHICA Methodology. IEEE Access 2024, 12, 180968–180991. [Google Scholar] [CrossRef]
Cellupica, A.; Cirelli, M.; Saggio, G.; Gruppioni, E.; Valentini, P.P. An Interactive Digital-Twin Model for Virtual Reality Environments to Train in the Use of a Sensorized Upper-Limb Prosthesis. Algorithms 2024, 17, 35. [Google Scholar] [CrossRef]
Fu, J.; Choudhury, R.; Hosseini, S.M.; Simpson, R.; Park, J.-H. Myoelectric Control Systems for Upper Limb Wearable Robotic Exoskeletons and Exosuits—A Systematic Review. Sensors 2022, 22, 8134. [Google Scholar] [CrossRef]

Figure 1. The main stages of the study.

Figure 2. Architecture of the virtual reality interaction system based on an actuated upper-limb exoskeleton.

Figure 3. Architecture of the developed machine learning models.

Figure 4. Exoskeleton control algorithm.

Figure 5. Sensor placement: (a) overall layout; (b) excerpt from the data acquisition process.

Figure 6. Data marking software.

Figure 7. Distribution of data by action categories.

Figure 8. Samples of the collected data.

Figure 9. Neural network training process on the full dataset.

Figure 10. Neural network training process on the IMU-only data.

Table 1. Analysis of the use of tracking systems in virtual reality systems and rehabilitation complexes based on exoskeletons.

Source	Brief Description of the Study	Main Findings
Wanying et al. (2024) [11]	Method for capturing human posture for robotic rehabilitation of upper limbs using optical motion capture.	An optical method based on markers placed on the human body was used to capture movements. The collected data on the user’s hand movements were used to control the manipulator motors, with the trajectories of the human and manipulator movements corresponding to each other with a specified level of accuracy.
Bhujel et al. (2023) [12]	Comparison of manipulators and exoskeletons for upper-limb rehabilitation of various designs.	Exoskeletons only track the body segments to which they are attached, which makes it difficult to track, for example, fingers. Depth cameras and inertial sensors need to be integrated for more accurate tracking of the entire body.
Mani Bharathi et al. (2024) [13]	The systematic review of the use of VR in upper-limb rehabilitation after stroke.	The review includes an analysis of the tracking systems used, including specialised gloves, motion sensors, mobile devices, inertial sensors, cameras, and computer vision.
Ödemiş et al. (2025) [14]	Methods for assessing the effectiveness of upper-limb rehabilitation.	Approaches to assessing the effectiveness of rehabilitation using various methods are considered: positional error analysis, assessment of the strength of interaction with a robot, electromyography, electroencephalography (EEG), exercise performance effectiveness, physiological signals (pulse, galvanic skin response).
Cha et al. (2021) [15]	Development of a virtual reality rehabilitation system for restoring upper-limb function.	The computer vision-based tracking system is considered, which allows for analysing precise finger movements, the amplitude of the entire arm’s movement, and matching the user’s upper limb movements with those of the avatar’s upper limbs.
Li et al. (2016) [16]	The use of wearable devices to measure training effectiveness in sports medicine.	Wearable devices in sports medicine (accelerometers, heart rate monitors, GPS devices, and other sensors) are analysed to track exercise performance, analyse movements, and prevent sports injuries.
Li et al. (2019) [17]	The use of wearable sensors for health monitoring in VR.	The use of various wearable sensors for health monitoring in VR. EEG, heart rate, temperature, ECG, EMG, pressure, and other sensors are used, which highlights the need to generate comprehensive information about the user’s physiological indicators.
Tsilomitrou et al. (2021) [18]	Wireless capture system for upper-limb rehabilitation.	Combining multiple IMUs to monitor patient progress during rehabilitation exercises. Merging data from sensors to construct a kinematic chain with 7 degrees of freedom.

Table 2. Description of exercise categories.

Exercise Number	Description
1.	Flexion and extension of the arm at the elbow
2.	Flexion and extension of the wrist
3.	Reaching forward with one’s hand
4.	Spreading the arm out to the side and bringing it back to the chest
5.	Circular movements of the arm along the body
6.	Movement of the palm in front of the body along the x-axis
7.	Movement of the palm in front of the body along the y-axis
8.	Movement of the palm in front of the body along the z-axis
9.	Imitation of grasping an object with an outstretched arm and moving it towards the chest

Table 3. Comparison of models in regression (absolute values).

Model	MAE	MSE	Time
1.1 Linear Regression	0.071169	0.009196	1.45 × 10⁻⁴
1.2 ElasticNet Regression	0.092258	0.013951	8.46 × 10⁻⁵
1.3 RANSAC Regressor	0.126859	0.032174	9.36 × 10⁻⁵
1.4 Theil–Sen Regressor	0.072374	0.009491	0.633186
1.5 Decision Tree Regressor	0.029349	0.002572	2.00 × 10⁻⁴
1.6 Random Forest Regressor	0.002181	4.43 × 10⁻⁵	0.013973
1.7 AdaBoost	0.067673	0.008036	0.345653
1.8 Gradient Boosting Regressor	0.041770	0.003961	0.036809
1.9 XGBRegressor	0.027777	0.001940	5.00 × 10⁻⁴
1.10 K-Nearest Neighbors Regressor	0.006458	4.00 × 10⁻⁴	0.009008
1.11 Dense NN	0.028895	0.002276	0.920696
1.12 Transformer	0.092324	0.013956	0.433021
1.13 CNN–Transformer	0.022941	0.001387	2.043763

Table 4. Comparison of models in regression (values of position increments).

Model	MAE	MSE	Time
1.1 Linear Regression	4.75 × 10⁻⁴	6.96 × 10⁻⁶	1.56 × 10⁻⁴
1.2 ElasticNet Regression	4.16 × 10⁻⁴	7.10 × 10⁻⁶	8.99 × 10⁻⁵
1.3 RANSAC Regressor	4.13 × 10⁻⁴	7.10 × 10⁻⁶	9.21 × 10⁻⁵
1.4 Theil–Sen Regressor	4.14 × 10⁻⁴	7.10 × 10⁻⁶	0.003424
1.5 Decision Tree Regressor	4.19 × 10⁻⁴	7.12 × 10⁻⁶	1.57 × 10⁻⁴
1.6 Random Forest Regressor	5.29 × 10⁻⁴	8.81 × 10⁻⁶	0.017663
1.7 AdaBoost	0.003944	8.34 × 10⁻⁵	0.023176
1.8 Gradient Boosting Regressor	4.40 × 10⁻⁴	7.37 × 10⁻⁶	0.015588
1.9 XGBRegressor	4.85 × 10⁻⁴	7.98 × 10⁻⁶	4.18 × 10⁻⁴
1.10 K-Nearest Neighbors Regressor	6.53 × 10⁻⁴	8.25 × 10⁻⁶	0.009248
1.11 Dense NN	4.22 × 10⁻⁴	7.10 × 10⁻⁶	0.073274
1.12 Transformer	4.34 × 10⁻⁴	7.10 × 10⁻⁶	0.207461
1.13 CNN–Transformer	4.19 × 10⁻⁴	7.10 × 10⁻⁶	0.396306

Table 5. Analysis of ablation during classification (absolute values).

EMG	IMUp	IMUa	CV	VRp	VRa	LoR	NNC	DT	RF	Ada	GNB	XGB	Stack	Vot	DNN	Tr	CTr	CGRU
✓						0.460	0.786	0.365	0.278	0.500	0.357	0.460	0.738	0.492	0.897	0.500	0.810	0.897
	✓					0.881	0.960	0.762	0.730	0.921	0.508	0.913	0.968	0.802	0.976	0.952	1.000	1.000
		✓				0.659	0.897	0.786	0.722	0.897	0.540	0.881	0.952	0.921	1.000	0.952	0.984	1.000
			✓			0.937	0.937	0.825	0.841	0.937	0.722	0.857	0.905	0.913	0.968	0.968	0.992	0.968
				✓		0.889	0.992	0.841	0.738	0.952	0.437	0.937	0.992	0.905	0.992	0.754	0.984	0.992
					✓	0.921	0.968	0.929	0.889	0.968	0.619	0.921	0.960	0.929	0.992	0.706	1.000	1.000
✓	✓					0.460	0.786	0.722	0.675	0.913	0.571	0.873	0.841	0.786	0.913	0.714	0.889	0.937
	✓	✓				0.786	0.897	0.746	0.746	0.929	0.532	0.889	0.905	0.865	0.992	0.976	1.000	1.000
				✓	✓	0.921	0.968	0.873	0.865	0.976	0.643	0.905	0.984	0.929	0.984	0.817	0.992	0.992
✓	✓	✓				0.698	0.825	0.746	0.762	0.921	0.595	0.873	0.841	0.802	0.976	0.921	0.984	0.976
	✓			✓	✓	0.952	0.976	0.825	0.881	0.968	0.698	0.929	0.968	0.913	0.992	0.794	1.000	1.000
✓	✓	✓	✓			0.698	0.825	0.841	0.849	0.952	0.722	0.889	0.905	0.865	0.968	0.944	0.976	0.992
	✓	✓		✓	✓	0.952	0.968	0.865	0.889	0.952	0.754	0.929	0.944	0.897	0.992	0.952	1.000	1.000
✓	✓	✓	✓	✓		0.698	0.825	0.794	0.865	0.968	0.722	0.881	0.873	0.817	0.937	0.921	0.984	0.992
✓	✓	✓		✓	✓	0.802	0.857	0.857	0.810	0.952	0.770	0.897	0.889	0.857	0.992	0.937	0.984	1.000
✓	✓	✓	✓	✓	✓	0.802	0.857	0.865	0.881	0.976	0.810	0.913	0.905	0.857	1.000	0.937	0.984	0.984

IMUp—IMU position; IMUa—IMU angles; VRp—VR position; VRa—VR angles; LoR—logistic regression; NNC—Nearest Neighbors Classification; DT—Decision Tree Classifier; RF—Random Forest Classifier; Ada—AdaBoost Classifier; GNB—Gaussian Naive Bayes; XGB—XGBClassifier; Stack—Stacking Classifier; Vot—Voting Classifier; DNN—dense neural network; Tr –Transformer; CTr—CNN–Transformer; CGRU—CNN–GRU.

Table 6. Analysis of ablation during classification (normalised values).

EMG	IMUp	IMUa	CV	VRp	VRa	LoR	NNC	DT	RF	Ada	GNB	XGB	Stack	Vot	DNN	Tr	CTr	CGRU
✓						0.222	0.452	0.317	0.246	0.325	0.317	0.302	0.317	0.310	0.706	0.349	0.476	0.571
	✓					0.698	0.905	0.738	0.619	0.833	0.492	0.825	0.897	0.825	0.984	0.944	0.992	0.960
		✓				0.468	0.873	0.849	0.794	0.889	0.460	0.873	0.881	0.881	0.937	0.952	0.968	0.976
			✓			0.889	0.889	0.786	0.802	0.889	0.476	0.865	0.897	0.825	0.929	0.913	0.976	0.968
				✓		0.746	0.968	0.714	0.738	0.881	0.571	0.873	0.929	0.802	0.984	0.810	0.976	0.992
					✓	0.817	0.960	0.833	0.810	0.929	0.595	0.913	0.960	0.913	0.984	0.897	0.968	0.992
✓	✓					0.587	0.873	0.802	0.770	0.913	0.587	0.873	0.921	0.849	0.968	0.984	0.984	0.984
	✓	✓				0.794	0.960	0.873	0.873	0.944	0.635	0.937	0.960	0.889	0.984	0.952	0.984	0.984
				✓	✓	0.881	0.960	0.865	0.921	0.968	0.698	0.952	0.952	0.937	0.984	0.960	0.992	0.992
✓	✓	✓				0.841	0.913	0.833	0.881	0.968	0.706	0.937	0.921	0.929	0.968	0.960	0.984	0.984
	✓			✓	✓	0.389	0.516	0.833	0.857	0.984	0.706	0.929	0.833	0.865	0.921	0.873	0.968	0.976
✓	✓	✓	✓			0.389	0.516	0.810	0.889	0.968	0.786	0.929	0.857	0.873	0.913	0.841	0.976	0.968
	✓	✓		✓	✓	0.222	0.452	0.317	0.246	0.325	0.317	0.302	0.317	0.310	0.706	0.349	0.476	0.571
✓	✓	✓	✓	✓		0.698	0.905	0.738	0.619	0.833	0.492	0.825	0.897	0.825	0.984	0.944	0.992	0.960
✓	✓	✓		✓	✓	0.468	0.873	0.849	0.794	0.889	0.460	0.873	0.881	0.881	0.937	0.952	0.968	0.976
✓	✓	✓	✓	✓	✓	0.889	0.889	0.786	0.802	0.889	0.476	0.865	0.897	0.825	0.929	0.913	0.976	0.968

IMUp—IMU position; IMUa—IMU angles; VRp—VR position; VRa—VR angles; LoR—logistic regression; NNC—Nearest Neighbors Classification; DT—Decision Tree Classifier; RF—Random Forest Classifier; Ada—AdaBoost Classifier; GNB—Gaussian Naive Bayes; XGB—XGBClassifier; Stack—Stacking Classifier; Vot—Voting Classifier; DNN—dense neural network; Tr—Transformer; CTr—CNN–Transformer; CGRU—CNN–GRU.

Table 7. Comparison of models during classification (absolute values).

Model	Accuracy	Precision	Recall	F1-Score	Time, s.
2.1 Logistic Regression	0.8016	0.8767	0.8073	0.8148	0.7699
2.2 Nearest Neighbors Classification	0.8571	0.8940	0.8614	0.8595	1.5360
2.3 Decision Tree Classifier	0.8651	0.9108	0.8671	0.8746	0.0225
2.4 Random Forest Classifier	0.8810	0.9294	0.8816	0.8867	0.0284
2.5 AdaBoost Classifier	0.9762	0.9804	0.9766	0.9771	0.0487
2.6 Gaussian Naive Bayes	0.8095	0.8573	0.8069	0.7936	0.1545
2.7 XGBClassifier	0.9127	0.9511	0.9134	0.9218	0.4667
2.8 Stacking Classifier	0.9048	0.9263	0.9053	0.9085	1.4746
2.9 Voting Classifier	0.8571	0.9311	0.8584	0.8747	1.8218
2.10 Dense NN	0.9841	0.9840	0.9835	0.9835	0.0928
2.11 Transformer	0.9921	0.9926	0.9915	0.9917	0.2688
2.12 CNN–Transformer	0.9444	0.9456	0.9438	0.9436	0.3683
2.13 CNN–GRU	0.9762	0.9755	0.9756	0.9753	0.5511

Table 8. Comparison of models during classification (IMU only).

Model	Accuracy	Precision	Recall	F1-Score	Time, s.
2.1 Logistic Regression	0.7857	0.8322	0.7884	0.7902	0.6984
2.2 Nearest Neighbors Classification	0.8968	0.9107	0.8993	0.8986	0.6466
2.3 Decision Tree Classifier	0.7460	0.8469	0.7500	0.7569	0.0193
2.4 Random Forest Classifier	0.7460	0.9089	0.7495	0.7726	0.0248
2.5 AdaBoost Classifier	0.9286	0.9517	0.9295	0.9311	0.0457
2.6 Gaussian Naive Bayes	0.5317	0.6113	0.5230	0.5197	0.0103
2.7 XGBClassifier	0.8889	0.9315	0.8886	0.8958	0.3956
2.8 Stacking Classifier	0.9048	0.9432	0.9090	0.9137	0.4264
2.9 Voting Classifier	0.8651	0.9198	0.8674	0.8775	0.4592
2.10 Dense NN	1.0000	1.0000	1.0000	1.0000	0.9878
2.11 Transformer	1.0000	1.0000	1.0000	1.0000	1.5917
2.12 CNN–Transformer	0.9841	0.9847	0.9851	0.9844	0.3468
2.13 CNN–GRU	1.0000	1.0000	1.0000	1.0000	0.5062

Table 9. Comparison of classification accuracy with other works.

Source	Data Source	Model Type	Classification Accuracy
Jiang et al. (2022) [41]	EMG + IMU	LSTM–Res	99.67%
Jiang et al. (2022) [41]	EMG + IMU	GRU–Res	99.49%
Jiang et al. (2022) [41]	EMG + IMU	Transformer–CNN	98.96%
Vásconez et al. (2022) [31]	EMG + IMU	DQN (deep Q-network)	97.50% ± 1.13%
Bangaru et al. (2020) [42]	EMG + IMU	Random Forest	98.13%
Valdivieso Caraguay et al. (2023) [43]	EMG	DQN	90.37
Bai et al. (2025) [44]	EMG + IMU	SWCTNet (CNN + Transformer)	98%
Jeon et al. (2022) [45]	IMU	Bi-directional LSTM	91.7%
Toro-Ossaba et al. (2022) [46]	EMG	LSTM	87.29 ± 6.94%
Hassan et al. (2024) [47]	CV (UCF11 dataset)	Deep BiLSTM	99.2%
Lin et al. (2025) [48]	CV (MediaPipe)	Swin Transformer	99.7%
Kamali Mohammadzadeh et al. (2025) [49]	VR trackers	CNN–Transformer	100%
Our work	IMU	Dense neural network/CNN–Transformer/CNN–GRU	100%
Our work	EMG + IMU + CV+ VR trackers	Transformer	99.2%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Published by MDPI on behalf of the International Institute of Knowledge Innovation and Invention. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Obukhov, A.; Krasnyansky, M.; Merkuryev, Y.; Rybachok, M. Development of a System for Recognising and Classifying Motor Activity to Control an Upper-Limb Exoskeleton. Appl. Syst. Innov. 2025, 8, 114. https://doi.org/10.3390/asi8040114

AMA Style

Obukhov A, Krasnyansky M, Merkuryev Y, Rybachok M. Development of a System for Recognising and Classifying Motor Activity to Control an Upper-Limb Exoskeleton. Applied System Innovation. 2025; 8(4):114. https://doi.org/10.3390/asi8040114

Chicago/Turabian Style

Obukhov, Artem, Mikhail Krasnyansky, Yaroslav Merkuryev, and Maxim Rybachok. 2025. "Development of a System for Recognising and Classifying Motor Activity to Control an Upper-Limb Exoskeleton" Applied System Innovation 8, no. 4: 114. https://doi.org/10.3390/asi8040114

APA Style

Obukhov, A., Krasnyansky, M., Merkuryev, Y., & Rybachok, M. (2025). Development of a System for Recognising and Classifying Motor Activity to Control an Upper-Limb Exoskeleton. Applied System Innovation, 8(4), 114. https://doi.org/10.3390/asi8040114

Article Menu

Development of a System for Recognising and Classifying Motor Activity to Control an Upper-Limb Exoskeleton

Abstract

1. Introduction

2. Related Works

3. Materials and Methods

3.1. Research Methodology

3.2. Formalisation of Data Handling Processes

3.3. Development of Machine Learning Models

3.4. Evaluating the Quality of Trained Models

3.5. Algorithm for Integrating Trained Models for Upper Exoskeleton Control

4. Results

4.1. Experimental Setup and Data Preparation

4.2. Structure of the Collected Data

4.3. Comparing Models When Solving Regression Problems

4.4. Ablation Model Research

4.5. Comparison of Classification Algorithms

4.6. Comparison with Previous Studies

5. Discussion

5.1. Analysis of the Obtained Results

5.2. Limitations of the Study and Directions for Further Work

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI