Automated Control of Rehabilitation Process in Physical Therapy Using a Novel Human Skeleton-Based Balanced Time Warping Algorithm

Seredin, Oleg; Kopylov, Andrey; Surkov, Egor; Mityugov, Nikita; Tokarev, Alexei; Bagchi, Parama; Bhattacharjee, Debotosh

doi:10.3390/s25216696

Open AccessArticle

Automated Control of Rehabilitation Process in Physical Therapy Using a Novel Human Skeleton-Based Balanced Time Warping Algorithm

by

Oleg Seredin

¹

,

Andrey Kopylov

^1,*

,

Egor Surkov

¹

,

Nikita Mityugov

¹

,

Alexei Tokarev

²,

Parama Bagchi

³

and

Debotosh Bhattacharjee

⁴

¹

Laboratory of Cognitive Technologies and Simulation Systems, Tula State University, Lenin Ave. 92, 300012 Tula, Russia

²

Medical Institute, Tula State University, Lenin Ave. 92, 300012 Tula, Russia

³

Department of CSE, RCC Institute of Information Technology, Beliaghata, Kolkata 700015, India

⁴

Department of CSE, Jadavpur University, Kolkata 700032, India

^*

Author to whom correspondence should be addressed.

Sensors 2025, 25(21), 6696; https://doi.org/10.3390/s25216696

Submission received: 12 September 2025 / Revised: 17 October 2025 / Accepted: 28 October 2025 / Published: 2 November 2025

(This article belongs to the Section Sensing and Imaging)

Download

Browse Figures

Review Reports Versions Notes

Abstract

Physical therapy is a critical component of medical rehabilitation, aiding recovery from conditions such as stroke, spinal cord injuries, and musculoskeletal disorders. Effective rehabilitation requires precise monitoring of patient performance to ensure exercises are executed correctly and progress is accurately assessed. This paper presents a novel automated system for controlling the rehabilitation process and evaluating physical therapy exercise quality using computer vision and a customized Human Skeleton-based Balanced Time Warping algorithm. The proposed method quantitatively assesses the similarity between a physiotherapist and patient performance by analyzing skeletal motion data extracted from RGB-D video sequences without requiring pre-alignment or sensor-specific calibration. A motion-dependent, weighted Euclidean distance between 3D skeletal models is used to compute pose dissimilarity, while a modified DTW approach aligns temporal sequences and evaluates dynamic consistency. The total dissimilarity measure is a balanced combination of posture (DP) and dynamics (DT) components. Evaluated on a custom dataset of 136 video recordings from 23 participants performing exercises in sitting and standing positions under varying performance accuracy levels (“good,” “intermediate,” and “bad”), the system demonstrates the strong clustering of accuracy levels. Proposed dissimilarity, together with a fixed reference element (physiotherapist), induces a natural non-strict order on the set of distances between patients and physiotherapists. A high value of Spearman’s rank correlation coefficient between computed dissimilarity and execution accuracy (0.977) indicates that this method is suitable for assessing exercise performance accuracy and for adequately evaluating the patient’s rehabilitation progress. The method enables objective, real-time feedback, reduces therapist workload, and supports remote monitoring, offering a scalable solution for personalized rehabilitation. Future work will involve clinical validation with post-stroke and cardiac patients.

Keywords:

rehabilitation assessment; skeletal models; motion analysis; weighted Euclidean distance; time warping

1. Introduction

Physical therapy plays an important role in medical rehabilitation and is an integral part of it [1]. It helps patients recover from various medical conditions by improving their physical function strength and mobility.

Many injuries or medical conditions can affect the ability to function [2], including

Brain disorders, such as stroke, multiple sclerosis, or cerebral palsy;
Long-term (chronic) pain, including back and neck pain;
Major bone or joint surgery;
Severe arthritis becoming worse over time;
Severe weakness after recovering from a serious illness (such as infection, heart failure, or respiratory failure);
Spinal cord injury or brain injury.

Physical rehabilitation describes the process that a person goes through to reach optimal physical functioning.

Controlling the rehabilitation process involves regular monitoring and adjustments based on patient progress. This ensures that the rehabilitation program is tailored to the individual’s needs, promoting optimal recovery.

Physiotherapists often face various challenges such as resource limitations, complex patient needs, communication barriers, financial constraints, and ethical and legal issues.

Computer vision and artificial intelligence (AI) are revolutionizing the field of medical rehabilitation, particularly in the control and monitoring of exercise. These technologies offer several advantages that enhance the effectiveness and efficiency of rehabilitation programs. Computer vision systems can be used to monitor patients’ movements during exercise therapy sessions. Unlike wearable or directly printed on garments sensors [3], video cameras do not restrict movement and do not require lengthy preparation for a rehabilitation session. By using cameras and depth sensors, these systems can track the movements of patients, ensuring that exercises are performed correctly and safely. This real-time feedback helps in

Correcting Form: Ensuring that patients perform exercises with the correct form to prevent injuries and maximize the benefits of the therapy;
Progress Tracking: Monitoring patient improvements in range of motion, strength, and endurance over time;
Remote Supervision: Allowing therapists to supervise patients remotely, which is particularly useful for those who cannot frequently visit rehabilitation centers;
Reduce the workload of the physiotherapist: Allowing them to manage more patients simultaneously and focus on complex cases.

Although numerous patient-centered systems have been developed for home rehabilitation, there is a notable lack of systems designed to support both the physiotherapist and the patient [4].

In modern physical rehabilitation protocols, patients typically perform exercises with periodic feedback or guidance following initial demonstrations by the physiotherapist. As the patient tries to repeat the exercises after the physiotherapist, how accurately they can do this can serve as an important indicator of the degree of their rehabilitation.

The main objective of this work is to build a system of interactions between a patient and a physiotherapist using a quantitative assessment of the degree of similarity between their performance of exercises based on computer vision methods.

The nature of the exercises allows for the phases of movements of the therapist and the patient to be aligned in order to assess the dissimilarity of poses in the corresponding frames. We use a skeletal description of a human pose to form the basis for measuring such value. In [5], we have proposed the measure of dissimilarity between two skeletons, which can be used here for dual purposes. First, it can serve for time warping to align the frames in video sequences of the therapist and patient, which correspond to the same phases of an exercise. Second, it can measure the degree of dissimilarity between poses in aligned frames. Thus, we have two parts in the overall dissimilarity measure between the videos. One part is responsible for inconsistency in the dynamics of an exercise and another reflects the differences in posture. The final measure is a kind of balance between these two parts. Given the adopted protocol, it is essential not to interfere with the patient’s exercise execution during recording, as any real-time intervention could bias the evaluation. Therefore, the unit of assessment here is the complete the exercise recording and evaluate it against the instructor’s reference performance.

The main contributions of the proposed method can be summarized as follows.

We do not use any pre-alignment or registration techniques before computing the similarity between the two video sequences;
We introduce a novel algorithm called Human Skeleton-based Balanced Time Warping, a modified dynamic time warping (DTW) algorithm-based technique for comparing two video sequences. Using skeleton-based dissimilarity allows us to significantly speed up and reduce the requirements for the computing resources of the warping algorithm;
Our proposed method demonstrated the possibility of accurately measuring the rehabilitation progress of patients by quantitatively assessing how well patients reproduce the exercises performed by instructors;
We collected a new dataset with objective scores of rehabilitation degrees.

The core innovation lies in the fact that, instead of using specific pose features, we adopted a featureless approach, elaborated in our previous works on skeleton standardization [6], fall detection [5], and skeleton distance evaluation [7], based on a balanced dissimilarity measure designed explicitly for skeletal model analysis in rehabilitation systems.

We hope that the integration of computer vision and AI technologies in exercise therapy represents a significant advancement in the field of medical rehabilitation. These technologies not only enhance the precision and effectiveness of rehabilitation programs but also make them more accessible and personalized. As a result, patients can achieve better outcomes and a higher quality of life.

The rest of the paper has the following structure. A brief review of the recent methods of human action analysis based on skeleton descriptions and dynamic time warping is presented in Section 2. Section 3 provides a comprehensive description of our methodology. The experimental evaluation and proof of the method are presented in Section 4.

2. Literature Review

Recent advances in human action recognition (HAR) have increasingly leveraged skeletal data due to its robustness to appearance variations and environmental conditions. In this section, we present a brief literature review of the recent works on HAR to clarify methodological trends, distinguish between general HAR and clinical applications, and highlight the evolution from handcrafted features to self-supervised deep models.

Early approaches to skeleton-based HAR relied heavily on handcrafted features combined with dynamic time warping (DTW) or related alignment strategies to handle temporal variability in motion sequences. Tormene et al. [3]. introduced a DTW variant for matching incomplete time series in post-stroke rehabilitation, establishing a foundational approach for variable-length motion comparison.

Kumar et al. [8] extracted skeletal joint trajectories from Kinect v2 data and applied graph-based time-series matching to recognize actions. Similarly, Vishwakarma and Jain [9] projected 3D joint positions onto a 2D plane to form movement polygons, enabling geometric feature extraction; their method achieved up to 95.7% accuracy across four general HAR benchmark datasets (MSR Action3D, Berkeley MHAD, TST Fall Detection, NTU-RGB+D).

İnce et al. [10] introduced wavelet transform (HWT), which is used to preserve the information of the features before reducing the data dimension. Dimension reduction using an averaging algorithm is also applied to decrease the computational cost, which provides a faster performance. Before the classification, the authors proposed a thresholding method to extract the final feature set. Finally, the K-nearest neighbor (k-NN) algorithm is used to recognize the activity with respect to the given data. In this case, the accuracy of the proposed model with LSTM was 82.2%.

Muralikrishna et al. [11] based his proposed work on human action recognition by combining structural and temporal features. He based his research on four datasets, namely, KTH, UTKinect, and MSR Action3D datasets. The overall accuracy of this method is 90.33%.

Ahad et al. [12] proposed the Linear Joint Position Feature (LJPF) and Angular Joint Position Feature (AJPF) based on 3D linear joint positions and angles between bone segments, which were then combined into two kinematics features for each video frame. These features include the variation in motion in the temporal domain, as each body joint represents kinematic positions and orientation sensors. Then, the extracted KPF feature descriptor was extracted using a low-pass filter, which consists of temporal domain statistical features, which were further segmented using Position-based Statistical Features (PSFs). For classification purposes, a variety of classifiers including Support Vector Machine (Linear), RNN, CNNRNN, and ConvRNN models were used. The highest classification rate was 98.44% using ConvRNN + PSF.

The above works were based mainly on the extraction of features from human action datasets and classification using deep learning models. The exact application of the above methods in the field of biomedical engineering is still to be proposed. More recent works exploit deep architectures to learn spatiotemporal representations directly from raw or minimally processed skeleton sequences.

Wang et al. [13] proposed an algorithm “MEET_JEANIE” where the 3D skeleton sequences whose camera and subjects’ poses can be easily manipulated were extracted and evaluated on skeletal Few-shot Action Recognition (FSAR), where matching the temporal blocks of support–query sequence pairs (by factoring out nuisance variations) is essential due to the limited samples. Given a query sequence, the author’s created several views by simulating several camera locations. For comparison purposes, the smallest distances amongst matching pairs were selected, with different temporal viewpoint warping patterns. The algorithm achieved 65% accuracy over the already existing ones. In the above work by Wang et al. [13], an alignment procedure was also used.

Graph Convolutional Networks (GCNs) have emerged as a powerful tool for modeling skeletal topology. Du et al. [14] had proposed a self-supervised GCN framework for rehabilitation exercise assessment, leveraging regularization to improve generalization without extensive, labeled data. The experiments on the existing HAR benchmark dataset validated that the proposed methods achieved state-of-the-art performance with lower prediction and improved performances.

Park et al. [15] generated motion embedding vectors for the body parts, and the motion variation loss was introduced in order to distinguish similar kinds of motions. The authors also worked on the synthetic dataset to train the model. The entire system was tested on the NTU RGB+D 120 dataset.

Chen et al. [16] target fine-grained activity recognition and prediction in smart manufacturing, aiming to improve productivity and safety. It introduces the two-stage deep learning network combining multi-modal feature extraction (RGB and hand skeleton) with temporal modeling (LSTM and GRU), achieving high accuracy in both trimmed and continuous assembly activity videos.

While many HAR methods focus on generic action classification, a growing body of work tailors techniques to clinical rehabilitation, where precision, interpretability, and patient-specific adaptation are critical.

Anton et al. [17] used “KiReS”, a telerehabilitation system based on Kinect, using 3D models to check how an exercise should be repeated by patients, as performed by the instructor. This feature can help them improve the performance of the exercises performed.

Lam et al. [4] implemented an Automated Rehabilitation System (ARS) in a hip/knee replacement clinic, providing real-time visual feedback by superimposing ideal motion trajectories onto patient performance.

Çubukçu et al. [18] proposed a Kinect-based integrated physiotherapy for shoulder damage. The authors implemented the method on 14 volunteers who were treated with conventional methods and 15 who were treated with the proposed mechanism. Yurtman et al. [19] proposed an automatic system to detect and evaluate physical therapy exercises. The accuracy is 93.46% for exercise classification and 88.65% for simultaneous exercise.

Adolf et al. [20] assessed the quality of the Single Leg Squat Test and Step-Down functional tests. The authors recorded the exercises using forty-six healthy subjects, extracting the motion data, and classifying them for assessment by three independent physiotherapists. The authors calculated the ranges of movement in key point-pair orientations, joint angles, and the relative distances of the monitored segments using machine learning. The results showed that the AdaBoost classifier achieved a specificity of 0.8, a sensitivity of 0.68, and an accuracy of 0.7.

Gal et al. [21] presented a Kinect and fuzzy inference system-based e-rehabilitation system. The Kinect could detect the motion of patients, and the fuzzy inference system could interpret the data by looking through the initial posture and motion changes on a cognitive level. The system is capable of assessing the initial posture and motion ranges of 20 joints. Using angles to describe the motion of the joints, exercise patterns were developed for each patient.

Notably, while the entertainment industry has long known solutions for the Xbox that allow one to assess the degree of similarity of the repetition of dance movements for a typical rhythm game avatar (Just Dance, Dance Central), rehabilitation requires domain-aware similarity metrics that account for the initial posture, movement type, pace, and therapeutic intent—factors often ignored in general-purpose skeleton matching [22].

Liao et al. [23] conducted a study which reviews an important role performed by trained clinicians in assisting patients with performing rehabilitation tasks. Tsiouris et al. [24] conducted a review on virtual coaching designed to initiate healthcare interventions, which combined sensing and system-user interventions. The results focused on the fact that home coaching techniques were better to deal with healthier lifestyles and training patients with IOT devices and sensors. Debnath et al. [25] further underscore the need for clinically validated, sensor-driven coaching systems that close the feedback loop between patients and healthcare providers.

The exercises in physical rehabilitation have their own specific movement patterns, which include the initial positioning of the patient, the predominant types of movements used, and the pace of the exercises. The dissimilarity function proposed in this work, unlike general-purpose skeleton-based measures (e.g., [22]), allows us to take these characteristics into account to improve the accuracy and reliability of assessing the patients’ rehabilitation progress. The difference between the already existing methods and the present method is that, here, we have a specific goal that could be applied to patients who have suffered from motion disorders. For this reason, we have extracted the skeletons of the instructor and the patient and found the degree of similarity using a novel algorithm termed as the “Human Skeleton-based Balanced Time Warping algorithm”.

3. Proposed Methodology

We consider here the degree of rehabilitation of the patient based on his ability to perform certain physical exercises at the instructor’s command. The more accurately he repeats the instructor’s movements, the better the rehabilitation goes.

We will apply our algorithm to the frames of the two video sequences, obtain the alignment path between frames using the distance between skeletons, and, finally, compute the distance between video sequences with the help of the distances between skeletons for each corresponding pair of frames in the alignment path. The general architecture of the system is shown in Figure 1.

The first step of the proposed method is extraction of a skeleton from each frame of a video sequence. Following the convention in [6], all skeletons are normalized and transformed into a standard form. Using the instructor’s reference sequence, we compute per-coordinate attention weights that reflect the relative importance of each spatial dimension of each skeletal point for the specific exercise being performed. Concurrently, we quantify the intensity of motion over time for both the patient and the instructor by analyzing temporal derivatives of joint trajectories.

To account for differences in movement timing, we apply dynamic time warping (DTW) to align the two sequences, ensuring that frames corresponding to the same phase of the exercise are matched. This alignment enables the computation of a dynamics-based dissimilarity component, which captures discrepancies in movement tempo and rhythm.

Subsequently, for each pair of temporally aligned skeletons, we compute a weighted Euclidean distance [7] between the patient’s and instructor’s poses. The weights are derived from the previously estimated attention scores, emphasizing clinically or kinematically relevant joints and dimensions. This yields a pose-based dissimilarity component that reflects deviations in spatial configuration.

The total dissimilarity between the patient and instructor video sequences is defined as a balanced sum of the dynamics-based and pose-based components, providing a comprehensive measure of exercise fidelity that accounts for both temporal execution and postural accuracy.

A more comprehensive and detailed description of the proposed method is provided below.

The proposed method begins by extracting a skeletal representation—referred to as a skeleton—from each frame of a video sequence. Skeleton models are widely used in various human activity analyses and recognition systems, like fall detection, activity classification, human–computer interaction, etc. There are several modern techniques to obtain a skeletal model from a video frame. First, depth sensors, as they became more available, occupy an increasing place in event recognition systems. Many of them can directly produce a skeletal description of the human figure for compact representation of a person’s posture. An alternative way consists of using special neural network-based solutions that build skeleton models from images produced by a conventional RGB camera like YOLO11-pose, Alphapose, Google MediaPipe Pose Landmarker, and others. All these methods produce a skeleton model in different forms (Figure 2).

The method of constructing a skeletal model from a video frame is not the subject of this work. Considering our previous experience, we adopt here the standardized skeleton models obtained by the special procedure described in [6]. Each skeleton model is actually a graph and can be described by a correspondent adjacency matrix. The problem of converting one skeleton model to another could be formally stated as a problem of adjacency matrix transformation. Paper [6] divides all forms of skeletal models into three categories and provides a number of simple equations for each category for conversion. Such a procedure allows them to transform different skeletal models into the standard form (Figure 2), making the proposed degree of rehabilitation assessment independent from the type of sensor. After that, each skeleton comprises 17 vertices (Figure 2), corresponding approximately to major body joints, with each point represented by its 3D coordinates.

In particular, the experimental dataset of videos corresponding to the instructor and the patients who try to successfully execute the exercises demonstrated by the instructors was collected using the Kinect v2 RGBD sensor (Microsoft Corporation, Redmond, WA, USA).

A dissimilarity measure between two skeletal models plays a key role in the development of any activity assessment system. In our case, the specificity of physical exercises for the rehabilitation programs allows us to improve the reliability and robustness of the measure considering the different contributions of the skeleton points to the final movement during a particular exercise, producing motion-dependent dissimilarity measures.

In paper [7], we proposed to use weighted Euclidean distance to calculate the dissimilarity measure, with weights proportional to the standard deviation of the coordinates of the correspondent skeletal point representing the degree of attention to the different coordinates of different skeleton points. The general flowchart of the dissimilarity measure evaluation can be found in Figure 3.

The distance between pairs of skeletons,

I

(instructor) and

P

(patient), could be found by the motion-dependent dissimilarity measure:

d [S (I), S (P)] = \frac{1}{K} \sum_{p = 0}^{K - 1} \sqrt{\sum_{m \in {x, y, z}} w_{p}^{m} {(S {(I)}_{p}^{m} - S {(P)}_{p}^{m})}^{2}}, w_{p}^{m} = \frac{σ_{p}^{m}}{\sum_{m \in {x, y, z}} \sum_{p = 0}^{K - 1} σ_{p}^{m}},

where

K

—number of used points of skeletons,

S {(I)}_{p}^{m}

—

m

-th coordinate of

p

-th point of instructor skeletal model

S (I)

,

S {(P)}_{p}^{m}

—

m

-th coordinate of

p

-th point of patient skeletal model

S (P)

,

w_{p}^{m}

—weights of importance, representing the degree of attention to the

m

-th coordinate of the

p

-th point, and

σ_{p}^{m}

—standard deviation of

m

-th coordinate of the

p

-th point of instructor skeleton.

Our challenge is to measure the distance between two video sequences based on the distance between skeletons. But video sequences have different lengths, and the relative time of the beginning of an exercise is often not well-defined. The common solution is to use a version of dynamic time warping (DTW) to find a match between frames of the first video sequence and frames of the second video sequence to obtain an alignment path between them. The comprehensive review of different approaches to temporal alignment methods can be found in [3,23].

The problem with aligning instructors’ and patients’ records can be approached from two opposing perspectives. On the one hand, a skeletal model with 17 points in three-dimensional space yields 51 curves representing changes in corresponding coordinates over time, and multivariate temporal alignment could be performed by comparing these curves to an equal number of curves generated by the second record being compared. On the other hand, for alignment, it is sufficient to use a distance function between states of the skeletal models at comparable moments in time without directly using changes in the coordinates of the skeletal model points over time. We propose the use of the second scheme for temporal alignment. This makes the algorithm independent of how skeletal models are compared and lets us use motion-dependent dissimilarity measures for both sequence alignment and calculating the person-pose discrepancy in frames that match as a result of alignment.

The classical DTW algorithm cannot handle the time shift between the start of exercises on the compared records, as well as the different end times of the exercises. Although there exist open-ending and open-beginning versions of DTW [3], experimental evaluation shows the lack of stability of these methods while using dissimilarity measures between skeletal models for matching. We attempt to circumvent this issue by formulating the following statement of the temporal alignment task. As a common practice, we will denote one sequence

R^{b} = (R_{i}^{b}, i = 0, \dots, N^{b} - 1)

, specifically the shortest one with length

N^{b}

, as the base and the other

R^{r} = (R_{i}^{r}, i = 0, \dots, N^{r} - 1)

as the reference video sequence with length

N^{r}

. As a result, at each frame

i

in the base sequence, there should be a determined value

v_{i}

indicating the time shift between corresponding points in the base

i

and reference

i + v_{i}

sequences. The result of the analysis is represented by the set

V = (v_{i}, i = 0, \dots, N^{b} - 1)

, which takes values from the finite number of possible shifts

v \in \{- w / 2, - w / 2 + 1, \dots, w / 2\}

, where

w

defines the admissible shifts range. Parameter

w

sets the size of the shift window and specifies a possible delay between the start time of the exercises by the instructor and the patient, as well as between the end time of the exercises.

As well as in the DTW algorithm, we use the optimization approach to the alignment task with the following loss function:

J (V |R^{b}, R^{r}) = \sum_{i = 0}^{N^{b} - 1} d [S (R_{i}^{b}), S (R_{i + v_{i}}^{r})] + \sum_{i = 1}^{N^{b} - 1} γ (v_{i - 1}, v_{i}) .

(1)

The loss function

J (V |R^{b}, R^{r})

consists of two parts, where

d [S (R_{i}^{b}), S (R_{i}^{r})]

is the distance between two skeletons:

S (R_{i}^{b})

on frame number

R_{i}^{b}

of the medical instructor record and

S (R_{i}^{r})

on frame number

R_{i}^{r}

of the patient record, while

γ (v_{i - 1}, v_{i})

represents the smoothness term, which reflects that sequential interframe displacements have to change smoothly over time, and no displacement crossing is allowed based on the physical nature of the process. Here, we take

γ (v_{i - 1}, v_{i})

in the widely used form of a quadratic function with parameter

q

if the transition between displacements is valid and there are no cross references or it provides a big penalty overwise:

γ (v_{i - 1}, v_{i}) = \{\begin{matrix} q \cdot {(v_{i - 1}, v_{i})}^{2}, v_{i} \geq v_{i - 1} - 1, \\ M, v_{i} < v_{i - 1} - 1, \end{matrix}

where

q

is a smoothness penalty;

M

is a sufficiently large constant.

Such kinds of smoothness terms prevent changes in the order of frames in an aligned sequence but also penalize abrupt changes in interframe shifts.

Criterion (1) belongs to the class of pairwise separable functions, since it is a sum of functions of no more than two arguments. A problem of such kind is well-known in computer vision as the graph-based energy minimization problem [26]. Here, we have a rather simple case of a process in discrete time, and criterion (1) corresponds to a chain-like graph of variable adjacency. To solve it, a dynamic programming procedure [27] can be used, which has a linear computational complexity with respect to the length of the base sequence.

\hat{V} = \underset{V}{\arg \min} J (V |R^{b}, R^{r}) .

Having a set of shifts

V = (v_{i}, i = 0, \dots, N^{b} - 1)

, it is easy to obtain a sequence of pairs of corresponding frames

R = (R_{0}^{b}, R_{0}^{r}), \dots, (R_{N^{b} - 1}^{b}, R_{N^{b} - 1}^{r})

, where

(R_{i}^{b}, R_{i}^{r})

represent the indexes of matched frames for reference number

i

,

b

means that the frame is from the base video with length

N^{b}

,

r

means that the frame is from a reference video with length

N^{r}

, and

N

is the total number of references in the alignment path.

Considering that the shortest video is taken as «base», the total distance between two videos should be calculated as

D (R^{b}, R^{r}) = \frac{1}{N^{b}} \{\sum_{i = 0}^{N - 1} d [S (R_{i}^{b}), S (R_{i}^{r})] + α \sum_{i = 1}^{N - 1} |(|R_{i}^{b} - R_{i}^{r}|) - (|R_{i - 1}^{b} - R_{i - 1}^{r}|)|\},

(2)

where

d [S (R_{i}^{b}), S (R_{i}^{r})]

is the distance between two skeletons:

S (R_{i}^{b})

on frame number

R_{i}^{b}

of the medical instructor record and

S (R_{i}^{r})

on frame number

R_{i}^{r}

of the patient record.

The dissimilarity between two skeleton models on two frames is determined by the differences in poses but also by the discrepancy between the anthropometric characteristics of the instructor and the patient. We assume that the minimal dissimilarity value among all frame pairs

M = \min d [S (R_{i}^{b}), S (R_{i}^{r})], i = 0, \dots, N - 1

, considering preliminary normalization of the skeletons, is just determined by the anthropometric discrepancy. Therefore, the final distance should be corrected.

So, the total distance between two videos in a whole should be calculated as

D (R^{b}, R^{r}) = \frac{1}{N^{b}} \sum_{i = 0}^{N - 1} \{d [S (R_{i}^{b}), S (R_{i}^{r})] - M\} + α \frac{1}{N^{b}} \sum_{i = 1}^{N - 1} |(|R_{i}^{b} - R_{i}^{r}|) - (|R_{i - 1}^{b} - R_{i - 1}^{r}|)| .

(3)

The first term in the criterion

D P (R^{b}, R^{r}) = \frac{1}{N^{b}} \sum_{i = 0}^{N - 1} \{d [S (R_{i}^{b}), S (R_{i}^{r})] - M\}

is responsible for the difference between the instructor’s and the patient’s performance in the form of posture, the second term

D T (R^{b}, R^{r}) = \frac{1}{N^{b}} \sum_{i = 1}^{N - 1} |(|R_{i}^{b} - R_{i}^{r}|) - (|R_{i - 1}^{b} - R_{i - 1}^{r}|)|

evaluates the discrepancy in the dynamics of performance over time, and

α

is a balancing coefficient.

While measure

D (R^{b}, R^{r})

represents the dissimilarity between different records, the intensity of movements during the exercise can be an important characteristic of a single record in the context of rehabilitation. Based on this observation, the intensity values can be used as a standalone characteristic, as well as contribute to the dynamic part

D T (R^{b}, R^{r})

in the criterion (3). In the latter case, we should use the inverse values of intensities:

D T (R^{b}, R^{r}) = \frac{1}{N^{b}} \sum_{i = 1}^{N - 1} |(|R_{i}^{b} - R_{i}^{r}|) - (|R_{i - 1}^{b} - R_{i - 1}^{r}|)| + β (E_{\max} - E^{r}),

(4)

where

E^{r}

is an intensity of the patient’s record;

E_{\max}

is the maximum intensity in a dataset.

We use an average interframe dissimilarity of a record as an estimation of intensity:

E^{r} = \frac{1}{N K} \sum_{i = 1}^{N - 1} \sum_{p = 0}^{K - 1} \sqrt{\sum_{m \in {x, y, z}} {(S {(R_{i}^{r})}_{p}^{m} - S {(R_{i - 1}^{r})}_{p}^{m})}^{2}} .

(5)

Therefore, the final metric between instructor and patient records has the form of a linear combination of two components with the coefficient

α

:

D (R^{b}, R^{r}) = D P (R^{b}, R^{r}) + α \cdot D T (R^{b}, R^{r}) .

(6)

We will pay attention to the selection of the value of the

α

parameter in Section 4.3.

4. Experimental Study and Results

4.1. Dataset

The primary target of this study is to provide an objective and reliable measure of a patient’s rehabilitation progress by comparing their exercise performance against a reference recording performed by an instructor. This goal defines the basic requirements for the dataset for experiments.

Existing open-access datasets, such as FineRehab by Li et al. [28] (50 participants, 16 rehabilitation actions), the UCO Physical Rehabilitation Dataset by Aguilar-Ortega et al. [29] (27 participants, eight exercises), IntelliRehabDS by Miron et al. [30] (29 participants, nine rehabilitation gestures), and the University of Liverpool Rehabilitation Exercise Dataset (UL-RED) by Reji et al. [31] (10 participants, 22 actions), primarily contain motion recordings of both healthy individuals and patients. However, these datasets lack rehabilitation scores and thus effectively support only a binary distinction (e.g., healthy vs. impaired), offering limited granularity in assessing rehabilitation quality. In contrast, the KIMORE dataset by Capecci et al. [32] (78 participants, five rehabilitation exercises) stands out by providing expert-rated clinical scores derived from a standardized questionnaire, as well as reference recordings of expert (therapist) performances. While KIMORE is arguably the most suitable dataset for rehabilitation assessment tasks, its clinical ratings, though standardized, remain inherently subjective due to inter-rater variability.

In this work, rather than estimating a rehabilitation score from a patient’s recording, we adopt a different approach: we aim to simulate exercise execution at varying levels of motor impairment. To this end, we introduce a new dataset specifically designed to evaluate the proposed dissimilarity measure. Moreover, the collection of the dataset let us create, debug, and test the necessary components of the real-world automated exercise control system.

Our dataset comprises video recordings of a physiotherapist (instructor) and 22 participants (17 men and 5 women, aged 22–55 years) in a controlled laboratory environment. In our approach, the instructor’s recording was presented to the patient in real time during the exercise execution. Each person repeated the instructor’s exercises three times. The first time, they did it as accurately as possible. The second time, they did it less precisely, simulating a patient with minor motor impairments. The third time, they did it even less accurately, simulating a patient with severe motor impairments. We will denote these three accuracy levels of repetition as “good”, “intermediate”, and “bad”. Thus, the total number of records in the dataset (including the instructor) is equal to 136.

The main limitation of such a protocol of three-level estimations of performance accuracy, while improving reliability, turns the problem into a problem of ranking regression, as the ground truth is represented on the ordinal scale.

All data were captured using the Microsoft Kinect v2 sensor and include 3D coordinates of 17 skeletal joints (corresponding to the standard Kinect v2 skeletal model) for every video frame (see Figure 2).

4.2. Evaluation of the Relationship Between the Level of Accuracy of Exercise Performance and the Intensity of Movements

To evaluate the relationship between the level of accuracy of the exercise performance and the intensity of movements, we calculate the intensity of each record in the dataset by Equation (5). Figure 4 shows the strong relations between the «quality» of exercise (good, intermediate, and bad) in mimicking the instructor’s patterns and the estimation of the intensity of exercises.

4.3. Video Sequence Distances Visualization and Analysis

In this section, we enlist the experimental results as to how efficiently the patients repeat the exercises performed by the instructors.

We deliberately decompose the total distance between two videos (3) into two components: a pose dissimilarity measure

D P

and a dynamics dissimilarity

D T

. This approach allows us to avoid difficulties associated with visualizing isolated data points of metric similarity to the reference and instead allows for the effective visualization and analysis of the experimental results on the collected dataset in a two-dimensional representation. This enhances the interpretability of the patient’s rehabilitation progress, facilitates the construction of individual rehabilitation trajectories, and enables comparison across patients within the dataset.

In the figures, the second characteristic (

D T

) is plotted along the ordinate axis and multiplied by 10 (see Figure 5). It is clear that the closer the patient’s point is to the instructor’s point, the better the patient performs the exercises. It should be noted that individual exercises performed in both sitting and standing positions formally generate two isolated dissimilarity spaces [33].

Decomposing the function

D

into two components,

D P

and

D T

, provides a clearly interpretable and intuitive visualization of each exercise act on a two-dimensional plane. Obviously, the instructor will be displayed at the origin. However, it should be understood that this will result in a separate visualization for each exercise. However, we wanted exercises of different types (in our case, sitting and standing positions) to be projected onto a single display plane. To this end, we performed separate centering and scaling of the points for each exercise. In this case, we obtained projections of the points onto a single visual display field simply by placing the individual centers of each exercise at the origin. We were somewhat pleasantly surprised that no additional alignment of the circles (sitting position) and triangles (standing position) was required.

Connected lines correspond to the same specific patient identified as P_18.

The reference position of the instructor’s exercises is marked in purple in Figure 5. After some time, the instructor repeated the exercises, attempting to perform the movements as accurately as possible in comparison to the first recording. This repetition is represented in the figure by objects with a purple outline. Simulations of “intermediate” and “bad” accuracy levels by the instructor were not conducted.

As can be seen from Figure 5, there is good clustering of points belonging to different accuracy levels (“good,” “intermediate,” and “bad”). We tried to highlight the areas of concentration of rehabilitation levels, and, to avoid drawing it manually, we used a classifier based on a simple neural network. We consider such division on the colorful areas to be useful for the therapist. Classification was carried out using the Multi-layer Perceptron Classifier from the Scikit-learn python library [34] with standard training parameters.

On one hand, in our collected dataset, we assess how closely a patient performs an exercise relative to the instructor using an ordinal scale ranging from “bad” to “intermediate” to “good.” On the other hand, when it comes to rehabilitation outcomes, progress can look very different from one patient to another, both in pace and pattern. As such, directly comparing numerical recovery scores across individuals—especially in terms of overall improvement or specific physiological functions—remains a complex challenge that warrants further investigation.

4.4. Performance Metrics and Hyperparameter Values Estimation

As stated in Section 4.1, each patient record in the experimental dataset is associated with a specific level of accuracy in reproducing the instructor’s movements, which naturally turns the problem into a ranking regression task. At the same time, the primary objective of this work is to produce a quantitative measure of the distance between the recorded executions of the therapeutic exercise by the instructor and the patient. This raises the issue of selecting an appropriate method for evaluating the performance of the proposed distance estimation approach.

Note that the dissimilarity (or distance) function, together with a fixed reference element that the instructor naturally acts as, induces a natural (non-strict) order on the set of distances, because this set is a subset of the real numbers, which are already totally ordered. This induces a pre-order on the set of patient records for a particular exercise:

R_{i}^{r} \underline{≺} R_{j}^{r} \Leftrightarrow D (R^{b}, R_{i}^{r}) \leq D (R^{b}, R_{j}^{r}),

where

R_{i}^{r}

and

R_{j}^{r}

are records of the same exercise of the patient

r

.

In this work, we focused on preserving a meaningful, monotonic relationship between increasing the dissimilarity of movements and decreasing the performance quality. In other words, as the mismatch in pose and dynamics grows, the score should reliably reflect a drop in the accuracy level.

The commonly adopted way to capture this is to use Spearman’s rank correlation coefficient [35]:

r_{s} = 1 - \frac{6 \sum_{i = 1}^{n} d_{i}^{2}}{n (n^{2} - 1)},

(7)

where

n

is the number of records for each person in each position in the dataset;

d

is the difference between the ranks of records. It is assumed that the “good” execution of an exercise corresponds to rank 0, an “intermediate” execution corresponds to rank 1, and a bad execution corresponds to rank 2. In our case, each person performs an exercise three times in the standing position and in the sitting position,

n = 3

. Therefore, Equation (7) takes a simple form:

r_{s} = 1 - \frac{\sum_{i = 1}^{3} d_{i}^{2}}{4} .

(8)

To obtain a measure of performance on the whole dataset, we average the correlation coefficient (8) over records of all people in both positions.

Such a measure lets us choose the optimal parameter

α

in (6) to coordinate the importance of pose (

D P

) and dynamic (

D T

) dissimilarity. Considering the way we adopt visualization, it will be more convenient to represent Equation (6) in a slightly different form:

D = D P \cdot \cos φ + D T \cdot \sin φ .

(9)

Equation (9) defines the projection of each record onto the axis passing through the origin at an angle

φ

to the horizontal. Figure 6 demonstrates the dependency between the projection angle

φ

and the average Spearman’s rank correlation coefficient.

The maximum average Spearman’s rank correlation coefficient is equal to 0.977 and is achieved for a range of

φ

angles; for certainty, we chose the smallest of them with the angle value

φ = 13^{°}

. The correspondent axis is shown in Figure 5 as a blue line. And the distribution of the number of records of each type on that axis is represented in Figure 7.

Results of the experimental evaluation of the proposed measure are shown in Figure 8. Each line corresponds to the records of certain patients (P_patient number) in the dataset, suffix _01 corresponds to the standing position, and _02 corresponds to the sitting position. The color markers (dots for sitting, triangles for standing exercises) reflect the exercise execution levels: the green color for “good”, blue color for “intermediate”, and red color for “bad”. The proper sequence should be green, blue, and red. A different order of colors indicates an error. Distances between markers reflect the distance

D

to the record of instructor execution, assigning zero to the minimal distance.

It can be seen that there are only two records with the wrong order of execution quality, P_06_02 and P_18_02 (reminder: suffix _02 means sitting position), but in both cases, the distances between disordered states are very small.

To evaluate the sensitivity of the optimal angle to variations in the composition of the patient group, we employed an 11-fold cross-validation procedure. The entire dataset, excluding the instructor’s records, was randomly partitioned into 11 subsets, each containing all exercise recordings from two distinct participants. In each iteration of the experiment, one subset (12 records) was held out, and the optimal angle

φ

was selected via exhaustive search over the range in 1° increments using the remaining ten subsets (120 records). Figure 9 presents the plots of the mean Spearman’s rank correlation coefficient as a function of the angle obtained for all cross-validation folds.

The data in Table 1 demonstrates sufficient robustness of the estimated angle

φ

to variations in the patient group.

The angle

φ

yielding the maximum average Spearman’s rank correlation coefficient differed from the previously identified optimal value (

φ = 13^{°}

) in only two out of the eleven cross-validation folds. Even in those cases, the average Spearman’s rank correlation coefficient (see Figure 9) remains very close to the fold-specific optimum, indicating minimal practical impact on performance.

5. Conclusions and Discussion

This study presents a novel, vision-based system for the automated assessment of physical therapy exercise performance, grounded in a Human Skeleton-based Balanced Time Warping (HS-BTW) algorithm. Unlike conventional approaches that rely on predefined activity classes, deep learning models requiring large, annotated datasets or wearable sensors demanding calibration; our method offers a calibration-free, interpretable, and computationally efficient solution using only RGB-D or even RGB video input.

In pursuit of maximizing objectivity, our newly collected dataset eschews reliance on subjective therapist assessments or predefined feature sets, which are commonly employed in other datasets such as KIMORE [32]. Rather than relying on these established methods, we tasked actors with replicating instructional movements with different levels of accuracy, simulating varying degrees of motor impairments. Consequently, this unique form of data collection necessitates the utilization of specialized metrics tailored specifically to capture the ranked characteristics inherent within the evaluative framework. However, this innovative strategy simultaneously excludes compatibility with conventional benchmark datasets.

The core innovation lies in the fact that, instead of using specific pose features, we adopted a featureless approach [36,37] based on a balanced dissimilarity measure designed explicitly for skeletal model analysis. This measure separately quantifies posture deviation (DP) and dynamic inconsistency (DT), then combines them into a single metric optimized via Spearman’s rank correlation to align with rehabilitation quality. Note that DP uses the attention mechanism to the coordinates of points of the skeletal model. Its dual application includes both the temporal alignment of recorded videos against their associated skeletal representations and the direct quantification of deviations between corresponding poses across individual frames.

Evaluated on our dataset of 134 recordings from 22 participants simulating three levels of motor impairment (“good,” “intermediate,” and “bad”), the system achieved a remarkably high correlation of 0.977 between computed dissimilarity and execution quality. Visualization in the DP–DT space revealed clear clustering by accuracy level, demonstrating the method’s sensitivity to clinically meaningful differences in movement fidelity.

Compared to existing rehabilitation assessment systems—many of which depend on subjective clinician scores, fixed exercise templates, or sensor-laden setups—our approach provides objective feedback without requiring pre-alignment, extensive training data, or patient-specific calibration. By decoupling static posture errors from temporal dynamics, it also offers actionable diagnostic insights, enabling therapists to identify whether a patient struggles with form, timing, or both.

While some recent works like [16,20] and some others and the present work combine temporal and dynamics features when making decisions, our work addresses physical therapy monitoring, proposing a non-deep learning approach that uses RGB-D video and a customized balanced time warping algorithm to assess exercise quality. It focuses on comparing the motions of the patient and therapist using 3D skeletons. Here are some additional advantages of our method:

(1): No Need for Extensive Training Data: Our work uses a similarity-based evaluation approach rather than supervised deep learning, reducing the need for large, annotated datasets, which are often expensive and time-consuming to collect.
(2): Sensor-Free and Calibration-Free: Our work relies on RGB-D video and skeletal data without requiring wearable sensors or device-specific calibration, making it easier and more practical to deploy in clinical or home settings.
(3): Quantitative and Interpretable Feedback: Our present method produces interpretable metrics (posture and dynamics scores) that directly reflect the quality of movements, providing meaningful, actionable feedback to patients and therapists.
(4): Clarity in Error Assessment: By separating static posture (DP) and dynamic movement (DT) errors, the system helps identify specific areas where the patient deviates from the correct performance, enabling targeted rehabilitation.
(5): Robust to Individual Variations: The algorithm measures relative similarity to a reference rather than forcing classification into fixed labels, making it more adaptable to individual differences in patient performance.
(6): Supports Remote Monitoring: Non-reliance on heavy computation (e.g., deep neural networks) of our proposed work makes it suitable for tele-rehabilitation, allowing patients to receive feedback outside of clinical environments.
(7): No Predefined Activity Classes Required: Our present proposed work depends on predefined activity labels for classification, thereby evaluating the quality of performance irrespective of activity type, allowing for more flexibility in usage.

These attributes make the proposed system particularly well-suited for tele-rehabilitation and home-based care, where accessibility, scalability, and interpretability are critical. Future work will focus on clinical validation with real post-stroke and cardiac patients, integration into therapist dashboards, and extension to a broader repertoire of therapeutic exercises.

On the whole, we understand that, although our dataset was collected under supervision and with the involvement of medical professionals, it cannot be used for clinical purposes without recordings from actual patients. We currently have an agreement in place with the University Medical Center and the City Clinical Hospital to expand our dataset and evaluate the system on real patients.

Author Contributions

Conceptualization, O.S., A.K. and A.T.; methodology, O.S. and A.K.; software, N.M. and E.S.; validation, A.T. and D.B.; investigation, P.B.; data curation, E.S. and N.M.; writing—original draft preparation, P.B., O.S. and A.K.; writing—review and editing, A.T. and D.B.; visualization, A.K.; supervision, D.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research is funded by the Ministry of Science and Higher Education of the Russian Federation within the framework of the state task FEWG-2024-0001.

Institutional Review Board Statement

Conclusion of the Bioethics Committee of the Medical Institute of the Federal State Budgetary Educational Institution of Higher Education “Tula State University”.

Informed Consent Statement

Personal consent was waived due to no personal data being used and individuals could not be identified.

Data Availability Statement

The datasets used in the current study are available from the authors on reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Dibben, G.O.; Faulkner, J.; Oldridge, N.; Rees, K.; Thompson, D.R.; Zwisler, A.D.; Taylor, R.S. Exercise-Based Cardiac Rehabilitation for Coronary Heart Disease: A Meta-Analysis. Eur. Heart J. 2023, 44, 452–469. [Google Scholar] [CrossRef]
Physical Medicine and Rehabilitation: MedlinePlus Medical Encyclopedia. Available online: https://medlineplus.gov/ency/article/007448.htm (accessed on 7 July 2025).
Tormene, P.; Giorgino, T.; Quaglini, S.; Stefanelli, M. Matching Incomplete Time Series with Dynamic Time Warping: An Algorithm and an Application to Post-Stroke Rehabilitation. Artif. Intell. Med. 2009, 45, 11–34. [Google Scholar] [CrossRef] [PubMed]
Lam, A.W.K.; Varona-Marin, D.; Li, Y.; Fergenbaum, M.; Kulić, D. Automated Rehabilitation System: Movement Measurement and Feedback for Patients and Physiotherapists in the Rehabilitation Clinic. Hum. Comput. Interact. 2016, 31, 294–334. [Google Scholar] [CrossRef]
Seredin, O.S.S.; Kopylov, A.V.V.; Huang, S.-C.C.; Rodionov, D.S.S. A Skeleton Features-Based Fall Detection Using Microsoft Kinect v2 with One Class-Classifier Outlier Removal. Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci.—ISPRS Arch. 2019, 42, 189–195. [Google Scholar] [CrossRef]
Surkov, E.; Seredin, O.; Kopylov, A.; Kushnir, O.; Semenishchev, E. Standardizing Skeletal Models for Fall Detection. In Proceedings of the ICPR 2024 International Workshops and Challenges, Kolkata, India, 1–5 December 2024; pp. 345–362. [Google Scholar] [CrossRef]
Bagchi, P.; Seredin, O.; Kopylov, A.; Surkov, E.; Mityugov, N.; Tokarev, A.; Bhattacharjee, D. Skeletons Distance Evaluation for Automated Control of Exercises in Physical Therapy. Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. 2025, XLVIII-2-W, 19. [Google Scholar] [CrossRef]
Kumar, D.A.; Kumar, E.K.; Suneetha, M.; Rajasekhar, L. Human Action Recognition from Depth Sensor via Skeletal Joint and Shape Trajectories with a Time-Series Graph Matching. AIP Conf. Proc. 2024, 2512, 020029. [Google Scholar] [CrossRef]
Vishwakarma, D.K.; Jain, K. Three-Dimensional Human Activity Recognition by Forming a Movement Polygon Using Posture Skeletal Data from Depth Sensor. ETRI J. 2022, 44, 286–299. [Google Scholar] [CrossRef]
İnce, Ö.F.; Ince, I.F.; Yıldırım, M.E.; Park, J.S.; Song, J.K.; Yoon, B.W. Human Activity Recognition with Analysis of Angles between Skeletal Joints Using a RGB-Depth Sensor. ETRI J. 2020, 42, 78–89. [Google Scholar] [CrossRef]
Muralikrishna, S.N.; Muniyal, B.; Acharya, U.D.; Holla, R. Enhanced Human Action Recognition Using Fusion of Skeletal Joint Dynamics and Structural Features. J. Robot. 2020, 2020, 3096858. [Google Scholar] [CrossRef]
Ahad, M.A.R.; Ahmed, M.; Das Antar, A.; Makihara, Y.; Yagi, Y. Action Recognition Using Kinematics Posture Feature on 3D Skeleton Joint Locations. Pattern Recognit. Lett. 2021, 145, 216–224. [Google Scholar] [CrossRef]
Wang, L.; Liu, J.; Zheng, L.; Gedeon, T.; Koniusz, P. Meet JEANIE: A Similarity Measure for 3D Skeleton Sequences via Temporal-Viewpoint Alignment. Int. J. Comput. Vis. 2024, 132, 4091–4122. [Google Scholar] [CrossRef]
Du, C.; Graham, S.; Depp, C.; Nguyen, T. Assessing Physical Rehabilitation Exercises Using Graph Convolutional Network with Self-Supervised Regularization. In Proceedings of the Annual International Conference of the IEEE Engineering in Medicine and Biology Society, EMBS, Orlando, FL, USA, 15–19 July 2021; pp. 281–285. [Google Scholar] [CrossRef]
Park, J.; Cho, S.; Kim, D.; Bailo, O.; Park, H.; Hong, S.; Park, J. A Body Part Embedding Model with Datasets for Measuring 2D Human Motion Similarity. IEEE Access 2021, 9, 36547–36558. [Google Scholar] [CrossRef]
Chen, H.; Zendehdel, N.; Leu, M.C.; Yin, Z. Fine-Grained Activity Classification in Assembly Based on Multi-Visual Modalities. J. Intell. Manuf. 2024, 35, 2215–2233. [Google Scholar] [CrossRef]
Anton, D.; Goni, A.; Illarramendi, A.; Torres-Unda, J.J.; Seco, J. KiReS: A Kinect-Based Telerehabilitation System. In Proceedings of the 2013 IEEE 15th International Conference on e-Health Networking, Applications and Services, Healthcom 2013, Lisbon, Portugal, 9–12 October 2013; pp. 444–448. [Google Scholar] [CrossRef]
Çubukçu, B.; Yüzgeç, U.; Zılelı, A.; Zılelı, R. Kinect-Based Integrated Physiotherapy Mentor Application for Shoulder Damage. Future Gener. Comput. Syst. 2021, 122, 105–116. [Google Scholar] [CrossRef]
Yurtman, A.; Barshan, B. Automated Evaluation of Physical Therapy Exercises Using Multi-Template Dynamic Time Warping on Wearable Sensor Signals. Comput. Methods Programs Biomed. 2014, 117, 189–207. [Google Scholar] [CrossRef]
Adolf, J.; Segal, Y.; Turna, M.; Nováková, T.; Doležal, J.; Kutílek, P.; Hejda, J.; Hadar, O.; Lhotská, L. Evaluation of Functional Tests Performance Using a Camera-Based and Machine Learning Approach. PLoS ONE 2023, 18, e0288279. [Google Scholar] [CrossRef] [PubMed]
Gal, N.; Andrei, D.; Nemeş, D.I.; Nədəşan, E.; Stoicu-Tivadar, V. A Kinect Based Intelligent E-Rehabilitation System in Physical Therapy. Stud. Health Technol. Inform. 2015, 210, 489–493. [Google Scholar] [CrossRef] [PubMed]
Esmaeeli, R.; Valadan Zoej, M.J.; Safdarinezhad, A.; Ghaderpour, E. Recognition and Scoring Physical Exercises via Temporal and Relative Analysis of Skeleton Nodes Extracted from the Kinect Sensor. Sensors 2024, 24, 6713. [Google Scholar] [CrossRef] [PubMed]
Liao, Y.; Vakanski, A.; Xian, M.; Paul, D.; Baker, R. A Review of Computational Approaches for Evaluation of Rehabilitation Exercises. Comput. Biol. Med. 2020, 119, 103687. [Google Scholar] [CrossRef]
Tsiouris, K.M.; Tsakanikas, V.D.; Gatsios, D.; Fotiadis, D.I. A Review of Virtual Coaching Systems in Healthcare: Closing the Loop With Real-Time Feedback. Front. Digit. Health 2020, 2, 567502. [Google Scholar] [CrossRef]
Debnath, B.; O’Brien, M.; Yamaguchi, M.; Behera, A. A Review of Computer Vision-Based Approaches for Physical Rehabilitation and Assessment; Springer: Berlin/Heidelberg, Germany, 2022; Volume 28, ISBN 0123456789. [Google Scholar]
Poggio, T.; Torre, V.; Koch, C. Computational Vision and Regularization Theory. In Readings in Computer Vision; Morgan Kaufmann: San Francisco, CA, USA, 1987; pp. 638–643. [Google Scholar] [CrossRef]
Mottl, V.V.; Blinov, A.B.; Kopylov, A.V.; Kostin, A.A. Optimization Techniques on Pixel Neighborhood Graphs for Image Processing. In Graph-Based Representations in Pattern Recognition; Jolion, J.-M., Kropatsch, W.G., Eds.; Springer: Wien, Austria, 1998; pp. 135–145. [Google Scholar]
Li, J.; Xue, J.; Cao, R.; Du, X.; Mo, S.; Ran, K.; Zhang, Z. FineRehab: A Multi-Modality and Multi-Task Dataset for Rehabilitation Analysis. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, DC, USA, 17–18 June 2024. [Google Scholar]
Aguilar-Ortega, R.; Berral-Soler, R.; Jiménez-Velasco, I.; Romero-Ramírez, F.J.; García-Marín, M.; Zafra-Palma, J.; Muñoz-Salinas, R.; Medina-Carnicer, R.; Marín-Jiménez, M.J. UCO Physical Rehabilitation: New Dataset and Study of Human Pose Estimation Methods on Physical Rehabilitation Exercises. Sensors 2023, 23, 8862. [Google Scholar] [CrossRef]
Miron, A.; Sadawi, N.; Ismail, W.; Hussain, H.; Grosan, C. IntelliRehabDS (IRDS)—A Dataset of Physical Rehabilitation Movements. Data 2021, 6, 46. [Google Scholar] [CrossRef]
Reji, N.; D’Août, K.; Fichera, S.; Paoletti, P. A Human Motion Data Capture Study The University of Liverpool Rehabilitation Exercise Dataset. Sci. Data 2025, 12, 761. [Google Scholar] [CrossRef] [PubMed]
Capecci, M.; Ceravolo, M.G.; Ferracuti, F.; Iarlori, S.; Monteriu, A.; Romeo, L.; Verdini, F. The KIMORE Dataset: KInematic Assessment of MOvement and Clinical Scores for Remote Monitoring of Physical REhabilitation. IEEE Trans. Neural Syst. Rehabil. Eng. 2019, 27, 1436–1448. [Google Scholar] [CrossRef] [PubMed]
Duin, R.P.W.; Pçkalska, E. The Dissimilarity Space: Bridging Structural and Statistical Pattern Recognition. Pattern Recognit. Lett. 2012, 33, 826–832. [Google Scholar] [CrossRef]
MLPClassifier—Scikit-Learn 1.7.2 Documentation. Available online: https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html (accessed on 14 October 2025).
Spearman, C. The Proof and Measurement of Association between Two Things. By C. Spearman, 1904. Am. J. Psychol. 1987, 100, 441–471. [Google Scholar] [CrossRef]
Duin, R.P.W.; Pȩkalska, E.; De Ridder, D. Relational Discriminant Analysis. Pattern Recognit. Lett. 1999, 20, 1175–1181. [Google Scholar] [CrossRef]
Mottl, V.; Seredin, O.; Dvoenko, S.; Kulikowski, C.; Muchnik, I. Featureless Pattern Recognition in an Imaginary Hilbert Space. In Proceedings of the 2002 International Conference on Pattern Recognition, 16th International Conference on Pattern Recognition, Quebec City, QC, Canada, 11–15 August 2002; Volume 2, pp. 88–91. [Google Scholar]

Figure 1. Flowchart of total distance between video sequences evaluation.

Figure 2. Standardization of the skeletal model.

Figure 3. Flowchart of evaluation of the motion-dependent dissimilarity measure [7] between the two skeleton models.

Figure 4. Normalized intensities of movements for 22 patients in sitting (top sub-picture) and standing (bottom sub-picture) positions under varying performance accuracy levels (“good,” “intermediate,” and “bad”).

Figure 5. Presentation of data as normalized distances of individual patient exercises from the instructor’s reference exercises. The ordinate axis shows the DT value; the abscissa axis shows the DP value. Circles correspond to exercises in a sitting position, triangles to a standing position. Connected lines correspond to specific patient (P_18). Patients in sitting exercises are dots and in standing exercises are triangles. Green marks are the instructor repeating the exercise and 22 patients who have repeated the exercises well. Blue marks are intermediate patients. And finally, the red ones are the most inaccurate patients.

Figure 6. Dependency between the projection angle

φ

and the average Spearman’s rank correlation coefficient.

Figure 6. Dependency between the projection angle

φ

and the average Spearman’s rank correlation coefficient.

Figure 7. Diagram of projection of records to the axis under the angle of

13^{°}

in the dissimilarity space. Green means “good” execution, blue means “intermediate”, and red means “bad” execution of an exercise.

Figure 7. Diagram of projection of records to the axis under the angle of

13^{°}

in the dissimilarity space. Green means “good” execution, blue means “intermediate”, and red means “bad” execution of an exercise.

Figure 8. Experimental results: projections of all patients coordinate in DP-DT space into the line with the angle

φ = 13^{°}

. The color markers (dots for sitting, triangles for standing exercises) reflect the exercise execution levels: the green color for “good”, blue color for “intermediate”, and red color for “bad”.

Figure 8. Experimental results: projections of all patients coordinate in DP-DT space into the line with the angle

φ = 13^{°}

. The color markers (dots for sitting, triangles for standing exercises) reflect the exercise execution levels: the green color for “good”, blue color for “intermediate”, and red color for “bad”.

Figure 9. Dependency between the projection angle

φ

and the average Spearman’s rank correlation coefficient for 11 data subsets. The excluded patient IDs in each subset are shown in the legend in square brackets.

Figure 9. Dependency between the projection angle

φ

and the average Spearman’s rank correlation coefficient for 11 data subsets. The excluded patient IDs in each subset are shown in the legend in square brackets.

Table 1. Optimal balanced

φ

and corresponding average Spearman’s rank correlation coefficient in cross-validation procedure.

Table 1. Optimal balanced

φ

and corresponding average Spearman’s rank correlation coefficient in cross-validation procedure.

Excluded Patients	$Optimal φ^{°}$	$Average r_{s}$
P_1, P_13	44	0.9875
P_10, P_19	13	0.975
P_4, P_21	13	0.975
P_16, P_18	13	0.9875
P_6, P_7	8	1.0
P_9, P_17	13	0.975
P_3, P_5	13	0.975
P_2, P_15	13	0.975
P_14, P_20	13	0.975
P_8, P_11	13	0.975
P_12, P_22	13	0.975

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Seredin, O.; Kopylov, A.; Surkov, E.; Mityugov, N.; Tokarev, A.; Bagchi, P.; Bhattacharjee, D. Automated Control of Rehabilitation Process in Physical Therapy Using a Novel Human Skeleton-Based Balanced Time Warping Algorithm. Sensors 2025, 25, 6696. https://doi.org/10.3390/s25216696

AMA Style

Seredin O, Kopylov A, Surkov E, Mityugov N, Tokarev A, Bagchi P, Bhattacharjee D. Automated Control of Rehabilitation Process in Physical Therapy Using a Novel Human Skeleton-Based Balanced Time Warping Algorithm. Sensors. 2025; 25(21):6696. https://doi.org/10.3390/s25216696

Chicago/Turabian Style

Seredin, Oleg, Andrey Kopylov, Egor Surkov, Nikita Mityugov, Alexei Tokarev, Parama Bagchi, and Debotosh Bhattacharjee. 2025. "Automated Control of Rehabilitation Process in Physical Therapy Using a Novel Human Skeleton-Based Balanced Time Warping Algorithm" Sensors 25, no. 21: 6696. https://doi.org/10.3390/s25216696

APA Style

Seredin, O., Kopylov, A., Surkov, E., Mityugov, N., Tokarev, A., Bagchi, P., & Bhattacharjee, D. (2025). Automated Control of Rehabilitation Process in Physical Therapy Using a Novel Human Skeleton-Based Balanced Time Warping Algorithm. Sensors, 25(21), 6696. https://doi.org/10.3390/s25216696

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Automated Control of Rehabilitation Process in Physical Therapy Using a Novel Human Skeleton-Based Balanced Time Warping Algorithm

Abstract

1. Introduction

2. Literature Review

3. Proposed Methodology

4. Experimental Study and Results

4.1. Dataset

4.2. Evaluation of the Relationship Between the Level of Accuracy of Exercise Performance and the Intensity of Movements

4.3. Video Sequence Distances Visualization and Analysis

4.4. Performance Metrics and Hyperparameter Values Estimation

5. Conclusions and Discussion

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI