Next Article in Journal
Fusion of Coherent and Non-Coherent Pol-SAR Features for Land Cover Classification
Next Article in Special Issue
VF-Mask-Net: A Visual Field Noise Reduction Method Using Neural Networks
Previous Article in Journal
A Compact Circular Waveguide Directional Coupler for High-Order Mode Vacuum Electronic Devices
Previous Article in Special Issue
FDA-Approved Artificial Intelligence and Machine Learning (AI/ML)-Enabled Medical Devices: An Updated Landscape
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Learning Spatio-Temporal Radon Footprints for Assessment of Parkinson’s Dyskinesia

by
Paraskevi Antonia Theofilou
*,
Georgios Tsatiris
and
Stefanos Kollias
*
School of Electrical and Computer Engineering, National Technical University of Athens, 15773 Athens, Greece
*
Authors to whom correspondence should be addressed.
Electronics 2024, 13(3), 635; https://doi.org/10.3390/electronics13030635
Submission received: 24 December 2023 / Revised: 30 January 2024 / Accepted: 31 January 2024 / Published: 2 February 2024

Abstract

:
Parkinson’s disease is a severe neurodegenerative disorder that leads to loss of control over various motor and mental functions. Its progression can be limited with medication, particularly through the use of levodopa. However, prolonged administration of levodopa often results in disorders independent of those caused by the disease. The detection of these disorders is based on the clinical examination of patients, through different type of activities and tasks, using the Unified Dyskinesia Rating Scale (UDysRS). In the present work, our aim is to develop a state-of-the-art assessment system for levodopa-induced dyskinesia (LID) using the joint coordinate data of a human skeleton body depicted on videotaped activities related to UDysRS. For this reason, we combine a robust mathematical method for encoding action sequences known as Spatio-temporal Radon Footprints (SRF) with a Convolutional Neural Network (CNN), in order to estimate dyskinesia’s ratings for six body parts. We introduce two different methodological approaches: Global SRF-CNN and Local SRF-CNN, based on the set of skeletal points used in the encoding scheme. A comparison between these approaches reveals that Local SRF-CNN demonstrates better performance than the Global one. Finally, Local SRF-CNN outperforms the state-of-the-art technique, on both tasks, for UDysRS dyskinesia assessment, using joint coordinate data of the human body, achieving an overall performance in mean RMSE value of 0.6198 for Drinking task and 0.4885 for Communication, compared to 0.6575 and 0.5175, respectively. This illustrates the ability of the proposed machine learning system to successfully assess LID.

1. Introduction

Parkinson’s disease (PD) is a serious progressive degenerative disorder of the nervous system that causes the loss of control over physical movements [1]. It is considered the second most common disease after Alzheimer’s disease [2] and is expected, according to studies, to be the second most common cause of death after cancer in the future [3]. In individuals with PD, dopamine, a neurotransmitter that plays a crucial role in transmitting signals in the brain, gradually diminishes. The symptoms of the disease cover a wide range of different motor and mental dysfunctions, significantly impacting the daily lives of patients [4]. These include fatigue, loss of strength, low voice, poor writing quality, and even mental disorders. Many are related to movement control disorder, with the most common being stiffness in the limbs, restlessness, decreased balance, sluggishness, instability and freezing episodes. While PD is not curable, its progression can be managed based on its stages through physical and mental exercise, medication or surgery [5]. So, early diagnosis of the disease and accurate assessment of its severity is of high importance. However, many times, long-term administration of medications, such as levodopa, can lead to additional disorders [6].
The patients may present with symptoms of dyskinesia that are not related to those of parkinsonism, but rather stem from dopamine-based medication, like levodopa. PD primarily affects the dopaminergic neurons in the brain. Levodopa, which is commonly used in PD treatment, is a precursor to dopamine, a neurotransmitter that is deficient in individuals with PD. While levodopa is effective in managing the motor symptoms of PD, long-term use can lead to the development of further dyskinesia. This form of dyskinesia is known as levodopa-induced dyskinesia (LID) and manifests as involuntary, rapid, irregular and purposeless movements, including chorea, dystonia, and athetosis. These movements significantly diminish patients’ quality of life [7,8]. LID typically emerges as the disease progresses, often requiring an increase in the dosage of levodopa. However, increasing the dosage means worse dyskinesia, while decreasing it means poor disease control. A potential solution might involve adjusting the medication regimen and utilizing a combination of drugs capable of addressing symptoms arising from both PD and LID. However, achieving such a balance is immensely challenging. So, we understand the importance of LID’s detection and assessment in order to provide a better quality of patients’ life. This is why we are interested in this problem and aim to deal with this type of assessment.
Regarding diagnosis, there is no precise procedure based on a specific hematological or biochemical test for both PD and LID. Instead, it is conducted through the evaluation of patient’s medical history, the clinical examination of their various symptoms and the assessment of their performance in specific diagnostic tests that aim to identify their impairments in daily activities [9]. Clinical diagnostic criteria as introduced by the International Parkinson and Movement Disorder Society and defined by the Unified Parkinson’s Disease Rating Scale (UPDRS) [10,11] and the Unified Dyskinesia Rating Scale (UDysRS) [12], are used to assess and draw conclusions about the severity of the disease. The UPDRS was developed to evaluate various aspects of PD including non-motor and motor experiences, while UDysRS was developed to evaluate the involuntary movements related to the effects of prolonged administration of PD medication and LID, excluding from evaluation the impact of parkinsonism and tremor.
Several studies have been conducted regarding the detection and evaluation of PD using machine learning algorithms. However, in the present work, our focus is directed toward the detection and evaluation of LID, a subject that has received limited attention in the literature. The UDysRS is a standardized rating scale specifically designed to assess LID. It quantifies the level of severity and disability of dyskinesias on various aspects of patients’ life. Within this framework, patients undergo examination while participating in specific activities, their bodily movements are tracked throughout these activities and are assessed according the UDysRS. During each one of these activities, specialists give a rating between 0 and 4 for each body part under assessment, i.e., neck, right arm, left arm, trunk, right leg and left leg, according the intensity of dyskinesia. Our work aims at the development of a machine learning system for automatic assessment of LID. Given that LID is strongly related with UDysRS, our work focuses on the estimation of UDysRS scores for each one of the six body parts under evaluation.
Most of the existing works approach the problem of PD assessment as a classification task, aiming to predict the severity of PD. Their input is skeletal joint data which is continuously tracked over time, generating various signals. These studies calculate the average values of these signals and utilize them as features, thereby enabling the classification of cases into different assessment categories. However, within these frameworks, the extracted features are computed based on the overall activity, without specifically focusing on individual features at each time point, i.e., per frame.
Conversely, we choose a distinct approach to address the problem of LID assessment. Our objective is the estimation of six real values, between zero and four, corresponding to six body parts under LID assessment according UDysRS. These values present the severity level of LID for each body part. So, in contrast to treating this problem as a classification task, we treat it as a regression one. The data consist of action sequences, wherein each frame may contain important information for the overall assessment of an activity. In our approach, we analyze skeletal data derived from the application of pose estimation techniques on videotaped activities of PD patients, i.e., the joint coordinates of human body for each frame. Thus, we take into consideration all the features present within the frames of the analyzed videos regarding a body movement—not only their mean values or a summary of these features. In this way, we aim to automatically quantify the severity level of LID for each body part, as UDysRS indicates.
To achieve this, we introduce a novel methodology centered on encoding the movement sequence by generating 2D sinusoidal representations, the so-called Spatio-temporal Radon Footprints (SRF) [13], which contain all the necessary information, i.e., all the extracted features. Then, we use them for appropriately training Convolutional Neural Networks (CNN) in order to predict the rating values according to the UDysRS metric. This encoding scheme operates on the skeletal joint data per frame, extracted through a pose estimation method [14]. An SRF-CNN pipeline is proposed, through two approaches—the Global and Local ones—and is compared to the state-of-the-art method, using the same dataset [15].
Finally, in order to justify the ability of our methodology to successfully assess the LID, we use the Grad-CAM explainable method [16]. Via this visualization technique, we add explainability to our models’ decisions, localizing in SRF-frames those points that played a significant role in the final predictions. In this way, we aim to illustrate that our proposed encoding scheme of SRFs is more information representative and can lead to better results regarding the task of LID’s assessment. It is worth mentioning that the related works do not use any kind of visualization techniques to prove their methodologies’ ability to detect PD or LID.
The contribution of the paper lies in the following:
  • Presentation of a novel approach for the problem of Parkinson’s dyskinesia analysis, using a robust mathematical transformation for encoding spatio-temporal action sequences into compact representations, i.e., Spatio-temporal Radon Footprints
  • Presentation of two different implementations of the proposed SRF-CNN methodology, the Global and Local ones, which generate SRF representations, respectively, either for the whole set of skeletal joint data, or for a specific subset of them, depending on the part of the body to be assessed
  • Treatment of the LID assessment problem as a regression task, being able to generate continuous values of its severity
  • Extraction of features related to each frame of the action sequence, generating an effective representation across all time points
  • Investigation of the ability of joint data to be correlated with a large and with a small set of other joints, so as to provide successful results
  • Application of Grad-CAM explainable method, as a visualization technique, on a multitude of SRF-frames to present the ability of our system to detect and predict correctly LID
  • Comparison of the proposed methodology to the state-of-the-art methods using the same datasets, illustrating that the Local SRF-CNN approach is the most accurate for this specific problem.
The rest of the paper is organized as follows: Section 2 presents related work, Section 3 describes the used dataset and Section 4 presents the proposed methodology. Section 5 defines the experimental setup, whilst Section 6 presents the experimental study. Finally, Section 7 describes the conclusions and our future research.

2. Related Work

Machine learning with its various methods contributes significantly to the development of medical applications. Regarding PD, there are a number of publications on the detection and diagnosis of the disease, as well as the estimation of its severity. However, the studies concerning the detection and assessment of LID are limited, resulting in a research gap in this field. Given that both problems follow a similar approach by clinicians via the observation of motor symptoms through specific daily activities, we can take into consideration methods that have been proposed to solve the PD assessment problem.
The contribution of sensors to this effort is important. Wearable sensing has been the most popular technology for PD assessment, using accelerometers, gyroscopes and magnetometers to record movements, detect changes in patients’ posture and gait [17,18], extract appropriate parameters [19], and assess the dyskinesia of the lower limbs [20,21] and more generally the symptoms of bradykinesia [22]. Furthermore, systems using this kind of sensors through smartphones [23] or other devices, such as special bracelets [24], have been proposed for the quantification of tremor and bradykinesia of upper limbs. In addition to gait information, sensors capture voice information to evaluate voice disorders, in which case the created systems are capable of recording in real time specific voice activities based on UPDRS [25]. According to Anju et al. [26] a smartphone with its sensors and extensive capabilities can be used for the self-assessment of patients, since it is capable of assessing motor and mental disorders.
In contrast to wearable sensors that require physical contact with the body, vision-based methods are completely non-contact and require only a camera for data capture and a computer for processing. For PD assessment, various studies have been published using pose estimation and detection techniques for action recognition and motion representation, such as Kinect [27], Convolutional Pose Machines (CPMs) [28], and OpenPose [29]. These techniques model human action based on the set of joints and the orientation of the limbs of human body. The coordinates of the joints’ points of the body are extracted and consequently their position is located in space. Thus, changes in the posture of the depicted person are noted and the evolution of the related movement is monitored.
In this work, we are interested in vision-based methods for the disease’s assessment. The first of these studies relied on Kinect to calculate through the 3D depth cameras the kinetic and spatio-temporal parameters that affect the movement of patients and help to track the gait pattern [30,31], the hand movement [32] and assess the related parkinsonism [33] and dyskinesia [34]. Buongiorno et al. [35], using the Microsoft Kinect v2 sensor in combination with Artificial Neural Networks (ANN), implemented a tool capable of detecting subtle symptoms in the movement of patients and estimating the severity of the disease. This kind of method used markers for tracking, extracted trajectories, and calculated handcrafted features.
Then, more recent studies used sophisticated motion recognition methods. More specifically, Li et al. [15,36] proposed a method for assessing parkinsonism and dyskinesia caused by long-term levodopa administration. The system receives as input videos of patients’ specific activities where, with the technique of CPMs, the characteristics of the movement (joints) are extracted and their positions are estimated. Then, kinematic features were extracted from each joint trajectory and spectral features from velocity signals. Thus, using Random Forest as model, experiments were performed to classify and predict scores on the UPDRS and UDysRS rating scales. This method treats the problem of detection as a classification one, while this of assessment as a regression one.
Also, Sato et al. [37] aimed to assess the disease focused on determining the rhythm, the length and the duration of gait through simple step-by-step videos of healthy individuals and patients with Parkinson. They used the OpenPose technique and unsupervised classification methods to observe the movement and evaluate it. Similarly, OpenPose technique is used for key point extraction for rating of resting tremor and bradykinesia via video analysis of upper limbs motion calculating parameters, such as amplitude and speed [38]. It is used, as well, for the overall assessment of Parkinson’s severity via UPDRS extracting features from time series signals of each key point which capture the relevant characteristics of movement [39].
In a similar context, Liu et al. [40], Lu et al. [41], Dadashzadeh et al. [42], Mehta et al. [43], and Rupprechter et al. [44], use other deep learning techniques for pose estimation in order to study upper and lower limb movements. They detect abnormalities in posture and gait and evaluate patients’ bradykinesia according to UPDRS. Furthermore, facial analysis methodologies are often introduced into these systems for disease assessment, since patient expressions are distorted [45]. The majority of the studies mentioned in this section use Random Forest, SVM, and k-NN algorithms as classifiers to make predictions. However, some of them use other models, like 3D CNN, LSTM, and TCN.
It is worth mentioning the KELVIN-PD [46] and Park (Parkinson’s Analysis with Remote Kinetic-tasks) [47] applications that use a smartphone and a computer, respectively, to record certain movements on video. They have software that processes these data, detects movement disturbances, evaluates and informs the patients’ files and the specialists who monitor them. Motion metrics are measured, signals and features are extracted from the data in order to classify the severity level of the disease (according to UPDRS) observing motor and audio tasks.
Most of the studies observe activities based on UPDRS concerning symptoms in the upper and lower limbs. They tend to focus on specific body parts rather than assessing the entirety of body movements. Furthermore, these studies exclusively approach the problem of PD assessment as a classification task. As a result, they do not predict real values as actual rating scores, but only provide the estimated severity classes. Moreover, all the aforementioned studies compute mean values of the features extracted from the whole signals over time. Consequently, the predictions rely on features that describe a summary of the movement and not on unique features extracted individually from each timestamp of a video, i.e., each video frame.
It is worth emphasizing that the only research that tackles the assessment problem through a regression perspective is the study conducted by Li et al. [15]. This study extracts kinematic features from frames relevant to the task under examination and uses them to make an automatic assessment, i.e., predict real values according to the rating scales rather than severity categories. In addition, it is the only study, among the aforementioned ones, which deals with the problem of LID. This study is the state-of-the-art for LID assessment according to UDysRS and is compared with the proposed system in the experimental study. This comparison arises as both systems address the same task with the same output goal using different methodologies.
The problem of action recognition consists of a spatio-temporal problem where data derived from both spatial and temporal axes have to be combined for making further analysis. Most of the studies focus on the generation of signals derived from action sequences under examination utilize their mean values as features, which consist in input for models like SVM, k-NN, Random Forest, or simple ANN. Other studies make use of purely spatiotemporal information, so, they use either traditional RNN, LSTM, or 3D CNN models. In our study, we try to resolve this problem by making a robust encoding of the action sequence in order to capture all the necessary information, which is used as input to a lightweight model capable of making successful predictions. The proposed methodology serves this purpose, as it encodes and represents all the motion information in 2D images through the SRFs. CNNs are the most suitable models for managing and extracting features from images, and so are the best choice for our proposed system.
Finally, systems using Deep Neural Networks (DNNs) have been developed to detect and diagnose Parkinson’s disease, through medical images of the brain in combination with demographic, epidemiological, and clinical data. These types of systems combine Convolutional Neural Networks (CNNs) and Recursive Neural Networks (RNNs), which, with the use of supervised and unsupervised learning, as well as transfer learning, can yield significant results in disease prediction through MRIs (Magnetic Resonance Images) data or DaT (Dopamine Transporters) scans [48,49,50]. These systems significantly support the medical practice and contribute to the successful assessment of the severity of the disease using another type of related data.

3. Dataset

Our aim is the development of a machine learning system for the automatic assessment of LID, which provokes involuntary movements characterized by a non-rhythmic motion flowing from one body part to another (chorea) and/or involuntary contractions of opposing muscles causing the twisting of the body into abnormal postures (dystonia) [51]. Given that UDysRS consists of the assessment measure of LID, our work focuses on the estimation of UDysRS scores. Its results are based on the evaluation of the patient’s medical history and the observations they make in their daily life regarding this type of dyskinesia, as well as the objective evaluation of certain activities that the patients realize at the time of examination [12]. The assessment of LID is divided into four parts according UDysRS. Part 1 and 2 consist of the historical section, while Part 3 and 4 the objective one. In the objective section, patients realize four activities: communication, drinking from a cup, dressing, and ambulation, in order to assess their impairment and disability caused by medication. Although historical information related to disease is very important for the assessment, there is very high interest in results derived from the observation of patients during these activities.
The data that we use in the present work come from the dataset (https://github.com/limi44/Parkinson-s-Pose-Estimation-Dataset (accessed on 17 January 2023)) from the study of Li et al. [15]. These were recorded by the Center for Movement Disorders of Toronto Western Hospital. The study included nine people, five men and four women, with a mean age of 64 years, who were diagnosed with Parkinson’s disease and had problems with levodopa at least 25% of their day. These patients underwent selected activities according to the standard rating scales for parkinsonism and levodopa-induced dyskinesia, UPDRS and UDysRS, respectively. More specifically, the examinees were asked to describe a picture, to talk to the examiner, and to answer mental and mathematical tests (Communication task), as well as to drink a glass of water (Drinking task). These two activities were evaluated by physicians according to the UDysRS. They also underwent leg agility and toe tapping tasks, which were evaluated according to the UPDRS. These actions of the examinees were recorded through a 480 × 640 or 540 × 960 camera of 30 frames per second, in which they stood directly opposite. Their evaluation was made, after the completion of levodopa-infusion protocol, every 15–30 min and in a period of 2–4 h, by three specialist neurologists. Therefore, the patients underwent simple tasks that were suitable for their automatic evaluation using non-invasive methods. The data have been processed in order to extract appropriate features. In all the videos, the Convolutional Pose Machines (CPMs) technique [28] was applied to locate the joints of the human body in the video frames. As a result, 2D coordinates were extracted from each frame for head, neck, shoulders, elbows, wrists, hips, knees, and ankles.
In this paper, we use the data of the tasks related with the estimation of UDysRS score following levodopa’s administration. It is worth mentioning that the dataset introduced by the study of Li et al. [15] consists the only public available dataset, which contains videotaped activities for LID assessment according to UDysRS.
Primarily, for our experiments, we focus on the dataset of Drinking task. It consists a more challenging and difficult task than Communication as it contains a type of voluntary movement, which must be differentiated from the involuntary and those of parkinsonism. In this task, the patients are asked to pick up a cup filled of water with the dominant hand, bring it to their lips, drink it, and replace the cup on the table, ignoring bradykinesia or tremor from parkinsonism. This dataset includes 118 videos corresponding to patients with IDs 1, 2, 4, 5, 6, 7, 8, 10, and 11. Each video consists of a number of frames, each of which corresponds to a set of coordinates derived from the points detected by the CPM. Also, each of these videos is associated with an evaluation according to the UDysRS comprising 6 real values on a scale of 0–4. These six values correspond to the scores assigned by neurologists assessing six significant body parts for dyskinesia, which include the neck, right and left arm (shoulder–elbow–wrist), torso (right–left shoulder), right and left leg (hip, knee, ankle).
Secondly, we extend our experiments using the dataset of the Communication task, in order to enhance the results of the proposed system. In this task, the patients are asked to stand on a chair, opposite the camera, without moving in order to describe an image, discuss it with the examiner, and respond to relevant mental and mathematical questions. Through this activity, how much the patient is distracted and affected by the involuntary movements due to LID is observed and evaluated. This dataset includes 128 videos corresponding to patients with IDs 1, 2, 4, 5, 6, 7, 8, 10, 11. Similar to the previous dataset, each video consists of a number of frames, each of which corresponds to a set of coordinates derived from the points detected by the CPM. Also, each of these videos is associated with an evaluation according to the UDysRS comprising 6 real values on a scale of 0–4. These six values, again, correspond to the scores assigned by neurologists for the six body parts under assessment.

4. The Proposed Methodology

In the present work, we aim to address the problem of automatic assessment of LID by employing a machine learning system capable of extracting features from specific movements and using them to make successful predictions regarding the dyskinesia’s severity. Our data consists of coordinates of human body joints, which have been acquired through pose estimation techniques applied to each frame of the original video data. Our goal is to extract the necessary features from the contained motion information in order to estimate real values between 0 and 4 for each of the six examined body parts. These values, in accordance with the UDysRS, signify the degree of LID severity.
The proposed methodology is based on the application of a robust feature extraction scheme paired with a Convolutional Neural Network (CNN). Specifically, this approach entails the utilization of Spatio-temporal Radon Footprints (SRF), which was initially presented in [13]. We introduce a novel action sequence encoding scheme, which efficiently transforms spatio-temporal information into compact representations, using Mahalanobis distance-based shape features and the Radon transform. These representations are then used as inputs to a lightweight CNN, which is characterized by its simplicity, ease of training, and capacity to learn complex motion patterns. In this way, we manage to develop a robust system that provides a compact action-recognition pipeline that does not require extremely sophisticated hardware.

4.1. Spatio-Temporal Radon Footprints

Our approach relies on the fact that human joints can be treated as probabilistic distributions in space. Consequently, an action sequence can be described by modeling the relationships of salient features [52]. As the relative position between a joint and the human body is crucial for definition of human motion, we try to represent their relationship via the Mahalanobis distance. This type of distance, as an established metric for point-to-distribution distances, is more suitable in this task, instead of more intuitive metrics, such as the Euclidean distance between a point and the distribution centroid. Given a sample vector x from a distribution D with mean μ and covariance matrix S, the Mahalanobis distance between x and D is given by Equation (1)
m a h a l ( x , μ , S ) = ( x μ ) T S 1 ( x μ )
What this representation lacks, however, is a way to correlate individual frame information through compact robust features that describe the action sequence from its start up to a specific point in time. For this reason, we use the Radon transform, which is an effective feature-extraction tool with a lot of applications suitable for human action-recognition tasks [53,54,55]. This transform converts the action’s information of each frame into a signal focusing on its spatio-temporal representation. As a result, it captures patterns and extracts features that combine spatial and temporal information for action recognition. By applying the Radon transform to these motion features, information about the direction and intensity of motion is extracted. Additionally, the Radon transform can be utilized for visualizing and understanding the spatio-temporal dynamics of actions in a compact and interpretable manner. Studies have shown that CNN models trained on Radon-transformed data give better results.
Given a 2D function f ( x , y ) with x , y R , its Radon transform R f is equal to the integral of all values of f under all lines L, defined by parameters ρ , θ R , where ρ is each line’s distance from the origin and θ is the angle formed between the line and the horizontal axis:
R f ( ρ , θ ) = L f ( x , y ) d L
If we substitute the parametrical ρ , θ line equation in Equation (2), we have the following Equation (3):
R f ( ρ , θ ) = + + f ( x , y ) δ ( x cos θ + y sin θ ρ ) d x d y
where δ the Dirac delta function, ensuring that only the values of f under line ρ , θ will be integrated over.
The Radon transform plays a significant role in human action-recognition tasks by capturing and representing motion patterns in video sequences [56,57,58]. It involves projecting an action’s motion information from various angles and accumulating these projections to create a parameter space known as a sinogram. In the context of human action recognition, the Radon transform offers distinct advantages. It extracts directional motion features, making it effective in identifying unique motion patterns associated with different actions. These features are invariant to translation, rotation, and scaling, and are effective, simple, and fast to create. So, this approach enhances the recognition of actions performed from various viewpoints, ensuring robustness against changes in orientation. Additionally, the Radon transform’s ability to emphasize specific motion characteristics aids in detecting subtle gestures or dynamic actions that might be challenging to recognize from raw video frames alone. By providing a transformed representation that highlights motion signatures, the Radon transform contributes to more accurate, efficient, and viewpoint-invariant human action recognition systems.
Since the above dataset consists of skeletal data, i.e., joints coordinates of the depicted human body for each frame, we can consider that the set of these points constitute a probabilistic distribution in space for each moment in time. We can rely on this distribution to model the relationship among the joints at each timestep. This way, we can express the relationship between the position of one joint and the rest of the body parts of a subject, at a particular point in time, as the Mahalanobis distance between it and the distribution of the other joints. A point in time is a frame in an action sequence or video. So, we calculate Mahalanobis distance for each joint in a frame, we repeat it for all the joints in the same frame and we extend it across all the frames of a video, i.e., all the moments of an action sequence, for a specific subject. As a result, we can formulate a Mahalanobis matrix M, whose values for joint j and frame f can be calculated using the following Equation (4) (the Mahalanobis distance is denoted as m a h a l ( ) for short):
M ( j , f ) = m a h a l ( x f j , μ f , S f )
where x f j is the coordinate vector of joint j in frame f, defined in R 2 or R 3 , μ f is the mean of all joint coordinate vectors in frame f (essentially the centroid) and S f is the covariance matrix of the joint distribution at frame f.
The Mahalanobis matrix is a complete 2D representation of the action sequence in space and time. With J × t resolution, where J is the total number of joints and t is the point in time (number of frames passed) since the beginning of the sequence, each line of the matrix includes the Mahalanobis distances of every joint at the corresponding frame and each column the distances of a specific joint across the frames. Encoding the action sequence in a particular point at time is equivalent creating this matrix, which contains all the necessary information until this timestep.
The pipeline then correlates the individual frame information into a compact feature that will describe the action sequence from its start up to point t, using the Radon transform of matrix M—treated as a 2D function between time (frames) and joints. This gives as the SRF at point t a 2D spatio-temporal feature describing the action sequence up to point t in time (Equation (5)).
S R F t ( ρ , θ ) = 0 t 1 0 J 1 M ( j , f ) δ ( j cos θ + f sin θ ρ ) d j d f
It is worth noting that t 2 , both for M to be a 2D function of j and f and for the representation to encode a minimum of temporal information. In practice, means that some frames may need to pass before starting to calculate SRFs, because otherwise it is impossible to ascertain the nature of an activity with insufficient joint information.
The process of calculating an SRF at a point in time, from a sequence of frames containing joint information, is presented in Figure 1. More specifically, the proposed encoding scheme takes as inputs skeleton data derived from the application of pose estimation techniques in video data. For each video frame of an action sequence, which corresponds to a specific point in time, we have the joints coordinates of the human body depicted in it. We calculate the Mahalanobis distance between one joint and the rest of the body parts, and we extend this for all the joints of a frame and for all frames of a video until a specific point in time. So, we formulate a Mahalanobis matrix, which contains as many rows as frames and as many columns as joints. Then, we apply the Radon transform and as a result we take a 2D sinusoidal image, i.e., a sinogram, which contains the information of the action up to the corresponding moment in time. This is called Spatio-temporal Radon Footprints at t (SRFt). This procedure is extended across all the timesteps of an action sequence, so, different SRFs are produced as the action progresses.
As a result, we take 2D images of sinograms for each time point time, which encode the action sequence up to the corresponding frame of the video. These 2D sinusoidal representations contain all the necessary information as the movements evolve. So, the sinograms, encoding distinct motion information, exhibit varying shapes, widths, and focal points. These variations depend on the nature of movements and their chronological occurrences. More detail on the aforementioned sequence encoding scheme can be found in [13]. More specifically, a sinogram is a visual representation derived from the Radon transform, offering a unique perspective on an image’s internal structures and features. Created by projecting the image’s contents along various angles, the sinogram showcases how these structures are observed from different viewpoints. Each row in the sinogram corresponds to a specific projection angle, while the columns represent the accumulation of intensities along the projected lines. By presenting these accumulated values in a 2D grid, the sinogram provides a compact yet insightful representation of an image’s components and their orientations. This graphical portrayal unveils a novel way to analyze the distribution of features within an image, shedding light on patterns that might be less apparent in the original visual representation.
Regarding the use of Mahalanobis distance and the Radon transform for encoding action sequences, we further explain their role in the proposed system. Skeletal joints can be considered as valid spatio-temporal interest points when it comes to human action analysis and, with them being a more refined and targeted source of human posture information, can lead to better results. As the relative position of a joint and the human body is crucial in the definition of human motion, we believe that a compact and robust way to model these individual relations should be central to any action-recognition pipeline. The Mahalanobis distance, as an established metric for point-to-distribution distances, is more suitable in this task, instead of more intuitive metrics, such as the Euclidean distance between a point and the distribution centroid. The Radon transform is effectively used as feature-extraction tool for image analysis. It creates translation and scale invariant features to be used as input for Artificial Neural Networks. In this way, we give spatio-temporal prominence to raw joint information using variance-based shape features and, particularly, the Mahalanobis distance. The Radon transform, after that, creates robust salient features, which are crucial for understanding and interpreting the visual information and the interest points of an action. Therefore, the use of the Radon transform is appropriate for feature extraction and data analysis, encoding the information content to create a compact representation of the data for better assessment of an action. Regarding the fact that the discrete Radon transform of the Mahalanobis matrix (in discrete joint and frame numbers) may introduce additional gridding errors is true; however, it is shown through the performed experiments that no severe discretization effect is introduced and machine learning can produce improved results when compared to the state-of-the-art methods.
Consequently, experiments confirm that this pipeline, when based on state-of-the-art human pose-estimation techniques, can provide a robust end-to-end online action recognition scheme, which does not demand extreme computational resources [59,60,61].

4.2. SRF-CNN

It is customary to use Recurrent Neural Network (RNN) or Long Short-Term Memory Network (LSTM) architectures for online pattern recognition tasks involving temporal sequences [62], due to their ability to learn and encode temporal information. However, known issues come with the use of such networks, including the need for large training datasets in most cases, the vanishing and exploding gradient problems, as well their high demands in computational resources. By encoding online spatio-temporal information in SRFs and using robust architectures, we aim to eliminate the need for such networks and provide an online and compact action recognition pipeline that would not depend on extremely sophisticated hardware. CNNs are the most suitable models for managing and extracting features from images and are a very good choice for the proposed system [63]. So, CNNs are chosen to learn the features derived from SRFs, because of their simplicity, their ease to train and deploy and their ability to adapt to complex pattern-recognition problems.
The Radon transform significantly enhances the performance of CNNs in human action-recognition tasks [64]. By preprocessing input video sequences through the Radon transform, intricate motion patterns are captured and transformed into directional features. These features provide the CNN with a unique understanding of motion dynamics, making it adept at recognizing various actions even when presented from different viewpoints. This integration empowers the CNN to identify nuanced gestures, discern subtle changes, and extract essential motion-related information that raw frames might not reveal. The Radon transform effectively acts as a specialized feature extractor, complementing the CNN’s spatial analysis prowess. Consequently, this combined approach leads to an action-recognition system that is more accurate, robust to variations, and capable of distinguishing complex actions and patterns across different scenarios.
As was mentioned above, our method focuses on the use of a SRF feature-extraction scheme paired with a CNN for the LID assessment. SRFs offer compact representations of an action sequence, encoding and visualizing its spatio-temporal patterns. According to this pipeline, SRFs extracted form joint data are shuffled and inserted into a CNN for training and prediction. Then, the CNN learns the corresponding features directly from SRFs and detects complex motion patterns encoded in these representations in order to make predictions. The SRF-CNN pipeline is presented and explained in Figure 2. More specifically, our aim is the estimation of 6 real values, between 0 and 4, corresponding to six body parts under LID assessment according UDysRS. These values present the severity level of LID for each body part. In this way, we intend to automatically quantify the severity level of LID for each body part, as UDysRS indicates. For this reason, we focus on the implementation of a scheme capable of encoding the incoming information in a form easily manageable by a machine learning system so as to achieve the best possible results. The proposed system, it takes as input data derived from the application of pose estimation techniques on videotaped activities of PD patients, i.e., the joint coordinates of human body for each frame. SRF is introduced as the encoding scheme that transforms the data information into compact features representing, in 2D sinusoidal images, all the necessary information of an action sequence until a specific point in time. As an action progresses, different SRFs are produced, given that SRFs are calculated across all the timesteps of an action sequence. So, the model, for training and prediction, takes as input these 2D images, i.e., sinograms, which represent the initial data per frame. Finally, SRFs encode the action sequence in a compact, robust and representative form that contains all the necessary and meaningful information, all the features related with this action. So, the following CNN, which is chosen, directly learns the SRFs, i.e., the features related with action, and estimates the severity level of LID.
The creation of SRF representations consists of a preprocessing step in order to improve the ability of the CNN to capture the important motion information and predict the desired rating values. SRFs enhance the motion-related aspects of the input data and provide information about motion patterns that might not be captured well by the CNN alone. This preprocessing step thanks to the Radon transform leads to a more comprehensive representation of the data and helps CNN to recognize intricate motion details that might not be apparent in raw frames. Preprocessing with Radon transform improves CNN’s robustness, given that the input data becomes more invariant to changes in orientation, background and noise, emphasizing important motion information. The reduced dimensionality and enhanced features provided by this step can potentially lead to faster training convergence and improved generalization.
This pipeline was first introduced in our previous work [14] and is explored further in this paper. In particular, we introduce two distinct SRF-CNN approaches to address the problem of assessing LID, the Global and the Local one, which are also compared to the current state of the art [15]. The differences between these two approaches are related to the respective pipelines’ input and output. The Global approach creates SRF representations from the entire set of skeletal joints, whereas the Local one creates SRF representations only from the joints of the specific body part. These representations respectively constitute the input of CNNs, which perform the final assessment. In the first approach, the predictions of six UDysRS values are obtained simultaneously, using all the data. In the latter one, the predictions are conducted separately for each individual body part (neck, right and left arm, torso, right and left leg), using only the corresponding data.

4.2.1. Global Approach

The Global SRF-CNN is implemented using the whole set of skeletal joint coordinates available for each frame. This approach endeavors to develop and learn a 2D representation that encodes the entire motion sequence up to the reference time. In this way, the system is designed to collectively predict the six UDysRS values by training a single network. As already mentioned, each of these six values corresponds to a distinct body part (neck, right and left arm, torso, right and left leg).
In this context, we intend to demonstrate the ability of this methodology to encode the entire body movement within a representation that contains all the requisite information for the concurrent prediction of six values, each corresponding to the six body parts under evaluation. We consider that an involuntary movement of one body part might affect another, given that some symptoms of LID, like chorea, flow from one part to another. This is the reason behind our interest in assessing the entirety of body movement. The SRF-CNN pipeline illustrated in Figure 2 aligns with the Global approach by receiving specific input and producing specific output. It takes the whole set of joint data for each video frame as input in order to generate SRFs that encode the entire body activity up to that specific time point. This pipeline simultaneously gives 6 real values ranging from 0 to 4 as output, corresponding to each of the examined body parts.
Regarding the CNN that follows the SRF representation, the optimal architecture has been found through an extensive experimental study. The Global SRF-CNN uses a CNN with two 2D convolutional layers and a relatively small number of neurons (32 and 64, respectively), kernel size 3 × 3 and ReLU activation functions, Max Pooling layers with pool size 2 × 2, as well as a Dense layer with depth equal to 256 and ReLU activation. After each Max Pooling layer, a Dropout layer, with a dropout rate of 0.25, is added. Also, before the final Dense layer, another Dropout layer with rate of 0.5 is placed. The training is carried out with a batch size of 32 and 50 epochs. This model’s architecture is presented in Figure 3.

4.2.2. Local Approach

The Local SRF-CNN is implemented using the coordinates of joints linked to the anatomical area of a specific body part under assessment. In this context, 2D representations encode the movements of individual body parts. This way, the system independently acquires and learns distinctive features from each body part. So, the focus is on the individual assessment of each body part. As a result, six independent networks are trained to model and evaluate all the body parts. Therefore, we demonstrate that our proposed methodology effectively manages the problem in a segmented manner, concentrating on the relevant data for each specific body part under examination.
As an action progresses over time, many movements from various body parts are observed. Despite the correlation that the movements may have, we consider them to be independent since they are located in different parts of the body and activate different joints. It is interesting to approach the problem by concentrating on the set of those points that play a decisive role in each movement. Thus, we aim to achieve more successful evaluation results by focusing on the movement of individual parts rather than the whole body. The SRF-CNN pipeline illustrated in Figure 2 aligns with the Local approach by receiving specific input and producing specific output. As input, it takes, for each video frame, only the relevant joint data corresponding to a specific body part. As output, it gives a single real value between 0 and 4 for the assessment of that particular body segment. In order to assess all body parts, this pipeline is employed six times, once for each body part being evaluated. The key distinction lies in the input’s joint data, tailored to the respective body part.
We aim to compare these two approaches and arrive at a conclusion regarding the superior one for our problem. Additionally, we intend to demonstrate that the SRF-CNN method can address similar problems from both a generic and a focused aspect.
Similarly to the Global case, the Local SRF-CNN uses a CNN with three 2D convolutional layers and a small number of neurons (16, 32, and 64, respectively), kernel size 3 × 3 and ReLU activation functions, Max Pooling layers with pool size 2 × 2, as well as a Dense layer with depth equal to 256 and ReLU activation. After each Max Pooling layer, a Dropout layer, with a dropout rate of 0.25, is added. Also, before the final Dense layer, another Dropout layer with rate of 0.5 is placed. The training is carried out with a batch size of 64 and 10 epochs.
In this case, each of the six body parts for assessment corresponds to a individual trained network. However, we target the same model architecture by choosing the optimal one overall, i.e., for the six body parts. This model’s architecture is presented in Figure 4. It is worth noting that the Global SRF-CNN requires more trainable parameters than the Local SRF-CNN, given that the first approach adds more complexity to the problem.

4.3. Visualization Technique

In order to document the ability of our methodology to successfully assess the LID, we introduce and appropriately use a visualization technique capable of explaining models’ decision-making procedure. This technique is Grad-CAM explainable method [16], which calculates a CAM-type heatmap based on the derivatives of the feature maps’ elements of the last convolutional layer of each network. In this way, we intend to add explainability to our models’ decisions localizing the areas that influence the predictions. The method locates, in SRF-frames, those points that played a significant role in the final assessment. Our aim is to prove that our proposed encoding scheme of SRFs is more information-representative, permits the use of Grad-CAM for explanations’ visualization, and leads to better results regarding the task of LID’s assessment.

5. Experimental Setup

In the described context, the automatic assessment of LID of sufferers of PD is regarded as a regression problem. The proposed pipelines, Global SRF-CNN and Local SRF-CNN, focus on predicting real values, namely six values based on the UDysRS score per video, in a different way. These values correspond to the assessed dyskinesia of the six body parts.
In the dataset under inspection, every subject is recorded performing the Drinking task multiple times. We follow the Leave-One-Person-Out Cross-validation (LOPOCV) protocol, in which one person’s samples are singled out to form the testing subset, while the proposed neural network is trained using the remaining persons’ samples. This process is iterated over all persons. Given that each person performs an action in a unique way, compared to any other, we ensure that the network is never trained and evaluated on the same person’s data. LOPOCV is considered a rigorous cross-validation protocol, suitable for an action recognition and assessment task [13], particularly one in which the dataset is not considerably large. It may be a strict protocol, but it fully responds to the reality and nature of the problem. Consequently, this technique avoids potential overfitting and biased model performance.
According to the SRF extraction pipeline, every SRF-frame represents the evolution of the subject’s activity up to that moment. The initial SRF-frame is computed using information extracted from the first five frames of the video. This encodes the action up to that particular time point for generating the first instance. In this way, more frames elapse before the first calculation of SRFs is completed, ensuring there is sufficient information for encoding. Therefore, this is consistent with the methodology’s limitations, as was mentioned in the above section.
It is logical to have the ability to handle frames independently, regardless of the sequence to which they belong, treating each of them as an individual entity. Shuffling the dataset, as well as setting a relatively large batch size, makes sure that the network is trained using independent frames, rather than knowledge amassed from one single video. In this way, the batches correspond to independent representations of the data and the network is trained by extracting features from independent frames, resulting in more objective training and better network behavior. The training process, also, uses the Mean Squared Error (MSE) as a loss function, the Root Mean Squared Error (RMSE) as metric, as well as the adaptive moment estimation (ADAM) as optimizer. So, the experiments’ evaluation is based on both MSE and RMSE values.
Every SRF-frame in the processed dataset is annotated with actual UDysRS values of the video it belongs to. This is based on the assumption that the score characterizes the entire video and all its frames. It also enables us to perform evaluation both at the frame and video level. Another assumption that we make is that both MSE and RMSE metrics should be minimized and be as close to 0 as possible [65,66].
The evaluation of the experimental setup is divided into two phases. The initial phase involves per-frame predictions for all frames within the testing set, following the LOPOCV protocol. In the second phase, a final score is computed per video, based on every per-frame prediction for that particular video. This approach is influenced by the fact that clinical dyskinesia assessment for levodopa-prescribed patients is also handled in a similar manner by practitioners: after the patient completes an activity, the practitioner assesses the activity as a whole. Furthermore, since each SRF-frame of the video represents the movement up to the indicated time, it participates equally in the final assessment. So, the per-frame prediction serves as an effective strategy in achieving the best overall prediction, given that each frame contributes to the final decision.
After completing the prediction process for every frame within a video from the testing set, which encompasses the samples of a single individual according to the LOPOCV protocol, the ultimate prediction of the UDysRS values is determined for each video within this subset. Subsequently, the Squared Error (SE) between the final predicted value for each video and the corresponding ground truth is computed. Then, the MSE and RMSE for all videos within this testing subset are calculated. These metrics serve as the evaluation criteria for the ongoing iteration of LOPOCV, i.e., for the particular person under examination. Finally, after every person’s samples have been treated as test samples once, i.e., after all the iterations of LOPOCV protocol, we calculate the Average Mean Square Error (AMSE) and the Average Root Mean Square Error (ARMSE). These metrics involve averaging the MSEs and RMSEs across all individuals examined. So, they serve as the conclusive evaluation indicators for the entire experiment.
Consequently, our system conducts per-frame predictions for every video within the testing set. These individual predictions are averaged to derive an estimated prediction for each video. Subsequently, the SEs for each video are determined. By utilizing the SEs for each person, the MSEs and RMSEs are calculated across all videos belonging to that individual. This results in separate MSEs and RMSEs for each person, corresponding to each body part. Then, the process concludes with the computation of the AMSE and ARMSE. These metrics are attained by averaging the MSEs and RMSEs of each person.

6. Experimental Study

6.1. Evaluation Scenarios

Given that each SRF-frame of a video encodes the action sequence’s information from its start up to point t (SRFt), we consider that there are no dependencies between them. This is the reason why it is interesting to examine whether it is not necessary to use all the SRF-frames for the prediction, but only a subset of them that can improve the process of assessment.
For this purpose, five different scenarios have been selected, for the evaluation of Global SRF-CNN on the Drinking task. The choice of these scenarios is indicative and has to do with an endeavor to determine whether a subset of the frames is more significant towards the ultimate goal. To support these evaluation scenarios we make the assumption that as the activity progresses, the frames may capture more information, but noise may be also added. Taking into account the above, we investigate whether the use of a specific subset of the SRF-frames can give better results for the performance of the system and lead to some generalized approach about the importance of the frames during the evolution of movement.
The scenarios, which refer to using specific subsets of the complete SRF-processed dataset for training and evaluation, are the following:
  • First SRF-frame (encoding the first five frames of the video).
  • First 25% of the SRF-frames of a video.
  • First 50% of the SRF-frames of a video.
  • First 75% of the SRF-frames of a video.
  • All (100%) SRF-frames of a video.
To evaluate these scenarios, we observe the predictions per-video during all the iterations of LOPOCV. This means that initially our observations are based on Squared Errors calculated for each video of validation subsets and on Mean Squared Errors for each person across the iterations of LOPOCV protocol. Then, we observe the Average Mean Squared Errors across all the persons as an indicator of the overall performance of experimental scenarios. In this way, we try to discover whether a subset of SRF-frames plays more important role in the predictions, which can lead to some generalization about the importance of some frames during the evolution of movement.
Observing the results in Table 1, where the AMSEs across all the persons, i.e., across all the iterations of the LOPOCV protocol are presented, the 5th scenario seems to have the best performance overall, which means that it gives the best predictions. The other scenarios demonstrate similar performance, but there is no uniformity in the increase or decrease of their errors among the videos, the persons, or the body parts that are assessed. So, we cannot derive a general rule about whether a certain number of SRF-frames should be involved in the predictions.
Under the assumption that every frame’s contribution to the final prediction is of equal importance, given that every movement contributes to the upcoming assessment, we conclude that scenario 5, which uses the full set of SRF-frames per video, is the best choice for SRF-CNN’s evaluation.

6.2. Experimental Results

6.2.1. Global SRF-CNN

Selecting the 5th scenario, the metric values of MSE and RMSE for each person using the Global SRF-CNN in the Drinking task are provided in Table 2. Additionally, the AMSEs and ARMSEs are computed across all persons for each body part. This calculation is performed to facilitate a comparison of the overall performance against that of the state-of-the-art methods. It can be seen that both metric values, per person and overall, are very good (i.e., as closer to zero as possible). However, some greater values are also observed that have a bigger divergence from zero; this means that the predictions in these cases were not so successful. This is not observed in all evaluations of a person, or of a body part; this means that either it is a random result, or it is difficult to assess this part for the specific data, due to noise, occlusions, or voluntary movements confused with involuntary ones. However, the overall performance of the pipeline according the Average MSEs and RMSEs is shown to be very good.

6.2.2. Local SRF-CNN

For the evaluation of the Local SRF-CNN in the Drinking task, we follow the same steps as before. We use the total number of SRF-frames for the predictions according to the 5th evaluation scenario and we calculate the metric values, MSE, RMSE, AMSE, and ARMSE, based on the experimental setup.
The metrics’ results for each person using the Local SRF-CNN in the Drinking task are provided in Table 3. According to these values per person, per body part and overall, the performance of this technique is superior to the former case. Comparing the results extracted from the two approaches, we notice that the majority of MSE and RMSE values in the second approach have been improved (they have lower values than before). So, the overall performance of Local SRF-CNN is better than that of the Global SRF-CNN. We justify this result due to the fact that in the Local approach the system focuses on the points of the body that affect a specific movement. Therefore, our proposed methodology is more efficient and accurate when each body part is assessed separately using only the related skeletal joints.

6.2.3. Grad-CAM on Local SRF-CNN

Enriching the data analysis, we provide visualization of our models’ decision-making procedure, using the Grad-CAM method. We make an extensive study of the results derived from the application of Grad-CAM to a multitude of SRF-frames that correspond to different assessment scores for the Local SRF-CNN approach. Observing the produced images, we conclude that in cases where intense dyskinesia appears, the Grad-CAM strongly focuses on specific points of representations. On the contrary, in cases where there is no dyskinesia, the method provides a more “flat” representation, without emphasizing specific points of the sinograms. This observation is justified by the fact that when there is an intense problem of dyskinesia, the characteristics that lead to a high evaluation are significantly more evident. So, as we can see in the Figure 5 and Figure 6, respectively, Grad-CAM focuses on points of sinograms in the cases with high scores of UDysRS for LID (Figure 5), in contrast to the cases with zero or very low scores (Figure 6).
We can claim that our method is more information-representative given that Grad-CAM proves its ability to detect and asses LID localizing the points that lead in the final prediction/evaluation. It is worth mentioning that as the movement evolves, the representations contain more information, so Grad-CAM seems to emphasize more and more points of interest. Also, by comparing the results of the method for corresponding representations in the cases of low evaluations, it is confirmed that the method, being unable to emphasize certain points, gives a more generalised visualization of the results.
The Figure 5 and Figure 6 present some samples of the heatmaps and the produced images from the application of Grad-CAM on a multitude of different SRF-frames corresponding to different body parts, persons, videos, points in time and assessments. The samples of Figure 5 derive from the cases where the evaluated scores are high, i.e., 1.5 or 2, while the samples of Figure 6 are related to cases with low scores, i.e., 0 or 0.5. In addition, observing the results of Grad-CAM in both Figures, we can see how this explainable method locate the points of interest in the different cases. In samples, where there is a high level of dyskinesia, Grad-CAM focuses on specific points, coloring them with red, orange or yellow colors. On the contrary, in samples, where there is no dyskinesia, Grad-CAM provides a much smoother representation, without emphasizing specific points in it.
As a consequence, exploring Grad-CAM, we illustrate that our system based on Local SRF-CNN has the ability to explain its decisions focusing on the areas that contain important features and affect the final prediction. That means that our proposed encoding scheme with the use of Radon transform permits the use of Grad-CAM (as an explainable method) for visualizing the points of interest in cases of LID; this has not been achieved in any of the methods to which our approach is compared in the paper.

6.3. Comparison with SOTA

As was mentioned above, the data we use in this study originate from the work of Li et al. [15], which consists of the state of the art (SOTA) for the assessment of levodopa-induced dyskinesia (LID) for Drinking and Communication tasks, using vision-based methods. The CPM technique for pose estimation has been applied to the videotaped activities of the patients, with which 14 skeletal joints and their trajectories during movement have been extracted. As a result, the raw data to handle consisted of 2D coordinates of skeletal joints.
Li et al. [15] tackled the Drinking and Communication tasks by extracting 32 kinematic and spectral features per joint trajectory. For the prediction of each subscore of the UDysRS rating scale, they used the features of those joints related to the corresponding anatomical region, i.e., body part. In their experiments, they trained a Random Forest using the LOPOCV protocol. In this study, they dealt with binary classification, regression and multi-class classification. To evaluate the performance of their system they used the metric of RMSE. We particularly focus on how they deal with the regression problem, so as to compare the performance of their method with our proposed one in this paper.
According the bibliography, there are too few vision-based methods used for LID assessment and especially for estimating rating values based on UDysRS. Most of the studies deal with the problem of PD assessment and treat it as a classification task. Only the compared method of Li et al. [15] addresses the problem of LID assessment via the extraction of kinematic and spectral features related to the pose estimation of different body parts, managing to estimate the severity level of this type of dyskinesia. Also, each one of the related studies apply experiments on different datasets which are not public for further analysis. Only the dataset under consideration, which was introduced by Li et al. [15], is available. So, the compared method of Li et al. [15] is the state-of-the-art measure of comparison for this problem.
In our methodology, we initially introduced the Global SRF-CNN, where we utilized the complete set of joint data to generate SRFs encoding the entire motion. This approach differs from the aforementioned SOTA method, as we do not concentrate on the individual joints of the body part under assessment, but on the whole set of them.
In contrast, our Local SRF-CNN approach follows the same set up as Li et al. [15], since it handles the joint data separately, focusing on the joints of the particular anatomical area under assessment. As was mentioned, six individual CNN networks were trained, each dedicated to a specific body part, with individual predictions made for each subscore. As a consequence, the obtained performance can be directly compared to the SOTA one.
In our comparison, we used the RMSE metric, which was also employed in the original SOTA method for evaluating the performance. To be precise, Li et al. [15] exclusively presented their findings concerning the overall system performance, indicated by the ARMSEs for each body part. These values need to be minimized for predictions to be considered successful [66]. Observing the ARMSEs values for our pipelines and those of the SOTA for the Drinking task in Table 4, it becomes evident that the proposed Local SRF-CNN approach achieves the best results, overcoming the SOTA performance across all body parts and persons.
We extend our experiments to strengthen the accuracy of the results and conclusions drawn from this study. For this reason, we train the proposed Local SRF-CNN model, which has the best performance in Drinking task, on the available data of Communication task. In this way, we want to make a further comparison between our system and the one proposed by Li et al. about the same task, in order to enrich the experimental results and derive a more generalised conclusion about our proposed methodology. So, in Table 5, the ARMSEs values of our Local SRF-CNN pipeline and those of the SOTA about Communication task are presented. Observing these results, we come to the conclusion that the proposed Local SRF-CNN approach achieves better resutls as well.
These results prove the ability of our proposed methodology to successfully resolve the problem of LID assessment through the prediction of rating values based on UDysRS. However, our methodology is not limited in a specific type of tasks. The application of Global and Local SRF-CNN have proved that our methodology can be adapted in different type of movements derived from individual body parts. This is based on the fact that our methodology offers the ability to associate each joint with both a large and a small set of other joints. As a result, the features captured by this encoding scheme are related only with the action sequence under observation and are extracted automatically by the representation applying the Radon transform. The comparison between these two approaches of our methodology prove the ability of SRF-CNN to be generalised in other tasks and in different kind of movements. In addition, the SRF representations help the CNN to improve its predictions and as a preprocessing step benefit the overall computational cost by the use of simpler and more lightweight models.
Finally, we also illustrated that our system based on Local SRF-CNN has the ability to explain its decisions focusing on the areas that contain important features and affect the final prediction. That means that our proposed encoding scheme with the use of Radon Transform permits the use of Grad-CAM (as an explainable method) visualizing the points of interest in cases of LID; this has not been achieved in any of the methods to which our approach is compared in the paper.

6.4. Discussion

The application of SRF methodology holds distinct advantages in assessing LID with complex and often debilitating motor symptoms. By analyzing video data through the lens of the Radon transform, this methodology excels in capturing and quantifying the intricate and irregular movement patterns characteristic of dyskinesia. The transform’s sensitivity to motion nuances enables precise identification of dyskinesia’s abnormal involuntary movements. This reduces subjectivity in assessment, providing clinicians with accurate and quantifiable insights into the severity and temporal dynamics of dyskinesia. The methodology’s ability to highlight fine-grained motion abnormalities, coupled with its robustness to variations in viewpoint and lighting, offers a comprehensive and consistent evaluation of dyskinesia across different scenarios. Such an approach not only enhances the precision of diagnosing and monitoring LID but also paves the way for tailored treatment strategies that address patients’ unique motor symptoms effectively.
More specifically, this methodology enables the remote monitoring of patients’ symptoms and helps clinicians tailor personalized treatment plans. LID’s symptoms can be subtle and might go unnoticed in the early stages. The methodology’s sensitivity to fine-grained motion changes can facilitate early detection, enabling timely intervention and treatment. Its ability to track changes in motion patterns over time allows for continuous monitoring of disease progression, symptoms’ severity and treatment effectiveness. The proposed system, taking advantage of all the aforementioned characteristics of our methodology, can be integrated into telehealth and remote monitoring solutions, allowing healthcare professionals to assess patients’ motor symptoms remotely, increasing accessibility and convenience. Finally, our system can contribute to research by providing a data-driven approach for studying Parkinson’s disease progression, identifying patterns, and investigating new therapeutic interventions.
The use of this methodology and especially of SRF serves to create a system properly adapted to the problem of LID’s assessment exploiting the properties of Radon transform. This way, the movement analysis process is accelerated, the necessary information is better encoded, critical features are extracted and the dyskinesia’s severity level is more effectively predicted. Similar studies do not employ these preprocessing procedures; instead, they suffice with the generation of hand-crafted features. These features stem from computing the mean values of various signals obtained through monitoring motion at each specific instance in time. Therefore, these studies rely on classifying dyskinesia grounded in the values of certain features that do not characterize each movement in detail and fail to identify subtle discrepancies that could contribute to accurate evaluations.
Consequently, our system offers the better monitoring of LID’s symptoms and thus enhances the effectiveness of medical decisions. It is fast, reliable and accurate permitting either the remote monitoring of the patients or their self-assessment. Also, given that each individual time moment is treated as distinct from the preceding and subsequent instances and holds an equal contribution to the ultimate evaluation, an online monitoring system can be developed. So, vision-based assessments offer a valuable tool in the objective assessment of LID, providing insights into the nature and severity of motor symptoms. Finally, our system can evolve into a mobile easy-to-use application given that it does not require sophisticated hardware.

7. Conclusions and Future Research

In this work, we proposed a state-of-the-art methodology that uses an action encoding technique, i.e., Spatio-temporal Radon Footprints, in combination with a lightweight Convolutional Neural Network. Our target has been to facilitate the creation of a simple, accessible and easy-to-use vision-based assistive framework for clinical practitioners. For this purpose, we proposed two approaches, the Global SRF-CNN and Local SRF-CNN, which we compared both with each other and with the state-of-the-art method. It was concluded that the Local approach, in which the SRF representations are derived from the skeletal data of those joints participating in the body part to be assessed, gives the best possible results and can be effectively used for the automatic assessment of levodopa-induced dyskinesia in Parkinson’s patients, through the Drinking and Communication tasks indicated by UDysRS, using joint coordinate data. Therefore, a robust and accurate system was developed, capable of supplementing clinicians’ decisions in various aspects concerning Parkinson’s Disease.
In the context of our future work, we will explore the use of the Inverse Radon Transform in order to accurately interpret the points detected by Grad-CAM by matching them to specific joints and frames of the Mahalanobis matrix. In this way, we will also try to investigate the inherent relationships and similarities between actions. Also, we will try to apply other visualization techniques as well, in order to test the ability of these explainable methods detect the points of interest in sinograms of SRF representations.
Also, our future work focuses on the application of a more complex and advanced video based method to further improve the assessment results of Parkinson’s dyskinesia. RACNet [67] is a high-performance CNN-RNN network that can receive the spatio-temporal data at its input. In this formulation, the CNN part of the network extracts latent variables from the data at each time point, whilst the RNN part takes into account the sequential order of the data, detects their spatial dependence and extracts those features that play a more important role in decision making.
Another interesting aspect of our future research would be the use of an attention-based network instead of a simple CNN. The use of Vision Transformers (ViT) [68] paired with SRF holds the promise of a more accurate, efficient, and interpretable approach to LID’s recognition and assessment. By incorporating motion-focused features from the Radon transform with spatial information from ViT, the system may track better dyskinesia’s progression. The model’s attention mechanisms provide insights into the regions of interest within sinograms that contribute most to dyskinesia recognition. This transparency can provide valuable information about the specific motion patterns associated with dyskinesia and can aid clinicians in understanding and trusting the model’s decisions.

Author Contributions

Conceptualization, P.A.T., G.T. and S.K.; Methodology, P.A.T., G.T. and S.K.; Software, P.A.T. and G.T.; Validation, P.A.T.; Investigation, P.A.T.; Writing—original draft, P.A.T.; Writing—review & editing, S.K.; Visualization, P.A.T.; Supervision, S.K.; Project administration, Stefanos Kollias. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The dataset analyzed during the current study is available in the Parkinson’s Pose Estimation Dataset repository, https://github.com/limi44/Parkinson-s-Pose-Estimation-Dataset (accessed on 17 January 2023). No datasets were generated during the current study.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Parkinson, J. An essay on the shaking palsy. 1817. J. Neuropsychiatry Clin. Neurosci. 2002, 14, 223–236; discussion 222. [Google Scholar] [CrossRef]
  2. Nussbaum, R.L.; Ellis, C.E. Alzheimer’s disease and Parkinson’s disease. N. Engl. J. Med. 2003, 348, 1356–1364. [Google Scholar] [CrossRef] [PubMed]
  3. Statistics on Parkinson’s—Parkinson’s Disease Foundation. Available online: https://www.parkinson.org/understanding-parkinsons/statistics (accessed on 17 January 2023).
  4. Jankovic, J. Parkinson’s disease: Clinical features and diagnosis. J. Neurol. Neurosurg. Psychiatry 2008, 79, 368–376. [Google Scholar] [CrossRef] [PubMed]
  5. Parkinson’s Disease: Causes, Symptoms, and Treatments. Available online: https://www.nia.nih.gov/health/parkinsons-disease (accessed on 17 January 2023).
  6. Cotzias, G.C.; Papavasiliou, P.S.; Gellene, R. Modification of Parkinsonism—Chronic treatment with L-dopa. N. Engl. J. Med. 1969, 280, 337–345. [Google Scholar] [CrossRef] [PubMed]
  7. Ahlskog, J.E.; Muenter, M.D. Frequency of levodopa-related dyskinesias and motor fluctuations as estimated from the cumulative literature. Mov. Disord. 2001, 16, 448–458. [Google Scholar] [CrossRef] [PubMed]
  8. Thanvi, B.; Lo, N.; Robinson, T. Levodopa-induced dyskinesia in Parkinson’s disease: Clinical features, pathogenesis, prevention and treatment. Postgrad. Med. J. 2007, 83, 384–388. [Google Scholar] [CrossRef] [PubMed]
  9. Kassubek, J. Diagnostic procedures during the course of Parkinson’s Disease. Basal Ganglia 2014, 4, 15–18. [Google Scholar] [CrossRef]
  10. Movement Disorder Society Task Force on Rating Scales for Parkinson’s Disease. The unified Parkinson’s disease rating scale (UPDRS): Status and recommendations. Mov. Disord. 2003, 18, 738–750. [Google Scholar] [CrossRef]
  11. Goetz, C.G.; Tilley, B.C.; Shaftman, S.R.; Stebbins, G.T.; Fahn, S.; Martinez-Martin, P.; Poewe, W.; Sampaio, C.; Stern, M.B.; Dodel, R.; et al. Movement Disorder Society-sponsored revision of the Unified Parkinson’s Disease Rating Scale (MDS-UPDRS): Scale presentation and clinimetric testing results. Mov. Disord. 2008, 23, 2129–2170. [Google Scholar] [CrossRef]
  12. Goetz, C.G.; Nutt, J.G.; Stebbins, G.T. The Unified Dyskinesia Rating Scale: Presentation and clinimetric profile. Mov. Disord. 2008, 23, 2398–2403. [Google Scholar] [CrossRef]
  13. Tsatiris, G.; Karpouzis, K.; Kollias, S. A Compact Sequence Encoding Scheme for Online Human Activity Recognition in HRI Applications. In Proceedings of the 21st EANN (Engineering Applications of Neural Networks) 2020 Conference, Halkidiki, Greece, 5–7 June 2020; Iliadis, L., Angelov, P.P., Jayne, C., Pimenidis, E., Eds.; Springer: Cham, Switzerland, 2020; pp. 3–14. [Google Scholar]
  14. Theofilou, P.A.; Tsatiris, G.; Kollias, S. Automatic assessment of Parkinson’s patients’ dyskinesia using non-invasive machine learning methods. In Proceedings of the 2022 International Conference on Interactive Media, Smart Systems and Emerging Technologies (IMET), Limassol, Cyprus, 4–7 October 2022; pp. 1–4. [Google Scholar] [CrossRef]
  15. Li, M.H.; Mestre, T.A.; Fox, S.H.; Taati, B. Vision-based assessment of parkinsonism and levodopa-induced dyskinesia with pose estimation. J. Neuroeng. Rehabil. 2018, 15, 97. [Google Scholar] [CrossRef] [PubMed]
  16. Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. Int. J. Comput. Vis. 2019, 128, 336–359. [Google Scholar] [CrossRef]
  17. Dong, Z.; Gu, H.; Wan, Y.; Zhuang, W.; Rojas-Cessa, R.; Rabin, E. Wireless body area sensor network for posture and gait monitoring of individuals with Parkinson’s disease. In Proceedings of the 2015 IEEE 12th International Conference on Networking, Sensing and Control, Taipei, Taiwan, 9–11 April 2015. [Google Scholar]
  18. Eskofier, B.; Lee, S.; Baron, M.; Simon, A.; Martindale, C.; Gaßner, H.; Klucken, J. An Overview of Smart Shoes in the Internet of Health Things: Gait and Mobility Assessment in Health Promotion and Disease Monitoring. Appl. Sci. 2017, 7, 986. [Google Scholar] [CrossRef]
  19. Hannink, J.; Kautz, T.; Pasluosta, C.F.; Gasmann, K.G.; Klucken, J.; Eskofier, B.M. Sensor-Based Gait Parameter Extraction with Deep Convolutional Neural Networks. IEEE J. Biomed. Health Inf. 2016, 21, 85–93. [Google Scholar] [CrossRef] [PubMed]
  20. Ramsperger, R.; Meckler, S.; Heger, T.; van Uem, J.; Hucker, S.; Braatz, U.; Graessner, H.; Berg, D.; Manoli, Y.; Serrano, J.A.; et al. Continuous leg dyskinesia assessment in Parkinson’s disease -clinical validity and ecological effect. Park. Relat. Disord. 2016, 26, 41–46. [Google Scholar] [CrossRef] [PubMed]
  21. Delrobaei, M.; Baktash, N.; Gilmore, G.; McIsaac, K.; Jog, M. Using Wearable Technology to Generate Objective Parkinson’s Disease Dyskinesia Severity Score: Possibilities for Home Monitoring. IEEE Trans. Neural Syst. Rehabil. Eng. 2017, 25, 1853–1863. [Google Scholar] [CrossRef] [PubMed]
  22. Eskofier, B.M.; Lee, S.I.; Daneault, J.F.; Golabchi, F.N.; Ferreira-Carvalho, G.; Vergara-Diaz, G.; Sapienza, S.; Costante, G.; Klucken, J.; Kautz, T.; et al. Recent machine learning advancements in sensor-based mobility analysis: Deep learning for Parkinson’s disease assessment. Annu. Int. Conf. IEEE Eng. Med. Biol. Soc. 2016, 2016, 655–658. [Google Scholar]
  23. Kostikis, N.; Hristu-Varsakelis, D.; Arnaoutoglou, M.; Kotsavasiloglou, C. A Smartphone-Based Tool for Assessing Parkinsonian Hand Tremor. IEEE J. Biomed. Health Inf. 2015, 19, 1835–1842. [Google Scholar] [CrossRef]
  24. Channa, A.; Ifrim, R.C.; Popescu, D.; Popescu, N. A-WEAR Bracelet for Detection of Hand Tremor and Bradykinesia in Parkinson’s Patients. Sensors 2021, 21, 981. [Google Scholar] [CrossRef]
  25. Celik, E.; Omurca, S.I. Improving Parkinson’s Disease Diagnosis with Machine Learning Methods. In Proceedings of the 2019 Scientific Meeting on Electrical-Electronics and Biomedical Engineering and Computer Science (EBBT), Istanbul, Turkey, 24–26 April 2019; pp. 1–4. [Google Scholar] [CrossRef]
  26. Anju, P.; Varghese, A.; Roy, A.; Suresh, S.; Joy, E.; Sunder, R. Recent Survey on Parkinson Disease Diagnose using Deep Learning Mechanism. In Proceedings of the 2020 2nd International Conference on Innovative Mechanisms for Industry Applications (ICIMIA), Bangalore, India, 5–7 March 2020; pp. 340–343. [Google Scholar] [CrossRef]
  27. Miles, R. Start Here! Learn the Kinect Api; Pearson Education: Upper Saddle River, NJ, USA, 2012. [Google Scholar]
  28. Wei, S.E.; Ramakrishna, V.; Kanade, T.; Sheikh, Y. Convolutional Pose Machines. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
  29. Cao, Z.; Hidalgo, G.; Simon, T.; Wei, S.E.; Sheikh, Y. OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 172–186. [Google Scholar] [CrossRef]
  30. Procházka, A.; Vyšata, O.; Valis, M.; Ťupa, O.; Schätz, M.; Mařík, V. Use of the image and depth sensors of the Microsoft Kinect for the detection of gait disorders. Neural Comput. Appl. 2015, 26, 1621–1629. [Google Scholar] [CrossRef]
  31. Arango Paredes, J.D.; Muñoz, B.; Agredo, W.; Ariza-Araújo, Y.; Orozco, J.L.; Navarro, A. A reliability assessment software using Kinect to complement the clinical evaluation of Parkinson’s disease. Annu. Int. Conf. IEEE Eng. Med. Biol. Soc. 2015, 2015, 6860–6863. [Google Scholar]
  32. Dror, B.; Yanai, E.; Frid, A.; Peleg, N.; Goldenthal, N.; Schlesinger, I.; Hel-Or, H.; Raz, S. Automatic assessment of Parkinson’s Disease from natural hands movements using 3D depth sensor. In Proceedings of the 2014 IEEE 28th Convention of Electrical & Electronics Engineers in Israel (IEEEI), Eilat, Israel, 3–5 December 2014. [Google Scholar]
  33. Rocha, A.P.; Choupina, H.; Fernandes, J.M.; Rosas, M.J.; Vaz, R.; Silva Cunha, J.P. Kinect v2 based system for Parkinson’s disease assessment. In Proceedings of the 2015 37th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), Milan, Italy, 25–29 August 2015; pp. 1279–1282. [Google Scholar] [CrossRef]
  34. Dyshel, M.; Arkadir, D.; Bergman, H.; Weinshall, D. Quantifying Levodopa-Induced Dyskinesia Using Depth Camera. In Proceedings of the IEEE International Conference on Computer Vision (ICCV) Workshops, Santiago, Chile, 7–13 December 2015. [Google Scholar]
  35. Buongiorno, D.; Bortone, I.; Cascarano, G.D.; Trotta, G.F.; Brunetti, A.; Bevilacqua, V. A low-cost vision system based on the analysis of motor features for recognition and severity rating of Parkinson’s Disease. BMC Med. Inform. Decis. Mak. 2019, 19, 243. [Google Scholar] [CrossRef]
  36. Li, M.H.; Mestre, T.A.; Fox, S.H.; Taati, B. Automated assessment of levodopa-induced dyskinesia: Evaluating the responsiveness of video-based features. Park. Relat. Disord. 2018, 53, 42–45. [Google Scholar] [CrossRef] [PubMed]
  37. Sato, K.; Nagashima, Y.; Mano, T.; Iwata, A.; Toda, T. Quantifying normal and parkinsonian gait features from home movies: Practical application of a deep learning–based 2D pose estimator. PLoS ONE 2019, 14, e0223549. [Google Scholar] [CrossRef] [PubMed]
  38. Park, K.W.; Lee, E.J.; Lee, J.S.; Jeong, J.; Choi, N.; Jo, S.; Jung, M.; Do, J.Y.; Kang, D.W.; Lee, J.G.; et al. Machine Learning–Based Automatic Rating for Cardinal Symptoms of Parkinson Disease. Neurology 2021, 96, e1761–e1769. [Google Scholar] [CrossRef] [PubMed]
  39. Morinan, G.; Peng, Y.; Rupprechter, S.; Weil, R.S.; Leyland, L.A.; Foltynie, T.; Sibley, K.; Baig, F.; Morgante, F.; Gilron, R.; et al. Computer-vision based method for quantifying rising from chair in Parkinson’s disease patients. Intell.-Based Med. 2022, 6, 100046. [Google Scholar] [CrossRef]
  40. Liu, Y.; Chen, J.; Hu, C.; Ma, Y.; Ge, D.; Miao, S.; Xue, Y.; Li, L. Vision-Based Method for Automatic Quantification of Parkinsonian Bradykinesia. IEEE Trans. Neural Syst. Rehabil. Eng. 2019, 27, 1952–1961. [Google Scholar] [CrossRef] [PubMed]
  41. Lu, M.; Poston, K.; Pfefferbaum, A.; Sullivan, E.V.; Fei-Fei, L.; Pohl, K.M.; Niebles, J.C.; Adeli, E. Vision-Based Estimation of MDS-UPDRS Gait Scores for Assessing Parkinson’s Disease Motor Severity. In Proceedings of the Medical Image Computing and Computer Assisted Intervention—MICCAI 2020, Lima, Peru, 4–8 October 2020; Martel, A.L., Abolmaesumi, P., Stoyanov, D., Mateus, D., Zuluaga, M.A., Zhou, S.K., Racoceanu, D., Joskowicz, L., Eds.; Springer: Cham, Germany, 2020; pp. 637–647. [Google Scholar]
  42. Dadashzadeh, A.; Whone, A.L.; Rolinski, M.; Mirmehdi, M. Exploring Motion Boundaries in an End-to-End Network for Vision-based Parkinson’s Severity Assessment. arXiv 2020, arXiv:2012.09890. [Google Scholar]
  43. Mehta, D.; Asif, U.; Hao, T.; Bilal, E.; von Cavallar, S.; Harrer, S.; Rogers, J. Towards Automated and Marker-less Parkinson Disease Assessment: Predicting UPDRS Scores using Sit-stand videos. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Los Alamitos, CA, USA, 19–25 June 2021; pp. 3836–3844. [Google Scholar] [CrossRef]
  44. Rupprechter, S.; Morinan, G.; Peng, Y.; Foltynie, T.; Sibley, K.; Weil, R.S.; Leyland, L.A.; Baig, F.; Morgante, F.; Gilron, R.; et al. A Clinically Interpretable Computer-Vision Based Method for Quantifying Gait in Parkinson’s Disease. Sensors 2021, 21, 5437. [Google Scholar] [CrossRef]
  45. Jin, B.; Qu, Y.; Zhang, L.; Gao, Z. Diagnosing Parkinson Disease Through Facial Expression Recognition: Video Analysis. J. Med. Internet Res. 2020, 22, e18697. [Google Scholar] [CrossRef] [PubMed]
  46. Motor Assessment Made Simple and Scalable. Available online: https://machinemedicine.com/ (accessed on 17 January 2023).
  47. Langevin, R.; Ali, M.R.; Sen, T.; Snyder, C.; Myers, T.; Dorsey, E.R.; Hoque, M.E. The PARK Framework for Automated Analysis of Parkinson’s Disease Characteristics. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 2019, 3, 1–22. [Google Scholar] [CrossRef]
  48. Kollias, D.; Tagaris, A.; Stafylopatis, A.; Kollias, S.; Tagaris, G. Deep neural architectures for prediction in healthcare. Complex Intell. Syst. 2018, 4, 119–131. [Google Scholar] [CrossRef]
  49. Wingate, J.; Kollia, I.; Bidaut, L.; Kollias, S. Unified deep learning approach for prediction of Parkinson’s disease. IET Image Process. 2020, 14, 1980–1989. [Google Scholar] [CrossRef]
  50. Kollias, D.; Bouas, N.; Vlaxos, Y.; Brillakis, V.; Seferis, M.; Kollia, I.; Sukissian, L.; Wingate, J.; Kollias, S. Deep Transparent Prediction through Latent Representation Analysis. arXiv 2020, arXiv:2009.07044. [Google Scholar]
  51. Zis, P.; Chaudhuri, K.R.; Samuel, M. Phenomenology of Levodopa-Induced Dyskinesia. In Levodopa-Induced Dyskinesia in Parkinson’s Disease; Fox, S.H., Brotchie, J.M., Eds.; Springer: London, UK, 2014; pp. 1–16. [Google Scholar] [CrossRef]
  52. Tsatiris, G.; Karpouzis, K.; Kollias, S. Variance-based shape descriptors for determining the level of expertise of tennis players. In Proceedings of the 2017 9th International Conference on Virtual Worlds and Games for Serious Applications (VS-Games), Athens, Greece, 6–8 September 2017; pp. 169–172. [Google Scholar] [CrossRef]
  53. Beylkin, G. Discrete radon transform. IEEE Trans. Acoust. Speech Signal Process. 1987, 35, 162–172. [Google Scholar] [CrossRef]
  54. Chen, Y.; Wu, Q.; He, X. Human Action Recognition Based on Radon Transform. In Multimedia Analysis, Processing and Communications; Lin, W., Tao, D., Kacprzyk, J., Li, Z., Izquierdo, E., Wang, H., Eds.; Springer: Berlin/Heidelberg, Germany, 2011; pp. 369–389. [Google Scholar] [CrossRef]
  55. Deans, S.R. The Radon Transform and Some of Its Applications; Dover Books on Mathematics; Dover Publications: Mineola, NY, USA, 2007. [Google Scholar]
  56. Goudelis, G.; Karpouzis, K.; Kollias, S. Robust human action recognition using history trace templates. In Proceedings of the International Workshop on Image Analysis for Multimedia Interactive Services, Dublin, Ireland, 23–25 May 2012. [Google Scholar]
  57. Goudelis, G.; Karpouzis, K.; Kollias, S. Exploring trace transform for robust human action recognition. Pattern Recognit. 2013, 46, 3238–3248. [Google Scholar] [CrossRef]
  58. Goudelis, G.; Tsatiris, G.; Karpouzis, K.; Kollias, S. 3D cylindrical trace transform based feature extraction for effective human action classification. In Proceedings of the 2017 IEEE Conference on Computational Intelligence and Games (CIG), New York, NY, USA, 22–25 August 2017; pp. 96–103. [Google Scholar] [CrossRef]
  59. Dhiman, C.; Vishwakarma, D.K. A Robust Framework for Abnormal Human Action Recognition Using R -Transform and Zernike Moments in Depth Videos. IEEE Sens. J. 2019, 19, 5195–5203. [Google Scholar] [CrossRef]
  60. Jalal, A.; Kamal, S. Real-time life logging via a depth silhouette-based human activity recognition system for smart home services. In Proceedings of the 2014 11th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Seoul, Republic of Korea, 26–29 August 2014; pp. 74–80. [Google Scholar] [CrossRef]
  61. Khan, Z.A.; Sohn, W. Abnormal human activity recognition system based on R-transform and kernel discriminant technique for elderly home care. IEEE Trans. Consum. Electron. 2011, 57, 1843–1850. [Google Scholar] [CrossRef]
  62. Liu, J.; Wang, G.; Duan, L.Y.; Abdiyeva, K.; Kot, A.C. Skeleton-Based Human Action Recognition with Global Context-Aware Attention LSTM Networks. IEEE Trans. Image Process. 2018, 27, 1586–1599. [Google Scholar] [CrossRef]
  63. Alzubaidi, L.; Zhang, J.; Humaidi, A.J.; Al-Dujaili, A.; Duan, Y.; Al-Shamma, O.; Santamaría, J.; Fadhel, M.A.; Al-Amidie, M.; Farhan, L. Review of deep learning: Concepts, CNN architectures, challenges, applications, future directions. J. Big Data 2021, 8, 53. [Google Scholar] [CrossRef]
  64. Hamdi, D.E.; Elouedi, I.; Fathallah, A.; Nguyuen, M.K.; Hamouda, A. Combining Fingerprints and their Radon Transform as Input to Deep Learning for a Fingerprint Classification Task. In Proceedings of the 2018 15th International Conference on Control, Automation, Robotics and Vision (ICARCV), Singapore, 18–21 November 2018; pp. 1448–1453. [Google Scholar] [CrossRef]
  65. Allwright, S. What Is a Good MSE Value? Available online: https://stephenallwright.com/good-mse-value/ (accessed on 17 January 2023).
  66. Allwright, S. What Is a Good RMSE Value? Available online: https://stephenallwright.com/good-rmse-value/ (accessed on 17 January 2023).
  67. Arsenos, A.; Kollias, D.; Kollias, S. A large imaging database and novel deep neural architecture for covid-19 diagnosis. In Proceedings of the 2022 IEEE 14th Image, Video, and Multidimensional Signal Processing Workshop (IVMSP), Nafplio, Greece, 26–29 June 2022; pp. 1–5. [Google Scholar]
  68. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Figure 1. Calculating SRFt: the Mahalanobis matrix is formulated from the corresponding Mahalanobis distances for each one of the depicted joints as the action sequence progresses and the video frames pass until a specific time point t; the Radon transform is applied on the matrix producing a sinogram which consists a compact representation of action’s information up to the corresponding moment in time [13]. This procedure is extended across all the timesteps of an action sequence, so different SRFs are produced as an action progresses.
Figure 1. Calculating SRFt: the Mahalanobis matrix is formulated from the corresponding Mahalanobis distances for each one of the depicted joints as the action sequence progresses and the video frames pass until a specific time point t; the Radon transform is applied on the matrix producing a sinogram which consists a compact representation of action’s information up to the corresponding moment in time [13]. This procedure is extended across all the timesteps of an action sequence, so different SRFs are produced as an action progresses.
Electronics 13 00635 g001
Figure 2. SRF-CNN Pipeline: SRF representations (frames) are extracted from joint data of an action sequence (video) and then are shuffled to insert in CNN for UDysRS score prediction. For Global SRF-CNN, the entire set of joint data is encoded to predict six rating values simultaneously. For Local SRF-CNN, a subset of joint data related to the specific body part under assessment is utilized to predict its rating value seperately. Although there are differences in input and output, the pipeline remains consistent across these two approaches.
Figure 2. SRF-CNN Pipeline: SRF representations (frames) are extracted from joint data of an action sequence (video) and then are shuffled to insert in CNN for UDysRS score prediction. For Global SRF-CNN, the entire set of joint data is encoded to predict six rating values simultaneously. For Local SRF-CNN, a subset of joint data related to the specific body part under assessment is utilized to predict its rating value seperately. Although there are differences in input and output, the pipeline remains consistent across these two approaches.
Electronics 13 00635 g002
Figure 3. Global SRF-CNN model architecture.
Figure 3. Global SRF-CNN model architecture.
Electronics 13 00635 g003
Figure 4. Local SRF-CNN model architecture.
Figure 4. Local SRF-CNN model architecture.
Electronics 13 00635 g004
Figure 5. Grad-CAM on SRF-frames with score 1.5 or 2 for Local approach: (a1a3) original SRF-frame, (b1b3) heatmap of Grad-CAM derived from the weights of the last convolutional layer, (c1c3) produced image of Grad-CAM applied on SRF.
Figure 5. Grad-CAM on SRF-frames with score 1.5 or 2 for Local approach: (a1a3) original SRF-frame, (b1b3) heatmap of Grad-CAM derived from the weights of the last convolutional layer, (c1c3) produced image of Grad-CAM applied on SRF.
Electronics 13 00635 g005
Figure 6. Grad-CAM on SRF-frames with score 0 or 0.5 for Local approach: (a1a3) original SRF-frame, (b1b3) heatmap of Grad-CAM derived from the weights of the last convolutional layer, (c1c3) produced image of Grad-CAM applied on SRF.
Figure 6. Grad-CAM on SRF-frames with score 0 or 0.5 for Local approach: (a1a3) original SRF-frame, (b1b3) heatmap of Grad-CAM derived from the weights of the last convolutional layer, (c1c3) produced image of Grad-CAM applied on SRF.
Electronics 13 00635 g006
Table 1. Average Mean Squared Errors (AMSEs) across all persons, per experimental scenario.
Table 1. Average Mean Squared Errors (AMSEs) across all persons, per experimental scenario.
ScenariosNeckR-ArmL-ArmTrunkR-LegL-Leg
1st0.607820.505500.386840.608230.458790.57441
2nd0.624830.504710.356700.564850.413570.48073
3rd0.639680.517590.350650.565050.405520.47496
4th0.643350.528960.338820.549510.393270.45017
5th0.625110.531500.329380.534310.381510.43012
Table 2. Mean Squared Errors and Root Mean Squared Errors for Global SRF-CNN in Drinking task.
Table 2. Mean Squared Errors and Root Mean Squared Errors for Global SRF-CNN in Drinking task.
Iterations of LOPO-CVNeckR-ArmL-ArmTrunkR-LegL-Leg
MSERMSEMSERMSEMSERMSEMSERMSEMSERMSEMSERMSE
Person 11.4931.2222.0101.4180.2810.5300.5070.7120.3580.5980.1970.444
Person 20.5470.7400.2800.5290.2160.4650.8030.8960.5980.7730.4700.686
Person 40.4430.6660.1910.4370.2060.4540.3000.5480.4700.6860.2310.481
Person 50.2870.5360.2750.5240.1890.4350.3360.5800.1740.4170.2010.448
Person 60.5560.7460.3100.5570.4940.7030.3340.5780.4650.6820.3530.594
Person 70.7880.8880.5840.7640.2850.5340.3970.6300.3330.5770.4290.655
Person 80.2110.4590.2860.5350.1960.4430.4510.6720.2210.4700.2360.486
Person 100.5980.7730.3020.5500.2170.4660.4260.6530.2810.5300.2580.508
Person 110.7040.8390.5470.7400.8820.9391.2551.1200.5330.7301.4961.223
Average0.6250.7630.5320.6730.3290.5520.5340.7100.3820.6070.4300.614
Table 3. Mean Squared Errors and Root Mean Squared Errors for Local SRF-CNN in Drinking task.
Table 3. Mean Squared Errors and Root Mean Squared Errors for Local SRF-CNN in Drinking task.
Iterations of LOPO-CVNeckR-ArmL-ArmTrunkR-LegL-Leg
MSERMSEMSERMSEMSERMSEMSERMSEMSERMSEMSERMSE
Person 12.2801.5102.2101.4870.4150.6440.5870.7660.2990.5470.1290.359
Person 20.4910.7010.2390.4890.1920.4380.6000.7750.5640.7510.5240.724
Person 40.5040.7100.2320.4820.2890.5380.3210.5670.5500.7420.3110.558
Person 50.1750.4180.2040.4520.2420.4920.3380.5810.2050.4530.2080.456
Person 60.2320.4820.1560.3950.5350.7310.2850.5340.6120.7820.5090.713
Person 70.4130.6430.1720.4150.1070.3270.4400.6630.1370.3700.1130.336
Person 80.2460.4960.2280.4770.1170.3420.4220.6500.1490.3860.1870.432
Person 100.2710.5210.2190.4680.0790.2810.5690.7540.1630.4040.2550.505
Person 110.8340.9130.5530.7440.9970.9981.0231.0110.6050.7781.6701.292
Average0.6050.7100.4680.6010.3300.5320.5490.7000.3900.5790.4340.597
Table 4. Comparison of ARMSE values between SOTA, Global and Local SRF-CNN in Drinking task.
Table 4. Comparison of ARMSE values between SOTA, Global and Local SRF-CNN in Drinking task.
PipelinesNeckR-ArmL-ArmTrunkR-LegL-LegMean
State-of-the-art0.7240.7370.5750.7010.5860.6220.6575
Global SRF-CNN0.7630.6730.5520.7100.6070.6140.6532
Local SRF-CNN0.7100.6010.5320.7000.5790.5970.6198
Table 5. Comparison of ARMSE values between SOTA and Local SRF-CNN in Communication task.
Table 5. Comparison of ARMSE values between SOTA and Local SRF-CNN in Communication task.
PipelinesNeckR-ArmL-ArmTrunkR-LegL-LegMean
State-of-the-art0.5590.3990.4650.5130.5790.5900.5175
Local SRF-CNN0.5210.3310.4320.5090.5650.5730.4885
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Theofilou, P.A.; Tsatiris, G.; Kollias, S. Learning Spatio-Temporal Radon Footprints for Assessment of Parkinson’s Dyskinesia. Electronics 2024, 13, 635. https://doi.org/10.3390/electronics13030635

AMA Style

Theofilou PA, Tsatiris G, Kollias S. Learning Spatio-Temporal Radon Footprints for Assessment of Parkinson’s Dyskinesia. Electronics. 2024; 13(3):635. https://doi.org/10.3390/electronics13030635

Chicago/Turabian Style

Theofilou, Paraskevi Antonia, Georgios Tsatiris, and Stefanos Kollias. 2024. "Learning Spatio-Temporal Radon Footprints for Assessment of Parkinson’s Dyskinesia" Electronics 13, no. 3: 635. https://doi.org/10.3390/electronics13030635

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop