1. Introduction
Parkinson’s disease (PD) is the second most common neurodegenerative disorder worldwide, and its incidence has been steadily increasing in recent years [
1]. Parkinson’s disease can lead to motor dysfunction and non-motor symptoms, such as tremors, rigidity, bradykinesia, and cognitive impairments, significantly impacting patients’ quality of life [
2]. Given this, early diagnosis and timely intervention in PD are critically important for delaying disease progression [
3,
4,
5]. However, due to insufficient public awareness of the condition, many patients fail to seek prompt medical attention at the onset of symptoms, thereby missing the optimal window for treatment. Therefore, the development of a PD detection method and portable device capable of rapid screening and auxiliary diagnosis holds significant practical value.
Common detection methods for PD include clinical assessment [
6] and neuroimaging techniques [
7]. However, these methods are relatively cumbersome and not conducive to the development of portable devices. Since PD is often accompanied by motor impairments, movement-related information also serves as a critical biomarker for PD diagnosis [
8]. Gait—being the most common expression of human movement—naturally manifests without the need for professional guidance and effectively distinguishes PD patients from healthy individuals, making it an excellent biomarker [
9]. In recent years, the rapid development of artificial intelligence has further expanded the application of data-driven gait analysis methods in PD detection, ushering in new advancements in this research field [
10].
Gait can be digitized using both sensor-based and video-based methods. For instance, Biase et al. employed sensors to convert gait into kinetic data for gait analysis [
11]. Shi et al. proposed an adaptive step detection algorithm based on sensor data [
12], while Lancini et al. developed a biomechanics-based sensor gait recognition system that enables relatively effective gait detection [
13]. Zhou et al. analyzed gait features using a single inertial sensor [
14], and Shi et al. further introduced a deep learning-based inertial sensor model for gait detection [
15]. Additionally, Alazeb et al. investigated the use of multi-sensor fusion for Parkinson’s disease detection [
16]. These studies have provided valuable insights into gait mechanics and abnormal gait recognition. However, compared to the high cost and cumbersome operation of sensors, video data have recently gained greater favor among researchers due to their convenience and ease of acquisition. For example, Guo et al. utilized gait videos to extract skeletal joint data, subsequently employing a two-stream spatial–temporal attention graph convolutional network (2s-ST-AGCN) for PD detection [
17]. And Zeng et al. combined skeletal joint data with silhouette data to construct a skeleton–silhouette fusion spatial–temporal graph convolutional network (ST-GCN) and Visual Geometry Group (VGG) network for assessing PD gait impairments [
18].
However, it is noteworthy that most video-based data collection methods commonly face three issues: First, many data collections are limited to a single frontal or lateral view, lacking data from multiple views and related research. Second, the data collection process is relatively complex, relying on guidance from professionals and requiring substantial manual intervention during data processing, which poses significant obstacles to the development of portable devices for use by non-professionals. Third, the methods used in the aforementioned studies have not adequately considered the temporal characteristics of gait data, leaving room for further improvement.
To address the aforementioned issues, we propose an innovative solution. Firstly, our data collection encompasses both frontal and lateral views, providing a richer dataset for subsequent fusion. Based on this, we introduce the Cross-Attention Fusion with Mamba-2 and Self-Attention Network (CMSA-Net). CMSA-Net leverages the efficient time-series processing capabilities of Mamba-2 and self-attention to extract features from dual-view data, followed by cross-attention to effectively fuse frontal and lateral data. Additionally, we apply Maximum Mean Discrepancy (MMD) loss to ensure similarity in data distribution, thereby enhancing the overall fusion efficacy. To automatically extract single-step gait information from skeletal sequences, we propose the single-step segmentation method, which leverages Savitzky–Golay (SG) filtering and a sliding window peak selection function. Furthermore, we analyze the impact of frontal- and lateral-view data on PD detection, as well as the influence of prior information such as age and sex on the detection outcomes. Building upon the aforementioned methods, we develop a portable detection device characterized by its user-friendly operation and efficient data acquisition and processing workflows. This device offers a practical solution for the early screening and auxiliary diagnosis of PD, holding significant potential for clinical translation and application.
The main contributions of this study can be summarized as follows:
To the best of our knowledge, we are the first to construct a full-body synchronized dual-view PD gait video dataset, which includes 304 HC samples and 84 PD samples, with a relatively large sample size.
Based on the dual-view camera sensors of the data and characteristics of the skeletal sequence gait continuity, we propose a single-step segmentation method. Furthermore, we introduce CMSA-Net, which effectively processes temporal dual-view data and performs efficient information fusion, and employ MMD Loss to enhance the effectiveness of this fusion. In comparative analyses with multiple methods, our method achieved optimal results.
Based on the aforementioned methods, we construct a portable device that not only achieves favorable detection performance but also exhibits excellent usability, thereby possessing strong practical significance.
The remainder of this paper is organized as follows:
Section 2 reviews related work on PD detection using video-based gait data.
Section 3 introduces the proposed method.
Section 4 presents the experiments and results, providing both qualitative and quantitative analyses.
Section 5 discusses the influence of camera sensor view and prior information on the results, the construction and usage of the device, and a comparative analysis with different studies. Finally,
Section 6 concludes this study.
2. Related Work
In research on PD detection using gait videos, studies can be categorized based on data acquisition methods and classification schemes. First, video views can be classified into three types: frontal [
17,
19], lateral [
18,
20], and a combination of frontal and lateral [
21]. Second, classification schemes can be divided into binary [
19,
21], three-class [
18], and four-class [
17,
20] setups. In the binary class, the objective is to distinguish between PD and healthy control (HC) cases; in the four-class setup, classification is based on the Hoehn–Yahr stage [
22] (with stages 3 and 4 merged into a single class); and the three-class method relies on ratings provided by professional clinicians [
18].
Most studies employ a single view for data collection, but relying on a single view has a limitation because no study currently supports the optimality of any one side, and comparative experiments cannot be performed. Both Kaur et al.’s study and our work adopt a combined frontal and lateral view [
21]. In Kaur et al.’s research, the videos recorded only the feet and required subjects to use a treadmill, whereas our study captures full-body gait videos with minimal equipment dependency, thereby enhancing practical applicability [
21].
After processing, gait video data are typically represented as skeletal joint data. Consequently, some studies have adopted graph convolutional network (GCN)-based [
23] methods. For instance, Zeng et al. employed ST-GCN [
24], and Guo et al. used 2s-ST-AGCN [
17]. Convolutional neural networks (CNNs) [
19] have also demonstrated their utility in handling image data; for example, Ciresan et al. reported that a CNN achieved superior performance on a dataset [
25], while Zeng et al. leveraged a VGG [
26] architecture (improved CNN method) to process contour data. Furthermore, Transformer [
27] architectures, known for their excellent sequence processing capabilities, have been applied to PD detection—for example, Endo et al. employed a Transformer-based method, GaitForeMer [
20], for PD detection. In recent years, a number of other effective temporal data processing methods have emerged, such as iTransformer [
28], Joint Time–Frequency Domain Transformer (JTFT) [
29], and Mamba-2 [
30], all of which have achieved state-of-the-art (SOTA) results in their respective fields. Based on the current research landscape and the characteristics of our data, we propose Cross-Attention Fusion with Mamba-2 and Self-Attention Network (CMSA-Net), which demonstrates superior performance compared to other methods in our experiments.
4. Experiments and Results
4.1. Dataset and Evaluation Metrics
After single-step segmentation, the HC single-step dataset comprised 26,999 samples, and the PD single-step dataset comprised 11,472 samples, with a sample ratio of 0.72:0.28, totaling 37,471 samples. Each dataset point consisted of a matrix of the size (1, 50, 39), where 50 was the number of frames after linear interpolation, and 39 was the combination of 13 coordinate points with their x-axis data, y-axis data, and joint confidence scores. The dataset was standardized so that the mean was 0 and the standard deviation was 1.
The experiment used five-fold cross-validation (CV), where the total dataset was divided into five equal parts, four of which were used for training and one for validation. The training set and validation set did not contain data from the same individual. Five experiments were conducted using the same parameters and random seed. The results presented are the average values from five-fold CV, unless otherwise specified.
The results from the five-fold CV provide a comprehensive evaluation of the method’s performance across different subsets of the data and utilize the following evaluation metrics:
Accuracy (Acc) measures the proportion of correctly predicted samples.
where TP is True Positives, TN is True Negatives, FP is False Positives, and FN is False Negatives.
Precision (Prec) indicates the ratio of true positive instances among all samples predicted as positive.
Recall (Rec) evaluates the method’s ability to correctly identify positive instances from all actual positive samples.
F1-score (F1) is the harmonic mean of precision and recall, providing a balanced measure of both metrics’ performance.
Additionally, the Area Under the Curve (AUC) of the Receiver Operating Characteristic (ROC) curve reflects the overall performance of the method by depicting the relationship between the false positive rate and true positive rate across different thresholds. A higher AUC value indicates stronger classification ability of the method.
4.2. Implementation Details
Our network was implemented in Python 3.8 using common deep learning libraries such as PyTorch 2.0.1, Scikit-Learn 1.2.2, and NumPy 1.23.5. The experiments were conducted on a computer equipped with an Intel i9-13900K CPU and a RTX 4090 GPU with 32 GB of memory. The training batch size for all experiments was 3072, with 1000 epochs.
The training was performed using the AdamW [
34] optimizer with the parameters
and
. Warm-up [
35] and learning rate decay strategies were applied. During the first 200 epochs, the learning rate gradually increased from
to
. In the subsequent 400 epochs, the learning rate remained constant, and in the final 400 epochs, the learning rate gradually decayed to
. The model was saved when it achieved the highest F1-score.
Through testing, we found that a four-layer network depth with alternating self-attention and Mamba-2 achieved the best results. The remaining parameters used throughout the entire experimental process, including data processing, are presented in
Table 2, with all parameters having been optimized.
4.3. Experiment and Analysis
4.3.1. Classification Result
The experimental results obtained from the five-fold CV are presented in
Table 3. Our proposed method achieved an Acc of 89.1% and an F1 of 81.1%. The average AUC was 0.928.
4.3.2. Comparison with Different Methods
To validate the effectiveness of our proposed method, we conducted comparison experiments with several classical methods and advanced time-series- or joint-based methods (see
Table 4). Our method outperformed these methods, achieving the highest Acc, F1, and AUC. Compared to the advanced time-series method Joint Time–Frequency Domain Transformer (JTFT), our method achieved a 1.0% higher Acc, 1.3% higher F1, and 0.012 higher AUC, along with a 3.3% higher Prec. Our method also performed well compared to common joint-based methods such as GCN, ST-GCN, and 2s-ST-GCN. Although iTransformer excelled in Rec, its overall performance was not as superior as that of our method.
Overall, although our method shows some limitations in recall compared to iTransformer and JTFT, it surpasses the existing methods in terms of Acc, Prec, F1, and AUC, suggesting promising overall performance.
4.3.3. Ablation Experiments
To analyze the impact of different methods on the overall results, we conducted ablation experiments, with the results presented in
Table 5. Data Parallelism indicates whether dual-channel processing is applied to data from bilateral camera sensors. Through ablation experiments, it was observed that the baseline Attention achieved an Acc of 87.4% and an F1 of 78.0%. With the introduction of Data Parallelism, Acc increased by 0.6% and F1 improved by 1.6%. When cross-attention was applied for information fusion, Acc improved by 0.9% and F1 by 1.2%. Replacing part of the Attention modules with Mamba-2 resulted in an increase of 0.2% in Acc and 0.4% in F1.
Although the improvement from substituting Attention with Mamba-2 is relatively modest, the computational speed of Mamba-2 exceeds that of Attention, making it produce performance gain with negligible cost. Overall, the ablation experiments showed that each modification led to improvements in Acc, F1, and AUC.
4.3.4. Loss Parameters Experiments
The proposed MMD loss parameter was evaluated through comparative experiments, as shown in
Table 6. When
, the Acc improved by 0.7% and the F1 by 0.8% compared to those when not using MMD Loss (
). However, when
, the constraints were too strict, leading to a decline in overall performance.
Due to sample imbalance (HC samples ∼ 70%, PD samples ∼ 30%), we attempted to address this by adjusting the Weighted Cross-Entropy Loss parameters,
and
, as shown in
Table 7. We first set the compensation ratio to match the sample imbalance ratio,
,
, but this method yielded poor results. Although
Rec improved, other metrics significantly declined, resulting in an
Acc of only 85.6% and an
F1 of 77.4%.
By optimizing, we obtained and . Compared to no imbalance compensation, with and , this adjustment led to a 0.7% increase in Acc and a 0.5% improvement in the F1. Overall, the proposed loss function method was proven to be effective, and it necessitates certain parameter optimization.
5. Discussion
5.1. Comparative Study of Frontal and Lateral View
Different studies have collected PD video data using various methods, typically focusing on frontal and lateral views. In contrast, our dataset includes both views from the same individual, simultaneously captured, allowing for an equal comparison of their impact on the results. To conduct a comparative analysis of these views, we performed experiments using the same single-side network, and the results are presented in
Table 8.
Table 8 shows that the lateral view outperforms the frontal view in Acc, Prec, and F1, with a slightly higher AUC, while the frontal view has a higher Rec. If used solely as a disease detection method, a high recall rate is crucial, as it helps reduce the risk of missing PD patients. However, for a home-use portable device, high precision is crucial, as false positives can lead to unnecessary costs, leading to wasted time and effort for family members taking elderly individuals to medical appointments. Therefore, in terms of the combined metrics of Acc and F1, the lateral view outperforms the frontal view.
Considering the application of the device, the lateral view allows for single-step segmentation through the motion curve of the foot’s skeletal joints. However, due to perspective, single-step segmentation in the frontal view becomes much more complex and less accurate. Therefore, from an application view, if only one view is used, the lateral view is preferable.
Overall, both in terms of results and application, the lateral view outperforms the frontal view. However, the data from the frontal and lateral views exhibit complementarity, and combining both views leads to improvements in overall performance.
5.2. Fine-Tuning Based on Prior Information
We collected physiological prior information on sex, age, height, weight, BMI, shoe size, and single-step duration for all volunteers, which could be used to improve the method’s performance and analyze the relationship between these priors and PD. Specifically, all prior information was first Z-score-standardized and then passed through an unbiased Linear layer, and the resulting adjustment was added to the original softmax output, followed by a final Softmax layer. During the training process, the original method was kept frozen, and only the newly added Linear layer was fine-tuned. A total of 500 epochs were used for fine-tuning, and the results are shown in
Table 9.
In
Table 9, it is shown that fine-tuning with prior information improved the method by 0.6% in Acc and 0.7% in F1, although the AUC decreased. However, this is entirely reasonable, as the adjustment from the prior information made the classification boundary relatively fuzzier, which could lead to a decrease in the AUC. Nonetheless, fine-tuning with prior information improved the overall performance of the method.
Because we used an unbiased Linear layer to obtain the prior information adjustment, we can analyze the impact of different prior information data on the classification result through the weights of this layer.
Figure 6 shows all the prior information used for training and the magnitude of the weights in the Linear layer that affect PD, where positive values indicate a positive impact (blue in the figure), meaning a higher likelihood of PD, and negative values indicate a negative impact (red in the figure). By analyzing the literature, we explore the practical significance of the prior information on Linear weight in
Figure 6.
The Sex weight in the Linear weight layer is 0.0416, indicating a slight increase in the probability of PD in males. This finding aligns with the sex-related conclusions in the review study by Zhu et al. [
38], while studies by Terrin et al. [
39] and Chen et al. [
40] further analyze the underlying mechanisms. Overall, most studies agree that males are more susceptible to PD. However, according to Dumas et al. [
41], sex itself induces biomechanical differences in movement patterns, and how such differences influence PD gait characteristics warrants further investigation.
The Age weight of 0.1805 indicates a strong positive correlation between age and the probability of developing PD, which is a well-established fact. This has been extensively validated by numerous studies and will not be further elaborated here.
Research on the relationship between PD and height is relatively scarce. An early study by Ragonese et al. [
42] found a negative correlation between height and PD in males, while no such association was observed in females. Similarly, a study by Saari et al. [
43], based on autopsy analysis of PD samples, suggests that shorter stature may be associated with a lower number of dopaminergic neurons, potentially increasing the risk of PD. These studies provide some evidence for Height weight of −0.0524.
Weight loss is a common clinical manifestation of PD [
44,
45], which can explain the Weight weight of −0.0334.
The relationship between BMI and PD has been extensively studied with intriguing findings. In a female-exclusive study by Portugal et al. [
46], women with higher BMI showed lower PD incidence. Conversely, a study by Kim et al. [
47] indicates that individuals with lower BMI have higher PD risk, while overweight status shows no association with PD. Studies by Palacios et al. [
48] and Wang et al. [
49] found no significant association between BMI and PD prevalence. However, in a male-specific study by Osler et al. [
50], higher BMI in men correlated with increased PD risk. As a composite measure of height and weight, BMI’s influence appears complex. Our result, with a BMI weight of −0.0177, represents a relatively low negative weight, which falls within a reasonable range.
There is almost no research on the relationship between shoe size and PD. We believe that male shoe sizes (approximately 40–43) and female shoe sizes (approximately 36–39) differ significantly, which may be attributed to sex differences. Therefore, the Shoe size weight (0.0502) may be considered as an indirect representation of the Sex weight (0.0416).
Although stride length and walking speed are clearly reduced in PD patients, the duration of a single step does not necessarily increase significantly. For example, in the study by Veer et al. [
51], the difference in single-step duration between HC and PD was not significant, with PD having a slightly longer single-step duration on average than HC. This is consistent with the meaning of the slight positive weight observed in our Duration of single step weight (0.0223).
Overall, the weights obtained from the unbiased Linear layer based on prior information can be corroborated with existing research, and using prior information can effectively improve the method’s performance. However, we should not mandate that users must input this prior information in order to use the device. In the practical device, we incorporate prior information as an additional adjustment, allowing users to choose whether to use it.
5.3. Portable Device
We developed a portable device based on CMSA-Net, with its core computing unit being the NVIDIA Jetson Orin NX, manufactured by NVIDIA Corporation, Santa Clara, CA, USA. The device can be configured to operate in either a single-camera (only lateral camera) or dual-camera mode, and it allows the optional entry of prior information about the test subject. Its structure is shown in
Figure 7, and the metric results are presented in
Table 10.
Different usage modes are designed to adapt to various environments. For example, in spacious nursing homes or senior activity centers, the device configured with both frontal and lateral camera sensors can be fixedly installed in public areas frequently used by older adults. After pressing the designated on-screen button, the individual simply walks back and forth within a specified area. When potential cases are detected, staff can assist by entering prior information to achieve more accurate results. For temporary or portable applications, using only the lateral camera sensor is sufficient, rendering the device similar to a smart phone that can perform detection when held steadily.
Overall, this portable device allows for flexible deployment in various environments, making it suitable for both institutional and individual use to facilitate PD detection. However, during device usage, participants are still required to walk in a relatively regular and straight manner. In future research, we plan to collect data on other types of trajectories, such as circular, diagonal, and irregular walking, to expand the device’s applicability and reduce its usage limitations.
5.4. Comparison with Present Studies
We compared studies in recent years that employed data and methods similar to those used in our research, with the results presented in
Table 11. Due to variations in data processing methods, sample sizes, and sample proportions across studies, although the statistical metrics in the table are not directly comparable, they still offer some reference value.
It can be observed that our method achieved the highest Acc. Although the F1 was not the highest, this can be attributed to the difference in sample distribution. In He et al.’s study [
19], the ratio of HC to PD samples was approximately 1:1, whereas in our study, the ratio was around 0.8:0.2. As a result, a slightly lower F1 is expected.
Kaur et al. [
21] and we both utilized frontal- and lateral-view data. However, since Kaur et al. did not consider the fusion of bilateral data and relied solely on CNN for processing, their results were less optimal. In contrast, our data included all body joints and did not require additional equipment, resulting in better performance and practicality.
Our dataset includes the largest total number of participants. Although the number of PD samples is not the highest and there is an imbalance between HC and PD samples, this imbalance better reflects real-world scenarios, where the number of HC individuals is naturally higher than PD cases. Therefore, our results are more representative of practical applications.
The multi-class classification of PD also holds significant clinical value. In our current study, which primarily focuses on developing a user-friendly portable device, we adopted a binary classification method as the main outcome to facilitate user interpretation. In future research, we will utilize this dual-view dataset to explore multi-class classification of PD.
6. Conclusions
We collected a dual-view gait video dataset comprising 304 HC and 84 PD volunteers. Considering the characteristics of this dataset, we proposed a novel gait detection method, CMSA-Net. The method employs a cross-attention mechanism to integrate self-attention features from different views with Mamba-2 block features. Moreover, we introduced the MMD loss to optimize the distribution of features, thereby enhancing the fusion effectiveness. In comparative experiments with multiple methods, CMSA-Net achieved the best performance, demonstrating its effectiveness in PD gait detection tasks.
Furthermore, to facilitate the development of a portable PD detection device, we adopted a single-step segmentation method during data preprocessing. This method discards static states and extracts only effective gait segments, improving overall detection performance while enhancing ease of use. This improvement is particularly significant for a device primarily designed for elderly users. Additionally, we analyzed the impact of video viewpoints and prior information on method performance, providing valuable insights for future research.
Based on these methodologies, we developed a portable PD detection device that supports both single- and dual-view modes and offers optional integration of prior information. This flexibility enables adaptation to various application scenarios, such as home and elder care institutions, highlighting its practical applicability.
In future work, we will develop multi-class classification methods and devices for applications in hospitals and assistive healthcare settings, where more precise grading and assessment are required.