Leveraging Deep Learning for Fine-Grained Categorization of Parkinson’s Disease Progression Levels through Analysis of Vocal Acoustic Patterns

Speech impairments often emerge as one of the primary indicators of Parkinson’s disease (PD), albeit not readily apparent in its early stages. While previous studies focused predominantly on binary PD detection, this research explored the use of deep learning models to automatically classify sustained vowel recordings into healthy controls, mild PD, or severe PD based on motor symptom severity scores. Popular convolutional neural network (CNN) architectures, VGG and ResNet, as well as vision transformers, Swin, were fine-tuned on log mel spectrogram image representations of the segmented voice data. Furthermore, the research investigated the effects of audio segment lengths and specific vowel sounds on the performance of these models. The findings indicated that implementing longer segments yielded better performance. The models showed strong capability in distinguishing PD from healthy subjects, achieving over 95% precision. However, reliably discriminating between mild and severe PD cases remained challenging. The VGG16 achieved the best overall classification performance with 91.8% accuracy and the largest area under the ROC curve. Furthermore, focusing analysis on the vowel /u/ could further improve accuracy to 96%. Applying visualization techniques like Grad-CAM also highlighted how CNN models focused on localized spectrogram regions while transformers attended to more widespread patterns. Overall, this work showed the potential of deep learning for non-invasive screening and monitoring of PD progression from voice recordings, but larger multi-class labeled datasets are needed to further improve severity classification.


Introduction
Parkinson's disease (PD) is a progressive neurodegenerative disorder characterized by motor symptoms like tremors, rigidity, and slowed movement [1][2][3].However, pathology underlying PD begin years before the clinical diagnosis, with early manifestations like hyposmia, speech disorders, depression, constipation, and sleep disturbances frequently overlooked [4,5].Diagnosing PD during the initial phase and initiating treatment can potentially impede the rate of progression of this degenerative disorder [6].
While neurological examination methods like the Movement Disorder Society Unified Parkinson's Disease Rating Scale (MDS-UPDRS) and brain scans are among the main criteria for diagnosing PD, they have limitations such as cost, accessibility, clinician bias, and difficulty monitoring progression and treatment effectiveness [1][2][3]7,8].Therefore, there is a need for alternative diagnostic approaches that are more objective, cost-effective, and accessible.
Speech difficulties are often one of the initial and most serious signs of PD, severely affecting how patients communicate and their overall quality of life [9].Over 80% of PD patients have some vocal dysfunction, including decreased volume, lack of tone, reduced fundamental frequency range, slurred speech, or abnormal rhythms and melodies [10,11].This can occur up to 5 years before motor symptoms like tremors appear [12,13].While assessing writing and walking needs specialized devices, voice can be captured and analyzed without special equipment or clinic visits [13].Therefore, speech analysis provides a promising opportunity for early PD detection and continuous monitoring.
Various acoustic analysis techniques including measuring fundamental frequency variation, noise parameters, and non-linear dynamics, have been explored for detecting and quantifying vocal symptoms [14,15].However, recent research has increasingly focused on leveraging advanced machine learning and neural network approaches to automatically detect PD through speech analysis [16].Significant work has centered on selecting optimal features for shallow classifiers as well as determining ideal architectures for deep learning classifiers.
Mamun et al. tested ten algorithms on 195 vocal recordings, finding that LightGBM, a gradient-boosting method, achieved 95% accuracy in classifying PD [23].Govindu et al. recently studied early PD detection via telemedicine using ML models on audio data from 30 PD and 30 control subjects.Their RF classifier had the best performance-91.83%accuracy and 0.95 sensitivity for detecting PD [20].Wang et al. implemented 12 machine learning models on the 401 voice biomarkers dataset to classify subjects as PD or not.They also built a custom deep learning model with a classification accuracy of 96.45% [24].Pramanik et al. achieved high accuracy in PD detection using Naïve Bayes algorithms [28].Other studies focused on feature selection techniques.Lamba et al. tested combinations of three selection methods (mutual information gain, extra tree, genetic algorithm) and three classifiers (Naive Bayes, KNN, RF), finding that the genetic algorithm plus RF performed best with 95.58% accuracy [25].
In contrast to the previous approach, which primarily used manual feature engineering and shallow classifiers, the second approach harnesses deep learning to automatically learn features directly from speech data.Various neural network architectures have been designed and tested, including Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNN) like Long Short-Term Memory Networks (LSTMs) networks, a combination of them, and more recently, transformer-based models.These models directly learn feature representations from the speech signal or spectrograms, including sustained vowels, continuous speech, and repeating syllables.Deep learning models can alleviate the need for expert-crafted features and have achieved state-of-the-art (SOTA) results on PD detection from speech [8].Aversano et al. developed LSTM and CNN models to analyze voice recordings segmented into 1 s intervals consisting of vowels, phrases, and sentences.These voice samples were transformed into mel spectrogram representations as input to the models, which achieved an F1 score of 97%.However, a notable limitation of this study was that the researchers did not ensure that the training and validation sets were speaker-independent, which could potentially introduce biases and may limit the generalizability of the models' performance [29].Similarly, Shah et al. employed a CNN-based model that analyzed 1 s speech chunks transformed into log-scaled mel spectrograms (LMS) for detecting PD from vowel phonations of /a/ and /i/, achieving 90.32% accuracy [30].Another study employed a MobileNet CNN model with various types of spectrograms as input.The findings indicated that speech energy spectrograms and mel spectrograms yielded the highest accuracy rates of 96% and 92%, respectively [31].A study by Khojasteh et al. evaluated the performance of a CNN model on sustained vowel phonation recordings of the /a/ lasting over 5 s.When tested on 2 s voice samples segmented into 815 ms frames, the CNNs achieved a classification accuracy of 75.7%.An interesting aspect of their approach involved data augmentation techniques like flipping (vertically and horizontally) and rotating the frames, which were applied to the training dataset.However, since the inputs were spectrogram-based images representing time-frequency information, such spatial transformations may not have been suitable augmentation techniques [8].Quan et al. employed an end-to-end model incorporating both 2D and 1D CNNs to achieve 92% accuracy in classifying PD based on speech tasks involving the reading of both simple and complex sentences.Their model operated on a sequence of overlapping segments derived from the LMS representation of the input audio.However, the study did not specify the length of this sequence of overlapping segments [10].
Furthermore, some researchers further improve performance by using transfer learning to adapt these speech models, leveraging knowledge already gained on other tasks.Hireš et al. proposed an ensemble approach involving multiple fine-tuned versions of the Xception deep learning model.When applied to a subset of the sustained vowel recordings dataset (PC-GITA), focusing on the vowels /a/, /i/, /o/, /u/, and /e/, this ensemble method achieved an impressive 99% accuracy in classifying the presence of PD based solely on the voice recordings.In their approach, the 1 s voice signal was transformed into a spectrogram, which was then blurred before being processed by the models [13].In another study, Wodzinski et al. fine-tuned a ResNet architecture model using a subset of the PC-GITA dataset containing only the vowel sound /a/.By transforming the audio recordings into spectrograms, their model achieved an accuracy of over 90% in classifying the presence of PD [11].More recently, Klempíř et al. found that self-supervised speech models, such as wav2vec which have been pre-trained on 960 h of 16 kHz English speech, generate valuable embeddings for PD detection.These models achieved AUROC (area under the receiver operating characteristic curve) scores ranging from 0.77 to 0.98 across various datasets, which included repeated /pa/ syllables.Notably, this pipeline can be immediately applied to raw audio signal recordings without the need for segmenting [32].In summary, the deep learning approach shows promise for PD detection from voice, with recent work achieving accuracies over 90% using techniques like CNNs, LSTM models, and self-supervised learning.
Prior studies have focused on binary classification of PD detection from voice recordings, distinguishing between people with PD and healthy controls.However, clinical applications would benefit from more granular subtype classification beyond this binary distinction [33].In this work, we first explored the use of multi-class classification to detect PD and differentiate between various stages based on their MDS-UPDRS III scores.Part III of the MDS-UPDRS assesses motor function in Parkinson's disease patients.We trained models to classify voice recordings into three classes.This paper also compared three DL architectures widely used in computer vision tasks.The models were trained using LMS representations derived from sustained vowel phonations from a publicly accessible dataset.Secondly, the study examined how the length of audio clips and particular vowel sounds impacted the effectiveness of these models.Additionally, previous studies segmented audio recordings before analysis but did not evaluate model performance on full recordings; in this work, we applied an ensemble method across segments to obtain overall classifications for entire segments after splitting.Finally, we employed visualization techniques such as Grad-CAM [34] and t-SNE [35] to provide possible explanations of the deep learning model's predictions, highlighting discriminative regions in the LMS inputs that influence particular classification decisions.

Materials and Methods
Figure 1 shows the architecture of our speech classification system that categorizes speech signals into one of three classes: healthy, Parkinson's disease mild, or severe.The system captures the audio signal, preprocesses it into segments, and converts the segments into LMSs-visual representations of audio frequency content over time.These spectrograms are input to a deep neural network that extracts informative audio features.
A classifier model then categorizes the speech into one of three classes by matching the extracted features to learned patterns.In essence, the system transforms audio into images, extracts features using deep learning, and classifies speech based on those features.

Materials and Methods
Figure 1 shows the architecture of our speech classification system that categorizes speech signals into one of three classes: healthy, Parkinson's disease mild, or severe.The system captures the audio signal, preprocesses it into segments, and converts the segments into LMSs-visual representations of audio frequency content over time.These spectrograms are input to a deep neural network that extracts informative audio features.A classifier model then categorizes the speech into one of three classes by matching the extracted features to learned patterns.In essence, the system transforms audio into images, extracts features using deep learning, and classifies speech based on those features.

Dataset
The present study used the Italian Parkinson's voice and speech database.The dataset comprises speech recordings in the .wavformat obtained from Italian individuals diagnosed with Parkinson's disease, as well as healthy control subjects.This database was collected through the efforts of Dimauro et al., as referenced in [36,37].Building on prior work that found sustained vowels to be more predictive of Parkinson's diagnosis compared to words or sentences [19], this study focused its analysis specifically on short vowels.By concentrating only on short vowel samples, potential factors like language and education that could potentially skew the results can be eliminated.

Dataset
The present study used the Italian Parkinson's voice and speech database.The dataset comprises speech recordings in the .wavformat obtained from Italian individuals diagnosed with Parkinson's disease, as well as healthy control subjects.This database was collected through the efforts of Dimauro et al., as referenced in [36,37].Building on prior work that found sustained vowels to be more predictive of Parkinson's diagnosis compared to words or sentences [19], this study focused its analysis specifically on short vowels.By concentrating only on short vowel samples, potential factors like language and education that could potentially skew the results can be eliminated.
As outlined in Table 1, the subset includes sustained vowel recordings (vowels /a/, /e/, /i/, /o/, and /u/) from 22 healthy controls (12 female, 10 male) and 28 PD patients (9 female, 19 male).The participants were closely matched by age, with an average of 67.1 years (±5.2 years) in the control group and 67.2 years (±8.7 years) in the PD group.The PD patients were further classified by their score on Part III of the MDS-UPDRS.Figure 2 shows the histogram of audio lengths across three groups: Healthy Controls (HC), Mild Parkinson's Disease (PD_Mild), and Severe Parkinson's Disease (PD_Severe).Notably, HC samples predominantly fall within approximately 5 s, while PD groups exhibit a broader range.

Materials and Methods
Figure 1 shows the architecture of our speech classification system that categorizes speech signals into one of three classes: healthy, Parkinson's disease mild, or severe.The system captures the audio signal, preprocesses it into segments, and converts the segments into LMSs-visual representations of audio frequency content over time.These spectrograms are input to a deep neural network that extracts informative audio features.A classifier model then categorizes the speech into one of three classes by matching the extracted features to learned patterns.In essence, the system transforms audio into images, extracts features using deep learning, and classifies speech based on those features.

Dataset
The present study used the Italian Parkinson's voice and speech database.The dataset comprises speech recordings in the .wavformat obtained from Italian individuals diagnosed with Parkinson's disease, as well as healthy control subjects.This database was collected through the efforts of Dimauro et al., as referenced in [36,37].Building on prior work that found sustained vowels to be more predictive of Parkinson's diagnosis compared to words or sentences [19], this study focused its analysis specifically on short vowels.By concentrating only on short vowel samples, potential factors like language and education that could potentially skew the results can be eliminated.
As outlined in Table 1, the subset includes sustained vowel recordings (vowels /a/, /e/, /i/, /o/, and /u/) from 22 healthy controls (12 female, 10 male) and 28 PD patients (9 female, 19 male).The participants were closely matched by age, with an average of 67.1 years (±5.2 years) in the control group and 67.2 years (±8.7 years) in the PD group.The PD patients were further classified by their score on Part III of the MDS-UPDRS.Figure 2 shows the histogram of audio lengths across three groups: Healthy Controls (HC), Mild Parkinson's Disease (PD_Mild), and Severe Parkinson's Disease (PD_Severe).Notably, HC samples predominantly fall within approximately 5 s, while PD groups exhibit a broader range.

Data Preprocessing
We performed data preprocessing to convert and structure the raw audio data into an applicable format that could be effectively analyzed via deep learning models.Initially, all audio recordings from the database were resampled at 16 kHz to ensure a consistent sampling rate.Subsequently, recordings with excessive background noise were removed from the dataset during this preprocessing stage (2 healthy participants were excluded for this reason).The total number of audio recordings after this part was 475.The audio clips were also trimmed to remove any leading or trailing silence.The raw speech data contained audio recordings of different lengths, as shown in Figure 2. To create manageable training batches with consistent sample sizes, the recordings were segmented into fixed-length clips (1 s and 5 s), with each segment overlapping the previous one by 50%, padding shorter utterances and truncating longer utterances.The original dataset was processed to create two distinct versions for training purposes.In the First Segment (FS) version, only the first segment from each audio recording was utilized.Alternatively, the All Segments (AS) version encompassed all segments derived from the recordings rather than just the initial segment.These two approaches to segmentation produced different training datasets, FS and AS, from the same raw data.These varying combinations of segmentation approaches and duration made four unique training datasets (FS-1, FS-5, AS-1, and AS-5) from the same raw data (Figure 3).From now on in this paper, these abbreviations will be utilized to reference the particular dataset versions.The details of the modified datasets are provided in Table S1.

Data Preprocessing
We performed data preprocessing to convert and structure the raw audio data into an applicable format that could be effectively analyzed via deep learning models.Initially, all audio recordings from the database were resampled at 16 kHz to ensure a consistent sampling rate.Subsequently, recordings with excessive background noise were removed from the dataset during this preprocessing stage (2 healthy participants were excluded for this reason).The total number of audio recordings after this part was 475.The audio clips were also trimmed to remove any leading or trailing silence.The raw speech data contained audio recordings of different lengths, as shown in Figure 2. To create manageable training batches with consistent sample sizes, the recordings were segmented into fixedlength clips (1 s and 5 s), with each segment overlapping the previous one by 50%, padding shorter utterances and truncating longer utterances.The original dataset was processed to create two distinct versions for training purposes.In the First Segment (FS) version, only the first segment from each audio recording was utilized.Alternatively, the All Segments (AS) version encompassed all segments derived from the recordings rather than just the initial segment.These two approaches to segmentation produced different training datasets, FS and AS, from the same raw data.These varying combinations of segmentation approaches and duration made four unique training datasets (FS-1, FS-5, AS-1, and AS-5) from the same raw data (Figure 3).From now on in this paper, these abbreviations will be utilized to reference the particular dataset versions.The details of the modified datasets are provided in Table S1.Since the models that were used in this study were suitable for images, after segmenting the voice recordings, they needed to be transformed into an image data format.All recordings were then converted from waveform audio to LMS-based images.The LMS is a representation of an audio signal that accounts for the human auditory perception of frequency and loudness.It is obtained by first computing a spectrogram using the Short-Time Fourier Transform (STFT), which provides the frequency content and amplitude over time, with frequency on a linear Hz scale.The linear frequency axis is then converted to the mel frequency scale using Equation ( 1): where m and f represent mel frequency and frequency in mels and Hz, respectively, this conversion results in a mel spectrogram, where the frequency axis is represented in the mel scale, which better approximates the human auditory system's response to sound frequencies.Finally, the logarithm of the amplitude values (in dB) is taken to mimic the human ear's logarithmic perception of loudness.The resulting LMS displays the frequency content in mels on one axis and time on the other, with the amplitude represented by a logarithmically scaled color map [38].In this research, LMS representations were computed using 128 ms (2048 samples) window lengths and 32 ms (512 samples) hop lengths for the STFT, with examples provided in the referenced Figure 4.
where m and f represent mel frequency and frequency in mels and Hz, respectively, this conversion results in a mel spectrogram, where the frequency axis is represented in the mel scale, which better approximates the human auditory system's response to sound frequencies.Finally, the logarithm of the amplitude values (in dB) is taken to mimic the human ear's logarithmic perception of loudness.The resulting LMS displays the frequency content in mels on one axis and time on the other, with the amplitude represented by a logarithmically scaled color map [38].In this research, LMS representations were computed using 128 ms (2048 samples) window lengths and 32 ms (512 samples) hop lengths for the STFT, with examples provided in the referenced Figure 4. Additionally, to reduce overfitting given the initially small training dataset, the limited data set was expanded by applying different types of audio augmentation before executing the voice-to-image transformation process.This data expansion aims to improve generalizability.For this purpose, we performed data augmentation using the torch audio spectrogram augmentation library [39].Here, various techniques, including time masking, frequency masking, and a combination of them, were applied to each audio and then transformed to the LMS image (Figure 5).Data augmentation was not used for the validation sets, so these sets would resemble real-world data.Finally, the LMSs were resized to 224 × 224 pixels and converted to 3-channel grayscale images for input into the deep learning models.Additionally, to reduce overfitting given the initially small training dataset, the limited data set was expanded by applying different types of audio augmentation before executing the voice-to-image transformation process.This data expansion aims to improve generalizability.For this purpose, we performed data augmentation using the torch audio spectrogram augmentation library [39].Here, various techniques, including time masking, frequency masking, and a combination of them, were applied to each audio and then transformed to the LMS image (Figure 5).Data augmentation was not used for the validation sets, so these sets would resemble real-world data.Finally, the LMSs were resized to 224 × 224 pixels and converted to 3-channel grayscale images for input into the deep learning models.

Training and Deep Learning Models
In this study, we utilized several popular deep learning models for computer visio tasks.Specifically, two popular CNN architectures were employed: ResNet and VGG [40,41].These CNNs have achieved good performance on benchmark datasets and hav become standard models for computer vision.VGG16 and VGG19 are deep convolutiona neural network architectures that have 16 and 19 layers, respectively.Both architecture consist of 5 sets of convolutional layers, where each layer is followed by a max poolin layer.The main difference between VGG16 and VGG19 is the number of cascaded convo lutional layers in each set.The architecture of VGG16 is shown in Figure 6a.ResNet-50 on the other hand, is a residual network architecture that contains 50 layers (49 convolu tional layers organized into 16 residual blocks and one final fully connected layer for ou put).It utilizes skip connections, which allow the network to skip certain convolutiona layers during backpropagation, alleviating the vanishing gradient problem.ResNet-18 i a simplified variant of the original ResNet architecture for image classification.As show in Figure 6b, it contains 18 layers in total-17 convolutional layers organized into eigh residual blocks and one final fully connected layer for output [40][41][42].
In recent years, transformers have become the predominant model architecture fo natural language processing (NLP) tasks due to their continuously improving efficienc [43].The capabilities of transformers are not limited to NLP, though they have also show excellent skill in image recognition.Architectures like the Vision Transformer (ViT) [44 demonstrate how transformers can match or even surpass CNNs on computer vision da tasets.Building on the concepts of ViT, the Swin Transformer [45] introduces a hierarchica design for greater efficiency and the flexibility to model at a variety of scales [43].We als employed the Swin Transformer architecture in this study to take advantage of its state of-the-art capabilities.The Swin Transformer model is a pure transformer architectur model that is becoming a general-purpose backbone for various tasks.There are four Swi Transformer configurations: Swin_t, Swin_s, Swin_b, and Swin_l [45].The Swin_s an Swin_b were chosen as feature extractors in this study.The numbers of parameters fo them are 49.6 M and 87.8 M, respectively, as shown in Table 2.The overall architecture o the Swin Transformer is illustrated in Figure 6c.The Swin_s and Swin_b models diffe primarily in the size of the embeddings and the number of heads used in their transforme architectures.Swin_b has larger embeddings and more heads than Swin_s.Further detail about these models can be found in the original paper [45].
These models have already been trained on a large-scale labeled dataset.The perfor mance metrics of these models on the ImageNet dataset are presented in Table 2. Durin the training phase, the pre-trained weights (the weights obtained when a model wa trained on the ImageNet dataset) were utilized.Transfer learning was applied by tunin the pre-trained layers.The weights learned on ImageNet provide a much better initializa tion for many computer vision tasks than random weights [46].

Training and Deep Learning Models
In this study, we utilized several popular deep learning models for computer vision tasks.Specifically, two popular CNN architectures were employed: ResNet and VGG [40,41].These CNNs have achieved good performance on benchmark datasets and have become standard models for computer vision.VGG16 and VGG19 are deep convolutional neural network architectures that have 16 and 19 layers, respectively.Both architectures consist of 5 sets of convolutional layers, where each layer is followed by a max pooling layer.The main difference between VGG16 and VGG19 is the number of cascaded convolutional layers in each set.The architecture of VGG16 is shown in Figure 6a.ResNet-50, on the other hand, is a residual network architecture that contains 50 layers (49 convolutional layers organized into 16 residual blocks and one final fully connected layer for output).It utilizes skip connections, which allow the network to skip certain convolutional layers during backpropagation, alleviating the vanishing gradient problem.ResNet-18 is a simplified variant of the original ResNet architecture for image classification.As shown in Figure 6b, it contains 18 layers in total-17 convolutional layers organized into eight residual blocks and one final fully connected layer for output [40][41][42].
In recent years, transformers have become the predominant model architecture for natural language processing (NLP) tasks due to their continuously improving efficiency [43].The capabilities of transformers are not limited to NLP, though they have also shown excellent skill in image recognition.Architectures like the Vision Transformer (ViT) [44] demonstrate how transformers can match or even surpass CNNs on computer vision datasets.Building on the concepts of ViT, the Swin Transformer [45] introduces a hierarchical design for greater efficiency and the flexibility to model at a variety of scales [43].We also employed the Swin Transformer architecture in this study to take advantage of its state-of-the-art capabilities.The Swin Transformer model is a pure transformer architecture model that is becoming a general-purpose backbone for various tasks.There are four Swin Transformer configurations: Swin_t, Swin_s, Swin_b, and Swin_l [45].The Swin_s and Swin_b were chosen as feature extractors in this study.The numbers of parameters for them are 49.6 M and 87.8 M, respectively, as shown in Table 2.The overall architecture of the Swin Transformer is illustrated in Figure 6c.The Swin_s and Swin_b models differ primarily in the size of the embeddings and the number of heads used in their transformer architectures.Swin_b has larger embeddings and more heads than Swin_s.Further details about these models can be found in the original paper [45].
The classification layers of the original models were removed and replaced with new classification head.This new classifier uses a neural network with two dense layers before the final classification layer.The first dense layer has 256 neurons, and the second dense layer has 128 neurons (Figure 6d).After each dense layer, a dropout with a probability of 0.5 was applied.This same classification architecture was utilized across all models in the study.These models have already been trained on a large-scale labeled dataset.The performance metrics of these models on the ImageNet dataset are presented in Table 2.During the training phase, the pre-trained weights (the weights obtained when a model was trained on the ImageNet dataset) were utilized.Transfer learning was applied by tuning the pretrained layers.The weights learned on ImageNet provide a much better initialization for many computer vision tasks than random weights [46].
The classification layers of the original models were removed and replaced with new classification head.This new classifier uses a neural network with two dense layers before the final classification layer.The first dense layer has 256 neurons, and the second dense layer has 128 neurons (Figure 6d).After each dense layer, a dropout with a probability of 0.5 was applied.This same classification architecture was utilized across all models in the study.

Experimental Setups and Evaluation Criteria
Our implementation leveraged various Python libraries such as PyTorch [39] for deep learning model development, Pandas [47] and NumPy [48] for data analysis, and Matplotlib [49] and Scikit-learn [50] for visualization and some analysis tasks.
As detailed in Table 3, key training hyperparameters used during model optimization included learning rate, batch size, and number of epochs.The models were trained using an Adaptive Moment Estimation optimizer with Weight Decay (AdamW), an optimization algorithm with cross-entropy loss to measure prediction error.A learning rate of 0.0003 was set initially and adjusted over time per a scheduler.We implemented the experiments using a system comprising an Intel Core i7-11700K CPU @ 3.60 GHz, with 128G of RAM and GPU NVIDIA RTX 3090 24G.This study followed two approaches for classifying audio samples and training the models.The first approach involved segmenting the audio clips into 1 and 5 s segments, as AS-1 and AS-5 methods explained in Section 2.1, thereby increasing the dataset size.The second approach only used the first segments of each audio clip, FS-1 and FS-5.We evaluated whether the segmentation helped improve model accuracy compared to using only the first segmented part.
This study utilized four main evaluation criteria: Precision, Recall, F1 score, and Overall accuracy.Precision refers to the percentage of positive classifications that were correct.Recall (also called sensitivity) measures the percentage of actual positives that were correctly identified.The F1 score combines precision and sensitivity by taking their harmonic mean.Finally, overall accuracy is simply the percentage of total classifications that were correct out of all classifications made.
To calculate these performance metrics, we determined the numbers of true positives (TP), false positives (FP), and false negatives (FN) per class.A TP represents a correct prediction for a given class.An FP is an incorrect prediction that wrongly predicted that class.An FN is a case that belongs to that class but was incorrectly excluded.

Results and Discussion
In this section, we will describe and discuss our results in detail while evaluating the studied models' performance.

Classification Performance
A stratified patient-independent three-fold cross-validation approach was utilized for all experiments, where the data was partitioned into three folds with no patient overlap across folds to avoid data leakage and reduce potential biases in model evaluation.The model was trained on two folds and evaluated on the held-out fold, and this was repeated three times so that each fold served as the evaluation set once.This ensured a rigorous assessment of model performance on unseen data.We decided not to use a separate test set due to the small database size.To mitigate potential issues caused by an imbalance of class distribution, we utilized the train-time oversampling technique to achieve a more balanced class distribution [51,52].
The cross-validated performance metrics, including precision, recall, F1 score, and accuracy, for each model are presented in Tables 4 and 5. Additionally, Figure 7 depicts a graphical representation of the cross-validated classification accuracy for each model.In addition, performance by two additional recent architectures were compared in Table S2.Boldfaced values indicate the best performance for each metric.
Table 4 highlights that utilizing the first 5 s of each recording results in higher classification accuracy across all models compared to using only the initial 1 s segment.While all models demonstrate strong performance in correctly identifying HC subjects, they face challenges distinguishing between varying degrees of PD severity.The FS-5 dataset exhibited superior performance in classifying the different stages of PD.When considering only the recognition of HC subjects, the Swin_s model slightly outperformed other models, demonstrated the best performance in terms of precision (97.00 ± 4.24), recall (100.00 ± 0), and F1 score (98.33 ± 2.36).However, its performance showed minimal deviation compared to the other models.The proposed models for 5 s datasets were evaluated using cumulative confusion matrices and receiver operating characteristic (ROC) curves across three-fold cross-validation.The confusion matrices aggregated results across folds to showcase the overall model performance.Color bars accompanying the confusion matrices illustrated the proportions of observations within each class that were correctly or incorrectly classified, with values ranging from 0 to 1.The ROC curves plotted the trade-off between the true positive rate and the false positive rate, depicting the diagnostic capability of the models.A One versus Rest (OvR) method constructed the ROC curves.The area under the ROC curve (AUC) signified model performance, with higher values indicating better classification ability.Across models, the AUC for the HC class approached 1.00 (Figures 8 and 9), demonstrating strong identification of healthy subjects.For PD classes, VGG16 achieved slightly higher AUCs compared to other models.Furthermore, the analysis revealed an increase in the AUC from the FS to the AS dataset, particularly for the PD_Mild class, with a 4% improvement.This suggests that the models exhibited slightly better discrimination Our findings indicated that the models demonstrated better accuracy when using longer phonation samples as input.As shown in Table 5, models trained on complete audio segments, rather than just the initial segment, exhibit higher average accuracy on 5 s datasets (AS-5).However, this improvement comes at the cost of increased performance variability, as evidenced by larger standard deviations.Notably, the Swin Transformer models demonstrate the largest gain of around 3% when utilizing the AS-5 dataset.In contrast, for the 1 s dataset, particularly the ResNet models, there is no improvement when using the AS dataset.Among the tested models, VGG19 experiences the most significant boost on the 1 s dataset when trained on all segments compared to just the initial segments.
Overall, utilizing complete audio clips for training tends to improve model accuracy, especially for longer 5 s datasets, although this benefit is less pronounced on the shorter 1 s dataset (AS-1).In addition, visual inspection of bar plots in Figure 7 suggests that, for the specific task we have, the deeper architectures do not demonstrate a substantial improvement in accuracy when compared to their shallower counterparts.Furthermore, the transformer-based model showed noticeable performance gains when trained on the AS dataset.Conversely, the CNN-based models evaluated did not exhibit significant improvements from utilizing the full segmented data.
The proposed models for 5 s datasets were evaluated using cumulative confusion matrices and receiver operating characteristic (ROC) curves across three-fold cross-validation.The confusion matrices aggregated results across folds to showcase the overall model performance.Color bars accompanying the confusion matrices illustrated the proportions of observations within each class that were correctly or incorrectly classified, with values ranging from 0 to 1.The ROC curves plotted the trade-off between the true positive rate and the false positive rate, depicting the diagnostic capability of the models.A One versus Rest (OvR) method constructed the ROC curves.The area under the ROC curve (AUC) signified model performance, with higher values indicating better classification ability.Across models, the AUC for the HC class approached 1.00 (Figures 8 and 9), demonstrating strong identification of healthy subjects.For PD classes, VGG16 achieved slightly higher AUCs compared to other models.Furthermore, the analysis revealed an increase in the AUC from the FS to the AS dataset, particularly for the PD_Mild class, with a 4% improvement.This suggests that the models exhibited slightly better discrimination capabilities when utilizing the full-segment dataset.Furthermore, the transformer-based models exhibited higher AUC values when trained on the larger AS-5 dataset, suggesting that these models benefited from the increased data availability for improved classification performance.
The analysis of the confusion matrices in Figures 8 and 9 suggests that the models excel at accurately identifying samples from the HC group, exhibiting the highest precision and recall for this class.For the FS-5 dataset, there were no instances where an HC sample was incorrectly predicted as PD_Severe or vice versa.However, some instances labeled PD_Severe were misclassified as PD_Mild, and vice versa, indicating potential challenges in distinguishing between these two classes.To better evaluate the VGG16 model's accuracy for different vowels, we grouped the results by the sustained vowel present in the dataset.The confusion matrices for each vowel are shown in Figure 10.Of the vowels, /u/ had the highest recall for HC and PD_Severe groups (100%) while having a lower recall value for the PD_Mild group (75%).
Although binary classification was not employed in this study, we combined the results to compare accuracy with previous works that utilized the Italian-speaking Parkinson's speech dataset.Specifically, we categorized HC as negative and all PD cases as positive.The accuracy results of this binary classification are summarized in Table 6.
These results are promising; however, recent studies [53,54] indicated that the models employed for pathological voice detection are typically trained using small-scale data, hindering their ability to perform consistently across diverse datasets.As a result, the performance of these models fluctuates considerably depending on the dataset encountered.This is largely due to the scarcity and variability in the quality of medical voice recordings available for training such systems [54].This can limit model robustness compared to speech recognition systems trained on ample large-scale datasets.For greater generalizability and diagnostic precision, more consistent and substantial medical voice datasets are required.
In previous studies [11,29] on PD classification using audio recordings, researchers have typically segmented the recordings into smaller parts before extracting features and training machine learning models.The researchers assessed the models' performance on the segmented audio excerpts and reported the corresponding results for these segments.However, they did not provide performance results for complete audio samples.This study employed a simple ensemble method to enable a fair evaluation and comparison of different audio segmentation approaches.Specifically, we passed each segment through the trained model to get a prediction, then took the most common predicted class across all segments as the final prediction for the recording, effectively using majority voting.This allows the comparison of different segmenting approaches equally in terms of overall recording classification.After using this approach, we calculated the cumulative confusion matrix and accuracy, as shown in Figure 11 for the AS-5 dataset.This is a more realistic test scenario, as in real-world applications, we would need to make predictions on individual audio.When implementing this approach, the accuracy of the VGG19 model increased by around 1% compared to results on the AS-5 dataset.Accuracy for the other models did not change significantly or even decreased slightly for this dataset.Despite overall lower performance compared to not using ensembling, our dataset still achieved slightly higher accuracy than when we used the FS dataset, especially when leveraging transformer-based models.This increases more pronounce for the AS-1 dataset that is shown in Figure S1.Although binary classification was not employed in this study, we combined the results to compare accuracy with previous works that utilized the Italian-speaking Parkinson's speech dataset.Specifically, we categorized HC as negative and all PD cases as positive.The accuracy results of this binary classification are summarized in Table 6.
These results are promising; however, recent studies [53,54] indicated that the models employed for pathological voice detection are typically trained using small-scale data, hindering their ability to perform consistently across diverse datasets.As a result, the performance of these models fluctuates considerably depending on the dataset encountered.This is largely due to the scarcity and variability in the quality of medical voice recordings   In previous studies [11,29] on PD classification using audio recordings, researchers have typically segmented the recordings into smaller parts before extracting features and training machine learning models.The researchers assessed the models' performance on the segmented audio excerpts and reported the corresponding results for these segments.However, they did not provide performance results for complete audio samples.This study employed a simple ensemble method to enable a fair evaluation and comparison of different audio segmentation approaches.Specifically, we passed each segment through the trained model to get a prediction, then took the most common predicted class across all segments as the final prediction for the recording, effectively using majority voting.This allows the comparison of different segmenting approaches equally in terms of overall recording classification.After using this approach, we calculated the cumulative confusion matrix and accuracy, as shown in Figure 11 for the AS-5 dataset.This is a more realistic test scenario, as in real-world applications, we would need to make predictions on individual audio.When implementing this approach, the accuracy of the VGG19 model increased by around 1% compared to results on the AS-5 dataset.Accuracy for the other

Grad Cam Feature Visualization
Grad-CAM (Gradient-weighted Class Activation Mapping) is a visual explanation technique for CNNs [34].Grad-CAM utilizes the gradient information from the final convolutional layer of a CNN to generate a heat map representing the regions of the input image that are most relevant for the network's prediction.Specifically, it computes the Hz, also exhibited sensitivity to relatively higher frequencies when detecting healthy control subjects.Furthermore, the ResNet 18 model for the healthy control class demonstrated primary activation in the high-frequency range.The generated heatmaps highlighted the specific regions in an LMS input image that significantly influenced the model's prediction.A comparison of the visualization results across different columns revealed key differences between the CNN-based and Swin transformer-based architectures.The CNN models demonstrated more localized attention, focusing on specific local areas in the images [56].In contrast, the visualizations for the Swin transformer network displayed attention that was more scattered and less spatially localized.

Class HC
The models generally placed less emphasis on the higher frequency components of the LMSs, particularly in the range greater than 1024 Hz, suggesting that these regions were less discriminative for the classification task.However, it was noteworthy that the Swin Transformer models, in addition to their focus on lower frequencies, less than 512 Hz, also exhibited sensitivity to relatively higher frequencies when detecting healthy control subjects.Furthermore, the ResNet 18 model for the healthy control class demonstrated primary activation in the high-frequency range.
When examining the temporal patterns for the healthy class, it was evident that CNN models primarily focused on the first half to the middle of the audio clips, while transformer-based models were more consistent across time frames.For the mild class, models generally concentrated on the middle period.For the severe class, VGG16 displayed a distinct pattern compared to the other studied models.This model was activated on the middle frequency range (around 2048 Hz) and the timeframes of the initial segments.Additionally, there was a moderately intense region towards the end of the spectrogram.In contrast, the other models focused more on the second half of the audio clips and lower frequencies.
Additional visualizations showcasing Grad-CAM feature maps are presented in Supplementary Figure S3.
This suggests that the network heavily relies on the spectral patterns in this specific time-frequency region, indicating that the network is also considering some higherfrequency components.

Analyzing Feature Extraction Capability
In the previous section, Grad-CAM visualizations demonstrated qualitative differences between the features extracted by different architectures on our classification FS-5 dataset.
To further analyze these representations, the t-distributed Stochastic Neighbor Embedding (t-SNE) technique can be utilized to project high-dimensional feature spaces into a 2D representation, allowing for visualization and interpretation of the learned representations.
Figure 13 presents 2D scatter plots that visualize the distribution of features extracted from the layer just before the classifier in each model.Each class is represented by a different color, allowing for visual analysis of how well the features separate the classes prior to classification.
The t-SNE visualization clearly shows three distinct clusters corresponding to the Healthy, PD_Mild, and PD_Severe classes across all models.Architectures like VGG16, Swin_s, and ResNet50 exhibit cleaner separations between these class clusters, suggesting their ability to extract more discriminative features from the log mel spectrogram images.Notably, the ResNet50 model forms the most compact clusters, indicating higher feature similarity within each class.However, there is some overlap between the PD_Mild and PD_Severe classes, particularly in the region where their feature points intersect.This overlap suggests that certain mild and severe cases may share similar feature characteristics, making it challenging to distinguish them based solely on the extracted features.
Despite the subtle overlap between PD_Mild and PD_Severe classes, all models successfully separated the Healthy class from the Parkinson's disease classes, demonstrating the effectiveness of using log mel spectrogram images for distinguishing between healthy and Parkinson's voices.

Conclusions
This study explored multi-class classification of Parkinson's disease from speech recordings using deep learning approaches.Several popular CNN and transformer models were trained on log mel spectrogram representations of sustained vowel recordings to categorize samples as healthy controls, mild, or severe Parkinson's disease labeled based on their MDS-UPDRS III scores.The models demonstrated strong capabilities to distinguish healthy samples from those with Parkinson's, achieving over 95% precision.However, they struggled to reliably differentiate between mild and severe Parkinson's, with classification precision closer to 85%.The findings revealed that models performed better when utilizing longer speech segments.The Swin transformer architecture attained the best accuracy in terms of binary classification, though its superiority over CNNs was marginal for this task.Considering overall accuracy, VGG16 can be proposed as the best model with 91.8%.Applying ensemble techniques across segments and focusing analysis on vowels, /u/ and /o/ recordings further improved accuracy by 1-4%.Moreover, visualization methods highlighted discriminative regions and features learned by models, showing transformers identify more widespread patterns while CNNs focus on localized spectrogram areas.
A key limitation of this study was the relatively small dataset size, which may have impacted the models' ability to reliably distinguish between mild and severe cases of Parkinson's disease.The limited availability of large-scale, well-annotated medical datasets can hinder the generalization capabilities of such models for real-world clinical applications.
In conclusion, this work demonstrates the potential of leveraging deep learning techniques on spectrogram inputs derived from voice recordings to enable non-invasive detection and monitoring of different stages of Parkinson's disease progression.However, to further enhance the identification of disease severity from patient voices, our future

Conclusions
This study explored multi-class classification of Parkinson's disease from speech recordings using deep learning approaches.Several popular CNN and transformer models were trained on log mel spectrogram representations of sustained vowel recordings to categorize samples as healthy controls, mild, or severe Parkinson's disease labeled based on their MDS-UPDRS III scores.The models demonstrated strong capabilities to distinguish healthy samples from those with Parkinson's, achieving over 95% precision.However, they struggled to reliably differentiate between mild and severe Parkinson's, with classification precision closer to 85%.The findings revealed that models performed better when utilizing longer speech segments.The Swin transformer architecture attained the best accuracy in terms of binary classification, though its superiority over CNNs was marginal for this task.Considering overall accuracy, VGG16 can be proposed as the best model with 91.8%.Applying ensemble techniques across segments and focusing analysis on vowels, /u/ and /o/ recordings further improved accuracy by 1-4%.Moreover, visualization methods highlighted discriminative regions and features learned by models, showing transformers identify more widespread patterns while CNNs focus on localized spectrogram areas.
A key limitation of this study was the relatively small dataset size, which may have impacted the models' ability to reliably distinguish between mild and severe cases of Parkinson's disease.The limited availability of large-scale, well-annotated medical datasets can hinder the generalization capabilities of such models for real-world clinical applications.
In conclusion, this work demonstrates the potential of leveraging deep learning techniques on spectrogram inputs derived from voice recordings to enable non-invasive detection and monitoring of different stages of Parkinson's disease progression.However, to further enhance the identification of disease severity from patient voices, our future work will focus on building larger multi-class labeled datasets of Parkinson's cases.Additionally, further research could explore a broader range of SOTA architectures and input representations beyond log mel spectrograms, potentially enhancing the classification accuracy.

Figure 1 .
Figure 1.The workflow diagram of our classification system.
1 years (±5.2 years) in the control group and 67.2 years (±8.7 years) in the PD group.The PD patients were further classified by their score on Part III of the MDS-UPDRS.
Figure 2 shows the histogram of audio lengths across three groups: Healthy Controls (HC), Mild Parkinson's Disease (PD_Mild), and Severe Parkinson's Disease (PD_Severe).Notably, HC samples predominantly fall within approximately 5 s, while PD groups exhibit a broader range.

Figure 2 .
Figure 2. The histogram illustrates the distribution of audio lengths across three groups: HC, PD_Mild, and PD_Severe.Most audio samples are around 5 s in length, with a count exceeding 150.

Figure 1 .
Figure 1.The workflow diagram of our classification system.

Figure 1 .
Figure 1.The workflow diagram of our classification system.

Figure 2 .
Figure 2. The histogram illustrates the distribution of audio lengths across three groups: HC, PD_Mild, and PD_Severe.Most audio samples are around 5 s in length, with a count exceeding 150.

Figure 2 .
Figure 2. The histogram illustrates the distribution of audio lengths across three groups: HC, PD_Mild, and PD_Severe.Most audio samples are around 5 s in length, with a count exceeding 150.

Figure 3 .
Figure 3. Overview of the process used to construct distinct datasets from the original dataset.Figure 3. Overview of the process used to construct distinct datasets from the original dataset.

Figure 3 .
Figure 3. Overview of the process used to construct distinct datasets from the original dataset.Figure 3. Overview of the process used to construct distinct datasets from the original dataset.

Figure 4 .
Figure 4. Speech sound examples.The upper panel in each example shows the acoustic waveform.The lower panel shows the corresponding log mel spectrogram representation (128 mel-bands).Figure 4. Speech sound examples.The upper panel in each example shows the acoustic waveform.The lower panel shows the corresponding log mel spectrogram representation (128 mel-bands).

Figure 4 .
Figure 4. Speech sound examples.The upper panel in each example shows the acoustic waveform.The lower panel shows the corresponding log mel spectrogram representation (128 mel-bands).Figure 4. Speech sound examples.The upper panel in each example shows the acoustic waveform.The lower panel shows the corresponding log mel spectrogram representation (128 mel-bands).

Figure 5 .
Figure 5.The effects of data augmentations on LMSs: (a) displays the original LMS without an augmentations; (b) shows the LMS with time masking applied, which masks blocks of time step This forces the model to rely more on context; image (c) shows the LMS with frequency maskin applied, which masks blocks of frequencies; and (d) demonstrates the combination of these aug mentations.

Figure 5 .
Figure 5.The effects of data augmentations on LMSs: (a) displays the original LMS without any augmentations; (b) shows the LMS with time masking applied, which masks blocks of time steps.This forces the model to rely more on context; image (c) shows the LMS with frequency masking applied, which masks blocks of frequencies; and (d) demonstrates the combination of these augmentations.

Figure 6 .
Figure 6.Overview of the architecture of models used in this research.

Figure 6 .Table 2 .
Figure 6.Overview of the architecture of models used in this research.Table 2. Presents the architectural details of the ResNet, VGG, and Swin Transformer models employed in this study, along with their respective performances on the ImageNet-1K dataset.All these models were designed to process input images with dimensions of 224 × 224 pixels.

Figure 7 .
Figure 7. Bar chart showcasing the average accuracy of studied models across modified datasets, with error bars representing the standard deviation (SD).For a clear comparison, the accuracy scale begins at 70%.

Figure 7 .
Figure 7. Bar chart showcasing the average accuracy of studied models across modified datasets, with error bars representing the standard deviation (SD).For a clear comparison, the accuracy scale begins at 70%.

Bioengineering 2024 , 24 FSFigure 8 .
Figure 8.The cumulative confusion matrices and ROC curves show the performance of each model across three folds of cross-validation on the dataset limited to only the FS-5 dataset.

Figure 8 .
Figure 8.The cumulative confusion matrices and ROC curves show the performance of each model across three folds of cross-validation on the dataset limited to only the FS-5 dataset.

Figure 9 .
Figure 9.The cumulative confusion matrices and ROC curves show the performance of each model across three folds of cross-validation on the dataset limited to the AS-5 dataset.

Figure 9 .
Figure 9.The cumulative confusion matrices and ROC curves show the performance of each model across three folds of cross-validation on the dataset limited to the AS-5 dataset.

Figure 10 .
Figure 10.The cumulative confusion matrix for each sustained vowel recording for the VGG16 model.Color bars display the proportion of observations within each class that were correctly or incorrectly classified, with values ranging from 0 to 1.

Figure 10 . 24 ASFigure 11 .
Figure 10.The cumulative confusion matrix for each sustained vowel recording for the VGG16 model.Color bars display the proportion of observations within each class that were correctly or incorrectly classified, with values ranging from 0 to 1. Bioengineering 2024, 11, x FOR PEER REVIEW 17 of 24

Figure 11 .
Figure 11.Cumulative Confusion matrix for each model after applying majority voting to predictions on the AS-5 dataset.Color bars display the proportion of observations within each class that were correctly or incorrectly classified, with values ranging from 0 to 1.

Figure 12 .
Figure 12.Grad-CAM visualization features different models across various classes for specific vowel /o/.

FSFigure 13 .
Figure 13.Visualization of feature space in 2D using t-SNE for each model.

Figure 13 .
Figure 13.Visualization of feature space in 2D using t-SNE for each model.

Table 1 .
Demographic information, including gender, and age ranges of the dataset.

Table 3 .
Parameter settings for training models.

Table 4 .
Cross-validated classification performance (mean ± SD) for each model using the FS datasets.The table compares precision, recall, F1-score, and accuracy across models.

Table 5 .
Cross-validated classification performance (mean ± SD) for each model using the AS datasets.The table compares precision, recall, F1-score, and accuracy across models.

Table 6 .
Comparison of accuracy results obtained on the Parkinson Italian speaking dataset.