Deep Transfer Learning for Parkinson’s Disease Monitoring by Image-Based Representation of Resting-State EEG Using Directional Connectivity

: Parkinson’s disease (PD) is characterized by abnormal brain oscillations that can change rapidly. Tracking neural alternations with high temporal resolution electrophysiological monitoring methods such as EEG can lead to valuable information about alterations observed in PD. Concomi-tantly, there have been advances in the high-accuracy performance of deep neural networks (DNNs) using few-patient data. In this study, we propose a method to transform resting-state EEG data into a deep latent space to classify PD subjects from healthy cases. We ﬁrst used a general orthogonalized directed coherence (gOPDC) method to compute directional connectivity (DC) between all pairwise EEG channels in four frequency bands (Theta, Alpha, Beta, and Gamma) and then converted the DC maps into 2D images. We then used the VGG-16 architecture (trained on the ImageNet dataset) as our pre-trained model, enlisted weights of convolutional layers as initial weights, and ﬁne-tuned all layer weights with our data. After training, the classiﬁcation achieved 99.62% accuracy, 100% precision, 99.17% recall, 0.9958 F1 score, and 0.9958 AUC averaged for 10 random repetitions of training/evaluating on the proposed deep transfer learning (DTL) network. Using the latent features learned by the network and employing LASSO regression, we found that latent features (as opposed to the raw DC values) were signiﬁcantly correlated with ﬁve clinical indices routinely measured: left and right ﬁnger tapping, left and right tremor, and body bradykinesia. Our results demonstrate the power of transfer learning and latent space derivation for the development of oscillatory biomarkers in PD.


Introduction
Parkinson's disease (PD) is the second most common neurodegenerative disorder affecting seniors [1]. The motor features of PD include bradykinesia, posture, and rest tremors, while non-motor complications can include constipation, postural hypotension, depression, and dementia. Pathologically, PD is characterized primarily by the degeneration of dopaminergic cells in the substantia nigra in the midbrain [2], leading to dopamine depletion in the striatum [3]. This neurochemical alteration impairs neuronal processing in the basal ganglia [4], resulting in prominent  Hz oscillations in local field potential recordings.
Non-invasive neuroimaging modalities (e.g., fMRI, MEG, EEG) have assisted in detecting altered brain networks in PD. As fMRI has limited temporal resolution (~sec) [5], it cannot directly detect features such as pathological oscillations seen at higher frequencies (e.g., . On the other hand, fMRI is capable of assessing changes in functional connectivity between macroscopic brain regions. Prior studies have demonstrated that Here, we explore an MVAR-based DC measure to calculate connection strength between brain channels based on the gOPDC method (see Methods, below). DC is computed from EEG data for four bands: Theta (4-8 Hz), Alpha (8)(9)(10)(11)(12)(13), Beta (13)(14)(15)(16)(17)(18)(19)(20)(21)(22)(23)(24)(25)(26)(27)(28)(29)(30), and Gamma (30)(31)(32)(33)(34)(35)(36)(37)(38)(39)(40)(41)(42)(43)(44)(45). We arrange DC matrices as 2D image-based representations of EEG signals and train a deep transfer learning model based on these acquired images. Then, we extract latent features of this model and demonstrate that these are more closely aligned with three important clinical indices of tremor, finger tap performance, and body bradykinesia compared to the original DC values, with each clinical feature having a unique DC signature. The general schematic overview of the procedures carried out in this research is depicted in Figure 1. EEG signals and train a deep transfer learning model based on these acquired images. Then, we extract latent features of this model and demonstrate that these are more closely aligned with three important clinical indices of tremor, finger tap performance, and body bradykinesia compared to the original DC values, with each clinical feature having a unique DC signature. The general schematic overview of the procedures carried out in this research is depicted in Figure 1.

Participants
The EEG data used in this paper are the same as [21], where a classical machine learning algorithm, sparse discriminant analyses [22], was used to discriminate between PD and control subjects. More detailed information is described in [21]. Moreover, 15 PD subjects (7 males; age: 67.3 ± 6.5 years) and 18 HC (9 males; age: 67.6 ± 8.9 years) subjects participated in the experiment. Because, in this study, we were utilizing the raw data, we were stringent in excluding subjects that had excessive muscle artifact in the EEG. Four HC and five PD subjects were excluded. Table 1 shows the characteristics of the subjects. The Hoehn and Yahr stages [23] range from 0-no disease to 1-Unilateral involvement only, 2-Bilateral involvement without impairment of balance, 3-Mild to moderate involvement with some postural instability, 4-Severe disability, and 5-Wheelchair-bound or bedridden. All participants provided written, informed consent. The study procedure was approved by the Clinical Research Ethics Board at the University of British Columbia (UBC).

Participants
The EEG data used in this paper are the same as [21], where a classical machine learning algorithm, sparse discriminant analyses [22], was used to discriminate between PD and control subjects. More detailed information is described in [21]. Moreover, 15 PD subjects (7 males; age: 67.3 ± 6.5 years) and 18 HC (9 males; age: 67.6 ± 8.9 years) subjects participated in the experiment. Because, in this study, we were utilizing the raw data, we were stringent in excluding subjects that had excessive muscle artifact in the EEG. Four HC and five PD subjects were excluded. Table 1 shows the characteristics of the subjects. The Hoehn and Yahr stages [23] range from 0-no disease to 1-Unilateral involvement only, 2-Bilateral involvement without impairment of balance, 3-Mild to moderate involvement with some postural instability, 4-Severe disability, and 5-Wheelchair-bound or bedridden. All participants provided written, informed consent. The study procedure was approved by the Clinical Research Ethics Board at the University of British Columbia (UBC).

Study Protocols
The participants were seated in front of a computer screen, continuously gazing at a fixed target while EEG was recorded for 60 s.
The PD participants stopped taking levodopa medication and dopamine agonists for at least 12 h and 18 h, respectively, before the experiments. More detailed information is available in [21].

Directional Connectivity
Functional connectivity algorithms typically use a linear mapping between two connections, leading to symmetrical connectivity matrices between EEG channels [25]. However, there is no guarantee that there will be abnormal connections in both directions equally between two brain regions [26]. To compute directional connectivity, we can assume that an asymmetrical linear operator H has been multiplied with zero-mean sources W that form the measured signals. Then, assuming the lack of instantaneous feedback between the received signals, the power spectral matrix of these signals can be decomposed, having components of H (a transfer function) and diagonal covariance matrix of W, which is ∑ as shown in Equation (1) [25], where the superscript H means the Hermitian operator: To compute this asymmetrical H matrix, we suppose a multivariate autoregressive (MVAR) model for the received signals to find a correlation pattern between values of a "cause" source to later values of its "effect" (similar to that in Granger causality). In this way, a time series y(n) with L samples can be modeled by a causal MVAR of order p: where w = w 1 · · · w M T are the assumed white noise vector with zero mean and a diagonal covariance matrix with nonzero elements equal to ww T = diag λ 2 kk . The A r matrix elements are equal to the linear delay between two channels at a specific delay. For instance, m-th row and n-th column element, a r mn , is the r-th delay between channel m (cause) and channel n (effect) [27]. There can be two time-variant and time-invariant cases that the latter one forces A r elements' values to be dependent on the time.
Taking the Z-transform of both sides of Equation (2), the time-varying case A r can be computed by an algebraic equation (as shown in Equation (3), where Z [25] means the Z-transform of a function or matrix). The frequency components of H can be computed by utilizing the correspondence of z = e j2π f between the Fourier and Z transforms. The A(n, f ) matrix is related to the power component in each time and frequency bin: Partial directed coherence (PDC) is a technique to compute directional connectivity in the frequency domain: Equation (4) implies that the PDC from channel l to k is the ratio of in the k-th series due to the past of the l-th series. This method, based on a frequency picture of Granger causality, utilizes directional connection to compute the strength of the causal link between channels. In this manner, the null-hypothesis for Granger causality is no connection from l-th to k-th region (H 0 : π kl (n, f ) = 0) [28].
Unfortunately, PDC as typically written has two major disadvantages. First, it is not scale-invariant, i.e., if one of the signals is amplified by a constant value and its spectral representation is unchanged, then the value of PDC is reduced [29]. To overcome this issue, each power component in Equation (4) can be divided by the related element in the cross-covariance matrix to obtain an estimate for general partial directed coherence (gPDC). Another drawback of the PDC is its sensitivity to volume conduction effects. Volume conduction is a serious problem for directional connectivity computations of EEG signals [30] due to the relatively poor spatial resolution of scalp EEG. A single source on the brain, through volume conduction, can reach adjacent channels leading to erroneous inferences about connectivity [18].
In [31], a method was proposed to orthogonalize the power envelope of the received signals to suppress volume conduction effects. By orthogonalizing power signals at each channel, adjacent channels do not share the trivial co-variability made by common sources. Reference [18] proves that the aforementioned orthogonalizing procedure can be reflected in the imaginary part of cross-spectral density between two channels using the MVAR modeling of the received signals. This method (using the time scaling technique) is called general orthogonal partial directed coherence (gOPDC). If we assume that at most K independent sources affect each EEG channel and the sources add linearly, such a relationship based on Equation (1) in the frequency domain is: Equation (5) can be written in its matrix form: Y( f ) ∈ C M is a multi-channel EEG collected datum in the frequency domain, S( f ) ∈ C M is the multivariate source signal in the frequency domain, and V ∈ R M×M includes all source weights.
The cross-spectral density function C ij ( f ) between Y i ( f ) and Y j ( f ) is defined by Equation (8): The MVAR model in Equation (8) aids in checking the directional connectivity in the cross-spectral density. The innovative idea behind [18] is that instead of having two orthogonal signals, they search for orthogonal MVAR representations. The imaginary part of Imag C ij ( f ) , in effect, subtracts the volume conduction effects from the received signal. It can be shown that this value can be computed by Equation (9) [27]. Σ −1 w is the inverse covariance matrix of l-th column of A(n, f ), and a l (n, f ) is the l-th column of A(n, f ). The inverse covariance value is used to ensure scale invariance: Algorithms 2022, 15, 5

of 19
Analogous to the definition of gPDC, OPDC can be extended to gOPDC, Ψ kl (n, f ) by taking the effect of time series scaling into consideration: We computed DC from EEG signals during rest using the model in Equation (1) and the formula of Equation (10). Each element of DC matrix is the mean value of time components and frequency bins in the regarded band derived from Equation (10). After computation of gOPDC via Equation (10)

1.
In order to increase our sample size, for each of the 4 frequency bands, we created 2D matrices (27 × 27) as a heatmap since there were 27 channels, and there were 26 channels that can be directionally connected to each channel (connectivity of a channel with itself is ignored). In order to have a square matrix for sizing of the heatmap, we created a 27 x 27 matrix and set the main diagonal components of the matrix to zero.

2.
We normalized each matrix individually. In fact, we had normalized each matrix of DC (27 × 27) using the min/max approach.

3.
We represented the results by computing heat maps, i.e., 2D representations of the values in a data matrix, in which colors represent their variability in intensity. Figure 2, as an example, shows two sample heat maps from two PD and HC cases. Larger values were represented by lighter pixels and smaller values by darker pixels. 1. We used the VGG-16 architecture [35] trained on the ImageNet dataset as our pre-trained model, and we used only weights of convolutional layers of the pre-trained model. 2. To better fit our classification task, we modified the base model of VGG-16 in this order: (A) Implementing a fully connected layer (1 × 1 × 512) after the last max-pooling layer (7 × 7 × 512). (B) Developing a fully connected layer (1 × 1 × 64) with activation function of "Relu" to better map reduced features to the last layer.

4.
We resized the heatmap images to 224 × 224 in order to make them fit the architecture of the VGG-16 architecture.
After we computed heat maps of all bands, Theta, Alpha, Beta, and Gamma, we obtained 18 HC subjects' 2D arranged DC matrices (18 subjects × 4 bands = 72 heatmaps) and 15 PD subjects' 2D arranged DC matrices (15 subjects × 4 bands = 60 heatmaps). These were then used as the alternative compact representations of the EEG dataset for the binary classification task via deep transfer learning.

Transfer Learning
Traditionally, the goal of transfer learning is to solve the problem of domain gap when the knowledge gained on training data (source) needs to be re-used on different test (target) data. This aids in the problem of insufficient training data [32]. The primary layers of a DCNN (Deep Convolutional Neural Network), trained on a large dataset, can extract general features, as shown by [33,34]. Therefore, by recruiting deep transfer learning on a big dataset, we can fine-tune a pre-trained model on our small-size dataset and solve a binary classification task to PD/HC much more efficiently.

1.
We used the VGG-16 architecture [35] trained on the ImageNet dataset as our pretrained model, and we used only weights of convolutional layers of the pre-trained model.

2.
To better fit our classification task, we modified the base model of VGG-16 in this order: (A) Implementing a fully connected layer (1 × 1 × 512) after the last maxpooling layer (7 × 7 × 512). (B) Developing a fully connected layer (1 × 1 × 64) with activation function of "Relu" to better map reduced features to the last layer.
(C) Applying the fully connected layer (1 × 1 × 2) with the activation function of "Softmax" for the classification task consists of two categories, Parkinson's disease and healthy controls.

3.
We tried two techniques of fine-tuning: The network, which we call the Totally trained model, is represented in Figure 3, while the other alternative is represented in the Appendix A just for further information.

4.
To implement the first technique, we used weights of convolutional layers as initial weights and updated all layers' weights with our data. The network architecture is shown in Figure 3. Tables 2 and 3 represent the details of this network. In addition, Figure A1 shows the best performance of the proposed model over the epochs. 1. We used the VGG-16 architecture [35] trained on the ImageNet dataset as our pre-trained model, and we used only weights of convolutional layers of the pre-trained model. 2. To better fit our classification task, we modified the base model of VGG-16 in this order: (A) Implementing a fully connected layer (1 × 1 × 512) after the last max-pooling layer (7 × 7 × 512). (B) Developing a fully connected layer (1 × 1 × 64) with activation function of "Relu" to better map reduced features to the last layer.
(C) Applying the fully connected layer (1 × 1 × 2) with the activation function of "Softmax" for the classification task consists of two categories, Parkinson's disease and healthy controls. 3. We tried two techniques of fine-tuning: The network, which we call the Totally trained model, is represented in Figure 3, while the other alternative is represented in the Appendix A just for further information. 4. To implement the first technique, we used weights of convolutional layers as initial weights and updated all layers' weights with our data. The network architecture is shown in Figure 3. Tables 2 and 3 represent the details of this network. In addition, Figure A1 shows the best performance of the proposed model over the epochs.
We compared the best performance of both techniques (see the Appendix A for the alternative one) to choose the best fine-tuning strategy. As it is shown in Table 4, The totally-trained model surpassed the limited-trained model (expanded upon in the Appendix A). Our implementation was performed in Python using Keras with Tensorflow Backend and underwent 43 s of training on Google Colab. The test data set was 25% of data that had been randomly selected.  We compared the best performance of both techniques (see the Appendix A for the alternative one) to choose the best fine-tuning strategy. As it is shown in Table 4, The totallytrained model surpassed the limited-trained model (expanded upon in the Appendix A). Our implementation was performed in Python using Keras with Tensorflow Backend and underwent 43 s of training on Google Colab. The test data set was 25% of data that had been randomly selected.

Feature Extraction from the Network
To investigate the clinical relevance of the features learned by the model, we saved the weights of the best model, fed every data sample in, and extracted the features of the second last layer. This resulted in a vector of (1 × 64) for each data point for each subject.
In [21], after the computation of phase-locking value (PLV) as a functional connectivity metric, sparse discriminant analysis [36] was used to find different main connections between HC and PD cases. This reference suggested that PLV variability lowers after the development of PD, and there is a significant negative correlation between PLV variability and disease severity. Here, we used the same EEG data as [21] and acquired new features from the above DC approach by transfer learning. We then used the latent space features learned by the network to determine if the latent space is associated with disease severity.
We use the aforementioned (1 × 64) vector as 64 new latent features for DC of each subject. The UPDRS of healthy subjects were considered zero to incorporate both HC and PD connectivity values in the correlation computation [37]. To investigate whether these extracted features differed from raw connectivity values in predicting clinical scores, we fitted separate LASSO regression models [38] to both the extracted latent features and the raw DC values [39]. We then computed the correlation between the actual clinical scores and the scores estimated by the fitted models.
For the clinical scores, we checked for 5 sub-scores including both symmetric and asymmetric symptoms to compare latent and non-latent space results: Right and left "Finger tapping performance" and "Tremor" may present asymmetrically, while "Body Bradykinesia" is midline and is not associated with a specific side.
The LASSO regression models were computed for 30 runs separately for both latent and non-latent cases. In each run, out of 33 observations (33 subjects' data), 25 random observations were assigned to the train and the rest 8 to the test set. In addition, as there were five statistical tests (the null hypothesis is that the resultant correlation is not significantly different from zero), p-values were corrected by the FDR method to deal with multiple comparisons. The corrected p-values less than 0.05 were considered significant. After completing 30 runs of the LASSO, the average of 30 best beta coefficients (coefficients that resulted in the least MSE at each run) were considered as the LASSO coefficients for the final LASSO model. Then, these vectors were chosen as the final LASSO coefficients of the models and were applied to the latent space and non-latent space data to estimate each clinical score.

PD Classification Performance
To compare the proposed method against when only the four last layers were finetuned (Appendix A) and a prior DL model on the same data set [8], we evaluated it by feeding in randomly selected test data. We tested the model 10 times (on randomly chosen train/test partitions). The training accuracy was constantly 100%, but the test accuracy varied in the range of 91 to 100%.
The average values of performance metrics are reported in Table 5. We extracted the last layer's scores assigned to each class (HC/PD) for 10-fold repetition as predicted classes, compared them to the actual classes, and computed the True Negative, True Positive, False Negative, and False Positive declared in Figure 4.
In [15], a CNN model is suggested comprised of four consecutive blocks which are comprised of a 1D convolutional layer, max pooling layer, and activation layer, and the activation function is Relu, finally followed by three fully connected layers with Softmax activation function. The learning rate equals to 0.001, and the optimizer is Adam.  Previous work on the same data set included applications on a variety of machine learning models, including "KNN", "SVM-L", "SVM-P2", "SVM-P3", "SVM-RBF", and "RF" for 779 extracted time-series features, including autocorrelation, quantiles, and number of zero-crossings [8]. In addition, they used univariate statistical analyses (ANOVA) and identified significant features, with a significance level of 0.05. Then, using these significant features, they trained the models. The best performance for these models is reported as "SVM-RBF" in Table 5. The proposed approach achieved 99.62% accuracy, 100% precision, 99.17% recall, 0.995 F1 score, and 0.995 AUC. To quantitatively compare the correlation scores found by LASSO models in the latent and non-latent spaces, we use t-statistics on the Fisher z-transform of these correlation values to see whether latent space correlation values among the 30 runs were significantly higher than the non-latent space.

Clinical Relevance of Deep-Transfer-Learning-Based Classification
The t-test outcomes are shown in Table 6 and demonstrate the superiority of the latent space in predicting clinical aspects of the disease.
Since it is difficult to visualize the heat map that would maximize each feature, we found the DC image (heat map) of which subject, in what health status (HC or PD), and in which frequency band led to the highest value in each feature ( Table 7). The 10 highest values of each DC matrix (leading to maximum values in each feature) are shown in an arrow type plot in Figure 6.
To establish the most influential EEG bands in non-latent space, we drew the main neural oscillations related to the right and left finger tap (the two cases mostly worse than the latent-space data) for non-latent data (Figure 7). Previous work on the same data set included applications on a variety of machine learning models, including "KNN", "SVM-L", "SVM-P2", "SVM-P3", "SVM-RBF", and "RF" for 779 extracted time-series features, including autocorrelation, quantiles, and number of zero-crossings [8]. In addition, they used univariate statistical analyses (ANOVA) and identified significant features, with a significance level of 0.05. Then, using these significant features, they trained the models. The best performance for these models is reported as "SVM-RBF" in Table 5. The proposed approach achieved 99.62% accuracy, 100% precision, 99.17% recall, 0.995 F1 score, and 0.995 AUC.   To quantitatively compare the correlation scores found by LASSO models in the latent and non-latent spaces, we use t-statistics on the Fisher z-transform of these correlation values to see whether latent space correlation values among the 30 runs were significantly higher than the non-latent space.

Clinical Relevance of Deep-Transfer-Learning-Based Classification
The t-test outcomes are shown in Table 6 and demonstrate the superiority of the latent space in predicting clinical aspects of the disease.
Since it is difficult to visualize the heat map that would maximize each feature, we found the DC image (heat map) of which subject, in what health status (HC or PD), and in which frequency band led to the highest value in each feature ( Table 7). The 10 highest values of each DC matrix (leading to maximum values in each feature) are shown in an arrow type plot in Figure 6.       To establish the most influential EEG bands in non-latent space, we drew the main neural oscillations related to the right and left finger tap (the two cases mostly worse than the latent-space data) for non-latent data (Figure 7).

Discussion
We found that DC measures are useful to discriminate PD subjects from controls. Importantly, we found that the latent space obtained by TL was more closely aligned with clinical indices related to movement, tremor, and body bradykinesia. It is perhaps unsurprising that the latent space, which will likely contain non-linear combinations of DC would more closely align with clinical performance, as presumably there will be a complex non-linear mapping between EEG connectivity measures and crude clinical performance metrics. What is perhaps surprising is that TL, from a network trained on natural images (as the VGG-16 network used here is), results in a latent space that is more closely aligned to clinical indices.
We are acutely aware that our study may potentially be under-powered. However, we note that (1) as is typical with patient populations, the number of subjects tends to be quite low, and as a result of the challenges associated with obtaining recordings in these populations, (2) we have spent particular effort in ensuring that overfitting has not taken place, and (3) we are using transfer learning, which is a key aspect of the novelty of the proposed approach. The whole point of transfer learning is to allow for smaller data sets to be employed to allow for informed descriptions of disease states, and (4) we augmented the data by a factor of four by generating independent heat maps for each frequency subband.
The goal of using LASSO was to (1) select a sparse set of features and then (2) solve the regression problem to linearly map these sparse features to the target clinical scores. The beta coefficients are found by jointly minimizing the mean square error (MSE) between original and modeled values and applying an L1 norm constraint on the coefficients to promote sparsity. In Figure 5 we show that the LASSO method using the latent space values more closely predicts the clinical scores compared to the DC values used to develop the latent space. This proves that there exists a linear mapping from extracted features (of latent space) to the five clinical scores considered in this paper.
It is interesting that the network was agnostic to the particular band used, yet still provided excellent classification performance. This implies that the presence of effective connectivity between pairs of channels may be more informative for disease discrimination than the particular frequencies involved. Note that we were still able, post-hoc, to infer which frequency bands were most informative by deducting which particular heat maps were most informative in the final latent space.
It is also notable that many of the DC connections in the latest space appear ipsilateral to the hand in the finger tapping and tremor tasks ( Figure 6). While mirror movements have been suggested as a reason for ipsilateral activation in PD subjects [40], it is important to note that the features we isolate best discriminate between PD and HC subjects. PD subjects tend to demonstrate more bilateral activation with unilateral movement than HC subjects [41]. Thus, the connections we found that best discriminate between PD and HC subjects will tend to be ipsilateral to the movement. We note that most of the discriminating features are in the beta-band (Table 7). This is hardly surprising, given the prominent role of altered beta-band oscillations in PD pathophysiology [42].
The proposed method had better results in terms of the accuracy, precision, AUC, and F1 score compared to when only the last four layers were fine-tuned (A2), a prior DL model on the same data set [8], and another deep learning approach for PD diagnosis [36], with approximately 10% of the run time of [8]. This likely reflects the relatively small data set, so that the other DL approaches could not fully capture the discrimination subspace given the data available. The transfer learning approach, adopted here, would appear to make discrimination tasks on smaller data sets more feasible.
Finally, we note that the maximum values of a feature originated from different connections, different subjects, and different bands. This suggests that the features we found were not based on overfitting a small subset of subjects.

Conclusions
Computing DC in the EEG provided us with an innovative technique to generate image representation of the 1D signal. Transfer learning achieved excellent classification performance and provided a latent space that is more closely related to clinical indices. Our results suggest transfer learning applied to an image classifier incorporating heatmaps may provide a way to assist in the classification of small-sample data sets.

Informed Consent Statement:
The patients/participants provided their written informed consent to participate in this study.

Data Availability Statement:
De-identified data will be made available via written request to the corresponding author. In this paper, we used Matlab, Python. All codes and modules are accessible via GitHub repository, which also includes a Readme file on how to run the codes step by step to generate the results [43].

Acknowledgments:
The authors would like to thank Ramy Hussain for his invaluable time and efforts in consulting with us about the network design.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A
The best performance of the proposed model, classifying PD and HC subjects, is shown in Figure A1.
Computing DC in the EEG provided us with an innovative technique to generate image representation of the 1D signal. Transfer learning achieved excellent classification performance and provided a latent space that is more closely related to clinical indices. Our results suggest transfer learning applied to an image classifier incorporating heatmaps may provide a way to assist in the classification of small-sample data sets.

Informed Consent Statement:
The patients/participants provided their written informed consent to participate in this study.
Data Availability Statement: De-identified data will be made available via written request to the corresponding author. In this paper, we used Matlab, Python. All codes and modules are accessible via GitHub repository, which also includes a Readme file on how to run the codes step by step to generate the results [43].

Acknowledgments:
The authors would like to thank Ramy Hussain for his invaluable time and efforts in consulting with us about the network design.

Conflicts of Interest:
The authors declare no conflicts of interest.

Appendix A
The best performance of the proposed model, classifying PD and HC subjects, is shown in Figure A1. In an alternative technique, we froze the weights of convolutional layers of the pre-trained model and updated only the last four layers with our data is shown in Figure  A2. In addition, the details of the network are given in Tables A1 and A2 and the best performance of the network is shown in Figure A3. In an alternative technique, we froze the weights of convolutional layers of the pretrained model and updated only the last four layers with our data is shown in Figure A2. In addition, the details of the network are given in Tables A1 and A2 and the best performance of the network is shown in Figure A3 Figure A2. Limited-trained network architecture.     Figure A5 depicts the mean value of the beta after 30 runs of LASSO. If any element of the average beta vector is nonzero in at least 70% of the 30 runs, then that element is used in the final set of coefficients otherwise it is discarded. In fact, LASSO coefficients in different runs can be variable, and this causes unstable results. So, we heuristically found that stability over 70% of the runs leads to more reliable outcomes.  Figure A5 depicts the mean value of the beta after 30 runs of LASSO. If any element of the average beta vector is nonzero in at least 70% of the 30 runs, then that element is used in the final set of coefficients otherwise it is discarded. In fact, LASSO coefficients in different runs can be variable, and this causes unstable results. So, we heuristically found that stability over 70% of the runs leads to more reliable outcomes.  Figure A5 depicts the mean value of the beta after 30 runs of LASSO. If any element of the average beta vector is nonzero in at least 70% of the 30 runs, then that element is used in the final set of coefficients otherwise it is discarded. In fact, LASSO coefficients in different runs can be variable, and this causes unstable results. So, we heuristically found that stability over 70% of the runs leads to more reliable outcomes.