This section presents the experimental results derived from training and evaluating the proposed EfficientNetV2-S-based framework on the Parkinson’s Augmented Handwriting Dataset. The outcomes are systematically analyzed in terms of classification performance, robustness across handwriting variations, and comparisons with existing state-of-the-art approaches.
4.1. Experimental Setup
All experiments were conducted in a Python 3.11.13 environment using the TensorFlow–Keras deep learning framework. The key libraries and their versions were PyTorch 2.8.0, Torchvision 0.23.0, OpenCV 4.12.0, NumPy 2.0.2, and Scikit-learn 1.6.1. For Canny edge detection, a central preprocessing step, OpenCV’s default parameters were used with automatic threshold calculation. This precise software environment ensures full reproducibility of our preprocessing and training pipeline. Model development and training were carried out in Google Colab Pro, which provides access to an NVIDIA Tesla T4 GPU with 15 GB of RAM and 100 GB of cloud storage.
The implementation leveraged pretrained ImageNet weights for EfficientNetV2-S, with fine-tuning performed on the handwriting dataset. Training was configured with a batch size of 16 for a maximum of 10 epochs, employing early stopping to terminate training when the validation loss failed to improve. Optimization was performed using the Adam optimizer with an initial learning rate of 0.001, dynamically adjusted via a ReduceLROnPlateau scheduler. To ensure reproducibility, random seeds were fixed for dataset shuffling and parameter initialization. On-the-fly data augmentation was performed using Keras’ ImageDataGenerator, enabling exposure to new handwriting variations at every epoch. Model performance was evaluated using accuracy, precision, recall, F1-score, ROC curves, and confusion matrices.
4.4. Cross-Validation Performance
To further validate the robustness of the proposed framework, 5-fold cross-validation was performed on the combined dataset. The dataset was partitioned randomly into five equal subsets, with each fold serving as the validation set once, while the remaining four folds were used for training. This same process was applied across all folds, and the average performance is reported to ensure fair evaluation. The cross-validation results indicated strong generalization ability with minimal performance variance across folds. The accuracy per fold is illustrated in
Figure 9.
The model demonstrated consistently high validation accuracy across all folds, ranging from 96.28% to 99.47%, reflecting its strong ability to generalize. Fold 4 showed the weakest performance, with a slightly higher validation loss (0.0968) and lower accuracy (96.28%), indicating some variability in that subset of data. In contrast, Fold 5 achieved the best results, delivering a near-perfect accuracy of 99.47% and the lowest validation loss (0.0342), highlighting excellent discriminative capability. Overall, the validation losses remained low across all folds, underscoring the model’s stability and robustness.
Table 5 shows a summary of the 5-fold validation accuracy, along with the standard deviation, demonstrating the performance of the proposed model.
These results confirm that the EfficientNetV2-S-based framework is not only highly accurate but also stable across different data partitions, thereby mitigating the risk of overfitting or bias toward a specific training–validation split.
4.5. Baseline CNN Performance
To establish a benchmark, a standard CNN was trained using the same preprocessed spiral and wave handwriting datasets. The CNN achieved validation accuracies of 0.7143 on wave images and 0.7632 on spiral images, as shown in
Figure 10. While these results indicate that the CNN captured relevant handwriting features for Parkinson’s detection, its performance was substantially lower compared to the proposed EfficientNetV2-S framework. These findings highlight the advantage of deeper architectures with transfer learning and enhanced preprocessing in capturing subtle neuromotor irregularities.
The baseline CNN model showed noticeable variations in performance across both handwriting tasks. For the spiral dataset, the model achieved a high recall of 0.95 but a lower precision of 0.69 for the healthy class, resulting in an F1-score of 0.80. Conversely, for the Parkinson’s class, the model recorded a higher precision of 0.92 but a significantly lower recall of 0.58, leading to an F1-score of 0.71. For the wave dataset, performance declined further, with the model obtaining precision, recall, and F1-score values of 0.61, 0.58, and 0.59, respectively, for the healthy class. The model achieved slightly better results for the Parkinson’s class, with a precision of 0.77, a recall of 0.79, and an F1-score of 0.78. These results indicate that the CNN baseline struggled with balanced detection, particularly in distinguishing healthy samples, reinforcing the need for advanced models to improve sensitivity and robustness across both handwriting tasks. The classification results of the standard CNN model on spiral and wave handwriting tasks are shown in
Table 6.
Figure 11 presents confusion matrices of the baseline CNN model on both the spiral and wave datasets, highlighting class-wise prediction strengths and misclassifications.
Figure 12 shows the ROC curves of the baseline CNN model, illustrating its ability to distinguish between healthy and Parkinson’s classes on both the spiral and wave datasets.
4.8. Discussion
This study presented a handwriting-based framework for PD detection, integrating spiral and wave handwriting tasks, advanced preprocessing techniques, and fine-tuning of the EfficientNetV2-S architecture. The strong predictive performance and robustness of the proposed framework validate its effectiveness. A preprocessing pipeline involving histogram equalization, Canny edge detection, and synthetic RGB fusion improved feature extraction compared to using raw grayscale images. The results confirm this assertion. The preprocessing pipeline enhanced contrast and highlighted stroke boundaries, enabling the network to learn subtle motor irregularities more effectively. This demonstrates that preprocessing is a critical factor in improving generalization and sensitivity.
Individually, the models trained on spiral and wave inputs achieved strong accuracies of 98.68% and 98.10%, respectively. However, when both modalities were fused, a mean validation accuracy of 0.9767 ± 0.0109 was achieved across five folds. These results suggest that spiral drawings capture tremor-induced distortions and curvature irregularities, while wave patterns exhibit motor rhythm and smoothness. Together, they create a richer representation of motor function. The fine-tuning of EfficientNetV2-S on handwriting data yielded higher accuracy and robustness compared to conventional CNNs and handcrafted feature-based approaches. Experimental evidence supports this claim. EfficientNetV2-S consistently delivered high accuracy across folds. Its fused MBConv blocks, SiLU activation, and progressive learning strategy effectively leveraged limited handwriting data while maintaining computational efficiency, making it well-suited for clinical and real-world applications.
The baseline CNN results on the wave and spiral datasets, with accuracies of 71.43% and 76.32%, respectively, provide further evidence that simple convolutional architectures can identify gross handwriting differences between healthy and Parkinson’s subjects. The superior performance of EfficientNetV2-S underscores the importance of deeper feature hierarchies and robust preprocessing pipelines for reliable clinical applications. The proposed approach demonstrates that handwriting analysis combined with modern deep learning techniques can achieve near state-of-the-art diagnostic performance. The key contributions include establishing a reliable preprocessing pipeline to enrich feature representation from simple handwriting samples, demonstrating that combining spiral and wave tasks enhances diagnostic accuracy and stability, and validating EfficientNetV2-S as an efficient yet powerful backbone for handwriting-based biomedical analysis.
Earlier studies primarily relied on handcrafted feature extraction, such as stroke length, curvature, and velocity, combined with classifiers including support vector machines (SVMs) and random forests. Although computationally efficient, these methods exhibited limited generalization capacity and strong dependence on manual feature design. Subsequent research explored applying CNNs directly to spiral or wave handwriting samples. While CNNs successfully captured local pixel-level features, they often failed to generalize across diverse handwriting styles and required substantial preprocessing to mitigate inter-subject variability.
In contrast, the proposed EfficientNetV2-S-based framework integrates a structured preprocessing pipeline incorporating grayscale conversion, histogram equalization, and edge detection, followed by synthetic RGB fusion, enabling the model to learn richer and more discriminative representations of handwriting irregularities. The incorporation of data augmentation further enhances robustness, while transfer learning from ImageNet weights enables efficient convergence despite the limited size of the medical dataset. The key advantages of the proposed approach are the elimination of manual feature engineering, as stroke-level irregularities are learned directly from enriched images. Stable performance across folds (±0.0109) demonstrated that the proposed model outperformed earlier CNN-based studies that exhibited greater variability due to dataset imbalance and smaller sample sizes. Balanced accuracy and computational efficiency position EfficientNetV2-S as a practical solution for deployment in real-world clinical applications.
Despite promising results, several limitations remain. The risk of overfitting remains non-trivial given the modest underlying dataset, despite our efforts to mitigate it through extensive data augmentation and cross-validation. Although these strategies are effective, they may not fully capture the immense biological and stylistic variability present in the general population. This leads directly to the significant challenge of real-world deployment. The model’s performance could be substantially impacted by handwriting variations influenced by factors completely outside the current dataset’s scope, such as diverse literacy levels, cultural writing conventions, and different digital acquisition hardware. Therefore, while the augmentation strategy improves robustness, it does not eliminate the fundamental need for larger and more demographically heterogeneous datasets to ensure true generalizability.
However, the successful external validation of our framework on an independent dataset provides strong preliminary evidence of its generalizability. The model’s ability to accurately classify samples from an unseen dataset suggests that the learned feature representations of Parkinson’s spiral and wave handwriting tasks are robust and transferable, mitigating the immediate concerns of overfitting to the original dataset’s specific characteristics. The current approach focuses on static handwriting images. Incorporating temporal dynamics such as stroke velocity, pen pressure, and trajectory could provide richer diagnostic cues. Practical deployment will require validation through clinical trials, explainable artificial intelligence (XAI) tools, and user-friendly interfaces to foster trust among healthcare professionals. Addressing the critical need for trust and interpretability, highlighted through the integration of XAI techniques such as Gradient-weighted Class Activation Mapping (Grad-CAM), could be pursued in subsequent work. These methods would generate visual explanations by highlighting the specific regions in a spiral or wave drawing that most influenced the model’s classification decision. For instance, we would expect a trustworthy model to focus on areas exhibiting tremor, micrographia, or irregular stroke curvature for Parkinson’s disease samples, while highlighting smooth, continuous trajectories for healthy controls. The development and validation of such explainable interfaces are necessary steps for the clinical adoption of this technology.