5.2. Training and Fitting the Best Machine Learning Models
Figure 11 presents the results of the metrics calculated during the training and tuning of the 60 best-performing machine learning models, selected based on the highest mean test scores and the lowest total time. Each model is identified by its number and the base algorithm it uses. The graphs are organized according to three dataset sizes
and the dimensions of the extracted features
. Specifically,
Figure 11a displays the metrics results for the 20 datasets containing 500 samples, representing an experimental test duration of 0.5 s.
Figure 11b illustrates the metrics results for the 20 datasets with 1000 samples, corresponding to a 1 s experimental test. Meanwhile,
Figure 11c shows the metric results for the 20 datasets, which consist of 5000 samples and have a duration of 5 s of experimental testing.
By selecting models with a mean test score greater than 90% during the training and fitting stage, 45 machine learning models were obtained. The distribution of these models based on the algorithms used is as follows: 33.33% are DT, 28.89% are SVM, 20.00% are KNN, and 17.78% are MLP. The distribution of the 45 machine learning models based on the number of features extracted is as follows: 22.22% for = 35, 24.44% for = 82, 26.67% for =170, 13.33% for = 435, and 13.33% for = 690. While the distribution of the 45 machine learning models according to experiment time or sample number , remains balanced with 33.33% for . This initial filtering of machine learning models indicates that the nature of the data favors both DT and SVM models, as well as datasets with a reasonable number of extracted features . However, the experiment time or number of samples does not appear to affect the mean test score of the machine learning models.
As a result of selecting models with a mean test score greater than 95% during the training and fitting stage, 37 machine learning models were obtained. The distribution of these models by algorithm is as follows: 32.43% for DT, 29.73% for SVM, 21.62% for KNN, and 16.22% for MLP. Whereas, the distribution of the 37 machine learning models based on the number of features extracted is as follows: 10.81% for = 35, 27.03% for 82, 32.43% for 170, 13.51% for = 435, and 16.22% for 690. Finally, when analyzing the distribution of the models according to experiment time or sample size , the results were as follows: 32.43% for = 500, 32.43% for = 1000, and 35.14% for = 5000. This second round of filtering confirms that the nature of the datasets favors DT and SVM models. In comparison to the 45 machine learning models that had a mean test score between 90% and 95%, the 37 models with scores above 95% show that datasets with 82 and 170 extracted features achieve the highest performance, appearing in about two-thirds of the selected models. Notably, no significant effect is observed regarding the performance of the models based on sampling time or the number of samples.
For a mean test score value greater than 98% during the training and fitting stage, the selection of machine learning models was drastically reduced to 13. The distribution of these models based on the algorithms used is as follows: 46.15% for DT, 7.69% for SVM, 23.08% for KNN, and 23.08% for MLP. When categorized by the number of features extracted , the distribution is: 0% for = 35, 23.08% for = 82, 53.85% for = 170, 7.69% for = 435, and 15.38% for = 690. In terms of experiment time or sample size (, the distribution is as follows: 23.08% for = 500, 30.77% for = 1000, and 46.15% for = 5000. This analysis confirms that the nature of the data predominantly supports the use of DT models. However, SVM models lag significantly behind both KNN and MLP models. Notably, the sets with 170 extracted features account for more than half of the selected models. Lastly, we observed that sampling time, or the number of samples, significantly impacts model performance, with sets containing = 5000 representing twice the percentage of models compared to those with = 500.
The following analysis presents the minimum and maximum execution times of machine learning models that achieved a mean test score greater than 90% during the training and fitting stages. The results are organized by algorithm type. For the DT-based models, the recorded run times ranged from 0.083665 s to 2.102346 s. The SVM models exhibited a wider range, with execution times from 0.548840 s to 259.081404 s. The KNN models had run times ranging from 0.092199 s to 0.349957 s. In contrast, the MLP models had the longest execution times, ranging from 12.374451 s to 34.133011 s.
For models that achieved a mean test score exceeding 95%, the minimum and maximum execution times remained consistent with the previously reported values, except for the minimum time for the DT-based models, which was noted as 0.242134 s. Additionally, in models that reached a mean test score greater than 98%, changes were observed in the maximum execution times for DT and MLP-based models, recorded at 1.933996 s and 28.310107 s, respectively. A variation in the minimum run time for KNN models was also noted, with a value of 0.189169 s. In the case of the DT-based model, both the minimum and maximum run times were recorded as 27.854260 s, as only one model of this type was evaluated below that performance threshold. This level of filtering helps identify models with lower computational costs while still maintaining effective classifier performance. Overall, algorithms like KNN and DT demonstrate considerably shorter execution times compared to SVM and MLP, whose times can be hundreds of times longer.
Table 4 displays the model number, base algorithm, and configured hyperparameters for each of the 13 machine learning models that achieved a mean test score exceeding 98% during the training and fitting stages.
Table 5 provides details on the number of samples
, the number of extracted features
, the mean test score, the test-time training, and the fit time for the 13 machine learning models selected for their optimal performance during the training and fit stages.
A performance overview of the 13 machine learning models is presented below. The KNN models achieved the highest overall accuracy, ranging from 99.46% to 99.96%. The best-performing model, identified as M384-KNN, achieved an accuracy of 99.96%, demonstrating both high performance and consistency across the models. These results establish KNN as the most effective algorithm in terms of predictive accuracy among the four models evaluated. The MLP models also performed well, with accuracies ranging from 98.07% to 99.18%. The top MLP model, M352-MLP, achieved an accuracy of 99.18%, making MLP the second most accurate overall. However, the wider accuracy range observed in MLP models suggests some variability depending on model configuration or training conditions.
In contrast, the DT models had a lower accuracy range of 98.05% to 98.94%, with the model identified as M626-DT achieving the highest accuracy. Although DT models are quick to train and test, their predictive performance is lower compared to both KNN and MLP models, which may limit their suitability for tasks requiring high accuracy. Finally, only one model from the SVM category was recorded, achieving an accuracy of 98.61%. This result places the SVM model slightly above average DT performance but below the top-performing models from both KNN and MLP. With just one data point, it is not easy to assess the consistency or variability of SVM performance. In summary, the KNN models stand out as the top performers in terms of accuracy, while the MLP models offer a strong alternative with slightly lower but still competitive accuracy. Meanwhile, SVM and DT models lag, making them less favorable for tasks where maximum predictive performance is critical.
The following is a comparative analysis of the fitting times for 13 top-performing machine learning models. Among these, the KNN models stand out for their efficiency, with fitting times ranging from approximately 0.013 s to 0.017 s. The fastest model, identified as M384-KNN, takes 0.0128 s, while the slowest model, M290-KNN, takes 0.0167 s. This narrow range suggests consistent performance across the KNN models. In contrast, the DT models exhibit moderate fitting times, which range from 0.24 s to 1.93 s. Though slower than the KNN models, the DT models still train relatively quickly compared to more complex alternatives. The fastest DT model, M230-DT, fits in 0.2375 s, whereas the slowest, M582-DT, takes 1.9274 s. This nearly eight-fold increase indicates variability in the complexity or data handling capabilities of the DT models. The MLP models have significantly higher fitting times, ranging from 12.36 s to 28.29 s. The slowest model, M172-MLP, is more than twice as slow as the fastest model, M352-MLP. This substantial variance is likely attributable to factors such as network depth and hyperparameter settings, making the MLP models more computationally expensive to train than both the KNN and the DT models. Finally, the SVM model, which has only one recorded instance (M376-SVM), has a fitting time of 27.83 s. Although there is no range for the SVM, this fitting time places it among the slowest models, comparable to the slower MLP models. Hence, like the MLP models, the SVM model is not ideal for scenarios requiring rapid training. Overall, the KNN models are the clear winners in terms of training efficiency. In contrast, the MLP models and the SVMs offer complex modeling capabilities at the expense of longer fitting times.
On the other hand, the DT algorithm demonstrates the fastest overall performance during testing, with a training time ranging from approximately 0.0046 s to 0.0066 s. The quickest model, identified as M230-DT, completed testing in 0.0046 s, whereas the slowest model, M582-DT, took 0.0065 s. In contrast, the MLP exhibits a broader range of testing times, from about 0.012 s to 0.23 s, indicating more variability among its models. The fastest MLP model, M308-MLP, achieved a time of 0.0119 s, while the slowest, M172-MLP, took significantly longer at 0.2300 s. The support vector machine (SVM) shows a consistent test-time performance of 0.0223 s, represented solely by model M376-SVM, with no slower counterpart available. The KNN algorithm displayed the broadest range of testing times, from approximately 0.176 s to 0.333 s. The fastest KNN model, M384-KNN, completed testing in 0.1763 s, while M290-KNN was the slowest at 0.3333 s.
In summary, DT is the most efficient algorithm in terms of test speed, followed by SVM, MLP, and KNN. KNN has the slowest performance and the most significant variability. Although MLP can sometimes be faster than SVM, its high variability makes it less consistent in performance compared to SVM.
The comparative analysis of scalability and the impact of sample size reveals the following insights: The KNN algorithm shows improved accuracy with an increased sample size while maintaining low training times. However, it suffers from high prediction times due to its instance-based nature. The DT models experience slight benefits from an increased dataset and remain efficient in both training and prediction processes. The MLP demonstrates improved accuracy with more features but faces high training times, particularly when dealing with larger datasets. The SVM, tested with a single instance comprising 5000 samples and 170 features, exhibits decent performance but incurs a high training cost.
The comparative analysis of feature count impact is presented here. The high-performing models, particularly KNN and SVM, consistently utilized 170 features, indicating an optimal balance between input complexity and performance for these algorithms. While the DT models with fewer features—such as 82 or 435—still performed reasonably well, the highest accuracy for DTs was achieved using 690 features, as demonstrated by models like M582-DT and M626-DT. This finding suggests that while DTs can operate effectively with a limited number of features, increasing the richness of the feature set can further enhance their performance. More complex models, such as MLP and SVM, significantly benefit from richer feature sets, allowing them to capture more intricate patterns. However, this also results in increased training time. Overall, having a richer set of features generally improves model accuracy, especially for more sophisticated algorithms like MLP and SVM.
5.3. Validation and Testing of the Best Machine Learning Models
Table 6 presents the results of the metrics calculated during the validation of the best machine learning models for each algorithm used. When comparing the F1-score values obtained in model validation (
Table 6) with the mean test score values from the training stage (
Table 5), a consistent pattern was observed. For the M186-DT and M376-SVM models, the F1-scores increased by 0.490527% and 0.472488%, respectively. In contrast, the F1-scores for the M334-KNN and M352-MLP models showed a slight decrease of 0.252093% and 0.990524%, respectively. These results indicate a favorable sign of stability across the different models.
Table 7 presents the results of the metrics calculated during the testing of the best machine learning models for each algorithm used. This table indicates a consistency between the F1-score values obtained during the validation stage (
Table 6) and those obtained during the test stage (
Table 7). When analyzing each of the tested models, the M334-KNN model showed a performance increase of 0.3%. The M186-DT model exhibited a yield increase of approximately 0.2%. In contrast, the M352-MLP and M376-SVM models demonstrated a decrease in performance, with declines of about 0.5% and 0.7%, respectively. Despite these variations, all four models are considered stable and effective for detecting the presence of cracks, identifying the cracked blades, and pinpointing the areas where cracks occur in the blades of wind turbines under operating conditions.
Table 8 shows the score of the k-folds of the best machine learning models for each algorithm used. These results indicate consistency, which validates the models.
Figure 12 presents the confusion matrices resulting from testing the best machine learning models for each algorithm listed in
Table 7. Analyzing
Figure 12 reveals that the distribution of tested cases has an overall standard deviation of 8.77%, indicating a balanced dataset. Specifically,
Figure 12a illustrates that the M186-DT model experiences classification errors across all cases but demonstrates strong overall performance. In contrast,
Figure 12b shows that the model 344-KNN has a lower failure margin, with only 2 cases misclassified, highlighting its exceptional performance.
Figure 12c indicates that the M352-MLP model performs well in classifying various cases; however, it ranks below the previously mentioned models due to a higher number of errors in specific classes, such as case 2 and case 5 (
Table 2), which correspond to Cracked Tip on WTB bolted to P1 and Cracked Tip on WTB bolted to P2, respectively. Lastly,
Figure 12d reveals that the M376-SVM model has good overall performance, but it is noted as the weakest among the four selected models. This model exhibits classification errors with greater dispersion between classes, which could lead to problems in specific categories like case 2 and case 3 (cracked tip on WTB bolted to P1 and cracked mid on WTB bolted to P1).
To evaluate the safety and cost trade-offs for industrial applications, the misclassification rate (MCR) for the best-performing models was calculated, as shown in
Table 9. The MCR is a machine-learning metric that measures the proportion of incorrect predictions made by a classification model, calculated as the total number of incorrect predictions divided by the total number of predictions. A lower MCR indicates a better-performing model [
37]. The evaluated cases demonstrated a significantly lower MCR, ranging from 0% to 7.6%. The M186-DT model records a mean MCR of 1.280226% (19 misclassified events out of 1536 evaluated), primarily concentrated in cases 2 and 3. The M344-KNN model shows a mean MCR of 0.127392% (2 misclassified events out of 1536 evaluated), occurring solely in case 2. The M352-MLP model has a mean MCR of 2.260617% (36 misclassified events out of 1536 evaluated), with higher incidences in cases 2, 3, 5, and 8. The M376-SVM model achieves a mean MCR of 1.65785% (25 misclassified events out of 1536 evaluated), focusing on cases 2 and 3. Among these models, the M344-KNN model shows the lowest MCR, making it the most reliable option for critical environments, such as crack detection in WTBs.
Table 10 presents the F1-score metrics in macro, micro, and weighted modes. An analysis of these results reveals that the values of the three metrics are consistent across the four models. This consistency indicates the generalizability of the models, demonstrating solid and uniform performance for each class. Additionally, the results suggest that the models are not biased, despite the slight imbalance in the data.
Table 11 shows the computational cost of implementing the machine learning models. The M186-DT is the most efficient, with the lowest consumption of RAM and TFLOPs, making it suitable for moderate datasets. In contrast, the M334-KNN and M376-SVM models have the highest costs. In the first model, the reason for these costs is due to its dependence on stored instances, while the reason in the second model is the number of support vectors evaluated. The M352-MLP model is in between, with higher memory usage but fewer operations than models based on KNN and SVM, which keeps it competitive. These results show that the choice of a model should consider not only accuracy, but also the balance between computational cost and expected benefit.