We apply the peak detection methods and evaluation criteria to 67 measurements (cf.
Section 2.2).
Table 1 gives an overview on the results of the postprocessing. After merging the overlapping peaks of the peak lists, the automatic VisualNow and IPHEx methods show the by far largest number of peaks, between 4000 and 6000. The manual peak picking, local maxima as well as the peak model estimation methods find a similar amount of peaks, “only” about 1500. The number of peak clusters is almost constant over all methods, it varies between 40 and 90. An exception is the automated IPHEx peak picker, which finds 420 clusters. Both VisualNow-based methods, manual as well as automated, find a comparably high number of potential peaks related to a low number of resulting peak clusters. The reason lies with the VisualNow implementation. Once a potential peak was found in one measurement (out of the 67), it automatically “finds” a peak at this position in all other 66 measurements (presumably with low intensities), even if no actual peak exists. This results in the observed high number of peaks, which are mainly noise and, as we will demonstrate later, may lead to problems within the classification procedure. In contrast, the IPHEx, local maxima and PME approaches only assign intensities to peak clusters for those measurements where a peak at the corresponding position is detectable.
4.2. Evaluation by using Statistical Learning
The evaluation of the machine learning performance is shown in
Table 3 and
Table 4.
Table 3 presents the results of the linear support vector machine indicating that all methods perform almost equally well. The manual, local maxima and automatic VisualNow peak detection methods perform worst, in terms of AUC as well as accuracy. The automatic peak detection in IPHEx shows a slightly better AUC and performs best in terms of accuracy 73%. The peak detection method that produced the most informative features for the linear method in terms of AUC ≈ 82% is the peak model estimation approach.
Table 3.
Classification Results of the linear support vector machine. The quality measures are the AUC, accuracy (ACC), sensitivity, specificity, positive predictive value (PPV) and negative predictive value (NPV).
| AUC | ACC | Sensitivity | Specificity | PPV | NPV |
---|
Manual VisualNow | 77.4 | 70.9 | 69.7 | 72.4 | 75.7 | 65.9 |
Local Maxima Search | 77 | 67.8 | 70.6 | 64.4 | 71 | 64 |
Automatic VisualNow | 76.6 | 68.3 | 66.8 | 70.1 | 73.4 | 63.1 |
Automatic IPHEx | 79.8 | 73 | 70.5 | 76 | 78.4 | 67.6 |
Peak Model Estimation | 82.2 | 72.2 | 77.2 | 66.1 | 73.7 | 70.1 |
Table 4.
Classification Results of the random forest. The quality measures are the AUC, accuracy (ACC), sensitivity, specificity, positive predictive value (PPV) and negative predictive value (NPV).
| AUC | ACC | Sensitivity | Specificity | PPV | NPV |
---|
Manual VisualNow | 86.9 | 76.3 | 78.7 | 73.4 | 78.5 | 73.6 |
Local Maxima Search | 80.8 | 70.5 | 75 | 64.9 | 72.5 | 67.8 |
Automatic VisualNow | 81.1 | 71.9 | 75.6 | 67.3 | 74.1 | 69.1 |
Automatic IPHEx | 80 | 68.9 | 72.8 | 64 | 71.4 | 65.6 |
Peak Model Estimation | 81.9 | 74.2 | 81.6 | 65 | 74.2 | 74.1 |
Table 4 shows the classification results of the random forest method. Again, all methods vary little in their performance. The best set of features for this machine learning method was generated by the gold standard (Manual VisualNow). The manual detection shows an accuracy of ≈ 76% and an AUC of ≈ 87% and also outperforms all other peak detection methods in all of the quality indices. The peak model estimation performs slightly better in terms of AUC ≈ 82% and accuracy ≈ 74%, as well as most of the other measures.
Data Robustness: Figure 4 shows boxplots of the list of AUCs generated by 100 runs of the ten-fold cross validation. The prediction results of the linear SVM with the manual and automated VisualNow methods are the most stable, while the local maxima search shows the highest variation. The PME approach has a reasonable robustness and performs better than the simple methods in almost all runs. In comparison, the AUC-measured classification performance with random forest is most robust for the gold standard and the PME approach. The other automated methods introduce larger variations, in particular IPHEx.
Figure 4.
Boxplots of 100 runs of the ten-fold cross validation for both, the linear SVM and the random forest method.
Tuning Robustness: Finally we investigate if the feature sets and their model performance are susceptible to parameter tuning for the worse performing classifier: the linear SVM. Therefore, we systematically vary the cost and tolerance parameters ({0.1,1.0,100,1000} and {0.01,0.1,1}, respectively) and in a second run we randomize the class labels. The result of this analysis is shown in
Figure 5, which plots the variance of the AUC for both the original labels (left) as well as the randomized labels (right). The results of the robustness analysis of random forest is shown in the
Appendix Figure A1.
At first glance,
Figure 5 indicates that the performance (AUC) of the manual and automated VisualNow, as well as the IPHEx peak detection feature set, can be heavily improved by tuning the classifiers’ parameter sets. However, when considering the results for the randomized labels, these three tools seem to generate peak clusters that are prone to overfitting, most likely resulting from the high number of detected potential peaks. We would generally expect to observe a drastic drop in the classification quality for the randomized labels compared with the real labels, which is not clearly observed for all methods (overfitting), but LMS and PME. In addition to its comparably low susceptibility to overfitting, PME has a quite small variability in AUC, indicating stable classification results. In contrast to the results of the linear SVM, the random forest tuning results show that this method is considerably less stable.
In contrast to the tuning results of the linear SVM, most data sets show considerable smaller potential for tuning of the random forest. Furthermore, for all data sets, we observe a drastic drop in classification quality for the randomized labels, compared with the original labels. One can generally say that random forest classification appears to be more robust in terms of overfitting on this dataset.
Figure 5.
Boxplots illustrating the variation within the linear SVM tuning results in a single ten-fold cross validation run. The yellow boxes show the results when tuning the original feature sets. The green boxes show the results when tuning the randomly labeled feature sets.