5.2. Results and Visualization
Figure 5 shows the distribution of predictions over the 500 MCD runs (Algorithm 1) and the accuracy of the ensemble (Algorithm 2) in one of the 10 repetitions of the DL.
Figure 6 illustrates the manner in which the MCDE approach outputs the probabilities for each of the three classes (control, presymptomatic and sick) Algorithm 3, by showing one particular register in each class. The number of saccades included in each box plot is mentioned in the title of the corresponding chart.
The DT used a subset of features saved from the MCDE approach, as described in
Section 4.2, to build a model that is able to classify the registers. The description shown in
Figure 7 is the one that yielded the best test accuracy from one of the 10 repeated runs of the entire DL-MCDE-DT approach. More details about the interpretability of the DT rules are discussed in the next section.
Figure 8 shows how a presymptomatic register is mistaken as control by another DT.
Table 2 has two sections, one dedicated to the saccade classification results on the validation and test sets, as obtained by MCD and its ensemble version along the CNN-LSTM without Monte Carlo uncertainty and a support vector machine (SVM). All results are reported out of 10 repeated runs. The second part of the table shows the results obtained by the DL-MCDE-DT model, the CNN-LSTM without MCD and SVM, when applied to the register test set. For the last two approaches, the label of a register is established as the class of the majority of its saccades. The last 3 rows depict the precision, recall and F1-score obtained by the proposed DL-MCDE-DT.
Finally,
Figure 9 indicates on the left plot the correctly and mislabeled registers in the 10 runs corresponding to
Table 2 register accuracy results, while the right one shows the ROC curves in one of the 10 runs.
5.3. Discussion
When the class for each register is taken directly as the vote given by the majority of its constituent saccades, the results are not encouraging: the control and sick registers are correctly identified, but most of the presymptomatic registers are mistaken.
Figure 6 depicts the results of MCDE probabilities for three distinct registers, one representing each class. To further dissect the output, the attention is next focused on the plots of the first row. The register C020 from the control class has 85 saccades and 83 out of these are labeled by the MCDE approach correctly. This is depicted in the first plot that shows the probabilities for the saccades of register C020 that have the largest value for control. There is only one saccade that has a larger probability for the presymptomatic class (top-center plot) and one saccade where the largest probability targets the sick class (top-right plot). The majority of saccades (83 out of 85, that is 97.6%) are in this case labeled correctly (as control).
The second row of plots shown in
Figure 6 illustrates a presymptomatic register. However, the MCDE does not classify any of the saccades as presymptomatic. Nevertheless, out of 52 saccades, 33 are classified as control (center-left plot) and 19 are labeled as sick (center-right plot). Hence presymptomatic saccades are misclassified. The decision on the label of the register needs to be established mainly by balancing the control and sick saccades.
Finally, the plots corresponding to the third row of
Figure 6 correspond to the classification of the 56 saccades in the sick register S008. In this, 2 may be labeled as control, 1 as presymptomatic and 53 as sick. The ones labeled as sick are established by the DL-MCDE with a remarkably high certainty.
Naturally, the three registers from
Figure 6 do not necessarily reflect the manner in which the saccades are labeled in all the other registers from the same corresponding classes. There are control registers in which all saccades are correctly identified in their entirety, while there are also others where more saccades are mislabeled. We initially attempted to manually establish rules (with thresholds) for balancing the control and sick saccades towards reaching an accurate classification of the validation registers. However, this path was abandoned as the rules became too complex to follow. Consequently, we extracted various statistical features at the register level from the obtained results and fed them to a DT model to extract the rules.
Figure 7 illustrates such a tree with rules obtained by the DT model. The most important attribute, i.e., the one from the root with the highest
value, is represented by the number of saccades that are labeled as control. If there are less than 22 samples in the same register that are labeled as control by the MCDE model, the class of that register is established as sick.
Figure 2 shows the overview with respect to the amount of saccades for each register and three classes. Most of the registers that have a limited number of saccades (e.g., less than 30) belong to the sick class (and the number of control saccades naturally falls below 22). When evaluating such a patient, the physicians decided that no more tests are necessary, since they observed an impaired behavior and this is accurately identified by the DL-MCDE approach, as well. For other registers with more samples, when the number of control saccades was below 22, the label was sick for all validation cases. Actually, it can also be observed in
Figure 9 that this rule proved to be accurate for the test registers too, since all sick class registers are correctly classified.
When there are more than 21 samples classified as control in a register, the differentiation is to be made between presymptomatic and control. The next most important attribute is represented by the mean probability of the sick class for the samples that are classified as sick by the MCDE (the rightmost box plots with
S in
Figure 6). Naturally, this attribute is very important, since it represents the average probability returned by the softmax activation in the CNN-LSTM approach according to which the saccades should be labeled as sick. However, this does not directly decides the class of the register, but it leads to a further check on the mean probability of the presymptomatic label, also for the set of saccades when the sick probability is the highest. In the same
Figure 6, this corresponds to the middle box plot (labeled with
P) from the same rightmost plots. Finally, the mean probability of control saccades in the same set of samples labeled as sick (i.e., cases with the highest probability for sick) is another decision attribute. This corresponds to the box plot labeled with
C in the same rightmost charts of
Figure 6.
Figure 8 illustrates another tree that is similar in some nodes with the one from
Figure 7. This new illustration is concentrated in pointing how a test register with the presymptomatic class is categorized as control. Each node shows a histogram with the registers in each class: the horizontal line contains the interval for the current attribute and the black triangle indicates the determined
value. The features involved in the DT classification are written in orange at the bottom of the plot and also indicated with an orange triangle in each node of the tree. The run whose result is outlined in the figure had only one mistaken register in the test set, i.e., the one represented.
One drawback of the MCDE approach is given by the running time. While applying 500 passes of the MCDE over the validation set takes 24.16 min (2.9 s per iteration), the same amount of applications on the test set takes 11.33 min (1.36 s per iteration). We recall that the test set is smaller than the validation set, as shown in
Figure 2. The experiments are performed on a PC with an Intel i7-4770 CPU, 3.40 GHz, 16 GM RAM and a GPU GeForce GTX 1650. The program is written in Python, uses the TensorFlow library and it runs on the GPU.
The DT in
Figure 7 provided the highest classification accuracy of 94.12% for the test registers. It misclassified one presymptomatic register for one control. It is interesting to acknowledge that besides the first attribute that refers to the number of saccades labeled by the model as control, all the rest used only mean probabilities from the set of samples in the registers that have the highest probability for the sick class (corresponding to the box plots from the rightmost charts in
Figure 6). Naturally, not all DT rules from the 10 repeated runs were identical and different attributes were also considered in other cases.
The results of saccade classification shown in
Table 2 indicate the best result for the CNN-LSTM approach, both for the validation and the test sets. The high advantage is however not preserved when the classified samples are used to establish the label of the register. Despite the fact that all control and sick registers are accurately identified by taking the class of the majority of the saccades, the presymptomatic registers are all misclassified for control when the additional DT is not used.
The second part of the table indicates the results obtained for register classification. Besides the accuracy of the DL-MCDE-DT tandem, the CNN-LSTM and the SVM results are also reported, with the majority of samples establishing the label of their register. Although the values from the standard deviation in the saccade classification for CNN-LSTM indicate some spreading, this is not enough to change any label at the level of the register, hence the null value for the standard deviation in the second part of the table for the same classifier. Afterwards, the weighted results for the precision, recall and F1-score are shown. As it can be observed in the first plot from
Figure 9 with the confusion matrix, all sick registers are correctly identified and no other register is mistaken for sick, as opposed to the output in [
3]. The matrix is symmetric, hence there are very close values for the precision, recall and F1-score in
Table 2. The values for the three measures in the table are not however identical to those in the figure, because they are computed as average over all 10 runs and not directly from the confusion matrix results.
A higher degree of presymptomatic signs are now correctly identified, as opposed to the results in [
3]. This is also visible from the right plot in
Figure 9 with the ROC curves, which are calculated for one of the 10 runs. The micro- and macroaverages are also computed. The high value for the microaverage is of special interest, since the classes of the problem are unbalanced (more control registers and less presymptomatic ones) and this measure adequately captures the precision in such cases. It would still be useful to have a significantly larger number of registers to train the DT model with more data and make it more robust.