3.1. Learning Monitoring per Fold
While the mini-batch SGD was running, we monitored the cross-entropy and accuracy evolution along epochs. This is the very first check to ensure our CNNs were properly learning from data and our local validation criteria stopped the learning algorithm in the right moment. If our CNNs were correctly implemented, we expected that cross-entropy be minimized to reach values as close as possible to 0, and the accuracy to reach values as close as possible to 1. If this check gave wrong results, then there would be no point in computing subsequent metrics.
Figure 6 shows two representative examples of this check, using data from H1 and L1 detectors. Both were performed during a single fold of a 10-fold CV experiment, from the first to the last mini-batch SGD epoch. Besides, here we used a time resolution of
s with 2 stacks, and 20 kernels in convolutional layers. Notice that cross-entropy shows decreasing trends (
Figure 6a,c) and accuracy increasing trends (
Figure 6b,d). The total number of epochs for H1 data was 372, and for L1 data was 513, which means that the CNN has greater difficulties in learning parameters with L1 than with H1 data. When the CNN finish its learning process, cross-entropy and accuracy reach values of
and
, respectively, using H1 data; and
and
, respectively, using L1 data.
Notice that, from all plots shown in
Figure 6, fluctuations appear. This is actually expected, since, in the mini-batch SGD algorithm, a randomly number of samples
are taken, then stochastic noise is introduced. Besides, when using data from L1 detector, some anomalous peaks appear between epoch 350 and 400, but this is not a problem because CNN normally continues its learning process and trendings in both metrics are not affected. At the end, we can observe this resilience effect, because of our validation patience criterion, which is implemented to prevent our CNN algorithm prematurely stopping and/or to dispense with manually adjust the total number of epochs for each learning fold.
Still focusing on the SGD fluctuations, zoomed plots in
Figure 6 show their order of magnitude—the highest peak minus the lowest peak. When we work with H1 data, cross-entropy fluctuations are about
(
Figure 6a) and accuracy fluctuacions are about
(
Figure 6b). On the other hand, when we learn from L1 data, both cross-entropy and accuracy fluctuations are about
(
Figure 6c,d). Here, it should be stressed that, although mini-batch SGD perturbations contribute with its own uncertainty, when we compute mean accuracies among all folds in the next
Section 3.2, we will see that the magnitude of these perturbations do not totally influence the magnitude of data dispersion present in the distribution of mean accuracies.
3.2. Hyperparameter Adjustments
Our CNN models introduce several hyperparameters, namely the number of stacks, size and mount of kernels, size of pooling regions, number of layers for each stack, stride, and padding, among many others. Presenting a systematic study for all hyperparameters is beyond the scope of our research. However, given CNN architectures shown in
Table 1, we decided to study and adjust two of them, namely the number of stacks and the number of kernels in convolutional layers. In addition, although it is not an hyperparameter of the CNN, the time length
of the samples is a resolution that also required being set to reach an optimal performance, then we included it in the following analyses.
A good or bad choice of hyperparameters will affect the performance, and this choice introduces uncertainty. Nonetheless, once we found a set of hyperparameters defining an optimal setting and, regardless of how sophisticated our adjustment method is, the goal is only using this setting for predictions. That is to say, once we find optimal hyperparameters, we remove the randomness that would be introduced in predictions if we run our CNN algorithms with different hyperparameters. For this reason, we say that uncertainty that is present in hyperparameter selection defines a prior level to intrinsic uncertainties of a setting already chosen—according to the Bayesian formalism, this is a prior belief. Last uncertainties are more risky, because they introduce stochasticity in all predictions of our models, despite that we are working with a fixed particular hyperparameter setting.
In any case, we emphasize that our methodology for hyperparameter selection is robust. Thanks to our white-box approach, a clear understanding of the internal behavior of our CNNs was shown and, based on this understanding, we heuristically proposed a reduced set of hyperparameters to perform a transparent statistical exploration on possible meaningful values—obtaining good results, as will be seen. This methodology is smartly distancing from the blind brute force approach of performing unnecesarily large and expensive explorations (many hyperparameter settings could be silly to waste time on them), and even much more far from a naive perspective of trivializing model decision as a superficial detail and/or something opaquely given without showing a clear exploration.
The first hyperparameter adjustment is shown in
Figure 7, and it was implemented to find optimal number of stacks and time length resolution, according to the resulting mean accuracies. The left panels (
Figure 7a,c) show the distribution of mean accuracy for all 10 repetitions or runs of the entire 10-fold CV experiment, in function of
. Each of these mean accuracies, which we can denote as
, is the average among all fold-accuracies of a
th run of the 10-fold CV. In addition, the right panels (
Figure 7b,d) show the mean of mean accuracies among all 10 runs, i.e.,
, in function of
. Inside right plots, we have included small boxplots that, as will be seen next, are useful to study dispersion and skewness of mean accuracy distributions—circles inside boxplots are distribution mean values. The line plots show contributions of our three CNN architectures, i.e., with 1, 2, and 3 stacks.
Consider the top panels of
Figure 7 for H1 data. Notice, from
Figure 7b that, for all CNN architectures, mean of mean accuracies shows a trend to decrease when
s and this decrease occurs more pronouncedly when we work with less stacks. Besides, when
s, a slight increase appears, even if local differences are of the order of SGD fluctuations. In short, given our mean accuracy sample dataset, the highest mean of mean accuracies, about
, occurs when
s with a CNN of 3 stacks. Subsequently, to decide if this setting is optimal, we need to explore
Figure 7a together with boxplots inside
Figure 7b. From
Figure 7a, we have that not only the mentioned setting give high mean accuracy values, but also
s and
s, both with 2 stacks,
s with 2 and 3 stacks, and even
s with 3 stacks; all these settings reach a mean accuracy that is greater than
. Setting
s with 3 stacks can be discarded because its maximum mean accuracy is clearly an outlier and, to elucidate what of remaining settings is optimal, we need to explore boxplots.
Here, it is crucial to assimilate that the optimal setting to choose actually depends on what specifically we have. Let us focus on
Figure 7b. If the dispersion does not worry us too much and we want to have a high probability of occurence for many high values of mean accuracy, setting of
s with 2 stacks is the best, because its distribution has a slighly negative skewness concentrating most of mass probability to upper mean accuracy values. On the other hand, if we prefer to have more stable estimates working with less dispersion at the expense of having a clear positive skewness (in fact, having a high mass concentration in a region that does not reach as high mean accuracy values as the range from median to third quartile in the previous setting), setting of
s with 3 stacks is the natural choice. In practice, we would like to work with greater dispersions if they help to reach the highest mean accuracy values, but, as all our boxplots have similar maximum values, we decide to maintain our initial choice of
s with 3 stacks for H1 data.
From
Figure 7a, it can be seen that, regardless number of stacks, data dispersion in
s is greater than in
s, even if in the former region dispersion slightly tends to decrease as we increase the number of stacks. This actually is a clear visual hint that, together with the evident trend to decrease mean accuracy as
increase, motivate discarding all settings for
s. However, this hint is not present in the bottom panels, in which a trend of decrease and then increase appears (which is clearer in the right plot), and data dispersion of mean accuracy distributions are similar almost for all time resolutions. For this reason, and although the procedure for hyperparameter adjustment is the same as upper panels, one should be cautious, in the sense that decisions here are more tentative, especially if we have prospect to increase the mount of data.
In any case, given our current L1 data and based on scatter distribution plot in
Figure 7c and line mean of mean accuracies plot in
Figure 7d, we have that the best performance(s) should be among settings of
s with 2 stacks,
s with 2 and 3 stacks, and
s with 3 stacks. Now, exploring the boxplots in
Figure 7d, we notice that, even if settings of
s with 3 stacks reach the highest mean accuracy values, their positive skewness toward lower values of mean accuracies is not great. SUbsequently, the two remaining settings, which, in fact, have negative skewness toward higher mean accuracy values, are the optimal options, and again choosing one or other will depend of what extent we tolerate data dispersion. Unlike upper panels, here a larger dispersion increase the probality to reach higher mean accuracy values; therefore, we finally decide to work with the setting of
s with 2 stacks for L1 data.
To find the optimal mount of kernels in convolution layers, we perfomed the adjustment that is shown in
Figure 8, again separately for data from each LIGO detector. When considering the information that is provided by previous adjustment, we set
s, and the number of stacks in 3 for H1 data and 2 for L1 data. Subsequently, once the 10-fold CV was run 10 times as usual, we generated boxplots for CNN configurations with several mounts of kernels, as was advanced in
Table 1, including all mean accuracies for each run marked (red circles). Besides, the average for each boxplot is included (blue crosses). Random data horizontal spreading inside each boxplot was made to avoid visual overlap of markers, and it does not mean that samples were obtained with a number of kernels different from those already specified in horizontal axis.
Let us concentrate on kernels adjustment for H1 data in the left panel of
Figure 8. From these results we have that a CNN with 12 kernels give us more stable results by far, because most of its mean accuracies lie in the smallest dispersion region—discarding outliers, half of mean accuracies are concentrated in a tiny interquartile region located near to
. On the other hand, CNN configuration with 24 kernels is the least suitable setting among all, not because its mean accuracy values are low per se (values from
to
are actually good), but rather, because, unlike other cases, the nearly zero skewness of its distribution is not prone to boost sample values beyond the third quartile as it is appeared. Configuration with 8 kernels has a distribution mean very close to the setting with 24 kernels and, even, reaches two mean accuracy values of about
. Nonetheless, given that settings with 16, 20, 28, and 32 have mean of mean accuracies greater or equal to
(and, hence, boxplots that are located towards relative higher mean accuracy values), these last four configurations offer the best options. At the end, we decided to work with 32 kernels, because this setting groups a whole set of desirable features: the highest mean of mean accuracies, namely
, a relatively low dispersion, and a positive skewness that is defined by a pretty small range from the first quartile to the median.
Kernels adjustment for L1 data is shown in right panel of
Figure 8. Here, the situation is easier to analize, because performance differences appears to be visually clearer than those for data from H1. Settings with 8, 20, and 28 kernels lead to mediocre performances, specially the first one which has a high dispersion and
of its samples are below
of mean accuracy. Notice that, like adjustment for H1 data, setting with 12 kernels shows the smallest dispersion (discarding a outlier below
), where we have mean accuracies from
to
, and, again, this option will be suitable if we would be very interested in reaching stable estimates. We decided to pick up setting with 16 kernels, which has the highest distribution mean,
,
of mean accuracy samples above the distribution mean, an aceptable data dispersion (without counting the clear outlier), and a relatively small region from the minimum to the median.
In summary, based on all of the above adjustments, the best time resolution is s, with a CNN architecture of 3 stacks and 32 kernels when working with data from H1 detector, and 2 stacks and 16 kernels when working with data from the L1 detector. We only use these hyparameter settings hereinafter.
Now, let us finish this subsection reporting an interesting additional result. We can ask to what extent magnitude of perturbations from the mini-batch SGD algorithm influence the dispersion of mean accuracy distributions, as was mentioned at the end of
Section 3.1. Here, we can compare the order of magnitude of SGD perturbations and dispersion present in boxplots. In previous subsection, we had that, when
s and we work with a CNN architecture of 2 stacks and 20 kernels in convolution layers, the order of magnitude of SGD fluctuations in accuracy is about
. Curiously, this value is much greater than the dispersion of data distribution shown in the left panel of
Figure 7, which, in turn, reach a value of
, that is to say,
times smaller. These results are good news, because apart of showing that stochasticity of mini-batch SGD perturbations do not totally define dispersion of mean accuracy distributions, it seems that our resampling approach, actually contributes to smooth stochastic effects of mini-batch SGD perturbations and, hence, to decrease the uncertainty in the mentioned distributions. This is a very important result that could serve as motivation and standard guide to future works. Given that very few previous works of ML/DL applied to GW detection have transparently reported their results under a resampling regime, i.e., clearly showing distributions of their performance metrics (for instance, [
29,
39]), this motivation is highly relevant. Resampling is a fundamental tool in ML/DL that should be used, even if the involved algorithms are deterministic, because there will always be uncertainty given that data are always finite. Moreover, under this regime, it is important to report the distributions of metrics to understand the probabilistic behavior of our algorithms, beyond mere averages or single values from arbitrarily picked out runs.
3.3. Confusion Matrices and Standard Metrics
In general, accuracy provides information regarding the probability of a successful classification, either if we are classifying a noise alone sample (
) or noise plus GW sample (
); that is to say, it is a performance metric with multi-label focus. However, we would like to elucidate to what extent our CNNs are proficient in separately detecting samples of each class, then it is useful to introduce peformance metrics with single-label focus. A standard tool is the confusion matrix, which is shown in
Figure 9, depending on data from each detector. As we are under a resampling regime, each element of confusion matrices is computed when considering the entire mount of
detections, which, in turn, are resulting from concatenating all prediction vectors of dimension
that are outputted by the 10 runs of the 10-k fold CV.
A first glance of the confusion matrices shown in
Figure 9 reveals that our CNNs have a better performance in detecting noise alone samples than detecting noise plus GW samples, because
(we are using the notation
to represent each element of a confusion matrix) the element is greater than
for both matrices. Yet, the amount of successful predictions of noise plus GW are reasonably good because they considerably surpass a totally random performance—as described by successful detections or the order of
of total negative samples.
Moreover, from
Figure 9, we have that, based on wrong predictions, CNNs are more likely to make a type II error than type I error, because
for both confusion matrices. If we think more carefully, this result leads to an advantage and a disadvantage. The advantage is that our CNN performs a “conservative” detection of noise alone samples in the sense that a sample will be not classified as beloging to class 1 unless the CNN is very sure, which is to say the CNN is quite precise to detect noise samples. Using H1 data,
of samples predicted as
belong this class; and, using L1 data,
of samples that were predicted as
belong this class. This is an important benefit if, for instance, we wanted to apply our CNNs to remove noise samples from a segment of data with a narrow marging of error in addition to other detection algorithms and/or analysis focused on generating triggers. Nonetheless, the disadvantage is that a not less number of noise samples are lost by wrongly classifying them as GW event samples. In terms of false negative rates, we have that
of actual noise samples are misclassified with H1 data, and
of actual noise samples are misclassified with L1 data. This would be a serious problem if our CNNs were implemented to decide whether an individual trigger is actually a GW signal and not a noise sample—either Gaussian or non-Gaussian noise.
Taking in mind that, according to statistical decision theory, there will always be a trade-off between type I and type II errors [
71]. Hence, given our CNN architecture and datasets, it is not possible to reduce value of
element without increasing value of
element. In principle, keeping the total number of training samples, we could generalize the CNN architecture for a multi-label classification to further specify the noise including several kind of glitches as was implemented in works as [
33,
38]. Indeed, starting from our current problem, such multiclass generalization could be motivated to redistribute the current false negative counts
in new elements of a bigger confusion matrix, where several false positive predictions will be converted to new sucessful detections located along a new longer diagonal. Nonetheless, it is not clear how to keep constant the bottom edge of the diagonal of the original binary confusion matrix when the number of noise classes is increased; not to mention that this approach can be seen as a totally different problem instead of a generalization.
With regard to misclassified GW event samples, despite that they are quite less than misclassified noise samples, we would like to understand more about them. Subsequently, we decided to study the ability of the CNN to detect GW events depending on the values of their expected SNR—values that are provided with LIGO hardware injections. The results are shown in
Figure 10; upper panel with data from H1 detector, and lower panel with data from L1 detector. Both panels include a blue histogram for actual (injected) GW events that come from the testing set, a gray histogram with GW events detected for the CNN, and the bin-by-bin discrepancy between both histograms as scatter points. As a first approach, we defined this bin-by-bin discrepancy as the relative error:
where
and
are the detected GW count and injected GW count, respectively, and index
i represent a bin. Here, we set 29 same-length bins for both histograms, starting from a lower edge
to a upper edge
for H1 data, and from
to
for L1 data, respectively. For testing histograms, the count of events comes from our
predictions given our resampling regime.
By comparing most bins that appear on both panels of
Figure 10, we have detected that the GW count is greater the more actual injections in the testing set there are. Besides, most GW events are concentrated in a region of smaller SNR values. For H1 data, most events are in the first six bins, namely from
to
; with 6520 actual GW events and 5191 detected GW events, representing aprox. the
and
of the total number of actual GW events and detected GW events, respectively. For L1 data, on the other hand, most of the events are in the first seven bins, from
to
; with 5040 actual GW events and 3277 detected GW events, representing aprox. the
and
of the total number of actual GW events and detected GW events, respectively.
The above information regarding counts is relevant, but the most important results come from relative errors. From these, we have that, in both panels, a clear trend of detecting a greater percentage of actual GW events as long as those events has greater SNR values. Besides, if we focus in upper panel of
Figure 10, corresponding to H1 data, we have that, in first four bins, the GW count relative errors are the greatest; beginning with
and ending with
. Subsequently, from the fifth bin at
, relative errors stochastically approaches zero—indeed, a relative error exactly equal to zero is reached in 10 of 29 bins. For L1 data, as shown in lower panel, we observe a similar behavior of relative error. In the first six bins, the greatest relative errors appears; from
to
. Next, from seventh bin at
, relative errors stochastically approach zero. Indeed, here we have that a relative error value exactly equal to zero is reached for the first time at smaller SNR value than with H1 data, although, once zero values begin to appear, relative errors that are further away from zero than with H1 data also appear. This last result is statistically consistent with the fact that, according to
Figure 9, a negative predictive value (
) is smaller in the confusion matrix for H1 data than for L1 data, with
and
, respectively.
It should be stressed that, in the above paragraph, we refer to trends in bin-by-bin discrepances, which stochastically approach zero as the SNR values increases. Although the binning choice could influence the fact that the above discrepances are not monothonic, the main reason behind this behavior is stochasticity of our CNN algorithms. We are counting the inputted and detected GW events, which come from repeteated 10-fold CV experiments in which the dataset is stochastically split in each experiment.
For bin-by-bin discrepancies that are shown in
Figure 10, we include error bars. These are standard deviations and each of these was computed from distributions of 10 relative errors because of the 10-fold CV experiment is repeated 10 times. From the plots, we observe that standard deviations do not approach to zero as their SNR increase, meaning that stochasticity introduced in such standard deviations by our resampling cannot seem to be smoothed by selecting certain SNR values.
It is important to reiterate that, here, we applied the CNN algorithm under a realistic approach in the sense that GW events are given by the hardware injections provided by LIGO, and therefore, all SNR values are given in the strain data with no possibility to be directly handled in numerical relativity templates before software injections (in addition to SNR values, frequency of occurrence of GW events also represents an important challenge to generate a more realistic dataset emulating record of astrophysical data, even though this leads us to work with highly imbalanced datasets. Even, for a more realistic situation, we could internally describe each bin of histograms in
Figure 10 as a random sampling in which the counts themselves take random values, following a distribution—indeed, this hypothesis is usually assumed to perform systematic statistical comparisons between two histograms). Additionally, this is consistent with real experimental conditions, in which the SNR values of real recorded GW signals depend solely on the nature of the astrophysical sources and the noise conditions of the detectors—aspects that obviously are not handleable during an observation run. If the CNN is able to deal with this limiting scenario beforehand, then it does not learn more than what is strictly necessary, avoiding overoptimistic results or, even, underperformance. Indeed, because of this aspect, we can transparently conclude that our CNN per se is more sensitive to stocastically detectint GW signals when
for H1 data and when
for L1 data.
Continuing with our analysis,
Table 3 shows a summary of several metrics that we previously defined in
Table 2—again, these metrics were computed by counting the entire mount of
predictions given that we repeated 10 times a 10-fold CV experiment. From the table, we have that, working with L1 data, we observe that recall has a mean value telling us that
of noise alone samples are retrieved. Given these results, if we want to have chances of recovering most noise alone samples of a segment of data on our side in order to, for instance, increase in the short-term our catalogues of glitches or to fully analyze strain data in real-time observation to filter them, this CNN could be not the best option because its sensitivity is not great. The mean recall is slightly better with H1 data,
, but not as great as to considerably improve the sensitivity. Notice, on the other hand, that mean precision and mean fall-out show that our CNN is quite precise classifying noise alone samples, because once it labels a set of samples as that, for L1 data we have that
of them are actually noise alone, and just
are GW signals. Even for H1 data the results are better, because mean precision is
and a mean fall-out is
. At the end, this disparity between recall and precision is summarized in the F1 score. For H1 data, the F1 score is
and, for L1 data, is
. In both cases, the mean F1 score reaches a moderate performance with numerical values lying between values of mean recall and mean precision. Besides, although fall-out plus precision is theoretically exactly 1, here we are considering means among several stochastic realizations of theses metrics; then, summation slightly differs in
for L1 data, and
for H1 data.
Because the F1 score has the limitation of leaving out true negatives samples, it is recomendable to report it together with G mean1.
Table 3 also shows the values for the mean of this metric, namely
for H1 data and
for L1 data. These two values are low, because, by definition, G mean1 is mainly susceptible to the sensitivity of the CNN. In fact, these results elucidate a useful feature of G mean1, namely that it is works as a warning for avoiding overoptimistic performance reports based solely on accuracy. Notice that, on the other hand, that mean of G mean1 shows a slightly better performance for L1 than for H1 data; showing that G mean1 also contributes to avoiding excesive pessimistic interpretations when accuracy, or other metrics, reach lower relative results (for a
N-labels classification, imbalanced datasets, and
, accuracy has a serious risk to become a pessimistic metric, and working with single-label focus metrics would be impractical when
N is significantly larger, because we would need
N metrics to detail the model performance. Hence, the need of drawn on metrics as F1 score and G mean1).
Table 3 shows the dispersion of metrics. For data from a given detector (H1 or L1), we observe that standard deviations of accuracy, precision, recall, and fall-out are of the same order of magnitude. This is expected because these metrics were computed directly from the same resampling of data predictions. Besides, for H1 or L1 data, we have that the standard deviation of the F1 score is also of the same order of magnitude as other metrics. However, with G mean1, we observe a slightly smaller dispersion with L1 data than with H1 data, which is consistent with the minimal improvement reported in the mean of the G mean1. In any case, this reported improvement is actually marginal, because all other metrics report a better performance of our CNN working with data from H1 detector. In the next subsection, we give more reasons to reach this conclusion.
3.4. ROC Comparative Analyses
As it was mentioned in
Section 2.6, all of the performance metrics that are shown in
Table 2 depend on a choosen fixed threshold for assigning a class per image sample. Until now, previous analyses used a threshold of
by default; but, for generating ROC curves, it is necessary to vary this thereshold from 0 to 1 as it was pointed in Equation (
18). In general, ROC curves visually show to what extent our binary CNN classifier, depending on thresholds, defines the trade-off between recovered noise alone samples and GW events samples that were wrongly classified as noise alone samples. Moreover, in the context of ML/DL techniques, ROC curves are used to contrast performances of a model learning from different datasets, or more widely, to compare performances of different models. As a case of study, and for going beyond only evaluate our CNN probabilistic classifier, here we present two ROC comparative analyses: one using H1 data and other using L1 data, where each one will contrast performances of our CNN with other two classic ML models, namely Naive Bayes (NB) and Support Vector Machines (SVM). We implemented the NB and SVM classifiers with the
MATLAB Statistics and Machine Learning Toolbox [
72]—for theoretical details, see [
63] and/or [
22].
The NB and SVM models need vectors as input, then we apply a reshaping operation: each image sample is flattened in a vector such that all columns of , from the first to the last, are concatenated as a one big single column. For NB model, we assume that our train set follows a Gaussian distribution, with mean and variance obtained from to the maximum likelihood estimation (MLE). For the SVM model, on the other hand, we applied a normalization for each component of along all of the training samples, and we used a linear kernel.
Take in mind that there is not definitive criteria to generate ROC curves under the resampling regime. Subsequently, following the same approach that was taken in
Section 3.3 for computing confusion matrices, we considered the whole set of
predictions that were made by our
learning-testing process. In practice, this approach avoids averaging point-by-point and helps to smooth ROC curves through increasing its number of discrete generative steps. For all ROC curves, we set
s for the strain samples, and for ROC curves describing the performance of our CNN, we used the same hyperparameter adjustments for stacks and kernels that, in
Section 3.2, we found are the best.
Figure 11 shows the results of our comparative analyses. Notice that, in both panels, we have that all ROC curves, in general, are quite distant from the 45-degree diagonal of totally random performance, which is fairly good. Even so, depending on the used dataset, their have different performances. When the models learn (and test with) H1 data, their performance are better than with L1 data. Now, if we focus separately on each panel, we have that, for almost all thresholds, the CNN model has the best performance, NB model the worst performance, and SVM model is in the middle. However, as it is shown in zoomed plots, we have that ROC curves in both panels have some peculiarities. In the left zoomed plot, we observe that our CNN has the best performance only until its ROC curve reaches
, because the NB classifier becomes the best and it remains that way until the end—in fact, the performance of the SVM model had already been surpassed by the performance of the NB model from the point
. From what happens next, very close to the north-east edge
, we should not make any strong conclusion, because we are very close to the totally random performance and, therefore, the results are mainly perturbations. In the right zoomed plot, we observe that only after the point
in the ROC space, NB model becomes better that SVM model, and the CNN classifier always has the best performance.
Notice that, on all ROC curves, a specific point have been highlighted. This is called an “optimal operating point” (OOP) and it corresponds to the particular optimal threshold (OT) in which a classifier has the best trade-off between the costs of missing noise aline samples,
, against the costs of raising false noise aline detections,
. In the ROC space, this trade-off is defined by isocost lines of constant expected cost:
where
and
. Assuming, as a first approach, that
, then OOP is just the point lying on the ROC curve that intersects the 45 degree isocost line that is closest to the north-west corner
of the ROC plot (If
and
were different and/or the dataset were imbalanced with respect classes
and
, OOP and OT would be near one of its extremes. Subsequently, in that situation, ROC analysis would be more sensitive to statistical fluctuations, making difficult to take statisfically significant decisions with respect to the class with much less detections and/or samples. This situation would require more data for dealing with the imbalance or alternative analysis as precision-recall curves or cost curves). For each ROC curve, their OOP, OT, and expected costs, are included in
Table 4. Notice from this table that, for the CNN classifier, OT with H1 data are not closer to the exact fifty-fifty chance value of
than OT with L1 data, which shows that the default threshold of
is actually chosen by convention, and not because it is a limit as the performance of our classifier improves. The relative difference between OT and
has nothing to do with performance, but rather with the skewness of classes and/or cost of misclassifications. We also include the optimal expected cost that is computed with Equation (
20) and define the isocost curve in which the OOP lies. Notice that smaller values of
define isocost curves that are closer to
in the ROC space.
In general, the relative performance between models can change depending on whether their ROC curves intersect. Because of this, we would like to have a metric for summarizing, regardless of thresholds, the performance of a model in a single scalar. Here, we used the total area under the ROC curve (AUC) [
73]; this is a standard metric that gives the performance of the CNN averaged over all possible trade-offs between TP predictions and FP predictions. Moreover, we can use this metric to made a final choice among all models; the best model corrresponds to the highest AUC value. In practice, we computed this metric by a trapezoidal approximation, and its results are also included in the
Table 4. We have that, for both datasets,
, which allows us to conclude that, among the three models, the CNN definitely has the best performance, followed by the SVM classifier, and finally by the NB classifier.
3.5. Shuffling and Output Scoring
Two related analyses were conducted to ensure that the results are statistically significant were performed, as mentioned in
Section 2.8. The first one was run our CNN algorithm, including a shuffling of training samples before each training, with the peculiarity of removing links between each sample and its known label. A comparison of distribution of the mean accuracies along all runs of the 10-fold CV experiment, with and without shuffling, is shown in
Figure 12—remember that each point of the boxplots, i.e., a
th mean accuracy or
, as was defined
Section 3.2, coming from the
th run of the whole 10-fold CV. From this plot, we have that shuffling radically affects the results. Whether we work with data from H1 detector or L1 detector, and if shuffling is present, the distribution of mean accuracy moves towards lower values and increases its dispersion. With H1 data, the mean of mean accuracies decreases from
to
and standard deviation increases from
o
; and, with L1 data, mean of mean accuracies falls from
to
and standard deviation grows from
to
.
Moreover, whether shuffling is present or not, the boxplot of mean accuracies has positive or negative skewness, respectively. This makes sense, because, without shuffling, the higher mean accuracies, the greater the effort of the CNN for reaching performances with those accuracies; there is not free lunch and we expect to have a higher concentration of samples below the median than above the median and, therefore, a positive skewness. On the other hand, if we have shuffling, we know from the basics of probability that adding new points, one-by-one, to the mean accuracy distribution, is actually a stochastically symmetric process around —the theoretical limit if we have an infinite number of points in the distribution, i.e., an infinite number of runs of the k-fold CV experiment. Subsequently, given that, here, we obtained medians that were slightly below of ( with H1 data and with L1 data), it is expected that there is a higher concentration of points above medians and, then, boxplots with negative skewness; because this works as a balance to maintain the symmetry of stochastic occurrences (i.e., boxplot points) around the mean accuracy value.
Descriptive statistics is a reasonable analysis, but, to make a formal conclusion regarding the significance of our results, we performed a sample-paired t-test. Therefore, we first define the mean accuracy datasets
without and with shuffling, respectively. Subsequently, with the means of each dataset at hand,
and
, the task is test the null hypothesis
by computing the
value, which is defined as:
Subsequently, assuming a significance level
(a standard similarity threshold between
and
), we have that: i) if
, then we accept
, or ii) if
, then we reject
. The results for
values are shown in
Figure 12, namely
with H1 data and
with L1 data. These values are much less than
; hence, we reject null hypothesis and conclude that, for a significance level of
(or confidence level of
), distribution
is significantly different from
. This is actually a quite good result.
As final analysis, we focus on output scoring of the CNNs. Our CNNs output scores that are probabilities generated by the softmax layer, as explained in
Section 2.5. After the training, these probabilities are defined by our classes,
(noise alone) and
(noise plus GW), conditioned by model parameters within vector
once they have already be learned; namely,
(with
) for each input image sample. Histograms describing the distribution of these probabilities, considering all our
predictions, are included in
Figure 13—all of the histograms were made using 28 same-length bins. Here, we have important results.
Firstly, in both panels of
Figure 13, we have that the distribution for
and distribution for
are multimodal, and each one has three different modes or maximums. In addition, we observe that both of the distributions are asymmetric. Given a multimodal distribution, there are not a univocal definition of its center; it can be its mean, its median, its center of probability mass, among others. Here, we decided to define the center of distribution for as the optimal threshold (OT), because this metric is directly related to our decision criteria for assigning a class to the output score. The closer to the OT a probabilistic occurence is located, the greater uncertainty for taking a decision about what class a CNN actually is predicting with that probability. The OT values were already computed and presented in
Table 4, and we included them in panels of
Figure 13 as dashed lines.
For probability, we have that, in the left-hand side of OT, there is a low concentration of occurrences until before left edge bin, , having the greatest mode of the whole distribution; this edge bin has of all occurrences counted along all bins if we work with H1 data, and of all occurrences with L1 data. In contrast, in the right-hand side of OT, we have more dispersed occurrences around the two remaining modes—with H1 data, one of these reimaining modes is located at edge right bin , and with L1 data, no mode is located at edge right bin. The distribution of is similar to that of , except now it is inverted along horizontal axis, then the highest mode is at the right edge bin, two remaining modes at the left-hand side of OT, among others. For , the fraction of counted occurrences in right edge bin is also the same, with H1 data and with L1 data.
The above results actually mean that, given our datasets, our CNNs are more optimistic predicting GW samples than predicting noise alone samples or, equivantly, more pessimistic predicting noise alone samples than predicting GW events. Hence, even though our input datasets are exactly class-balanced, predictive behavior of our CNNs is highly class dependent. Under a frequentist approach, the asymmetric shape of distributions for
and
is the statistical reason why the CNNs have a high precision and a not negligible false negative rate or, said more simply, why the CNNs are more “conservative” classifying samples as noise alone, than classifying as GW events—and this is coherent with that we interpreted from confusion matrices shown in
Figure 9.
It is also important to notice how, depending on data, the distribution of occurrences, either for or , change. Remember that, from previous ROC comparative analyses, we found that working with H1 data reaches a better performance than working with L1 data and, here, we also observed this improvement from another point of view. Because of the more uncertainty that our CNNs have for predicting a specific class, occurrences are more concentrated (i.e., skewed) towards OT. Even if our network was not learn anything, e.g., because of a shuffling, as we previously applied, then we would have that all probabilities are distributed as a Gaussian centered at the default threshold, , i.e., a totally random performance—although it is not explicitly included here, we checked this random values with our code and visualizations.