5.4. Models’ Comparison
We forecast the blood glucose at three prediction horizons,
min; we take
as the starting short-term prediction horizon and double it progressively to define the medium- and long-term prediction horizons. After a CH ingestion, BG level starts to rise after 10 to 15 min. Hence,
is the minimum prediction horizon to take corrective actions. In addition, we find the maximum BG level one hour after the ingestion. Finally, we continue to double the prediction horizon to observe the maximum potential of the NN. We denote the actual BG value at time
t as
, the actual future BG value
minutes ahead of time
t as
, and the predicted BG
minutes ahead of time
t as
. The predictions are evaluated on a per-patient basis using the most common error metrics, Equations (
4)–(10), respectively: mean squared error (MSE), root mean squared error (RMSE), mean absolute error (MAE), R-squared (R
), correlation coefficient (CC), fit (Fit), and mean absolute relative difference (MARD). In them,
n is the number of predictions per patient, and
and
are the mean values.
Table 11,
Table 12 and
Table 13 show the results of the 10-fold cross-validation over the different prediction horizons. The results are the average of the metrics over the patients with the standard error of the mean deviation. Green cells in each column highlight the model with the best performance, whereas the grey-colored cells are the worst. We have to differentiate two points of view to analyze these metrics: first, the point of view of the predictive artificial intelligence tool from which we only take into account the results and do not look at the behavior of BG levels; secondly, the clinical point of view in which we observe how the predictions affect the patients.
As an example, we present the RMSE values, as we can draw the same conclusions if we analyze the MSE or the MAE. For , there is a difference of only mg dL−1, between the models with lowest (Ensemble MMS) and highest (Idriss, 2019) RMSE30. For , the difference is slightly higher, mg dL−1, between the Sun model and the Aiello one. Finally, the lowest RMSE is mg dL−1 for the Ensemble MMSZ and the highest one is mg dL−1 for the Idriss model. Hence, these differences can be notable in a numerical analysis of the results but are irrelevant from the clinical point of view.
R can be interpreted as the explainability of the model, i.e., how much of the data can be explained by each of the models. For , even the worst model explains 85% of the data variability; thus, the predictions of all the models are promising. For , the explainability of the models remains high enough, explaining between 45% and 69% of the data variability. However, for , only 32% of the data variability can be explained in the best case.
Regarding CC, we compare predictions versus actual values, so the highest performance corresponds to values near 1. CC values are around 0.94 and CC values range from 0.72 to 0.84 and are still relevant figures, while the highest CC is 0.57, indicating a poor fit for the prediction. In addition, and values show once again that all the models predict with very similar values. Finally, for , we find values near zero, or even negative; this means that the predictions are further from the mean than the targets or, in other words, they do not predict correctly.
Finally, MARD is the most common metric used to analyze the accuracy of CGM systems [
36]. It measures the difference between the actual values and the predicted ones. Thus, the lower the MARD is, the more accurate the predictions are. For
, we find a difference between 0.09 and 0.12, which indicates a good correlation of the predictions; for
, the difference is greater, as expected, between 0.18 and 0.26, and, for
, MARD values lie between 0.28 and 0.38.
On the other hand, in clinical practice, physicians usually plot predictions versus actual values using the Parkes error grid (PEG) [
37]. This graph has five zones (A to E) to bound prediction accuracy. These zones are set by taking into account the treatment applied for a corresponding BG level. While zone A will always correspond to correct treatment, zone E corresponds to a hyperglycemia treatment while the patient will suffer from hypoglycemia, or vice versa.
Figure 6 illustrates the PEG for the models with the highest and lowest number of predictions in region A for the three prediction horizons. Hence,
Figure 6a–f compare (Sun, 2018) versus (Muñoz, 2020). For
, the first one has
% of predictions between zones A and B versus the second one with
%. For
, we obtain a range between
% and
% of points inside regions A and B. Finally, for
, in the worst case,
% of the points lie in regions A and B, while in the best case,
% are within them. From this analysis, we can conclude that in all the PHs, every model predicts well from the clinical point of view, even though the differences between the models within a PH are negligible. In essence, all the metrics show consistent values within their error metrics. However, the confidence intervals overlap, so we cannot conclude which models are better. From the predictors’ point of view, the models predict well for
and
, but for
, the models have no accuracy at all. From the clinical point of view, the models show no difference between choosing one model or the other. For
and
, the models have very good accuracy and, in contrast with the first point of view, we can still use
.
We cannot extract definitive conclusions about NN performance by only using previous metrics since the confidence intervals of most of the metrics in previous tables overlap. This can be explained by the amount of data available in which there are not enough possible scenarios to be learned by the models. To extract these conclusions, we use three comparison methods based on the losses of the predictions, each one applying a different statistical approach, either frequentist or Bayesian. In all three of them, RMSE is the metric of choice to estimate prediction losses. Using these three methods, the models are compared for , , and ; the last is a global evaluation of NN performance in a multi-horizon approach.
Firstly, we compare models using the scmamp method.
Figure 7a–c show the probability of each model being the best for 30 min, 60 min, and multi-horizon, respectively. According to these results, Sun’s model is the model with the highest probability of being the best one among the non-ensemble models at all prediction horizons—for example, a probability in the range of 0.09 and 0.30 for 30 min. At ph = 30 min, Mirshekarian is the second best model, clearly separated from the remainder of the models, whereas at ph = 60 min, the second best model is not so clear when Zhu, Khadem, and Meiner are competing for this rank. Regarding the ensemble models, both of them perform similarly, although MMS, the model with the lowest number of models, has the highest probability of winning in a multi-horizon scenario, with a probability as high as 0.48.
The Model Confidence Set (MCS) [
38] is a frequentist method whose aim is to determine which models are the best within a collection of models with a given level of confidence, analogous to the confidence interval for a parameter. It consists of a series of tests that repeatedly filter the models in the initial test to finally return the set of those with the lowest losses with confidence level
, which we denote as
. The tests are run on a sample of the models’ predictions, typically using bootstrap replications. In particular, in
Table 14, we set 1000 bootstrap replications and
; that is, the models with a
p-value > 0.05 are in the confidence set and, with different probabilities, they will return the best predictions. In this case,
for all the prediction horizons as well as the multi-horizon analysis.
The Superior Predictive Ability (SPA) [
39] uses the model’s losses as a benchmark, and its null hypothesis is that any model is better than the benchmark. Hence, if the
p-value is high, there are no models better than the benchmark. This algorithm returns three
p-values: lower, consistent, and upper. They correspond to different re-centerings of the losses, and, normally, the consistent one is the value taken into account [
40]. According to
Table 15, the best models are Meijner, Sun, Mayo, and both ensembles for this analysis.
The findings can be summarized as follows:
- (1)
The comparison of the models based only on confidence intervals or the distribution of predictions in the grid’s regions is not precise enough to rank the models. Indeed, the difference between the best and worst models is only mg dL−1 for RMSE, and mg dL−1 RMSE, which, although notable for a prediction model, is irrelevant to the physicians’ practice.
- (2)
At 30 min, the best models are consistently (either using scmamp, MCS, or SPA) the ensemble models, Sun, and Mirshekarian. The ensembles have a higher probability of winning, but their ranges of probability overlap with Sun’s range. Thus, taking into account the complexity of the ensemble models, Sun’s model can be a reasonable choice that combines good predictions with lower complexity.
- (3)
At 60 min, the best models are the ensemble models, Sun, and Zhu. As stated above, attending to information criteria to select the best model with the lowest complexity, Sun seems to be the best option.
- (4)
None of the NN models provide accurate predictions 120 min ahead of time. This is a wide time window with a high number of events. There is no sufficient information in the dataset for the models to learn all the possible patterns in BG levels that may occur during this time to make accurate predictions.