Glucose Prediction under Variable-Length Time-Stamped Daily Events: A Seasonal Stochastic Local Modeling Framework

Accurate glucose prediction along a long-enough time horizon is a key component for technology to improve type 1 diabetes treatment. Subjects with diabetes might benefit from supervision and control systems that accurately predict risks and trigger corrective actions early enough with improved mitigation. However, large intra-patient variability poses big challenges to glucose prediction. In previous works by the authors, clustering and local modeling techniques with seasonal stochastic models proved to be efficient, allowing for good glucose prediction accuracy for long prediction horizons. Continuous glucose monitoring (CGM) data were partitioned into fixed-length postprandial time subseries and clustered with Fuzzy C-Means to collect similar behaviors, enforcing seasonality at each cluster after subseries concatenation. Then, seasonal stochastic models were identified for each cluster and local predictions were integrated into a global prediction. However, free-living conditions do not support the fixed-length partition of CGM data since daily events duration is variable. In this work, a new algorithm is provided to overcome this constraint, allowing better coping with patient’s variability under variable-length time-stamped daily events in supervision and control applications. Besides predicted glucose, two real-time indices are additionally provided—a crispness index, indicating good representation of current glucose behavior by a single model, and a normality index, allowing for the detection of an abnormal glucose behavior (unusual according to registered historical data). The framework is tested in a proof-of-concept in silico study with ten patients over four month training data and two independent two month validation datasets, with and without abnormal behaviors, from the distribution version of the UVA/Padova simulator extended with diverse sources of intra-patient variability.


Introduction
Diabetes represents one of the major health challenges of the 21st century. According to [1], 463 million people have diabetes worldwide in 2019, and this figure is estimated to rise to 700 million by 2045. Type 1 diabetes (T1D), the most severe kind, affects about 10% of people with diabetes. In T1D, β-cells in the pancreas are destroyed by an autoimmune response. These cells secrete insulin, which is an essential hormone needed to maintain blood glucose (BG) concentrations in a narrow range (70-140 mg/dL). Insulin promotes the transport of glucose into the cell and its lack translates into elevated BG (hyperglycemia). Thus, T1D treatment consists in insulin replacement by means of subcutaneous administration of insulin analogs in order to avoid the long-term complications due to hyperglycemia. This is carried out either by multiple daily injections with insulin pens or continuous infusion through insulin pumps. However, overdosing of insulin can provoke too low BG concentration (hypoglycemia), with severe consequences if untreated, including comma and death. Good BG control is challenging, and requires a lot of effort from patients, who must frequently monitor their BG levels and take dosing decisions to avoid hyperglycemia and hypoglycemia. Nowadays, glucose concentration can be measured in a quasi-continuous way by using continuous glucose monitoring (CGM) devices, which consist of an electrochemical sensor measuring interstitial glucose at the subcutaneous space and estimating BG, which is transmitted periodically (every 5 min) to an external receiver [2].
Accurate glucose prediction along a given time horizon is a key component in current diabetes technology. Patient monitoring systems must raise warnings of risks enough time ahead to successfully prevent extreme hyperglycemic and, especially, hypoglycemic episodes [3]. In sensor-pump integrated systems, predicted glucose levels can trigger actions by the insulin pump, like automatic insulin infusion suspension to mitigate impending hypoglycemia [4]. Glucose prediction is also an important feature in artificial pancreas (AP) systems, that is, closed-loop glucose control systems where a control algorithm governs the insulin pump from CGM data [5,6]. AP research has been intense in the last decade. A first commercial system was launched in late 2017, a second one in early 2020, and many more are on the way [7,8]. Glucose prediction can be an integral part of the control algorithm itself, such as in Model Predictive Control-based AP systems [9]. Therefore, high-reliability glucose prediction models have the potential to significantly improve diabetes management as part of a monitoring system, integrated systems or an AP.
A huge number of alternatives for BG prediction in T1D have appeared in the literature, such as, among others, linear empirical dynamic models, multivariate nonlinear regression techniques, extended Kalman filters, data mining or artificial intelligence approaches [10]. In [11,12], seasonal stochastic local models are introduced for the first time in the field aiming at improved glucose prediction under high intra-patient variability. The rationale lies on the fact that observed behaviors in past similar scenarios can help improving prediction at the present moment. Seasonal stochastic time series models such as SARIMA (Seasonal AutoRegressive Integrated Moving Average), or SARIMAX (SARIMA including eXogenous variables), are meant to capture regular periodical patterns in time series data [13]. Although seasonality is not a priori a characteristic of glucose data in T1D, it can be enforced partitioning historical CGM data into fixed-length time subseries and concatenating them, after a clustering process collecting similar behaviors based on a given similarity measure. Then, seasonal stochastic models can be identified for each cluster and their predictions integrated to provide a global glucose prediction. Seasonal models have exhibited relatively higher postprandial BG prediction accuracy for long prediction horizons [11], including challenging scenarios with variety of meals and exercise sessions [12]. In this latter case, SARIMAX models were used including insulin infusion and energy expenditure from a wearable as exogenous inputs.
In the above works, a concatenation of fixed-length postprandial time subseries driven by mealtime was considered, from controlled studies. Although this can be appropriate to test performance and concepts underlaying the new prediction methodology, its application in free-living conditions do not support fixed-length partition of CGM data, as required to enforce seasonality. Daily events such as mealtime and night periods happen at non regular times and have variable lengths. In this work, algorithms are modified to overcome this constraint, allowing to cope with patient's variability under variable-length timestamped events, and paving the way to its application in supervision and control in T1D. CGM data is partitioned into a collection of "event-to-event" time subseries whose length is regularized (unified) by adding fictitious blank samples where needed to enforce seasonality. This imposes the use of new clustering algorithms able to deal with these blank samples, which correspond to missing (incomplete) data. Main events are mealtimes and night periods, although other events could be considered as long as a timestamp exists, such as reported hypoglycemia treatments or exercise sessions. Besides predicted glucose, two real-time indices are additionally provided indicating crispness of the prediction (degree of representation of current glucose behavior by a single model), and normality of the observed behavior (degree of representation by registered historical data), allowing for the detection of patient's abnormal states (unusual or not seen before in historical data), which might trigger conservativeness in the insulin therapy for the sake of safety.

Data Overview
Three datasets were generated for the ten adults virtual cohort in the educational version of the UVA/Padova simulator [14], which was extended with an exercise model [15], and diverse sources of intra-patient variability: • Training dataset: Four-month data was generated for the ten patient cohort following open loop therapy (basal insulin and ratios provided in the simulator were used). The 15-15 rule was used to treat hypoglycemia (15 g of carbohydrates were administered when glucose went below 70 mg/dL, and repeated if after 15 min hypoglycemia was still present). A nominal day with three meals of 40, 90 and 60 g of carbohydrates at 7:00, 14:00 and 21:00 h was considered. Actual mealtime and carbohydrate intake varied around nominal values following a normal distribution with a standard deviation of 20 min for mealtime, and with a 10% coefficient of variance for meal size. Although no study on dietary habits was performed to set this amount of variability on meal size, a 10% coefficient of variance managed to challenge this in silico study with a wide enough variety of meal sizes, ranging from 20 to 120 g, distributed around the chosen nominal values. Meal absorption dynamics was changed at each meal by randomly selecting one of meal model parameter sets available from the simulator, resulting in faster or slower meal absorptions especially due to nonlinearities in gastric emptying emulating a slow down of carbohydrates absorption produced by fats [14]. Additionally, meal absorption rate (k abs ) varied with a uniform distribution in ±30%, and carbohydrate bioavailability ( f ) in ±10% around selected nominal values. Carb counting errors by the patient were considered with a uniform distribution between −30% and +10%, following results in [16] where a trend to meal underestimation is reported. Insulin absorption pharmacokinetics (k d , k a1 , k a2 ) varied in ±30%, according to the intra-patient variability reported in [17]. Circadian variability of insulin sensitivity (V mx , K p3 ) was considered with variations in ±30% around nominal sensitivity, reproducing changes in basal insulin requirements in the adult population reported in [18]. The reader is referred to [14] for the model and parameters details. Finally, missed boluses were sporadically generated with a minimum separation of 14 days between them in order to better mimic imperfect data collection. These postprandial periods were later excluded to compose the training dataset from collected data. • Validation dataset 1 ("normal"): Two-month data was generated for the 10-patient cohort for the same scenario as data used for training. Thus, events and variability in this validation dataset is expected to be well represented in training data (it contains "normal" data). • Validation dataset 2 ("abnormal"): Two-month data was generated for the 10-patient cohort adding to the above scenario missed boluses (once per week), and three moderate intensity (50% VO2max) exercise sessions per week (on Tuesday, Thursday and Saturday). Nominal duration of the exercise duration was 60 min, and nominal start time at 18:00 h. Actual duration and start time changed at each session following a normal distribution with coefficient of variance of 10% for duration and standard deviation of 20 min for start time. Since training data does not contain missed boluses and exercise, these events are expected not to be well represented in training data (the dataset contains "abnormal" data).
In all the above datasets, besides CGM data, time stamps and labels for the following events were generated: Breakfast, Lunch, Dinner, Night and Hypoglycemia treatment. The Night event was considered to start 6 h after dinner. All rescue carbs administered in a hypo-glycemia treatment were gathered in a single Hypoglycemia treatment event, starting at the time of the first rescue carb intake. This will be represented by the set of time-ordered pairs where t e i is the timestamp for the i-th event, with t e i < t e i+1 , L e i is the label of the i-th event, L e i ∈ {Breakfast, Lunch, Dinner, Night, Hypoglycemia treatment}, and n e is the number of events. In a clinical context, assumption of existence of time stamps for meals and night period is a mild one. Mealtime can be automatically extracted from insulin pump information (bolus calculator), or from the recently marketed smart pens in a multiple daily injections therapy. The start of a night period is defined algorithmically and no input is necessary from the patient. Hypoglycemia occurrence can be detected from CGM data and only confirmation of rescue carbs intake by the patient would be needed. However, this information in not critical since the amount of carbs is not needed. In the absence of this confirmation, assumption of treatment of any hypoglycemia occurrence might be assumed.

Enforcing Seasonality: CGM Data Partitioning
As a first step, seasonality must be enforced prior to building seasonal stochastic local models from the training data set. Provided that the starting time and duration of events in the CGM historical data is variable between event instances, the original CGM time series data is partitioned into a set of "event-to-event" time subseries from the reported timestamps and labels in the set E . A single partition can be considered if event labels are neglected, or multiple partitions can be generated according to some grouping of event labels. This latter option has the advantage of building prediction models more specific to well differentiated events (for instance a night compared to a meal), although it requires events labeling. A compromise solution is adopted here, with consideration of three partitions-meals, night and hypoglycemia treatment. No difference between breakfast, lunch and dinner is needed, requiring only information about mealtime and hypoglycemia occurrence, which can be extracted automatically. Thus, consider Therefore, each CGM time subseries comprises from a given event timestamp, t e i , to the next event timestamp, t e i+1 .
In order to enforce seasonality, time subseries length must be regularized. To this end, a maximum expected length for an event-to-event period must be set. In this work, the period with the maximum duration for each partition Π M , Π N , and Π H for the training data is computed, denoted here as L M , L N , and L H , respectively. Given the time subseries [G] t e i+1 t e i (t) ∈ R l i , then L M := max i∈I M l i , L N := max i∈I N l i , and L H := max i∈I H l i . Time subseries with length l i smaller than the corresponding maximum length (L M , L N , L H ), are expanded with blank values ("not a number"-NaN) until reaching maximum length. As a result, time subseries length is regularized and seasonality can be enforced (see Section 2.4). Remark that most of time subseries will have blanks in the final positions which should be adequately treated as missing data in the clustering and identification processes. As well, regularized length will be different for each partition Π M , Π N , and Π H .
Other events may be present in the historical data besides the considered ones, like for instance snacks and exercise. If these events are announced by the patient or can be automatically extracted from the devices, they could be included in the list of events driving data partition. Alternatively, they can be treated as exogenous inputs, as it was the case of exercise in [12] by means of signals from a wearable. Unannounced and unmeasurable events affecting CGM will be part of the data variability, which will be adequately treated in the clustering step.

Enforcing Similarity: Data Clustering
Once CGM data is partitioned into event-driven glucose time subseries, data in a partition are clusterized to gather similar glucose responses. Fuzzy C-Means (FCM) clustering [19] was successfully applied in previous works to the classification of similar postprandial periods [12]. However, FCM requires data to be complete for the computation of cluster prototypes (centroids) and distance measures. Thus, FCM is not directly applicable to the problem at hand due to the presence of missing data resulting from the seasonality enforcing step described in the previous section. Therefore, the Partial Distance Strategy FCM (PDSFCM) algorithm [20] is used here to overcome CGM incomplete data. Given a data set X = {x 1 , x 2 , . . . , x n }, x i ∈ R L , i = 1, . . . , n, PDSFCM partitions data into c > 1 clusters through the minimization of the objective function where V = (ν 1 , ν 2 , . . . , ν c ) is the vector of sought cluster prototypes ν i ∈ R L , where L is the regularized length in the partition under consideration, that is, L M , L N , or L H , d is a partial distance function and m ∈ [1, ∞) is the fuzziness parameter. The following partial distance is used here [20] d(a, b) : for a, b ∈ R L and B the number of blanks (in one vector, or the other, or both). Remark that for B = 0, the above distance is the standard Euclidean distance. The partition matrix U = [u ij ], i = 1, 2, . . . , c, j = 1, 2, . . . , n, collects the membership value of each data x j to each cluster i, where u ij ∈ [0, 1], ∑ c i=1 u ij = 1, ∀j and 0 < ∑ n j=1 u ij < n ∀i. In our case, the data set X corresponds to each partition Π ∈ {Π M , Π N , Π H } of glucose time subseries [G] t e i+1 t e i (t) after length regularization as described in Section 2.2.
There exists a large variety of indices to determine the optimal number of clusters. In this work, the Fukuyama-Sugeno (FS) index was used [21].

Local SARIMA Models
After data in a partition are clusterized, local seasonal models are trained for each cluster. A SARIMA model, as opposed to its non-seasonal counterpart ARIMA model, includes new seasonal autoregressive (SAR) and seasonal moving-average (SMA) terms introducing dependence of an observation at time t on observations and stochastic errors at lags multiple of the model seasonality period s [22]. Combination with AR and MA components will introduce dependence on observations and stochastic errors previous to t, t − s, t − 2s, . . . up to a given number of past data depending on the corresponding model component orders. Remark that this implies that CGM data before each event time is needed to properly feed the SARIMA model at t = t e i and during the first computed samples depending on the AR order. This will be referred to as pre-sampling data, which length must accommodate to the maximum expected AR order, denoted as P r . Then, a CGM time series can be built from the time-ordered concatenation of regularized-length time subseries [G] t e i+1 t e i (t) ∈ R L in a cluster, preceded by the corresponding pre-sampling data, of length P r (see Figure 1). The resulting concatenated time series G(t) will have seasonality s := L + P r . Remark there will be as many concatenated CGM time series like the one in Figure 1 as number of clusters (c), gathering together similar behaviors. Then, a seasonal stochastic local model can be identified for each cluster. An illustrative example of a concatenated time series for seasonal stochastic local models identification. Each element in a cluster, corresponding in this case to glucose responses to meal events regularized to length L, are concatenated ordered according to the event time stamp, and preceded by pre-sampling data of length P r . As a result, a seasonal time series with seasonality s = L + P r is obtained. Remark that meal events not necessarily are consecutive due to the clustering process.
In the absence of exogenous inputs, a local SARIMA model for a given cluster can be expressed in the form: where G(t) is the glucose concentration at time t, α is a constant term (intercept), w(t) is the disturbance series, ∇ is the backward difference operator, defined by ∇w(t) := w(t) − w(t − 1), d is the non-seasonal integration order of the time series, D is the seasonal integration order of the time series (in practice, d and D are usually small, equal to 0, 1 or 2), ∇ s is the seasonal backward difference operator, defined by ∇ s w(t) The input ε(t) is the stochastic error following a white noise process ε(t) ∼ W N(0, σ 2 ) and φ p (z −1 ), Φ P (z −s ), θ q (z −1 ), and Θ Q (z −s ) are polynomials in the lag (back-shift) operator z −1 of degree p, q, P, and Q, respectively, defined as Model (15) will be expressed in short form as SARIMA (p, d, q)(P, D, Q) s . If the clustering process is successful, the time subseries in a cluster will be similar, as defined by the partial distance measure d. Then, a SARIMA model can be identified for each cluster i, yielding a set of local glucose predictors (one per cluster). Remark that the cost index during identification will neglect blank data in the residuals computation, as well as pre-sampling data added during the concatenation. Box-Jenkins methodology [23] is used in this identification process. Exogenous inputs could also be considered with SARIMAX models [12], although here only glucose and mealtimes will be considered available.
As a summary, Figure 2 shows the data processing steps during the training phase prior to local models identification. Figure 3 shows the identification procedure.
Partition 3 SARIMA/X local models identification Repeat the same process with all the subsets

Model Integration
Once local models are trained for each cluster, a glucose predictor is built from the integration of such local models during real-time operation. Consider a given time t p ≥ t e k , where t e k is the latest event timestamp, from which a prediction is to be computed according to the selected prediction horizon. A total of c local predictions will be available at that time instant,Ĝ i (t|t p ), i = 1, . . . , c produced by the models at each cluster. A global glucose predictionĜ(t|t p ) will be computed as a weighted sum of the c local predictions, with timevarying weights γ i (t p ), as followŝ The computation of the weights γ i (t p ) is carried out in a two-step process in an attempt to discard coarsely non-contributing clusters (very dissimilar behavior of the cluster center compared to measured CGM data), and then refining the weighting among the remaining possibly contributing clusters. Given [G] t 1 t 0 (t) the segment of CGM data from t 0 to t 1 , and [ν i ] t 1 t 0 (t), i = 1, . . . , c, the corresponding segments for cluster centers, the fuzzy membership of the segment [G] t 1 t 0 (t) to the i-th cluster in a given index set I is defined as where n I := |I| is the cardinality of the index set I. Then, weights in (17) are computed as follows: • First, fuzzy membership u i (t p ; t e k ), i ∈ I 1 := {1, . . . , c} is computed. That is, similarity of CGM data and cluster centers from the event onset t e k to current time t p is checked. Given a (possibly time-varying) threshold µ min (t p ), a new index set I 2 := {i ∈ I 1 | u i (t p ; t e k ) ≥ µ min (t p )} ⊆ I 1 is defined, selecting the clusters with high enough similarity according to the selected µ min . A value µ min (t p ) := 0.2 max i u i (t p ; t e k ) was used here, that is, contributions of clusters for which membership is lower than 20% of the maximum membership are neglected. • Then, in a second step, fuzzy membership u i (t p ; t p − W) to each cluster i ∈ I 2 is computed, where W is a to-be-defined (short) time window length. A time window W corresponding to 20 min will be considered here.
Weights γ i (t p ) are then computed as . (19) Remark that ∑ c i=1 γ i (t p ) = 1 and γ i (t p ) can be interpreted as a fuzzy membership value. We will refer to (17)- (19) as Global Seasonal Model (GSM) prediction.
It must be remarked that at the onset of the next event, t e k+1 , the CGM time subseries from t e k to t e k+1 , with regularized length L, together with the corresponding pre-sampling data, must be appended as new data in the concatenated seasonal time series (Figure 1) of a selected cluster. This is needed because local SARIMA models make use of time-ordered previous data in the same cluster for their predictions, according to the SAR component order. The cluster with the highest fuzzy membership u i (t e k+1 ; t e k ), i ∈ I 1 , for the CGM time subseries is selected for this purpose. This should not change representativeness of the cluster prototype, since high similarity of the new time subseries with respect to cluster members is expected, whenever training data represents well the diversity of glucose profiles in a patient. Eventually, online updating of the SARIMA models, or re-clustering and re-training, can be performed.

Real-Time Crispness and Normality Indices
Besides predicted glucose, two real-time indices are additionally provided indicating crispness of the prediction (degree of representation of current glucose behavior by a single cluster), and normality of the observed behavior, allowing for the detection of patient's abnormal states (behaviors too dissimilar to available historical CGM data). It must be remarked that, in the scope of this work, the concept of normality is not primarily related to a physiological meaning: normality does not mean patient normality (normoglycemia), and abnormality does not mean patient abnormality (hyper-or hypoglycemia). On the contrary, abnormality must be understood as a behavior too dissimilar to available historical CGM data of a patient. For example, consider that patient A performed exercise regularly in the past and patient B did not perform exercise at all. This difference will be reflected in the available historical CGM data of patient A and B used during training, and, consequently, the identified models using data for patient A will be able to predict well a glycemic response under exercise, whereas identified models using data for patient B will perform badly when facing a well distinguishable response to exercise. Therefore, an exercise bout leading to the same CGM profile will be considered normal for patient A and abnormal for patient B. Hence, the objective of the normality index will be to inform patient B that something abnormal is happening in his/her CGM profile and that he/she can not rely on the prediction of the proposed system. In the case of patient A, nothing will happen in this case. In order to check the normality index capabilities and performance, in Section 2.1 a fictitious validation dataset 2 ("abnormal") has been defined containing missed boluses and exercise sessions, which are not present in the validation dataset 1 ("normal"). Then, response to missed boluses and exercise sessions are expected to contain abnormal CGM segments (those well distinguishable from other glycemic responses in historical data) in the of the context of this paper, although missed boluses and exercise sessions are normal in the patients' daily life. The crispness and normality indices may therefore provide significant information to the user (and control algorithms) as part of an online monitoring system.
The "crispness index" (CI) aims at providing information about how much the computed glucose predictionĜ(t|t p ) is produced by a single model. Since the glucose predictor is a multi-model system weighted by fuzzy membership values, the index rationale is that predictions based on a single local model in the set, perfectly matching the system's behavior, should be more reliable. This ideal best case is the one with a membership value γ i (t p ) of 1 for one of the local models and 0 for the others. On the contrary, the worst case is that all local models contribute equally to the prediction, with weight 1/c. The crispness index at time t p is defined as the Manhattan distance, normalized to the interval [0, 1], between the vector composed by the membership values γ i (t p ), i = 1, . . . , c, and the equally-distributed-membership vector (1/c, 1/c, . . . , 1/c), that is The "normality index" (NI) aims at determining the degree at which the current behavior is well represented in historical data and forecasting is the result of "interpolation" among local models predictions (normal behavior), or, on the contrary, the current behavior is beyond past behaviors and forecasting results from "extrapolation" by some or all the local models (abnormal behavior). Possibilistic memberships can give some hints in this case. The possibilistic counterpart of (18), denoted as u P i (t 1 ; t 0 ), representing possibilistic membership to the i-th cluster referred to the data window [t 0 , t 1 ] can be computed as where u P i (t 1 ; t 0 ) ∈ [0, 1], and η is a constant. Define now the possibilistic counterpart of (19) as Contrary to fuzzy membership, ∑ c i=1 γ P i (t p ) = 1 does not necessarily hold now. In the presence of an abnormal period, far away and not represented by any cluster, equal memberships to each cluster would be obtained when considering fuzzy membership (18), resulting from similar distances to all prototypes. The same result would be obtained in a normal period just in the middle of the available clusters. Nevertheless, if the possibilistic membership (21) is used, such abnormal period would result in very low membership values for every cluster. Thus, at a given time t p , the sum of all possibilistic memberships γ P i (t p ) provides a measure of the sought concept: the closer to zero, the more abnormal the period is, not being represented enough by any cluster. Thus, the normality index is defined as where normalization by the number of non-zero membership values (cardinality of the index set I 2 ) is carried out so that N I(t p ) ∈ [0, 1]. Abnormal behaviors may be due to hardware failures, extreme hypoglycemia or hyperglycemia, or any other glycemic behavior not represented in the historical time series used for training the local models. When abnormal behaviors are detected, the user should be informed of the extrapolation being performed. This can highlight the necessity of re-clustering and learning new local models. To this end, an abnormal behavior warning can be generated when N I(t p ) is below a given tunable threshold N I thr . Figure 4 summarizes the local models integration procedure to get a glucose prediction, together with the computation of crispness and normality indices.

Validation Procedure
The training data set comprised four-month in silico data for a ten patient cohort, as described in Section 2.1. For each patient, training data was partitioned in Π M , Π N , and Π H sets. Clustering and local model training was then applied to each partition, yielding a set of glucose predictors per patient (one per event group M, N , and H). Real-time operation of the glucose predictor. When a new event is reported, the predictor trained for that event class (meal, night, or hypo treatment in this work) is selected. This predictor will be composed of a number of cluster prototypes and local models (dashed boxes PROTOTYPES and SARIMA/X MODELS). Then, at each sampling time t p , a new glucose measurement G(t p ) is received from the CGM, and a predicted trajectoryĜ(t|t p ) is computed considering a given prediction horizon. To this end, fuzzy memberships to each cluster, γ i (t p ), provide the weighting factors of predicted trajectories by the local models,Ĝ i (t|t p ), following the two-step method described in Section 2.5. Additionally, crispness and normality indices are computed from fuzzy and possibilistic memberships to each cluster, respectively, as described in this section, which are provided as outputs together with the predicted trajectory. When the next event happens, the CGM time series of the last period is added to the cluster with most similar prototype, as well as to the historical data of the local model associated to that cluster.
On the one hand, local model training was performed by separating data in each cluster into training and validation data. The first 80% of the available data for each cluster was used to generate the concatenated time series for model identification (Section 2.4). The remaining 20% was kept as validation time series. For all time subseries in the validation data in a cluster, predictions of the corresponding local model were evaluated, for prediction horizons PH ∈ {15, 30, 60, 120, 180, 240} min, every time step, from starting time t e i up to t e i+1 − PH, in order to be able to compute prediction residuals. If the duration of the time subseries was smaller than PH, then it was discarded. Standard prediction horizons in the literature are 30 to 60 min. One of the purposes of this in silico study was to analyze in which extent these values could be extended, since longer prediction horizons can be beneficial, for instance, in the context of control algorithms. A prediction horizon of 240 min was considered long enough to represent a postprandial period, and longer prediction horizons were considered of no additional value. Standard metrics RMSE (mg/dL) and MAPE (%) were used to measure forecasting accuracy: where n is the number of observations, G(i) denote the ith glucose observation, and e(i) = G(i) −Ĝ(i) is the forecasting error (residual), withĜ(i) the forecast of G(i). Two definitions of residuals were considered, leading to different forecasting accuracy metrics: • Residuals of each predicted trajectory G(t|t p ), t ∈ [t p , t p + PH]. It evaluates how successfully a predicted trajectory fits actual data. Remark that during the identification process glucose trajectories are sought to be fitted by the model, following the same rationale in this metrics definition. In this case, for each prediction, PH observations are available. Metrics were averaged for all computed predictions. Henceforth, this metrics will be referred to as RMSE traj and MAPE traj . This is the metrics used in our previous works [11,12]. • Residuals of the glucose trajectory built from predictions G(t p + PH|t p ). It evaluates how successfully actual glucose trajectory can be predicted PH time instants ahead. In this case, for each time subseries in the validation data, and denoting as l i the length of the time subseries, l i − PH observations are available for the computation of residuals. Metrics were averaged for all time subseries in the validation data. Henceforth, this metrics will be referred to as RMSE last and MAPE last .
On the other hand, glucose predictors performance was evaluated using two independent 2-month validation data sets. A first validation data set included normal behavior, while the second one included new events not present in the training data in order to challenge the predictor (see Section 2.1). For each validation data set, glucose prediction was performed for the whole 2-month period, selecting the corresponding glucose predictor at each new event according to its type (meal, night, hypoglycemia treatment). For all time subseries in the validation data, predictions of the GSM model were computed in the same way as for local models evaluation, described above. As validation data was generated independently from training data, in few cases the duration of a validation time subseries was longer than the prototypes length (regularized length of training data). In this case, last value of the prototypes were held constant up to the validation time subseries length. Metrics RMSE traj , MAPE traj , RMSE last , and MAPE last were computed as described before.

Time Series Building and Data Clustering
Training data, for each of the 10 patients, was comprised of 360 meals and 119 night events, of varying duration. The number of hypoglycemia treatment events was patientdependent, attending to the settings of the open loop therapy provided in the simulator, with a median value of 18 events and 25-75% percentiles of 13 and 36 events, respectively. Mean CGM glucose ranged from 122 mg/dL (patient 10) to 169 mg/dL (patient 7).
Application of the PDSFCM clustering algorithm, in combination with the FS cluster validity index for the determination of the number of clusters, resulted in a varying number of clusters depending on the variability exhibited by the patient, as shown in Table 1. As an illustration, Figure 5 shows the clustering results for patient 6 and data in the meals partition Π M . The number of elements per cluster were 50, 61, 36, 63, 70 and 71 for clusters 1 to 6, respectively. Regularized length of the time subseries was 98. Considering the whole cohort, regularized lengths ranged from 98 to 106 samples for Π M , from 60 to 65 samples for Π N , and from 36 to 76 samples for Π H . After clustering, time subseries were assigned to the cluster with maximum membership. Then, a seasonal time series per cluster was created by concatenating all time subseries assigned to the given cluster representing similar glycemic excursions, preceded by pre-sampling data. A length of five samples was considered for pre-sampling data (P r = 5), given that orders lower than 5 were obtained for the AR and MA processes in previous studies. Thus, a seasonality period of s = L + 5 was enforced, with L the regularized length for that patient and data partition. The resulting seasonalities are presented in Table 2.

Local Models Identification
A model for each cluster was identified using the Box-Jenkins methodology. As stated in Section 2.7, the first 80% of the data in a cluster were used for model identification, and the rest 20% for model evaluation. During the identification, residuals at time instants with missing values or belonging to pre-sampling periods in the concatenated time series were neglected (weighted 0 in the cost index). SARIMA model structure and parameters were identified for each cluster. Table 3 shows, as an example, the resulting SARIMA model structure (p, d, q)(P, D, Q) s for each cluster for patient 6. The orders for the AR component (p) ranges from 1 to 4, and for the MA component (q) from 0 to 4. With regards to seasonal part, SAR component order (P) ranges from 1 to 2, and SMA order (Q) from 0 to 2. Need for time series differentiation (d = 0 or D = 0) only appeared in hypoglycemia treatment cases, where glucose may follow an increasing trend. Similar results were obtained for the rest of patients. Table 3. Example of local models structure (patient 6), represented as (p, d, q)(P, D, Q) s .

Partition
Cluster Accuracy metrics of the trained local models is shown in Table 4, aggregating data for all patients and local models in a partition, as well as overall metrics. For all prediction horizons, the lowest metrics were obtained for the night partition Π N due to the lower variability during nights as compared to meals. Local models for Π H had a poorer performance due to the significantly lower number of events for training and validation, compared to Π M and Π N . Comparing RMSE traj (MAPE traj ) versus RMSE last (MAPE last ), the latter reported higher values, since residuals of the last predicted value in each prediction is expected to be higher than when the whole predicted trajectory is considered. However, the higher the prediction horizon, the lower this difference was. This may be due to the implicit exclusion of residuals in [t e i , t e i + PH) in the computation of RMSE last (MAPE last ), where t e i is the event initial time. When PH is long enough, an important part of a postprandial response may be neglected, which may provoke a loss of monotonicity of the RMSE last (MAPE last ) metrics with respect to the prediction horizon. Overall performance was good, with a maximum average RMSE traj of 11.46 mg/dL (MAPE traj of 6.79%), corresponding to a PH of 240 min. Maximum average RMSE last was 13.69 mg/dL (for a PH of 180 min), and maximum average MAPE last of 8.86% (for a PH of 240 min). During the night predictions, these metrics were reduced approximately by 40%. Table 4. Accuracy metrics of the trained local models aggregating data for all patients and local models in a partition, as well as overall metrics. The number of computed PH-ahead predictions is reported in each case, which depends on the number of validation events and their length.  Figure 6 shows an illustration of the real-time computation of glucose predictions considering 5 local models for a sample validation time subseries in Π M . Four 120-min ahead predictions are shown, separated one hour each, at the samples indicated by the vertical dashed lines starting at mealtime. Upper panel shows CGM data, local models predictions and global glucose prediction obtained from the weighted sum of local predictions, with time-varying weights are shown in the γ i (t p ) panel. Early postprandial response is more variable and, thus, more local models contribute to the glucose prediction computation, compared to late postprandial phase, where LM2 seems to fully describe the response in this case. This is also described by the crispness index, with lower values at time instants where a higher number of local models contribute to the predicted trajectory. The normality index describes to what extent past short-time CGM data at each prediction time t p is represented by the same segment of data from cluster prototypes, and "interpolation" instead of "extrapolation" is being performed. In this case, a drop of the index is observed during the glucose increase until peak value. A too low value would indicate an abnormal behavior, which threshold need to be tuned (see Section 3.4). Table 5 present validation results of the glucose predictor in real-time operation for the 2-month independent validation dataset 1 ("normal") described in Section 2.1. Metrics are expected to deteriorate as compared to the local models performance due to the local predictions integration process, as it is observed in the data presented. For this dataset, overall average RMSE traj ranged from 3.51 mg/dL to 23.90 mg/dL for prediction horizons 15, 30, 60, 120, 180 and 240 min (MAPE traj from 2.35% to 15.19%). Overall average RMSE last ranged from 6.52 mg/dL to 28.96 mg/dL (MAPE last from 4.04% to 20.00%). Table 5. Accuracy metrics of the glucose predictor for validation dataset 1 ("normal").    Table 6 presents validation results for the two month independent validation dataset 2 ("abnormal") described in Section 2.1. Inclusion of abnormal events slightly increased error in 1-2% as comparison of MAPE traj and MAPE last in Tables 5 and 6 reveal. This was expected since missed boluses and exercise were not present in the training. The impact was higher up to a prediction horizon of 120 min, which may be related to the duration of the abnormal event effect on glycemia (postprandial and post-exercise periods). The number of predictions column indicates that abnormal events increased the number of hypoglycemic episodes (due to exercise), inducing different partitioning of data and participation of the glucose predictors for Π M , Π N , and Π H . For the abnormal dataset, overall average RMSE traj ranged from 4.04 mg/dL to 25.36 mg/dL for the PHs considered (MAPE traj from 2.65% to 15.06%). Overall average RMSE last ranged from 7.33 mg/dL to 32.83 mg/dL (MAPE last from 4.66% to 19.82%).

Number of
Other authors have conducted in silico studies for the performance analysis of a variety of prediction methods, although comparison is not straightforward due to the difference in the simulators used, scenarios considered, and different input requirements by the glucose prediction technique. In this sense, the lower the inputs needed to perform a prediction the better, especially when such inputs require patient intervention such as the amount of carbs intake, for instance. These results are summarized in Table 7.
In [24], 8-day glucose and insulin data is used from a virtual cohort of 30 patients from the educational version of the UVA/Padova simulator, comprising 10 adults, 10 adolescents and 10 children. Variability in meal time and quantity was considered. Three models are compared: AR, ARX considering insulin as input, and ANN (artificial neural network), with prediction horizons of 30 min and 45 min. For the adult population, RMSE last values of 14.0 mg/dL, 13.3 mg/dL and 2.8 mg/dL are reported for AR, ARX and ANN with prediction horizon 30 min, respectively. These figures increase to 23.2 mg/dL, 22.8 mg/dL, and 4.0 mg/dL for a prediction horizon of 45 min. ANN outperformed AR and ARX models. Compared to results in Table 5, AR and ARX are outperformed by the methodology here presented, with a RMSE last of 11.32 mg/dL for a prediction horizon of 30 min. This is as well the case for results in Table 6 including abnormal data, where an RMSE last of 13.18 mg/dL was achieved. Metrics obtained for AR and ARX for 45 min are comparable to the metrics reported in Table 5 for a prediction horizon of 120 min (RMSE last = 25.15 mg/dL), and in Table 6 for 60 min (RMSE last = 21.00 mg/dL), respectively. However, this is not the case for ANN, with extremely low RMSE last values reported in [24]. This may be due to the nature of the in silico study, since the simulator used in [24] does not include the extra features of intra-patient variability reported in Section 2.1, besides the length of the data (8 day vs. 2 months), which might have produced overfitting in the ANN case. As well, the ANN used insulin data as external input which may limit applicability to insulin pump users, while in our case, only glucose and mealtime is used for glucose prediction. This latter can be extracted from the pump or smart pen information, or even estimated using meal detection algorithms in standard insulin pen users. Table 6. Accuracy metrics of the glucose predictor for validation dataset 2 ("abnormal"). In [25], a neural network incorporating meal information in parallel with a linear predictor (NN-LPA) is evaluated in silico with 11-day data from 20 subjects of the UVA/Padova simulator. As in the previous case, variability in meal time and quantity was considered. The method is compared with the neural network presented in [30] (NNPG) and an AR(1) model. For a prediction horizon of 30 min, RMSE last values of 9.4 mg/dL, 10.7 mg/dL, and 17.5 mg/dL are reported for NN-LPA, NNPG and AR(1), respectively. Of note, NN-LPA requires a physiological meal model since the glucose rate of appearance at t + PH is one of the inputs of the neural network. In [26], a latent variable-based model (LVX) is evaluated from 7-day data from 10 virtual subjects of the UVA/Padova simulator. The model used meal intake and insulin information as exogenous variables. An average RMSE last of 8.6 mg/dL and 14.0 mg/dL was reported for 30-min and 60-min prediction horizons, respectively, as compared to our results in Table 5 with an RMSE last of 11.32 mg/dL and 17.85 mg/dL, respectively, and Table 6, with 13.18 mg/dL and 21.00 mg/dL, respectively. A slight increase of RMSE last of only approximately 3-4 mg/dL (5-7 mg/dL with abnormal data) is observed in our results, despite the far more challenging intra-patient variability in our work and less input information required, as stated above.

Number of Predictions
More recent in silico studies, in this case using the same extended UVA/Padova simulator as in this work, have been reported. In [27], a physiological model (PM) in combination with CGM signal deconvolution is presented for long-term glucose prediction. The model requires as inputs carbohydrate intake and insulin delivery, as opposed to our case. Two-week data for 10 in silico subjects is used for model training and evaluation. RMSE last values of 10.90 mg/dL, 24.44 mg/dL, 33.50 mg/dL, and 37.63 mg/dL are reported for prediction horizons of 30, 60, 90 and 120 min. These values outperformed ARX and LVX models in head-to-head comparisons. In [28], convolutional recurrent neural networks (CRNN) are evaluated with 1-year data from 10 in silico subjects. Models are trained with 50% of the data and evaluated with the rest 50%. Glucose, meal and insulin information are required as inputs. RMSE last values of 9.38 mg/dL and 18.87 mg/dL for 30-min and 60-min prediction horizons are reported. Above metrics are outperformed in general by results presented in Tables 5 and 6, being especially relevant longer prediction horizons. Finally, in [29] dilated recurrent neural networks (DRNN) are tested with the same 1-year in silico study. A prediction horizon of 30 min is considered, with a reported RMSE last of 7.8 mg/dL, which is better than our result for the same prediction horizon even in the absence of abnormal data (11.32 mg/dL). However, DRNN required information on insulin, meal intake and time of the day, as opposed to this work.
Summing up, with the limitation of a non-head-to-head comparison and variety of simulation studies in the literature, metrics obtained outperformed other methods, or were very close to the reported performance. This was so even when imperfect training was considered introducing abnormal data in the validation dataset, while having the clear advantage of the minimal input information required (CGM and mealtime). This one can ultimately be automatically extracted without patient intervention making the glucose predictor suitable for insulin pump and MDI users.
From the point of view of the real-world application of the methodology, it is worth remarking that, in a similar way to a neural network, the training phase is the one more computationally demanding. In this work the computing cluster Rigel from Universitat Politècnica de València was used, in particular the Dell Power Edge R640 nodes with 2 processors Intel Xeon Gold 6154 with 18 cores, 3 Ghz and 25 Mb cache memory. The clustering phase for the 10 patients, which included 10 clustering problems per patient with increasing number of clusters for later selecting the optimal number of them, using 22 cores and 3 Gb memory per core, was computed in 22 min. The training of the local models, using 30 cores and 3 Gb memory per core, was performed in about 6.5 h. Validation of local models, using the same resources, lasted 53 min. This gives rise to a total of 8 h of computing time approximately for training. Once trained, the real-time evaluation of a glucose prediction is straightforward, requiring the evaluation of equations in Section 2.5 for the glucose prediction (evaluation of n SARIMA models, n fuzzy memberships, and a weighted sum of time series, where n is the number of clusters), and Section 2.6 for crispness and normality indices.

Normality Index
The Normality Index defined in Equation (23) is intended to provide information on abnormal glucose behaviors according to available historical data. Thus, it is expected that low normality values at time t p , N I(t p ), are related to higher prediction errors for glucose trajectories computed at that same time t p . The Normality Index relies on the comparison of CGM values in the recent past (20 min in this work) with cluster prototypes in that same time window. It is expected that this relationship with the prediction error is stronger for short or medium prediction horizons, for which abnormal behavior is still present. A PH of 60 min was considered in the following analysis. All predictions G(t|t p ), t ∈ [t p , t p + PH], carried out during the validation of the glucose predictor for dataset 2 ("abnormal"), amounting to a total of 138,888 predicted trajectories, were grouped into two groups, for N I(t p ) < N I thr and N I(t p ) ≥ N I thr , with the threshold N I thr varying from 0.1 to 0.9 in steps of 0.1. Then, the difference of the median prediction error at t p + PH (absolute value of the residual) among groups was analyzed. Remark that these residuals are the ones considered in the computation of RMSE last for a given validation time subseries. Figure 7 shows these results. All differences were found statistically significant using the Wilcoxon Rank Sum test (p < 0.001). This test was selected since data was non-normal and non-paired.  . Relationship between Normality Index and prediction error. Difference of the median absolute value of the residual at t p + PH between groups for N I(t p ) < N I thr and N I(t p ) ≥ N I thr , with N I thr ranging from 0.1 to 0.9, is shown. All differences were found statistically significant (p < 0.001).
Clinically meaningful differences were obtained for the lowest thresholds, amounting to 19.14 mg/dL for N I thr = 0.1, and 10.29 mg/dL for N I thr = 0.2. As an illustration, Figure 8 shows the relationship for N I thr = 0.2, which reveals an increased presence of higher prediction errors when N I(t p ) < 0.  Finally, Figure 9 illustrates an example of the relationship between NI and abnormal events in the simulated scenario. Postprandial response after a missed bolus (blue shaded area) is clearly detected as an abnormal CGM response. Consider that a minimum delay of 20 min will happen since that is the window considered in the computation of NI. After an initial glucose rise that could be similar to other meals, the value of NI is below the threshold consecutively at each sample during the postprandial period until glucose returns to normoglycemia. Regarding the exercise, including 3-h post-exercise period with altered insulin sensitivity (green area), abnormality is found in the hypoglycemia recovery, due to the increased need of carbohydrates as compared to a non-exercise related hypoglycemia in the training data, as well as the initial response of the meal following exercise. Other segments of CGM data might be also considered abnormal without relation to missed boluses or exercise, such as the one around sample 1.64 × 10 4 . In this case, a meal close to a hypoglycemia recovery happened.
The information provided by the Normality Index can be useful in real-time operation, raising for instance warnings when sufficiently long abnormal periods are detected (as in the case of missed boluses), due to expected decrease of accuracy of predictions. Additionally, remark that its computation is independent of predictions since it relies only on the clustering training phase (no local models are implied). This means that its use in an offline context can also be devised, as a tool for the analysis of CGM data highlighting areas of abnormal response that may deserve special attention by the clinician.

Conclusions
In previous works, seasonal local modeling proved successful in glucose prediction for longer prediction horizons. However, fixed-length postprandial time series were used, which do not apply to realistic scenario. In this work, this limitation is overcome with the introduction of event-to-event CGM time series partitioning, and clustering of variablelength data. A new local models integration method is also proposed. Metrics obtained outperformed other literature methods, or were very close to the reported ones, even when imperfect training was considered introducing abnormal data in the validation dataset, while having the clear advantage of the minimal input information required (CGM and mealtime). This one can ultimately be automatically extracted without patient intervention making the glucose predictor suitable for insulin pump and MDI users. Besides, the normality index can provide useful information about abnormal glycemic responses leading to a more cautious use of the provided predictions, or providing valuable information to clinicians for the inspection of CGM data. This study has the limitation of any in silico study, and evaluation with clinical data should be conducted as the next step in this research. Although this work focused on the adult cohort for the sake of comparison with other literature methods, the methodology could have been applied equally to the adolescent or children virtual cohort, with an adaptation of the number of clusters considered according to the population variability. Data Availability Statement: The simulated datasets used in this study are available on request from the corresponding author.