Long-Term Glucose Forecasting for Open-Source Automated Insulin Delivery Systems: A Machine Learning Study with Real-World Variability Analysis

Glucose forecasting serves as a backbone for several healthcare applications, including real-time insulin dosing in people with diabetes and physical activity optimization. This paper presents a study on the use of machine learning (ML) and deep learning (DL) methods for predicting glucose variability (GV) in individuals with open-source automated insulin delivery systems (AID). A three-stage experimental framework is employed in this work to systematically implement and evaluate ML/DL methods on a large-scale diabetes dataset collected from individuals with open-source AID. The first stage involves data collection, the second stage involves data preparation and exploratory analysis, and the third stage involves developing, fine-tuning, and evaluating ML/DL models. The performance and resource costs of the models are evaluated alongside relative and proportional errors for 17 GV metrics. Evaluation of fine-tuned ML/DL models shows considerable accuracy in glucose forecasting and variability analysis up to 48 h in advance. The average MAE ranges from 2.50 mg/dL for long short-term memory models (LSTM) to 4.94 mg/dL for autoregressive integrated moving average (ARIMA) models, and the RMSE ranges from 3.7 mg/dL for LSTM to 7.67 mg/dL for ARIMA. Model execution time is proportional to the amount of data used for training, with long short-term memory models having the lowest execution time but the highest memory consumption compared to other models. This work successfully incorporates the use of appropriate programming frameworks, concurrency-enhancing tools, and resource and storage cost estimators to encourage the sustainable use of ML/DL in real-world AID systems.


Overview of Data-Driven Automated Insulin Delivery Systems
With an ever-increasing number of diabetes technologies that assist individuals living with insulin-requiring diabetes, large amounts of diabetes-related and user-entered behavioral data are generated. Connected insulin pens or insulin pumps deliver insulin, and real-time blood glucose information is obtained using Bluetooth-enabled glucose meters or continuous glucose monitors (CGM). Insulin pumps and CGM can be combined as part of an automated insulin delivery (AID) system, where data from each device flows through an algorithm to determine insulin-delivery rates and automatically adjust them to keep glucose values in a specific range, requiring less work from people with diabetes and also improving quality of life outcomes [1]. AID systems further generate rich data regarding the conditions (such as sensor glucose values, user-entered information such as targets or carbohydrates, and current and previous insulin delivery) in which it operates [2]. Explor-ing these rich data sources unveils opportunities for scientific discoveries to understand individual glucose outcomes better and improve diabetes technology.
There has been increasing interest in applying machine learning (ML) and deep learning (DL) techniques to improve predictions of glucose levels [3]. Accurate and reliable glucose profile forecasting is essential for a range of data-driven applications and use cases that improve diabetes management (Figure 1). ML models are able to train and automatically capture hidden trends and patterns in large volumes of data with considerable accuracy and efficiency. This enables them to make decisions for various prediction and classification tasks and to learn and improve over time.

Applications of Machine Learning and Deep Learning in AID Systems
Several ML techniques, including K-Nearest Neighbour (KNN), Random Forests (RF), Long Short Term Memory (LSTM), Support Vector Regressor (SVR), and Gradient Boost (XGBoost), have been used for regression and classification tasks to predict and identify hypoglycemia and hyperglycemia [4][5][6][7][8][9][10][11][12][13][14]. These methods use invasive and non-invasive techniques to collect data such as continuous glucose monitor data and physiological and demographic features to train the models and achieve high prediction accuracy. Our in-depth review of ML/DL methods applied to glucose forecasting (Section 2.1) yields a list of challenges and limitations to the practical adoption of these methods in opensource AID systems for glucose profile forecasting, including: (1). limited prediction horizon (30,60, or 120 min) of trained models, (2). inconsistency of reported accuracies and employed model evaluation metrics makes it difficult to compare and reproduce the existing work, (3). unavailability of large-scale and real-world diabetes datasets that encourage the use of artificial and synthetic data for model training and evaluation, (4). lack of evaluation and reporting on the computing resource costs of building the models, (5). lack of implementation details and open-source models that are fine-tuned on diabetes datasets, and (6). lack of assessment of clinically-approved glucose variability metrics (reviewed in Section 2.2) based on predicted glucose profiles.
Historically, due to the non-availability of quality diabetes data, many early datasets used to perform ML-related work were considered "large" if they contained several weeks of data from a dozen individuals. However, with the early adoption of open-source AID systems, which predated the availability of commercial AID systems for several years, users donated their anonymized data for diabetes research [15]. The resulting dataset from the OpenAPS Data Commons contains tens of thousands of days of glucose data points [16] and is employed in this paper.
One unique aspect of open-source AID systems such as OpenAPS is its inherent design to be understandable to users, including the rationale of every decision it makes. ML can be seen as a black box, and it may be challenging to substitute an ML-based prediction algorithm wholesale into an open-source AID. However, OpenAPS is uniquely designed to generate predictions based on various scenarios, including whether carbohydrates are fully absorbed, or a meal is consumed but not recorded to the system. These predictions are conditionally blended and heuristically used [17], such as to produce estimates of the lowest predicted glucose value to be observed over the timeframe relevant for insulin dosing and separately the blended average glucose level over the approximate period when the activity of any additional insulin would be peaking, in order to limit contributions to hypoglycemia while also seeking to minimize hyperglycemia. Therefore, OpenAPS is one such system where an ML-based prediction algorithm could be introduced and blended into the current set of predictions and used alongside the backstop of safety rules used by the system to achieve the highest possible time in the target glucose range (known as "time in range" or TIR) without much hypoglycemia or hyperglycemia.

Original Contributions
As a result of this opportunity for improvement, this paper sought to assess different ML-based prediction methods for glucose profiles, paying particular attention to limitations mentioned above in the existing works [18] and to their performance in terms of accuracy and resource consumption of the implementation (training/inference time and memory consumption) intending to integrate them in open source or future commercial AID solutions.
In this paper, 30 and 60 days of glucose data has been employed from a set of individuals having diverse demographic attributes from OpenAPS Data Commons to train a set of ML and DL models, including ARIMA, XGBoost, RF, SVR, and LSTM. The fine-tuned models have been further evaluated based on their performance and resource consumption for glucose profile prediction up to 48 h. Finally, a set of clinically-validated statistical and glucose variability (GV) metrics have been calculated, and a comparative analysis of the predicted and expected outcomes are presented.
All models have been implemented with the flexibility to train online, and programming scripts are open-sourced for reproducibility and benchmarking [19].

Organisation of the Paper
The rest of this paper is divided into the following sections. Section 2 presents the literature review of tools and technologies for glucose profile assessment and the latest advances in ML-based glucose forecasting methods. Section 3 provides a summary of the dataset and techniques adopted for diabetes data collection, selection and cleaning; followed by a description of employed ML-based predictive models and the glucose analysis metrics. Section 4 presents the glucose variability assessments and the evaluation results of trained ML models for selected individuals with insulin-requiring diabetes. The section further shows the performance and resource costs of ML-based predictive models and reports the relative and proportional errors as a result of a comparison of GV metrics obtained for predicted and expected glucose profiles. Section 5 presents discussions on the analysed ML model outcomes and assessment of metrics used for glucose analysis, highlights the lessons learned, and criticises the limitations. Finally, Section 6 concludes the paper and provides a roadmap for future considerations.

Related Work
This section first highlights recent research developments towards ML-enabled glucose predictions and highlights the main limitations and challenges; followed by a review of clinically-approved glucose variability metrics.

Review of Machine Learning and Deep Learning Methods and Techniques for Glucose Forecasting
Several machine learning and statistical learning techniques have been employed for regression and classification tasks to predict and identify hypoglycemia and hyperglycemia.
Mordvanyuk et al. [4] employed K-Nearest Neighbour (KNN) algorithm on machinesimulated data and used the meal information along with CGM data to predict out of range glucose with 83.64% accuracy. Dave et al. [5] employed 26 features including gender, the hour of the day, etc as multivariate input in logistic regression (LR) and random forest (RF) algorithms to predict glucose up to 60 min with sensitivity and specificity over 90%. Another approach is the use of physiological data including heart rate and movement recorded by a smartwatch alongside CGM data of an individual employed in the Gradient Boost algorithm to classify normal blood glucose levels and hypoglycemia with an accuracy of 82.7% [6].
Zhu et al. [7] used OhioT1DM dataset [20] to train Long Short Term Memory (LSTM) network to predict up to 30 and 60 min of glucose data and reported root mean square error (RMSE) of 19.10 mg/dL and 32.61 mg/dL, respectively. In [8], simulated data from UVA-Padova [21] (360 simulated days of 10 patients) and OhioT1DM dataset (8 weeks of clinical trials on 6 patients) were employed to train a dilated recurrent neural network (D-RNN) with prediction RMSE of 20.1 mg/dL. Using data from 12 individuals from OhioT1DM, Yang et al. [9] developed an autonomous channel model using a combination of multiple LSTM models for glucose prediction for up to next 30 and 60 min with an RMSE of 18.9 mg/dL and 31.79 mg/dL, respectively.
Berikov et al. [10] used eight CGM-derived metrics including glycemic control and glucose variability from 406 patients in RF, logistic linear regression with lasso regularization, and artificial neural networks (ANN) to predict the next 15 and 30 min of glucose data with considerable accuracy. Duckworth et al. in [11] used explainable ML (trained using CGM data for 153 people with diabetes) to make predictions of hypoglycemia and hyperglycemia up to 60 min. The gradient boost (GB) algorithm yielded a reasonable prediction performance (AUROC) of 0.998 and 0.989 for hypoglycemia and hyperglycemia, respectively, in comparison to standard heuristic and logistic regression models. Van et al. [12] employed a portion of the Maastricht Study's dataset (including CGM and accelerometer) to train multiple ML and DL models (including ARIMA, support vector regressor (SVR), GB, LSTM, and RNN) and predicted the next 15 and 60 min of blood glucose levels with an RMSE of 0.48 mmol/L and 0.9 mmol/L, respectively. In [13], authors trained a personalized LSTM model (using UVA-Padova simulator data for 100 patients with meals, insulin, and past blood glucose) to predict the next 40 min of blood glucose levels with an RMSE of 7.67 mg/dL.
Allam et al. [14] trained an RNN and SVR using data from 9 individuals to predict blood glucose for 15, 30, and 60 min horizon with an RMSE (in mmol/L) for 0.14, 0.55, 1.32 for RNN and 0.52, 0.89, 1.37 for SVR, respectively. In [22], authors presented an ensemble approach using SVR as a base model and using ARIMA and physiological features (trained on data for 10 individuals with type-1 diabetes) to predict blood glucose levels with RMSE (in mg/dL) of 19.5 and 35.7 for 30 and 60 min prediction horizon, respectively. A jump neural network (JNN) in [23] is trained on data for 20 T1D individuals to predict 30 min of blood glucose with an RMSE (Mean ± Standard deviation) of 16.6 ± 3.1 mg/dL.
Pustozerov et al. [24] trained a linear regression model using data from 62 individuals (with 48 pregnant women with gestational diabetes mellitus (GDM) and 14 women with normal glucose tolerance) with food intake as an evaluation parameter. Results show that the RMSE of BG levels for 1 h after food intake is 0.87 mmol/L. The use of smartwatches has seen tremendous growth with improvements in sensor technology motivated by the use of Photoplethysmography (PPG) signals to detect volumetric changes in blood in the peripheral circulation [25]. Data from 9 people (3 males and 6 females) was used to train ada-boost and RF models to provide 90% prediction accuracy for glucose levels [25]. Dave et al. [5] trained an RF model to predict possible hypoglycemia for 30 and 60 min ahead of time with a sensitivity and specificity of 91% and 90%, respectively.
Georga et al. [26] used multivariate data (including glucose profile, plasma insulin concentration, appeared glucose derived from a meal in the blood circulation, and the energy utilized during other physical activities) from 27 people in free-living conditions in an SVR to predict glucose levels for 15, 30, 60, and 120 min with average prediction errors of 5.21, 6.03, 7.14, and 7.62 mg/dL, respectively. Pérez-Gandía et al. [27] trained a neural network using data from 15 individuals to predict glucose in 15, 30 and 45 min horizon with an RMSE of 10, 18, and 27 mg/dL, respectively.

Limitations and Shortcomings
To summarise, multiple ML/DL frameworks and methodologies have been employed to forecast and predict blood glucose for people with diabetes. The limitations and shortcomings of the existing literature are listed below: • The primary issue of all the reported methods is the evaluation of trained models for a limited prediction horizon of 30 min and 60 min, with the maximum being 120 min, i.e., the reported predictions for the trained models are in the range of 30, 60, or 120 min. • The lack of consistency in the accuracies of the reported models makes it difficult to compare the existing work. This further affects the reliability of the trained models for further evaluation and reproducibility. • Another drawback of the existing literature is the previous lack of large-scale and real-world datasets for individuals with diabetes that use automated insulin delivery systems. Therefore, the majority of the aforementioned models in the literature are trained on partial/fully simulated data or limited days of real-world CGM data. • Multiple model performances and accuracy metrics have been used (including RMSE, specificity, MAE and F1 score) to evaluate the model predictions. However, to the best of our knowledge, none of the existing works has evaluated and studied the impact of glucose predictions by calculating the clinically validated glucose variability (GV) metrics. • There is a lack of implementation details and open-source methods to reproduce the reported results which makes it difficult to independently evaluate them on additional datasets or to be able to evaluate their applicability for different modalities of insulin therapy, such as in sensor-augmented pump therapy as compared to automated insulin delivery therapy. • Most of the existing works employed a limited number of machine learning models (one or two) for evaluation which certainly adds inconsistency. However, it is critical to evaluate model results for multiple machine learning and deep learning models along with tuned time series analysis frameworks like ARIMA. Evaluating the results of multiple model types would lay a foundation for benchmarking.

Clinically-Approved Statistical and Variability Metrics for Glucose Analysis
Over 25 clinically approved GV metrics have been adopted by the diabetes research community. Table 1 list the acronyms and full forms of the most important and commonly used metrics for GV assessment.

ADRR
The average daily risk range (ADRR) measures the overall daily variation of glucose, within a specific risk range meanwhile the risk is defined based on the target.

CONGA
Continuous overall net glycemic action (CONGA) is applicably close to standard deviation (SD) and measures the possible changes in glucose for a defined period.
CV Coefficient of Variation (CV) is a statistical metric to evaluate the diversity in glucose data and is commonly subdivided into inter-day and intra-day CV metrics.

GRADE
The glycemic risk assessment diabetes equation (GRADE) score evaluates the risk correlated with a particular glucose profile comprehensively.
HBGI High blood glucose index (HBGI) is a metric that quantifies the possible risk of hyperglycemia and it can be calculated using self-monitoring of blood glucose (SMBG) or continuous glucose monitor (CGM) data.
LBGI Low blood glucose index (LBGI) is used for hypoglycemic risk management.

MAG
Mean absolute glucose (MAG) represents the difference of summation between sequential glucose profiles over 24 h, which is divided by the time (in hours) between the starting and ending glucose values.

MAGE
The Mean Amplitude of Glycemic Excursion (MAGE) is defined as the mean of glucose values that exceed the 24-h mean blood glucose value, by one standard deviation.

MODD
Mean of daily differences (MODD) evaluates the inter-day variability; the average difference between glucose values is calculated over multiple days at the same time.

Materials and Methods
This section presents the experimental workflow and adopted processes and procedures for diabetes data collection, anonymisation, cleaning, processing, modeling, and analysis.

Experimental Workflow and ML Development Pipelines
The experiments are conducted using a standalone Intel-based Core-i7 CPU processor (2 cores, 2 threads) with 8 GB of main memory. Figure 2 illustrates a tri-staged architecture demonstrating the experimental workflow employed in developing and analyzing ML/DL models.

•
Stage 1: Data generation and collection includes data provision from the OpenAPS Data Commons [34], which contains data from open-source AID users who have contributed their data via the Open Humans platform [15] (Steps 1 and 2). • Stage 2: Data preparation and exploratory data analysis (EDA) is composed of four steps: Data is exported, prepared using anonymization and cleaning protocols (Step 3), a diverse subset of individuals are selected, and the glucose profiles are analyzed using descriptive statistics and clinically approved GV metrics (Steps 4 and 5). The data is then split into training and testing sets. Models have been trained on 30 and 60 days of glucose data and individually tested to predict upto 48 h of glucose data points. (Step 6).
• Stage 3: ML/DL modeling, evaluation and analysis consists of 4 steps. ML/DL algorithms are fine-tuned and evaluated for accuracy and resource consumption (step 7), and analyzed using statistical and glucose variability metrics from expected and predicted glucose profiles (Steps 8, 9, 10).

Figure 2.
Tri-staged experimental workflow and ML/DL development pipelines for glucose data analysis. Stage 1 includes data generation and collection, stage 2 involves data preparation and exploratory statistical analysis, and stage 3 consists of ML/DL modeling, evaluation and analysis.

Highlights of Data Collection, Anonymisation, and Cleaning
The OpenAPS Data Commons, collated as a project on the Open Humans platform, is imported as anonymized diabetes dataset with rich CGM data, insulin delivery information from insulin pumps, user-entered information such as carbohydrate entries or temporary target changes, as well as algorithm-derived information about insulin dosing decisions.
An individual was randomly chosen to test the ML/DL methods described below. After initial tests of methods and validating how much data was needed for analysis, an additional 18 individuals were chosen from the dataset based on the diversity of demographic variables such as ages, AID system used, geography, etc. Table 2 summarizes the demographics of the resulting n = 19 individuals employed in the dataset for this paper, alongside their gender and geography distributions. Data cleaning methods has been reproduced for timestamps and glucose entries from previous work on glycemic variability [35], and all programming scripts are open-source at [36].

Machine Learning and Deep Learning Algorithms Employed for Glucose Forecasting
Selected ML and DL timeseries forecasting models for glucose include ARIMA [37], XGBoost [38], RF [39], LSTM [40], and SVR [41]. Table 3 provide model descriptions, their fine-tuned hyperparameters for glucose data, and Python implementation library. Although SVR was initially employed to forecast glucose profiles, due to excessive training and execution time and resource consumption, it was dropped and was not considered for further experiments on our dataset. Model evaluation metrics for performance and resource cost are described in Appendix A.
It is important to note that a three-stage process is utilized for ARIMA model building [37]. The first step involves the identification of the order of differencing (d), the order of autoregression (p), and the order of moving average (q) required to model the data. This step involves analyzing the autocorrelation and partial autocorrelation functions of the time series data to determine the values of p and q and analyzing the time series data to determine the value of d. In the second step, parameters have been estimated using maximum likelihood estimation. Lastly, the adequacy of the ARIMA model is checked. This involves analyzing the residuals of the model, which are the differences between the actual data and the model predictions.
When it comes to predicting time series data there are several DL algorithms, however, LSTMs are often considered a reasonable choice for univariate time series prediction due to its ability to handle long-term dependencies and capture temporal patterns in the data. LSTM is a type of recurrent neural network (RNN) that is capable of retaining long-term dependencies in the data, which is particularly useful for time series prediction, where past values can have a strong influence on future values. Unlike traditional RNNs, which can suffer from vanishing or exploding gradients when dealing with long-term dependencies, LSTM has a mechanism to selectively forget or remember information from previous time steps.
Some other conventional DL algorithms were less suitable for our task due to a number of reasons including the inefficiency of univariate time series prediction tasks, computational complexity, and complex hyperparameter tuning. For example, Convolutional Neural Networks (CNNs) are often used for image classification, they can also be applied to time series prediction by treating the time series as a 1D image. However, CNNs may not be suitable for all time series problems, especially if the time series has complex temporal dependencies that cannot be captured by convolutional filters. Similarly, Deep Belief Networks (DBNs) are generative models that consist of multiple layers of Restricted Boltzmann Machines (RBMs) and can be used for unsupervised feature learning. However, they can be computationally expensive to train and may require more data to learn meaningful representations.

Statistical and Variability Metrics for Glucose Analysis
Descriptive statistic metrics are computed for glucose profiles to analyse the spread, variation, and distributions. These metrics include mean, standard deviation (SD), coefficient of variation (CV), skewness score, and quantile statistics (Table 4). Q1, Q2, and Q3 represent the first, second, and third quartiles that evaluate the overall data distribution, respectively. CV indicates the variability in data concerning the mean; the higher the CV is, the more dispersed the data will be. The skewness score is the measure of asymmetric distribution.
A number of clinically approved GV metrics are computed using EasyGV tool [31] and compared for measured (using CGM sensors) and predicted (using ML/DL models) glucose profiles. Relative and proportional errors were calculated and the rationale behind using two error metrics is given in Appendix B.

Results
This section presents the results of in-depth statistical and GV analysis followed by evaluation and analysis of trained ML/DL models.

Descriptive Statistics and Glucose Variability Metrics for Selected AID Users
Statistical methods are applied to complete glucose profiles for n = 19 individuals to evaluate timeseries data in terms of their characteristics. Stationarity analysis was applied using the augmented Dickey-Fuller (ADF) and Kwiatkowski Phillips Schmidt Shin (KPSS) test. A glucose profile is labeled stationary if both tests conclude that the series is stationary. It is labeled as difference stationary in case only the ADF test is positive and trend stationary if only the KPSS test is positive. It was observed that all the glucose profiles are stationary, with both ADF and KPSS tests being positive. Further analyse was done to evaluate if the time series is seasonal using auto-correlation, and if seasonality is detected, the best period would be found. If the autocorrelation is over 0.9, the data was labelled as seasonal. However, no evident seasonality and periods are detected for selected individuals. Table 4 reports the descriptive statistics for complete glucose profiles for n = 19 individuals. AID19 had the minimum number of data points (equal to 96 days worth of glucose data), whereas AID3 has the maximum count (constituting 1688 days worth of glucose data). The glucose profile variation is an essential factor in hypoglycemia/hyperglycemia assessment. The minimum and maximum mean values for glucose profiles are 98. 42  The skewness value greater than ±1 indicates highly skewed distributions. These include AID3, AID7, AID9, AID10, AID15, and AID18. The skewness score between −0.5 and 0.5 (including AID1, AID5, AID8, AID11, AID17, and AID19) indicates symmetrical distributions. The rest of the glucose profiles have skewness scores between 0.5 and 1 or −0.5 and −1, demonstrating that they are moderately skewed.  Table 5. Glucose variability outcomes for complete glucose profiles of selected AID users. Abbreviations: SD ROC, Standard deviation for glucose rate of change; TBR/TIR/TAR, Time before/inside/after range; HBGI/LBGI, High/Low blood glucose index; GMI, Glycemic management index.

Performance and Resource Cost Evaluation and Analysis of Trained ML/DL Algorithms
The ML/DL models are trained by employing 30 and 60 days of data and tested individually for their performance and resource costs to predict glucose up to 48 h. Resource costs are evaluated by measuring execution time and memory consumption, whereas RMSE and MAE are calculated to assess the model's prediction performance. Figure 3 shows the MAE, RMSE, and execution time for models trained on 30 days of glucose data. The results for models trained on 60 days of glucose data are given in Appendix E.
The maximum value of MAE of 8.07 is observed for ARIMA, whereas the lowest MAE is 1.295 reported for the random forest model (Figure 3a). Overall, the ARIMA model yields the highest MAE indicating the least prediction performance.
The maximum and minimum recorded RMSE is 10.42 for AID9 and 2.16 for AID11, respectively, both in the case of XGBoost (Figure 3b). No noticeable trend was observed between the RMSE values of reported models trained on 30 days of glucose data when compared with the ones trained on 60 days of glucose data.
ARIMA yields a maximum execution time equal to 780 s. In comparison, LSTM performs best in terms of execution time with a minimum of 162 s (Figure 3c. However, LSTMs are recorded as memory-hungry, with consumption peaking at 1993 MBs (Appendix D).

Comparative Analysis of Glucose Variability for Predicted and Expected Glucose Profiles
GV metrics have been calculated from the predicted and expected profiles up to 48 h for n = 19 individuals and evaluate error scores between each GV metric using relative and proportional errors (defined in Appendix B). Table 6 reports the mean of minimum, average, and maximum relative and proportional errors for GV metrics among selected individuals; obtained by comparing ground truths with the ones calculated using the glucose profiles predicted by ARIMA, XGBoost, LSTM, and RF, respectively. The models trained on 30 days of data are denoted by ARIMA30, XGBoost30, LSTM30, and RF30, respectively. Additional results for the models trained on 60 days of data (ARIMA60, XGBoost60, LSTM60, and RF60) are provided in Appendix F.
Errors have been represented in sets of minimum, average, and maximum. The highest score in the case of ARIMA30 for relative and proportional errors is obtained for TBR with {0%, 11.78%, 54.55%} and {1, 1.12, 1.55}, respectively. The noticeable problem with the relative error is the inconsistency in the maximum error because it considers equal relative proportions for expected and predicted values. Therefore, the proportional error can be considered a comparatively more gaugeable parameter.

Discussion
Large-scale diabetes datasets, such as the OpenAPS Data Commons, provide opportunities for researchers to develop innovative ML/DL tools and technologies and improve the functionality of future automated insulin delivery (AID) systems. This work addresses the limitations of existing ML/DL methods (Section 2.1) for predicting glucose profiles by developing models using a dataset of diverse individuals with insulin-requiring diabetes who use open-source AID systems.
ML/DL solutions for diabetes require computing resources, so practical solutions that are fine-tuned and optimized to reduce energy consumption without degrading performance are necessary. This includes using appropriate programming frameworks and tools that enhance concurrency, as well as resource and storage cost estimators and minimizers. Incorporating these strategies ensures the sustainable use of ML technologies and minimizes the environmental impact. In addition to evaluating the accuracy of predictions, it is important to assess the feasibility and sustainability of ML/DL models for use in real-world AID solutions.
The min and max mean values for glucose are likely below average (137.56 mg/dL) due to the use of open-source AID (Table 4). This is confirmed by studies, including a recent RCT [42], which show that open-source AID users typically achieve above-goal glucose metrics. This work also uniquely evaluates data from three open-source AID systems (OpenAPS, AndroidAPS, and Loop). It is worth reflecting that with a decrease in time below range (TBR) and as it is approaching to 0 (which is ideal), the relative error will increase accordingly.
Although AID systems significantly improve glucose management, one should also consider infrequent but significant events such as severe hypoglycemia (a "bad low") and its long-lasting effects on glucose variability. However, current literature on ML/DL-based glucose forecasting only considers prediction horizons of up to 120 min, hindering the understanding of the relationship between glucose variability and such events. These ML/DL models fine-tuned using the OpenAPS Data Commons accurately forecast glucose profiles up to 48 h (see Appendix C for example profiles). The average MAE range for all trained models is 2.50 mg/dL (for LSTM) to 4.94 mg/dL (for ARIMA). LSTMs have the lowest overall MAE (0.99 mg/dL for AID14) when trained with 60 days of glucose data. The average RSME is 3.7 mg/dL for LSTM to 7.67 mg/dL for ARIMA (Figure 3b).
ML/DL models developed in this work have been evaluated for their computing resource costs. This analysis shows that the execution time of a model is proportional to the amount of data used to train it. For example, models trained on 30 days of data have almost half the execution time of models trained with 60 days of data. LSTMs have the least execution time and the highest memory consumption compared to other models. However, since CPU/GPU time contributes the most to energy-consumption costs, LSTMs are the most resource-efficient in our case. LSTMs could run daily during non-critical times to generate daily predictions, similar to how Autotune, a non-ML-based algorithm for recommending setting changes, runs overnight in OpenAPS [43]. Future work should also consider evaluating cloud computing and the tradeoff costs, including both computing power and the safety risk of off-device calculations in the context of AID.

Conclusions
Our study comparing GV metrics calculated using predicted and original glucose profiles show the improved accuracy and reliability of extended horizon forecasts in realworld applications. GV metrics are widely used to understand diabetes management outcomes, above and beyond standard glucose outcome metrics, and should continue to be used to evaluate ML/DL-based glucose forecasting methods. The lower error scores in Table 6 show that fine-tuned ML/DL models can accurately estimate glucose variability outcomes for up to 48 h in the future, which is a much longer horizon than has previously been studied with ML/DL methods. Future work should evaluate these methods on different, non-AID diabetes datasets to assess whether ML/DL is "learning" that an AID system will be able to successfully correct according to the forecast; additional work should then also extend this work to assess the utility of such extended forecasts for non-AID users living with diabetes.
The applications of ML/DL described in this paper have the potential to form the basis for intelligent recommender systems in future-generation AIDs and other diabetes applications. In particular, these can be applied thoughtfully to enable individuals to target improvements for their most relevant areas. Quality-of-life improvement could be achieved for people with diabetes by further optimizing exercise, minimizing hypoglycemia, or reducing AID system interaction requirements, all of which can be achieved with future research and applications such as the ML/DL-based forecasts described in this work.

Abbreviations
The following abbreviations are used in this manuscript:

Appendix A. Model Evaluation Metrics for Performance and Resource Cost
All the aforementioned models are trained by employing 30 and 60 days of glucose data for each selected individual using closed-loop AID technology and are evaluated for their accuracy and resource costs to predict up to 48 h.
The forecasting accuracy of models is evaluated using Root Mean Square Error (RMSE) and Mean Absolute Error (MAE).
RMSE is calculated as a square root of the second moment of the disparity between expected and predicted data samples and is mathematically defined by Equation (A1); whereŷ is the expected value, y is the predicted value, and T denotes the total number of samples.
MAE provides the average difference between expected and predicted values, whereas the difference between the two is an absolute value. It helps to estimate the disparity between corresponding actual and predicted observations and is mathematically defined by Equation (A2); where y i is the expected value, x i is the predicted value, and n denotes the total number of samples.
Furthermore, in order to assess the suitability of ML and DL models to be employed online and in a real-time application, resource costs have been measured using overall execution time and their memory consumption.

Appendix B. Relative and Proportional Errors
The relative error (r) between a predicted GV metric (p) and the expected (ground truth) GV metric (g) is given by Equation (A3).
Relative error (r) gives a lower score for a profile that underestimates the GV metric than a profile that overestimates it. This can negatively impact the interpretation of the results. Therefore, proportional error has been further reported.
The proportional error (µ) for predicted GV metric (p) with the ground truth (g) is a ratio of a maximum of the two values with the minimum of the two values (given by Equation (A4)). The proportional error of 1 indicates no error. The proportional error greater than 1 indicates the difference between the predicted and expected GV metric. Figures A1 and A2 show the comparison of expected and predicted glucose profiles for 48 h (576 data points) using XGBoost and ARIMA, respectively.    Figure A3 shows the MAE, RMSE, and execution time for models trained on 60 days of glucose data. The maximum and minimum reported MAE is 6.21 for the ARIMA and 0.99 for LSTM, respectively ( Figure A3a). Figure A3b shows that ARIMA yields the highest error equal to 13.7 for AID19, whereas the minimum RMSE equal to 2.17 for AID14 is obtained for LSTM. Furthermore, LSTM performs best in execution time with a minimum of 346 s.

Appendix F. Relative and Proportional Errors for Models Trained on 60 Days of Glucose Data
In the case of ARIMA60