Machine Learning-Based Risk Stratification for Gestational Diabetes Management

Gestational diabetes mellitus (GDM) is often diagnosed during the last trimester of pregnancy, leaving only a short timeframe for intervention. However, appropriate assessment, management, and treatment have been shown to reduce the complications of GDM. This study introduces a machine learning-based stratification system for identifying patients at risk of exhibiting high blood glucose levels, based on daily blood glucose measurements and electronic health record (EHR) data from GDM patients. We internally trained and validated our model on a cohort of 1148 pregnancies at Oxford University Hospitals NHS Foundation Trust (OUH), and performed external validation on 709 patients from Royal Berkshire Hospital NHS Foundation Trust (RBH). We trained linear and non-linear tree-based regression models to predict the proportion of high-readings (readings above the UK’s National Institute for Health and Care Excellence [NICE] guideline) a patient may exhibit in upcoming days, and found that XGBoost achieved the highest performance during internal validation (0.021 [CI 0.019–0.023], 0.482 [0.442–0.516], and 0.112 [0.109–0.116], for MSE, R2, MAE, respectively). The model also performed similarly during external validation, suggesting that our method is generalizable across different cohorts of GDM patients.


Introduction
Gestational diabetes mellitus (GDM) is one of the most common health conditions during pregnancy, with a prevalence of one in six pregnant women worldwide [1]. GDM is defined as glucose intolerance of either first onset or recognition during pregnancy [2]. It is associated with both maternal and fetal complications, including perinatal death, excessive fetal growth (leading to problems during childbirth), preeclampsia, and neonatal hypoglycemia. Additionally, women who develop GDM are at increased risk of developing Type 2 diabetes [3][4][5]. Appropriate assessment, management, and treatment have been shown to reduce the complications of GDM. However, GDM is typically diagnosed during the last trimester of pregnancy, thus leaving only a short timeframe for intervention (typically around 10-16 weeks) [6]. Once diagnosed, counselling is provided for lifestyle management, including dietary and exercise modifications. To monitor improvement, women are asked to check their glucose levels through finger stick-testing several times a day. In the case that glucose levels remain high, medication is prescribed [3].
The UK's National Institute for Health and Care Excellence (NICE) guidelines states women with GDM require an increased level of maternal and fetal surveillance as they may need more interventions during pregnancy [7]. Based on the NICE guidelines, nurses and clinicians should review women with GDM at least once every two weeks, from the time of diagnosis of GDM until delivery. However, women who are at high risk of hyperglycaemia (high blood sugar level), or those who, despite treatment, demonstrate persistent hyperglycaemia, can require more frequent clinical review. Given that many NHS Trusts and hospitals abroad provide care for very large numbers of women with GDM at any one time (often >100 women), this presents a challenge for busy clinicians. Traditionally, glucose control has been assessed based on historical blood glucose readings recorded by the women in paper forms or diaries. The advent of digital monitoring opens up new possibilities for the prediction of women at risk of hyperglycaemia.
To help monitor patients with GDM, the University of Oxford and Oxford University Hospitals NHS Foundation Trust (OUH) developed a Bluetooth-enabled digital blood glucose management system, GDm-Health [6]. This smartphone-based, Bluetooth-controlled blood glucose monitoring system enables remote self-monitoring and bidirectional communication between patients and clinicians. Women who were diagnosed with GDM and managed by OUH were subscribed to the GDm-health system to monitor their blood glucose levels. During the period of using GDm-Health, women were recommended to measure their blood glucose between four to six times a day (recorded as pre-breakfast, onehour post-breakfast, pre-lunch, one-hour post-lunch, pre-dinner, and one-hour post-dinner), for a minimum of three days of the week [7]. We will refer to these six measurements as "Tags" in this study.
Additionally, GDm-Health has a heuristic alerting system based on clinical care pathways used by OUH during the development of GDm-Health. A red flag is generated if three or more consecutive blood glucose readings are above the designated threshold at the same meal tag. An amber flag is generated if two or more consecutive readings are above the threshold [6]. However, with large numbers of women with GDM managed in each hospital, a significant proportion of women have red or amber flags every week, generating substantial work for clinicians. Thus, there is a need for improved approaches to further stratify patients within these higher risk groups, who need urgent attention.
In this paper, we developed a model that aims to streamline clinician workflows by automating the identification of patients that need urgent clinical review. In clinical practice, clinicians typically review patients' blood glucose levels every 2-4 weeks in the outpatient antenatal clinic. Thus, with the advent of real-time monitoring using GDm-Health, there is the potential for more frequent review and more responsive medication adjustment. However, this needs to be balanced against workload generation for the clinical team. We propose a novel data-driven model that can bypass the need to manually screen all patients with high blood glucose readings before making decisions on patient-review order, significantly reducing the burden on clinicians. Specifically, we introduce a model for stratifying patients with GDM based on hyperglycaemia risk, and consequently, their need for clinical review. Using machine learning-based regression models, the proposed pipeline quantifies the risk of hyperglycaemia for the three days immediately following a blood glucose measurement. This algorithm can be used as an intelligent add-on module on GDm-Health or as a stand-alone system for any GDM clinic if they have access to patients' daily blood glucose data.

Participant Inclusion and Exclusion
For this study, we used de-identified, linked electronic health record (EHR) data and blood glucose measurements. Women with GDM, managed at the OUH, and subscribed to the GDm-Health system between 30 April 2018 to 4 May 2021 were included in this study (1148 pregnancy cases). Patients with more than one pregnancy during the study period were considered for each pregnancy, independently. Additionally, we only considered patients who had one baby during the pregnancy (i.e., twin pregnancies were not included in the study). Pregnancies with less than 36 blood glucose readings were excluded to ensure that models were trained on patients who have established their blood glucose test patterns. This threshold of 36 readings assumed that patients would establish their blood glucose test pattern after 7-10 days of blood glucose monitoring (with 4-6 readings per day). Excluding patients with a small number of blood glucose measurements reduces the risk of having distribution shifts between training and prediction samples, removing possible biases, such as patient behavioural changes.
To externally validate the performance of our model, the GDm-Health data of 709 pregnancy cases at the Royal Berkshire Hospital was also included using the same inclusion and exclusion criteria.

Hyperglycaemia Risk Score Definition
The model is designed based on the hypothesis that the predicted three-day mean percentage of high readings, immediately after a three-day observation window, can be used as a proxy for hyperglycaemia risk. This score can then be used to stratify patients in need of clinical review, supporting clinicians in deciding whom to review based on predicted risk.

Data Preprocessing
There were 1148 pregnancies and 272,712 blood glucose readings considered during the model development process. In our study, the measurements used were self-tested and self-reported by patients; thus, the blood glucose testing frequency varied. As shown in Figure 1, the highest numbers of blood glucose measurements were taken pre-and post-breakfast, post-lunch, and post-dinner (in the OUH cohort).
included in the study). Pregnancies with less than 36 blood glucose cluded to ensure that models were trained on patients who have est glucose test patterns. This threshold of 36 readings assumed that patie their blood glucose test pattern after 7-10 days of blood glucose m readings per day). Excluding patients with a small number of bloo ments reduces the risk of having distribution shifts between training ples, removing possible biases, such as patient behavioural changes.
To externally validate the performance of our model, the GDm pregnancy cases at the Royal Berkshire Hospital was also included u sion and exclusion criteria.

Hyperglycaemia Risk Score Definition
The model is designed based on the hypothesis that the predic percentage of high readings, immediately after a three-day observat used as a proxy for hyperglycaemia risk. This score can then be used in need of clinical review, supporting clinicians in deciding whom predicted risk.

Data Preprocessing
There were 1148 pregnancies and 272,712 blood glucose reading the model development process. In our study, the measurements u and self-reported by patients; thus, the blood glucose testing frequenc in Figure 1, the highest numbers of blood glucose measurements were breakfast, post-lunch, and post-dinner (in the OUH cohort).   Beyond the difference in sample rate frequencies, the duration of blood glucose monitoring (between the week of recruitment and the week of giving birth), also varied. Thus, for each pregnancy, we separated blood glucose measurements into multiple three-day windows, using measurements from one three-day window to predict the hyperglycaemia risk for the following three-day window. This windowing method helps minimize any possible errors from outliers or missing measurements over the three-day period. Furthermore, it allows us to use more of the data, as days with some missing tag values can still be considered in training. The risk score being predicted (score between 0-1, with 1 representing the highest risk), is defined as the proportion of blood glucose readings above the NICE advised thresholds [7]. A higher proportion of high-reading alerts indicates a higher risk of blood glucose abnormality. This set-up allows for blood glucose status to be predicted after each blood test, making it easier for clinicians to stratify patients with abnormal blood glucose levels. Additionally, as the monitoring period between clinical review for GDM patients is typically between three days (and possibly up to two weeks), the window size is suitable for current clinical review periods. The ability to prioritize patients with higher hyperglycaemia risk, helps enable personalized and continuous GDM blood glucose monitoring.
In the OUH time-series data, we further removed any readings with miscellaneous values, such as those recorded as NaN or those with invalid time tags (i.e., tags not corresponding to any of the six). We did not filter any extreme blood glucose values outside reasonable ranges, as we wanted to develop a model that would be robust to any errors that may be due to data collection/sensor recordings. Blood glucose readings under the "pre-lunch" and "pre-dinner" tags were also excluded during model development, as these tags had far fewer measurements recorded (less than half of the frequency of the other tags). By removing these, we were able to avoid using any form of data imputation for missing values. Moreover, the NICE guideline advises patients to have four measurements every day; thus, developing a model based on four tag features is well-suited for the task, without overwhelming women with additional blood tests. We then combined the remaining blood glucose readings with their linked, de-identified EHR features. We included three features previously shown to affect blood glucose levels-maternal age, gestational day (duration of pregnancy in days), and medication [8][9][10]. Pregnancies missing any of these features were excluded, leaving 840 pregnancies (collectively contributing 5765 windows) in the final dataset, as shown in Figure 2.  Beyond the time-series blood glucose readings and EHR features, we also ge two engineered features. The first feature is based on high blood glucose reading considers any blood glucose measurements higher than the NICE advised blood ranges, which are defined as target measurements between 3.5 mmol/L and 5.8  Although BMI has been previously found to affect blood-glucose levels, there were many missing values present in our dataset, and since we removed any windows with missing values, including BMI as a feature would significantly reduce the size of our training set (from 4573 windows to 2500). Thus, we did not include it as a feature in our primary analyses.
We applied the same pre-processing pipeline to our external validation dataset collected at the RBH, as shown in Figure 2, which resulted in 186 pregnancies (corresponding to 1219 windows) available for testing (screened from 163,376 blood glucose readings from 709 patients).
Beyond the time-series blood glucose readings and EHR features, we also generated two engineered features. The first feature is based on high blood glucose readings, which considers any blood glucose measurements higher than the NICE advised blood glucose ranges, which are defined as target measurements between 3.5 mmol/L and 5.8 mmol/L for fasting measurements, and measurements less than 7.8 mmol/L for 1 h postprandial measurements [7]. Using this definition, we calculated the value of this feature as the percentage of high-readings that occur within the three-day observation period (across all tags). The second engineered feature we calculated is the average rate of change of blood glucose measurements over the three-day observation window (calculated individually for each time tag). We refer to these as High-readings and Gradients, respectively. The full summary of features (and their respective definitions) included in model development is listed in Table 1. We grouped individual features into corresponding sets based on accessibility, namely, sensor-provided features (Tags), two types of engineered features (Gradients, High-readings), and EHR data (maternal age, gestational day, medication). Table 1. Feature sets used in model development.

Tags
Pre-breakfast reading, Post-breakfast reading, Post-lunch reading, Post-dinner reading Tags correspond to mean blood glucose measurements for a given time-point (tag), over the three-day observation period

Gradients
Pre-breakfast gradient, Post-breakfast gradient, Post-lunch gradient, Post-dinner gradient Gradients correspond to the rate of change in blood glucose for a given time-point (tag), over the three-day observation period

High-readings Percentage of high readings
High-readings is the percentage of high-readings among all blood glucose measurements within the three-day observation period. This feature is also calculated for the subsequent three days (three days following the observation window) and used as the predicted output of the models.

Maternal age, Gestational day, Medication
Maternal age is the age of the woman when she is confirmed with pregnancy. Gestational day is the average day of the woman's pregnancy (gestational) days over each three-day window. Medication is a binary feature, defined as anyone undertaking Metformin and insulin during their pregnancy.
Summary population characteristics of OUH and RBH patient cohorts are reported in Table 2, and summary population statistics of the features used in training can be found in Tables A1 and A2 in Appendix B, for the OUH and RBH cohorts, respectively.

Model Development and Hyperparameter Optimization
For hyperglycaemia risk prediction, we trained both linear and non-linear ensemble models. To predict a continuous risk score, we used multiple linear regression (MLR), Random Forest, and XGBoost regression models. All models can handle tabular data consisting of both continuous and categorical features. MLR is a parametric model that is widely accepted in clinical decision-making, making it an appropriate benchmark for comparison to more complex models. Random Forest is an ensemble method built on decision trees, and XGBoost is an optimized distributed gradient boosting library which has been found to outperform Random Forest and other tree-based models. It is an ensemble model that has achieved state-of-the-art results on many machine learning challenges, especially those involving structured or tabular datasets (as we are using in our study). Another benefit of using tree-based models is that feature importance can be explained using Shapley additive explanations (SHAP).
To predict the impending proportion of high-readings a woman will have in the upcoming three days, we used the features in a three-day window as the input features (representing the features, X, Table 1) and the percentage of high-reading alerts in the subsequent three-day window as the output (corresponding label, y, respectively). Each X and y pairing is then treated as an individual sample during model development. To understand the role of each feature set (Tags, Gradients, EHR, High-readings) in MLR, Random Forest, and XGBoost models, we reported model performances using a stepwise method, adding additional feature sets one at a time for each subsequent model developed, thereby evaluating their relative importance.
To choose the appropriate training settings for the XGBoost regressor, we plotted the model outcome variable (i.e., percentage of high-readings alerts over the subsequent three days), y, to look at its distribution. As shown in Figure 3, this variable is highly skewed, with many zeros. However, as our model focuses on predicting patients at risk of exhibiting high blood glucose levels, we chose to use a Gamma distribution to represent the distribution of the predictor variable, such that our model is focused on non-zero values for high-readings.
For model development, we used an 80:20 ratio of the OUH data, by individual pregnancy, resulting in 4573 and 1192 windows in the training and test set, respectively (corresponding to 672 and 168 patients, respectively). We used the training set for hyperparameter optimization and model training. For the XGBoost model, we implemented a grid search for different values of the learning rate, number of trees used, maximum tree depth, percentage of samples used per tree, and percentage of features used per tree. Standard five-fold cross-validation was then applied to evaluate which hyperparameter combination performed the best. We used the same number of trees and maximum tree depth in the Random Forest Model. Details about the final settings and hyperparameter values used for each model can be found in Table A3. After successful hyperparameter optimization, we tested the final model on the held-out test set.  For model development, we used an 80:20 ratio of the O nancy, resulting in 4573 and 1192 windows in the training a responding to 672 and 168 patients, respectively). We used rameter optimization and model training. For the XGBoost m search for different values of the learning rate, number of tree percentage of samples used per tree, and percentage of feat five-fold cross-validation was then applied to evaluate whic tion performed the best. We used the same number of trees the Random Forest Model. Details about the final settings used for each model can be found in Table A3. After succes zation, we tested the final model on the held-out test set.
We started model training by using only tag features, quentially added additional feature sets. The availability of f ing on the level of data access in different settings; thus, we these sets in model development. Although we did not inclu yses, we tested its influence using the subsequently reduced the potential of including it as a feature in any future anal found in Appendix D.
To compare our models, we reported the mean square We started model training by using only tag features, as a baseline model, and sequentially added additional feature sets. The availability of feature sets can vary depending on the level of data access in different settings; thus, we evaluate all combinations of these sets in model development. Although we did not include BMI in our primary analyses, we tested its influence using the subsequently reduced training set, to demonstrate the potential of including it as a feature in any future analyses. Results for this can be found in Appendix D.
To compare our models, we reported the mean squared error (MSE), the R2 value (R2), and the mean absolute error (MAE). Because our goal was to stratify patients at risk of having high blood glucose levels, the actual prediction value itself was ancillary to the order in which patients are ranked. Thus, we also considered the accuracy in which a patient is ranked. We determined this rank by calculating the percentage of patients that were correctly triaged into correct risk bounds. We considered three label-encoded scoring bounds-lower-, middle-, and upper-bounds (scores binned by equally splitting lower-, middle-, and upper-thirds). To further understand the contribution of individual features to model predictions, we also performed SHAP analysis.

Model Training and Internal Validation
Both tree-based ensemble methods outperformed MLR (   Feature importance determined by SHAP analysis ranked the Pre-breakfast tag as being the most important, followed by Post-breakfast, Post-dinner, and Post-lunch tags. This was similar to the MLR coefficients, which ranked Pre-breakfast as the most important, followed by Post-dinner, Post-breakfast, and Post-lunch tags. Results for the feature ranking of different Tags can be found in Figure A1 and Table A4.
Non-linear, tree-based ensemble regression models achieved higher performance for this task, compared to an MLR model. Thus, as the XGBoost model out-performed both Random Forest and MLR models, especially with respect to the R2 value, we chose to use XGBoost for the development of all subsequent models.
Overall, performance across different models, did not differ substantially from the base- . However, when comparing the test set performances of the different models, they do not significantly differ from the model which is trained solely on Tags (model trained on Tags and High-Readings, p = 0.309 using the Wilcoxon Signed Rank Test; model trained on Tags, High-Readings, and EHR, p = 0.232). This suggests that blood glucose measurements are collectively the most influential features for determining impending blood glucose anomalies. This is further confirmed by SHAP analysis, where Tags and High-Readings are consistently ranked highest in terms of feature importance across all models (SHAP results can be found in Appendix C). The addition of Gradients, EHR, or the combination of both, did not appear to improve model performance over the corresponding models without these feature sets (p > 0.05). A full list of p-values comparing models to the baseline can be found in Table A5.
As there were many missing values present for BMI in our dataset, we did not include the BMI in our main analyses. However, we did perform preliminary analyses, including BMI as a feature in model development (using a subsequently reduced dataset). Additionally, we also performed analyses using the same dataset filtered to only include samples with blood glucose measurements between [1,31]. Results for both can be found in Appendix D.

External Validation
To demonstrate the generalizability of our method, we performed external validation on a cohort of women from RBH. The distribution of the predictor variable (Figure 4b) is similar to that of the training set used during model development (Figure 3).
Sensors 2022, 22, 4805 9 of 18 feature sets (p > 0.05). A full list of p-values comparing models to the baseline can be found in Table A5. As there were many missing values present for BMI in our dataset, we did not include the BMI in our main analyses. However, we did perform preliminary analyses, including BMI as a feature in model development (using a subsequently reduced dataset). Additionally, we also performed analyses using the same dataset filtered to only include samples with blood glucose measurements between [1,31]. Results for both can be found in Appendix D.

External Validation
To demonstrate the generalizability of our method, we performed external validation on a cohort of women from RBH. The distribution of the predictor variable (Figure 4b) is similar to that of the training set used during model development (Figure 3). As the model using Tags and High-Readings achieved the best performance compared to other model variations, we used this as the model for external validation. We also compared this to the baseline model trained solely on Tags, as we found that they were not significantly different (p > 0.05). Table 4, when applied to the RBH cohort, our models achieved similar scores to those previously achieved from internal validation. The model trained on Tags   As the model using Tags and High-Readings achieved the best performance compared to other model variations, we used this as the model for external validation. We also compared this to the baseline model trained solely on Tags, as we found that they were not significantly different (p > 0.05).

As shown in
As shown in Table 4, when applied to the RBH cohort, our models achieved similar scores to those previously achieved from internal validation. The model trained on Tags  The similarity in scores suggests that our model was not overfitted to the training set, and thus, is generalizable across external cohorts of patients. Additionally, as previously shown, the addition of High-Readings achieved better performance, overall, than the baseline model without this feature (Wilcoxon Signed Rank Test, p = 0.004).

Discussion
In this study, we developed a data-driven machine learning model to identify patients at risk of exhibiting high blood glucose levels (hyperglycaemia). This is a crucial task, as there is an appreciation that there exists a spectrum of diseases and outcomes, and assigning all women to the same care pathway is not patient-centred and does not necessarily provide the best care for each woman. Furthermore, with an increasing prevalence of GDM and limited healthcare resources, it is important for healthcare providers to tailor care delivery to women who need it-providing a proportionate response to care delivery depending on glycaemic control and other risk factors.
This study demonstrates the first machine learning-based stratification system for quantifying hyperglycaemia risk in GDM clinics, and is not limited to the existing GDm-Health platform. The model presented can be used by any GDM clinic if they have access to patients' daily blood glucose data, and such a tool could be used to identify patients who require more urgent clinical review or need an adjustment to their current treatment. In order to translate this tool into clinical practice, future studies can consider converting the model results into a risk score (e.g., thresholding the regression scores into categories, combining regression scores with other clinical features to define the degree of risk).
We found that tree-based ensemble models significantly outperformed a linear model. This may be due to their inherent ability to consider non-linear effects of the features. Additionally, tree-based models are less sensitive to extreme values (e.g., outliers, any data measurement errors which can occur from the data collection process) compared to linear regression. This is further demonstrated by MLR performing better on the filtered dataset, where extreme blood glucose values and sensor errors were removed prior to training (Table A8). Moreover, in particular, we found that XGBoost achieved the highest overall performance, thanks to the boosting technique it utilizes, where trees are sequentially added and fit to correct for the previous prediction errors made.
There are several limitations to this study. Firstly, the MSE, R2, MAE, and rank accuracy scores suggest that this model can perform moderately accurately for predicting high blood glucose readings, especially those in the upper-bound group. However, the scores achieved require improvement in order to be suitable for clinical practice. This lower performance may be due to the small dataset size used in training, data imbalance, and possible clinical confounders that were not considered during model development. Further development needs to be conducted with clinical experts to both increase the sample size and determine what confounders should be included in the study. In terms of the algorithmic approach, a weighted regression model may help improve model fitting, especially when there is imbalanced data. Additionally, a times-series modelling approach (such as the Auto Regressive Integrated Moving Average model) could be investigated, as it could provide more detailed insight into blood glucose patterns, improving monitoring capacity.
As blood glucose measurements are self-measured, there is variation in the number and length of recordings completed by each patient. Thus, when splitting the data into windows, there was an unequal number of samples contributed from different patients. Ideally, the number of samples available from each patient is balanced; however, in the real world, this is difficult to achieve as women may be diagnosed at different points in their pregnancy and may have very different lifestyles, making it difficult to collect consistent measurements across all patients. If enough data is available, one possible solution can be regularizing the number of samples contributed by each patient (reducing the bias contributed by any individual patient). Similarly, future models can collectively consider multiple windows per patient rather than treating each window separately.
The specification of when a patient started or stopped taking GDM-related medication was not clear in our data. We considered women who had taken any medication for blood glucose control (metformin or insulin) as the medication group and the remainder as the non-medication group. Patients who changed from non-medication to medication during the data collection period of the study, were considered as part of the medication group. This is a limitation of this study, as it may have impacted the models' ability to confidently differentiate different groups and accurately predict the high-reading percentages. By comparing model performance results (Table 5) and SHAP values ( Figures A1-A8), we found that including the medication feature did not significantly change our results when compared to the baseline model. Thus, future analyses should be performed, with more data, to confirm the effectiveness of including this feature.  Data missingness is another limitation, as we used real-world clinical data sets. After we removed samples with values missing values from any of our selected features, the size of data available for training and testing was reduced to 70% and 25% of the original dataset size in the OUH and RBH datasets, respectively. Thus, future analyses would greatly benefit from more data or the application of different data imputation methods. For this study, we did not have enough data to impute missing values in a way that would be biologically accurate.
In general, overall performance did not differ significantly with the addition of nonblood-glucose features. This may be due to the size of our dataset, as a larger number of features often require a larger sample size for training. Additionally, it is difficult to understand how different behavioral, physiological, and genetic factors independently and collectively affect GDM, and thus, there may be other factors that are better suited to use in the model than the ones we tested. For example, known diabetes risk factors (e.g., high BMI, previously having a macrosomic baby, being from an ethnicity with a higher prevalence of GDM), may be important in blood glucose prediction. In this paper, whilst we were not able to test these features, this reflects the reality of features directly accessible by most GDM clinics. Although BMI was not included in our main model development (due to many missing values), it was ranked moderately high in terms of feature importance in our preliminary analyses (Table A7). Thus, future models can consider additional EHR features during development, as it may help improve model performance.

Conclusions
With the massive growth of digital sensors and electronic data continuing to saturate healthcare, machine learning will greatly support clinicians in optimizing healthcare utilization and facilitating patient care. This paper presents one of the largest clinical machine learning studies on GDM patient stratification and provides a proof-of-concept demonstration of how personalized patient care can be implemented for GDM patients. As there is currently no mechanism in place to predict those women at risk of hyperglycemia, our study outlines and demonstrates a straightforward method for implementing proportionate care delivery based on features already available in many GDM clinics. Additionally, our framework has the potential to be extended to and used with many other predictor features and applications. Overall, machine learning in GDM is still a relatively new area; thus, additional model training and external validation are necessary to improve our understanding of GDM, clinical management, and ultimately, overall maternal and fetal health and care.      Tags, EHR, Gradients, High-Readings 0.7 0.5 0.05 3 47 0.9                      Table A6. Hyperparameter values used in model development (XGBoost) for additional experiments: (1) Including BMI as a feature, (2) Using a dataset filtered for blood glucose range [1,30].