The Unrealised Potential for Predicting Pregnancy Complications in Women with Gestational Diabetes: A Systematic Review and Critical Appraisal

Gestational diabetes (GDM) increases the risk of pregnancy complications. However, these risks are not the same for all affected women and may be mediated by inter-related factors including ethnicity, body mass index and gestational weight gain. This study was conducted to identify, compare, and critically appraise prognostic prediction models for pregnancy complications in women with gestational diabetes (GDM). A systematic review of prognostic prediction models for pregnancy complications in women with GDM was conducted. Critical appraisal was conducted using the prediction model risk of bias assessment tool (PROBAST). Five prediction modelling studies were identified, from which ten prognostic models primarily intended to predict pregnancy complications related to GDM were developed. While the composition of the pregnancy complications predicted varied, the delivery of a large-for-gestational age neonate was the subject of prediction in four studies, either alone or as a component of a composite outcome. Glycaemic measures and body mass index were selected as predictors in four studies. Model evaluation was limited to internal validation in four studies and not reported in the fifth. Performance was inadequately reported with no useful measures of calibration nor formal evaluation of clinical usefulness. Critical appraisal using PROBAST revealed that all studies were subject to a high risk of bias overall driven by methodologic limitations in statistical analysis. This review demonstrates the potential for prediction models to provide an individualised absolute risk of pregnancy complications for women affected by GDM. However, at present, a lack of external validation and high risk of bias limit clinical application. Future model development and validation should utilise the latest methodological advances in prediction modelling to achieve the evolution required to create a useful clinical tool. Such a tool may enhance clinical decision-making and support a risk-stratified approach to the management of GDM. Systematic review registration: PROSPERO CRD42019115223.


Introduction
Gestational diabetes (GDM) affects 7-20% of pregnancies and confers an increased risk of pregnancy complications with health consequences for both mother and baby. These risks are related to elevated glucose in GDM, but the relationship is complex, and an individual's risk is modified by interrelated factors, including maternal weight [1,2], gestational weight gain [3], and ethnicity [4]. Accumulating empirical data suggests this phenotypic heterogeneity may be explained by multiple physiologic defects, demonstrable on sophisticated laboratory insulin secretion and sensitivity testing [5,6]. As a result of this heterogeneity, there is a continuum and breadth in the risk of pregnancy complications associated with contemporary definitions for this condition [7,8]. Therefore, for GDM, like in much of healthcare, there is a need to move from the current one-size-fits-all approach towards a personalised and risk-stratified model-of-care.
A personalised approach would stratify women with GDM by the estimated risk of pregnancy complications. Those at high risk would maximally benefit from the targeted delivery of evidence-based preventative and therapeutic interventions. Those at low risk would be spared unnecessary treatment and may be offered less intensive intervention. Accurate risk prediction models working within existing diagnostic definitions and utilising predictors readily available in routine care, could be implementable in clinical care and would be feasible and scalable. From a public health perspective, this could enable a risk-stratified approach and development of new models of care to better allocate scarce healthcare resources, imperative in the context of the increasing GDM prevalence [9][10][11][12].
Stratifying affected women by their risk of pregnancy complications requires a method to estimate the absolute risk of future events in an individual based on readily available characteristics, or a prediction model. Many fields of medicine have seen rapid growth in the development of prediction models. However, such models are rarely translated to clinical practice [13,14], and hence rarely positively influence patient care as their creators intended. A systematic review was conducted to establish the existing literature and inform progress towards optimal primary research in prediction modelling [15] and its translation into clinical care.
The aims of this systematic review were to: identify prognostic prediction models for pregnancy complications in women with GDM; describe characteristics of the identified prognostic prediction models qualitatively; compare the performance of identified prognostic prediction models quantitatively, with meta-analysis if appropriate, and critically assess the conduct and reporting of prediction modelling development methods.

Materials and Methods
A detailed description of the methods is available in the published protocol [16]. In summary, a systematic review of prediction modelling studies for pregnancy complications in women with GDM was conducted to identify all prediction models relevant to developing a risk-stratified approach to GDM. A sensitive search strategy was developed combining the Ingui filter for prediction modelling studies [17] as updated by Geersing [18] with keywords and subject headings for gestational diabetes and relevant pregnancy complications (Supplementary Table S1). A search of MEDLINE and Embase from inception to 16 August 2019 was executed. No limits on publication date nor language were applied. Study selection, data extraction and critical appraisal were conducted independently by two reviewers. Data extraction was conducted with guidance from the CHARMS checklist (checklist for critical appraisal and data extraction for systematic reviews of prediction modelling studies) [19]. Critical appraisal focusing on the risk of bias and concerns regarding applicability was conducted using the Prediction model Risk Of Bias overdiagnosis Tool (PROBAST) [20]. The systematic review protocol was registered with the International Prospective Register of Systematic Reviews (PROSPERO), number CRD42019115223.

Study Selection
The search returned 12,161 unique records. Following title and abstract screening, the full text of 63 articles were assessed. Five studies meeting the selection criteria were included in this review ( Figure 1) [21][22][23][24][25].

Study Characteristics
The five included studies reported the development of ten prediction models for pregnancy complications in women with GDM (Table 1). No validation studies were identified. The composition of the pregnancy complications predicted varied across the studies. Four studies reported the development of a single prediction model [22][23][24][25]. One study reported the development of six models, one for each of the six outcomes (primary caesarean delivery, birth injury, large-for-gestational-age (LGA), adiposity, hyperinsulinaemia, hypoglycaemia) [21]. In this review, these six models are presented collectively due to the shared model development process and methodological characteristics.

Source of Data and Participants
Four prediction modelling studies had a retrospective study design using routinely collected data, three from a single centre [22,23,25] and one from multiple centres [24] (Table 2). One prediction modelling study used a historical cohort from a single centre of a multi-centre prospective observational study [21].
The study populations varied across studies (Table 3). Two studies included only women with GDM [23,24] while three also included women without GDM [21,22,25]. Diagnostic criteria for GDM varied by region. Uniquely, one study was a post hoc analysis of a historical cohort where participants and clinicians were blinded to oral glucose tolerance test (OGTT) results [21]. Hence, in this study, 10.5% of participants who would meet the International Association of Diabetes in Pregnancy Study Groups diagnostic criteria for GDM, did not receive any specific treatment for this condition. In the other studies, treatment for GDM followed a standardised institutional protocol [23][24][25] or was not reported [22].
Exclusion criteria were comparable across studies with the exception of the prospective observational study with blinded OGTT results [21] (Table 3). This study had more extensive exclusion criteria to reduce the likelihood of non-adherence to the study protocol. This prediction modelling study also excluded participants with missing data for predictors and those with non-Caucasian ethnicity.  The fetal overgrowth index [25] Women with pre-gestational diabetes or GDM Multiple pregnancy, birth <20 weeks gestation 50.5% 82% White Universal screening with ACOG approach b GDM treatment as per standardised institutional protocol Abbreviations: GDM, gestational diabetes; HAPO, Hyperglycaemic and adverse pregnancy outcomes; IADPSG, International Association of Diabetes in Pregnancy Study Groups; NA, not applicable; OGTT, oral glucose tolerance test; NR, not reported; ACOG, American College of Obstetricians and Gynaecologists; a The HAPO study [26] included all pregnant women unless they had one or more exclusion criteria listed above; b ACOG approach = two-step procedure using a screening 50 g glucose challenge test (GCT) with abnormal ≥ 140 mg/dL (7.8 mmol/L) and diagnostic 100 g 3 h oral glucose tolerance test (OGTT) with two or more values above the Carpenter-Coustan cut-offs [27]

Outcome(s) to be Predicted
The pregnancy complications for prediction in women with GDM varied across studies (Table 4 and Figure 2). The most common outcome to be predicted was the delivery of an LGA neonate, as it was included as an outcome in four of the five prediction modelling studies (Table 4 and Figure 2). In these four studies, LGA was defined as greater than the 90th percentile by gestational age. However, there was variation in the methods of standardization, with one study adjusting for maternal height and weight, ethnicity and parity [25] and the others adjusting for fetal sex and parity [24], fetal sex and ethnicity [22] or fetal sex alone [21]. Two of these studies included LGA neonates within a composite [22,24], it was handled as a single outcome in another study [21], and was a sole outcome for another [25].
Neonatal hypoglycaemia was selected as an outcome in three studies. It was defined as blood glucose < 2.2 mmol/L universally, however, there were variations in the timing of measures and mode of measurement were not defined in two studies [21,24]. Neonatal hyperinsulinaemia was selected as an outcome in two studies, defined as an elevated cord c-peptide level in one study [21] and as an elevated neonatal serum insulin measure in the other [24].
Two models predicted a composite outcome [22,24]. The first predicted a composite for adverse outcomes affecting both the mother and neonate [22]. The prediction of the second was limited to complications affecting the neonate [24]. However, this composite was more extensive, including 11 outcomes. Similar pregnancy complications were the subject of a third prediction modelling study, however here the outcomes were predicted in six discrete models for a single outcome rather than as a composite [21]. Table 4. Outcome(s) to be predicted and candidate predictors in models for pregnancy complications in women with gestational diabetes.

Model
Outcome (

Candidate Predictors
The number of candidate predictors investigated ranged from seven to 24 (Table 4). A variety of candidate predictors were chosen for model development with a tendency towards those that are routinely available in clinical practice via patient history or physical examination, or glycaemic measures available routinely via diagnostic testing for GDM. One study considered predictors from investigations that may not be available in routine care in all settings, serum analytes (first-trimester pregnancy-associated plasma protein A and second-trimester total human chorionic gonadotropin and inhibin-A serum) and fetal abdominal circumference from an obstetric ultrasound performed between 24 and 30 weeks gestation [25].

Model Development
The presence and handling of missing data were not adequately reported in the four studies utilising routinely collected data [22][23][24][25] (Supplementary Table S2).
Statistical power varied across the studies, from two to 106 events per predictor (Table 4 and Figure 3).  [24]; B, Park et al. [22]; C, Phaloprakarn and Tangjitgamol [23]; D, Tomlinson et al. [25]. The EPP could not be calculated for McIntyre et al. [21] An EPP above 10 to 20 is regarded as the minimum sample size for model development [19]. This graphical presentation format was adapted from Ensor and colleagues [30].
Multivariable logistic regression was the most commonly used modelling method, described in four studies [21][22][23]25]. Notably, a tree-based approach, namely the recursive partitioning and amalgamation (RECPAM) method, was used in one study (Supplementary Table S2) [24]. In this study a binary decision-tree that uses answers from a series of yes/no questions about clinical characteristics was developed to predict an individual's likely outcome. Continuous predictors were dichotomised using cut-points fitted to the development data in two studies [24,25]. Methods for the selection of predictors for inclusion in and during multivariable modelling varied across the studies (Supplementary  Table S2).

Predictors Selected in the Final Models
Ten predictors were selected for the final models across the five studies (Table 5 and Figure 4). Some measure of glycaemia and BMI was included in the final models of four of the five studies.

Model Evaluation
Model performance was evaluated using the same dataset used to develop the model (apparent validation) in all but one study [25] (Table 5). In this study, results of internal validation based on resampling of the development dataset using bootstrapping was reported. No studies reported results of external validation as a measure of transportability of the developed model to new populations.

Model Presentation
Three studies [23][24][25] included an alternative presentation format of the final models designed for ease of use and clinical application (Table 5).

Comparison of Predictive Performance
Model performance was most commonly reported in terms of discrimination with four of the studies reporting a concordance statistic (c-statistic) for their final models [21][22][23]25]. For these four models predicting a binary outcome, the c-statistic is equal to the area under the receiver operating characteristic (ROC) curve (AUC) and ranged from 0.517 to 0.911 (Table 5). Two studies reported that calibration was adequate, presenting non-significant findings for the Hosmer-Lemeshow goodness-of-fit test [22,23]. A meta-analysis of performance measures was not appropriate because the model development studies were not sufficiently homogenous with regard to the outcome to be predicted and no validation studies for a common prediction model were identified.
Four studies reported classification measures: sensitivity and specificity and/or positive and negative predictive values [21][22][23]25]. Cut-points were not determined a priori. In two studies they were determined by selecting a point on the ROC curve closest to the upper left corner, which maximises the Youden index [21,22]. The method for determination was not specified in the other two studies [23,25].

Risk of Bias and Concerns Regarding the Applicability of Models
As assessed using the PROBAST tool [20], all models had a high risk of bias driven by the analysis domain ( Figure 5 and Supplementary Table S3). There was a high concern regarding the applicability of two studies to the systematic review question [21,22], as only a minority of the participants were affected by GDM. Moreover, in one of these studies [21], the exclusion of non-Caucasian women and women with a history of gestational diabetes requiring pharmacologic treatment may further limit its applicability. Figure 5. The risk of bias and concern regarding the applicability of the models developed in the five prediction modelling studies for pregnancy complications in women with gestational diabetes using the Prediction model Risk of Bias Assessment Tool (PROBAST). The x-axes display the proportion of studies rated by level of concern (low, high or unclear) for risk of bias or applicability for each domain.

Discussion
This systematic review identified five prediction modelling studies for pregnancy complications in women with GDM. Approaches to prediction varied, but the birth of an LGA neonate was the leading outcome, whether as part of a composite or singularly. Models seeking to predict a single outcome were more discriminatory than those predicting a composite outcome. Three predictors emerged in most models: glycaemic measures, BMI, and maternal age. Predictive performance was generally inadequately reported, and external validation was lacking. All models had a high risk of bias due to methodologic limitations in analysis as assessed by PROBAST.

Models Identified
Ten prediction models were developed by five prediction modelling studies, reflecting five distinct approaches to the clinical problem of quantifying the absolute risk of pregnancy complications in women with GDM. The literature is relatively lacking, compared to the related, but distinct literature on diagnostic prediction models for the development of GDM, with 17 models identified in a recent review [31]. However, interest in prognostic-based approaches to pregnancy risks in GDM is growing, with the first model published ten years ago and the later four within the last three years.
The model developed using a prospective cohort utilised an unselected population of pregnant women, of which 10.5% would meet current diagnostic criteria for GDM [21]. We acknowledge that the population in this prediction modelling study may not strictly meet the population criteria for eligibility of this review. However, we believe that the omission of this study would limit the value of this review, given its robust prospective study design and unique treatment-naïve study population. Furthermore, recognising that there is a continuum of risk for pregnancy complications related to GDM, this study is valuable because it facilitates feasibility assessment for a prediction model for pregnancy complications in women with hyperglycaemia independent of the consensus-based International Association of Diabetes and Pregnancy Study Group diagnostic threshold for GDM.
A recent prediction modelling study conducted by Barnes and colleagues [32] featured prominently at the title and abstract screening stage as it seemed to be especially applicable to the review question. This study developed and externally validated a model to predict the need for insulin therapy in women with GDM. Following model development, a post-hoc analysis found that the outcome of this model, the need for insulin therapy, was strongly associated with pregnancy complications. As pregnancy complication, the outcome of interest for this review was not the subject of this prediction modelling study it was ultimately excluded. We, however, note the close relationship of these outcomes and the relevance of this existing prediction modelling study to the overarching aim of developing a stratified model-of-care for women with GDM.

Outcome(s) to Be Predicted
The delivery of an LGA neonate was the most common outcome predicted, offering three key advantages. Firstly, it reflects the classical maternal hyperglycaemia-fetal hyperinsulinaemia hypothesis, linking maternal hyperglycaemia to the LGA neonate, via transplacental glucose transport causing secondary fetal hyperinsulinaemia [33]. Secondly, although potentially too simplistic, excessive fetal growth is the unifying feature linking GDM to downstream pregnancy complications, such as failure to progress in labour, obstetric intervention, and shoulder dystocia. Thirdly, an LGA neonate with excess neonatal adiposity has poorer long-term metabolic health [34][35][36][37][38], with potential inter-generational implications [39].
Where multiple outcomes are potentially relevant, the prediction of a composite outcome, as in two of the models [22,24], may more accurately quantify multiple risks that concern women and clinicians and may be more translatable into clinical practice than a model predicting a single outcome. However, a poorly constructed composite outcome may be confusing and limit clinical application. Future model development should provide a clear rationale for the use and formulation of composite outcomes. There may be utility in heeding recommendations that the components of composite outcomes: (1) are of similar importance, (2) occur with similar frequency, and (3) are likely to have similar relative risk reductions (or predictive effects moving in the same direction) with similar underlying biology [40].

Model Development
Two studies were inadequately powered with less than 10 events per predictor (EPP) [23,25] (Table 4 and Figure 3), generating significant risks for overfitting and consequently biased predictions [19]. The two studies which predicted composite outcome [22,24] were adequately powered for model development with more than twenty EPP [19]. This is an advantage of composite outcomes where event rates are low (Figure 4). The EPP for the six models developed in one study could not be calculated as the candidate predictors were not reported [21]. Although an EPP above 10-20 is traditionally advocated as the minimum sample size for model development [19], future studies should consider the recent proposal that a tailored sample size estimate may be advantageous in certain circumstances [41].
The dichotomisation of continuous predictor variables leads to substantial loss of information and is widely discouraged [42][43][44][45]. This is an inherent disadvantage of tree-based models included in this review [24] and was also notable in the model developed using classical regression methods [25]. In both models, continuous predictor variables were dichotomised using data-driven cut points, leading to high risks of bias [46]. In two models, continuous variables were dichotomised using cut-points which were pre-defined and independent to the development data [21,23], which leads to the loss of information but minimises the risk of bias. Continuous predictors were only handled optimally in one study [22]. Future model development studies should avoid dichotomising continuous predictors.
Selection of predictors is a key component of the model building at two stages, 1) inclusion in modelling and 2) during multivariable modelling. Where selection is based on associations with the outcome in the development dataset, there is a high risk that the developed model will be overfitted to this dataset. This was observed in three studies [22,23,25] and was inadequately reported in the other two [21,24]. In future studies, model generalisability may be improved by a priori selection of predictors for inclusion or selection independent of the predictor-outcome association, such as those based on clinical expertise [47].

Predictors Selected in the Final Model
The predictors most commonly selected in final models, glycaemic measures (n = 4), BMI (n = 4), and age (n = 2) are routinely available in clinical practice. These predictors should be included in the set of candidate predictors evaluated in future model development studies.

Model Evaluation
Model evaluation addresses two questions pivotal for the application of prediction models into clinical practice: (1) how accurate are its predictions and (2) how generalisable is it [48]. Accuracy relates to a model's internal validity or "reproducibility" [45]. Internal validation techniques include apparent or bootstrap validation. Generalisability considers how well the model is likely to perform in a new but related population. It relates to a model's external validity or transportability, tested by evaluating the model in a new population (external validation) [45]. External validation measures predictive performance and corrects internal validation for the inherent optimism of a model being overfitted to the development dataset. As such, it provides a more realistic measure of predictive performance and quantifies the "transportability" of a model to other populations. No studies identified here reported external validation.
Three studies reported measures of model performance using apparent validation [21][22][23]. Here model performance is evaluated against the sample from which it was developed. The utility is limited, as performance is biased towards overestimating model performance in new populations. This risk of bias is further exaggerated with the small development datasets noted here.
One reviewed study assessed internal validity using simple bootstrapping validation, with the model repeatedly fitted to 1000 bootstrap samples and the average area under the ROC calculated as an estimate of future performance of the model in other populations [25]. This technique for internal validation is preferable to apparent validation conducted in the other studies, with less biased performance estimates. It is also preferable to split-sample validation, as data available for development and validation are maximised. Ultimately, however, external validation is essential to establish the confidence in a model required for clinical application.
The included studies did not report any formal evaluation of the clinical usefulness of the developed models. This could include the net-benefit realised by using the model's predictions to guide clinical decision-making [49]. Future studies may consider decision curve analysis [50] to quantify the clinical value of developed models as is increasingly reported in the literature [51] and recommended [52][53][54].

Model Presentation
A clinical prediction model is developed with the overarching premise to be applied to clinical care to improve outcomes. Hence, developed models should be presented in a format fit for this purpose [55]. Two studies presented simplified scoring systems [23,25], and one presented a simple decision tree [24], all readily applicable to clinical care. The other studies reported regression coefficients for included predictors but did not present baseline components or regression equations [21,22], and as such do not facilitate the clinical application of the models reported.

Comparison of Predictive Performance
The overall performance of prediction models is traditionally quantified with two essential measures of predictive performance: discrimination and calibration [49]. Consistent with general prediction model literature [48], performance was incompletely evaluated in prediction models identified in this review (Table 5). Where reported, discrimination was evaluated using the c-statistic and was graphically presented with ROC curves. Calibration was evaluated using the Hosmer-Lemeshow goodness of fit test.
Discrimination was highest for models predicting single outcome versus composite outcomes overall as observed in the risk score for pre-eclampsia [23] and the fetal overgrowth index [25]. Models with single outcomes also had minimal loss in performance when adapted into simplified clinical tools. Discrimination was more limited with composite outcomes [22] and was not reported in the model developed using a decision-tree for a composite of neonatal complications [24]. In the fifth study, discrimination varied considerably for the six independent models for single pregnancy complications [21].
Calibration was evaluated in two studies using the Hosmer-Lemeshow goodness-of-fit test, suggesting an agreement between predicted and observed probabilities [22,23]. However, this method alone does not capture the magnitude or direction of miscalibration, limiting clinical utility. For instance, a model with accurate predictions in the intermediate risk range, can consistently overestimate in the low-risk range and underestimate in the high-risk range, limiting clinical usefulness. Another limitation of the Hosmer-Lemeshow test is that it is strongly related to sample size and is usually non-significant for small data sets, and conversely significant for large data sets [46]. Future work should ideally report calibration graphically (calibration plot) or in a tabular format, comparing predicted probabilities to observed outcome frequencies, and allowing the model's performance to be assessed at clinically relevant risk ranges [46,56].

Risk of Bias
Collectively, the high risk of bias for prognostic prediction models for women with GDM limits generalisability ( Figure 5 and Supplementary Table S3), driven by analysis methods and overfitting of models to development datasets. This is reflective of the rapid evolution of prediction modelling methodology. The risk of bias of future prediction models may be reduced by addressing the findings of this review and referring to relevant guidelines such as the TRIPOD (transparent reporting of a multivariable prediction model for individual prognosis or diagnosis) statement [57].

Concerns Regarding Applicability
The applicability of identified models varied based on the participants in the development dataset ( Figure 5 and Supplementary Table S3). Two models sought to develop prediction models for pregnancy complications as a basis for defining alternative diagnostic criteria, including women both with and without GDM [21,22]. Their aims differed from this review, the prediction of pregnancy complications in women with GDM limiting applicability.

Strengths and Limitations of this Review
To our knowledge, this is the first systematic review of prediction models for pregnancy complications in GDM. Strengths of this review include rigorous methods and a sensitive search strategy utilising standardised and validated search terms across the entirety of the two leading databases of biomedical literature since their inception. Bias was minimised with prospective registration and peer-reviewed publication protocol [16]. The risk of bias and applicability of included models was systematically and objectively evaluated using PROBAST [20], a robust and for-purpose tool.
Limitations included the inability to synthesize the quantitative characteristics of included models due to the heterogeneity of included studies. However, the results of the systematic search support this broad approach. We note that, despite the clinical imperative, there are only five prediction modelling studies for pregnancy complications in GDM that met our eligibility criteria in the indexed medical literature to date.
Finally, given the heterogeneity of this condition and the variety of diagnostic approaches currently used, a developed model cannot be assumed to perform equally well (or poorly) in a new, but related population. Clinical prediction modelling is an iterative multi-stage process [58]. External validation in a range of new but related populations, both geographically and temporally, is crucial but, as this review suggests, frequently neglected. Such external validation may facilitate model updating which, by addressing the characteristics of the local population and clinical practice, is likely to optimise model performance [59].

Conclusions
This review demonstrates the potential for prediction models to provide an individualised absolute risk of pregnancy complications for women affected by GDM. However, limitations in current models have been identified and this emphasises that future model development and validation would benefit from the application of methodologic advances in this rapidly evolving field. External validation, including appropriate reporting of calibration and formal evaluation of clinical usefulness with decision curve-analysis, will significantly assist the translation of promising statistical models into a useful clinical tool. Such a tool would be capable of improving outcomes for women with GDM by enhancing clinical decision-making and facilitating the stratification of affected women by their risk of pregnancy complications, thus enabling a personalised model-of-care.
Supplementary Materials: The following are available online at http://www.mdpi.com/1660-4601/17/9/3048/s1, Table S1: Search strategy for MEDLINE, Table S2: Model development characteristics of models for pregnancy complications in women with gestational diabetes, Table S3: The risk of bias and concern regarding the applicability of the models developed in the five prediction modelling studies using the Prediction model Risk of Bias Assessment Tool. Fund. The funding bodies had no role in the study design, the collection, analysis and interpretation of the data, the writing of the report, nor the decision to submit the paper for publication.

Acknowledgments:
The authors thank Anne Young, Medical Librarian at Monash University, and Marie Misso, Head of Evidence Program at the Monash Centre for Health Research and Implementation for their expertise in developing the search strategy and systematic review methodology respectively. Preliminary findings from this work were presented at the 10th International Symposium on Diabetes, Hypertension, Metabolic Syndrome and Pregnancy (DIP 2019).

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

AUC
area under the receiver operating characteristic curve BMI body mass index c-statistic concordance statistic CHARMS checklist for critical appraisal and data extraction for systematic reviews of prediction modelling studies EPP events per predictor GDM gestational diabetes GWG gestational weight gain HAPO hyperglycaemia and adverse pregnancy outcomes LASSO least absolute shrinkage and selection operator LGA large-for-gestational age OGTT oral glucose tolerance test ROC receiver operating characteristic PROBAST prediction model risk of bias overdiagnosis tool RECPAM recursive partitioning and amalgamation TRIPOD transparent reporting of a multivariable prediction model for individual prognosis or diagnosis