Prediction Models of Obstructive Sleep Apnea in Pregnancy: A Systematic Review and Meta-Analysis of Model Performance

Background: Gestational obstructive sleep apnea (OSA) is associated with adverse maternal and fetal outcomes. Timely diagnosis and treatment are crucial to improve pregnancy outcomes. Conventional OSA screening questionnaires are less accurate, and various prediction models have been studied specifically during pregnancy. Methods: A systematic review and meta-analysis were performed for multivariable prediction models of both development and validation involving diagnosis of OSA during pregnancy. Results: Of 1262 articles, only 6 studies (3713 participants) met the inclusion criteria and were included for review. All studies showed high risk of bias for the construct of models. The pooled C-statistics (95%CI) for development prediction models was 0.817 (0.783, 0850), I2 = 97.81 and 0.855 (0.822, 0.887), I2 = 98.06 for the first and second–third trimesters, respectively. Only multivariable apnea prediction (MVAP), and Facco models were externally validated with pooled C-statistics (95%CI) of 0.743 (0.688, 0.798), I2 = 95.84, and 0.791 (0.767, 0.815), I2 = 77.34, respectively. The most common predictors in the models were body mass index, age, and snoring, none included hypersomnolence. Conclusions: Prediction models for gestational OSA showed good performance during early and late trimesters. A high level of heterogeneity and few external validations were found indicating limitation for generalizability and the need for further studies.


Introduction
Obstructive sleep apnea (OSA) is a common disorder, characterized by repetitive upper airway collapse during sleep specifically apnea and/or hypopnea leading to oxygen desaturation, arousal, sleep fragmentation, sympathetic activation, and endothelial dysfunction [1][2][3]. Long-term cardiovascular consequences are shown both in men and women, despite the lower OSA prevalence in women [4]. OSA in women is often less perceived and underdiagnosed because of the "unclassical symptom" of OSA found in women [5]. In addition, snoring and witnessed apnea, the hallmark of OSA symptomatology were less reported in premenopausal women when compared to post-menopausal women [6,7]. Altogether, OSA in premenopausal women is underrecognized and may be problematic when these women become pregnant.
Gestational OSA had been shown to increase adverse maternal and fetal outcomes such as preeclampsia, gestational hypertension/diabetes, and preterm birth [8]. Prevalence of gestational OSA increased as increasing obesity in the population [9]. Despite evidence showing that early diagnosis and treatments of gestational OSA could improve pregnancy outcomes [10,11], diagnosis of gestational OSA is challenging given the difficulty in access

Search Strategy and Study Location
This study had been registered on the international prospective register of systematic reviews (PROSPERO) with registration number CRD42021237996. An extensive literature search was performed on 2 major databases in MEDLINE (from 1996 to 28 February 2021), and Scopus databases (from 1980 to 28 February 2021) as recommended by the PRISMA guideline [22]. Search terms were constructed according to the PICO principles (i.e., participants, intervention, comparator, and outcome): pregnancy (MeSH), "pregnant women", parturient, gestation*, obstetric"; "sleep questionnaire", screening, "prediction model", predictors, "prediction tool", "risk score"; "polysomnography", PSG, sleep test, "home sleep test", "Watch-PAT; and obstructive sleep apnea (MeSH), "obstructive sleep apnea', "sleep apnea", OSA, "sleep-disordered breathing", snoring. The details on search terms and search strategies for each database are described in supporting information in Appendix A (Tables A1 and A2).

Inclusion and Exclusion Criteria
Any type of observational study (cross-sectional, cohort, or case control) or randomized controlled trial published in any language was included in the review if it met all the following inclusion criteria: (1). studied in pregnant women, (2). developed or externally validated at least one multivariable model for predicting diagnosis of OSA during pregnancy, and (3). had outcome of interest as OSA diagnosed by objective sleep tests including PSG, or home sleep apnea test (HSAT).
Exclusion criteria for studies were any of the following; (1). They were reviews or case reports, (2). Had insufficient data for pooling despite several attempts to contact authors, or (3). were multiple publications of the same original study.
All identified articles were combined and duplicates were excluded. Studies were independently selected by reviewers (SS, SR) by screening titles and abstracts. If a decision could not be made based on abstracts, full articles were retrieved. Disagreements between reviewers were adjudicated by a third reviewer (VT).

Data Extraction
Two independent researchers (SS, VT) independently extracted the data. The general characteristics of studies including author, publication year, study design, number of subjects, study phases (i.e., development and internal/external validations), and diagnosis of OSA were extracted. If the prediction model was a development model, specific infor-Diagnostics 2021, 11,1097 3 of 20 mation about the model construct (i.e., type of statistical model, predictors and selections, creating scores using coefficients or their exponentials) were extracted. Model performance in discrimination by C-statistics along with 95% CI was extracted by study phases. In addition, model performance in calibration was also extracted. Details on the prediction models and operational definitions are described in Appendix B.

Reference Test
The outcome of interest was diagnosis of OSA during pregnancy based on objective sleep tests including PSG, or any type of HSAT including Watch-PAT ® (Itamar Medical, Isarael). The criteria for diagnosis was defined according to the original studies, i.e., either apnea-hypopnea index (AHI) or respiratory disturbance index (RDI) ≥ 5 events/hour.

Risk of Bias Assessment
We appraised the risk of bias (ROB) of the studies' developing or externally validating prediction models using the Prediction Model Risk of Bias Assessment Tool (PROBAST) for systematic reviews [23,24]. This contains multiple signaling questions in four different domains: participants, predictors, outcome, and analysis. Signaling questions are answered as "yes", "probably yes", "probably no", "no", or "no information" where yes and no mean low and high risk of bias. Overall ROB is judged as low risk if all domains are considered low risk, high risk if at least one of the domains is considered high risk. Two researchers (SS, VT) independently assessed the ROB.

Statistical Analysis
We calculated and reported descriptive statistics to summarize the characteristics of the models. We calculated the median and interquartile range for continuous variables and the respective percentages for categorical variables. For the prediction models that were examined in more than 2 independent datasets, we applied a random effect meta-analysis to calculate the summary estimates of C-statistics and calibration separately by study phase. We followed a recently published framework for the meta-analysis of prediction models [23,24]. For those studies reported only C-statistics but not for dispersions (e.g., standard error (SE) or 95% confidence interval), their SEs were estimated following a formula. Heterogeneity was assessed using Cochrane Q test and its degree was quantified by the I 2 . All statistical analyses were performed using STATA version 16.1 (StataCorp ® , College Station, TX, USA), with a significance threshold p-value < 0.05 (2-sided).

Description of the Included Studies in the Systematic Review
A total of 1262 studies were identified but 6 studies (3713 participants) met our inclusion criteria and included in meta-analysis, (see Figure 1) [15][16][17][18][19][20]. The number of participants were largely driven by a study of a cohort of pregnant women (3264 participants) during second-third trimesters [17]. Characteristics of these studies were described, (see Tables 1 and 2). All studies were prospective cohorts of pregnant women with a total of 29 prediction models [15][16][17][18][19][20]. Two studies involved high risk pregnancy, in which one study defined as chronic hypertension, pre-gestational diabetes, obesity, and/or history of preeclampsia [15]; while another study defined as extreme obesity (BMI ≥ 40 kg/m 2 ) [18]. All of the studies included mixture of ethnicity [15][16][17][18][19][20], with majority of whites (range 20-60.4%), while 2 studies of the same dataset had 75% of African Americans [19,20]. One study included only nulliparous participants [17]. Two studies screened for OSA once, each during first and third trimesters [15,18]. Four longitudinal studies performed OSA screening twice consisted of 3 studies during first to second-third trimesters [17,19,20], and one study during second to third trimesters [16]. One of these studies reported new-onset of OSA [17]. Diagnosis of OSA was made by PSG in three studies [16,19,20], HSAT in two studies [17,18], and Watch-PAT in 1 study [15]. Criteria for diagnosis of OSA was mainly based on AHI ≥ 5 events/hour in 5 studies [15,[17][18][19][20], except 1 study using RDI ≥ 5 Diagnostics 2021, 11, 1097 4 of 20 events/hour [16]. All studies were development phases, in which only one study had performed internal validation [17], whereas three studies had externally validated previous models [16,19,20]. for OSA once, each during first and third trimesters [15,18]. Four longitudinal studies performed OSA screening twice consisted of 3 studies during first to second-third trimesters [17,19,20], and one study during second to third trimesters [16]. One of these studies reported new-onset of OSA [17]. Diagnosis of OSA was made by PSG in three studies [16,19,20], HSAT in two studies [17,18], and Watch-PAT in 1 study [15]. Criteria for diagnosis of OSA was mainly based on AHI ≥ 5 events/hour in 5 studies [15,[17][18][19][20], except 1 study using RDI ≥5 events/hour [16]. All studies were development phases, in which only one study had performed internal validation [17], whereas three studies had externally validated previous models [16,19,20].    A four-variable screening tool was developed in high risk pregnancy during first trimester using integer-based score from logistic regression (coefficient-based) yielded good sensitivity (86%) and specificity (74%), better than Berlin questionnaire and Epworth sleepiness scale. The model needs further external validation in general and late pregnancy for generalizability. 12.0 (1.9) 33.6 (2.5) Using the same cohort of healthy pregnant women to construct two combined prediction models to Sleep Apnea Symptom Score (SASS). Model I had sensitivity and specificity of 77 and 74% for first trimester; 77 and 78% for third trimester. Model II had sensitivity and specificity of 77 and 72% for first trimester; 82 and 78% for third trimester.

Risk of Bias Assessment
Results of PROBAST are described, (see Figure 2 and Table 3). The overall ROBs were high risk as for the outcome and analysis domains. For the outcome domain, 1 study had workup and spectrum biases by not including all participants recruited during second trimester, only small subset of participants with either end of OSA risk (high vs. low) were followed and underwent PSG during the third trimester [16]. Although PSG is the gold standard diagnostic test for OSA, only 3 studies performed PSG. Three other studies used HSAT with different criteria for hypopneas [15,17,18]. One study used Watch-PAT, a wrist-worn device using a peripheral arterial tonometry (PAT), finger plethysmography, and pulse oximeter [25]. Watch-PAT had been validated and shown good accuracy for the diagnosis of OSA during pregnancy [25]. For the analysis domain, five studies did not have reasonable number of participants with the outcome as determined by events per variable (EPV) ≥ 10 for each prediction model (Table 1) [23,24]. All studies selected predictors based on univariate analyses [15][16][17][18][19][20] and only one study properly calibrated model performance by accounting for overfitting, underfitting, and optimism in the model development [17]. first trimester; 82 and 78% for third trimester.

Risk of Bias Assessment
Results of PROBAST are described, (see Figure 2 and Table 3). The overall ROBs were high risk as for the outcome and analysis domains. For the outcome domain, 1 study had workup and spectrum biases by not including all participants recruited during second trimester, only small subset of participants with either end of OSA risk (high vs low) were followed and underwent PSG during the third trimester [16]. Although PSG is the gold standard diagnostic test for OSA, only 3 studies performed PSG. Three other studies used HSAT with different criteria for hypopneas [15,17,18]. One study used Watch-PAT, a wrist-worn device using a peripheral arterial tonometry (PAT), finger plethysmography, and pulse oximeter [25]. Watch-PAT had been validated and shown good accuracy for the diagnosis of OSA during pregnancy [25]. For the analysis domain, five studies did not have reasonable number of participants with the outcome as determined by events per variable (EPV) ≥ 10 for each prediction model (Table 1) [23,24]. All studies selected predictors based on univariate analyses [15][16][17][18][19][20] and only one study properly calibrated model performance by accounting for overfitting, underfitting, and optimism in the model development [17].

Meta-Analysis of Prediction Models
Among 6 studies, there were 29 prediction models, which were classified according to trimesters, development vs. validation phases, and type of pregnancy (high vs. general). A meta-analysis was applied to pool C-statistics of each stratum if there were at least two models.

Development Models Trimester 1
Four studies [15,17,19,20] developed seven prediction models of OSA based on general pregnancy (six models) [17,19,20] and high risk pregnancy (one model) [15]. Pooling overall C-statistics (95% CI) of prediction models with and without high risk pregnancy were 0.817 (0.783, 0850; I 2 = 97.81) and 0.811 (0.768, 0.853; I 2 = 97.58), (see Table 4). Among these models, age and BMI as continuous variables were the common predictors with discrimination C-statistics ranged from 0.772 to 0.800 [19,20]. Model performance improved markedly when frequent snoring, chronic hypertension, and tongue enlargement were included in the models with discrimination C-statistics of >0.80 [15,17,19], whereas models with Sleep Apnea Symptom Score (SASS) or bedpartner reported information did not drastically improve the C-statistics [20].  Abbreviations: SASS-Sleep Apnea Symptom Score; HT-hypertension; BMI-Body mass index; CI-Confidence interval, * High risk is defined in the study as those with chronic hypertension (diagnosed prior to pregnancy); obesity (pre-gestational diabetes (type 1 or 2); obesity (pre-pregnancy BMI ≥ 30 kg/m 2 ), and/or a prior history of preeclampsia, δ Frequent snoring is defined as self-reported snoring ≥ 3 times/week, † Age and BMI used as continuous variables, § Transformed BMI is defined as (BMI λ − 1)/g, where (g = geometric mean BMI (λ − 1) ), as continuous variable. Tongue enlargement is defined if tongue protrudes beyond the teeth or the alveolar ridge in the resting position, γ Bed partner reports were obtained if participants had bedpartner by questions asking the frequency of loud snoring and long pauses between breath while asleep during the past month.

Trimester 2-3
Four studies with 9 prediction models were constructed in the second-third trimesters with the overall pooled C-statistics (95%CI) of 0.855 (0.822, 0.887) with I 2 = 98.06, (see Table 4) [16,17,19,20]. Age and BMI as continuous variables were common predictors in all models with the C-statistics ranged from 0.810 to 0.831 [19,20]. Wilson's model yielded the highest discrimination C-statistics however it was high ROB from workup and spectrum biases as mentioned while using second trimester data to predict third trimester OSA, and BMI was handled as categorical variable (BMI ≥ 32 kg/m 2 ) [16].

Discussion
We conducted a systematic review and meta-analysis of performance of prediction models for OSA during pregnancy. There were 6 studies with 29 eligible prediction models involving 3,713 pregnant women [15][16][17][18][19][20]. Our findings indicated that the existing prediction models showed good performances in discrimination with the pooled C-statistics of 0.817 (0.783, 0.850) and 0.811 (0.768, 0.853) for the first and second-third trimesters OSA, respectively. Two models were externally validated, i.e., MVAP and Facco models yielding fair discrimination performances.
For the developmental viewpoints, all included prediction models demonstrated Cstatistic >70, which is considered as threshold for good performance [24,27]. All prediction models included common predictors of age, and BMI as continuous variables, except for Wilson model [16]. When looking at performance of age and BMI as a prediction model from Izci-Balserak, et al. studies, the C-statistics were 0.772-0.800, and 0.831-0.851 for the first and third trimesters, respectively [19,20]. Indicating that BMI and age as continuous variables strongly predicted OSA during pregnancy as described by previous studies [19,28]. In addition, the discriminative performance of the models that included BMI and age were consistent across all trimesters including the new-onset OSA model [17]. This may indicate that BMI and age are the predisposing risk factors for gestational OSA, which can be precipitated by other physiological changes of pregnancy progression. Considering additional predictors such as frequent snoring reports, tongue enlargement into the BMI and age models as Facco, Louis, and BATE models did could increase the discrimination performance C-statistics about 6.25-8.75%; whereas, adding symptom score (SASS), and bed partner reported information did not markedly improve performance of the models [15,17,19,20]. Data from Izci-Balserak studies, using SASS, the symptomatology-based score alone showed less predictive performance with C-statistics of 0.72(0.58, 0.86) and 0.57 (0.43, 0.71) for the first and third trimesters [20]. This is probably because women are less likely to report other symptoms of OSA (apnea, and gasping/choking) [19,21]. However, snoring symptom per se may still prove to be the cardinal symptoms of OSA during pregnancy as by itself demonstrated significant coefficients (i.e., 1.5 in Facco model; 2.4 in Wilson model) [15,16] and were included in many OSA prediction models [15][16][17]. The prevalence of snoring increased significantly during pregnancy particularly those with preeclampsia due to the narrowing of upper airway and were associated with increased adverse maternal and fetal outcomes [29][30][31][32][33][34]. New-onset snoring during pregnancy had also shown association with adverse pregnancy outcome [31,35]. However, only Louis prediction model demonstrated performance for the new-onset OSA [17]. Of notice, none of the prediction models included EDS. As shown in studies, EDS was not discriminative between pregnant women with or without OSA [15,18,36], likely due to the hypersomnolence and sleep disruption of pregnancy itself [21,37]. Other studies had shown that EDS (ESS > 10) was not associated with snoring and gestational hypertension/diabetes [21,29,37]. However, high level of EDS (ESS > 16) was associated with gestational diabetes and other symptoms of OSA (loud snoring, gasping, choking/apnea) [37,38].
Although these prediction models showed good performance in discrimination of OSA in pregnancy, they were high ROBs for many reasons. Majority were lack of calibration and internal validation; low EPV for the construct of the model [15,16,19,20]. Some had workup and spectrum biases and used the predictors in first/second trimesters to predict third trimester OSA [16,19]. Although Facco prediction model yielded lower performance in external setting than the original development setting [15], C-statistics of 0.784 vs. 0.850, it showed good performances for the first and third trimester OSA and both for general and high risk pregnancy [18,19].
As for the MVAP model, the external validations performance was good during first and third trimesters for general pregnancy [16,19]. There was no validation study for high risk pregnancy.
All meta-analyses showed high heterogeneity, sources of heterogeneity were therefore explored by performing sensitivity analyses but no explanation was found. Prediction models constructed based on OSA diagnosed by both PSG and HSAT still showed high discriminative performance but with high heterogeneity. This may be due to numerous variations from different studies (i.e., differences in the clinical setting of the study, type of participants, ethnicity, trimester of screening, diagnostic methods and criteria, prevalence of OSA and the complexity of statistical analysis used to construct each prediction model). However, there were limited prediction model studies, insufficient for further sensitivity analysis. Despite many prediction models had been constructed, only a few models were externally validated, thus generalization of these models is still questionable. This indicates the need for more prediction model studies. Although a few models were externally validated, the findings were also based on high heterogeneity across studies. Caution must be taken to apply prediction models in different clinical settings to that of the original model.

Conclusions
Evidence from a systematic review and meta-analysis of performance of prediction model for OSA during pregnancy showed good performances during the 1st and 2nd-3rd trimesters. BMI, age, and snoring were the most common predictors among the strong prediction models. The combination of BMI and age included in the models as continuous variables showed consistently good results, and maybe more appropriate across ethnicity. Despite many prediction models developed, only Facco and MVAP models were externally validated. Furthermore, there was high level of heterogeneity indicating the limitation on generalizability and the need for more studies for both development and validation in different clinical settings.

Acknowledgments:
We would like to express our gratitude towards Professor Ammarin Thankkinstian for the initiatives of the project. We thank Associate Professor Sasivimon Rattanasiri for the support on systematic review and meta-analysis methodology.

Appendix A
OR "risk prediction") OR "risk prognostic"))) OR (("clinical decision model") OR "clinical decision rule"))) AND      and tiredness upon awakening in predicting third trimester OSA with optimized cut-off threshold at 0.30. BATE model were prediction models for each trimester (first, and third trimesters) and the first trimester model predicting the third trimester OSA. The multivariate logistic regression model consisted of age, BMI, and tongue enlargement. Tongue enlargement is defined if tongue protrudes beyond the teeth or the alveolar ridge in the resting position Simplified model for first trimester and = BMI + age + (15 * tongue enlargement), with cutoff of 65. Similar model and cutoff for first trimester predicting third trimester OSA was used during the first trimester. And the simplified model for third trimester = BMI + age + (20 * tongue enlargement), with cutoff of 75.
Izci-Balserak model I is a combined prediction model for each trimester (first and third trimesters) consisted of sleep apnea symptom frequency index (SASS), one of the subscales of MVAP with BMI and age.
Izci-Balserak model II is a combined prediction model for each trimester (first and third trimesters) consisted of SASS, a subscale of MVAP with BMI, age and bedpartner-reported information. Bedpartner-reported information was obtained from the two questions of Pittsburg sleep quality index (PSQI). These questions are "ask roommate or bed partner how often in the past month you have had (a). loud snoring and (b). long pauses between breaths while asleep." Available answers are "not during the past month, less than once a week, once or twice a week, three or more times a week".
General prediction model for OSA MVAP is model developed for predicting OSA utilizing the self-reporting symptoms of loud snoring, snorting or gasping, and breathing cessations. The frequency of each of the three symptoms over the past months was scored as follows: 0 = never, 1 = less than once a week, 2 = once or twice per week, 3 = three to four times per weeks, and 4 = five to seven times per week. A SASS score was calculated as the mean of the three apnea items, and ranged from 0-4. The MAP index score was developed using multiple logistic regression and incorporates the SASS score along with age, BMI, and gender (female) with score ranged from 0-1.