Risk Scoring Systems for Preterm Birth and Their Performance: A Systematic Review

Introduction: Nowadays, the risk stratification of preterm birth (PTB) and its prediction remain a challenge. Many risk factors associated with PTB have been identified, and risk scoring systems (RSSs) have been developed to face this challenge. The objectives of this systematic review were to identify RSSs for PTB, the variables they consist of, and their performance. Materials and methods: Two databases were searched, and two authors independently performed the screening and eligibility phases. Records studying an RSS, based on specified variables, with an evaluation of the predictive value for PTB, were considered eligible. Reference lists of eligible studies and review articles were also searched. Data from the included studies were extracted. Results: A total of 56 studies were included in this review. The most frequently incorporated variables in the RSS included in this review were maternal age, weight, history of smoking, history of previous PTB, and cervical length. The performance measures varied widely among the studies, with sensitivity ranging between 4.2% and 92.0% and area under the curve (AUC) between 0.59 and 0.95. Conclusions: Despite the recent technological and scientifical evolution with a better understanding of variables related to PTB and the definition of new ultrasonographic parameters and biomarkers associated with PTB, the RSS’s ability to predict PTB remains poor in most situations, thus compromising the integration of a single RSS in clinical practice. The development of new RSSs, the identification of new variables associated with PTB, and the elaboration of a large reference dataset might be a step forward to tackle the problem of PTB.


Introduction
According to the World Health Organization (WHO), a birth that occurs before 37 complete weeks of pregnancy is defined as a preterm birth (PTB). Based on gestational age, preterm births can be categorized as extreme preterm (less than 28 weeks), very preterm (28 to 32 weeks), or moderate to late preterm (32 to 37 weeks). Moreover, PTB can be classified into "spontaneous" (spontaneous onset of labor or following preterm premature rupture of membranes (PPROM)) and "indicated" (parturition initiated by the caregivers: induction of labor or elective cesarean for maternal or fetal indications or other non-medical indications) [1,2]. It is estimated that, annually, 15 million babies are born preterm, meaning that more than 1 in 10 babies are born too early [1].
PTB and its complications account for approximately 1 million child deaths each year, making it the leading cause of death in children under 5 years of age [3]. Current, Studies were considered eligible if they studied or developed an RSS, based on specific variables, as to predict PTB and help to stratify the PTB risk, irrespective of the gestational age used as a threshold to define PTB. Studies that did not study or develop an RSS (uni or multivariable) that did not mention the variable(s) used in the RSS or that did not analyze the RSS predictive value were excluded. No restrictions were imposed concerning the studies' participants and their pregnancy characteristics. Observational studies, for example, cohort, case-control, and cross-sectional studies were included. Editorials, clinical case reports, literature reviews, or incomplete publications (e.g., abstracts only) were excluded. Non-English and non-human published studies were also excluded.
Two databases were searched for studies: PubMed and Web of Science. We searched from inception to the 12 November 2022, the date on which we ran the final search. No time restrictions were imposed on the search. We screened the databases using the following search query: [("preterm birth" OR "preterm delivery" OR "preterm labor" OR "preterm labour" OR "premature birth" OR "premature delivery" OR "premature labour" OR "premature labor") AND ("risk" OR "risk factors") AND ("scoring systems" OR "score*" OR "scoring algorithm") AND ("validity" OR "validation" OR "assessment" OR "evaluation")]. This query was built by adding together keywords considered pertinent for this review. Keywords adding no results were excluded from the query, such as "points system". Search syntax was adapted for each database. No filters were applied to the searches.
The studies retrieved from the PubMed and Web of Science databases searches were exported to a reference manager (EndNote version 20), where duplicates were removed. Screening by title and abstract of the remaining studies was performed independently by two authors. Studies not meeting the inclusion criteria or not consistent with the purpose of this review were removed. Divergences between investigators were resolved by consensus.
After the screening phase, eligibility assessment was performed independently by the same two authors through the reading of the full-text articles, to certify their eligibility, using the inclusion and exclusion criteria defined. Divergences were also resolved by consensus. The reasons for the exclusion of studies both on screening and eligibility phases were registered.
Additionally, a manual search, of possible missing studies in the databases search, was performed in the list of references of the eligible studies, as well as in related review articles.
A structured data extraction form was developed to extract the data from the eligible studies. After testing it with some included studies, it was appropriately perfected. The data were independently extracted by two authors applying the previously developed data extraction form. Disagreements were discussed and solved by consensus. The following variables were collected: (1) study characteristics (year, country, study design, and sample size); (2) participant characteristics (exclusion and inclusion criteria); (3) outcome measure (PTB gestational age criteria considered, PTB type); and (4) scoring system characteristics (risk factors and variables considered, model used, model outcomes and output, performance analysis-reported by the area under the receiver operating characteristic curve (AUC), sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV)-evaluation method and gestational age at RSS testing). Ambiguous or absent information was stated as "not reported".
Methodological quality and study risk of bias were assessed based on the National Institutes of Health study quality assessment tool. This tool consists of a set of questions to evaluate a study's internal validity and risk of bias. By applying this tool to a study, it is possible to classify them into one of three predefined categories: poor quality, fair quality, or good quality. Depending on the study design, different criteria were applied to assess the risk of bias in each study. Case-control and cohort studies were both assessed regarding the research question, study population, target population, sample size, and statistical analysis. In the case-control studies, it was also analyzed the inclusion/exclusion criteria, case and control definitions, selection of study participants, and exposure measurement. On the other hand, in the cohort studies, the timeframe was also analyzed to observe an effect, levels of exposure, and exposure and outcome measures and assessment. Intervention studies were assessed based on randomization, treatment allocation, blinding, the similarity of groups at baseline, dropout, adherence, outcome measures, power calculation, and intention-to-treat analysis. Study methodological quality was independently rated as good, fair, or poor by two authors. Disagreements were resolved by consensus.

Results
The final search of the two databases retrieved 1226 records. PubMed database search retrieved 654 records, whereas Web of Science database search retrieved 572 records. These records were extracted by a reference manager, where 320 duplicates were removed. From the 906 records screened by title and abstract, 829 did not proceed to the eligibility phase. The vast majority of the exclusions at this point were due to not studying PTB or not studying/developing an RSS (n = 409 and n = 402, respectively). Other reasons for article exclusion at this point were the study type (literature reviews, n = 12, and study protocols, n = 2), incomplete publications (n = 3), and non-English record (n = 1). After the screening, out of the 77 articles assessed for eligibility, 34 were excluded owing to not studying PTB (n = 5), not studying/developing an RSS (n = 18), not analyzing the RSS predictive power of PTB (n = 6), and unavailability of the full papers (n = 5). The unavailable papers were not available on the website of the respective journals. Literature reviews and reference lists from included articles were searched, and 16 records considered relevant to our review were identified. Of those 16, 3 articles were not eligible due to not studying/developing an RSS (n = 1) and not analyzing the RSS predictive value of PTB (n = 2). Therefore, in total, 56 studies  were included in this review. Figure 1 shows the PRISMA flow diagram, demonstrating the study selection process. criteria, case and control definitions, selection of study participants, and exposure measurement. On the other hand, in the cohort studies, the timeframe was also analyzed to observe an effect, levels of exposure, and exposure and outcome measures and assessment. Intervention studies were assessed based on randomization, treatment allocation, blinding, the similarity of groups at baseline, dropout, adherence, outcome measures, power calculation, and intention-to-treat analysis. Study methodological quality was independently rated as good, fair, or poor by two authors. Disagreements were resolved by consensus.

Results
The final search of the two databases retrieved 1226 records. PubMed database search retrieved 654 records, whereas Web of Science database search retrieved 572 records. These records were extracted by a reference manager, where 320 duplicates were removed. From the 906 records screened by title and abstract, 829 did not proceed to the eligibility phase. The vast majority of the exclusions at this point were due to not studying PTB or not studying/developing an RSS (n = 409 and n = 402, respectively). Other reasons for article exclusion at this point were the study type (literature reviews, n = 12, and study protocols, n = 2), incomplete publications (n = 3), and non-English record (n = 1). After the screening, out of the 77 articles assessed for eligibility, 34 were excluded owing to not studying PTB (n = 5), not studying/developing an RSS (n = 18), not analyzing the RSS predictive power of PTB (n = 6), and unavailability of the full papers (n = 5). The unavailable papers were not available on the website of the respective journals. Literature reviews and reference lists from included articles were searched, and 16 records considered relevant to our review were identified. Of those 16, 3 articles were not eligible due to not studying/developing an RSS (n = 1) and not analyzing the RSS predictive value of PTB (n = 2). Therefore, in total, 56 studies  were included in this review. Figure 1 shows the PRISMA flow diagram, demonstrating the study selection process.  Regarding the study design of the studies included in this review, they were mainly observational cohort or case-control. A total of 46 cohort studies were included, and a total of nine case-control studies were included. Only one randomized controlled trial was included. In the study of methodological quality and risk of bias assessment, one study was rated as poor. A total of 28 studies were rated as good, and 27 were rated as fair. The most frequent reasons for studies not to be rated as good were the lack of adjustment for confounding variables, the non-definition of gestational age at the time of testing, and the poor definition of the study population.
The 56 studies included were conducted in a total of 20 different countries from 5 different continents. The countries with the most published studies under the eligibility criteria of this systematic review were the United States of America (USA, n = 20), the United Kingdom (UK, n = 6), and France (n = 5). Europe totals 25 included studies, considering studies from the UK, France, Germany, Italy, Sweden, Croatia, the Netherlands, Spain, Belgium, and Poland. Asia, South America, Oceania, and Africa were the least represented regions with four, two, two, and one published studies, respectively.
With respect to temporal analysis, only 14 studies were published until the year 2000, and among those, only three addressed an RSS that included laboratory variables in addition to variables obtained by clinical history and physical examination. A total of 42 studies included in this review were published in the 21st century, 16 of which were published in the last 5 years.
The study and the participant characteristics extracted from each of the 56 included studies, as well as the study of risk of bias assessment results, are outlined in Supporting Table S1. This table contains the following subset of characteristics defined in the data extraction form: year, country, study design, sample size, inclusion and exclusion criteria, and quality and risk of bias rating.
The most frequently incorporated variables in the RSS included in this review were maternal age, weight/BMI, history of smoking, history of previous PTB, and cervical length. In Figure 2, the variables are grouped in categories, and it is shown their distribution throughout the decades. Table 1 discriminates the variables that integrated the RSS addressed in the included studies. It also shows, by decade and in total, how many studies each variable was part of the RSS. Regarding the study design of the studies included in this review, they were mainly observational cohort or case-control. A total of 46 cohort studies were included, and a total of nine case-control studies were included. Only one randomized controlled trial was included. In the study of methodological quality and risk of bias assessment, one study was rated as poor. A total of 28 studies were rated as good, and 27 were rated as fair. The most frequent reasons for studies not to be rated as good were the lack of adjustment for confounding variables, the non-definition of gestational age at the time of testing, and the poor definition of the study population.
The 56 studies included were conducted in a total of 20 different countries from 5 different continents. The countries with the most published studies under the eligibility criteria of this systematic review were the United States of America (USA, n = 20), the United Kingdom (UK, n = 6), and France (n = 5). Europe totals 25 included studies, considering studies from the UK, France, Germany, Italy, Sweden, Croatia, the Netherlands, Spain, Belgium, and Poland. Asia, South America, Oceania, and Africa were the least represented regions with four, two, two, and one published studies, respectively.
With respect to temporal analysis, only 14 studies were published until the year 2000, and among those, only three addressed an RSS that included laboratory variables in addition to variables obtained by clinical history and physical examination. A total of 42 studies included in this review were published in the 21st century, 16 of which were published in the last 5 years.
The study and the participant characteristics extracted from each of the 56 included studies, as well as the study of risk of bias assessment results, are outlined in Supporting  Table S1. This table contains the following subset of characteristics defined in the data extraction form: year, country, study design, sample size, inclusion and exclusion criteria, and quality and risk of bias rating.
The most frequently incorporated variables in the RSS included in this review were maternal age, weight/BMI, history of smoking, history of previous PTB, and cervical length. In Figure 2, the variables are grouped in categories, and it is shown their distribution throughout the decades. Table 1 discriminates the variables that integrated the RSS addressed in the included studies. It also shows, by decade and in total, how many studies each variable was part of the RSS.
Current pregnancy characteristics: Qualitative glandular cervical score -- Quantitative fetal fibronectin assay - Serum biomarkers: Corticotropin-releasing hormone concentration. - Not all studies included considered the same gestational age at PTB as the outcome. Prediction of PTB before 37 completed weeks of pregnancy was the most frequent outcome with approximately 69% of the studies included defining it as outcome. While some studies defined a single outcome, others did not restrict the outcome to the prediction of PTB before the completion of 37 weeks of pregnancy and studied the RSS prediction of PTB at different GA, such as before 32 (very PTB) and 34 completed weeks of pregnancy. Concerning the PTB classification as spontaneous or indicated, not all studies unanimously studied the same type of PTB. In total, 53% of the included studies focused exclusively on spontaneous PTB, whereas 22% focused on both spontaneous and medically indicated PTB.
RSSs addressed in the studies were built using different methods: univariate analysis (simple cutoff), multivariate models (linear and logistic regression) and, less frequently, machine learning models (artificial neural networks). Overall, considering all the included studies and the performance measures they presented, sensitivity ranged from 4.2% to 92.0%, specificity ranged from 41.5% to 99.3%, PPV ranged from 5.9% to 91.0%, NPV ranged from 69.2% to 100%, and AUC ranged from 0.59 to 0.95. Table 2 gathers the eligible studies' outcome measure, focusing on the PTB gestational age criteria considered and PTB type, and the scoring system characteristics, mainly the model used to build the RSS and its output and outcome, the risk factors and variables used, the performance analysis, reported by AUC, sensitivity, specificity, PPV and NPV, the evaluation method, and the gestational age at RSS testing.   Logistic regression High-risk for probability > 0.5 Number of fetuses, age (mother), gravidity, parity, length (mother), weight (mother), BMI, gestational age at admission, duration ruptured membranes, method of conception, smoking history, alcohol usage, drug usage, history of cesarean section, race (mother), and admission indications

Discussion
The present systematic review focused on the identification, characterization, and comparison of RSSs for the screening of the risk of PTB, with a focus on their performance. The extracted characteristics from each system included those related to study design and sample size; inclusion and exclusion criteria of participants; the predictors, considered model, outputs, and the performance; and the outcome measure and its applicability. To our knowledge, the only published systematic review addressing the performance of RSSs for PTB was published about 20 years ago [33].
The classical interpretation of an RSS may be assumed as a system including two or more predictors, based on which a sum of points produces a final score. However, we did not restrict our review to such interpretation, and thus, considered any system including one or more predictors, with different types of model output (also including probabilities).
The incorporation of clinical analysis results in RSS for PTB, according to the included studies in this review, started around the 90s decade. Since then, with the evolution of ultrasonography, the definition of new cervical ultrasonographic parameters and the discovery of biomarkers associated with PTB, the combined use of medical and obstetric history, maternal and pregnancy characteristics, ultrasonographic evaluation, and PTB biomarkers to develop an accurate RSS, capable of predicting PTB, has gradually increased.
Currently, attending to the fast technical development of ultrasonography, more accurate and reproducible ultrasound-based screening strategies can be performed for the prediction of PTB. Guidelines have been developed in order to provide recommendations and a consensus-based approach on this matter [91]. Although there has not been found a biomarker capable of accurately predicting PTB, many have been associated with it. These biomarkers associated with PTB are thought to be more predictive of PTB when used together in a model, instead of alone [27]. However, despite the undeniable progress in these areas and the development of RSSs combining all these components, their predictive value of PTB is still not as good as expected. The range of the RSS performance measures was wide, partly due to differences in inclusion/exclusion criteria of the studies' participants and different RSS evaluation frameworks. In particular, the use of the whole sample validation scheme overestimates an RSS performance compared with cross-validation or internal/external validation. Therefore, it is difficult to clearly highlight an RSS for its predictive power and integrate it into the clinical practice.
The comparison between RSS should be carefully performed. With respect to the study design, the identified RSSs were tested in either cohort or case-control studies. The latter usually includes a higher prevalence of PTB in the sample than in the population, which may have, in some cases, led to an overestimation of the RSS predictive value, in terms of sensitivity and PPV. There were large differences in the inclusion and exclusion criteria between the reported studies. The identification of main groups of such criteria is advisable, such as singleton versus multiple pregnancies, the presence/absence of maternal and/or fetal pathologies, or obstetric history. The gestational age at PTB is considered as outcome and the gestational age at which the score was computed are also important factors, which influence the performance comparison between RSS. We did not report the existence of interventions in the course of pregnancy, as they are quite heterogeneous, but it may have a clear influence on the risk of PTB in the course of pregnancy.
One of the limitations of the presented review is related to the inherent possibility of any systematic review missing some relevant papers. However, in addition to the efforts put into the development of the search query, we did also a full search for other papers in the list of references of all papers in the eligibility phase, which led us to an increase of 30% in the papers initially identified in the screening phase. Another limitation might be the fact that we did not include RSSs, which combined PTB with other outcomes. Nevertheless, in our opinion, it does not provide a clear identification of risk factors strictly associated with PTB and avoids the comparison of their performance in terms of PTB.

Conclusions
This systematic review provides a characterization of most of the published RSSs for the assessment of the risk for PTB, which is, nowadays, not only one of the major causes of burden related to obstetrical care but also related to complications in the short and long term. This review suggests that the prediction of PTB and the risk stratification through RSSs is poor. Therefore, there is plenty of room for improvement in this field. Future studies should seek to develop new RSS, with good clinical applicability, based on PTB strongly associated variables, and should also seek to identify new variables, such as biomarkers or ultrasonographic parameters, related to PTB that can be combined with other variables to build better-performing RSSs. In order to account for the heterogeneity in PTB etiology and to provide an effective comparison between the available systems, a large reference dataset developed based on the joint efforts of different centers worldwide would be a step forward to tackle the problem of PTB.
Supplementary Materials: The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/jcm12134360/s1, Table S1: Study and participants characteristics of each included study; quality and risk of bias ratings.
Author Contributions: A.F., H.G. and J.B. conceived and designed the study and developed the search strategy. A.F. and H.G. performed the screening and eligibility phases and extracted the data from all the included articles. The first draft was produced by A.F. and H.G. All authors have read and agreed to the published version of the manuscript.

Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.

Data Availability Statement:
The data presented in this study are available in this article.