GRADE Use in Evidence Syntheses Published in High-Impact-Factor Gynecology and Obstetrics Journals: A Methodological Survey

Objective: To identify and describe the certainty of evidence of gynecology and obstetrics systematic reviews (SRs) using the Grading of Recommendations, Assessment, Development and Evaluation (GRADE) approach. Method: Database searches of SRs using GRADE, published between 1 January 2016 to 31 December 2020, in the 10 “gynecology and obstetrics” journals with the highest impact factor, according to the Journal Citation Report 2019. Selected studies included those SRs using the GRADE approach, used to determine the certainty of evidence. Results: Out of 952 SRs, ninety-six SRs of randomized control trials (RCTs) and/or nonrandomized studies (NRSs) used GRADE. Sixty-seven SRs (7.04%) rated the certainty of evidence for specific outcomes. In total, we identified 946 certainty of evidence outcome ratings (n = 614 RCT ratings), ranging from very-low (42.28%) to low (28.44%), moderate (17.65%), and high (11.63%). High and very low certainty of evidence ratings accounted for 2.16% and 71.60% in the SRs of NRSs, respectively, compared with 16.78% and 26.55% in the SRs of RCTs. In the SRs of RCTs and NRSs, certainty of evidence was mainly downgraded due to imprecision and bias risks. Conclusions: More attention needs to be paid to strengthening GRADE acceptance and building knowledge of GRADE methods in gynecology and obstetrics evidence synthesis.


Introduction
Systematic reviews (SRs) are essential parts of evidence-based medicine and serve as the basis for clinical practice guidelines [1], which are widely used in the field of gynecology and obstetrics [2][3][4][5][6]. Recently, several publications have concentrated on the quality and credibility of systematic reviews as the number of such publications has increased [7,8].
The GRADE system is an emerging method for appraising studies and making recommendations for systematic reviews and guidelines [9,10]. As one of the most important methodological achievements of evidence-based medicine (EBM) in the last 30 years, it has been used by over 100 organizations up to now [11]. Different from other appraisal tools (The Jadad, the Newcastle-Ottawa score etc.), GRADE separates quality of evidence and strength of recommendation, assesses the quality of evidence for each outcome, and allows observational studies to be "upgraded", if they meet certain criteria [12]. There are five distinct steps in the GRADE method [9,12].
Step 1: A prior ranking. For randomized controlled trials, we assign a high ranking, and for observational studies, a low ranking.
Step 2: 'Downgrade' or 'Upgrade' the initial ranking. There are five downgrading domains (risk of bias (RoB), inconsistency, indirectness, imprecision, and publication bias) and three upgrading domains (large consistent effect, dose response, and plausible confounding, which would reduce a demonstrated effect).
Step 3: Assign a final grade. On the basis of upgrading and downgrading domains in step 2, the final evidence quality is rated "high", "medium", "low", or "very low" [13]. All the three steps above are repeated for each critical outcome.
Step 4: Consider factors affecting recommendation. In addition to evidence quality, recommendations must also take other factors into account, such as cost-effectiveness, patient preference, and balance of desirable and undesirable effects.
Step 5: Combine the above factors to give a final recommendation, strong or weak [14].
In general, GRADE has a strict procedure for evaluating evidence and considers other factors besides evidence, which makes it more suitable for the medical field. Previous studies have explored the use of GRADE in the fields of nutrition, urology, and nephrology [15,16]. Among these studies, the number of SRs using GRADE was limited, and the certainty of most evidence was quite low. Considering the merits of GRADE in the evaluation of evidence, there is a need for enhancing the acceptance and use of GRADE in both fields. However, no study has evaluated the application of the GRADE approach in the SRs of journals of gynecology and obstetrics. Therefore, it seems sensible to explore the current status of the GRADE approach used in SRs published in gynecology and obstetrics journals. Herein, we take the following two steps: (1) identify and describe all relevant SRs using the GRADE methodology, to evaluate the outcome-specific certainty of evidence published between 2016 and 2020, in the 10 gynecology and obstetrics journals, with the highest impact factor according to the JCR 2019, and (2) summarize and present the GRADE specific information, including the number of outcomes rated, the certainty of evidence ratings, the use of summary of findings tables, down-and upgrading factors, while also taking the study design (SRs of RCTs vs. NRSs) into account.

Search Strategy
Systematic reviews (SR) published between 1 January 2016 to 31 December 2020 in the 10 gynecology and obstetrics journals, with the highest impact factor (range: 17. 18-4.25), according to the JCR 2019, were identified through searches in the database PubMed. (Appendix S1).

Inclusion and Exclusion Criteria
SRs were included if they met the following criteria: (1) SRs published between 1 January 2016 and 31 December 2020 in the top 10 journals according to the JCR 2019 category gynecology and obstetrics, and (2) SRs applying the GRADE approach to rate the certainty of evidence.
SRs were excluded if they met the following criteria: (1) SRs using a modified version of the GRADE approach, based on the adaptions of the tool to self-defined criteria of authors to assess the quality of evidence, and (2) SRs failing to provide detailed GRADE evaluation processes and results.

Selection Process of Sources of Evidence
First, title and abstract screening were performed by two reviewers (Y-ZL, SZ.) to identify articles relevant to GRADE. Second, for all potentially relevant references, fulltext publications were obtained and checked for final inclusion, by two reviewers (H-JY, H-LX.) independently. Discrepancies were resolved through discussion with a third author (Q-J W.).

Data Extraction
For included SRs, one reviewer (H-JY.) extracted the data and an independent reviewer (H-LX.) cross-checked all data. The following data were extracted at SR level: year of publication, journal name, number of primary studies included, type of studies (RCTs vs. NRSs (including: i.e., non-randomized intervention studies, case-control studies, cohort studies, and cross-sectional studies vs. combination RCTs/NRSs) included, number of participants, description of intervention(s)/exposure(s), number and types of outcome(s) and comparison(s) rated, category of certainty of evidence ratings (high, moderate, low, or very low), meta-analysis conducted (yes vs. no), summary of findings table reported (yes vs. no), number of down and/or upgrades (count of the respective downgrading/upgrading domain used at the outcome-level), and reasons for down-and upgrading. For the quantitative presentation, all downgrading factors listed in the SRs of both RCTs and NRSs were extracted according to the study design, except in the case of two SRs, where a differentiation between study designs was not possible.
Finally, in this methodological survey, we excluded two types of SRs based on minimum criteria proposed by the GRADE working group: (1) the authors of SRs rated the certainty of evidence of each individual study ("study level"); and (2) the authors of SRs rated the certainty of the SR's body of evidence, instead of rating the body of evidence for a given outcome ("outcome level") [15]. Figure S1 shows the flow diagram of the literature search. The database search retrieved 1033 documents. After the removal of duplicate records (n = 2), title and abstract screening (n = 79), and analysis of the remaining 952 full-text articles, 93 articles remained. Among them, 3 SRs using a modified version of the GRADE [17][18][19] and 12 SRs failing to provide a detailed evaluation process [20][21][22][23][24][25][26][27][28][29][30][31] were excluded from this study. A total of 11 SRs using GRADE, did not rate the certainty of evidence for the specific outcome [32][33][34][35][36][37][38][39][40][41][42], but instead assessed certainty of evidence of individual studies or overall certainty of the body of evidence (Table S1). Finally, only 67 SRs (7.04%; out of 952 SRs published) were included in this study. Table 1 shows the distribution of these SRs according to the journal and year. The publication of SRs in the top 10 gynecology and obstetrics journals decreased between 2016 and 2019 and increased in 2020, with the least SRs published in 2019 (n = 160), and the most in 2020 (n = 238). As compared to 2016 and 2017 (6.12% and 6.25%), the proportion of SRs rating the certainty of evidence with GRADE increased among all SRs published in 2018, 2019, and 2020 (9.2%, 6.25%, and 7.56%). More than 80% of all included SRs were published in four journals (Ultrasound in Obstetrics & Gynecology, Human Reproduction Update, British Journal of Obstetrics and Gynecology, and American journal of Obstetrics and Gynecology).

Certainty of Evidence Ratings
The median of the total number of outcomes rated in a SR was 5 (IQR: 3-8). (Table 5) Overall, there were 946 individual outcome ratings: 42.28% of very low, 28.44% of low, 17.65% of moderate, and 11.63% of high, in certainty of evidence. Among 614 outcomes in the SRs of RCTs, 26.55% were rated very low, 32.74% low, 23.94% moderate, and 16.78% high, in certainty of evidence. In the SRs of NRSs (outcomes were 324), 71.60% were rated very low, 20.73% low, 5.86% moderate, and 2.16% high, in certainty of evidence. In the SRs of NRSs and RCTs, a total of 8 outcomes were rated: 5 of very low, 2 of low, and 1 of moderate certainty of evidence. In studies assessing interventions with drug therapy, there were 445 individual outcome ratings: 415 in the SRs of RCTs, 26 in the SRs of NRSs, and 4 in the SRs of NRSs and RCTs. The certainty of evidence was rated as very low (33.26%), low (23.37%), moderate (22.70%) and high (20.67%). Details can be seen in Table 6. Table 2. Summary of the study characteristics and the number and type of (un)rated outcomes of systematic reviews of randomized controlled trials (n = 41) that rated the outcome-specific certainty of evidence with GRADE.
We counted an approximate mean of 1.67 downgrades per outcome in the SRs of RCTs and a mean of 1.32 downgrades per outcome in the SRs of NRSs. The downgrading frequency (the number of downgrades per the number of rated outcomes) in the SRs of RCTs was higher for imprecision (70.68%), RoB (54.40%), indirectness (16.94%), and publication bias (8.96%), and lower for inconsistency (16.29%), compared to the downgrading frequencies encountered in the SRs of NRSs. Besides, 13.00% of outcomes rated to the SRs of NRSs were upgraded, 7.10% outcomes were upgraded for large effect, 5.56% for plausible confounding, and 0.31% for dose-response. Of note, no outcome was upgraded for unclear effect ( Table 6). The reasons for the authors' choice to downgrade or upgrade the certainty of evidence for outcomes are summarized in Table S2.

Principal Findings
To the best of our knowledge, this is the first study to examine the extent to which GRADE has been used in SRs published in the top 10 gynecology and obstetrics journals with the highest impact factor, according to the JCR 2019. In the last five years, the number of SRs using GRADE to evaluate the certainty of evidence was relatively small, but the number of it showed an upward trend, reaching 18 (7.56%) in 2020. In general, this methodological survey shows there were only 67 SRs (7.04%) that rated the outcome specific certainty of evidence with GRADE. Four hundred (42.28%) and 269 (24.88%) individual outcomes were rated as very low and low, respectively. In the SRs of RCTs, the certainty of evidence was downgraded mostly for RoB and imprecision, while in the SRs of NRSs, the certainty of evidence was downgraded mostly for RoB, imprecision, and inconsistency.
It is a very important finding that such a low proportion of evidence evaluated using GRADE in gynecology and obstetrics is of high quality. There are several reasons for this. First, several limitations in clinical studies such as the lack of a clearly randomized allocation sequence, blinding, allocation concealment, and failure to adhere to intentionto-treat analysis are inevitable, and can lower the quality of the evidence. In our study, we found all these accounting for 33.40%, among downgrading domains. Second, in clinical studies, it can be difficult to control the number of participants. If the number of participants is low and the confidence interval is wide, the quality of the evidence will be downgraded due to imprecision. For example, in a RCT conducted by Armstrong, et al. [103], the number of women undergoing hysteroscopy after IVF (in vitro fertilization) failure was low, so the quality of evidence was downgraded for impression. In addition, indirectness, publication bias, and inconsistency are also reasons for such low quality.

Results in the Context of What Is Known
This is the first scoping review on the use of GRADE in gynecology and obstetrics, and other publications have also addressed the details, advantages, and challenges of the GRADE approach in the clinical field [110][111][112]. As noted in the findings of the scope review above, certainty of evidence in NRSs may be downgraded too much, which would appear to prevent the application of GRADE in such cases. However, solutions to this challenge may help to promote the use of the GRADE approach. For example, researchers have recommended using ROBINS-I in GRADE. ROBINS-I offers an alternative terminology: establishing NRS rather than observational studies, which will give researchers a more transparent way of separating studies by design. This placed RCTs and NRS on a common metric for risk of bias, and facilitated the comparison of evidence from both types of studies.
Considering that the research design is the initial and key point of GRADE, in determining the level of evidence, our research investigated the use of GRADE in the evaluation of evidence in gynecology and obstetrics, and additionally presented the use of downand upgrading factors, according to the study design reviewed in the SRs. The results of our study were similar to findings reported by Cuello-Garcia and colleagues [113], which showed that most respondents would present pooled data from both RCTs and NRSs separately, either in a single Summary of Findings Table or each in its own table. As our sample size is small, these findings need to be supported by testing on a larger sample.

Implications for the Broader Research Field
Having been adopted by more than 100 organizations worldwide indicates that GRADE is a promising approach to evaluating the certainty of evidence and determining the strength of a recommendation. Ref. [11] However, our experience permits us to put forward a few recommendations on assessing certainty of evidence and strength of recommendations.
First, GRADE has been widely used for rating the level of evidence from both RCTs and NRS, but using GRADE to evaluate the certainty of evidence in NRSs continues to present some challenges. Since the study design is the initial and key consideration of GRADE to rate the initial certainty of evidence, RCTs start as "high", whereas NRSs start as "low", due to the risk of confounding and selection bias, users may improperly double count the risk of confounding and selection bias, so the certainty of evidence in NRSs may be downgraded excessively. Several opportunities for GRADE are presented by ROBINS-I (risk of bias in non-randomized studies of interventions) [114]. Since ROBINS-I places RCTs and NRS on a common metric for risk of bias, it may facilitate the comparison of evidence from both types of studies [114]. ROBINS-I does not consider study design as a risk of bias, such as cohort, case-control, case series, or cross sectional. The GRADE certainty of evidence from a body of studies using NRS designs would be high when ROBINS-I is used to assess risk of bias in NRS, since selection bias and confounding are assessed as integral components of ROBINS-I [114]. In addition, ROBINS-I provides a way to assess whether failure to use randomization in individual studies impacts bias risk and harmonizes GRADE approaches for different types of questions, such as prognosis and test accuracy [114]. In order to accurately rate the certainty of a body of evidence, it is highly recommended to use ROBINS-I in GRADE.
Second, while GRADE provides a framework for a systematic, transparent, and explicit assessment of the certainty of evidence and strength of recommendations, using GRADE will commonly involve some subjective judgments, and result in various assessments [115,116]. For example, if the GRADE users are uncertain about the exact ratings for multiple domains or when the same issue affects multiple domains, they may render various judgments. In this case, GRADE users should consider the domains together and choose the worst rating considered in one domain and the best rating considered in the other. Further, transparency requires presenting the reasoning for all judgments. Third, for better recommendations, individual patient conditions, preferences, and values should also be considered in addition to the certainty of evidence, as well as other important factors in GRADE approach.

Strengths and Limitations
Our study has several strengths. Firstly, we performed a rigorous screening and extensive data extraction, with information such as the number of outcomes rated and unrated outcomes, number and reason identified for down-and upgrading domains being included. Secondly, this is the first study to examine the extent to which GRADE has been used in SRs published in gynecology and obstetrics journals, which may contribute to improving the process of evidence-informed diagnostic methods in this area.
Limitations of this study are as follows. Firstly, our study focused on SRs published in the top 10 gynecology and obstetrics journals within a time frame limited to 5 years, which consequently resulted in a relatively small sample size (n = 67) and lack of inclusion in other medical and lower-ranked gynecology and obstetrics journals. Nevertheless, SRs published in high-impact medical journals are more likely to be relevant for future research and to provide evidence for clinical practice. Second, prior to this study, no official research protocol about GRADE use had been published, therefore, we reported our findings following the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) framework where applicable to study design. Third, our study was based on a descriptive examination of GRADE use in SRs in gynecological and obstetric, rather than determining whether the authors followed criteria conducted by the GRADE working group to rate the certainty of the outcomes. Our study indicates that some authors may misconstrue GRADE domains. For example, the certainty of evidence in SRs was upgraded due to low RoB, narrow confidence intervals, very low P-values and mild statistical heterogeneity, rather than upgrading domains. Therefore, future research should give priority to the optimal use of the GRADE approach. Finally, no time trends are addressed or assessed in this report. GRADE system adoption has possibly increased and improved over time.

Conclusions
This methodological investigation attempts to reveal the application of GRADE in gynecology and obstetrics. As an explicit, comprehensive, transparent, and pragmatic evaluation system, GRADE gives the strength of evidence and recommended significance for each outcome, which makes it important to evaluate the certainty of the review evidence in obstetrics and gynecology. Our study shows that the use of rating the certainty of evidence in gynecology and obstetrics SR is relatively few. More attention should be paid to the use of ROBINS-I in GRADE, to the transparency of GRADE in the evaluation of evidence, and the actual situation of patients in the synthesis of gynecology and obstetrics evidence, to provide evidence for the final decision of clinical researchers and clinicians.
Supplementary Materials: The following supporting information can be downloaded at: https:// www.mdpi.com/article/10.3390/jcm12020446/s1, Figure S1: Flow diagram showing study selection process; Table S1: Overview of SRs that applied GRADE incorrectly* by rating the certainty of evidence for all studies or each individual study; Table S2: Overview of the GRADE domains of systematic reviews that rated the outcome-specific certainty of evidence.

Conflicts of Interest:
The authors declare no conflict of interest.