Ovarian Adnexal Reporting Data System (O-RADS) for Classifying Adnexal Masses: A Systematic Review and Meta-Analysis

Simple Summary We performed a systematic review and meta-analysis aiming to assess the diagnostic performance of the Ovarian Adnexal Report Data System (O-RADS) using transvaginal ultrasound for classifying adnexal masses. Data from 11 studies comprising 4634 masses showed that the pooled estimated sensitivity, specificity, positive likelihood ratio, negative likelihood ratio, and diagnostic odds ratio of O-RADS system for classifying adnexal masses were 97% (95% confidence interval (CI) = 94%–98%), 77% (95% CI = 68%–84%), 4.2 (95% CI= 2.9–6.0), 0.04 (95% CI = 0.03–0.07), and 96 (95% CI = 50–185), respectively. We concluded that the O-RADS system has good sensitivity and moderate specificity for classifying adnexal masses. Abstract In this systematic review and meta-analysis, we aimed to assess the pooled diagnostic performance of the so-called Ovarian Adnexal Report Data System (O-RADS) for classifying adnexal masses using transvaginal ultrasound, a classification system that was introduced in 2020. We performed a search for studies reporting the use of the O-RADS system for classifying adnexal masses from January 2020 to April 2022 in several databases (Medline (PubMed), Google Scholar, Scopus, Cochrane, and Web of Science). We selected prospective and retrospective cohort studies using the O-RADS system for classifying adnexal masses with histologic diagnosis or conservative management demonstrating spontaneous resolution or persistence in cases of benign appearing masses after follow-up scan as the reference standard. We excluded studies not related to the topic under review, studies not addressing O-RADS classification, studies addressing MRI O-RADS classification, letters to the editor, commentaries, narrative reviews, consensus documents, and studies where data were not available for constructing a 2 × 2 table. The pooled sensitivity, specificity, positive and negative likelihood ratios, and diagnostic odds ratio (DOR) were calculated. The quality of the studies was evaluated using QUADAS-2. A total of 502 citations were identified. Ultimately, 11 studies comprising 4634 masses were included. The mean prevalence of ovarian malignancy was 32%. The risk of bias was high in eight studies for the “patient selection” domain. The risk of bias was low for the “index test” and “reference test” domains for all studies. Overall, the pooled estimated sensitivity, specificity, positive likelihood ratio, negative likelihood ratio, and DOR of the O-RADS system for classifying adnexal masses were 97% (95% confidence interval (CI) = 94%–98%), 77% (95% CI = 68%–84%), 4.2 (95% CI = 2.9–6.0), 0.04 (95% CI = 0.03–0.07), and 96 (95% CI = 50–185), respectively. Heterogeneity was moderate for sensitivity and high for specificity. In conclusion, the O-RADS system has good sensitivity and moderate specificity for classifying adnexal masses.


Introduction
Accurate discrimination between benign and malignant adnexal masses is essential for adequate management. Adnexal lesions considered as physiologic processes or at low risk of malignancy can be managed expectantly or removed through minimally invasive techniques [1,2]. On the other hand, adnexal masses classified at high risk of malignancy warrant further evaluation and should be eventually referred to gynecologic oncology units for adequate management [3].
Transvaginal sonography (TVS) is still considered the first-line imaging technique for assessing adnexal masses, and no other imaging technique provides better diagnostic performance than TVS [4]. Traditionally, TVS assessment of adnexal masses is based on the subjective impression of an expert examiner using so-called pattern recognition [5]. However, this approach is mainly limited by the need for proper training and experience [6]. For this reason, the use of several approaches such as ultrasound-based scoring systems, mathematical models, or biomarker-based models has been proposed [7][8][9][10][11][12]. Recently, the Assessment of Different Neoplasias in the Adnexa (ADNEX) model was proven to be the best model for discriminating benign from malignant adnexal masses [13]. An additional problem resides in the reporting of data. There is evidence that this problem may influence patient management [14].
In 2020, the American College of Radiology developed the so-called Ovarian Adnexal Report Data System (O-RADS) for classifying adnexal masses, with the aim of decreasing or eliminating ambiguity related to ultrasound reports [15]. This reporting system classifies adnexal masses into five risk groups (O-RADS 1: risk of malignancy 0%; O-RADS 2, risk of malignancy <1%; O-RADS 3, risk of malignancy 1-9%; O-RADS-4, risk of malignancy 10-49%; O-RADS 5, risk of malignancy ≥50%) on the basis of both the International Ovarian Tumor Analysis group (IOTA) terms and definitions and the ADNEX model. Since this report, several studies have been published addressing the diagnostic performance of this classification system. The aim of the present systematic review and meta-analysis was to synthesize the current evidence of the O-RADS reporting system for classifying adnexal masses.

Protocol and Registration
This meta-analysis was performed following the recommendations of the PRISMA Statement (http://www.prisma-statement.org/, accessed on 24 April 2022), as well as the guidelines from the Synthesizing Evidence from Diagnostic Accuracy Tests (SE-DATE) [16,17]. The protocol was not registered. Inclusion and exclusion criteria for studies to be selected, as well as how data extraction and quality assessment were defined, were established prior to starting the data search.
Institutional Review Board approval from Clinical Universidad de Navarra was waived because of the study's nature and design.

Data Sources and Searches
Three of the authors screened five electronic databases, PubMed/Medline, Scopus, Cochrane, Web of Science, and Google Scholar, to identify potentially eligible studies published between January 2020 and April 2022.
The search terms included and captured the concepts of "adnexal masses", "ovarian cancer", "transvaginal ultrasound", "O-RADS", and/or "Ovarian Adnexal Report Data System". No language limit was set.

Study Selection and Data Collection
Two authors screened the titles and abstracts identified by the search to exclude irrelevant articles. Then, full-text articles were selected to identify potentially eligible studies applying the following three inclusion criteria: (1) Prospective and retrospective cohort study including patients diagnosed as having at least one adnexal mass classified using the O-RADS system after transvaginal/transabdominal ultrasound assessment as the index test. (2) Report of histologic diagnosis of the adnexal mass after surgical removal or conservative management demonstrating spontaneous resolution or persistence in cases of benign appearing masses after follow-up scan as the reference standard. For avoiding the inclusion of duplicate cohorts from at least two studies reported from the same authors, the study period of each study was examined; if dates overlapped, we contacted the authors to check whether the same patients were included in the different studies reported. We searched for additional papers by reading the reference lists of those papers selected for full-text reading. In cases of insufficient data, we contacted the authors. The Patients, Intervention, Comparator, Outcomes, Study Design (PICOS) criteria used for inclusion and exclusion of studies were recorded.
Two authors independently retrieved the diagnostic accuracy results from the ultimately selected studies. Disagreements arising during the process of study selection and data extraction were resolved by consensus between these two authors.

Risk of Bias in Individual Studies
Quality assessment of studies included in the meta-analysis was conducted using the tool provided by the Quality Assessment of Diagnostic Accuracy Studies-2 (QUADAS-2) [18]. The QUADAS-2 format includes four domains: (1) patient selection, (2) index test, (3) reference standard, and (4) flow and timing. For each domain, the risk of bias and concerns about applicability (the latter not applying to the domain of flow and timing) were analyzed and rated as low, high, or unclear risk. Quality assessment was used to provide an evaluation of the overall quality of the studies and to investigate potential sources of heterogeneity.
Two authors independently evaluated the methodological quality. Disagreements were solved by discussion between these authors. The assessment of the quality was based on whether the study described the study's design, as well as inclusion and exclusion criteria, for the patient selection domain (studies with inadequate exclusions, retrospective studies with examiners not blinded to reference standard, and studies mixing data from expert and nonexpert examiners were considered at high risk of bias), whether the study reported on how the of the index test was performed and interpreted for the index test domain (studies not reporting whether the O-RADS classification was established using IOTA descriptors of the mass or the ADNEX model were considered at high risk), which was the reference standard used for the reference standard domain (for this domain, in case of conservative management of the mass, at least 1 year of follow-up was considered as appropriate to identify true negative or false negative cases), and description of the time elapsed from index test assessment to the reference standard result for the flow and timing domain (surgery >180 days after diagnosis was considered as high-risk). Unclear risk was stated when the corresponding information for each domain was not reported in the study.

Statistical Analysis
We extracted information on the diagnostic performance of the O-RADS system. O-RADS classifies the adnexal masses into five groups (see above), O-RADS 1, O-RADS 2, O-RADS 3, O-RADS 4, and O-RADS 5, on the basis of either IOTA terms for description of the mass or the ADNEX model. We used the following dichotomous classification for constructing the 2 × 2 tables: O-RADS 1-3 cases were considered as benign, and O-RADS 4-5 cases were considered as malignant.
A random effects model was used to estimate the pooled sensitivity, specificity, positive likelihood ratio (LR+), negative likelihood ratio (LR−), and diagnostic odds ratio (DOR). Likelihood ratios were used to characterize the clinical utility of a test and to estimate the post-test probability of disease [19]. In cases where a study reported data from the same cohort analyzed by expert and nonexpert examiners, we chose the data derived from expert examiners.
Using the mean prevalence of ovarian malignancy (pre-test probability), post-test probabilities were calculated using the positive and negative likelihood ratios and plotted on Fagan's nomogram.
Heterogeneity for sensitivity and specificity was assessed using Cochran's Q statistic and the I 2 index [20]. A p-value < 0.1 indicates heterogeneity. I 2 values of 25%, 50%, and 75% would be considered to indicate low, moderate, and high heterogeneity, respectively [20]. Forest plots of sensitivity and specificity of all studies were plotted. Meta-regression was used if heterogeneity existed for assessing covariates that could explain this heterogeneity. The covariates analyzed were sample size and malignancy prevalence.
Summary receiver operating characteristic (sROC) curves were plotted to illustrate the relationship between sensitivity and specificity. Lastly, publication bias was assessed using Deek's method [21].
All analyses were performed using MIDAS and METANDI commands in STATA version 12.0 for Windows (Stata Corporation, College Station, TX, USA). A p-value < 0.05 was considered statistically significant.

Methodological Quality of Included Studies
The QUADAS-2 assessment of the risk of bias and concerns regarding applicability of the selected studies is shown graphically in Figure 2.
Concerning the domain "index test", all studies adequately described the method of the index text, as well as how it was performed and interpreted. Ten studies used IOTA terminology of the mass features [22,23,[25][26][27][28][29][30][31][32], while one used the ADNEX model [24]. In all retrospective studies, examiners were blinded to the reference standard result.
For the domain "reference standard", all studies were considered as low risk, since they correlated correctly with the target condition according to the reference standard, with definitive histology after surgical removal or follow-up with spontaneous resolution of the mass or follow-up for at least 1 year.

Methodological Quality of Included Studies
The QUADAS-2 assessment of the risk of bias and concerns regarding applicability of the selected studies is shown graphically in Figure 2. The study design was retrospective in 10 studies [22][23][24][25][26][27][29][30][31][32] and prospective in just one study [28]. Eight studies were considered as having high risk regarding the patient selection domain, since inappropriate exclusions (for example, cases with poor image quality or cases with not all data available) were observed [22,23,25,[27][28][29][30][31][32], and one study was unclear since a complete description of the exclusion criteria was lacking [26] Concerning the domain "index test", all studies adequately described the method of the index text, as well as how it was performed and interpreted. Ten studies used IOTA terminology of the mass features [22,23,[25][26][27][28][29][30][31][32], while one used the ADNEX model [24]. In all retrospective studies, examiners were blinded to the reference standard result.
For the domain "reference standard", all studies were considered as low risk, since they correlated correctly with the target condition according to the reference standard, with definitive histology after surgical removal or follow-up with spontaneous resolution of the mass or follow-up for at least 1 year.
Concerning applicability, all studies were considered as low risk regarding patient selection (target population: patient with an adnexal mass), index test (ultrasound), and reference standard (surgery or follow-up) domains.

Diagnostic Performance of O-RADS System for Classifying Adnexal Masses
Overall, the pooled estimated sensitivity, specificity, positive likelihood ratio, negative likelihood ratio, and DOR of the O-RADS system for classifying adnexal masses were Regarding the domain "flow and timing", the time elapsed between the index test and reference standard was unclear in five studies [22,23,[28][29][30]. In the remaining six studies, it was considered as low risk [24][25][26][27]31,32].
Concerning applicability, all studies were considered as low risk regarding patient selection (target population: patient with an adnexal mass), index test (ultrasound), and reference standard (surgery or follow-up) domains.
Observed heterogeneity was moderate for sensitivity (I 2 = 55.1%; Cochran Q = 22.2; p < 0.001) and high for specificity (I 2 = 95.3%; Cochran Q= 214.2; p < 0.001). The forest plot is shown in Figure 3. Meta-regression showed that neither sample size nor malignancy prevalence explained the heterogeneity observed.

Summary of Evidence
In the present study, we performed a systematic review and meta-analysis of the O-RADS classification system using transvaginal ultrasound. We found 11 studies with available data comprising more than 4500 patients. The mean prevalence of malignancy was 32%, and benign malignant tumors more frequent in postmenopausal than in premenopausal women.
We observed that the pooled sensitivity and specificity of O-RADS system were 97% and 77%, respectively. The vast majority of the studies reported so far were retrospective, with only one prospective study with a small sample size (n = 50). Additionally, only one study used the O-RADS classification based on the results of the ADNEX model, whereas most used the interpretation of ultrasound features of the adnexal masses according to IOTA criteria.

Limitations and Strengths
The main strength of our study is that this meta-analysis is the first to address this issue. We do believe that the methodology used is correct.
As a limitation of our meta-analysis, we consider that the number of studies included was low. Therefore, we must be cautious when interpreting the results reported herein. On the other hand, most studies used the IOTA classification based on sonographic features of adnexal masses and not the ADNEX model. Therefore, we could not compare whether both approaches offer similar diagnostic performance.

Interpretation of Results
Our data indicate that the O-RADS classification system offers a very high sensitivity and moderate specificity for classifying adnexal masses. However, we observed a significant heterogeneity among studies for both sensitivity and specificity. This also implies that our results should be interpreted with caution.
Regarding the quality of the studies, we found room for improvement in study design. Clearly, there is a need for more prospective studies with large series of patients.
Notwithstanding, our data can be valuable from the clinical point of view. We observed that the specificity of O-RADS classification is moderate (pooled false positive rate of 23%). This finding deserves attention, since, in most studies analyzed in this meta-analysis, the examiners involved were expert examiners. According to current evidence, a higher specificity should be expected for expert examiners when using subjective impression [5,33]. Expert examiners need to use clearly defined criteria for describing adnexal masses. However, most of the times, an expert examiner establishes their diagnostic judgment on the basis of a subjective assessment, which in turn is based on so-called "pattern recognition". In fact, this is how the meta-analysis by Meys analyzed this issue. This meta-analysis showed that the pooled sensitivity and specificity for "subjective assessment" were 93% and 89%, respectively. On the other hand, O-RADS does not allow providing a judgment of risk malignancy on the basis of a "subjective assessment", but instead using either the ADNEX model or IOTA lexicon. This is why we do think that O-RADS could render a poorer specificity if used by an expert examiner. However, the pooled sensitivity of O-RADS was high (97%). This is important, since few ovarian malignancies would be missed using this classification and, therefore, in nonexpert hands, this is quite relevant (in spite of a significant false positive rate). For this reason, we do think that this system would be more suitable for nonexpert examiners.
Our data must be also interpreted in the context of a comparison with other ultrasoundbased and biomarker-based models. The data we obtained for O-RADS classification provided figures, in terms of sensitivity and specificity, similar to those reported for IOTA Simple Rules [33][34][35], IOTA LR2 [33,34], and the IOTA ADNEX model [36,37]. However, O-RADS classification offered better sensitivity and specificity than the risk of malignancy index (RMI) [11,33,34] and the risk of ovarian malignancy algorithm (ROMA) [11,38].
In addition, a recent consensus paper developed by several societies such as the European Society of Gynecologic Oncology (ESGO), the European Society of Gynecologic Endoscopy (ESGE), the International Society of Ultrasound in Obstetrics and Gynecology (ISUOG), and the IOTA group stated that, in spite of the O-RADS classification system not having been validated, it should be used to classify adnexal masses, and the consensus paper provided management guidelines accordingly [39]. Our results now support this recommendation.

Future Research Agenda
As stated above, there is a need for prospective studies in large series of patients to definitively validate the O-RADS ultrasound classification system.
Although some studies have addressed the issue of reproducibility, there is a need for more studies assessing this important issue, as well as studies assessing how the O-RADS system works in the hands of nonexpert examiners.
Lastly, the O-RADS system was developed for providing guidance in patient management. To date, no study has addressed this issue properly.

Conclusions
In conclusion, our data show that the O-RADS ultrasound system has good diagnostic performance in classifying adnexal masses. However, more and better-designed studies are needed to determine whether this system should be used as standard in the ultrasound evaluation of adnexal masses.