Diagnostic Models Combining Clinical Information, Ultrasound and Biochemical Markers for Ovarian Cancer: Cochrane Systematic Review and Meta-Analysis

Simple Summary Diagnosing ovarian cancer (OC) accurately helps triage patients to receive anticancer treatment and appropriate cancer surgery. We conducted a systematic review and meta-analysis evaluating models combining clinical information, biomarkers, and ultrasound to identify the most accurate test. Our review investigated 58 studies (30,121 patients, 9061 OC cases) and compared the standard of care test in the UK, Risk of Malignancy index I (RMI I) against risk of ovarian malignancy (ROMA) and Assessment of Different NEoplasias in the adnexa (ADNEX). Compared to RMI I, in pre-menopausal women, ROMA and ADNEX identified more cancers correctly (increased sensitivity) but increased the number of women classified as having cancer when they did not have cancer (reduced specificity). In post-menopausal women, ROMA identified more cancers than RMI I with similar specificity, whilst ADNEX identified the most cancers overall but with least specificity. Consideration should be given as to whether RMI I should be replaced by ROMA or ADNEX as the standard of care test for OC. Abstract Background: Ovarian cancer (OC) is a diagnostic challenge, with the majority diagnosed at late stages. Existing systematic reviews of diagnostic models either use inappropriate meta-analytic methods or do not conduct statistical comparisons of models or stratify test performance by menopausal status. Methods: We searched CENTRAL, MEDLINE, EMBASE, CINAHL, CDSR, DARE, Health Technology Assessment Database and SCI Science Citation Index, trials registers, conference proceedings from 1991 to June 2019. Cochrane collaboration review methods included QUADAS-2 quality assessment and meta-analysis using hierarchical modelling. RMI, ROMA or ADNEX at any test positivity threshold were investigated. Histology or clinical follow-up was the reference standard. We excluded screening studies, studies restricted to pregnancy, recurrent or metastatic OC. 2 × 2 diagnostic tables were extracted separately for pre- and post-menopausal women. Results: We included 58 studies (30,121 patients, 9061 cases of ovarian cancer). Prevalence of OC ranged from 16 to 55% in studies. For premenopausal women, ROMA at a threshold of 13.1 (+/−2) and ADNEX at a threshold of 10% demonstrated significantly higher sensitivity compared to RMI I at 200 (p < 0.0001) 77.8 (72.5, 82.4), 94.9 (92.5, 96.6), and 57.1% (50.6 to 63.4) but lower specificity (p < 0.002), 92.5 (90.0, 94.4), 84.3 (81.3, 86.8), and 78.2 (75.8, 80.4). For postmenopausal women, ROMA at a threshold of 27.7 (+/−2) and AdNEX at a threshold of 10% demonstrated significantly higher sensitivity compared to RMI I at a threshold of 200 (p < 0.001) 90.4 (87.4, 92.7), 97.6 (96.2, 98.5), and 78.7 (74.3, 82.5), specificity of ROMA was comparable, whilst ADneX was lower, 85.5 (81.3, 88.9), 81.3 (76.9, 85.0) (p = 0.155), compared to RMI 55.2 (51.2, 59.1) (p < 0.001). Conclusions: In pre-menopausal women, ROMA and ADNEX offer significantly higher sensitivity but significantly decreased specificity. In post-menopausal women, ROMA demonstrates significantly higher sensitivity and comparable specificity to RMI I, ADNEX has the highest sensitivity of all models, but with significantly reduced specificity. RMI I has poor sensitivity compared to ROMA or ADNEX. Choice between ROMA and ADNEX as a replacement test will depend on cost effectiveness and resource implications.


Introduction
This article is based on a Cochrane Review published in the Cochrane Database of Systematic Reviews (CDSR) 2022, Issue 7, DOI:10.1001/14651858. CD011964.pub2 In press (see www.cochranelibrary.com for information, accessed on 12 May 2022). [1] Cochrane Reviews are regularly updated as new evidence emerges and in response to feedback, and the CDSR should be consulted for the most recent version of the review.
Ovarian cancer (OC) has a high case fatality rate, largely attributed to the advanced stage at diagnosis in the majority of patients. Overall survival is 35% at 10 years; however, one year survival is only 51% at stage 4 compared to over 90% at stage 1 [2,3]. Globally, about 14-22% of women with OC will not receive anticancer therapy, probably due to late diagnosis impacting adversely on performance status and the ability to undergo treatment [4][5][6]. Diagnostic testing enables triage at two levels: initially from the community setting into hospital gynaecology clinics and then a further triage of some women to tertiary cancer care. Women classified as 'low risk' are managed by gynaecologists and can have fertility preservation, laparoscopic surgery, or surveillance whilst women classified as 'high-risk' are referred to a specialist cancer centre for surgery by expert gynaecological oncologists. Earlier, more accurate, diagnosis will enable better triage, optimisation of resources, and can improve outcomes.
OC represents a diagnostic challenge: symptoms are non-specific [7] and prevalence in the community is low (the average primary care physician sees one patient with ovarian cancer in 400 patient encounters). In addition, OC comprises heterogenous histology types which undermines the potential accuracy of any single biomarker. Menopausal status is likely to be a significant modifier of test accuracy given the difference in pre-test probability and the differing spectrum of histological subtypes in pre-compared to postmenopausal women. Diagnosis of OC in premenopausal women is particularly challenging; the majority of tumours are benign and only 1 in 1000 symptomatic ovarian cysts is malignant. This increases to 3 in 1000 by age 50 [8]. False positive results are generated by ovarian physiology (cysts) and benign pathology, e.g., endometriosis. Given the above, a combination of tests is more likely to improve diagnostic accuracy over any single test alone. Current guidelines advocate the use of composite models which, in addition to adjustment for age or menopausal status, include multiple biomarkers. These include Risk of Malignancy Algorithm (ROMA) which combines biomarkers CA125 and He4, the Assessment of Different NEoplasias in the adneXa (ADNEX) model from the International Ovarian tumour analysis consortium (IOTA) and the Risk of Malignancy Index I (RMI I) which both combine Ca125 and ultrasound characteristics (ACOG guidelines). ADNEX also offers a polynomial probability of histology. RMI I is recommended by the National Institute for Health and Clinical Excellence [9] and the Royal College of Obstetrics and Gynaecology in the UK [10].
We conducted a systematic review to compare the accuracy of diagnostic models for the diagnosis of OC, distinguishing between pre-and post-menopausal women. Our target population was women presenting with symptoms or signs suspicious for ovarian cancer. We used test accuracy review methods recommended by the Cochrane collaboration.

Methods and Materials
We followed Cochrane Screening and Diagnostic Tests Methods group recommendations [11]. Review registration number: Rai N, Champaneria R, Snell K, et al. Symptoms, ultrasound imaging and biochemical markers alone or in combination for the diagnosis of ovarian cancer in women with symptoms suspicious of ovarian cancer. Cochrane Database Syst Rev. 2015; 2015(12):CD011964. Published 7 December 2015. doi:10.1002/14651858.CD011964.

Criteria for Including Studies
We included studies in women aged 18 years or older undergoing testing in any healthcare setting (primary/secondary/tertiary) for a suspicion of OC. Studies investigated the accuracy of RMI I, ROMA, or ADNEX at any test positivity threshold. We included comparative and non-comparative diagnostic test accuracy studies if they verified index test results against histology or clinical follow-up and with sufficient data to extract 2 × 2 diagnostic tables separately for pre-and post-menopausal women. We included studies of any design, including case-control studies, where controls included benign ovarian pathology.
We excluded studies restricted to pregnant women, history of OC, recurrent, metastatic OC, and screening studies.

Search Strategy
We used sensitive search strategies combining terms for the target condition (OC), biomarkers, symptom indices and ultrasound, and terms to describe diagnostic models and algorithms. We searched Cochrane Central Register of Controlled Trials (CENTRAL), MEDLINE and MEDLINE In Process (Ovid), EMBASE (Ovid), CINAHL (Ebsco), the Cochrane Database of Systematic Reviews (CDSR), Database of Abstracts of Reviews of Effects (DARE), Health Technology Assessment Database (HTA), and SCI Science Citation Index (ISI Web of Knowledge) with a date restriction (1991 onwards) to ensure applicability to current technology. Trials registers, conference abstracts, and proceedings were searched for unpublished studies without date restrictions. Reference lists of systematic reviews and guidelines were searched. No language restrictions were applied. Searches were completed in June 2019 (Supplementary File S1).

Data Extraction and Quality Assessment
Selection, data extraction, and quality assessment were carried out independently and in duplicate by 2 reviewers, with disagreements resolved by a third reviewer. Quality assessment was through a QUADAS-2 checklist tailored to the topic [12]. Tailoring included a question about follow up of index test negatives (>12 months), two questions relevant to multivariable model/composite index test validation studies from a quality checklist for prognostic studies [13], and whether borderline ovarian tumours (BOT) were appropriately handled, i.e., BOT categorised as malignant/index test positive for analysis. Key elements of applicability included symptoms as reason for testing, inclusion of comorbidity, e.g., endometriosis, that might affect estimates of test accuracy, and expertise of clinicians performing ultrasound. The modified QUADAS-2 tool used is available from the authors on request.

Statistical Analysis
Exploratory analyses included plotting estimates of sensitivity and specificity separately in pre-and post-menopausal women and grouped by test threshold on Forest plots and for summary ROC plots. Analyses were conducted in Stata version SE 17.0 [14] and SAS software (version 9.4) [15]. HSROC models were fitted using the NLMIXED procedure in SAS. Bivariate models were fitted using the meqrlogit command in Stata. In cases where both random effects were set to zero, a fixed effect logistic regression was fitted using the blogit command. Absolute differences in sensitivities/specificities and p-values were derived from bivariate models using the nlcom command in Stata. This computes point estimates and standard errors using the delta method.

Estimation of Accuracy for Individual Tests
Where adequate data were available and considered reasonable to pool results, we performed meta-analyses using hierarchical models including random effects [16,17]. To estimate average sensitivity and specificity at fixed thresholds, we performed analysis of each index test version by first restricting to studies that reported thresholds recommended in guidelines and/or used in clinical practice and secondly to commonly reported thresholds. For ROMA, we included studies using thresholds +/−2 units around most commonly reported thresholds. We used random-effects univariate analyses where pooling was considered an appropriate approach, but bivariate models failed to converge. Where appropriate, models were simplified by setting near-zero variance estimates of the random effects to zero [18].

Comparison of Test Accuracy
Since studies reported different thresholds per index test to define disease positivity, hierarchical summary ROC (HSROC) models [19] that included random-effects parameters for variation in accuracy and threshold were fitted to maximise use of all date available. Comparisons between index tests were made by adding a covariate for test type to the accuracy and threshold parameters, while assuming a common underlying shape. A summary ROC curve for each index test across all included thresholds was estimated. Each included study contributed one threshold to the summary ROC curve. Where an individual study reported more than one threshold, the most commonly reported threshold for that index test across all included studies was selected for the meta-analyses. P values for difference in accuracy for each test compared to RMI I (RMI I being recommended test for routine use in the United Kingdom) were computed using Wald tests. The difference in sensitivities at fixed specificities of 80% and 90% for each index test compared to RMI I with 95% CI was also reported.
To illustrate comparative accuracy at specific test operating thresholds, we also undertook a comparison of index tests at the single most commonly reported threshold using bivariate hierarchical models that included a covariate for test type. Absolute differences in sensitivity/specificity and the corresponding p-values for each pairwise test comparison were reported from the model. Where appropriate, models were simplified by assuming no correlation between sensitivity and specificity estimates or by setting near-zero variance estimates of the random effects to zero [18].
We translated summary estimates of sensitivity and specificity into summary estimates of the absolute numbers of true positives, false negatives, false positives, and true negatives. We did this using a hypothetical population of 1000 women using an estimate of disease prevalence (pre-test probability) reflecting the NICE threshold for cancer referral from primary care to hospital settings in the UK of 3% (Supplementary File S2) (NICE 2017) [20].

Investigation of Heterogeneity
We investigated the effect on test accuracy of menopausal status (pre-menopausal or post-menopausal) and classification of BOT as disease positive (grouped with malignant ovarian tumours) or where classification of BOT was unclear or BOT were excluded. Grouping of BOT with OC was considered appropriate (reflecting current surgical practice) whereas exclusion of BOT was considered methodologically inappropriate. Estimation of differences in accuracy were performed using the NLMIXED procedure in Statistical Analysis System [15] by including borderline grouping as covariates in the bivariate model. Differences in accuracy were reported using the ratio of diagnostic odds ratios with 95% CI and associated p values using Wald tests.

Characteristics of Included Studies
Supplementary File S3 presents characteristics of included studies. Studies were conducted in Europe (40), Asia-Pacific (12), North America (5), and South America (1). Eighteen were multicentre. Forty-eight studies were conducted in a hospital (8 mixed secondary and tertiary, 28 tertiary, and 12 secondary) and in 10 healthcare setting was not reported. Clinical pathway to test was not described in any studies. Prevalence of OC ranged between 16% (RMI I, ROMA), and 27% (AdNEX) in pre-menopausal and between 38% (RMI I), 45% (ROMA) and 55% (AdNEX) in post-menopausal women. Four studies specified presence of symptoms whilst 10 ROMA studies reported that an adnexal mass was identified following imaging. Eighteen ROMA and four RMI I studies explicitly restricted inclusion to epithelial ovarian cancer (EOC) and seven ROMA and one RMI I study explicitly excluded BOT. In a further 18 ROMA and 3 RMI I studies occurrence of BOT was not reported. The majority of studies were conducted in specialist gynaecological oncology centres (36/58) in women scheduled for surgery.

Quality (Risk of Bias and Applicability Concerns of Included Studies)
Risk of bias: Figure 2 illustrates quality assessment for 58 included studies (reporting 66 datasets) across participant, index tests, reference standard, and flow and timing domains. For participant selection domain, 15/58 (27%) studies were at high risk of bias and 38/59 (64%) at unclear risk of bias. Only 5 studies were judged at low risk of bias [24][25][26][27][28] where authors explicitly reported consecutive sampling and tumour pathology such that impact on accuracy could be investigated. For index test domain, the majority of ROMA (33/42; 79%), (2/4) ADNEX and 9/20 (45%) of RMI I studies were judged low risk of bias as they were prospective or because of the objective nature of the index test (ROMA). One RMI I study [29] was judged at high risk of bias because RMI I results were interpreted with knowledge of reference standard. Four ROMA studies [30][31][32][33] were judged at high risk of bias because no threshold was pre-defined. For reference standard domain, the majority of studies 30/58 (52%) were judged low risk of bias. In total, 27 of 58 studies (47%) were judged unclear and two at high risk of bias [34,35] either because minimum length of follow up for index negatives was not reported or because there was concern that the reference standard was ascertained with knowledge of the index test result. For flow and timing domain, the majority of studies were judged unclear risk of bias 32/58 (55%), most commonly because of no information regarding the interval between index test and reference standard. In total, 12 of 58 (21%) studies were judged at high risk of bias because not all participants receiving an index test received a reference standard.

Applicability
In total, 53 of 58 studies (91%) were judged high or unclear applicability for participant selection domain because participants did not explicitly represent symptomatic women. For index test domain, applicability was high or unclear for all RMI and ADNEX studies because ultrasound was conducted by specialist sonographers or level of specialism was unclear. For reference standard domain, applicability was judged as high or unclear in 49/58 (85%) studies because BOT had been excluded from analysis or classification of BOT was unclear.

Estimation of Test Accuracy
A consistent difference was observed in sensitivity (higher in post-menopausal women) and specificity (lower in post-menopausal women) across index tests operating at a fixed threshold (Supplementary File S4) and across all versions of all index tests at all thresholds analysed. Subsequently, sensitivity and specificity were estimated in pre-and postmenopausal women separately.

Estimation of Accuracy-Single Tests
Supplementary File S4 summarises accuracy of 58 unique included studies (66 data sets) in pre-and post-menopausal women. ROMA and ADNEX studies included reported multiple test positivity thresholds. For ROMA, no evidence of a significant difference in accuracy at thresholds was identified and we chose the threshold reported by most studies (13.1 (+/−2) and 27.7 (+/−2) in pre-and post-menopausal women, respectively). For ADNEX, the only common threshold used across all 4 included studies was that to achieve a post-test probability of OC of 10%. Figure 3A illustrates comparison of test performance contributed by all studies at all test positivity thresholds in pre-menopausal women. In pre-menopausal women, ADNEX (p = 0.0083) but not ROMA (p = 0.5608) demonstrates superior accuracy compared to RMI I. Figure 3B illustrates that in post-menopausal women ROMA (p = 0.0043) but not ADNEX (p = 0.0522) demonstrates superior overall accuracy compared to RMI I.

Comparison of Test Accuracy: HSROC Analysis
We also compared the sensitivity of tests when fixing specificity at 80% and 90% (Table 1) in keeping with clinical consensus [8,10]. Of note the sensitivity estimate for ADNEX in pre-menopausal women at a fixed specificity of 80% and in pre-and postmenopuasal women at a fixed specificity of 90% is extrapolating beyond the data from included ADNEX studies ( Figure 3B).  In pre-menopausal women at a fixed specificity of 80%, RMI I has sensitivity of 79.8 (69.7 to 87.1). The average difference in sensitivity of ROMA compared to RMI I at a fixed specificity of 80% is compatible with chance (2.4 (−5.8 to 10.5) but a significant increase in average sensitivity is observed with ADNEX: 15.1 (5.7 to 24.5) ( Table 1).
In post-menopausal women at a fixed specificity of 80%, RMI I has a sensitivity of 85.4 (81.1 to 88.9). ROMA and ADNEX both demonstrated a significant increase in average sensitivity compared to RMI of 5.5 (1.7 to 9.4) and 8.8 (2.4 to 15.1), respectively. (Table 1). Tables 2 and 3     Pre-menopausal women: Test performance of ROMA and ADNEX compared to RMI I specific thresholds. Post-menopausal women: Test performance or ROMA and ADNEX compared to RMI I at specific thresholds.

Investigation of Heterogeneity: The Effect of Classification of BOT on Estimates of Test Accuracy
In pre-menopausal women (ROMA n = 38 studies; RMI I 19 studies) and post-menopausal women (ROMA n = 40 studies) there was sufficient data, when utilising all test positivity thresholds at a fixed specificity of 80%, to allow comparison of sensitivity estimated by studies where BOT were classified as positive (grouped with malignant tumours) against studies excluding BOT from analysis or where classification of BOT was unclear. In post-menopausal women, for ROMA, a significant decrease in sensitivity of 6.4% (1.2 to 11.5) was observed for studies grouping BOT with malignant compared to studies where BOT were excluded or where categorisation for analysis was unclear (Supplementary File S5).

Discussion
This systematic review highlights that the performance of test combinations varies considerably between pre-and post-menopausal women. Evaluation of models should include an assessment of their performance separately in these different target populations, including when models include variables to adjust for this characteristic. In pre-menopausal women, both ROMA and ADNEX demonstrated significantly higher sensitivity than RMI, ADNEX demonstrating marginally significantly higher sensitivity compared to ROMA. Both ROMA and ADNEX showed significantly lower specificity than RMI. In post-menopausal women, both ROMA and ADNEX demonstrated significantly higher sensitivity than RMI I. ROMA demonstrated comparable specificity to RMI I, whilst for ADNEX specificity was significantly lower compared to RMI I.
Novel features of this review include systematic investigation of effects of menopausal status, investigation of BOT on estimates of test accuracy and statistical comparison of test accuracy between models. We mitigated heterogeneity by restricting analysis to primary adnexal tumours and where this was not possible or unclear in studies with mixed primary, recurrent, and metastatic disease, this was reflected in downgrading of quality.

Limitations
We acknowledge limitations of the search date, in recognition of the relatively small number of ADNEX studies included we performed a scoping search for primary studies published since our search cut-off date of June 2019. Three studies were found, two from single centres [45,46] and one multicentre study [47].
Only one study [45] reported sensitivity and specificity separately in pre-and postmenopausal women. Sensitivity and specificity were both 83% in premenopausal women and 100% and 76%, respectively, in postmenopausal women at a threshold to achieve a post-test probability of ovarian cancer of 10%. These estimates are in line with studies included in this review. Inclusion of this single eligible additional ADNEX study is unlikely to alter the conclusions of this review.
Deficiencies of included studies included lack of data and poor reporting precluding quality assessment and investigation of potential important sources of heterogeneity. These included clinical setting, target condition, histological subtype and stage; included studies included a varying range of ovarian pathology. A lack of distinction between pre-and post-menopausal women when evaluating test accuracy is a major limitation of research in this area.

Recommendations for Clinical Practice
Guidelines that recommend RMI I for diagnosis of OC in pre-and post-menopausal women should be changed. Whether ROMA or ADNEX should replace RMI I will depend on how health care systems view the trade-off between sensitivity and specificity. Implementing ultrasound models will require training in specialist ultrasound skills and quality assurance processes, similar to those introduced for nuchal scans in early pregnancy. Implementing ultrasound testing through dedicated 'pelvic mass' clinics may represent a method for achieving this and this is currently being investigated within the ROCkeTS study [48]. Implementing testing with ROMA will require investment in laboratory processes.

Recommendations for Research
Most included studies demonstrated a high prevalence of OC (16-55%) in both preand post-menopausal women reflective of tertiary hospitals and a highly preselected population. Research is urgently needed to evaluate tests for diagnosis of ovarian cancer in community settings and lower prevalence. The ROCkeTS study will report on a head-to-head comparison of diagnostic tests in a lower prevalence setting than previously published [48]. Future systematic reviews should present results stratified by menopausal status.

Conclusions
Based on the results of this review RMI I should not be recommended for the diagnosis of OC in pre-or post-menopausal women. The test of choice between ROMA and ADNEX will depend on health care systems' evaluation of the trade-off between sensitivity and specificity.