Radiomics Models for Predicting Microvascular Invasion in Hepatocellular Carcinoma: A Systematic Review and Radiomics Quality Score Assessment

Simple Summary Microvascular invasion (MVI) is regarded as a sign of early metastasis in liver cancer and can be only diagnosed by a histopathology exam in the resected specimen. Preoperative prediction of MVI status may exert an effect on patient treatment management, for instance, to expand the resection margin. Radiomics can identify delicate imaging features from routinely used radiological images that are invisible to the naked eye and has been increasingly adopted to predict MVI. We reviewed the available radiomics models to evaluate their role in the prediction of MVI. The discriminative capacity of the models ranged from 0.69 to 0.94. Even though the studies were preliminary and the methodologic quality was suboptimal, radiomics models hold promise for the accurate and non-invasive prediction of MVI. In accordance with a standardized radiomics workflow, future prospective studies with external validation are expected to provide a reliable and robust prediction tool for clinical implementation. Abstract Preoperative prediction of microvascular invasion (MVI) is of importance in hepatocellular carcinoma (HCC) patient treatment management. Plenty of radiomics models for MVI prediction have been proposed. This study aimed to elucidate the role of radiomics models in the prediction of MVI and to evaluate their methodological quality. The methodological quality was assessed by the Radiomics Quality Score (RQS), and the risk of bias was evaluated by the Quality Assessment of Diagnostic Accuracy Studies (QUADAS-2). Twenty-two studies using CT, MRI, or PET/CT for MVI prediction were included. All were retrospective studies, and only two had an external validation cohort. The AUC values of the prediction models ranged from 0.69 to 0.94 in the test cohort. Substantial methodological heterogeneity existed, and the methodological quality was low, with an average RQS score of 10 (28% of the total). Most studies demonstrated a low or unclear risk of bias in the domains of QUADAS-2. In conclusion, a radiomics model could be an accurate and effective tool for MVI prediction in HCC patients, although the methodological quality has so far been insufficient. Future prospective studies with an external validation cohort in accordance with a standardized radiomics workflow are expected to supply a reliable model that translates into clinical utilization.


Introduction
Microvascular invasion (MVI) has been recognized as an independent predictor for early recurrence and poor prognosis after liver resection or transplantation in hepatocellular carcinoma (HCC) [1,2]. Its reported incidence ranges from 15% to 57% according to different diagnostic criteria and study population [3]. The diagnosis of MVI, however, is only made by a postoperative histopathology exam on the resected specimen, which exerts little or no influence on the patient treatment management, while with the knowledge of MVI, clinicians can optimize a patient treatment strategy, for example, to expand the resection margin in operation or to adopt an alternative treatment option. To implement personalized medicine, it is of utmost importance to preoperatively identify and stratify patients with MVI. Therefore, a reliable, noninvasive biomarker for preoperative prediction of MVI is urgently needed.
Medical imaging has evolved from a primarily diagnostic tool to an essential role in clinical decision making. Clinically, radiologists use pattern recognition after establishing links between radiological features at CT or MRI images and MVI [4,5], such as arterial peritumoral enhancement, non-smooth tumor margins, and rim arterial enhancement [2]. The Liver Imaging Reporting and Data System (LI-RADS) has recently been developed and has evolved as a comprehensive and standardized diagnostic algorithm for HCC imaging reporting [6]. LI-RADS has been proven to be an effective tool not only for HCC diagnosis but also for outcome prediction after liver resection, radiofrequency ablation, or liver transplantation [6][7][8], exerting an increasing influence on the treatment management of HCC. Previous studies have demonstrated the diagnostic value of LI-RADS in the prediction of MVI [9,10]. However, these qualitative features suffer from their subjectivity and high inter-observer variability [11].
Radiomics is an emerging field that can extract high-throughput imaging features from biomedical images and convert them into mineable data for quantitative analysis [12,13]. Its basic assumption lies on that the alterations and heterogeneity of the tumor on the micro scale (e.g., cell or molecular levels) can be reflected in the images [14]. Therefore, through radiomics analysis, the cancerous cell emboli (i.e., MVI) in the hepatic vasculature can be detected in the preoperative images, which holds promise for the preoperative prediction of MVI and personalized treatment. In recent years, a number of radiomics models for MVI prediction have emerged. However, there has not been any research systematically summarizing current radiomics research for MVI prediction, and the overall efficacy of the prediction model is still unknown. In addition, as radiomics research is a sophisticated process and consists of several steps, it is important to evaluate the methodological variability to obtain a reliable and reproducible model before translating it to clinical applications. The current systematic review therefore aims (1) to provide an overview of radiomics studies for MVI prediction in HCC patients and assess the efficacy of the prediction models and (2) to evaluate the methodologic quality in the radiomics workflow and the risk of bias in the research.

Materials and Methods
This study is registered at the PROSPERO website (https://www.crd.york.ac.uk/ prospero/, No: CRD42021250082, (accessed on 20 May 2021)) and was conducted under the guidance of the Preferred Reporting Items for a Systematic Review and Meta-analysis of Diagnostic Test Accuracy Studies (PRISMA-DTA) (Table S1).

Literature Research and Study Selection
Publications from databases of the PubMed, Embase, Web of Science, and Cochrane libraries were systematically retrieved by using the following key terms: "radiomics/texture analysis", "microvascular invasion", and "hepatocellular carcinoma". Detailed searching queries in each database can be found in Table S2. The last updated date of the literature  search is 29 May 2021. Records satisfying the following criteria were considered as eligible: Inclusion criteria: (1) retrospective or prospective studies; (2) studies considering patients who were diagnosed with hepatocellular carcinoma by a pathology exam; (3) studies with radiomics features extracted from CT, MRI, or PET/CT images used as predictors for MVI, solely or as a variable in a model; (4) studies where MVI prediction is the main outcome or one of the main outcomes; (5) publications in English. Exclusion criteria: (1) publications in the form of a letter, conference abstract, editorial, review, or case report; (2) research considering only semantic radiological features used for MVI prediction; (3) research with operatordependent imaging modalities, such as ultrasound-based studies; (4) deep-learning research not involving any textural features in the model; (5) studies only evaluating the predictive value of a single radiomics feature, without any combination into a multiple features prediction model; (6) studies with a sample size of less than 30.
Study selection was conducted by two reviewers (Q.W. and C.L.) by screening the title and abstract and then the full text. Any disagreement or uncertainty was resolved by two senior researchers (K.M. and T.B.) to reach a consensus. Reference lists of the enrolled studies as well as a pre-existing systematic review/meta-analysis were also searched manually to recruit any potentially eligible studies.

Data Extraction
A pre-defined table was used to extract the following information from each paper: (1) general study characteristics; (2) patient characteristics; (3) characteristics in development of a radiomics model, including imaging modalities, tumor segmentation, imaging preprocessing and feature extraction, and feature selection and modelling; (4) performance metrics of a radiomics model, including area under the receiver operating characteristic (ROC) curve (AUC), calibration statistics, and decision analysis. A typical radiomics research workflow for MVI prediction is illustrated in Figure 1. If several prediction models were developed in one study, the one with the best performance in the test cohort was selected. For studies from the same medical center with subjects overlapped, if the same imaging modality was adopted, the latest study was included; if different modalities or different contrast media used in the same modality were applied, both studies were enrolled. Supplemental files of included studies were also screened to extract required data, if necessary.
The terms "test cohort" and "validation cohort" were unified in this study to avoid potential misunderstanding and confusion. A "test cohort" is a part of the model development cohort and usually refers to an "internal test cohort". A "validation cohort" is independent from the model development cohort, be it temporal validation (data collected from a later period) or geographic validation (data sampled from another hospital or country) [15], and it is often called an "external validation cohort".

Assessment of Radiomics Quality Score, Risk of Bias, and Research Type
The Radiomics Quality Score (RQS) is a scoring system proposed by Lambin in 2017 [16] and is commonly used for evaluating the methodologic quality of the radiomics research [17,18]. The RQS tool contains 16 key items to quantify the quality of the radiomics workflow and the reporting. Most items are designated to 0, 1, or 2 points, according to how well a study achieves the signaling question. To highlight the importance of some dimensions, a higher point is assigned; for example, 7 points is given to a prospective validation study, and 5 points is given to a study validated in three or more datasets. The ideal score of the RQS is 36 points, responding to a percentage of 100%. Table S3 provides a detailed description of the RQS items.
As the radiomics model is also used as a diagnostic tool, the risk of bias and the applicability concerns of the included studies were further assessed by using the revised Quality Assessment of Diagnostic Accuracy Studies (QUADAS-2) tool [19]. QUADAS-2 evaluates the risk of bias of a study in four domains: patient selection, index test, reference standard, and flow and timing. The results of each domain were marked as low, high risk, or unclear. Detailed description of QUADAS-2 is provided in Table S4.
An assessment of the RQS and QUADAS-2 was independently performed and crossvalidated by two reviewers (Q.W. and C.L.). When discrepancy occurred, agreement was reached after discussion with two senior researchers (K.M. and T.B.).

Literature Selection
The systematic literature search initially yielded 188 records from the four databases. After removing 82 duplicates, 50 inappropriate types of publications, and 34 ineligible studies, a total of 22 studies were included in this systematic review   (Figure 2).

General Characteristics and the Incidence of MVI
The included 22 studies were published between September 2017 and May 2021, with two thirds (15/22) within the last two years. All studies were retrospectively designed and, in total, included 5552 patients with a sample size varying from 69 to 637 patients (median: 174). Most studies (20/22) split the cohort into a training and a test cohort, while only two of them further validated their model using an independent external cohort [25,29]. Nine studies (8/22) focused on solitary HCC, among which five focused on HCC with a diameter of less than 5 cm.
The incidence of MVI ranged from 25.3% to 67.5% for an individual entire cohort, and 25.3% to 56.4% for HCC less than 5 cm. Around two thirds (16/22) of the studies explicitly stated their definition of MVI. Table 1 gives more details about the general characteristics of the reviewed studies.

RQS and Risk of Bias Assessment
The average RQS score of the included studies was 10, accounting for 28% of the total points. The highest RQS score was 15 points (42%), seen in only one study [16]. Around half of the studies were credited between 11 and 14 points, corresponding to 30-40% of total points ( Figure 3A). As no research considered the items of "phantom study", "prospective study", "detect and discuss biological correlates", "cost-effectiveness analysis", or "open science and data", these five items were assigned 0 points. Other poorly performed items include "imaging at multiple time points", "cut-off analysis", and "calibration statistics", in which the average points for each item were less than 30% ( Figure 3B). A detailed description and a summary of the RQS score are provided in Table S5.
The summary of the risk of bias and the applicability concerns evaluated by the QUADAS-2 tool are presented in Figure 4. Most studies showed a low or unclear risk of bias in each domain. Detailed description in each domain is provided in Table S6.

Study Characteristics
According to the typical radiomics workflow, the study characteristics is summarized as follows.
3.4.1. Imaging Acquisition CT was applied in 10 studies, MRI in 10 studies, and both modalities in 1 [40], and only 1 used PET/CT [34]. Most studies (16/22) exploited more than one phase/sequence to construct their prediction model. The interval between the preoperative imaging exam and liver resection (for histopathological diagnosis of MVI) varied from 1 week to 3 months (median: 1 month).

Tumor Segmentation
A majority of studies performed 3D segmentation (20/22). In 15 of these studies, 3D segmentation was achieved manually; in 3, segmentation was semi-automatic [21,28,34]; in the remaining 2, it was achieved automatically [20,31]. Two studies manually delineated the tumor on the cross-section slice with the largest tumor diameter [24,28]. Nine studies expanded the segmented tumor with different distances, and the most common dilated distance was 10 mm from the tumor margin.

Imaging Preprocessing and Feature Extraction
As imaging may come from different centers, different manufacturers, and different scanners, imaging preprocessing prior to feature extraction is necessary to increase the reliability of the textural measurements. Six studies (6/22) resampled the images before feature extraction, most often to a voxel size of 1 × 1 × 1 mm 3 .
The most commonly used software to extract imaging features was pyradiomics (9/22), followed by MatLab or its related software (6/22). The number of radiomics features extracted from each phase/sequence ranged from 58 to 2932.

Feature Selection and Modelling
To avoid potential overfitting during development of a radiomics model, feature selection and dimensionality reduction is necessary, as the radiomics features often outnumbered the sample size. The most widely used algorithm was the Least Absolute Shrinkage and Selection Operator (LASSO) regression, which is an efficient method to select informative variables by introducing L1 regularization (15/22). The number of imaging features included in the radiomics model ranged from 2 to 74 (median: 15), and the event/feature ratio ranged from 0.7 to 35.5 (median: 4.2). Nine studies further included clinical risk factors into a combined prediction model. High alpha-fetoprotein (AFP) (9/22) and a large tumor size (4/22) were both frequently detected clinical risk factors for MVI prediction.
It is worth mentioning that the reproducibility evaluation of imaging features can also be used for feature selection. Among the 10 studies that performed interclass correlation coefficient (ICC) analysis, 4 of them set a threshold of 0.8 for robust features and selected those for further analysis [32,38,40,41].

Performance of the Prediction Model
A majority of studies (20/22) split the subjects into training and test subsets. The median AUC in the test cohort was 0.79, ranging from 0.69 to 0.94. Two studies validated their models using an independent cohort with AUCs of 0.84 and 0.80 [29,31]. Only five studies reported the cut-off value when presenting the performance metrics [21,26,32,34,41]. Nine studies (9/22) evaluated the calibrated ability of their model in the form of a calibration curve and clinical usefulness of the model in the form of decision curve analysis.
The abovementioned characteristics of the radiomics workflow has been provided in detail in Table 2 and Table S7.

Discussion
The present study identified an ever-growing number of studies performing radiomics analysis of HCC for MVI prediction, mostly published in the last two years. The added value of radiomics in imaging modalities used in clinical routines has been explored extensively, with an AUC as high as 0.80-0.84 in two independent validation cohorts, shedding light on the management and prognosis prediction of HCC patients. Although the initial results are promising and encouraging, the methodological variability of the research is considerable, and the reporting quality is insufficient. Before translating the radiomics model to clinical applications, it is urgent to standardize the reporting norms to make the prediction models reproducible and reliable and to validate the models in external cohorts.
Radiomics research is a complex, interdisciplinary, multi-step project, involving image processing, big data handling, algorithm operating, model construction, and validation. Each step in the radiomics workflow can be achieved by several different strategies and approaches, which induces substantial methodological heterogeneity among radiomics studies. The variability started with different imaging modalities, followed by different tumor segmentation strategies and different categories of radiomics features, as well as different algorithms and classifiers used for feature selection and modelling. Moreover, variability existed even in the same imaging modality; e.g., MRI acquisition may vary in terms of the manufacturers, scanning protocols, contrast media, and sequence/phase used, and the various software and tools applied for feature extraction inevitably resulted in radiomics features with different nomenclatures. Therefore, it seems hard to pool data across studies and to enable a robust meta-analysis. Given that the radiomics workflow involves multiple steps, it poses a challenge for other researchers to reproduce findings when the original study does not supply sufficient detail. Instead, improving the reporting quality seems to be a practical approach to validating the findings and translating them into clinical utility. However, the present review has highlighted the insufficient reporting quality of current radiomics HCC-MVI research, which was reflected by an average RQS score of 10 (28% of the total points). This finding is similar to the result of a recent systematic review that evaluated radiomics research quality in the area of HCC, with a mean RQS score of 8.4 [42].
Five items of the RQS in which all included studies performed poorly are "prospective study", "phantom study", "biological correlates", "cost-effectiveness analysis", and "openness of data and code". A well-designed prospective study can reduce and minimize the potential confounding factors, representing a higher level of evidence for the quality validity. Thus, prospective studies are given the highest weighting in the RQS tool (7 points, accounting for around 20% of the full scale). However, to date, no prospective radiomics MVI research has been performed. A phantom study's purpose is to detect potential feature variability among different scanners and manufacturers. This is of great importance, as the evaluated cohorts often involve many scanners or even different medical centers. The phantom study process ensures that only robust features are included in the following radiomics analysis. Biological correlates aim to link imaging findings with gene or molecular signatures. However, none of the reviewed studies evaluated the gene or molecular levels of the tumor samples. Previous studies have detected a 91-gene signature that highly correlates with vascular invasion in HCC [43]. Based on this finding, a contrast-enhanced CT imaging biomarker, i.e., radiogenomic venous invasion (RVI), which includes three imaging features (internal arteries, a hypo-dense halo, and a tumor-liver difference), has been shown to be an accurate predictor of MVI [44]. Future studies are required to explore and verify the correlations between radiomics features and gene expressions. A cost-effectiveness analysis can evaluate a radiomics prediction model in terms of health economics when applied in clinical routines. It assumes that a novel predictor should not be more expensive than currently available predictors when accuracy is comparable. It also compares the health effect of a radiomics predictor with a condition without a radiomics predictor, such as a quality-adjusted life year analysis. We think that evaluating this point seems less urgent, given that the methodological standardization and clinical/biological validation of current radiomics models are still lacking. Data and code openness aims to repeat and reproduce results and findings and to further validate and promote the prediction model in other centers. Though some initiatives have been proposed in an attempt to remove the obstacles in data sharing, other factors, such as legal/privacy issues, culture/language barriers, and insufficient staff/time, still exist [10]. None of the studies shared their codes or imaging data publicly.
Regarding the items of "imaging at multiple time points" and "multiple segmentations", both aim to select stable imaging features for modelling considering subjective and temporal variations. However, less than half of the studies performed ICC analysis and seldom explicitly stated that imaging features from different phases/sequences were evaluated during that analysis (i.e., test-retest analysis). Furthermore, there is no generally accepted ICC threshold at which radiomics features can be considered robust. Generally, when reporting ICC, values of 0.75-0.90 are regarded as indicating good reliability, and values higher than 0.9 are regarded as excellent [45]. However, among the studies that calculated ICC, the applied threshold varied among 0.75, 0.80, and 0.9. A future study should be applied to determine the proper threshold at which robust radiomics features for modelling can be defined. Interestingly, some of the studies reported here did not rule out features with low ICC and constructed their model using only the full features extracted from their images.
When evaluating the performance and clinical utility of the radiomics model considering the items of "cut-off analysis", "calibration statistics", "comparison with goldstandard", "potential clinical utility", and "validation", the included studies again were insufficient. The performance metrics of a model, such as the sensitivity and specificity, are often determined by a specified cut-off value, and this value can further classify a patient cohort into high and low risk groups for a certain condition. A cut-off value is also one of the prerequisites for reproducing the results of previous research. However, only five studies reported their cut-off values. Regarding calibration analysis, which evaluates the agreement between predictions and the actual events, less than half of the studies performed one. Regarding the comparison with "gold-standard", there is currently no surrogate that can serve as a "gold-standard" for MVI prediction. As the value of semantic imaging features have been extensively explored for MVI prediction, we therefore defined conventional imaging features as the "gold-standard". Among the 10 studies that compared prediction performance between radiomics and radiologist models, all declared that the radiomics models outperformed the radiologists' semantic models (Table S7). However, the publishing bias should be borne in mind when interpreting these results. Only two studies validated their models using independent external cohorts. However, one of them validated their model in only 18 patients, which is not a sufficiently large validation cohort according to the "10-EPV" principle (at least 10 events per variable) [16,46]. When developing a prediction model, the ratio of event and variable should be maintained at a certain level to avoid potential overfitting or underfitting. Among the 16 studies with an EPV ratio available, the median EPV (MVI positive cases/features) ratio was 4.2, indicating a potential risk of overfitting. Therefore, it is assumed that, before translating these models into a clinical routine utility, some practical issues should be well addressed, such as the reproducibility of the radiomics model, the standardization of imaging protocols, model overfitting, and the external validation of the prediction models.
Though the RQS tool aims for high-quality radiomics research, there are concerns that should be optimized in future revisions. The current RQS is mainly focused on radiomics itself and ignores non-radiomics components during radiomics model/predictor development, such as blindness to outcomes and measurement, intervals between the index test and reference standard (in the case of MVI, the time between imaging and liver resection), and the influence of sample size and enrollment of study subjects. All these factors may also introduce bias. Under this context, the tool of QUADAS-2 can serve as a vital supplement to RQS when evaluating the quality of radiomics research.
Most of the studies reported in this systematic search showed a low or unclear risk in the four domains of risk of bias evaluation. The missing or unclear parts observed using the RQS and QUADAS-2 tools were obvious, which implies that these tools might not be so well known or adopted. Future researchers will ideally apply the RQS or QUADAS-2 as a checklist to improve the quality of their reports. In fact, a specified checklist, i.e., CLAIM (Checklist for Artificial Intelligence in Medical Imaging) for artificial intelligence research [47], and a general guideline for diagnostic/prognostic prediction, i.e., TRIPOD (Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis) [15], have already been proposed.
This systematic review has some limitations. Firstly, high-level evidence from studies with a prospective design and an independent external validation cohort is lacking, so a definitive and convincing conclusion about the efficacy of the radiomics model for MVI prediction cannot be drawn. Secondly, we did not synthesize the performance metrics of the prediction model, given the high methodological heterogeneity of each study. Therefore, model performance comparisons between semantic-feature-based models and radiomics models, between CT-based models and MRI-based models, and between dilated-VOI-based models and non-dilated models could not be performed. Thirdly, we did not evaluate the specific radiomics features shared among different models due to the variability of imaging modalities and the extraction software used.

Conclusions
Even though current studies were preliminary and the methodological quality was insufficient, the radiomics model has the potential to provide an accurate and effective tool to preoperatively predict MVI presence in patients with HCC. Future prospective studies with an external validation cohort in accordance with a standardized radiomics workflow and reporting norms are expected to supply a reliable, reproducible, and accurate radiomics model for clinical implementation.