Utilization of a Machine Learning Algorithm for the Application of Ancillary Features to LI-RADS Categories LR3 and LR4 on Gadoxetate Disodium-Enhanced MRI

Simple Summary In the Liver Imaging Reporting and Data System (LI-RADS), liver observations are categorized as LR1-LR5 according to the probability of benign and hepatoma on the basis of major features. Subsequent adjustment is allowed using ancillary features (AFs). However, the LI-RADS does not provide specific guidelines. In this study, we determined the utilization of a machine-learning-based strategy of applying AFs to LR3/4 on MRI. Our decision tree algorithm of applying AFs for LR3/4 provides significantly higher AUC, sensitivity, and accuracy than those of other methods, albeit reduced specificity. These appear to be usefully employed in certain circumstances in which there is a focus on the early detection of hepatoma. Abstract Background: This study aimed to identify the important ancillary features (AFs) and determine the utilization of a machine-learning-based strategy for applying AFs for LI-RADS LR3/4 observations on gadoxetate disodium-enhanced MRI. Methods: We retrospectively analyzed MRI features of LR3/4 determined with only major features. Uni- and multivariate analyses and random forest analysis were performed to identify AFs associated with HCC. A decision tree algorithm of applying AFs for LR3/4 was compared with other alternative strategies using McNemar’s test. Results: We evaluated 246 observations from 165 patients. In multivariate analysis, restricted diffusion and mild–moderate T2 hyperintensity showed independent associations with HCC (odds ratios: 12.4 [p < 0.001] and 2.5 [p = 0.02]). In random forest analysis, restricted diffusion is the most important feature for HCC. Our decision tree algorithm showed higher AUC, sensitivity, and accuracy (0.84, 92.0%, and 84.5%) than the criteria of usage of restricted diffusion (0.78, 64.5%, and 76.4%; all p < 0.05); however, our decision tree algorithm showed lower specificity than the criterion of usage of restricted diffusion (71.1% vs. 91.3%; p < 0.001). Conclusion: Our decision tree algorithm of applying AFs for LR3/4 shows significantly increased AUC, sensitivity, and accuracy but reduced specificity. These appear to be more appropriate in certain circumstances in which there is an emphasis on the early detection of HCC.


Introduction
The Liver Imaging Reporting and Data System (LI-RADS) was released in 2011 by the American College of Radiology [1] and was continuously updated until 2018 to improve diagnostic accuracy and promote communication between healthcare providers by standardizing the interpretation and categorization of liver observations. In the LI-RADS, liver observations are categorized as LR1 to LR5 according to the probability of benignity and hepatocellular carcinoma (HCC), though the algorithm includes the size and major features (MFs) such as nonrim arterial phase hyperenhancement (APHE), enhancing capsule, nonperipheral washout, and threshold growth. After allocation to a category based on the MFs, adjustment using ancillary features (AFs) is allowed [2]. Observations can be upgraded by one category up to LR4 using AFs favoring malignancy, whereas they can be downgraded by one category using AFs favoring benignity. Based on a recent meta-analysis, the occurrence rate of HCC is 0% in LR1, 13% in LR2, 38% in LR3, 74% in LR4, and 94% in LR5 [3], although these rates in the lower categories may be inflated owing to selection bias for biopsied lesions. Once the LR category is assigned to each observation, cases with LR3 and LR4 observations are recommended for repeat or alternative diagnostic imaging in 3-6 months, multidisciplinary discussion, or biopsy. However, an invasive biopsy is a risky procedure and may result in biopsy failure because small lesions are classified as mainly LR3 or LR4. In addition, there is a risk of missing local treatment opportunities or a decrease in the number of treatment options due to increased lesion size or vascular invasion during follow-up without treatment. The LI-RADS does not provide specific guidelines for the application of Afs, and the utilization of AFs are at the discretion of the radiologist according to each case. Furthermore, variability of the proportion of the change in the LI-RADS category has been shown after the application of AFs among the studies, that is, from 18.1% to 56.4% [4][5][6]. Studies have shown that some observations remained in the initial category even after the application of AFs; therefore, specific and appropriate instructions for the application of AFs are necessary to improve the accuracy and timeliness of diagnosis. Prior studies have reported the widely variable performance of AFs of 3~62% in sensitivity and 79~99% in specificity [5,[7][8][9], and certain AFs, such as mildmoderate T2 hyperintensity or hepatobiliary hypointensity, showed stronger associations than other AFs. However, some studies were limited to the analysis of hepatic observations already categorized as LR5; thus, these studies involved unavoidable inflated sensitivity and unreliable significant association of AFs for diagnosing HCC. Those studies reported various rules for applying AFs for improving the diagnostic performance of LR3 and LR4 observations on gadoxetate disodium-enhanced MRI (using 2-4 or more AFs or specific combinations using independent features identified by multivariate analysis) [4,[10][11][12].
Recently, artificial intelligence has been actively utilized in many complex problems to facilitate the identification of complex patterns and relationships within various parameters. It has the potential to rapidly evolve into an applicable solution in the medical field to improve diagnostic accuracy, treatment strategy, and follow-up outcomes [13,14]. However, to our knowledge, no study has examined the effect of machine learning algorithms on applying AFs in the LI-RADS.
Therefore, this study aimed to identify the important features of AFs and determine the utilization of a machine-learning-based strategy for applying AFs to LR3 and LR4 observations on gadoxetate disodium-enhanced MRI.

Materials and Methods
This retrospective study was conducted at a single center after approval was obtained from the institutional review board. The requirement for informed consent was waived due to the retrospective nature of the study.
Interventional studies involving animals or humans and other studies that require ethical approval must list the authority that provided approval and the corresponding ethical approval code.

Study Subjects
We searched our institution's electronic medical records and identified 523 treatmentnaïve patients at risk for HCC who underwent gadoxetate disodium-enhanced MRI between January 2017 and February 2022. We included patients who met the following criteria: (1) age ≥ 18 years; (2) high risk for HCC according to the LI-RADS v2018 (presence of cirrhosis or chronic hepatitis B infection regardless of the presence of cirrhosis); and (3) MRI findings of a focal hepatic solid nodule. We excluded 356 patients based on the following criteria: (1) inadequate final diagnosis such as unknown final diagnosis of malignancy as a result of immediate locoregional therapy or insufficient follow-up (<2 years) for benign lesions to determine size stability (n = 308); (2) poor quality of images for interpretation (n = 3); (3) only observations categorized as LR1, LR2, LR5, LR-TIV,

MRI Techniques
The gadoxetate disodium-enhanced liver MRI examinations were conducted using a 3-T MRI scanner (MAGMETOM Vida, Siemens Healthcare; SIGNA Architect, GE, Erlangen, Germany). The imaging protocol included the following sequences: axial T2weighted single-shot fast spin echo; axial T2-weighted fast spin echo; axial dual-gradientrecalled echo (GRE) T1-weighted sequence (in-phase and opposed-phase); and axial T1weighted three-dimensional (3D) GRE with fat suppression (liver acquisition with volume acceleration-LAVA, or volumetric interpolated breath-hold examination-VIBE) obtained before and after the intravenous bolus injection of 0.025 mmol/kg gadoxetate disodium at a rate of 1.0 mL/s, followed by a subsequent 20 mL saline flush. Postcontrast axial 3D GRE images were obtained during the late hepatic arterial phase (AP; 5 s after peak aortic enhancement determined using 1 mL test bolus injection), portal venous phase (PVP; 50 s), transitional phase (TP; 3 min), and hepatobiliary phase (HBP; 20 min). Diffusionweighted images were acquired using a maximum b-value of 800 s/mm 2 . Details of the MRI parameters are shown in Table A1 in Appendix A.

Image Analysis
Image analyses were performed by two board-certified radiologists with >9 years of experience in hepatic imaging, who were blinded to any information about clinical history or final diagnosis. Any disagreement was resolved in consensus. Nodule size and the presence or absence of MFs (nonrim APHE, nonperipheral washout, or enhancing capsule) according to the LI-RADS v2018 were analyzed. The following AFs were also assessed based on the LI-RADS v 2018: (1) AFs favoring malignancy in general: mild-moderate T2 hyperintensity, corona enhancement, fat sparing in a solid mass, iron sparing in a solid mass, TP hypointensity, HBP hypointensity, and restricted diffusion; and (2) AFs favoring malignancy in particular: nonenhancing capsule, nodule-in-nodule, mosaic architecture, blood products in mass, and fat in mass more than that in the adjacent liver. Imaging features regarding interval change in tumor size (threshold growth, subthreshold growth, size stability ≥ 2 years, or size reduction) and discrete nodules observed on ultrasound were not assessed because only focal lesions initially detected on MRI were included, and prior imaging studies for the comparison were not provided. AFs favoring benignity were also not analyzed in this study because no observation showed AFs favoring benignity in preliminary imaging analysis.

Extracting Important Features and Constructing a Machine-Learning-Based Algorithm for Applying AFs
Evaluating the influence of each factor is important for understanding the prediction process. In this study, we evaluated the feature importance using a random forest model. A random forest [18] is an ensemble of decision trees. First, we sampled a dataset by allowing duplication from the training data, and many decision trees were generated using the sampled data. Second, various types of tree structures, which are called clusters of random forest trees, were generated. Important features are located at the top of each tree to predict the results, and the importance of each feature can be determined by statistically analyzing the number of trees.
After determining the feature importance by random forest, we used the decision tree model for HCC prediction. A decision tree [18] is one of the most famous machine learning algorithms. The advantage of a decision tree is that it is very intuitive and explainable for classification and regression; therefore, it is easy to understand why the results are predicted by the decision tree. In other words, unlike other machine learning algorithms such as KNN and SVM, a decision tree has the advantage of using both the results and a prediction process because of its explainability. In this study, we used scikit-learn (v1.1.1) [19] in Python for decision tree training. The classification and regression tree (CART) method in scikit-learn is used to train the decision tree. The CART divides the data into two subsets depending on the characteristics that distinguish the data in the training set. The CART sets a threshold for one input feature, divides the data into two subsets according to the threshold, and sets the input factor and its threshold to minimize the impurity between the two divided subsets. The cost function of the classification CART is described by the following Equation (1): where is the kth feature of the input, t k is the threshold of the kth feature of the input, G le f t/right is the impurity of the left/right subset, m le f t/right is the number of samples of the left/right subset, and m is the total number of samples. The CART builds its subtree using a recursive method of dividing the subset. The threshold is determined using two criteria: GINI and entropy. In this study, we used the GINI method to determine the threshold of the subset tree. Most ML algorithms cannot be reproduced due to the random characteristics of some hyperparameters. To solve the random characteristics of these ML algorithms, most ML libraries, including scikit learn, fix the random characteristics, enabling the generation of reproducible random variables. In other words, we used the randomness fixing technique of scikit-learn to perform fixing of both the randomnesses of the data splits and the hyperparameters.

Statistical Analysis
All statistical analyses were performed on an observational basis. Continuous variables are presented as mean ± standard deviation (SD) or as median and interquartile range (IQR) and compared between HCC and non-malignant nodules using the Student's t-test or the nonparametric Mann-Whitney U test. Categorical variables or MFs and AFs are expressed as numbers and frequencies and compared using chi-squared test or Fisher's exact test, as appropriate. To identify significant AFs suggestive of HCCs rather than non-malignant nodules in initial LR3 or LR4 observations (LR3 or LR4 determined with only MFs, regardless of AFs), univariate and multivariate logistic regression analyses were performed. In the multivariate analysis, variables that showed a positive association with HCC (p < 0.05) in the univariate analysis were entered, and backward stepwise elimination was performed. The inter-reader agreement was evaluated using kappa statistics.
The important AFs were identified using random forest analysis, and a decision tree algorithm was constructed for the application of AFs to improve diagnostic performance in the LR3 and LR4 categories.
Sensitivity, specificity, accuracy, and area under the receiver operating characteristic (ROC) curve (AUC), positive predictive value, and negative predictive value were calculated to evaluate the diagnostic performance of our diagnostic systems. Statistical significance was set at p < 0.05. Statistical analyses were performed using SPSS version 25.0 (IBM Inc., Armonk, NY, USA) and MedCalc version 19.4.0 (MedCalc Software). The important AFs were identified, and a decision tree algorithm for applying AFs to the LR3 and LR4 categories was constructed using Python 3.8.13 module scikit-learn (v1.1.1) (Python Software Foundation, Wilmington, DE, USA).

Comparison of Imaging Features between HCCs and Non-Malignant Nodules and Important Features for Diagnosis HCC in LR3 and LR4 Observation
Comparative analyses of the imaging features between HCC and non-malignant nodules are summarized in Table 2. The most common AFs recorded in HCC were HBP hypointensity (131, 95.6%), followed by TP hypointensity (113, 82.5%), restricted diffusion (88, 64.2%), and mild-moderate T2 hyperintensity (86, 62.8%). Fat sparing and iron sparing in the solid mass were not observed. Univariate analyses demonstrated that restricted diffusion, mildmoderate T2 hyperintensity, TP hypointensity, HBP hypointensity, nonenhancing capsule, and nodule-in-nodule appearance were significantly associated with HCC (all p < 0.05). In the multivariate analyses, restricted diffusion (odds ratio [OR], 12.4; 95% confidence interval [CI], 5.1-30.35; p < 0.001) and mild-moderate T2 hyperintensity (OR, 2.5; 95% CI, 1.1-5.3; p = 0.02) were independent significant features associated with HCC (Table 3). Random forest analysis showed restricted diffusion as the most important feature (feature importance ratio: 0.48), followed by mild-moderate T2 hyperintensity (feature importance ratio: 0.21; Figure 1). Interobserver agreement of the AFs are presented in Table A2. We observed a range from 0.21 to 0.74 of kappa values for each AF. Among the AFs, hepatobiliary-phase hypointensity showed the highest kappa value. Restricted diffusion and mild-moderate T2 hyperintensity showed moderate agreement with kappa value of 0.55, in both. Nodule-in-nodule architecture showed a relatively high proportion of agreement (82.7%). However, it showed the lowest kappa value (0.21) due to its low prevalence.

Comparison of Diagnostic Performance of Decision Tree Algorithm with Alternative Criteria of Applying Afs
We also established alternative criteria for the application of Afs to LR3 and LR4 for HCC diagnosis. The diagnostic performance of these criteria for the application of Afs (according to the number of Afs and exclusive usage of significant Afs or their combination) is presented in Table 4. Figure A1 in Appendix A shows the diagnostic performance of the application criteria according to the number of Afs favoring malignancy at every cutoff point, that is, the number of Afs ≥ 1 to ≥6. A cutoff value of ≥3 showed the highest diagnostic performance, with an AUC of 0.75 (95% CI, 0.75-0.76); sensitivity of 77.6% (95% CI, 76.4-78.8); specificity of 72.9% (95% CI, 72.2-73.7); and accuracy of 75.5% (95% CI, 75.0-76.1). We also analyzed the application criteria via various combinations of independently significant AFs (restricted diffusion and mild-moderate T2 hyperintensity) identified from the multivariate and random forest analyses. Among those criteria, the criterion of "restricted diffusion only" yielded the highest AUC (0.78) compared with the criterion of "restricted diffusion or mild-moderate T2 hyperintensity" (AUC, 0.76; p = 0.025); "restricted diffusion and mild-moderate T2 hyperintensity" (AUC, 0.75; p = 0.01); and "mildmoderate T2 hyperintensity only" (AUC, 0.73; p < 0.001; Table A3 in Appendix A). Our decision tree approach had higher AUC, sensitivity, and accuracy than the other criteria (all p ≤ 0.002, Figures 3 and 4); however, it showed a significantly reduced specificity compared with the criterion of "restricted diffusion only" (Table 4). "restricted diffusion only" yielded the highest AUC (0.78) compared with the criterion of "restricted diffusion or mild-moderate T2 hyperintensity" (AUC, 0.76; p = 0.025); "restricted diffusion and mild-moderate T2 hyperintensity" (AUC, 0.75; p = 0.01); and "mildmoderate T2 hyperintensity only" (AUC, 0.73; p < 0.001; Table A3 in Appendix A). Our decision tree approach had higher AUC, sensitivity, and accuracy than the other criteria (all p ≤ 0.002, Figures 3 and 4); however, it showed a significantly reduced specificity compared with the criterion of "restricted diffusion only" (Table 4).

Discussion
The current study revealed two AFs favoring malignancy (restricted diffusion and mild-moderate T2 hyperintensity) as significant independent features for the diagnosis of HCC in LR3 and LR4 observations on gadoxetate disodium-enhanced MRI in both multivariate and random forest analyses. Furthermore, we presented a strategy for applying AFs in the diagnosis of HCC from the initial LR3 and LR4 observations, which are categorized using only the MFs of the LI-RADS v2018. We developed a decision tree algorithm to apply AFs to LR3 and LR4 observations. This approach was compared with other alternative approaches using the number of AFs or various combinations of significant AFs identified from multivariate and random forest analyses. Our decision tree approach showed the highest AUC, sensitivity, and accuracy compared with other criteria, albeit with somewhat compromised specificity.
Our results showed that restricted diffusion and mild-moderate T2 hyperintensity were independent and important features for the diagnosis of HCC from the LR3 and LR4 observations in both multivariate and random forest analyses. In particular, restricted diffusion showed the highest odds ratio (12.4; 95% CI, 5.1-30.0) and importance ratio (0.48), which is consistent with the results of previous studies [20][21][22]. Diffusion restriction is not specific to HCC but is more often used for detecting hepatic lesions and discriminating malignant lesions from benign lesions [23,24]. It is important to understand hepatocarcinogenesis in the early diagnosis of HCC from a premalignant lesion such as a dysplastic nodule. Hepatocarcinogenesis is a multi-step process that starts from a regenerative nodule in cirrhosis or as a dysplastic nodule and progresses to advanced HCC [25]. Given that one of the major histologic differences between dysplastic nodules and early HCC is the degree of cellular density, restricted diffusion reflecting the high cellularity of lesions might help in the better discrimination of HCC from non-malignant lesions [15].
T2 hyperintensity is a typical imaging feature of HCC and helps differentiate hypovascular HCC from dysplastic nodules [26]. We observed that mild-moderate T2 hyperintensity has an OR of 2.5 and an importance ratio of 0.21 and is the second most significant feature after restricted diffusion. According to previous studies, mild-moderate T2 hyperintensity has been proven to be a suggestive feature of progressed HCC rather than early HCC [27,28]. When a focus of HCC develops within a dysplastic nodule, a mildly elevated signal may be observed on T2-weighted images, representing the focus of HCC within the hypointense dysplastic nodule, and has been described as a "nodule-in-nodule" appearance. This is consistent with our results showing that presentations of a nodule-in-nodule appearance were significantly more frequently encountered with HCCs than with non-malignant nodules, although their numbers were small. Thus, the entire change in T2 signal intensity of the observation may reflect the progressive biological characteristics of HCC.
In our study, among AFs favoring malignancy, the most commonly encountered feature was HBP hypointensity. However, this feature was not significantly associated with HCC in the multivariate analysis. This might be because this feature appears to be frequent, even in non-malignant nodules, which is consistent with the results of previous studies [11,29]. Because organic anion-transporting polypeptides that mediate hepatic uptake of gadoxetic acid may decrease in expression in the early stage of hepatocarcinogenesis, dysplastic nodules or regenerative nodules, and even hemangioma cysts, can present with HBP hypointensity [30,31]. Therefore, applying this characteristic to the LR3 and LR4 categories may cause concerns regarding false positivity when diagnosing HCC. In this study, we found consistent results: HBP hypointensity was a significantly frequent finding among AFs in misclassification cases using our decision tree algorithm (Table A4).
In the decision tree algorithm for the application of AFs in LR3 and LR4 observation, the results showed a good ability of the method to diagnose HCC, with an AUC of 0.84 (95% CI, 0.84-0.85); sensitivity of 92.0% (95% CI, 91.6-92.4); specificity of 71.1% (95% CI, 70.9-71.4); and accuracy of 84.5% (95% CI, 84.1-84.8). The decision tree algorithm consists of restricted diffusion, nodule-in-nodule, mild-moderate T2 hyperintensity, blood in mass, TP hypointensity, corona enhancement, HBP hypointensity, and fat in mass. Interestingly, in the decision tree algorithm, nodule-in-nodule, blood in mass, TP hypointensity, corona enhancement, HBP hypointensity, and fat in mass, which failed to demonstrate independent associations in the present study, were correlated with restricted diffusion and mild-moderate T2 hyperintensity. This may indicate that minor AFs still play important roles in the diagnosis of HCC among observations that already exhibit significant weighting features.
We also evaluated an alternative application algorithm using a criterion based on the number of AFs and criterion utilizing independent features identified by multivariate analysis. These approaches have been addressed in previous studies [10][11][12]32]. Kang et al. reported that criteria with the number of AFs ≥ 4 showed a sensitivity of 80.6% and a specificity of 70.0% [12], and Cannella et al. reported that criteria with the number of AFs ≥ 2 showed a sensitivity of 72.6% and a specificity of 91.5% [11]. The present study also showed good diagnostic ability for HCC using criteria with the number AFs ≥ 3, with a sensitivity of 77.6% and a specificity of 72.9%. Direct comparison between results among the studies may be unnecessary because of the differences in the characteristics between study populations. Nevertheless, our results have strength in overcoming the overestimation of sensitivity because of the exclusion of the LR5 observation. In addition, this approach showed significantly lower sensitivity and specificity than those of the decision tree algorithm.
Cannella et al., Lee et al., and Jeon et al. showed how to incorporate AFs identified as significant independent features in multivariate analysis to enhance diagnostic performance in the LR3 and LR4 categories in the LI-RADS diagnostic table [10,11,32]. In the present study, we identified that the highest diagnostic performance for HCC was achieved using the exclusive application of restricted diffusion to the LR3 and LR4 categories, among other combinations of significant AFs. This approach had significantly higher specificity than our decision tree algorithm (91.3% vs. 71.1%, p < 0.001). Nevertheless, our decision tree algorithm showed significantly higher AUC, sensitivity, and accuracy in HCC diagnosis in LR3 and LR4 observations (AUC, 0.84 vs. 0.78, p < 0.001; sensitivity, 92.0% vs. 64.5%, p < 0.001; and accuracy, 84.5% vs. 76.4%, p = 0.032) than that of the criteria utilizing the exclusive application of restricted diffusion to the LR3 and LR4 category. Indeed, the LI-RADS is designed to promote the specific diagnosis of HCC [1]. Other Western countries also adopt specific diagnostic algorithms to avoid false-positive diagnoses of HCC, because liver transplantation is the only potentially curative treatment in patients with advanced cirrhosis, who predominantly constitute individuals with a high risk of HCC in Western countries [33,34]. Meanwhile, Asian countries prefer sensitive diagnosis of HCC to detect HCC in its early stages and to provide patients with HCC with local treatment, such as resection or ablation, as a curative treatment [33,34]. Therefore, despite significantly reduced specificity, our decision tree algorithm for the application of AFs with significantly high sensitivity can be used more in Asian societies.
This decision tree algorithm is a conceptually simple decision-making model and provides the diagnosis process of HCC in an easy-to-understand classification system. Thus, it may be useful in situations in which a decision must be made effectively and reliably. Although there were some limitations in the absence of AFs favoring benignity in our decision tree algorithm, it may be still useful in daily practice, as compared with other alternative approach.
Our study had several limitations. First, there may have been an inevitable selection bias owing to the retrospective nature of this study. Among the initially eligible patients, approximately 308 patients were excluded from the study population due to a lack of a final diagnosis. Among these, the majority showed LR4 observation. They tended to be treated with locoregional treatment without pathological confirmation, especially when they exhibited a co-existing LR5 observation. Second, our study conducted LR3 and LR4 observations simultaneously. As LR3 and LR4 may express different distributions of AFs between HCC and non-malignant nodules, subgroup analysis of LR3 and LR4 showed more confident study results. Although subgroup analysis could not be performed in this study due to the lack of LR4 lesions, a larger study should be conducted in the future.
Lastly, the majority of benign lesions in this study were not confirmed using biopsy but by follow-up imaging. To minimize misdiagnosis, we considered benignity based on longterm stability (≥24 months), whereas HCC was considered based on the presence of LR5 observation on follow-up imaging [5,11,12,16,17,26].

Conclusions
In conclusion, among the LI-RADS v2018 AFs favoring malignancy, restricted diffusion and mild-moderate T2 hyperintensity showed a strong association for the diagnosis of HCC in LR3 and LR4 observations. Our decision tree algorithm for applying AFs to LR3 and LR4 observations provides significantly increased AUC, sensitivity, and accuracy but reduced specificity. These appear to be more appropriate for application under certain circumstances with an emphasis on early detection of HCC.

Data Availability Statement:
The data presented in this study are available in this article.

Acknowledgments:
The authors thank the reviewers and editors from the CANCERS journal.