Artificial Intelligence in the Diagnosis of Hepatocellular Carcinoma: A Systematic Review

Hepatocellular carcinoma ranks fifth amongst the most common malignancies and is the third most common cause of cancer-related death globally. Artificial Intelligence is a rapidly growing field of interest. Following the PRISMA reporting guidelines, we conducted a systematic review to retrieve articles reporting the application of AI in HCC detection and characterization. A total of 27 articles were included and analyzed with our composite score for the evaluation of the quality of the publications. The contingency table reported a statistically significant constant improvement over the years of the total quality score (p = 0.004). Different AI methods have been adopted in the included articles correlated with 19 articles studying CT (41.30%), 20 studying US (43.47%), and 7 studying MRI (15.21%). No article has discussed the use of artificial intelligence in PET and X-ray technology. Our systematic approach has shown that previous works in HCC detection and characterization have assessed the comparability of conventional interpretation with machine learning using US, CT, and MRI. The distribution of the imaging techniques in our analysis reflects the usefulness and evolution of medical imaging for the diagnosis of HCC. Moreover, our results highlight an imminent need for data sharing in collaborative data repositories to minimize unnecessary repetition and wastage of resources.


Introduction
Artificial intelligence (AI) is "a field of science and engineering concerned with the computational understanding of what is commonly called intelligent behavior, and with creating artefacts that exhibit such behavior" [1].
Alan Turing first described the use of computers for the simulation of critical thinking and intelligence in 1950. In 1956, John McCarthy coined the definition of AI, the allencompassing term for computer programs replicating human intelligence. Machine 2 of 19 learning is a subset of AI that trains on learning from previous experience and rectifies its functioning sequentially. Deep learning (DL) is a further subset of machine learning that utilizes multi-layered networks between the computing units termed "neurons" that process and validate large training datasets between input and output units, and it leads to meaningful predictions in multiple spheres of medical research (diagnostic, therapeutic, prognostic, etc.) [2].
Hepatocellular carcinoma (HCC) ranks fifth amongst the most common malignancies and is the third most common cause of cancer-related death globally [3]. Though there have been several breakthroughs in the treatment and diagnostic capability, the prognosis of HCC remains dismal due to delayed diagnosis and limited treatment strategies. AI has far reaching potential in the sphere of (a) risk factor stratification, (b) characterization, and (c) improved prognostication in established cases [2]. HCC is a notorious cancer with multiple and overlapping risk factors with the spectrum of its evolving conditions, including NAFLD (Non-Alcoholic Fatty liver disease), NASH (Non-Alcoholic steatohepatitis), and subsequent cirrhosis. Several AI modalities have now been modelled to differentiate and predict the risk of incident HCC [2]. The next challenge lies in classifying indeterminate liver lesions requiring histopathological evidence. The use of computed tomography (CT) and magnetic resonance imaging (MRI) based on DL and radiomics and the success in differentiating between HCC and non-HCC liver nodules with high diagnostic accuracy serve as an essential impetus for creating universal standardized liver tumor segmentation techniques [4]. The following systematic review will expand on the current role of artificial intelligence in HCC detection and characterization, regardless of the instrumental technique.

Materials and Methods
Following the Preferred Reporting Items for Systematic Reviews and Meta-analyses (PRISMA) reporting guidelines, we conducted a systematic review. This review reports qualitative data, and because of inconsistent reporting of outcome measures and differences in populations and study design, we did not perform a meta-analysis.

Searches
PubMed, Scopus, and Cochrane were searched using a combination of the following key words: ((Artificial Intelligence) OR (Machine Learning)) AND ((Hepatocellular Carcinomas) OR (HCC) OR (Liver Cancer)) to retrieve articles reporting the application of AI in detecting or diagnosing HCC. Results were admitted from the time of inception up to and including 5 May 2022. The search terms were modified to fit each database (the terms and their adjustments are found in the Supplementary Materials File S1). Additionally, the reference list of included articles and relevant reviews was checked manually to identify other papers.

Inclusion and Exclusion Criteria
Only published articles reporting the application of AI in detecting or diagnosing HCC were included, excluding all the studies reporting the application of AI outside the Diagnosis of HCC, such as risk prediction, prognosis, or treatment. Only diagnoses based on CT, MRI, Ultrasound (US), 18F-FDG Positron emission tomography (PET), or X-ray were selected, while other methods like pathology reports or biomarkers were excluded. Reviews, letters, editorials, conference papers, preprints, commentaries, book chapters, or any article in languages other than English were excluded too.

Quality Assessment
Studies were assessed for quality based on three items:

•
The number of images, estimating the risk of bias and overfitting: fewer than 50 (score 0), 50 to 100 (score 1), and more than 100 (score 2) [5]. This factor was considered the most frequently reported in articles. Where only the number of patients was reported, we considered at least one image per patient.

•
The use of a completely independent cohort for validation: no cohort (score 0), the partition of the cohort between completely separated training and test set (score 1), external validation cohort (score 2). • By 2011, the speed of graphics processing units had increased significantly, making it possible to train convolutional neural networks "without" the layer-by-layer pre-training. With the increased computing speed, deep learning had significant advantages in terms of efficiency and speed: no data (score 0), before 2011 (score 1), 2011 or after (score 2).
A simple quality score (QS), consisting of the sum of the 3 previously stated items, was calculated. A maximum possible score of 6 meant a high-quality study design of the article.

Study Selection & Data Extraction
Duplicates were removed using Endnote X9. Titles and/or abstracts of studies identified using our search criteria were screened independently by 2 authors (A.M. & M.A.) to identify all studies meeting our inclusion criteria. Any disagreement was resolved through discussion with a third reviewer (F.G.). Random included articles were used to generate an extraction sheet. Three authors (A.M., M.A., and J.P.S.P) reviewed the full texts for inclusion and data extraction. Any discrepancies were corrected by consensus. The following parameters were extracted from each article: F.G. then reviewed all articles, rechecked data, and analyzed them using an Excel (R) sheet. Statistical calculations were performed with Jamovi (R) software version 2.0.0.0 [6,7].

Searching Results
The study flow diagram is illustrated in Figure 1. Searches identified 3160 records: 1677 from PubMed, 1426 from Scopus, and 57 from Cochrane. A total of 1052 were duplicates and automatically excluded using EndNoteX9. A total of 2108 studies were evaluated by title/abstract screening against the eligibility criteria, and 2032 were excluded. Of these, 1813 were not related to the topic, 5 not including HCC, 5 not including AI, 12 not discussing diagnosis, 80 duplicates not detected by the software, 62 conference papers, 26 reviews, 5 book chapters, and 24 letters. Of the remaining 76 records potentially eligible, after the full-texts screening, 27 articles were included, and 49 were excluded because 6 were not related to the topic, 7 did not include HCC, 3 did not include AI, 11 did not discuss diagnosis, 9 diagnosed based on methods other than CT/PET/MRI/US/X-ray, 5 articles were in a language other than English, 4 were reviews, 3 articles were not available, and 1 was a clinical trial with no published data. After the manual search, 19 articles were further identified. Thus, a total of 46 cited articles were included in this review, published between 1998 and 2022 (Table 1). manual search, 19 articles were further identified. Thus, a total of 46 cited articles were included in this review, published between 1998 and 2022 (Table 1).

Quality Assessments
The mean of the "Number of Images" score was 1.70, identifying 36 articles (78.3%) where at least 100 images were analyzed. (Table 2) The mean of the "Cohort for Validation" Score was 0.609. Indeed, an external validation cohort was used only in 2 articles (4.3%). (Table 3) The mean of the "Year of Publication" score was 1.87, documenting that most of the works (87.0%) included in this systematic review were published in 2011 or later. (Table 4) On average, the Total Quality Score was 4.17, with a median of 4.00 and SD of 1.04. (Table 5) The contingency table correlating the Total Score with the Year of Publication reports a statistically significant constant improvement over the years of the quality score (p = 0.004). ( Table 6) A total of 3 articles (6.52%) scored a QS lower than 3, while 2 (4.34%) received the maximum score. Results from articles with a QS strictly lower than 3 are written in italics in Table 7.   Table 7 lists the total study population, diagnostic method, research question or purpose, AI method, key findings included in this systematic review to summarize how artificial intelligence is used today in diagnosing HCC. Moreover, when the information was available, we reported in Table 7 the background of the images studied, i.e., whether HCC on the cirrhotic or healthy liver and whether other cancerous and benign lesions studied were present. Logistic regression, SVM, RF, and KNN Aspects related to perfusion (peak time and wash-in time), the microvascular architecture (spatiotemporal coherence), and the spatial characteristics of contrast enhancement at wash-in (global kurtosis) and peak (GLCM Energy) are particularly relevant to aid FLLs diagnosis.

US
To analyse the diagnostic performance of deep multimodal representation model-based integration of tumour image, patient background, and blood biomarkers for the differentiation of liver tumours observed using B-mode US.

CNN
The integration of patient background information and blood biomarkers in addition to US images, multi-modal representation learning outperformed the CNN model that used US images alone. Rela     A new liver and brain tumour classification method is proposed by using the power of CNN in feature extraction, the power of DWT in signal processing, and the power of LSTM in signal classification.
CNN in feature extraction, DWT in signal processing, and LSTM in signal classification The proposed method has a satisfactory accuracy rate at the liver tumour and brain tumour classifying.

US
To introduce CAD aimed at differential Diagnosis of FLLs by use of CEUS.

ANNs
The classification accuracies were 84.8% for metastasis, 93.3% for hemangioma, and 98.6% for all HCCs. In addition, the classification accuracies for histologic differentiation types of HCCs were 65.2% for w-HCC, 41.7% for m-HCC, and 80.0% for p-HCC. To apply an ANN for differential diagnosis of certain hepatic masses on CT images and evaluate the effect of ANN output on radiologist diagnostic performance.

ANN
The ANN can provide useful output as a second opinion to improve radiologist diagnostic performance in the differential diagnosis of hepatic masses seen on contrast-enhanced CT. To present a CT liver image diagnostic classification system which will automatically find, extract the CT liver boundary, and further classify liver diseases.

Discussion
Artificial Intelligence is a rapidly growing field of interest. It has immense potential to be the standard of care in resource-limited settings where there is lesser availability of expert care and a heavy burden of cancer volume load. However, the use of AI and ML-based algorithms are limited in current practice owing to their limited generalizability. ML algorithms require large training sets of data, processing using GPUs and functions on the GIGO (Garbage in Garbage out) principle, which means that the output is as robust as the input obtained. However, robustness and standardization of large datasets, including follow-up evaluation and patient quality of care, is extremely cumbersome and difficult. The incongruity between modelled datasets versus real-world data is a fundamental challenge that must be overcome in the future [2].
We have grouped systemically the articles using artificial intelligence in HCC detection and characterization in a unique table, helping to plan further research projects. Indeed, for each article, we extracted the scope, the AI method used, and the key findings related to that AI approach, with the idea of having an index of all projects carried out to date. The significant heterogeneity of the studies reflected the difficulty of extrapolating several variables related to the different radiological techniques and pulling them together (e.g., gold standard used for the diagnosis of HCC, patient features, radiologist's opinion, dose and type of contrast agent, and follow-up imaging).
In this work, 27 articles were analyzed with our composite score for the evaluation of the quality of the publications, with an overall score at 4.17/6. The "Cohort for Validation" score was the lowest, indeed, an external validation cohort was used only in 2 articles. This phenomenon, although explained by the difficulty of collecting data, limits the generalizability of the conclusions. We observe a statistically significant constant improvement over the years on our composite criterion combining the number of images and the presence of a validation cohort. (p = 0.004) This improvement is probably due to the publication of guidelines, dedicated checklists to ensure proper methodology, and technological improvement in the field of AI.
Our results highlight an imminent need for data sharing in collaborative data repositories to minimize unnecessary repetition and wastage of resources. In addition, universal standardized data sharing protocols for sharing datasets from clinical trials are essential to help make the available data robust and fill in the missing data. One such example is the creation of the Human Brain Project and project EBRAINS by the European Union to handle data related to brain research and its broader usage in the development of AI networks [54]. To help make the datasets uniformly accessible and usable, it is also imperative to diversify the data. Most of the work on AI-based algorithms was done on small-scale datasets due to economic and logistic constraints in high-income developed countries with limited to no data from lower middle-and low-income countries, which puts their credibility in ambiguity. Significant work needs to be done to increase the transparency and understanding of AI algorithms so that healthcare professionals gain confidence in using them in clinical settings.
Our systematic approach has shown that previous works in HCC detection and characterization have assessed the comparability of conventional interpretation with machine learning using US, CT, and MRI. The distribution of the imaging techniques in our analysis reflects the usefulness and evolution of medical imaging for the diagnosis of HCC. Ultrasound and CT are overrepresented in our analysis as both are easily available imaging techniques that have long proven their usefulness in the diagnosis of HCC. More recent and limited access to MRI may explain its absence before and low representation since 2019 in our analysis. On the opposite side, no study investigated X-ray or PET techniques. Indeed, even if X-ray has an interesting role in interventional therapeutic procedures, this technique has not been used for diagnostic purposes in this field. Moreover, unlike the other branches of medicine, such as neurology [55], head and neck cancer [56], or lung imaging [57][58][59], artificial intelligence in PET technology has not yet been studied and tested in HCC diagnosis. As PET, in combination with CT scan, is already used in other cancer to define undetermined lesions with high sensitivity and precision, AI and PET technology in HCC have not been explored yet. The most straightforward explanation can be found in the difficulty in analyzing structural and morphological characteristics, the hepatocellular cancer lesions having a variable degree of avidity for the PET tracers such as 18F-FDG. Indeed, the liver is unique in its capacity to maintain glucose homeostasis, thus leading to low 18F-FDG uptakes in low-grade (i.e., relatively metabolically less active) tumors [60]. It has been reported that only up to two-thirds of the tumors are 18F-FDG avid, although higher standardized uptake values (SUV) indicate a more malignant tumor [61,62]. Using other tracers such as 18F-Choline and 11C-Acetate may be a promising approach to increase the accuracy of results and openness to new AI technologies in combination with PET in diagnosing HCC [63,64].
In the future, DL algorithms combining clinical, radiological, pathological, and molecular information can help identify and better prognosticate patients. In addition, algorithms trained on post-chemotherapy patients could help in the early identification of their response and the time to switch between other therapeutic options. This will enable earlier identification of patients with poor treatment response and pre-emptive therapy adjustment based on molecular signature and imaging [2,4]. Anyhow, conducting high quality AI studies with large sets of data remains a real challenge whatever the medical imaging technique. Supervised and moreover unsupervised training-based algorithms need very large sets of data for training but also for validation purpose. High quality methodology requires standardized multi-parametric imaging acquisition protocols and solid diagnostic methods including multiple reader assessment, follow-up imaging, and/or anatomopathological. Multi-center AI studies and pooled imaging data could be an effective solution to spare time and financial resources.

Limitations and Strengths
The most significant limitation of this review is a wide diversity from one article to another in terms of textural parameters and methods used, which meant that even for similar subjects, it was challenging to aggregate and compare the articles between them. Secondly, the scale used to assess the quality of the articles was practical but rather simplistic. This score made it possible to evaluate many articles with high reproducibility at the expense of a thorough analysis of the methods. At the same time, to the best of the authors' knowledge, this is the first systematic review in the scientific literature focusing on the use of AI in radiological HCC detection and characterization, omitting pathology and prognosis. This allowed for a detailed analysis that described all the scientific techniques and efforts studied in this narrow field, providing an overview that can provide points for reflection and guide future research.

Conclusions
Our systematic approach has shown that previous works in HCC detection and characterization have assessed the comparability of conventional interpretation with machine learning using US, CT, and MRI. The distribution of the imaging techniques in our analysis reflects the usefulness and evolution of medical imaging for the diagnosis of HCC. Moreover, our results highlight an imminent need for data sharing in collaborative data repositories to minimize unnecessary repetition and wastage of resources.