Artificial Intelligence for Detecting and Quantifying Fatty Liver in Ultrasound Images: A Systematic Review

Background: Non-alcoholic Fatty Liver Disease (NAFLD) is growing more prevalent worldwide. Although non-invasive diagnostic approaches such as conventional ultrasonography and clinical scoring systems have been proposed as alternatives to liver biopsy, their efficacy has been called into doubt. Artificial Intelligence (AI) is now combined with traditional diagnostic processes to improve the performance of non-invasive approaches. Objective: This study explores how well various AI methods function and perform on ultrasound (US) images to diagnose and quantify non-alcoholic fatty liver disease. Methodology: A systematic review was conducted to achieve this objective. Five science bibliographic databases were searched, including PubMed, Association for Computing Machinery ACM Digital Library, Institute of Electrical and Electronics Engineers IEEE Xplore, Scopus, and Google Scholar. Only peer-reviewed English articles, conferences, theses, and book chapters were included. Data from studies were synthesized using narrative methodologies per Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) criteria. Results: Forty-nine studies were included in the systematic review. According to the qualitative analysis, AI significantly enhanced the diagnosis of NAFLD, Non-Alcoholic Steatohepatitis (NASH), and liver fibrosis. In addition, modalities, image acquisition, feature extraction and selection, data management, and classifiers were assessed and compared in terms of performance measures (i.e., accuracy, sensitivity, and specificity). Conclusion: AI-supported systems show potential performance increases in detecting and quantifying steatosis, NASH, and liver fibrosis in NAFLD patients. Before real-world implementation, prospective studies with direct comparisons of AI-assisted modalities and conventional techniques are necessary.


Background
Non-alcoholic Fatty Liver Disease (NAFLD) is a group of disorders caused by a buildup of fat in the liver. The disease is most common in overweight or obese people [1]. One of the most common chronic liver disease in the world is NAFLD, affecting between 25% and 30% of the adult population [1,2]. High liver fat levels are linked to a higher risk of significant health issues such as diabetes, high blood pressure, cirrhosis, renal disease, and heart disease [3]. However, if diagnosed and treated early enough, NAFLD can be prevented from worsening, and the amount of fat in the liver can be reduced. Unfortunately, advanced liver disease and mortality due to NAFLD/NASH are expected to rise in Saudi Arabia (Figure 1), necessitating a strategy to slow the growth of the NAFLD population and minimize the liver disease burden [4]. The primary line of treatment for NAFLD and NASH is lifestyle changes, including eating habits and physical activity. A well-established therapy for NAFLD and NASH is weight loss, which has an unmistakable dose-response correlation between definite nutriments and fatty liver disease [5]. The progression of NAFLD involves four stages. Steatosis (simple fatty liver) is a harmless build-up of fat in the liver cells. A more severe form of NAFLD called NASH occurs when the liver becomes inflamed. A patient is diagnosed with fibrosis when persistent inflammation creates scar tissue around the liver and adjacent blood vessels, but the liver can still function normally. Cirrhosis, the most severe stage, develops after years of inflammation and causes the liver to shrink, scar, and lump; this damage is irreversible and can lead to liver failure ( Figure 2). Typically, doctors divide the severity of a patient's disease into four groups, normal, mild, intermediate, or severe, based on histological characteristics [6]. Example of NAFLD progression model [liver fibrosis can be divided into four stages (F1-4) as follows: F0-no fibrosis; F1-portal fibrosis without septa; F2-portal fibrosis and few septa; F3-numerous septa without cirrhosis; F4-cirrhosis, adapted from [4].
In most cases, abdominal ultrasonography is used to diagnose NAFLD [7]. Ultrasonography is a low-cost, safe, quick, and uncomplicated procedure in most healthcare settings [8]. To determine the severity of liver disease, non-invasive testing or liver biopsies The progression of NAFLD involves four stages. Steatosis (simple fatty liver) is a harmless build-up of fat in the liver cells. A more severe form of NAFLD called NASH occurs when the liver becomes inflamed. A patient is diagnosed with fibrosis when persistent inflammation creates scar tissue around the liver and adjacent blood vessels, but the liver can still function normally. Cirrhosis, the most severe stage, develops after years of inflammation and causes the liver to shrink, scar, and lump; this damage is irreversible and can lead to liver failure ( Figure 2). Typically, doctors divide the severity of a patient's disease into four groups, normal, mild, intermediate, or severe, based on histological characteristics [6].
Bioengineering 2022, 9, x FOR PEER REVIEW 2 of 33 unmistakable dose-response correlation between definite nutriments and fatty liver disease [5]. The progression of NAFLD involves four stages. Steatosis (simple fatty liver) is a harmless build-up of fat in the liver cells. A more severe form of NAFLD called NASH occurs when the liver becomes inflamed. A patient is diagnosed with fibrosis when persistent inflammation creates scar tissue around the liver and adjacent blood vessels, but the liver can still function normally. Cirrhosis, the most severe stage, develops after years of inflammation and causes the liver to shrink, scar, and lump; this damage is irreversible and can lead to liver failure ( Figure 2). Typically, doctors divide the severity of a patient's disease into four groups, normal, mild, intermediate, or severe, based on histological characteristics [6]. Example of NAFLD progression model [liver fibrosis can be divided into four stages (F1-4) as follows: F0-no fibrosis; F1-portal fibrosis without septa; F2-portal fibrosis and few septa; F3-numerous septa without cirrhosis; F4-cirrhosis, adapted from [4].
In most cases, abdominal ultrasonography is used to diagnose NAFLD [7]. Ultrasonography is a low-cost, safe, quick, and uncomplicated procedure in most healthcare settings [8]. To determine the severity of liver disease, non-invasive testing or liver biopsies Example of NAFLD progression model [liver fibrosis can be divided into four stages (F1-4) as follows: F0-no fibrosis; F1-portal fibrosis without septa; F2-portal fibrosis and few septa; F3-numerous septa without cirrhosis; F4-cirrhosis, adapted from [4].
In most cases, abdominal ultrasonography is used to diagnose NAFLD [7]. Ultrasonography is a low-cost, safe, quick, and uncomplicated procedure in most healthcare settings [8]. To determine the severity of liver disease, non-invasive testing or liver biopsies are used. Hepatic steatosis, inflammation, and fibrosis are all assessed with a liver biopsy. However, a liver biopsy is an intrusive procedure that may result in a hemoperitoneum liver transplant recipients, and predicting post-transplant survival and complications was reported in a recent systematic review of AI in hepatology [28][29][30]. A comprehensive study also highlights basic technical knowledge about AI, such as traditional ML and deep learning (DL) algorithms, particularly CNNs, and their clinical applications in the medical imaging of liver diseases, such as detecting and evaluating focal liver lesions, facilitating treatment, and predicting liver treatment response [11,31]. Another recent systematic review detailed the use of AI in imaging modalities, digital pathology, and electronic health records to diagnose and stage NAFLD [32]. Furthermore, the performance of AI-assisted systems for the detection of NAFLD, NASH, and liver fibrosis is examined in a recent meta-analysis [33].

Research Problem and Aim
To the researchers' best knowledge, no studies rigorously analyzed AI models for diagnosing and quantifying NAFLD using ultrasound (US) images or compared models regarding measures. Therefore, this systematic review is the first work of its kind in the literature. Many knowledge gaps have been identified and explored throughout this article. This study aims to examine all prior studies to determine the best accuracy, sensitivity, and specificity for diagnosing and quantifying NAFLD using ML, DL, or a combination of both.

Overview
A preliminary search and idea validation were conducted using the search terms "Artificial Intelligence" and "Ultrasound" and "Fatty liver" in PubMed and Google Scholar. When performing this step, systematic reviews and meta-analyses were found. These sources contained relevant papers to read to gain a deeper understanding of the topic and identify gaps to articulate the research question better. Because the previous reviews had different outcomes and populations, a systematic review of AI-powered ultrasonography were conducted to detect and quantify NAFLD. Figure 3 summarizes the method and tackles the research purpose.
Using ultrasonography with AI to detect and quantify NAFLD is a promising approach that has recently attracted researchers' attention. In a review published in 2021, the domains of liver disease in which AI can be used are briefly discussed [27]. In addition, the use of ML for measuring liver fibrosis, forecasting hepatic decompensation, screening potential liver transplant recipients, and predicting post-transplant survival and complications was reported in a recent systematic review of AI in hepatology [28][29][30]. A comprehensive study also highlights basic technical knowledge about AI, such as traditional ML and deep learning (DL) algorithms, particularly CNNs, and their clinical applications in the medical imaging of liver diseases, such as detecting and evaluating focal liver lesions, facilitating treatment, and predicting liver treatment response [11,31]. Another recent systematic review detailed the use of AI in imaging modalities, digital pathology, and electronic health records to diagnose and stage NAFLD [32]. Furthermore, the performance of AI-assisted systems for the detection of NAFLD, NASH, and liver fibrosis is examined in a recent meta-analysis [33].

Research Problem and Aim
To the researchers' best knowledge, no studies rigorously analyzed AI models for diagnosing and quantifying NAFLD using ultrasound (US) images or compared models regarding measures. Therefore, this systematic review is the first work of its kind in the literature. Many knowledge gaps have been identified and explored throughout this article. This study aims to examine all prior studies to determine the best accuracy, sensitivity, and specificity for diagnosing and quantifying NAFLD using ML, DL, or a combination of both.

Overview
A preliminary search and idea validation were conducted using the search terms "Artificial Intelligence" and "Ultrasound" and "Fatty liver" in PubMed and Google Scholar. When performing this step, systematic reviews and meta-analyses were found. These sources contained relevant papers to read to gain a deeper understanding of the topic and identify gaps to articulate the research question better. Because the previous reviews had different outcomes and populations, a systematic review of AI-powered ultrasonography were conducted to detect and quantify NAFLD. Figure 3 summarizes the method and tackles the research purpose.

Protocol and Registration
The protocol was designed by the author and reviewed and approved by the corresponding author. The protocol is registered at PROSPERO.

Search Sources
The bibliographic databases used in this study were PubMed, ACM Digital Library, IEEE Xplore, Scopus, and Google Scholar. The first one hundred Google Scholar results were sorted by relevancy to the search topic, returning many publications. The initial search was broad, based on terms in all fields across all databases. Because the number of keywords allowed in IEEE Xplore and Google Scholar was limited, shorter search strings were used compared to the other databases. Backward and forward reference list checking turned up a multitude of related studies in a variety of databases.

Study Eligibility Criteria
All studies reported patients with hepatic steatosis, including NAFLD, NASH, acute fatty liver of pregnancy (AFLP), and other disease phases. Eligible interventions were those that fell under the umbrella of AI and were related to computer vision (CV), which was used to assess medical images. The type of medical image discussed in this review was a US, also known as sonography or ultrasonography. Interventions that identified and diagnosed fatty liver disease and its stages were considered eligible outcomes. Outcome accuracy, sensitivity, and specificity were used to assess performance. This review used no limitations such as age, gender, race, or publication date. Table 1 defines the inclusion and exclusion criteria.

Parameters Inclusion Criteria Exclusion Criteria
Population Patients with Hepatic Steatosis (NAFLD) and developed stages.
Intervention AI that used ultrasound images to detect and quantify hepatic steatosis.

Study Selection
Using PRISMA, studies were selected based on three processes: deleting duplicates, screening and assessing titles and abstracts of recovered research and index terms, and reading the contexts of the studies selected during the previous processes. The research selection was aided by a web-based systematic review tool known as Rayyan, speeding up the screening step [34]. Finally, the primary author completed all the processes and conducted data cross-checking among the studies' extracted data to correct any probable errors.

Data Extraction and Synthesis
The primary author designed a data extraction form to collect specific data and parameters from the included studies. The form was a result of revising the parameters gathered in similar reviews and adding extra parameters needed to accomplish this review's aim. The final extraction form was reviewed and approved by the corresponding author. Furthermore, data regarding the study (e.g., first author, year of publication, country of publication, publication type, etc.), population (e.g., age, gender, health status, sample size, etc.), interventions (e.g., imaging modalities, image types, image quality, AI branch, AI methods, validation methods, etc.), datasets (e.g., public/private, training data, testing data, augmented or not, etc.), and performance measures (e.g., accuracy, sensitivity, specificity, and Area Under the Curve) were collected by the author manually. In addition, as previously mentioned, manual data cross-checking among the studies' extracted data was conducted by the author to correct any probable errors.

Risk of Bias in Individual Studies
Using the modified Quality Assessment of Diagnostic Accuracy Studies (QUADAS-2) was considered to evaluate the risk of bias in the studies, but all studies were included because of the limited number of studies available. A literature map was created to help review the literature for gaps and points of impact. According to Figure 4, the selected studies are mostly correlated since there are very few studies not in the citation loop. This result encourages including all the studies even though their quality is questionable.

Data Checking Task
Data was validated through the cross-verification of more than two sources. Using triangulation, the consistency of findings was evaluated, increasing the chance of controlling any threats to result validity [35]. Collected parameters of included studies were cross-checked in front of the same parameter in different studies. A form that was designed by the author was used to record any discrepancy that was found. The next step was to revisit each paper and correctly extract the parameters that showed discrepancies according to the form.

Data Checking Task
Data was validated through the cross-verification of more than two sources. Using triangulation, the consistency of findings was evaluated, increasing the chance of controlling any threats to result validity [35]. Collected parameters of included studies were crosschecked in front of the same parameter in different studies. A form that was designed by the author was used to record any discrepancy that was found. The next step was to revisit each paper and correctly extract the parameters that showed discrepancies according to the form. Figure 5 depicts the search process and results. A total of 609 articles were found during a literature search. After eliminating 93 duplicates, 516 titles and abstracts were reviewed and 461 articles were rejected for the following reasons: irrelevant research (n = 398), incorrect intervention (n = 42), incorrect population (n = 16), and incorrect publication type (n = 5). Following that, 55 full-text articles were reviewed, with 23 being removed for the following reasons: incorrect research design (n = 1), incorrect population (n = 4), incorrect intervention (n = 13), irrelevant studies (n = 2), and articles written in languages other than English (n = 2). Finally, after completing the forward and backward review process, the total number of studies considered was forty-nine.

Description of the Included Studies
Results of the performance measures and assessments of each included study are summarized in Table 5. This table contains basic information about the whole study, including the study number in the reference, the type of categorization, the number of images, and the classifier. The AUC, sensitivity, accuracy, and specificity are then added. Each study's findings are also summarized.

Study Characteristics
The highest percentage of studies were journal articles (n = 34, ≈69%), while most of the rest of the studies were conference proceedings (n = 14, ≈28.5%). However, one study

Description of the Included Studies
Results of the performance measures and assessments of each included study are summarized in Table 5. This table contains basic information about the whole study, including the study number in the reference, the type of categorization, the number of images, and the classifier. The AUC, sensitivity, accuracy, and specificity are then added. Each study's findings are also summarized.

Study Characteristics
The highest percentage of studies were journal articles (n = 34, ≈69%), while most of the rest of the studies were conference proceedings (n = 14, ≈28.5%). However, one study was a dissertation. As shown in Table 2, studies were published between 1996 and 2021-however, there was a marked increase in published studies, especially DL studies, starting in 2014. When dividing the studies by year of publication, 22.4% were published in 2021, 8.16% in 2020, 10.2% in 2019, and approximately 24.5% between 2015 and 2018. More than a third of the studies were about DL (n = 17, 35%), and the rest discussed ML (n = 32, 56%). Finally, while India, the USA, China, and Portugal published the most significant number of studies, the USA, Taiwan, and Romania published the largest number of DL studies.

Evaluation of Modalities
The included studies used different modalities, probe frequencies, and settings. Table 3 shows the modalities and frequency ranges used in all the studies.
Using only pertinent data and eliminating data noise, feature selection is a technique for lowering the variability of inputs put into a model. Choosing suitable characteristics for the AI model is based on the type of problem to be solved. Some feature selection methods often used in high-performance studies are locality-sensitive discriminant analysis [36,37], student's t-test [41,53], Fisher's discrimination ratio with Pearson's correlation coefficient [65,66], sequential forward floating selection [79], marginal Fisher analysis with Wilcoxon signed-rank test [39], and Welch's test [54].
Not all the included studies reported the rate of data splitting used for training and testing. Often, the percentage of data allocated for validation was also unclear. Table 4 shows approximate data splitting calculations. For most studies, the number of samples used for validation was assumed to be null since no numbers were explicitly declared for validation.

Evaluation of Classification Models
An algorithm that performs classification is known as a classifier. The features of images that need to be classified significantly impact classifier performance. Numerous empirical studies have been conducted to compare classifier performance and identify the elements of images that affect classifier performance.

Explanation of the Performance Measure
Every machine learning problem may be divided into two categories: regression and classification [13]. For regression models, for example, measures such as Mean Absolute Error, Mean Squared Error, Root Mean Squared Error, and R2 are used to evaluate the performance. Classification models are also evaluated using measures such as Accuracy, Precision, Recall, F1-score, and AU-ROC. It is worth noting that the Metrics are distinct from the loss functions used to train a machine learning model, and they are typically distinguishable in the model's parameters [85].
We will include the three key performance indicators reported in the majority of the included studies in this review: sensitivity, specificity, and accuracy. A few studies also reported F1-score and AUC; however, due to the scarcity of studies utilizing these measures, they will not be included in the review. The classification accuracy of a collection of measures is defined as how near they are to their actual value. Accuracy is calculated by dividing the number of correct predictions by the total number of predictions and multiplying the result by 100. A diagnostic test's classification sensitivity measures how well it can identify true positives. Sensitivity is also known as Recall, Hit-Rate, and True Positive Rate. It may be computed by dividing the number of true positives by the total number of positives in the ground truth. Specificity, also known as Selectivity or True Negative Rate, assesses how successfully a test can identify true negatives. [86]. An automated diagnosis based on RT and DCT coefficients was used to classify a normal liver and a liver affected by fatty liver disease (FLD). Using only two features, the FS classifier presented the highest accuracy, sensitivity, and specificity, at 100%. Moreover, using just two elements, FLDI discriminated between normal and FLD.
[51]    To reduce dimensionality and DL network speed without raising computational expenses, the system in this study used the inception model. First, the background of the original liver images was removed from the optimized images by stripping the border. When removing 15% of the background, the findings showed remarkable accuracy.
[84] Using GLDS, RUNL, SGLDM, and FDTA algorithms, this study used a method created for computer-assisted liver tissue characterization. It was anticipated that it would be challenging to distinguish cirrhosis, fatty, and diffused diseases from normal, but the preliminary outcomes seemed incredibly good.
[53] 1-normal 2-fatty 100 Self Organising Map Unknown Unknown Unknown Unknown This study found representative feature vectors using a one-dimensional self-organizing map [SOM]. The most distinctive components were "maximum probability" and "uniformity." The plots for normal and fatty liver superimposed images indicate distinct groups with little to no overlap. In this study, features such as the mean grey level, 10 th percentile, contrast, ASM, entropy, correlation, attenuation, and speckle separation, produced good results when used as the input of fuzzy logic to build an automated categorization of cirrhosis, fatty, and normal liver.
The findings of this research demonstrated the potential benefit of taking fuzzy reasoning into account during the "quantitative tissue characterization" of diffused liver diseases.

95.40% Unknown
The diagnosis of FLD and heterogeneous liver utilizing textural analysis of liver US images is a unique method presented in this research. First, a WPT was used to examine the ROI, and from each of the WPT sub-images, several statistical features were collected (median, standard deviation, and interquartile range). The classification was then performed using a "v-linear support vector" classifier. The suggested approach provided an overall accuracy of approximately 95%, demonstrating the system's effectiveness.
[81] In this study, ResNet-50 v2 was trained and evaluated on many images and, as a result, performed relatively well compared to invasive diagnostic techniques for fatty liver. According to the findings, US images are more dependable than CT imaging for detecting hepatic steatosis. In addition, when ten features from a co-occurrence matrix were loaded into a BLR, it performed pretty well at differentiating between healthy and diseased fatty liver.
[68] The study results showed that the Inception-ResNet-v2 architecture-based model is more helpful in classifying medical images. In addition, the study showed that it performs better than classical methods regarding accuracy and AUC. This study built an extreme learning machine (ELM) on a single-layer feed-forward neural network. Only hidden-to-output weights were taught, and input-to-hidden layer weights were created randomly to reduce computing costs. As a result, the results were more accurate with fewer features. The results of this study indicated that the SVM was the most applicable for the discrimination of pathologic tissues in clinical practice, having better performance than the kNN and ANN.
[65] In this study, the best textural characteristics for classifying livers were found. A novel classification approach employing information fusion was suggested. It consisted of a linear combination of features weighted according to how well they could separate classes.
[43] This study suggested that an existing learning-based model may perform well by combining US and shear wave features (shear wave attenuation, shear wave absorption, elasticity, dispersion slope, and echo attenuation). Furthermore, it supports that the target tissue may be identified and distinguished from other targets in the high-dimensional space established by the suggested ultrasonic parameter set. GIST descriptors were used in this study to extract features. A marginal fisher analysis (MFA) data reduction method reduced many elements to the top seventeen. The Wilcoxon signed-rank test was used to create effective and reliable classifiers to rank a set of characteristics. Using eighteen features, the proposed approach identified all normal classes as normal (specificity was 100%). To train the classifiers, 10-fold stratified cross-validation was employed. The PNN classifier produced results with the highest classification accuracy of 98%, sensitivity of 96%, specificity, and PPV of 100%.
[40] In this study, classifying normal, mild, moderate, and severe liver images was objectified using medical domain knowledge to diagnose the severity of fatty liver images. Findings demonstrated that the classification accuracy for a given feature category, such as run-length matrix (RLM), may be improved by appending feature sets. [83] 1-normal liver 2-low-grade fatty liver 3-moderate grade fatty liver 4-severe fatty liver 500 convolution neural network Unknown 83% 90% 95% The study covered the impact of network width on a model. The study found that correctly expanding the network model's width increased the model's accuracy. "Skip connection" expedites network convergence while preserving the image's original features.

Discussion
Even though the first empirical research on this review topic was published in 1996, it must be acknowledged that few in-depth studies on this topic have had conclusive findings. For instance, Figure 6 shows the differentiation of the studies based on three factors. Regarding AI methods, some studies used ML, while others used DL. Regarding outputs, some studies detected disease while others quantified it. This quantification focused on fatty liver disease or incorporated morbidity stages. If the protocol of this review defined studies that used DL to quantify fatty liver disease and its later morbidity phases, the number of studies would be relatively small.

Discussion
Even though the first empirical research on this review topic was published in 1996, it must be acknowledged that few in-depth studies on this topic have had conclusive findings. For instance, Figure 6 shows the differentiation of the studies based on three factors. Regarding AI methods, some studies used ML, while others used DL. Regarding outputs, some studies detected disease while others quantified it. This quantification focused on fatty liver disease or incorporated morbidity stages. If the protocol of this review defined studies that used DL to quantify fatty liver disease and its later morbidity phases, the number of studies would be relatively small.  Figure 7 depicts the process steps followed in all the studies. Most studies used the same method to design prediction models using US images of the liver. The process consisted of collecting images, processing images to extract features, processing features in the classifier, and making a prediction. However, other studies used varied approaches, potentially resulting in disparities in performance. A turning point in the computerized automated detection of fatty liver disease was reached when Michal Byra conducted a study in 2018 [59]. In Byra's work, hepatic feature extraction was done using Inception-ResNet-v2 architecture. On US scans obtained from fifty-five patients, quantitative validation was conducted (38 fatty livers, 17 healthy livers). An SVM classifier was used to classify the retrieved features, and the reported mean accuracy was 96.3 percent. Since 2018, most studies have used Byra's work as a benchmark against which to compare their findings. Five studies reviewed in this work used Byra's dataset [60][61][62][63][64]. Furthermore, it is crucial to emphasize the significance of two recent studies conducted in Taiwan, where a considerable number of images from various imaging modalities were used in each study [45,81].
The studies included in this review show that combining AI with US image analysis can reduce human-related mistakes and enhance overall performance. The studies also demonstrate the capacity of AI-integrated approaches to detect early-stage steatosis. The studies shows impressive AI-assisted US performance with great sensitivity, specificity, Positive Predictive Value (PPV), Negative Predictive Value (NPV), and accuracy.
Although most studies collected liver US images using Philips, Siemens, or GE modalities, it does not appear that the type of modality is related to the performance of the classification model. The modality settings and the frequency of the probe used in each  Figure 7 depicts the process steps followed in all the studies. Most studies used the same method to design prediction models using US images of the liver. The process consisted of collecting images, processing images to extract features, processing features in the classifier, and making a prediction. However, other studies used varied approaches, potentially resulting in disparities in performance.

Discussion
Even though the first empirical research on this review topic was published in 1996, it must be acknowledged that few in-depth studies on this topic have had conclusive findings. For instance, Figure 6 shows the differentiation of the studies based on three factors. Regarding AI methods, some studies used ML, while others used DL. Regarding outputs, some studies detected disease while others quantified it. This quantification focused on fatty liver disease or incorporated morbidity stages. If the protocol of this review defined studies that used DL to quantify fatty liver disease and its later morbidity phases, the number of studies would be relatively small.  Figure 7 depicts the process steps followed in all the studies. Most studies used the same method to design prediction models using US images of the liver. The process consisted of collecting images, processing images to extract features, processing features in the classifier, and making a prediction. However, other studies used varied approaches, potentially resulting in disparities in performance. A turning point in the computerized automated detection of fatty liver disease was reached when Michal Byra conducted a study in 2018 [59]. In Byra's work, hepatic feature extraction was done using Inception-ResNet-v2 architecture. On US scans obtained from fifty-five patients, quantitative validation was conducted (38 fatty livers, 17 healthy livers). An SVM classifier was used to classify the retrieved features, and the reported mean accuracy was 96.3 percent. Since 2018, most studies have used Byra's work as a benchmark against which to compare their findings. Five studies reviewed in this work used Byra's dataset [60][61][62][63][64]. Furthermore, it is crucial to emphasize the significance of two recent studies conducted in Taiwan, where a considerable number of images from various imaging modalities were used in each study [45,81].
The studies included in this review show that combining AI with US image analysis can reduce human-related mistakes and enhance overall performance. The studies also demonstrate the capacity of AI-integrated approaches to detect early-stage steatosis. The studies shows impressive AI-assisted US performance with great sensitivity, specificity, Positive Predictive Value (PPV), Negative Predictive Value (NPV), and accuracy.
Although most studies collected liver US images using Philips, Siemens, or GE modalities, it does not appear that the type of modality is related to the performance of the classification model. The modality settings and the frequency of the probe used in each A turning point in the computerized automated detection of fatty liver disease was reached when Michal Byra conducted a study in 2018 [59]. In Byra's work, hepatic feature extraction was done using Inception-ResNet-v2 architecture. On US scans obtained from fifty-five patients, quantitative validation was conducted (38 fatty livers, 17 healthy livers). An SVM classifier was used to classify the retrieved features, and the reported mean accuracy was 96.3 percent. Since 2018, most studies have used Byra's work as a benchmark against which to compare their findings. Five studies reviewed in this work used Byra's dataset [60][61][62][63][64]. Furthermore, it is crucial to emphasize the significance of two recent studies conducted in Taiwan, where a considerable number of images from various imaging modalities were used in each study [45,81].
The studies included in this review show that combining AI with US image analysis can reduce human-related mistakes and enhance overall performance. The studies also demonstrate the capacity of AI-integrated approaches to detect early-stage steatosis. The studies shows impressive AI-assisted US performance with great sensitivity, specificity, Positive Predictive Value (PPV), Negative Predictive Value (NPV), and accuracy.
Although most studies collected liver US images using Philips, Siemens, or GE modalities, it does not appear that the type of modality is related to the performance of the classification model. The modality settings and the frequency of the probe used in each study may also be observed similarly. Although some studies contend that utilizing a high frequency, such as 40 MHz, will result in greater classification performance (accuracy of 95%) [43], a comparative study using a low frequency, 5 MHz, scored the same performance [56]. It is worth mentioning that most studies used a frequency probe of 3.5 MHz.
Even though studies in this review processed images and chose ROIs in numerous ways, most of the studies unequivocally agreed on the importance of conducting these two processes. Regarding image pre-processing and ROI selection, no discernible patterns can be reported. Studies consistently standardized image size, removed irrelevant details, attempted to choose ROIs near images' central lines, and avoided anomalies such as vessels and bile stores. Two studies successfully selected ROIs automatically. Ribeiro et al. based their approach on the decomposition of liver parenchyma US images in two fields-the speckled image holding textural information and the de-speckled image comprised of liver intensity and anatomical data [79]. Owjimehr et al.'s ROI selection process included three steps. First, images were divided into blocks and overlapped repeatedly. Second, sections of a specific size were chosen from the middle of each block. Finally, a linear support vector classifier was used to select the best ROIs from all those formed [55].
Most studies that used supervised ML reported selecting features for classification. While it is well known that the number of elements positively effects accuracy, feature types seem to have the most impact on accuracy. For example, a study that used 325 features to detect whether a liver was normal or steatotic scored an overall accuracy of 79.77% [51]. In comparison, another study that used only five features for the same purpose scored 100% accuracy [36]. The same can be seen with 636 features used and 85.4% accuracy in [69] and five features used and 87.5% accuracy in [44]. Likewise, one study selected 156 features and scored 85% accuracy [47], while another used only nine features and achieved 90.84% accuracy [75].
Regarding data splitting, different studies used different methodologies. However, using 80 to 90% of the data as a training dataset and 10% to 20% as a testing dataset were the most common splitting percentages. Furthermore, many studies used the 10-fold cross-validation method to validate data.
Studies have proved the efficiency of using DL classifiers to classify images, and medical images are no exception. DL algorithms use data to learn high-level features. This is a distinguishing feature of DL and a significant advancement over classical ML. As a result, DL minimizes the need to create a new feature extractor for each challenge. However, when it comes to US images for NAFLD, studies show that a neural network AI outperforms a non-neural network AI. More research and quantitative analyses are needed to accurately identify a superior algorithm among the ones described in this review.
Finally, the included studies reported many challenges and opportunities for improvement. The following is a list of the most important obstacles to overcome in future research: 1.
to overcome problems that currently exist in some classifiers, such as speckle noise, semantic gap, computational time, dimensionality reduction, and accuracy of images retrieved from a large dataset; 2.
to examine the effect of every parameter to improve the performance of the model; 3.
to use a more extensive dataset acquired by different operators from different patients; 4.
to consider a multipolar hospital; 5.
to consider more diseases stages; 6.
to use more advanced techniques to improve images before analysis; 7.
to automate all steps as much as possible; 8.
to examine more sophisticated features; and 9.
to implement classification models in the hardware and transfer the technology to a clinical setting.

Clinical Implications
Although the studies did not elaborate on the clinical implications of suggested solutions and models, several studies attempted to discover the most appropriate models for healthcare settings, even if it meant sacrificing the quality of results. For example, one study attempted to reduce computation time and enhance speed by employing Fourier layers to standardize the modern technology in clinical settings [64]. Furthermore, the results in [58] suggested that SVM was best suited for differentiating diseased tissue in clinical practice, outperforming KNN and ANN.
Despite the above, several studies suggested that having a reliable technology that can be adapted to healthcare settings has clinical implications. The clinical implications were stated as secondary results, a conclusion, or recommended future works. The clinical implications are as follows:

1.
US powered by AI can be used to integrate an index in place of the H-MRS index of the biopsy method, which is invasive, expensive, scarcely available, and unsettling for patients [36,44,77]. US powered by AI also lessens the workload and the need for biopsy since it is considered a preliminary test for selecting patients eligible for biopsy [39,81]. 2.
In the future, DL might be used to quantify NAFLD with the combined use of pathologic and laboratory tests [72].
Given the rising incidence of NAFLD and the potential for permanent hepatic damage, early recognition of NAFLD and cirrhosis is essential for doctors to be able to advise on appropriate therapies to stop the onset of HCC and its associated consequences [37,44]. 6.
The accuracy of NAFLD detection with ultrasonography can be enhanced with the development of computer-aided diagnostic technology, especially for those less trained or operating in distant locations [50,51]. 7.
Some methods provide a fully automated solution that will assist in determining the advantages of telehealth [40,56]. 9.
Future US devices will include functionalities for tissue analysis that are easier to implement in hardware [73].
Potential readers of this work might include healthcare practitioners and computer scientists to promote awareness of the importance of collaboration between the two fields. Healthcare professionals, for example, can supply the necessary dataset for fatty liver to computer scientists, who can then run additional tests and act on clinical validation and feedback.

Strengths
To the best of the researchers' knowledge, this is the first review to investigate all AI strategies used to automate NAFLD detection and quantification. The search was sensitive and accurate since the most prominent health and information technology databases were searched using a well-developed search query and backward and forward reference list checking. Because this research does not focus on individual AI branches or stages, it may be considered comprehensive. As a result, the study presents a comprehensive view of AI's function in monitoring fatty liver using US images. On the one hand, the review may be deemed high-quality since well-recommended criteria were followed during the creating, implementing, and reporting processes. On the other hand, it is possible to expand on this work.

Limitations
Despite data cross-checking between studies being used to fill any gaps in the gathered data, it could be a limitation that only one person carried out the review was. Another drawback is that the search conducted for this study only covered English studies. As a result, studies written in other languages were omitted. Finally, the analysis presented in this research is qualitative. Therefore, it would be preferable to contribute additional value to the topic by conducting a metadata analysis on some of the included papers.

Conclusions and Future Work
Over time, efforts to detect and classify fatty liver disease and its accompanying clinical stages more accurately than humans have increased. Most of the effort has been devoted to extracting features from processed images and employing these features to complete the task. Using ANNs, whether for extracting features or classifying, represents a significant step in the right direction.
For potential future work, more effort needs to be placed into creating models that tackle challenges and performing randomized clinical trials on more significant numbers of patients. The findings will help in the future development of explainable AI. Furthermore, more efforts must be devoted to processing images and extracting features to determine the most accurate stages of the images, taking into account the structural differences between the images. In addition, comparing computational complexity/power and classification accuracy should be considered a strategy for comparing DL methods with ML methods. This will lead to a more advantageous selection to detect and quantify NAFLD using US images.
As AI has received a lot of attention regarding its utilization in the healthcare sector, this study emphasizes the application of AI in fatty liver diagnosis and future problems. Our findings pave the way for computer scientists to focus on the use of AI in the diagnosis of fatty liver, particularly in the early stages of the illness, which is difficult to identify, especially for junior non-expert doctors. As a result, AI applications are critical in this domain to overcome challenges and avoid human mistake.