Recent Applications of Artificial Intelligence from Histopathologic Image-Based Prediction of Microsatellite Instability in Solid Cancers: A Systematic Review

Simple Summary Although the evaluation of microsatellite instability (MSI) is important for immunotherapy, it is not feasible to test MSI in all cancers due to the additional cost and time. Recently, artificial intelligence (AI)-based MSI prediction models from whole slide images (WSIs) are being developed and have shown promising results. However, these models are still at their elementary level, with limited data for validation. This study aimed to assess the current status of AI applications to WSI-based MSI prediction and to suggest a better study design. The performance of the MSI prediction models were promising, but a small dataset, lack of external validation, and lack of a multiethnic population dataset were the major limitations. Through a combination with high-sensitivity tests such as polymerase chain reaction and immunohistochemical stains, AI-based MSI prediction models with a high performance and appropriate large datasets will reduce the cost and time for MSI testing and will be able to enhance the immunotherapy treatment process in the near future. Abstract Cancers with high microsatellite instability (MSI-H) have a better prognosis and respond well to immunotherapy. However, MSI is not tested in all cancers because of the additional costs and time of diagnosis. Therefore, artificial intelligence (AI)-based models have been recently developed to evaluate MSI from whole slide images (WSIs). Here, we aimed to assess the current state of AI application to predict MSI based on WSIs analysis in MSI-related cancers and suggest a better study design for future studies. Studies were searched in online databases and screened by reference type, and only the full texts of eligible studies were reviewed. The included 14 studies were published between 2018 and 2021, and most of the publications were from developed countries. The commonly used dataset is The Cancer Genome Atlas dataset. Colorectal cancer (CRC) was the most common type of cancer studied, followed by endometrial, gastric, and ovarian cancers. The AI models have shown the potential to predict MSI with the highest AUC of 0.93 in the case of CRC. The relatively limited scale of datasets and lack of external validation were the limitations of most studies. Future studies with larger datasets are required to implicate AI models in routine diagnostic practice for MSI prediction.


Introduction
Colorectal cancers (CRCs) with high microsatellite instability (MSI-H) have a better prognosis and respond very well to immunotherapy [1][2][3]. MSI-H cancers generally show certain distinctive clinicopathological features, such as younger age, tumor location in

Search Strategy
The protocol of this systematic review follows the standard guidelines for a systematic review of the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) statement. A systematic search of online databases including EMBASE, MED-LINE, and Cochrane was conducted. Articles published in English up to August 2021 were included. The following queries were used in the search; "deep learning", "microsatellite instability", "gene mutation", "prognosis prediction", "solid cancers", "whole slide image", "image analysis", "artificial intelligence", and "machine learning". We also manually searched the eligible studies, and the included studies were managed using EndNote (ver. 20.0.1, Bld. 15043, Thomson Reuters, New York, NY, USA). The protocol of this systematic review is registered with PROSPERO (282422).The Institutional Review Board of the Catholic University of Korea approved the ethical clearance for this study (UC21ZISI0129).

Article Selection and Data Extraction and Analysis
The combined search results from online databases were retrieved and transferred to the EndNote, and duplicates were removed. Original studies with full text on AI and MSI prediction from WSIs in solid cancers were included. To identify eligible studies, two independent reviewers (MRA and YC) first screened the studies by title and abstract. Finally, the full text of each eligible study was reviewed. Any discrepancy between the authors (MRA and YC) regarding study selection was resolved by consulting a third author (JAG). Case studies, editorials, conference proceedings, letters to the editor, review articles, poster presentations, and articles not written in English were excluded.

Characteristics of Eligible Study
The detailed criteria for selecting and reviewing the articles are shown in Figure 1. The initial search from online databases yielded 13,049 records and six articles identified through a hand search. After removing duplicates, a total of 11,134 records remained. Following that, 3646 records were removed owing to an irrelevant reference type, which was reduced to 7488 records. Next, 6156 records were excluded by title, which was reduced to 1332 records. After 1305 records were removed by abstract, 27 records were selected for full-text review. In the process of full-text review, only 14 studies met the inclusion criteria and were included in the systematic review.

Yearly and Country-Wise Trend of Publication
The yearly and country-wise trends of publications are illustrated in Figure 2. The AI models for MSI prediction was first reported in 2018 and slightly increased so far. The included 14 studies were published from China (n = 5), followed by Germany (n = 4), the United States (n = 4), and South Korea (n = 1).

Prediction of MSI Status in CRC
The key characteristics of the AI models included in the CRC are summarized in Table 1. Most of the studies used the TCGA dataset for training and validation of their AI models. The study by Echle et al. used data from a large-scale international collaboration representing the European population for training, validation, and testing, which includes 6406 patients from Darmkrebs: Chancen der Verhütung durch Screening (DACHS), Quick and Simple and Reliable (QUASAR), and Netherlands Cohort Study (NLCS) datasets in addition to the TCGA dataset [30]. DACHS is a dataset of CRC patients with stage I-IV from the German Cancer Research Center. QUASAR is a clinical trial data of CRC patients, mainly with stage II tumors, from the United Kingdom. NLCS is a dataset from the Netherlands that includes patients of any tumor stage. The study by Lee et al. used an in-house dataset along with the TCGA dataset, and the study by Yamashita et al. used only an in-house dataset for training, validation, and testing of their AI models [48,49]. A study by Co et al. and Lee et al. used an Asian dataset for external validation, which is different from the population dataset used for training and testing their models [48,49].
The comparison of the AUC of their tests is shown in Figure 4. The performance metric AUC of AI models ranged 0.74-0.93. The highest AUC 0.93 was reported by Yamashita et al. with a small data set, but a study by Echle et al. with a large international dataset also showed good AUC 0.92. Kather et al. and Coa et al. trained and tested their models on frozen section slides (FSS) and compared their model performance with the results of a formalin-fixed paraffin-embedded (FFPE) slide dataset [29,50]. Their results showed that AUC is slightly higher in the model trained and tested on FSS in comparison to that trained and tested on FFPE.
A comparison of the sensitivity and specificity of the AI models of CRC is shown in Figure 4. Echle et al.'s study with a large-scale international dataset showed a good sensitivity of 95.0%, although its specificity was slightly low (67.0%) [30]. A study by Coa et al. showed good sensitivity and a specificity of 91.0% and 77.0%, respectively [50]. The type of AI models used for MSI prediction in each study is shown in the supplementary Table S1. We also compared the AUCs of AI models that used the same dataset and that is shown in Supplementary Figure S1A,B. Our data showed that the average performance of ResNet18 model in CRC was better in FSS (AUC 0.85) compared to FFPE (AUC 0.79). The next commonly used AI model for CRC was ShuffleNet, which was used by three studies. However, due to heterogeneity in their data, we were able to compare only two studies, which showed an average AUC of 0.83. The average AUCs of both ResNet18 and ShuffleNet classifiers were almost similar.

Prediction of MSI Status in Endometrial, Gastric, and Ovarian Cancers
The key characteristics of the AI model studies on endometrial, gastric, and ovarian cancers are summarized in Table 2. In endometrial cancer, except for one study, all the other studies used only the TCGA dataset for the training, testing, and validation of their models. In addition to the TCGA dataset, Hong et al. used the Clinical Proteomic Tumor Analysis Consortium (CPTAC) dataset for training and testing [57]. This study also used the New York Hospital dataset for external validation. The performance metric AUC of the test ranged from 0.73-0.82. ResNet18 is also a commonly used AI model in endometrial cancer and comparison of their AUCs is shown in Supplementary Figure S1C.
All the included studies in gastric cancer used only the TCGA dataset for training, testing, and validation. The performance metric AUC of the test ranged from 0.76-0.81. Kather et al. reported that their model trained on mainly Western population data performed poorly in an external validation test with a dataset of the Japanese population [29]. ResNet18 is also a commonly used AI model in gastric cancer, and comparison of their AUCs is shown in the Supplementary Figure S1D.
Ovarian cancer included only one study, and this study used the TCGA dataset for training and testing for the AI model, with a performance metric of AUC 0.91 [58].

Discussion
In this study, we found that AI models for MSI prediction have been increasing recently, mainly focusing on CRC, endometrial, and gastric cancers, and the performance of these models is quite promising, but there were some limitations. More qualified data with external validation, including various ethnic groups, should be considered in future studies. Yearly publication trends related to MSI prediction by AI are increasing, and most publications were from developed countries. A recent publication also suggested a similar trend on topics related to AI and oncology, which showed that the United States is the leading country, followed by South Korea, China, Italy, the UK, and Canada [60]. Publication trends related to overall AI research in medicine also showed exponential growth since 1998, and most papers were published between 2008 and 2018 [61]. In another report, the number of publications in overall AI and machine learning in oncology remained stable until 2014, but increased enormously from 2017 [60], which is consistent with our results.
Our data showed that the number of publications on MSI models is higher in CRC compared to endometrial, gastric and ovarian cancers. It may be because the CRC is the second most lethal cancer worldwide, and approximately 15% of CRC is caused by the MSI [6][7][8][9]62,63]. MSI-high tumors are widely considered to have a large neoantigen burden, making them especially responsive to immune checkpoint inhibitor therapy [64,65]. In recent years, MSI has gained much attention because of its involvement in predicting the response to immunotherapy for many types of tumors [66]. An example of the AI model for CRC is shown in Figure 5.  Figure 1. Overview of the Ensemble Patch Likelihood Aggregation (EPLA) model. A whole slide image (WSI) of each patient was obtained and annotated to highlight the regions of carcinoma (ROIs). Next, patches were tiled from ROIs, and the MSI likelihood of each patch was predicted by ResNet-18, during which a heat map was shown to visualize the patch-level prediction. Then, patch likelihood histogram (PALHI) pipelines and bags of words (BoW) pipelines integrated the multiple patch-level MSI likelihoods into a WSI-level MSI prediction, respectively. Finally, ensemble learning combined the results of the two pipelines and made the final prediction of the MS status. Reprinted from Ref. [50].
AI models using WSI showed great potential for prediction of MSI in CRCs, which can be used as a low-cost screening method for these patients. It also can be used as a prescreening tool to select MSI-H probability for patients before testing with the current costly available PCR/IHC methods. However, further validation of these models on a large dataset is necessary to improve their performance to an acceptable level of clinical usage. Most of the MSI models for CRC were developed on a dataset of surgical specimens. More models from endoscopic biopsy samples using more datasets from various ethnic populations should be developed in the future, which can reduce the possibility of missing MSI-H cases, particularly in advanced CRCs, where resection is not possible. Another limitation of these AI modes is that they cannot distinguish between hereditary and sporadic MSI cases. Therefore, to improve the performance of these models, training and validation with a large dataset is required in future research studies.
As immunotherapy and MSI testing gets more and more importance in other solid cancers such as gastric, endometrial, and ovarian cancers, we can see that the AI-based MSI prediction models have also been applied in these cancers recently. They showed promising results for a potential application, although the evidence is still insufficient. A large dataset with external validation should follow in the future.

Performance of AI Models and Their Cost Effectiveness
The sensitivity and specificity of AI models were comparable to that of routinely used methods such as PCR and IHC. The study by Echle et al. and Coa et al. showed 91.0-95.0% of sensitivity and 67.0-77.0% of specificity [30,50]. In the literature, IHC sensitivity ranges from 85-100% and the specificity ranges from 85-92% [31,32]. MSI PCR showed 85-100% sensitivity and 85-92% specificity [31]. According to a recent study assessing the costeffectiveness of these molecular tests and the AI models, the accuracy of MSI prediction models was similar to that of commonly used PCR and IHC methods [67]. NGS technology is useful for the testing of many gene mutations, such as for epithelial ovarian cancer patients with BRCA mutation or for HR deficiency that might benefit from a therapeutic option of platinum agents and PARP inhibitors, whereas immune checkpoint inhibitors are effective in tumors with the MSI-H [68].
In this study, the authors predicted the net medical costs of six different clinical scenarios using the combination of different MSI testing methods including PCR, IHC, NGS and AI models and corresponding treatment in the United States. An overview of the cost effectiveness comparison of their study is shown in Figure 6. They reported that AI models with high PCR or IHC can save up to $400 million annually [67]. As the cancer burden is increasing, a precise diagnosis of MSI is essential to identify appropriate candidates for immunotherapy and to reduce the medical costs.

Data, Image Quality and CNN Architecture
To obtain the best results from any convolutional neural network (CNN) model, a large dataset from various ethnic groups is required for training, testing, and validation. Most studies in this review had a relatively small number of TCGA datasets for appropriate training and validation. Without a large-scale validation, the performance of these AI models cannot be generalized, and it is not feasible for routine diagnosis. One study could not perform further subgroup analysis due to limited clinical information of TCGA datasets [49]. Another study raised the potential limitation that the TCGA datasets may not represent the real situation [55]. Another group of researchers raised the potential limitation of technical artifacts such as blurred images in TCGA datasets [30]. Although the TCGA dataset includes patients from various institutes/hospitals, but all are the patients are from similar ethnic group, which is primarily from the North America and local in-house datasets for training or external validation [29,30,48,49]. However, for high generalizability, the datasets from various ethnic groups should be explored further.
On a side note, one study reported poor performance with 40× magnification compared to 20× magnification which may be due to differences in the image color metrics [49]. Another study reported that the color normalization of images slightly improves performance of the AI model [30]. Cao et al. recommended to use the images over 20× magnification for a better performance [50]. Interestingly, Krause et al. in 2021 proposed a specialized method to train an AI model when only a limited number of datasets was available (Figure 7). They synthesized the 10,000 histological images with and without MSI using a generative adversarial network from 1457 CRC WSIs with MSI information [56]. They reported increased AUROC after adopting this method and an increase in the size of the training dataset, and this synthetic image approach can be used to for generating large datasets with rare molecular features. The choice of CNN also affects the performance of the AI models; commonly used networks such as ResNet18, ShuffleNet, and Inception-V3 have been used in most of the studies. The ResNet model has many other variations as per the number of layers used, such as ResNet18, ResNet34, ResNet50, and many others. The ResNet18 model has 72-layer architecture with 18 deep layers, which may degrade the output result due to multiple deep layers in the network [69]. However, if the output result is degraded it can be fixed through back propagation. ShuffleNet has a simple design architecture, and it is also optimized for mobile devices [53]. Therefore, it can show good performance with a high accuracy at a low training time [53].
A study observed that lightweight neural network models performed on par with more complex models [53]. Performance comparison including three to six of these models is essential for enhancing the performance of the final model.

External Validation and Multi-Institutional Study
In CRC cases, six out of 11 studies included an external validation. The performance metric AUC for external validation ranged from 0.61-0.97. In endometrial and gastric cancer cases, only one study for each group performed external validation. AI models that are trained and tested on a single dataset may overfit and perform well on internal datasets. However, these models show low performance when tested for external datasets. Therefore, external validation on different datasets is always necessary in order to have a well-trained AI model.
Studies also suggested that a large sample size, multiple institutions data, and patients with different populations are needed to determine the generalization performance of their AI models. An overview of the multicentric study deign is shown in Figure 8. AI models trained mainly on data from Western populations performed poorly when validated on Asian populations [29]. Another study suggested that transfer learning for model fine-tuning in different ethnic populations may improve the generalizability of their AI models [50]. Previous researchers argued that datasets from multi-institutional and multinational models enhanced the generalizability of DL models [70,71].

MSI Prediction on Biopsy Samples
Most studies only use WSIs of surgical specimens for the development of their AI models. However, MSI prediction on small colonoscopic biopsy samples is more practically useful in the clinical setting if it is feasible. A recent study observed relatively low performance on biopsy samples with their surgical specimen trained AI model [30]. Thus, further research on small biopsy samples is required to increase the performance.

Establishment of Central Facility
AI technology in medical applications is still growing recent study showed increasing trend of patent related to AI and pathological images [72]. The lack of installed slide scanners in hospitals can hinder the implementation of DL models. The WSIs are large files which cannot be stored in a routine hospital setting. The whole slide scanners and the viewing and archiving system along with an appropriate server is expensive equipment that cannot be easily established. The establishment of central slide scanner facilities with a server with a larger data storage capacity can overcome this challenge [45,73].

Future Direction
Originally, AI applications in the pathology field focused on mimicking or replacing human pathologists' tasks, such as segmentation, classification, and grading. The main goal of these studies was to reduce intra-or inter-observer variability in pathologic interpretation to support or augment human ability. AI models trained with small datasets may overfit the target sample and may adversely affect the performance. For the accuracy of AI models, factors such as class imbalance and selection bias of the dataset must be considered during the development of the models. Since labels of the datasets are important for the training of AI models, biased and low-quality labeled datasets will decrease the performance of AI models. Therefore, collaborative research work between pathologists and AI researchers is needed. Furthermore, most of the studies used the TCGA dataset, which is a collection of representative cases, and may not efficiently represent the general population. Therefore, their performance cannot be generalized to the population as it may not contain many rare morphologic types of samples that exist in the general population. For the future, we suggest collecting a larger dataset of various ethnic populations, reviewed by experienced pathologists to minimize the selection bias and enhance the generalizability of AI models. Furthermore, external validation should be performed with the representative data of various ethnic populations. Randomized controlled trials are a useful tool to assess the risk and benefit in medical research studies. There is a need for randomized clinical studies or prospective clinical trials for AI models before using these models for routine clinical practice. Most of the AI models were developed using surgical sample datasets. Despite immunotherapy being the best treatment choice for CRC patients with stage IV tumors, the endoscopic biopsy sample is the only available tissue from these patients due to the inability of surgical resection. Future studies are needed to accurately estimate MSI based on biopsy samples, which will aid in the selection of immunotherapy for patients with advanced CRC cancer. Currently available AI models can not specifically differentiate between Lynch syndrome and MSI-H in sporadic cancer patients. The Development of an AI model for detecting Lynch syndrome may help in selecting better therapeutic options for these patients. It is difficult to understand how the AI models arrive at a conclusion. This is because AI algorithms process data in a "black box". Therefore, the AI models should be validated against the currently available quality standards to ensure their efficiency.
However, scientists are increasingly focusing on the "superpower" from AI models that can surpass human abilities, such as mutation, prognosis, and treatment response predictions in cancer patients. Our research group has already developed an AI model for MSI prediction in CRC, and the results is quite promising [48]. These findings motivated us to initiate a multi-institutional research project for the MSI prediction from CRC WSIs. Our first aim is to collect a large image dataset of CRC patients and verify the quality of the image by experienced pathologists. Second, we will develop an AI model using this large image dataset and test the generalized performance of AI models so that it may be feasible to use it in routine practice. At present, we are in the process of scanning the H&E slides of CRC patients in collaboration with 14 hospitals/institutions around the country.

Conclusions
This study showed that in the future, AI models can be an alternative and effective method for the prediction of MSI-H from WSIs. Overall, AI models showed promising results and have the potential to predict MSI-H in a cost-effective manner. However, the lack of a large dataset, multiethnic population sample, and lack of external validation were major limitations of the previous studies. Currently, the AI models are not approved for clinical use to replace routine molecular tests. As the cancer burden is increasing, there is need for the precise diagnostic method for predicting MSI-H and identify appropriate candidates for immunotherapy and to reduce the medical costs. AI models also can be used as a prescreening tool to select MSI-H probability for patients before testing with the current costly available PCR/IHC methods. Future studies are needed to accurately estimate MSI based on biopsy samples, which will aid in the selection of immunotherapy for patients advance stages of CRC. Moreover, currently available AI models can not specifically differentiate between Lynch syndrome and MSI-H in sporadic cancer patients. The development of an AI model for detecting Lynch syndrome may help in selecting better therapeutic options for these patients. As a result, to ensure efficiency, AI models should be tested against currently existing quality standards before being used in clinical practice. Well-designed AI models in the future can improve their performance without compromising diagnostic accuracy. Training and validation with a larger dataset and external validation on new datasets may improve the performance of AI models to an acceptable level.  Table S1: Artificial intelligence models used for microsatellite.  Informed Consent Statement: Not applicable.

Data Availability Statement:
The data presented in this study are available upon request from the corresponding author (https://www.researchgate.net/profile/Yosep-Chong (accessed on 17 April 2022)). The data are not publicly available due to institutional policies.