Artificial Intelligence for Predicting Microsatellite Instability Based on Tumor Histomorphology: A Systematic Review

Microsatellite instability (MSI)/defective DNA mismatch repair (dMMR) is receiving more attention as a biomarker for eligibility for immune checkpoint inhibitors in advanced diseases. However, due to high costs and resource limitations, MSI/dMMR testing is not widely performed. Some attempts are in progress to predict MSI/dMMR status through histomorphological features on H&E slides using artificial intelligence (AI) technology. In this study, the potential predictive role of this new methodology was reviewed through a systematic review. Studies up to September 2021 were searched through PubMed and Embase database searches. The design and results of each study were summarized, and the risk of bias for each study was evaluated. For colorectal cancer, AI-based systems showed excellent performance with the highest standard of 0.972; for gastric and endometrial cancers they showed a relatively low but satisfactory performance, with the highest standard of 0.81 and 0.82, respectively. However, analyzing the risk of bias, most studies were evaluated at high-risk. AI-based systems showed a high potential in predicting the MSI/dMMR status of different cancer types, and particularly of colorectal cancers. Therefore, a confirmation test should be required only for the results that are positive in the AI test.


Introduction
Microsatellites, also called Short Tandem Repeats (STR) or Simple Sequence Repeats (SSR) are short, repeated sequences of 1-6 nucleotides present throughout the genome [1]. Their repeated natures make them particularly vulnerable to DNA mismatch errors (insertion, deletion, and misincorporation of base) that occur during DNA replication and recombination. One of the most important DNA repair systems, called mismatch repair (MMR), usually correct these errors in normal tissues. However, in cases of alterations of MMR genes (called defective MMR; dMMR), namely MLH1, PMS2, MSH2 and MSH6, the probability of accumulating mutations in microsatellite regions increases exponentially [2,3]. Cancers with dMMR are thus often hypermutated, clustering mutations in data, and if wrong, its filter is adjusted to improve the prediction accuracy [36]. Several studies on MSI/dMMR prediction through deep learning have shown promising results, but a comprehensive evaluation of their achievements is still lacking. Furthermore, most of the studies have been conducted with colorectal cancer, and only a few studies have been conducted on the rest of the tumors. To review the research flow of the MSI/dMMR prediction on the entire cancer types and to set a guide for future work, a wide and indepth literature review of the studies so far is needed. In this study, we systematically review these studies, their reliability, and the potential risks of bias, while also discussing future perspectives.

Materials and Methods
This systematic review follows the Preferred Reporting Items for Systematic Re-view and Meta-Analyses (PRISMA) statement [37] (Table S1).

Inclusion and Exclusion Criteria
All studies published until 30 September 2021 using deep learning for predicting microsatellite instability on histopathologic slides are included. Exclusion criteria were: (1) Not using whole slide images of human tissue slides, (2) Not related to studies for microsatellite instability, (3) Not published in the English language, (4) Narrative re-view articles, case reports and letters to editors are also excluded.

Data Sources and Literature Search Strategy
Two investigators (J.H.P and J.I.S) independently searched PubMed, Embase to identify studies up to 30 September 2021. The search terms used in PubMed were as follows: (Artificial intelligence OR Machine learning OR Deep learning OR Computer-assisted OR Digital image analysis) AND (Microsatellite instability OR MSI OR MMR OR mismatch repair). A similar search was conducted in Embase. Manual selection of relevant articles through checking references of key articles was done additionally.

Study Selection and Data Extraction
After removal of duplicates, two investigators (J.H.P and B.J.L) screened the articles independently according to inclusion/exclusion criteria. (Figure 1) Any discord was discussed until an agreement was reached. For each article, information about authors, year of publication, type of model for AI, number of patients/images, cohort (training set, validation set), type of organ, performance outcome (area under the curve (AUC) and accuracy), methods for MSI/dMMR testing (ground truth/reference standard) were extracted. One author (J.H.P.) extracted data from each study and a second independent author (J.I.S.) validated the extracted data. Finally, all extracted data were reported and summarized in Table 1. The quality of each article was evaluated by Quality Assessment of Diagnostic Accuracy Studies (QUADAS-2) by two independent investigators (J.H.P. and B.J.L.) and summarized in Table S2 [38]. Only officially published articles are included in the QUADAS-2 evaluation. Figure 1. A simplified version of convolutional neural networks (CNNs) workflow in digital pathology. The scanned H&E slide image passes through filters that detect specific features (e.g., lines, edges). Pooling layers summarize features from convolution layers. After a series of convolution and pooling layers, fully connected layers (classification layer) are generated, and through this layer output is created.

Figure 1.
A simplified version of convolutional neural networks (CNNs) workflow in digital pathology. The scanned H&E slide image passes through filters that detect specific features (e.g., lines, edges). Pooling layers summarize features from convolution layers. After a series of convolution and pooling layers, fully connected layers (classification layer) are generated, and through this layer output is created.

Search Results
The result of search yielded 553 non-duplicated articles. After excluding 538 articles based on title/abstract screening, 15 articles were retrieved for full text review. Five articles were excluded after the full text review and three were added manually by manual reference checking, and 13 articles were finally selected for systematic review (Figure 2) [39][40][41][42][43][44][45][46][47][48][49][50][51]. As for endometrial carcinoma, the strongest model for MSI prediction was developed by Hong et al. [51]. In this study, the authors trained the model using the TCGA and Clinical Proteomic Tumor Analysis Consortium (CPTAC) dataset. When this model was tested on an internal cohort, it achieved an AUC of 0.827. However, when tested on an external cohort, performance was dropped by 0.667. The strongest performance outcome for stomach cancer was demonstrated by Valieris et al. [47]. They trained Resnet-34 using a TCGA dataset and when tested on an internal cohort, the model achieved an AUC of 0.81. A similar result was shown by Kater et al. [41].
Among the individual studies, Echle et al. [46] used the largest amount of data for model development. The MSIDETECT consortium, which consists of TCGA, DACHS, QUASAR, and NLCS was used as a training material. They demonstrated a good performance value for predicting MSI on both internal and external validation cohorts (0.92 for the internal cohort and 0.96 for the external cohort).

Predicting MSI/dMMR Status by AI-Based/Deep Learning-Approaches
No studies used prospectively collected data. In the vast majority of cases (11/13), studies were focused on colorectal cancer, alone or in combination with other cancer types (stomach, endometrium) ( Table 1). Three studies investigated endometrial adenocarcinoma and four studies investigated gastric adenocarcinoma. Except for one study by Echle et al. [46], the rest of the studies used only The Cancer Genome Atlas (TCGA) data or included TCGA as part of the study cohort. Methods for dividing the cohorts into training and test sets were different across the studies. Three studies used K-fold cross-validation and seven studies randomly split data into a training set and test set. One conference abstract didn't specify the method [42]. Two studies attempted to compare performance using both methods. Regarding the methodology for assessing MSI/dMMR status (as reference standard), nine studies used MSI-based PCR, three used IHC, and three used NGS to establish a ground truth. Three unpublished studies (conference paper, abstract) did not disclose the specific method for MSI/dMMR assessment. Since each study collected various data groups and created a research cohort, the method of setting the reference standard was different within one study. For example, in the study of Kater et al. [41], PCR was used in the TCGA cohort and Darmkrebs Chancen der Verhütung durch Screening (CRC prevention through screening study abbreviation in German; DACHS) cohort, but Kangawa cancer center hospital (KCCH) data used IHC. And also, in the case of the adoption of the same methodology, some discrepancies were present among different studies: this was recoded for MSI-PCR, since some studies used non-standardized methods. Indeed, the DACHS data group use 3-plex PCR for its confirmation and the United Kingdom-based Quick and Simple and Reliable trial (QUASAR) and the Netherlands Cohort Study (NLCS), data group use 2-plex PCR, where the gold standard calls for at least 5 markers [4]. This was the same for IHC as for NLCS, which used an IHC panel with only two antibodies and without a PCR for confirmation. Nine out of 13 studies measured MSI/dMMR prediction-performance using an external validation cohort. As for the form of reporting the performance value, one conference paper was published as accuracy, and the rest of the study was reported as AUC. One study which reported the performance value of the model as accuracy shows up to 98.3% of accuracy for colorectal cancer and up to 94.6% for endometrial carcinoma. As for AUC, studies using colorectal cancer tissue showed a higher performance (highest standard of 0.972) compared to stomach and endometrial tumors (highest standard of 0.81 and 0.82 respectively) which is a relatively low but satisfactory performance. Individual studies used various deep learning models for prediction (Table 1, Figure 3). Of the 12 studies (one study did not reveal the exact model), five studies used the ResNet-based model, accounting for the largest number, followed by the ShuffleNet model used in three studies. Inception-V3, MSInet, and InceptionResNetV1 were used in two, one, and one studies, respectively.
The model with the highest performance for predicting MSI in colorectal cancer was developed by Lee et al. [50]. In their study, Inception-V3 was trained on a cohort composed of images from TCGA and Saint Mary's Hospital (SMH). When this trained model was tested on an internal validation cohort (TCGA), AUC was 0.892 and AUC tested on another internal cohort (SMH dataset) was 0.972, which is the highest value reported in the included studies. The authors also developed another model only trained on the TCGA dataset. When this model was tested to external validation cohort (SMH dataset), the performance of this model dropped by 0.787. When tested on the internal cohort (TCGA), AUC was 0.861.
As for endometrial carcinoma, the strongest model for MSI prediction was developed by Hong et al. [51]. In this study, the authors trained the model using the TCGA and Clinical Proteomic Tumor Analysis Consortium (CPTAC) dataset. When this model was tested on an internal cohort, it achieved an AUC of 0.827. However, when tested on an external cohort, performance was dropped by 0.667. The strongest performance outcome for stomach cancer was demonstrated by Valieris et al. [47]. They trained Resnet-34 using a TCGA dataset and when tested on an internal cohort, the model achieved an AUC of 0.81. A similar result was shown by Kater et al. [41].
Among the individual studies, Echle et al. [46] used the largest amount of data for model development. The MSIDETECT consortium, which consists of TCGA, DACHS, QUASAR, and NLCS was used as a training material. They demonstrated a good performance value for predicting MSI on both internal and external validation cohorts (0.92 for the internal cohort and 0.96 for the external cohort). data without adding other data groups to increase data diversity were judged at high risk. For the "index test" domain, six studies were found to be at high risk of bias. This was due to the absence of an external validation cohort and for not using K-fold cross validation in the data split process. For the "reference standard" domain, high risk was applied to those studies that used IHC only for MSI/dMMR assessment. Based on this criterion, two other studies have been judged to be in the high risk group (20%). At last, no high risk of bias study was identified in "flow and timing" domain. The digitalized system is able to separate cancer cells (here colored in blue) from non-cancer cells (red). (C) The immunohistochemistry for mismatch-repair proteins can be also taken into account in this process. This figure represents MSL-staining, showing the loss of the protein into the neoplastic component, while its expression in retained in non-tumor cells (original magnification: 10×, same field of hematoxylin-eosin). (D) The digitalized system is able to interpret the results of immunohistochemistry, based on a deep learning approach. In this step, the system shows its ability to separate cancer cells (here colored in blue) from non-cancer cells (brown). (E,F) Since the immunohistochemistry for mismatch-repair proteins is a nuclear staining, for finalizing its interpretation the 10×, same field of hematoxylin-eosin). (D) The digitalized system is able to interpret the results of immunohistochemistry, based on a deep learning approach. In this step, the system shows its ability to separate cancer cells (here colored in blue) from non-cancer cells (brown). (E,F) Since the immunohistochemistry for mismatch-repair proteins is a nuclear staining, for finalizing its interpretation the system here shows its ability in the detection and analysis of only cell nuclei, with tumor cells in blue and non-tumor cells in red (E,F; different resolution of analysis, which can be adapted based on staining patterns and the difficulty of their interpretation). Scale bar represents 200 µm.

Assessment of the Risk of Bias and Applicability
When the overall risk of bias and applicability was measured with the QUADAS-2 tool, most studies had one or more high risk factors (Table S2). For the "patient selection" domain, a high risk of bias was detected in three studies (30%). Studies using only TCGA data without adding other data groups to increase data diversity were judged at high risk. For the "index test" domain, six studies were found to be at high risk of bias. This was due to the absence of an external validation cohort and for not using K-fold cross validation in the data split process. For the "reference standard" domain, high risk was applied to those studies that used IHC only for MSI/dMMR assessment. Based on this criterion, two other studies have been judged to be in the high risk group (20%). At last, no high risk of bias study was identified in "flow and timing" domain.

Discussion
Recent studies have shown that AI-based deep learning models can predict molecular alterations through histopathological features of digital slides [25,44,52]. One of the most important biomarkers under investigation is MSI/dMMR, which is the most widely validated biomarker [46]. Current methodologies of MSI/dMMR assessment suffer from some limitations including potential low-reproducibility of IHC and high costs of direct molecular tests (MSI-based PCR and NGS), the latter being available only in tertiary medical institutions [53]. Some attempts to determine MSI/dMMR using clinical data and histopathological features in patients, especially those with colorectal cancer, have been made prior to the development of AI-based deep learning technologies. Some studies focused on colorectal cancer reported that MSI/dMMR tumors have peculiar histomorphological features (e.g., mucinous, signet-ring or medullary morphology, poor differentiation, tumor infiltrating lymphocytes, peritumoral lymphoid reaction, Crohn's-like lymphoid reaction, signet ring cells) and clinical findings (lower age, right-colonic tumor location) compared to MSS-neoplasms [20,22,23,54]. A similar study with overlapping results was also conducted in endometrial adenocarcinoma [24]. MSI status was predicted by integrating information on histological features and clinical data [22,23,55,56]. However, skilled pathologists also had to spend a lot of time to confirm each histological findings and interobserver variability may affect the reliability of results [57][58][59]. Eventually, this method cannot replace the existing detection system. However, these attempts have established the premise of predicting MSI from the morphological features and now AI-based deep learning technologies could be of great help in overcoming existing issues. In addition to some of the morphological findings related to MSI revealed in previous human-based studies, subtle morphological features that humans can't find are identified and collectively judged by AI. However, none of the individual studies mentioned histological characteristics that were thought to be related to MSI. This is because, due to the nature of deep learning models, researchers can develop models through training data and obtain prediction results through tests data, but they can't know what kind of thinking flow the model itself makes decisions through. The performance of the deep learning algorithm is determined by various variables such as type of network architecture, sample preparation, size of cohorts, and method of defining the ground truth method (reference). Among them, the use of a diverse large amount of data has a great impact on performance: the higher the quality of images used for training, the better will be tumor detection and molecular subtyping [46,59]. However, unlike non-medical fields, these images are tied to legal and ethical regulation as patient personal information, making it difficult to build data available for model development [60]. Due to these limitations, there are few available datasets such as TCGA, which is adopted as a validation cohort in several studies. In a recent investigation, the generative model was used to create artificial histopathology images that was difficult to distinguish compared to real images [61][62][63]. In a study by Krause et al. [49], a cohort containing synthetic images used for a training model showed non-inferior results in performance compared to models trained with real images only. This study indicated new alternatives also to overcome data acquisition limitations.
Multi-national and multi-institutional datasets are essential for developing generalizable models which reflect differences between diverse regions and ethnicities around the world [64,65]. In a recent study by Kater et al. [41], the model trained with TCGA data showed a poor performance in a Japanese cohort (KCCH), and similar results were identified in remaining studies. Cao et al. [45] also showed a low level of AUC when a model trained on TCGA data was applied to the Asian group, but the inclusion of the Asian group in the training set showed an improved performance. Thus, for deep learning technology, it is necessary to quantitatively ensure that there is a lot of data (so-called big data) reflecting the actual real world patient group, so that algorithms can learn through reliable training materials. All of these findings and considerations call for taking into account demographic and representative data of different ethnic groups for building robust AI-based systems to be used in clinical practice.
It is also crucial to secure an independent external validation set in model evaluation [66], distinguishing training groups and validation-groups. This should be seen as an important tool for guaranteeing a high level of reliability for AI-based systems, the assumption "garbage-in, garbage-out" always being valid.
Reporting the reference standard of individual studies also did show some variations. According to current guidelines [4], the suggested methods for MSI/dMMR detection are as follows: (1) IHC alone, but only for those tumors strictly belonging to the spectrum of HNPCC (Lynch) syndrome; (2) MSI-based PCR based on at least five of the Bethesda markers and including mandatorily BAT25 and BAT26; (3) NGS, above all for tumors out of HNPCC spectrum or in the case of limited neoplastic material (e.g., biopsy, cytological samples). Not only there is a variation, but if MSI/dMMR status was determined with 2-plex PCR or IHC out of HNPCC spectrum, not-negligible doubts may arise about the reliability of the reference standard. Given that labelling data set in a correct manner is fundamental for evaluating the performance of a given model, the existence of such variability for defining the reference standard poses limitations and concerns in evaluating the overall performance of the model. QUADAS-2 is an assessment tool for systematically evaluating the risk of bias and applicability of individual studies of diagnostic accuracy. It has been used as a quality assessment tool in this review, as per recent guidelines. Over the past decade, research on AI-based diagnostic tests has been centered on AI research in the healthcare field. Of note, more than 90% of health-related AI systems that have received regulatory evaluation from the US Food and Drug Administration are related to the diagnostic field [27]. Although AI-based diagnostic research is carried out extensively, the existing QUADAS-2 tool for evaluating diagnostic research may be not sufficient to reflect the specificity of such a heterogeneous field, so it may be of help to consider the modified quality's evaluation tools. Notably, an AI-specific tool (QUADAS-AI) is currently under development and is expected to be published soon. The publishment of a specific tool will help in a closer evaluation of each diagnostic research in the AI field.
Although the diagnostic performance of individual studies is reported to be excellent, it is unclear whether it is immediately applicable to clinical practice due to the different risk factors that have been identified through QUADAS-2 and also due to variations between each individual study mentioned above.

Conclusions
Given that immunotherapy is at the center of the paradigm of cancer treatment in these last years, AI-based/deep learning technologies able to predict MSI status are expected to have a great impact into the diagnostic workflow of oncology and related areas, playing important roles as a standardized diagnostic tool. The presence of not-negligible limitations, however, calls for a direct molecular confirmation (MSI-based PCR or NGS) in those cases that are positive at evaluation by AI-based systems.