Artificial intelligence (AI) systems exceeding expert performance have shortcomings when they are applied on data outside their training domains. At present, such AI systems lack a form of context awareness, which allows the model to reject data outside its learned feature space. Since medical examinations often include an extensive range of anatomical checks, there is a risk that AI-based automated lesion detectors will be applied outside the target domain. Potentially, when inexperienced clinicians are relying on the algorithm, this might lead to higher false positives and false negatives and thereby to malignancies in the diagnosis, which is to the detriment of patients. Assistive tools for automatic lesion detection should therefore be designed for robustness and accuracy with the standard clinical practice in mind.
In the field on gastroenterology, a Computer-Aided Detection (CAD) system has been developed for Barrett’s neoplasia detection in white light endoscopic still images [1
], achieving expert performance. Yet, this algorithm is restricted and validated on the visual features of a Barrett’s esophagus (BE). In clinical practice, it is common to fully assess the esophagus from stomach to the healthy squamous region. Therefore, the current model should only be restricted to the analysis of the Barrett’s region of the esophagus.
In order to facilitate the continuous analysis of the video signal during the full clinical protocol, a vast pool of new relevant and irrelevant features needs to be taken into account. For example, optical tissue deformation, which can be estimated through consecutive frames, is an inherent cell marker for testing malignant morphological changes, according to Guck et al. [2
]. In contrast, when ambiguous frames are introduced, the model could become unstable according to Van der Putten et al. [3
]. In order to deal with such ambiguity, the model should consider the context of prior frames for robust and reliable decision making. The consecutive frames in an endoscopy procedure do not differ substantially, and therefore information prior to an ambiguous frame can be exploited to make an accurate prediction. Such sequential models could be used to improve position tracking in during an endoscopy procedure. Accordingly, since Esophagus Adenocarcinoma (EAC) only occurs in a particular segment of the esophagus (i.e., in BE), frames that are captured outside this segment could be disregarded by an EAC detection algorithm, leading to a reduction in false alarms and an increased user confidence in the CAD system.
Practically, different approaches and algorithms have been applied on time-series data, including independent frame analysis, averaging over the temporal domain and hidden Markov models [4
]. However, the absence of long-term memory in these models hampers the exploitation of long-distance interactions and correlations, which make the corresponding algorithms not suitable for learning long-distance dependencies typically found in clinical data. Since the employed, existing image-based classification networks are trained on still images in overview, the response on unseen non-informative frames is unknown. This implies that algorithms trained only on still images do not perform well on video signals without algorithm modifications. [9
Recurrent Neural Networks (RNNs) can be used to provide a temporal flow of information. These networks have been widely used to learn the processing of sequential video data and are capable of dealing with long-term dependencies. In this type of artificial neural network, connections are formed between units and a directed cycle. This cycle creates an internal state of the network which allows it to exhibit and model dynamic temporal behavior without computation-intensive 3D convolutional layers. Recently, Yao et al. [10
] demonstrated a state-of-the-art method for action recognition, which imposes Gated Recurrent Units (GRUs) on the deep spatiotemporal information extracted by a convolutional network. Furthermore, Yue et al. [11
] and Donahue et al. [12
] have successfully demonstrated the ability of RNNs to recognize activity, based on a stack of input frames. A similar approach could be followed for the classification of tissue in videos, thereby potentially leading to a more temporally stable algorithm, since it is able to exploit information existing in the temporal domain.
The literature describes a variety of methods to analyze video for classification tasks in endoscopy. The most basic form for video analysis describes a single frame based analysis for classification [13
]. Other recent work on video analysis in endoscopy focuses on a frame-based analysis approach with additional post processing to yield some form of temporal cohesion. Byrne et al. [16
] describe a frame-based feature extractor, which interpolates a confidence score between consecutive frames, in order to make a more confident prediction for colorectal polyp detection. De Groof et al. [9
] implement a voting system for multiple frames on multiple levels. Yu et al. [17
] describe a 3D convolutional model, in order to capture inter-frame correlations. Yet, 3D convolutions fail to capture long-term information. Harada et al. [18
] propose an unsupervised learning method, which clusters frame-based predictions, in order to improve temporal stability in tissue classification. Yet, a clustering approach is not able to capture the consecutive or inter-frame correlation between frames. Frameworks that do actively learn spatiotemporal information with the implementation of RNNs are described by, Owais et al. [19
] and Ghatwary et al. [20
]. They demonstrate that the implementation of RNNs yield superior classification accuracies in endoscopic videos, but no quantitative results are reported on the stability of the employed models.
In this paper, we address the ambiguity in the classification of tissue in the upper gastrointestinal tract by introducing RNNs, as a first exploratory study to obtain a more robust system for endoscopic lesion detection. Our system is generally applicable for CAD systems in the gastrointestinal tract and can potentially serve as a pre-processing step that reduces the amount of false alarms for a wide range of endoscopic CAD systems. We hypothesize that by extending Resnet18 with RNNs, or more specifically, by employing Long Short Term Memory (LSTM) or Gated Recurrent Unit (GRU) as concepts, the model is able to actively learn and memorize information seen in earlier frames to make a more accurate prediction about the tissue class compared to networks without temporal processing.
Our contributions are therefore as follows. First, our work demonstrates that including temporal information in endoscopic video analysis leads to an improved classification performance. Second, We show that exploiting the concepts of LSTM and GRU outperform the conventional Fully Connected (FC) networks. Third, the proposed approach offers a higher stability and robustness in classification performance, so that it paves the way for applying automated detection during the complete clinical endoscopic procedure.
The output stability is measured for four models, FC, FC Avg(n = 5), LSTM and GRU. We have found on average that the following networks switches from label: FC 43.27 (±23.83), FC Avg (n = 5) 18.48 (±9.76), LSTM 10.81 (±5.68) and GRU 11.91 (±6.33). These results demonstrate that the models implemented without RNNs, switch 2–4 more times from label within a single video. This is especially apparent in the video’s in which the model has a bad performance, see Figure 3
displays the mean accuracy of the four different models for the five different tissue classes. The results show that by averaging over five consecutive frames, a performance improvement of 0.8% is obtained. The introduction of RNNs into the classification model results yield an increase of 3.7% in overall accuracy, as seen with the accuracies of 85.9% and 85.6% for LSTM and GRU, respectively, compared to 82.2% for FC. Detailed performances per model for each class are provided in confusion matrices in Figure 4
. The Wilcoxon signed-rank test has found a
in all comparisons on accuracy of FC, LSTM and GRU classifiers, which confirms the statistical significance of the results.
An important observation is the accuracy of 98.3% on the Barrett’s segment in the esophagus. A good performance on this label is crucial, since this model will be used in an a priori tissue classification to extend the robustness for lesion detection. High sensitivities are generally preferred in this field, since a false positive will only lead to an extra biopsy, while a false negative gives a severe detriment to the patient. The demonstrated accuracy score implies that roughly 1.7% of the images are rejected due to lesion detection caused by a false negative. We consider that this is acceptable because during inference, we are able to process up to 180 frames per second for real-time video analysis. This would mean that even if some frames are rejected, the time-gap between the analyzed frames would be small, so that the Barrett area will still be fully analyzed effectively.
5. Discussion and Conclusions
In this work, we have explored the use of Recurrent Neural Networks (RNNs) for true temporal analysis of endoscopic video. In particular, we have evaluated two popular RNN architectures (i.e., Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU)) for tissue classification in endoscopic videos. This is a particularly interesting application, since current CAD systems show a relatively high number of false classifications for video frames captured outside the organ of interest. Reliably detecting the organ that is currently in view can therefore lead to an increased CAD performance. We demonstrate that by exploiting temporal information in latent space, much more stable classification behavior is observed than when simple frame averaging is used. Hence, the results confirm our hypothesis that by leveraging RNNs, we can stabilize the classification output from the model. Moreover, by learning the temporal flow we have also discovered an increase in the accuracy of all tissue classes. For the application of Barrett’s cancer detection, the proposed system reliably detected the tissue of interest, i.e., the Barrett label, with an accuracy of 98.3%. These results are a proof of concept, and therefore the presented models do not yield the optimal results. In future work we will address this limitation by conducting an ablation study to find the optimal parameters.
The classification performance on the stomach and squamous tissue remains relatively poor. This discrepancy can be observed in Table 2
and is mostly caused by the definition of the label correspondence mapping in Table 1
. Although the algorithm is able to approximate the tissue type, it often also guesses the neighboring tissue type. This error can be readily understood, as there is no hard defined transition on the visible border between tissue types, i.e., each view gradually transitions over time into the next one, resulting in the property that adjacent tissue areas (and labels) visually exhibit similar features (see Appendix A
To address this transition ambiguity, a score based on the agreement between observers could be introduced. However, in our current training protocol, we only have one annotated label available per frame, originating from one of the three observers. By introducing multiple observers per frame, a score of agreement can be calculated (like simple majority voting), which can be used to train the future algorithm. Such an approach would take into account the ambiguity, and can then potentially also result into an additional score for ambiguity.
An other limitation is that the employed data is imbalanced at present. As can be seen in Table 2
, the labels Stomach and Squamous are under-represented in the dataset. This imbalance is partly a reason for the poor performance on these classes. To overcome the limitation of available data, future efforts will focus on the collection of data, originating from other sources than videos alone.
In conclusion, our work has demonstrated that incorporating temporal information in endoscopic video analysis can lead to an improved classification performance. Exploiting the sequential bias present in endoscopic video (e.g., the order of the tissue types that are captured, in addition to a higher accuracy), also presents a more stable classification behavior over time. Although being directly applicable to EAC detection in BE patients, to likely enhance the CAD performance by reliably detecting the Barrett’s tissue, our approach can be generalized and easily translated to similar endoscopic video analysis tasks. Future experiments should explore such novel applications and should focus on combining the proposed pre-processing system with several succeeding, and already established classification tasks.