Inter/Intra-Observer Agreement in Video-Capsule Endoscopy: Are We Getting It All Wrong? A Systematic Review and Meta-Analysis

Video-capsule endoscopy (VCE) reading is a time- and energy-consuming task. Agreement on findings between readers (either different or the same) is a crucial point for increasing performance and providing valid reports. The aim of this systematic review with meta-analysis is to provide an evaluation of inter/intra-observer agreement in VCE reading. A systematic literature search in PubMed, Embase and Web of Science was performed throughout September 2022. The degree of observer agreement, expressed with different test statistics, was extracted. As different statistics are not directly comparable, our analyses were stratified by type of test statistics, dividing them in groups of “None/Poor/Minimal”, “Moderate/Weak/Fair”, “Good/Excellent/Strong” and “Perfect/Almost perfect” to report the proportions of each. In total, 60 studies were included in the analysis, with a total of 579 comparisons. The quality of included studies, assessed with the MINORS score, was sufficient in 52/60 studies. The most common test statistics were the Kappa statistics for categorical outcomes (424 comparisons) and the intra-class correlation coefficient (ICC) for continuous outcomes (73 comparisons). In the overall comparison of inter-observer agreement, only 23% were evaluated as “good” or “perfect”; for intra-observer agreement, this was the case in 36%. Sources of heterogeneity (high, I2 81.8–98.1%) were investigated with meta-regressions, showing a possible role of country, capsule type and year of publication in Kappa inter-observer agreement. VCE reading suffers from substantial heterogeneity and sub-optimal agreement in both inter- and intra-observer evaluation. Artificial-intelligence-based tools and the adoption of a unified terminology may progressively enhance levels of agreement in VCE reading.


Introduction
Video-capsule endoscopy (VCE) entered clinical use in 2001 [1]. Since then, several post-market technological advancements followed, making capsule endoscopes the prime diagnostic choice for several clinical indications, i.e., obscure gastrointestinal bleeding (OGIB), iron-deficiency anemia (IDA), Crohn's disease (diagnosis and monitoring) and tumor diagnosis. Recently, the European Society of Gastrointestinal Endoscopy (ESGE) endorsed colon capsule endoscopy (CCE) as an alternative diagnostic tool in patients with incomplete conventional colonoscopy or contraindication for it, when sufficient expertise for performing CCE is available [2]. Furthermore, the COVID-19 pandemic has bolstered CCE (and double-headed capsules) in clinical practice as the test can be completed in the patient's home with minimal contact with healthcare professionals and other patients [3,4].
The diagnostic yield of VCE depends on several factors, such as the reader's performance, experience [5] and accumulating fatigue (especially with long studies) [6]. Although credentialing guidelines for VCE exist, there are no formal recommendations and only limited data to guide capsule endoscopists on how to read the many images collected in each VCE [7,8]. Furthermore, there is no guidance on how to increase performance and obtain a consistent level of high-quality reporting [9]. With accumulating data on inter/intra-observer variability in VCE reading (i.e., degree of concordance between multiple readers/multiple reading sessions of the same reader), we embarked on a comprehensive systematic review of the contemporary literature and aimed to estimate the inter-and intra-observer agreement of VCE through a meta-analysis.

Data sources and Search Strategy
We conducted a systematic literature search in PubMed, Embase and Web of Science in order to identify all relevant studies in which inter-and/or intra-observer agreement in VCE reading was evaluated. The primary outcome was the evaluation of inter-and intra-observer agreement in VCE examinations. The last literature search was performed on 26 September 2022. The complete search strings are available in Table S1. This review was registered at the PROSPERO international register of systematic reviews (ID 307267).

Inclusion and Exclusion Criteria
The inclusion criteria were: (i) full text articles; (ii) articles reporting either inter-or intra-observer agreement values (or both) of VCE reading; (iii) articles in English/Italian/ Danish/Spanish/French language. Exclusion criteria were: article types such as reviews, case reports, conference papers or abstracts.

Screening of References
After exclusion of duplicates, references were independently screened by six authors (P.C.V., U.D., T.B.-M., X.D., P.B.-C., P.E.). Each author screened one fourth of the references (title and abstract), according to the inclusion and exclusion criteria. In case of discrepancy, the reference was included for full text evaluation. This approach was then repeated on included references with an evaluation of the full text by three authors (P.C.V., U.D., T.B.-M.). In case of discrepancy in the full-text evaluation, the third author would also evaluate the reference and a consensus discussion between all three would determine the outcome.

Data Extraction
Data were extracted in accordance with the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) [10]. We extracted data on patients' demographics, indication for the procedure, the setting for the intervention, the type of VCE and its completion rate, and the type of test statistics.

Study Assessment and Risk of Bias
Included studies underwent an assessment of methodological quality by three independent reviewers (P.C.V., U.D., T.B.-M.) through the Methodological Index for Non-Randomized Studies (MINORS) assessment tool [11]. Items 7, 9, 10, 11 and 12 were omitted, as they were not applicable to the included studies; therefore, since the global ideal score for non-comparative studies, in MINORS, is at least two thirds of the total score (n = 24), we applied the same proportion to the maximum score with omitted items (n = 14) obtaining the arbitrary cut-off value of 10.

Statistics
In the included studies, different test statistics were used when reporting the degree of observer agreement. The most common ones are the Kappa statistics for categorical outcomes and the intra-class correlation coefficient (ICC) for continuous outcomes. Kappa and ICC are not directly comparable and our analyses were therefore stratified by type of test statistics.
The Kappa statistics estimates the degree of agreement between two or more readers, while taking into account the chance agreement that would occur if the readers guessed at random. Cohen's Kappa was introduced in order to improve the previously common used percent agreement [12].
The ICC is a measure of the degree of correlation and agreement between measurements and is a modification of the Pearson correlation coefficient, which measures the magnitude of correlation between variables (or readers) but, in addition, ICC takes readers' bias into account [13,14].
Less commonly reported were the Spearman rank correlation [15], Kendall's coefficient and the Kolmogorov-Smirnov test. First, we evaluated each comparison using guidelines for the specific test statistics (Table 1) and divided them into groups of "None/Poor/Minimal", "Moderate/Weak/Fair", "Good/Excellent/Strong" and "Perfect/Almost perfect" to report the proportions of each, stratified by inter/intra-observer agreement evaluations. As no guidelines were identified for the Kendall's coefficient and the Kolmogorov-Smirnov test, we adopted the guidelines used for Kappa as the scales were similar. The mean value was estimated stratified by test statistic. The significance level was set at 5%, and 95% confidence intervals (CIs) were calculated. All pooled estimates were calculated in random effects models stratified into four categories; inter-observer Kappa, intra-observer Kappa, inter-observer ICC and intra-observer ICC. To investigate publication bias and small study effects, Egger's tests were performed and illustrated by funnel plots. Individual study data were extracted and compiled in spreadsheets for pooled analyses. Data management was conducted in SAS (SAS Institute Inc. SAS 9.4. Cary, NC, USA), while analyses and plots were performed in R (R Development Core Team, Boston, MA, USA) using the metafor and tidyverse packages [16,17].

Results
Overall, 483 references were identified from the databases. After the removal of duplicates, 269 were screened, leading to 95 references for full-text reading. One additional reference was retrieved via snowballing. Sixty (n = 60) studies were eventually included, 37 of which had reported information on variance for their agreement measures, enabling them to be included for pooled estimates (Figure 1). MINORS scores ranged from 7 to 14, with the majority of references scoring 10 or above (n = 52) ( Table 2). and small study effects, Egger's tests were performed and illustrated by funnel plots. Individual study data were extracted and compiled in spreadsheets for pooled analyses. Data management was conducted in SAS (SAS Institute Inc. SAS 9.4. Cary, NC, USA), while analyses and plots were performed in R (R Development Core Team, Boston, MA, USA) using the metafor and tidyverse packages [16,17].

Results
Overall, 483 references were identified from the databases. After the removal of duplicates, 269 were screened, leading to 95 references for full-text reading. One additional reference was retrieved via snowballing. Sixty (n = 60) studies were eventually included, 37 of which had reported information on variance for their agreement measures, enabling them to be included for pooled estimates (Figure 1). MINORS scores ranged from 7 to 14, with the majority of references scoring 10 or above (n = 52) ( Table 2).        The distribution of evaluations, stratified by inter/intra-observer agreements, was analyzed by combining all specific comparisons regardless of the type of statistics models (Kappa alone was considered in 25 inter-observer comparisons, whenever more than one model was applied for the same outcome): in 479 inter-observer comparisons, a "good" or "perfect" agreement was obtained in only 23% of the cases; in 75 intra-observer comparisons, this was the case in 36% of the cases (Figure 2). The distribution of evaluations, stratified by inter/intra-observer agreements, was analyzed by combining all specific comparisons regardless of the type of statistics models (Kappa alone was considered in 25 inter-observer comparisons, whenever more than one model was applied for the same outcome): in 479 inter-observer comparisons, a "good" or "perfect" agreement was obtained in only 23% of the cases; in 75 intra-observer comparisons, this was the case in 36% of the cases (Figure 2).  For the pooled random effects models stratified by inter/intra-observer and test statistic, the overall estimates of agreement ranged from 0.46 to 0.84, although a substantial degree of heterogeneity was present in all four models (Figures 3 and 4). The I 2 statistic ranged from 81.8% to 98.1% (Figure 4). Meta-regressions investigating the possible sources of heterogeneity found no significance of any variable for ICC inter-observer agreement, but for Kappa inter-observer agreement, country, capsule type and year of publication may have contributed to the heterogeneity. For the pooled random effects models stratified by inter/intra-observer and test statistic, the overall estimates of agreement ranged from 0.46 to 0.84, although a substantial degree of heterogeneity was present in all four models (Figures 3 and 4). The I 2 statistic ranged from 81.8% to 98.1% (Figure 4). Meta-regressions investigating the possible sources of heterogeneity found no significance of any variable for ICC inter-observer agreement, but for Kappa inter-observer agreement, country, capsule type and year of publication may have contributed to the heterogeneity.     For the random effects models of the overall inter/intra-observer agreements, the Eggers tests resulted in p-values < 0.01 for inter/intra-observer ICC models, 0.78 for Kappa inter-observer and 0.20 for Kappa intra-observer ( Figure 5). For the random effects models of the overall inter/intra-observer agreements, the Eggers tests resulted in p-values < 0.01 for inter/intra-observer ICC models, 0.78 for Kappa inter-observer and 0.20 for Kappa intra-observer ( Figure 5).

Discussion
Reading VCE videos is a laborious and time-consuming task. Previous work has showed that the inter-observer agreement and the detection rate of significant findings are low, regardless of the reader's experience [5,78]. Moreover, attempts to improve performance by a constructed upskilling training program did not significantly impact readers with different experience levels [78]. Fatigue has been blamed as a significant determi-

Discussion
Reading VCE videos is a laborious and time-consuming task. Previous work has showed that the inter-observer agreement and the detection rate of significant findings are low, regardless of the reader's experience [5,78]. Moreover, attempts to improve performance by a constructed upskilling training program did not significantly impact readers with different experience levels [78]. Fatigue has been blamed as a significant determinant of missed lesions: a recent study demonstrates that reader accuracy declines after reading just one VCE video, and that neither subjective nor objective measures of fatigue were sufficient to predict the onset of the effects of fatigue [6]. Recently, strides were made in establishing a guide for evaluating the relevance of small-bowel VCE findings [79]. Above all, artificial intelligence (AI)-supported VCE can identify abnormalities in VCE images with higher sensitivity and significantly shorter reading times than conventional analysis by gastroenterologists [80,81]. AI has, of course, no issues with inter-observer agreement and is poised to become an integral part of VCE reading in the years to come. AI develops on the background of human-based 'ground truth' (usually subjective expert opinion) [82]. So, how do we as human readers get it so wrong?
The results of our study show that the overall pooled estimate for "perfect" or "good" inter-and intra-observer agreement was only 23% and 37%, respectively ( Figure 2). Although significant heterogeneity was noted in both Kappa statistic and ICC-based studies, the overall combined inter/intra-observer agreement for Kappa-evaluated outcomes was weak (0.46 and 0.54, respectively), while for ICC-evaluated outcomes the agreement was good (0.83 and 0.84, respectively).
A possible explanation to this apparent discrepancy is that ICC outcomes are more easily quantifiable, therefore providing a higher degree of unified understanding on how to evaluate, whereas categorical outcomes in Kappa statistics may be prone to a more subjective evaluation; for instance, substantial heterogeneity may be caused by pooling observations without unified definition of the outcome variables (e.g., cleansing scale, per segment or patient, categorical subgroups differences).
A viable solution to the poor inter-/intra-observer agreement on VCE reading could be represented by AI-based tools. AI offers the opportunity of a standardized observerindependent evaluation of pictures and videos relieving reviewers' workload, but are we ready to rely on non-human assessment of diagnostic examinations to decide for subsequent investigations or treatments? Several algorithms reported with high accuracy have been proposed for VCE analysis. The main deep learning algorithm for image analysis has become convolutional neural networks (CNN) as they have shown excellent performances for detecting esophageal, gastric and colonic lesions [83][84][85]. However, some important shortcomings need to be overcome before CNNs are ready for implementation in clinical practice. The generalization and performance of CNNs in real-life settings are determined by the quality of data used to train the algorithm. Hence, large amounts of high-quality training data are needed with external algorithm validation, which necessitates collaboration between international centers. A high sensitivity from AI should be prioritized even at the cost of the specificity as AI findings should always be reviewed by human professionals.
This study shows several limitations. As VCE is used for numerous indications and for all parts of the GI tract, an inherent weakness is the natural heterogeneity of the included studies, which is evident in the pooled analyses (I 2 statistics > 80% in all strata). The meta-regressions indicated that country, capsule type and year of publication may have contributed to the heterogeneity for Kappa inter-observer agreement, whereas no sources were identified in ICC analyses; furthermore, the Eggers' tests indicated publication bias in ICC analyses but not in Kappa analyses. Therefore, there is a risk that specific pooled estimates may be inaccurate, but the heterogeneity may also be the result of very different ways of interpreting videos or definitions of outcomes between sites and trials. No matter these substantial weaknesses to the results of the pooled analyses, the proportions of agreements and the great variance in agreements are clear. In more than 70% of the published comparisons, the agreement between readers is moderate or worse, as for intraobserver agreement.
Data regarding the reader's experience were originally extracted but omitted in the final analysis because of heterogeneity of the terminology and of the lack of a unified experience scale. This should not be considered as a problem, as most studies fail to confirm a significant lesion detection rate difference between experienced and expert readers, physician readers and nurses [86,87], while some of them point to possible equalization of any difference between novices and experienced even only after one VCE reading due to fatigue [6].
Moreover, we decided not to perform any subgroup analysis based on possible a priori clustering of findings (e.g., bleeding lesions, ulcers, polyps, etc.); the reason for this choice is related, once again, to the extreme variability of encountered definitions and the lack of a uniform terminology.

Conclusions
As of today, the results of our study show that VCE reading suffers from a sub-optimal inter/intra-observer agreement.
For future meta-analyses, more studies are needed enabling strata of subgroups specific to the outcome and indication, which may limit the heterogeneity. The heterogeneity may also be reduced by stratifying analyses based on the experience level of the readers or the number of them in comparisons, as this will most likely affect the agreement. The progressive implementation of AI-based tools will possibly enhance the agreement in VCE reading between observers, not only reducing the "human bias" but also relieving the significant burden in workload.