Combination of Different Fecal Immunochemical Tests in Colorectal Cancer Screening: Any Gain in Diagnostic Performance?

A variety of fecal immunochemical tests (FITs) are used for colorectal cancer screening. FIT performance could be improved further. It is unclear, whether the combination of different FITs with different analytical characteristics (such as, different antibodies for the detection of fecal hemoglobin) can yield a better diagnostic performance. Fecal samples were obtained from 2042 participants of screening colonoscopy. All participants with advanced neoplasm (AN, colorectal cancer (n = 16) or advanced adenoma (n = 200)) and 300 randomly selected participants without AN were included. Nine quantitative FITs were evaluated simultaneously. Sensitivity and specificity was calculated for single tests (n = 9) and for their pairwise test combinations (n = 36) (requiring either both FITs (P++) or at least one FIT (P+) to be positive for defining a positive test result). Mean age of the participants (n = 516) was 63 (range: 50–79) years and 56% were men. At cutoffs yielding a specificity of 96.7% for single FITs, the median gain in specificity by P++ combination was +1.0%, whereas the median loss in sensitivity for AN was −4.2%. For P+ combination the median gain in sensitivity for AN was +2.8%, at a prize of median loss of −1.0% of specificity. Combinations of different FITs do not yield any relevant gain in diagnostic performance.

Former studies [11,[16][17][18] have assessed whether multiple FIT measurements and their combinations based on the collection of multiple stool samples from different bowel movements can improve diagnostic performance. Even though they found an increase in sensitivity, this went along with a reduced specificity, or vice versa; and the area under the curve for AN was not significantly different in comparison to a one FIT sample regime. However, all these former evaluations were based on combinations of the same FIT brand (OC Sensor, Eiken Chemical, Tokyo, Japan). Meanwhile, a large variety of different FITs from diverse manufactures with different analytical characteristics (such as different antibodies for hemoglobin detection, different analytical reading techniques) have entered the market [15,[19][20][21][22]. Therefore, it is essentially unclear whether a combination of different FITs with different analytical characteristics could yield a relevant gain in diagnostic performance.
We aimed to explore the potential for enhancing diagnostic performance by pairwise combinations of nine different quantitative FITs based on stool samples obtained from a large cohort of participants of the German screening colonoscopy program.

Study Population
A total of 2042 participants of screening colonoscopy (Begleitende Evaluierung innovativer Testverfahren zur Darmkrebsfrüherkennung (BLITZ) study, screening setting), who provided stool samples in 60 mL containers were eligible for this project. After excluding 375 participants due to exclusion criteria shown in Figure 1 (left side), 1667 participants were left to choose from. All eligible participants with CRC (n = 16) or AA (n = 200) and 300 randomly selected individuals without CRC and AA were included. A slight majority of the participants were men (56%) and mean age was 63.2 years.
Due to the low number of screening CRC cases (n = 16), an ancillary group of CRC cases (n = 50) from a clinical setting (Darmkrebs: Chancen der Verhütung durch Screening (DACHS) plus study) were included. From a total of 184 clinical CRC patients, 90 study participants were excluded, due to exclusion criteria shown in Figure 1 (right side). From the remaining 94 individuals, all eligible participants with screen-detected CRC were included (n = 27), and 23 randomly-selected participants, who were detected otherwise, were additionally added. Thirty individuals (60%) were male and mean age was 65.8 years. At original cutoffs, median sensitivity for AN was 34.7% (range: 21.8-46.3%) across all nine FITs and median specificity for participants without AN was 91.3% (range: 85.7-97.7%). Pairwise combinations with both tests to be positive (P++) led to a strong reduction of the sensitivity of the respective tests by a median of −13.7% units (range: −25.5 to −1.4% units), whereas the specificity increased just slightly by a median of 0.7% units (range: 0.0 to 4.7% units). Combinations of tests with at least one test to be positive (P+) resulted in a slight increase of the sensitivity by a median of 1.2% units (range: 0.0 to 5.1% units), at a prize of a stronger median loss of −6.7% units (range: −12.7 to −1.3% units) of the specificity.

Sensitivity and Specificity
At a uniform cutoff, the median sensitivity and specificity for AN was 21.8% (range: 16.2-34.3%) and 96.3% (range: 94.0-98.7%), respectively. For P++ combinations, the sensitivity decreased by a median of −7.0% units (range: −18.5 to −1.9% units) and the specificity increased only marginally by a median of 0.7% units (range: 0.0 to 1.7% units). For P+ combinations, the sensitivity increased only slightly by a median of 1.4% units (range: 0.0 to 4.2% units) and this went along with a similarly strong reduction of the specificity by a median of -2.3% units (range: −4.7 to −0.7% units).
In order to enhance the comparability of diagnostic performance of the single tests with the test combinations, sensitivities were calculated at cutoffs adjusted to yield the same specificity of 96.7% (range of applied cutoffs: 6-30 µg/g) and 93.0% (range of applied cutoffs: 2-13 µg/g), respectively. At 96.7% specificity, the sensitivities were very similar across all nine FITs, ranging from 21.3% to 23.6% (median: 22.7%). For P++ combinations, the specificity increased slightly by a median of 1.0% unit (range: 0.3 to 2.3% units), however, the sensitivity decreased much stronger, with a median reduction of −4.2% units (range: −5.6 to −1.4% units). Whereas for P+ combinations, the specificity declined slightly by a median of −1.0% unit (range: −2.3 to −0.3% units) and the sensitivity increased only marginally by a median of 2.8% units (range: 0.9 to 4.7% units). Similar observations were made a cutoffs adjusted to yield 93% specificity.
In addition, Tables 1 and 2 presents summary results on the sensitivities for CRC (separately for screening (n = 16) and clinical (n = 50) CRC cases) as well as for AA cases (n = 200) and the absolute differences in percent units between the respective test combinations and the single tests with the higher sensitivity and specificity, respectively. In order to enhance the comparability of diagnostic performance of the single tests with the test combinations, sensitivities were calculated at cutoffs adjusted to yield the same specificity of 96.7% (range of applied cutoffs: 6-30 µ g/g) and 93.0% (range of applied cutoffs: 2-13 µ g/g), respectively. At 96.7% specificity, the sensitivities were very similar across all nine FITs, ranging from 21.3% to 23.6% (median: 22.7%). For P++ combinations, the specificity increased slightly by a median of 1.0% unit (range: 0.3 to 2.3% units), however, the sensitivity decreased much stronger, with a median reduction of −4.2% units (range: −5.6 to −1.4% units). Whereas for P+ combinations, the specificity declined slightly by a median of −1.0% unit (range: −2.3 to −0.3% units) and the sensitivity increased only marginally by a median of 2.8% units (range: 0.9 to 4.7% units). Similar observations were made a cutoffs adjusted to yield 93% specificity.
In addition, Tables 1 and 2 presents summary results on the sensitivities for CRC (separately for screening (n = 16) and clinical (n = 50) CRC cases) as well as for AA cases (n = 200) and the absolute differences in percent units between the respective test combinations and the single tests with the higher sensitivity and specificity, respectively.

Area Under the Curve (AUC)
The single AUCs of the FITs for detection of AN in the screening setting ranged from 59.5% (QuikRead go iFOBT, Orion Diagnostica, Espoo, Finland) to 72.1% (IDK Hb ELISA, Immundiagnostik, Bensheim, Germany), with a median AUC of 68.9% (QuantOn Hem, Immundiagnostik, Bensheim, Germany). Pairwise combinations yielded combined AUCs for AN between 63.0% (QuikRead go iFOBT + SENTiFIT-FOB Gold) and 72.8% (IDK Hb ELISA + RIDASCREEN Hb). The absolute gain in AUC for AN between the best single test (IDK Hb ELISA) and the best pairwise test combination (IDK Hb ELISA + RIDASCREEN Hb) was only +0.7% units, revealing no meaningful gain in diagnostic performance by pairwise test combinations.
Even when all nine FITs were combined together, no gain in diagnostic performance was observed. The AUC for AN by combining all nine FITs was 71.6%, which was even below the single AUC value of IDK Hb ELISA.

Cohen's Kappa for Agreement of Test Classification
After adjusting the original cutoffs to yield an equal specificity of 96.7%, the median kappa coefficients across all 36 two-test comparisons increased from the original cutoffs to the adjusted cutoffs from 0.67 (range: 0.56-1.  (Table 4).

Discussion
This is the first study assessing the potential of improving the diagnostic performance of FITs by combining nine different FITs with different analytical characteristics, based on stool samples obtained from a large cohort of average-risk participants of screening colonoscopy.
We found that the nine different FITs correlated strongly across each other, although the tests were different in their analytical characteristics (e.g., different antibodies for detection of human Hb, different analytical reading techniques). The different FITs obtained very similar sensitivities and specificities, and detected in majority the same CRC and AA cases, without any relevant gain in test performance by combining two different FITs. Even the combination of all nine FITs together yielded no gain in diagnostic performance.
Previous colonoscopy-controlled studies [11,[16][17][18] have investigated whether the diagnostic performance could be improved by combining multiple FIT samples, collected from different bowel movements, but these studies were based on using the same FIT (OC Sensor, Eiken Chemical, Tokyo, Japan): Oort et al. [16] evaluated the diagnostic performance of 2 FIT samples and found a slight increase in sensitivity, but this went along with a reduced specificity, or vice versa, when either at least one test or both tests were required to be positive for defining a positive test result. Similarly, Hernandez et al. [17] and Liles et al. [18] combined two FIT samples by using the higher fecal Hb measurement of both FIT samples for test interpretation and observed no relevant gain in test performance in comparison to one FIT sample only. In another study from South Korea, Park et al. [11] combined up to three FIT samples by taking the highest test result into test interpretation, but the improvement in sensitivity went along again with a reduction in specificity, and AUCs for AN were not significantly different. However, because in all these studies the FIT combinations were based on the same FIT (OC Sensor, Eiken Chemical, Tokyo, Japan) with the same analytical characteristics, it was essentially unclear whether a combination of FITs with different analytical characteristics could yield a relevant improvement in diagnostic performance.
In a previous analysis we evaluated and directly compared for the first time, the individual diagnostic performance of these nine quantitative FITs, and found apparent large differences regarding the diagnostic performance at original cutoff values recommended by the manufacturers (range: 2-17 µg Hb/g feces) [22]. However, after adjusting the cutoffs to yield equal specificities (here: 99%, 97% and 93%, respectively), very similar sensitivities and almost identical positivity rates (here: ≈3%, ≈6% and ≈11%, respectively) were observed. However, the single FIT measurements varied widely across the tests and it was necessary to set partly very different cutoff values (range: 18-53 µg Hb/g feces, 6-30 µg Hb/g feces, and 2-12 µg Hb/g feces, respectively) to yield the same specificity at almost identical positivity rates among all nine tests. These variations between the tests reflect the different analytical characteristics across the tests, for example, the different analytical reading techniques, the different antibodies for Hb detection or their different affinities to bind (partly) degraded Hb variants. In this study, we therefore aimed to correlate and combine these nine previously evaluated FITs in order to investigate whether the diagnostic performance of single FITs could be improved by combining different FITs with different analytical characteristics. Interestingly, we found that the different FITs correlate strongly across each other, although they have different analytical characteristics. They detected in majority the same CRC and AA cases, and no relevant gain in diagnostic performance was observed across all test combinations. Furthermore, with cutoff values adjusted to yield the same specificity or overall positivity rate [15,22], very similar sensitivities can be achieved with a variety of different FITs.
Specific strengths of our study are the first time combination of a large number of different quantitative FITs with different analytical characteristics, based on exactly the same stool samples obtained from average-risk participants of screening colonoscopy. The stool samples were collected before starting bowel preparation for colonoscopy. Screening colonoscopy, which is the current diagnostic gold-standard was performed independently of the FIT and was used as the reference standard to evaluate the diagnostic performance. The study design essentially precluded variations, which could have occurred from different study populations (e.g., case numbers; age [23][24][25], sex [23][24][25][26][27], and stage [11,22,[28][29][30] distribution) or from different pre-analytical sample handlings [31] between the FITs: Stool samples were collected in exactly the same manner and additional mixing of the stool samples, before filling the fecal sampling tubes, was performed to account for potential heterogeneity in Hb concentration within the same bowel movement [32] as an additional source of variation of test results between the different FITs. Furthermore, the stool-filled FIT tubes were stored under very similar ambient temperatures (range: 20.0 • C to 24.0 • C) and were evaluated in parallel on the same day to rule out differences in pre-analytical sample handling [31] as another source of test variation.
However, our study has also some limitations. The stool samples were collected by the study participants in stool containers (60 mL) and stored frozen at −80 • C over several years prior to filling the special fecal sampling tubes for test analysis, instead of directly collecting the stool samples with the recommended fecal sampling tubes provided by the manufacturers. However, as it is difficult to imagine that study participants would be willing and able to collect nine FIT samples in parallel using nine different fecal sampling tubes (each with different fecal sampling instructions and FIT-specific peculiarities), this was probably the only way to realize a study with so many simultaneously evaluated FITs. Nevertheless, the original FIT tubes provided by the manufacturers were used to collect a defined amount of stool after thawing the stool containers, and the stool specimens were stored in their respective preservative buffer from sampling until test evaluation. Furthermore, the stool-filled FIT tubes were vortexed so that the stool could move out of the notches of the stick and disperse completely into the buffer of the tube to ensure optimal Hb stabilization during the study [31]. In addition, we found in a previous examination based on one of the nine tests (SENTiFIT-FOB Gold, Sentinel Diagnostics, Milan, Italy), only a small difference in diagnostic performance when the FIT was conducted either based on frozen stool samples (like in this study) or on fecal specimens directly collected with the recommended FIT tube from the whole bowel movement by the study participants [33]. Another limitation is that the analysis is based on a rather low number of CRC cases recruited in the screening setting (n = 16), which, on the other hand, is typical in a screening setting among average-risk participants of screening colonoscopy [15,22]. However, by including a separate group of CRC cases recruited in a clinical setting (n = 50), of whom 27 were also detected via a screening colonoscopy, more precise estimates of sensitivity were possible. We observed very similar sensitivity results for both groups of CRC, which, in addition, go in hand with the published results of other colonoscopy-controlled FIT-studies [15,22]. Furthermore, the estimates of specificity are based on a random selection of "only" 300 individuals out of over 1400 eligible participants without CRC and AA. Nevertheless, our specificity results go again in hand with results of other colonoscopy-controlled FIT-studies [15,22]. Therefore, we believe that this type of study design, with a targeted selection of all AN cases (n = 216), and a random selection of participants without AN (n = 300) seems to be a justified and efficient approach to save resources and capacities.
In conclusion, our study provides important, previously unavailable results regarding the combination of a variety of different FITs for CRC screening. We have shown that even FITs with different analytical characteristic correlate strongly across each other and detect in the majority of cases the same participants with AN (i.e., CRC or AA), and that combinations of different FITs do not lead to a relevant improvement in diagnostic performance. Therefore, FIT-based screening programs should consider selecting a FIT that fits best to their current CRC screening program based on factors like costs of tests, usability, laboratory requirements, simplicity of test analysis or ability of sample stabilization in the respective buffer-filled FIT tubes until test evaluation [31], and choose the positivity cutoff according to a defined target level of specificity (or overall positivity rate, which is highly correlated to specificity [15]) and available resources to follow up positive test results.

Materials and Methods
This article is following the STARD (Standards for Reporting of Diagnostic Accuracy) [34] and the FITTER (Fecal Immunochemical Tests for Hemoglobin Evaluation Reporting) [35] guidelines.

Study Design and Study Population
Details of the study design and study population have been provided in a previous report on the individual diagnostic performance of each of the nine different quantitative FITs [22]. Briefly, study participants from the BLITZ (Begleitende Evaluierung innovativer Testverfahren zur Darmkrebsfrüherkennung) study, an ongoing prospective study among participants of the German screening colonoscopy program were asked to collect a stool sample before starting bowel preparation for colonoscopy. Participants are informed and recruited at a preparatory visit in cooperating gastroenterology practices before screening colonoscopy.
To increase the number of CRC cases, which is typically low among average-risk individuals in true screening settings, an ancillary group of CRC cases from the DACHSplus study, which is an add-on study to the DACHS study (Darmkrebs: Chancen der Verhütung durch Screening), was included in a clinical setting. In the DACHSplus study, participants were asked to collect a stool sample before starting bowel preperation for surgery. Particpants were informed and recruited before initiating therapy.
Both  Figure 1 shows the flow diagram and exclusion criteria of the study participants. Study participants, who were recruited between 2005 and 2010 and provided stool samples, were considered for this project. From the BLITZ study, all eligible 216 individuals with CRC or AA (defined as adenoma with at least one of the following features: ≥1 cm in size, tubulovillous or villous components, or high-grade dysplasia) and 300 randomly selected participants without CRC and AA were included. From the DACHS plus study, a total of 50 CRC cases were included for ancillary analyses.

Sample and Data Collection
Study participants from both studies were asked to fill a 60 mL container with stool from a single bowel movement, without any dietary or medicinal restrictions, before starting bowel preparation for colonoscopy (BLITZ) or surgery (DACHS plus). Participants were asked to store the stool-filled container frozen or, if not possible, refrigerated. On the day of their colonoscopy appointment (BLITZ) or hospital admission (DACHS plus), the participants were asked to bring the frozen stool-filled container in a temperature-isolated manner to their gastroenterology practice or hospital. Upon receipt the containers were kept frozen at -20 • C and shipped on dry ice to the German Cancer Research Center (DKFZ) for final storage at -80 • C.
Screening colonoscopy was performed blinded to the test results. Colonoscopy and histology reports (BLITZ) as well as medical reports after surgery (DACHS plus) were collected from all study participants. Relevant information was extracted by two independent, trained research assistants who were blinded to the test results.

Fecal Immunochemical Test Analysis
Five laboratory-based tests (IDK Hb ELISA, Immundiagnostik, Bensheim, Germany; RIDASCREEN Hb, R-Biopharm, Darmstadt, Germany; CAREprime Hb, Alfresa Pharma, Osaka, Japan; OC Sensor, Eiken Chemical, Tokyo, Japan; and SENTiFIT-FOB Gold, Sentinel Diagnostics, Milan Italy) and four point-of-care tests (QuantOn Hem, Immundiagnostik, Bensheim, Germany; immoCARE-C, CARE diagnostica, Möllersdorf, Austria; Eurolyser FOB test, Eurolyser Diagnostica, Salzburg, Austria; and QuikRead go iFOBT, Orion Diagnostica, Espoo, Finland) were conducted in parallel. Characteristics of the nine different FITs are provided in Table 5. Stool containers, from both studies, were blinded and put in random order before they were thawed for FIT analyses. Afterwards the stool samples were mixed to reduce potential heterogeneity of Hb concentration within the same bowel movement [32] as a potential source of variation of test measurements between the different FITs. A defined amount of stool was extracted using each manufacturer's FIT-specific fecal sampling tube. Each FIT tube was a small vial, containing a notched stick for stool collection. After inserting the collection stick into three different areas of the stool sample, the notches of the stick were visually inspected for complete filling. Then the stick was pushed back into the vial, which is filled with a defined volume (range: 1.5-2.5 mL) of a FIT-specific preservation buffer to slow down any Hb degradation from sampling until test evaluation [31]. The vials had a tight entrance that wiped off excess stool from the stick, when the stick was pushed back into the vial, leaving only a defined mass of stool in the notches of the stick (range: 9.5-20 mg). The only exception was the immoCARE-C vial, where a supplied cardboard fork was used to wipe off excess stool from the stick before putting it back into the vial.
Afterwards the stool-filled FIT tubes were shaken on a vortexer so that the stool could move out of the notches of the stick into the preservation buffer to ensure optimal stabilization from sampling until test evaluation [31]. All tests were evaluated in parallel, automatically by the analytical instrument and in a one-time measurement by trained laboratory staff. Test calibrations and controls were performed on a regular basis according to the manufacturers´instructions. Test results above the upper analytical limit were diluted and re-tested, if possible. Due to limited laboratory space and resources, five of the nine tests were evaluated externally, but under the same pre-analytical and analytical conditions: After vortexing, the FIT tubes were shipped to the following distributing and cooperating companies for blinded test evaluations on the next day: CARE diagnostica GmbH, Voerde, Germany (CAREprime Hb and immoCARE-C) Immundiagnostik AG, Bensheim, Germany (IDK Hb ELISA and QuantOn Hem) R-Biopharm AG, Darmstadt, Germany (RIDASCREEN Hb).
The internal and external FIT tubes were stored at a median temperature of 21.5 • C (range: 20.0-24.0 • C) until blinded test analyses on the next day.

Statistical Analyses
Before starting the statistical analysis, all quantitative FIT measurements were converted to the same and comparable unit of µg Hb/g feces [36]. Sensitivities and specificities were calculated for single tests (n = 9) and for pairwise combinations of tests (n = 36) requiring either both tests (P++) or at least one test (P+) to be positive for defining a positive test result. In order to enhance comparability, these estimates were not only computed at original cutoffs (range: 2-17 µg Hb/g feces) and at a uniform cutoff (15 µg Hb/g feces), but also at cutoffs adjusted to yield the same specificity (96.7% and 93.0%, respectively). Sensitivities were calculated for detection of CRC (separetely for cases from the screening (n = 16) and clinical (n = 50) setting), AA (n = 200) and AN (n = 216, screening setting only), respectively, and specificities were computed for participants without AN. Screening colonoscopy results (gold-standard, BLITZ) and medical reports (DACHSplus) were used as reference standard. For one test (immoCARE-C) the analyses were based on one less AA case (n = 199), because of a missing FIT measurement. Due to the lower analytical limit of another test (QuikRead go iFOBT), the cutoff value could not be adjusted to yield a specificity below 96.7%, this test was excluded from the analysis at 93% specificity.
Median sensitivities and specificities with their range were computed for each single test and each of the various two-test combinations. Furthermore absolute differences in percent units between each test combination and the single test with the higher sensitivity and specificity, respectively, their medians and range were calculated (Tables 1 and 2). In addition, specificities and sensitivities for AN (only particpants of screening colonoscopy, BLITZ) were graphically displayed at original cutoffs (range: 2-17 µg/g), at a uniform (15 µg/g) and at adjusted cutoffs, yielding equal specificities ( Figure 2).
AUCs for detection of AN (only particpants of screening colonoscopy, BLITZ) were calculated for the nine single FITs and for their pairwise combinations, using logistic regression model. In addition, the AUC for AN was calculated by combining all nine FITs together.
Spearman rank correlation coefficients between the quantitative FIT measurements were computed for participants of screening colonscopy with CRC, participants with AA, participants without CRC and AA (No AN), and for all included study participants of screening colonscopy (Total), respectively across all 36 two-test comparisons (Table 3). Similarly, Cohen's kappa for agreement of test classifications as positive or negative according to original cutoffs and to adjusted cutoffs, yielding the same specificity of 96.7% were calculated ( Table 4).
All statistical analyses were conducted using SAS Enterprise Guide, version 6.1 (SAS Institute Inc., Cary, CA, USA).

Conclusions
Different quantitative FITs from diverse manufacturers yield strongly correlated results and are able to detect in the majority of cases the same participants with AN. As a result, no relevant gain in diagnostic performance was found by the combination of different FITs.