Consistent Major Differences in Sex- and Age-Specific Diagnostic Performance among Nine Faecal Immunochemical Tests Used for Colorectal Cancer Screening

Simple Summary We evaluated the performance of nine faecal immunochemical tests among participants of screening colonoscopy. A total of 216 cases of advanced neoplasia (AN, colorectal cancer or advanced adenoma) and 300 randomly selected participants without AN were included. Diagnostic performance for detection of AN was assessed by sex and age (50–64 vs. 65–79 years), for each of the nine faecal immunochemical tests (FITs) individually and for all FITs combined. Major differences in diagnostic performance by sex and age were consistently seen across nine different FIT brands. Sensitivities were consistently lower, and specificities were consistently higher, for females as compared with males. Positive predictive values were similar between both sexes, but negative predictive values were higher for females. A negative FIT is less reliable in ruling out AN among men than among women and among older than among younger participants. Abstract Evidence on diagnostic performance of faecal immunochemical tests (FITs) by sex and age is scarce. We aimed to evaluate FIT performance for detection of advanced colorectal neoplasia (AN) by sex and age across nine different FIT brands in a colonoscopy-controlled setting. The faecal samples were obtained from 2042 participants of colonoscopy screening. All eligible cases with AN (n = 216) and 300 randomly selected participants without AN were included. Diagnostic performance for detection of AN was assessed by sex and age (50–64 vs. 65–79 years for each of the nine FITs individually and for all FITs combined. Sensitivity was consistently lower, and specificity was consistently higher for females as compared with males (pooled values at original FIT cutoffs, 25.7% vs. 34.6%, p = 0.12 and 96.2% vs. 90.8%, p < 0.01, respectively). Positive predictive values (PPVs) were similar between both sexes, but negative predictive values (NPVs) were consistently higher for females (pooled values, 91.8% vs. 86.6%, p < 0.01). Sex-specific cutoffs attenuated differences in sensitivities but increased differences in predictive values. According to age, sensitivities and specificities were similar, whereas PPVs were consistently lower and NPVs were consistently higher for the younger participants. A negative FIT is less reliable in ruling out AN among men than among women and among older than among younger participants. Comparisons of measures of diagnostic performance among studies with different sex or age distributions should be interpreted with caution.


Introduction
Worldwide, colorectal cancer (CRC) accounts for approximately 1 million new cases among men and for approximately 800,000 new cases among women annually [1]. Faecal immunochemical tests (FITs) are widely recommended [2,3] and used [4,5] for populationwide screening and early detection of CRC and its precancerous lesions. The diagnostic performance of quantitative FITs has been assessed in many studies and has been summarized in meta-analyses [6,7]; however, evidence on FIT performance according to sex and age derived from colonoscopy-controlled studies is scarce and limited to only a few FIT brands [8]. Furthermore, it is unclear whether differences exist for detection of advanced colorectal neoplasia (AN) by sex or by age across different FIT brands.
We aimed to evaluate the diagnostic performance of a large number of different quantitative FITs according to sex and to age, using faecal samples obtained from individuals undergoing colonoscopy screening in Germany.

Materials and Methods
The Standards for the Reporting of Diagnostic Accuracy Studies (STARD) [9] and the standard for Faecal Immunochemical Tests for Haemoglobin Evaluation Reporting (FITTER) [10] were followed.

Study Design and Population
This analysis was carried out following a direct comparison and combination of nine quantitative FITs for detection of AN, details of which have been published previously [11,12]. Briefly, this study is based on the BLITZ study, which has been running since 2005 with the aim to collect blood and stool samples among average-risk individuals before undergoing colonoscopy screening for evaluation of novel non-invasive CRC screening tests. Study participants are informed and recruited during their preparatory visit (typically 1 week before colonoscopy) in cooperating gastroenterology practices in southwest Germany.
The study was approved by the Ethics committee of University of Heidelberg (178/ 2005) and those of the state chambers of physicians of Baden-Wuerttemberg (M118-05-f), Rhineland-Palatinate (837.047.06(5145)) and Saarland (217/13). The BliTz study was registered in the German Clinical Trials Register (DRKSID: DRKS00008737). Written informed consent was obtained from each study participant. Further information about the BLITZ study has been provided elsewhere [13][14][15].

Selection of Study Participants
A total of 2042 participants, who were recruited until 2010 and who provided faecal samples in stool containers (60 mL), were eligible for this project. After excluding participants <50 or ≥80 years of age (n = 52), with inflammatory bowel disease (n = 10), history of previous colorectal neoplasia (n = 39), colonoscopy in the past 5 years (n = 114), stool sample collection not prior to colonoscopy (n = 75), and incomplete colonoscopy (n = 8) or inadequate bowel cleansing (n = 77), 1667 participants fulfilled the inclusion criteria for this analysis. All participants diagnosed with AN, i.e., either CRC or advanced adenoma (AA, defined as adenomas with either size ≥1 cm, villous/tubulovillous components, or high-grade dysplasia) who provided enough faeces for the evaluation of 9 FITs were included (n = 216). For one FIT (immoCARE-C), the analyses were based on one less AN case (n = 215), because one FIT measurement was missing. To save resources and capacities, 300 participants without CRC and AA were randomly selected and included for specificity calculations.

Data/Sample Collection and Processing
Participants were asked to collect a faecal sample before starting bowel preparation for colonoscopy, to store the sample in a freezer (or, if not possible, in a refrigerator), and to bring it in a temperature-isolated bag to the gastroenterology practice on the day of the scheduled colonoscopy. In the practices, the samples were kept at −20 • C and sent on dry ice to a central laboratory, and afterwards to the German Cancer Research Center (DKFZ, Heidelberg) for final storage at −80 • C. Although this preanalytical sample procedure differs from the recommended faecal sampling procedure (i.e., filling the faecal sampling tubes directly with fresh stool), we have found in a recent retrospective analysis that estimates of diagnostic performance of FITs remained fairly stable even after long-term frozen storage and repeat thawing and freezing cycles [16].
The screening colonoscopy was performed by experienced colonoscopists who were unaware of the FIT result. Afterwards, colonoscopy (and histology) reports were collected, and relevant data were extracted by trained medical data officers who were likewise blinded to any FIT result.

Test Analysis
Faecal samples were thawed in 2016 in order to measure different FITs in parallel, as previously described [11,12]. Overall, 516 faecal samples from average-risk participants of screening colonoscopy were measured using nine quantitative FITs from seven manufactures. All FITs were approved for use in Germany. Detailed test characteristics are shown in Table S1. Before filling the single faecal sample collection tubes, the stool within each container was mixed to reduce heterogeneity in faecal haemoglobin distribution [17]. All nine FITs were evaluated simultaneously under the same preanalytical and analytical conditions: Stool specimens were extracted for the nine FITs using the special sampling tubes that had been designed to transfer a defined amount of faeces into a haemoglobinstabilizing buffer of the tube. Afterwards, the tubes were shaken and kept at ambient temperature (range 20-24 • C) until they were blindly measured on the next day. Further detailed information on test analysis has been published previously [11,12].

Statistical Analysis
All quantitative faecal haemoglobin measurements were converted to the same, and directly comparable, unit of µg haemoglobin per gram faeces (µg/g) [18].
Sensitivities, specificities, positive and negative predictive values (PPVs and NPVs) with their 95% confidence interval (CI) were calculated for detection of AN (either CRC or AA) by sex and by age (50-64 and 65-79 years). The analyses were conducted at thresholds recommended by the manufacturers (=original thresholds, range 2-17 µg/g) and at thresholds yielding an equal overall specificity of 95%. In addition, positivity rates with their 95% CIs were computed. Due to the overrepresentation of participants with AN (all AN cases, n = 216) in comparison to the participants without AN (random sample, n = 300) by design, PPVs, NPVs and positivity rates were derived from weighted analyses. Weights were calculated by dividing original fractions, which were observed among the 1667 eligible study participants, by sampling fractions for inclusion in the FIT. This way, positivity rates, PPVs, and NPVs reflect the prevalence of AN, sex, and age distribution observed in the cohort of eligible participants (n = 1667) who fulfilled the inclusion criteria. Testing for statistical differences by sex and by age was conducted using logistic regression models. For positivity rates, PPVs, and NPVs, a weighted logistic regression model was fitted. p-values and CIs were based on the Wald test.
Generalized estimating equations (GEE) logistic regression models were used to derive pooled estimates including 95% CIs of the various measures of diagnostic performance by sex and by age across the nine FITs and to test for the associations of age and sex with diagnostic performance, taking FIT effects and dependency of observations within the same individuals into account. Statistical testing by sex and by age was conducted using the Wald test.
To assess the overall diagnostic performance within the clinically relevant segment of ≥80% specificity, partial areas under the curves (AUCs) were calculated in such a way that they became 50% for nondiscriminant and 100% for perfectly discriminating tests.
Derivation of 95% CIs and testing for statistical significance of differences in partial AUCs were done using 2000 bootstrap replicates.
Two-sided p-values that were below 0.05 were considered statistically significant. The analyses of partial AUCs were conducted using R (version 3.6.0, R Core Team, Vienna, Austria) with the R package 'pROC' (version 1.16.2), whereas all other analyses and statistical tests were conducted using SAS enterprise guide (version 7.1, Cary, NC, USA).

Study Population
The characteristics of all eligible 1667 participants of colonoscopy screening are shown in Table 1. The sample included approximately equal numbers of women and men. The age distribution was similar among both sexes, and most participants were between 50 and 64 years old. Advanced neoplasia was detected in 230 participants. Among these, colorectal cancer and advanced adenomas were the most advanced finding of colonoscopy screenings for 16 (1.0%) and 214 (12.8%) participants. The overall AN prevalence was 13.8% in the total study population. It was higher among men (17.3%) than among women (10.0%) and in the age group 65-79 (17.6%) than in the age group 50-64 (11.3%).

Diagnostic Performance by Sex
Across all FITs and all assessed thresholds, sensitivities were consistently lower and specificities were consistently higher among females as compared with among males ( Table 2). At original threshold values, substantial differences in measures of diagnostic performance were observed among the different single FIT brands. However, when threshold values were adjusted to yield identical levels of overall specificity (to enhance the comparability between the FITs), no meaningful differences were observed among the different FIT brands. At original thresholds, pooled sensitivities were 25.7% vs. 34.6% (p = 0.12) and pooled specificities were 96.2% vs. 90.8% (p = 0.005) for females and males, respectively. Similar sex differences were observed at thresholds adjusted to yield equal overall specificities (95%) across the FITs. PPVs were similar between both sexes, but NPVs were consistently higher for females (pooled values 91.8% vs. 86.6%, p < 0.01) ( Table 3). Differences in sensitivities diminished when using sex-specific cutoffs, whereas differences in PPVs increased. Pooled sex-specific differences in sensitivity, specificity, PPV, NPV, and positivity at original thresholds are summarized in Figure 1A.
The overall diagnostic performance, measured by the partial AUC (between 80% and 100% specificity), showed no clinically relevant difference between men and women (Table S2). For four of the nine FITs, partial AUCs were slightly higher for men (up to 3.1%), whereas for the other five FITs, the partial AUCs were slightly higher for women (up to 2.4%), but none of these small differences reached statistical significance.

Diagnostic Performance by Age
Pooled age-specific differences in sensitivity, specificity, PPV, NPV, and positivity are summarized in Figure 1B. Differences in sensitivity and specificity between younger (50-64 years) and older (65-79 years) study participants were generally small (Table 4) and less consistent than differences according to sex. Similar results were observed at thresholds adjusted to yield an overall specificity of 95%, and none of these differences between both age groups was statistically significant.
PPVs were consistently higher among the older age group as compared with the younger age group (Table 5), but differences were not statistically significant. At original thresholds, pooled PPVs were 37.6% and 51.0% (p = 0.16) for the younger and older age groups, respectively. NPVs were consistently lower (by about 5%) for the older age group, and pooled estimates were statistically significantly different (p = 0.02) between both age groups, across all thresholds. Differences in NPVs were slightly larger when age-specific cutoffs were used (about 6%). Partial AUCs were slightly higher (up to 2.7%) for the younger study group for five of the nine FITs, and for the other four FITs these estimates were slightly higher (up to 3.8%) for the older study group, but none of these differences was statistically significant (Table S2).

Discussion
In this study, we assessed the diagnostic performance for detection of AN of nine different quantitative FITs according to sex and age, using stool samples collected from average-risk participants of screening colonoscopy. Even when adjusting FIT cutoffs to yield equal specificity in the entire study population, pooled sensitivities were consistently higher, whereas pooled specificities were statistically significantly lower among males as compared with females. Pooled PPVs were very similar between both sexes. By contrast, pooled NPVs were statistically significantly lower for males, suggesting that a negative FIT is less reliable in ruling out AN among men than among women. When using sex-specific cutoffs with respect to specificities, differences in sensitivities by sex diminished, whereas differences in PPVs and NPVs became greater. According to age, pooled sensitivities and specificities were very similar between both age groups, but pooled PPVs were consistently higher and pooled NPVs were statistically significantly lower for the older age group.
To the best of our knowledge, this is the first study to assess diagnostic performance of several different FIT brands in parallel for detection of AN by sex among participants of colonoscopy screening. There are only a few previous studies that have investigated FIT performance for AN detection with respect to sex [19][20][21][22]. Each of these studies assessed only one specific FIT brand and consistently reported higher sensitivity for men than for women, although the magnitude of the difference varied among studies. Specificities were generally lower among men, but sex differences were generally smaller for specificity, varying only by a few percent units. It was unclear, however, to what extent differences in the magnitude of sex-and age-specific variations were due to differences in study populations and age groups or to differences in FIT brands assessed in these studies. Our study demonstrates across several different FIT brands consistently higher sensitivities (by 3-13% units) along with consistently lower specificities (by 2-10% units) for men as compared with women at original cutoffs. Differences persisted when adjusting to equal specificity in the entire study population, but diminished when using equal specificities among women and among men, respectively. It remains to be investigated by future studies if age and sex are similarly, or possibly more strongly, associated with FIT performance among symptomatic patients than among screening participants. Symptomatic patients may comprise a very heterogeneous group and it is conceivable that differences in performance characteristics vary in strength or even direction across heterogeneous symptomatic groups (e.g., those reporting abdominal pain vs. change in bowel habits). In both symptomatic and screening populations, it should be considered that age and sex may interact or be associated with other covariates potentially influencing FIT results. For example, intake of proton pump inhibitors (PPIs) has been suggested to be associated with reduced accuracy of FIT by some [23,24] but not all [25] studies. Furthermore, interactions with sex or intake of other drugs have been suggested [26].
The reasons for the higher sensitivity and lower specificity of FITs among men than among women at equal cutoffs remain to be fully explored. Possible reasons may include the higher proportion of AN located in the distal colon and rectum that are more frequently detected by FIT than proximal AN [27,28], higher rates of aspirin use for cardio prevention [29], and a shorter colonic transit time [30] that may be associated with less Hb degradation prior to defecation. The higher positivity rate and the lower NPV among men might also be partly explained by the higher prevalence of AN among men than among women (17.3% versus 10.0% in our study population).
We are aware of only three previous studies that assessed the FIT performance for AN detection among participants of colonoscopy screening according to age [20,22,31]. Again, each of these studies assessed only one specific FIT brand. Furthermore, they were conducted in very different study populations, used different age categorizations and yielded inconsistent results. In our study, no consistent differences in sensitivity and specificity were found across nine different FIT brands evaluated in parallel in the same study population. However, PPVs consistently tended to be higher and NPVs consistently tended to be lower among older as compared with younger age groups. Given the lack of differences in sensitivity and specificity, the differences in PPVs and NPVs most likely are due to differences in prevalence of advanced adenomas, which is higher in older than in younger participants of colonoscopy screening (17.6% versus 11.3% in our study population).
Finally, but importantly, eight out of nine FITs yielded very similar overall measures of diagnostic performance, as quantified by partial AUCs. Slightly lower partial AUCs were observed only for QuikRead go iFOBT, but this observation was caused by the lower analytical working limit being unusually high (15 µg/g) for this FIT. Furthermore, partial AUCs were very similar between sexes and age groups for each of the nine FIT brands. Therefore, no clinically relevant differences in overall diagnostic performance by sex and age were observed between FITs from different manufacturers.
A major strength of our study is the parallel measurement of the faecal haemoglobin concentration across nine different quantitative FITs under the same preanalytical and analytical test requirements in a colonoscopy-controlled study setting. The few previous studies that assessed the diagnostic performance according to sex [19][20][21][22] and age [20,22] included only a single FIT brand each. A further strength is that stool samples were collected from participants of colonoscopy screening prior to bowel preparation and the samples were stored in the same manner until parallel laboratory test analysis. Furthermore, the results of colonoscopy screening with adequate bowel preparation served as a reference standard to calculate diagnostic performance. In order to enhance comparability of diagnostic performance by sex and age across different FITs, we adjusted the cutoffs to yield equal overall specificities.
Our study also has limitations. Although more than 2000 participants of colonoscopy screening were recruited, the limited overall number of AN cases (n = 216) and randomly selected controls (n = 300) did not allow for in-depth analyses for each sex-and age-specific subset, for example, stratified by adenoma location. Despite the limited numbers, the pooled results of the GEE model revealed statistically significant differences in overall specificity, NPV, and positivity rate. The suggested differences in sensitivities warrant further research with larger case numbers. Future studies should also investigate potential mechanisms by which FIT sensitivity varies across groups of participants, for example, prevalence of anemia according to age and sex.

Conclusions
In conclusion, we observed consistently higher sensitivities and lower specificities among males as compared with among females with a number of different FITs and a broad range of threshold values. Furthermore, the analyses among men yielded consistently higher positivity rates, comparable PPVs, and lower NPVs than among women. According to age, no major differences in sensitivity and specificity were observed, but positive and negative predictive values differed, probably reflecting differences in AN prevalence between sexes and age groups. A negative FIT is less reliable for ruling out AN among men than among women and among older than among younger individuals. Further studies should address if, and to what extent, the sex-and age-specific differences might be relevant for the design of screening offers and interpretation of FIT results in various groups of screening participants, for example, by using sex-and age-specific cutoffs. These questions could, for example, be addressed in comprehensive modelling studies for which our results provide important input information. The diagnostic performance of FITs should be interpreted and compared with caution among studies with different sex or age distributions.
Supplementary Materials: The following are available online at https://www.mdpi.com/article/10 .3390/cancers13143574/s1, Table S1: Test characteristics, Table S2: Partial area under the curve (% (95% CI)) for detection of advanced neoplasia by sex and by age. Funding: The BLITZ study was partly funded by grants from the German Research Council (DFG, grant number BR1704/16-1). This work was partly funded by the German Federal Ministry of Education and Research (BMBF, grant number 01GL1712). The funder had no role in the study's design, conduct, interpretation and reporting.

Institutional Review Board Statement:
The study was conducted according to the guidelines of the Declaration of Helsinki, and approved by the Ethics Committee of University of Heidelberg (178/2005) and those of the state chambers of physicians of Baden-Wuerttemberg (M118-05-f), Rhineland-Palatinate (837.047.06(5145)) and Saarland (217/13).

Informed Consent Statement:
Informed consent was obtained from all subjects involved in the study.

Data Availability Statement:
The data supporting reported results can be made available upon reasonable request.

Acknowledgments:
The authors thank Katarina Cuk for her excellent help in planning and conducting the project. They also thank Sabine Eichenherr, Romana Kimmel and Ulrike Schlesselmann for their excellent work in laboratory preparation of the stool samples and Volker Herrmann for his help in preparing the project.

Conflicts of Interest:
The authors declare no conflict of interest. The funder had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.