1.1. Background
In recent years, it has become widely recognized that the human microbiome significantly influences host digestion, immunity, metabolism, and even neural function, making its balance critical for overall health [
1]. A central task in microbiome research is differential abundance analysis (DAA), which aims to identify reliable marker features that distinguish groups within microbiome datasets [
2,
3,
4,
5,
6].
Microbiome datasets are characterized by high dimensionality and often include hundreds or even thousands of features, such as bacterial taxa or genes. Biomarker discovery typically involves testing each feature for statistical associations. Because each test carries a non-zero probability of a false positive (Type I error), conducting thousands of tests creates a high risk of accidentally discovering significant results. Consequently, microbial analysis inherently presents a multiple comparison problem, highlighting the need for hypothesis filtration [
7,
8].
Given this context, the high dimensionality of microbiome data creates ample opportunity for false positives. Nonetheless, the number of identified microbiome markers is often strikingly small compared to the total number of taxonomic features examined [
9,
10], and the overwhelming majority of microbiome variation (>80%) consistently remains unaccounted for [
11,
12,
13] by studied demographic, clinical, or environmental factors. This discrepancy can be explained, in considerable part, by intra-person variability and the inherent biological stability of microbiome systems.
Research indicates that human microbiome systems are generally stable and resilient [
1]. For instance, longitudinal studies demonstrate high rates of re-identification (over 86%) of unique microbial species upon repeated patient examinations [
14]. Similarly, core microbiome species are consistently identified across large populations and explain a substantial proportion of microbiome variation [
15]. At the same time, the largest part of explained microbiome variability is attributed to intra-person differences [
16], complicating pattern identification. Nonetheless, dysbiosis can arise from known subtle microbial imbalances [
17]. Minor shifts in the microbiome can trigger community-wide changes in structure, metabolic activity, or both, ultimately leading to significant phenotypic effects [
18,
19,
20,
21,
22].
Thus, one of the methodological challenges of microbial DAA is the detection of subtle changes in a microbiome that has large intra-person variation, yet which overall composition appears to be stable. The challenge of detecting these changes in microbial composition is threefold. First, increases in pathobionts or decreases in key commensals are often minor. Second, pathobionts and key commensals typically represent only a small fraction of the total microbial population. Third, the change in their relative abundance (which is already minimal) can itself be small. After conducting tests and adjusting for multiple comparisons across all taxonomic features, these subtle but biologically important changes can resurface as insignificant.
The reason for this obscuring lies in the nature of multiple comparison corrections. Although these hypothesis filtration methods are designed to control false positives (Type I errors), they inherently reduce statistical power [
23]. This can be particularly problematic for high-dimensional data, such as microbiome datasets, where most features are not differentially abundant (i.e., “stable”). In this context, standard corrections become overly conservative, disproportionately penalizing the few true signals and leading to false negatives (Type II errors). Given this limitation, it can be concluded that other hypothesis filtrations are in demand in microbial DAA.
Another popular option for ranking hypotheses is filtration by effect size [
2]. Most effect size measures (e.g., from the r and d families) provide information about the effect size that complements the statistical significance indicated by the
p-value. Large effect sizes with reasonable sample sizes often reveal more about practical importance than small
p-values. However, effect size measures present their own challenges. For example, although tools such as LEfSe (Linear discriminant analysis Effect Size) are widely popular in microbial DAA [
2], they can be susceptible to generating or inflating false positives [
7]. This vulnerability arises because effect sizes, whether LDA estimates from LEfSe or other common statistics such as Cliff’s δ, are single-point estimates intended to complement or extend the underlying significance test [
24]. Consequently, these point estimates remain highly dependent on sample peculiarities.
One well-known, straightforward way to address this issue is to consider confidence intervals (CIs). CIs for between-group differences can be seen as a better alternative to raw
p-values and related point estimate statistics because they provide a range of plausible location shift values [
25]. This range simultaneously conveys information about significance, the potential magnitude and direction of the effect, and the certainty of the estimate itself.
A critical question is what constitutes the optimal CI for analyzing microbiological data. Since a desirable CI roughly describes the plausible range for the true shift between the central tendencies of two samples, the first step can be to obtain robust estimates of the location parameters (the central values) of the microbial data itself.
The overdispersed, zero-inflated count data typical of microbiome samples can be adequately modeled using the negative binomial (NB) distribution [
26]. This distribution models the number of failures before a specified number of successes in a series of independent Bernoulli trials. A defining feature of the NB distribution is its right-skewed, unimodal shape with a long right tail, which naturally captures the overdispersion (variance exceeding the mean) observed in microbial abundance data. The shape of the distribution, specifically its degree of skewness and the location of its peak, is governed by its parameters. As the number of required successes increases, the distribution becomes less skewed and its shape approaches that of a symmetric bell curve. Consequently, an NB distribution with substantial positive skew is most-optimal for modeling low-abundance, sparse taxa, while a distribution nearing symmetry bell better describes common, abundant taxa. The open question is: which measure of central tendency would most accurately represent the center of NB-distributed microbial data?
The mean is the most commonly used measure of central tendency. In the context of the NB distribution, it is directly sensitive to the parameters governing the distribution’s spread and location. It also aligns well with the zero-inflated nature of microbial data, making it a theoretically useful metric for reflecting both taxonomic abundance and prevalence. However, microbiome datasets frequently contain outliers, which in practice renders the mean an unreliable statistic that is unduly influenced by extreme values. Related robust metrics, such as the trimmed and winsorized mean, offer greater resistance to extreme values. Yet these methods introduce a new challenge: determining an optimal and biologically justified threshold for trimming or winsorization [
27]. Moreover, even after applying these adjustments, the reliability of the resulting estimate is not guaranteed. The median, defined as the middle value of a ranked vector, is less susceptible to outliers than the mean, providing greater robustness for sample comparisons. However, it may be the least accurate theoretical estimate of the true central parameter of a NB distribution, as it does not account for right skew. Moreover, in the context of zero-inflated microbial data, the median tends to underestimate the true central tendency and can easily yield a zero estimate for samples with low prevalence.
A robust alternative to traditional measures can be the pseudo-median, calculated via the Hodges–Lehmann estimator (HLE) [
28,
29]. This estimator is both sensitive to the right skew of the NB distribution without being unduly influenced by extreme outliers. In the one-sample case, the HLE is defined as the median of all pairwise averages, also known as Walsh averages. In a symmetric distribution, the HLE equals the median. However, for right-skewed NB data, the HLE lies between the median and the mean, effectively balancing the robustness of the median with the sensitivity of the mean. This balance is achieved by incorporating information about the data’s spread through all pairwise averages (similar to the mean) while remaining resistant to extreme values by taking their median. Consequently, the HLE offers a favorable compromise. The two-sample HLE is defined as the median of all pairwise differences between the observations in two samples. CIs for this estimator can be derived by inverting the Mann–Whitney U count statistic, yielding a non-parametric CI for the difference between the location parameters of two NB samples.
Additionally, unlike conventional tests that primarily yield a p-value and corresponding test statistic, the two-sample HLE estimate directly results in an intuitive and interpretable measure of the shift in central tendencies between groups (i.e., median of all observed pairwise differences in two samples). Consequently, the one-sample HLE (pseudo-median) and the two-sample HLE for differences between locations may represent a powerful, assumption-lean, and robust tool for analyzing microbial data, warranting further investigation.