1. Introduction
Information theory provides powerful tools for quantifying relationships between variables across scientific disciplines. Among these, mutual information (MI) stands out for its ability to capture both linear and nonlinear dependencies while remaining robust to small sample sizes [
1,
2,
3]. Unlike correlation-based measures, MI is sensitive to the complete dependence structure between variables, making it particularly valuable for complex data analysis [
4].
The foundation of MI lies in Shannon’s entropy framework [
5], originally defined for discrete variables and later extended to continuous variables as differential entropy. For a continuous random variable
X with probability density function
, the differential entropy is:
This definition applies to both bounded and unbounded domains, provided the integral converges. Our invariant measure methodology requires that the median of k-nearest neighbor distances stabilizes with increasing sample size, a condition empirically satisfied even for heavy-tailed distributions such as Cauchy and Lévy (Table 2).
However, differential entropy suffers from a critical limitation: it is not invariant under scale transformations. Under a linear transformation
, the entropy transforms as:
This scale-dependence means that entropy values depend on measurement units rather than reflecting intrinsic properties of the distribution. For instance, measuring temperature in Celsius versus Fahrenheit yields different entropy values, even though the underlying physical system is identical. This limitation extends to mutual information, potentially leading to misleading results when comparing variables with different scales.
MI is zero when
X and
Y are independent and increases with the dependence between variables. This measure is always non-negative and has no upper bound, expressed as
. MI can be expressed in terms of differential entropy as:
where
denotes the joint differential entropy. Equivalently,
, measuring the KL divergence from the joint distribution to the product of marginals. While MI shares some metric-like properties (non-negativity, symmetry), it does not satisfy the triangle inequality and is therefore not a true metric in the mathematical sense; we use the term “measure of dependence” throughout.
A significant challenge in many practical applications lies in estimating entropy from finite samples when the underlying probability density function (pdf) is unknown. Techniques for entropy estimation can be broadly classified into two main categories: parametric and nonparametric methods [
6]. Parametric methods assume that the form of the pdf is known, reducing the problem to estimating the parameters from the data [
1]. Nonparametric methods do not make assumptions about the form of the pdf, making them more versatile and widely applicable. Among these, popular approaches include histogram-based methods [
7], kernel density estimators (KDEs) [
8], and entropy-based statistical tests [
9]. Another nonparametric approach is based on the k-nearest neighbors (kNN) method [
3,
6], which is computationally efficient and has been shown to be robust even with small sample sizes [
10,
11].
Despite their popularity, traditional nonparametric methods have significant limitations. Histogram-based methods are highly sensitive to the choice of bin width: too few bins lead to oversmoothing and information loss, while too many bins result in high variance and systematic bias that increases with the number of bins. The optimal bin width is data-dependent and difficult to determine a priori. KDE methods face similar bandwidth selection challenges. While the kNN method addresses some of these issues by using a single, interpretable parameter
k and demonstrating good convergence properties, it remains sensitive to scale transformations. For comprehensive coverage of information-theoretic foundations, we refer readers to [
12].
The root of the scale-dependence problem lies in the transition from discrete to continuous entropy. To understand this, consider the Kullback–Leibler (KL) divergence [
13], which measures the dissimilarity between distributions
P and
Q:
The KL divergence relates to differential entropy through:
When
Q is a uniform distribution over an interval of length
r, we obtain:
This reveals that differential entropy implicitly compares the distribution to a uniform reference whose scale affects the entropy value. The scale-dependence arises because the reference scale r changes under transformations.
Jaynes [
14] recognized this fundamental issue and proposed addressing it through an “invariant measure”
representing complete ignorance about the probability distribution. This leads to the limiting density of discrete points (LDDP):
Comparing Equations (
4) and (
7), we observe that the LDDP can be interpreted as a KL divergence where the reference distribution is replaced by the invariant measure. However, Jaynes did not provide a concrete method for computing
, leaving the central question unanswered: what properties must
satisfy, and how can it be estimated from data?
Recent work by Nagel and coworkers [
15] proposed normalizing MI using an invariant measure, but their approach introduces inconsistencies: the marginal entropies vary depending on which variables are paired together, violating the fundamental principle that a variable’s entropy should be intrinsic to that variable alone.
Our contributions: This paper establishes a rigorous connection between the KL divergence framework and Jaynes’s invariant measure concept. We demonstrate that:
The invariant measure can be rigorously defined through specific mathematical properties and computed from data using k-nearest neighbor distances;
This measure naturally emerges from requiring transformation invariance analogous to the KL divergence framework;
The resulting invariant entropy corresponds to the LDDP and is truly scale-invariant;
The approach generalizes naturally to multivariate settings, enabling consistent scale-invariant MI estimation;
The median of nearest neighbor distances provides a robust estimator that avoids negative entropy values and identifies distribution families.
The present manuscript is organized as follows.
Section 2 develops the theoretical framework connecting KL divergence to the invariant measure, proves invariance properties, and presents the computational method.
Section 3 validates the approach through simulations and demonstrates its advantages over traditional methods.
Section 4 discusses connections to information geometry, maximum entropy principles, and broader implications.
Section 5 concludes with future directions.
2. Materials and Methods
2.1. From Kullback–Leibler Divergence to Invariant Measure
To understand the connection between KL divergence and invariant entropy, we examine what happens when comparing a distribution to a uniform reference. We use a bounded interval here for pedagogical clarity; the resulting invariant measure framework applies broadly to distributions with bounded or unbounded support, as demonstrated in
Section 3. Consider a uniform distribution
with density
for
.
For a distribution
P with support contained in
, the KL divergence is:
Under an affine transformation
, the uniform reference transforms to
with density
. The KL divergence becomes:
This demonstrates that the KL divergence to a uniform reference is invariant under affine transformations, but differential entropy is not because the log-volume term changes. The key insight is that we need a reference measure that adapts to the data scale in a way that removes this scale-dependence.
2.2. Definition of the Invariant Measure
We propose that the invariant measure should satisfy properties that ensure scale invariance while remaining practically computable from data.
Proposition 1. Let be an invariant measure function. We require:
- (i)
Positivity: for any random variable X;
- (ii)
Scale equivariance: for any ;
- (iii)
Translation invariance: for any ;
- (iv)
Consistency with KL divergence: The measure should lead to an entropy-like quantity that behaves as a KL divergence from the data distribution to a reference distribution.
These properties ensure that
captures the intrinsic scale of the distribution independent of affine transformations. We now define the invariant differential entropy as:
Theorem 1 (Invariance of
)
. Let satisfy Properties (i)–(iii) of Proposition 1. Then, for any transformation with : Proof. Let
. Based on Properties (iii) and (ii), we have:
Let
be the normalized variable. Under the transformation
, we have:
where
is a constant and
.
Now, we apply the change-of-variables formula for differential entropy. For a random variable
Z with density
and a transformation
where
g is differentiable and invertible, the density of
W is:
For a linear transformation
, we have
and
, giving:
Substituting
,
:
However, for our normalized variable,
has
, so:
Therefore, , establishing invariance. □
2.3. Connection to Jaynes’s LDDP
We now show that
is equivalent to Jaynes’s LDDP. Using the change of variables
, we have
and
. The density transforms as:
This can be expressed as:
which is precisely Jaynes’s expression (
7) with constant invariant measure
. Hence, as mentioned in the Introduction, the invariant entropy corresponds to comparing the distribution to a uniform reference over an interval of length
, which adapts to the data scale.
2.4. Estimation of the Invariant Measure
To make this framework practical, we need a method to estimate from data. We propose using the k-nearest neighbor (kNN) approach, which aligns naturally with kNN-based entropy estimation methods.
Given a sample
from
X, we compute the distance from each point to its
k-th nearest neighbor:
where
denotes the
k-th smallest value. For simplicity, we focus on
(nearest neighbor). The vector of nearest neighbor distances
captures the local density structure. We propose:
Justification for using the median:
(1) Robustness: The median is robust to outliers, reflecting “complete ignorance” about points far from the data bulk. Consider with nearest neighbor vector (sorted: ), giving and . Adding an outlier yields and , still giving . The mean would change from to , demonstrating inferior robustness.
(2) Scale equivariance: For scaled data , nearest neighbor distances scale as , so median median, satisfying Property (ii) of Proposition 1.
(3) Translation invariance: Translating data by b does not change distances, so median remains unchanged, satisfying Property (iii).
(4) Avoids negative entropy: Using the mean can lead to negative entropy. For example, the exponential and normal distributions yield negative values with the mean-based measure (
Table 1):
Empirically, the median avoids negative entropy values while preserving the desired invariance properties. This robustness likely stems from the median’s optimality as an center, which is less sensitive to extreme values in the distance distribution than the mean ( center). A rigorous proof establishing conditions under which the median measure guarantees non-negative entropy for all distribution classes remains an important open theoretical question.
Theoretical status: Our main results (Theorems 1 and 2) rigorously establish that the median-based invariant measure satisfies scale equivariance and translation invariance. The non-negativity of
has been verified empirically for all distribution families in
Table 2, but a general proof guaranteeing
for all distributions remains an important open question.
2.5. Multivariate and Multidimensional Generalization
We distinguish between the multivariate setting (multiple random variables , each potentially vector-valued) and the multidimensional setting (a single variable ). For a multidimensional variable, is computed from kNN distances in . For multiple variables, each is computed from the marginal distribution independently, and the joint measure uses the product form .
The extension to multiple variables follows naturally from the KL divergence perspective. For two random variables
, the joint LDDP is:
Proposition 2 (Separable invariant measure)
. For independent scale transformations, the invariant measure for the joint distribution should satisfy:where and are the marginal invariant measures. Theorem 2 (Joint invariant entropy)
. Let be jointly distributed random variables with joint density , marginal densities and , and marginal invariant measures and computed from the respective marginal samples. Define the normalized variables and . The invariant joint entropy is:This is invariant under independent affine transformations , for any and . Geometric interpretation via kNN distances:
The kNN entropy estimator in 2D uses Euclidean distances. For points
and
:
Under uniform transformation
, distances scale as
. However, for independent transformations
:
which is not simply proportional to
.
By normalizing each coordinate by its invariant measure:
This shows that the normalized coordinates create a natural reference frame where distances are invariant under independent scale transformations—precisely what is needed for invariant entropy estimation.
By normalizing each coordinate by its invariant measure, we create a natural reference frame where distances are invariant under independent scale transformations. The invariant mutual information is then:
3. Results
3.1. Validation with Standard Distributions
To validate our invariant measure approach, we performed extensive simulations comparing it with traditional kNN and histogram methods. The purpose of
Figure 1 is to test the core invariance property: distributions differing only in scale parameters should yield identical invariant entropy. Each column corresponds to a different distribution family (uniform, normal, exponential), with three scale parameter values overlaid. The top panels show convergence with sample size, while the bottom panels show stability across the number of neighbors
k.
The key observation is that distributions with the same shape but different scale parameters yield identical invariant entropy values (overlapping curves in
Figure 1). This confirms that the invariant measure successfully removes scale dependence. The convergence is rapid, with the estimation stabilizing at approximately 1500 samples. Moreover, the standard deviation is small and remains consistent across different parameter values, demonstrating the robustness of the approach.
3.2. Distribution Identification via Invariant Entropy
The median-based invariant entropy identifies distribution families independent of their location and scale parameters.
Table 2 presents invariant entropy values for common continuous distributions.
Table 2 presents invariant entropy values for 14 common distribution families, ordered from lowest to highest. The ordering reflects a spectrum from highly predictable local structure (arcsine, 1.008) to highly unpredictable local structure (Lévy, 1.973). Several observations emerge. First, the invariant entropy uniquely characterizes each distribution family: all normal distributions share the same value (1.150), all exponential distributions share the same value (1.227), etc. The parameter-free nature of the invariant entropy is demonstrated in
Figure 1, where curves for different scale parameters (
) overlap completely. Second, the arcsine distribution has the lowest invariant entropy (1.008), even lower than the uniform distribution (1.060). This reflects the arcsine distribution’s distinctive property: its probability density concentrates at the boundaries
and
, where
. Given the typical nearest-neighbor spacing captured by
, points near the boundaries are highly predictable relative to the invariant measure scale, resulting in lower entropy. Third, heavy-tailed distributions like Cauchy and Levy have higher invariant entropy, reflecting their greater unpredictability at the scale of typical nearest-neighbor distances.
3.3. Scale-Invariant Mutual Information
To demonstrate the practical advantage of invariant MI, we simulated three independent variables with vastly different scales: , , and . These differing standard deviations highlight the advantages of the invariant entropy estimation. Since the variables are independent, the theoretical MI between any pair should be zero.
Figure 2 demonstrates striking differences between methods. The histogram method (panels a,d) achieves scale invariance, with all three MI estimates superposed in panel (a). However, convergence toward the theoretical value of zero is extremely slow as sample size increases (panel a), requiring thousands of points to approach the correct value. Furthermore, panel (d) reveals significant systematic bias that varies with the number of bins, making parameter selection critical and problematic. The bias increases substantially with larger bin counts, demonstrating a fundamental limitation of the binning approach.
The kNN method (panels b,e) shows faster convergence than histograms. In panel (b), and are superposed and converge relatively quickly to zero. Panel (e) reveals no bias for —the typical range preferred in the literature where smaller k values are standard. However, a severe breakdown emerges for : in panel (e), this estimate diverges toward increasingly negative values, a physical impossibility since mutual information is non-negative by definition. This failure is not a minor numerical artifact but a fundamental problem: the divergence persists across all values of k (panel e). The breakdown occurs because the kNN estimator implicitly assumes comparable scales when computing joint nearest-neighbor distances. The 100-fold magnitude difference between X (scale ∼ 0.1) and Z (scale ∼ 10) causes the Z coordinate to completely dominate the Euclidean distance metric, corrupting the joint entropy estimation.
In contrast, the invariant method (panels c,f) demonstrates superior performance across all metrics. In panel (c), all three MI estimates (, , and ) are perfectly superposed, confirming complete scale invariance regardless of the magnitude differences between variables. The convergence toward zero is faster than both the histogram and kNN methods, achieving accurate estimates with fewer than 2500 samples. Crucially, panel (f) shows no bias regardless of the number of neighbors k, eliminating the need for careful parameter tuning. The estimates remain stable and centered near zero across the entire range . While finite-sample estimation errors can occasionally produce small negative values near zero due to statistical fluctuations, the invariant method avoids the divergence that afflicts standard kNN estimation.
These simulations establish the superiority of the invariant approach across multiple dimensions. First, it achieves faster convergence than competing methods, requiring fewer samples to reach accurate estimates. Second, it demonstrates complete scale invariance with all MI pairs collapsing to a single curve, independent of scale differences spanning four orders of magnitude ( to 10). Third, it exhibits no parameter-dependent bias, maintaining stability across a wide range of k values without requiring careful tuning. Fourth, it avoids the catastrophic divergence to negative values that plague traditional kNN estimation when variables have disparate scales.
3.4. Computational Efficiency
The computational complexity of our method is identical to standard kNN entropy estimation:
for sorting and
for nearest neighbor search using KD-trees or ball trees. The additional computation of
via median is
, making it negligible compared to the nearest neighbor search. The Julia package EntropyInvariant.jl provides an efficient implementation with performance comparable to standard kNN methods (see the
Appendix A for detailed usage examples).
4. Discussion
4.1. Theoretical Contributions
This work establishes a rigorous connection between Kullback–Leibler divergence and Jaynes’s limiting density of discrete points. Our main theoretical contributions are:
Formalization of the invariant measure: We have shown that the invariant measure can be understood through the lens of KL divergence as a data-adaptive reference scale. When differential entropy is expressed relative to a uniform distribution, the implicit scale factor introduces scale-dependence. By comparing to a reference distribution with characteristic scale estimated from the data itself, we remove this implicit scale dependence. This is analogous to using an empirical prior in Bayesian statistics: the reference incorporates information about the typical scale of the phenomenon.
Bridge between discrete and continuous entropy: The KL divergence framework naturally connects Shannon’s discrete entropy to differential entropy. Our invariant measure provides the missing piece for making the continuous case behave like the discrete case with respect to invariance properties. Just as Shannon entropy for discrete variables is invariant to relabeling of outcomes, our invariant entropy for continuous variables is invariant to rescaling of measurement units.
Rigorous change-of-variables proof: Unlike previous informal arguments, our proof of invariance (Theorem 1) explicitly uses the change-of-variables formula for probability densities. This establishes the result on solid mathematical foundations and clarifies the role of the Jacobian in transformation properties.
Multivariate generalization with geometric interpretation: The extension to joint distributions follows naturally from the separability principle in KL divergence (Proposition 2). The geometric interpretation shows that normalizing by marginal invariant measures creates a coordinate system where Euclidean distances are invariant under independent scale transformations, a property essential for multivariate kNN entropy estimation.
4.2. Comparison with Existing Approaches
Traditional histogram methods: As demonstrated in the original work and
Figure 2, histogram methods suffer from systematic bias that increases with the number of bins. There is no principled way to choose the optimal bin width, and the method shows severe scale sensitivity. Our approach eliminates these issues by using a data-adaptive scale.
Standard kNN estimator: The standard kNN estimator [
3] provides consistent entropy estimates and is computationally efficient. However, it is not scale-invariant, as evidenced by the negative MI values in
Figure 2d. Our approach modifies this by normalizing data by
before applying the kNN estimator, yielding
. This simple modification preserves all the advantages of kNN methods while adding scale invariance.
Kernel density estimators: KDE methods [
8] face similar bandwidth selection challenges as histogram methods. While sophisticated adaptive bandwidth selection procedures exist, they add computational complexity. Our median-based invariant measure provides a simple, robust alternative that requires no tuning beyond the standard
k parameter.
Nagel et al.’s approach: Nagel and coworkers [
15] proposed normalizing MI by subtracting a scaling factor computed from marginal entropies. However, their normalization has a fundamental flaw: it affects marginal entropies differently depending on which variables are paired together. For example, the normalized entropy of
X when computed with
Y differs from its normalized entropy when computed with
Z. This violates the principle that a variable’s entropy should be an intrinsic property. Our approach provides consistent entropy for each variable regardless of which other variables it is paired with, because each variable has its own invariant measure
computed from its marginal distribution.
4.3. Interpretation of Results
Our simulation results demonstrate several key properties:
- 1.
Fast convergence with small samples: The invariant estimator converges faster than traditional methods, particularly when variables have different scales (
Figure 2, panels e,f versus c,d). This occurs because normalization by
and
brings variables to comparable scales before computing joint entropy. The kNN distance calculations then operate in a balanced space where all dimensions contribute equally.
- 2.
Distribution identification:
Table 2 shows that distributions maintain consistent invariant entropy values regardless of location and scale parameters. From the KL divergence perspective, this makes sense: distributions in the same family (e.g., all normal distributions) differ only in location parameter
and scale parameter
. The invariant measure removes precisely these parameters, leaving only the intrinsic “shape” of the distribution. This property enables distribution classification based solely on shape characteristics.
- 3.
Boundary concentration in arcsine distribution: The arcsine distribution has the lowest invariant entropy (1.008), even lower than uniform (1.060). This initially surprising result reflects a deep property of the arcsine distribution. Its density diverges at the boundaries, meaning probability mass concentrates there. When we measure entropy relative to the typical nearest-neighbor spacing , points near boundaries are highly predictable, and their neighbors must also be near the boundary. This local predictability, captured by the invariant measure, results in lower entropy despite the distribution appearing “spread out” over .
4.4. Relationship to Maximum Entropy Principle
The Maximum Entropy (MaxEnt) principle and our invariant entropy framework address different questions. MaxEnt is a distribution selection principle: given constraints (e.g., fixed mean, fixed variance), it selects the distribution that maximizes entropy, yielding the “least biased” distribution compatible with the constraints. Invariant entropy is an entropy measurement principle: given a distribution (known or empirically observed), it computes an entropy value that is invariant to measurement units. These frameworks are complementary rather than competing.
The MaxEnt principle yields different distributions depending on the constraint structure:
Bounded support constraint: Among all distributions with support contained in a fixed interval
, the uniform distribution
maximizes differential entropy:
For the uniform distribution, the variance
is determined by the interval bounds, not independently specified.
Fixed variance constraint on
: Among all distributions on
with fixed variance
, the normal distribution
maximizes differential entropy:
This is fundamentally different: the domain is unbounded, and variance is an independent constraint.
From the invariant entropy perspective, all uniform distributions yield
and all normal distributions yield
, regardless of their parameters (
Table 2). The uniform has relatively low invariant entropy (1.060). This reflects what the invariant measure captures: the median nearest-neighbor distance reflects the typical local spacing of points. For a uniform distribution, this spacing is highly regular—when we normalize by
, all points lie within a predictable range relative to their typical spacing. In contrast, heavy-tailed distributions like Cauchy and Levy exhibit extreme variability in local density: some regions have tightly clustered points while others are sparse. This variability persists after normalization, yielding higher invariant entropy. Distributions with high standard differential entropy (given appropriate constraints) also tend to have high invariant entropy, suggesting consistency between the two frameworks rather than competition.
4.5. Connections to Information Geometry
From the perspective of information geometry [
16], our approach can be understood as choosing a coordinate system on the manifold of probability distributions. The Fisher information metric provides a natural Riemannian structure on this manifold, and geodesics correspond to exponential families.
The invariant measure
defines a natural coordinate chart that makes entropy calculations coordinate-independent, analogous to working in canonical coordinates in differential geometry. For a family of distributions
, the Fisher metric is:
For location-scale families , the invariant measure removes the dependence, effectively projecting onto the “shape manifold” of distributions. Future work could explore whether there is a canonical connection between our invariant measure and the Fisher–Rao metric, potentially leading to a geometric interpretation of the LDDP as arc length on the shape manifold.
4.6. Limitations and Future Directions
1. Beyond affine transformations: Our current framework handles affine transformations . Extending to general diffeomorphisms would require incorporating the Jacobian determinant . The invariant measure would need to transform as to preserve invariance. This generalization could connect to the theory of differential forms and volume elements in differential geometry.
2. Theoretical optimality of the median: While our empirical results demonstrate that the median of nearest-neighbor distances avoids negative entropy and satisfies all required invariance properties, a complete mathematical characterization remains to be established. The median is optimal as an center (minimizing ), while the mean is optimal as an center (minimizing ). Different centers induce different geometries on the distance distribution, each preserving the geometric invariance properties but potentially yielding different entropy values. Establishing the minimax optimality of the median among all centers for minimizing the probability of negative entropy across relevant distribution classes, and characterizing the tail conditions under which as , would provide a rigorous theoretical foundation. Such an analysis could parallel the development of M-estimators in robust statistics, potentially establishing conditions under which the geometry provides the natural reference scale for entropy measurement.
3. Connection to rate-distortion theory: Rate-distortion theory [
12] involves minimizing
subject to distortion constraints
. The invariant MI might provide new insights into scale-invariant coding schemes where the distortion metric itself adapts to the data scale. This could have applications in lossy compression where preservation of “shape” is more important than absolute accuracy.
4. Applications in causality: Invariance under interventions is central to causal inference. Pearl’s do-calculus and the invariance principle of Peters et al. both rely on identifying relationships that remain stable across environments. Our scale-invariant MI might help identify causal relationships that persist across different measurement scales or units, potentially improving causal discovery algorithms when variables are measured inconsistently across datasets.
5. Extensions to discrete–continuous mixtures: Many real-world datasets contain both discrete and continuous variables. The MI between discrete and continuous variables is well-defined, but estimating it is challenging [
2]. The invariant measure framework could potentially be extended to mixed data types by defining appropriate reference measures for discrete components.
5. Conclusions
This work establishes a rigorous theoretical foundation for invariant entropy estimation by connecting Jaynes’s limiting density of discrete points to the Kullback–Leibler divergence framework. By defining the invariant measure as the median of nearest-neighbor distances and proving that is truly scale-invariant, we provide the first practical method for computing Jaynes’s LDDP. The approach extends naturally to multivariate settings and demonstrates superior performance compared to standard methods, particularly avoiding catastrophic failures when variables have disparate scales.
The invariant entropy represents a logical evolution of Shannon’s information theory for continuous variables. Just as Shannon entropy for discrete variables is invariant to relabeling of outcomes, our invariant entropy for continuous variables is invariant to rescaling of measurement units. This makes it a more natural measure of uncertainty for physical quantities that can be measured in different units—temperature in Celsius versus Fahrenheit, distance in meters versus feet, concentration in molarity versus parts-per-million.
Beyond practical utility, the invariant entropy offers theoretical insights.
Table 2 reveals that the arcsine distribution, despite appearing spread over an interval, has the lowest invariant entropy due to boundary concentration effects visible at the nearest-neighbor scale. Conversely, heavy-tailed distributions like Cauchy and Levy have high invariant entropy, reflecting fundamental unpredictability even at local scales. These observations suggest that invariant entropy captures the intrinsic properties of distribution families independent of parametrization.
By grounding this concept in the well-established KL divergence framework, we provide both theoretical justification and practical tools for scale-invariant information-theoretic analysis. The connection to information geometry suggests deeper links between invariant measures and the Fisher–Rao metric on distribution manifolds, opening avenues for future research.
We hope this work will enable new applications across diverse scientific fields where scale invariance is essential: feature selection in machine learning with mixed-unit data, network inference in systems biology where genes have vastly different expression scales, time series analysis comparing signals with different amplitudes, and causal discovery across heterogeneous datasets. Open-source implementations in Julia (EntropyInvariant.jl) and Python (entropy_invariant) make these methods readily accessible to the research community.