You are currently viewing a new version of our website. To view the old version click .
Entropy
  • Article
  • Open Access

12 December 2025

Measuring Statistical Dependence via Characteristic Function IPM

,
,
and
1
Neurotechnology, Laisvės av. 125A, 06118 Vilnius, Lithuania
2
Research Institute of Natural and Technological Sciences, Vytautas Magnus University, Universiteto Str. 10, Akademija, 53361 Kaunas, Lithuania
3
Institute of Data Science and Digital Technologies, Vilnius University, Akademijos Str. 4, 08412 Vilnius, Lithuania
*
Author to whom correspondence should be addressed.

Abstract

We study statistical dependence in the frequency domain using the integral probability metric (IPM) framework. We propose the uniform Fourier dependence measure (UFDM) defined as the uniform norm of the difference between the joint and product-marginal characteristic functions. We provide a theoretical analysis, highlighting key properties, such as invariances, monotonicity in linear dimension reduction, and a concentration bound. For the estimation of the UFDM, we propose a gradient-based algorithm with singular value decomposition (SVD) warm-up and show that this warm-up is essential for stable performance. The empirical estimator of UFDM is differentiable, and it can be integrated into modern machine learning pipelines. In experiments with synthetic and real-world data, we compare UFDM with distance correlation (DCOR), Hilbert–Schmidt independence criterion (HSIC), and matrix-based Rényi’s α -entropy functional (MEF) in permutation-based statistical independence testing and supervised feature extraction. Independence test experiments showed the effectiveness of UFDM at detecting some sparse geometric dependencies in a diverse set of patterns that span different linear and nonlinear interactions, including copulas and geometric structures. In feature extraction experiments across 16 OpenML datasets, we conducted 160 pairwise comparisons: UFDM statistically significantly outperformed other baselines in 20 cases and was outperformed in 13.

1. Introduction

The estimation of statistical dependence plays an important role in various statistical and machine learning methods (e.g., hypothesis testing [1], feature selection and extraction [2,3], causal inference [4], self-supervised learning [5], representation learning [6], interpretation of neural models [7], among others). In recent years, various authors (e.g.,  [1,8,9,10,11,12,13,14]) have suggested different approaches to measuring statistical dependence.
In this paper, we focus on the estimation of statistical dependence using characteristic functions (CFs) and integral probability metric (IPM) framework. We propose and investigate a novel IPM-based statistical dependence measure, defined as the uniform norm of the difference between the joint and product-marginal CFs. After introducing core concepts, we conduct a short review of the previous work (Section 2). In Section 3, we formulate the proposed measure and its empirical estimator and perform their theoretical analysis. Section 4 is devoted to empirical investigation. Finally, in Section 5 we discuss results, limitations, and future work. Appendix A contains technical details, such as mathematical proofs, and auxiliary tables. The main contributions of this paper are the following:
  • Theoretical and methodological contributions. We propose a new IPM-based statistical dependence measure (UFDM) and derive its properties. The main theoretical result of this paper is the structural characterisation of UFDM, which includes invariance under linear transformations and augmentation with independent noise, monotonicity under linear dimension reduction, vanishing under independence, and a concentration bound for its empirical estimator. We additionally propose a gradient-based estimation algorithm with an SVD warm-up to ensure numerical stability.
  • Empirical analysis. We conduct an empirical study demonstrating the practical effectiveness of UFDM in permutation-based independence testing across diverse linear, nonlinear, and geometrically structured patterns, as well as in supervised feature-extraction tasks on real datasets.
In addition, we provide the accompanying code repository https://github.com/povidanius/UFDM (accessed on 4 December 2025).

1.1. IPM Framework

In the context of estimation of statistical dependence, the IPM is a class of metrics between two probability distributions P X , Y and P X P Y , defined for a function class F :
IPM ( P X , Y , P X P Y | F ) = sup f F | E U f ( U ) E V f ( V ) | ,
where U P X , Y , and  V P X P Y  [15].

1.2. Characteristic Functions

Let X R d X , Y R d Y , and  ( X T , Y T ) T R d X + d Y be random vectors defined on a common probability space ( Ω , F , P ) . Let us recall that their characteristic functions are given by
ϕ ( α ) : = E X e i α T X , ϕ ( β ) : = E Y e i β T Y , and ϕ ( α , β ) : = E X , Y e i ( α T X + β T Y ) ,
where i 2 = 1 , α R d X , and  β R d Y . Having n i.i.d. realisations ( x i , y i ) i = 1 n , the corresponding empirical characteristic functions (ECFs) are given by
ϕ n ( α ) : = 1 n j = 1 n e i α T x j , ϕ n ( β ) : = 1 n j = 1 n e i β T y j , and ϕ n ( α , β ) : = 1 n j = 1 n e i ( α T x j + β T y j ) .
The uniqueness theorem states that X and Y have the same distribution if and only if their CFs are identical [16]. Therefore, CFs can be considered a description of a distribution. Alternatively, a CF ϕ can be represented as a real vector ( ϕ , ϕ ) R 2 , where ℜ and ℑ denote real and imaginary components [17]. This viewpoint avoids explicit reliance on the imaginary unit i and makes the geometric structure of CFs more transparent.
For convenience, let us define γ = ( α T , β T ) T , ψ ( γ ) = ϕ ( α ) ϕ ( β ) and let ψ n ( γ ) = ψ n ( α , β ) be its empirical counterpart. In our study, we will utilise IPM framework for investigation of the statistical dependence via
Δ ( γ ) = ϕ ( γ ) ψ ( γ )
and its empirical counterpart
Δ n ( γ ) = ϕ n ( γ ) ψ n ( γ ) .

2. Previous Work

Various theoretical instruments have been employed for statistical dependence estimation. For example, weighted L 2 spaces and CFs (e.g., distance correlation, [13]), reproducing kernel Hilbert spaces (RKHS) (HSIC [1], DIME [18]), information theory (mutual information [19], and generalisations such as MEF [20,21]) and copula theory ([10,22]), among others. Since our work is rooted in the CF-based line of research and IPM framework, and it is empirically evaluated for independence testing and representation learning, let us consider DCOR, HSIC, and MEF, because these three measures form the compact set of high-performing baselines that span CFs, IPMs, and information-theoretic methods, which are widely used in representation learning tasks.
  • Distance correlation. DCOR [13] is defined as
DCOR ( X , Y ) = DCOV ( X , Y ) DCOV ( X , X ) DCOV ( Y , Y ) ,
where the distance covariance (DCOV) is given by
DCOV 2 ( X , Y ) = R d X + d Y | Δ ( γ ) | 2 w ( γ ) d γ ,
with weighting function w ( γ ) = w ( α , β ) = ( c d X c d Y | | α | | 1 + d X | | β | | 1 + d Y ) 1 , where c d X = π ( 1 + d X ) / 2 / Γ ( ( 1 + d X ) / 2 ) , and  c d Y = π ( 1 + d Y ) / 2 / Γ ( ( 1 + d Y ) / 2 ) , and  Γ ( . ) is the gamma function. This weighting function allows one to avoid the direct estimation of the integral, expressing it in terms of the covariance of distances between data points [13]. The later result of [23] generalises the distance correlation to multiple random vectors. Given the i.i.d. sample pairs ( x i , y i ) , i = 1 , , n , the empirical unbiased estimator of the squared distance covariances [24] is defined as
DCOV n 2 ( X , Y ) = 1 n ( n 3 ) i j A i j B i j ,
where matrices A = ( A i j ) , B = ( B i j ) are given by
A i j = a i j 1 n 2 k = 1 n a i k 1 n 2 k = 1 n a k j + 1 ( n 1 ) ( n 2 ) k , = 1 n a k ,
with Euclidean distance a i j = x i x j . The matrix B is defined analogously using distances b i j = y i y j . The empirical DCOR is then obtained as follows:
DCOR n ( X , Y ) = DCOV n ( X , Y ) DCOV n ( X , X ) DCOV n ( Y , Y ) .
Note that the biased version of the empirical distance-based estimator Equation (7) is equivalent to the ECF-based estimator of Equation (6) (Theorem 1, [13]). While consistency is established for the biased estimator under the moment condition E ( X + Y ) < (Theorem 2, [13]), the unbiased estimator Equation (7) differs only by a finite-sample correction and converges to the same population quantity Equation (6)  [24], implying consistency under the same moment condition.
  • HSIC. For reproducing kernel Hilbert spaces (RKHS) F and G with kernels k and l, it is defined as
HSIC ( X , Y ) = E X Y k ( X , · ) l ( Y , · ) E X k ( X , · ) E Y l ( Y , · ) HS 2 ,
where · HS denotes the Hilbert–Schmidt norm, and ⊗ is the tensor product [1]. Taking a product kernel κ ( ( x , y ) , ( x , y ) ) = k ( x , x ) l ( y , y ) , HSIC is equal to the squared maximum mean discrepancy, which is an instance of an IPM with function class F = { f : | | f | | H κ 1 } , where H κ is RKHS generated by κ  [25]. Having a sample of paired n i.i.d. observations, the empirical estimator is
HSIC n ( X , Y ) = 1 ( n 1 ) 2 tr ( K H L H )
with kernel matrices K i j = k ( x i , x j ) , L i j = l ( y i , y j ) , and centering matrix H = I 1 n 1 1 T . When both kernels k and l are translation-invariant (i.e., k ( x , x ) = k 0 ( x x ) on R d X and l ( y , y ) = l 0 ( y y ) on R d Y , with  k 0 , l 0 positive definite functions such as the Gaussian k 0 ( v ) = exp ( v 2 / ( 2 σ 2 ) ) with σ > 0 ), the product kernel κ ( ( x , y ) , ( x , y ) ) = k ( x , x ) l ( y , y ) = k 0 ( x x ) l 0 ( y y ) is also translation-invariant on R d X + d Y . In this case, κ ( u , v ) = κ 0 ( u v ) for some positive definite function κ 0 on R d X + d Y , and HSIC can be expressed in the frequency domain as
HSIC ( X , Y ) = R d X + d Y | Δ ( γ ) | 2 F 1 κ 0 ( γ ) d γ ,
where γ = ( α T , β T ) T , and  F 1 κ 0 denotes the inverse Fourier transform of κ 0 . Therefore, for translation-invariant kernels, HSIC is structurally analogous to distance covariance, since it also corresponds to the squared L 2 norm of Δ (Equation (4)), with weighting determined by κ .
  • MEF. Shannon mutual information is defined by MI ( X , Y ) = E X , Y log p ( X , Y ) p ( X ) p ( Y )  [19]. The neural estimation of mutual information (MINE, [26]) uses its variational (Donsker–Varadhan) representation MI ( X , Y ) m a x θ E X , Y f ( x , y | θ ) log ( E X E Y e f ( x , y | θ ) ) , since it allows avoiding density estimation (here f ( x , y | θ ) is a neural network with parameters θ ). In this case, the optimisation is performed over the space of neural network parameters, which often leads to unstable training and biased estimates due to the unboundedness of the objective and the difficulty of balancing the exponential term. The matrix-based Rényi’s α -order entropy functional (MEF) [20,21,27] provides a kernel version of mutual information that avoids both density estimation and neural optimization. For random variables X and Y with distributions P X , P Y , and  P X Y , it is defined as
MEF α ( X , Y ) = S α ( P X ) + S α ( P Y ) S α ( P X Y ) ,
where S α ( P X ) = 1 1 α log 2 ( tr ( T X α ) ) and T X is the normalised kernel integral operator on L 2 ( P X )  [27]. Given i.i.d. samples { ( x i , y i ) } i = 1 n with Gram matrices K i j = k ( x i , x j ) and L i j = l ( y i , y j ) , the empirical estimator is
MEF α , n ( X , Y ) = S α , n K tr ( K ) + S α , n L tr ( L ) S α , n K L tr ( K L ) ,
where ⊙ denotes the element-wise product, S α , n ( A ) = 1 1 α log 2 i λ i ( A ) α , and  λ i are eigenvalues of n × n matrix A.

Motivation

The motivation of our work stems from the theoretical observation that applying the L norm to Δ  Equation (4) yields a novel, structurally simple IPM with some advantageous properties, such as the ability to detect arbitrary statistical dependencies, invariance under full-rank linear transformations and coordinate augmentation with independent noise, and monotonicity under linear dimension reduction (Theorem 1).
Since the L norm isolates the most informative frequencies where dependence concentrates, we hypothesise that its empirical estimator could extract important structure from Δ that may be diluted by weighted L 2 or other global approaches such as DCOV, HSIC, and MEF.

3. Proposed Measure

Given two random vectors X and Y of dimensions d X and d Y , and assuming possibly unknown joint distribution P X , Y , we define our measure via IPM with function class F = { f : f ( z ) = e i γ T z ; γ , z R d X + d Y , i 2 = 1 } , which corresponds to the following.
Definition 1.
Uniform Fourier Dependence Measure.
UFDM ( X , Y ) = | | Δ | | L = sup γ | Δ ( γ ) | .
Since CF is a Fourier transform of a probability distribution, and the norm in L is called a uniform norm, we refer to it as Uniform Fourier Dependence Measure (UFDM).
Theorem 1.
UFDM has the following properties:
  • 0 UFDM ( X , Y ) 1 .
  • UFDM ( X , Y ) = UFDM ( Y , X ) .
  • UFDM ( X , Y ) = 0  if and only if  X Y  (denotes statistical independence).
  • For Gaussian random vectors X N ( 0 , Σ X ) , Y N ( 0 , Σ Y ) with cross-covariance matrix Σ X , Y we have UFDM ( X , Y ) = sup α , β e 1 2 ( α T Σ X α + β T Σ Y β ) | e α T Σ X , Y β 1 | .
  • Invariance under full-rank linear transformation: UFDM ( A X + a , B Y + b ) = UFDM ( X , Y ) for any full-rank matrices A R d X × d X , B R d Y × d Y and vectors a R d X , b R d Y .
  • Linear dimension reduction does not increase UFDM ( X , Y ) .
  • If X E , for any continuous function f : R d X R d Y , lim λ UFDM ( X , f ( X ) + λ E ) = 0 , if  E has a density.
  • If X and Y have densities, then UFDM ( X , Y ) min { 1 , 2 MI ( X , Y ) } , where MI ( X , Y ) is mutual information.
  • Invariance to augmentation with independent noise: let X , Y , Z be random vectors such that Z ( X , Y ) . Then UFDM ( X , Z ) , Y = UFDM ( X , Y ) .
Proof. 
See Appendix A.1.    □
  • Interpretation of UFDM via canonical correlation analysis (CCA). In the Gaussian case, the UFDM objective reduces analytically to CCA via a closed-form expression (Theorem 1, Property 4): after whitening (setting u = Σ X 1 / 2 α and v = Σ Y 1 / 2 β ), it becomes max u , v e 1 2 ( | u | 2 + | v | 2 ) ( 1 e u K v ) , where K = Σ X 1 / 2 Σ X Y Σ Y 1 / 2 . By von Neumann’s inequality, the maximizers ( u , v ) align with the leading singular vectors of K, corresponding to the top CCA pair. Note that since Gaussian independence is equivalent to the vanishing of the leading canonical correlation ρ 1 (as all remaining correlations 0 ρ j ρ 1 , j > 1 must also vanish), UFDM’s focus on the leading canonical correlation entails no loss of discriminatory power.
  • Interpretation of UFDM via cumulants. Let us recall that γ = ( α T , β T ) T , ϕ ( γ ) = ϕ ( α , β ) , ψ ( γ ) = ϕ ( α ) ϕ ( β ) . For general distributions, writing Δ ( γ ) = ψ ( γ ) ( exp ( C ( γ ) ) 1 ) offers a cumulant-series factorization, with  C ( γ ) = log ϕ ( γ ) ψ ( γ ) = p , q 1 i p + q p ! q ! κ p , q , α p β q , where κ p , q are cross-cumulants and α p β q are the ( p + q ) -order tensors formed by the tensor product of p copies of α and q copies of β . The leading term, corresponding to p = q = 1 , is i 2 1 ! , 1 ! κ 1 , 1 , α β = α Σ X Y β (with κ 1 , 1 = Σ X Y for centered variables), which aligns with the CCA interpretation, while higher-order κ p , q terms capture non-Gaussian deviations, interpreting UFDM as a frequency-domain approach that aligns ( α , β ) with cross-cumulant directions under marginal damping by ψ ( γ ) .
  • Remark on the representations of CFs. Since UFDM ( X , Y ) = sup γ ( Δ ( γ ) , Δ ( γ ) ) 2 , the UFDM objective naturally operates on the real two-dimensional vector formed by the real and imaginary parts of Δ ( γ ) . This aligns with recent work on real-vector representations of characteristic functions [17] and shows that UFDM does not rely on any special algebraic role of the imaginary unit.

3.1. Estimation

Having i.i.d. observations ( X n , Y n ) = ( x j , y j ) P X , Y , j = 1 , 2 , , n , we define and discuss empirical estimators of UFDM. Recall that (Section 1.2) that γ = ( α T , β T ) T and let ϕ ( α ) , ϕ ( β ) , and  ϕ ( γ ) be CFs of X, Y, and  ( X , Y ) , respectively ( α R d X , β R d Y , and  γ R d X + d Y ). Let us also denote norms | | f | | L t = sup | | τ | | < t | f ( τ ) | , | | f | | L = sup τ | f ( τ ) | , for  t > 0 and multivariate τ .
  • Empirical estimator. Let us define the empirical estimator of UFDM for a fixed t > 0 :
UFDM n t ( X n , Y n ) = | | Δ n | | L t .

3.2. Estimator Convergence

The ECF is a uniformly consistent estimator of CF in each bounded subset [28] (i.e., lim n sup | | γ | | < t | ϕ ( γ ) ϕ n ( γ ) | = 0 almost surely for any fixed t > 0 ) [28]. By the triangle inequality, this implies the following:
Proposition 1.
For a fixed t > 0 , lim n | | Δ n Δ | | L t = 0 , almost surely.
Theorem 2
([29]). If t n and log t n n 0 , as  n , then lim n sup | | γ | | < t n | ξ ( γ ) ξ n ( γ ) | = 0 almost surely for any CF ξ ( γ ) and corresponding ECF ξ n ( γ ) .
  • This implies the convergence of the empirical estimator Equation (12):
Proposition 2.
If t n and log t n n 0 , as  n , then lim n | | Δ n | | L t n = UFDM ( X , Y ) , almost surely.
Proof. 
See Appendix A.1.    □
Note that ECF does not converge to CF [28,29] uniformly in the entire space. Therefore, to ensure the convergence of the empirical estimator of UFDM, we need to bound the norm by slowly growing balls as in Theorem 2. The finite–sample analysis of the convergence of empirical UFDM Equation (12) to its truncated population counterpart ( UFDM t ( X , Y ) = | | Δ | | L t ) yields the following concentration inequality.
Theorem 3.
Let us assume that E | | X | | 2 < , E | | Y | | 2 < . Let us define d = d X + d Y , Z = ( X T , Y T ) T , and  W = | | X | | + | | Y | | + | | Z | | . Then there exists a constant C, such that for every fixed ε > 1 n , t > 0 :
Pr | UFDM n t ( X n , Y n ) UFDM t ( X , Y ) | > ε 2 C t ε d exp n 18 ε 2 1 n 2 + σ 2 n L 2 ,
where L = E W , and  σ 2 = E ( W L ) 2 .
Proof. 
See Appendix A.2.    □

3.3. Estimator Computation

In practice, UFDM can be estimated iteratively using Algorithm 1. Since it depends on initial parameters α and β , the complementary Algorithm 2 is designed for their data-driven initialisation. According to our experience with UFDM applications, Algorithm 2 is very important, since without it we often encountered stability issues, and initially had to rely on various heuristics, such as parameter normalisation to the unit sphere. In our opinion, this is because Δ n is a highly nonlinear optimisation surface (especially in larger dimensions), which complicates the finding of the corresponding maxima.
Algorithm 1 UFDM estimation
Require: Number of iterations N, batch size n b , initial α R d X , β R d Y .
    for  i t e r a t i o n = 1 to N do
          Sample batch ( X n b , Y n b ) = ( x i , y i ) i = 1 n b .
          Standardise ( X n b , Y n b ) to zero mean and unit variance.
           α , β AdamW ( [ α , β ] , | Δ n b ( α , β ) | ) .
    end for
    return  Δ ( α , β ) , α , β
Algorithm 2 SVD warm-up
Require: Batch size n b .
    Sample batch ( X n b , Y n b ) = ( x i , y i ) i = 1 n b .
    Compute cross-covariance C = ( X n b ) Y n b / n b .
    Decompose: [ U , Σ , V H ] = SVD ( C ) .
     α U : , 1 , β V 1 , : H , .
    return  α , β
The computational complexity of Algorithm 2 consists of cross-covariance computation and finding its SVD a complexity of O n b d X d Y + d X d Y min ( d X , d Y ) . Having initialisation of α and β , the complexity of Algorithm 1 is O N n b ( d X + d Y ) . Hence, the total computational complexity of the sequential application of Algorithm 2 and Algorithm 1 is O n b d X d Y + d X d Y min ( d X , d Y ) + N n b ( d X + d Y ) . Finally, having the optimal α * and β * computed by Algorithm 1, the evaluation of empirical UFDM has computational complexity linear in sample size.

4. Experiments

For UFDM, we used SVD warm-up (Algorithm 2) for parameter initialisation and fixed truncation parameter t to 25.0 . For kernel measures, HSIC and MEF, we used Gaussian kernels for both X and Y, with a bandwidth selected using median heuristics [30]. For MEF measure α was set to 1.01 , as in [21].

4.1. Permutation Tests

  • Permutation tests with UFDM. We compared UFDM , DCOR, HSIC, and MEF in permutation-based statistical independence testing ( H 0 : X Y versus the alternative H 1 : X Y ) using a set of multivariate distributions. We investigated scenarios with a sample size of n = 750 and data dimensions d { 5 , 15 , 25 } ( d X = d Y = d ). To ensure valid finite-sample calibration, permutation p-values were computed with the Phipson–Smyth correction [31].
  • Hyperparameters. We used 500 permutations per p-value. The number of iterations in UFDM estimation Algorithm 1 was set to 100. The batch size equaled the sample size ( n = 750 ). We used a learning rate of 0.025 . Due to the high computation time (permutation tests took 6.3 days on five machines with Intel i7 CPU, 16GB of RAM, and Nvidia GeForce RTX 2060 12 GB GPU), we relied on 500 p-values for each test in the H 0 scenario and on 100 p-values for each test in the H 1  scenario.
  • Distributions analysed. In the H 0 case, X was sampled from multivariate uniform, Gaussian, and Student t ( 3 ) distributions (corresponding to no-tail, light-tail, and heavy-tail scenarios, respectively), and Y was independently sampled from the same set of distributions. Afterwards, we examined the uniformity of the p-values obtained from permutation tests using different statistical measures, through QQ-plots and Kolmogorov–Smirnov (KS) tests.
In the H 1 case, X and Y were related through statistical dependencies described in Table A2. These dependencies include structured dependence patterns, where X was sampled from the same set of distributions (multivariate uniform, Gaussian, and Student t ( 3 ) ), and Y was generated as Y = f ( X ) + 0.1 ϵ , with  ϵ denoting additive Gaussian noise independent of X. We also examined more complex dependencies (Table A2), where the relationship between X and Y was modeled using copulas, bimodal, circular, and other nonlinear patterns. Using this setup, we evaluated the empirical power of the permutation tests based on the same collection of statistical measures.
  • Results for H 0 . As shown in Figure 1, UFDM, DCOR, HSIC, and MEF exhibited approximately uniform permutation p-values across all distribution pairs and dimensions, with empirical false rejection rates (FRR) remaining close to the nominal 0.05 level. Isolated low KS p-values below 0.05 occurred in only two cases: one for MEF in the Gaussian/Gaussian pair at dimension 5 (p-value of 0.01 ) and one for UFDM in the Gaussian/Student-t pair at dimension 5 (p-value of 0.03 ), suggesting minor sampling variability rather than systematic deviations from uniformity. These results show that UFDM remained comparably stable to DCOR, HSIC and MEF, in terms of type-I error control under H 0 .
    Figure 1. Empirical QQ-plots of p-values under H 0 . The dashed vertical line corresponds to the nominal significance level 0.05 . The empirical FRR and its Wilson confidence interval, p-values of KS test are reported in the legend.
  • Results for H 1 . The empirical power and its 0.95 -Wilson confidence intervals (CIs) are presented in Table 1 and Table 2. These results show that, in most cases, the empirical power of UFDM, DCOR, HSIC, and MEF was approximately equal to 1.00 . However, Table 2 also reveals that for the sparse Circular and Interleaved Moons patterns ( d 15 ), MEF exhibited a noticeable decrease in empirical power. We conjecture that this reduction may stem from MEF’s comparatively higher sensitivity to kernel bandwidth selection in these specific, geometrically structured patterns. On the other hand, UFDM’s robustness in these settings may also be explained by its invariance to augmentation with independent noise (Theorem 1, Property 9), which helps to preserve the detectability of sparse geometric dependencies embedded within high-dimensional noise coordinates.
    Table 1. Empirical power and Wilson CIs for the dependent data (structured dependence patterns) at α = 0.05 .
    Table 2. Empirical power with 95% Wilson confidence intervals for dependent data (complex dependence patterns) at α = 0.05 .
  • Ablation experiment. The necessity of the SVD warm-up (Algorithm 2) is empirically demonstrated in Table A1, where the p-values obtained without SVD warm-up systematically fail to reveal dependence in many nonlinear patterns.
  • Remark on the stability of the estimator. Since the UFDM objective is non-convex, different random initialisations may potentially lead to distinct local optima. To assess the impact of this issue, we investigated the numerical stability of the UFDM estimator. We computed the mean and standard deviation of the statistic across 50 independent runs for each distribution pattern and dimension (Table 1 and Table 2), as well as for the corresponding permuted patterns in which dependence is destroyed, as reported in Table 3. The obtained results align with the permutation test findings. While a slight upward shift is observed under independent (permuted) data, the proposed estimator retained consistent separation between dependent and independent settings and exhibited stable behaviour across random restarts.
    Table 3. UFDM statistic (mean ± std) under true dependence/permuted independence.

4.2. Supervised Feature Extraction

Feature construction is often a key initial step in machine learning with tabular data. These methods can be roughly classified into feature selection and feature extraction. Feature selection identifies a subset of relevant inputs, either incrementally (e.g., via univariate filters) or through other strategies, and feature extraction transforms inputs into lower-dimensional, informative representations. In our experiments, we used the latter approach because of its computational effectiveness. The total computational time for these experiments was 94.3 h on single Intel i7 CPU, 16GB of RAM, and Nvidia GeForce RTX 2060 12 GB GPU machine.
Let ( x i , y i ) i = 1 n be a classification dataset consisting of n pairs of d X -dimensional inputs x i , and  d Y -dimensional one-hot encoded outputs y i . In our experiments, we used a collection of OpenML classification datasets [32], which cover different domains, input and output dimensionalities. We randomly split the data into training, validation, and test sets using the proportions ( 0.5 , 0.1 , 0.4 ) , respectively. We followed the dependence maximisation scheme (e.g., [3,33]) by seeking
W * = a r g max W DEP ( W x , y ) λ tr ( ( W T W I ) T ( W T W I ) ) ,
where DEP { UFDM , DCOR , HSIC , MEF } . To evaluate the obtained features f ( x ) = W * x , we used logistic regression’s [34] accuracy, measured on the test set. For each baseline method, we selected the dimensions of the features that correspond to the maximal validation accuracy of the investigated method, checking all dimensions starting from 1 with a step of 10 % of d X . Similarly, we selected λ { 0.1 , 1.0 , 10.0 } . The feature extraction loss Equation (13) was optimised via Algorithm 1 for 100 epochs, with the learning rate set to 0.025 , as in permutation testing experiments (Section 4.1).
  • Baselines. We compared the following baselines: unmodified inputs (denoted as RAW); and Equation (13) scheme with dependence measures: UFDM, DCOR, MEF, and HSIC. We also included the neighbourhood component analysis (NCA) [35] baseline, which is specially tailored for classification.
  • Evaluation metrics. Let us denote a r , p ( b , b | d ) = 1 , if for r runs on the dataset d the average test set accuracy of baseline b is statistically significantly higher than that of b with p-value threshold p. For statistical significance assessment, we used Wilcoxon’s signed-rank test [36]. We computed the win ranking (WR) and loss ranking (LR) as
WR ( b ) = d b b a 25 , 0.05 ( b , b | d ) and LR ( b ) = d b b a 25 , 0.05 ( b , b | d ) .
  • Based on these metrics, Table 4 includes full information on how many cases each baseline method statistically significantly outperformed the other method.
    Table 4. Pairwise wins matrix: entry ( i , j ) is the number of cases where the method in row i outperformed the method in column j (Wilcoxon’s signed-rank test, 25 runs, p-value threshold 0.05 ).
  • Results. Using 18 datasets, we conducted 80 feature efficiency evaluations (excluding the RAW baseline) and 160 feature efficiency comparisons, of which 97 (∼60%) were statistically different. The results of the feature extraction experiments are presented in Table 4 and Table 5. They reveal that, although MEF showed best WR, UFDM also performed comparable to other measures: it statistically significantly outperformed them in 6 + 4 + 5 + 5 = 20 cases (listed in Table 6), and was outperformed in 2 + 4 + 2 + 5 = 13 cases (Table 4).
    Table 5. Classification accuracy comparison. n denotes dataset size, d X is input dimensionality, and  n c is the number of classes. Best-performing method that is also statistically significant when compared with all other methods (Wilcoxon’s signed-rank test, 25 runs, p-value threshold 0.05 ) is indicated in bold (otherwise, best-performing method is underlined).
    Table 6. Twenty cases (Measures Outperformed) where UFDM outperformed the other baselines.
In addition to pairwise statistical comparisons using Wilcoxon’s test, we also conducted statistical analysis to clarify whether some method is globally better or worse over multiple datasets using the methodology described in [37]. In this analysis, the Friedman/Iman–Davenport test ( α = 0.05 ) showed a global significant difference between the five methods. The Nemenyi post hoc test ( α = 0.05 , critical difference 1.884 ) revealed that RAW was significantly outperformed by the other methods; however, it also showed the absence of a global best-performing method.

5. Conclusions

  • Results. We proposed and analysed an IPM-based statistical dependence measure, UFDM , defined as the L norm of the difference between the joint and product-marginal characteristic functions. UFDM applies to pairs of random vectors of possibly different dimensions and can be integrated into modern machine learning pipelines. In contrast to global measures (e.g., DCOR, HSIC, MEF), which aggregate information across the entire frequency domain, UFDM identifies spectrally localised dependencies by highlighting frequencies where the discrepancy is maximised, thereby offering potentially interpretable insights into the structure of dependence. We theoretically established key properties of UFDM , such as invariance under linear transformations and augmentation with independent noise, monotonicity under dimension reduction, and vanishing under independence. We also showed that UFDM’s objective aligns with the vectorial representation of CFs. In addition, we investigated the consistency of the empirical estimator and derived a finite-sample concentration bound. For practical estimation, we proposed a gradient-based estimation algorithm with SVD warm-up, and this warm-up was found to be essential for stable convergence.
We evaluated UFDM on simulated and real data in permutation-based independence testing and supervised feature extraction. The permutation test experiments ( n = 750 , d { 5 , 15 , 25 } ) indicated that in this regime UFDM performed comparably to established baseline measures, exhibiting similar empirical power and calibration across diverse dependence structures. Notably, UFDM maintained high power on the Circular and Interleaved Moons datasets, where some other measures displayed reduced sensitivity under these geometrically structured dependencies. These findings suggest that UFDM provides a complementary addition to the family of widely used dependence measures (DCOR, HSIC, and MEF).
Further experiments with real data demonstrated that, in dependence-based supervised feature extraction, UFDM often performed on par with the well-established alternatives (HSIC, DCOR, MEF) and with NCA, which is specifically designed for classification. Across 16 datasets and 160 pairwise comparisons, UFDM statistically significantly outperformed other baselines in 20 cases and was outperformed in 13. To facilitate reproducibility, we provide an open-source repository.
  • Limitations. Computing UFDM requires maximising a highly nonlinear objective, which makes the estimator sensitive to initialisation and optimisation settings. Although the proposed SVD warm-up substantially improves numerical stability, estimation may still become more challenging as dimensionality d increases or sample size n decreases. From the perspective of the effective ( n , d ) , our empirical evaluation covers two different tasks. First, in independence testing with synthetic data and n = 750 and d { 5 , 15 , 25 } , UFDM maintained effectiveness across diverse dependence structures. Our preliminary experiments with n = 375 , d { 5 , 15 , 25 } , and n = 750 , d = 50 indicate a reduction in power for several dependency patterns, whereas DCOR, HSIC, and MEF remained comparatively stable. Nonetheless, UFDM preserved its performance for sparse geometrically structured dependencies (e.g., Interleaved Moons), where alternative measures often show more pronounced loss of sensitivity. Due to the high computational cost of UFDM permutation tests, we omitted systematic exploration of these regimes, leaving it to future work. On the other hand, in supervised feature extraction on real datasets, we examined substantially broader ( n , d ) ranges, including high-dimensional settings such as USPS ( n = 9298 , d = 256 ) , Micro-Mass ( n = 360 , d = 1300 ) , and Scene ( n = 2407 , d = 299 ) . UFDM outperformed one or more baselines on several such datasets (Table 6), suggesting that it may be effective in some larger-dimensional machine learning tasks.
  • Future work and potential applications. Identifying the limit distribution of the empirical UFDM could enable faster alternatives to permutation-based statistical tests, which would also facilitate the systematic analysis of previously mentioned ( n , d ) settings. However, since the empirical UFDM is not a U- or V-statistic like HSIC or distance correlation, this would require a non-trivial analysis of the extrema of empirical processes. Possible extensions of UFDM include multivariate generalisations [23] and weighted or normalised variants to enhance empirical stability. From an application perspective, UFDM may prove useful in causality, regularisation, representation learning, and other areas of modern machine learning where statistical dependence serves as an optimisation criterion.

Author Contributions

Conceptualization, P.D.; methodology, P.D.; software, P.D., S.J., L.K., V.M.; validation, P.D., and V.M.; formal analysis, P.D.; writing—original draft preparation, P.D., and V.M.; writing—review and editing, P.D.; funding acquisition, P.D., and V.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding. The APC was funded by Vytautas Magnus University and Vilnius University.

Data Availability Statement

All synthetic data were generated as described in the manuscript; real datasets were obtained from OpenML (https://www.openml.org (accessed on 4 December 2025)). Code to reproduce the experiments is available at https://github.com/povidanius/UFDM (accessed on 4 December 2025). No additional unpublished data were used.

Acknowledgments

We sincerely thank Dominik Janzing for pointing out the possible theoretical connection between UFDM and HSIC, Marijus Radavičius for a remark that the convergence of empirical UFDM in the entire space requires special investigation, and Iosif Pinelis for [38]. We also acknowledge Pranas Vaitkus, Mindaugas Bloznelis, Linas Petkevičius, Aleksandras Voicikas, Osvaldas Putkis, and colleagues from Neurotechnology for discussions. We feel grateful to Neurotechnology, Vytautas Magnus University, and the Institute of Data Science and Digital Technologies, Vilnius University, for supporting this research. We also thank the anonymous reviewers for their valuable feedback.

Conflicts of Interest

Povilas Daniušis is employee of Neurotechnology. The paper reflects the views of the scientists and not the company. The other authors declare no conflict of interest.

Appendix A

Appendix A.1. Proofs

In the proofs, we interchangeably abbreviate ϕ X ( α ) with ϕ ( α ) , ϕ Y ( β ) with ϕ ( β ) , and  ϕ X , Y ( α , β ) with ϕ ( γ ) , where γ = ( α T , β T ) T .
Proof of Theorem 1.
Property 1. By Cauchy–Schwarz inequality.
| Δ ( α , β ) | 2 = | E X , Y ( e i α T X ϕ X ( α ) ) ( e i β T Y ϕ Y ( β ) ) | 2 E X | ( e i α T X ϕ X ( α ) ) | 2 E Y | ( e i β T Y ϕ Y ( β ) ) | 2 .
Recall that for complex numbers z and z we have | z z | 2 = | z | 2 z z ¯ z ¯ z + | z | 2 , where z ¯ is complex conjugate of z. Therefore by plugging z = e i α T X and z = ϕ X ( α ) from the definition of CF we obtain
E X | e i α T X ϕ X ( α ) | 2 = 1 ϕ X ( α ) ϕ X ( α ) ¯ ϕ X ( α ) ¯ ϕ X ( α ) + | ϕ X ( α ) | 2 = 1 | ϕ X ( α ) | 2 ,
and similarly E Y | ( e i β T Y ϕ Y ( β ) ) | 2 = 1 | ϕ Y ( β ) | 2 . Since the absolute value of CF is bounded by 1, we have that  Equation (A1) is also bounded by 1.
Property 2.
UFDM ( X , Y ) = sup α , β | E X , Y e i ( α T X + β T Y ) E X e i α T X E Y e i β T Y | = sup β , α | E Y , X e i ( β T Y + α T X ) E Y e i β T Y E X e i α T X | = UFDM ( Y , X ) .
Property 3. Let us assume that X Y . Then ϕ X , Y ( α , β ) = E X , Y e i ( α T X + β T Y ) = E X E Y e i ( α T X + β T Y ) = ϕ X ( α ) ϕ Y ( β ) . Therefore, UFDM ( X , Y ) = 0 . On the other hand, if  UFDM ( X , Y ) = 0 then ϕ X , Y ( α , β ) = ϕ X ( α ) ϕ Y ( β ) for all α R d X , β R d Y . Let X ˜ and Y ˜ be two independent random vectors, having the same distributions as X and Y, respectively. Therefore ϕ X , Y ( α , β ) = ϕ X ( α ) ϕ Y ( β ) = ϕ X ˜ ( α ) ϕ Y ˜ ( β ) = ϕ X ˜ , Y ˜ ( α , β ) . The uniqueness of CF [16] implies that distributions of ( X , Y ) and ( X ˜ , Y ˜ ) coincide, from what directly follows that X Y .
Property 4. Let Σ X , Y be cross-covariance matrix of X and Y. Since X and Y are Gaussian, we have ϕ X ( α ) = e 1 2 α T Σ X α , ϕ Y ( β ) = e 1 2 β T Σ Y β , ϕ X , Y ( α , β ) = e 1 2 ( α T Σ X α + β T Σ Y β + 2 α T Σ X , Y β ) . Therefore, by Equation (11)
UFDM ( X , Y ) = sup α , β e 1 2 ( α T Σ X α + β T Σ Y β ) | e α T Σ X , Y β 1 | .
Property 5. Since ϕ A X + a , B Y + b ( α , β ) = e i α T a + i β T b ϕ X , Y ( A T α , B T β ) , and  ϕ A X + a ( α ) = e i α T a ϕ X ( A T α ) , ϕ B Y + b ( β ) = e i β T b ϕ Y ( B T β ) , we have
UFDM ( A X + a , B Y + b ) = sup α , β | ϕ A X + a , B Y + b ( α , β ) ϕ A x + a ( α ) ϕ B Y + b ( β ) | = = sup α , β | e i α T a + i β T b | | Δ ( A T α , B T β ) | = sup α , β | Δ ( A T α , B T β ) | .
Since both A and B are full-rank matrices, and  A R d X × d X , B R d Y × d Y , the maximization of the last equation is equivalent to the maximization of | Δ ( α , β ) | , which by definition is UFDM ( X , Y ) .
Property 6. If A R d X × d X , B R d Y × d Y , a R d X , b R d Y are parameters of linear dimension reduction, where d X < d X , and  d Y < d Y , we have
UFDM ( A X + a , B Y + b ) UFDM ( A X + a , B Y + b ) ,
for any A , B , a , b of the same dimensions (defined as in Property 5), because maximisation of LHS is conducted in smaller space than that of RHS. By Property 5, it follows that UFDM ( A X + a , B Y + b ) = UFDM ( X , Y ) .
Property 7. The independence of X and E implies that
UFDM ( X , f ( X ) + λ E ) = sup α , β | E e i ( α T X + β T f ( X ) ) ϕ E ( λ β ) ϕ X ( α ) ϕ f ( X ) ( β ) ϕ E ( λ β ) | ,
which converges to 0, since by multivariate Riemann–Lebesgue lemma [39] the common term | ϕ E ( λ β ) | 0 , when λ . The multivariate Riemann–Lebesgue lemma can be applied since E has a density.
Property 8. Recall that the total variation distance between joint probability measure P X , Y and product measure P X P Y is given by
TV ( P X , Y , P X P Y ) = 1 2 p X , Y ( x , y ) p X ( x ) p Y ( y ) d x d y ,
where p X , Y ( x , y ) is joint density, and  p X ( x ) , p Y ( y ) are marginal ones. Recall that Pinsker’s inequality for total variation states that TV ( P X , Y , P X P Y ) 1 2 MI ( X , Y ) , where MI ( X , Y ) is mutual information between X and Y. Therefore,
Δ ( α , β ) = e i ( α T x + β T y ) ( p X , Y ( x , y ) p X ( x ) p Y ( y ) ) d x d y p X , Y ( x , y ) p X ( x ) p Y ( y ) d x d y = 2 TV ( P X , Y , P X P Y ) .
By taking the supremum we have UFDM ( X , Y ) min { 1 , 2 TV ( P X , Y , P X P Y ) } min { 1 , 2 MI ( X , Y ) } by Property 1 and Pinsker’s inequality.
Property 9. Independence condition Z ( X , Y ) gives
Δ ( X , Z ) , Y ( α X , α Z , β ) = φ Z ( α Z ) Δ X , Y ( α X , β ) .
Since | φ Z ( α Z ) | 1 and | φ Z ( 0 ) | = 1 , we have sup α X , α Z , β Δ ( X , Z ) , Y = sup α X , β | Δ X , Y ( α X , β ) | . Therefore, UFDM ( ( X , Z ) , Y ) = UFDM ( X , Y ) .    □
Proof of Proposition 2.
Let ϵ > 0 . Since ECF is CF, and a product of two CFs also is CF, by Theorem 2 and triangle inequality, we can find natural number n 0 such that n > n 0 : | | Δ Δ n | | L t n = sup | | γ | | < t n | Δ ( γ ) Δ n ( γ ) | = sup | | γ | | < t n | ϕ ( γ ) ψ ( γ ) ϕ n ( γ ) + ψ n ( γ ) | = sup | | γ | | < t n | ϕ ( γ ) ϕ n ( γ ) + ψ n ( γ ) ψ ( γ ) | sup | | γ | | < t n | ϕ ( γ ) ϕ n ( γ ) | + sup | | γ | | < t n | ψ ( γ ) ψ n ( γ ) | ϵ , almost surely. From the inverse triangle inequality for norms we have | | | Δ | | L t n | | Δ n | | L t n | | | Δ Δ n | | L t n ϵ , almost surely. On the other hand, along with the definition of UFDM ( X , Y ) = lim n | | Δ ( γ ) | | L t n , this implies that | UFDM ( X , Y ) | | Δ n | | L t n | | UFDM ( X , Y ) | | Δ | | L t n | + | | | Δ | | L t n | | Δ n | | L t n | will be arbitrarily small almost surely, when n is sufficiently large.    □

Appendix A.2. Proof of Theorem 3

Proof of Theorem 3.
Recall that Z = ( X T , Y T ) T , γ = ( α T , β T ) T R d with d = d X + d Y and
Δ ( γ ) = ϕ ( γ ) ψ ( γ ) , Δ n ( γ ) = ϕ n ( γ ) ψ n ( γ ) .
Step 1. Lipschitz continuity. First, we will prove that Δ ( γ ) and Δ n ( γ ) are Lipschitz continuous. For the population version, consider
| Δ ( γ ) Δ ( γ ) | | ϕ ( γ ) ϕ ( γ ) | + | ψ ( γ ) ψ ( γ ) | .
Since ϕ ( γ ) = E exp ( i γ T Z ) , by inequality | e i a e i b | | a b | , a , b R
| ϕ ( γ ) ϕ ( γ ) | E | exp ( i γ T Z ) exp ( i γ T Z ) | E | ( γ γ ) T Z | γ γ E Z .
Similarly,
ψ ( γ ) ψ ( γ ) = ϕ ( α ) ϕ ( β ) ϕ ( α ) ϕ ( β ) = ϕ ( α ) ϕ ( β ) ϕ ( β ) + ϕ ( β ) ( ϕ ( α ) ϕ ( α ) | ϕ ( α ) ϕ ( α ) | + | ϕ ( β ) ϕ ( β ) | ,
since | ϕ ( α ) | 1 , | ϕ ( β ) | 1 . Therefore,
| ϕ ( α ) ϕ ( α ) | E X α α , | ϕ ( β ) ϕ ( β ) | E Y β β .
Thus,
| ψ ( γ ) ψ ( γ ) | ( E X + E Y ) γ γ ,
so Δ ( γ ) is Lipschitz with constant L = E Z + E X + E Y < . For the empirical version,
| Δ n ( γ ) Δ n ( γ ) | | ϕ n ( γ ) ϕ n ( γ ) | + | ψ n ( γ ) ψ n ( γ ) | ,
where
| ϕ n ( γ ) ϕ n ( γ ) | 1 n j = 1 n | γ T Z j γ T Z j | 1 n j = 1 n Z j γ γ ,
and
| ψ n ( γ ) ψ n ( γ ) | | ϕ n ( α ) ϕ n ( α ) | + | ϕ n ( β ) ϕ n ( β ) | 1 n j = 1 n X j + j = 1 n Y j γ γ .
Define L n = 1 n j = 1 n ( Z j + X j + Y j ) , so Δ n ( γ ) is Lipschitz with random constant L n . Recall that E L n = L , E ( L n L ) 2 = σ 2 / n are finite because of bounded second moment assumption. L n concentrates around L, and by Cantelli’s inequality, we have
Pr ( L n 2 L ) = Pr ( L n L L ) 1 1 + n ( L / σ ) 2 σ 2 n L 2 .
Step 2. Construct a δ -net and bound the deviation on the δ -net. For B t = { γ : γ < t } , construct a δ -net { γ 1 , , γ N ( t , δ ) } such that every γ B t is within δ of some γ k . The cardinality satisfies N ( t , δ ) ( 3 t / δ ) d  [40].
For fixed γ k , bound | Δ n ( γ k ) Δ ( γ k ) | . Changing one Z j to Z j alters ϕ n ( γ k ) by at most 2 / n , ϕ n ( α k ) and ϕ n ( β k ) by at most 2 / n each, and  ψ n ( γ k ) by at most 4 / n . Thus, | Δ n ( γ k ) Δ n ( γ k ) | 6 / n . By McDiarmid’s inequality,
Pr ( | Δ n ( γ k ) E Δ n ( γ k ) | > u ) 2 exp n u 2 18 .
Compute the bias: E ϕ n ( γ k ) = ϕ ( γ k ) , and  E ψ n ( γ k ) = 1 n ϕ ( γ k ) + 1 1 n ψ ( γ k ) , so
E Δ n ( γ k ) = 1 1 n Δ ( γ k ) , | E Δ n ( γ k ) Δ ( γ k ) | 1 n .
Thus,
Pr ( | Δ n ( γ k ) Δ ( γ k ) | > ε ) 2 exp n 18 ε 1 n 2 , ε > 1 n .
Step 3. Extend to the entire frequency ball. For any γ B t , choose γ k with γ γ k δ . Then we have
| Δ n ( γ ) Δ ( γ ) | | Δ n ( γ ) Δ n ( γ k ) | + | Δ n ( γ k ) Δ ( γ k ) | + | Δ ( γ k ) Δ ( γ ) |
L n δ + | Δ n ( γ k ) Δ ( γ k ) | + L δ .
Thus, sup γ B t | Δ n ( γ ) Δ ( γ ) | ( L n + L ) δ + max k | Δ n ( γ k ) Δ ( γ k ) | . Then by union bound
Pr sup γ B t | Δ n ( γ ) Δ ( γ ) | > ε Pr ( L n + L ) δ > ε 2 + Pr max k | Δ n ( γ k ) Δ ( γ k ) | > ε 2 .
Recall that in Equation (A4) we showed that Pr ( L n > 2 L ) σ 2 n L 2 . Choosing δ = ε 6 L implies
Pr ( ( L n + L ) δ > ε 2 ) = Pr ( L n > 2 L ) σ 2 n L 2 .
For the max term, by the union bound,
Pr ( max k | Δ n ( γ k ) Δ ( γ k ) | > ε 2 ) 2 N ( t , δ ) exp n 18 ε 2 1 n 2 ,
where N ( t , δ ) ( 3 t / δ ) d = 18 t L ε d .
  • Step 4: Final bound. Plugging Equation (A8) and Equation (A9) into Equation (A7) we have
Pr ( sup γ B t | Δ n ( γ ) Δ ( γ ) | > ε ) 2 C t ε d exp n 18 ε 2 1 n 2 + σ 2 n L 2 .
Finally, the stated bound follows from the inverse triangle inequality for norms.   □

Appendix A.3. Ablation Experiment on SVD Warm-Up

Table A1. p-value means and standard deviations of the analysed dependence patterns in permutation tests for UFDM without SVD warm-up (Algorithm 2), uniformly initialising parameters α and β from [ 1 , 1 ] interval. Here X N ( 0 , I d ) .
Table A1. p-value means and standard deviations of the analysed dependence patterns in permutation tests for UFDM without SVD warm-up (Algorithm 2), uniformly initialising parameters α and β from [ 1 , 1 ] interval. Here X N ( 0 , I d ) .
Distribution of Y d = 5 d = 15 d = 25
Linear (1.0)0.002 ± 0.0000.002 ± 0.0000.002 ± 0.000
Linear (0.3)0.002 ± 0.0000.002 ± 0.0000.002 ± 0.000
Logarithmic0.035 ± 0.0970.192 ± 0.2510.387 ± 0.297
Quadratic0.023 ± 0.0610.298 ± 0.2910.285 ± 0.145
Polynomial0.002 ± 0.0000.062 ± 0.1340.056 ± 0.078
LRSO (0.05)0.002 ± 0.0010.041 ± 0.0660.026 ± 0.040
Heteroscedastic0.004 ± 0.0060.002 ± 0.0010.003∗ ± 0.003

Appendix A.4. Dependency Patterns

Table A2. Dependence structures. L i n [ a , b ] denotes uniform linear spacing over given interval [ a , b ] , a < b , X E N ( 0 , I ) , and d is dimension. Fixed parameters k = 6 , ρ = 0.85 , θ = 5.0 . By ⊙ we denote element-wise product.
Table A2. Dependence structures. L i n [ a , b ] denotes uniform linear spacing over given interval [ a , b ] , a < b , X E N ( 0 , I ) , and d is dimension. Fixed parameters k = 6 , ρ = 0.85 , θ = 5.0 . By ⊙ we denote element-wise product.
TypeFormula
Structured dependence patterns ( X { N ( 0 , I d ) , U [ 0 , 1 ] d , Student t 3 ( 0 , I d ) } )
Linear(p) Y = p W X + 0.1 E , p R
Logarithmic Y = log ( 1.0 + W X W X ) + 0.1 E
Quadratic Y = W X W X + 0.1 E
Cubic Y = 0.5 ( W X W X W X ) W X W X + 0.1 E
LRSO(p) X 0 P X , Y 0 = sin ( k ( w T X 0 ) ) 1 d + 0.1 E ( proportion 1 p ) X 1 Y 1 N ( 0 , 25 2 I d ) ( proportion p ) , ( X , Y ) = random - shuffle ( X 0 X 1 , Y 0 Y 1 )
Heteroscedastic Y = ( 1.0 + E 1 ) W X + 0.1 E , E 1 N ( 0 , I )
Complex dependence patterns
Bimodal S Uniform ( { 1 , 1 } ) μ X = 2 1 d X , μ Y = 2 1 d Y X N ( S μ X , I d X ) Y N ( S μ Y , I d Y )
Sparse bimodal X 0.5 N ( μ , I d ) + 0.5 N ( μ , I d ) ,    μ = ( 2 , 0 , , 0 )
Sparse circular T L i n [ 0 , 2 π ] , R N ( 1 , 0 . 2 2 ) X = ( R cos T , R sin T , η ) , η N ( 0 , I d 2 ) Y = ( R cos ( T + δ ) , R sin ( T + δ ) , ζ ) + 0.1 E , δ N ( 0 , 1 ) , ζ N ( 0 , I d 2 )
Gaussian copulaMarginals N ( 0 , ρ 1 d × d + ( 1 ρ ) I d ) .
Clayton copula Parameter θ and standard normal marginals for each component .
Interleaved Moons ( X 0 , L X ) = make _ moos ( ) , ( Y 0 , L Y ) = make _ moos ( ) For each sample i : X i ( 1 , 2 ) = ( X 0 ) i Y i ( 1 , 2 ) Uniform { ( Y 0 ) j ( L Y ) j ( L X ) i } X i ( 3 : d ) , Y i ( 3 : d ) N ( 0 , I d 2 )
We used sklearn.datasets.make_moons.

References

  1. Gretton, A.; Bousquet, O.; Smola, A.; Schölkopf, B. Measuring statistical dependence with Hilbert-Schmidt norms. In Proceedings of the 16th International Conference on Algorithmic Learning Theory (ALT), Singapore, 8–11 October 2005. [Google Scholar]
  2. Daniušis, P.; Vaitkus, P.; Petkevičius, L. Hilbert–Schmidt component analysis. Lith. Math. J. 2016, 57, 7–11. [Google Scholar] [CrossRef]
  3. Daniušis, P.; Vaitkus, P. Supervised feature extraction using Hilbert-Schmidt norms. In Proceedings of the 10th International Conference on Intelligent Data Engineering and Automated Learning (IDEAL), Burgos, Spain, 23–26 September 2009; Springer: Berlin/Heidelberg, Germany, 2009; pp. 25–33. [Google Scholar]
  4. Hoyer, P.; Janzing, D.; Mooij, J.M.; Peters, J.; Schölkopf, B. Nonlinear causal discovery with additive noise models. In Proceedings of the Advances in Neural Information Processing Systems 21 (NeurIPS 2008), Vancouver, BC, Canada, 8–11 December 2008. [Google Scholar]
  5. Li, Y.; Pogodin, R.; Sutherland, D.J.; Gretton, A. Self-Supervised Learning with Kernel Dependence Maximization. In Proceedings of the 35th Conference on Neural Information Processing Systems (NeurIPS 2021), Virtual, 6–14 December 2021. [Google Scholar]
  6. Ragonesi, R.; Volpi, R.; Cavazza, J.; Murino, V. Learning unbiased representations via mutual information backpropagation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Virtual, 19–25 June 2021; pp. 2723–2732. [Google Scholar]
  7. Zhen, X.; Meng, Z.; Chakraborty, R.; Singh, V. On the Versatile Uses of Partial Distance Correlation in Deep Learning. In Proceedings of the 17th European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022. [Google Scholar]
  8. Chatterjee, S. A New Coefficient of Correlation. J. Am. Stat. Assoc. 2021, 116, 2009–2022. [Google Scholar] [CrossRef]
  9. Feuerverger, A. A consistent test for bivariate dependence. Int. Stat. Rev. 1993, 61, 419–433. [Google Scholar] [CrossRef]
  10. Póczos, B.; Ghahramani, Z.; Schneider, J.G. Copula-based kernel dependency measures. arXiv 2012, arXiv:1206.4682. [Google Scholar] [CrossRef]
  11. Puccetti, G. Measuring linear correlation between random vectors. Inf. Sci. 2022, 607, 1328–1347. [Google Scholar] [CrossRef]
  12. Shen, C.; Priebe, C.E.; Vogelstein, J.T. From Distance Correlation to Multiscale Graph Correlation. J. Am. Stat. Assoc. 2020, 115, 280–291. [Google Scholar] [CrossRef]
  13. Székely, G.J.; Rizzo, M.L.; Bakirov, N.K. Measuring and testing dependence by correlation of distances. Ann. Stat. 2007, 35, 2769–2794. [Google Scholar] [CrossRef]
  14. Tsur, D.; Goldfeld, Z.; Greenewald, K. Max-Sliced Mutual Information. In Proceedings of the Advances in Neural Information Processing Systems 36 (NeurIPS 2023), New Orleans, LA, USA, 10–16 December 2023; Curran Associates, Inc.: Red Hook, NY, USA, 2023; pp. 80338–80351. [Google Scholar]
  15. Sriperumbudur, B.K.; Fukumizu, K.; Gretton, A.; Schölkopf, B.; Lanckriet, G.R.G. On the empirical estimation of integral probability metrics. Electron. J. Stat. 2012, 6, 1550–1599. [Google Scholar] [CrossRef]
  16. Jacod, J.; Protter, P. Probability Essentials, 2nd ed.; Springer: Berlin/Heidelberg, Germany, 2003. [Google Scholar]
  17. Richter, W.-D. On the vector representation of characteristic functions. Stats 2023, 6, 1072–1081. [Google Scholar] [CrossRef]
  18. Zhang, W.; Gao, W.; Ng, H.K.T. Multivariate tests of independence based on a new class of measures of independence in Reproducing Kernel Hilbert Space. J. Multivar. Anal. 2023, 195, 105144. [Google Scholar] [CrossRef]
  19. Cover, T.M.; Thomas, J.A. Elements of Information Theory, 2nd ed.; Wiley-Interscience: Hoboken, NJ, USA, 2006. [Google Scholar]
  20. Yu, S.; Giraldo, L.G.S.; Jenssen, R.; Príncipe, J.C. Multivariate Extension of Matrix-Based Rényi’s α-Order Entropy Functional. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 2960–2966. [Google Scholar] [CrossRef]
  21. Yu, S.; Alesiani, F.; Yu, X.; Jenssen, R.; Príncipe, J.C. Measuring Dependence with Matrix-based Entropy Functional. In Proceedings of the 35th AAAI Conference on Artificial Intelligence (AAAI 2021), Virtual, 2–9 February 2021; pp. 10781–10789. [Google Scholar]
  22. Lopez-Paz, D.; Hennig, P.; Schölkopf, B. The Randomized Dependence Coefficient. In Proceedings of the Advances in Neural Information Processing Systems 26 (NeurIPS 2013), Lake Tahoe, NV, USA, 5–8 December 2013; Curran Associates, Inc.: Red Hook, NY, USA, 2013. [Google Scholar]
  23. Böttcher, B.; Keller-Ressel, M.; Schilling, R. Distance multivariance: New dependence measures for random vectors. arXiv 2018, arXiv:1711.07775. [Google Scholar] [CrossRef]
  24. Székely, G.J.; Rizzo, M.L. Partial distance correlation with methods for dissimilarities. Ann. Stat. 2014, 42, 2382–2412. [Google Scholar] [CrossRef]
  25. Schölkopf, B.; Smola, A.J.; Bach, F. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
  26. Belghazi, M.I.; Baratin, A.; Rajeshwar, S.; Ozair, S.; Bengio, Y.; Courville, A.; Hjelm, D. Mutual Information Neural Estimation. In Proceedings of the 35th International Conference on Machine Learning (ICML 2018), Stockholm, Sweden, 10–15 July 2018; pp. 531–540. [Google Scholar]
  27. Sanchez Giraldo, L.G.; Rao, M.; Principe, J.C. Measures of Entropy From Data Using Infinitely Divisible Kernels. IEEE Trans. Inf. Theory 2015, 61, 535–548. [Google Scholar] [CrossRef]
  28. Ushakov, N.G. Selected Topics in Characteristic Functions; De Gruyter: Berlin, Germany, 2011. [Google Scholar]
  29. Csörgo, S.; Totik, V. On how long interval is the empirical characteristic function uniformly consistent. Acta Sci. Math. (Szeged) 1983, 45, 141–149. [Google Scholar]
  30. Garreau, D.; Jitkrittum, W.; Kanagawa, M. Large sample analysis of the median heuristic. arXiv 2017, arXiv:1707.07269. [Google Scholar]
  31. Phipson, B.; Smyth, G.K. Permutation p-values should never be zero: Calculating exact p-values when permutations are randomly drawn. Stat. Appl. Genet. Mol. Biol. 2010, 9, 39. [Google Scholar] [CrossRef]
  32. Vanschoren, J.; van Rijn, J.N.; Bischl, B.; Torgo, L. OpenML: Networked Science in Machine Learning. SIGKDD Explor. 2013, 15, 49–60. [Google Scholar] [CrossRef]
  33. Zhang, Y.; Zhou, Z.H. Multilabel Dimensionality Reduction via Dependence Maximization. ACM Trans. Knowl. Discov. Data 2010, 4, 14:1–14:21. [Google Scholar] [CrossRef]
  34. McCullagh, P.; Nelder, J.A. Generalized Linear Models, 2nd ed.; Chapman and Hall/CRC Monographs on Statistics and Applied Probability Series, Chapman & Hall; Routledge: Oxfordshire, UK, 1989. [Google Scholar]
  35. Goldberger, J.; Hinton, G.E.; Roweis, S.; Salakhutdinov, R.R. Neighbourhood components analysis. In Proceedings of the Advances in Neural Information Processing Systems 17 (NeurIPS 2004), Vancouver, BC, Canada, 13–18 December 2004. [Google Scholar]
  36. Wilcoxon, F. Individual Comparisons by Ranking Methods. Biom. Bull. 1945, 1, 80–83. [Google Scholar] [CrossRef]
  37. Demšar, J. Statistical Comparisons of Classifiers over Multiple Data Sets. J. Mach. Learn. Res. 2006, 7, 1–30. [Google Scholar]
  38. Euclidean Norm of Sub-Exponential Random Vector is Sub-Exponential? MathOverflow. Version: 2025-05-06. Available online: https://mathoverflow.net/q/492045 (accessed on 10 April 2025).
  39. Bochner, S.; Chandrasekharan, K. Fourier Transforms (AM-19); Princeton University Press: Princeton, NJ, USA, 1949. [Google Scholar]
  40. Vershynin, R. High-Dimensional Probability: An Introduction with Applications in Data Science; Cambridge Series in Statistical and Probabilistic Mathematics; Cambridge University Press: Cambridge, UK, 2018. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Article metric data becomes available approximately 24 hours after publication online.