Machine Learning with Squared-Loss Mutual Information

Mutual information (MI) is useful for detecting statistical independence between random variables, and it has been successfully applied to solving various machine learning problems. Recently, an alternative to MI called squared-loss MI (SMI) was introduced. While ordinary MI is the Kullback–Leibler divergence from the joint distribution to the product of the marginal distributions, SMI is its Pearson divergence variant. Because both the divergences belong to the ƒ-divergence family, they share similar theoretical properties. However, a notable advantage of SMI is that it can be approximated from data in a computationally more efficient and numerically more stable way than ordinary MI. In this article, we review recent development in SMI approximation based on direct density-ratio estimation and SMI-based machine learning techniques such as independence testing, dimensionality reduction, canonical dependency analysis, independent component analysis, object matching, clustering, and causal inference.


Introduction
Mutual information (MI) [1,2] for random variables X and Y is defined as: MI(X, Y ) := p(x, y) log p(x, y) p(x)p(y) dxdy where p(x, y) is the joint probability density of X and Y , and p(x) and p(y) are the marginal probability densities of X and Y , respectively (Precisely, p(x, y), p(x), and p(y) are different functions and thus should be denoted, e.g., by p X,Y (x, y), p X (x), and p Y (y), respectively.However, we use the simplified notations for the sake of brevity).Statistically, MI can be regarded as the Kullback-Leibler divergence [3] from the joint density p(x, y) to the product of the marginals p(x)p(y), and thus can be regarded as a measure of statistical dependency between X and Y .Estimation of MI from data samples has been one of the major challenges in information science and various approaches have been developed thus far.The most naive approach to approximating MI from data would be to use a non-parametric density estimator such as kernel density estimation (KDE) [4], i.e., the densities p(x, y), p(x), and p(y) included in MI are separately estimated from samples, and the estimated densities are used for approximating MI.However, density estimation is known to be a hard problem [5] and division by estimated densities tends to magnify the estimation error.Therefore, the KDE-based MI approximator may not be reliable in practice.
Another approach uses histogram-based density estimators with data-dependent partitioning.In the context of estimating the Kullback-Leibler divergence [3], histogram-based methods have been studied thoroughly and their consistency has been established [6][7][8].However, the rate of convergence has not been elucidated yet, and such histogram-based methods are strongly influenced by the curse of dimensionality.Thus, these methods may not be reliable in high-dimensional problems.
MI can be expressed in terms of the entropies as: where H(X) denotes the entropy of X: H(X) := − p(x) log p(x)dx Based on this expression, the nearest neighbor distance has been used for approximating MI [9].Such a nearest neighbor approach was shown to perform better than the naive KDE-based approach [10], given that the number k of nearest neighbors is chosen appropriately-a small (large) k yields an estimator with small (large) bias and large (small) variance.However, appropriately determining the value of k so that the bias-variance trade-off is optimally controlled is not straightforward in the context of MI estimation.
A similar nearest-neighbor idea has been applied to Kullback-Leibler divergence estimation [11], whose consistency has been proved for finite k-this is an interesting result since Kullback-Leibler divergence estimation is consistent even when density estimation is not consistent.However, the rate of convergence seems to be still an open research issue.Approximation of the entropies based on the Edgeworth expansion has also been explored in the context of MI estimation [12].Such a method works well when the target density is close to Gaussian.However, if the target density is far from Gaussian, the Edgeworth expansion method is no longer reliable.
More recently, an MI approximator via direct estimation of the density ratio p(x,y) p(x)p(y) has been developed [13], which is based on a Kullback-Leibler divergence approximator via direct density-ratio estimation [14][15][16].The MI approximator is given as the solution of a convex optimization problem, which tends to be sparse [14].A notable advantage of this density-ratio method is that it does not involve separate estimation of densities p(x, y), p(x), and p(y), and it was proved to achieve the optimal non-parametric convergence rate.However, due to the "log" operation included in MI, this MI approximator is computationally rather expensive and susceptible to outliers [17,18].
To cope with these problems, a variant of MI called the squared-loss mutual information (SMI) [19] has been explored recently.SMI for X and Y is defined as: SMI(X, Y ) := 1 2 p(x)p(y) p(x, y) p(x)p(y) − 1 2 dxdy SMI is the Pearson divergence [20] from the joint density p(x, y) to the product of the marginals p(x)p(y).It is always non-negative and it vanishes if and only if X and Y are statistically independent.Note that both the Pearson divergence and the Kullback-Leibler divergence belong to the class of Ali-Silvey-Csiszár divergences (which is also known as f -divergences) [21,22], meaning that they share similar properties.
In a similar way to ordinary MI, SMI can be approximated accurately via direct estimation of the density ratio p(x,y) p(x)p(y) [19], which is based on a Pearson divergence approximator via direct density-ratio estimation [16,23].This SMI approximator has various desirable properties: For example, it was proved to achieve the optimal non-parametric convergence rate [24], its solution can be obtained analytically just by solving a system of linear equations, it has superior numerical properties [25], and it is robust against outliers [17,18].
In particular, the property of the SMI approximator that an analytic solution is available is highly useful in machine learning, because this allows explicit computation of the derivative of the SMI approximator with respect to another parameter.For example, in supervised dimensionality reduction, linear transformation U for input x is optimized so that the transformed input U x has the highest dependency on output y.In this context, the derivative of the SMI estimator between U x and y with respect to transformation U can be exploited for optimizing transformation U .On the other hand, such derivative computation is not straightforward for the MI estimator whose solution is obtained via numerical optimization.
The purpose of this article is to review recent development in SMI approximation based on direct density-ratio estimation and SMI-based machine learning techniques.The remainder of this paper is structured as follows.After reviewing the SMI approximator based on direct density-ratio estimation in Section 2, we illustrate in Section 3 how the SMI approximator can be utilized for solving various machine learning tasks such as: independence testing [26], feature selection [19,27], feature extraction [28,29], canonical dependency analysis [30], independent component analysis [31], object matching [32], clustering [33,34], and causality learning [35].

Definition and Estimation of SMI
In this section, we review the definition of SMI and its approximator based on direct density-ratio estimation.

Definition of SMI
Let us consider two random variables X ∈ X and Y ∈ Y, where X and Y are domains of X and Y , respectively.Let p(x, y) be the joint probability density of X and Y , and p(x) and p(y) be the marginal probability densities of X and Y , respectively.The squared-loss mutual information (SMI) [19] for X and Y is defined as: SMI is always non-negative and it takes zero if and only if X and Y are statistically independent.Hence, SMI can be used for detecting statistical independence between X and Y .Below, we consider the problem of estimating SMI from paired samples {(x i , y i )} n i=1 drawn independently from the joint distribution with density p(x, y).

Least-Squares Estimation of SMI
Here, we review the basic idea and theoretical properties of the SMI approximator called least-squares mutual information (LSMI) [19].

SMI Approximation via Direct Density-Ratio Estimation
The basic idea of LSMI is to directly estimate the following density-ratio function without going through density estimation of p(x, y), p(x), and p(y): Let g(x, y) be a model of the density ratio r(x, y).In LSMI, the model is learned so that the following squared-error J is minimized: where C is a constant defined by:

y)dxdy
Since J contains the expectations over unknown densities p(x)p(y) and p(x, y), the expectations are approximated by empirical averages.Then the LSMI optimization problem is given as follows: where G is a function space from which g is searched.
Finally, the SMI approximator called LSMI is given as: or Equation ( 5) would be the simplest SMI approximator, while Equation ( 6) is suitable for theoretical analysis because this corresponds to the negative of the objective function (4) up to the constant 1/2.These estimators are derived based on the following equivalent expressions of SMI: Equation ( 7) is obtained by expanding the squared term in Equation ( 1), applying Equation ( 2) to the squared density-ratio term once, and showing that the cross-term and the remaining terms are −1 and 1/2, respectively.Equivalence between Equations ( 7) and ( 8) can be confirmed by applying Equation ( 2) to the first term in Equation ( 8) once.Note that Equation ( 8) can also be obtained via the Legendre-Fenchel duality of Equation ( 1), implying that the optimization problem (4) corresponds to approximately maximizing the Legendre-Fenchel lower-bound [15].

Convergence Analysis
Here we briefly review theoretical convergence properties of LSMI.First, let us consider the case where the function class G from which the function g is searched is a parametric model: Suppose that the true density-ratio r is contained in the model G, i.e., there exists θ * (∈ Θ) such that: r = g θ * .Then, it was shown [28] that, under the standard regularity conditions for consistency [for example, see Section 3.2.1 of 36], it holds that: where O p denotes the asymptotic order in probability.This shows that LSMI retains the optimality in terms of the order of convergence in n, because O p (n −1/2 ) is the optimal convergence rate in the parametric setup [37].
Next, we consider non-parametric cases where the function class G is a reproducing kernel Hilbert space [38] on X × Y. Let us consider a non-parametric version of the LSMI optimization problem: where • 2 G denotes the norm in the reproducing kernel Hilbert space G.In the above optimization problem, a regularizer g 2 G is included to avoid overfitting and λ n ≥ 0 is the regularization parameter.Suppose that the true density-ratio function r is contained in the function space G and is bounded from above.Then, it was shown [28] that, if λ n → 0 and λ −1 n = o(n 2/(2+γ) ) where γ (0 < γ < 2) denotes a complexity measure of the function space G based on the bracketing entropy (The larger the value of γ is, the more complex the function space G is) [see p.83 of 36], it holds that: The conditions λ n → 0 and λ −1 n = o(n 2/(2+γ) ) roughly mean that the regularization parameter λ n should be sufficiently small but not too small.Equation (9) means that the convergence rate of the nonparametric version can also be O p (n −1/2 ) for an appropriate choice of λ n , but the non-parametric method requires a milder model assumption.According to [15], the above convergence rate is the minimax optimal rate under some setup.Thus, the convergence property of the above non-parametric method would also be optimal in the same sense.

Practical Implementation of LSMI
We have seen that LSMI has desirable convergence properties.Here we review practical implementation of LSMI.A MATLAB R implementation of LSMI is publicly available [39].

LSMI for Linear-in-Parameter Models
Let us approximate the density ratio Equation (2) using the following linear-in-parameter model: where θ = (θ 1 , . . ., θ b ) are parameters, φ(x, y) = (φ 1 (x, y), . . ., φ b (x, y)) are fixed basis functions, and denotes the transpose.Practical choices of the basis functions will be explained in Section 2.3.2. .For this model, the LSMI optimization problem with an 2 -regularizer is expressed as: where λ ≥ 0 is the regularization parameter that controls the strength of regularization, H is the b × b matrix defined by: and h is the b-dimensional vector defined by: The solution θ can be analytically obtained as: where I b is the b-dimensional identity matrix.Finally, LSMI is also given analytically as: or Some elements of θ may take negative values in the above formulation, which can lead to negative density-ratio values and negative LSMI values.Such negative values may be rounded up to zero if necessary, although this does not happen for sufficiently large n.Another option is to explicitly impose the non-negativity constraint θ 1 , . . ., θ b ≥ 0 on the optimization problem.However, by this modification, the solution can no longer be obtained analytically, but only numerically using a quadratic program solver.(In this case, if the 2 -regularizer is replaced with the 1 -regularizer, the regularization path [40,41]-i.e., solutions for all different regularization parameter values-can be computed efficiently without a quadratic program solver just by solving systems of linear equation [23].)

Design of Basis Functions
The practical accuracy of LSMI depends on the choice of basis functions in the model Equation (10).A typical choice is a non-parametric kernel model, i.e., setting the number of basis function to b = n and the -th basis function to φ (x, y) = K(x, x )L(y, y ): where K(x, x ) and L(y, y ) are kernel functions for x and y, respectively.If n is too large, b may be set to be smaller than n and choose a subset of data points {(x i , y i )} n i=1 as kernel centers.For real vector x ∈ R d , we may practically use the Gaussian kernel for K(x, x ) after element-wise variance normalization: where σ x > 0 is the Gaussian width.When x is a non-vectorial structured object such as a string, a tree, and a graph, we may employ a kernel function defined for such structured data [42].
In the (multi-output) regression scenario where y is a real vector, the Gaussian kernel may also be used for L(y, y ) after element-wise variance normalization: where σ y > 0 is the Gaussian width.In the multi-class classification scenario where y ∈ {1, . . ., c} and c denotes the number of classes, we may use the delta kernel for L(y, y ): Note that, in the classification case with the delta kernel, the LSMI solution can be computed efficiently in a class-wise manner [33].In the multi-label classification scenario where y ∈ {0, 1} c and c denotes the number of labels, we may use the normalized linear kernel function [43] for y: where y = 1 n n i=1 y i is the sample mean.

Model Selection by Cross-Validation
Most of the above kernels include tuning parameters such as the Gaussian width, and the practical performance of LSMI depends on the choice of such kernel parameters and the regularization parameter λ.Model selection of LSMI is possible based on cross-validation with respect to the criterion J defined by Equation (3).
More specifically, the sample set Then the LSMI solution g m (x) is obtained using D\D m (i.e., all samples without D m ), and its J-score for the hold-out samples D m is computed as: where |D m | denotes the number of elements in the set D m .
x,y∈Dm denotes the summation over all combinations of x and y in D m (and thus |D m | 2 terms), while (x,y)∈Dm denotes the summation over all pairs (x, y) in D m (and thus |D m | terms).This procedure is repeated for m = 1, . . ., M , and the average score, Finally, the model (the kernel parameter and the regularization parameter in the current setup) that minimizes J CV is chosen as the most suitable one.

SMI-Based Machine Learning
In this section, we show how the SMI estimator, LSMI, can be used for solving various machine learning tasks.

Independence Testing
First, we show how the SMI estimator can be used for independence testing.

Introduction
Identifying the statistical independence between random variables is one of the fundamental challenges in statistical data analysis.A traditional independence measure between random variables is the Pearson correlation coefficient, which can be used for detecting linear dependency.Recently, kernel-based independence measures have been studied to detect non-linear dependency.The Hilbert-Schmidt independence criterion (HSIC) [44] utilizes cross-covariance operators on universal reproducing kernel Hilbert spaces (RKHSs) [45], which is an infinite-dimensional generalization of covariance matrices.HSIC allows efficient detection of non-linear dependency by making use of the reproducing property of RKHSs [38].However, HSIC has a weakness that its performance depends on the choice of RKHSs and there is no theoretically justified way to determine the RKHS properly thus far.In practice, using the Gaussian RKHS with width set to the median distance between samples is a popular heuristic [46], but this does not always work well.
To overcome the above limitations, an SMI-based independence test called least-squares independence test (LSIT) was proposed [26].Below, we review LSIT.

Independence Testing with SMI
Let x ∈ X be an input feature and y ∈ Y be an output feature, which follow a joint probability distribution with density p(x, y).Suppose that we are given a set of independent and identically distributed (i.i.d.) paired samples {(x i , y i )} n i=1 .The objective of independence testing is to conclude whether x and y are statistically independent or not, based on the samples {(x i , y i )} n i=1 .The SMI-based independence test, where the null hypothesis is that x and y are statistically independent, is based on the permutation test procedure [47].More specifically, LSMI is first run using the original dataset D = {(x i , y i )} n i=1 , and an SMI estimate, LSMI(D), is obtained.Next, {y i } n i=1 are randomly permuted and a shuffled dataset denote permuted samples.Then LSMI is run again using the shuffled dataset D, and an SMI estimate LSMI( D) is obtained.Note that the random permutation eliminates the dependency between x and y (if it exists), and therefore LSMI( D) would take a value close to zero.This random permutation procedure is repeated many times, and the distribution of LSMI( D) under the null-hypothesis that x and y are statistically independent is constructed.Finally, the p-value is approximated by evaluating the relative ranking of LSMI(D) in the distribution of LSMI( D).This procedure is called the least-squares independence test (LSIT) [26].A MATLAB R implementation of LSIT is publicly available [48].

Supervised Feature Selection
Next, we show how the SMI estimator can be used for supervised feature selection.

Introduction
The objective of supervised learning is to learn an input-output relation from input-output paired samples.However, when the dimensionality of input vectors is large, using all input elements could lead to a model interpretability problem.Feature selection is aimed at finding a subset of input elements that is useful for predicting output values [49].
Feature ranking is a simple implementation of feature selection that ranks each feature according to its relevance.In this feature ranking scenario, SMI between a single input variable and an output was shown to be useful [19].However, feature ranking does not take feature interaction into account, and thus it is not useful when each single feature is not capable of predicting outputs, but multiple features are necessary for a valid prediction of outputs (e.g., an XOR problem).Two criteria, relevancy and redundancy, are often used to select multiple features simultaneously: A feature is said to be relevant if it can explain outputs, and features are said to be redundant if they are similar.Ideally, we want to find a subset of features that has high relevance and low redundancy.
Another important issue in feature selection is the computational cost: Naively selecting multiple features causes computational infeasibility because the number of possible feature combinations is exponential with respect to the number of input features.To cope with this problem, a computationally efficient method to handle multiple features called the least absolute shrinkage and selection operator (LASSO) [50] was proposed.In LASSO, a predictor consisting of a weighted sum of each feature is fitted to output values using the least-squares method, while the weight vector is confined in an 1 -ball.The 1 -ball restriction actually provides a notable property that the solution is sparsified, meaning that some of the weight parameters become exactly zero.Thus, LASSO automatically removes irrelevant features from its predictor, which can be achieved through convex optimization in a computationally efficient way [51,52].
However, LASSO can only handle linear predictors and its feature selection characteristic explicitly depends on the squared-loss function used in the least-squares method.To go beyond these limitations, an SMI-based feature selection method called 1 -LSMI was proposed [27].Below, we review 1 -LSMI.

Feature Selection with SMI
The objective of feature selection is, from input feature vector x = (x (1) , . . ., x (d) ) ∈ R d , to choose a subset of its elements that are useful for the prediction of output y ∈ Y. Suppose that we are given n i.i.d.paired samples {(x i , y i )} n i=1 drawn from a joint distribution with density p(x, y).Let w 1 , . . ., w d be feature weights for x (1) , . . ., x (d) , and we learn the weights as: where η ≥ 0 is the regularization parameter that controls the number of features.Because the sign of feature weights is not relevant in feature selection, they are restricted to be non-negative.For non-negative weights, d i=1 w i is reduced to the 1 -norm of the feature weight vector (w 1 , . . ., w d ) .The features having zero weights are regarded as irrelevant in this formulation.
To compute the solution, a simple gradient ascent may be used, where the weight vector is projected onto the positive orthant of the 1 -ball in each iteration to guarantee the feasibility.This can be performed by first projecting the weight vector onto the positive orthant by rounding up negative elements to zero, and then projecting it onto the 1 -ball [54].

Supervised Feature Extraction
While feature selection chooses a subset of features for enhancing model interpretability, feature extraction finds a low-dimensional representation of features that can depend on all features (e.g., through linear combination) for improving the prediction accuracy.Here, we show how the SMI estimator can be used for supervised feature extraction.

Introduction
The goal of sufficient dimension reduction (SDR) is to map input features to low-dimensional expressions while "sufficient" information for predicting output values is maintained [55].Earlier SDR methods developed in the statistics community, such as sliced inverse regression [56], principal Hessian direction [57], and sliced average variance estimation [58], rely on the ellipticity of the data (e.g., Gaussian), but such an assumption may not be fulfilled in practice.To overcome the limitations of these approaches, kernel dimension reduction (KDR) was proposed [59].KDR employs a kernel-based dependence measure that is distribution-free, and thus KDR is flexible.However, it lacks systematic model selection strategies for kernel and regularization parameters.Furthermore, KDR scales poorly to massive datasets and there is no good way to set an initial solution for its gradient-based optimization.In practice, many restarts from different initial solutions may be needed for finding a good local optimum, which makes the entire procedure even slower and the performance of dimension reduction unreliable.
To overcome the above limitations, an SMI-based SDR method called least-squares dimension reduction (LSDR) was proposed [28].An advantage of LSDR is that its tuning parameters can be systematically optimized based on cross-validation.To further address the computational and initialization issues, a heuristic search strategy for LSDR called sufficient component analysis (SCA) was proposed [29].Below, we review LSDR and SCA.

Sufficient Dimension Reduction with SMI
First, we formulate the problem of SDR [55].Let x ∈ R dx be an input vector and y ∈ Y be an output.The goal of SDR is to find a subspace of input domain R dx that contains "sufficient" information about output y.We assume that there exists an orthogonal matrix That is, given the projected feature U * x, the (remaining) feature x is conditionally independent of output y and thus can be discarded without sacrificing the predictability of y.The objective of SDR is to find such U * from n i.i.d.paired samples, {(x i , y i )} n i=1 , drawn from a joint distribution with density p(x, y).We assume that the projection dimensionality d u is known.
SMI can be used for characterizing the optimal projection matrix U * [28].Indeed, it was shown that inequality, holds, and the equality holds if and only if Equation ( 15) holds.Thus, maximizing SMI(U X, Y ) with respect to U leads to U * .In practice, the following optimization problem is solved: This formulation is called least-squares dimension reduction (LSDR) [28].

Gradient-Based Subspace Search
A simple approach to solving the above LSDR optimization problem is the following iterative procedure: The gradient of LSMI({(U x i , y i )} n i=1 ) with respect to U is given by: If the kernel model Equation ( 14) with the Gaussian kernel, ∂U (for , = 1, . . ., n) are given by: The projection of U onto the feasible region specified by U U = I du may be carried out by the Gram-Schmidt process [60], although this is time-consuming.An alternative way to solve the LSDR optimization problem is to perform gradient ascent on the Grassmann manifold [61].In the Euclidean space, the ordinary gradient gives the steepest direction.
However, on a manifold, the natural gradient [62] gives the steepest direction.The natural gradient ∇LSMI(U ) at U is given as follows [63]: Then the geodesic from U to the direction of the natural gradient ∇LSMI(U ) over the Grassmann manifold can be expressed using t ∈ R as: where "exp" for a matrix denotes the matrix exponential, and O dx is the d x × d x zero matrix.Thus, line search along the geodesic in the natural gradient direction is equivalent to finding the maximizer from {U t | t ≥ 0}.For choosing the step size of each gradient update, some approximate line search method such as Armijo's rule [64] or backtracking line search [51] may be used.
A MATLAB R implementation of LSDR is publicly available [65].

Heuristic Subspace Search
Although the natural gradient method is computationally more efficient than the plain gradient method, it is still time consuming.Moreover, many restarts from different initial solutions may be needed for finding a good local optimum.Here, we introduce a heuristic method that can address these issues [29].
A key idea is to use a truncated negative quadratic function called the Epanechnikov kernel [66] as a kernel function for U x: Let I(c) be the indicator function, i.e., I(c) = 1 if c is true and zero otherwise.Then, for the above kernel function, LSMI can be expressed as: where tr(•) denotes the trace of a matrix and D U is the d x × d x matrix defined by: Here, the fact that θ depends on U is explicitly indicated by θ (U ).
If U in D U is replaced by U , where U is a transformation matrix obtained in the previous iteration, the SMI estimator is simplified as: Because D U is independent of U , a maximizer of Equation ( 16) with respect to U can be analytically obtained by , where {u i } du i=1 are the d u principal components of D .The same technique can also be utilized for determining an initial transformation matrix, by computing the above solution for U = I dx (i.e., no dimensionality reduction).
The above heuristic search method for LSDR is called sufficient component analysis (SCA) [29].A MATLAB R implementation of SCA is publicly available [67].

Canonical Dependency Analysis
Next, we show how the SMI estimator can be used for feature extraction from two sets of data.

Introduction
Canonical correlation analysis (CCA) [68] is a classical dimensionality reduction technique for two data sources, and it iteratively finds projection directions with maximum correlation.However, because CCA only captures correlations under linear projections, it is often insufficient to analyze complex real-world data that contain higher-order correlations.To be more flexible, non-linear CCA methods have been explored.A simple approach uses neural networks to handle non-linear projections [69,70], but neural networks are prone to local optima.Another approach first non-linearly transforms data samples into feature spaces and then apply linear CCA [71,72].Given that the non-linear transformation is fixed, this two-step approach allows analytic computation of the global optimal solution via a generalized eigenvalue problem in the same way as linear CCA.This non-linear approach is called kernel CCA (KCCA) because reproducing kernels [38] are used as non-linear transforms.Alternating regression such as the alternating conditional expectation [73] is another possible way to find dependency in a flexible manner, which estimates transformations for two variables alternately by minimizing the squared error between transformed variables.These non-linear variants of CCA are highly flexible, although obtained results are often difficult to interpret due to the non-linearity.
The above non-linear CCA approaches can be regarded as capturing correlations along non-linear projection directions.Another extension of CCA called canonical dependency analysis (CDA) [30] captures higher-order correlations under linear projections.It was shown that KCCA with a universal kernel [45] such as the Gaussian kernel allows efficient detection of higher-order correlations [74].However, the choice of universal kernels affects the practical performance, and there is no systematic method to choose a suitable kernel function.Another approach to higher-order CCA called informational CCA (ICCA) [75] uses mutual information (MI) as a dependency measure, where MI is estimated via kernel density estimation (KDE).Because systematic model selection strategies are available for KDE [76], ICCA could be more practical than the KCCA-based CDA method.In the ICCA method, one-dimensional projection directions are found in an iterative manner.Thus, it would be more powerful if multi-dimensional projection directions (i.e., a subspace) could be directly found in CDA [30].However, ICCA may not be reliable in such a subspace search scenario because it involves the ratio of estimated densities, which tends to produce large estimation error if the dimensionality is not small.
To overcome the above limitation, an SMI-based CDA method called least-squares CDA (LSCDA) was proposed [30].Below, we review LSCDA.

Canonical Dependency Analysis with SMI
Suppose that we are given n i.i.d.paired samples {(x i , y i ) | x i ∈ R dx , y i ∈ R dy } n i=1 drawn from a joint distribution with density p(x, y).CDA is aimed at finding the low-dimensional expressions of x and y that are maximally dependent on each other.Here, we focus on linear dimension reduction, i.e., x and y are transformed as U x and V y, where U ∈ R du×dx and V ∈ R dv×dy are orthogonal matrices with known dimensionalities d u and d v .The objective of CDA is to find the transformation matrices U and V such that the statistical dependency between U x and V y is maximized.Let us use the SMI estimator, LSMI({(U x i , V y i )} n i=1 ), as the dependency measure, i.e., we solve, This formulation is called least-squares CDA (LSCDA) [30].
The above optimization problem can be solved in the same way as LSDR presented in Section 3.3.3.A MATLAB R implementation of LSCDA is publicly available [77].

Independent Component Analysis
Here, we show how the SMI estimator can be used for independent component analysis.

Introduction
Suppose that there exist statistically independent sources of signals, and we observe their mixtures.The purpose of independent component analysis (ICA) [78] is to separate the mixed signals into the original source signals.An approach to ICA is to separate the mixed signals such that statistical independence among separated signals is maximized under some independence measure.
Various methods for evaluating the statistical independence among random variables from samples have been explored so far.A naive approach is to estimate probability densities based on parametric or non-parametric density estimation methods.However, finding an appropriate parametric model is not straightforward without strong prior knowledge and non-parametric estimation is not generally accurate in high-dimensional problems.Thus, this naive approach is not reliable in practice.Another approach is to approximate the entropy based on the Gram-Charlier expansion [79] or the Edgeworth expansion [80].An advantage of this entropy-based approach is that a hard task of density estimation is not directly involved.However, these expansion techniques are based on the assumption that the target density is close to Gaussian, and violation of this assumption can cause large approximation error.
The above approaches are based on the probability densities of signals.Another line of research that does not explicitly involve probability densities employs non-linear correlation-signals are statistically independent if and only if all non-linear correlations among signals vanish.Following this line, computationally efficient algorithms have been developed based on a contrast function [81,82], which is an approximation of the entropy or mutual information.However, non-linearities in the contrast function need to be pre-specified in these methods, and thus they could be inaccurate if the predetermined non-linearities do not match the target distribution.To cope with this problem, the kernel trick has been applied in ICA, which allows computationally efficient evaluation of all non-linear correlations citeJMLR:Bach+Jordan:2002.However, its practical performance depends on the choice of kernels (more specifically, the Gaussian kernel width) and there seems no theoretically justified method to determine the kernel width.This is a critical problem in unsupervised learning tasks such as ICA.
To cope with this problem, an SMI-based ICA algorithm called least-squares independent component analysis (LICA) has been developed [31].Below, we review LICA.

Independent Component Analysis with SMI
Suppose there are d signal sources and let: i=1 be i.i.d.samples drawn from a distribution with density p(x).We assume that elements x (1) , . . ., x (d) are statistically independent of each other, i.e., p(x) is factorized as: We cannot directly observe {x i } n i=1 , but only their linearly mixed samples {y i } n i=1 : where U is a d × d invertible matrix called the mixing matrix.
The goal of ICA is, from the mixed samples {y i } n i=1 , to obtain a demixing matrix V that recovers the original source samples {x i } n i=1 .We denote the demixed samples by {z i } n i=1 : The ideal solution is V = U −1 , but we can only recover the source signals up to permutation and scaling of components of x due to non-identifiability of the ICA setup [78].Let us denote the demixed samples by: i , . . ., z i ) := V y i for i = 1, . . ., n.
A direct approach to ICA is to determine V so that elements of z are as statistically independent as possible.Here, we adopt SMI as the independence measure: We try to find the demixing matrix V that minimizes SMI.In practice, the following optimization problem is solved: where LSMI({V y i } n i=1 ) is given by the same form as Equation ( 12) (or Equation ( 13)), but the matrix H and the vector h are defined in a slightly different way.For the Gaussian kernel, H and h are given by: This formulation is called least-squares independent component analysis (LICA) [31].

Gradient-Based Demixing Matrix Search
Based on the plain gradient technique, an update rule of V is given by: where t (> 0) is the step size.The gradient ∂LSMI ∂V is given by: where In ICA, scaling of components of z can be arbitrary.This implies that the above gradient updating rule can lead to a solution with poor scaling, which is not preferable from a numerical viewpoint.To avoid possible numerical instability, V is normalized at each gradient iteration as:

Natural Gradient Demixing Matrix Search
Suppose that data samples are whitened, i.e., samples {y i } n i=1 are pre-transformed as: where Σ is the sample covariance matrix: Then it can be shown that any demixing matrix that eliminates the second order correlation is an orthogonal matrix [78].Thus, for whitened data, the search space of V can be restricted to the orthogonal group without loss of generality.The natural gradient [62] update rule on the orthogonal group is given by: where "exp" for a matrix denotes the matrix exponential and t (> 0) is the step size.
A MATLAB R implementation of LICA is publicly available [83].

Cross-Domain Object Matching
Next, we show how the SMI estimator can be used for cross-domain object matching.

Introduction
The objective of cross-domain object matching is to match two sets of unpaired objects in different domains.For example, in photo album summarization, we are given a set of photos and a designed photo frame expressed as a set of photo slots in the Cartesian coordinate system, and we want to automatically assign the photos into the designed photo frame.A typical approach of cross-domain object matching is to find a mapping from objects in one domain (photos) to objects in the other domain (frame) so that the pairwise dependency is maximized.In this scenario, accurately evaluating the dependence between objects is a key issue.
Kernelized sorting [84] tries to find the mapping between two domains that maximizes mutual information under the Gaussian assumption.However, because the Gaussian assumption may not be fulfilled in practice, this method tends to perform poorly.To overcome the above limitation, the kernel-based dependence measure called the Hilbert-Schmidt independence criterion (HSIC) [85] was proposed to use in kernelized sorting [86].Because HSIC is distribution-free, HSIC-based kernelized sorting is more flexible than the original method based on the Gaussian assumption.However, HSIC includes a tuning parameter (more specifically, the Gaussian kernel width), and its choice is crucial to obtain better performance [87].
To cope with this problem, an SMI-based cross-domain object matching method called least-squares object matching (LSOM) was developed [32].Below, we review LSOM.

Cross-Domain Object Matching with SMI
The goal of cross-domain object matching is, given two sets of unpaired samples of the same size, , to find a mapping that well "matches" them.Let π be a permutation function over {1, . . ., n}.The optimal permutation, denoted by π * , can be obtained as the maximizer of the dependency between the two sets {x i } n i=1 and {y π(i) } n i=1 .Here, we use the SMI approximator, LSMI({(x i , y π(i) )} n i=1 ), as the dependency measure, i.e., we solve, Let K and L be the n × n kernel matrices defined by K i,j = K(x i , x j ) and L i,j = L(y i , y j ).Then LSMI for {(x i , y π(i) )} n i=1 can be expressed as: where Π is the permutation matrix corresponding to π, i.e., Π is the n × n zero-one matrix such that Π i,j = 1 if i = π(j) for j = 1, . . ., n and Π i,j = 0 otherwise.Θ Π is the diagonal matrix with diagonal elements given by the LSMI solution θ π obtained by paired data {(x i , y π(i) )} n i=1 (see Equation ( 11)).Because maximizing Equation ( 18) with respect to Π is computationally infeasible, greedy update from previous solution Π is used in practice: where 0 < t ≤ 1 is the step size.Maximization of the second term is called a linear assignment problem, which can be solved efficiently by the Hungarian method [88].
The above method is called least-squares object matching (LSOM) [32].A MATLAB R implementation of LSOM is publicly available [89].

Clustering
Here, we show how SMI can be effectively used for clustering.

Introduction
The objective of clustering is to classify data samples into disjoint groups in an unsupervised manner.K-means [90] is a classic but still popular clustering algorithm.However, k-means only produces linearly separated clusters, and thus its usefulness is rather limited in practice.To cope with this problem, various non-linear clustering methods have been developed.Kernel k-means [91] performs k-means in a feature space induced by a reproducing kernel function [46].Spectral clustering [92,93] first unfolds non-linear data manifolds by a spectral embedding method, and then performs k-means in the embedded space.Blurring mean-shift [94,95] uses a non-parametric kernel density estimator for modeling the data-generating probability density, and finds clusters based on the modes of the estimated density.Discriminative clustering learns a discriminative classifier for separating clusters, where class labels are also treated as parameters to be optimized [96,97].Dependence-maximization clustering determines cluster assignments so that their dependence on input data is maximized [34,98,99].
Information-maximization clustering exhibited the state-of-the-art performance [100,101], where probabilistic classifiers such as a kernelized Gaussian classifier [100] and a kernel logistic regression classifier [101] are learned so that mutual information between feature vectors and cluster assignments is maximized in an unsupervised manner.A notable advantage of information-maximization clustering is that classifier training is formulated as continuous optimization, which is substantially simpler than discrete optimization of cluster assignments.Indeed, classifier training can be carried out in computationally efficient manners by a gradient method [100] or a quasi-Newton method [101].Furthermore, a model selection strategy based on the information-maximization principle is also provided [100].Thus, kernel parameters can be systematically optimized in an unsupervised way.However, the optimization problems of these clustering methods are non-convex and finding a good local optimal solution is not straightforward in practice.
To overcome the above limitation, an SMI-based clustering method called SMI clustering (SMIC) was proposed [33].Below, we review SMIC.

Clustering with SMI
Suppose that we are given d-dimensional i.i.d.feature vectors of size n, i=1 , which are drawn independently from a distribution with density p(x).The goal of clustering is to give cluster assignments, {y i | y i ∈ {1, . . ., c}} n i=1 , to the feature vectors {x i } n i=1 , where c denotes the number of clusters.c is assumed to be pre-fixed below.To solve the clustering problem, the information-maximization approach is taken [100,101].That is, clustering is regarded as an unsupervised classification problem, and the class-posterior probability p(y|x) is learned so that "information" between feature vector x and cluster label y is maximized.
As an information measure, SMI Equation ( 1) is adopted, which can expressed as: Suppose that the class-prior probability p(y) is set to a user-specified value π y for y = 1, . . ., c, where π y > 0 and c y=1 π y = 1.Without loss of generality, {π y } c y=1 are assumed to be sorted in the ascending order: y=1 is unknown, the uniform class-prior distribution may be adopted: Substituting π y into p(y), we can express Equation ( 19) as: Let us approximate the class-posterior probability p(y|x) by the following kernel model: where α = (α 1,1 , . . ., α c,n ) is the parameter vector and K(x, x ) denotes a kernel function.A useful example of kernel functions is the local-scaling kernel [102] defined as: where N k (x) denotes the set of k nearest neighbors for x (k is the kernel parameter), σ i is a local scaling factor defined as , and x (k) i is the k-th nearest neighbor of x i .Note that we did not include the normalization term in Equation ( 21) because model outputs will be normalized later (see Equation ( 22)).
Further approximating the expectation with respect to p(x) included in Equation ( 20) by the empirical average of samples {x i } n i=1 , we arrive at the following SMI approximator: where α y := (α y,1 , . . ., α y,n ) and K i,j := K(x i , x j ).
For each cluster y, α y K 2 α y is maximized under α y = 1.Since this is the Rayleigh quotient, the maximizer is given by the normalized principal eigenvector of K [104].To avoid all the solutions {α y } c y=1 to be reduced to the same principal eigenvector, their mutual orthogonality is imposed: Then the solutions are given by the normalized eigenvectors ψ 1 , . . ., ψ c associated with the eigenvalues Since the sign of ψ y is arbitrary, the sign is set as: where sign(•) denotes the sign of a scalar and 1 n denotes the n-dimensional vector with all ones.On the other hand, because and the class-prior probability p(y) was set to π y for y = 1, . . ., c, the following normalization condition is obtained: Furthermore, probability estimates should be non-negative, which can be achieved by rounding up negative outputs to zero.By taking these normalization and non-negativity issues into account, cluster assignment y i for x i is determined as the maximizer of the approximation of p(y|x i ): where the "max" operation for vectors is applied in the element-wise manner and [•] i denotes the i-th element of a vector.Note that K ψ y = λ y ψ y was used in the above derivation.For out-of-sample prediction, cluster assignment y for new sample x may be obtained as: The above method is called SMI-based clustering (SMIC) [33].LSMI can be used for model selection of SMIC, i.e., LSMI is computed as a function of the kernel parameter included in K(x, x ) and the maximizer of LSMI is chosen as the most promising one.A MATLAB R implementation of SMIC is publicly available [103].

Causal Direction Estimation
Finally, we show how the SMI estimator can be used for causal direction estimation.

Introduction
Learning causality from data is one of the important challenges in the artificial intelligence, statistics, and machine learning communities [105].A traditional method of learning causal relationship from observational data is based on the linear-dependence Gaussian-noise model [106].However, the linear-Gaussian assumption is too restrictive and may not be fulfilled in practice.Recently, non-Gaussianity and non-linearity have been shown to be beneficial in causal inference, because it can break symmetry between observed variables [107,108].Since then, much attention has been paid to the discovery of non-linear causal relationship through non-Gaussian noise models [109].
In the framework of non-linear non-Gaussian causal inference, the relation between a cause X and an effect Y is assumed to be described by Y = f (X) + E, where f is a non-linear function and E is non-Gaussian additive noise that is independent of the cause X.Under this additive noise assumption, it was shown [108] that the causal direction between X and Y can be identified based on a hypothesis test of whether the causal model Y = f (X) + E or the alternative model X = f (Y ) + E fits the data well-here, the goodness of fit is measured by independence between inputs and residuals (i.e., estimated noise).In [108], the functions f and f were learned by the Gaussian process (GP) regression [110], and the independence between inputs and residuals was evaluated by the Hilbert-Schmidt independence criterion (HSIC) [85].
However, standard regression methods such as GP are designed to handle Gaussian noise, and thus they may not be suited for discovering causality in the non-Gaussian additive noise formulation.To cope with this problem, an alternative regression method called HSIC regression was proposed [109], which learns a function so that the dependence between inputs and residuals is directly minimized based on HSIC.Through experiments, HSIC regression was shown to outperform the GP-based method [109].However, the choice of the kernel width in HSIC regression heavily affects the sensitivity of the independence measure, and systematic model selection strategies are not available.Another weakness of HSIC regression is that the kernel width of the regression model is fixed to the same value as HSIC.This crucially limits the flexibility of function approximation in HSIC regression.
To overcome the above weaknesses, an SMI-based regression method for causal inference called least-squares independence regression (LSIR) was developed [35].Below, we review LSIR.

Dependence Minimizing Regression with SMI
Suppose random variables X ∈ R and Y ∈ R are connected by the following additive noise model [108]: where f : R → R is some non-linear function and E ∈ R is a zero-mean random variable that is independent of X.The goal of dependence minimizing regression is, from i.i.d.paired samples {(x i , y i )} n i=1 , to obtain a function f such that input X and estimated additive noise Let us employ a linear model for dependence minimizing regression: where m is the number of basis functions, β = (β 1 , . . ., β m ) are regression parameters, and ψ(x) = (ψ 1 (x), . . ., ψ m (x)) are basis functions.In LSMI-based dependence minimization regression, the regression parameters β are learned as: where e i = y i − f β (x i ) is the residual and γ > 0 is the regularization parameter to avoid overfitting.For regression parameter learning, a gradient descent method may be used: where t is the step size.The gradient ∂LSMI ∂β can be approximately expressed as: × (e j − e )ψ(x i ) + (e i − e )ψ(x j ) Note that, in the above derivation, the dependence of β on e i is ignored for simplicity.Although it is possible to exactly compute the derivative in principle, this approximated expression is computationally more efficient with good performance in practice.
By taking into account the assumption that the mean of noise E is zero, the final regressor is obtained as: This method is called least-squares independence regression (LSIR) [35].A MATLAB R implementation of LSIR is publicly available [111].

Causal Direction Inference by LSIR
Our final goal is, given i.i.d.paired samples {(x i , y i )} n i=1 , to determine whether X causes Y or vice versa under the additive noise assumption.To this end, we test whether the causal model Y = f Y (X) + E Y or the alternative model X = f X (Y )+E X fits the data well, where the goodness of fit is measured by independence between inputs and residuals (i.e., estimated noise).Independence of inputs and residuals may be decided in practice based on the permutation test procedure [47].
More specifically, LSIR is first run for {(x i , y i )} n i=1 as usual, and obtain a regression function f .This procedure also provides an SMI estimate, LSMI({(x i , e i )} n i=1 ), where e i = y i − f (x i ).Next, pairs of input and residual {(x i , e i )} n i=1 are randomly permuted as {(x i , e π(i) )} n i=1 , where π(•) is a randomly generated permutation function.Note that the permuted pairs of samples are independent of each other because the random permutation breaks the dependency between X and E (if it exists).Then, an SMI estimate for the permuted data, LSMI({(x i , e π(i) )} n i=1 ), is computed.This random permutation process is repeated many times, and the distribution of LSMI values under the null-hypothesis that X and E are independent is constructed.Finally, the p-value is approximated by evaluating the relative ranking of LSMI computed from the original input-residual data, LSMI({(x i , e i )} n i=1 ), over the distribution of LSMI values for randomly permuted data.
In order to decide the causal direction, the p-values p X→Y and p X←Y for both directions X → Y (i.e., X causes Y ) and X ← Y (i.e., Y causes X) are computed.Then, for a given significance level δ, the causal direction is determined as follows: • If p X→Y > δ and p X←Y ≤ δ, the causal model X → Y is chosen.
• If p X←Y > δ and p X→Y ≤ δ, the causal model X ← Y is selected.
• If p X→Y , p X←Y ≤ δ, perhaps there is no causal relation between X and Y or our modeling assumption is not correct (e.g., an unobserved confounding variable exists).• If p X→Y , p X←Y > δ, perhaps our modeling assumption is not correct or it is not possible to identify a causal direction (i.e., X, Y , and E are Gaussian random variables).
When we have prior knowledge that there exists a causal relation between X and Y but the causal direction is unknown, the values of p X→Y and p X←Y may be simply compared for determining the causal direction as follows: • If p X→Y > p X←Y , we conclude that X causes Y .
• Otherwise, we conclude that Y causes X.This simplified procedure does not include the computational expensive permutation process and thus it is computationally very efficient.

Conclusions
In this article, we reviewed recent development in the estimation of squared-loss mutual information (SMI) and its application to machine learning.The key idea for accurately estimating SMI is to directly estimate the ratio of probability densities without separately estimating each density.A notable advantage of the SMI estimator called least-squares mutual information (LSMI) [19] is that it can be computed analytically in a computationally more efficient and numerically more stable way than ordinary MI.
We have introduced SMI as a measure of statistical independence between random variables.On the other hand, ordinary MI has a rich information-theoretic interpretation via entropies.Thus, it is important to investigate an information-theoretic meaning of SMI, which remains to be an open question currently.
Various methods of direct density-ratio estimation have been explored so far [16,18], and such density ratio estimators were shown to be applicable to an even wider class of machine learning tasks beyond SMI estimation, such as non-stationarity adaptation [112], outlier detection [113], change detection [114,115], class-balance estimation [116], two-sample homogeneity testing [117,118], probabilistic classification [119,120], and conditional density estimation [121].
Improving the accuracy of density ratio estimation contributes to enhancing the performance of the above machine learning solutions.Recent advances in this line of research include dimensionality reduction for density ratio estimation [122][123][124], a unified statistical framework of density ratio estimation [18], and extensions to relative density ratios [125] and density differences [126].Further improving the accuracy and computational efficiency and exploring new application areas are important future directions to pursue.
More program codes are publicly available [127].