Next Article in Journal
The Relation between Granger Causality and Directed Information Theory: A Review
Next Article in Special Issue
Minimum Mutual Information and Non-Gaussianity through the Maximum Entropy Method: Estimation from Finite Samples

Metrics 0

Export Article

Entropy 2013, 15(1), 80-112; doi:10.3390/e15010080

Review
Machine Learning with Squared-Loss Mutual Information
Department of Computer Science, Tokyo Institute of Technology 2-12-1 O-okayama, Meguro-ku,Tokyo 152-8552, Japan
Received: 29 October 2012; in revised form: 7 December 2012 / Accepted: 21 December 2012 / Published: 27 December 2012

Abstract

:
Mutual information (MI) is useful for detecting statistical independence between random variables, and it has been successfully applied to solving various machine learning problems. Recently, an alternative to MI called squared-loss MI (SMI) was introduced. While ordinary MI is the Kullback–Leibler divergence from the joint distribution to the product of the marginal distributions, SMI is its Pearson divergence variant. Because both the divergences belong to the f-divergence family, they share similar theoretical properties. However, a notable advantage of SMI is that it can be approximated from data in a computationally more efficient and numerically more stable way than ordinary MI. In this article, we review recent development in SMI approximation based on direct density-ratio estimation and SMI-based machine learning techniques such as independence testing, dimensionality reduction, canonical dependency analysis, independent component analysis, object matching, clustering, and causal inference.
Keywords:
squared-loss mutual information; Pearson divergence; density-ratio estimation; independence testing; dimensionality reduction; independent component analysis; object matching; clustering; causal inference; machine learning

1. Introduction

Mutual information (MI) [1,2] for random variables X and Y is defined as:
$MI ( X , Y ) : = ∫ ∫ p ( x , y ) log p ( x , y ) p ( x ) p ( y ) d x d y$
where $p ( x , y )$ is the joint probability density of X and Y, and $p ( x )$ and $p ( y )$ are the marginal probability densities of X and Y, respectively (Precisely, $p ( x , y )$, $p ( x )$, and $p ( y )$ are different functions and thus should be denoted, e.g., by $p X , Y ( x , y )$, $p X ( x )$, and $p Y ( y )$, respectively. However, we use the simplified notations for the sake of brevity). Statistically, MI can be regarded as the Kullback–Leibler divergence [3] from the joint density $p ( x , y )$ to the product of the marginals $p ( x ) p ( y )$, and thus can be regarded as a measure of statistical dependency between X and Y. Estimation of MI from data samples has been one of the major challenges in information science and various approaches have been developed thus far.
The most naive approach to approximating MI from data would be to use a non-parametric density estimator such as kernel density estimation (KDE) [4], i.e., the densities $p ( x , y )$, $p ( x )$, and $p ( y )$ included in MI are separately estimated from samples, and the estimated densities are used for approximating MI. However, density estimation is known to be a hard problem [5] and division by estimated densities tends to magnify the estimation error. Therefore, the KDE-based MI approximator may not be reliable in practice.
Another approach uses histogram-based density estimators with data-dependent partitioning. In the context of estimating the Kullback–Leibler divergence [3], histogram-based methods have been studied thoroughly and their consistency has been established [6,7,8]. However, the rate of convergence has not been elucidated yet, and such histogram-based methods are strongly influenced by the curse of dimensionality. Thus, these methods may not be reliable in high-dimensional problems.
MI can be expressed in terms of the entropies as:
$MI ( X , Y ) = H ( X ) + H ( Y ) - H ( X , Y )$
where $H ( X )$ denotes the entropy of X:
$H ( X ) : = - ∫ p ( x ) log p ( x ) d x$
Based on this expression, the nearest neighbor distance has been used for approximating MI [9]. Such a nearest neighbor approach was shown to perform better than the naive KDE-based approach [10], given that the number k of nearest neighbors is chosen appropriately—a small (large) k yields an estimator with small (large) bias and large (small) variance. However, appropriately determining the value of k so that the bias-variance trade-off is optimally controlled is not straightforward in the context of MI estimation. A similar nearest-neighbor idea has been applied to Kullback–Leibler divergence estimation [11], whose consistency has been proved for finite k—this is an interesting result since Kullback–Leibler divergence estimation is consistent even when density estimation is not consistent. However, the rate of convergence seems to be still an open research issue.
Approximation of the entropies based on the Edgeworth expansion has also been explored in the context of MI estimation [12]. Such a method works well when the target density is close to Gaussian. However, if the target density is far from Gaussian, the Edgeworth expansion method is no longer reliable.
More recently, an MI approximator via direct estimation of the density ratio $p ( x , y ) p ( x ) p ( y )$ has been developed [13], which is based on a Kullback–Leibler divergence approximator via direct density-ratio estimation [14,15,16]. The MI approximator is given as the solution of a convex optimization problem, which tends to be sparse [14]. A notable advantage of this density-ratio method is that it does not involve separate estimation of densities $p ( x , y )$, $p ( x )$, and $p ( y )$, and it was proved to achieve the optimal non-parametric convergence rate. However, due to the “log” operation included in MI, this MI approximator is computationally rather expensive and susceptible to outliers [17,18].
To cope with these problems, a variant of MI called the squared-loss mutual information (SMI) [19] has been explored recently. SMI for X and Y is defined as:
$SMI ( X , Y ) : = 1 2 ∫ ∫ p ( x ) p ( y ) p ( x , y ) p ( x ) p ( y ) - 1 2 d x d y$
SMI is the Pearson divergence [20] from the joint density $p ( x , y )$ to the product of the marginals $p ( x ) p ( y )$. It is always non-negative and it vanishes if and only if X and Y are statistically independent. Note that both the Pearson divergence and the Kullback–Leibler divergence belong to the class of Ali–Silvey–Csiszár divergences (which is also known as f-divergences) [21,22], meaning that they share similar properties.
In a similar way to ordinary MI, SMI can be approximated accurately via direct estimation of the density ratio $p ( x , y ) p ( x ) p ( y )$ [19], which is based on a Pearson divergence approximator via direct density-ratio estimation [16,23]. This SMI approximator has various desirable properties: For example, it was proved to achieve the optimal non-parametric convergence rate [24], its solution can be obtained analytically just by solving a system of linear equations, it has superior numerical properties [25], and it is robust against outliers [17,18].
In particular, the property of the SMI approximator that an analytic solution is available is highly useful in machine learning, because this allows explicit computation of the derivative of the SMI approximator with respect to another parameter. For example, in supervised dimensionality reduction, linear transformation U for input x is optimized so that the transformed input $U x$ has the highest dependency on output y. In this context, the derivative of the SMI estimator between $U x$ and y with respect to transformation U can be exploited for optimizing transformation U. On the other hand, such derivative computation is not straightforward for the MI estimator whose solution is obtained via numerical optimization.
The purpose of this article is to review recent development in SMI approximation based on direct density-ratio estimation and SMI-based machine learning techniques. The remainder of this paper is structured as follows. After reviewing the SMI approximator based on direct density-ratio estimation in Section 2, we illustrate in Section 3 how the SMI approximator can be utilized for solving various machine learning tasks such as: independence testing [26], feature selection [19,27], feature extraction [28,29], canonical dependency analysis [30], independent component analysis [31], object matching [32], clustering [33,34], and causality learning [35].

2. Definition and Estimation of SMI

In this section, we review the definition of SMI and its approximator based on direct density-ratio estimation.

2.1. Definition of SMI

Let us consider two random variables $X ∈ X$ and $Y ∈ Y$, where $X$ and $Y$ are domains of X and Y, respectively. Let $p ( x , y )$ be the joint probability density of X and Y, and $p ( x )$ and $p ( y )$ be the marginal probability densities of X and Y, respectively. The squared-loss mutual information (SMI) [19] for X and Y is defined as:
$SMI ( X , Y ) : = 1 2 ∫ ∫ p ( x ) p ( y ) p ( x , y ) p ( x ) p ( y ) - 1 2 d x d y$
SMI is always non-negative and it takes zero if and only if X and Y are statistically independent. Hence, SMI can be used for detecting statistical independence between X and Y.
Below, we consider the problem of estimating SMI from paired samples ${ ( x i , y i ) } i = 1 n$ drawn independently from the joint distribution with density $p ( x , y )$.

2.2. Least-Squares Estimation of SMI

Here, we review the basic idea and theoretical properties of the SMI approximator called least-squares mutual information (LSMI) [19].

2.2.1. SMI Approximation via Direct Density-Ratio Estimation

The basic idea of LSMI is to directly estimate the following density-ratio function without going through density estimation of $p ( x , y )$, $p ( x )$, and $p ( y )$:
$r ( x , y ) : = p ( x , y ) p ( x ) p ( y )$
Let $g ( x , y )$ be a model of the density ratio $r ( x , y )$. In LSMI, the model is learned so that the following squared-error J is minimized:
$J ( g ) : = 1 2 ∫ ∫ g ( x , y ) - r ( x , y ) 2 p ( x ) p ( y ) d x d y (3) : = 1 2 ∫ ∫ g ( x , y ) 2 p ( x ) p ( y ) d x d y - ∫ ∫ g ( x , y ) p ( x , y ) d x d y + C$
where C is a constant defined by:
$C : = 1 2 ∫ ∫ r ( x , y ) p ( x , y ) d x d y$
Since J contains the expectations over unknown densities $p ( x ) p ( y )$ and $p ( x , y )$, the expectations are approximated by empirical averages. Then the LSMI optimization problem is given as follows:
$g ^ : = argmin g ∈ G 1 2 n 2 ∑ i , j = 1 n g ( x i , y j ) 2 - 1 n ∑ i = 1 n g ( x i , y i )$
where $G$ is a function space from which g is searched.
Finally, the SMI approximator called LSMI is given as:
$LSMI ( { ( x i , y i ) } i = 1 n ) : = 1 2 n ∑ i = 1 n g ^ ( x i , y i ) - 1 2$
or
$LSMI ′ ( { ( x i , y i ) } i = 1 n ) : = - 1 2 n 2 ∑ i , j = 1 n g ^ ( x i , y j ) 2 + 1 n ∑ i = 1 n g ^ ( x i , y i ) - 1 2$
Equation (5) would be the simplest SMI approximator, while Equation (6) is suitable for theoretical analysis because this corresponds to the negative of the objective function (4) up to the constant $1 / 2$. These estimators are derived based on the following equivalent expressions of SMI:
$SMI ( X , Y ) = 1 2 ∫ ∫ r ( x , y ) p ( x , y ) d x d y - 1 2 (8) = - 1 2 ∫ ∫ r ( x , y ) 2 p ( x ) p ( y ) d x d y + ∫ ∫ r ( x , y ) p ( x , y ) d x d y - 1 2$
Equation (7) is obtained by expanding the squared term in Equation (1), applying Equation (2) to the squared density-ratio term once, and showing that the cross-term and the remaining terms are $- 1$ and $1 / 2$, respectively. Equivalence between Equations (7) and (8) can be confirmed by applying Equation (2) to the first term in Equation (8) once. Note that Equation (8) can also be obtained via the Legendre–Fenchel duality of Equation (1), implying that the optimization problem (4) corresponds to approximately maximizing the Legendre–Fenchel lower-bound [15].

2.2.2. Convergence Analysis

Here we briefly review theoretical convergence properties of LSMI.
First, let us consider the case where the function class $G$ from which the function g is searched is a parametric model:
$G = { g θ ( x , y ) ∣ θ ∈ Θ ⊂ R b }$
Suppose that the true density-ratio r is contained in the model $G$, i.e., there exists $θ *$ ($∈ Θ$) such that: $r = g θ *$. Then, it was shown [28] that, under the standard regularity conditions for consistency [36], it holds that:
$LSMI ′ ( { ( x i , y i ) } i = 1 n ) - SMI ( X , Y ) = O p ( n - 1 / 2 )$
where $O p$ denotes the asymptotic order in probability. This shows that $LSMI ′$ retains the optimality in terms of the order of convergence in n, because $O p ( n - 1 / 2 )$ is the optimal convergence rate in the parametric setup [37].
Next, we consider non-parametric cases where the function class $G$ is a reproducing kernel Hilbert space [38] on $X × Y$. Let us consider a non-parametric version of the LSMI optimization problem:
$g ^ : = argmin g ∈ G 1 2 n 2 ∑ i , j = 1 n g ( x i , y j ) 2 - 1 n ∑ i = 1 n g ( x i , y i ) + λ n 2 ∥ g ∥ G 2$
where $∥ · ∥ G 2$ denotes the norm in the reproducing kernel Hilbert space $G$. In the above optimization problem, a regularizer $∥ g ∥ G 2$ is included to avoid overfitting and $λ n ≥ 0$ is the regularization parameter.
Suppose that the true density-ratio function r is contained in the function space $G$ and is bounded from above. Then, it was shown [28] that, if $λ n → 0$ and $λ n - 1 = o ( n 2 / ( 2 + γ ) )$ where γ$( 0 < γ < 2 )$ denotes a complexity measure of the function space $G$ based on the bracketing entropy (The larger the value of γ is, the more complex the function space $G$ is) [see p.83 of 36], it holds that:
$LSMI ′ ( { ( x i , y i ) } i = 1 n ) - SMI ( X , Y ) = O p max ( λ n , n - 1 / 2 )$
The conditions $λ n → 0$ and $λ n - 1 = o ( n 2 / ( 2 + γ ) )$ roughly mean that the regularization parameter $λ n$ should be sufficiently small but not too small. Equation (9) means that the convergence rate of the non-parametric version can also be $O p ( n - 1 / 2 )$ for an appropriate choice of $λ n$, but the non-parametric method requires a milder model assumption. According to [15], the above convergence rate is the minimax optimal rate under some setup. Thus, the convergence property of the above non-parametric method would also be optimal in the same sense.

2.3. Practical Implementation of LSMI

We have seen that LSMI has desirable convergence properties. Here we review practical implementation of LSMI. A MATLAB$®$ implementation of LSMI is publicly available [39].

2.3.1. LSMI for Linear-in-Parameter Models

Let us approximate the density ratio Equation (2) using the following linear-in-parameter model:
$g θ ( x , y ) = ∑ ℓ = 1 b θ ℓ ϕ ℓ ( x , y ) = θ ⊤ ϕ ( x , y )$
where $θ = ( θ 1 , … , θ b ) ⊤$ are parameters, $ϕ ( x , y ) = ( ϕ 1 ( x , y ) , … , ϕ b ( x , y ) ) ⊤$ are fixed basis functions, and $⊤$ denotes the transpose. Practical choices of the basis functions will be explained in Section 2.3.2. For this model, the LSMI optimization problem with an $ℓ 2$-regularizer is expressed as:
$θ ^ : = argmin θ ∈ R b 1 2 θ ⊤ H ^ θ - θ ⊤ h ^ + λ 2 θ ⊤ θ$
where $λ ≥ 0$ is the regularization parameter that controls the strength of regularization, $H ^$ is the $b × b$ matrix defined by:
$H ^ : = 1 n 2 ∑ i , j = 1 n ϕ ( x i , y j ) ϕ ( x i , y j ) ⊤$
and $h ^$ is the b-dimensional vector defined by:
$h ^ : = 1 n ∑ i = 1 n ϕ ( x i , y i )$
The solution $θ ^$ can be analytically obtained as:
$θ ^ = ( H ^ + λ I b ) - 1 h ^$
where $I b$ is the b-dimensional identity matrix. Finally, LSMI is also given analytically as:
$LSMI ( { ( x i , y i ) } i = 1 n ) = 1 2 h ^ ⊤ θ ^ - 1 2$
or
$LSMI ′ ( { ( x i , y i ) } i = 1 n ) = - 1 2 θ ^ ⊤ H ^ θ ^ + h ^ ⊤ θ ^ - 1 2$
Some elements of $θ ^$ may take negative values in the above formulation, which can lead to negative density-ratio values and negative LSMI values. Such negative values may be rounded up to zero if necessary, although this does not happen for sufficiently large n. Another option is to explicitly impose the non-negativity constraint $θ 1 , … , θ b ≥ 0$ on the optimization problem. However, by this modification, the solution can no longer be obtained analytically, but only numerically using a quadratic program solver. (In this case, if the $ℓ 2$-regularizer is replaced with the $ℓ 1$-regularizer, the regularization path [40,41]—i.e., solutions for all different regularization parameter values—can be computed efficiently without a quadratic program solver just by solving systems of linear equation [23].)

2.3.2. Design of Basis Functions

The practical accuracy of LSMI depends on the choice of basis functions in the model Equation (10). A typical choice is a non-parametric kernel model, i.e., setting the number of basis function to $b = n$ and the -th basis function to $ϕ ℓ ( x , y ) = K ( x , x ℓ ) L ( y , y ℓ )$:
$g θ ( x , y ) = ∑ ℓ = 1 n θ ℓ K ( x , x ℓ ) L ( y , y ℓ )$
where $K ( x , x ′ )$ and $L ( y , y ′ )$ are kernel functions for x and y, respectively. If n is too large, b may be set to be smaller than n and choose a subset of data points ${ ( x i , y i ) } i = 1 n$ as kernel centers.
For real vector $x ∈ R d$, we may practically use the Gaussian kernel for $K ( x , x ′ )$ after element-wise variance normalization:
$K ( x , x ′ ) = exp - ∥ x - x ′ ∥ 2 2 σ x 2$
where $σ x > 0$ is the Gaussian width. When x is a non-vectorial structured object such as a string, a tree, and a graph, we may employ a kernel function defined for such structured data [42].
In the (multi-output) regression scenario where y is a real vector, the Gaussian kernel may also be used for $L ( y , y ′ )$ after element-wise variance normalization:
$L ( y , y ′ ) = exp - ∥ y - y ′ ∥ 2 2 σ y 2$
where $σ y > 0$ is the Gaussian width. In the multi-class classification scenario where $y ∈ { 1 , … , c }$ and c denotes the number of classes, we may use the delta kernel for $L ( y , y ′ )$:
Note that, in the classification case with the delta kernel, the LSMI solution can be computed efficiently in a class-wise manner [33]. In the multi-label classification scenario where $y ∈ { 0 , 1 } c$ and c denotes the number of labels, we may use the normalized linear kernel function [43] for y:
$L ( y , y ′ ) = ( y - y ¯ ) ⊤ ( y ′ - y ¯ ) ∥ y - y ¯ ∥ ∥ y ′ - y ¯ ′ ∥$
where $y ¯ = 1 n ∑ i = 1 n y i$ is the sample mean.

2.3.3. Model Selection by Cross-Validation

Most of the above kernels include tuning parameters such as the Gaussian width, and the practical performance of LSMI depends on the choice of such kernel parameters and the regularization parameter λ. Model selection of LSMI is possible based on cross-validation with respect to the criterion J defined by Equation (3).
More specifically, the sample set $D = { ( x i , y i ) } i = 1 n$ is divided into M disjoint subsets ${ D m } m = 1 M$. Then the LSMI solution $g ^ m ( x )$ is obtained using $D \ D m$ (i.e., all samples without $D m$), and its J-score for the hold-out samples $D m$ is computed as:
$J ^ m CV : = 1 2 | D m | 2 ∑ x , y ∈ D m g ^ m ( x , y ) 2 - 1 | D m | ∑ ( x , y ) ∈ D m g ^ m ( x , y )$
where $| D m |$ denotes the number of elements in the set $D m$. $∑ x , y ∈ D m$ denotes the summation over all combinations of x and y in $D m$ (and thus $| D m | 2$ terms), while $∑ ( x , y ) ∈ D m$ denotes the summation over all pairs $( x , y )$ in $D m$ (and thus $| D m |$ terms). This procedure is repeated for $m = 1 , … , M$, and the average score,
$J ^ CV : = 1 M ∑ m = 1 M J ^ m CV$
is computed. Finally, the model (the kernel parameter and the regularization parameter in the current setup) that minimizes $J ^ CV$ is chosen as the most suitable one.

3. SMI-Based Machine Learning

In this section, we show how the SMI estimator, LSMI, can be used for solving various machine learning tasks.

3.1. Independence Testing

First, we show how the SMI estimator can be used for independence testing.

3.1.1. Introduction

Identifying the statistical independence between random variables is one of the fundamental challenges in statistical data analysis. A traditional independence measure between random variables is the Pearson correlation coefficient, which can be used for detecting linear dependency. Recently, kernel-based independence measures have been studied to detect non-linear dependency. The Hilbert–Schmidt independence criterion (HSIC) [44] utilizes cross-covariance operators on universal reproducing kernel Hilbert spaces (RKHSs) [45], which is an infinite-dimensional generalization of covariance matrices. HSIC allows efficient detection of non-linear dependency by making use of the reproducing property of RKHSs [38]. However, HSIC has a weakness that its performance depends on the choice of RKHSs and there is no theoretically justified way to determine the RKHS properly thus far. In practice, using the Gaussian RKHS with width set to the median distance between samples is a popular heuristic [46], but this does not always work well.
To overcome the above limitations, an SMI-based independence test called least-squares independence test (LSIT) was proposed [26]. Below, we review LSIT.

3.1.2. Independence Testing with SMI

Let $x ∈ X$ be an input feature and $y ∈ Y$ be an output feature, which follow a joint probability distribution with density $p ( x , y )$. Suppose that we are given a set of independent and identically distributed (i.i.d.) paired samples ${ ( x i , y i ) } i = 1 n$. The objective of independence testing is to conclude whether x and y are statistically independent or not, based on the samples ${ ( x i , y i ) } i = 1 n$.
The SMI-based independence test, where the null hypothesis is that x and y are statistically independent, is based on the permutation test procedure [47]. More specifically, LSMI is first run using the original dataset $D = { ( x i , y i ) } i = 1 n$, and an SMI estimate, $LSMI ( D )$, is obtained. Next, ${ y i } i = 1 n$ are randomly permuted and a shuffled dataset $D ˜ = { ( x i , y ˜ i ) } i = 1 n$ is formed, where ${ y ˜ i } i = 1 n$ denote permuted samples. Then LSMI is run again using the shuffled dataset $D ˜$, and an SMI estimate $LSMI ( D ˜ )$ is obtained. Note that the random permutation eliminates the dependency between x and y (if it exists), and therefore $LSMI ( D ˜ )$ would take a value close to zero. This random permutation procedure is repeated many times, and the distribution of $LSMI ( D ˜ )$ under the null-hypothesis that x and y are statistically independent is constructed. Finally, the p-value is approximated by evaluating the relative ranking of $LSMI ( D )$ in the distribution of $LSMI ( D ˜ )$.
This procedure is called the least-squares independence test (LSIT) [26]. A MATLAB$®$ implementation of LSIT is publicly available [48].

3.2. Supervised Feature Selection

Next, we show how the SMI estimator can be used for supervised feature selection.

3.2.1. Introduction

The objective of supervised learning is to learn an input-output relation from input-output paired samples. However, when the dimensionality of input vectors is large, using all input elements could lead to a model interpretability problem. Feature selection is aimed at finding a subset of input elements that is useful for predicting output values [49].
Feature ranking is a simple implementation of feature selection that ranks each feature according to its relevance. In this feature ranking scenario, SMI between a single input variable and an output was shown to be useful [19]. However, feature ranking does not take feature interaction into account, and thus it is not useful when each single feature is not capable of predicting outputs, but multiple features are necessary for a valid prediction of outputs (e.g., an XOR problem). Two criteria, relevancy and redundancy, are often used to select multiple features simultaneously: A feature is said to be relevant if it can explain outputs, and features are said to be redundant if they are similar. Ideally, we want to find a subset of features that has high relevance and low redundancy.
Another important issue in feature selection is the computational cost: Naively selecting multiple features causes computational infeasibility because the number of possible feature combinations is exponential with respect to the number of input features. To cope with this problem, a computationally efficient method to handle multiple features called the least absolute shrinkage and selection operator (LASSO) [50] was proposed. In LASSO, a predictor consisting of a weighted sum of each feature is fitted to output values using the least-squares method, while the weight vector is confined in an $ℓ 1$-ball. The $ℓ 1$-ball restriction actually provides a notable property that the solution is sparsified, meaning that some of the weight parameters become exactly zero. Thus, LASSO automatically removes irrelevant features from its predictor, which can be achieved through convex optimization in a computationally efficient way [51,52].
However, LASSO can only handle linear predictors and its feature selection characteristic explicitly depends on the squared-loss function used in the least-squares method. To go beyond these limitations, an SMI-based feature selection method called $ℓ 1$-LSMI was proposed [27]. Below, we review $ℓ 1$-LSMI.

3.2.2. Feature Selection with SMI

The objective of feature selection is, from input feature vector $x = ( x ( 1 ) , ⋯ , x ( d ) ) ⊤ ∈ R d$, to choose a subset of its elements that are useful for the prediction of output $y ∈ Y$. Suppose that we are given n i.i.d. paired samples ${ ( x i , y i ) } i = 1 n$ drawn from a joint distribution with density $p ( x , y )$. Let $w 1 , … , w d$ be feature weights for $x ( 1 ) , ⋯ , x ( d )$, and we learn the weights as:
where $η ≥ 0$ is the regularization parameter that controls the number of features. Because the sign of feature weights is not relevant in feature selection, they are restricted to be non-negative. For non-negative weights, $∑ i = 1 d w i$ is reduced to the $ℓ 1$-norm of the feature weight vector $( w 1 , … , w d ) ⊤$. The features having zero weights are regarded as irrelevant in this formulation.
To compute the solution, a simple gradient ascent may be used, where the weight vector is projected onto the positive orthant of the $ℓ 1$-ball in each iteration to guarantee the feasibility. This can be performed by first projecting the weight vector onto the positive orthant by rounding up negative elements to zero, and then projecting it onto the $ℓ 1$-ball [54].
This SMI-based feature selection algorithm is called the $ℓ 1$-LSMI [27]. A MATLAB$®$ implementation of $ℓ 1$-LSMI is publicly available [53].

3.3. Supervised Feature Extraction

While feature selection chooses a subset of features for enhancing model interpretability, feature extraction finds a low-dimensional representation of features that can depend on all features (e.g., through linear combination) for improving the prediction accuracy. Here, we show how the SMI estimator can be used for supervised feature extraction.

3.3.1. Introduction

The goal of sufficient dimension reduction (SDR) is to map input features to low-dimensional expressions while “sufficient” information for predicting output values is maintained [55]. Earlier SDR methods developed in the statistics community, such as sliced inverse regression [56], principal Hessian direction [57], and sliced average variance estimation [58], rely on the ellipticity of the data (e.g., Gaussian), but such an assumption may not be fulfilled in practice. To overcome the limitations of these approaches, kernel dimension reduction (KDR) was proposed [59]. KDR employs a kernel-based dependence measure that is distribution-free, and thus KDR is flexible. However, it lacks systematic model selection strategies for kernel and regularization parameters. Furthermore, KDR scales poorly to massive datasets and there is no good way to set an initial solution for its gradient-based optimization. In practice, many restarts from different initial solutions may be needed for finding a good local optimum, which makes the entire procedure even slower and the performance of dimension reduction unreliable.
To overcome the above limitations, an SMI-based SDR method called least-squares dimension reduction (LSDR) was proposed [28]. An advantage of LSDR is that its tuning parameters can be systematically optimized based on cross-validation. To further address the computational and initialization issues, a heuristic search strategy for LSDR called sufficient component analysis (SCA) was proposed [29]. Below, we review LSDR and SCA.

3.3.2. Sufficient Dimension Reduction with SMI

First, we formulate the problem of SDR [55]. Let $x ∈ R d x$ be an input vector and $y ∈ Y$ be an output. The goal of SDR is to find a subspace of input domain $R d x$ that contains “sufficient” information about output y. We assume that there exists an orthogonal matrix $U * ∈ R d u × d x$ for $d u ≤ d x$ such that
$y ⊥ ⊥ x | U * x$
That is, given the projected feature $U * x$, the (remaining) feature x is conditionally independent of output y and thus can be discarded without sacrificing the predictability of y. The objective of SDR is to find such $U *$ from n i.i.d. paired samples, ${ ( x i , y i ) } i = 1 n$, drawn from a joint distribution with density $p ( x , y )$. We assume that the projection dimensionality $d u$ is known.
SMI can be used for characterizing the optimal projection matrix $U *$ [28]. Indeed, it was shown that inequality,
$SMI ( X , Y ) ≥ SMI ( U X , Y )$
holds, and the equality holds if and only if Equation (15) holds. Thus, maximizing $SMI ( U X , Y )$ with respect to U leads to $U *$. In practice, the following optimization problem is solved:
This formulation is called least-squares dimension reduction (LSDR) [28].

A simple approach to solving the above LSDR optimization problem is the following iterative procedure:
• U is updated to ascend the gradient of $LSMI ( { ( U x i , y i ) } i = 1 n )$ with respect to U.
• U is projected onto the feasible region specified by $U U ⊤ = I d u$.
The gradient of $LSMI ( { ( U x i , y i ) } i = 1 n )$ with respect to U is given by:
$∂ LSMI ∂ U = ∑ ℓ = 1 b θ ^ ℓ ∂ h ^ ℓ ∂ U - 1 2 ∑ ℓ , ℓ ′ = 1 b θ ^ ℓ θ ^ ℓ ′ ∂ H ^ ℓ , ℓ ′ ∂ U$
If the kernel model Equation (14) with the Gaussian kernel,
$K ( U x , U x ′ ) = exp - ∥ U x - U x ′ ∥ 2 2 σ 2$
is used, $∂ h ^ ℓ ∂ U$ and $∂ H ^ ℓ , ℓ ′ ∂ U$ (for $ℓ , ℓ ′ = 1 , … , n$) are given by:
$∂ h ^ ℓ ∂ U = - 1 n σ 2 ∑ i = 1 n ( U x i - U x ℓ ) ( x i - x ℓ ) ⊤ exp - ∥ U x i - U x ℓ ∥ 2 2 σ 2 L ( y i , y ℓ ) , ∂ H ^ ℓ , ℓ ′ ∂ U = [ - 1 n σ 2 ∑ i = 1 n ( U x i - U x ℓ ) ( x i - x ℓ ) ⊤ + ( U x i - U x ℓ ′ ) ( x i - x ℓ ′ ) ⊤ = × exp - ∥ U x i - U x ℓ ∥ 2 + ∥ U x i - U x ℓ ′ ∥ 2 2 σ 2 ] × 1 n ∑ i = 1 n L ( y i , y ℓ ) L ( y i , y ℓ ′ )$
The projection of U onto the feasible region specified by $U U ⊤ = I d u$ may be carried out by the Gram–Schmidt process [60], although this is time-consuming.
An alternative way to solve the LSDR optimization problem is to perform gradient ascent on the Grassmann manifold [61]. In the Euclidean space, the ordinary gradient gives the steepest direction. However, on a manifold, the natural gradient [62] gives the steepest direction. The natural gradient $∇ LSMI ( U )$ at U is given as follows [63]:
$∇ LSMI ( U ) = ∂ LSMI ∂ U - ∂ LSMI ∂ U U ⊤ U = ∂ LSMI ∂ U U ⊥ ⊤ U ⊥$
where $U ⊥$ is any $( d - m ) × d$ matrix such that $[ U ⊤ U ⊥ ⊤ ]$ is orthogonal. Then the geodesic from U to the direction of the natural gradient $∇ LSMI ( U )$ over the Grassmann manifold can be expressed using $t ∈ R$ as:
$U t : = I d x O d x - d u exp t O d u ∂ LSMI ∂ U U ⊥ ⊤ - U ⊥ ∂ LSMI ∂ U ⊤ O d x - d u U U ⊥$
where “exp” for a matrix denotes the matrix exponential, and $O d x$ is the $d x × d x$ zero matrix. Thus, line search along the geodesic in the natural gradient direction is equivalent to finding the maximizer from ${ U t ∣ t ≥ 0 }$. For choosing the step size of each gradient update, some approximate line search method such as Armijo’s rule [64] or backtracking line search [51] may be used.
A $MATLAB ®$ implementation of LSDR is publicly available [65].

3.3.4. Heuristic Subspace Search

Although the natural gradient method is computationally more efficient than the plain gradient method, it is still time consuming. Moreover, many restarts from different initial solutions may be needed for finding a good local optimum. Here, we introduce a heuristic method that can address these issues [29].
A key idea is to use a truncated negative quadratic function called the Epanechnikov kernel [66] as a kernel function for $U x$:
$K ( U x , U x ′ ) = max 0 , 1 - ∥ U x - U x ′ ∥ 2 2 σ z 2$
Let $I ( c )$ be the indicator function, i.e., $I ( c ) = 1$ if c is true and zero otherwise. Then, for the above kernel function, $LSMI$ can be expressed as:
$LSMI = 1 2 tr ( U D U U ⊤ ) - 1 2$
where $tr ( · )$ denotes the trace of a matrix and $D U$ is the $d x × d x$ matrix defined by:
$D U = 1 n ∑ i = 1 n ∑ ℓ = 1 n θ ^ ℓ ( U ) I ∥ U x i - U x ℓ ∥ 2 2 σ z 2 < 1 L ( y i , y ℓ ) 1 d u I d x - 1 2 σ z 2 ( x i - x ℓ ) ( x i - x ℓ ) ⊤$
Here, the fact that $θ ^ ℓ$ depends on U is explicitly indicated by $θ ^ ℓ ( U )$.
If U in $D U$ is replaced by $U ′$, where $U ′$ is a transformation matrix obtained in the previous iteration, the SMI estimator is simplified as:
$1 2 tr U D U ′ U ⊤ - 1 2$
Because $D U ′$ is independent of U, a maximizer of Equation (16) with respect to U can be analytically obtained by $( u 1 | ⋯ | u d u ) ⊤$, where ${ u i } i = 1 d u$ are the $d u$ principal components of $D ′$. The same technique can also be utilized for determining an initial transformation matrix, by computing the above solution for $U ′ = I d x$ (i.e., no dimensionality reduction).
The above heuristic search method for LSDR is called sufficient component analysis (SCA) [29]. A MATLAB$®$ implementation of SCA is publicly available [67].

3.4. Canonical Dependency Analysis

Next, we show how the SMI estimator can be used for feature extraction from two sets of data.

3.4.1. Introduction

Canonical correlation analysis (CCA) [68] is a classical dimensionality reduction technique for two data sources, and it iteratively finds projection directions with maximum correlation. However, because CCA only captures correlations under linear projections, it is often insufficient to analyze complex real-world data that contain higher-order correlations. To be more flexible, non-linear CCA methods have been explored. A simple approach uses neural networks to handle non-linear projections [69,70], but neural networks are prone to local optima. Another approach first non-linearly transforms data samples into feature spaces and then apply linear CCA [71,72]. Given that the non-linear transformation is fixed, this two-step approach allows analytic computation of the global optimal solution via a generalized eigenvalue problem in the same way as linear CCA. This non-linear approach is called kernel CCA (KCCA) because reproducing kernels [38] are used as non-linear transforms. Alternating regression such as the alternating conditional expectation [73] is another possible way to find dependency in a flexible manner, which estimates transformations for two variables alternately by minimizing the squared error between transformed variables. These non-linear variants of CCA are highly flexible, although obtained results are often difficult to interpret due to the non-linearity.
The above non-linear CCA approaches can be regarded as capturing correlations along non-linear projection directions. Another extension of CCA called canonical dependency analysis (CDA) [30] captures higher-order correlations under linear projections. It was shown that KCCA with a universal kernel [45] such as the Gaussian kernel allows efficient detection of higher-order correlations [74]. However, the choice of universal kernels affects the practical performance, and there is no systematic method to choose a suitable kernel function. Another approach to higher-order CCA called informational CCA (ICCA) [75] uses mutual information (MI) as a dependency measure, where MI is estimated via kernel density estimation (KDE). Because systematic model selection strategies are available for KDE [76], ICCA could be more practical than the KCCA-based CDA method. In the ICCA method, one-dimensional projection directions are found in an iterative manner. Thus, it would be more powerful if multi-dimensional projection directions (i.e., a subspace) could be directly found in CDA [30]. However, ICCA may not be reliable in such a subspace search scenario because it involves the ratio of estimated densities, which tends to produce large estimation error if the dimensionality is not small.
To overcome the above limitation, an SMI-based CDA method called least-squares CDA (LSCDA) was proposed [30]. Below, we review LSCDA.

3.4.2. Canonical Dependency Analysis with SMI

Suppose that we are given n i.i.d. paired samples ${ ( x i , y i ) | x i ∈ R d x , y i ∈ R d y } i = 1 n$ drawn from a joint distribution with density $p ( x , y )$. CDA is aimed at finding the low-dimensional expressions of x and y that are maximally dependent on each other. Here, we focus on linear dimension reduction, i.e., x and y are transformed as $U x$ and $V y$, where $U ∈ R d u × d x$ and $V ∈ R d v × d y$ are orthogonal matrices with known dimensionalities $d u$ and $d v$. The objective of CDA is to find the transformation matrices U and V such that the statistical dependency between $U x$ and $V y$ is maximized. Let us use the SMI estimator, $LSMI ( { ( U x i , V y i ) } i = 1 n )$, as the dependency measure, i.e., we solve,
This formulation is called least-squares CDA (LSCDA) [30].
The above optimization problem can be solved in the same way as LSDR presented in Section 3.3.3 A MATLAB$®$ implementation of LSCDA is publicly available [77].

3.5. Independent Component Analysis

Here, we show how the SMI estimator can be used for independent component analysis.

3.5.1. Introduction

Suppose that there exist statistically independent sources of signals, and we observe their mixtures. The purpose of independent component analysis (ICA) [78] is to separate the mixed signals into the original source signals. An approach to ICA is to separate the mixed signals such that statistical independence among separated signals is maximized under some independence measure.
Various methods for evaluating the statistical independence among random variables from samples have been explored so far. A naive approach is to estimate probability densities based on parametric or non-parametric density estimation methods. However, finding an appropriate parametric model is not straightforward without strong prior knowledge and non-parametric estimation is not generally accurate in high-dimensional problems. Thus, this naive approach is not reliable in practice. Another approach is to approximate the entropy based on the Gram–Charlier expansion [79] or the Edgeworth expansion [80]. An advantage of this entropy-based approach is that a hard task of density estimation is not directly involved. However, these expansion techniques are based on the assumption that the target density is close to Gaussian, and violation of this assumption can cause large approximation error.
The above approaches are based on the probability densities of signals. Another line of research that does not explicitly involve probability densities employs non-linear correlation—signals are statistically independent if and only if all non-linear correlations among signals vanish. Following this line, computationally efficient algorithms have been developed based on a contrast function [81,82], which is an approximation of the entropy or mutual information. However, non-linearities in the contrast function need to be pre-specified in these methods, and thus they could be inaccurate if the predetermined non-linearities do not match the target distribution. To cope with this problem, the kernel trick has been applied in ICA, which allows computationally efficient evaluation of all non-linear correlations citeJMLR:Bach+Jordan:2002. However, its practical performance depends on the choice of kernels (more specifically, the Gaussian kernel width) and there seems no theoretically justified method to determine the kernel width. This is a critical problem in unsupervised learning tasks such as ICA.
To cope with this problem, an SMI-based ICA algorithm called least-squares independent component analysis (LICA) has been developed [31]. Below, we review LICA.

3.5.2. Independent Component Analysis with SMI

Suppose there are d signal sources and let: ${ x i | x i = ( x i ( 1 ) , … , x i ( d ) ) ⊤ ∈ R d } i = 1 n$ be i.i.d. samples drawn from a distribution with density $p ( x )$. We assume that elements $x ( 1 ) , … , x ( d )$ are statistically independent of each other, i.e., $p ( x )$ is factorized as:
$p ( x ) = p ( x ( 1 ) ) ⋯ p ( x ( d ) )$
We cannot directly observe ${ x i } i = 1 n$, but only their linearly mixed samples ${ y i } i = 1 n$:
$y i : = U x i$
where U is a $d × d$ invertible matrix called the mixing matrix.
The goal of ICA is, from the mixed samples ${ y i } i = 1 n$, to obtain a demixing matrix V that recovers the original source samples ${ x i } i = 1 n$. We denote the demixed samples by ${ z i } i = 1 n$:
$z i = V y i$
The ideal solution is $V = U - 1$, but we can only recover the source signals up to permutation and scaling of components of x due to non-identifiability of the ICA setup [78]. Let us denote the demixed samples by:
$z i = ( z i ( 1 ) , … , z i ( d ) ) ⊤ : = V y i$
for $i = 1 , … , n$.
A direct approach to ICA is to determine V so that elements of z are as statistically independent as possible. Here, we adopt SMI as the independence measure:
$SMI ( Z ( 1 ) , … , Z ( d ) ) : = 1 2 ∫ ⋯ ∫ p ( z ( 1 ) ) ⋯ p ( z ( d ) ) p ( z ( 1 ) , … , z ( d ) ) p ( z ( 1 ) ) ⋯ p ( z ( d ) ) - 1 2 d z ( 1 ) ⋯ d z ( d )$
We try to find the demixing matrix V that minimizes SMI. In practice, the following optimization problem is solved:
$min V ∈ R d × d LSMI ( { V y i } i = 1 n )$
where $LSMI ( { V y i } i = 1 n )$ is given by the same form as Equation (12) (or Equation (13)), but the matrix $H ^$ and the vector $h ^$ are defined in a slightly different way. For the Gaussian kernel,
$K ( V y , V y ′ ) = exp - ∥ V y - V y ′ ∥ 2 2 σ 2$
$H ^$ and $h ^$ are given by:
$H ^ ℓ , ℓ ′ = 1 n d ∏ m = 1 d ∑ i = 1 n exp - ( z ℓ ( m ) - z i ( m ) ) 2 + ( z ℓ ′ ( m ) - z i ( m ) ) 2 2 σ 2 h ^ ℓ = 1 n ∑ i = 1 n exp - ∥ z i - z ℓ ∥ 2 2 σ 2$
This formulation is called least-squares independent component analysis (LICA) [31].

Based on the plain gradient technique, an update rule of V is given by:
$V ⟵ V - t ∂ LSMI ∂ V$
where t$( > 0 )$ is the step size. The gradient $∂ LSMI ∂ V$ is given by:
$∂ LSMI ∂ V = ∑ ℓ = 1 n θ ^ ℓ ∂ h ^ ℓ ∂ V - 1 2 ∑ ℓ , ℓ ′ = 1 n θ ^ ℓ θ ^ ℓ ′ ∂ H ^ ℓ , ℓ ′ ∂ V$
where
$∂ h ^ ℓ ∂ V k , k ′ = - 1 n σ 2 ∑ i = 1 n ( z i ( k ) - z ℓ ( k ) ) ( y i ( k ′ ) - y ℓ ( k ′ ) ) ⊤ exp - ∥ z i - z k ∥ 2 2 σ 2 ∂ H ^ ℓ , ℓ ′ ∂ V k , k ′ = 1 n d - 1 ∏ m ≠ k ∑ i = 1 n exp - ( z i ( m ) - z ℓ ( m ) ) 2 + ( z i ( m ) - z ℓ ′ ( m ) ) 2 2 σ 2 = × [ - 1 n σ 2 ∑ i = 1 n ( z i ( k ) - z ℓ ( k ) ) ( y i ( k ′ ) - y ℓ ( k ′ ) ) + ( z i ( k ) - z ℓ ′ ( k ) ) ( y i ( k ′ ) - y ℓ ′ ( k ′ ) ) × exp - ( z i ( k ) - v ℓ ( k ) ) 2 + ( z i ( k ) - z ℓ ′ ( k ) ) 2 2 σ 2 ]$
In ICA, scaling of components of z can be arbitrary. This implies that the above gradient updating rule can lead to a solution with poor scaling, which is not preferable from a numerical viewpoint. To avoid possible numerical instability, V is normalized at each gradient iteration as:
$V k , k ′ ⟵ V k , k ′ ∑ m = 1 d V k , m 2$

3.5.4. Natural Gradient Demixing Matrix Search

Suppose that data samples are whitened, i.e., samples ${ y i } i = 1 n$ are pre-transformed as:
$y i ⟵ Σ ^ - 1 2 y i$
where $Σ ^$ is the sample covariance matrix:
$Σ ^ : = 1 n ∑ i = 1 n y i - 1 n ∑ j = 1 n y j y i - 1 n ∑ j = 1 n y j ⊤$
Then it can be shown that any demixing matrix that eliminates the second order correlation is an orthogonal matrix [78]. Thus, for whitened data, the search space of V can be restricted to the orthogonal group without loss of generality. The natural gradient [62] update rule on the orthogonal group is given by:
$V ⟵ V exp - t V ⊤ ∂ LSMI ∂ V - ∂ LSMI ∂ V ⊤ V$
where “exp” for a matrix denotes the matrix exponential and t $( > 0 )$ is the step size.
A MATLAB$®$ implementation of LICA is publicly available [83].

3.6. Cross-Domain Object Matching

Next, we show how the SMI estimator can be used for cross-domain object matching.

3.6.1. Introduction

The objective of cross-domain object matching is to match two sets of unpaired objects in different domains. For example, in photo album summarization, we are given a set of photos and a designed photo frame expressed as a set of photo slots in the Cartesian coordinate system, and we want to automatically assign the photos into the designed photo frame. A typical approach of cross-domain object matching is to find a mapping from objects in one domain (photos) to objects in the other domain (frame) so that the pairwise dependency is maximized. In this scenario, accurately evaluating the dependence between objects is a key issue.
Kernelized sorting [84] tries to find the mapping between two domains that maximizes mutual information under the Gaussian assumption. However, because the Gaussian assumption may not be fulfilled in practice, this method tends to perform poorly. To overcome the above limitation, the kernel-based dependence measure called the Hilbert–Schmidt independence criterion (HSIC) [85] was proposed to use in kernelized sorting [86]. Because HSIC is distribution-free, HSIC-based kernelized sorting is more flexible than the original method based on the Gaussian assumption. However, HSIC includes a tuning parameter (more specifically, the Gaussian kernel width), and its choice is crucial to obtain better performance [87].
To cope with this problem, an SMI-based cross-domain object matching method called least-squares object matching (LSOM) was developed [32]. Below, we review LSOM.

3.6.2. Cross-Domain Object Matching with SMI

The goal of cross-domain object matching is, given two sets of unpaired samples of the same size, ${ x i | x i ∈ X } i = 1 n$ and ${ y i | y i ∈ Y } i = 1 n$, to find a mapping that well “matches” them. Let π be a permutation function over ${ 1 , … , n }$. The optimal permutation, denoted by $π *$, can be obtained as the maximizer of the dependency between the two sets ${ x i } i = 1 n$ and ${ y π ( i ) } i = 1 n$. Here, we use the SMI approximator, $LSMI ( { ( x i , y π ( i ) ) } i = 1 n )$, as the dependency measure, i.e., we solve,
$max π LSMI ( { ( x i , y π ( i ) ) } i = 1 n )$
Let K and L be the $n × n$ kernel matrices defined by $K i , j = K ( x i , x j )$ and $L i , j = L ( y i , y j )$. Then LSMI for ${ ( x i , y π ( i ) ) } i = 1 n$ can be expressed as:
$LSMI ( { ( x i , y π ( i ) ) } i = 1 n ) = 1 2 n tr Π ⊤ L Π Θ ^ Π K - 1 2$
where Π is the permutation matrix corresponding to π, i.e., Π is the $n × n$ zero-one matrix such that $Π i , j = 1$ if $i = π ( j )$ for $j = 1 , … , n$ and $Π i , j = 0$ otherwise. $Θ ^ Π$ is the diagonal matrix with diagonal elements given by the LSMI solution $θ ^ π$ obtained by paired data ${ ( x i , y π ( i ) ) } i = 1 n$ (see Equation (11)).
Because maximizing Equation (18) with respect to Π is computationally infeasible, greedy update from previous solution $Π ′$ is used in practice:
$Π new = ( 1 - t ) Π ′ + t · argmax Π tr Π ⊤ L Π ′ Θ ^ Π ′ K$
where $0 < t ≤ 1$ is the step size. Maximization of the second term is called a linear assignment problem, which can be solved efficiently by the Hungarian method [88].
The above method is called least-squares object matching (LSOM) [32]. A MATLAB$®$ implementation of LSOM is publicly available [89].

3.7. Clustering

Here, we show how SMI can be effectively used for clustering.

3.7.1. Introduction

The objective of clustering is to classify data samples into disjoint groups in an unsupervised manner. K-means [90] is a classic but still popular clustering algorithm. However, k-means only produces linearly separated clusters, and thus its usefulness is rather limited in practice. To cope with this problem, various non-linear clustering methods have been developed. Kernel k-means [91] performs k-means in a feature space induced by a reproducing kernel function [46]. Spectral clustering [92,93] first unfolds non-linear data manifolds by a spectral embedding method, and then performs k-means in the embedded space. Blurring mean-shift [94,95] uses a non-parametric kernel density estimator for modeling the data-generating probability density, and finds clusters based on the modes of the estimated density. Discriminative clustering learns a discriminative classifier for separating clusters, where class labels are also treated as parameters to be optimized [96,97]. Dependence-maximization clustering determines cluster assignments so that their dependence on input data is maximized [34,98,99].
Information-maximization clustering exhibited the state-of-the-art performance [100,101], where probabilistic classifiers such as a kernelized Gaussian classifier [100] and a kernel logistic regression classifier [101] are learned so that mutual information between feature vectors and cluster assignments is maximized in an unsupervised manner. A notable advantage of information-maximization clustering is that classifier training is formulated as continuous optimization, which is substantially simpler than discrete optimization of cluster assignments. Indeed, classifier training can be carried out in computationally efficient manners by a gradient method [100] or a quasi-Newton method [101]. Furthermore, a model selection strategy based on the information-maximization principle is also provided [100]. Thus, kernel parameters can be systematically optimized in an unsupervised way. However, the optimization problems of these clustering methods are non-convex and finding a good local optimal solution is not straightforward in practice.
To overcome the above limitation, an SMI-based clustering method called SMI clustering (SMIC) was proposed [33]. Below, we review SMIC.

3.7.2. Clustering with SMI

Suppose that we are given d-dimensional i.i.d. feature vectors of size n, ${ x i | x i ∈ R d } i = 1 n$, which are drawn independently from a distribution with density $p ( x )$. The goal of clustering is to give cluster assignments, ${ y i | y i ∈ { 1 , … , c } } i = 1 n$, to the feature vectors ${ x i } i = 1 n$, where c denotes the number of clusters. c is assumed to be pre-fixed below. To solve the clustering problem, the information-maximization approach is taken [100,101]. That is, clustering is regarded as an unsupervised classification problem, and the class-posterior probability $p ( y | x )$ is learned so that “information” between feature vector x and cluster label y is maximized.
As an information measure, SMI Equation (1) is adopted, which can expressed as:
$SMI = 1 2 ∫ ∑ y = 1 c p ( y | x ) p ( x ) p ( y | x ) p ( y ) d x - 1 2$
Suppose that the class-prior probability $p ( y )$ is set to a user-specified value $π y$ for $y = 1 , … , c$, where $π y > 0$ and $∑ y = 1 c π y = 1$. Without loss of generality, ${ π y } y = 1 c$ are assumed to be sorted in the ascending order:
$π 1 ≤ ⋯ ≤ π c$
If ${ π y } y = 1 c$ is unknown, the uniform class-prior distribution may be adopted:
$p ( y ) = 1 c f o r y = 1 , … , c$
Substituting $π y$ into $p ( y )$, we can express Equation (19) as:
$1 2 ∫ ∑ y = 1 c 1 π y p ( y | x ) p ( x ) p ( y | x ) d x - 1 2$
Let us approximate the class-posterior probability $p ( y | x )$ by the following kernel model:
$q α ( y | x ) : = ∑ i = 1 n α y , i K ( x , x i ) ,$
where $α = ( α 1 , 1 , … , α c , n ) ⊤$ is the parameter vector and $K ( x , x ′ )$ denotes a kernel function. A useful example of kernel functions is the local-scaling kernel [102] defined as:
where $N k ( x )$ denotes the set of k nearest neighbors for x (k is the kernel parameter), $σ i$ is a local scaling factor defined as $σ i = ∥ x i - x i ( k ) ∥$, and $x i ( k )$ is the k-th nearest neighbor of $x i$. Note that we did not include the normalization term in Equation (21) because model outputs will be normalized later (see Equation (22)).
Further approximating the expectation with respect to $p ( x )$ included in Equation (20) by the empirical average of samples ${ x i } i = 1 n$, we arrive at the following SMI approximator:
$SMI ^ : = 1 2 n ∑ y = 1 c 1 π y α y ⊤ K 2 α y - 1 2$
where $α y : = ( α y , 1 , … , α y , n ) ⊤$ and $K i , j : = K ( x i , x j )$.
For each cluster y, $α y ⊤ K 2 α y$ is maximized under $∥ α y ∥ = 1$. Since this is the Rayleigh quotient, the maximizer is given by the normalized principal eigenvector of K [104]. To avoid all the solutions ${ α y } y = 1 c$ to be reduced to the same principal eigenvector, their mutual orthogonality is imposed:
Then the solutions are given by the normalized eigenvectors $ψ 1 , … , ψ c$ associated with the eigenvalues $λ 1 ≥ ⋯ ≥ λ n ≥ 0$ of K. Since the sign of $ψ y$ is arbitrary, the sign is set as:
$ψ ˜ y = ψ y × sign ( ψ y ⊤ 1 n )$
where $sign ( · )$ denotes the sign of a scalar and $1 n$ denotes the n-dimensional vector with all ones.
On the other hand, because
$p ( y ) = ∫ p ( y | x ) p ( x ) d x ≈ 1 n ∑ i = 1 n q α ( y | x i ) = α y ⊤ K 1 n$
and the class-prior probability $p ( y )$ was set to $π y$ for $y = 1 , … , c$, the following normalization condition is obtained:
$α y ⊤ K 1 n = π y$
Furthermore, probability estimates should be non-negative, which can be achieved by rounding up negative outputs to zero.
By taking these normalization and non-negativity issues into account, cluster assignment $y i$ for $x i$ is determined as the maximizer of the approximation of $p ( y | x i )$:
$y i = argmax y [ max ( 0 n , K ψ ˜ y ) ] i π y - 1 max ( 0 n , K ψ ˜ y ) ⊤ 1 n = argmax y π y [ max ( 0 n , ψ ˜ y ) ] i max ( 0 n , ψ ˜ y ) ⊤ 1 n$
where the “max” operation for vectors is applied in the element-wise manner and $[ · ] i$ denotes the i-th element of a vector. Note that $K ψ ˜ y = λ y ψ ˜ y$ was used in the above derivation. For out-of-sample prediction, cluster assignment $y ′$ for new sample $x ′$ may be obtained as:
$y ′ : = argmax y π y max 0 , ∑ i = 1 n K ( x ′ , x i ) [ ψ ˜ y ] i λ y max ( 0 n , ψ ˜ y ) ⊤ 1 n$
The above method is called SMI-based clustering (SMIC) [33]. LSMI can be used for model selection of SMIC, i.e., LSMI is computed as a function of the kernel parameter included in $K ( x , x ′ )$ and the maximizer of LSMI is chosen as the most promising one. A MATLAB$®$ implementation of SMIC is publicly available [103].

3.8. Causal Direction Estimation

Finally, we show how the SMI estimator can be used for causal direction estimation.

3.8.1. Introduction

Learning causality from data is one of the important challenges in the artificial intelligence, statistics, and machine learning communities [105]. A traditional method of learning causal relationship from observational data is based on the linear-dependence Gaussian-noise model [106]. However, the linear-Gaussian assumption is too restrictive and may not be fulfilled in practice. Recently, non-Gaussianity and non-linearity have been shown to be beneficial in causal inference, because it can break symmetry between observed variables [107,108]. Since then, much attention has been paid to the discovery of non-linear causal relationship through non-Gaussian noise models [109].
In the framework of non-linear non-Gaussian causal inference, the relation between a cause X and an effect Y is assumed to be described by $Y = f ( X ) + E$, where f is a non-linear function and E is non-Gaussian additive noise that is independent of the cause X. Under this additive noise assumption, it was shown [108] that the causal direction between X and Y can be identified based on a hypothesis test of whether the causal model $Y = f ( X ) + E$ or the alternative model $X = f ′ ( Y ) + E ′$ fits the data well—here, the goodness of fit is measured by independence between inputs and residuals (i.e., estimated noise). In [108], the functions f and $f ′$ were learned by the Gaussian process (GP) regression [110], and the independence between inputs and residuals was evaluated by the Hilbert–Schmidt independence criterion (HSIC) [85].
However, standard regression methods such as GP are designed to handle Gaussian noise, and thus they may not be suited for discovering causality in the non-Gaussian additive noise formulation. To cope with this problem, an alternative regression method called HSIC regression was proposed [109], which learns a function so that the dependence between inputs and residuals is directly minimized based on HSIC. Through experiments, HSIC regression was shown to outperform the GP-based method [109]. However, the choice of the kernel width in HSIC regression heavily affects the sensitivity of the independence measure, and systematic model selection strategies are not available. Another weakness of HSIC regression is that the kernel width of the regression model is fixed to the same value as HSIC. This crucially limits the flexibility of function approximation in HSIC regression.
To overcome the above weaknesses, an SMI-based regression method for causal inference called least-squares independence regression (LSIR) was developed [35]. Below, we review LSIR.

3.8.2. Dependence Minimizing Regression with SMI

Suppose random variables $X ∈ R$ and $Y ∈ R$ are connected by the following additive noise model [108]:
$Y = f ( X ) + E$
where $f : R → R$ is some non-linear function and $E ∈ R$ is a zero-mean random variable that is independent of X. The goal of dependence minimizing regression is, from i.i.d. paired samples ${ ( x i , y i ) } i = 1 n$, to obtain a function $f ^$ such that input X and estimated additive noise $E ^ = Y - f ^ ( X )$ are independent.
Let us employ a linear model for dependence minimizing regression:
$f β ( x ) = ∑ l = 1 m β l ψ l ( x ) = β ⊤ ψ ( x )$
where m is the number of basis functions, $β = ( β 1 , … , β m ) ⊤$ are regression parameters, and $ψ ( x ) = ( ψ 1 ( x ) , … , ψ m ( x ) ) ⊤$ are basis functions. In LSMI-based dependence minimization regression, the regression parameters β are learned as:
$min β LSMI { ( x i , e i ) } i = 1 n + γ 2 β ⊤ β$
where $e i = y i - f β ( x i )$ is the residual and $γ > 0$ is the regularization parameter to avoid overfitting.
For regression parameter learning, a gradient descent method may be used:
$β ⟵ β - t ∂ LSMI ∂ β + γ β$
where t is the step size. The gradient $∂ LSMI ∂ β$ can be approximately expressed as:
$∂ LSMI ∂ β = ∑ ℓ = 1 n θ ^ ℓ ∂ h ^ ℓ ∂ β - 1 2 ∑ ℓ , ℓ ′ = 1 n θ ^ ℓ θ ^ ℓ ′ ∂ H ^ ℓ , ℓ ′ ∂ β$
where
$∂ h ^ ℓ ∂ β ≈ - 1 2 n σ 2 ∑ j = 1 n exp - ( x i - x ℓ ) 2 + ( e i - e ℓ ) 2 2 σ 2 ( e i - e ℓ ) ψ ( x i ) , ∂ H ^ ℓ , ℓ ′ ∂ β ≈ - 1 2 n 2 σ 2 ∑ i , j = 1 n exp - ( x i - x ℓ ) 2 + ( e j - e ℓ ) 2 + ( x i - x ℓ ′ ) 2 + ( e j - e ℓ ′ ) 2 2 σ 2 × ( e j - e ℓ ) ψ ( x i ) + ( e i - e ℓ ) ψ ( x j )$
Note that, in the above derivation, the dependence of β on $e i$ is ignored for simplicity. Although it is possible to exactly compute the derivative in principle, this approximated expression is computationally more efficient with good performance in practice.
By taking into account the assumption that the mean of noise E is zero, the final regressor is obtained as:
$f ^ ( x ) = f β ^ ( x ) + 1 n ∑ i = 1 n y i - f β ^ ( x i )$
This method is called least-squares independence regression (LSIR) [35]. A MATLAB® implementation of LSIR is publicly available [111].

3.8.3. Causal Direction Inference by LSIR

Our final goal is, given i.i.d. paired samples ${ ( x i , y i ) } i = 1 n$, to determine whether X causes Y or vice versa under the additive noise assumption. To this end, we test whether the causal model $Y = f Y ( X ) + E Y$ or the alternative model $X = f X ( Y ) + E X$ fits the data well, where the goodness of fit is measured by independence between inputs and residuals (i.e., estimated noise). Independence of inputs and residuals may be decided in practice based on the permutation test procedure [47].
More specifically, LSIR is first run for ${ ( x i , y i ) } i = 1 n$ as usual, and obtain a regression function $f ^$. This procedure also provides an SMI estimate, $LSMI ( { ( x i , e ^ i ) } i = 1 n )$, where $e ^ i = y i - f ^ ( x i )$. Next, pairs of input and residual ${ ( x i , e ^ i ) } i = 1 n$ are randomly permuted as ${ ( x i , e ^ π ( i ) ) } i = 1 n$, where $π ( · )$ is a randomly generated permutation function. Note that the permuted pairs of samples are independent of each other because the random permutation breaks the dependency between X and $E ^$ (if it exists). Then, an SMI estimate for the permuted data, $LSMI ( { ( x i , e ^ π ( i ) ) } i = 1 n )$, is computed. This random permutation process is repeated many times, and the distribution of LSMI values under the null-hypothesis that X and $E ^$ are independent is constructed. Finally, the p-value is approximated by evaluating the relative ranking of LSMI computed from the original input-residual data, $LSMI ( { ( x i , e ^ i ) } i = 1 n )$, over the distribution of LSMI values for randomly permuted data.
In order to decide the causal direction, the p-values $p X → Y$ and $p X ← Y$ for both directions $X → Y$ (i.e., X causes Y) and $X ← Y$ (i.e., Y causes X) are computed. Then, for a given significance level δ, the causal direction is determined as follows:
• If $p X → Y > δ$ and $p X ← Y ≤ δ$, the causal model $X → Y$ is chosen.
• If $p X ← Y > δ$ and $p X → Y ≤ δ$, the causal model $X ← Y$ is selected.
• If $p X → Y , p X ← Y ≤ δ$, perhaps there is no causal relation between X and Y or our modeling assumption is not correct (e.g., an unobserved confounding variable exists).
• If $p X → Y , p X ← Y > δ$, perhaps our modeling assumption is not correct or it is not possible to identify a causal direction (i.e., X, Y, and E are Gaussian random variables).
When we have prior knowledge that there exists a causal relation between X and Y but the causal direction is unknown, the values of $p X → Y$ and $p X ← Y$ may be simply compared for determining the causal direction as follows:
• If $p X → Y > p X ← Y$, we conclude that X causes Y.
• Otherwise, we conclude that Y causes X.
This simplified procedure does not include the computational expensive permutation process and thus it is computationally very efficient.

4. Conclusions

In this article, we reviewed recent development in the estimation of squared-loss mutual information (SMI) and its application to machine learning. The key idea for accurately estimating SMI is to directly estimate the ratio of probability densities without separately estimating each density. A notable advantage of the SMI estimator called least-squares mutual information (LSMI) [19] is that it can be computed analytically in a computationally more efficient and numerically more stable way than ordinary MI.
We have introduced SMI as a measure of statistical independence between random variables. On the other hand, ordinary MI has a rich information-theoretic interpretation via entropies. Thus, it is important to investigate an information-theoretic meaning of SMI, which remains to be an open question currently.
Various methods of direct density-ratio estimation have been explored so far [16,18], and such density ratio estimators were shown to be applicable to an even wider class of machine learning tasks beyond SMI estimation, such as non-stationarity adaptation [112], outlier detection [113], change detection [114,115], class-balance estimation [116], two-sample homogeneity testing [117,118], probabilistic classification [119,120], and conditional density estimation [121].
Improving the accuracy of density ratio estimation contributes to enhancing the performance of the above machine learning solutions. Recent advances in this line of research include dimensionality reduction for density ratio estimation [122,123,124], a unified statistical framework of density ratio estimation [18], and extensions to relative density ratios [125] and density differences [126]. Further improving the accuracy and computational efficiency and exploring new application areas are important future directions to pursue.
More program codes are publicly available [127].

Acknowledgements

The author would like to thank Taiji Suzuki, Makoto Yamada, Wittawat Jitkrittum, Masayuki Karasuyama, Marthinus Christoffel du Plessis, and John Quinn for their valuable comments. This work was supported by the JST PRESTO program, the FIRST program, and AOARD.

References

1. Shannon, C. A mathematical theory of communication. AT&T Tech. J. 1948, 27, 379–423. [Google Scholar]
2. Cover, T.M.; Thomas, J.A. Elements of Information Theory, 2nd ed.; John Wiley & Sons, Inc.: Hoboken, NJ, USA, 2006. [Google Scholar]
3. Kullback, S.; Leibler, R.A. On information and sufficiency. Ann. Math. Stat. 1951, 22, 79–86. [Google Scholar] [CrossRef]
4. Fraser, A.M.; Swinney, H.L. Independent coordinates for strange attractors from mutual information. Phys. Rev. A 1986, 33, 1134–1140. [Google Scholar] [CrossRef] [PubMed]
5. Vapnik, V.N. Statistical Learning Theory; Wiley: New York, NY, USA, 1998. [Google Scholar]
6. Darbellay, G.A.; Vajda, I. Estimation of the information by an adaptive partitioning of the observation space. IEEE Trans. Inf. Theory 1999, 45, 1315–1321. [Google Scholar] [CrossRef]
7. Wang, Q.; Kulkarmi, S.R.; Verdú, S. Divergence estimation of continuous distributions based on data-dependent partitions. IEEE Trans. Inf. Theory 2005, 51, 3064–3074. [Google Scholar] [CrossRef]
8. Silva, J.; Narayanan, S. Universal Consistency of Data-Driven Partitions for Divergence Estimation. In Proceedings of IEEE International Symposium on Information Theory, Nice, France, 24–29 June 2007; pp. 2021–2025.
9. Kraskov, A.; Stögbauer, H.; Grassberger, P. Estimating mutual information. Phys. Rev. E 2004, 69, 066138. [Google Scholar] [CrossRef] [PubMed]
10. Khan, S.; Bandyopadhyay, S.; Ganguly, A.; Saigal, S. Relative performance of mutual information estimation methods for quantifying the dependence among short and noisy data. Phys. Rev. E 2007, 76, 026209. [Google Scholar] [CrossRef] [PubMed]
11. Pérez-Cruz, F. Kullback-Leibler Divergence Estimation of Continuous Distributions. In Proceedings of IEEE International Symposium on Information Theory, Toronto, Canada, 6–11 July 2008; pp. 1666–1670.
12. Van Hulle, M.M. Edgeworth approximation of multivariate differential entropy. Neural Comput. 2005, 17, 1903–1910. [Google Scholar] [CrossRef] [PubMed]
13. Suzuki, T.; Sugiyama, M.; Sese, J.; Kanamori, T. Approximating Mutual Information by Maximum Likelihood Density Ratio Estimation. In Proceedings of ECML-PKDD2008 Workshop on New Challenges for Feature Selection in Data Mining and Knowledge Discovery 2008 (FSDM2008), 2008; Saeys, Y., Liu, H., Inza, I., Wehenkel, L., de Peer, Y.V., Eds.; Volume 4, JMLR Workshop and Conference Proceedings. pp. 5–20.
14. Sugiyama, M.; Suzuki, T.; Nakajima, S.; Kashima, H.; von Bünau, P.; Kawanabe, M. Direct importance estimation for covariate shift adaptation. Ann. I. Stat. Math. 2008, 60, 699–746. [Google Scholar] [CrossRef]
15. Nguyen, X.; Wainwright, M.J.; Jordan, M.I. Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Trans. Inf. Theory 2010, 56, 5847–5861. [Google Scholar] [CrossRef]
16. Sugiyama, M.; Suzuki, T.; Kanamori, T. Density Ratio Estimation in Machine Learning; Cambridge University Press: Cambridge, UK, 2012. [Google Scholar]
17. Basu, A.; Harris, I.R.; Hjort, N.L.; Jones, M.C. Robust and efficient estimation by minimising a density power divergence. Biometrika 1998, 85, 549–559. [Google Scholar] [CrossRef]
18. Sugiyama, M.; Suzuki, T.; Kanamori, T. Density ratio matching under the bregman divergence: A unified framework of density ratio estimation. Ann. I. Stat. Math. 2012, 64, 1009–1044. [Google Scholar] [CrossRef]
19. Suzuki, T.; Sugiyama, M.; Kanamori, T.; Sese, J. Mutual information estimation reveals global associations between stimuli and biological processes. BMC Bioinf. 2009, 10, S52:1–S52:12. [Google Scholar] [CrossRef] [PubMed]
20. Pearson, K. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. Philos. Mag. Series 5 1900, 50, 157–175. [Google Scholar] [CrossRef]
21. Ali, S.M.; Silvey, S.D. A general class of coefficients of divergence of one distribution from another. J. R. Stat. Soc. Series B 1966, 28, 131–142. [Google Scholar]
22. Csiszár, I. Information-type measures of difference of probability distributions and indirect observation. Stud. Sci. Math. Hung. 1967, 2, 229–318. [Google Scholar]
23. Kanamori, T.; Hido, S.; Sugiyama, M. A least-squares approach to direct importance estimation. J. Mach. Learn. Res. 2009, 10, 1391–1445. [Google Scholar]
24. Kanamori, T.; Suzuki, T.; Sugiyama, M. Statistical Analysis of kernel-based least-squares density-ratio estimation. Mach. Learn. 2012, 86, 335–367. [Google Scholar] [CrossRef]
25. Kanamori, T.; Suzuki, T.; Sugiyama, M. Computational complexity of kernel-based density-ratio estimation: A condition number analysis. 2009; arXiv:0912.2800. [Google Scholar]
26. Sugiyama, M.; Suzuki, T. Least-squares independence test. IEICE T. Inf. Syst. 2011, E94-D, 1333–1336. [Google Scholar] [CrossRef]
27. Jitkrittum, W.; Hachiya, H.; Sugiyama, M. Feature Selection via ℓ1-Penalized Squared-Loss Mutual Information. Technical Report 1210.1960, arXiv. 2012. [Google Scholar]
28. Suzuki, T.; Sugiyama, M. Sufficient dimension reduction via squared-loss mutual information estimation. Available online: sugiyama-www.cs.titech.ac.jp/.../AISTATS2010b.pdf (accessed on 26 December 2012).
29. Yamada, M.; Niu, G.; Takagi, J.; Sugiyama, M. Computationally Efficient Sufficient Dimension Reduction via Squared-Loss Mutual Information. In Proceedings of the Third Asian Conference on Machine Learning (ACML2011); Hsu, C.N., Lee, W.S., Eds.; 2011; Volume 20, JMLR Workshop and Conference Proceedings . pp. 247–262. [Google Scholar]
30. Karasuyama, M.; Sugiyama. Canonical dependency analysis based on squared-loss mutual information. Neural Netw. 2012, 34, 46–55. [Google Scholar] [CrossRef] [PubMed]
31. Suzuki, T.; Sugiyama, M. Least-squares independent component analysis. Neural Comput. 2011, 23, 284–301. [Google Scholar] [CrossRef] [PubMed]
32. Yamada, M.; Sugiyama, M. Cross-Domain Object Matching with Model Selection. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics (AISTATS2011), 2011; Gordon, G., Dunson, D., Dudík, M., Eds.; Volume 15, JMLR Workshop and Conference Proceedings . pp. 807–815.
33. Sugiyama, M.; Yamada, M.; Kimura, M.; Hachiya, H. On Information-Maximization Clustering: Tuning Parameter Selection and Analytic Solution. In Proceedings of 28th International Conference on Machine Learning (ICML2011), 2011; Getoor, L., Scheffer, T., Eds.; pp. 65–72.
34. Kimura, M.; Sugiyama, M. Dependence-maximization clustering with least-squares mutual information. J. Adv. Comput. Intell. Intell. Inf. 2011, 15, 800–805. [Google Scholar]
35. Yamada, M.; Sugiyama, M. Dependence Minimizing Regression with Model Selection for Non-Linear Causal Inference under Non-Gaussian Noise. In Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence (AAAI2010); The AAAI Press: Atlanta, Georgia, USA, 2010; pp. 643–648. [Google Scholar]
36. Van der Vaart, A.W.; Wellner, J.A. Weak Convergence and Empirical Processes with Applications to Statistics; Springer: New York, NY, USA, 1996. [Google Scholar]
37. Van der Vaart, A.W. Asymptotic Statistics; Cambridge University Press: Cambridge, MA, USA, 2000. [Google Scholar]
38. Aronszajn, N. Theory of reproducing kernels. T. Am. Math. Soc. 1950, 68, 337–404. [Google Scholar] [CrossRef]
39. Least-Squares Mutual Information (LSMI). Available online: http://sugiyama-www.cs.titech.ac.jp/~sugi/software/LSMI/ (accessed on 7 December 2012).
40. Efron, B.; Hastie, T.; Johnstone, I.; Tibshirani, R. Least angle regression. Ann. Stat. 2004, 32, 407–499. [Google Scholar]
41. Hastie, T.; Rosset, S.; Tibshirani, R.; Zhu, J. The entire regularization path for the support vector machine. J. Mach. Learn. Res. 2004, 5, 1391–1415. [Google Scholar]
42. Gärtner, T. A survey of kernels for structured data. SIGKDD Explor. 2003, 5, S268–S275. [Google Scholar] [CrossRef]
43. Sarwar, B.; Karypis, G.; Konstan, J.; Reidl, J. Item-Based Collaborative Filtering Recommendation Algorithms. In Proceedings of the 10th International Conference on World Wide Web (WWW2001), Hong Kong, China, 1–5 May 2001; pp. 285–295.
44. Gretton, A.; Fukumizu, K.; Teo, C.H.; Song, L.; Schölkopf, B.; Smola, A. A Kernel Statistical Test of Independence. In Advances in Neural Information Processing Systems 20; Platt, J.C., Koller, D., Singer, Y., Roweis, S., Eds.; MIT Press: Cambridge, MA, USA, 2008; pp. 585–592. [Google Scholar]
45. Steinwart, I. On the influence of the kernel on the consistency of support vector machines. J. Mach. Learn. Res. 2001, 2, 67–93. [Google Scholar]
46. Schölkopf, B.; Smola, A.J. Learning with Kernels; MIT Press: Cambridge, MA, USA, 2002. [Google Scholar]
47. Efron, B.; Tibshirani, R.J. An Introduction to the Bootstrap; Chapman & Hall/CRC: New York, NY, USA, 1993. [Google Scholar]
48. Least-Squares Independence Test (LSIT). Available online: http://sugiyama-www.cs.titech.ac.jp/~sugi/software/LSIT/ (accessed on 7 December 2012).
49. Guyon, I.; Elisseeff, A. An introduction to variable and feature selection. J. Mach. Learn. Res. 2003, 3, 1157–1182. [Google Scholar]
50. Tibshirani, R. Regression shrinkage and subset selection with the lasso. J. R. Stat. Soc. Series B 1996, 58, 267–288. [Google Scholar]
51. Boyd, S.; Vandenberghe, L. Convex Optimization; Cambridge University Press: Cambridge, UK, 2004. [Google Scholar]
52. Tomioka, R.; Suzuki, T.; Sugiyama, M. Super-linear convergence of dual augmented Lagrangian algorithm for sparsity regularized estimation. J. Mach. Learn. Res. 2011, 12, 1537–1586. [Google Scholar]
53. 1-Ball. Available online: http://wittawat.com/software/l1lsmi/ (accessed on 7 December).
54. McCallum, A.; Roweis, S. Efficient Projections onto the 1-Ball for Learning in High Dimensions. In Proceedings of the 25th Annual International Conference on Machine Learning (ICML2008), Helsinki, Finland, 5–9 July 2008; pp. 272–279.
55. Cook, R.D. Regression Graphics: Ideas for Studying Regressions through Graphics; Wiley: New York, NY, USA, 1998. [Google Scholar]
56. Li, K. Sliced inverse regression for dimension reduction. J. Am. Stat. Assoc. 1991, 86, 316–342. [Google Scholar] [CrossRef]
57. Li, K. On principal hessian directions for data visualization and dimension reduction: another application of Stein’s lemma. J. Am. Stat. Assoc. 1992, 87, 1025–1039. [Google Scholar] [CrossRef]
58. Cook, R.D. SAVE: A method for dimension reduction and graphics in regression. Commun. Stat. Theory 2000, 29, 2109–2121. [Google Scholar] [CrossRef]
59. Fukumizu, K.; Bach, F.R.; Jordan, M.I. Kernel dimension reduction in regression. Ann. Stat. 2009, 37, 1871–1905. [Google Scholar] [CrossRef]
60. Golub, G.H.; Loan, C.F.V. Matrix Computations, 2nd ed.; Johns Hopkins University Press: Baltimore, MD, USA, 1989. [Google Scholar]
61. Nishimori, Y.; Akaho, S. Learning algorithms utilizing quasi-geodesic flows on the Stiefel manifold. Neurocomputing 2005, 67, 106–135. [Google Scholar] [CrossRef]
62. Amari, S. Natural gradient works efficiently in learning. Neural Comput. 1998, 10, 251–276. [Google Scholar] [CrossRef]
63. Edelman, A.; Arias, T.A.; Smith, S.T. The geometry of algorithms with orthogonality constraints. SIAM J. Matrix. Anal. A. 1998, 20, 303–353. [Google Scholar] [CrossRef]
64. Patriksson, M. Nonlinear Programming and Variational Inequality Problems; Kluwer Academic: Dordrecht, The Netherlands, 1999. [Google Scholar]
65. Least-Squares Dimensionality Reduction (LSDR). Available online: http://sugiyama-www.cs.titech.ac.jp/~sugi/software/LSDR/ (accessed on 7 December 2012).
66. Epanechnikov, V. Nonparametric estimates of a multivariate probability density. Theor. Probab. Appl. 1969, 14, 153–158. [Google Scholar] [CrossRef]
67. Sufficient Component Analysis (SCA). Available online: http://sugiyama-www.cs.titech.ac.jp/~yamada/sca.html (accessed on 7 December 2012).
68. Hotelling, H. Relations between two sets of variates. Biometrika 1936, 28, 321–377. [Google Scholar] [CrossRef]
69. Becker, S.; Hinton, G.E. A self-organizing neural network that discovers surfaces in random-dot stereograms. Nature 1992, 355, 161–163. [Google Scholar] [CrossRef] [PubMed]
70. Fyfe, C.; Lai, P.L. Kernel and nonlinear canonical correlation analysis. Int. J. Neural Syst. 2000, 10, 365–377. [Google Scholar]
71. Akaho, S. A Kernel Method For Canonical Correlation Analysis. In Proceedings of the International Meeting of the Psychometric Society, Osaka, Japan, 15–19 July 2001.
72. Gestel, T.V.; Suykens, J.; Brabanter, J.D.; Moor, B.D.; Vandewalle, J. Kernel Canonical Correlation Analysis and Least Squares Support Vector Machines. In Proceedings of the International Conference on Artificial Neural Networks; Springer: Berlin/Heidelberg, Germany, 2001; Volume 2130, Lecture Notes in Computer Science . pp. 384–389. [Google Scholar]
73. Breiman, L.; Friedman, J.H. Estimating optimal transformations for multiple regression and correlation. J. Am. Stat. Assoc. 1985, 80, 580–598. [Google Scholar] [CrossRef]
74. Bach, F.; Jordan, M.I. Kernel independent component analysis. J. Mach. Learn. Res. 2002, 3, 1–48. [Google Scholar]
75. Yin, X. Canonical correlation analysis based on information theory. J. Multivariate Anal. 2004, 91, 161–176. [Google Scholar] [CrossRef]
76. Härdle, W.; Müller, M.; Sperlich, S.; Werwatz, A. Nonparametric and Semiparametric Models; Springer: Berlin, Germany, 2004. [Google Scholar]
77. Least-Squares Canonical Dependency Analysis (LSCDA). Available online: http://www.bic.kyoto-u.ac.jp/pathway/krsym/software/LSCDA/index.html (accessed on 7 December 2012).
78. Hyvärinen, A.; Karhunen, J.; Oja, E. Independent Component Analysis; Wiley: New York, NY, USA, 2001. [Google Scholar]
79. Amari, S.; Cichocki, A.; Yang, H.H. A New Learning Algorithm for Blind Signal Separation. Advances in Neural Information Processing Systems 8; Touretzky, D.S., Mozer, M.C., Hasselmo, M.E., Eds.; The MIT Press: Cambridge, MA, USA, 1996; pp. 757–763. [Google Scholar]
80. Van Hulle, M.M. Sequential fixed-point ICA based on mutual information minimization. Neural Comput. 2008, 20, 1344–1365. [Google Scholar] [CrossRef] [PubMed]
81. Jutten, C.; Herault, J. Blind separation of sources, part I: An adaptive algorithm based on neuromimetic architecture. Signal Process. 1991, 24, 1–10. [Google Scholar] [CrossRef]
82. Hyvärinen, A. Fast and robust fixed-point algorithms for independent component analysis. IEEE T. Neural Networ. 1999, 10, 626. [Google Scholar] [CrossRef] [PubMed]
83. Least-squares Independent Component Analysis. Available online: http://www.simplex.t.u-tokyo.ac.jp/~s-taiji/software/LICA/index.html (accessed on 7 December 2012).
84. Jebara, T. Kernelized Sorting, Permutation and Alignment for Minimum Volume PCA. In Proceedings of the 17th Annual Conference on Learning Theory (COLT2004), Banff, Canada, 1–4 July 2004; pp. 609–623.
85. Gretton, A.; Bousquet, O.; Smola, A.; Schölkopf, B. Measuring Statistical Dependence with Hilbert-Schmidt Norms. In Algorithmic Learning Theory; Jain, S., Simon, H.U., Tomita, E., Eds.; Springer-Verlag: Berlin, Germany, 2005; Lecture Notes in Artificial Intelligence; pp. 63–77. [Google Scholar]
86. Quadrianto, N.; Smola, A.J.; Song, L.; Tuytelaars, T. Kernelized sorting. IEEE Trans. Patt. Anal. 2010, 32, 1809–1821. [Google Scholar] [CrossRef] [PubMed]
87. Jagarlamudi, J.; Juarez, S.; Daumé, H., III. Kernelized Sorting for Natural Language Processing. In Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence (AAAI2010), Atlanta, Georgia, USA, 11–15 July 2010; pp. 1020–1025.
88. Kuhn, H.W. The Hungarian method for the assignment problem. Nav. Res. Logist. Q. 1955, 2, 83–97. [Google Scholar] [CrossRef]
89. Least-Squares Object Matching (LSOM). Available online: http://sugiyama-www.cs.titech.ac.jp/~yamada/lsom.html (accessed on 7 December 2012).
90. MacQueen, J.B. Some Methods for Classification and Analysis of Multivariate Observations. In Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability; University of California Press: Berkeley, CA, USA, 1967; Vol. 1, pp. 281–297. [Google Scholar]
91. Girolami, M. Mercer kernel-based clustering in feature space. IEEE Trans. Neural Networ. 2002, 13, 780–784. [Google Scholar] [CrossRef] [PubMed]
92. Shi, J.; Malik, J. Normalized cuts and image segmentation. IEEE Trans. Patt. Anal. 2000, 22, 888–905. [Google Scholar]
93. Ng, A.Y.; Jordan, M.I.; Weiss, Y. On Spectral Clustering: Analysis and An Algorithm. Advances in Neural Information Processing Systems 14; Dietterich, T.G., Becker, S., Ghahramani, Z., Eds.; MIT Press: Cambridge, MA, USA, 2002; pp. 849–856. [Google Scholar]
94. Fukunaga, K.; Hostetler, L.D. The estimation of the gradient of a density function, with application in pattern recognition. IEEE Trans. Inf. Theory 1975, 21, 32–40. [Google Scholar] [CrossRef]
95. Carreira-Perpiñán, M.A. Fast Nonparametric Clustering with Gaussian Blurring Mean-Shift. In Proceedings of 23rd International Conference on Machine Learning (ICML2006), Pittsburgh, Pennsylvania, USA, 25–29 June 2006; Cohen, W., Moore, A., Eds.; pp. 153–160.
96. Xu, L.; Neufeld, J.; Larson, B.; Schuurmans, D. Maximum Margin Clustering. Advances in Neural Information Processing Systems 17; Saul, L.K., Weiss, Y., Bottou, L., Eds.; MIT Press: Cambridge, MA, USA, 2005; pp. 1537–1544. [Google Scholar]
97. Bach, F.; Harchaoui, Z. DIFFRAC: A Discriminative and Flexible Framework for Clustering. In Advances in Neural Information Processing Systems 20; Platt, J.C., Koller, D., Singer, Y., Roweis, S., Eds.; MIT Press: Cambridge, MA, USA, 2008; pp. 49–56. [Google Scholar]
98. Song, L.; Smola, A.; Gretton, A.; Borgwardt, K. A Dependence Maximization View of Clustering. In Proceedings of the 24th Annual International Conference on Machine Learning (ICML2007), Corvallis, Oregon, USA, 20–24 June 2007; Ghahramani, Z., Ed.; pp. 815–822.
99. Faivishevsky, L.; Goldberger, J. A Nonparametric Information Theoretic Clustering Algorithm. In Proceedings of 27th International Conference on Machine Learning (ICML2010), Haifa, Israel, 21–24 June 2010; Joachims, A.T., Fürnkranz, J., Eds.; pp. 351–358.
100. Agakov, F.; Barber, D. Kernelized Infomax Clustering. In Advances in Neural Information Processing Systems 18; Weiss, Y., Schölkopf, B., Platt, J., Eds.; MIT Press: Cambridge, MA, USA, 2006; pp. 17–24. [Google Scholar]
101. Gomes, R.; Krause, A.; Perona, P. Discriminative Clustering by Regularized Information Maximization. In Advances in Neural Information Processing Systems 23; Lafferty, J., Williams, C.K.I., Zemel, R., Shawe-Taylor, J., Culotta, A., Eds.; 2010; pp. 766–774. [Google Scholar]
102. Zelnik-Manor, L.; Perona, P. Self-Tuning Spectral Clustering. In Advances in Neural Information Processing Systems 17; Saul, L.K., Weiss, Y., Bottou, L., Eds.; MIT Press: Cambridge, MA, USA, 2005; pp. 1601–1608. [Google Scholar]
103. SMI-based Clustering (SMIC). Available online: http://sugiyama-www.cs.titech.ac.jp/~sugi/software/SMIC/ (accessed on 7 December 2012).
104. Horn, R.A.; Johnson, C.A. Matrix Analysis; Cambridge University Press: Cambridge, UK, 1985. [Google Scholar]
105. Pearl, J. Causality: Models, Reasoning and Inference; Cambridge University Press: New York, NY, USA, 2000. [Google Scholar]
106. Geiger, D.; Heckerman, D. Learning Gaussian Networks. In Proceedings of the 10th Annual Conference on Uncertainty in Artificial Intelligence (UAI1994), Seattle, Washington, USA, 29–31 July 1994; pp. 235–243.
107. Shimizu, S.; Hoyer, P.O.; Hyvärinen, A.; Kerminen, A.J. A linear non-gaussian acyclic model for causal discovery. J. Mach. Learn. Res. 2006, 7, 2003–2030. [Google Scholar]
108. Hoyer, P.O.; Janzing, D.; Mooij, J.M.; Peters, J.; Schölkopf, B. Nonlinear Causal Discovery with Additive Noise Models. In Advances in Neural Information Processing Systems 21; Koller, D., Schuurmans, D., Bengio, Y., Bottou, L., Eds.; MIT Press: Cambridge, MA, USA, 2009; pp. 689–696. [Google Scholar]
109. Mooij, J.; Janzing, D.; Peters, J.; Schölkopf, B. Regression by Dependence Minimization and Its Application to Causal Inference in Additive Noise Models. In Proceedings of the 26th Annual International Conference on Machine Learning (ICML2009), Montreal, Canada Jun, 14–18, 2009; pp. 745–752.
110. Rasmussen, C.E.; Williams, C.K.I. Gaussian Processes for Machine Learning; MIT Press: Cambridge, MA, USA, 2006. [Google Scholar]
111. Least-Squares Independence Regression (LSIR). Availble online: http://sugiyama-www.cs.titech.ac.jp/~yamada/lsir.html (accessed on 7 December 2012).
112. Sugiyama, M.; Kawanabe, M. Machine Learning in Non-Stationary Environments: Introduction to Covariate Shift Adaptation; MIT Press: Cambridge, Massachusetts, USA, 2012. [Google Scholar]
113. Hido, S.; Tsuboi, Y.; Kashima, H.; Sugiyama, M.; Kanamori, T. Statistical outlier detection using direct density ratio estimation. Knowl. Inf. Syst. 2011, 26, 309–336. [Google Scholar] [CrossRef]
114. Kawahara, Y.; Sugiyama, M. Sequential change-point detection based on direct density-ratio estimation. Stat. Anal. Data Min. 2012, 5, 114–127. [Google Scholar] [CrossRef]
115. Liu, S.; Yamada, M.; Collier, N.; Sugiyama, M. Change-Point Detection in Time-Series Data by Relative Density-Ratio Estimation. In Structural, Syntactic, and Statistical Pattern Recognition; Gimel’farb, G., Hancock, E., Imiya, A., Kuijper, A., Kudo, M., Omachi, S., Windeatt, T., Yamada, K., Eds.; Springer: Berlin, Germany, 2012; Volume 7626, Lecture Notes in Computer Science ; pp. 363–372. [Google Scholar]
116. Langford, J.; Pineau, J. Semi-Supervised Learning of Class Balance under Class-Prior Change by Distribution Matching. In Proceedings of 29th International Conference on Machine Learning (ICML2012), Edinburgh, Scotland, 26 June–1 July 2012; pp. 823–830.
117. Sugiyama, M.; Suzuki, T.; Itoh, Y.; Kanamori, T.; Kimura, M. Least-squares two-sample test. Neural Netw. 2011, 24, 735–751. [Google Scholar] [CrossRef] [PubMed]
118. Kanamori, T.; Suzuki, T.; Sugiyama, M. f-divergence estimation and two-sample homogeneity test under semiparametric density-ratio models. IEEE Trans. Inf. Theory 2012, 58, 708–720. [Google Scholar] [CrossRef]
119. Sugiyama, M. Superfast-trainable multi-class probabilistic classifier by least-squares posterior fitting. IEICE Trans. Inf. Syst. 2010, E93-D, 2690–2701. [Google Scholar] [CrossRef]
120. Sugiyama, M.; Hachiya, H.; Yamada, M.; Simm, J.; Nam, H. Least-Squares Probabilistic Classifier: A Computationally Efficient Alternative to Kernel Logistic Regression. In Proceedings of International Workshop on Statistical Machine Learning for Speech Processing (IWSML2012), Kyoto, Japan, Mar. 31. 2012; pp. 1–10.
121. Sugiyama, M.; Takeuchi, I.; Suzuki, T.; Kanamori, T.; Hachiya, H.; Okanohara, D. Least-squares conditional density estimation. IEICE Trans. Inf. Syst. 2010, E93-D, 583–594. [Google Scholar] [CrossRef]
122. Sugiyama, M.; Kawanabe, M.; Chui, P.L. Dimensionality reduction for density ratio estimation in high-dimensional spaces. Neural Netw. 2010, 23, 44–59. [Google Scholar] [CrossRef] [PubMed]
123. Sugiyama, M.; Yamada, M.; von Bünau, P.; Suzuki, T.; Kanamori, T.; Kawanabe, M. Direct density-ratio estimation with dimensionality reduction via least-squares hetero-distributional subspace search. Neural Netw. 2011, 24, 183–198. [Google Scholar] [CrossRef] [PubMed]
124. Yamada, M.; Sugiyama, M. Direct Density-Ratio Estimation with Dimensionality Reduction via Hetero-Distributional Subspace Analysis. In Proceedings of the Twenty-Fifth AAAI Conference on Artificial Intelligence (AAAI2011); The AAAI Press: San Francisco, California, USA, 2011; pp. 549–554. [Google Scholar]
125. Yamada, M.; Suzuki, T.; Kanamori, T.; Hachiya, H.; Sugiyama, M. Relative Density-Ratio Estimation for Robust Distribution Comparison. In Advances in Neural Information Processing Systems 24; Shawe-Taylor, J., Zemel, R.S., Bartlett, P., Pereira, F.C.N., Weinberger, K.Q., Eds.; 2011; pp. 594–602. [Google Scholar]
126. Sugiyama, M.; Suzuki, T.; Kanamori, T.; Du Plessis, M.C.; Liu, S.; Takeuchi, I. Density-Difference Estimation. Advances in Neural Information Processing Systems 25, 2012. [Google Scholar]
127. Software. Available online: http://sugiyama-www.cs.titech.ac.jp/~sugi/software/ (accessed on 7 December 2012).