Open Access
This article is
 freely available
 reusable
Algorithms 2019, 12(11), 240; https://doi.org/10.3390/a12110240
Article
TensorBased Algorithms for Image Classification
Department of Mathematics and Computer Science, Freie Universität Berlin, 14195 Berlin, Germany
*
Correspondence: [email protected] (S.K.); [email protected] (P.G.)
^{†}
These authors contributed equally to this work.
Received: 20 October 2019 / Accepted: 7 November 2019 / Published: 9 November 2019
Abstract
:Interest in machine learning with tensor networks has been growing rapidly in recent years. We show that tensorbased methods developed for learning the governing equations of dynamical systems from data can, in the same way, be used for supervised learning problems and propose two novel approaches for image classification. One is a kernelbased reformulation of the previously introduced multidimensional approximation of nonlinear dynamics (MANDy), the other an alternating ridge regression in the tensor train format. We apply both methods to the MNIST and fashion MNIST data set and show that the approaches are competitive with stateoftheart neural networkbased classifiers.
Keywords:
quantum machine learning; image classification; tensor train format; kernelbased methods; ridge regressionMSC:
15A69; 62J07; 65D18; 68Q321. Introduction
Tensorbased methods have become a powerful tool for scientific computing over the last years. In addition to many application areas, such as quantum mechanics and computational dynamics, where lowrank tensor approximations have been successfully applied, using tensor networks for supervised learning has gained a lot of attention recently. In particular, the canonical format and the tensor train format have been considered for quantum machine learning (There are different research directions in the field of quantum machine learning, here we understand it as using quantum computing capabilities for machine learning problems.) problems, see, e.g., [1,2,3]. A tensorbased algorithm for image classification using sweeping techniques inspired by the density matrix renormalization group (DMRG) [4] was proposed in [5,6] and further discussed in [7,8]. Interestingly, also researchers at Google are currently developing a tensorbased machine learning framework called “TensorNetwork (http://github.com/google/TensorNetwork)” [9,10]. The goal is to expedite the adoption of such methods by the machine learning community.
Our goal is to show that recently developed methods for recovering the governing equations of dynamical systems can be generalized in such a way that they can also be used for supervised learning tasks, e.g., classification problems. To learn the governing equations from simulation or measurement data, regression methods such as sparse identification of nonlinear dynamics (SINDy) [11,12] and its tensorbased reformulation multidimensional approximation of nonlinear dynamics (MANDY) [13] can be applied. The main challenge is often to choose the right function space from which the system representation is learned. Although SINDy and MANDy essentially select functions from a potentially large set of basis functions by applying regularized regression methods, other approaches allow nested functions and typically result in nonlinear optimization problems, which are then frequently solved using (stochastic) gradient descent. By constructing a basis comprising tensor products of simple functions (e.g., functions depending only on one variable), extremely highdimensional feature spaces can be generated.
In this work, we explain how to compute the pseudoinverse required for solving the minimization problem directly in the tensor train (TT) format, i.e., we replace the iterative approach from [5,6] by a direct computation of the leastsquares solution and point out similarities with the aforementioned system identification methods. The reformulated algorithm can be regarded as a kernelized variant of MANDy, where the kernel is based on tensor products. This is also related to quantum machine learning ideas: As pointed out in [14], the basic idea of quantum computing is similar to kernel methods in that computations are performed implicitly in otherwise intractably large Hilbert spaces. Although kernel methods were popular in the 1990s, the focus of the machine learning community has shifted to deep neural networks in recent years [14]. We will show that, for simple image classification tasks, kernels based on tensor products are competitive with neural networks.
In addition to the kernelbased approach, we propose another DMRGinspired method for the construction of TT decompositions of weight matrices containing the coefficients for the selected basis functions. Instead of computing pseudoinverses, a corewise ridge regression [15] is applied to solve the minimization problem. Although the approach introduced in [5,6] only involves tensor contractions corresponding to single images of the training data set, we use TT representations of transformed data tensors, see [13,16], to include the entire training data set at once for constructing lowdimensional systems of linear equations. Combining an efficient computational scheme for the corresponding subproblems and truncated singular value decompositions [17], we call the resulting algorithm alternating ridge regression (ARR) and discuss connections to MANDy and other regularized regression techniques.
Although we describe the classification problems using the example of the iconic MNIST data set [18] and the fashion MNIST data set [19], the derived algorithms can be easily applied to other classification problems. There is a plethora of kernel and deep learning methods for image classification; a list of the most successful methods for the MNIST and fashion MNIST data sets including nearestneighbor heuristics, support vector machines, and convolutional neural networks can be found on the respective website (http://yann.lecun.com/exdb/mnist/, http://github.com/zalandoresearch/fashionmnist). We will not review these methods in detail, but instead focus on relationships with datadriven methods for analyzing dynamical systems. The main contributions of this paper are as follows.
 Extension of MANDy: We show that the efficacy of the pseudoinverse computation in the tensor train format can be improved by eliminating the need to left and rightorthonormalize the tensor. Although this is a straightforward modification of the original algorithm, it enables us to consider large data sets. The resulting method is closely related to kernel ridge regression.
 Alternating ridge regression: We introduce a modified TT representation of transformed data tensors for the development of a tensorbased regression technique which computes lowrank representations of coefficient tensors. We show that it is possible to obtain results which are competitive with those computed by MANDy and, at the same time, reduce the computational costs and the memory consumption significantly.
 Classification of image data: Although originally designed for system identification, we apply these methods to classification problems and visualize the learned classifier, which allows us to interpret features detected in the images.
The remainder is structured as follows. In Section 2, we describe methods to learn governing equations of dynamical systems from data as well as a tensorbased iterative scheme for image classification and highlight their relationships. In Section 3, we describe how to apply MANDy to classification problems and introduce the ARR approach based on the alternating optimization of TT cores. Numerical results are presented in Section 4, followed by a brief summary and conclusion in Section 5.
2. Prerequisites
We will introduce the original MNIST and the fashion MNIST data set, which will serve as guiding examples. Afterwards, SINDy and MANDy, as well as tensorbased methods for image classification problems, will be briefly discussed. In what follows, we will use the notation summarized in Table 1.
2.1. MNIST and Fashion MNIST
The MNIST data set [18], see Figure 1a, contains grayscale (The methods described below can be easily extended to color images by defining basis functions for each primary color.) images of handwritten digits and the associated labels. The data set is split into 60,000 images for training and 10,000 images for testing. Each image is of size 28 × 28. Let $d=784$ be the number of pixels of one image, and let the images, reshaped as vectors, be denoted by ${x}^{\left(j\right)}\in {\mathbb{R}}^{d}$ and the corresponding labels by ${y}^{\left(j\right)}\in {\mathbb{R}}^{{d}^{\prime}}$, where ${d}^{\prime}=10$ is the number of different classes. Each label encodes a number in $\{0,\dots ,9\}$, and the entries ${y}_{i}^{\left(j\right)}$ of the vector ${y}^{\left(j\right)}$ are given by
i.e., ${y}^{\left(j\right)}={[1,0,0,\dots ,0]}^{\top}$ represents 0, ${y}^{\left(j\right)}={[0,1,0,\dots ,0]}^{\top}$ represents 1, etc. This is also called onehot encoding in machine learning.
$${y}_{i}^{\left(j\right)}=\{\begin{array}{cc}1,\hfill & \mathrm{if}\text{}{x}^{\left(j\right)}\text{}\mathrm{contains}\text{}\mathrm{the}\text{}\mathrm{number}\text{}i1,\hfill \\ 0,\hfill & \mathrm{otherwise},\hfill \end{array}$$
The fashion MNIST data set [19] can be regarded as a shooin replacement for the original data set. There are again 60,000 training and 10,000 test images of size $28\times 28$. Some samples are shown in Figure 1b and the corresponding labels in Figure 1c. Given a picture of a clothing item, the goal now is to identify the correct category, which is encoded as described above.
2.2. SINDy
SINDy [11] was originally developed to learn the governing equations of dynamical systems from data. We will show how it can, in the same way, be used for classification problems. Consider an autonomous ordinary differential equation of the form $\dot{x}=f\left(x\right)$, with $f:{\mathbb{R}}^{d}\to {\mathbb{R}}^{d}$. Given m measurements of the state of the system, denoted by ${x}^{\left(j\right)}$, $j=1,\dots ,m$, and the corresponding time derivatives ${y}^{\left(j\right)}:={\dot{x}}^{\left(j\right)}$, the goal is to reconstruct the function f from the measurement data. Let $X=[{x}^{\left(1\right)},\dots ,{x}^{\left(m\right)}]\in {\mathbb{R}}^{d\times m}$ and $Y=[{y}^{\left(1\right)},\dots ,{y}^{\left(m\right)}]\in {\mathbb{R}}^{d\times m}$. That is, ${d}^{\prime}=d$ in this case. To represent f, we select a vectorvalued basis function $\mathsf{\Psi}:{\mathbb{R}}^{d}\to {\mathbb{R}}^{n}$ and define the transformed data matrix ${\mathsf{\Psi}}_{X}\in {\mathbb{R}}^{n\times m}$ by
$${\mathsf{\Psi}}_{X}=[\mathsf{\Psi}({x}^{\left(1\right)})\text{}\dots \text{}\mathsf{\Psi}({x}^{\left(m\right)})].$$
Omitting sparsity constraints, SINDy then boils down to solving
where
is the coefficient matrix. Each column vector ${\xi}_{i}$ then represents a function ${f}_{i}$, i.e.,
$$\underset{\Xi}{min}{\parallel Y{\Xi}^{\top}{\mathsf{\Psi}}_{X}\parallel}_{F},$$
$$\Xi =[{\xi}_{1}\text{}\dots \text{}{\xi}_{d}]\in {\mathbb{R}}^{n\times d}$$
$${y}_{i}^{\left(j\right)}\approx {f}_{i}({x}^{\left(j\right)})={\xi}_{i}^{\top}\mathsf{\Psi}({x}^{\left(j\right)}).$$
We thus obtain a model of the form $\dot{x}={\Xi}^{\top}\mathsf{\Psi}\left(x\right)$, which approximates the possibly unknown dynamics. The solution of the minimization problem (3) with minimal Frobenius norm is given by
where $+$ denotes the pseudoinverse, see [20].
$${\Xi}^{\top}=Y{\mathsf{\Psi}}_{X}^{+},$$
2.3. TensorBased Learning
We will now briefly introduce the basic concepts of tensor decompositions and tensor formats as well as the tensorbased reformulation of SINDy, called MANDy, proposed in [13]. Additionally, recently introduced methods for supervised learning with tensor networks will be discussed.
2.3.1. Tensor Decompositions
To mitigate the curse of dimensionality when working with tensors $\mathbf{T}\in {\mathbb{R}}^{{n}_{1}\times \dots \times {n}_{p}}$, where ${n}_{\mu}\in \mathbb{N}$, we will exploit lowrank tensor approximations. The simplest approximation of a tensor of order p is a rankone tensor, i.e., a tensor product of p vectors given by
where ${T}^{\left(\mu \right)}$, $\mu =1,\dots ,p$, are vectors in ${\mathbb{R}}^{{n}_{\mu}}$. If a tensor is written as the sum of r rankone tensors, i.e.,
with ${T}^{\left(\mu \right)}\in {\mathbb{R}}^{{n}_{\mu}\times r}$, this results in the socalled canonical format. In fact, any tensor can be expressed in this format, but we are particularly interested in lowrank representations of tensors in order to reduce the storage consumption as well as the computational costs. The same requirement applies to tensors expressed in the tensor train format (TT format), where a highdimensional tensor is represented by a network of multiple lowdimensional tensors [21,22]. A tensor $\mathbf{T}\in {\mathbb{R}}^{{n}_{1}\times \cdots \times {n}_{p}}$ is said to be in the TT format if
$$\mathbf{T}={T}^{\left(1\right)}\otimes {T}^{\left(2\right)}\otimes \cdots \otimes {T}^{\left(p\right)},$$
$$\mathbf{T}=\sum _{k=1}^{r}{T}_{:,k}^{\left(1\right)}\otimes {T}_{:,k}^{\left(2\right)}\otimes \cdots \otimes {T}_{:,k}^{\left(p\right)},$$
$$\mathbf{T}=\sum _{{k}_{0}=1}^{{r}_{0}}\cdots \sum _{{k}_{p}=1}^{{r}_{p}}{\mathbf{T}}_{{k}_{0},:,{k}_{1}}^{\left(1\right)}\otimes \cdots \otimes {\mathbf{T}}_{{k}_{p1},:,{k}_{p}}^{\left(p\right)}.$$
The tensors ${\mathbf{T}}^{\left(\mu \right)}\in {\mathbb{R}}^{{r}_{\mu 1}\times {n}_{\mu}\times {r}_{\mu}}$ of order 3 are called TT cores. The numbers ${r}_{\mu}$ are called TT ranks and have a strong influence on the expressivity of a tensor train. It holds that ${r}_{0}={r}_{p}=1$ and ${r}_{\mu}\ge 1$ for $\mu =1,\dots ,p1$. Figure 2a shows the graphical representation of a tensor train, which is also called Penrose notation, see [23].
The left and rightunfoldings of a TT core ${\mathbf{T}}^{\left(\mu \right)}$ are given by the matrices
respectively. Here, the indices of two modes of ${\mathbf{T}}^{\left(\mu \right)}$ are lumped into a single row or column index, whereas the remaining mode forms the other dimension of the unfolding matrix. We call the TT core ${\mathbf{T}}^{\left(\mu \right)}$ leftorthonormal if its leftunfolding is orthonormal with respect to the rows, i.e., ${\mathcal{L}}_{\mu}^{\top}\xb7{\mathcal{L}}_{\mu}=\mathrm{Id}\in {\mathbb{R}}^{{r}_{\mu}\times {r}_{\mu}}$. Correspondingly, a core is called rightorthonormal if its rightunfolding is orthonormal with respect to the columns, i.e., ${\mathcal{R}}_{\mu}\xb7{\mathcal{R}}_{\mu}^{\top}=\mathrm{Id}\in {\mathbb{R}}^{{r}_{\mu 1}\times {r}_{\mu 1}}$. In Penrose notation, orthonormal components are depicted by halffilled circles, cf. Figure 2b, where a tensor train with leftorthonormal cores is shown.
$${\mathcal{L}}_{\mu}={\mathbf{T}}^{\left(\mu \right)}{}_{{r}_{\mu 1},{n}_{\mu}}^{{r}_{\mu}}\in {\mathbb{R}}^{({r}_{\mu 1}\xb7{n}_{\mu})\times {r}_{\mu}}\text{}\mathrm{and}\text{}{\mathcal{R}}_{\mu}={\mathbf{T}}^{\left(\mu \right)}{}_{{r}_{\mu 1}}^{{n}_{\mu}{r}_{\mu}}\in {\mathbb{R}}^{{r}_{\mu 1}\times ({n}_{\mu}\xb7{r}_{\mu})},$$
A given TT core can be left or rightorthonormalized, respectively, by computing a singular value decomposition (SVD) of its unfolding. For instance, the components of an SVD of the form ${\mathcal{L}}_{\mu}=U\xb7\Sigma \xb7{V}^{\top}$ can be interpreted as a leftorthonormalized version of ${\mathbf{T}}^{\left(\mu \right)}$ coupled with the matrices $\Sigma $ and ${V}^{\top}$. When we talk about, e.g., leftorthonormalization of the cores of a tensor train, we mean the application of sequential SVDs from left to right (also called HOSVD, cf. [24]) where U builds the updated core, while the nonorthonormal part $\Sigma \xb7{V}^{\top}$ is contracted with the subsequent TT core. As described in [13,16,25], left and rightorthonormalization can be used to construct pseudoinverses of tensors. The general idea is to construct a global SVD of a given tensor train by left and rightorthonormalizing its cores. However, in Section 3.2, we will exploit the structure of transformed data tensors, as introduced in [13], to propose a different method for the construction of pseudoinverses, which significantly reduces the computational effort.
We also represent TT cores as twodimensional arrays containing vectors as elements. In this notation, a single core of a tensor train $\mathbf{T}\in {\mathbb{R}}^{{n}_{1}\times \cdots \times {n}_{p}}$ is written as
$$\u27e6{\mathbf{T}}^{\left(\mu \right)}\u27e7=\u27e6\begin{array}{ccc}{\mathbf{T}}_{1,:,1}^{\left(\mu \right)}& \cdots & {\mathbf{T}}_{1,:,{r}_{\mu}}^{\left(\mu \right)}\\ \vdots & \ddots & \vdots \\ {\mathbf{T}}_{{r}_{\mu 1},:,1}^{\left(\mu \right)}& \cdots & {\mathbf{T}}_{{r}_{\mu 1},:,{r}_{\mu}}^{\left(\mu \right)}\end{array}\u27e7.$$
2.3.2. MANDy
MANDy [13] is a tensorized version of SINDy and constructs counterparts of the transformed data matrices (2) directly in the TT format. Two different types of decompositions, namely, the coordinate and the functionmajor decomposition, were introduced in [13]. In [16], the technique for the construction of the transformed data tensors was generalized to arbitrary lists of basis functions. This will be explained in more detail in Section 3.1. Given data matrices $X,Y\in {\mathbb{R}}^{d\times m}$ and basis functions ${\psi}_{\mu}:{\mathbb{R}}^{d}\to {\mathbb{R}}^{{n}_{\mu}}$, $\mu =1,\dots ,p$, the tensorbased representation of the corresponding transformed data tensors ${\mathbf{\Psi}}_{X}\in {\mathbb{R}}^{{n}_{1}\times \cdots \times {n}_{p}\times m}$ enables us to solve the reformulated minimization problem
so that the coefficients are given in form of a tensor train $\Xi \in {\mathbb{R}}^{{n}_{1}\times \cdots \times {n}_{p}\times d}$, cf. Section 2.2. Instead of identifying the governing equations of dynamical systems from data, see [13], we seek to classify images using MANDy. The only difference is that ${\mathbf{\Psi}}_{X}$ now contains the transformed images and Y the corresponding labels. As the matrix Y may have different dimensions than X, i.e., $Y\in {\mathbb{R}}^{{d}^{\prime}\times m}$, the aim is to find the optimal solution of (12) in the form of a tensor train $\Xi \in {\mathbb{R}}^{{n}_{1}\times \cdots \times {n}_{p}\times {d}^{\prime}}$. We will discuss the explicit representation of transformed data tensors and their pseudoinversion in Section 3.
$$\underset{\Xi}{min}{\parallel Y{\Xi}^{\top}{\mathbf{\Psi}}_{X}\parallel}_{F}$$
2.3.3. Supervised Learning with Tensor Networks
It has been shown in [5,6] that tensorbased optimization schemes can be adapted to supervised learning problems. A given input vector x is mapped into a higherdimensional space using a feature map $\mathbf{\Psi}$ before being classified by a decision function $f:{\mathbb{R}}^{d}\to {\mathbb{R}}^{{d}^{\prime}}$ of the form
where $\Xi $ is a coefficient tensor in TT format. The ith entry of the vector $f\left(x\right)$ then represents the likelihood that the image x belongs to the class with label $i1$. The transformation defined in [5,6] reads as follows,
where $\alpha $ is a parameter. However, the originally proposed choice of $\alpha =\frac{\pi}{2}$ is often not optimal. This will be discussed in more detail below. The function $\mathbf{\Psi}$ assigns each pixel of the image a twodimensional vector, inspired by the spinvectors encountered in quantum mechanics [6]. It was illustrated in [14] how such a transformation can be implemented as a quantum feature map, where the information is encoded in the amplitudes of qubits. Embedding data into quantum Hilbert spaces might be interesting in cases where the quantum device evaluates kernels faster or where kernels cannot be simulated by classical computers anymore [14].
$$f\left(x\right)={\Xi}^{\top}\text{}\mathbf{\Psi}\left(x\right),$$
$$\mathbf{\Psi}\left(x\right)=\left[\begin{array}{c}cos\left(\alpha \text{}{x}_{1}\right)\\ sin\left(\alpha \text{}{x}_{1}\right)\end{array}\right]\otimes \left[\begin{array}{c}cos\left(\alpha \text{}{x}_{2}\right)\\ sin\left(\alpha \text{}{x}_{2}\right)\end{array}\right]\otimes \cdots \otimes \left[\begin{array}{c}cos\left(\alpha \text{}{x}_{d}\right)\\ sin\left(\alpha \text{}{x}_{d}\right)\end{array}\right],$$
Due to the tensor structure, $\mathbf{\Psi}\left(x\right)$ is a tensor with ${2}^{d}$ entries, which, for the original MNIST image size, amounts to $n\approx {10}^{236}$ basis functions. In [5,6], the image size is first reduced to $14\times 14$ pixels by averaging groups of four pixels, which then results in “only” $n\approx {10}^{59}$ basis functions. Thus, storing the full coefficient matrix is clearly infeasible since $\Xi \in {\mathbb{R}}^{2\times \cdots \times 2\times {d}^{\prime}}\cong {\mathbb{R}}^{n\times {d}^{\prime}}$. Here, ${d}^{\prime}$ appears as an additional tensor index since the decision function is computed for all ${d}^{\prime}$ labels simultaneously.
To learn the tensor $\Xi $ from training data, a DMRG/ALSrelated algorithm (cf. [4,28]) that sweeps back and forth along the cores and iteratively minimizes the cost function
is devised. The suggested algorithm varies two neighboring cores at the same time, which allows for adapting the tensor ranks, and computes an update using a gradient descent step. The tensor ranks are reduced by truncated SVDs to control the computational costs. The truncation of the TT ranks can also be interpreted as a form of regularization. For more details, we refer to [5,6].
$$\underset{\Xi}{min}\sum _{j=1}^{m}{\parallel {y}^{\left(j\right)}{\Xi}^{\top}\text{}\mathbf{\Psi}({x}^{\left(j\right)})\parallel}_{2}^{2}$$
Different techniques to improve the original algorithm presented in [5] were proposed. In [29], the image data is preprocessed using a discrete cosine transformation and the ordering of the pixels is optimized in order to reduce the ranks. In [10], the DMRGbased sweeping method was replaced by a stochastic gradient descent approach, where the gradient is computed with the aid of automatic differentiation. Furthermore, it was shown that GPUs allow for an efficient solution of such problems.
3. TensorBased Classification Algorithms
We will now describe two different tensorbased classification approaches. First, we show how to combine MANDy with kernelbased regression techniques, so as to derive an efficient method for the computation of the pseudoinverse of the transformed data tensor. Then, a classification algorithm based on the alternating optimization of the TT cores of the coefficient tensor is proposed.
3.1. Basis Decomposition
As above, let $x\in {\mathbb{R}}^{d}$ be a vector and ${\psi}_{\mu}:{\mathbb{R}}^{d}\to {\mathbb{R}}^{{n}_{\mu}}$, $\mu =1,\dots ,p$, basis functions. We consider the rankone tensors
$$\mathbf{\Psi}\left(x\right)={\psi}_{1}\left(x\right)\otimes \cdots \otimes {\psi}_{p}\left(x\right)=\left[\begin{array}{c}{\psi}_{1,1}\left(x\right)\\ \vdots \\ {\psi}_{1,{n}_{1}}\left(x\right)\end{array}\right]\otimes \cdots \otimes \left[\begin{array}{c}{\psi}_{p,1}\left(x\right)\\ \vdots \\ {\psi}_{p,{n}_{p}}\left(x\right)\end{array}\right]\in {\mathbb{R}}^{{n}_{1}\times {n}_{2}\times \cdots \times {n}_{p}}.$$
For m different vectors stored in a data matrix $X=[{x}^{\left(1\right)},\dots ,{x}^{\left(m\right)}]\in {\mathbb{R}}^{d\times m}$, we must construct transformed data tensors ${\mathbf{\Psi}}_{X}\in {\mathbb{R}}^{{n}_{1}\times \cdots \times {n}_{p}\times m}$ with ${\left({\mathbf{\Psi}}_{X}\right)}_{:,\dots ,:,j}=\Psi \left({x}^{\left(j\right)}\right)$. In [13,16], this was achieved by multiplying (with the aid of the tensor product) the rankone decompositions given in (16) for all vectors, ${x}^{\left(1\right)},\dots ,{x}^{\left(m\right)}$, by additional unit vectors and subsequently summing them up. The transformed data tensor can then be represented using the following canonical/TT decompositions,
where ${e}_{j}$, $j=1,\dots ,m$, denote the unit vectors of the standard basis in the mdimensional Euclidean space. An entry of ${\mathbf{\Psi}}_{X}$ is given by
for $1\le {i}_{k}\le {n}_{k}$ and $1\le j\le m$. Thus, the matrixbased counterpart of ${\mathbf{\Psi}}_{X}$, see (2), would be given by the modep unfolding
$$\begin{array}{cc}\hfill {\mathbf{\Psi}}_{X}& ={\displaystyle \sum _{j=1}^{m}}\mathbf{\Psi}({x}^{\left(j\right)})\otimes {e}_{j}\hfill \\ \hfill & ={\displaystyle \sum _{j=1}^{m}}{\psi}_{1}({x}^{\left(j\right)})\otimes \cdots \otimes {\psi}_{p}({x}^{\left(j\right)})\otimes {e}_{j}\hfill \\ \hfill & =\u27e6{\psi}_{1}({x}^{\left(1\right)})\text{}\cdots \text{}{\psi}_{1}({x}^{\left(m\right)})\u27e7\otimes \u27e6\begin{array}{ccc}{\psi}_{2}\left({x}^{\left(1\right)}\right)& & 0\\ & \ddots & \\ 0& & {\psi}_{2}\left({x}^{\left(m\right)}\right)\end{array}\u27e7\otimes \cdots \hfill \\ \hfill & \hspace{1em}\cdots \otimes \u27e6\begin{array}{ccc}{\psi}_{p}\left({x}^{\left(1\right)}\right)& & 0\\ & \ddots & \\ 0& & {\psi}_{p}\left({x}^{\left(m\right)}\right)\end{array}\u27e7\otimes \u27e6\begin{array}{c}{e}_{1}\\ \vdots \\ {e}_{m}\end{array}\u27e7\hfill \\ \hfill & =\u27e6{\mathbf{\Psi}}_{X}^{\left(1\right)}\u27e7\otimes \dots \otimes \u27e6{\mathbf{\Psi}}_{X}^{(p+1)}\u27e7,\hfill \end{array}$$
$${\left({\mathbf{\Psi}}_{X}\right)}_{{i}_{1},\cdots ,{i}_{p},j}={\psi}_{1,{i}_{1}}\left({x}^{\left(j\right)}\right)\xb7\dots \xb7{\psi}_{p,{i}_{p}}\left({x}^{\left(j\right)}\right),$$
$${\mathsf{\Psi}}_{X}={\mathbf{\Psi}}_{X}{}_{{n}_{1},\dots ,{n}_{p}}^{m}.$$
That is, the modes ${n}_{1},\dots ,{n}_{p}$ represent row indices of the unfolding, and mode m is the column index. However, for the purpose of this paper, we modify the representation of our transformed data tensors. First, realize that the last core of the TT representation in (17) can be neglected, as it is only a reshaped identity matrix. The result is then a tensor network with an “open arm”, which can be regarded as a tensor train with an additional column mode located at the last core, see Figure 3a. Second, this additional mode can be shifted to any TT core of the decomposition. This is shown in Figure 3b. We will benefit from these modifications in Section 3.3 when constructing the subproblems for the ALSinspired approach. Consider the TT decomposition ${\widehat{\mathbf{\Psi}}}_{X}$ given by
$${\widehat{\mathbf{\Psi}}}_{X}=\u27e6{\mathbf{\Psi}}_{X}^{\left(1\right)}\u27e7\otimes \cdots \otimes \u27e6{\mathbf{\Psi}}_{X}^{(p1)}\u27e7\otimes \u27e6\begin{array}{c}{\psi}_{p}\left({x}^{\left(1\right)}\right)\\ \vdots \\ {\psi}_{p}\left({x}^{\left(m\right)}\right)\end{array}\u27e7.$$
Note that this tensor is an element of the tensor space ${\mathbb{R}}^{{n}_{1}\times \cdots \times {n}_{p}}$, i.e., ${\widehat{\mathbf{\Psi}}}_{X}$ has no additional column dimension, and it holds that
$${\widehat{\mathbf{\Psi}}}_{X}{}_{{n}_{1},\dots ,{n}_{p}}={\mathsf{\Psi}}_{X}\xb7{[1,\dots ,1]}^{\top}.$$
Now, we define ${\widehat{\mathbf{\Psi}}}_{X,\mu}\in {\mathbb{R}}^{{n}_{1}\times \cdots \times {n}_{p}\times m}$ to be the tensor derived from ${\widehat{\mathbf{\Psi}}}_{X}$ by replacing the $\mu $th core by
where the outer modes correspond to the rank dimensions, whereas the inner modes represent the dimensions of the matrices. Analogously, for the first and the last core of ${\widehat{\mathbf{\Psi}}}_{X,\mu}$ the nondiagonal core structure has to be used. The 4dimensional TT core (22) naturally represents a component of a TT operator. In what follows, we will not need to store the whole TT core given in (22). Otherwise, this would mean that we have to save ${m}^{3}\xb7n$ scalar entries (not using a sparse format). However, from a theoretical point of view, ${\mathbf{\Psi}}_{X}$ in Figure 3a and ${\widehat{\mathbf{\Psi}}}_{X,\mu}$ in Figure 3b represent the same tensor in ${\mathbb{R}}^{{n}_{1}\times \cdots \times {n}_{p}\times m}$, see Appendix A.
$${\widehat{\mathbf{\Psi}}}_{X,\mu}^{\left(\mu \right)}=\u27e6\begin{array}{ccc}\left[\begin{array}{cccc}& 0& \cdots & 0\\ {\psi}_{\mu}\left({x}_{1}\right)& \vdots & & \vdots \\ & 0& \cdots & 0\end{array}\right]& & 0\\ & \ddots & \\ 0& & \left[\begin{array}{cccc}0& \cdots & 0& \\ \vdots & & \vdots & {\psi}_{\mu}\left({x}_{m}\right)\\ 0& \dots & 0& \end{array}\right]\end{array}\u27e7\in {\mathbb{R}}^{m\times {n}_{\mu}\times m\times m},$$
3.2. KernelBased MANDy
Given a training set $X\in {\mathbb{R}}^{d\times m}$, the corresponding label matrix $Y\in {\mathbb{R}}^{{d}^{\prime}\times m}$, and a set of basis functions ${\psi}_{\mu}:{\mathbb{R}}^{d}\to {\mathbb{R}}^{{n}_{\mu}}$, $\mu =1,\dots ,p$, we exploit the canonical representation of ${\mathbf{\Psi}}_{X}$ given in (17) for kernelbased MANDy. The aim is to solve the optimization problem (12), i.e., we try to find a coefficient tensor $\Xi \in {\mathbb{R}}^{{n}_{1}\times \cdots \times {n}_{p}\times {d}^{\prime}}$ such that ${\Xi}^{\top}{\mathbf{\Psi}}_{X}$ is as close as possible to the corresponding label matrix $Y\in {\mathbb{R}}^{{d}^{\prime}\times m}$. The solution of (12) with minimal Frobenius norm is given by ${\Xi}^{\top}=Y{\mathbf{\Psi}}_{X}^{+}$, cf. (6). Note that, compared to standard SINDy/MANDy, the matrix Y here does not necessarily have the same dimensions as X. Due to potentially large ranks of the transformed data tensor ${\mathbf{\Psi}}_{X}$, the direct computation of the pseudoinverse using left and rightorthonormalization, as proposed in [13], would be computationally expensive. However, using the identity ${\mathbf{\Psi}}_{X}^{+}={\left({\mathbf{\Psi}}_{X}^{\top}{\mathbf{\Psi}}_{X}\right)}^{+}{\mathbf{\Psi}}_{X}^{{\top}}$, we can rewrite the coefficient tensor as
$${\Xi}^{\top}=Y{\left({\mathbf{\Psi}}_{X}^{\top}{\mathbf{\Psi}}_{X}\right)}^{+}{\mathbf{\Psi}}_{X}^{\top}.$$
The contraction of ${\mathbf{\Psi}}_{X}^{\top}$ and ${\mathbf{\Psi}}_{X}$ yields a Gram matrix $G\in {\mathbb{R}}^{m\times m}$ whose entries are given by the resulting kernel function $k(x,{x}^{\prime})=\langle \mathbf{\Psi}\left(x\right),\text{}\mathbf{\Psi}\left({x}^{\prime}\right)\rangle $, i.e.,
$${G}_{i,j}=k({x}^{\left(i\right)},{x}^{\left(j\right)})=\langle \mathbf{\Psi}({x}^{\left(i\right)},\text{}\mathbf{\Psi}({x}^{\left(j\right)})\rangle .$$
Note that due to the tensor structure of ${\mathbf{\Psi}}_{X}$, we obtain
i.e., a product of p local kernels.
$$k({x}^{\left(i\right)},{x}^{\left(j\right)})=\prod _{\mu =1}^{p}\langle {\psi}_{\mu}\left({x}^{\left(i\right)}\right),{\psi}_{\mu}\left({x}^{\left(j\right)}\right)\rangle ,$$
Remark 1.
The product structure of the kernel allows us to compute the Gram matrix G as a Hadamard product (denoted by ⊙) of p matrices, that is,
where ${\Theta}_{\mu}\in {\mathbb{R}}^{m\times m}$ is given by
We now define $Z:=Y\text{}{G}^{+}\in {\mathbb{R}}^{{d}^{\prime}\times m}$, which can be obtained by solving the system $Z\text{}G=Y$ (in the leastsquares sense if G is singular). The decision function f, cf. (13), is then given by
and again only requires kernel evaluations. As above, we can use a sequence of Hadamard products to compute ${\mathbf{\Psi}}_{X}^{\top}\mathbf{\Psi}\left(x\right)$. The classification problem can thus be solved as summarized in Algorithm 1.
$$G={\Theta}_{1}\odot {\Theta}_{2}\odot \cdots \odot {\Theta}_{p},$$
$${\Theta}_{\mu}={\left[{\psi}_{\mu}\right({x}^{\left(1\right)}),\dots ,{\psi}_{\mu}({x}^{\left(m\right)}\left)\right]}^{\top}\xb7\left[{\psi}_{\mu}\right({x}^{\left(1\right)}),\dots ,{\psi}_{\mu}({x}^{\left(m\right)}\left)\right].$$
$$f\left(x\right)=\underset{=:{\Xi}^{\top}}{\underbrace{Z{\mathbf{\Psi}}_{X}^{\top}}}\mathbf{\Psi}\left(x\right)=Z\left[\begin{array}{c}k({x}_{1},x)\\ \vdots \\ k({x}_{m},x)\end{array}\right]$$
Algorithm 1 Kernelbased MANDy for classification. 
Input: Training set X and label matrix Y, test set $\tilde{X}$, basis functions. 
Output: Label matrix $\tilde{Y}$. 

We could also replace the pseudoinverse ${G}^{+}$ by the regularized inverse ${(G+\epsilon \mathrm{Id})}^{1}$, where $\epsilon $ is the regularization parameter, which would lead to a slightly different system of linear equations. However, for the numerical experiments in Section 4, we do not use regularization. Algorithm 1 is equivalent to kernel ridge regression (see, e.g., [15]) with a tensor product kernel. This is not surprising, as we are solving simple leastsquares problems.
Remark 2.
Note that the kernel does not necessarily have to be based on tensor products of basis functions for this method to work, we could also simply use, e.g., a Gaussian kernel, which for the MNIST data set leads to slightly lower but similar classification rates. Tensorbased kernels, however, have an exponentially large yet explicit feature space representation and additional structure that could be exploited to speed up computations. Moreover, the kernelbased algorithm outlined above can in the same way be applied to timeseries data to learn governing equations in potentially infinitedimensional feature spaces.
Compared to the method proposed in [5,6], the advantage of our approach, which can be regarded as a kernelbased formulation of MANDy (or SINDy), is that we can compute a closedform solution without necessitating any iterations or sweeps. However, even though this approach for classification problems computes an optimal solution of the minimization problem (12), the runtime as well as the memory consumption of the algorithm depend crucially on the size of the training data set (and also the number of labels), and the resulting coefficient tensor $\Xi $ has no guaranteed lowrank structure. We will now propose an alternating optimization method which circumvents this problem.
3.3. Alternating Ridge Regression
In what follows, we will use the TT representation illustrated in Figure 3b for the transformed data tensor ${\mathbf{\Psi}}_{X}\in {\mathbb{R}}^{{n}_{1}\times \cdots \times {n}_{p}\times m}$. Even though we do not consider a TT operator, the proposed approach is closely related to the DMRG method [4], also called alternating linear scheme (ALS) [28]. As in [5,6], the idea here is to compute a lowrank TT approximation of the coefficient tensor $\Xi $ by an alternating scheme. That is, a lowdimensional system of linear equations has to be solved for each TT core. Our approach is outlined in Algorithm 2.
First, note that instead of solving the minimization problem (12), we can also find separate solutions of
for each row of Y. As these systems can be solved independently, Algorithm 2 can be easily parallelized. We then use a DMRG/ALSinspired scheme to split the optimization problem (30) into p subproblems. The micromatrix ${M}_{\mu}$ of such a subproblem can be built from three different parts, namely, ${\widehat{\mathbf{\Psi}}}_{X,\mu}^{\left(\mu \right)}$, ${P}_{\mu}$, and ${Q}_{\mu}$. The latter are both collected in a left and right stack to avoid repetitive computations. Note that ${P}_{\mu}$ is determined by contracting ${P}_{\mu 1}$ with the $(\mu 1)$th cores of ${\Xi}_{i}$ and ${\widehat{\mathbf{\Psi}}}_{X}$. Analogously, ${Q}_{\mu}$ is build from ${Q}_{\mu +1}$ and the $(\mu +1)$th cores of ${\Xi}_{i}$ and ${\widehat{\mathbf{\Psi}}}_{X}$. During the first half sweep of Algorithm 2, we only have to compute the matrices ${P}_{\mu}$, as the used matrices ${Q}_{\mu}$ are not based on any updated cores. Afterwards, the matrices ${Q}_{\mu}$ are (re)computed during the second half. See [28] for further details and Figure 4 for a graphical illustration of the construction of the subproblems and the extraction of the optimized core. Note that it is not necessary to store the (sparse) core ${\widehat{\mathbf{\Psi}}}_{X,\mu}^{\left(\mu \right)}$ in its full representation as a 4dimensional array to construct the matrix ${M}_{\mu}$. By using, e.g., NumPy’s einsum the TT core can be replaced a (dense) matrix containing the corresponding function evaluations.
$$\underset{{\Xi}_{i}}{min}{\parallel {Y}_{i,:}{\Xi}_{i}^{\top}{\mathbf{\Psi}}_{X}\parallel}_{2}$$
Algorithm 2 Alternating ridge regression (ARR) for classification.  
Input: Training set X and label matrix Y, test set $\tilde{X}$, basis functions, initial guesses.  
Output: Label matrix $\tilde{Y}$.  
 (parallelizable) 
 
 
 
 (first half sweep) 
 
 
 
 
 (second half sweep) 
 
 
 
 
 
 
 
 
 

By orthonormalizing the fixed cores of $\Xi $, and using truncated SVDs [17] for solving the subsystems, we can interpret our approach as a corewise ridge regression approximating the solution obtained by kernelbased MANDy, see Appendix B. After approximating the coefficient tensor
the decision function f is given by (13). The main difference between our approach and the method introduced in [5,6] is that we do not update the TT cores of $\Xi $ using gradient descent steps. Instead we solve a lowdimensional system of linear equations corresponding to the entire training data set whose solution yields the updated core. Moreover, we solve a minimization problem for each row of the label matrix Y. Using the modified basis decomposition introduced in Section 3.1, it is possible to significantly reduce the storage consumption of the stack, see Algorithm 2 Lines 4 and 11. If we only use a fixed representation for ${\mathbf{\Psi}}_{X}$, as given in (17), the additional mode would lead to a much higher storage consumption of the right stack. Thus, our method provides an efficient construction of the subproblems.
$$\Xi =\sum _{i=1}^{{d}^{\prime}}{\Xi}_{i}\otimes {e}_{i},$$
4. Numerical Results
We apply the tensorbased classification algorithms described in Section 3.2 and Section 3.3 to both the MNIST and fashion MNIST data sets, choosing the basis defined in (14) and setting $\alpha \approx 0.59$. This value was determined empirically for the MNIST data set, but also leads to better classification rates for the fashion MNIST set. Kernelbased MANDy as well as ARR are available in ScikitTT (https://github.com/PGelss/scikit_tt). The numerical experiments were performed on a Linux machine with 128 GB RAM and an Intel Xeon processor with a clock speed of 3 GHz and eight cores.
For the first approach, using kernelbased MANDy, we do not apply any regularization techniques. For the ARR approach, we set the TT rank for each solution ${\Xi}_{i}$, see Algorithms 2–10, and repeat the scheme five times. Here, we use regularization, i.e., truncated SVDs with a relative threshold of 10^{−2} are applied to the minimization problems given in Algorithm 2 (Lines 8 and 13). The obtained classification rates for the reduced and full MNIST and fashion MNIST data are shown in Figure 5.
Similarly to [5,6], we first apply the classifiers to the reduced data sets, see Figure 5a. Using MANDy, we obtain classification rates of up to 98.75% for the MNIST and 88.82% for the fashion MNIST data set. Using the ARR approach, the classification rates are not monotonically increasing, which may simply be an effect of the alternating optimization scheme. The highest classification rates we obtain are 98.16% for the MNIST data and 87.55% for the fashion MNIST data. We typically obtain a 100% classification rate for the training data (as a consequence of the richness of the feature space). This is not necessarily a desired property as the learned model might not generalize well to new data, but seems to have no detrimental effects for the simple MNIST classification problem. As shown in Figure 5b, kernelbased MANDy can still be applied when considering the full data sets without reducing the image size. Here, we obtain classification rates of up to 97.24% for the MNIST and 88.37% for the fashion MNIST data set. That we obtain lower classification rates for the full images as compared to the reduced ones might be due to the fact that pixelbypixel comparisons of images are not expedient. The averaging effect caused by downscaling the images helps to detect coarser features. This is similar to the effect of convolutional kernels and pooling layers. In principle, ARR can also be used for the classification of the full data sets. So far, however, our numerical experiments produced only classification rates significantly lower than those obtained by applying MANDy (95.94% for the MNIST and 82.18% for fashion MNIST data set). This might be due to convergence issues caused by the kernel. The application to higherorder transformed data tensors and potential improvements of ARR will be part of our future research.
Figure 5 also shows a comparison with tensorflow. We run the code provided as a classification tutorial (www.tensorflow.org/tutorials/keras/basic_classification) ten times and compute the average classification rate. The input layer of the network comprises 784 nodes (one for each pixel; for the reduced data sets, we thus have only 196 input nodes), followed by two dense layers with 128 and 10 nodes. The layer with 10 nodes is the output layer containing probabilities that a given image belongs to the class represented by the respective neuron. Note that although more sophisticated methods and architectures for these problems exists—see the (fashion) MNIST website for a ranking—the results show that our tensorbased approaches are competitive with stateoftheart deeplearning techniques.
To understand the numerical results for the MNIST data set (obtained by applying kernelbased MANDy to all 60,000 training images), we analyze the misclassified images, examples of which are displayed in Figure 6a. For misclassified images x, the entries of $f\left(x\right)$, see (29), are often numerically zero, which implies that there is no other image in the training set that is similar enough so that the kernel can pick up the resemblance. Some of the remaining misclassified digits are hard to recognize even for humans. Histograms demonstrating which categories are misclassified most often are shown in Figure 6b. Here, we simply count the instances where an image with label i was assigned the wrong label j. The digits 2 and 7, as well as 4 and 9, are confused most frequently. Additionally, we wish to visualize what the algorithm detects in the images. To this end, we perform a sensitivity analysis as follows. Starting with an image whose pixel values are constant everywhere (zero or any other value smaller than one, we choose 0.5), we set pixel $(i,j)$ to one and compute $y=f\left(x\right)$ for this image. The process is repeated for all pixels. For each label, we then plot a heat map of the values of y. This tells us which pixels contribute most to the classification of the images. The resulting maps are shown in Figure 6c. Except for the digit 1, the results are highly similar to the images obtained by averaging over all images containing a certain digit.
Figure 7 shows examples of misclassified images and the corresponding histogram as well as the results of the sensitivity analysis for the fashion MNIST data set. We see that the images of shirts (6) are most difficult to classify (due to the ambiguity in the category definitions), whereas trousers (1) and bags (8) have the lowest misclassification rates (probably due to their distinctive shapes). In contrast to the MNIST data set, the results of the sensitivity analysis differ widely from the average images. The classifier for coats (4), for instance, “looks for” a zipper and coat pockets, which are not visible in the “average coat”, and the classifier for dresses (3) seems to base the decision on the presence of creases, which are also not distinguishable in the “average dress”. The interpretation of other classifiers is less clear, e.g., the ones for sandals (5) and sneakers (7) seem to be contaminated by other classes.
Comparing the runtimes of both approaches applied to the reduced data sets with 60,000 training images, kernelbased MANDy needs approximately one hour for the construction of the decision function (29). On the other hand, ARR needs less than 10 minutes to compute the coefficient tensor assuming we parallelize Algorithm 2.
5. Conclusions
In this work, we presented two different tensorbased approaches for supervised learning. We showed that a kernelbased extension of MANDy can be utilized for image classification. That is, extending the method to arbitrary leastsquares problems (originally, MANDy was developed to learn governing equations of dynamical systems) and using sequences of Hadamard products for the computation of the pseudoinverse, we were able to demonstrate the potential of kernelbased MANDy by applying it to the MNIST and fashion MNIST data sets. Additionally, we proposed the alternating optimization scheme ARR, which approximates the coefficient tensors by lowrank TT decompositions. Here, we used a mutable tensor representation of the transformed data tensors in order to construct lowdimensional regression problems for optimizing the TT cores of the coefficient tensor.
Both approaches use an exponentially large set of basis functions in combination with leastsquares regression techniques on a given set of training images. The results are encouraging and show that methods exploiting tensor products of simple basis functions are able to detect characteristic features in image data. The work presented in this paper constitutes a further step towards tensorbased techniques for machine learning.
The reason why we can handle the extremely highdimensional feature space spanned by basis functions is its tensor product format. Besides, the general questions of the choice of basis functions and the expressivity of these functions, the rankone tensor products that were used in this work can, in principle, be replaced by other structures, which might result in higher classification rates. For instance, the transformation of an image could be given by a TT representation with higher ranks or hierarchical tensor decompositions (with the aim to detect features on different levels of abstraction). Furthermore, we could define different basis functions for each pixel, vary the number of basis functions per pixel, or define basis functions for groups of pixels.
Even though kernelbased MANDy computes the minimum norm solution of the considered regression problems as an exact TT decomposition, the method is likely to suffer from high ranks of the transformed data tensors and might thus not be competitive for large data sets. At the moment, we are computing the Gram matrix for the entire training data set. However, a possibility to speed up computations and to lower the memory consumption is found in exploiting the properties of the kernel. That is, if the kernel almost vanishes if two images differ significantly in at least one pixel (as it is the case for the specific kernel used in this work, provided that the originally proposed value $\alpha =\frac{\pi}{2}$ is used), the Gram matrix is essentially sparse when setting entries smaller than a given threshold to zero. Using sparse solvers would allow us to handle much larger data sets. Moreover, the construction of the Gram matrix is highly parallelizable and it would be possible to use GPUs to assemble it in a more efficient fashion.
Further modifications of ARR such as different regression methods for the subproblems, an optimized ordering of the TT cores, and specific initial coefficient tensors can help to improve the results. We provided an explanation for the stability of ARR, but the properties of alternating regression schemes have to be analyzed in more detail in the future.
Author Contributions
Conceptualization, S.K. and P.G.; methodology, S.K. and P.G.; software, S.K. and P.G.; writing, S.K. and P.G.
Funding
This research has been funded by Deutsche Forschungsgemeinschaft (DFG) through grant CRC 1114 “Scaling Cascades in Complex Systems”. Part of this research was performed while S.K. was visiting the Institute for Pure and Applied Mathematics (IPAM), which is supported by the National Science Foundation (Grant No. DMS1440415).
Acknowledgments
We would like to thank Michael Götte and Alex Goeßmann from the TU Berlin for interesting discussions related to tensor decompositions and system identification. The publication of this article was funded by Freie Universität Berlin.
Conflicts of Interest
The authors declare no conflicts of interest.
Appendix A. Representation of Transformed Data Tensors
Proposition A1.
For all $i\in \{1,\dots ,p\}$, it holds that
$${\widehat{\mathbf{\Psi}}}_{X,i}{}_{{n}_{1},\dots ,{n}_{d}}^{m}={\mathbf{\Psi}}_{X}{}_{{n}_{1},\dots ,{n}_{d}}^{m}.$$
That is, the TT decompositions ${\widehat{\mathbf{\Psi}}}_{X,i}$ and ${\mathbf{\Psi}}_{X}$ represent the same tensor in ${\mathbb{R}}^{{n}_{1}\times \cdots \times {n}_{p}\times m}$.
Proof.
An entry of ${\widehat{\mathbf{\Psi}}}_{X,\mu}$, $1<\mu <p$, is given by
$${\left({\widehat{\mathbf{\Psi}}}_{X,\mu}\right)}_{{i}_{1},\dots ,{i}_{p},j}=\sum _{{k}_{1}=1}^{m}\cdots \sum _{{k}_{p1}=1}^{m}{\left({\widehat{\mathbf{\Psi}}}_{X,\mu}^{\left(1\right)}\right)}_{1,{i}_{1},{k}_{1}}\xb7\dots \xb7{\left({\widehat{\mathbf{\Psi}}}_{X,\mu}^{\left(\mu \right)}\right)}_{{k}_{\mu 1},{i}_{\mu},j,{k}_{\mu}}\xb7\dots \xb7{\left({\widehat{\mathbf{\Psi}}}_{X,\mu}^{\left(p\right)}\right)}_{{k}_{p1},{i}_{p},1}.$$
By definition,
$${\left({\widehat{\mathbf{\Psi}}}_{X,\mu}^{\left(\mu \right)}\right)}_{{k}_{\mu 1},{i}_{\mu},j,{k}_{\mu}}\ne 0\text{}\iff \text{}{k}_{\mu 1}=j={k}_{\mu}.$$
On the other hand, an entry of ${\widehat{\mathbf{\Psi}}}_{X,\mu}^{\left(\nu \right)}$ with $\nu \ne \mu $ and $1<\nu <p$ is nonzero if and only if ${k}_{\nu 1}={k}_{\nu}$. It follows that
$$\begin{array}{cc}\hfill {\left({\widehat{\mathbf{\Psi}}}_{X,\mu}\right)}_{{i}_{1},\dots ,{i}_{p},j}& ={\left({\widehat{\mathbf{\Psi}}}_{X,\mu}^{\left(1\right)}\right)}_{1,{i}_{1},j}\xb7\dots \xb7{\left({\widehat{\mathbf{\Psi}}}_{X,\mu}^{\left(\mu \right)}\right)}_{j,{i}_{\mu},j,j}\xb7\dots \xb7{\left({\widehat{\mathbf{\Psi}}}_{X,\mu}^{\left(p\right)}\right)}_{j,{i}_{p},1}\hfill \\ \hfill & ={\psi}_{1,{i}_{1}}\left({x}_{j}\right)\xb7\dots \xb7{\psi}_{\mu ,{i}_{\mu}}\left({x}_{j}\right)\xb7\dots \xb7{\psi}_{p,{i}_{p}}\left({x}_{j}\right),\hfill \end{array}$$
This can be shown in an analogous fashion for $\mu =1$ and $\mu =p$. □
Appendix B. Interpretation of ARR as ALS Ridge Regression
The following reasoning will elucidate the relation between ARR, ridge regression, and kernelbased MANDy. We only outline the rough idea without concrete proofs. Let ${R}_{\mu}$ denote the retraction operator, see [28], consisting of the fixed TT cores ${\Xi}^{\left(1\right)},\dots ,{\Xi}^{(\mu 1)}$ and ${\Xi}^{(\mu +1)},\dots ,{\Xi}^{\left(p\right)}$ of the solution $\Xi $ at any iteration step of Algorithm 2. Furthermore, assume that ${\Xi}^{\left(1\right)},\dots ,{\Xi}^{(\mu 1)}$ are left and ${\Xi}^{(\mu +1)},\dots ,{\Xi}^{\left(p\right)}$ rightorthonormal. In Lines 8 and 13 of Algorithm 2, we consider the system (with a slight abuse of notation)
$$y={M}_{\mu}x=({\mathbf{\Psi}}_{X}^{\top}\xb7{R}_{\mu})x.$$
The application of a truncated SVD to the matricization of ${\mathbf{\Psi}}_{X}^{\top}\xb7{R}_{\mu}$ (as done in Algorithm 2) is then similar to a regularization in the form of
with appropriate regularization parameter $\epsilon $, i.e., $x\approx {M}_{\mu}^{+}y$ for both approaches, see [17,30]. The formulation (A1) is known as Tikhonov’s smoothing functional, ridge regression, or ${\ell}^{2}$ regularization (which, of course, could also directly be applied in Algorithm 2). The solution of (A1) is also the solution of the regularized normal equation
see, e.g., [31]. As ${R}_{\mu}^{\top}{R}_{\mu}=\mathrm{Id}$, it follows that
$$\underset{x}{min}\{{\parallel y{M}_{\mu}x\parallel}_{2}^{2}+\epsilon {\parallel x\parallel}_{2}^{2}\}$$
$${M}_{\mu}^{\top}y=({M}_{\mu}^{\top}{M}_{\mu}+\epsilon \mathrm{Id})x,$$
$$\left({R}_{\mu}^{\top}{\mathbf{\Psi}}_{X}\right)y=\left({R}_{\mu}^{\top}\right({\mathbf{\Psi}}_{X}{\mathbf{\Psi}}_{X}^{\top}+\epsilon \mathrm{Id}\left){R}_{\mu}\right)x.$$
In fact, this is a subproblem corresponding to the application of ALS [28] to the tensorbased system
$${\mathbf{\Psi}}_{X}y=({\mathbf{\Psi}}_{X}{\mathbf{\Psi}}_{X}^{\top}+\epsilon \mathrm{Id})\Xi .$$
Note that all requirements for the application of ALS are satisfied since ${\mathbf{\Psi}}_{X}{\mathbf{\Psi}}_{X}^{\top}+\epsilon \mathrm{Id}$ is a symmetric positive definite tensor operator and ${R}_{\mu}$ is orthonormal. The system of linear equations given in (A2) is then equivalent to the minimization problem
$$\underset{\Xi}{min}\{{\parallel y{\mathbf{\Psi}}_{X}^{T}\Xi \parallel}_{2}^{2}+\epsilon {\parallel \Xi \parallel}_{2}^{2}\}.$$
For sufficiently small $\epsilon $, it holds that $\Xi \approx {\mathbf{\Psi}}_{X}^{+}y$, see [32], meaning Algorithm 2 computes an approximation of the coefficient tensor resulting from the application of kernelbased MANDy, see Section 3.2.
References
 Beylkin, G.; Garcke, J.; Mohlenkamp, M.J. Multivariate Regression and Machine Learning with Sums of Separable Functions. SIAM J. Sci. Comput. 2009, 31, 1840–1857. [Google Scholar] [CrossRef]
 Novikov, A.; Podoprikhin, D.; Osokin, A.; Vetrov, D. Tensorizing Neural Networks. In Advances in Neural Information Processing Systems 28 (NIPS); Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2015; pp. 442–450. [Google Scholar]
 Cohen, N.; Sharir, O.; Shashua, A. On the expressive power of deep learning: A tensor analysis. In Proceedings of the 29th Annual Conference on Learning Theory, New York, NY, USA, 23–26 June 2016; Feldman, V., Rakhlin, A., Shamir, O., Eds.; Proceedings of Machine Learning Research. Columbia University: New York, NY, USA, 2016; Volume 49, pp. 698–728. [Google Scholar]
 White, S.R. Density matrix formulation for quantum renormalization groups. Phys. Rev. Lett. 1992, 69, 2863–2866. [Google Scholar] [CrossRef] [PubMed]
 Stoudenmire, E.M.; Schwab, D.J. Supervised learning with tensor networks. In Advances in Neural Information Processing Systems 29 (NIPS); Lee, D.D., Sugiyama, M., Luxburg, U.V., Guyon, I., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2016; pp. 4799–4807. [Google Scholar]
 Stoudenmire, E.M.; Schwab, D.J. Supervised learning with quantuminspired tensor networks. arXiv 2016, arXiv:1605.05775. [Google Scholar]
 Stoudenmire, E.M. Learning relevant features of data with multiscale tensor networks. Quantum Sci. Technol. 2018, 3, 034003. [Google Scholar] [CrossRef]
 Huggins, W.; Patil, P.; Mitchell, B.; Whaley, K.B.; Stoudenmire, E.M. Towards quantum machine learning with tensor networks. Quantum Sci. Technol. 2019, 4, 024001. [Google Scholar] [CrossRef]
 Roberts, C.; Milsted, A.; Ganahl, M.; Zalcman, A.; Fontaine, B.; Zou, Y.; Hidary, J.; Vidal, G.; Leichenauer, S. TensorNetwork: A library for physics and machine learning. arXiv 2019, arXiv:1905.01330. [Google Scholar]
 Efthymiou, S.; Hidary, J.; Leichenauer, S. TensorNetwork for Machine Learning. arXiv 2019, arXiv:1906.06329. [Google Scholar]
 Brunton, S.L.; Proctor, J.L.; Kutz, J.N. Discovering governing equations from data by sparse identification of nonlinear dynamical systems. Proc. Natl. Acad. Sci. USA 2016, 113, 3932–3937. [Google Scholar] [CrossRef] [PubMed]
 Rudy, S.H.; Brunton, S.L.; Proctor, J.L.; Kutz, J.N. Datadriven discovery of partial differential equations. Sci. Adv. 2017, 3. [Google Scholar] [CrossRef] [PubMed]
 Gelß, P.; Klus, S.; Eisert, J.; Schütte, C. Multidimensional Approximation of Nonlinear Dynamical Systems. J. Comput. Nonlinear Dyn. 2019, 14, 061006. [Google Scholar] [CrossRef]
 Schuld, M.; Killoran, N. Quantum machine learning in feature Hilbert spaces. Phys. Rev. Lett. 2019, 122, 040504. [Google Scholar] [CrossRef] [PubMed]
 ShaweTaylor, J.; Cristianini, N. Kernel Methods for Pattern Analysis; Cambridge University Press: Cambridge, UK, 2004. [Google Scholar] [CrossRef]
 Nüske, F.; Gelß, P.; Klus, S.; Clementi, C. Tensorbased EDMD for the Koopman analysis of highdimensional systems. arXiv 2019, arXiv:1908.04741. [Google Scholar]
 Hansen, P.C. The truncated SVD as a method for regularization. BIT Numer. Math. 1987, 27, 534–553. [Google Scholar] [CrossRef]
 Lecun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradientbased learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
 Xiao, H.; Rasul, K.; Vollgraf, R. FashionMNIST: A novel image dataset for benchmarking machine learning algorithms. arXiv 2017, arXiv:1708.07747. [Google Scholar]
 Golub, G.H.; van Loan, C.F. Matrix Computations, 4th ed.; The Johns Hopkins University Press: Baltimore, MD, USA, 2013. [Google Scholar]
 Oseledets, I.V. A New Tensor Decomposition. Dokl. Math. 2009, 80, 495–496. [Google Scholar] [CrossRef]
 Oseledets, I.V. TensorTrain Decomposition. SIAM J. Sci. Comput. 2011, 33, 2295–2317. [Google Scholar] [CrossRef]
 Penrose, R. Applications of negative dimensional tensors. In Combinatorial Mathematics and Its Applications; Welsh, D.J.A., Ed.; Academic Press Inc.: Cambridge, MA, USA, 1971; pp. 221–244. [Google Scholar]
 Oseledets, I.V.; Tyrtyshnikov, E.E. Breaking the Curse of Dimensionality, Or How to Use SVD in Many Dimensions. SIAM J. Sci. Comput. 2009, 31, 3744–3759. [Google Scholar] [CrossRef]
 Klus, S.; Gelß, P.; Peitz, S.; Schütte, C. Tensorbased dynamic mode decomposition. Nonlinearity 2018, 31. [Google Scholar] [CrossRef]
 Gelß, P.; Matera, S.; Schütte, C. Solving the Master Equation Without Kinetic Monte Carlo. J. Comput. Phys. 2016, 314, 489–502. [Google Scholar] [CrossRef]
 Gelß, P.; Klus, S.; Matera, S.; Schütte, C. Nearestneighbor interaction systems in the tensortrain format. J. Comput. Phys. 2017, 341, 140–162. [Google Scholar] [CrossRef]
 Holtz, S.; Rohwedder, T.; Schneider, R. The Alternating Linear Scheme for Tensor Optimization in the Tensor Train Format. SIAM J. Sci. Comput. 2012, 34, A683–A713. [Google Scholar] [CrossRef]
 Liu, Y.; Zhang, X.; Lewenstein, M.; Ran, S. Entanglementguided architectures of machine learning by quantum tensor network. arXiv 2018, arXiv:1803.09111. [Google Scholar]
 Groetsch, C.W. Inverse Problems in the Mathematical Sciences; Vieweg+Teubner Verlag: Wiesbaden, Germany, 1993. [Google Scholar] [CrossRef]
 Zhdanov, A.I. The method of augmented regularized normal equations. Comput. Math. Math. Phys. 2012, 52, 194–197. [Google Scholar] [CrossRef]
 Barata, J.C.A.; Hussein, M.S. The Moore–Penrose pseudoinverse: A tutorial review of the theory. Braz. J. Phys. 2012, 42, 146–165. [Google Scholar] [CrossRef]
Figure 1.
(a) Samples of the MNIST data set. (b) Samples of the fashion MNIST data set. Each row represents a different item type. (c) Corresponding labels for the fashion MNIST data set.
Figure 2.
Graphical representation of tensor trains: (a) A core is depicted by a circle with different arms indicating the modes of the tensor and the rank indices. The first and the last tensor train (TT) core are regarded as matrices due to the fact that ${r}_{0}={r}_{p}=1$. (b) Leftorthonormalized tensor train obtained by, e.g., sequential singular value decompositions (SVDs). Note that the TT ranks may change due to orthonormalization, e.g., when using (reduced/truncated) SVDs.
Figure 3.
TT representation of transformed data tensors: (a) As in [13], the first p cores (blue circles) are given by (17). The direct contraction of the two last TT cores in (17) can be regarded as an operatorlike TT core with a row and column mode (green circle). (b) The additional column mode can be shifted to any of the p TT cores.
Figure 4.
Construction and solution of the subproblem for the $\mu $th core: (a) The 4dimensional core of ${\widehat{\mathbf{\Psi}}}_{X,\mu}$ (green circle) is contracted with the matrices ${P}_{\mu}$ and ${Q}_{\mu}$ constructed by joining the fixed cores of the coefficient tensor (orange circles) with the corresponding cores of the transformed data tensor. The matricization then defines the matrix ${M}_{\mu}$. (b) The TT core (red circle) obtained by solving the lowdimensional minimization problem is decomposed (e.g., using QR factorization) into a orthonormal tensor and a triangular matrix. The orthonormal tensor then yields the updated core.
Figure 5.
Results for MNIST and fashion MNIST: (a) Classification rates for the reduced 14 × 14 images. (b) Classification rates for full 28x28 images. Reducing the image size by averaging over groups of pixels improves the performance of the algorithm.
Figure 6.
MNIST classification: (a) Images misclassified by kernelbased MANDy described in Section 3.2. The original image is shown in black, the identified label in red, and the correct label in green. (b) Histograms illustrating which categories are misclassified most often. The rows represent the correct labels of the misclassified image and the columns the detected labels. (c) Visualizations of the learned classifiers showing a heat map of the classification function obtained by applying it to images that differ in one pixel.
Figure 7.
Fashion MNIST classification: (a) Misclassified images. (b) Histogram of misclassified images. (c) Visualizations of the learned classifiers.
Symbol  Description 

$X=[{x}^{\left(1\right)},\dots ,{x}^{\left(m\right)}]$  data matrix in ${\mathbb{R}}^{d\times m}$ 
$Y=[{y}^{\left(1\right)},\dots ,{y}^{\left(m\right)}]$  label matrix in ${\mathbb{R}}^{{d}^{\prime}\times m}$ 
${n}_{1},\dots ,{n}_{p}$  mode dimensions of tensors 
${r}_{0},\dots ,{r}_{p}$  ranks of tensor trains 
${\psi}_{1},\dots ,{\psi}_{p}$  basis functions ${\psi}_{\mu}:{\mathbb{R}}^{d}\to {\mathbb{R}}^{{n}_{\mu}}$ 
${\mathsf{\Psi}}_{X}$/${\mathbf{\Psi}}_{X}$  transformed data matrices/tensors 
$\Xi $/$\mathbf{\Xi}$  coefficient matrices/tensors 
© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).