Tensor-based methods have become a powerful tool for scientific computing over the last years. In addition to many application areas, such as quantum mechanics and computational dynamics, where low-rank tensor approximations have been successfully applied, using tensor networks for supervised learning has gained a lot of attention recently. In particular, the canonical format and the tensor train format have been considered for quantum machine learning (There are different research directions in the field of quantum machine learning, here we understand it as using quantum computing capabilities for machine learning problems.) problems, see, e.g., [1
]. A tensor-based algorithm for image classification using sweeping techniques inspired by the density matrix renormalization group (DMRG) [4
] was proposed in [5
] and further discussed in [7
]. Interestingly, also researchers at Google are currently developing a tensor-based machine learning framework called “TensorNetwork (http://github.com/google/TensorNetwork
]. The goal is to expedite the adoption of such methods by the machine learning community.
Our goal is to show that recently developed methods for recovering the governing equations of dynamical systems can be generalized in such a way that they can also be used for supervised learning tasks, e.g., classification problems. To learn the governing equations from simulation or measurement data, regression methods such as sparse identification of nonlinear dynamics (SINDy) [11
] and its tensor-based reformulation multidimensional approximation of nonlinear dynamics (MANDY) [13
] can be applied. The main challenge is often to choose the right function space from which the system representation is learned. Although SINDy and MANDy essentially select functions from a potentially large set of basis functions by applying regularized regression methods, other approaches allow nested functions and typically result in nonlinear optimization problems, which are then frequently solved using (stochastic) gradient descent. By constructing a basis comprising tensor products of simple functions (e.g., functions depending only on one variable), extremely high-dimensional feature spaces can be generated.
In this work, we explain how to compute the pseudoinverse required for solving the minimization problem directly in the tensor train (TT) format, i.e., we replace the iterative approach from [5
] by a direct computation of the least-squares solution and point out similarities with the aforementioned system identification methods. The reformulated algorithm can be regarded as a kernelized
variant of MANDy, where the kernel is based on tensor products. This is also related to quantum machine learning ideas: As pointed out in [14
], the basic idea of quantum computing is similar to kernel methods in that computations are performed implicitly in otherwise intractably large Hilbert spaces. Although kernel methods were popular in the 1990s, the focus of the machine learning community has shifted to deep neural networks in recent years [14
]. We will show that, for simple image classification tasks, kernels based on tensor products are competitive with neural networks.
In addition to the kernel-based approach, we propose another DMRG-inspired method for the construction of TT decompositions of weight matrices containing the coefficients for the selected basis functions. Instead of computing pseudoinverses, a core-wise ridge regression [15
] is applied to solve the minimization problem. Although the approach introduced in [5
] only involves tensor contractions corresponding to single images of the training data set, we use TT representations of transformed data tensors, see [13
], to include the entire training data set at once for constructing low-dimensional systems of linear equations. Combining an efficient computational scheme for the corresponding subproblems and truncated singular value decompositions [17
], we call the resulting algorithm alternating ridge regression (ARR) and discuss connections to MANDy and other regularized regression techniques.
Although we describe the classification problems using the example of the iconic MNIST data set [18
] and the fashion MNIST data set [19
], the derived algorithms can be easily applied to other classification problems. There is a plethora of kernel and deep learning methods for image classification; a list of the most successful methods for the MNIST and fashion MNIST data sets including nearest-neighbor heuristics, support vector machines, and convolutional neural networks can be found on the respective website (http://yann.lecun.com/exdb/mnist/
). We will not review these methods in detail, but instead focus on relationships with data-driven methods for analyzing dynamical systems. The main contributions of this paper are as follows.
Extension of MANDy: We show that the efficacy of the pseudoinverse computation in the tensor train format can be improved by eliminating the need to left- and right-orthonormalize the tensor. Although this is a straightforward modification of the original algorithm, it enables us to consider large data sets. The resulting method is closely related to kernel ridge regression.
Alternating ridge regression: We introduce a modified TT representation of transformed data tensors for the development of a tensor-based regression technique which computes low-rank representations of coefficient tensors. We show that it is possible to obtain results which are competitive with those computed by MANDy and, at the same time, reduce the computational costs and the memory consumption significantly.
Classification of image data: Although originally designed for system identification, we apply these methods to classification problems and visualize the learned classifier, which allows us to interpret features detected in the images.
The remainder is structured as follows. In Section 2
, we describe methods to learn governing equations of dynamical systems from data as well as a tensor-based iterative scheme for image classification and highlight their relationships. In Section 3
, we describe how to apply MANDy to classification problems and introduce the ARR approach based on the alternating optimization of TT cores. Numerical results are presented in Section 4
, followed by a brief summary and conclusion in Section 5
4. Numerical Results
We apply the tensor-based classification algorithms described in Section 3.2
and Section 3.3
to both the MNIST and fashion MNIST data sets, choosing the basis defined in (14
) and setting
. This value was determined empirically for the MNIST data set, but also leads to better classification rates for the fashion MNIST set. Kernel-based MANDy as well as ARR are available in Scikit-TT (https://github.com/PGelss/scikit_tt
). The numerical experiments were performed on a Linux machine with 128 GB RAM and an Intel Xeon processor with a clock speed of 3 GHz and eight cores.
For the first approach, using kernel-based MANDy, we do not apply any regularization techniques. For the ARR approach, we set the TT rank for each solution
, see Algorithms 2–10, and repeat the scheme five times. Here, we use regularization, i.e., truncated SVDs with a relative threshold of 10−2
are applied to the minimization problems given in Algorithm 2 (Lines 8 and 13). The obtained classification rates for the reduced and full MNIST and fashion MNIST data are shown in Figure 5
Similarly to [5
], we first apply the classifiers to the reduced data sets, see Figure 5
a. Using MANDy, we obtain classification rates of up to 98.75% for the MNIST and 88.82% for the fashion MNIST data set. Using the ARR approach, the classification rates are not monotonically increasing, which may simply be an effect of the alternating optimization scheme. The highest classification rates we obtain are 98.16% for the MNIST data and 87.55% for the fashion MNIST data. We typically obtain a 100% classification rate for the training data (as a consequence of the richness of the feature space). This is not necessarily a desired property as the learned model might not generalize well to new data, but seems to have no detrimental effects for the simple MNIST classification problem. As shown in Figure 5
b, kernel-based MANDy can still be applied when considering the full data sets without reducing the image size. Here, we obtain classification rates of up to 97.24% for the MNIST and 88.37% for the fashion MNIST data set. That we obtain lower classification rates for the full images as compared to the reduced ones might be due to the fact that pixel-by-pixel comparisons of images are not expedient. The averaging effect caused by downscaling the images helps to detect coarser features. This is similar to the effect of convolutional kernels and pooling layers. In principle, ARR can also be used for the classification of the full data sets. So far, however, our numerical experiments produced only classification rates significantly lower than those obtained by applying MANDy (95.94% for the MNIST and 82.18% for fashion MNIST data set). This might be due to convergence issues caused by the kernel. The application to higher-order transformed data tensors and potential improvements of ARR will be part of our future research.
also shows a comparison with tensorflow. We run the code provided as a classification tutorial (www.tensorflow.org/tutorials/keras/basic_classification
) ten times and compute the average classification rate. The input layer of the network comprises 784 nodes (one for each pixel; for the reduced data sets, we thus have only 196 input nodes), followed by two dense layers with 128 and 10 nodes. The layer with 10 nodes is the output layer containing probabilities that a given image belongs to the class represented by the respective neuron. Note that although more sophisticated methods and architectures for these problems exists—see the (fashion) MNIST website for a ranking—the results show that our tensor-based approaches are competitive with state-of-the-art deep-learning techniques.
To understand the numerical results for the MNIST data set (obtained by applying kernel-based MANDy to all 60,000 training images), we analyze the misclassified images, examples of which are displayed in Figure 6
a. For misclassified images x
, the entries of
, see (29
), are often numerically zero, which implies that there is no other image in the training set that is similar enough so that the kernel can pick up the resemblance. Some of the remaining misclassified digits are hard to recognize even for humans. Histograms demonstrating which categories are misclassified most often are shown in Figure 6
b. Here, we simply count the instances where an image with label i
was assigned the wrong label j
. The digits 2 and 7, as well as 4 and 9, are confused most frequently. Additionally, we wish to visualize what the algorithm detects in the images. To this end, we perform a sensitivity analysis as follows. Starting with an image whose pixel values are constant everywhere (zero or any other value smaller than one, we choose 0.5), we set pixel
to one and compute
for this image. The process is repeated for all pixels. For each label, we then plot a heat map of the values of y
. This tells us which pixels contribute most to the classification of the images. The resulting maps are shown in Figure 6
c. Except for the digit 1, the results are highly similar to the images obtained by averaging over all images containing a certain digit.
shows examples of misclassified images and the corresponding histogram as well as the results of the sensitivity analysis for the fashion MNIST data set. We see that the images of shirts (6) are most difficult to classify (due to the ambiguity in the category definitions), whereas trousers (1) and bags (8) have the lowest misclassification rates (probably due to their distinctive shapes). In contrast to the MNIST data set, the results of the sensitivity analysis differ widely from the average images. The classifier for coats (4), for instance, “looks for” a zipper and coat pockets, which are not visible in the “average coat”, and the classifier for dresses (3) seems to base the decision on the presence of creases, which are also not distinguishable in the “average dress”. The interpretation of other classifiers is less clear, e.g., the ones for sandals (5) and sneakers (7) seem to be contaminated by other classes.
Comparing the runtimes of both approaches applied to the reduced data sets with 60,000 training images, kernel-based MANDy needs approximately one hour for the construction of the decision function (29
). On the other hand, ARR needs less than 10 minutes to compute the coefficient tensor assuming we parallelize Algorithm 2.
In this work, we presented two different tensor-based approaches for supervised learning. We showed that a kernel-based extension of MANDy can be utilized for image classification. That is, extending the method to arbitrary least-squares problems (originally, MANDy was developed to learn governing equations of dynamical systems) and using sequences of Hadamard products for the computation of the pseudoinverse, we were able to demonstrate the potential of kernel-based MANDy by applying it to the MNIST and fashion MNIST data sets. Additionally, we proposed the alternating optimization scheme ARR, which approximates the coefficient tensors by low-rank TT decompositions. Here, we used a mutable tensor representation of the transformed data tensors in order to construct low-dimensional regression problems for optimizing the TT cores of the coefficient tensor.
Both approaches use an exponentially large set of basis functions in combination with least-squares regression techniques on a given set of training images. The results are encouraging and show that methods exploiting tensor products of simple basis functions are able to detect characteristic features in image data. The work presented in this paper constitutes a further step towards tensor-based techniques for machine learning.
The reason why we can handle the extremely high-dimensional feature space spanned by basis functions is its tensor product format. Besides, the general questions of the choice of basis functions and the expressivity of these functions, the rank-one tensor products that were used in this work can, in principle, be replaced by other structures, which might result in higher classification rates. For instance, the transformation of an image could be given by a TT representation with higher ranks or hierarchical tensor decompositions (with the aim to detect features on different levels of abstraction). Furthermore, we could define different basis functions for each pixel, vary the number of basis functions per pixel, or define basis functions for groups of pixels.
Even though kernel-based MANDy computes the minimum norm solution of the considered regression problems as an exact TT decomposition, the method is likely to suffer from high ranks of the transformed data tensors and might thus not be competitive for large data sets. At the moment, we are computing the Gram matrix for the entire training data set. However, a possibility to speed up computations and to lower the memory consumption is found in exploiting the properties of the kernel. That is, if the kernel almost vanishes if two images differ significantly in at least one pixel (as it is the case for the specific kernel used in this work, provided that the originally proposed value is used), the Gram matrix is essentially sparse when setting entries smaller than a given threshold to zero. Using sparse solvers would allow us to handle much larger data sets. Moreover, the construction of the Gram matrix is highly parallelizable and it would be possible to use GPUs to assemble it in a more efficient fashion.
Further modifications of ARR such as different regression methods for the subproblems, an optimized ordering of the TT cores, and specific initial coefficient tensors can help to improve the results. We provided an explanation for the stability of ARR, but the properties of alternating regression schemes have to be analyzed in more detail in the future.