Sparse Support Tensor Machine with Scaled Kernel Functions

Wang, Shuangyue; Luo, Ziyan

doi:10.3390/math11132829

Open AccessArticle

Sparse Support Tensor Machine with Scaled Kernel Functions

by

Shuangyue Wang

and

Ziyan Luo

^*

School of Mathematics and Statistics, Beijing Jiaotong University, Beijing 100044, China

^*

Author to whom correspondence should be addressed.

Mathematics 2023, 11(13), 2829; https://doi.org/10.3390/math11132829

Submission received: 26 April 2023 / Revised: 12 June 2023 / Accepted: 19 June 2023 / Published: 24 June 2023

(This article belongs to the Special Issue Optimization Theory, Method and Application)

Download

Browse Figures

Versions Notes

Abstract

:

As one of the supervised tensor learning methods, the support tensor machine (STM) for tensorial data classification is receiving increasing attention in machine learning and related applications, including remote sensing imaging, video processing, fault diagnosis, etc. Existing STM approaches lack consideration for support tensors in terms of data reduction. To address this deficiency, we built a novel sparse STM model to control the number of support tensors in the binary classification of tensorial data. The sparsity is imposed on the dual variables in the context of the feature space, which facilitates the nonlinear classification with kernel tricks, such as the widely used Gaussian RBF kernel. To alleviate the local risk associated with the constant width in the tensor Gaussian RBF kernel, we propose a two-stage classification approach; in the second stage, we advocate for a scaling strategy on the kernel function in a data-dependent way, using the information of the support tensors obtained from the first stage. The essential optimization models in both stages share the same type, which is non-convex and discontinuous, due to the sparsity constraint. To resolve the computational challenge, a subspace Newton method is tailored for the sparsity-constrained optimization for effective computation with local convergence. Numerical experiments were conducted on real datasets, and the numerical results demonstrate the effectiveness of our proposed two-stage sparse STM approach in terms of classification accuracy, compared with the state-of-the-art binary classification approaches.

Keywords:

support tensor machine; sparsity; scaled kernel function; subspace Newton method; binary classification

MSC:

62H30; 68T05; 90C26

1. Introduction

Tensors, also known as multi-way arrays, are ubiquitous in the big data era, with applications distributed in brain imaging, video surveillance, hyperspectral images, measurements in social networks, climatology, and geography [1]. Multilinear structures in tensorial data allow for the effective capture of spatial characteristics and intrinsic dimension-reduced properties, instead of working on flattened vectors. Tensor-based thinking has emerged in data mining and machine learning. While the development of theory and algorithms for tensor decompositions [2] has garnered extensive attention due to successful applications in unsupervised learning, supervised tensor learning (STL) methods have also received praise in various fields, including remote sensing imaging [3,4], video processing [5,6], fault diagnosis [7,8], and so on. Pioneering work can be traced back to Tao et al. [9], where the support vector machine (SVM) and minimax probability machine (MPM) were generalized to rank-one tensors for tensorial data classification. Sequential works on binary classification for tensors, namely STM, were delivered in [10,11,12,13,14] and the references therein. Related classification models in tensor format can be found in the one-class STM [15], twin STM [16,17,18], etc. In the literature on tensor classification, a common assumption imposed on the underlying weight tensor, which contributes to the decision function, is low-rankness. It provides advantages in alleviating overfitting issues for small sample-sized problems [19,20] since the dimensions of the weight tensor to be learned are substantially larger than the sample size in practice. The tensor rank notion is far more complicated than the matrix rank, leading to statistical and computational challenges. Another major difficulty in STM is how to build predictive models that can leverage the multi-dimensional structures of tensorial data to facilitate the learning process. Previous works concentrated on linear models, where the data were assumed to be linearly separable in the input space [10,11,13,21,22]. However, this assumption is often violated in practical problems and the linear decision boundaries are not adequate to classify the data appropriately [23]. Inspired by the success of kernel tricks in the conventional SVM, it was witnessed in [24,25,26,27,28] that the classification accuracy can be greatly improved by combining tensor decompositions with kernel-based methods. For example, the dual structure-preserving kernel (DuSK) proposed by He et al. [24] was constructed by feeding the factor vectors from the tensor CANDECOM/PARAFAC (CP) decomposition with the conventional Gaussian RBF kernel. As one of the most commonly used kernels, the Gaussian RBF kernel with a constant kernel width may bring about some local risks and short flexibilities in the distance metric, especially when dealing with unevenness in the pattern space. To address this issue, the scaling technique that conformally modifies the kernels (following the principles of Riemannian geometry [29]) was advocated in the settings of kernelized SVMs in [30,31,32,33,34]. A two-stage method is proposed to improve the classification performance, where a primal kernel is used to find the locations of support vectors in the first stage of learning, and the scaled kernel, based on the prior information on support vectors, is constructed to contribute to a better nonlinear classifier. Although the Gaussian RBF kernel is popular with STMs for obtaining the nonlinear boundaries, there is a lack of consideration regarding the local risk caused by using a constant width for tensorial classification. The enhanced learning process on SVMs has motivated us to use a similar technique in the training process for STM.

In this paper, to develop an efficient classification method, we focus on an STM model with the piecewise quadratic smooth loss, which makes the dual problem somewhat more tractable. To reduce the local risk associated with the constant width of the tensor Gaussian RBF kernel, we scaled the kernel function in the training process. Since the scaling kernel depends on support tensor information, we incorporated a sparsity constraint on the support tensors to reduce redundant information in the training process. The idea of sparsity is analogous to the sparse linearly kernelized SVM proposed by Zhou [35], where the proposed classification method has shown superiority in both computation speed and classification accuracy. To the best of our knowledge, this will be the first attempt that considers the sparsity of support tensors for STM. To conclude, we summarize the main contributions of this paper as follows:

Sparse-kernelized STM model: Taking the sparsity of support tensors into consideration, the sparse-kernelized STM model is proposed, which can flexibly govern the number of support tensors through sparsity constraints. Moreover, the Gaussian RBF kernel in combination with tensor decomposition was utilized to handle nonlinearly separable tensor data.
Scaled kernel function: To alleviate the local risk caused by the constant kernel width, we modified the kernel through conformal transformation by exploiting the structure of Riemannian geometry induced by the kernel function.
Two-stage method: The scaling kernel function can be realized via a two-stage training process. In the first stage, we obtained prior information on support tensors by utilizing a primal kernel. Then, the scaled kernel was used to obtain the final classifier in the second stage. Leveraging the sparse structure, we applied the subspace Newton method to solve the optimization problems in both stages.

To clarify our proposed classification approach, we provide a detailed flowchart in Figure 1. The remaining parts of this paper are organized as follows. In Section 2, related works are reviewed. In Section 3, we introduce some tensor basics. In Section 4, the proposed sparse-kernelized STM model is built and a two-stage training process is designed with scaled tensor kernels. In Section 5, the numerical experiments based on real datasets are constructed to validate the superiority of the proposed algorithm. The conclusions are drawn in Section 6.

2. Related Works

With different predictive models assumed in binary classification problems using the tensor-based large margin classification paradigm, the existing works on STM can be simply divided into two categories: linear STM and nonlinear STM with kernels. With linear models, which assume that samples are linearly separable in the input space, the classification approach initially referred to as the STM with the weight tensor as a rank-one tensor was proposed by Tao et al. [10]. It was solved through the alternating projection optimization method, where the expressive power was limited by the rank-one assumption. To retain more structural information from samples, the rank-one weight tensor was replaced with a rank-R

(R > 1)

tensor based on CP decomposition, as described in [21]. Together with the Tucker decomposition, Kotsia [11] developed a support Tucker machine (STuM), where the weight parameter was represented by the Tucker tensor. Along with the CP and Tucker representations, Chen et al. [13] proposed the support tensor train machine (STTM) based on tensor train (TT) decomposition. Going further into the general tensor low-rank constraint (with no reduction of parameters), Wang et al. [36] proposed the low-rank STM based on the

L_{0 / 1}

soft-margin loss function. Lian [37] employed the multilinear nuclear norm regularization to construct STM and analyzed its oracle properties. This idea of using a surrogate for tensor decomposition is analogous to what is proposed in [22,38]. The aforementioned works mainly focused on linearly separable tensor data, which are impractical for linear decision boundaries to suit real data in numerous practical applications [23].

Recently, the existing approaches have verified that it is useful to exploit the tensor structure with nonlinear kernels. He et al. [24] combined CP decomposition with kernel tricks and proposed the so-called DuSK to exploit discriminative nonlinear relationships among tensorial data. A similar idea was implemented in [25,39]. In particular, the linear support higher-order tensor machine (SHTM) [40] is deemed as a special case of DuSK. The structural-preserving scheme in the CP-based DuSK was extended to that under TT and TR decompositions [26,27,28]. A summarization of the aforementioned works can be found in Table 1, where the “Loss” column presents the loss functions in STMs, “LRD” represents the low-rank decomposition type, “SST” is short for the sparsity of support tensors, and “DR” indicates data reduction.

As one can see from Table 1, among the linear and nonlinear STMs, the hinge loss (i.e., the convex relaxation for the

0 / 1

loss that counts the number of misclassified samples) is employed in most of the STM models, except in [36]. As mentioned in the introduction, the low-rank approximation based on tensor decompositions is used for dimension reduction, with three exceptions in [22,36,38], where the original weight tensor is the decision variable with a tensor rank constraint in the underlying optimization model. The data reduction in the existing two-class STMs is due to the tensor decomposition rather than the sparsity constraint on the support tensors. In all kernelized STMs, the scaled scheme on the kernel function is not considered. All of this motivated us to consider the sparse STM with scaled kernels in this paper. As mentioned by Zhou in [35], the piecewise quadratic smooth loss function

ℓ_{c C} (\cdot)

(see Section 4) makes the dual problem somewhat more tractable to compute. The scaling strategy on the RBF kernel was adopted to relieve the local risk caused by the constant width for nonlinearly separable data. Moreover, for data reduction, we applied the sparsity constraint to govern the number of support tensors.

3. Tensor Basics

An

N^{th}

-order tensor

X \in R^{I_{1} \times I_{2} \times \dots \times I_{N}}

is an N-way array with entries

x_{i_{1} \dots i_{N}}

, where

i_{j}

ranges among

{1, \dots, I_{j}}

for all

j \in [N] : = {1, \dots, N}

. In particular, vectors and matrices are deemed as low-order tensors with

N = 1

and

N = 2

, respectively. For notational convenience, scalars, vectors, matrices, and tensors will be displayed by lowercase letters (e.g., x), boldface lowercase letters (e.g.,

x

), capital letters (e.g., X), and calligraphic letters (e.g.,

X

). Several related tensor fundamentals are reviewed below. More details on tensor multilinear algebra can be found in [2,41].

Definition 1

(Inner product). Given tensors

X

,

Y \in R^{I_{1} \times \dots \times I_{N}}

, the inner product of

X

and

Y

is defined as

〈 X, Y 〉 : = \sum_{i_{1} = 1}^{I_{1}} \sum_{i_{2} = 1}^{I_{2}} \dots \sum_{i_{N} = 1}^{I_{N}} x_{i_{1} i_{2} \dots i_{N}} y_{i_{1} i_{2} \dots i_{N}} .

The induced Frobenius norm of

X

, denoted by

{∥ X ∥}_{F}

, is defined as

{∥ X ∥}_{F} : = \sqrt{〈 X, X 〉}

.

Definition 2

(Outer product). Given tensors

X \in R^{I_{1} \times \dots \times I_{N}}

and

Y \in R^{J_{1} \times \dots \times J_{M}}

, the outer product of

X

and

Y

is an

{(N + M)}^{th}

-order tensor in

R^{I_{1} \times \dots \times I_{N} \times J_{1} \times \dots \times J_{M}}

, defined as

{(X \circ Y)}_{i_{1} \dots i_{N} j_{1} \dots j_{M}} : = x_{i_{1} \dots i_{N}} y_{j_{1} \dots j_{M}} .

Definition 3

(CP decomposition). Given

X \in R^{I_{1} \times \dots \times I_{N}}

, if there exist

x_{r}^{(n)} \in R^{I_{n}}

,

r \in [R]

,

n \in [N]

, such that

\begin{matrix} X = \sum_{r = 1}^{R} x_{r}^{(1)} \circ x_{r}^{(2)} \circ \dots \circ x_{r}^{(N)}, \end{matrix}

(1)

we call (1) a tensor CANDECOMP/PARAFAC (CP) decomposition of

X

. The smallest integer R in a CP decomposition of

X

is called the CP rank of

X

, denoted simply by

r a n k (X)

. Particularly,

X

is called a rank-one tensor when

r a n k (X) = 1

.

For illustrative purposes, a CP decomposition of

X \in R^{I_{1} \times I_{2} \times I_{3}}

is depicted in Figure 2.

For feature selection and/or dimension reduction purposes, tensor low-rank approximations can be produced in terms of the CP decomposition with a small R, and obtained by applying the alternating least squares (ALS) algorithm enhanced with line search strategies [42]. The resulting low-rank CP decomposition format provably facilitates the design of tensor kernel functions. A typical example is the following dual structure-preserving kernel (DuSK) proposed in [24].

Definition 4

(Tensor kernel function). Let ϕ be conventional feature mapping in the vector setting. Given tensors

X

,

Y \in R^{I_{1} \times \dots \times I_{N}}

with CP decompositions

X = \sum_{r = 1}^{R} \prod_{n = 1}^{N} \circ x_{r}^{(n)} a n d Y = \sum_{r = 1}^{R} \prod_{n = 1}^{N} \circ y_{r}^{(n)},

the tensor kernel function is defined by

\begin{matrix} κ (X, Y) = 〈Φ (\sum_{r = 1}^{R} \prod_{n = 1}^{N} \circ x_{r}^{(n)}), Φ (\sum_{r = 1}^{R} \prod_{n = 1}^{N} \circ y_{r}^{(n)})〉 = \sum_{p = 1}^{R} \sum_{q = 1}^{R} \prod_{n = 1}^{N} κ (x_{p}^{(n)}, y_{q}^{(n)}), \end{matrix}

(2)

where the inducing nonlinear feature mapping

Φ : R^{I_{1} \times \dots \times I_{N}} \to R^{H_{1} \times \dots \times H_{N}}

preserves the decomposition structure in the sense that

Φ : \sum_{r = 1}^{R} \prod_{n = 1}^{N} \circ x_{r}^{(n)} \to \sum_{r = 1}^{R} \prod_{n = 1}^{N} \circ ϕ (x_{r}^{(n)})

.

It is worth mentioning that there are several other tensor decompositions that admit low-rank approximations, including the commonly used Tucker decomposition, the tensor train (TT) decomposition, the tensor ring (TR) decomposition, etc. Most of them are also friendly to the design of tensor kernels; see Section 2. Throughout the paper, CP decomposition-based kernels in Definition 4 will be used.

4. Sparse STM with Scaled Kernels

4.1. The Proposed Sparse STM Model

Consider the binary classification problem with M tensorial training data and their labels

{X_{m}, y_{m}}_{m = 1}^{M}

, where

X_{m} \in R^{I_{1} \times \dots \times I_{N}}

and

y_{m} \in {- 1, 1}

. Let

Φ : R^{I_{1} \times \dots \times I_{N}} ⟶ R^{H_{1} \times \dots \times H_{N}}

be a nonlinear feature mapping with

R^{H_{1} \times \dots \times H_{N}}

being a high-dimensional Hilbert space, which is also called the tensor feature space. The purpose of STM is to find an optimal hyperplane that separates the input tensorial data in the feature space, i.e., to find a weight tensor

W \in R^{H_{1} \times \dots \times H_{N}}

along with a bias

b \in R

that solves the following tensor optimization problem:

\begin{matrix} min_{W, b} \frac{1}{2} {∥ W ∥}_{F}^{2} + C_{0} \sum_{i = 1}^{M} ℓ [1 - y_{i} (〈 W, Φ (X_{i}) 〉 + b)] . \end{matrix}

(3)

Here,

ℓ (\cdot)

is a loss function that penalizes the wrong classification in the training process, and

C_{0} > 0

is the penalty parameter. Conventional loss functions include the hinge loss and its variants, the pinball loss and its variants, the Sigmoid loss, etc. Surveys of loss functions include [43,44,45,46]. Throughout the paper, we adopt the following piecewise quadratic smooth loss function due to its superior performance in the setting of SVM [35]:

ℓ_{c C} (t) : = \{\begin{matrix} C t^{2} / 2, if t \geq 0, \\ c t^{2} / 2, otherwise . \end{matrix}

(4)

Here, the tuning parameters c and C satisfy

0 < c < C

. It is obvious that

ℓ_{c C} (t)

is reduced to the squared Hinge loss

{(ℓ_{H} (t))}^{2}

with

ℓ_{H} (t) = max {0, t}

when c = 0. Let

κ (\cdot, \cdot)

be the kernel function defined as Equation (2). Then, the target optimization problem, i.e., Problem (3) in the high-dimensional feature space, can be tackled by solving the following Lagrangian dual problem in the Euclidean space

R^{M}

:

\begin{matrix} min_{α \in R^{M}} & Y (α) : = \frac{1}{2} α^{⊤} K α + \frac{1}{2} 〈 E (α) α, α 〉 - 〈 1, α 〉 \\ s . t . & 〈 α, y 〉 = 0, \end{matrix}

(5)

where

y \in R^{M}

is the label vector,

1 \in R^{M}

is the all-one vector,

K : = K ⊙ ({yy}^{⊤})

with the Gram matrix

\begin{matrix} K : = [\begin{matrix} κ (X_{1}, X_{1}), & \dots & κ (X_{1}, X_{M}) \\ κ (X_{2}, X_{1}), & \dots & κ (X_{2}, X_{M}) \\ ⋮ & ⋮ \\ κ (X_{M}, X_{1}) & \dots & κ (X_{M}, X_{M}) \end{matrix}] \in R^{M \times M}, \end{matrix}

and ⊙ the Hadamard product,

E (α) \in R^{M \times M}

is a diagonal matrix with diagonal entries

\begin{matrix} {(E (α))}_{i i} = \{\begin{matrix} 1 / C, & if α_{i} \geq 0, \\ 1 / c, & otherwise . \end{matrix} \end{matrix}

From the Karush–Kuhn–Tucker (KKT) optimality conditions for the primal and dual programming Problems (3) and (5), see Theorem 2.1 in [35], the optimal solution

(W^{*}, b^{*})

to (3) can be expressed in terms of the optimal solution

α^{*}

to (5), with analytical formulae

W^{*} = \sum_{i \in T^{*}} α_{i}^{*} y_{i} Φ (X_{i}),

b^{*} = \frac{\sum_{i \in T^{*}} [y_{i} - \sum_{j \in T^{*}} α_{j}^{*} (y_{j} κ (X_{i}, X_{j}) + y_{i} {(E (α^{*}))}_{i j})]}{| T^{*} |},

where

T^{*} : = supp (α^{*})

is the support set of

α^{*}

, which collects indices of nonzero components in

α^{*}

. The resulting decision function for binary classification admits the dual representation

f (X) = \sum_{i \in T^{*}} α_{i}^{*} y_{i} κ (X_{i}, X) + b^{*} .

(6)

Here,

{X_{i}}_{i \in T^{*}}

are the so-called support tensors. Evidently, support tensors from the samples play essential roles in the classification with the decision rule given by

sgn (f (X))

, where

sgn (\cdot)

is the sign function defined by

\begin{matrix} sgn (t) = \{\begin{matrix} 1, & t > 0, \\ - 1, & otherwise . \end{matrix} \end{matrix}

Inspired by the emerging sparse SVM scheme, we impose a sparsity constraint on

α

in the dual Problem (5) to control the number of support tensors for data reduction. The resulting sparse STM model is as follows:

\begin{matrix} min_{α \in R^{M}} & Y (α) : = \frac{1}{2} α^{⊤} K α + \frac{1}{2} 〈 E (α) α, α 〉 - 〈 1, α 〉 \\ s . t . & 〈 α, y 〉 = 0, \\ {∥ α ∥}_{0} \leq s, \end{matrix}

(7)

where

{∥ α ∥}_{0}

is the so-called

ℓ_{0}

norm that counts the number of non-zero components in

α

,

s \in [M]

is a prescribed upper bound for the number of support tensors in STM.

4.2. Scaling Tensor Kernel Functions

Based on the low-rank CP decomposition, the tensor kernel function can be generated via (2) with any conventional nonlinear mapping

ϕ

. For simplicity, we adopted the Gaussian RBF kernel throughout the paper, i.e.,

κ (x, y) = exp (\frac{{∥ x - y ∥}^{2}}{- 2 σ^{2}})

for any vectors

x

and

y

of the same dimension, where

σ > 0

is the parameter that controls the kernel width. Then Equation (2) is reformulated to

κ (X, Y) = \sum_{p = 1}^{R} \sum_{q = 1}^{R} \prod_{n = 1}^{N} κ (x_{p}^{(n)}, y_{q}^{(n)}) = \sum_{p = 1}^{R} \sum_{q = 1}^{R} \prod_{n = 1}^{N} \exp (\frac{∥ x_{p}^{(n)} - y_{q}^{(n)} ∥^{2}}{- 2 σ^{2}}) .

It is known that in the conventional vector case, a kernel function induces a Riemannian metric in the original input space via the kernel function; moreover, the magnification factor, defined by the square root of the determinant of the metric matrix, indicates the magnification level on the local area in the feature space. Thus, the classification performance can be improved by scaling the kernel to change the magnification factor appropriately. Thus, we will generalize the magnification factor to the tensor setting and adaptively scale the tensor kernel function during the training process in the sparse STM. For simplicity, we only discuss the case of rank-one tensors.

Given

X = \prod_{n = 1}^{N} \circ x^{(n)} \in R^{I_{1} \times \dots \times I_{N}}

and

Z = Φ (X) = \prod_{n = 1}^{N} \circ ϕ (x^{(n)})

, the differentiation of

Z

, termed as

d Z

, takes the form of

\begin{matrix} d Z & = & d ϕ (x^{(1)}) \circ ϕ (x^{(2)}) \circ \dots \circ ϕ (x^{(N)}) + ϕ (x^{(1)}) \circ d ϕ (x^{(2)}) \circ ϕ (x^{(3)}) \circ \dots \circ ϕ (x^{(N)}) \\ + \dots + ϕ (x^{(1)}) \circ \dots \circ ϕ (x^{(N - 1)}) \circ d ϕ (x^{(N)}) . \end{matrix}

(8)

Lemma 1.

Let

d Z

be defined as in (8). We have

{∥d Z∥}_{F}^{2} = \frac{1}{σ^{2}} \sum_{n = 1}^{N} {∥ d x^{(n)} ∥}^{2}

.

Proof.

Utilizing the expression of the Gaussian RBF function, for any

n \in [N]

, direct manipulations lead to

\{\begin{matrix} ∥ ϕ (x^{(n)}) ∥^{2} = κ (x^{(n)}, x^{(n)}) = 1, \\ ∥ d (ϕ (x^{(n)})) ∥^{2} = \frac{1}{σ^{2}} {∥ d x^{(n)} ∥}^{2}, \\ 〈d (ϕ (x^{(n)})), ϕ (x^{(n)})〉 = \frac{1}{2} d (κ (x^{(n)}, x^{(n)})) = 0 . \end{matrix}

(9)

Combined with the following facts:

∥ x^{(1)} \circ \dots \circ x^{(N)} ∥_{F}^{2} = \prod_{n = 1}^{N} {∥ x^{(n)} ∥}^{2}

(10)

and

\begin{matrix} \begin{matrix} {∥ d Z ∥}_{F}^{2} \\ = & ∥ d (ϕ (x^{(1)})) \circ ϕ (x^{(2)}) \circ \dots \circ ϕ (x^{(N)}) ∥^{2} + ∥ ϕ (x^{(1)}) \circ d (ϕ (x^{(2)})) \circ \dots \circ ϕ (x^{(N)}) ∥^{2} \\ + \dots + ∥ ϕ (x^{(1)}) \circ ϕ (x^{(2)}) \circ \dots \circ d (ϕ (x^{(N)})) ∥^{2} \\ + 2 〈d (ϕ (x^{(1)})) \circ ϕ (x^{(2)}) \circ \dots \circ ϕ (x^{(N)}), ϕ (x^{(1)}) \circ d (ϕ (x^{(2)})) \circ \dots \circ ϕ (x^{(N)})〉 + \dots \\ + 2 〈ϕ (x^{(1)}) \circ \dots \circ d (ϕ (x^{(N - 1)})) \circ ϕ (x^{(N)}), ϕ (x^{(1)}) \circ \dots \circ ϕ (x^{(N - 1)}) \circ d (ϕ (x^{(N)}))〉, \end{matrix} \end{matrix}

we can obtain the desired assertion. This completes the proof. □

Note that

\frac{1}{σ^{2}} {∥ d x^{(n)} ∥}^{2} = \sum_{i_{n}, j_{n} = 1}^{I_{n}} \frac{δ_{i_{n} j_{n}}}{σ^{2}} d x_{i_{n}}^{(n)} d x_{j_{n}}^{(n)} = \sum_{i_{n}, j_{n} = 1}^{I_{n}} \frac{\partial}{\partial x_{i_{n}}^{(n)}} \frac{\partial}{\partial y_{j_{n}}^{(n)}} κ (x^{(n)}, y^{(n)}) |_{y^{(n)} = x^{(n)}} d x_{i_{n}}^{(n)} d x_{j_{n}}^{(n)},

where

δ_{i_{n} j_{n}}

is the Kronecker delta operator defined by

δ_{i_{n} j_{n}} = \{\begin{matrix} 1, & if i_{n} = j_{n}; \\ 0, & otherwise, \end{matrix}

. Denote

G (X) : = \frac{1}{σ^{2}} Diag (I_{I_{1} \times I_{1}}, \dots, I_{I_{N} \times I_{N}}) \in R^{Ξ \times Ξ},

where

Ξ : = I_{1} + \dots + I_{N}

.

From the geometric property of the tensor kernel function, as stated in Lemma 1, it follows that

{∥d Z∥}_{F}^{2} = \sum_{i = 1}^{Ξ} \sum_{j = 1}^{Ξ} G {(X)}_{i j} d v_{i} d v_{j},

where

v_{i}

is the ith element in

{({d x^{(1)}}^{⊤}, \dots, {d x^{(N)}}^{⊤})}^{⊤} \in R^{Ξ}

. We can then generalize the magnification factor in the tensorial case with the definitional expression

ρ (X) : = \sqrt{det (G (X))} .

Apparently, in the Gaussian RBF case, for any rank-one tensor

X \in R^{I_{1} \times \dots \times I_{N}}

, we have

ρ (X) = \frac{1}{σ^{Ξ}} .

This constant magnification factor will cause local risk since it is adverse to adapting to the spatial density change. Inspired by the scaling technique as used in SVM, we modify the tensor kernel function

κ (X, Y)

, such that the magnification factor

ρ (X)

is relatively enlarged around the boundary of

f (X) = 0

, in order to increase the margin or separability of classes.

However, the location of the boundary is not known in advance. To address this issue, we utilize the two-stage procedure suggested in [30]. In the first stage, we perform sparse STM with the primary kernel

κ (\cdot, \cdot)

, as defined in (2), to obtain the first trial of the decision function

f (X)

via support tensors. In the second stage, we perform the sparse STM with the following scaled kernel function:

\begin{matrix} \tilde{κ} (X, Y) : = D (X) κ (X, Y) D (Y) = 〈 D (X) Φ (X), D (Y) Φ (Y) 〉, \end{matrix}

(11)

where

D (X) : = exp (- μ f {(X)}^{2})

with a tunable parameter

μ > 0

. The visualization of the kernel scaling in the learning process is shown in Figure 3, where the flowchart of scaling nonlinear mapping is represented in Figure 4.

For any rank-one tensor

X = \prod_{n = 1}^{N} \circ x^{(n)}

,

\begin{matrix} D (X) = exp \{- μ {[\sum_{i \in T^{*}} α_{i} y_{i} exp \{\frac{\sum_{n = 1}^{N} {∥ x^{(n)} - x_{i}^{(n)} ∥}^{2}}{- 2 σ^{2}}\} + b]}^{2}\} = : Θ (x^{(1)}, \dots, x^{(N)}) . \end{matrix}

Let

\tilde{Z} = D (X) Φ (X) = Θ (x^{(1)}, \dots, x^{(N)}) Φ (X)

. We have the following property.

Lemma 2.

∥ d \tilde{Z} ∥_{F}^{2} = D {(X)}^{2} ({[d (log Θ (x^{(1)}, \dots, x^{(N)}))]}^{2} + {∥ d Z ∥}_{F}^{2})

.

Proof.

By the definitions of

D (X)

and

\tilde{Z}

, we have

d \tilde{Z} = d (Θ (x^{(1)}, \dots, x^{(N)})) Φ (X) + Θ (x^{(1)}, \dots, x^{(N)}) d (Φ (X))

and

d (Θ (x^{(1)}, \dots, x^{(N)})) = D (X) d (log Θ (x^{(1)}, \dots, x^{(N)}))

. Additionally, invoking expressions in (9) and (10), we have

〈Φ (X), Φ (X)〉 = 1 and 〈Φ (X), d (Φ (X))〉 = 0 .

(12)

Thus,

\begin{matrix} \begin{matrix} ∥ d \tilde{Z} ∥_{F}^{2} & = ∥ d (Θ (x^{(1)}, \dots, x^{(N)})) Φ (X) + Θ (x^{(1)}, \dots, x^{(N)}) d (Φ (X)) ∥_{F}^{2} \\ = {(d (Θ (x^{(1)}, \dots, x^{(N)})))}^{2} 〈 Φ (X), Φ (X) 〉 + D {(X)}^{2} 〈 d (Φ (X)), d (Φ (X)) 〉 \\ + 2 D (X) d (Θ (x^{(1)}, \dots, x^{(N)})) 〈 Φ (X), d (Φ (X)) 〉 \\ = {(d (Θ (x^{(1)}, \dots, x^{(N)})))}^{2} + D {(X)}^{2} {∥ d (Z) ∥}_{F}^{2} \\ = D {(X)}^{2} ({[d (log Θ (x^{(1)}, \dots, x^{(N)}))]}^{2} + {∥ d Z ∥}_{F}^{2}) \end{matrix} \end{matrix}

where the third equality follows from (12). This completes the proof. □

Obviously,

∥ d \tilde{Z} ∥_{F}^{2} = {∥ d Z ∥}_{F}^{2}

for any

X

on the boundary

f (X) = 0

. Notice that

\begin{matrix} d (log Θ (x^{(1)}, \dots, x^{(N)})) = D^{[1] ⊤} d x^{(1)} + \dots + D^{[N] ⊤} d x^{(N)}, \end{matrix}

where

D^{[n]} = \frac{\partial log Θ (x^{(1)}, \dots, x^{(N)})}{\partial x^{(n)}}

. Denote

Q_{i, j} = 〈 D^{[i]}, D^{[j]} 〉

, it follows that

\begin{matrix} \begin{matrix} ∥ d (log Θ (x^{(1)}, \dots, x^{(N)})) ∥^{2} \\ = & [d x^{(1) ⊤}, \dots, d x^{(N) ⊤}] [\begin{matrix} Q_{1, 1} & \dots & Q_{1, N} \\ Q_{2, 1} & \dots & Q_{2, N} \\ ⋮ & \dots & ⋮ \\ Q_{N - 1, 1} & \dots & Q_{N - 1, N} \\ Q_{N, 1} & \dots & Q_{N, N} \end{matrix}] [\begin{matrix} d x^{(1)} \\ ⋮ \\ d x^{(N)} \end{matrix}] . \end{matrix} \end{matrix}

(13)

Now, focusing on a neighborhood of support tensors

X_{i}

, there is

\begin{matrix} Θ (x^{(1)}, \dots, x^{(N)}) = exp \{- μ {[α_{i} y_{i} exp \{\frac{\sum_{n = 1}^{N} {∥ x^{(n)} - x_{i}^{(n)} ∥}^{2}}{- 2 σ^{2}}\} + b]}^{2}\} . \end{matrix}

For convenience, we denote

∥ d \tilde{Z} ∥_{F}^{2} = \sum_{i = 1}^{Ξ} \sum_{j = 1}^{Ξ} \tilde{G} {(X)}_{i j} d v_{i} d v_{j}

,

\tilde{ρ} (X) = \sqrt{det (\tilde{G} (X))}

. Then we can state the following theorem:

Theorem 1.

The ratio of the magnification factors

\tilde{ρ} (X)

and

ρ (X)

is

\begin{matrix} \frac{\tilde{ρ} (X)}{ρ (X)} = D {(X)}^{Ξ} \sqrt{1 + σ^{2} {∥ \nabla log Θ (x^{(1)}, \dots, x^{(N)}) ∥}_{F}^{2}} . \end{matrix}

(14)

Proof.

According to Lemma 2 and (13), we have

\begin{matrix} \begin{matrix} ∥ d \tilde{Z} ∥_{F}^{2} & = D {(X)}^{2} ({[d (log Θ (x^{(1)}, \dots, x^{(N)}))]}^{2} + {∥ d Z ∥}_{F}^{2}) \\ = \sum_{i = 1}^{Ξ} \sum_{j = 1}^{Ξ} \tilde{G} {(X)}_{i j} d v_{i} d v_{j}, \end{matrix} \end{matrix}

where

\begin{matrix} \begin{matrix} \tilde{G} (X) = D {(X)}^{2} [\begin{matrix} Q_{1, 1} & \dots & Q_{1, N} \\ Q_{2, 1} & \dots & Q_{2, N} \\ ⋮ & \dots & ⋮ \\ Q_{N, 1} & \dots & Q_{N, N} \end{matrix}] + D {(X)}^{2} G (X) . \end{matrix} \end{matrix}

Since

\tilde{ρ} (X) = \sqrt{det (\tilde{G} (X))}

, there is

\frac{\tilde{ρ} (X)}{ρ (X)} = D {(X)}^{Ξ} \sqrt{1 + σ^{2} {∥ \nabla \log Θ (x^{(1)}, \dots, x^{(N)}) ∥}_{F}^{2}} .

This completes the proof. □

To visualize the performance of the scaling strategy, we follow the approach by [32] to randomly generate

\{x^{(i)} = {(x_{1}^{(i)}, x_{2}^{(i)})}^{⊤} : i = 1, \dots, 200\}

, which follows the i.i.d. uniform distribution in the region

[- 1, 1] \times [- 1, 1]

and can be separated by

x_{2} = cos (π x_{1})

; see Figure 5. The symbols with different type and color represent different categories.

By applying ‘LIBSVM’ with 80% of samples for training and 20% for testing, we test the performance with fixed values of

C = 2^{2}, σ = 0.50

, and

μ = 0.20

. The classification accuracy is

90.00

% in the first stage and

97.50

% in the second stage with the scaled scheme. To visualize the effect of the scaling strategy, we plot the contours of the ratio magnification factors as defined in (14) in Figure 6, along with SVs marked by ‘+’, distributing around the boundary

f (x) = 0

. One can easily find that the magnification factor in the neighborhood that covers these SVs is greater than 1, indicating an enlargement in the magnification factor at these SVs. In contrast, the magnification factors in non-SVs are reduced after scaling. The enlargement-reduction scenario will facilitate the detection of the classification boundary, leading to a reasonable improvement in classification accuracy.

4.3. The Proposed NKSTM Approach

To effectively solve the non-convex and discontinuous problem, i.e., Problem (7), the Newton-type method, denoted as NSSVM, proposed by Zhou [35], is employed. The gradient and the Hessian of the objective function in Problem (7) are

\nabla Y (α) : = H (α) α - 1, H (α) : = K + E (α) .

Since

K

is the Hadamard product of two symmetric positive semidefinite matrices and, hence, positive semidefinite, and

E (α)

is a diagonal positive definite matrix, we can see that

H (α)

is positive definite. For notational convenience, we denote

z : = {(α^{⊤}, b)}^{⊤} \in R^{M + 1}

, and

\begin{matrix} \begin{matrix} h (z) & : = & \nabla Y (α) + y b = H (α) α - 1 + y b, \\ h_{T} (z) & : = & {(h (z))}_{T}, \\ H_{T} (α) & : = & {(H (α))}_{T T}, \end{matrix} \end{matrix}

where

T : = supp (α)

is the support set,

h_{T} (α)

is a sub-vector of the vector

h (z)

indexed by T,

H_{T} (α)

is the sub-principal matrix of the matrix

H (α)

with rows and columns both indexed by T. We denote

\bar{T} : = [M] ∖ T

, and

\begin{matrix} T_{s} (α) : = \{T \subseteq [M] : \begin{matrix} | T | = s, \\ | α_{i} | \geq | α_{j} |, \forall i \in T, \forall j \in \bar{T} \end{matrix}\}, \end{matrix}

where

| T |

is the cardinality of T, which counts the number of elements in T. Following from [35], we also introduce the stationary equation system for Problem (7) in the form of

\begin{matrix} F (z; T) : = [\begin{matrix} h_{T} (z) \\ α_{\bar{T}} \\ 〈 α_{T}, y_{T} 〉 \end{matrix}] = 0 . \end{matrix}

After choosing the index set

T_{k} \in T_{s} (α^{k} - η h (z^{k}))

, the Newton direction

d^{k} \in R^{M + 1}

can be found by solving the linear equations

\nabla F (z^{k}; T_{k}) d^{k} = - F (z^{k}; T_{k})

. Then, the algorithmic framework for solving sparse STM can be summarized as follows.

Equipped with the kernel scaling scheme, the framework of the resulting two-stage training process with NSSVM can be described as in Algorithm 1.

Algorithm 1 (re-NKSTM) NKSTM with the rescaling kernel.

Input: Training dataset

{X_{i}, y_{i}}_{i = 1}^{M}

, parameters

C > c, η, ϵ, K, μ > 0, s \in [M]

, the CP rank R.

Output: The solution

{\tilde{z}}^{k}

.

Step 1: Initialize

z^{0}

, pick

T_{0} \in T_{s} (α^{0} - η h (z^{0}))

, and

k : = 0

;

Step 2: Compute

z^{k + 1}

by NSSVM with the tensorial RBF kernel

κ (\cdot, \cdot)

;

Step 3: If the stopping criterion is satisfied, then go to Step 4; otherwise, set

k = k + 1

and go to Step 2;

Step 4: Compute the modified kernel

\tilde{κ} (\cdot, \cdot)

by (11) to rescale the primal kernel

κ (\cdot, \cdot)

;

Step 5: Compute

{\tilde{z}}^{k + 1}

by NSSVM with the tensorial RBF kernel

\tilde{κ} (\cdot, \cdot)

;

Step 6: If the stopping criterion is satisfied, then stop; otherwise, set

k = k + 1

and go to Step 5.

Convergence and complexity analysis. Note that Algorithm 2 is indeed the first stage in Algorithm 1. It follows from Theorem 3.1 in [35] that Algorithm 2 enjoys the one-step convergence if the initial point is chosen to be close to a local region of a stationary point. The computational complexity in each iteration of Algorithm 2 is analyzed as follows:

The main term involved in computing $H_{T_{k}} (α^{k}) = K_{T_{k} T_{k}} + E {(α)}_{T_{k} T_{k}}$ is $K_{T_{k} T_{k}}$ , which has a complexity of about $O (s^{2} R^{2} \sum_{n = 1}^{N} I_{n})$ since $| T_{k} | = s$ , and computing the inverse of $H_{T_{k}} (α^{k})$ requires at most $O (s^{3})$ ;
To pick $T_{k + 1} \in T_{s} (α^{k} - η h (z^{k}))$ , we need to compute $h (z^{k})$ and pick out the k largest elements of $| α^{k} - η h (z^{k}) |$ . The complexity of computing the former is $O (M^{2})$ and the latter is $O (M + s \ln s)$ .

Algorithm 2 (NKSTM) Newton-kernelized STM.

Input: Training dataset

{X_{i}, y_{i}}_{i = 1}^{M}

, the parameters

C > c, η, ϵ, K > 0, s \in [M]

, the CP rank R.

Output: The solution

z^{k}

.

Step 1: Initialize

z^{0}

, pick

T_{0} \in T_{s} (α^{0} - η h (z^{0}))

, and

k : = 0

;

Step 2: Compute

z^{k + 1}

by NSSVM with the tensorial RBF kernel;

Step 3: If the stopping criterion is satisfied, then stop; otherwise, set

k = k + 1

and go to Step 2.

Overall, the whole computational complexity in each iteration of both Algorithms 1 and 2 is of the following order:

O (s^{2} max \{R^{2} \sum_{n = 1}^{N} I_{n}, s\} + M^{2}) .

Note that the computations of tensorial RBF kernels in both Algorithms 1 and 2 heavily rely on the CP decompositions of tensor samples

{X_{i}}_{i = 1}^{M}

, which are performed by using the ALS algorithm with line search [42]. Since these CP decompositions are performed just once in Algorithms 1 and 2, we omit the complexity analysis here. The storage complexity is

O (M \sum_{n = 1}^{N} I_{n} + s^{2})

.

5. Numerical Experiments

We conducted numerical experiments on real datasets to evaluate the efficiency of the proposed sparse STM approach. All the computational results were obtained using a laptop running on a 64-bit Windows Operating System with an (Intel(R) Core(TM) i7-8565U CPU @1.8 GHz RAM 8.0 G), equipped with Matlab R2018b with the Tensor Toolbox Version 3.1, (https://gitlab.com/tensors/tensortoolbox, accessed on 1 June 2019).

Normalization is applied to each of the raw sample tensors, denoted as

{{\tilde{X}}_{j} : j \in [M]}

, for more efficient data processing. Specifically, for each

j \in [M]

, the normalized tensor

X_{j}

is generated by

{(X_{j})}_{i_{1} i_{2} \dots i_{N}} = \frac{{({\tilde{X}}_{j})}_{i_{1} i_{2} \dots i_{N}} - {\tilde{X}}_{j}^{min}}{{\tilde{X}}_{j}^{max} - {\tilde{X}}_{j}^{min}}, \forall i_{1}, i_{2}, \dots, i_{N},

with

{\tilde{X}}_{j}^{min}

and

{\tilde{X}}_{j}^{max}

being the minimal and the maximal entries of

{\tilde{X}}_{j}

, respectively.

Experimental setup. To construct the tensor kernels among

{X_{j} : j \in [M]}

, we use the grid search on

{1, \dots, 8}

to choose the CP rank R and adopt the ALS algorithm with line search in [42] to obtain the low-rank CP decomposition approximations. For the RBF kernel parameter

σ

and hyperparameter C, we adopt grid search to pick them from

{2^{- 3}, \dots, 2^{5}}

, and

{2^{0}, 2^{1}, \dots, 2^{9}}

, respectively, and set

c = 0.01 C

. For simplicity, the scaling parameter

μ

in the second stage is chosen as

\frac{1}{I_{1} I_{2} \dots I_{N}}

and adaptively changed in high-dimensionality problems. The sparsity level parameter s is chosen based on the testing instances.

For NKSTM, we choose the initialization point

z^{0} = {(α_{0}^{⊤}, b)}^{⊤}

with

α^{0} = 0

and

b^{0} = sgn (〈 y, 1 〉)

, and set the stationarity parameter

η = min (\frac{1}{M}, 10^{- 4})

from [35]. The algorithm will be terminated if one of the following criteria is satisfied: (i) reaching the maximum iteration number

K = 1000

; (ii) meeting the halting conditions:

\begin{matrix} \begin{matrix} ∥ F (z^{k}; T_{k}) ∥ < ϵ, | acc (α^{k}) - max_{j \in [k - 1]} acc (α^{j}) | < 10^{- 4}, \end{matrix} \end{matrix}

where

ϵ = \sqrt{M I_{1} \dots I_{N}} \times 10^{- 6}

,

acc (α)

is the training accuracy defined by

acc (α) = [1 - \frac{1}{M} ∥ sgn (K (α ⊙ y) + b 1) - y ∥_{0}] \times 100 %,

(15)

with acc

(α^{- 1}) = 0

by default. The testing classification accuracy is defined in the same way as in (15) by replacing the training samples with the testing samples.

5.1. Benchmark Methods

We compare the classification accuracy of the proposed approach with the accuracy of the following existing methods.

LIBSVM: SVM with hinge loss was implemented by Lin et al. [47], which is one of the most popular classification methods based on vector space. In this paper, we consider the SVM with a Gaussian kernel.
NSSVM-RBF: NSSVM based on vector space was developed by Zhou [35] for data reduction, which is a linearly kernelized SVM optimization problem with sparsity constraints. We consider the Gaussian RBF kernel for nonlinearly separable data.
DuSK: He et al. [24] proposed a dual structure-preserving kernel based on CP decomposition for supervised tensor learning to boost tensor classification performances. In the experiment, we chose the Gaussian RBF kernel.
TT-MMK: Kour et al. [27] introduced a tensor train multi-way multi-level kernel (TT-MMK), which converts the available TT decomposition into the CP decomposition for kernelized classification. The Gaussian RBF kernel was adopted in the article.

For fair comparisons, the default settings of all other competitor approaches will be adopted, and the rank selections in all rank-related methods will be determined in the same way as in our proposed methods. All numerical results show the average performance with five-fold cross-validation in 5 trials for all tested methods. For convenience, we simply use ‘TRACC1’ (‘TEACC1’) and ‘TRACC2’ (‘TEACC2’) to refer to the training (testing) accuracies of the NKSTM and re-NKSTM approaches, respectively.

5.2. StarPlus fMRI Dataset

We conduct the numerical experiments using the magnetic resonance imaging (fMRI) images in the field of neuroscience. Each is represented in a third-order tensor format, as illustrated in Figure 7 from [24]. We choose images from the StarPlus dataset www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-81/www, accessed on 1 June 2019, which are in the form of a tensor with dimensions of

64 \times 64 \times 8

, containing 25 to 30 anatomically defined regions (denoted as “Regions of Interest", or ROIs). In order to improve the classification accuracy, the following ROIs are taken into consideration: ‘CALC’, ‘LIPL’, ‘LT’, ‘LTRIA’, ‘LOPER’, ‘LIPS’, and ‘LDLPFC’. More details about data preprocessing are explained in [26]. Here, we pick subject 04799 from [26] to test the classification performance. This subject contains 320 fMRI images: half of them were collected when the subject was shown a picture, and the other half were collected when the subject was shown a sentence. In the five-fold CV, 256 samples are used for training and 64 for testing in each classification trial.

To test the effect of the sparsity level s in the proposed model (7), we simply choose the CP rank

R = 2

, and record the training and the testing accuracies with different values of

s \in {33, 49, 97, 145, 193, 256}

in Figure 8. It is not surprising to see in Figure 8 that the re-NKSTM in the second stage shows better performance than NKSTM in the testing accuracy, and achieves its peak testing accuracy in a sparse setting with

s = 97

among the instances. As shown in this figure, the effect of the sparsity constraint imposed on

α

is positive, with a reasonable sacrifice on the training accuracy to alleviate the overfitting.

To test the effect of the CP rank R, we choose

s = 97

, as discussed above, and collect the training and testing accuracies with different values of

R \in {1, 2, 3, \dots, 8}

in Figure 9. The superiority of re-NKSTM in the testing accuracy is confirmed again in these instances. In particular, in the case of CP rank

R = 4

, the accuracy in the second stage is 76.88%, which is 24% higher than 52.87% in the first stage. Moreover, the choice of CP rank

R = 4

for re-NKSTM shows the best performance in both training and testing, which, to some extent, indicates the advantage of imposing the low-rankness assumption on tensor samples for classification.

Table 2 presents the classification accuracies between the proposed approach and several state-of-the-art methods. Here, in the NKSTM and re-NKSTM approaches, we choose

s = 97

, and in all rank-related methods (NKSTM, re-NKSTM, DuSK, TT-MMK), the grid search is used to choose the corresponding tensor rank. As can be seen in Table 2, the testing accuracy of re-NKSTM is about 20% higher than those of other competitors. In particular, the scaled kernel scheme in re-NKSTM achieves a classification accuracy that is 24% higher than that of NKSTM without any scaled kernel function.

5.3. CIFAR-10 Dataset

Each color picture from the CIFAR-10 [48] dataset, as shown in Figure 10, can be deemed as a third-order tensor (pixel–pixel–color) with dimensions of

32 \times 32 \times 3

. We randomly select nine class pairs to perform the binary classification. Without overlap, we choose 100 samples from the dataset in each class to obtain 200 samples for each class pair classification (160 for training and 40 for testing).

Similarly, we test the effect of the sparsity level in the proposed NKSTM and re-NKSTM approaches. The class pair ‘automobile-cat’ is chosen for illustration. The training and the testing accuracies of these two approaches with different values of

s \in {28, 56, 84, 112, 140, 160}

are shown in Figure 11, by fixing the CP rank

R = 1

. One can observe that re-NKSTM outperforms NKSTM in the testing accuracy for all tested instances; for the sparse setting

s = 112

, both the training and testing accuracies approach a high percentage of 85%.

We also test the effect of the CP rank in the proposed NKSTM and re-NKSTM approaches by fixing

s = 112

. The results with the CP rank R varying among

{1, 2, 3, \dots, 8}

are reported in Figure 12. As can be seen in this figure, the NKSTM and re-NKSTM approaches gain the best classification performances at

R = 1

, with accuracies of about

78 %

for NKSTM and

85 %

for re-NKSTM.

Table 3 lists the comparison results for nine class pairs by performing the aforementioned six algorithms. Here, we set

s = 112

in the NKSTM and re-NKSTM approaches, and use the experiment setup as described at the beginning of this section for all algorithms. As one can see from Table 3, re-NKSTM outperforms all other five approaches in classification accuracy for all tested instances, and the superiority is mainly achieved in the scaled kernel scheme when compared with the results of NKSTM. In particular, for the ‘bird & horse’ classification, the accuracy is 90.00% in re-NKSTM, which improved by about 20% compared to 69.40% for NKSTM.

6. Conclusions

In this paper, a sparse STM model based on the dual structure-preserving kernel function was proposed for the binary classification of tensorial data. Scaling kernel functions were embedded to alleviate the local risks caused by the constant Gaussian RBF kernel width in the presence of uneven spatial distributions, and a two-stage approach equipped with the subspace Newton method was designed in the learning process. Numerical experiments conducted on real datasets have confirmed the superiority of the proposed approach in classification accuracy.

However, there are still limitations to the proposed method. For example, the time efficiency of the tensor decomposition in the kernel function evaluation process needs to be improved, especially for high-dimensional and high-order tensor samples. Moreover, the effectiveness of the proposed approach was demonstrated numerically in terms of the classification accuracy, and the analysis of the generalization ability needs to be further studied from a theoretical perspective. In addition, as the 0/1 loss was recognized as the ideal loss function to characterize the number of misclassified samples, the kernelized sparse STM with such a loss will be a promising model, which deserves further study in future research.

Author Contributions

Conceptualizations, S.W. and Z.L.; methodology, S.W. and Z.L.; software, S.W.; validation, Z.L.; writing—original draft preparation, S.W. and Z.L.; writing—review and editing, S.W. and Z.L.; visualization, S.W.; supervision, Z.L.; funding acquisition, Z.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Beijing Natural Science Foundation (grant no. Z190002) and the National Natural Science Foundation of China (grant no. 12271022).

Data Availability Statement

The data used to support this study are included within the article.

Conflicts of Interest

The authors declare no conflict of interest.

References

Liu, J.; Zhu, C.; Long, Z.; Liu, Y. Tensor regression. Found. Trends Mach. Learn. 2021, 14, 379–565. [Google Scholar] [CrossRef]
Kolda, T.; Bader, B. Tensor decompositions and applications. SIAM Rev. 2009, 51, 455–500. [Google Scholar] [CrossRef]
Xing, Y.; Wang, M.; Yang, S.; Zhang, K. Pansharpening with multiscale geometric support tensor machine. IEEE Geosci. Remote Sens. Lett. 2018, 56, 2503–2517. [Google Scholar] [CrossRef]
Zhang, L.; Zhang, L.; Tao, D.; Huang, X. A multifeature tensor for remote-sensing target recognition. IEEE Geosci. Remote Sens. Lett. 2011, 8, 374–378. [Google Scholar] [CrossRef]
Zhou, B.; Song, B.; Hassan, M.; Alamri, A. Multilinear rank support tensor machine for crowd density estimation. Eng. Appl. Artif. Intel. 2018, 72, 382–392. [Google Scholar] [CrossRef]
Zhao, Z.; Chow, T. Maximum margin multisurface support tensor machines with application to image classification and segmentation. Expert Syst. Appl. 2012, 39, 849–860. [Google Scholar]
He, Z.; Shao, H.; Cheng, J.; Zhao, X.; Yang, Y. Support tensor machine with dynamic penalty factors and its application to the fault diagnosis of rotating machinery with unbalanced data. Mech. Syst. Signal Process. 2020, 141, 106441. [Google Scholar] [CrossRef]
Hu, C.; He, S.; Wang, Y. A classification method to detect faults in a rotating machinery based on kernelled support tensor machine and multilinear principal component analysis. Appl. Intell. 2021, 51, 2609–2621. [Google Scholar] [CrossRef]
Tao, D.; Li, X.; Hu, W.; Maybank, S.; Wu, X. Supervised tensor learning. In Proceedings of the Fifth IEEE International Conference on Data Mining, Houston, TX, USA, 27–30 November 2005; pp. 450–457. [Google Scholar]
Tao, D.; Li, X.; Wu, X.; Hu, W.; Maybank, S. Supervised tensor learning. Knowl. Inf. Syst. 2007, 13, 1–42. [Google Scholar] [CrossRef]
Kotsia, I.; Patras, I. Support tucker machines. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, Colorado Springs, CO, USA, 20–25 June 2011; pp. 633–640. [Google Scholar]
Khemchandani, R.; Karpatne, A.; Chandra, S. Proximal support tensor machines. Int. J. Mach. Learn. Cyber. 2013, 4, 703–712. [Google Scholar] [CrossRef]
Chen, C.; Batselier, K.; Ko, C.Y.; Wong, N. A support tensor train machine. In Proceedings of the 2019 International Joint Conference on Neural Networks, Budapest, Hungary, 14–19 July 2019; pp. 1–8. [Google Scholar]
Sun, T.; Sun, X. New results on classification modeling of noisy tensor datasets: A fuzzy support tensor machine dual model. IEEE Trans. Syst. Man Cybern. 2021, 99, 1–13. [Google Scholar] [CrossRef]
Chen, Y.; Wang, K.; Zhong, P. One-class support tensor machine. Knowl.-Based Syst. 2016, 96, 14–28. [Google Scholar] [CrossRef] [Green Version]
Zhang, X.; Gao, X.; Wang, Y. Twin support tensor machines for MC detection. J. Electron. 2009, 26, 318–325. [Google Scholar] [CrossRef]
Shi, H.; Zhao, X.; Zhen, L.; Jing, L. Twin bounded support tensor machine for classification. Int. J. Pattern Recogn. 2016, 30, 1650002.1–1650002.20. [Google Scholar] [CrossRef]
Rastogi, R.; Sharma, S. Ternary tree based-structural twin support tensor machine for clustering. Pattern Anal. Appl. 2021, 24, 61–74. [Google Scholar] [CrossRef]
Yan, S.; Xu, D.; Yang, Q.; Zhang, L.; Tang, X.; Zhang, H. Multilinear discriminant analysis for face recognition. IEEE Trans. Image Process. 2007, 16, 212–220. [Google Scholar] [CrossRef] [PubMed]
Lu, H.; Plataniotis, K.; Venetsanopoulos, A. MPCA: Multilinear principal component analysis of tensor objects. IEEE Trans. Neural Netw. 2008, 19, 18–39. [Google Scholar] [PubMed] [Green Version]
Kotsia, I.; Guo, W.; Patras, I. Higher rank support tensor machines for visual recognition. Pattern Recognit. 2012, 45, 4192–4203. [Google Scholar] [CrossRef]
Yang, B. Research and Application of Machine Learning Algorithm Based Tensor Representation; China Agricultural University: Beijing, China, 2017. [Google Scholar]
Rubinov, M.; Knock, S.; Stam, C.; Micheloyannis, S.; Harris, A.; Williams, L.; Breakspear, M. Small-world properties of nonlinear brain activity in schizophrenia. Hum. Brain Mapp. 2009, 30, 403–416. [Google Scholar] [CrossRef] [PubMed]
He, L.; Kong, X.; Yu, P.; Ragin, A.; Hao, Z.; Yang, X. DuSK: A dual structure-preserving kernel for supervised tensor learning with applications to neuroimages. In Proceedings of the 2014 SIAM International Conference on Data Mining SIAM, Philadelphia, PA, USA, 24–26 April 2014; pp. 127–135. [Google Scholar]
He, L.; Lu, C.; Ma, G.; Wang, S.; Shen, L.; Yu, P.; Ragin, A. Kernelized support tensor machines. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; Volume 70, pp. 1442–1451. [Google Scholar]
Chen, C.; Batselier, K.; Yu, W.; Wong, N. Kernelized support tensor train machines. Pattern Recognit. 2022, 122, 108337. [Google Scholar] [CrossRef]
Kour, K.; Dolgov, S.; Stoll, M.; Benner, P. Efficient structure-preserving support tensor train machine. J. Mach. Learn. Res. 2023, 24, 1–22. [Google Scholar]
Deng, X.; Shi, Y.; Yao, D.; Tang, X.; Mi, C.; Xiao, J.; Zhang, X. A kernelized support tensor-ring machine for high-dimensional data classification. In Proceedings of the International Conference on Electronic Information Technology and Smart Agriculture (ICEITSA), IEEE, Huaihua, China, 10–12 December 2021; pp. 159–165. [Google Scholar]
Scholkopf, B.; Burges, C.; Smola, A. Advances in Kernel Methods; MIT Press: Cambridge, MA, USA, 1999. [Google Scholar]
Amari, S.; Wu, S. Improving support vector machine classifiers by modifying kernel functions. Neural Netw. 1999, 12, 783–789. [Google Scholar] [CrossRef] [PubMed]
Wu, S.; Amari, S. Conformal transformation of kernel functions: A data-dependent way to improve support vector machine classifiers. Neural Process. Lett. 2001, 25, 59–67. [Google Scholar]
Williams, P.; Li, S.; Feng, J.; Wu, S. Scaling the kernel function to improve performance of the support vector machine. In International Symposium on Neural Networks; Springer: Berlin/Heidelberg, Germany, 2005; pp. 831–836. [Google Scholar]
Williams, P.; Wu, S.; Feng, J. Improving the performance of the support vector machine: Two geometrical scaling methods. StudFuzz 2005, 177, 205–218. [Google Scholar]
Chang, Q.; Chen, Q.; Wang, X. Scaling Gaussian RBF kernel width to improve SVM classification. In Proceedings of the International Conference on Neural Networks and Brain, Beijing, China, 13–15 October 2005. [Google Scholar]
Zhou, S. Sparse SVM for sufficient data reduction. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 5560–5571. [Google Scholar] [CrossRef]
Wang, S.; Luo, Z. Low rank support tensor machine based on L_0/1 soft-margin loss function. Oper. Res. Trans. 2021, 25, 160–172. (In Chinese) [Google Scholar]
Lian, H. Learning rate for convex support tensor machines. IEEE Trans. Neural Netw. Learn. Syst. 2021, 32, 3755–3760. [Google Scholar] [CrossRef]
Shu, T.; Yang, Z.X. Support tensor machine based on nuclear norm of tensor. J. Neijiang Norm. Univ. 2017, 32, 34–39. (In Chinese) [Google Scholar]
He, L.F.; Lu, C.T.; Ding, H.; Wang, S.; Shen, L.L.; Yu, P.; Ragin, A.B. Multi-way multi-level kernel modeling for neuroimaging classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 356–364. [Google Scholar]
Hao, Z.; He, L.; Chen, B.; Yang, X. A linear support higher-order tensor machine for classification. IEEE Trans. Image Process. 2013, 22, 2911–2920. [Google Scholar] [PubMed]
Qi, L.; Luo, Z. Tensor Analysis: Spectral Theory and Special Tensors; SIAM Press: Philadelphia, PA, USA, 2017. [Google Scholar]
Nion, D.; Lathauwer, L. An enhanced line search scheme for complex-valued tensor decompositions. Application in DS-CDMA. Signal Process. 2008, 21, 749–755. [Google Scholar] [CrossRef]
Steinwart, I.; Christmann, A. Support Vector Machines; Springer: New York, NY, USA, 2008. [Google Scholar]
Zhao, L.; Mammadov, M.J.; Yearwood, J. From convex to nonconvex: A loss function analysis for binary classification. In Proceedings of the IEEE International Conference on Data Mining Workshops, Sydney, NSW, Australia, 13 December 2010; pp. 1281–1288. [Google Scholar]
Wang, Q.; Ma, Y.; Zhao, K.; Tian, Y. A comprehensive survey of loss functions in machine learning. Ann. Data. Sci. 2022, 9, 187–212. [Google Scholar] [CrossRef]
Wang, H.; Xiu, N. Analysis of loss functions in support vector machines. Adv. Math. 2021, 50, 801–828. (In Chinese) [Google Scholar]
Chang, C.; Lin, C. LIBSVM: A library for support vector machines. ACM Trans. Intel. Syst. Tec. 2011, 2, 1–27. [Google Scholar] [CrossRef]
Krizhevsky, A.; Hinton, G. Learning Multiple Layers of Features from Tiny Images; Technical Report; University of Toronto: Toronto, ON, Canada, 2009. [Google Scholar]

Figure 1. A flowchart of the proposed approach for binary classification.

Figure 2. CP decomposition of a third-order tensor.

Figure 3. A flowchart of the scaling kernel function.

Figure 4. Scaling nonlinear mapping with the CP decomposition.

Figure 5. The binary classification for 200 randomly generated points.

Figure 6. The contours of the ratio of magnification factors (14) for training data.

Figure 7. fMRI images: (a) The third-order tensor format; (b) the visualization.

Figure 8. Effect of the sparsity level s with

R = 2

.

Figure 8. Effect of the sparsity level s with

R = 2

.

Figure 9. Effect of the CP rank with

s = 97

.

Figure 9. Effect of the CP rank with

s = 97

.

Figure 10. CIFAR-10 dataset.

Figure 11. Effect of s with

R = 1

.

Figure 11. Effect of s with

R = 1

.

Figure 12. Effect of CP rank with

s = 112

.

Figure 12. Effect of CP rank with

s = 112

.

Table 1. Some related works on STM.

	Kernel (Scaled)	Loss	LRD	SST	DR
Tao et al. [10]	× (×)	hinge	CP	×	✓
Kotsia et al. [11]	× (×)	hinge	Tucker	×	✓
Kotsia et al. [21]	× (×)	hinge	CP	×	✓
Yang et al. [22]	× (×)	hinge	×	×	×
Chen et al. [13]	× (×)	hinge	TT	×	✓
Shu et al. [38]	×(×)	hinge	×	×	×
Wang et al. [36]	× (×)	0/1 loss	×	×	×
Hao et al. [40]	linear(×)	hinge	CP	×	✓
He et al. [24]	RBF(×)	hinge	CP	×	✓
He et al. [25]	RBF(×)	hinge	CP	×	✓
Chen et al. [26]	RBF(×)	hinge	TT	×	✓
Kour et al. [27]	RBF(×)	hinge	TT	×	✓
Deng et al. [28]	RBF (×)	hinge	TR	×	✓
Our work	RBF (✓)	piecewise quadratic smooth loss	CP	×	×

Table 2. The classification accuracy (%) comparison for subject ‘04799’.

re-NKSTM	NKSTM	LIBSVM	NSSVM-RBF	DuSK	TT-MMK
76.88	52.87	50.00	50.00	56.37	57.44

The best result is highlighted in bold.

Table 3. The classification accuracy (%) comparison for the CIFAR-10 dataset.

Class Pair	re-NKSTM	NKSTM	LIBSVM	NSSVM-RBF	DuSK	TT-MMK
‘bird, deer’	75.40	67.50	62.50	60.70	73.00	73.40
‘deer, horse’	88.30	66.10	54.70	71.10	70.30	71.20
‘bird, horse’	90.00	69.40	71.10	72.90	73.20	76.10
‘truck, ship’	82.90	75.30	76.40	75.80	78.40	78.20
‘automobile, ship’	83.30	74.40	60.30	77.70	78.30	77.30
‘automobile, cat’	84.60	77.90	65.80	77.10	79.10	79.50
‘automobile, horse’	79.20	75.00	72.50	78.60	76.40	75.90
‘air, ship’	73.10	65.90	65.10	69.80	68.70	71.50
‘automobile, truck’	87.10	69.00	58.10	71.90	73.10	72.20

The best result is highlighted in bold.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, S.; Luo, Z. Sparse Support Tensor Machine with Scaled Kernel Functions. Mathematics 2023, 11, 2829. https://doi.org/10.3390/math11132829

AMA Style

Wang S, Luo Z. Sparse Support Tensor Machine with Scaled Kernel Functions. Mathematics. 2023; 11(13):2829. https://doi.org/10.3390/math11132829

Chicago/Turabian Style

Wang, Shuangyue, and Ziyan Luo. 2023. "Sparse Support Tensor Machine with Scaled Kernel Functions" Mathematics 11, no. 13: 2829. https://doi.org/10.3390/math11132829

APA Style

Wang, S., & Luo, Z. (2023). Sparse Support Tensor Machine with Scaled Kernel Functions. Mathematics, 11(13), 2829. https://doi.org/10.3390/math11132829

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Sparse Support Tensor Machine with Scaled Kernel Functions

Abstract

1. Introduction

2. Related Works

3. Tensor Basics

4. Sparse STM with Scaled Kernels

4.1. The Proposed Sparse STM Model

4.2. Scaling Tensor Kernel Functions

4.3. The Proposed NKSTM Approach

5. Numerical Experiments

5.1. Benchmark Methods

5.2. StarPlus fMRI Dataset

5.3. CIFAR-10 Dataset

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI