Exploiting Generalized Cyclic Symmetry to Find Fast Rectangular Matrix Multiplication Algorithms Easier

Vermeylen, Charlotte; Vervliet, Nico; De Lathauwer, Lieven; Van Barel, Marc

doi:10.3390/math13193064

Open AccessFeature PaperArticle

Exploiting Generalized Cyclic Symmetry to Find Fast Rectangular Matrix Multiplication Algorithms Easier

¹

Department of Electrical Engineering (ESAT), KU Leuven, Kasteelpark Arenberg 10 bus 2246, B-3001 Leuven, Belgium

²

Subfaculty Science and Technology, KU Leuven Kulak, E. Sabbelaan 53, B-8500 Kortrijk, Belgium

³

Department of Computer Science, KU Leuven, Celestijnenlaan 200A, B-3001 Leuven, Belgium

^*

Author to whom correspondence should be addressed.

Mathematics 2025, 13(19), 3064; https://doi.org/10.3390/math13193064

Submission received: 22 August 2025 / Revised: 17 September 2025 / Accepted: 19 September 2025 / Published: 23 September 2025

(This article belongs to the Special Issue Spectral Theory of Tensors, Tensor (Rank) Decompositions, and Their Applications)

Download

Browse Figures

Versions Notes

Abstract

The quest to multiply two large matrices as fast as possible is one that has already intrigued researchers for several decades. However, the ‘optimal’ algorithm for a certain problem size is still not known. The fast matrix multiplication (FMM) problem can be formulated as a non-convex optimization problem—more specifically, as a challenging tensor decomposition problem. In this work, we build upon a state-of-the-art augmented Lagrangian algorithm, which formulates the FMM problem as a constrained least squares problem, by incorporating a new, generalized cyclic symmetric (CS) structure in the decomposition. This structure decreases the number of variables, thereby reducing the large search space and the computational cost per iteration. The constraints are used to find practical solutions, i.e., decompositions with simple coefficients, which yield fast algorithms when implemented in hardware. For the FMM problem, usually a very large number of starting points are necessary to converge to a solution. Extensive numerical experiments for different problem sizes demonstrate that including this structure yields more ‘unique’ practical decompositions for a fixed number of starting points. Uniqueness is defined relative to the known scale and trace invariance transformations that hold for all FMM decompositions. Making it easier to find practical decompositions may lead to the discovery of faster FMM algorithms when used in combination with sufficient computational power. Lastly, we show that the CS structure reduces the cost of multiplying a matrix by itself.

Keywords:

matrix multiplication; tensors; optimization; canonical polyadic decomposition; cyclic symmetry

MSC:

15-04; 15A69; 15A72; 15A21; 65F30; 65K05; 65K10; 90C26

1. Introduction

Matrix multiplication is a fundamental operation in numerical linear algebra, with applications ranging from machine learning to signal processing. Discovering faster algorithms for this task can therefore have a wide-reaching impact. For very large systems, saving a few percent of computation time can be worth millions of dollars. The study of fast matrix multiplication (FMM) began with Strassen’s breakthrough in 1969 [1], and despite decades of progress, many questions remain open. Recently, new algorithms have been discovered using deep reinforcement learning [2], while an augmented Lagrangian (AL) method has been proposed to obtain more stable and efficient decompositions [3]. Existing optimization-based approaches for discovering FMMs [2,4,5,6,7,8] often require significant computational resources and manual tuning, making them difficult to replicate. In this work, we propose a framework that does not require big computational resources and does not require manual tinkering, enabling the discovery of FMM algorithms in a more efficient and reproducible manner.

The fast matrix multiplication (FMM) problem aims for faster ways to multiply large matrices by rewriting the bilinear equations of matrix multiplication as a tensor equation:

\begin{matrix} Z (i, j) & = \sum_{k = 1}^{K} X (i, k) Y (k, j) & \to & Z^{⊤} & = \sum_{i, j, k = 1}^{I, J, K} {〈E_{i k}^{I \times K}, X〉}_{F} {〈E_{k j}^{K \times J}, Y〉}_{F} E_{j i}^{J \times I} \\ = : (\sum_{i, j, k = 1}^{I, J, K} E_{i k}^{I \times K} \otimes E_{k j}^{K \times J} \otimes E_{j i}^{J \times I}) \cdot_{F} (X, Y), \end{matrix}

(1)

where

Z : = X Y

and

E_{i j}^{I \times J}

, for

i = 1, \dots, I

, and

j = 1, \dots, J

, is a basis of matrices in

R^{I \times J}

such that

E_{i j}^{I \times J} (i_{1}, j_{2}) = 1

, if

(i_{1}, j_{2}) = (i, j)

, and zero otherwise. ‘

\otimes

’ denotes the tensor or outer product, and

{〈 \cdot 〉}_{F}

and ‘·_F’ indicate Frobenius inner products. A transpose is added in (1) such that the tensor that is defined has additional interesting properties, such as cyclic symmetry (CS), which will be discussed further in Section 2.3.2.

The rank of matrix multiplication is the minimal integer R such that

\begin{matrix} Z^{⊤} & = \sum_{r = 1}^{R} {〈U_{r}, X〉}_{F} {〈V_{r}, Y〉}_{F} W_{r}, \end{matrix}

(2)

for some matrices

U_{r} \in R^{I \times K}

,

V_{r} \in R^{K \times J}

, and

W_{r} \in R^{J \times I}

. It can be shown that minimizing R minimizes the computational complexity of matrix multiplication [9] (Proposition 15.1) and [10]. More specifically, the number of arithmetic operations needed to multiply two matrices in

R^{I \times I}

is

O (I^{ω})

, where

ω : = {log}_{I} R

. Since

R < I^{3}

, it holds that

ω < 3

. Additionally, the exponent is bounded from below by two, since you need at least

I^{2}

operations to compute as many elements.

Minimizing R in (2) corresponds to finding a canonical polyadic decomposition (CPD) of the matrix multiplication tensor (MMT) defined implicitly in (1) [1]:

\begin{matrix} T_{I K J} & : = \sum_{i, j, k = 1}^{I, J, K} e_{i k}^{I K} \otimes e_{k j}^{K J} \otimes e_{j i}^{I J}, \end{matrix}

(3)

where

T_{I K J}

has dimensions

I K \times K J \times I J

and

e_{i k}^{I K} : = vec (E_{i k}^{I \times J})

, where

vec (\cdot)

denotes vectorization along the columns. An MMT is a sparse tensor consisting of

I J K

ones and zeros elsewhere:

\begin{matrix} T_{I K J} (i_{1} + (k_{2} - 1) I, k_{1} + (j_{2} - 1) K, j_{1} + (i_{2} - 1) J) & = \{\begin{matrix} 1 & if i_{1} = i_{2}, j_{1} = j_{2}, k_{1} = k_{2}, \\ 0 & otherwise, \end{matrix} \end{matrix}

(4)

for all

i_{1}, i_{2} = 1, \dots, I

,

j_{1}, j_{2} = 1, \dots, J

, and

k_{1}, k_{2} = 1, \dots, K

. An MMT thus only depends on the size of the matrices that are multiplied. A polyadic decomposition (PD) of

T_{I K J}

of length R decomposes

T_{I K J}

into R rank-1 tensors:

\begin{matrix} T_{I K J} = \sum_{r = 1}^{R} u_{r} \otimes v_{r} \otimes w_{r}, \end{matrix}

(5)

where

u_{r}, v_{r}

, and

w_{r}

are vectors in

R^{I K}

,

R^{K J}

, and

R^{I J}

, respectively. These vectors can be collected in three so-called factor matrices:

U : = [u_{1}, \dots, u_{R}]

,

V : = [v_{1}, \dots, v_{R}]

, and

W : = [w_{1}, \dots, w_{R}]

. The rank

R^{*}

or

R^{*} (T)

denotes the minimal R for which (5) holds. The decomposition is then called canonical (CPD). A PD with R terms is denoted by

{PD}_{R}

and

{PD}_{R} (T)

denotes a

{PD}_{R}

of a certain tensor

T

. Equations (4) and (5) are also known as the Brent equations [11].

Decompositions of MMTs yield base algorithms for FMM of the form

\begin{matrix} vec (Z^{⊤}) & = \sum_{r = 1}^{R} 〈 u_{r}, vec (X) 〉 〈 v_{r}, vec (Y) 〉 w_{r}, \end{matrix}

(6)

which is the same equation as (2) with

u_{r} = vec (U_{r})

,

v_{r} = vec (V_{r})

, and

w_{r} = vec (W_{r})

, for

r = 1, \dots, R

. The total number of operations of a base algorithm is higher than for the standard algorithm. However, the active multiplications, i.e., the number of multiplications between (linear combinations of) elements of

X

and

Y

, is reduced from

I J K

to R. Algorithm (6) becomes faster when applied sufficiently many times recursively to multiply large matrices. This is because the active multiplications determine the asymptotic complexity [12]. In this case, the large matrices, e.g., of size

I^{l} \times K^{l}

and

K^{l} \times J^{l}

for some

l \in N

, are divided into

I \times K

and

K \times J

blocks of equal size, e.g.,

I^{l - 1} \times K^{l - 1}

and

K^{l - 1} \times J^{l - 1}

, respectively. All operations in (6) are then directly performed on the blocks. For example, the vectorization operator stacks the blocks in a block vector of length

I K

and

J K

, respectively, where each element is again a matrix. The inner products are calculated on these block vectors, where each scalar multiplication becomes a scalar element-wise multiplication on all elements of each block. Each multiplication between the inner products becomes another matrix multiplication of smaller dimension and can be further divided into smaller blocks if possible. A more detailed analysis of the recursive application of FMM algorithms is given in Section 2.2.

The rank of

T_{I K J}

, denoted by

R^{*} (T_{I K J})

, is not known for most problem parameters I, J, and K, which is one of the main difficulties of the FMM problem. One of the exceptions is

T_{222}

, of which the rank is known to be seven [13] and the problem is fully understood. We call

{PD}_{7} (T_{222})

discovered by Strassen

{PD}_{Strassen}

[1]. Only lower and upper bounds on the rank exist for most other I, J, and K. For example, it is known that

R^{*} (T_{333})

lies between 19 [14] and 23 [4,6,7,15]. The upper bound, i.e., the lowest rank of

T_{I K J}

known, is denoted by

\tilde{R} (T_{I K J})

. The most recent overview table of

\tilde{R} (T_{I K J})

is given in [16] (Table 3). However, since then, there has been some further improvement for

I = K = J = 4

[17]. Note that permuting

(I, J, K)

does not change the rank. Recently, there has also been an interest in finding complex-valued decompositions of

T_{I K J}

[16,18]. However, as the number of standard multiplications to multiply two complex numbers is at least three, the complexity is higher than for real decompositions of

T_{I K J}

. A recent overview of different techniques to obtain base algorithms for different dimensions is given in [19] (Section 1.1).

To find a

{PD}_{R}

of

T_{I K J}

, the following nonlinear least squares (NLS) problem in

x^{⊤} : = [\begin{matrix} vec {(U)}^{⊤} & vec {(V)}^{⊤} & vec {(W)}^{⊤} \end{matrix}]

is formulated [4,5,7,11,20]:

\begin{matrix} min_{x} \underset{= : f (x)}{\underset{︸}{\frac{1}{2} {∥\sum_{r = 1}^{R} w_{r} \otimes v_{r} \otimes u_{r} - vec (T_{I K J})∥}^{2}}}, \end{matrix}

(7)

where ‘⊗’ denotes the Kronecker product and

∥ \cdot ∥

the

ℓ_{2}

-norm. Note that

x^{*}

is a

{PD}_{R}

of

T_{I K J}

if and only if

f (x^{*}) = 0

. However, state-of-the-art optimization algorithms are not guaranteed to converge to a global optimum of (7). Furthermore, the convergence can be slow, even for second-order methods. One of the reasons is that (7) is non-convex, and PDs of MMT are not unique and have, furthermore, additional invariances compared with generic PDs [21]. This means that the minima of (7) are non-isolated.

In practice, the rank

R^{*}

is approximated experimentally by the lowest length

\tilde{R}

for which a solution

x^{*}

with

f (x^{*}) = 0

can be obtained using numerical optimization. However, as no globally convergent algorithm for (7) exists, this does not mean that a solution with a lower rank does not exist as long as there is a gap with the theoretical lower bound.

Another difficulty is the existence of border rank PDs (BRPDs) [10,12]. BRPDs are ill-conditioned approximate decompositions of

T_{I K J}

with elements that grow to infinity during convergence. The border rank of a particular tensor can be smaller than its canonical rank. This is closely related to the concept of degeneracy [22,23]. The BRPDs of MMTs are, e.g., investigated in [4,24,25,26,27,28]. To eliminate the convergence to BRPDs, constraints are usually added to (7) [3,4,5,8] or the elements are restricted to a discrete set of values [2,6]. The phenomenon of degeneracy also gives rise to regions of very slow convergence in the optimization landscape, informally known as swamps [29,30,31].

Additionally, the number of variables or unknowns grows rapidly with I, J, and K. Therefore, this paper proposes a new structure that can be enforced in PDs of MMTs to reduce the number of variables, which directly reduces the computational complexity of optimization algorithms to solve (7) and shrinks the large search space.

To solve (7), the alternating least squares (ALS) method is frequently used [4,7,20,32]. The local convergence of the ALS method for (7) is known under the assumption that the Hessian is positive definite at the solutions modulo the scaling indeterminacies [33]. This is, however, not satisfied for PDs of MMTs, as they have additional invariances [21]. In [4,5], a constrained optimization problem and corresponding method are proposed to improve the convergence. More specifically, Lagrange multipliers and a quadratic penalty are used for the constraints, and a Levenberg–Marquardt (LM) and an ALS method, respectively, are used to solve the constrained problems. Still, these methods were unable to find

{PD}_{49} s

for

4 \times 4

matrix multiplication, although they are known to exist. Another constrained optimization problem was proposed recently [3]:

min_{x} f (x), s . t . h (x) = 0, l \leq x \leq u,

(8)

together with an augmented Lagrangian (AL) method to solve it. The bound constraint ensures that the method does not converge to BRPDs. Different equality constraints

h (x)

were proposed to obtain PDs with simple coefficient decompositions. The AL method has as advantage compared with adding quadratic penalty terms to (7) that even outside the neighborhood of an optimum, the constraints are satisfied accurately. New

{PD}_{49} (T_{444}) s

, and different new CS PDs are obtained using this method and the stability of existing algorithms is improved [3]. As inner optimization algorithm, the LM method is used, which is a (second-order) damped Gauss–Newton optimization algorithm that optimizes over all factor matrices simultaneously [34].

We use the constrained problem formulation (8) together with the AL method proposed in [3] to find new PDs of

T_{I K J}

with the generalized CS structure proposed in this paper. The structure is imposed to find new decompositions more easily by reducing the search space and computational complexity per iteration, which can then, in turn, be used to reduce the complexity of matrix multiplication. Although it is not known whether for all I, J, and K, a CPD of

T_{I K J}

with the proposed structure exists, the structure is very flexible, as it is determined by two parameters that can be varied and, for example, can be made smaller to be less restrictive. So far, we have not encountered any CPDs of

T_{I K J}

of small dimensions that do not admit this structure for any combination of these two parameters.

1.1. Contribution

A generalization of the CS structure investigated in [7,20,35] to non-square matrix multiplication is proposed for decompositions of MMTs. The advantage of investigating this structure is threefold.

Firstly, this structure reduces the number of variables, thereby shrinking the large search space and speeding up the computation time per iteration when used in an optimization algorithm. We have implemented the structure efficiently in a state-of-the-art all-at-once optimization algorithm for the FMM problem and use a constrained problem formulation to find new practical solutions automatically without manual tinkering. A large number of numerical experiments illustrate that including this structure helps find more practical solutions for a fixed number of starting points.

Secondly, we prove that by exploiting this structure, the cost of multiplying a matrix by itself can be reduced. We give an example with Strassen’s decomposition, which shows that the CS structure enables the speed-up of standard matrix multiplication already from matrices of size

I = J = K = 2^{8} = 256

instead of

2^{10} = 1024

.

Lastly, ‘unique’ decompositions of MMTs, based on the well-known invariance transformations [21], are of theoretical interest. Different classes of decompositions have already been investigated before [2,3,7,20,36], but not for rectangular fast matrix multiplication. To determine if two decompositions are equivalent, a non-trivial optimization problem arises. Using the result from [3], where it was shown that the rank of the Jacobian matrix is also invariant under these transformations, we use the different ranks to distinguish ‘unique’ practical decompositions.

1.2. Organization

The paper is organized as follows. Section 2 provides the preliminaries related to the FMM problem. In Section 3, the new CS structure is proposed, and its advantages are discussed. Section 4 presents a large number of numerical experiments for different MMTs, illustrating that incorporating this structure helps to find more (unique) practical decompositions per fixed number of starting points.

2. Preliminaries

This section first discusses the notation and afterwards what is meant by practical decompositions and FMM algorithms. Section 2.3 gives a summary of the most important properties of decompositions of MMTs: invariances, cyclic symmetry, and recursive decompositions.

2.1. Notation

Scalars are denoted by lowercase letters, vectors by bold letters, matrices by bold uppercase letters, and tensors by calligraphic letters, e.g., a,

a

,

A

, and

A

. The ith column of

A

is denoted by

a_{i}

and the jth element of

a

by

a_{j}

. ‘

\otimes

’ denotes the tensor or outer product, ‘⊗’ the Kronecker product, ∗ the Hadamard or element-wise product, and

vec (\cdot)

the vectorization operator. Vectorization of a tensor is performed such that

vec (T) (i + (j - 1) J + (k - 1) J K) = T (i, j, k)

.

I

denotes the identity matrix of the appropriate size.

2.2. Practical Algorithms

To multiply larger matrices, e.g., of size

I^{l} \times K^{l}

and

K^{l} \times J^{l}

, a base algorithm for

(I, J, K)

matrix multiplication can be applied recursively, which requires

O (c R^{l})

floating-point operations. The value of c depends on how many nonzero elements appear in the decomposition. When c and l are sufficiently small and large, respectively, the computational complexity of standard matrix multiplication, i.e.,

O ({(I K J)}^{l})

operations, can be lowered. The following result is a generalization of the operation count of Strassen’s decomposition in [1] to general FMM decompositions.

Proposition 1

(cost fast matrix multiplication). Let

U

,

V

, and

W

be the factor matrices of a base algorithm for

(I, K, J)

matrix multiplication. The complexity to use this base algorithm l times recursively to multiply two matrices

X \in R^{I^{l} \times K^{l}}

and

Y \in R^{K^{l} \times J^{l}}

is

\begin{matrix} (1 + c_{U} \frac{1}{R - I K} + c_{V} \frac{1}{R - K J} + c_{W} \frac{1}{R - I J}) R^{l} - c_{U} \frac{{(I K)}^{l}}{R - I K} - c_{V} \frac{{(K J)}^{l}}{R - K J} - c_{W} \frac{{(I J)}^{l}}{R - I J}, \end{matrix}

where

c_{U}

and

c_{V}

are the number of additions and scalar multiplications needed to compute the inner products with

u_{r}

and

v_{r}

in (6), respectively, and

c_{W}

is the number of additions and scalar multiplications with the

w_{r}

.

Proof.

A base algorithm (6) of rank R of

T_{I K J}

can be applied on the

I^{l - 1} \times I^{l - 1}

submatrices of

X

and the

K^{l - 1} \times J^{l - 1}

submatrices of

Y

, where each active multiplication is then again a multiplication of

(I^{l - 1}, K^{l - 1}, J^{l - 1})

. The cost to multiply

X

and

Y

can thus be written as

cost (I^{l}, K^{l}, J^{l}) = R \cdot cost (I^{l - 1}, K^{l - 1}, J^{l - 1}) + c_{U} \cdot {(I K)}^{l - 1} + c_{V} \cdot {(K J)}^{l - 1} + c_{W} \cdot {(I J)}^{l - 1} .

Applying this formula recursively until

cost (1, 1, 1) = 1

, leads to

\begin{matrix} cost (I^{l}, K^{l}, J^{l}) & = R^{l} + \sum_{i = 1}^{l} R^{i - 1} (c_{U} \cdot {(I K)}^{l - i} + c_{V} \cdot {(K J)}^{l - i} + c_{W} \cdot {(I J)}^{l - i}) \\ = R^{l} + c_{U} {(I K)}^{l - 1} \sum_{i = 0}^{l - 1} {(\frac{R}{I K})}^{i} + c_{V} {(K J)}^{l - 1} \sum_{i = 0}^{l - 1} {(\frac{R}{K J})}^{i} + c_{W} {(I J)}^{l - 1} \sum_{i = 0}^{l - 1} {(\frac{R}{I J})}^{i} \\ = R^{l} + c_{U} \frac{R^{l} - {(I K)}^{l}}{R - I K} + c_{V} \frac{R^{l} - {(K J)}^{l}}{R - K J} + c_{W} \frac{R^{l} - {(I J)}^{l}}{R - I J} \end{matrix}

□

Remark: The constants

c_{U}

,

c_{V}

, and

c_{W}

depend implicitly on the parameters I, J, and K in the sense that each factor matrix contains at least

I J K

and at most

I K R

,

J K R

, and

I J R

nonzero elements, respectively. Furthermore, we can assume that each factor vector contains at least one nonzero, and the rows of

U

,

V

, and

W

contain at least I, J, and K nonzeros each, respectively. This can be derived from the specific locations of the ones in the tensor as given in (4). The constants

c_{U}

,

c_{V}

, and

c_{W}

are directly related to the number of nonzeros, but the exact relation depends on the specific distribution of the nonzeros and their values.

The constants

c_{U}

,

c_{V}

, and

c_{W}

can be decreased by using sparse PDs with elements that are powers of 2 because multiplication with a power of 2 is not a costly operation when implemented in hardware. Such PDs are called practical because, without small constants, standard matrix multiplication is only improved asymptotically.

A disadvantage of formulation (7) is that the solutions have floating-point elements. Consequently, if a numerical PD of

T_{I K J}

is obtained, it has to be transformed to a practical one. The well-known inv-transformations [21] can be used for this purpose [5,8]. However, not all numerical PDs of

T_{I K J}

can be transformed to a practical one [36]. Furthermore, this is not an easy process and involves a lot of manual tinkering [5,8]. In [36], this process is called discretization. Another possibility is to add constraints to the optimization problem, such as the following simple equality constraint proposed in [3]:

h_{discr} (x) : = x * (x + 1) * (x - 1),

(9)

where ‘∗’ denotes element-wise multiplication. This constraint forces the elements to the discrete set

{0, \pm 1}

.

2.3. Properties of Decompositions of Matrix Multiplication Tensors

This section first briefly discusses the invariances for decompositions of

T_{I K J}

. Afterwards, other properties such as cyclic symmetry and recursive PDs are discussed in more detail.

2.3.1. Invariance Transformations

Factor matrices are a non-unique representation of a PD. For example, the same tensor is obtained under the permutation of the rank-1 tensors. Additionally, the factor vectors can be scaled as

α_{r} u_{r}, β_{r} v_{r}, \frac{1}{α_{r} β_{r}} w_{r}

, for all

r = 1, \dots, R

, and

α_{r}, β_{r}

in

R_{0}

.

Decompositions of

T_{I K J}

also have other invariances [21]. See, e.g., [3] for an overview. The combination of the trivial transformations and the specific transformations that hold for decompositions of

T_{I K J}

is called inv-transformations. They have dimension

I^{2} + K^{2} + J^{2} + 2 R - 3

. Two PDs of

T_{I K J}

that can be transformed into one another in this way are called inv-equivalent. All

{PD}_{7} (T_{222})

are known to be inv-equivalent [13]. Some decompositions of

T_{I K J}

also have additional invariances; see, e.g., [3].

The Jacobian matrix can be used to investigate the inv-equivalence of decompositions of MMTs and to discover additional invariances [3]. That is why in Section 4, the size and ranks of the Jacobian matrix at solutions with the proposed structure are given.

2.3.2. Cyclic Symmetry

Because of (3),

T_{I K J}

is a structured tensor. More specifically, when

I = J = K

,

T_{I K J}

is so-called cyclic symmetric (CS), which means that

T_{I I I} (k, i, j) = T_{I I I} (j, k, i) = T_{I I I} (i, j, k)

, for

i, j, k = 1, \dots, I^{2}

. This structure can be exploited in the factor matrices. More specifically, the rank-1 tensors can be forced to either occur in CS pairs or to be symmetric [35]:

T_{I I I} = \sum_{s = 1}^{S} a_{s} \otimes a_{s} \otimes a_{s} + \sum_{t = 1}^{T} (b_{t} \otimes c_{t} \otimes d_{t} + c_{t} \otimes d_{t} \otimes b_{t} + d_{t} \otimes b_{t} \otimes c_{t}),

where

a_{s}, b_{t}, c_{t}

, and

d_{t}

are in

R^{I^{2}}

, for

s = 1, \dots, S

and

t = 1, \dots, T

. Here, S and T count the symmetric rank-1 tensors and the CS rank-1 tensors, respectively. The length of a CS PD equals

R = 3 T + S

. The matrices

A

,

B

,

C

, and

D

can be defined as containing the vectors

a_{s}, b_{t}, c_{t}

, and

d_{t}

as columns, and the factor matrices can be parameterized as

U = [\begin{matrix} A & B & C & D \end{matrix}]

,

V = [\begin{matrix} A & D & B & C \end{matrix}]

, and

W = [\begin{matrix} A & C & D & B \end{matrix}]

. The number of variables is only a third of the original value, making this parameterization highly useful for the FMM problem. Furthermore, the cost function can be changed to

\begin{matrix} min_{A, B, C, D} \frac{1}{2} ( & \sum_{i = 1}^{I^{2}} \sum_{j = i}^{I^{2}} \sum_{k = i + 1}^{I^{2}} (T_{I I I} (i, j, k) - \sum_{s = 1}^{S} a_{i s} a_{j s} a_{k s} \\ - & \sum_{t = 1}^{T} (b_{i t} d_{j t} c_{k t} + c_{i t} b_{j t} d_{k t} + d_{i t} c_{j t} b_{k t}))^{2} \\ + & \sum_{i = 1}^{I^{2}} {(T_{I I I} (i, i, i) - \sum_{s = 1}^{S} a_{i s}^{3} - 3 \sum_{t = 1}^{T} (b_{i t} d_{i t} c_{i t}))}^{2}), \end{matrix}

(10)

where

a_{i s} : = A (i, s)

,

b_{i t} : = B (i, t)

, and similarly for

c_{i t}

and

d_{i t}

, which reduces the number of rows in the Jacobian matrix used in NLS optimization methods to solve (10) from

I^{6}

to

\frac{1}{3} (I^{6} - I^{2}) + I^{2}

. Note that this cost function only includes one element for each CS pair of points because the CS structure already ensures that these elements are equal.

Different practical CS decompositions of rank 7 and 23 of

T_{222}

and

T_{333}

, respectively, are known [7,20]. They were discovered through an ALS method and analyzed using group theory and algebraic geometry. Concerning Strassen’s decomposition,

S = 1

and

T = 2

. Although all

{PD}_{7} (T_{222}) s

are known to be inv-equivalent, Strassen’s decomposition can be transformed into another practical PD with parameters

(S, T) = (4, 1)

. However, this increases the number of nonzeros and is thus not used in practice.

2.4. Decompositions Obtained by Recursion

As mentioned in Section 2.2, an FMM algorithm applies a base algorithm for small I, J, and K to large matrices, e.g., of size

I^{l} \times K^{l}

and

K^{l} \times J^{l}

. In the same way, we can also obtain PDs of rank 49 of

T_{444}

using a

{PD}_{7} (T_{222})

one time recursively. The

{PD}_{49} (T_{444})

that is obtained using

{PD}_{Strassen}

is called

{PD}_{Strassen}^{rec}

in the rest of the text. In the following proposition, the formulas to obtain the factor matrices of a recursive PD from the factor matrices of the original PD are given. This result is well known, e.g., from the Supplementary Material of [2] and in [37], and a proof can be found, e.g., in [38] (Section 2.2.3).

Proposition 2

(Recursive PD). If

U

,

V

, and

W

are the factor matrices of a

{PD}_{R} (T_{I K J})

, then the factor matrices

U^{'}, V^{'}

, and

W^{'}

, where

\begin{matrix} u_{r}^{'} & : = vec (U_{r_{1}} \otimes U_{r_{2}}), & v_{r}^{'} & : = vec (V_{r_{1}} \otimes V_{r_{2}}), \\ w_{r}^{'} & : = vec (W_{r_{1}} \otimes W_{r_{2}}), & r & : = r_{2} + (r_{1} - 1) R, \end{matrix}

where ‘⊗’ denotes the Kronecker product, for

r_{1}, r_{2} = 1, \dots, R

, are the factor matrices of a

{PD}_{R^{2}} (T_{I^{2} K^{2} J^{2}})

.

The following corollary gives the CS parameters and factor matrices of a recursive PD as a function of the original PD.

Corollary 3.

If

U

,

V

, and

W

are CS factor matrices of a

{PD}_{R} (T_{I I I})

with CS parameters S and T such that

U = [A B C D]

,

V = [A D B C]

, and

W = [A C D B]

, where

A \in R^{I^{2} \times S}

and

B, C, D \in R^{I^{2} \times T}

, then the factor matrices

U^{'}

,

V^{'}

, and

W^{'}

of the recursive decomposition

{PD}_{R^{2}}^{rec} (T_{I^{2} I^{2} I^{2}})

is also CS with parameters

S^{'} = S^{2}

and

T^{'} = T (S + R)

:

\begin{matrix} U^{'} & : = [\begin{matrix} A^{'} B^{'} C^{'} D^{'} \end{matrix}], & V^{'} & : = [\begin{matrix} A^{'} D^{'} B^{'} C^{'} \end{matrix}], & W^{'} & : = [\begin{matrix} A^{'} C^{'} D^{'} B^{'} \end{matrix}], \end{matrix}

where

A^{'} \in R^{I^{4} \times S^{'}}

and

B^{'}, C^{'}, D^{'} \in R^{I^{4} \times T^{'}}

, and, more specifically,

\begin{matrix} A_{s}^{'} & : = A_{s_{1}} \otimes A_{s_{2}}, & s & : = s_{2} + (s_{1} - 1) S, & s_{1}, s_{2} & = 1, \dots, S, \end{matrix}

and

\begin{matrix} B^{'} & : = [\begin{matrix} B_{1}^{'} & B_{2}^{'} \end{matrix}], & C^{'} & : = [\begin{matrix} C_{1}^{'} & C_{2}^{'} \end{matrix}], & D^{'} & : = [\begin{matrix} D_{1}^{'} & D_{2}^{'} \end{matrix}], \end{matrix}

where

\begin{matrix} B_{1, t_{1}}^{'} & : = A_{s_{1}} \otimes B_{t}, & C_{1, t_{1}}^{'} & : = A_{s_{1}} \otimes C_{t}, & D_{1, t_{1}}^{'} & : = A_{s_{1}} \otimes D_{t}, \\ B_{2, t_{2}}^{'} & : = B_{t} \otimes U_{r}, & C_{2, t_{2}}^{'} & : = C_{t} \otimes W_{r}, & D_{2, t_{2}}^{'} & : = D_{t} \otimes V_{r}, \end{matrix}

(11)

where

t_{1} : = t + (s_{1} - 1) T

and

t_{2} : = r + (t - 1) R

, for

t = 1, \dots, T

and

r = 1, \dots, R

.

Proof.

From Proposition 2 follows that

U_{i^{'}}^{'} = U_{i_{1}} \otimes U_{i_{2}}

,

V_{i^{'}}^{'} = V_{i_{1}} \otimes V_{i_{2}}

, and

W_{i^{'}}^{'} = W_{i_{1}} \otimes W_{i_{2}}

, where

i^{'} : = i_{2} + (i_{1} - 1) R

. Thus, the symmetric part of the recursive PD must satisfy

U_{i_{1}} = V_{i_{1}} = W_{i_{1}}

and

U_{i_{2}} = V_{i_{2}} = W_{i_{2}}

. Consequently,

i_{1}

and

i_{2}

must be smaller than or equal to S, such that

U_{i_{1}} = V_{i_{1}} = W_{i_{1}} = A_{i_{1}}

and

U_{i_{2}} = V_{i_{2}} = W_{i_{2}} = A_{i_{2}}

. For the other values of

i_{1}

and

i_{2}

, the columns still appear in CS pairs, and they can be rearranged as in (11) to satisfy the CS structure. □

Note that, since

{PD}_{Strassen}

satisfies

S = 1

and

T = 2

, the parameters of

{PD}_{Strassen}^{rec}

are

S = 1

and

T = 16

. Another CS

{PD}_{49} (T_{444})

is obtained with

(S, T) = (16, 11)

by using a

{PD}_{7} (T_{222})

with

(S, T) = (4, 1)

recursively.

3. Generalization Cyclic Symmetry

In this section, a generalization of the CS structure discussed in Section 2.3.2 to the case when I, J, and K are not equal is proposed. We do this to reduce the number of variables and reduce the search space to speed up the convergence. The basis of the idea is the observation that an MMT for non-square matrix multiplication is still CS as long as the indices in (4) are smaller than the smallest dimension. However, because of the interaction of these CS indices with the non-symmetric indices in a CPD of the tensor, imposing this structure on all rank-1 tensors would be too restrictive. Hence, some unstructured terms are added as well. In Section 4, we give experimental results for the maximum number of structured rank-1 terms for different problem sizes. Because, e.g., a PD of

T_{K J I}

can be obtained from a PD of

T_{I K J}

, we assume in this chapter that

I \leq K \leq J

. It is known from (4) that

\begin{matrix} T_{I K J} (i_{1} + (k_{2} - 1) I, k_{1} + (j_{2} - 1) K, j_{1} + (i_{2} - 1) J) \\ = \sum_{i = 1}^{r} U_{i} (i_{1}, k_{2}) V_{i} (k_{1}, j_{2}) W_{i} (j_{1}, i_{2}) = \{\begin{matrix} 1 & if i_{1} = i_{2}, j_{1} = j_{2}, k_{1} = k_{2} \\ 0 & else \end{matrix}, \end{matrix}

for all

i_{1}, i_{2} = 1, \dots, I

,

j_{1}, j_{2} = 1, \dots, J

, and

k_{1}, k_{2} = 1, \dots, K

. Consequently, as long as

i_{1}, i_{2}, j_{1}, j_{2}, k_{1}, k_{2} \leq I

, it holds that

\begin{matrix} T_{I K J} (i_{1} + (k_{2} - 1) I, k_{1} + (j_{2} - 1) K, j_{1} + (i_{2} - 1) J) \\ = & T_{I K J} (k_{1} + (j_{2} - 1) I, j_{1} + (i_{2} - 1) K, i_{1} + (k_{2} - 1) J) \\ = & T_{I K J} (j_{1} + (i_{2} - 1) I, i_{1} + (k_{2} - 1) K, k_{1} + (j_{2} - 1) J) . \end{matrix}

Therefore,

\begin{matrix} \sum_{r = 1}^{R} U_{r} (i_{1}, k_{2}) V_{r} (k_{1}, j_{2}) W_{r} (j_{1}, i_{2}) & = \sum_{r = 1}^{R} U_{r} (k_{1}, j_{2}) V_{r} (j_{1}, i_{2}) W_{r} (i_{1}, k_{2}) \\ = \sum_{r = 1}^{R} U_{r} (j_{1}, i_{2}) V_{r} (i_{1}, k_{2}) W_{r} (k_{1}, j_{2}), \end{matrix}

which can be used to include a CS structure in a subpart of the factor matrices. An illustration is shown in Figure 1, where the following submatrices are defined as a function of the position in the factor matrices:

\begin{matrix} U_{r} (i_{1}, k_{2}) = : \{\begin{matrix} A_{r} (i_{1}, k_{2}) & if k_{2} \leq I, r \leq S \\ B_{r - S} (i_{1}, k_{2}) & if k_{2} \leq I, S < r \leq S + T \\ C_{r - S - T} (i_{1}, k_{2}) & if k_{2} \leq I, S + T < r \leq S + 2 T \\ D_{r - S - 2 T} (i_{1}, k_{2}) & if k_{2} \leq I, S + 2 T < r \leq R_{CS} \\ {\tilde{U}}_{r} (i_{1}, k_{2} - I) & if k_{2} > I, r \leq R_{CS} \\ {\dot{U}}_{r - R_{CS}} (i_{1}, k_{2}) & if R_{CS} < r \end{matrix}, \end{matrix}

where

R_{CS} : = S + 3 T

, and for

r = 1, \dots, R

,

i_{1} = 1, \dots, I

, and

k_{2} = 1, \dots, K

,

\begin{matrix} V_{r} (k_{1}, j_{2}) = : \{\begin{matrix} A_{r} (k_{1}, j_{2}) & if k_{1}, j_{2} \leq I, r \leq S \\ D_{r - S} (k_{1}, j_{2}) & if k_{1}, j_{2} \leq I, S < r \leq S + T \\ B_{r - S - T} (k_{1}, j_{2}) & if k_{1}, j_{2} \leq I, S + T < r \leq S + 2 T \\ C_{r - S - 2 T} (k_{1}, j_{2}) & if k_{1}, j_{2} \leq I, S + 2 T < r \leq R_{CS} \\ {\hat{V}}_{r} (k_{1} - I, j_{2}) & if k_{1} > I, j_{2} \leq I, r \leq R_{CS} \\ {\tilde{V}}_{r} (k_{1}, j_{2} - I) & if j_{2} > I, r \leq R_{CS} \\ {\dot{V}}_{r - R_{CS}} (k_{1}, j_{2}) & if R_{CS} < r \end{matrix}, \end{matrix}

for

k_{1} = 1, \dots, K

, and

j_{2} = 1, \dots, J

, and

\begin{matrix} W_{r} (j_{1}, i_{2}) = : \{\begin{matrix} A_{r} (j_{1}, i_{2}) & if j_{1} \leq I, r \leq S \\ C_{r - S} (j_{1}, i_{2}) & if j_{1} \leq I, S < r \leq S + T \\ D_{r - S - T} (j_{1}, i_{2}) & if j_{1} \leq I, S + T < r \leq S + 2 T \\ B_{r - S - 2 T} (j_{1}, i_{2}) & if j_{1} \leq I, S + 2 T < r \leq R_{CS} \\ {\hat{W}}_{r} (j_{1} - I, i_{2}) & if j_{1} > I, r \leq R_{CS} \\ {\dot{W}}_{r - R_{CS}} (j_{1}, i_{2}) & if R_{CS} < r \end{matrix}, \end{matrix}

for

j_{1} = 1, \dots, J

, and

i_{2} = 1, \dots, I

, where

A_{s} : = reshape (a_{s}, I \times I)

, for

s = 1, \dots, S

,

B_{t} : = reshape (b_{t}, I \times I)

, for

t = 1, \dots, T

, and similarly for

C_{t}

and

D_{t}

, and

\begin{matrix} {\tilde{U}}_{r} & : = reshape (\tilde{U} (:, r), I \times (K - I)), \\ {\tilde{V}}_{r} & : = reshape (\tilde{V} (:, r), K \times (J - I)), \\ {\hat{V}}_{r} & : = reshape (\hat{V} (:, r), (K - I) \times I), \\ {\hat{W}}_{r} & : = reshape (\hat{W} (:, r), (J - I) \times I), \end{matrix}

for

r = 1, \dots, R_{CS}

, and

\begin{matrix} {\dot{U}}_{r} & : = reshape (\dot{U} (:, r), I \times K), \\ {\dot{V}}_{r} & : = reshape (\dot{V} (:, r), K \times J), \\ {\dot{W}}_{r} & : = reshape (\dot{W} (:, r), J \times I), \end{matrix}

for

r = 1, \dots, (R - R_{CS})

, and thus

\tilde{U} \in R^{I (K - I) \times R_{CS}}

,

\tilde{V} \in R^{K (J - I) \times R_{CS}}

,

\hat{V} \in R^{(K - I) I \times R_{CS}}

,

\hat{W} \in R^{(J - I) I \times R_{CS}}

,

\dot{U} \in R^{I K \times (R - R_{CS})}

,

\dot{V} \in R^{K J \times (R - R_{CS})}

, and

\dot{W} \in R^{I J \times (R - R_{CS})}

. The proportions in Figure 1 are those of

T_{234}

,

(S, T) : = (2, 2)

, and

R : = 20

. We show in the experiments that such PDs indeed exist. Note that, contrary to the case when I, J, and K are equal, the CS rank

R_{CS}

does not equal the rank R, because this would be too restrictive. Also, in the experiments that follow, we did not find PDs for which both are equal. This is likely because of the interaction of the CS part with the parts

\tilde{U}

,

\tilde{V}

,

\hat{V}

, and

\hat{W}

. However, including this structure still reduces the number of parameters from

(I K + K J + I J) R

to

(I K + K J + I J) R - 2 I^{2} R_{CS}

.

Example 4.

The optimal CPD (in terms of stability and number of nonzeros) for

T_{223}

is an extension of Strassen’s decomposition [1,39]:

The CS part can clearly be recognized. Remark that the matrices

\tilde{V}

and

\hat{W}

are zero in this case. However, this is not always the case, as shown for example in Section 4. Similarly, the optimal

{PD}_{14} (T_{224}) s

and

{PD}_{18} (T_{225}) s

can also be written with

(S, T) = (1, 2)

.

3.1. Computational Complexity Optimization

The most computationally expensive step in a Gauss–Newton-type method, such as the LM method used in Section 4, is solving the linear system for the step, which requires

O ({(R (I K + K J + I J))}^{3})

operations [40]. By incorporating the generalized CS structure, the complexity is directly reduced to

O ({(R (I K + K J + I J) - 2 I^{2} R_{CS})}^{3})

operations. For square matrix multiplication and when

R_{CS} = R

, the complexity is reduced by a factor of

3^{3}

.

3.2. Complexity of Multiplying a Matrix with Itself

When multiplying a matrix with itself, i.e.,

Z : = X X

, the CS structure can be exploited to save computations.

Proposition 5.

Let

U

,

V

, and

W

be the factor matrices of a base algorithm for

(I, I, I)

matrix multiplication. When

U

,

V

, and

W

are CS with parameters S and T, and

R_{CS} = R

, and

Z : = X X

, with

X \in R^{I^{l} \times I^{l}}

, then the computational cost of applying the base algorithm l times recursively derived in Proposition 1 is reduced to

\begin{matrix} {cost}_{CS} (I^{l}, I^{l}, I^{l}) & = 2 (1 + \frac{c_{U}}{δ}) R^{l} - (2 c_{U} + δ) \frac{I^{2 l}}{δ} \\ = cost (I^{l}, I^{l}, I^{l}) - c_{U} (\frac{R^{l} - I^{2 l}}{δ}), \end{matrix}

where

δ : = R - I^{2} > 0

and

c_{U}

is the number of additions and scalar multiplications needed to compute the inner products with

u_{r}

in (6).

Proof.

When

Y = X

in the base algorithm (6) and when the decomposition is CS with

R_{CS} = R

, the inner products with

v_{r}

are the same as those with

u_{r}

but in a different order:

\begin{matrix} vec (Z^{⊤}) & = \sum_{r = 1}^{R} 〈 u_{r}, vec (X) 〉 〈 v_{r}, vec (X) 〉 w_{r} \\ = \sum_{s = 1}^{S} 〈 a_{s}, vec (X) 〉 〈 a_{s}, vec (X) 〉 a_{s} + \sum_{t = 1}^{T} (〈 b_{t}, vec (X) 〉 〈 d_{t}, vec (X) 〉 c_{t} \dots \\ 〈 c_{t}, vec (X) 〉 〈 b_{t}, vec (X) 〉 d_{t} + 〈 d_{t}, vec (X) 〉 〈 c_{t}, vec (X) 〉 b_{t}) . \end{matrix}

Consequently, the recursive formula for the cost is

cost (I^{l}, I^{l}, I^{l}) = R \cdot cost (I^{l - 1}, I^{l - 1}, I^{l - 1}) + c_{U} \cdot {(I^{2})}^{l - 1} + c_{W} \cdot {(I^{2})}^{l - 1} .

Because of the CS, the elements in

W

are the same as those of

U

in a different order. Consequently,

c_{W} = c_{U} + R - I^{2}

. The recursive formula can be expanded in the same way as in the proof of Proposition 1 to obtain:

\begin{matrix} {cost}_{CS} (I^{l}, I^{l}, I^{l}) & = (1 + c_{U} \frac{1}{R - I^{2}} + c_{W} \frac{1}{R - I^{2}}) R^{l} - c_{U} \frac{I^{2 l}}{R - I^{2}} - c_{W} \frac{I^{2 l}}{R - I^{2}} \\ = cost (I^{l}, I^{l}, I^{l}) - c_{V} \frac{R^{l}}{R - I^{2}} + c_{V} \frac{I^{2 l}}{R - I^{2}} \\ = cost (I^{l}, I^{l}, I^{l}) - c_{U} (\frac{R^{l} - I^{2 l}}{R - I^{2}}) \\ = (1 + (2 c_{U} + R - I^{2}) \frac{1}{R - I^{2}}) R^{l} - (2 c_{U} + R - I^{2}) \frac{I^{2 l}}{R - I^{2}} \\ = 2 (1 + c_{U} \frac{1}{R - I^{2}}) R^{l} - (2 c_{U} + R - I^{2}) \frac{I^{2 l}}{R - I^{2}} . \end{matrix}

□

Example 6

(Complexity Strassen with and without CS). Strassen’s decomposition for

I = 2

has rank 7 and 12 nonzeros in each factor matrix. Thus

c_{U} = c_{V} = 12 - R = 5

, and

c_{W} = 12 - I^{2} = 8

. The number of operations required as a function of the number of recursive levels l is shown in Figure 2. When exploiting the CS in Strassen’s decomposition, the algorithm improves standard matrix multiplication for the multiplication of a matrix with itself already after eight recursive levels, thus for matrices of size

2^{8} = 256

or higher. Without CS, there is an improvement over standard matrix multiplication only from size

2^{10} = 1024

, and the improvement is less significant.

4. Numerical Experiments

This section presents the results obtained using the AL method [3] to find PDs of MMTs with the generalized CS structure, for various matrix sizes (I, K, J) and different CS parameters S and T. The rank is always chosen as the lowest rank

\tilde{R}

for which a solution is known to exist in the literature. For the first two experiments, this is also known to be the optimal one [28,41].

The AL method requires upper and lower bounds on the elements in the decomposition:

l \leq x \leq u

. We set

u : = - l : = 1

and generate 50 random starting points of magnitude

10^{- 2}

using the built-in

randn

function in Matlab, for each set of problem parameters:

x_{0} : = 10^{- 2} \cdot randn ((I K + K J + I J) R - 2 I^{2} R_{CS} \times 1) .

The number of inner iterations of the LM method was set to 50, and the number of outer iterations to 15. The tolerances on the gradient and the constraint were set to

10^{- 13}

.

When a numerical solution with a cost function value smaller than

10^{- 12}

is found, the constraint

h_{discr}

from (9), scaled by a factor of 0.1, is added to the optimization problem to convert the numerical solution into a practical one. It is known that not all numerical solutions can be converted into practical ones using inv-transformations [36]. Although we do not use inv-transformations but rather numerical optimization, we also observed that many numerical solutions do not converge to a discrete solution under this constraint, while others converge very quickly. We verified that many of the non-converging numerical solutions also fail to satisfy the necessary conditions for discretizability from [36].

The rank of the Jacobian matrix

J_{CS, gen}

of the generalized structure at the discrete solutions is computed. The rank deficiency reflects the dimension of the invariance transformation applicable to the decomposition. Two decompositions with different ranks cannot be transformed into one another using inv-transformations [3]. Thus, the more distinct ranks that are found, the more ‘unique’ solutions are discovered.

The MMTs are ordered according to their best-known rank in Table 1. The following paragraphs present the results for sizes between

(2, 2, 2)

and

(3, 3, 3)

, both corresponding to square matrix multiplication tensors. Larger sizes are possible, but at a certain point, the computation time per experiment becomes too large to perform all 50 experiments per parameter set in a reasonable amount of time on a standard computer.

All experiments were performed on a laptop with an AMD Ryzen 7 PRO 7840U processor and 64

GB

of RAM, using Tensorlab 4 beta [42] and Matlab R2023b. The results of the experiments, including the new decompositions, as well as the code, are publicly available (Version 1) (https://github.com/CharlotteVermeylen/ALM_FMM_gen_CS (accessed on 18 September 2025)).

4.1. Problem 1: $(I, K, J) = (2, 2, 3), R : = R^{*} = 11$

Table 2 presents the results for

{PD}_{11} (T_{223}) s

for different combinations of S and T. The rightmost column indicates the number of practical solutions found out of 50 random starting points. The bottom row shows the results without imposing any structure.

The size, minimal and maximal rank, and the number of distinct ranks of

J_{CS, gen}

are provided in the three middle columns. For a complete list of all distinct ranks, we refer to the detailed results available online. As expected, the second dimension of the Jacobian matrix and its rank decrease as

R_{CS}

increases, due to the reduced parameter space. Also the null-space dimension of the Gauss–Newton Gramian

J_{CS, gen}^{⊤} J_{CS, gen}

reduces with

R_{CS}

.

From Table 2, it is evident that significantly more practical solutions are found for moderate values of S and T compared with the unstructured case (bottom row). Specifically, for

(S, T) = (3, 1)

and

(S, T) = (1, 1)

, the number of practical solutions more than doubles. Additionally, for

(S, T) = (1, 1)

, the number of unique solutions also doubles. All improvements are highlighted in bold.

4.2. Problem 2: $(I, K, J) = (2, 2, 4), R : = R^{*} = 14$

The results for

{PD}_{14} (T_{224})

are shown in Table 3, again for different values of

R_{CS}

, S, and T. The largest CS rank obtained is

R_{CS} = 7

. The largest number of practical solutions is found for

(S, T) = (3, 1)

, which is four times higher than the number without structure (

(S, T) = (0, 0)

). Also here, it is evident that including the generalized CS structure leads to more practical PDs and also yields a greater variety of Jacobian ranks, indicating more non-inv-equivalent PDs.

4.3. Problem 3: $(I, K, J) = (2, 3, 3), R : = \tilde{R} = 15$

In Table 4, the results for

T_{233}

at rank 15 are presented. The maximum value of

R_{CS}

obtained is 9. The same conclusions as in the previous sections apply; however, generally fewer practical solutions are found because the problem becomes increasingly difficult as I, K, and J grow.

The following decomposition has the same number of nonzeros and stability parameters as the algorithm proposed in [35] but with CS parameters

(S, T) = (1, 2)

:

In the CS part, Strassen’s decomposition, which also satisfies

(S, T) = (1, 2)

, can be recognized. To our knowledge, this is one of the optimal known algorithms for

(I, J, K) = (2, 3, 3)

, or any permutation of the dimensions.

4.4. Problem 4: $(I, K, J) = (2, 2, 5), R : = \tilde{R} = 18$

Table 5 presents the results for

T_{225}

at rank 18. The maximal CS rank is 9, as was the case in Table 4. Notably, for

(S, T) = (3, 1)

, ten times more practical solutions are found compared with the unstructured case.

4.5. Problem 5: $(I, K, J) = (2, 3, 4), R : = \tilde{R} = 20$

Lastly, Table 6 presents the results for

T_{234}

at rank 20. Very few practical solutions are found in this case. Electronic versions of the decompositions are available. When using 1000 starting points for

S = 4

and

T = 1

, the most sparse practical PD that we found has 144 nonzeros and stability factors

Q = 16

and

E = 57

, which is slightly higher than the decomposition proposed in [8], which has 130 nonzeros and

Q = 14

and

E = 35

.

4.6. Remark: Square Matrix Multiplication

Note that the CS rank can also be chosen smaller than R when

I = J = K

. For example, 11 symmetric rank-1 tensors can be enforced within a

{PD}_{23} (T_{333})

. The remaining 12 rank-1 tensors do not need to follow any specific structure in this case. To our knowledge, no decomposition with this many symmetric terms has been reported in the literature for

T_{333}

. An example decomposition with this structure is available in electronic form.

5. Conclusions

The proposed generalized cyclic symmetric structure can be used in polyadic decompositions of the fast matrix multiplication (FMM) tensor to reduce the number of parameters. This reduction decreases the size of the search space and shortens the computation time per iteration of an optimization algorithm to find new decompositions. We have shown that the CS structure can be exploited to reduce the computational cost of multiplying a matrix by itself. For several problem parameters—i.e., different matrix multiplication sizes—numerical experiments using a state-of-the-art optimization algorithm demonstrate that incorporating this structure enables significantly more practical solutions to be found for a fixed number of starting points. Although we did not directly improve upon the existing FMM algorithms, the ability to find new decompositions more easily can support further research toward faster FMM algorithms, particularly when combined with sufficient computational resources.

Author Contributions

Conceptualization, L.D.L., C.V. and M.V.B.; methodology, C.V.; software, C.V. and N.V.; validation, C.V.; formal analysis, C.V.; investigation, C.V.; resources, L.D.L. and M.V.B.; data curation, C.V.; writing—original draft preparation, C.V.; writing—review and editing, C.V., N.V. and M.V.B.; visualization, C.V.; supervision, M.V.B.; project administration, L.D.L. and M.V.B.; funding acquisition, L.D.L. and M.V.B. All authors have read and agreed to the published version of the manuscript.

Funding

The following funders supported this work: (1) the Flemish Government, under the AI Research Program. Charlotte Vermeylen and Lieven De Lathauwer are affiliated with Leuven.AI - KU Leuven Institute for AI, B-3000, Leuven, Belgium. (2) KU Leuven Internal Funds: iBOF/23/064 and C14/22/096. Nico Vervliet holds a research grant from the Research Foundation Flanders (FWO) (12ZM223N). Marc Van Barel was supported by the Research Council KU Leuven, C1-project C14/17/073, and by the Fund for Scientific Research–Flanders (Belgium), EOS Project no. 30468160, and project G0B0123N.

Data Availability Statement

The results of the experiments, including the new decompositions, as well as the code of the optimization algorithm, are publicly available at https://github.com/CharlotteVermeylen/ALM_FMM_gen_CS, accessed on 18 September 2025.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:

AL	Augmented Lagrangian
BRPD	Border Rank Polyadic Decomposition
(C)PD	(Canonical) Polyadic Decomposition
CS	Cyclic Symmetry
FMM	Fast Matrix Multiplication
LM	Levenberg–Marquardt
MMT	Matrix Multiplication Tensor
NLS	Nonlinear least squares

References

Strassen, V. Gaussian elimination is not optimal. Numer. Math. 1969, 13, 354–356. [Google Scholar] [CrossRef]
Fawzi, A.; Balog, M.; Huang, A.; Hubert, T.; Romera-Paredes, B.; Barekatain, M.; Novikov, A.; R. Ruiz, F.J.; Schrittwieser, J.; Swirszcz, G.; et al. Discovering faster matrix multiplication algorithms with reinforcement learning. Nature 2022, 610, 47–53. [Google Scholar] [CrossRef]
Vermeylen, C.; Van Barel, M. Stability improvements for fast matrix multiplication. Numer. Algorithms 2024, 100, 645–683. [Google Scholar] [CrossRef]
Smirnov, A.V. The bilinear complexity and practical algorithms for matrix multiplication. Comput. Math. Math. Phys. 2013, 53, 1781–1795. [Google Scholar] [CrossRef]
Tichavský, P.; Phan, A.H.; Cichocki, A. Numerical CP decomposition of some difficult tensors. J. Comput. Appl. Math. 2017, 317, 362–370. [Google Scholar] [CrossRef]
Heule, M.J.; Kauers, M.; Seidl, M. New ways to multiply 3 × 3-matrices. J. Symb. Comput. 2021, 104, 899–916. [Google Scholar] [CrossRef]
Ballard, G.; Ikenmeyer, C.; Landsberg, J.M.; Ryder, N. The geometry of rank decompositions of matrix multiplication II: 3 × 3 matrices. J. Pure Appl. Algebra 2019, 223, 3205–3224. [Google Scholar] [CrossRef]
Benson, A.R.; Ballard, G. A framework for practical parallel fast matrix multiplication. In Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, New York, NY, USA, 7–11 February 2015; pp. 42–53. [Google Scholar]
Bürgisser, P.; Clausen, M.; Shokrollahi, M.A. Algebraic Complexity Theory; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2013; Volume 315. [Google Scholar]
Landsberg, J.M. Tensors: Geometry and Applications; American Mathematical Sociery: Providence, RI, USA, 2011; Volume 128. [Google Scholar]
Brent, R.P. Algorithms for Matrix Multiplication; Technical Report; Stanford University: Stanford, CA, USA, 1970. [Google Scholar]
Landsberg, J.M. Geometry and Complexity Theory; Cambridge Studies in Advanced Mathematics; Cambridge University Press: Cambrige, UK, 2017. [Google Scholar]
de Groote, H.F. On varieties of optimal algorithms for the computation of bilinear mappings II. Optimal algorithms for 2 × 2-matrix multiplication. Theor. Comput. Sci. 1978, 7, 127–148. [Google Scholar] [CrossRef]
Bläser, M. On the complexity of the multiplication of matrices of small formats. J. Complex. 2003, 19, 43–60. [Google Scholar] [CrossRef]
Laderman, J.D. A non commutative algorithm for multiplying 3 × 3 matrices using 23 multiplications. Bull. Am. Math. Soc. 1976, 82, 126–128. [Google Scholar] [CrossRef]
Novikov, A.; Vũ, N.; Eisenberger, M.; Dupont, E.; Huang, P.S.; Wagner, A.Z.; Shirobokov, S.; Kozlovskii, B.; Ruiz, F.J.R.; Mehrabian, A.; et al. AlphaEvolve: A coding agent for scientific and algorithmic discovery. arXiv 2025, arXiv:cs.AI/2506.13131. [Google Scholar] [CrossRef]
Dumas, J.G.; Pernet, C.; Sedoglavic, A. A non-commutative algorithm for multiplying 4 × 4 matrices using 48 non-complex multiplications. arxiv 2025, arXiv:cs.SC/2506.13242. [Google Scholar]
Kaporin, I. Finding complex-valued solutions of Brent equations using nonlinear least squares. Comput. Math. Math. Phys. 2024, 64, 1881–1891. [Google Scholar] [CrossRef]
Schwartz, O.; Zwecher, E. Towards faster feasible matrix multiplication by trilinear aggregation. arXiv 2025, arXiv:cs.DS/2508.017. [Google Scholar] [CrossRef]
Chiantini, L.; Ikenmeyer, C.; Landsberg, J.; Ottaviani, G. The geometry of rank decompositions of matrix multiplication I: 2 × 2 matrices. Exp. Math. 2019, 28, 322–327. [Google Scholar] [CrossRef]
de Groote, H.F. On varieties of optimal algorithms for the computation of bilinear mappings I. The isotropy group of a bilinear mapping. Theor. Comput. Sci. 1978, 7, 1–24. [Google Scholar] [CrossRef]
Krijnen, W.P.; Dijkstra, T.K.; Stegeman, A. On the non-existence of optimal solutions and the occurrence of “degeneracy” in the CANDECOMP/PARAFAC model. Psychometrika 2008, 73, 431–439. [Google Scholar] [CrossRef]
Paatero, P. Construction and analysis of degenerate PARAFAC models. J. Chemom. 2000, 14, 285–299. [Google Scholar] [CrossRef]
Bini, D.; Capovani, M.; Romani, F.; Lotti, G. O(n^2.7799) complexity for n × n approximate matrix multiplication. Inf. Process. Lett. 1979, 8, 234–235. [Google Scholar] [CrossRef]
Landsberg, J.M.; Ottaviani, G. New lower bounds for the border rank of matrix multiplication. Theory Comput. 2015, 11, 285–298. [Google Scholar] [CrossRef]
Landsberg, J.; Michatek, M. On the geometry of border rank algorithms for matrix multiplication and other tensors with symmetry. SIAM Appl. Algebr. Geom. 2017, 1, 2–19. [Google Scholar] [CrossRef]
Schönhage, A. Partial and total matrix multiplication. SIAM J. Comput. 1981, 10, 434–455. [Google Scholar] [CrossRef]
Alekseev, V.B.; Smirnov, A.V. On the exact and approximate bilinear complexities of multiplication of 4 × 2 and 2 × 2 matrices. Proc. Steklov Inst. Math. 2013, 282, 123–139. [Google Scholar] [CrossRef]
Gong, X.; Mohlenkamp, M.J.; Young, T.R. The optimization landscape for fitting a rank-2 tensor with a rank-1 tensor. SIAM J. Appl. Dyn. Syst. 2018, 17, 1432–1477. [Google Scholar] [CrossRef]
Mohlenkamp, M.J. The Dynamics of Swamps in the Canonical Tensor Approximation Problem. SIAM J. Appl. Dyn. Syst. 2019, 18, 1293–1333. [Google Scholar] [CrossRef]
Vermeylen, C.; Vervliet, N.; De Lathauwer, L. Reducing swamp behavior for the canonical polyadic decomposition problem by rank-1 freezing. Numer. Algorithms 2025, 100, 831–859. [Google Scholar] [CrossRef]
Johnson, R.W.; McLoughlin, A.M. Noncommutative bilinear algorithms for 3 × 3 matrix multiplication. SIAM J. Comput. 1986, 15, 595–603. [Google Scholar] [CrossRef]
Uschmajew, A. Local convergence of the alternating least squares algorithm for canonical tensor approximation. SIAM J. Matrix Anal. Appl. 2012, 33, 639–652. [Google Scholar] [CrossRef]
Nocedal, J.; Wright, S.J. Numerical Optimization; Springer: New York, NY, USA, 2006. [Google Scholar]
Ballard, G. Discovering fast matrix multiplication algorithms via tensor decomposition. In Proceedings of the SIAM Conference on Computational Science and Engineering, Atlanta, GA, USA, 27 February–3 March 2017; Volume 30. [Google Scholar]
Berger, G.O.; Absil, P.A.; De Lathauwer, L.; Jungers, R.M.; Van Barel, M. Equivalent polyadic decompositions of matrix multiplication tensors. J. Comput. Appl. Math. 2022, 406, 113941. [Google Scholar] [CrossRef]
Huang, J.; Rice, L.; Matthews, D.A.; van de Geijn, R.A. Generating families of practical fast matrix multiplication algorithms. In Proceedings of the 2017 IEEE International Parallel and Distributed Processing Symposium, Orlando, FL, USA, 29 May–2 June 2017; pp. 656–667. [Google Scholar]
Vermeylen, C. Tensor Methods: Fast Matrix Multiplication and Rank Adaptation. Ph.D. Thesis, KU Leuven, Leuven, Belgium, 29 March 2024. [Google Scholar]
Ballard, G.; Benson, A.R.; Druinsky, A.; Lipshitz, B.; Schwartz, O. Improving the numerical stability of fast matrix multiplication. SIAM J. Matrix Anal. Appl. 2016, 37, 1382–1418. [Google Scholar] [CrossRef]
Sorber, L.; Van Barel, M.; De Lathauwer, L. Optimization-based algorithms for tensor decompositions: Canonical polyadic decomposition, decomposition in rank-(L_r, L_r, 1) terms, and a new generalization. SIAM J. Optim. 2013, 23, 695–720. [Google Scholar] [CrossRef]
Alekseyev, V.B. On the complexity of some algorithms of matrix multiplication. J. Algorithms 1985, 6, 71–85. [Google Scholar] [CrossRef]
Vervliet, N.; Debals, O.; Sorber, L.; Van Barel, M.; De Lathauwer, L. Tensorlab 3.0. 2016. Available online: https://www.tensorlab.net (accessed on 18 September 2025).

Figure 1. Illustration of the generalized CS factor matrices of a

{PD}_{20} (T_{234})

with

(S, T) : = (2, 2)

and

I = 2

,

K = 3

, and

J = 4

.

Figure 1. Illustration of the generalized CS factor matrices of a

{PD}_{20} (T_{234})

with

(S, T) : = (2, 2)

and

I = 2

,

K = 3

, and

J = 4

.

Figure 2. Relative improvement compared with standard matrix multiplication of the operations required to multiply a matrix with itself with (blue) and without (red) CS in

{PD}_{Strassen}

.

Figure 2. Relative improvement compared with standard matrix multiplication of the operations required to multiply a matrix with itself with (blue) and without (red) CS in

{PD}_{Strassen}

.

Table 1. The best known ranks for different MMT sizes. The yellow colored problems are tested.

$(I, K, J)$	$\tilde{R}$
(2,2,2)	7
(2,2,3)	11
(2,2,4)	14
(2,3,3)	15
(2,2,5)	18
(2,3,4)	20
(3,3,3)	23
⋮	⋮

Table 2. The number of (unique) practical PDs of rank 11 for

T_{223}

increases significantly for moderate values of

R_{CS}

. Improvements relative to the unstructured case (bottom row) are highlighted in bold.

Table 2. The number of (unique) practical PDs of rank 11 for

T_{223}

increases significantly for moderate values of

R_{CS}

. Improvements relative to the unstructured case (bottom row) are highlighted in bold.

$R_{CS}$	S	T	$Size (J_{CS, gen})$	$Rank (J_{CS, gen} (x^{*}))$			# $x^{*}$ pract.
$R_{CS}$	S	T	$Size (J_{CS, gen})$	min	max	#	# $x^{*}$ pract.
8	2	2	$144 \times 112$	92	97	1	1
7	1	2	$144 \times 120$	85	90	4	10
7	4	1	$144 \times 120$	87	91	4	18
6	0	2	$144 \times 128$	91	105	7	13
6	3	1	$144 \times 128$	93	100	8	36
5	2	1	$144 \times 136$	99	117	8	25
4	1	1	$144 \times 144$	105	123	13	30
4	4	0	$144 \times 144$	107	118	10	18
3	0	1	$144 \times 152$	111	128	11	18
3	3	0	$144 \times 152$	113	121	8	29
2	2	0	$144 \times 160$	119	127	8	21
1	1	0	$144 \times 168$	123	131	7	14
0	0	0	$144 \times 176$	125	132	6	15

Table 3. The number of (unique) practical PDs of rank 14 for

T_{224}

increases with

R_{CS}

for almost all combinations of S and T. In particular, for

(S, T) = (3, 1)

, the number of practical decompositions is four times higher. Improvements relative to the unstructured case are indicated in bold.

Table 3. The number of (unique) practical PDs of rank 14 for

T_{224}

increases with

R_{CS}

for almost all combinations of S and T. In particular, for

(S, T) = (3, 1)

, the number of practical decompositions is four times higher. Improvements relative to the unstructured case are indicated in bold.

$R_{CS}$	S	T	$Size (J_{CS, gen})$	$Rank (J_{CS, gen} (x^{*}))$			# $x^{*}$ pract.
$R_{CS}$	S	T	$Size (J_{CS, gen})$	min	max	#	# $x^{*}$ pract.
7	1	2	$256 \times 224$	178	182	3	9
7	4	1	$256 \times 224$	180	186	4	14
6	0	2	$256 \times 232$	184	191	5	9
6	3	1	$256 \times 232$	184	190	4	21
5	2	1	$256 \times 240$	188	201	7	16
4	1	1	$256 \times 248$	196	204	7	12
4	4	0	$256 \times 248$	196	210	9	12
3	0	1	$256 \times 256$	202	209	4	4
3	3	0	$256 \times 256$	202	212	6	11
2	2	0	$256 \times 264$	214	220	2	4
1	1	0	$256 \times 272$	217	218	2	4
0	0	0	$256 \times 280$	217	220	3	5

Table 4. The number of (unique) practical PDs of rank 15 of

T_{233}

increases with

R_{CS}

for all values of S and T. Without CS structure, no practical decompositions are found. Improvements w.r.t. the unstructured case (bottom row) are indicated in bold.

Table 4. The number of (unique) practical PDs of rank 15 of

T_{233}

increases with

R_{CS}

for all values of S and T. Without CS structure, no practical decompositions are found. Improvements w.r.t. the unstructured case (bottom row) are indicated in bold.

$R_{CS}$	S	T	$Size (J_{CS, gen})$	$Rank (J_{CS, gen} (x^{*}))$			# $x^{*}$ pract.
$R_{CS}$	S	T	$Size (J_{CS, gen})$	min	max	#	# $x^{*}$ pract.
9	6	1	$324 \times 243$	218	218	1	1
8	5	1	$324 \times 251$	226	226	1	2
7	1	2	$324 \times 259$	238	228	1	2
7	4	1	$324 \times 259$	230	234	3	5
	0	2		324	234	1	1
6	3	1	$324 \times 267$	236	238	2	10
	6	0		238	238	1	1
5	2	1	$324 \times 275$	242	242	1	7
4	1	1	$324 \times 283$	248	248	1	2
4	4	0	$324 \times 283$	250	254	4	6
3	0	1	$324 \times 291$	254	254	1	2
3	3	0	$324 \times 291$	254	259	4	6
2	2	0	$324 \times 299$	258	261	3	4
0	0	0	$324 \times 315$	−	−	0	0

Table 5. The number of (unique) practical

{PD}_{18} (T_{225}) s

increases with

R_{CS}

for almost all combinations of S and T. For

(S, T) = (3, 1)

, ten times more practical solutions are found compared with

(S, T) = (0, 0)

. The improvements are indicated in bold.

Table 5. The number of (unique) practical

{PD}_{18} (T_{225}) s

increases with

R_{CS}

for almost all combinations of S and T. For

(S, T) = (3, 1)

, ten times more practical solutions are found compared with

(S, T) = (0, 0)

. The improvements are indicated in bold.

$R_{CS}$	S	T	$Size (J_{CS, gen})$	$Rank (J_{CS, gen} (x^{*}))$			# $x^{*}$ pract.
$R_{CS}$	S	T	$Size (J_{CS, gen})$	min	max	#	# $x^{*}$ pract.
9	6	1	$400 \times 360$	288	288	1	1
7	1	2	$400 \times 376$	292	299	4	4
7	4	1	$400 \times 376$	293	301	3	3
6	0	2	$400 \times 384$	297	306	3	3
6	3	1	$400 \times 384$	299	310	8	10
5	2	1	$400 \times 392$	305	312	8	8
4	1	1	$400 \times 400$	313	325	4	4
4	4	0	$400 \times 400$	312	323	5	5
3	0	1	$400 \times 408$	327	327	1	1
3	3	0	$400 \times 408$	322	327	4	5
2	2	0	$400 \times 416$	330	338	2	2
1	1	0	$400 \times 424$	332	332	1	2
0	0	0	$400 \times 432$	336	336	1	1

Table 6. Practical

{PD}_{20} (T_{234}) s

are found for four combinations of S and T. Without CS structure (

(S, T) = (0, 0)

), no practical decompositions are found.

Table 6. Practical

{PD}_{20} (T_{234}) s

are found for four combinations of S and T. Without CS structure (

(S, T) = (0, 0)

), no practical decompositions are found.

$R_{CS}$	S	T	$Size (J_{CS, gen})$	$Rank (J_{CS, gen} (x^{*}))$	# $x^{*}$ pract.
8	5	1	$576 \times 456$	408	1
7	4	1	$576 \times 464$	410	1
5	2	1	$576 \times 480$	425	1
4	4	0	$576 \times 488$	427	1
0	0	0	$576 \times 520$	-	0

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Vermeylen, C.; Vervliet, N.; De Lathauwer, L.; Van Barel, M. Exploiting Generalized Cyclic Symmetry to Find Fast Rectangular Matrix Multiplication Algorithms Easier. Mathematics 2025, 13, 3064. https://doi.org/10.3390/math13193064

AMA Style

Vermeylen C, Vervliet N, De Lathauwer L, Van Barel M. Exploiting Generalized Cyclic Symmetry to Find Fast Rectangular Matrix Multiplication Algorithms Easier. Mathematics. 2025; 13(19):3064. https://doi.org/10.3390/math13193064

Chicago/Turabian Style

Vermeylen, Charlotte, Nico Vervliet, Lieven De Lathauwer, and Marc Van Barel. 2025. "Exploiting Generalized Cyclic Symmetry to Find Fast Rectangular Matrix Multiplication Algorithms Easier" Mathematics 13, no. 19: 3064. https://doi.org/10.3390/math13193064

APA Style

Vermeylen, C., Vervliet, N., De Lathauwer, L., & Van Barel, M. (2025). Exploiting Generalized Cyclic Symmetry to Find Fast Rectangular Matrix Multiplication Algorithms Easier. Mathematics, 13(19), 3064. https://doi.org/10.3390/math13193064

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Exploiting Generalized Cyclic Symmetry to Find Fast Rectangular Matrix Multiplication Algorithms Easier

Abstract

1. Introduction

1.1. Contribution

1.2. Organization

2. Preliminaries

2.1. Notation

2.2. Practical Algorithms

2.3. Properties of Decompositions of Matrix Multiplication Tensors

2.3.1. Invariance Transformations

2.3.2. Cyclic Symmetry

2.4. Decompositions Obtained by Recursion

3. Generalization Cyclic Symmetry

3.1. Computational Complexity Optimization

3.2. Complexity of Multiplying a Matrix with Itself

4. Numerical Experiments

4.1. Problem 1: $(I, K, J) = (2, 2, 3), R : = R^{*} = 11$

4.2. Problem 2: $(I, K, J) = (2, 2, 4), R : = R^{*} = 14$

4.3. Problem 3: $(I, K, J) = (2, 3, 3), R : = \tilde{R} = 15$

4.4. Problem 4: $(I, K, J) = (2, 2, 5), R : = \tilde{R} = 18$

4.5. Problem 5: $(I, K, J) = (2, 3, 4), R : = \tilde{R} = 20$

4.6. Remark: Square Matrix Multiplication

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Exploiting Generalized Cyclic Symmetry to Find Fast Rectangular Matrix Multiplication Algorithms Easier

Abstract

1. Introduction

1.1. Contribution

1.2. Organization

2. Preliminaries

2.1. Notation

2.2. Practical Algorithms

2.3. Properties of Decompositions of Matrix Multiplication Tensors

2.3.1. Invariance Transformations

2.3.2. Cyclic Symmetry

2.4. Decompositions Obtained by Recursion

3. Generalization Cyclic Symmetry

3.1. Computational Complexity Optimization

3.2. Complexity of Multiplying a Matrix with Itself

4. Numerical Experiments

4.1. Problem 1: ( I , K , J ) = ( 2 , 2 , 3 ) , R : = R ∗ = 11

4.2. Problem 2: ( I , K , J ) = ( 2 , 2 , 4 ) , R : = R ∗ = 14

4.3. Problem 3: ( I , K , J ) = ( 2 , 3 , 3 ) , R : = R ˜ = 15

4.4. Problem 4: ( I , K , J ) = ( 2 , 2 , 5 ) , R : = R ˜ = 18

4.5. Problem 5: ( I , K , J ) = ( 2 , 3 , 4 ) , R : = R ˜ = 20

4.6. Remark: Square Matrix Multiplication

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

4.1. Problem 1: $(I, K, J) = (2, 2, 3), R : = R^{*} = 11$

4.2. Problem 2: $(I, K, J) = (2, 2, 4), R : = R^{*} = 14$

4.3. Problem 3: $(I, K, J) = (2, 3, 3), R : = \tilde{R} = 15$

4.4. Problem 4: $(I, K, J) = (2, 2, 5), R : = \tilde{R} = 18$

4.5. Problem 5: $(I, K, J) = (2, 3, 4), R : = \tilde{R} = 20$