Next Article in Journal
A Diophantine Inequality Involving Mixed Powers of Primes with a Specific Type
Previous Article in Journal
Unsupervised Segmentation and Alignment of Multi-Demonstration Trajectories via Multi-Feature Saliency and Duration-Explicit HSMMs
Previous Article in Special Issue
Quasi-Irreducibility of Nonnegative Biquadratic Tensors
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Exploiting Generalized Cyclic Symmetry to Find Fast Rectangular Matrix Multiplication Algorithms Easier

by
Charlotte Vermeylen
1,*,
Nico Vervliet
1,2,
Lieven De Lathauwer
1,2 and
Marc Van Barel
3
1
Department of Electrical Engineering (ESAT), KU Leuven, Kasteelpark Arenberg 10 bus 2246, B-3001 Leuven, Belgium
2
Subfaculty Science and Technology, KU Leuven Kulak, E. Sabbelaan 53, B-8500 Kortrijk, Belgium
3
Department of Computer Science, KU Leuven, Celestijnenlaan 200A, B-3001 Leuven, Belgium
*
Author to whom correspondence should be addressed.
Mathematics 2025, 13(19), 3064; https://doi.org/10.3390/math13193064
Submission received: 22 August 2025 / Revised: 17 September 2025 / Accepted: 19 September 2025 / Published: 23 September 2025

Abstract

The quest to multiply two large matrices as fast as possible is one that has already intrigued researchers for several decades. However, the ‘optimal’ algorithm for a certain problem size is still not known. The fast matrix multiplication (FMM) problem can be formulated as a non-convex optimization problem—more specifically, as a challenging tensor decomposition problem. In this work, we build upon a state-of-the-art augmented Lagrangian algorithm, which formulates the FMM problem as a constrained least squares problem, by incorporating a new, generalized cyclic symmetric (CS) structure in the decomposition. This structure decreases the number of variables, thereby reducing the large search space and the computational cost per iteration. The constraints are used to find practical solutions, i.e., decompositions with simple coefficients, which yield fast algorithms when implemented in hardware. For the FMM problem, usually a very large number of starting points are necessary to converge to a solution. Extensive numerical experiments for different problem sizes demonstrate that including this structure yields more ‘unique’ practical decompositions for a fixed number of starting points. Uniqueness is defined relative to the known scale and trace invariance transformations that hold for all FMM decompositions. Making it easier to find practical decompositions may lead to the discovery of faster FMM algorithms when used in combination with sufficient computational power. Lastly, we show that the CS structure reduces the cost of multiplying a matrix by itself.
MSC:
15-04; 15A69; 15A72; 15A21; 65F30; 65K05; 65K10; 90C26

1. Introduction

Matrix multiplication is a fundamental operation in numerical linear algebra, with applications ranging from machine learning to signal processing. Discovering faster algorithms for this task can therefore have a wide-reaching impact. For very large systems, saving a few percent of computation time can be worth millions of dollars. The study of fast matrix multiplication (FMM) began with Strassen’s breakthrough in 1969 [1], and despite decades of progress, many questions remain open. Recently, new algorithms have been discovered using deep reinforcement learning [2], while an augmented Lagrangian (AL) method has been proposed to obtain more stable and efficient decompositions [3]. Existing optimization-based approaches for discovering FMMs [2,4,5,6,7,8] often require significant computational resources and manual tuning, making them difficult to replicate. In this work, we propose a framework that does not require big computational resources and does not require manual tinkering, enabling the discovery of FMM algorithms in a more efficient and reproducible manner.
The fast matrix multiplication (FMM) problem aims for faster ways to multiply large matrices by rewriting the bilinear equations of matrix multiplication as a tensor equation:
Z ( i , j ) = k = 1 K X ( i , k ) Y ( k , j ) Z = i , j , k = 1 I , J , K E i k I × K , X F E k j K × J , Y F E j i J × I = : i , j , k = 1 I , J , K E i k I × K E k j K × J E j i J × I · F X , Y ,
where Z : = X Y and E i j I × J , for i = 1 , , I , and j = 1 , , J , is a basis of matrices in R I × J such that E i j I × J ( i 1 , j 2 ) = 1 , if ( i 1 , j 2 ) = ( i , j ) , and zero otherwise. ‘ ’ denotes the tensor or outer product, and · F and ‘·F’ indicate Frobenius inner products. A transpose is added in (1) such that the tensor that is defined has additional interesting properties, such as cyclic symmetry (CS), which will be discussed further in Section 2.3.2.
The rank of matrix multiplication is the minimal integer R such that
Z = r = 1 R U r , X F V r , Y F W r ,
for some matrices U r R I × K , V r R K × J , and W r R J × I . It can be shown that minimizing R minimizes the computational complexity of matrix multiplication [9] (Proposition 15.1) and [10]. More specifically, the number of arithmetic operations needed to multiply two matrices in R I × I is O I ω , where ω : = log I R . Since R < I 3 , it holds that ω < 3 . Additionally, the exponent is bounded from below by two, since you need at least I 2 operations to compute as many elements.
Minimizing R in (2) corresponds to finding a canonical polyadic decomposition (CPD) of the matrix multiplication tensor (MMT) defined implicitly in (1) [1]:
T I K J : = i , j , k = 1 I , J , K e i k I K e k j K J e j i I J ,
where T I K J has dimensions I K × K J × I J and e i k I K : = vec E i k I × J , where vec ( · ) denotes vectorization along the columns. An MMT is a sparse tensor consisting of I J K ones and zeros elsewhere:
T I K J ( i 1 + ( k 2 1 ) I , k 1 + ( j 2 1 ) K , j 1 + ( i 2 1 ) J ) = 1 if i 1 = i 2 , j 1 = j 2 , k 1 = k 2 , 0 otherwise ,
for all i 1 , i 2 = 1 , , I , j 1 , j 2 = 1 , , J , and k 1 , k 2 = 1 , , K . An MMT thus only depends on the size of the matrices that are multiplied. A polyadic decomposition (PD) of T I K J of length R decomposes T I K J into R rank-1 tensors:
T I K J = r = 1 R u r v r w r ,
where u r , v r , and w r are vectors in R I K , R K J , and R I J , respectively. These vectors can be collected in three so-called factor matrices: U : = u 1 , , u R , V : = v 1 , , v R , and W : = w 1 , , w R . The rank R or R ( T ) denotes the minimal R for which (5) holds. The decomposition is then called canonical (CPD). A PD with R terms is denoted by PD R and PD R ( T ) denotes a PD R of a certain tensor T . Equations (4) and (5) are also known as the Brent equations [11].
Decompositions of MMTs yield base algorithms for FMM of the form
vec Z = r = 1 R u r , vec ( X ) v r , vec ( Y ) w r ,
which is the same equation as (2) with u r = vec ( U r ) , v r = vec ( V r ) , and w r = vec ( W r ) , for r = 1 , , R . The total number of operations of a base algorithm is higher than for the standard algorithm. However, the active multiplications, i.e., the number of multiplications between (linear combinations of) elements of X and Y , is reduced from I J K to R. Algorithm (6) becomes faster when applied sufficiently many times recursively to multiply large matrices. This is because the active multiplications determine the asymptotic complexity [12]. In this case, the large matrices, e.g., of size I l × K l and K l × J l for some l N , are divided into I × K and K × J blocks of equal size, e.g., I l 1 × K l 1 and K l 1 × J l 1 , respectively. All operations in (6) are then directly performed on the blocks. For example, the vectorization operator stacks the blocks in a block vector of length I K and J K , respectively, where each element is again a matrix. The inner products are calculated on these block vectors, where each scalar multiplication becomes a scalar element-wise multiplication on all elements of each block. Each multiplication between the inner products becomes another matrix multiplication of smaller dimension and can be further divided into smaller blocks if possible. A more detailed analysis of the recursive application of FMM algorithms is given in Section 2.2.
The rank of T I K J , denoted by R ( T I K J ) , is not known for most problem parameters I, J, and K, which is one of the main difficulties of the FMM problem. One of the exceptions is T 222 , of which the rank is known to be seven [13] and the problem is fully understood. We call PD 7 ( T 222 ) discovered by Strassen PD Strassen [1]. Only lower and upper bounds on the rank exist for most other I, J, and K. For example, it is known that R ( T 333 ) lies between 19 [14] and 23 [4,6,7,15]. The upper bound, i.e., the lowest rank of T I K J known, is denoted by R ˜ ( T I K J ) . The most recent overview table of R ˜ ( T I K J ) is given in [16] (Table 3). However, since then, there has been some further improvement for I = K = J = 4 [17]. Note that permuting ( I , J , K ) does not change the rank. Recently, there has also been an interest in finding complex-valued decompositions of T I K J [16,18]. However, as the number of standard multiplications to multiply two complex numbers is at least three, the complexity is higher than for real decompositions of T I K J . A recent overview of different techniques to obtain base algorithms for different dimensions is given in [19] (Section 1.1).
To find a PD R of T I K J , the following nonlinear least squares (NLS) problem in x : = vec ( U ) vec ( V ) vec ( W ) is formulated [4,5,7,11,20]:
min x 1 2 r = 1 R w r v r u r vec T I K J 2 = : f ( x ) ,
where ‘⊗’ denotes the Kronecker product and · the 2 -norm. Note that x is a PD R of T I K J if and only if f ( x ) = 0 . However, state-of-the-art optimization algorithms are not guaranteed to converge to a global optimum of (7). Furthermore, the convergence can be slow, even for second-order methods. One of the reasons is that (7) is non-convex, and PDs of MMT are not unique and have, furthermore, additional invariances compared with generic PDs [21]. This means that the minima of (7) are non-isolated.
In practice, the rank R is approximated experimentally by the lowest length R ˜ for which a solution x with f ( x ) = 0 can be obtained using numerical optimization. However, as no globally convergent algorithm for (7) exists, this does not mean that a solution with a lower rank does not exist as long as there is a gap with the theoretical lower bound.
Another difficulty is the existence of border rank PDs (BRPDs) [10,12]. BRPDs are ill-conditioned approximate decompositions of T I K J with elements that grow to infinity during convergence. The border rank of a particular tensor can be smaller than its canonical rank. This is closely related to the concept of degeneracy [22,23]. The BRPDs of MMTs are, e.g., investigated in [4,24,25,26,27,28]. To eliminate the convergence to BRPDs, constraints are usually added to (7) [3,4,5,8] or the elements are restricted to a discrete set of values [2,6]. The phenomenon of degeneracy also gives rise to regions of very slow convergence in the optimization landscape, informally known as swamps [29,30,31].
Additionally, the number of variables or unknowns grows rapidly with I, J, and K. Therefore, this paper proposes a new structure that can be enforced in PDs of MMTs to reduce the number of variables, which directly reduces the computational complexity of optimization algorithms to solve (7) and shrinks the large search space.
To solve (7), the alternating least squares (ALS) method is frequently used [4,7,20,32]. The local convergence of the ALS method for (7) is known under the assumption that the Hessian is positive definite at the solutions modulo the scaling indeterminacies [33]. This is, however, not satisfied for PDs of MMTs, as they have additional invariances [21]. In [4,5], a constrained optimization problem and corresponding method are proposed to improve the convergence. More specifically, Lagrange multipliers and a quadratic penalty are used for the constraints, and a Levenberg–Marquardt (LM) and an ALS method, respectively, are used to solve the constrained problems. Still, these methods were unable to find PD 49 s for 4 × 4 matrix multiplication, although they are known to exist. Another constrained optimization problem was proposed recently [3]:
min x f ( x ) , s . t . h ( x ) = 0 , l x u ,
together with an augmented Lagrangian (AL) method to solve it. The bound constraint ensures that the method does not converge to BRPDs. Different equality constraints h ( x ) were proposed to obtain PDs with simple coefficient decompositions. The AL method has as advantage compared with adding quadratic penalty terms to (7) that even outside the neighborhood of an optimum, the constraints are satisfied accurately. New PD 49 T 444 s , and different new CS PDs are obtained using this method and the stability of existing algorithms is improved [3]. As inner optimization algorithm, the LM method is used, which is a (second-order) damped Gauss–Newton optimization algorithm that optimizes over all factor matrices simultaneously [34].
We use the constrained problem formulation (8) together with the AL method proposed in [3] to find new PDs of T I K J with the generalized CS structure proposed in this paper. The structure is imposed to find new decompositions more easily by reducing the search space and computational complexity per iteration, which can then, in turn, be used to reduce the complexity of matrix multiplication. Although it is not known whether for all I, J, and K, a CPD of T I K J with the proposed structure exists, the structure is very flexible, as it is determined by two parameters that can be varied and, for example, can be made smaller to be less restrictive. So far, we have not encountered any CPDs of T I K J of small dimensions that do not admit this structure for any combination of these two parameters.

1.1. Contribution

A generalization of the CS structure investigated in [7,20,35] to non-square matrix multiplication is proposed for decompositions of MMTs. The advantage of investigating this structure is threefold.
Firstly, this structure reduces the number of variables, thereby shrinking the large search space and speeding up the computation time per iteration when used in an optimization algorithm. We have implemented the structure efficiently in a state-of-the-art all-at-once optimization algorithm for the FMM problem and use a constrained problem formulation to find new practical solutions automatically without manual tinkering. A large number of numerical experiments illustrate that including this structure helps find more practical solutions for a fixed number of starting points.
Secondly, we prove that by exploiting this structure, the cost of multiplying a matrix by itself can be reduced. We give an example with Strassen’s decomposition, which shows that the CS structure enables the speed-up of standard matrix multiplication already from matrices of size I = J = K = 2 8 = 256 instead of 2 10 = 1024 .
Lastly, ‘unique’ decompositions of MMTs, based on the well-known invariance transformations [21], are of theoretical interest. Different classes of decompositions have already been investigated before [2,3,7,20,36], but not for rectangular fast matrix multiplication. To determine if two decompositions are equivalent, a non-trivial optimization problem arises. Using the result from [3], where it was shown that the rank of the Jacobian matrix is also invariant under these transformations, we use the different ranks to distinguish ‘unique’ practical decompositions.

1.2. Organization

The paper is organized as follows. Section 2 provides the preliminaries related to the FMM problem. In Section 3, the new CS structure is proposed, and its advantages are discussed. Section 4 presents a large number of numerical experiments for different MMTs, illustrating that incorporating this structure helps to find more (unique) practical decompositions per fixed number of starting points.

2. Preliminaries

This section first discusses the notation and afterwards what is meant by practical decompositions and FMM algorithms. Section 2.3 gives a summary of the most important properties of decompositions of MMTs: invariances, cyclic symmetry, and recursive decompositions.

2.1. Notation

Scalars are denoted by lowercase letters, vectors by bold letters, matrices by bold uppercase letters, and tensors by calligraphic letters, e.g., a, a , A , and A . The ith column of A is denoted by a i and the jth element of a by a j . ‘ ’ denotes the tensor or outer product, ‘⊗’ the Kronecker product, ∗ the Hadamard or element-wise product, and vec · the vectorization operator. Vectorization of a tensor is performed such that vec T ( i + ( j 1 ) J + ( k 1 ) J K ) = T ( i , j , k ) . I denotes the identity matrix of the appropriate size.

2.2. Practical Algorithms

To multiply larger matrices, e.g., of size I l × K l and K l × J l , a base algorithm for ( I , J , K ) matrix multiplication can be applied recursively, which requires O c R l floating-point operations. The value of c depends on how many nonzero elements appear in the decomposition. When c and l are sufficiently small and large, respectively, the computational complexity of standard matrix multiplication, i.e., O ( I K J ) l operations, can be lowered. The following result is a generalization of the operation count of Strassen’s decomposition in [1] to general FMM decompositions.
Proposition 1 
(cost fast matrix multiplication). Let U , V , and W be the factor matrices of a base algorithm for ( I , K , J ) matrix multiplication. The complexity to use this base algorithm l times recursively to multiply two matrices X R I l × K l and Y R K l × J l is
1 + c U 1 R I K + c V 1 R K J + c W 1 R I J R l c U ( I K ) l R I K c V ( K J ) l R K J c W ( I J ) l R I J ,
where c U and c V are the number of additions and scalar multiplications needed to compute the inner products with u r and v r in (6), respectively, and c W is the number of additions and scalar multiplications with the w r .
Proof. 
A base algorithm (6) of rank R of T I K J can be applied on the I l 1 × I l 1 submatrices of X and the K l 1 × J l 1 submatrices of Y , where each active multiplication is then again a multiplication of I l 1 , K l 1 , J l 1 . The cost to multiply X and Y can thus be written as
cost I l , K l , J l = R · cost I l 1 , K l 1 , J l 1 + c U · ( I K ) l 1 + c V · ( K J ) l 1 + c W · ( I J ) l 1 .
Applying this formula recursively until cost 1 , 1 , 1 = 1 , leads to
cost I l , K l , J l = R l + i = 1 l R i 1 c U · ( I K ) l i + c V · ( K J ) l i + c W · ( I J ) l i = R l + c U ( I K ) l 1 i = 0 l 1 R I K i + c V ( K J ) l 1 i = 0 l 1 R K J i + c W ( I J ) l 1 i = 0 l 1 R I J i = R l + c U R l ( I K ) l R I K + c V R l ( K J ) l R K J + c W R l ( I J ) l R I J
Remark: The constants c U , c V , and c W depend implicitly on the parameters I, J, and K in the sense that each factor matrix contains at least I J K and at most I K R , J K R , and I J R nonzero elements, respectively. Furthermore, we can assume that each factor vector contains at least one nonzero, and the rows of U , V , and W contain at least I, J, and K nonzeros each, respectively. This can be derived from the specific locations of the ones in the tensor as given in (4). The constants c U , c V , and c W are directly related to the number of nonzeros, but the exact relation depends on the specific distribution of the nonzeros and their values.
The constants c U , c V , and c W can be decreased by using sparse PDs with elements that are powers of 2 because multiplication with a power of 2 is not a costly operation when implemented in hardware. Such PDs are called practical because, without small constants, standard matrix multiplication is only improved asymptotically.
A disadvantage of formulation (7) is that the solutions have floating-point elements. Consequently, if a numerical PD of T I K J is obtained, it has to be transformed to a practical one. The well-known inv-transformations [21] can be used for this purpose [5,8]. However, not all numerical PDs of T I K J can be transformed to a practical one [36]. Furthermore, this is not an easy process and involves a lot of manual tinkering [5,8]. In [36], this process is called discretization. Another possibility is to add constraints to the optimization problem, such as the following simple equality constraint proposed in [3]:
h discr ( x ) : = x ( x + 1 ) ( x 1 ) ,
where ‘∗’ denotes element-wise multiplication. This constraint forces the elements to the discrete set { 0 , ± 1 } .

2.3. Properties of Decompositions of Matrix Multiplication Tensors

This section first briefly discusses the invariances for decompositions of T I K J . Afterwards, other properties such as cyclic symmetry and recursive PDs are discussed in more detail.

2.3.1. Invariance Transformations

Factor matrices are a non-unique representation of a PD. For example, the same tensor is obtained under the permutation of the rank-1 tensors. Additionally, the factor vectors can be scaled as α r u r , β r v r , 1 α r β r w r , for all r = 1 , , R , and α r , β r in R 0 .
Decompositions of T I K J also have other invariances [21]. See, e.g., [3] for an overview. The combination of the trivial transformations and the specific transformations that hold for decompositions of T I K J is called inv-transformations. They have dimension I 2 + K 2 + J 2 + 2 R 3 . Two PDs of T I K J that can be transformed into one another in this way are called inv-equivalent. All PD 7 ( T 222 ) are known to be inv-equivalent [13]. Some decompositions of T I K J also have additional invariances; see, e.g., [3].
The Jacobian matrix can be used to investigate the inv-equivalence of decompositions of MMTs and to discover additional invariances [3]. That is why in Section 4, the size and ranks of the Jacobian matrix at solutions with the proposed structure are given.

2.3.2. Cyclic Symmetry

Because of (3), T I K J is a structured tensor. More specifically, when I = J = K , T I K J is so-called cyclic symmetric (CS), which means that T I I I ( k , i , j ) = T I I I ( j , k , i ) = T I I I ( i , j , k ) , for i , j , k = 1 , , I 2 . This structure can be exploited in the factor matrices. More specifically, the rank-1 tensors can be forced to either occur in CS pairs or to be symmetric [35]:
T I I I = s = 1 S a s a s a s + t = 1 T b t c t d t + c t d t b t + d t b t c t ,
where a s , b t , c t , and d t are in R I 2 , for s = 1 , , S and t = 1 , , T . Here, S and T count the symmetric rank-1 tensors and the CS rank-1 tensors, respectively. The length of a CS PD equals R = 3 T + S . The matrices A , B , C , and D can be defined as containing the vectors a s , b t , c t , and d t as columns, and the factor matrices can be parameterized as U = A B C D , V = A D B C , and W = A C D B . The number of variables is only a third of the original value, making this parameterization highly useful for the FMM problem. Furthermore, the cost function can be changed to
min A , B , C , D 1 2 ( i = 1 I 2 j = i I 2 k = i + 1 I 2 ( T I I I ( i , j , k ) s = 1 S a i s a j s a k s t = 1 T b i t d j t c k t + c i t b j t d k t + d i t c j t b k t ) 2 + i = 1 I 2 T I I I ( i , i , i ) s = 1 S a i s 3 3 t = 1 T b i t d i t c i t 2 ) ,
where a i s : = A ( i , s ) , b i t : = B ( i , t ) , and similarly for c i t and d i t , which reduces the number of rows in the Jacobian matrix used in NLS optimization methods to solve (10) from I 6 to 1 3 ( I 6 I 2 ) + I 2 . Note that this cost function only includes one element for each CS pair of points because the CS structure already ensures that these elements are equal.
Different practical CS decompositions of rank 7 and 23 of T 222 and T 333 , respectively, are known [7,20]. They were discovered through an ALS method and analyzed using group theory and algebraic geometry. Concerning Strassen’s decomposition, S = 1 and T = 2 . Although all PD 7 T 222 s are known to be inv-equivalent, Strassen’s decomposition can be transformed into another practical PD with parameters ( S , T ) = ( 4 , 1 ) . However, this increases the number of nonzeros and is thus not used in practice.

2.4. Decompositions Obtained by Recursion

As mentioned in Section 2.2, an FMM algorithm applies a base algorithm for small I, J, and K to large matrices, e.g., of size I l × K l and K l × J l . In the same way, we can also obtain PDs of rank 49 of T 444 using a PD 7 T 222 one time recursively. The PD 49 T 444 that is obtained using PD Strassen is called PD Strassen rec in the rest of the text. In the following proposition, the formulas to obtain the factor matrices of a recursive PD from the factor matrices of the original PD are given. This result is well known, e.g., from the Supplementary Material of [2] and in [37], and a proof can be found, e.g., in [38] (Section 2.2.3).
Proposition 2 
(Recursive PD). If U , V , and W are the factor matrices of a PD R ( T I K J ) , then the factor matrices U , V , and W , where
u r : = vec U r 1 U r 2 , v r : = vec V r 1 V r 2 , w r : = vec W r 1 W r 2 , r : = r 2 + ( r 1 1 ) R ,
where ‘⊗’ denotes the Kronecker product, for r 1 , r 2 = 1 , , R , are the factor matrices of a PD R 2 T I 2 K 2 J 2 .
The following corollary gives the CS parameters and factor matrices of a recursive PD as a function of the original PD.
Corollary 3. 
If U , V , and W are CS factor matrices of a PD R T I I I with CS parameters S and T such that U = A B C D , V = A D B C , and W = A C D B , where A R I 2 × S and B , C , D R I 2 × T , then the factor matrices U , V , and W of the recursive decomposition PD R 2 rec T I 2 I 2 I 2 is also CS with parameters S = S 2 and T = T ( S + R ) :
U : = A B C D , V : = A D B C , W : = A C D B ,
where A R I 4 × S and B , C , D R I 4 × T , and, more specifically,
A s : = A s 1 A s 2 , s : = s 2 + ( s 1 1 ) S , s 1 , s 2 = 1 , , S ,
and
B : = B 1 B 2 , C : = C 1 C 2 , D : = D 1 D 2 ,
where
B 1 , t 1 : = A s 1 B t , C 1 , t 1 : = A s 1 C t , D 1 , t 1 : = A s 1 D t , B 2 , t 2 : = B t U r , C 2 , t 2 : = C t W r , D 2 , t 2 : = D t V r ,
where t 1 : = t + ( s 1 1 ) T and t 2 : = r + ( t 1 ) R , for t = 1 , , T and r = 1 , , R .
Proof. 
From Proposition 2 follows that U i = U i 1 U i 2 , V i = V i 1 V i 2 , and W i = W i 1 W i 2 , where i : = i 2 + ( i 1 1 ) R . Thus, the symmetric part of the recursive PD must satisfy U i 1 = V i 1 = W i 1 and U i 2 = V i 2 = W i 2 . Consequently, i 1 and i 2 must be smaller than or equal to S, such that U i 1 = V i 1 = W i 1 = A i 1 and U i 2 = V i 2 = W i 2 = A i 2 . For the other values of i 1 and i 2 , the columns still appear in CS pairs, and they can be rearranged as in (11) to satisfy the CS structure. □
Note that, since PD Strassen satisfies S = 1 and T = 2 , the parameters of PD Strassen rec are S = 1 and T = 16 . Another CS PD 49 T 444 is obtained with ( S , T ) = ( 16 , 11 ) by using a PD 7 T 222 with ( S , T ) = ( 4 , 1 ) recursively.

3. Generalization Cyclic Symmetry

In this section, a generalization of the CS structure discussed in Section 2.3.2 to the case when I, J, and K are not equal is proposed. We do this to reduce the number of variables and reduce the search space to speed up the convergence. The basis of the idea is the observation that an MMT for non-square matrix multiplication is still CS as long as the indices in (4) are smaller than the smallest dimension. However, because of the interaction of these CS indices with the non-symmetric indices in a CPD of the tensor, imposing this structure on all rank-1 tensors would be too restrictive. Hence, some unstructured terms are added as well. In Section 4, we give experimental results for the maximum number of structured rank-1 terms for different problem sizes. Because, e.g., a PD of T K J I can be obtained from a PD of T I K J , we assume in this chapter that I K J . It is known from (4) that
T I K J ( i 1 + ( k 2 1 ) I , k 1 + ( j 2 1 ) K , j 1 + ( i 2 1 ) J ) = i = 1 r U i ( i 1 , k 2 ) V i ( k 1 , j 2 ) W i ( j 1 , i 2 ) = 1 if i 1 = i 2 , j 1 = j 2 , k 1 = k 2 0 else ,
for all i 1 , i 2 = 1 , , I , j 1 , j 2 = 1 , , J , and k 1 , k 2 = 1 , , K . Consequently, as long as i 1 , i 2 , j 1 , j 2 , k 1 , k 2 I , it holds that
T I K J ( i 1 + ( k 2 1 ) I , k 1 + ( j 2 1 ) K , j 1 + ( i 2 1 ) J ) = T I K J ( k 1 + ( j 2 1 ) I , j 1 + ( i 2 1 ) K , i 1 + ( k 2 1 ) J ) = T I K J ( j 1 + ( i 2 1 ) I , i 1 + ( k 2 1 ) K , k 1 + ( j 2 1 ) J ) .
Therefore,
r = 1 R U r ( i 1 , k 2 ) V r ( k 1 , j 2 ) W r ( j 1 , i 2 ) = r = 1 R U r ( k 1 , j 2 ) V r ( j 1 , i 2 ) W r ( i 1 , k 2 ) = r = 1 R U r ( j 1 , i 2 ) V r ( i 1 , k 2 ) W r ( k 1 , j 2 ) ,
which can be used to include a CS structure in a subpart of the factor matrices. An illustration is shown in Figure 1, where the following submatrices are defined as a function of the position in the factor matrices:
U r ( i 1 , k 2 ) = : A r ( i 1 , k 2 ) if k 2 I , r S B r S ( i 1 , k 2 ) if k 2 I , S < r S + T C r S T ( i 1 , k 2 ) if k 2 I , S + T < r S + 2 T D r S 2 T ( i 1 , k 2 ) if k 2 I , S + 2 T < r R CS U ˜ r ( i 1 , k 2 I ) if k 2 > I , r R CS U ˙ r R CS ( i 1 , k 2 ) if R CS < r ,
where R CS : = S + 3 T , and for r = 1 , , R , i 1 = 1 , , I , and k 2 = 1 , , K ,
V r ( k 1 , j 2 ) = : A r ( k 1 , j 2 ) if k 1 , j 2 I , r S D r S ( k 1 , j 2 ) if k 1 , j 2 I , S < r S + T B r S T ( k 1 , j 2 ) if k 1 , j 2 I , S + T < r S + 2 T C r S 2 T ( k 1 , j 2 ) if k 1 , j 2 I , S + 2 T < r R CS V ^ r ( k 1 I , j 2 ) if k 1 > I , j 2 I , r R CS V ˜ r ( k 1 , j 2 I ) if j 2 > I , r R CS V ˙ r R CS ( k 1 , j 2 ) if R CS < r ,
for k 1 = 1 , , K , and j 2 = 1 , , J , and
W r ( j 1 , i 2 ) = : A r ( j 1 , i 2 ) if j 1 I , r S C r S ( j 1 , i 2 ) if j 1 I , S < r S + T D r S T ( j 1 , i 2 ) if j 1 I , S + T < r S + 2 T B r S 2 T ( j 1 , i 2 ) if j 1 I , S + 2 T < r R CS W ^ r ( j 1 I , i 2 ) if j 1 > I , r R CS W ˙ r R CS ( j 1 , i 2 ) if R CS < r ,
for j 1 = 1 , , J , and i 2 = 1 , , I , where A s : = reshape a s , I × I , for s = 1 , , S , B t : = reshape b t , I × I , for t = 1 , , T , and similarly for C t and D t , and
U ˜ r : = reshape U ˜ ( : , r ) , I × ( K I ) , V ˜ r : = reshape V ˜ ( : , r ) , K × ( J I ) , V ^ r : = reshape V ^ ( : , r ) , ( K I ) × I , W ^ r : = reshape W ^ ( : , r ) , ( J I ) × I ,
for r = 1 , , R CS , and
U ˙ r : = reshape U ˙ ( : , r ) , I × K , V ˙ r : = reshape V ˙ ( : , r ) , K × J , W ˙ r : = reshape W ˙ ( : , r ) , J × I ,
for r = 1 , , ( R R CS ) , and thus U ˜ R I ( K I ) × R CS , V ˜ R K ( J I ) × R CS , V ^ R ( K I ) I × R CS , W ^ R ( J I ) I × R CS , U ˙ R I K × ( R R CS ) , V ˙ R K J × ( R R CS ) , and W ˙ R I J × ( R R CS ) . The proportions in Figure 1 are those of T 234 , ( S , T ) : = ( 2 , 2 ) , and R : = 20 . We show in the experiments that such PDs indeed exist. Note that, contrary to the case when I, J, and K are equal, the CS rank R CS does not equal the rank R, because this would be too restrictive. Also, in the experiments that follow, we did not find PDs for which both are equal. This is likely because of the interaction of the CS part with the parts U ˜ , V ˜ , V ^ , and W ^ . However, including this structure still reduces the number of parameters from ( I K + K J + I J ) R to ( I K + K J + I J ) R 2 I 2 R CS .
Example 4. 
The optimal CPD (in terms of stability and number of nonzeros) for T 223 is an extension of Strassen’s decomposition [1,39]:
Mathematics 13 03064 i001
The CS part can clearly be recognized. Remark that the matrices V ˜ and W ^ are zero in this case. However, this is not always the case, as shown for example in Section 4. Similarly, the optimal PD 14 T 224 s and PD 18 T 225 s can also be written with ( S , T ) = ( 1 , 2 ) .

3.1. Computational Complexity Optimization

The most computationally expensive step in a Gauss–Newton-type method, such as the LM method used in Section 4, is solving the linear system for the step, which requires O R I K + K J + I J 3 operations [40]. By incorporating the generalized CS structure, the complexity is directly reduced to O R ( I K + K J + I J ) 2 I 2 R CS 3 operations. For square matrix multiplication and when R CS = R , the complexity is reduced by a factor of 3 3 .

3.2. Complexity of Multiplying a Matrix with Itself

When multiplying a matrix with itself, i.e., Z : = X X , the CS structure can be exploited to save computations.
Proposition 5. 
Let U , V , and W be the factor matrices of a base algorithm for ( I , I , I ) matrix multiplication. When U , V , and W are CS with parameters S and T, and R CS = R , and Z : = X X , with X R I l × I l , then the computational cost of applying the base algorithm l times recursively derived in Proposition 1 is reduced to
cost CS I l , I l , I l = 2 1 + c U δ R l 2 c U + δ I 2 l δ = cost I l , I l , I l c U R l I 2 l δ ,
where δ : = R I 2 > 0 and c U is the number of additions and scalar multiplications needed to compute the inner products with u r in (6).
Proof. 
When Y = X in the base algorithm (6) and when the decomposition is CS with R CS = R , the inner products with v r are the same as those with u r but in a different order:
vec Z = r = 1 R u r , vec ( X ) v r , vec ( X ) w r = s = 1 S a s , vec ( X ) a s , vec ( X ) a s + t = 1 T ( b t , vec ( X ) d t , vec ( X ) c t c t , vec ( X ) b t , vec ( X ) d t + d t , vec ( X ) c t , vec ( X ) b t ) .
Consequently, the recursive formula for the cost is
cost I l , I l , I l = R · cost I l 1 , I l 1 , I l 1 + c U · I 2 l 1 + c W · I 2 l 1 .
Because of the CS, the elements in W are the same as those of U in a different order. Consequently, c W = c U + R I 2 . The recursive formula can be expanded in the same way as in the proof of Proposition 1 to obtain:
cost CS I l , I l , I l = 1 + c U 1 R I 2 + c W 1 R I 2 R l c U I 2 l R I 2 c W I 2 l R I 2 = cost I l , I l , I l c V R l R I 2 + c V I 2 l R I 2 = cost I l , I l , I l c U R l I 2 l R I 2 = 1 + 2 c U + R I 2 1 R I 2 R l 2 c U + R I 2 I 2 l R I 2 = 2 1 + c U 1 R I 2 R l 2 c U + R I 2 I 2 l R I 2 .
Example 6 
(Complexity Strassen with and without CS). Strassen’s decomposition for I = 2 has rank 7 and 12 nonzeros in each factor matrix. Thus c U = c V = 12 R = 5 , and c W = 12 I 2 = 8 . The number of operations required as a function of the number of recursive levels l is shown in Figure 2. When exploiting the CS in Strassen’s decomposition, the algorithm improves standard matrix multiplication for the multiplication of a matrix with itself already after eight recursive levels, thus for matrices of size 2 8 = 256 or higher. Without CS, there is an improvement over standard matrix multiplication only from size 2 10 = 1024 , and the improvement is less significant.

4. Numerical Experiments

This section presents the results obtained using the AL method [3] to find PDs of MMTs with the generalized CS structure, for various matrix sizes (I, K, J) and different CS parameters S and T. The rank is always chosen as the lowest rank R ˜ for which a solution is known to exist in the literature. For the first two experiments, this is also known to be the optimal one [28,41].
The AL method requires upper and lower bounds on the elements in the decomposition: l x u . We set u : = l : = 1 and generate 50 random starting points of magnitude 10 2 using the built-in randn function in Matlab, for each set of problem parameters:
x 0 : = 10 2 · randn ( I K + K J + I J ) R 2 I 2 R CS × 1 .
The number of inner iterations of the LM method was set to 50, and the number of outer iterations to 15. The tolerances on the gradient and the constraint were set to 10 13 .
When a numerical solution with a cost function value smaller than 10 12 is found, the constraint h discr from (9), scaled by a factor of 0.1, is added to the optimization problem to convert the numerical solution into a practical one. It is known that not all numerical solutions can be converted into practical ones using inv-transformations [36]. Although we do not use inv-transformations but rather numerical optimization, we also observed that many numerical solutions do not converge to a discrete solution under this constraint, while others converge very quickly. We verified that many of the non-converging numerical solutions also fail to satisfy the necessary conditions for discretizability from [36].
The rank of the Jacobian matrix J CS , gen of the generalized structure at the discrete solutions is computed. The rank deficiency reflects the dimension of the invariance transformation applicable to the decomposition. Two decompositions with different ranks cannot be transformed into one another using inv-transformations [3]. Thus, the more distinct ranks that are found, the more ‘unique’ solutions are discovered.
The MMTs are ordered according to their best-known rank in Table 1. The following paragraphs present the results for sizes between ( 2 , 2 , 2 ) and ( 3 , 3 , 3 ) , both corresponding to square matrix multiplication tensors. Larger sizes are possible, but at a certain point, the computation time per experiment becomes too large to perform all 50 experiments per parameter set in a reasonable amount of time on a standard computer.
All experiments were performed on a laptop with an AMD Ryzen 7 PRO 7840U processor and 64 GB of RAM, using Tensorlab 4 beta [42] and Matlab R2023b. The results of the experiments, including the new decompositions, as well as the code, are publicly available (Version 1) (https://github.com/CharlotteVermeylen/ALM_FMM_gen_CS (accessed on 18 September 2025)).

4.1. Problem 1: ( I , K , J ) = ( 2 , 2 , 3 ) , R : = R = 11

Table 2 presents the results for PD 11 T 223 s for different combinations of S and T. The rightmost column indicates the number of practical solutions found out of 50 random starting points. The bottom row shows the results without imposing any structure.
The size, minimal and maximal rank, and the number of distinct ranks of J CS , gen are provided in the three middle columns. For a complete list of all distinct ranks, we refer to the detailed results available online. As expected, the second dimension of the Jacobian matrix and its rank decrease as R CS increases, due to the reduced parameter space. Also the null-space dimension of the Gauss–Newton Gramian J CS , gen J CS , gen reduces with R CS .
From Table 2, it is evident that significantly more practical solutions are found for moderate values of S and T compared with the unstructured case (bottom row). Specifically, for ( S , T ) = ( 3 , 1 ) and ( S , T ) = ( 1 , 1 ) , the number of practical solutions more than doubles. Additionally, for ( S , T ) = ( 1 , 1 ) , the number of unique solutions also doubles. All improvements are highlighted in bold.

4.2. Problem 2: ( I , K , J ) = ( 2 , 2 , 4 ) , R : = R = 14

The results for PD 14 T 224 are shown in Table 3, again for different values of R CS , S, and T. The largest CS rank obtained is R CS = 7 . The largest number of practical solutions is found for ( S , T ) = ( 3 , 1 ) , which is four times higher than the number without structure ( ( S , T ) = ( 0 , 0 ) ). Also here, it is evident that including the generalized CS structure leads to more practical PDs and also yields a greater variety of Jacobian ranks, indicating more non-inv-equivalent PDs.

4.3. Problem 3: ( I , K , J ) = ( 2 , 3 , 3 ) , R : = R ˜ = 15

In Table 4, the results for T 233 at rank 15 are presented. The maximum value of R CS obtained is 9. The same conclusions as in the previous sections apply; however, generally fewer practical solutions are found because the problem becomes increasingly difficult as I, K, and J grow.
The following decomposition has the same number of nonzeros and stability parameters as the algorithm proposed in [35] but with CS parameters ( S , T ) = ( 1 , 2 ) :
Mathematics 13 03064 i002
In the CS part, Strassen’s decomposition, which also satisfies ( S , T ) = ( 1 , 2 ) , can be recognized. To our knowledge, this is one of the optimal known algorithms for ( I , J , K ) = ( 2 , 3 , 3 ) , or any permutation of the dimensions.

4.4. Problem 4: ( I , K , J ) = ( 2 , 2 , 5 ) , R : = R ˜ = 18

Table 5 presents the results for T 225 at rank 18. The maximal CS rank is 9, as was the case in Table 4. Notably, for ( S , T ) = ( 3 , 1 ) , ten times more practical solutions are found compared with the unstructured case.

4.5. Problem 5: ( I , K , J ) = ( 2 , 3 , 4 ) , R : = R ˜ = 20

Lastly, Table 6 presents the results for T 234 at rank 20. Very few practical solutions are found in this case. Electronic versions of the decompositions are available. When using 1000 starting points for S = 4 and T = 1 , the most sparse practical PD that we found has 144 nonzeros and stability factors Q = 16 and E = 57 , which is slightly higher than the decomposition proposed in [8], which has 130 nonzeros and Q = 14 and E = 35 .

4.6. Remark: Square Matrix Multiplication

Note that the CS rank can also be chosen smaller than R when I = J = K . For example, 11 symmetric rank-1 tensors can be enforced within a PD 23 T 333 . The remaining 12 rank-1 tensors do not need to follow any specific structure in this case. To our knowledge, no decomposition with this many symmetric terms has been reported in the literature for T 333 . An example decomposition with this structure is available in electronic form.

5. Conclusions

The proposed generalized cyclic symmetric structure can be used in polyadic decompositions of the fast matrix multiplication (FMM) tensor to reduce the number of parameters. This reduction decreases the size of the search space and shortens the computation time per iteration of an optimization algorithm to find new decompositions. We have shown that the CS structure can be exploited to reduce the computational cost of multiplying a matrix by itself. For several problem parameters—i.e., different matrix multiplication sizes—numerical experiments using a state-of-the-art optimization algorithm demonstrate that incorporating this structure enables significantly more practical solutions to be found for a fixed number of starting points. Although we did not directly improve upon the existing FMM algorithms, the ability to find new decompositions more easily can support further research toward faster FMM algorithms, particularly when combined with sufficient computational resources.

Author Contributions

Conceptualization, L.D.L., C.V. and M.V.B.; methodology, C.V.; software, C.V. and N.V.; validation, C.V.; formal analysis, C.V.; investigation, C.V.; resources, L.D.L. and M.V.B.; data curation, C.V.; writing—original draft preparation, C.V.; writing—review and editing, C.V., N.V. and M.V.B.; visualization, C.V.; supervision, M.V.B.; project administration, L.D.L. and M.V.B.; funding acquisition, L.D.L. and M.V.B. All authors have read and agreed to the published version of the manuscript.

Funding

The following funders supported this work: (1) the Flemish Government, under the AI Research Program. Charlotte Vermeylen and Lieven De Lathauwer are affiliated with Leuven.AI - KU Leuven Institute for AI, B-3000, Leuven, Belgium. (2) KU Leuven Internal Funds: iBOF/23/064 and C14/22/096. Nico Vervliet holds a research grant from the Research Foundation Flanders (FWO) (12ZM223N). Marc Van Barel was supported by the Research Council KU Leuven, C1-project C14/17/073, and by the Fund for Scientific Research–Flanders (Belgium), EOS Project no. 30468160, and project G0B0123N.

Data Availability Statement

The results of the experiments, including the new decompositions, as well as the code of the optimization algorithm, are publicly available at https://github.com/CharlotteVermeylen/ALM_FMM_gen_CS, accessed on 18 September 2025.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:
ALAugmented Lagrangian
BRPDBorder Rank Polyadic Decomposition
(C)PD(Canonical) Polyadic Decomposition
CSCyclic Symmetry
FMMFast Matrix Multiplication
LMLevenberg–Marquardt
MMTMatrix Multiplication Tensor
NLSNonlinear least squares

References

  1. Strassen, V. Gaussian elimination is not optimal. Numer. Math. 1969, 13, 354–356. [Google Scholar] [CrossRef]
  2. Fawzi, A.; Balog, M.; Huang, A.; Hubert, T.; Romera-Paredes, B.; Barekatain, M.; Novikov, A.; R. Ruiz, F.J.; Schrittwieser, J.; Swirszcz, G.; et al. Discovering faster matrix multiplication algorithms with reinforcement learning. Nature 2022, 610, 47–53. [Google Scholar] [CrossRef]
  3. Vermeylen, C.; Van Barel, M. Stability improvements for fast matrix multiplication. Numer. Algorithms 2024, 100, 645–683. [Google Scholar] [CrossRef]
  4. Smirnov, A.V. The bilinear complexity and practical algorithms for matrix multiplication. Comput. Math. Math. Phys. 2013, 53, 1781–1795. [Google Scholar] [CrossRef]
  5. Tichavský, P.; Phan, A.H.; Cichocki, A. Numerical CP decomposition of some difficult tensors. J. Comput. Appl. Math. 2017, 317, 362–370. [Google Scholar] [CrossRef]
  6. Heule, M.J.; Kauers, M.; Seidl, M. New ways to multiply 3 × 3-matrices. J. Symb. Comput. 2021, 104, 899–916. [Google Scholar] [CrossRef]
  7. Ballard, G.; Ikenmeyer, C.; Landsberg, J.M.; Ryder, N. The geometry of rank decompositions of matrix multiplication II: 3 × 3 matrices. J. Pure Appl. Algebra 2019, 223, 3205–3224. [Google Scholar] [CrossRef]
  8. Benson, A.R.; Ballard, G. A framework for practical parallel fast matrix multiplication. In Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, New York, NY, USA, 7–11 February 2015; pp. 42–53. [Google Scholar]
  9. Bürgisser, P.; Clausen, M.; Shokrollahi, M.A. Algebraic Complexity Theory; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2013; Volume 315. [Google Scholar]
  10. Landsberg, J.M. Tensors: Geometry and Applications; American Mathematical Sociery: Providence, RI, USA, 2011; Volume 128. [Google Scholar]
  11. Brent, R.P. Algorithms for Matrix Multiplication; Technical Report; Stanford University: Stanford, CA, USA, 1970. [Google Scholar]
  12. Landsberg, J.M. Geometry and Complexity Theory; Cambridge Studies in Advanced Mathematics; Cambridge University Press: Cambrige, UK, 2017. [Google Scholar]
  13. de Groote, H.F. On varieties of optimal algorithms for the computation of bilinear mappings II. Optimal algorithms for 2 × 2-matrix multiplication. Theor. Comput. Sci. 1978, 7, 127–148. [Google Scholar] [CrossRef]
  14. Bläser, M. On the complexity of the multiplication of matrices of small formats. J. Complex. 2003, 19, 43–60. [Google Scholar] [CrossRef]
  15. Laderman, J.D. A non commutative algorithm for multiplying 3 × 3 matrices using 23 multiplications. Bull. Am. Math. Soc. 1976, 82, 126–128. [Google Scholar] [CrossRef]
  16. Novikov, A.; Vũ, N.; Eisenberger, M.; Dupont, E.; Huang, P.S.; Wagner, A.Z.; Shirobokov, S.; Kozlovskii, B.; Ruiz, F.J.R.; Mehrabian, A.; et al. AlphaEvolve: A coding agent for scientific and algorithmic discovery. arXiv 2025, arXiv:cs.AI/2506.13131. [Google Scholar] [CrossRef]
  17. Dumas, J.G.; Pernet, C.; Sedoglavic, A. A non-commutative algorithm for multiplying 4 × 4 matrices using 48 non-complex multiplications. arxiv 2025, arXiv:cs.SC/2506.13242. [Google Scholar]
  18. Kaporin, I. Finding complex-valued solutions of Brent equations using nonlinear least squares. Comput. Math. Math. Phys. 2024, 64, 1881–1891. [Google Scholar] [CrossRef]
  19. Schwartz, O.; Zwecher, E. Towards faster feasible matrix multiplication by trilinear aggregation. arXiv 2025, arXiv:cs.DS/2508.017. [Google Scholar] [CrossRef]
  20. Chiantini, L.; Ikenmeyer, C.; Landsberg, J.; Ottaviani, G. The geometry of rank decompositions of matrix multiplication I: 2 × 2 matrices. Exp. Math. 2019, 28, 322–327. [Google Scholar] [CrossRef]
  21. de Groote, H.F. On varieties of optimal algorithms for the computation of bilinear mappings I. The isotropy group of a bilinear mapping. Theor. Comput. Sci. 1978, 7, 1–24. [Google Scholar] [CrossRef]
  22. Krijnen, W.P.; Dijkstra, T.K.; Stegeman, A. On the non-existence of optimal solutions and the occurrence of “degeneracy” in the CANDECOMP/PARAFAC model. Psychometrika 2008, 73, 431–439. [Google Scholar] [CrossRef]
  23. Paatero, P. Construction and analysis of degenerate PARAFAC models. J. Chemom. 2000, 14, 285–299. [Google Scholar] [CrossRef]
  24. Bini, D.; Capovani, M.; Romani, F.; Lotti, G. O(n2.7799) complexity for n × n approximate matrix multiplication. Inf. Process. Lett. 1979, 8, 234–235. [Google Scholar] [CrossRef]
  25. Landsberg, J.M.; Ottaviani, G. New lower bounds for the border rank of matrix multiplication. Theory Comput. 2015, 11, 285–298. [Google Scholar] [CrossRef]
  26. Landsberg, J.; Michatek, M. On the geometry of border rank algorithms for matrix multiplication and other tensors with symmetry. SIAM Appl. Algebr. Geom. 2017, 1, 2–19. [Google Scholar] [CrossRef]
  27. Schönhage, A. Partial and total matrix multiplication. SIAM J. Comput. 1981, 10, 434–455. [Google Scholar] [CrossRef]
  28. Alekseev, V.B.; Smirnov, A.V. On the exact and approximate bilinear complexities of multiplication of 4 × 2 and 2 × 2 matrices. Proc. Steklov Inst. Math. 2013, 282, 123–139. [Google Scholar] [CrossRef]
  29. Gong, X.; Mohlenkamp, M.J.; Young, T.R. The optimization landscape for fitting a rank-2 tensor with a rank-1 tensor. SIAM J. Appl. Dyn. Syst. 2018, 17, 1432–1477. [Google Scholar] [CrossRef]
  30. Mohlenkamp, M.J. The Dynamics of Swamps in the Canonical Tensor Approximation Problem. SIAM J. Appl. Dyn. Syst. 2019, 18, 1293–1333. [Google Scholar] [CrossRef]
  31. Vermeylen, C.; Vervliet, N.; De Lathauwer, L. Reducing swamp behavior for the canonical polyadic decomposition problem by rank-1 freezing. Numer. Algorithms 2025, 100, 831–859. [Google Scholar] [CrossRef]
  32. Johnson, R.W.; McLoughlin, A.M. Noncommutative bilinear algorithms for 3 × 3 matrix multiplication. SIAM J. Comput. 1986, 15, 595–603. [Google Scholar] [CrossRef]
  33. Uschmajew, A. Local convergence of the alternating least squares algorithm for canonical tensor approximation. SIAM J. Matrix Anal. Appl. 2012, 33, 639–652. [Google Scholar] [CrossRef]
  34. Nocedal, J.; Wright, S.J. Numerical Optimization; Springer: New York, NY, USA, 2006. [Google Scholar]
  35. Ballard, G. Discovering fast matrix multiplication algorithms via tensor decomposition. In Proceedings of the SIAM Conference on Computational Science and Engineering, Atlanta, GA, USA, 27 February–3 March 2017; Volume 30. [Google Scholar]
  36. Berger, G.O.; Absil, P.A.; De Lathauwer, L.; Jungers, R.M.; Van Barel, M. Equivalent polyadic decompositions of matrix multiplication tensors. J. Comput. Appl. Math. 2022, 406, 113941. [Google Scholar] [CrossRef]
  37. Huang, J.; Rice, L.; Matthews, D.A.; van de Geijn, R.A. Generating families of practical fast matrix multiplication algorithms. In Proceedings of the 2017 IEEE International Parallel and Distributed Processing Symposium, Orlando, FL, USA, 29 May–2 June 2017; pp. 656–667. [Google Scholar]
  38. Vermeylen, C. Tensor Methods: Fast Matrix Multiplication and Rank Adaptation. Ph.D. Thesis, KU Leuven, Leuven, Belgium, 29 March 2024. [Google Scholar]
  39. Ballard, G.; Benson, A.R.; Druinsky, A.; Lipshitz, B.; Schwartz, O. Improving the numerical stability of fast matrix multiplication. SIAM J. Matrix Anal. Appl. 2016, 37, 1382–1418. [Google Scholar] [CrossRef]
  40. Sorber, L.; Van Barel, M.; De Lathauwer, L. Optimization-based algorithms for tensor decompositions: Canonical polyadic decomposition, decomposition in rank-(Lr, Lr, 1) terms, and a new generalization. SIAM J. Optim. 2013, 23, 695–720. [Google Scholar] [CrossRef]
  41. Alekseyev, V.B. On the complexity of some algorithms of matrix multiplication. J. Algorithms 1985, 6, 71–85. [Google Scholar] [CrossRef]
  42. Vervliet, N.; Debals, O.; Sorber, L.; Van Barel, M.; De Lathauwer, L. Tensorlab 3.0. 2016. Available online: https://www.tensorlab.net (accessed on 18 September 2025).
Figure 1. Illustration of the generalized CS factor matrices of a PD 20 T 234 with ( S , T ) : = ( 2 , 2 ) and I = 2 , K = 3 , and J = 4 .
Figure 1. Illustration of the generalized CS factor matrices of a PD 20 T 234 with ( S , T ) : = ( 2 , 2 ) and I = 2 , K = 3 , and J = 4 .
Mathematics 13 03064 g001
Figure 2. Relative improvement compared with standard matrix multiplication of the operations required to multiply a matrix with itself with (blue) and without (red) CS in PD Strassen .
Figure 2. Relative improvement compared with standard matrix multiplication of the operations required to multiply a matrix with itself with (blue) and without (red) CS in PD Strassen .
Mathematics 13 03064 g002
Table 1. The best known ranks for different MMT sizes. The yellow colored problems are tested.
Table 1. The best known ranks for different MMT sizes. The yellow colored problems are tested.
( I , K , J ) R ˜
(2,2,2)7
(2,2,3)11
(2,2,4)14
(2,3,3)15
(2,2,5)18
(2,3,4)20
(3,3,3)23
 ⋮ ⋮
Table 2. The number of (unique) practical PDs of rank 11 for T 223 increases significantly for moderate values of R CS . Improvements relative to the unstructured case (bottom row) are highlighted in bold.
Table 2. The number of (unique) practical PDs of rank 11 for T 223 increases significantly for moderate values of R CS . Improvements relative to the unstructured case (bottom row) are highlighted in bold.
R CS ST Size ( J CS , gen ) Rank J CS , gen ( x )   # x pract.
minmax#
822 144 × 112 929711
712 144 × 120 8590410
418791418
602 144 × 128 91105713
3193100836
521 144 × 136 99117825
411 144 × 144 1051231330
401071181018
301 144 × 152 1111281118
30113121829
220 144 × 160 119127821
110 144 × 168 123131714
000 144 × 176 125132615
Table 3. The number of (unique) practical PDs of rank 14 for T 224 increases with R CS for almost all combinations of S and T. In particular, for ( S , T ) = ( 3 , 1 ) , the number of practical decompositions is four times higher. Improvements relative to the unstructured case are indicated in bold.
Table 3. The number of (unique) practical PDs of rank 14 for T 224 increases with R CS for almost all combinations of S and T. In particular, for ( S , T ) = ( 3 , 1 ) , the number of practical decompositions is four times higher. Improvements relative to the unstructured case are indicated in bold.
R CS ST Size ( J CS , gen ) Rank J CS , gen ( x )   # x pract.
minmax#
712 256 × 224 17818239
41180186414
602 256 × 232 18419159
31184190421
521 256 × 240 188201716
411 256 × 248 196204712
40196210912
301 256 × 256 20220944
30202212611
220 256 × 264 21422024
110 256 × 272 21721824
000 256 × 280 21722035
Table 4. The number of (unique) practical PDs of rank 15 of T 233 increases with R CS for all values of S and T. Without CS structure, no practical decompositions are found. Improvements w.r.t. the unstructured case (bottom row) are indicated in bold.
Table 4. The number of (unique) practical PDs of rank 15 of T 233 increases with R CS for all values of S and T. Without CS structure, no practical decompositions are found. Improvements w.r.t. the unstructured case (bottom row) are indicated in bold.
R CS ST Size ( J CS , gen ) Rank J CS , gen ( x )   # x pract.
minmax#
961 324 × 243 21821811
851 324 × 251 22622612
712 324 × 259 23822812
4123023435
02 32423411
631 324 × 267 236238210
60 23823811
521 324 × 275 24224217
411 324 × 283 24824812
4025025446
301 324 × 291 25425412
3025425946
220 324 × 299 25826134
000 324 × 315 00
Table 5. The number of (unique) practical PD 18 T 225 s increases with R CS for almost all combinations of S and T. For ( S , T ) = ( 3 , 1 ) , ten times more practical solutions are found compared with ( S , T ) = ( 0 , 0 ) . The improvements are indicated in bold.
Table 5. The number of (unique) practical PD 18 T 225 s increases with R CS for almost all combinations of S and T. For ( S , T ) = ( 3 , 1 ) , ten times more practical solutions are found compared with ( S , T ) = ( 0 , 0 ) . The improvements are indicated in bold.
R CS ST Size ( J CS , gen ) Rank J CS , gen ( x )   # x pract.
minmax#
961 400 × 360 28828811
712 400 × 376 29229944
4129330133
602 400 × 384 29730633
31299310810
521 400 × 392 30531288
411 400 × 400 31332544
4031232355
301 400 × 408 32732711
3032232745
220 400 × 416 33033822
110 400 × 424 33233212
000 400 × 432 33633611
Table 6. Practical PD 20 T 234 s are found for four combinations of S and T. Without CS structure ( ( S , T ) = ( 0 , 0 ) ), no practical decompositions are found.
Table 6. Practical PD 20 T 234 s are found for four combinations of S and T. Without CS structure ( ( S , T ) = ( 0 , 0 ) ), no practical decompositions are found.
R CS ST Size ( J CS , gen ) Rank J CS , gen ( x ) # x pract.
851 576 × 456 4081
741 576 × 464 4101
521 576 × 480 4251
440 576 × 488 4271
000 576 × 520 -0
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Vermeylen, C.; Vervliet, N.; De Lathauwer, L.; Van Barel, M. Exploiting Generalized Cyclic Symmetry to Find Fast Rectangular Matrix Multiplication Algorithms Easier. Mathematics 2025, 13, 3064. https://doi.org/10.3390/math13193064

AMA Style

Vermeylen C, Vervliet N, De Lathauwer L, Van Barel M. Exploiting Generalized Cyclic Symmetry to Find Fast Rectangular Matrix Multiplication Algorithms Easier. Mathematics. 2025; 13(19):3064. https://doi.org/10.3390/math13193064

Chicago/Turabian Style

Vermeylen, Charlotte, Nico Vervliet, Lieven De Lathauwer, and Marc Van Barel. 2025. "Exploiting Generalized Cyclic Symmetry to Find Fast Rectangular Matrix Multiplication Algorithms Easier" Mathematics 13, no. 19: 3064. https://doi.org/10.3390/math13193064

APA Style

Vermeylen, C., Vervliet, N., De Lathauwer, L., & Van Barel, M. (2025). Exploiting Generalized Cyclic Symmetry to Find Fast Rectangular Matrix Multiplication Algorithms Easier. Mathematics, 13(19), 3064. https://doi.org/10.3390/math13193064

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop