1. Introduction
Matrix multiplication is a fundamental operation in numerical linear algebra, with applications ranging from machine learning to signal processing. Discovering faster algorithms for this task can therefore have a wide-reaching impact. For very large systems, saving a few percent of computation time can be worth millions of dollars. The study of fast matrix multiplication (FMM) began with Strassen’s breakthrough in 1969 [
1], and despite decades of progress, many questions remain open. Recently, new algorithms have been discovered using deep reinforcement learning [
2], while an augmented Lagrangian (AL) method has been proposed to obtain more stable and efficient decompositions [
3]. Existing optimization-based approaches for discovering FMMs [
2,
4,
5,
6,
7,
8] often require significant computational resources and manual tuning, making them difficult to replicate. In this work, we propose a framework that does not require big computational resources and does not require manual tinkering, enabling the discovery of FMM algorithms in a more efficient and reproducible manner.
The
fast matrix multiplication (FMM) problem aims for faster ways to multiply large matrices by rewriting the bilinear equations of matrix multiplication as a tensor equation:
where
and
, for
, and
, is a basis of matrices in
such that
, if
, and zero otherwise. ‘
’ denotes the tensor or outer product, and
and ‘·
F’ indicate Frobenius inner products. A transpose is added in (
1) such that the tensor that is defined has additional interesting properties, such as
cyclic symmetry (CS), which will be discussed further in
Section 2.3.2.
The rank of matrix multiplication is the minimal integer
R such that
for some matrices
,
, and
. It can be shown that minimizing
R minimizes the computational complexity of matrix multiplication [
9] (Proposition 15.1) and [
10]. More specifically, the number of arithmetic operations needed to multiply two matrices in
is
, where
. Since
, it holds that
. Additionally, the exponent is bounded from below by two, since you need at least
operations to compute as many elements.
Minimizing
R in (
2) corresponds to finding a
canonical polyadic decomposition (CPD) of the
matrix multiplication tensor (MMT) defined implicitly in (
1) [
1]:
where
has dimensions
and
, where
denotes vectorization along the columns. An MMT is a sparse tensor consisting of
ones and zeros elsewhere:
for all
,
, and
. An MMT thus only depends on the size of the matrices that are multiplied. A polyadic decomposition (PD) of
of length
R decomposes
into
R rank-1 tensors:
where
, and
are vectors in
,
, and
, respectively. These vectors can be collected in three so-called
factor matrices:
,
, and
. The rank
or
denotes the minimal
R for which (
5) holds. The decomposition is then called canonical (CPD). A PD with
R terms is denoted by
and
denotes a
of a certain tensor
. Equations (
4) and (
5) are also known as the Brent equations [
11].
Decompositions of MMTs yield
base algorithms for FMM of the form
which is the same equation as (
2) with
,
, and
, for
. The total number of operations of a base algorithm is higher than for the standard algorithm. However, the
active multiplications, i.e., the number of multiplications between (linear combinations of) elements of
and
, is reduced from
to
R. Algorithm (
6) becomes faster when applied sufficiently many times recursively to multiply large matrices. This is because the active multiplications determine the asymptotic complexity [
12]. In this case, the large matrices, e.g., of size
and
for some
, are divided into
and
blocks of equal size, e.g.,
and
, respectively. All operations in (
6) are then directly performed on the blocks. For example, the vectorization operator stacks the blocks in a block vector of length
and
, respectively, where each element is again a matrix. The inner products are calculated on these block vectors, where each scalar multiplication becomes a scalar element-wise multiplication on all elements of each block. Each multiplication between the inner products becomes another matrix multiplication of smaller dimension and can be further divided into smaller blocks if possible. A more detailed analysis of the recursive application of FMM algorithms is given in
Section 2.2.
The rank of
, denoted by
, is not known for most problem parameters
I,
J, and
K, which is one of the main difficulties of the FMM problem. One of the exceptions is
, of which the rank is known to be seven [
13] and the problem is fully understood. We call
discovered by Strassen
[
1]. Only lower and upper bounds on the rank exist for most other
I,
J, and
K. For example, it is known that
lies between 19 [
14] and 23 [
4,
6,
7,
15]. The upper bound, i.e., the lowest rank of
known, is denoted by
. The most recent overview table of
is given in [
16] (Table 3). However, since then, there has been some further improvement for
[
17]. Note that permuting
does not change the rank. Recently, there has also been an interest in finding complex-valued decompositions of
[
16,
18]. However, as the number of standard multiplications to multiply two complex numbers is at least three, the complexity is higher than for real decompositions of
. A recent overview of different techniques to obtain base algorithms for different dimensions is given in [
19] (Section 1.1).
To find a
of
, the following
nonlinear least squares (NLS) problem in
is formulated [
4,
5,
7,
11,
20]:
where ‘⊗’ denotes the Kronecker product and
the
-norm. Note that
is a
of
if and only if
. However, state-of-the-art optimization algorithms are not guaranteed to converge to a global optimum of (
7). Furthermore, the convergence can be slow, even for second-order methods. One of the reasons is that (
7) is non-convex, and PDs of MMT are not unique and have, furthermore, additional invariances compared with generic PDs [
21]. This means that the minima of (
7) are non-isolated.
In practice, the rank
is approximated experimentally by the lowest length
for which a solution
with
can be obtained using numerical optimization. However, as no globally convergent algorithm for (
7) exists, this does not mean that a solution with a lower rank does not exist as long as there is a gap with the theoretical lower bound.
Another difficulty is the existence of
border rank PDs (BRPDs) [
10,
12]. BRPDs are ill-conditioned approximate decompositions of
with elements that grow to infinity during convergence. The border rank of a particular tensor can be smaller than its canonical rank. This is closely related to the concept of degeneracy [
22,
23]. The BRPDs of MMTs are, e.g., investigated in [
4,
24,
25,
26,
27,
28]. To eliminate the convergence to BRPDs, constraints are usually added to (
7) [
3,
4,
5,
8] or the elements are restricted to a discrete set of values [
2,
6]. The phenomenon of degeneracy also gives rise to regions of very slow convergence in the optimization landscape, informally known as
swamps [
29,
30,
31].
Additionally, the number of variables or unknowns grows rapidly with
I,
J, and
K. Therefore, this paper proposes a new structure that can be enforced in PDs of MMTs to reduce the number of variables, which directly reduces the computational complexity of optimization algorithms to solve (
7) and shrinks the large search space.
To solve (
7), the alternating least squares (ALS) method is frequently used [
4,
7,
20,
32]. The local convergence of the ALS method for (
7) is known under the assumption that the Hessian is positive definite at the solutions modulo the scaling indeterminacies [
33]. This is, however, not satisfied for PDs of MMTs, as they have additional invariances [
21]. In [
4,
5], a constrained optimization problem and corresponding method are proposed to improve the convergence. More specifically, Lagrange multipliers and a quadratic penalty are used for the constraints, and a Levenberg–Marquardt (LM) and an ALS method, respectively, are used to solve the constrained problems. Still, these methods were unable to find
for
matrix multiplication, although they are known to exist. Another constrained optimization problem was proposed recently [
3]:
together with an augmented Lagrangian (AL) method to solve it. The bound constraint ensures that the method does not converge to BRPDs. Different equality constraints
were proposed to obtain PDs with simple coefficient decompositions. The AL method has as advantage compared with adding quadratic penalty terms to (
7) that even outside the neighborhood of an optimum, the constraints are satisfied accurately. New
, and different new CS PDs are obtained using this method and the stability of existing algorithms is improved [
3]. As inner optimization algorithm, the LM method is used, which is a (second-order) damped Gauss–Newton optimization algorithm that optimizes over all factor matrices simultaneously [
34].
We use the constrained problem formulation (
8) together with the AL method proposed in [
3] to find new PDs of
with the generalized CS structure proposed in this paper. The structure is imposed to find new decompositions more easily by reducing the search space and computational complexity per iteration, which can then, in turn, be used to reduce the complexity of matrix multiplication. Although it is not known whether for all
I,
J, and
K, a CPD of
with the proposed structure exists, the structure is very flexible, as it is determined by two parameters that can be varied and, for example, can be made smaller to be less restrictive. So far, we have not encountered any CPDs of
of small dimensions that do not admit this structure for any combination of these two parameters.
1.1. Contribution
A generalization of the CS structure investigated in [
7,
20,
35] to non-square matrix multiplication is proposed for decompositions of MMTs. The advantage of investigating this structure is threefold.
Firstly, this structure reduces the number of variables, thereby shrinking the large search space and speeding up the computation time per iteration when used in an optimization algorithm. We have implemented the structure efficiently in a state-of-the-art all-at-once optimization algorithm for the FMM problem and use a constrained problem formulation to find new practical solutions automatically without manual tinkering. A large number of numerical experiments illustrate that including this structure helps find more practical solutions for a fixed number of starting points.
Secondly, we prove that by exploiting this structure, the cost of multiplying a matrix by itself can be reduced. We give an example with Strassen’s decomposition, which shows that the CS structure enables the speed-up of standard matrix multiplication already from matrices of size instead of .
Lastly, ‘unique’ decompositions of MMTs, based on the well-known invariance transformations [
21], are of theoretical interest. Different classes of decompositions have already been investigated before [
2,
3,
7,
20,
36], but not for rectangular fast matrix multiplication. To determine if two decompositions are equivalent, a non-trivial optimization problem arises. Using the result from [
3], where it was shown that the rank of the Jacobian matrix is also invariant under these transformations, we use the different ranks to distinguish ‘unique’ practical decompositions.
1.2. Organization
The paper is organized as follows.
Section 2 provides the preliminaries related to the FMM problem. In
Section 3, the new CS structure is proposed, and its advantages are discussed.
Section 4 presents a large number of numerical experiments for different MMTs, illustrating that incorporating this structure helps to find more (unique) practical decompositions per fixed number of starting points.
4. Numerical Experiments
This section presents the results obtained using the AL method [
3] to find PDs of MMTs with the generalized CS structure, for various matrix sizes (
I,
K,
J) and different CS parameters
S and
T. The rank is always chosen as the lowest rank
for which a solution is known to exist in the literature. For the first two experiments, this is also known to be the optimal one [
28,
41].
The AL method requires upper and lower bounds on the elements in the decomposition:
. We set
and generate 50 random starting points of magnitude
using the built-in
function in
Matlab, for each set of problem parameters:
The number of inner iterations of the LM method was set to 50, and the number of outer iterations to 15. The tolerances on the gradient and the constraint were set to .
When a numerical solution with a cost function value smaller than
is found, the constraint
from (
9), scaled by a factor of 0.1, is added to the optimization problem to convert the numerical solution into a practical one. It is known that not all numerical solutions can be converted into practical ones using inv-transformations [
36]. Although we do not use inv-transformations but rather numerical optimization, we also observed that many numerical solutions do not converge to a discrete solution under this constraint, while others converge very quickly. We verified that many of the non-converging numerical solutions also fail to satisfy the necessary conditions for discretizability from [
36].
The rank of the Jacobian matrix
of the generalized structure at the discrete solutions is computed. The rank deficiency reflects the dimension of the invariance transformation applicable to the decomposition. Two decompositions with different ranks cannot be transformed into one another using inv-transformations [
3]. Thus, the more distinct ranks that are found, the more ‘unique’ solutions are discovered.
The MMTs are ordered according to their best-known rank in
Table 1. The following paragraphs present the results for sizes between
and
, both corresponding to square matrix multiplication tensors. Larger sizes are possible, but at a certain point, the computation time per experiment becomes too large to perform all 50 experiments per parameter set in a reasonable amount of time on a standard computer.
All experiments were performed on a laptop with an AMD Ryzen 7 PRO 7840U processor and 64
of RAM, using Tensorlab 4 beta [
42] and Matlab R2023b. The results of the experiments, including the new decompositions, as well as the code, are publicly available (Version 1) (
https://github.com/CharlotteVermeylen/ALM_FMM_gen_CS (accessed on 18 September 2025)).
4.1. Problem 1:
Table 2 presents the results for
for different combinations of
S and
T. The rightmost column indicates the number of practical solutions found out of 50 random starting points. The bottom row shows the results without imposing any structure.
The size, minimal and maximal rank, and the number of distinct ranks of are provided in the three middle columns. For a complete list of all distinct ranks, we refer to the detailed results available online. As expected, the second dimension of the Jacobian matrix and its rank decrease as increases, due to the reduced parameter space. Also the null-space dimension of the Gauss–Newton Gramian reduces with .
From
Table 2, it is evident that significantly more practical solutions are found for moderate values of
S and
T compared with the unstructured case (bottom row). Specifically, for
and
, the number of practical solutions more than doubles. Additionally, for
, the number of unique solutions also doubles. All improvements are highlighted in bold.
4.2. Problem 2:
The results for
are shown in
Table 3, again for different values of
,
S, and
T. The largest CS rank obtained is
. The largest number of practical solutions is found for
, which is four times higher than the number without structure (
). Also here, it is evident that including the generalized CS structure leads to more practical PDs and also yields a greater variety of Jacobian ranks, indicating more non-inv-equivalent PDs.
4.3. Problem 3:
In
Table 4, the results for
at rank 15 are presented. The maximum value of
obtained is 9. The same conclusions as in the previous sections apply; however, generally fewer practical solutions are found because the problem becomes increasingly difficult as
I,
K, and
J grow.
The following decomposition has the same number of nonzeros and stability parameters as the algorithm proposed in [
35] but with CS parameters
:
In the CS part, Strassen’s decomposition, which also satisfies
, can be recognized. To our knowledge, this is one of the optimal known algorithms for
, or any permutation of the dimensions.
4.4. Problem 4:
Table 5 presents the results for
at rank 18. The maximal CS rank is 9, as was the case in
Table 4. Notably, for
, ten times more practical solutions are found compared with the unstructured case.
4.5. Problem 5:
Lastly,
Table 6 presents the results for
at rank 20. Very few practical solutions are found in this case. Electronic versions of the decompositions are available. When using 1000 starting points for
and
, the most sparse practical PD that we found has 144 nonzeros and stability factors
and
, which is slightly higher than the decomposition proposed in [
8], which has 130 nonzeros and
and
.
4.6. Remark: Square Matrix Multiplication
Note that the CS rank can also be chosen smaller than R when . For example, 11 symmetric rank-1 tensors can be enforced within a . The remaining 12 rank-1 tensors do not need to follow any specific structure in this case. To our knowledge, no decomposition with this many symmetric terms has been reported in the literature for . An example decomposition with this structure is available in electronic form.