A Parallel Algorithm for Dividing Octonions

: The article presents a parallel hardware-oriented algorithm designed to speed up the division of two octonions. The advantage of the proposed algorithm is that the number of real multiplications is halved as compared to the naive method for implementing this operation. In the synthesis of the discussed algorithm, the matrix representation of this operation was used, which allows us to present the division of octonions by means of a vector–matrix product. Taking into account a speciﬁc structure of the matrix multiplicand allows for reducing the number of real multiplications necessary for the execution of the


Introduction
The dynamic development of science and technology of data processing as well as the need to implement more and more complex problems and practical applications require the use of ever more complex mathematical methods and formalisms. Currently, the algebra of hypercomplex numbers [1] is increasingly used for the synthesis, description, and implementation of highly efficient numerical algorithms in the field of digital signals, like in [2][3][4][5] where the analysis of complex signals was extended to multidimensional hypercomplex signals and hypercomplex digital signal processing. Frequency-domain adaptive filters used for the derivation of the most popular algorithms for a wide variety of signals, from real-valued ones to hypercomplex-valued signals are presented in [6]. The use of hypercomplex numbers for image recognition, due to their internal correlation, is a more natural fit for recognition of multicolor patterns than does the use of color vector spaces [7,8]. Similarly, the use of complex numbers with their internal correlation allows for a more complete analysis of the image [9,10], or analyse the various watermarking techniques [11]. Hypercomplex numbers are widely used in robotics, in rotation and moving of robot manipulators [12][13][14], in automation for the analysis of multidimensional spaces [11,15], or in wireless data transmission [16]. Deep neural networks are also a popular area of application for hypercomplex numbers: quaternion valued neural network [17,18] and octonion valued neural network [19,20].
A promising area also seems to be the application of hypercomplex numbers to big data analysis, e.g., in parallel training architecture for large-scale convolutional neural networks [21], in parallel extreme learning machine [22][23][24], in cloud computing service system based on big data [25], in blockchain [26,27], or in IoT [28]. In all of the above cases, the use of octonions in parallel algorithms can provide significant benefits, due to the internal relations of hypercomplex numbers.
When performing the aforementioned hypercomplex-valued algorithms, the most time-consuming macro operations are multiplication and division, since they require several dozen of the nested real multiplications and additions [29,30]. For natural reasons, the intention of developers of highly efficient algorithms for processing hypercomplex data has always been to look for ways to reduce the number of arithmetic operations in these algorithms. This, in particular, refers to the minimization of the number of real multiplications, which from the very beginning of the development of computerization has been the most time-consuming operation in the entire set of data processing operations. As for the operation of multiplying hypercomplex numbers, everything is more or less clear. Quite a lot of efficient algorithms for multiplying hypercomplex numbers have been developed. However, if various hypercomplex numbers have already been invented, then it would be good to have efficient algorithms on hand to perform all the defined mathematical operations on those numbers. Nevertheless, fast algorithms for dividing hypercomplex numbers are practically not described anywhere. The only exception is the fast quaternion division algorithm [31]. To eliminate this disadvantage, the authors proposed an algorithm for the fast division of octonions.
Thus, the purpose of this article is to present the original results of the synthesis of a rationalized algorithm for calculating the quotient of two octonions, which requires fewer real multiplications compared to the direct naive calculation method at the cost of some increase in the number of real additions.
The paper is organized as follows: The next section presents a short background of dividing two octonions. The main chapter, which presents the synthesis of a fast algorithm for octonion division is given in Section 3. Evaluation of computational complexity of proposed algorithm is provided in Section 4. The paper finishes with the concluding remarks in Section 5.

Short Background
Consider the problem of dividing two octonions where a = (a 0 + e 1 a 1 + e 2 a 2 + e 3 a 3 + e 4 a 4 + e 5 a 5 + e 6 a 6 + e 7 a 7 ), The rules for multiplying the octonion imaginary units are presented in Table 1 [1]. The operation of division (2) can be rewritten in the matrices-vector form [32][33][34] where η 8 is a scalar matrix η 8 = ηI 8 , I N is the identity matrix of order N, η is a inverse of the square of the norm of the quaterniondivisor η = 1/R, Y 8×1 and X 8×1 are the vector representation of the quotient d and octonion-dividend a, respectively Y 8×1 = [y 0 , y 1 , y 2 , y 3 , y 4 , y 5 , y 6 , and O 8 is the left real matrix representation of the conjugate octonion-divisor The direct multiplication of the vector-matrix product in Equation (3) requires 72 multipliers, 63 adders, 8 squarers and one divider of real numbers. The authors proposed some tricks, which reduce the multiplicative complexity of this operation to 38 real multiplications at the price of 56 more real additions.

Synthesis of a Fast Algorithm for Octonion Division
Let us multiply by (−1) the first column of the matrix O 8 . It is easy to see that this procedure leads to minimising the computational complexity of the final algorithm. This results in the following matrix: Taking into account the performed manipulations, the operation of dividing one octonion by another octonion can be written as follows: The following notation may be introduced: Then, the division of two octonions can be represented as the following sum: It is easy to see that the matrix B 8 has symmetric Toeplitz-type structure. A matrix with such a structure can be successfully diagonalized using the fast discrete Walsh-Hadamard transform [35]. Let us take a closer look at this thought.
The matrix B (1) 8 has a block-symmetric structure: where For matrices with such a structure, the following factorization takes place [35]: where W (0) is Hadamard matrix of order 2, 0 N is the N × N null matrix (all of its entries are zero), and symbols "⊗", "⊕" denote the tensor product and direct sum of two matrices, respectively [31,[35][36][37]. The next step is to analyze the structures of matrix A 4 + B 4 and matrix A 4 − B 4 : As you can see, these matrices also have block-symmetric structures and a similar factorization takes place for them. This means that expression (10) can be rewritten as follows: B (1) where D (2) Now, consider the structures of the matrices A 2 + B 2 , A 2 − B 2 , C 2 + D 2 , and C 2 − D 2 . It is easy to see that also in this case there are block-symmetric matrices. Thus, this is again a similar case. Then, expression (11) can be rewritten as follows: The triple products of matrices on the left and on the right with respect to the diagonal matrix D (3) 8 are in fact the factorized representations of the eighth order Hadamard matrices and describe the algorithms for the fast Walsh-Hadamard transform.
It is possible to write: It is easy to check that the elements of the matrix D can also be calculated using the eight-point fast Walsh-Hadamard transform: Unfortunately, the process of calculating product B 8 I 8 X 8×1 defies any effort at rationalization. However, since the matrix B Combining partial computing procedures into a single one: 8 , Ω 8×16 = 1 1×2 ⊗ I 8 , 1 1×2 = [1, −1] T and 1 N×M is a matrix of ones ( every element is equal to one).
The signal flow graph of the final algorithm (16) is shown in Figure 1.

Evaluation of Computational Complexity
The concept of computing the quotient of two octonions discussed in the article is an attempt to rationalize the computation process in terms of minimizing the number of multipliers required for a fully parallel hardware implementation of computations. The number of multiplication operations (multipliers) in the proposed solution is nearly twice as small as the naive method, and it is reduced from 72 to 38, which is a reduction of about 47.2%. This is at the expense of increasing the number of addition operations (adders) from 63 to 133, which is an increase of approximately 106.3%. It is also necessary to use eight three-bit barrel shifters. The number of dividers and squarers remained the same as in the compared method.
It should be noted that in most digital signal processing tasks one of the divided octonions is the so-called "constant", which means that its coefficients {b i } are known constant real numbers in advance. This means that the matrix D 16 entries can be computed in advance only once and stored in the read-only memory (ROM) of the computational system. Then, the solution proposed in the article becomes even more effective because it requires the same number of operating blocks as in the case of the naive method, among which only 38 blocks (twice less) are multipliers, and 70 adders. The total number of operating blocks necessary for the implementation of the processor system for calculating the obtained quotient in this case is 108 (128 in the case of the implementation of the naive method). In this way, the solution proposed in the article becomes the best in terms of the total number of arithmetic operations or operating blocks (in the case of hardware implementation) necessary to calculate the quotient of octonions.

Conclusions
The paper presented a new parallel algorithm for calculating the quotient of two octonions. The use of this algorithm reduces the multiplicative complexity of dividing octonion operation, thus reducing its hardware implementation complexity. The proposed algorithm has a simple, regular, and modular structure suitable for VLSI implementation. It is known that, all other things being equal, a conventional hardwired multiplier is a more complex device than a binary adder and occupies a much larger die area than an adder. Therefore, reducing the number of embedded multipliers is especially important when developing a specialized fully parallel VLSI-based module such as an octonion divider. Minimizing the number of multipliers required also reduces power dissipation and the cost of realization of such a module. Thus, it can be argued that a decrease in the number of embedded multipliers even due to a small increase in the number of adders plays an important role in the hardware implementation of the algorithm.
In addition, it can be seen that the total number of arithmetic operations in the presented algorithm is less than the total number of operations in the compared algorithm. Therefore, the proposed algorithm in some cases may turn out to be better than the naive algorithm even from the point of view of its software implementation on a general-purpose computer. Many applications of presented in the paper algorithm could be proposed. However, these questions are beyond the scope of this study and will be considered in the following articles by the authors.