Singular Value Decomposition in Embedded Systems Based on ARM Cortex-M Architecture

: Singular value decomposition (SVD) is a central mathematical tool for several emerging applications in embedded systems, such as multiple-input multiple-output (MIMO) systems, data analytics, sparse representation of signals. Since SVD algorithms reduce to solve an eigenvalue problem, that is computationally expensive, both speciﬁc hardware solutions and parallel implementations have been proposed to overcome this bottleneck. However, as those solutions require additional hardware resources that are not in general available in embedded systems, optimized algorithms are demanded in this context. The aim of this paper is to present an efﬁcient implementation of the SVD algorithm on ARM Cortex-M. To this end, we proceed to (i) present a comprehensive treatment of the most common algorithms for SVD, providing a fairly complete and deep overview of these algorithms, with a common notation, (ii) implement them on an ARM Cortex-M4F microcontroller, in order to develop a library suitable for embedded systems without an operating system, (iii) ﬁnd, through a comparative study of the proposed SVD algorithms, the best implementation suitable for a low-resource bare-metal embedded system, (iv) show a practical application to Kalman ﬁltering of an inertial measurement unit (IMU), as an example of how SVD can improve the accuracy of existing algorithms and of its usefulness on a such low-resources system. All these contributions can be used as guidelines for embedded system designers. Regarding the second point, the chosen algorithms have been implemented on ARM Cortex-M4F microcontrollers with very limited hardware resources with respect to more advanced CPUs. Several experiments have been conducted to select which algorithms guarantee the best performance in terms of speed, accuracy and energy consumption.

Various algorithms have been developed during the last decades for solving SVD [28], among these Golub-Reisch [29], Demmel-Kahan [30], Jacobi rotation [31], one-sided Jacobi rotation [32], and divide and conquer [33] algorithms are largely used since they guarantee good performance in common applications. An overview of SVD as well as the methods for its computation can be found in [34,35].
For example, optimized libraries such as Linear Algebra PACKage (LAPACK) and Basic Linear Algebra Subprograms (BLAS) (in their several implementations) that represent the state-of-the-art solution on computer-class CPUs, use several optimizations to efficiently take advantage of the hardware resources. An important optimization is reorganizing the operation flow and the data storage to make a better use of the processor's cache memory, in order to speed-up the access to the large amount of data from and to the slower RAM. Another improvement comes from parallelism, that is splitting the problem in several smaller subsets and exploiting the parallel execution of modern multi-core CPUs. A further form of parallelism is given by single instruction multiple data (SIMD) technology, consisting of single instructions that perform the same operation in parallel on multiple data elements of the same type and size. For example, a CPU that normally adds two 32-bit values, performs instead four parallel additions of 8-bit values in the same amount of time. Therefore, SIMD makes better use of the available resources in a microprocessor observing that when performing operations on large amount of data on a microprocessor, parts of the computation units are unused, while continuing to consume power.
It is worth to notice that all the SVD algorithms exhibit a large quantity of data-level parallelism, making them well suited to be implemented with SIMD acceleration and multi-core parallelism.
However, none of the cited hardware resources (cache memory, parallel execution units) exist in microcontrollers, so all the relevant optimization techniques are not useful on this class of embedded systems. About SIMD, actually the Cortex-M4 and newer embedded architectures offer a primitive form of SIMD instructions that are limited to operate on 8 or 16-bit integer values, so they are not suitable for complex computations requiring higher precision floating-point values for their stability.
Regarding more complex architectures, there is a rich literature on how to implement mathematical algorithms on Field Programmable Gate Arrays (FPGAs) and other programmable hardware [53][54][55][56]. Those are generally sophisticated and expensive systems, used for high-end applications and exploiting a completely different computation model, based on massive parallelism. Conversely the scope of this paper is on affordable, low-power and low-resource embedded systems, generally based on microcontrollers, as universally used in distributed scenarios like for example the Internet of Things concept.
Many embedded systems have a very limited amount of available memory, especially data memory, and may not employ an underlying operating system to further reduce resource usage. Because of this, the existing optimized libraries intended for computerclass CPUs are not readily usable in embedded systems, nor easily portable. For this reason, the demand of specific accelerating techniques that are able to implement SVD algorithms in embedded processors, has seen a growing interest in the past few years [59].
Regarding the SVD implementation a large variety of algorithms exist, thus from the embedded system designer point of view, comparing the performance achieved with different algorithms implemented in a high-performance embedded processor is of great interest.
The aim of this paper is to present an efficient implementation of the SVD algorithm on ARM Cortex-M (Arm Ltd., Cambridge, UK). More specifically, as the SVD algorithm is a central mathematical tool for several emerging applications in embedded systems and, to our knowledge, no previous implementations of this algorithm exists in literature for bare-metal microcontrollers, i.e., without an operating system, it is of paramount importance to develop a library suitable for these devices, specifically for the ARM Cortex-M4F architecture [60] . To achieve this goal, we proceeded: (i) to provide a comprehensive treatment of the most common SVD algorithms, writing the existing formulations found in literature in a more compact and uniform manner, suitable to be used as a comprehensive reference; (ii) to implement them on an ARM Cortex-M4F (Arm Ltd., Cambridge, UK). microcontroller, in order to develop a library that is suitable for embedded systems that work without an operating system. As already mentioned, LAPACK and BLAS represent the state-of-the-art solution on computer-class CPUs, but these optimized libraries are not readily available and easily portable for embedded systems, because of the very limited available memory and the lack of an underlying operating system of these devices. To this end, we decided to implement five common SVD algorithms (Golub-Reinsch, Demmel-Kahan, Jacobi rotation, one-sided Jacobi rotation and divide and conquer) in order to adapt them to operate in limited resource embedded systems and without an operating system. (iii) to find the best implementation that fits the low-resources of bare-metal embedded systems. For the purpose of finding the algorithms with the best performance for an embedded system, this paper reports a comparative study of their performance in terms of several implementation metrics: speed, accuracy and energy consumption. (iv) to show a practical real-time application of the SVD in an embedded system, by using the proposed optimized library for ARM Cortex-M4F in the Kalman filtering of an inertial measurement unit (IMU). This example also proofs the advantage of implementing the SVD on this class of systems, showing the improvement of the numerical accuracy of Kalman filtering by applying the SVD algorithm with respect to the conventional Kalman approach.
Thus, the paper is a good starting point for researchers and practitioners interested in developing new algorithms to be implemented in low-resource microcontrollers.
This paper is organized as follows. Section 2 is mainly focused on the five most representative SVD algorithms and gives a comprehensive treatment of them, with Appendix A summarizing the basic concepts of matrix algebra related to SVD transformation. Section 3 reports a comparative study of the five algorithms performance. Section 4 presents an application of SVD to Kalman filtering of data from an IMU, showing important improvements with respect to traditional algorithm implementation. Finally, some conclusions end this work.

Algorithms for the Singular Value Decomposition
Let A be a real (m × n) matrix with m ≥ n. It is known that the decomposition where exists [35]. The matrix U consists of n orthonormal eigenvectors corresponding to the n largest eigenvalues of AA T , and the matrix V consists of the orthonormal eigenvectors of A T A. The diagonal elements of Σ are the non-negative square roots of the eigenvalues of AA T , called singular values. Assuming (1) is called the singular value decomposition (SVD) of matrix A.

Golub-Reinsch Algorithm
This method, developed by G. H. Golub and C. Reinsch [29], acts directly on the matrix A thus avoiding unnecessary numerical inaccuracy due to the computation of A T A. The algorithm can be divided into these consecutive steps: (i) Householder's bidiagonalization (see Appendix A.2 for details); (ii) implicit QR method with shift (see Appendix A.6 for details); The pseudo-code of the algorithm is reported in Algorithm 1.

Algorithm 1 Golub-Reinsch
Require: A ∈ R m×n (m ≥ n), a small multiple of the unit round-off Use Algorithm A1 to compute bidiagonalization.
• Find the largest q and the smallest p such that if

Demmel-Kahan Algorithm
The structure of the algorithm is based on the Golub-Reinsch algorithm previously described. Nevertheless the Demmel-Kahan algorithm [30] can achieve a better accuracy for small singular values. This algorithm consists of the following main consecutive steps: (i) Householder's bidiagonalization (see Appendix A.2 for details); (ii) QR iteration with zero-shift (see Appendix A.7 for details); The description of this algorithm is shown in Algorithm 2.

Algorithm 2 Demmel-Kahan
Require: A ∈ R m×n (m ≥ n) ∈ a small multiple of the unit round-off Use Algorithm A1 to compute bidiagonalization.

Jacobi Rotation Algorithm
In this case given the real and symmetric matrix A ∈ R n×n the algorithm [31] aims to obtain a diagonal matrix B ∈ R n×n through the transformation where J represents a sequence of rotation matrices. In particular for the k-th rotation or sweep can be rewritten as where J k = J(p, q, θ) is the Jacobi rotation matrix that rotates rows and columns p and q of A k through the angle θ so that the (p, q) and (q, p) entries are zeroes. The p and q values are chosen properly at each iteration step. With reference to the sub matrices corresponding to the p, q columns we have The key point of the algorithm is to determine the rotation coefficients c and s in such a way the off-diagonal terms b pq and b qp are zeroed.
The algorithm is reported in Algorithm 3.

Algorithm 3 Jacobi rotation
Require: A ∈ R n×n symmetric Compute the rotations coefficients s, c such that and is a small multiple of the unit round-off then

One-Sided Jacobi Rotation Algorithm
The main idea of this algorithm [32] is to rotate columns i and j of A through the angle θ so that they become orthogonal to each other. In such a way the (i, j) element of A T A is implicitly zeroed resulting in the scalar product of the i, j columns.
Let J(i, j, θ) be the Givens matrix that when applied to the matrix A yields B = (b :1 · · · b :i · · · b :n ) = AJ = (a :1 · · · a :i · · · a :n ) where the i, j columns of B are given by The key point of the algorithm is to determine the rotation coefficients c and s in such a way the elements (B T B) ij and (B T B) ji of the product B = AJ(i, j, θ) are zeroed.
The pseudo-code of the algorithm is reported in Algorithm 4.

Algorithm 4 One-sided Jacobi rotation
Require: A ∈ R n×n symmetric Compute the rotations coefficients s, c such that and is a small multiple of the unit round-off then

Divide and Conquer Algorithm
For a square symmetric matrix A a relationship between singular values and eingenvalues, as well as between singular vectors and eigenvectors exists. Indeed, as A can be diagonalized we can write where Λ is the eigenvalue diagonal matrix and Q is a unitary matrix (orthogonal in real case). Since we can always assume that the elements of |Λ| are in decreasing order, then to the diagonalization (11) corresponds an SVD such that U = Q sign(Λ), Σ = |Λ| and V T = Q T . In words the singular values are the absolute values of eigenvalues and the singular vectors are the eigenvectors (same norm and direction, but not necessary the same versus). For a non symmetric matrix the singular values and the singular vectors are not directly related to the eigenvalues and eigenvectors, instead there is a strict relationship with eigenvalues and eigenvectors of the symmetric matrices A T A ∈ R M×M and AA T ∈ R N×N . In fact it can be easily shown that where A is decomposed as A = UΣV T . From (12) it results that the singular values of A are eigenvalues of the matrices A T A and AA T (except those equal to zero for the latter). Additionally the right and left singular vectors are the eigenvectors of A T A and AA T respectively, which can differ for a sign (note that the matrix A can be correctly reconstructed provided the correct sign is known). Therefore, on the basis of previous considerations, the singular value decomposition reduces to the eigenvalue problem of a symmetric matrix. In general, the methods for solving such a problem are iterative methods that include two stages: (i) in the first stage the matrix A is transformed to a matrix B whose structure makes the computation of the eigenvalues and eigenvectors easier. A typical choice, that is assumed here, is a tridiagonal form.
(ii) in the second stage an iterative method is applied to determine the eigenvalues and eigenvectors.
With reference to the second stage the divide and conquer algorithm aims at reducing a complex problem to a singular one [33]. The algorithm is intended to be applied to a tridiagonal and symmetric matrix T of dimension N × N.
The pseudo-code of the algorithm is reported in Algorithm 5.

Algorithm 5 Divide and conquer
Require: T ∈ R n×n tridiagonal symmetric Compute eigenvalues Λ by solving the secular equation

Experimental Results
The five algorithms described in the previous section have been implemented both in MATLAB on a desktop computer and on a Cortex-M4F microcontroller.
The algorithms have been tested with several matrices A ∈ R m×n . To ensure that the algorithms worked in the most general cases, several sets of matrices of increasing size have been chosen, each set with a different row/column ratio. Considering the limited memory of the microcontroller, three sets of matrices have been tested, with a ratio of m/n = 1, 4 3 and 2, respectively. Further increasing the ratio would have caused the matrices to not fit into the limited memory of the microcontroller without significantly reducing their rank (memory occupation must take into account not only the input and output matrices, but also the intermediate matrices and vectors needed by the various algorithms). Indeed, preference has been given to maintaining comparable ranks between the set of matrices. The low amount of RAM memory also limits the absolute size of matrices, so smaller matrices have been used for the experiments on the microcontroller, with respect to the tests in MATLAB.
The biggest matrix A has been randomly generated, and all the other matrices used are the top-left portion of the full matrix A.
For algorithms requiring a symmetric matrix, the input matrix is converted to a symmetric form as follows. First it is converted to bidiagonal form by applying Algorithm A1, then, being B the upper triangular portion of the bidiagonal matrix, the SVD algorithm is applied to BB T . In such a way the singular values of the original matrix are the square roots of eigenvalues of the latter tridiagonal matrix.

MATLAB Implementation
The algorithms have initially been implemented in MATLAB to test their correctness, on 6-core, 12-thread Intel i7 CPU with 32 GB RAM (Intel, Santa Clara, CA, USA). Figure 1 shows, as a matter of comparison, the timings of the various algorithms as implemented in MATLAB, as functions of the number of columns n for a set of matrices having m/n = 4 3 . Table 1 reports the accuracy of a sample of the same tests, with the algorithms implemented in MATLAB with respect to the MATLAB built-in SVD function. The errors reported are computed as the average of relative errors between the singular value matching pairs, as computed by MATLAB built-in function and the given routines.
As you can see, all the algorithms, with the exception of Demmel-Kahan, ensure the same accuracy of the MATLAB built-in SVD function. This could be expected, as the Demmel-Kahan algorithm was specifically designed to deal well with very small singular values, while the random-generated matrices we employed in our experiments have singular values close to unity.
The STM32F429ZI microcontroller is based on an ARM 32-bit Cortex-M4F [60] CPU clocked at 180 MHz, with 2 MB of flash memory for code and read-only data, and 256 KB of RAM. In addition it has several hardware peripherals that are not relevant to this work.
A main feature of the Cortex-M4F core is the presence of a 32-bit hardware floatingpoint unit (FPU), as implied by the additional "F" in its name. An FPU is essential for any kind of heavy computational work in the floating-point domain, as is the case for the experiments on SVD performed in this article. The Cortex-M4F FPU is limited to 32 bits (https://developer.arm.com/docs/ddi0439/latest/floating-point-unit/about-the-fpu), so the algorithms have been implemented using single-precision (32 bits) values. Implementing this kind of algorithms on a CPU with no FPU, or with larger precision than that managed by the hardware, would require the use of software floating-point mathematical libraries, that would be prohibitive in an already resource-constrained system.
The algorithms were implemented in C language. No particular development environment was used, the code was compiled with the GCC software suite for ARM on a GNU-Linux machine, using a custom makefile and with the aid of the STM32F4xx Standard Peripherals Drivers (STMicroelectronics, Geneva, Switzerland), a set of libraries provided by ST for their specific microcontrollers and encompassing all aspects of hardware management, from low-level initialization to use of hardware peripherals. The firmware is of the "bare-metal" kind, so no real-time operating system (RTOS) or other middlewares have been added.
The hardware system requested no particular design besides what was already provided by the 32F429IDISCOVERY board (STMicroelectronics, Geneva, Switzerland). The device has been clocked at its maximum speed of 180 MHz. The board also integrates all the hardware needed for programming and debugging the microcontroller, namely the ST-LINK/V2 interface (STMicroelectronics, Geneva, Switzerland), offering USB connectivity with the computer. On the computer side, communication with such interface has been established by using OpenOCD (http://openocd.org), a free software for debugging and programming of ARM and other systems. OpenOCD finally acts as a server for GDB, the standard debugger from the GCC suite, used when needed to transfer the code to the device and examine its memory for the results of the tests.
Regarding input and output, read-only data for the program, like the bigger matrix from which smaller ones are generated, or the reference vectors of singular values to compute the accuracy, are stored in the program memory (flash memory) that is more abundant than RAM. Once the program is run for a series of tests, the numerical outputs can be examined through the debugger, by interrupting the program in convenient points. The timing of the single routines is computed by the software itself, using the SysTick timer built in the Cortex-M core.
Besides the optimizations performed by the compiler, special care has been exercised in trying to optimize the critical mathematical routines, while keeping a low usage of program memory and needed RAM, in order to speed up the computation as much as possible while fitting in the constrained resources of the system.
A first optimization comes from choosing the best arrangement of data in memory. Bidimensional matrices can be stored in RAM (which is a linear array of bytes) in two different ways: row-major or column-major order, that is storing data sequentially by row or column respectively (see Figure 2). The former one is the most common way in the C language or anywhere the mathematical convention of having the row as the first index is followed. This has a much more crucial impact in CPUs with cache memory, where cache is filled with sequential data from RAM, and so it gives a huge speed boost in accessing sequential data with respect to sparse ones. As said, a microcontroller has no cache memory, so this is not directly the case; nevertheless a non-negligible advantage exists in accessing data sequentially, due to the load/store assembly instructions with auto-increment, that is instructions that read or write data from memory and increment the address register in a single execution cycle. As a quantitative example of the effect of row-major or column-major data storing, we can consider the one-sided Jacobi rotation algorithm, that differs from the other ones for accessing the input matrix exclusively by column. Table 2 shows the different timings of the algorithm for the biggest set of matrices, both in absolute times and in percentage of speed increase. As said, the increase in speed is minimal, even if appreciable. The last column shows the time needed to transpose the matrix, in the case it is stored in the opposite way, which is approximately one order of magnitude less than the time improvement obtained. Moreover, often the input matrix to SVD is generated from previous computations, and so it can be generated already in the more convenient order. Besides accessing data in the convenient order, that results in a modest speed increment, and lacking hardware resources to be further exploited, other optimizations must be necessarily obtained by reducing the unneeded computations as much as possible. A significant example is matrix multiplication, one of the most computationally expensive operations in matrix algebra. Generally speaking, if C = A · B, the generic element of C is given by where n is the number of columns of A. Computing all the elements of C in this way requires a triple nested loop which is very computationally expensive, especially for large matrices.
The number of operations performed for a matrix multiplication can be reduced by observing the properties of the specific matrices. For example a recurrent operation in the exposed algorithms requires the multiplication of a square matrix by its transpose, like C = A · A . In this case (14) becomes First we can notice that in the inner loop A is always traversed by row, so we can have the advantage of reading data always in the most convenient order if the matrix is stored in row-major order. Most importantly it can easily be seen that A · A is a symmetric matrix, so we can actually compute approximately half of its elements, sensibly reducing the number of operations (Figure 3). A further reduction of the number of operations is possible in the case of C = A · A where A is an upper-diagonal matrix, another common case in the given algorithms. Given that a i,j = 0 only for j = i or j = i + 1, from (15) follows that c i,j = 0 only for j = i − 1, i or i + 1 (indeed the resulting matrix is tridiagonal) and that the only non-zero terms in the sum are those for which a i,k and a j,k are both non-zero . The resulting formula is: where the (i + 1) th element may not exist. Being also symmetric, the reduction of the previous case also applies ( Figure 4).
A A' C Another special example is the multiplication of a matrix by a Givens matrix, in the form of (A14), to perform a Givens rotation. Let us call G the Givens matrix to avoid confusion with indices, and let's limit for simplicity to the case of left multiplication, as in C = G · A. If we initially set C = A, it is clear from the definition of G that only rows p and q of C are affected by the multiplication. Moreover, only elements p and q of a given column of A are used in the computation ( Figure 5). So the only elements of C that need to be updated are those at rows p and q, and their values are: c p,j = g p,p a p,j + g p,q a q,j , c q,j = g q,p a p,j + g q,q a q,j for every column j. A similar formula holds for right-side multiplication. G A C The complexity of the previous computations, corresponding to matrix products for different special cases of input matrices, can be compared in terms of number of scalar multiplications with respect to the size of input matrices. Results are shown in Table 3, together with the code size of the specific software routines. Table 3. Number of scalar multiplications in matrix products for different types of n × n input matrices.

Matrix Product Scalar Products
Code Size (Bytes) The impact of such optimizations on computation speed can be measured, in particular those from (15) and (16) and from the property of symmetry (Givens rotation is never actually implemented as a matrix multiplication, since its matrix is expressly constructed to only modify a few elements in the result). Table 4 shows the speed increase for the set of biggest matrices when using two algorithms where these optimizations are mostly relevant. Besides trying to optimize mathematical routines, it is important to avoid the problems arising from the limited power and precision of the microcontroller's FPU. In case of operations that can be carried out in different ways, choosing the right procedure can make a substantial difference. For example the assembly multiplication instruction (VMUL.F32) takes 1 clock cycle to execute, while the division (VDIV.F32) takes 14 cycles (https://developer.arm.com/docs/ddi0439/b/floating-point-unit/fpu-functiona l-description/fpu-instruction-set). In some cases the compiler can automatically substitute an expensive operation with a cheaper one during the optimization phase of compilation, for example when dividing by a constant.
Another problem arising from the limits of the FPU is the loss of precision in certain operations in the 32-bit domain. For example a recurring problem in the given algorithms is computing 1/ √ 1 + x 2 . This kind of operation causes the loss of many significant bits in the original value of x when x 1. It must be verified experimentally when this is tolerable and when not. In some cases the loss of precision causes a worsening of the final accuracy by an order of magnitude. A possible solution is switching temporarily to 64-bit precision, then converting back to single precision when the sensitive computation is done. Of course this sensibly increases the execution time, using software libraries instead of the hardware FPU. A better solution is applying the logarithm to the value to be computed, performing intermediate computations in the logarithm domain and finally applying exponentiation. In this case log (1/ √ 1 + x 2 ) = −0.5 log (1 + x 2 ), which can take advantage of a special function in the C mathematical library, called "log1pf", that is optimized to compute log (1 + x) with high accuracy even if the value of x is near zero. Tests show that using logarithm and exponentiation software libraries while remaining in the 32-bit domain is faster and gives similar or better results than computing the root square in software in double precision. Figure 6 shows the timings of the full SVD algorithms implemented on the microcontroller for the three sets of matrices of different row/column ratio, with respect to the total matrix size (m × n). The same results are also reported in Table 5, but with the three matrix sets listed separately.
From the experimental results it can be seen that the time of the algorithms is roughly dependent on the total matrix size, but with irregularities suggesting that other factors, like the matrix rank, affect the total time.
On a comparative basis, as you can see the one-sided Jacobi rotation algorithm gives the lowest execution time, compared to the others. Table 6 reports the accuracy of the specific tests, implemented on the microcontroller, with respect to the MATLAB built-in SVD function. As in Section 3.1, the errors reported are computed as the average of relative errors of matching pairs of singular values, as computed by MATLAB built-in function and the given routines. As you can see, the accuracy of Cortex-M4F implementation is significantly lower than the equivalent MATLAB code; this is due to the lower precision (32 bits) of the Cortex-M4F hardware floating-point unit. In this case, both the one-sided Jacobi rotation and divide and conquer algorithms achieve a better accuracy than the others.
Finally, Table 7 reports the energy consumption of the tests relative to one matrix set, measured by sampling the voltage and current with an INA226 from Texas Instruments (https://www.ti.com/product/INA226). The INA226 is a current and voltage monitor with I²C interface, with a maximum gain error of 0.1%, maximum input offset of 10 µV and 16-bit resolution. A 0.1 Ω shunt resistor has been used in series with the microcontroller power supply, and the data have been acquired through an external single-board computer.
As you can see, the results are coherent with execution times. As a matter of comparison, Figure 7 shows side-by-side the current consumption of the five algorithms for one of the matrices. The one-sided Jacobi rotation algorithm, besides being faster, clearly has the lowest average current consumption. It is worth noting that the other algorithms exhibit the same pattern at the beginning, corresponding to the Householder's bidiagonalization step. This step has therefore a significant relevance, both in time and energy, for the algorithms that need it.

Application: Kalman Filtering of Inertial System Data
As a practical application to an embedded system, and as a proof of the advantage of implementing the SVD on this class of systems, we will show how SVD can be used to improve the numerical accuracy of Kalman filtering applied to an inertial measurement unit (IMU). Sensors included in an IMU are typically accelerometers and gyroscopes. Data from such sensors, with proper filtering and data-fusion techniques, allow the computation of the spatial orientation of a moving body (its attitude), namely the relative angles of the body with respect to the spatial axes.
Accelerometers have the problem of high-frequency noise added to the signal, together with additional acceleration terms when a force is applied to the body. Gyroscopes measure the angular rate of change around each axis, that must be integrated to get angular positions; this leads to a sensible drift in the resulting data, because gyroscopes are affected by a bias in the measurement that is non-constant and unpredictable, resulting in a drift increasing with time.
Kalman filter [61] is a common technique used to estimate the state of such a system (namely the current angles along the chosen axes) given a set of discrete measurements over time that are subject to statistical noise. To that end, the system dynamics must be written in form of space-state equations: where x t is the unknown system state vector to be estimated, vector u t is the control input to the system and y t is the measurement vector. w and v are additive noise on process and measurement data respectively, both assumed to be zero-mean Gaussian processes: w t ∼ N (0, Q t ), v t ∼ N (0, R t ). Initial state condition x 0 is also a random variable: x 0 ∼ N (0, Π 0 ). It is shown in [22] that traditional Kalman filter implementations, like square-root algorithms based on the Cholesky decomposition, are effective with well-conditioned problems but may fail if equations are ill-conditioned, due to roundoff errors, as those algorithms rely on matrix inversion in several places. On the other hand [22] shows how an SVD-based Kalman filter variant works much better in ill-conditioned cases.
In the IMU example, a case of ill-conditioned problem is when multiple sets of the same measurements are available. To better show a practical case, we report results obtained with an X-NUCLEO-IKS01A2 (https://www.st.com/en/ecosystems/x-nucleo-iks01a2.html) board, a sensor board from ST equipped, among others, with a gyroscope-accelerometer combined sensor and a magnetometer-accelerometer one. The board can be used with a microcontroller unit, in our case we used a NUCLEO-F411RE (https://www.st.com/en/ev aluation-tools/nucleo-f411re.html), with a Cortex-M4F-based microcontroller similar to the one used in the previous experiments. For this hardware system the firmware has been developed using the STM32CubeIDE development platform, version 1.4.0. An additional software package, X-CUBE-MEMS1, has been installed in the IDE through its integrated software package manager. This package supports the communication with the specific sensors in the X-NUCLEO-IKS01A2 inertial system. In this case the microcontroller (STM32F411RE) is clocked at 84 MHz. Like in the previous experiment, the firmware is "bare-metal", with no RTOS or other middlewares added. Communication with the board for programming and debugging is entirely managed by the IDE in this case.
Having two accelerometers available, it is convenient to use both data to get a better estimate of the state, given the noisy nature of the device. Mathematically, matrices in Equations (18) and (19) can be expressed as: and variables as: Here the system state includes the estimated roll and pitch angles (φ,θ) along x and y axes respectively and the estimated gyroscope bias on both angles (b φ ,b θ ); system inputs include angular rates from gyroscope (φ G ,θ G ); ∆t is the period of acquired data; system measurements are given by estimated angles from both accelerometer data (φ A1 ,θ A1 ,φ A2 ,θ A2 ).
The δ factor accounts for the fact that the two series of accelerometer data are very similar numerically, leading to an ill-formed problem: the matrix is now nearly singular, due to redundancy of system measurement variables. As shown in [22], we can expect that resolution of Kalman filter equations by means of conventional algorithms will suffer from roundoff errors when δ becomes small enough to be comparable to machine precision. We can also expect that the SVD-based algorithm will work better in such cases.
The firmware on the board has been used to export a batch of measurements (gyroscope, accelerometer 1, accelerometer 2) to the computer, so that the data could be examined in Matlab in a reproducible way. The data have a duration of about 90 s, with a ∆t of 10 ms, and have been obtained by moving the device through a set of reference positions, namely 0, 90 and −90 degrees on the three axes.
Then the Kalman filter has been computed in Matlab with both algorithms (conventional and SVD-based), repeating the experiment for various decreasing values of δ, until approaching values comparable to machine precision.
Finally the accuracy of the estimated state has been evaluated by computing the mean absolute error (in degrees) with respect to the reference intervals of known orientations, for all the different values of δ. Table 8 shows that the two algorithms give very similar results down to a certain value of δ, but at lower values the conventional algorithm fails ("NaN" means Not-a-Number, an error condition indicating an invalid numeric value). The SVD-based algorithm, conversely, keeps working with neglectable difference for other two orders of magnitude of δ, starting to degrade at a much lower value. To perform a similar experiment with the limited capabilities of an embedded system, the same algorithms for the Kalman filter have been implemented in a C program on the same set of off-line data, using the custom SVD algorithms discussed in the previous sections and with floating-point numbers in single precision (32 bit), that is the same precision of the floating-point hardware of the Cortex-M4F architecture. To just focus on the difference between conventional and SVD-based Kalman filter, we used one only SVD algorithm, namely the One-Sided Jacobi, that proved to be the best one under several criteria (Section 3). The other algorithms have shown to give similar results.
Preliminary tests have shown that with limited numeric precision, as in this case, the conventional Kalman algorithm is much more sensible to variations of initial conditions. Actually, the difference between the two Kalman implementations is evident even in the single accelerometer case, by varying the relative values of the covariance matrices involved.
As a clarifying example, a series of tests is shown with fixed values Π 0 = I 4 and Q = I 4 , data from a single accelerometer and varying values of R. Results are shown in Table 9, which highlights a trend similar to the Matlab case: the two algorithms are very similar for some cases, but the SVD-based one works uniformly on a wider range of parameters. Table 9. Single precision C code: mean absolute errors (φ, θ) of estimated angles (degrees). About the actual performances on a Cortex-M4F microcontroller, the matrices involved in the SVD-based Kalman algorithm are much smaller than the ones in Table 5, having a maximum size of 8x4 in the case of double accelerometer. An extrapolation of the results in Table 5 leads to an estimate of fractions of millisecond, so the computation of Kalman filter using the SVD-based variant should be possible in real time in most applications.

Conclusions
This paper presents a comparative study of five of the most common SVD algorithms, namely Golub-Reinsch, Demmel-Kahan, Jacobi rotation, one-sided Jacobi rotation, divide and conquer. The aim of this comparative study is to find the most suitable algorithm for an embedded system. The chosen algorithms have been investigated starting from a theoretical perspective giving a detailed mathematical description, which provides crucial hints for the practical implementations. The algorithms have been initially written in MATLAB to verify their correctness, then implemented on a Cortex-M4F microcontroller to achieve optimized results for low-resource embedded systems. The comparison of the Cortex-M4F implementations shows that the one-sided Jacobi rotation outperforms the other algorithms in terms of speed, accuracy and energy consumption. Moreover the divide and conquer algorithm shows similar accuracy results. Finally the SVD has been applied to an example case, showing that it can improve the accuracy of algorithms where the data are ill-conditioned, while traditional implementations may fail.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A. Matrix Algebra Operations
Appendix A. 1

. Householder Transformation
It is based on an (n × n) symmetric matrix of the form where u is the Householder vector. A matrix of this kind is also called a Householder reflection, due to the following specific geometry property. A given vector x can always be represented as where w is orthogonal to u. By applying the transformation U to x one gets the vector representing the reflected vector of x with respect to the hyperplane spanned by w. Thanks to this property the Householder transformation can be used to zero selected components of a vector. To show this let us choose u such that where e 1 = (1, 0, . . . , 0) T , and · is the Euclidean norm of a vector. By applying the transformation U so obtained to x, then it results meaning that all the components of x but the first one are zeroed. A common choice to avoid errors when x 1 ≈ x is to pose u = x + sign(x 1 ) x e 1 . The method can be generalized to the k-th component as follows. For the generic index k ∈ {1, . . . , n}, let us define where s = x 2 k + . . . + x 2 n . The resulting Householder matrix U k when applied to x gives that is U k leaves the first k − 1 components unchanged, changes the k-th component and zeroes all the residual n − k components. It can be easily shown that U k has the block form where U acts only on the last n − k + 1 components of x by zeroing them all but the k-th component. Similarly, the transformation where v is a row vector, acts on the row vector x T by zeroing some of its components. A useful property of U is that it is not necessary for the matrix to be explicitly derived, indeed the transformation can be written in terms of u alone. A mathematical description of the algorithm is shown in Algorithm A1.

Algorithm A1 Householder bidiagonalization
Require: A ∈ R m×n for k = 1, . . . , n − 1 do • Determine Householder matrix U k such that

. Householder's Bidiagonalization
Given the matrix A ∈ R n×m (n > m) the bidiagonal form exists. The matrix B can be obtained from A by the successive orthogonal transformation where U k , V k are Householder's matrices. In particular for the k-th step: (i) an Householder matrix U k can be defined for zeroing all the last n − k components of the k-th column of B; (ii) an Householder matrix V k can be defined for zeroing all the m − k − 1 components of the k-th row of B.
At the end of the process the diagonalization (A11) is achieved with U = U m · · · U 1 , V = V 1 · · · V m−2 . The computational cost of this process is about O(n 3 ).

Appendix A.3. Jacobi Rotation
Householder transformation is useful for zeroing a number of components of a vector. However when it is necessary to zero elements more selectively, Jacobi (or Givens) rotation is able to zero a selected component of a vector. It is based on the Jacobi matrix, also called Givens matrix, denoted by J(p, q, θ), of the form where c = cos θ and s = sin θ. Premultiplication of a vector by J(p, q, θ) T corresponds to a counterclockwise rotation of θ in the (p, q) plane, that zeroes the q components of the resulting vector y. Indeed, if x ∈ R n and then From (A16) it is clear that y q can be forced to zero by setting The Jacobi matrix, when applied as a similarity transformation to a symmetric matrix A, rotates rows and columns p and q of A through the angle θ so that the (p, q) and (q, p) entries are zeroed.

Appendix A.4. QR Factorization
This factorization is a fundamental step in QR iteration algorithms. The QR factorization of an (m × n) matrix A is given by where Q ∈ R m×m is orthogonal and R ∈ R m×n is upper triangular.

. QR Iteration
This algorithm is based on the QR factorization and the "power method". Assuming A ∈ R n×n is a symmetric matrix then, by the symmetric Schur decomposition, there exists a real orthogonal Q such that Given a vector q (0) ∈ R n , such that q (0) = 1, the power method produces a sequence of vectors q (k) as follows: The power method states that if q (0) = 0 and λ 1 > λ 2 ≥ . . . ≥ λ n then q (k) converges to an eigenvector and λ (k) to the corresponding eigenvalue.
This method can be generalized to solve the eigenvalue problem of a symmetric matrix. To this end let us consider an (n × n) matrix Q 0 with orthonormal columns and a sequence of matrices {Q k } generated as follows: where the QR factorization is applied at each step to obtain the matrices Q k and R k , then (A23) defines the so-called orthogonal iteration. It can be shown [62] that the matrices T k defined by are converging to a diagonal form whose values {λ n } converge to {λ 1 , . . . , λ n }. From definition of T k−1 we have where the QR factorization of A Q k−1 has been applied. Similarly, using (A23) and orthogonality of Q k−1 , one gets Defining U k = Q T k−1 Q k the algorithm (A23) can be rewritten as and T k converges to the diagonal form (A21). The iteration (A27) establishes the so-called QR iteration algorithm for symmetric matrices.
The main limitation of QR algorithm is that it is only valid for symmetric matrices, such as A T A, thus the method cannot be directly applied to the matrix A.
Appendix A.6. Implicit QR Method with Shift The symmetric QR algorithm (A27) can be made more efficient in two ways: (i) by choosing U 0 such that U 0 A U 0 = T 0 is tridiagonal. In this way all T k in (A27) are tridiagonal and this reduces the complexity of the algorithm to O(n 2 ); once the Householder's algorithm is applied to the matrix A giving the bidiagonal matrixB, the tridiagonal form T 0 can be easily obtained as T 0 =B TB . (ii) by introducing a shift in the iteration of (A27): with this change the convergence to diagonal form proceeds at a cubic rate. This result is based on the following facts: (a) if s ∈ R and T − sI = QR is the shifted version of T then T + = RQ + sI is also tridiagonal; (b) if s is an eigenvalue of T, s ∈ λ(T), the last column of T + equals s e n = s(0 . . . 1) T , that is T + (n, n) = s. where µ is a good approximate eigenvalue and An effective choice is to shift by the eigenvalue of a n−1 b n−1 b n−1 a n (A31) known as the Wilkinson shift and given by µ = a n + d − sign(d) d 2 + b 2 n−1 , d = (a n−1 − a n )/2 . (A32) If µ is a good approximation of the eigenvalue s, then the term b n−1 will be smaller after a QR step with shift µ. It has been shown [63] that with this shift strategy, (A28) is cubically convergent.
A pseudo-code of the algorithm in shown in Algorithm A2.

Algorithm A2 QR iteration with shift
Require: A ∈ R m×n Apply Algorithm A1 to obtain bidiagonalB T =B TB tridiagonal for k = 1, . . . do It is possible to execute the transition to T = RU + µI without explicitly forming the matrix T − µI, thus giving the implicit shift version [62]. This is achieved by a Given rotation matrix in which c = cos(θ) and s = sin(θ) are such that However if we set J 1 = J(1, 2, θ) we have where the two nonzero elements "+" out of the tridiagonals appears. To "chase" these unwanted elements, we can apply rotations J 2 , . . . , J n−1 of the form J i = J(i, i + 1, θ i ), i = 2, . . . , n − 1, such that if z = J 1 J 2 . . . J n−1 then Z T TZ is tridiagonal. In such a way, it can be shown that the tridiagonal matrix produced by this implicit shift technique is the same as the tridiagonal matrix obtained by the explicit method. A description of implicit QR method with shift is reported in Algorithm A3, while the pseudo-code for Golub-Reinsch algorithm is described in Algorithm 1.

. QR Iteration with Zero-Shift
This algorithm is a variation of the QR standard method with shift, called implicit zero-shift QR algorithm, since it corresponds to the standard algorithm when σ = 0, which computes all the singular values of a bidiagonal matrix, with guaranteed high relative accuracy.
To show the algorithm, let us take σ = 0 and refer to a 4 × 4 matrix example. From (A26) one gets tan θ 1 = −b 12 /b 11 so that the result of the first rotation is We see that (1, 2) entry is zero and, as it will propagate through the rest of the algorithm, this is the key of its effectiveness. After the rotation by J 2 we have where b is a rank one matrix. Postmultiplication by J 3 to zero out the (1, 3) entry will also zero out the (2, 3) entry: Rotation by J 4 just repeats the situation: the submatrix of J 4 J 2 BJ 1 J 3 consisting of row 2 and 3 and columns 3 and 4 is rank one, and rotation by J 5 zeroes out the (3, 4) entry as well as the (2,4) entry. This engine repeats itself for the length of the matrix. Thus at each step of zero-shift algorithm a transformation is applied which takes f and g as input and returns r, cs = cos θ and sn = sin θ such that cs sn −sn cs