RcdMathLib: An Open Source Software Library for Computing on Resource-Limited Devices

We developped an open source library called RcdMathLib for solving multivariate linear and nonlinear systems. RcdMathLib supports on-the-fly computing on low-cost and resource-constrained devices, e.g., microcontrollers. The decentralized processing is a step towards ubiquitous computing enabling the implementation of Internet of Things (IoT) applications. RcdMathLib is modular- and layer-based, whereby different modules allow for algebraic operations such as vector and matrix operations or decompositions. RcdMathLib also comprises a utilities-module providing sorting and filtering algorithms as well as methods generating random variables. It enables solving linear and nonlinear equations based on efficient decomposition approaches such as the Singular Value Decomposition (SVD) algorithm. The open source library also provides optimization methods such as Gauss–Newton and Levenberg–Marquardt algorithms for solving problems of regression smoothing and curve fitting. Furthermore, a positioning module permits computing positions of IoT devices using algorithms for instance trilateration. This module also enables the optimization of the position by performing a method to reduce multipath errors on the mobile device. The library is implemented and tested on resource-limited IoT as well as on full-fledged operating systems. The open source software library is hosted on a GitLab repository.


Introduction
Algorithms and scientific computing are workhorses of many numerical libraries that support users to solve technical and scientific problems. These libraries use mathematics and numerical algebraic computations, which contribute to a growing body of research in engineering and computational science. This leads to new disciplines and academic interests. The use of computers has accelerated the trend as well as enhanced the deployment of numerical libraries and approaches in scientific and engineering communities. Originally, computers were built for numerical and scientific applications. Konrad Zuse built a mechanical computer in 1938 to perform repetitive and cumbersome calculations. A specific problem, from the area of static engineering, requires performing tedious calculations to design load-bearing structures by solving systems of linear equations [1]. Howard Aiken independently developed an electro-mechanical computing device that can execute predetermined commands typed on a keyboard in the notation of mathematics and translated into numerical codes. These are stored on punched cards and perforated magnetic tapes or drums [2,3].
The first software libraries of numerical algorithms were developed in the programming language ALGOL 60 including procedures for solving linear systems of equations ant linear equation systems based on efficient algorithms such as the Householder or the Moore-Penrose inverse. The Moore-Penrose inverse is implemented by using the SVD method. • A module for solving multivariant nonlinear equation systems as well as optimization and curve fitting problems on a resource-contained device on the basis of the SVD algorithm. • A utilities-module provides various algorithms such as the Shell sort algorithm or the Box-Muller method to generate normally distributed random variables. • A localization module for positioning systems that use distance measurements or DC-pulsed, magnetic signals. This module enables an adaptive, optimized localization of mobile devices. • A software routine to locally reduce multipath errors on mobile devices.
Guckenheimer perceived that we are conducting increasingly complex computations built upon the assumption that the underlying numerical approaches are complete and reliable. Furthermore, we ignore numerical analysis by using mathematical software packages [15]. Thus, the user must be aware of the limitations of the algorithms and be able to choose appropriate numerical methods. The use of inappropriate methods can lead to incorrect results. Therefore, we briefly address certain difficulties by using linear algebra algorithms.
The remainder of this article is structured as follows: Firstly, we review related works in Section 2. We present the architecture as well as describe the modules of the software library in Section 3. We introduce the implementation issues in Section 4 and the usage of the RcdMathLib in Section 5. In Section 6, we evaluate the algorithms on a resource-limited device and on a low-cost, single-board computer. Finally, we conclude our article and give an outlook on future works in Section 7.

Related Work
Computing software especially for linear algebra is indispensable in science and engineering for the implementation of applications like text classification or speech recognition. To the best of our knowledge, there are few mathematical libraries for resource-limited devices, and most of them are limited to simple algebraic operations.
Libraries for numerical computation such as the GNU Scientific Library (GSL) are suitable for Digital Signal Processors (DSPs) or Linux-based embedded systems. For example, the commercially available Arm Performance Libraries (ARMPL) offer basic linear algebra subprograms, fast Fourier transform routines for real and complex data as well as some mathematical routines such as exponential, power, and logarithmic routines. Nonetheless, these routines do not support resource-constrained devices such as microcontrollers [16].
The C standard mathematical library includes mathematical functions defined in <math.h>. This mathematical library is widely used for microcontrollers, since it is a part of the C compiler. It provides only basic mathematical functions for instance trigonometric functions (e.g., sin, cos) or exponentiation and logarithmic functions (e.g., exp, log) [17].
Our research shows that there are very few attempts to build numerical computations libraries, which can run on microcontrollers: Texas Instruments ® (Dallas, Texas, USA) provides for its MSP430 and MSP432 devices the IQmath and Qmath Libraries, which contain a collection of mathematical routines for C programmers. However, this collection is restricted only to basic mathematical functions such as trigonometric and algorithmic functions [18].
Libfixmatrix is a matrix computation library for microcontrollers. This library includes basic matrix operations such as multiplication, addition, and transposition. Equation solving and matrix inversion are implemented by the QR decomposition. Libfixmatrix is only suitable for tasks involving small matrices [19].
MicroBLAS is a simple, tiny, and efficient library designed for PC and microcontrollers. It provides basic linear algebra subprograms on vectors and matrices [20].
Other libraries for microcontrollers are the MatrixMath and BasicLinearAlgebra libraries. However, these libraries offer limited functionalities and are restricted to the Arduino platform [21,22].
The Python programming language is becoming widely used in scientific and numeric computing. The Python package NumPy (Numeric Python) is used for manipulating large arrays and matrices of numeric data [23]. The Scientific Python (SciPy) extends the functionality of NumPy with numerous mathematical algorithms [24]. Python is widely used for PC and single-board computers such as the Raspberry Pi. A new programming language largely compatible with Python called MicroPython is optimized to run on microcontrollers [25]. MicroPython includes the math and cmath libraries, which are restricted to basic mathematical functions. The mainline kernel of MicroPython supports only the ARM Cortex-M processor family. CircuitPython is a derivative of the MicroPython created to support boards like the Gemma M0 [26]. Various mathematical libraries are outlined in Table 1, which reveals their capabilities, limitations, and supported platforms.

Library Architecture and Description
RcdMathLib has a pyramidal and a modular architecture as illustrated in Figure 1, whereby each layer rests upon the underlying layers. For example, the linear algebra module layer rests on the basic algebraic module layer. Each module layer is composed of several submodules such as the matrix or vector submodules. The submodules can be built up from the underlying submodules, for example, the pseudo-inverse submodule is based on the SVD, the Householder, and the Gevins submodules. For the sake of brevity, Figure 1 presents only the sublayers. The software layers will be briefly addressed in Section 3.1 through Section 3.3.

Linear Algebra Module Layer
The module layer of linear algebra is composed of the following submodules: • Basic operations submodule: provides algebraic operations such as addition or multiplication of vectors or matrices. This submodule distinguishes between vector and matrix operations. • Matrix decomposition submodule: allows for the decomposition of matrices by using algorithms such as Givens, Householder, or the SVD. The SVD method is implemented using the Golub-Kahan-Reinsch algorithm [27,28]. • Pseudo-inverse submodule: enables the computation of the inverse of quadratic as well as of rectangular matrices. The matrix inverse can be calculated by using the Moore-Penrose, Givens, or Householder algorithms [29]. • Linear solve submodule: permits the solution of under-determined and over-determined linear equation systems. We solve the linear equation systems using two matrix decompositions: the SVD and QR factorizations. The first method uses the Moore-Penrose inverse, while the second approach applies the Householder or the Givens algorithms with the combination of the back substitution method. We also provide the Gaussian Elimination (GE) with a pivoting algorithm, which is based on the LU decomposition. We use the GE-based method only for testing purposes or for devices with very limited stack memory. We suggest using the SVD-or the QR-based methods due to the numerical stability and the support of non-quadratic matrices [30]. • Utilities submodule: offers filtering algorithms such as median, average, or moving average. Furthermore, it provides the Shell algorithm to put elements of a vector in a certain order as well as the Box-Muller method to generate normally distributed random variables [31,32].

Non-Linear Algebra Module Layer
The nonlinear algebra module includes the following submodules: • Optimization submodule: enables the optimization of an approximate solution by using Nonlinear Least Squares (NLS) methods such as modified Gauss-Newton (GN) or the LVM algorithms. These methods are iterative and need a start value as an approximate solution. Moreover, the user should give a pointer to an error function and to a Jacobian matrix. The modified GN and the LVM algorithms will be briefly described in Sections 3.2.1 and 3.2.2. • Nonlinear equations submodule: allows for the solution of multivariate nonlinear equation systems by using Newton-Raphson and damped Newton-Raphson methods [33]. The user must deliver a start value as well as a pointer to nonlinear equation systems to solve, and a pointer to the appropriate Jacobian matrix.

Gauss-Newton Algorithm
The Gauss-Newton algorithm works iteratively to find the solution x that minimizes the sum of the square errors. During the iteration process, we cache the value of x with the minimal sum of squares to prevent the divergence of the GN algorithm [34]. The computed solution by the GN can be used as a start value for the subsequent LVM algorithm if the start value is unknown or the GN algorithm diverges.

Levenberg-Marquardt Algorithm
The LVM algorithm is also a numerical optimization approach enabling solving NLS problems [35][36][37]. The LVM algorithm can be used for optimization or fitting problems. The LVM method proceeds iteratively as follows: where x (k) is the k-th approximation of the searched solution and s (k) is the k-th error correction vector. The LVM improves the approximate solution x 0 in each iteration step by calculating the correction vector s (i) as follows [38,39]: where µ is the damping parameter, f is the error function, and J f is the Jacobian matrix.
The LVM algorithm has the advantage over the GN method because the matrix on the left side of Equation (2) is no longer singular. This is accomplished by the factor µ 2 I regulating the matrix J T f J f . The LVM method is described in Algorithm 1.  15: break; 16: end if 17: end while 18: x = x + s; 19: Solve (B + µ 2 I) s = − H for s; 4: ; 5: end function

Localization Module Layer
Localization of users or IoT devices is indispensable for Localization-Based Services (LBSs) such as tracking in smart buildings, advertising in shopping centers, or routing and navigation in large public buildings [40]. Indoor Localization Systems (ILSs) are used to locate IoT or mobile devices inside environments where the Global Positioning System (GPS) cannot be deployed. Numerous technologies have been evaluated for ILSs, for example, Ultra-Wideband (UWB) [41], Wireless Local Area Network (WLAN) [42], ultrasound [43], magnetic signals [44], or Bluetooth [45].

Introduction and Layer Description
The Localization Module (LM) layer provides algorithms to calculate as well as optimize a position of an ILS. At the current stage of development, we provide two example applications of the LM as submodules: a distance-based as well as a DC-pulsed, magnetic-based submodule of ILSs. We also offer a common positioning submodule comprising shared algorithms like the trilateration algorithm [46]. Figure 2 illustrates the principle of a distance-based ILS composed of four anchors with known positions and one mobile device. The distances between the anchors and the mobile device can be measured by using UWB or ultrasound sensors. The mobile device performs distance measurements to the four anchors. Furthermore, the collected distances are preprocessed using the median filter from the utilities submodule to remove outliers (see Section 3.1). Finally, the mobile device locally calculates a three-dimensional position using the trilateration algorithm provided by the RcdMathLib. The trilateration algorithm computes the position of an unknown point with the coordinates (x, y, z) and distances d i to the reference positions (x i , y i , z i ) for i = 1, 2, . . . , n. This problem requires the estimation of a vector x = (w, x, y, z) such that: where the matrix A and the vector b have the following forms [46][47][48]: The solution x is given by: where A + is the pseudo-inverse of the matrix A. The pseudo-inverse matrix A + is computed by using the pseudo-inverse submodule of the RcdMathLib (see Section 3.1). The quality q of the calculated position x is given by: Figure 3 illustrates the principle of a magnetic-based ILS composed of various coils with known positions and a mobile device. This system enables the calculation of the position of the mobile device by measuring magnetic fields to the coils as anchors. The magnetic signals are artificially generated from the coils using a pulsed direct current. The collected measurement data are preprocessed from the utilities-submodule for removing outliers and calibrating magnetic data. Finally, the position is calculated on the mobile device. The magnetic field B i generated from the coil i is equal to [47,49]: In this setting, K = µ 0 N t IF 4π , where N t describes the number of turns of the wire, I is the current running through the coil, F expresses the base area of the coil, µ 0 is the permeability of free space, r i is the distance between the mobile device and coil i, and θ i is the mobile device elevation angle relative to the coil plane. The distance r i and the elevation angle θ i are equal to: Equation (8) is a nonlinear equation system with the unknowns coordinates x, y and z, which can be solved by applying the LVM algorithm from the optimization-submodule of the RcdMathLib, whereby, (x i , y i , z i ) and (x, y, z) are the coordinates of the i-th coil and the mobile device.

Multipath Distance Detection and Mitigation and Position Optimization Algorithm
The location module is not restricted to simple position calculations but rather performs complex tasks such as the Multipath Distance Detection and Mitigation (MDDM) by the usage of other modules of the RcdMathLib. The MDDM algorithm enables to reduce the effects of multipath fading on digital signals in radio environments of a mobile device to Reference Stations (RSs) with known locations. The MDDM approach is based on the Robust Position Estimation in Ultrasound Systems (RoPEUS) algorithm [50]. The MDDM is adapted for precise distance-based ILSs such as the Ultra-Wideband (UWB)-based localization systems, whereas it is simultaneously optimized for resource-limited devices [12]. The MDDM algorithm is summarized in Algorithm 2.

Documentation and Examples' Modules
RcdMathLib includes a module that provides an Application Programming Interface (API) documentation. The API documentation is in Portable Document Format (PDF) and in Hypertext Markup Language (HTML) format. It is generated from the C source code by using the Doxygen tool [51]. The software reference documentation covers the description of the implemented functions as well as their passing parameters. In addition, the example module comprises samples of each module to familiarize the users with the API. The example module has the same structure as the RcdMathLib.

Implementation Issues
Given a (m × n) non-singular matrix A and an n-vector b, the fundamental problem of linear algebra is to find an n-vector x such that A x = b. This fundamental problem emerges in various areas of science and engineering such as applied mathematics, physics, or electrical engineering [52]. Associated problems are finding the inverse, the rank, or projections of a matrix A. Attempting to solve the linear algebra problem using common theoretical approaches would face computational difficulties. For example, solving a (20,20) linear system with the Cramer's Rule, which is a significant theoretical algorithm, might take more than a million years even by using fast computers [52].
We use the Givens, the Householder, and the SVD matrix decomposition algorithms. These decomposition methods enable the solution of various problems such as the computation of the inverse matrix, the linear equations, or the rank of a matrix. We do not use the Cholesky decomposition (AA T ), since it can become unstable due to the rounding errors that are equal to κ(A) 2

instead of κ(A). Although the Gaussian Elimination (GE)
is an efficient algorithm to implement the LU factorization, we do not used it, since GE generally requires pivoting and is limited to square matrices. Furthermore, GE can be unstable in certain contrived cases; nonetheless, it performs well for the majority of practical problems [53,54]. We do not use the Classical Gram-Schmidt (CGS) and the Modified Gram-Schmidt (MGS) to implement the QR decomposition due to the numerical instability. Instead, we use the Householder and the Givens algorithms. Even though the Householder is more efficient than the Givens algorithm, the Givens method is easy to parallelize. The Givens and the Householder methods have a guaranteed stability but fail if the matrix is nearly rank-deficient or singular. The SVD algorithm can be used to avoid the rank deficiency problem. This algorithm is not be explicitly computable by determining the eigenvalues of the symmetric matrix A T A due to the round-off errors in the calculation of the matrix A T A. Therefore, we implement the SVD by using the Golub-Kahah-Reinsch (GKR) algorithm, which will be described in Section 4.1.
We calculate the pseudo-inverse matrix by using the Moore-Penrose method based on the SVD or the QR decomposition using the Householder or the Givens algorithms. The QR-based pseudo-inverse A + is computed as follows: where R −1 is the inverse of an upper triangular matrix, and Q T is the transpose of an orthogonal matrix. We calculate the R −1 matrix by using Algorithm 3 [55]. In general, we avoid the explicit calculation of matrix multiplications such as the construction of Householder matrices (H i A) or of Givens matrices (J i A). The calculated triangular matrix R is stored over the matrix A. We also provide functions to avoid the explicit computation and storage of the transpose matrix such as the function that implicitly calculates the matrix-transpose-vector multiplication (A T x). We use the SVD algorithm to overcome the rank deficiency problem, for example, by the modified GN, the Newton-Raphson, and damped Newton-Raphson methods. We used the Householder instead of the SVD method by the LVM algorithm to save computing time and memory stack. This optimization is possible because of the robustness of the LVM algorithm (see Section 3.2.2). We provide the iterative Shell sort algorithm that is suitable for resource-limited devices with a limited stack size. We use the Shell sort algorithm to implement the median filter.

Singular Value Decomposition
The SVD method has become a powerful tool for solving a wide range of problems in different application domains such as biomedical engineering, control systems, or signal processing [52]. We implemented the SVD approach based on the Golub-Kahan-Reinsch algorithm that works in two phases: a bidiagonalization of the matrix A and the reduction of the calculated bidiagonal matrix to a diagonal form.

First phase (bidiagonalization)
A (m × n) matrix A is transformed to an upper bidiagonal matrix B ∈ R mxn by using the Householder bidiagonalization, where m ≥ n. The matrix A is transformed as follows: where B is an n × n bidiagonal matrix equal to

Second phase (reduction to the diagonal form)
The bidiagonal matrix B is further reduced to a diagonal matrix Σ by using orthogonal equivalence transformations as follows: where Σ is the matrix of the singular values σ i and the matrices U 1 and V 1 are orthogonal. The singular vector matrices can be computed as follows: We implemented the first and second phases by the Golub-Kahan bidiagonal procedure and the Golub-Reinsch algorithm. Both algorithms will be described in detail in Sections 4.1.1 and 4.1.2. In this description, we will mention some implementation issues.

Golub-Kahan Bi-Diagonal Procedure
The reduction of matrix A to the upper bi-diagonal matrix B is accomplished by using a sequence of Householder reflections, where the matrix B has the same set of singular values as the matrix A [56]. First, a Householder transformation U 01 is applied to zero out the sub-diagonal elements of the first column of the matrix A. Next, a Householder transformation V 01 is used to zero out the last (n − 2) elements of the first row by postmultiplying the matrix U 01 A: U 01 AV 01 . Repeating these steps a total of n times, the matrix A will be transformed to:

Golub-Reinsch Algorithm
The Golub-Reinsch algorithm is a variant of the QR iteration [28]. At each iteration i, the implicit symmetric QR algorithm is applied with the Wilkinson shift without forming the product B T i B i . The algorithm has guaranteed convergence with a quite fast rate [52]. Starting from the bi-diagonalization of the matrix A obtained from the previous Golub-Kahan bi-diagonal procedure, the algorithm creates a sequence of bi-diagonal matrices {B i } with possibly smaller off-diagonals than the previous one. For simplicity, we write: We calculate the Wilkinson shift σ that is equal to the eigenvalue λ of the right-hand corner sub-matrix of the matrix C = B T i B i : c n−1,n−1 c n−1,n c n−1,n c n,n = α 2 n−1 + β 2 which is closer to α 2 n + β 2 n . G. H. Golub and C. F. Van Loan suggest to calculate the Wilkinson shift as follows [28]: We calculate c 1 and s 1 such that: and form the Givens rotation V 1 .
We apply the Givens rotation V 1 to the right of the matrix B: The bidiagonal form is destroyed by the unwanted non-zero element (bulge) indicated by the "+" sign. Therefore, we apply the Givens rotations U 1 , V 2 , U 2 , . . ., V n−1 , and U n−1 to chase the badges.
We apply a Givens transformation U 1 to the left of the matrix BV 1 to eliminate the unwanted sub-diagonal element. This reintroduces a badge in the first row to the right of the super-diagonal element: We apply the Givens rotations V 2 to remove the badge in the matrix U 1 BV 1 . This introduces a new badge into the sub-diagonal of the third row, which is eliminated by the Givens rotation U 2 : The matrix pair (V 3 , U 3 ) terminates the chasing process and delivers a new bi-diagonal matrixB: In general, the chasing process creates a new bi-diagonal matrixB that is related to the matrix B as follows [28]: (29) whereŨ andṼ are orthogonal. During the chasing process, we distinguish between the splitting, the cancellation, and the negligibility steps [57]: At the i-th iteration, we assume that the matrixB is equal to: Splitting: If the matrix entry e i is equal to zero, we split the matrixB i into two block diagonal-matrices whose singular values can be computed independently: where svd B i is the singular value decomposition of the matrixB i ; in this case, we compute the matrixB 2 first. If the split occurs at i equal to n, then the matrixB 2 is equal to q n and q n is a singular value.
Cancellation: If the matrix entry q i is equal to zero, we split the matrixB i by using Givens rotations from the left to zero out row i as follows: whereas the budge b is removed by using the Givens rotations for k = i + 2, . . . , n.
Since the matrix element e i+1 is equal to zero, the matrix splits again into two block diagonal sub-matrices (see the splitting step).

Negligibility:
The values of the matrix elements e i or q i will be small but not exactly zero due to the finite precision arithmetic used by digital processors. Therefore, we require a threshold to decide when the elements e i or q i can be considered zero. Golub and Reinsch [58] recommend the following threshold rule: where ε is the machine precision. Björck [27] suggests the following approach: Linpack [5] uses a variant Björck's approach that omits the factor 0.5 in Equations (35) and (36).

Usage of the RcdMathLib
RcdMathLib can be used on PCs, resource-constrained devices such as microcontrollers, or on small single-board computers like Raspberry Pi devices. It is a free software and available under the terms of the GNU Lesser General Public License as published by the Free Software Foundation, version 2.1 (LGPLv2.1) [59]. The RcdMathLib software is written in the C programming language by using the GNU Compiler Collection (GCC) for full-fledged devices and embedded tool chains for resource-limited devices; for instance, the GNU ARM Embedded Toolchain. RcdMathLib can also be used on top of an Operating System (OS) for resource-constrained devices with a minimal effort due to the modular architecture of the library. We support the RIOT-OS, which is an open source IoT OS [60]. RcdMathLib is interfaced with the RIOT-OS using the GNU Make utility, whereby the user only needs to choose the modules needed by setting the USE_MODULE-macro. We automatically calculate the dependencies of the modules needed and the user can choose between a floating-point single-precision or double-precision depending on the available stack memory. An OS for resource-limited devices is recommended for the use of the RcdMathLib, but it is not required. A minimum stack size of 2560 bytes is needed to compute with floating-point numbers. The printf() function needs extra memory stack, therefore a minimum stack size of 4608 bytes is required to work with double-precision for floating-point arithmetic. We recommend a stack size of 8192 bytes.
The Linaro toolchain can be used on Linux or Windows to build applications for a target platform [61]. The OpenOCD can be used for flashing the code to the target (chip) as well as for low level or source level debugging [62]. The source code as well as the documentation (APIs) of the RcdMathLib can be downloaded from the GitLab repository [63]. The wikis are available on the homepage of the library to get started with the RcdMathLib [9].

Simple Example
We present a simple example to demonstrate how to use the RcdMathLib by defining two (3, 4) matrices in Listing 1. We create a matrix with specific values equal to π in the main diagonal by calling the "matrix_get_diag_mat()" function. In the next step, we calculate the transpose of the second matrix by invoking the "matrix_get_transpose()" function. Finally, we calculate the multiplication of the first matrix with the transpose of the second matrix by executing the "matrix_mul()" function. In these three cases, the user should deliver the dimension of the matrices as well as a reference to the matrix resulted. The outputs of the calculated results are presented in Listing 2.

Evaluation of the Algorithms
We evaluated the linear as well as the nonlinear algebra modules on an STM32F407 Microcontroller Unit (MCU) based on the ARM Cortex-M4 core operating at 168 MHz and having a memory capacity of 192 KB RAM. In order to demonstrate the scalability of the algorithms implemented, we also evaluated the same algorithms on Raspberry Pi 3, which has more capacity (Quad Core 1.2 GHz and 1 GB RAM) than the STM32F4-MCU.

Evaluation of the Linear Algebra Module
We evaluated the linear algebra module by using a (m × n) matrix A and a vector b with uniformly distributed random numbers. The aim is to calculate and measure the mean execution time of the methods for solving linear equation systems. We evaluated the SVD-, QR-, and the LU-based algorithms for solving linear equation systems described in Section 3.1. The determined and the over-determined linear equation systems can be represented by the colon notation as follows: where 2 ≤ i ≤ n, and where n + 1 ≤ i ≤ m and A(1 : i, 1 : n) is the sub-matrix of A with rows 1 up to i and columns 1 up to n. We use the same format as the corresponding column notation in MATLAB. The determined and the over-determined linear equation systems are illustrated in the matrix form in Equations (39) and (40), respectively.
Version February 15, 2021 submitted to Sensors 17 of 24 Table 2: Complexity of the algorithms evaluated.

Algorithm Complexity [flops]
Matrix multiplication: A m,n × B n,p mp(2n − 1) QR-Householder The horizontal and vertical dotted lines enclose the square and rectangular sub-514 matrices in (39) and (40). We set the maximal row (m) and column (n) numbers to 10 and 5.

515
We also measured the mean execution time of the matrix multiplication by using the same 516 matrix A m,n and a matrix B (n,p) initialized with uniformly distributed random numbers.
The horizontal and vertical dotted lines enclose the square and rectangular submatrices in Equations (39) and (40). We set the maximal row (m) and column (n) numbers to 10 and 5. We also measured the mean execution time of the matrix multiplication by using the same matrix A m,n and a matrix B (n,p) initialized with uniformly distributed random numbers. The row and column number of the matrix B are equal to 5 and 10. The row number (m) of the matrix A varies from 1 to 10. The linear equation systems are solved by using three different decomposition algorithms: the Golub-Kahan-Reinsch, Householder, and GE with pivoting. We measured the execution time of matrix multiplications as well as of the methods for solving linear equation systems on the STM32F4 MCU and the Raspberry Pi 3.
For solving linear equation systems, Figure 4 compares the execution time of the following algorithms: GE with pivoting, the Householder, and the Golub-Kahan-Reinsch. Furthermore, Figure 4 represents the execution time of the matrix multiplications. Figure 4a,b illustrate the execution time of these algorithms in function of the row number of the matrix on the STM32F4 MCU and the Raspberry Pi 3. The Golub-Kahan-Reinsch-based algorithm has the largest execution time, since it is more expensive than other algorithms (see Table 2). Table 3 summarizes the mean execution time of the matrix evaluated by calculating an A 10,5 × B 5,7 matrix or solving an (7, 5) linear equation system. These execution times are measured on the STM32F4 MCU and the Raspberry Pi 3. The Raspberry Pi 3 outperforms the STM32F4 MCU, as expected, due to the limited computing capacity of the STM32F4 MCU. However, the execution time for finding a solution applying the Golub-Kahan-Reinsch algorithm remains in a micro-second range on the STM32F4 MCU, which would be sufficient for many IoT applications.

Evaluation of the Non-Linear Algebra Module
The nonlinear algebra module is evaluated by using exponential and sinusoidal data [64]. Optimizing of least-squares problems are solved by using the modified GN and LVM methods as described in Sections 3.2.1 and 3.2.2.

Evaluation with Exponential Data
Given the model function g( x, t) that is equal to: The aim is to find the parameters (x 1 , x 2 ) that most accurately match the model function g( x, t) by minimizing the sum of squares of the error function f i . The function f i computes the residual values and is equal to: We introduce the error function vector f : The Jacobian matrix is equal to: The initial square residual f ( x 0 ) 2 2 is equal to 127.309. We get the solution x 3 = [7.000093, 0.262078] T by using the GN algorithm after three iterations. The appropriate square residual f ( x 3 ) 2 2 is equal to 6.013, which indicates the improvement of the model. We obtain the solution x 3 = [7.000090, 0.262078] T by using the LVM algorithm after three iterations. The LVM algorithm shows nearly the same behavior as the modified GN method. This is confirmed by Figure 5.   Table 4 summarizes the average time per iteration required from the GNM and LVM algorithms on the STM32F4 MCU and the Raspberry Pi 3. Table 4. Mean execution time of the Gauss-Newton and Levenberg-Marquardt methods (per iteration) using exponential data.

Evaluation with Sinusoidal Data
The model function is: . .
Thus, the Jacobian matrix is calculated using the partial derivatives in Equation (44) and is equal to: The initial square residual f ( x 0 ) 2 2 is equal to 40.048. After one iteration ( f ( x 0 ) 2 2 = 13.805), the LVM algorithm is slightly more efficient than the GN method ( f ( x 0 ) 2 2 = 13.810). Both algorithms show the same behavior after two iterations. Figure 6 shows the sinusoidal model after three iterations by using the GN and LVM algorithms.   Table 5 summarizes the average time per iteration required from the GNM and LVM approaches on the STM32F4 MCU and the Raspberry Pi 3. Table 5. Mean execution time of the Gauss-Newton and Levenberg-Marquardt methods (per iteration) using sinusoidal data.

Conclusions and Outlook
We presented an open source library for linear and nonlinear algebra as well as an application module for localization that is suitable for resource-limited, mobile, and embedded systems. This library permits the solution of linear equations and matrix operations like the matrix decomposition, the calculation of the inverse, or the rank of a matrix. It provides various algorithms such as sorting or filtering algorithms. This software also enables solving nonlinear problems like curve fitting or nonlinear equations. RcdMathLib allows for the localization of mobile devices by using localization algorithms like, for instance, the trilateration. The localization can be further refined by using an adaptive optimization algorithm based on the SVD method. The localization software module facilitates the localization of the mobile device in NLoS scenarios by using a multipath distance detection and mitigation algorithm. RcdMathLib can serve as a basis for artificial intelligence techniques for mobile technologies with IoT, or as a tool in research and industry. Therefore, we intend to extend the RcdMathLib with machine learning or digital signal processing algorithms. We also aim to extend the package with an additional algorithm for solving nonlinear problems called the Landweber method [65].
Author Contributions: Z.K. conceived the research, designed the software architecture, and implemented the software components of the RcdMathLib. Furthermore, he integrated the software components of the RcdMathLib in RIOT-OS, performed the experiments and data evaluation, and wrote all parts of the article. A.N. offered valuable tests and evaluation of the algorithms. He helped through the design of the localization algorithms. He wrote the "related works" part and contributed to the writing of the article part "Library Architecture and Description". M.G. gave valuable suggestions and offered plenty of comments and useful discussions to the paper. J.S. gave valuable suggestions to the paper and reviewed the article. C.M. gave suggestions to the article and reviewed it. All authors have read and agreed to the published version of the manuscript. Data Availability Statement: Source code can be download from the GitLab repository on Reference [63].