Least Squares Optimization: from Theory to Practice

Nowadays, Non-Linear Least-Squares embodies the foundation of many Robotics and Computer Vision systems. The research community deeply investigated this topic in the last years, and this resulted in the development of several open-source solvers to approach constantly increasing classes of problems. In this work, we propose a unified methodology to design and develop efficient Least-Squares Optimization algorithms, focusing on the structures and patterns of each specific domain. Furthermore, we present a novel open-source optimization system, that addresses transparently problems with a different structure and designed to be easy to extend. The system is written in modern C++ and can run efficiently on embedded systems. Source code: https://srrg.gitlab.io/srrg2-solver.html. We validated our approach by conducting comparative experiments on several problems using standard datasets. The results show that our system achieves state-of-the-art performances in all tested scenarios.


I. INTRODUCTION
Iterative Least-Squares (ILS) solvers are core building blocks of many robotic applications, systems and subsystems [1]. This technique has been traditionally used for calibration [2]- [4], registration [5]- [7] and global optimization [8]- [11]. In particular, modern Simultaneous Localization and Mapping (SLAM) systems typically employ multiple ILS solvers at different levels: in computing the incremental ego motion of the sensor, in refining the localization of a robot upon loop closure and -most notably -to obtain a globally consistent map. Similarly, in several computer vision systems, ILS is used to compute/refine camera parameters, estimating the structure of a scene, the position of the camera or both. Many inference problems in robotics are effectively described by a factor graph [12], which is a graphical model expressing the joint likelihood of the known measurements with respect to a set of unknown conditional variables. Solving a factor graph requires to find the values of the variables that maximize the joint likelihood of the measurements. If the noise affecting the sensor data is Gaussian the solution of a factor graph can be computed by an ILS solver implementing variants of the well known Gauss-Newton (GN) algorithm.
The relevance of the topic has been addressed by several works such as GTSAM [9], g 2 o [8], SLAM++ [13], or the Ceres solver [10] by Google. These systems have grown over time to include comprehensive libraries of factors and variables, that can tackle a large variety of problems, and in most cases these systems can be used as black boxes. Since they typically consist of an extended codebase, entailing them to a particular application/architecture to achieve the maximum performances is a non-trivial task. In contrast, extending these systems to approach new problems is typically easier than customizing: in this case the developer has to implement some additional functionalities/classes according to the API of the system. Also in this case, however, an optimal implementation might require a reasonable knowledge of the solver internals.
We believe that at the current times, a researcher working in robotics should possess the knowledge on how to design factor graph solvers for specific problems. Having this skill enables to both effectively extend existing systems and realize custom software that utilizes the hardware at its maximum. Accordingly, the primary goal of this paper is to provide the reader with a methodology on how to mathematically define such a solver for a problem. To this extent in Sec. IV, we start by revising the nonlinear least squares by highlighting the connections between inference on conditional Gaussian distributions and ILS. In the same section, we introduce the ⊞ method introduced by Hertzberg et al. [14] to deal with non-Euclidean domains. Furthermore, we discuss on how to cope with outliers in the measurements through robust cost functions and we outline the effects of sparsity in factor graphs. We conclude the section by presenting a general methodology on how to design factors and variables that describe a problem. In Sec. V we bake up this methodology by providing examples that approach four prominent problems in Robotics: Iterative Closest Point (ICP), projective registration, Bundle Adjustment (BA), Pose-Graph Optimization (PGO).
When it comes to the implementation of a solver, several choices have to be made in the light of the problem structure, the compute architecture and the operating conditions (online or batch). In this work we characterize ILS problems, distinguishing between dense and sparse, batch and incremental, stationary and non-stationary based on their structure and application domain. In Sec. II we provide a more detailed description of these characteristics, while in Sec III we discuss how ILS has been used in the literature to approach various problems in Robotics and by highlighting how addressing a problem according to its traits leads to effective solutions.
The second orthogonal goal of this work is to propose a unifying system that deals with dense/sparse, static/dynamic, batch problems, with no apparent performance loss compared to ad-hoc solutions. We build on the ideas that are at the base of the g 2 o optimizer [8], to address some requirements arising from users and developers, namely: fast convergence, small runtime per iteration, rapid prototyping, trade-off between implementation effort and performances, and, finally, code compactness. In Sec. VI we highlight from the general algorithm outlined Sec. IV, a set of functionalities that results in a modular, decoupled and minimal design. This analysis ultimately leads to a modern compact and efficient C++ library released under BSD3 license for ILS of Factor Graphs that relies on a component model, presented in Sec VII that effectively runs on both on x86-64 and ARM platforms. To ease prototyping we offer an interactive environment to graphically configure the solver (Fig. 1a). The core library of our solver consists of no more than 6000 lines of C++ code, whereas the companion libraries implementing a large set of factors and variables for approaching problems -e.g. 2D/3D ICP, projective registration, BA, 2D/3D PGO and Pose-Landmark Graph Optimization (PLGO) and many others -is at the time of this writing below 4000 lines. Our system relies on our visual component framework, image processing and visualization libraries that contain no optimization code and consists of approximately 20000 lines. To validate our claims, we conducted extensive comparative experiments on publicly avalilable datasets (Fig. 1b) -in dense and sparse scenarios. We compared our solver with sparse approaches such as GTSAM, g 2 o and Ceres, and with dense ones, such as the well-known PCL library [15]. The experiments presented in Sec. VIII confirm that our system has performances that are on par with other state-of-the-art frameworks. Summarizing, the contribution of this work is twofold: -We present a methodology on how to design a solver for a generic class of problems, and we exemplify such a methodology by showing how it can be used to approach a relevant subset of problems in Robotics. -We propose an open-source, component-based ILS system that aims to coherently address problems having different structure, while providing state-of-the-art performances.

II. TAXONOMY OF ILS PROBLEMS
Whereas the theory on ILS is well-known, the effectiveness of an implementation greatly depends on the structure of the problem being addressed and on the operating conditions. We qualitatively distinguish between dense and sparse problems, by discriminating on the connectivity of the factor graph. A dense problem is characterized by many measurements affected by relatively few variables. This occurs in typical registration problems, where the likelihoods of the measurements (e.g. the intensities of image pixels) depend on a single variable expressing the sensor position. In contrast, sparse problems are characterized by measurements that depend only on a small subset of variables. Examples of sparse problems include PGO or BA.
A further orthogonal classification of the problems divides them in stationary and non-stationary. A problem is stationary when the measurements do not change during the iterations of the optimization. This occurs when the data-association is known a priori with sufficient certainty. Conversely, non-stationary problems admit measurements that might change during the optimization, as a result of a modification of the variables being estimated. A typical case of non-stationary problem is point registration [16], when the associations between the points in the model and in the reference are computed at each iteration based on an heuristic that depends on the current estimate of their displacement.
Finally, the problem might be extended over time by adding new variables and measurements. Several Graph-Based SLAM systems exploit this intrinsic characteristic in on-line applications, reusing the computation done while solving the original problem to determine the solution for the augmented one. We refer to a solver with this capability as an incremental solver, in contrast to batch solvers that carry on all the computation from scratch once the factor graph is augmented.
In this taxonomy we left out other crucial aspects that affect the convergence basin of the solver such as the linearity of the measurement function, or the domain of variables and measurements. Exploiting the structure of these domains has shown to provide even more efficient solutions [17], with the obvious shortcoming that they are restricted to the specific problem they are designed to address.
Using a sparse stationary solver on a dense non-stationary problem results in carrying on useless computations that hinders the usability of the system. Using a dense dynamic solver to approach a sparse stationary problem presents similar issues. State-of-the-art open-source solvers like the ones mentioned in Sec I focus on sparse stationary or incremental problems. Dense solvers are usually within the application/library using it and tightly coupled to it. On the one hand, this allows to reduce the time per iteration, while on the other hand it results in avoidable code replication when multiple systems are integrated. This might result in potential inconsistencies among program parts and consequent bugs.

III. RELATED WORK
In this section, we revise the use of ILS in approaching several problems in robotics, to highlight the structure and the peculiarities that each problem presents to the solver according to the taxonomy presented in Sec. II. Furthermore, we provide an overview of generic sparse solvers that are commonly used nowadays for factor graph optimization.

A. ILS in Robotics
In calibration, ILS has been used extensively since the first works appeared until these days [18]- [20]. Common works in batch calibration involve relatively small state spaces covering only the parameters to be estimated. Since these parameters condition directly or indirectly all measurements, these class of problems typically requires a dense stationary solver. When temporal calibration is required, however, the changing time offset might result in considering different  data chunks at each iterations thus requiring a dense, nonstationary solver, such as the one presented in [4].
Among the first works on pairwise shape registration relying on ILS, we find the ICP proposed by Besl and McKay [16], while Chen and Medioni [21] proposed the first ILS method for the incremental reconstruction of a 3D model from multiple range images. These methods constitute the foundation of many registration algorithms appearing during the subsequent years. In particular Lu and Milios [22] specialized ICP to operate on 2D laser scans. All these works employed dense non-stationary solvers, to estimate the robot pose that better explain the point measurements. The nonstationary aspect arises from the heuristic used to estimate the data association is based on the current pose estimate.
In the context of ICP, Censi [23] proposed an alternative metric to compute the distance between two points and an approach to estimate the information matrix from the set of correspondences [24]. Subsequently, Segal et al. [25] proposed the use of covariance matrices that better reflect the structure of the surface in computing the error. Registration has been addressed by Bieber et al. [26] for 2D scans and subsequently Magnusson et al. [27] for 3D point clouds by using a pure Newton's method relying on a Gaussian approximation of the point clouds to be registered, called Normal Distributed Transform (NDT). Serafin et. al [7] approached the problem of point cloud registration using a 6D error function encoding also the normal difference in the error vector. All the approaches mentioned so far leverage on a dense ILS solver, with the notable exception of NDT that is a second-order approach that specializes the Newton's algorithm.
In the context of Computer Vision, the p2p algorithm [28], [29] allows to find the camera transformation that minimizes the reprojection error between a set of 3D points in the scene and their 2D projections in the image. The first stage of p2p is usually conducted according to a consensus scheme that relies on an ad-hoc minimal solver requiring only 3 corre-spondences. The final stage, however typically uses a dense and stationary ILS approach, since the correspondences do not change during the iterations. When the initial guess of the camera is known with sufficient accuracy, like in Visual Odometry (VO), only the latter stage is typically used. In contrast to these feature-based solvers, Engels et. al [30] approach VO by minimizing the reprojection error between two images through dense and non-stationary ILS. Using this method requires the system to possess a reasonably good estimate of the depth for a subset of the point in the scene. Such initialization is usually obtained by estimating the transformation between two images using a combination of RANSAC and direct solvers, and then computing the depth through triangulation between the stereo pair. Della Corte et. al [31] developed a registration algorithm called MPR, which was built on this idea. As a result, MPR is able to operate on depth images capturing different cues and obtained with arbitrary projection functions. To operate online, all the registration works mentioned so far rely on adhoc dense and non-stationary ILS solvers that leverage on the specific problem's structure to reduce the computation.
The scan based ICP algorithm [22] has been subsequently employed by the same authors [32] as a building block for a system that estimates a globally consistent map. The core is to determine the relative transforms between pairwise scans through ICP. These transformations are known as constraints, and a global map is obtained by finding the position of all the scans that better explain the constraints. The process can be visualized as a graph, whose nodes are the scan positions and whose edges are the constraints, hence this problem is called PGO. Constraints can exist only between spatially close nodes, due to the limited sensor range. Hence, PGO is inherently sparse. Additionally, in the on-line case the graph is incrementally augmented as new measurements become available, rendering it incremental. We are unaware on these two aspects being exploited in the design of the underlying solver in [32]. The work of Lu and Milios inspired Borrman et al. [33] to produce an effective 3D extension.
For several years after the introduction of Graph-Based SLAM [32], the community put aside ILS approaches in favor of filtering methods relying on Gaussian [34]- [39] or Particle [40]- [43] representation of the posterior. Filtering approaches were preferred since they were regarded as more suitable to be run on-line on a moving robot with the available computational resources of the era, and the sparsity of the problem had not yet been fully exploited.
In a Graph-Based SLAM problem, it is common to have a number of variables in the order of hundreds or thousands. Such a high number of variables results in a large optimization problem that represented a challenge for the computers of the time, rendering global optimization a bottleneck of Graph-Based SLAM systems. In the remainder of this document we will refer to the global optimization module in Graph-SLAM as the back-end, in contrast to the frontend which is responsible to construct the factor graph based on the sensor measurements. Gutmann and Konolidge [44] addressed the problem of incrementally building a map, by finding topological relations and loop closures. This work was one of the first on-line implementations of Graph-Based SLAM. The core idea to reduce the computation in the back-end was to restrict the optimization to the portions of the graph having the larger errors. This insight has inspired several subsequent works [13], [45].

B. Stand-Alone ILS solvers
Whereas dense solvers are typically embedded in the specific application for performance reasons, sparse solvers are complex enough to motivate the design of generic libraries forILS. The first work to explicitly consider the sparsity of SLAM in conjunction with a direct method to solve the linear system was √ SAM , developed by Dallaert et al. [46]. Kaess et al. [45] exploited this aspect of the problem in iSAM, the second iteration of √ SAM . Here when a new edge is added to the graph, the system computes a new solution reusing part of the previous one and selectively updating the vertices. In the third iteration of the system -iSAM2 -Kaess et al. [47] exploited the Bayes Tree to solve the optimization problem without explicitly constructing the linear system. This solution is in contrast with the general trend of decoupling linearization of the problem and solution of the linear system and highlights the connections between the elimination algorithms used in the solution of a linear system and inference on graphical models. This selfcontained engine allows deal very efficiently with dynamic graph that grows during time and Gaussian densities, two typical features of the SLAM problem. The final iteration of the system, called GTSAM [9], embeds all this concepts in a single framework.
Meanwhile, Hertzberg with his thesis [48] introduced the ⊞ method to systematically deal with non-Euclidean spaces and sparse problems. This work has been at the root of the framework of Kümmerle et al. [8] -called g 2 o. This system introduces a layered architecture that allows to easily exchange sub-modules of the systeme.g. the linear solver or optimization algorithm. A further paper of Hertzberg et al. [14] extends the ⊞ method ILS to filtering. Agarwal et al. proposed in their Ceres Solver [10] a generalized framework to perform non-linear optimization. Ceres embeds state-of-the-art methodologies that take advantages of modern CPUse.g. SIMD instructions and multithreading -resulting in a complete and fast framework. One of its most relevant feature is represented by the efficient use of Automatic Differentiation (AD) [49], that consists in the algorithmic computation of derivatives starting from the error function. Further information on the topic of AD can be found in [50], [51].
In several contexts knowing the optimal value of a solution is not sufficient, and also the covariance is required. In SLAM, knowing the marginal covariances relative to a variable is fundamental to approach data-association. To this extent Kaess et. al [52] outlined the use of the elimination tree. Subsequently, Ila et al. [13] designed SLAM++, an optimization framework to estimate mean and covariance of the state by performing incremental Cholesky updates. This work takes advantage of the incremental aspect of the problem to selectively update the approximated Hessian matrix by using parallel computation.
ILS algorithms have several known drawbacks. Perhaps the most investigated aspect is the sensitivity of the solution to the initial guess, that is reflected by the convergence basin. A wrong initial guess might lead a non-linear solver to converge to an inconsistent local minimum. Convex optimization [53] is one of the possible strategies to overcome this problem, however its use is highly domain dependent. Rosen et al. [17] explored this topic, proposing a system to perform optimization of generic SE(d) factor graphs. In their system, called SE-Sync, they use Riemannian Truncated-Newton Trust-Region method to certifiably compute the global optimum in a two step optimization (rotation and translation). Briales et al. [54] extended this approach to jointly optimize rotation and translation using the same concepts. Bai et al. [55] provided a formulation of the SLAM problem based on constrained optimization, where constraints are represented by loop-closure cycles. Still, those approaches are bounded to SE(d) sparse optimization. In contrast, Ni et al. [56] and Grisetti et al. [57] exploited respectively nested dissection and hierarchical local sub-graphs devising divide and conquer strategies to both increase the convergence basin and speed up the computation.

IV. LEAST SQUARES MINIMIZATION
This section describes the foundations of ILS minimization. We first present a formulation of the problem, that highlights its probabilistic aspects (Section IV-A). In Section IV-B we review some basic rules for manipulating the Normal distribution and we apply these rules to the definition presented in Section IV-A, leading to the initial definition of linear Least-Squares (LS). In Section IV-C we discuss the effects of non-linear observation model, assuming that both the state space and the measurements space are Euclidean. Subsequently, we relax this assumption on the structure of state and measurement spaces, proposing a solution that uses smooth manifold encapsulation. In Section IV-D we introduce effects of outliers in the optimization and we show commonly used methodologies to reject them. Finally in Section IV-E, we address the case of large, sparse problem characterized by measurement functions depending only on small subsets of the state. Classical problems such as SLAM or BA fall in this category and are characterized by a rather sparse structure.

A. Problem Formulation
Let W be a stationary system whose non-observable state variable is represented by x and let z be a measurement, i.e. a perception of the environment. The state is distributed according to a prior distribution p(x), while the conditional distribution of the measurement given the state p(z|x) is known. p(z|x) is commonly referred to as the observation model. Our goal is to find the most likely distribution of states, given the measurementsi.e. p(x|z). A straightforward application of the Bayes rule results in the following: The proportionality is a consequence of the normalization factor p(z), which is constant.
In the reminder of this work, we will considers two key assumptions: -the prior about the states is uniform, i.e.
-the observation model is Gaussian, i.e. Eq.
(2) expresses the uniform prior about the states using the canonical parameterization of the Gaussian. Alternatively, the moment parameterization characterizes the Gaussian by the information matrixi.e. the inverse of the covariance matrix Ω x = Σ −1 x -and the information vector ν x = Ω x µ x . The canonical parameterization is better suited to represent non-informative prior, since Ω x = 0 does not lead to numerical instabilities while implementing the algorithm. In contrast, the moment parameterization can express in a stable manner situations of absolute certainty by setting Σ x = 0. In the remainder, we will use both representations upon convenience, being their relation clear.
In Eq. (3) the mean µ z|x of the predicted measurement distribution is controlled by a generic non-linear function of the state h(x), commonly referred to as measurement function. In the next section we will derive the solution for Eq. (1), imposing that the measurement function is an affine transformation of the state -i.e. h(x) = Ax + billustrated in Fig. 2. In Section IV-C we address the more general non-linear case.

B. Linear Measurement Function
In case of linear measurement function, expressed as h(x) = A(x − µ x ) +ẑ, the prediction model has the following form: withẑ constant and µ x being the mean of the prior. For convenience, we express the stochastic variable ∆x = x−µ x as Switching between ∆x and x is achieved by summing or subtracting the mean. To retrieve a solution for Eq. (1) we first compute the joint probability of p(z, ∆x) using the chain rule and, subsequently, we condition this joint distribution with respect to the known measurement z. For further details on the Gaussian manipulation, we refer the reader to [58]. a) CHAIN RULE: Under the Gaussian assumptions made in the previous section, the parameters of the joint distribution over states and measurements p(∆x, z) have the following block structure: The value of the terms in Eq. (7) are obtained by applying the chain rule to multivariate Gaussians to Eq. (4) and Eq. (5), according to [58], and they result in the following: Since we assumed the prior to be non-informative, we can set Ω x = 0. As a result, the information vector ν ∆x,z of the joint distribution is computed as: Fig. 3 shows visually the outcome distribution. b) CONDITIONING: Integrating the known measurement z in the joint distribution p(∆x, z) of Eq. (6) results in a new distribution p(∆x|z). This can be done by conditioning in the Gaussian domain. Once again we refer the reader to [58] for the proofs, while we report here the results that the conditioning has on the Gaussian parameters: where The conditioned mean µ ∆x|z is retrieved from the information matrix and the information vector as: Remembering that ∆x = x − µ x , the Gaussian distribution over the conditioned states has the same information matrix, while the mean is obtained by summing the increment's mean µ ∆x|z as An important result in this derivation is that the matrix H = Ω ∆x|z is the information matrix of the the estimate, therefore, we can estimate not only the optimal solution µ x|z , but also its uncertainty Σ x|z = Ω −1 ∆x|z . Fig. 4 illustrates visually the conditioning of a bi-variate Gaussian distribution. c) INTEGRATING MULTIPLE MEASUREMENTS: Integrating multiple independent measurements z 1:K requires to stack them in a single vector. As a result, the observation model becomes .
Hence, matrix H and vector b are composed by the sum of each measurement's contribution; setting e k =ẑ k − z k , we compute them as follows:

C. Non-Linear Measurement Function
Equations (12) (13) and (15) allow us to find the exact mean of the conditional distribution, under the assumptions that i) the measurement noise is Gaussian, ii) the measurement function is an affine transform of the state and iii) both measurement and state spaces are Euclidean. In this section we first relax the assumption on the affinity of the measurement function, leading to the common derivation of the GN algorithm. Subsequently, we address the case of non-Euclidean state and measurement spaces.
If the measurement model mean µ z|x is controlled by a non-linear but smooth function h(x), and that the prior mean µ x =x is reasonably close to the optimum, we can approximate the behavior of µ x|z through the first-order Taylor expansion of h(x) around the mean, namely: The Taylor expansion reduces conditional mean to an affine transform in ∆x. Whereas the conditional distribution will not be in general Gaussian, the parameters of a Gaussian approximation can still be obtained around the optimum through Eq. (11) and Eq. (13). Thus, we can use the same algorithm described in Sec. IV-B, but we have to compute the linearization at each step. Summarizing, at each iteration, the GN algorithm: -processes each measurement z k by evaluating error e k (x) = h k (x) − z k and Jacobian J k at the current solutionx: -builds a coefficient matrix and coefficient vector for the linear system in Eq. (12), and computes the optimal perturbation ∆x by solving a linear system: -applies the computed perturbation to the current state as in Eq. (13) to get an improved estimatȇ A smooth prediction function has lower-magnitude higher order terms in its Taylor expansion. The smaller these terms are, the better its linear approximation will be. This leads to the situations close to the ideal affine case. In general, the smoother the measurement function h(x) is and the closer the initial guess is to the optimum, the better the convergence properties of the problem. a) NON-EUCLIDEAN SPACES: The previous formulation of GN algorithm uses vector addition and subtraction to compute the error e k in Eq. (17) and to apply the increments in Eq. (20). However, these two operations are only defined in Euclidean spaces. When this assumption is violated -as it usually happens in Robotics and Computer Vision applications -the straightforward implementation does not generally provide satisfactory results. Rotation matrices or angles cannot be directly added or subtracted without performing subsequent non-trivial normalization. Still, typical continuous states involving rotational or similarity transformations are known to lie on a smooth manifold [59].
A smooth manifold M is a space that, albeit not homeomorphic to R n , admits a locally Euclidean parameterization around each element M of the domain, commonly referred to as chart -as illustrated in Fig. 5. A chart computed in a manifold point M is a function from R n to a new point M ′ on the manifold: Intuitively, M ′ is obtained by "walking" along the perturbation ∆m on the chart, starting from the origin.   are analogous to the sum and subtraction. Those operators, referred to as ⊞ and ⊟, are, thence, defined as: This notation -firstly introduced by Smith et al. [60] and then generalized by Hertzberg et al. [14], [61] -allows us to straightforwardly adapt the Euclidean version of ILS to operate on manifold spaces. The dimension of the chart is chosen to be the minimal needed to represent a generic perturbation on tha manifold. On the contrary, the manifold representation can be chosen arbitrarily.
A typical example of smooth manifold is the SO(3) domain of 3D rotations. We represent an element SO(3) on the manifold as a rotation matrix R. In contrast, the for perturbation, we pick a minimal representation consisting on the three Euler angles ∆r = (∆φ, ∆θ, ∆ψ) ⊤ . Accordingly, the operators become: The function fromVector(·) computes a rotation matrix as the composition of the rotation matrices relative to each Euler angle. In formulae: The function toVector(·) does the opposite by computing the value of each Euler angle starting from the matrix R. It operates by equating each of its element to the corresponding one in the matrix product R x (∆φ) R y (∆θ) R z (∆ψ), and by solving the resulting set of trigonometric equations. As a result, this operation is quite articulated. Around the origin the chart constructed in this manner is immune to singularities. Once defined proper ⊞ and ⊟ operators, we can reformulate our minimization problem in the manifold domain. To this extent we can simply replace the + with a ⊞ in the computation of the Taylor expansion of Eq. (16). Since we will compute an increment on the chart, we need to compute the expansion on the chart ∆x at the local optimum, that is at the origin of the chart itself ∆x = 0, in formulae: The same holds when applying the increments in Eq. (20), leading to:X Here we denoted with capital letters the manifold representation of the state X, and with ∆x the Euclidean perturbation.
Since the optimization within one iteration is conducted on the chart, the origin of the chartX on the manifold stays constant during this iteration. If the measurements lie on a manifold too, a local ⊟ operator is required to compute the error, namely: To apply the previously defined optimization algorithm we should linearize the error around the current estimate through its first-order Taylor expansion. Posingȇ k = e k (X), we have the following relation: The reader might notice that in Eq. (30) the error space may differ from the increments one, due to the ⊟ operator. As reported in [62], having a different parametrization might enhance the convergence properties of the optimization in specific scenarios. Still, to avoid any inconsistencies, the information matrix Ω k should be expressed on a chart around the current measurement Z k .

D. Handling Outliers: Robust Cost Functions
In Sec. IV-C we described a methodology to compute the parameters of the Gaussian distribution over the state x which minimizes the Omega-norm of the error between prediction and observation. More concisely, we compute the optimal state x ⋆ such that: The mean of our estimate µ x|z = x * is the local optimum of the GN algorithm, and the information matrix Ω * x|z = H * , is he coefficient matrix of the system at convergence. The procedure reported in the previous section assumes all measurements correct, albeit affected by noise. Still, in many real cases this is not the case. This is mainly due to aspects that are hard to modeli.e. multi-path phenomena or incorrect data associations. These wrong measurements are commonly referred to as outliers. On the contrary, inliers represent the good measurements.
A common assumption made by several techniques to reject outliers is that the inliers tend to agree towards a common solution, while outliers do not. This fact is at the root of consensus schemes such as RANSAC [29]. In the context of ILS, the quadratic nature of the error terms leads to over-accounting for measurement whose error is large, albeit those measurements are typically outliers. However, there are circumstances where all errors are quite large even if no outliers are present. A typical case occurs when we start the optimization from an initial guess which is far from the optimum.
A possible solution to this issue consists in carrying on the optimization under a cost function that grows subquadratically with the error. Indicating with u k (x) the L1 Omega-norm of the error term in Eq. (17), its derivatives with respect to the state variable x can be computed as follows: We can generalize Eq. (31) by introducing a scalar function ρ(u) that computes a new error term as a function of the L1-norm. Eq. (31) is a special case for ρ(u) = 1 2 u 2 . Thence, our new problem will consist in minimizing the following function: Going more in detail and analyzing the gradients of Eq. (34) we have the following relation: The robustifier function ρ(·) acts on the gradient, modulating the magnitude of the error term through a scalar function γ k (x). Still, we can also compute the gradient of the Eq. (31) as follows: We notice that Eq. (37) and Eq. (35) differ by a scalar term γ(x) that depends on x. By absorbing this scalar term at each iteration in a new information matrixΩ k (x) = γ k (x)Ω k , we can rely on the iterative algorithm illustrated in the previous sections to implement a robust estimator. In this sense, at each iteration we compute γ k (x) based on the result of the previous iteration. Note that, upon convergence the Eq. (37) and Eq. (35) will be the same, therefore, they lead to the same optimum. This formalization of the problem is called Iterative Reweighed Least-Squares (IRLS). The use of robust cost functions biases the information matrix of the system H. Accordingly, if we want to recover an estimate of the solution uncertainty when using robust cost functions, we need to "undo" the effect of function ρ(·). This can be easily achieved recomputing H after convergence considering only inliers and disabling the robustifieri.e. setting ρ(u) = 1 2 u 2 . Fig. 6 illustrates some of the most common cost function used in Robotics and Computer Vision. Further information on modern robust cost function can be found in the work of MacTavish et al. [63].
(a) H matrix before reordering.
(c) H matrix after reordering.
(d) Cholesky decomposition after reordering. Fig. 7: Effects of AMD variable reordering on the fill-in of the Cholesky decomposition of matrix H. Black pixels indicate non-zero blocks. As illustrated in Fig. 7b and Fig. 7d, variable reordering dramatically reduces the fill-in of the decomposed matrix.

E. Sparsity
Minimization algorithms like GN or Levenberg-Marquardt (LM) lead to the repeated construction and solution of the linear system H∆x = −b. In many cases, each measurement z k only involves a small subset of state variables, namely: (38) Therefore, the Jacobian for the error term k has the following structure: According to this, the contribution H k = J T k Ω k J k of the k th measurement to the system matrix H exhibits the following pattern: Each measurement introduces a finite number of nondiagonal components that depends quadratically on the number of variables that influence the measurement. Therefore, in the many cases when the number of measurements is proportional to the number of variables such as SLAM or BA, the system matrix H is sparse, symmetric and positive semi-definite by construction. Exploiting these intrinsic properties, we can efficiently solve the linear system in Eq. (12).
In fact, the literature provides many solutions to this kind of problem, which can be arranged in two main groups: (i) iterative methods and (ii) direct methods. The former computes an iterative solution to the linear system by following the gradient of the quadratic form. These techniques often use a pre-conditioner e.g. Preconditioned Conjugate Gradient (PCG), which scales the matrix to achieve quicker convergence to take steps along the steepest directions. Further informations about these approaches can be found in [64]. Iterative methods might require a quadratic time in computing the exact solution of a linear system, but they might be the only option when the dimension of the problem is very large due to their limited memory requirements. Direct methods, instead, compute return the exact solution of the linear system, usually leveraging on some matrix decomposition followed by backsubstitution. Typical methods include: the Cholesky factorization H = LL T or the QR-decomposition. A crucial parameter controlling the performances of a sparse direct linear solver is the fill-in, that is the number of new non-zero elements introduced by the specific factorization. A key aspect in reducing the fill in is the ordering of the variables. Since computing the optimal reordering is NP-hard, approximated techniques [65]- [67] are generally employed. Fig. 7 shows the effect of different variable ordering on the same system matrix. A larger fill-in results in more demanding computations. We refer to [68] and [69] for a more detailed analysis about this topic.

F. A unifying formalism: Factor Graphs
In this section, we introduce a formalism to represent a super-class of the minimization problems discussed so far: factor graphs. We recall that i) our state x = {x 1:N } is composed by N variables, ii) the conditional probabilities p(z k |x) might depend only by a subset of the state variables x k = {x k1 , x k2 , . . . , x kq } ∈ x and iii) we have no prior about the state p(x) = U(x). Given this, we can expand Eq. (1) as follows: (40) Eq. (40) expresses the likelihood of the measurements as a product of factors. This concept is elegantly captured in the factor graph formalism, which provides a graphical representation for this kind of problems. A factor graph is a bipartite graph where each node represents either a variable x i ∈ x, or a factor p(z k |x k ). Fig. 8 illustrates an example of factor graph. Edges connect a factor p(z k |x k ) with each of its variables x k = {x k1 , x k2 , . . . , x kq }. In the remainder of this document, we will stick to the factor graph notation and we will refer to the measurement likelihoods p(z k |x k ) as factors. Note that, the main difference between Eq. (40) and Eq. (14) is that the former highlights the subset of state variables x k from which the observation depends, while the latter considers all state variables -also those that have a null contribution.
The aim of this section is to use the factor graph formulation to formalize the ILS minimization exposed so far. In this sense, Alg. 1 reports a step-by-step expansion of the vanilla GN algorithm exploiting the factor graph formalism -supposing that both states and measurement belong to a smooth manifold. In the remainder of this document, we indicate with bold uppercase symbols elements lying on a manifold spacee.g. X ∈ SE(3); lowercase bold symbols specify their corresponding vector perturbationse.g. ∆x ∈ R n . Going more in detail, at each iteration the algorithm reinitializes its workspace to store the current linear system (lines 6-7). Subsequently, it processes each measurement Z k (line 8), computing i) the prediction (line 9), ii) the error vector (line 10) and iii) the coefficients to apply the robustifier (lines 13-15). While processing a measurement, it also computes the blocks J k,i of the Jacobians with respect to the variables X i ∈ X k involved in the factor. We denote with H i,j the block i, j of the H matrix corresponding to the variables X i and X j ; similarly we indicate with b i the block of the coefficient vector for the variable X i . This operation is carried on in the lines 18-20. The contribution of each measurement to the linear system is added in a block fashion. Further efficiency can be achieved exploiting the symmetry of the system matrix H, computing only its lower triangular part. Finally, once the linear system H∆x = −b has been built, it is solved using a general sparse linear solver (line 21) and the perturbation ∆x is applied to the current state in a block-wise fashion (line 23). The algorithm proceeds until convergence is reached, namely when the delta of the costfunction F between two consecutive iteration is lower than a threshold ǫ -line 3.
Summarizing, instantiating Alg. 1 on a specific problem requires to: -Define for each type of variable x i ∈ x i) an extended parametrization X i , ii) a vector perturbation ∆x i and iii) a ⊞ operator that computes a new point on the manifold X ′ i = X i ⊞ ∆x i . If the variable is Euclidean, Algorithm 1 Gauss-Newton minimization algorithm for manifold measurements and state spaces k Ω k e k 12: 18: for all X j ∈ {X k1 ... X kq } and j <= i do 19: ∆x ← solve(H∆x = −b) 22: for all X i ∈ X do 23:X i ←X i ⊞ ∆x i 24: returnX the extended and the increment parametrization match and, thus, ⊞ degenerates to vector addition.
-For each type of factor p(z k |x k ), specify i) an extended parametrization Z i , ii) an Euclidean representation ∆z i for the error vector and ii) a ⊟ operator such that, given two points on the manifold Z i and Z ′ i , ∆z i = Z ′ i ⊟ Z i represents the motion on the chart that moves Z i onto Z ′ i . If the measurement is Euclidean, the extended and perturbation parametrizations match and, thus, ⊟ becomes a simple vector difference. Finally, it is necessary to define the measurement function h k (X k ), that given a subset of state variables X k , computes the expected measurementẐ k .
-Choose a robustifier function ρ k (u) for each type of factor. The non-robust case is captured by choosing ρ k (u) = 1 2 u 2 . Note that, depending on the choices and on the application, not all these steps indicated here are required. Furthermore, the value of some variables might be known beforehande.g. the initial position of the robot in SLAM is typically set at the origin. Hence, these variables do not need to be estimated in the optimization process, since they are constants in this context. In Alg. 1, fixed variables can be handled in the solution stepi.e. line 21 -suppressing all block rows and columns in the linear system that arise from these special nodes. In the next section, we present how to formalize several common SLAM problems through the factor graph formalization introduced so far.

V. EXAMPLES
In this section we present examples on how to apply the methodology illustrated in Sec. IV-F to typical problems in robotics, namely: Point-Cloud Registration, Projective Registration, BA and PGO.

A. ICP
ICP represents a family of algorithms used to compute a transform that maximizes the overlap between two point clouds. Let P f be the cloud that stays fixed and P m be the one that is moved to maximize the overlap. ICP algorithms achieve this goal progressively refining an initial guess of the target transformation X by alternating two phases: data-association, and optimization. The aim of the dataassociation is to find a point p f i ∈ P f that is likely to be the same as the point p m j ∈ P m being transformed according to X, see Fig. 9. Note that, several heuristics to determine the data association have been proposed by the research community, depending on the context of the problem. The most common one are either geometry basedi.e. nearest neighbor, normal shooting, projective association -or rely on appearance-based evaluations. Discussing dataassociation strategies is out of the scope fo this work, still, we can generalize the outcome of data association by a selector function j(k) ∈ {1, . . . , |P f |} that maps a point index k in the moving cloud to an index j in the fixed cloud. In this way, we indicate a pair of corresponding points as p m k , p f j(k) . In contrast to data-association, the optimization step is naturally described as an ILS problem. The variable to be estimated is a transform X whose domain depends on the specific scenarioe.g. SE(2), SE(3) or even a Similarity if the two clouds are at different scales. In the remaining of this section, we will use X ∈ SE(3) to instantiate our factor-graph-based ILS problem.

1) Variables:
Since the transformation we should estimate is a 3D Isometry X ∈ SE(3), our state lies on a smooth manifold. Therefore we should define all the entities specified in Sec. IV-F, namely: -Extended Parameterization: we conveniently define a transformation X = [R | t] ∈ SE(3) as a rotation matrix R and a translation vector t. Using this notation, the following relations hold: -Perturbation Vector: a commonly used vector where ∆t ∈ R 3 represents a translation, while ∆a ∈ R 3 is a minimal representation for the rotation. The latter might use Euler angles, unit-quaternion or the logarithm of the rotation matrix. -X ⊞ ∆x Operator: this is straightforwardly implemented by first computing the transformation ∆X = [∆R ∆t] ∈ SE(3) from the perturbation, and then applying such a perturbation to the previous transform.
In formulae: where v2t(∆x) computes a transform ∆X from a perturbation vector ∆x. Its implementation depends on the parameters chosen for the rotation part ∆a. Note that, the perturbation might be applied to the left or to the right of the initial transformation. In this document we will consistently apply it to the left. Finally, we define also the inverse function ∆x = t2v(∆X), that computes perturbation vector from the transformation matrix such that ∆x = t2v(v2t(∆x)).

2) Factors:
In this problem, we have just one type of factor, which depends on the relative position between a pair of corresponding points, after applying the current transformation X to the moving cloud. Given a set of associations { p m s , p f j(s) , . . . , p m K , p f j(K) }, each fixed point p f j(k) constitutes a measurement z k -since its value does not change during optimization. On the contrary, each moving point p m k will be used to generate the predictionẑ. Note that, the measurement space is Euclidean in ths scenarioi.e. R 3 . Therefore, we only need to define the following entities: -Measurement Function: it computes the position of a point p m k that corresponds to the point p f j(k) in fixed scene by applying the transformation X, namely: -Error Function: since both prediction and measurement are Euclidean, the ⊟ operator boils down to simple vector difference. The error, thus, is a 3-dimensional vector computed as: The Jacobians can be computed analytically very straightforwardly from Eq. (45) as: With this in place, we can now fully instantiate Alg. 1. For completeness, in the appendix we report the functions v2t(·) and t2v(·) for SE(3) objects, together with the analytical derivation of the Jacobians. Since in this case the measurement is Euclidean, the Jacobians of error function and measurement function are the same.

B. Projective Registration
Projective Registration consists in determining the pose X of a camera in a known 3D scene from a set of 2D projections of these points on the image plane. In this case our fixed point cloud P f will be consisting of the image projections, while the moving one P m will be composed by the known location of the 3D points, see Fig. 10. We use the notation for data-association defined in Sec. V-A, in which the function j(i) retrieves the index of a 2D measurement on the image that corresponds to the 3D point p m i ∈ P m . Also in this scenario, the only variable to estimate is the transformation X ∈ SE(3), therefore we will simply re-use the entities defined in Sec. V-A.1 and focus only on the factors. 1) Factors: Given a set of 2D-3D associations { p m s , p f j(s) , . . . , p m K , p f j(K) }, each fixed point p m k ∈ R 2 will represent a measurement z k , each moving point will contribute to the predictionẑ k . Therefore we can define: -Measurement Function: it is the projection on the image plane of a scene point p m k , assuming the camera is at X. Such a prediction is obtained by first mapping the point in the camera reference frame to get a new point p icp , and then projecting this point on the image plane, in formluae: Note that, p cam is the point in homogeneous image coordinates, while p img is the 2D point in pixel coordinates obtained normalizing the point through homogeneous division. Finally, the complete measurement function is defined as: Note that, we can exploit the work done in Sec. V-A.2 to easily compute Jacobians using the chain-rule, namely:

C. Structure from Motion and Bundle Adjustment
Structure from Motion (SfM) is the problem of determining the pose of N cameras and the position of M 3D points on a scene, from their image projections. The scenario is shown in Fig. 11. This problem is highly non-convex, and tackling it with ILS requires to start from an initial guess not too far from the optimum. Such a guess is usually obtained by using Projective Geometry techniques to determine an initial layout of the camera poses. Subsequently, the points are triangulated to initialize all variables. A final step of the algorithm consists in performing a non-linear refinement of such an initial guess -known as BA -which is traditionally approached as an ILS problem. Since typically each camera observes only a subset of points, and a point projection depends only on the relative pose between the observed point and the observing camers, BA is a good example of a sparse problem.
1) Variables: We want to estimate the pose of each camera X c 1:N , and the position of each point x p 1:M . The state will thus be a data structure X = X c 1:N , x p 1:M storing all camera poses and all points. Given this, for the camera poses X c 1:N , the definitions in Sec. V-A.1 will be used again. As for the points, we do not need a specific extended parametrization, since they lie on ℜ 3 . Therefore we should define only: -Perturbation Vector: the total perturbation vector is defined as Otherwise speaking, it is a 6N + 3M vector obtained by stacking the individual perturbations. -X ⊞ ∆x Operator: the operator will use the same machinery introduced in Sec. V-A.1 for the poses and the standard Euclidean addition for the point positions. 2) Factors: Similar to Projective Registration, in BA a measurement z k is a projection of a point on the 2D image plane. However, in this specific scenario, such a projection depends not only on the estimate of a camera pose but also on the estimate of the point. Note that, this information was known in the case of Projective Registration, while now it becomes part of the state. For consistency with Alg. 1, if the k th measurement arises from observing the point x p m with the camera X c n , we will denote these indices with two selector functions n = n(k) and m = m(k), that map the factor index k respectively to the indices of the observing camera and the observed point. For the k th factor, the camera and point variables will then be x p m(k) and X c n(k) . Note that, also in this case, a measurement z k lies in an Euclidean spacei.e. R 2 . Given this, to instantiate a factor we define: -Measurement Function: the predictionẑ k can be easily obtained from Eq. (50), namely: -Error Function: it is the Euclidean difference between prediction and measurement: In this context, the Jacobian J ba k (X) will be consisting of two blocks, corresponding to the perturbation of the camera pose and to the perturbation of the point position, in formulae: where . (57) Again, the measurement domain is Euclidean, thus the Jacobians of the error function and the measurement function match. For completeness, in the Appendix of this document we report a more in-depth derivation of the Jacobians.
Still, since all measurements are relative, given a particular solution X ⋆ all solutions X ′ = TX ⋆ obtained by applying a transformation T ∈ SE(3) to all the variables in X ⋆ have the same residual χ 2 and, thus, are equivalent. Furthermore, all solutions X ′ = sX ⋆ obtained by scaling all poses and landmarks by a constant s are equivalent too. This reflects the fact that observing an object that is twice as big from twice the distance results in the same projection. Thence, the problem of BA is under-determined by 7 Degrees-of-Freedom (DoF) and, thus, the vanilla GN algorithm requires to fix at least 7 DoF -typically a camera pose (6 DoF), and the distance between two points or two camera poses (1 DoF).

D. Pose Graphs
A pose graph is a factor graph whose variables X r 1:N are poses and whose measurements Z 1:K are relative measurements between pairs of poses. Optimizing a pose graph means determining the configuration of poses that is maximally consistent with the measurements. PGO is very common in the SLAM community, and several ad-hoc approaches have been proposed. Similar to BA, PGO is highly non-convex, and its solution with ILS requires a reasonably good initial guess.
1) Variables: Also in PGO, each variable X r k lies on the smooth manifold SE(3). Once again, we will make use of the formulation used in Sec. V-A.1 to characterize the state X = X r 1:N and the perturbation vector ∆x rT = (∆x rT 1 . . . ∆x rT N ). 2) Factors: Using the same index notation in Sec. V-C.2, let Z k be the k th relative pose measurement expressing the pose X m in the reference frame of the pose X n . We denote the pair of poses as X n = X n(k) , and X m = X m(k) using the two selector functions n(k) and m(k). In this scenario, a measurement Z k expresses a relative pose between two variables and, consequently, also Z k lies on the smooth manifold SE(3). Considering this, we define the following entities: -Measurement Function: this is straightforwardly obtained by expressing the observed pose X r m(k) in the reference frame of the observing pose X r n(k) , namely: -Error Function: in this case, since the measurements are non-Euclidean too, we are required to specify a suitable parametrization for the error vector e k . In literature, many error vectorization are available [62], each one with different properties. Still, in this document, we will make use of the same 6-dimensional parametrization used for the incrementsi.e. e ⊤ = (e xyz⊤ e rpy⊤ ). Furthermore, we need to define a proper ⊟ operator that expresses on a chart the relative pose between two SE(3) objects ∆z =Ẑ ⊟ Z. To achieve this goal, we i) expressẐ in the reference system of Z obtaining the relative transformation ∆Z and then ii) compute the chart coordinates of ∆Z around the origin using the t2v(·) function. In formluae: Note that, since ∆Z k expresses a relative motion between prediction and measurement, its rotational component will by away from singularities. With this in place, the error vector is computed as the pose of the predictionẐ k = h pgo k (X) on a chart centered in Z k ; namely: Similar to the BA case, in PGO the Jacobian J pgo k (X) will be consisting of two blocks, corresponding to the perturbation of observed and the observing poses. The measurements in this case are non-Euclidean, and, thus, we need to compute the Jacobians on the error function -as specified in Eq. (60): Analogous to the BA case, also in PGO all measurements are relative, and, hence, all solutions that are related by a single transformation are equivalent. The scale invariance, however, does not apply in this context. As a result, PGO is under-determined by 6 DoF and using GN requires to fix at least one of the poses.

E. Considerations
In general, one can carry on the estimation by using an arbitrary number of heterogeneous factors and variables. As an instance, if we want to augment a BA problem with odometry, we can model the additional measurements with PGO factors connecting subsequent poses. Similarly, if we want to solve a Projective Registration problem where the world is observed with two cameras, and we have guess of the orientation from an inertial sensor, we can extend the approach presented in Sec. V-B by conducting the optimization on a common origin of the rigid sensor system, instead of the camera position. We will have three types of factors, one for each camera, and one modeling the inertial measurements.
As a final remark, common presentations of ICP, Projective Registration and BA conduct the optimization by estimating world-to-sensor frame, rather than the sensorto-world, as we have done in this document. This leads to a more compact formulation. This avoids inverting the transform to compute the prediction, and results in Jacobians are easier to compute in close form. We preferred to provide the solution for sensor-to-world to be consistent with the PGO formulation.

VI. A GENERIC SPARSE/DENSE MODULAR LEAST SQUARES SOLVER
The methodology presented in Sec. IV-F outlines a straight path to the design of an ILS optimization algorithm. Robotic applications often require to run the system on-line, and, thus, they need efficient implementations. When extreme performances are needed, the ultimate strategy is to overfit the solution to the specific problem. This can be done both at an algorithmic level and at an implementation level. To improve the algorithm, one can leverage on the a-priori known structure of the problem, by removing parts of the algorithm that are non needed or by exploiting domainspecific knowledge. A typical example is when the structure of the linear system is known in advancee.g. in BA -where it is common to use specialized methods to solve the linear system [70].
Focusing on the implementation, we reported two main bottlenecks: the computation of the linear system H∆x = b and its solution. Dense problems such as ICP, Sensor Calibration or Projective Registration, are typically characterized by a small state space and many factors of the same type. In ICP, for instance, the state contains just a single SE(3) object i.e. the robot pose. Still, this variable might be connected to hundreds of thousands of factors, one for each point correspondence. Between iterations, the ICP mechanism results in these factors to change, depending on the current status of the data association. As a consequence, these systems spend most of their time in constructing the linear system, while the time required solve it is negligible. Notably, applications such as Position Tracking or VO require the system to run at the sensor frame-rate, and each new frame might take several ILS iterations to perform the registration. On the contrary, sparse problems like , PGO or large scale BA are characterized by thousands of variables, and a number of factors which is typical in the same order of magnitude. In this context, a factor is connected to very few variables. As an example, in case of PGO, a single measurement depends only two variables that express mutually observable robot poses, whereas the complete problem might contain a number of variables proportional to the length of the trajectory. This results in a large-scale linear system, albeit most of its coefficients are null. In these scenarios, the time spent to solve the linear system dominates over the time required to build it.
A typical aspect that hinders the implementation of a ILS algorithm by a person approaching this task for the first time is the calculation of the Jacobians. The labor-intensive solution is to compute them analytically, potentially with the aid of some symbolic-manipulation package. An alternative solution is to evaluate them numerically, by calculating the Jacobians column-by-column with repeated evaluation of the error function around the linearization point. Whereas this practice might work in many situations, numerical issues can arise when the derivation interval is not properly chosen. A third solution is to delegate the task of evaluating the analytic solution directly to the program, starting from the error function. This approach is called AD and Ceres Solver [10] is the most representative system to embed this feature -later also adopted by other optimization frameworks.
In the remainder of this section, we first revisit and generalize Alg. 1 to support multiple solution strategies. Subsequently, we outline some design requirements that will finally lead to the presentation of the overall design of our approach -proposed in Sec. VII.

A. Revisiting the Algorithm
In the previous section, we presented the implementation of a vanilla GN algorithm for generic factor graphs. This simplistic scheme suffers under high non linearities, or when the cost function is under-determined. Over time, alternatives to GN have been proposed, to address these issues, such as LM or Trusted-Region Method (TRM) [71]. All these algorithms present some common aspects or patterns that can be exploited when designing an optimization system. Therefore, in this section, we reformulate Alg. 1 to isolate different independent sub-modules. Finally we present both the GN and the LM algorithms rewritten by using these submodules.
In Alg. 2 we isolate the operations needed to compute the scaling factor γ k for the information matrix Ω k , knowing the current χ 2 k . Alg. 3 performs the calculation of the error e k and the Jacobian J k for a factor Z k , Ω k at the current linearization point. Alg. 4 applies the robustifier to a factor, and updates the linear system. Alg. 5 applies the perturbation ∆x to the current solutionX to obtain an updated estimate. Finally, in Alg. 6 we present a revised version of Alg. 1 that relies on the modules described so far. In Alg. 7, we provide an implementation of the LM algorithm that makes use of the same core sub-algorithms used in Alg. 6. The LM algorithm solves a damped version of the system, namely (H + λ · diag(H))∆x = b. The magnitude of the damping factor λ is adjusted depending on the current variation of the χ 2 . If the χ 2 increases upon an iteration, λ increases too. In contrast, if the solution improves, λ is decreased. Variants of these two algorithmse.g. damped GN, that solves (H+λI)∆x = bcan be straightforwardly implemented by slight modification to the algorithm presented here. Algorithm 2 robustify(χ 2 k ) -computes the robustification coefficient γ k Require: Current χ 2 k . Ensure: γ k computed from the actual error, 1: Algorithm 3 linearize(X k ,Z k ) -computes the error e k and the Jacobians J k at the current linearization pointX Require: Initial guessX k ; Current measurementZ k ; Ensure: Error: e k ; JacobiansJ k ; system with a factor current linearization pointX, and returns the χ 2 k of the factor Require: Initial guessX k ; Coefficients of the linear system H and b; Measurement Z k , Ω k Ensure: Coefficients of the linear system after the update H and b; Value of the cost function for this factor χ 2 k 1: < e k , J k >= linearize(X k , Z k ) 2: χ 2 k ← e T k Ω k e k 3: γ k = robustify(χ 2 k ) 4:Ω k = γ k Ω k 5: for all X i ∈ {X k1 ... X kq } do 6: for all X j ∈ {X k1 ... X kq } and j <= i do 7: Algorithm 5 updateSolution(X, ∆x) -applies a perturbation to the current system solution Require: Current solutionX; Perturbation ∆x Ensure: New solutionX, moved according to ∆x 1: for all X i ∈ X do 2:X i ←X i ⊞ ∆x i 3: returnX

22:
t ← t + 1 23: returnX work. Although most of these requirements indicate good practices to be followed in potentially any new development, we highlight here their role in the context of a solver design. a) EASY TO USE AND SYMMETRIC API: As users we want to configure, instantiate and run a solver in the same manner, regardless to the specific problem to which is applied. Ideally, we do not want the user to care if the problem is dense or sparse. Furthermore, in several practical scenarios, one wants to change aspects of the solver while it runse.g. the minimization algorithm chosen, the robust kernel or the termination criterion. Finally we want to save/retrieve the configuration of a solver and all of its submodules to/from disk. Thence, the expected usage pattern should be: i) load the specific solver configuration from disk ed eventually tune it, ii) assign a problem to the solver or load it from file, iii) compute a solution and eventually iv) provide statistics about the evolution of optimization. Note that, many current state-of-the-art ILS solver allow to easily perform the last 3 steps of this process, however, they do not provide the ability of permanently write/read their configuration on/from disk -as our system does.
b) ISOLATING PARAMETERS, WORKING VARIABLES AND PROBLEM DESCRIPTION/SOLUTION: When a user is presented to a new potentially large code-base, having a clear distinction between what the variables represent and how they are used, substantially reduces the learning curve. In particular we distinguish between parameters, working variables and the input/output. Parameters are those objects controlling the behavior of the algorithm, such as number of iterations or the thresholds in a robustifier. Parameters might include also processing sub-modules, such as the algorithm to use or the algebraic solver of the linear system. Summarizing, parameters characterize the behavior of the optimizer, independently from the input, and represent the configuration that can be stored/retrieved from disk.
In contrast to parameters, working variables are altered during the computation, and are not directly accessible to the end user. Finally, we have the description of the problem i.e. the factor graph, where the factors and the variables expose an interface agnostic to the approach that will be used to solve the problem. c) TRADE-OFF DEVELOPMENT EFFORT / PERFOR-MANCE: Quickly developing a proof of concept is a valuable feature to have while designing a novel system. At the same time, once a way to approach the problem has been found, it becomes perfectly reasonable to invest more effort to enhance its efficiency.
Upon instantiation, the system should provide an off-the shelf generic and fair configuration. Obviously, this might be tweaked later for enhancing the performances on the specific class of problems. A possible way to enhance performances is by exploiting the special structure of a specific class of problems, overriding the general APIs to perform adhoc computations. This results in layered APIs, where the functionalities of a level rely only on those of the level below. As an example, in our architecture the user can either specify the error function and let the system to compute the Jacobians using AD or provide the analytical expression of the Jacobians if more performances are needed. Finally, the user might intervene at a lower level, providing directly the contribution to the linear system H k = J T k Ω k J k given by the factor. In a certain class of problems also computing this product represents a performance penalty. An additional benefit provided by this design is the direct support for Newton's method, which can be straightforwardly achieved by substituting the approximated Hessian H k = J T k Ω k J k with the analytic one H k = ∂ 2 e k ∂∆x 2 k . This feature captures second order approaches such as NDT [72] in the language of our API. d) MINIMIZE CODEBASE: The likelihood of bugs in the implementation grows with the size of the code-base. For small teams characterized by a high turnover -like the ones found in academic environments -maintaining the code becomes an issue. In this context we choose to favor the code reuse in spite of a small performance gain. The same class used to implement an algorithm, a variable type or a robustifier should be used in all circumstances -namely sparse and dense problems -where it is needed.

VII. IMPLEMENTATION
As support material for this tutorial, we offer an own implementation of a modular ILS optimization framework, that has been designed around the methodology illustrated in Section IV-F. Our system is written in modern C++17 and provides static type checking, AD and a straightforward interface. The core of our framework fits in less than 6000 lines of code, while the companion libraries to support the most common problemse.g. 2D and 3D SLAM, ICP, Projective and Dense Registration, Sensor Calibration -are contained in 6500 lines of code. Albeit originally designed as a tool for rapid prototyping, our system achieves a high degree of customization and competes with other state-ofthe-art systems in terms of performances.
Based on the requirements outlined in Sec. VI-B, we designed a component model, where the processing objects (named Configurable) can possess parameters, support dynamic loading and can be transparently serialized. Our framework relies on a custom-built serialization infrastructure that supports format independent serialization of arbitrary data structures, named Basic Object Serialization System (BOSS). Furthermore, thanks to this foundation, we can provide both a graphical configurator -that allows to assemble and easily tune the modules of a solver -and a command-line utility to edit and run configurations on the go.
The goal of this section is to provide the reader with a quick overview of the proposed system, focusing on how the user interacts with it. Given the class-diagram illustrated in Fig. 12, in the remainder we will first analyze the core modules of the solver and then provide two practical examples on how to use it.

A. Solver Core Classes
Our framework has been designed to satisfy the requirements stated in Sec. VI, embedding unified APIs to cover both dense and sparse problems symmetrically. Furthermore, thanks to the BOSS serialization library, the user can generate permanent configuration of the solver, to be later read and reused on the go. The configuration of a solver generally embeds the following parameters: Fig. 12: UML class diagram of our architecture. In green we show the type-independent classes, in pink the type-dependent classes and in pale blue we outline potential specializations. The template class names end with an underscore and the arguments are highlighted above the name, between angular brackets. Arrow lines denote inheritance, and diamonds aggregation/ownership.
-Optimization Algorithm: the algorithm that performs the minimization; currently, only GN and LM are supported, still, we plan to add also TRM approaches -Linear Solver: the algebraic solver that computes the solution of the linear system H∆x = −b; we embed a naive AMD-based [65] linear solver together with other approaches based on well-known highly-optimized linear algebra librariese.g. SuiteSparse 1 -Robustifier: the robust kernel function applied to a specific factor; we provide several commonly used instances of robustifier, together with a modular mechanism to assign specific robustifier to different types of factor -called robustifier policy -Termination Criterion: a simple modules that, based on optimization statistics, checks whether convergence has been reached. Note that, in our architecture there is a clear separation between solver and problem. In the next section we will describe how we formalized the latter. In the remaining of this section, instead, we will focus on the solver classes, which are in charge of computing the problem solution.
The class Solver implements a unified interface for our optimization framework. It presents itself to the user with an unified data-structure to configure, control, run and monitor the optimization. This class allows to select the type of algorithm used within one iteration, which algorithm to use to solve the linear system, or which termination criterion to use. This mechanism is achieved by delegating the execution of these functions to specific interfaces. More in detail, the linear system is stored in a sparse-block-matrix structure, that effectively separates the solution of the linear system from the rest of the optimization machinery. Furthermore, our solver supports incremental updates, and can provide an estimate of partial covariance blocks. Finally, our system supports hierarchical approaches. In this sense the problem can be represented at different resolutions (levels), by using different factors. When the solution at a coarse level is computed, the optimization starts from the next denser level, enabling a new set of factors. In the new step, the initial guess is computed from the solution of the coarser level.
The IterationAlgorithmBase class defines an interface for the outer optimization algorithmi.e. GN or LM. To carry on its operations, it relies on the interface exposed by the Solver class. The latter is in charge to invoke the IterationAlgorithmBase, which will run a single iteration of its algorithm.
Class RobustifierBase defines an interface for using arbitrary ρ(u) functions -as illustrated in Sec. IV-D. Robust kernels can be directly assigned to factors or, alternatively, the user might define a policy, that based on the status of the actual factor decides which robustifier to use. The definition of a policy is done by implementing the RobustifierPolicyBase interface.
Finally, TerminationCriterionBase defines an interface for a predicate that, exploiting the optimization statistics, detects whether the system has converged to a solution or a fatal error has occurred.

B. Factor Graph Classes
In this section we provide an overview of the top-level classes constituting a factor graphi.e. the optimization problem -in our framework. In specifying new variables or factors, the user can interact with the system through a layered interface. More specifically, factors can be defined using AD and, thus, contained in few lines of code for rapid prototyping or the user can directly provide how to compute analytic Jacobians if more speed is required. Furthermore, to achieve extreme efficiency, the user can choose to compute its own routines to update the quadratic form directly, consistently in line with our design requirement of more-work/more-performance. Note that, we observed in our experiments that in large sparse problems the time required to linearize the system is marginal compared to the time required to solve it. Therefore, in most of these cases AD can be used without significant performance losses.
1) Variables: The VariableBase implements a base abstract interface the variables in a factor graph, whereas Variable_<PerturbationDim,EstimateType> specializes the base interface on a specific type. The definition of a new variable extending the Variable_ template requires the user to specify i) the type EstimateType used to store the value of the variable X i , ii) the dimension PerturbationDim of the perturbation ∆x i and iii) the ⊞ operator. This is coherent with the methodology provided in Sec. IV-F. In addition to these fields, variable has an integer key, to be uniquely identified within a factor graph. Furthermore, a variable can be in either one of these three states: -Active: the variable will be estimated -Fixed: the variable stays constant through the optimization -Disabled: the variable is ignored and all factors that depend on it are ignored as well. To provide roll-back operations -such as those required by LM -a variable also stores a stack of values.
To support AD, we introduce the ADVariable_ template, that is instantiated on a variable without AD. Instantiating a variable with AD requires to define the ⊞ operator by using the AD scalar type instead of the usual float or double. This mechanism allows us to mix in a problem factors that require AD with factors that do not.
2) Factors: The base level of the hierarchy is the FactorBase. It defines a common interface for this type of graph objects. It is responsible of i) computing the errorand, thus, the χ 2 -ii) updating the quadratic form H and the right-hand side vector b and iii) invoking the robustifier function (if required). When an update is requested, a factor is provided with a structure on which to write the outcome of the operation. A factor can be enabled or disabled. In the latter case, it will be ignored during the computation. Besides, upon update a factor might become invalid, if the result of the computation is meaningless. This occurs for instance in BA, when a a point is projected outside the image plane.
The Factor_<VariableTupleType> class implements a typed interface for the factor class. The user willing to extend the class at this level is responsible of implementing the entire FactorBase interface, relying on functions for typed access to the blocks of the system matrix H and of the coefficient vector b. In this case, the block size is determined from the dimension of the perturbation vector of the variables in the template argument list. We extended the factors at this level to implement approaches such as dense multi-cue registration [31]. Special structures in the Jacobians can be exploited to speed up the calculation of H k whose computation has a non negligible cost.
The ErrorFactor_<ErrorDim, VariableTypes...> class specializes a typed interface for the factor class, where the user has to implement both the error function e k and the Jacobian blocks J k,i . The calculation of the H and the b blocks is done through loops unrolled at compile time since the types and the dimensions of the variables/errors are part of the type.
The ADErrorFactor_<Dim, VariableTypes...> class further specializes the ErrorFactor_. Extending the class at this level only required to specify only the error function. The Jacobians are computed through AD, and the updates of H and the b are done according to the base class.
Finally, the FactorCorrespondenceDriven_<FactorType> implements a mechanism that allows the solver to iterate over multiple factors of the same type and connecting the same set of variables, without the need of explicitly storing them in the graph. A FactorCorrespondenceDriven_ is instantiated on a base type of factor, and it is specialized by defining which actions should be carried on as a consequence of the selection of the "next" factor in the pool by the solver. The solver sees this type of factor as multiple ones, albeit a FactorCorrespondenceDriven_ is stored just once in memory. Each time a FactorCorrespondenceDriven_ is accessed by the solver a callback changing the internal parameters is called. In its basic implementation this class takes a container of corresponding indices, and two data containers: Fixed and Moving. Each time a new factor within the FactorCorrespondenceDriven_ is requested, the factor is configured by: selecting the next pair of corresponding indices from the set, and by picking the elements in Fixed and Moving at those indices. As an instance, to use our solver within an ICP algorithm, the user has to configure the factor by setting the Fixed and Moving point clouds. The correspondence vector can be changed anytime to reflect a new data association. This results in different correspondences to be considered at each iteration.
3) FactorGraph: To carry on an iteration, the solver has to iterate over the factors and, hence, it requires to randomly access the variables. Restricting the solver to access a graph through an interface of random access iterators enables us to decouple the way the graph is accessed from the way it is stored. This would allow us to support transparent off-core storage that can be useful on very large problems.
A FactorGraphInterface defines the way to access a graph. In our case we use integer values as key for variables and factors. The solver accesses a graph only through the FactorGraphInterface and, thence, it can read/write the value of state variables, read the factors, but it is not allowed to modify the graph structure.
A heap-based concrete implementation of a factor graph is provided by the FactorGraph class, that specializes the interface. The FactorGraph supports transparent serialization/deserialization. Our framework makes use of the opensource math library Eigen [73], which provides fast and easy matrix operation. The serialization/deserialization of variable and factors that are constructed on Eigen types is automatically handled by our BOSS library.
In sparse optimization it is common to operate on a local portion of the entire problem. Instrumenting the solver with methods to specify the local portions would bloat the implementation. Alternatively we rely on the concept of FactorGraphView that exposes an interface on a local portion of a FactorGraph -or of any other object implementing the FactorGraphInterface.

VIII. EXPERIMENTS
In this section we propose several comparisons between our framework and other state-of-the-art optimization system. The aim of these experiments is to support the claims on the performance of our framework and, thus, we focused on the accuracy of the computed solution and the time required to achieve it. Experiments have been performed both on dense scenarios -such as ICP -and sparse onese.g. PGO and PLGO.

A. Dense Problems
Many well-known SLAM problems related to the frontend can be solved exploiting the ILS formulation introduced before. In such scenariose.g.point-clouds registrationthe number of variables is small compared to the observations' one. Furthermore, at each registration step, the dataassociation is usually recomputed to take advantage of the new estimate. In this sense, one has to build the factor graph associated to the problem from scratch at each step. In such contexts, the most time consuming part of the process is represented by the construction of linear system in Eq. (19) and not its solution.
To perform dense experiments, we choose a well-known instance of this kind of problems: ICP. We conducted multiple tests, comparing our framework to the current state-ofthe-art PCL library [15] on the standard registration datasets summarized in Tab. I. In all the cases, we setup a controlled benchmarking environment, to have a precise groundtruth. In the ETH-Hauptgebaude, ETH-Apartment and Stanford-Bunny, the raw data consists in a series of DATASET SENSOR VARIABLES FACTORS ICL-NUIM [74] RGB-D 1 307200 ETH-Hauptgebaude [75] Laser-scanner 1 189202 ETH-Apartment [75] Laser-scanner 1 370276 Stanford-Bunny [76] 3D digitalizer 1 35947 range scans. Therefore, in such cases, we constructed the ICP problem as follows: -reading of the first raw scan and generate a point cloud -transformation of the point cloud according to a known isometry T GT -generation of perfect association between the two clouds -registration starting from T init = I. Since our focus is on the ILS optimization, we used the same set of data-associations and the same initial guess for all approaches. This methodology renders the comparison fair and unbiased. As for the ICL-NUIM dataset, since obtained the raw point cloud unprojecting the range image of the first reading of the lr-0 scene. After this initial preprocessing, the benchmark flow is the same described before.
In this context, we compared i) the accuracy of the solution obtained computing the translational and rotational error of the estimate and ii) the time required to achieve that solution. We compared the recommended PCL registration suite -that uses the Horn formulas -against our framework with and without AD. Furthermore, we also provide results obtained using PCL implementation of the LM optimization algorithm.
As reported in Tab. II, the final registration error is almost negligible in all cases. Instead, in Fig. 13 we document the speed of each solver. When using the full potential of our frameworki.e. using analytic Jacobians -it is able to achieve results in general equal or better than the offthe-shelf PCL registration algorithm. Using AD has a great impact on the iteration time, however, our system is able to be faster than the PCL implementation of LM also in this case.

B. Sparse Problems
Sparse problems are mostly represented by generic global optimization scenarios, in which the graph has a large number of variables while each factor connects a very small subset of those (typically two). In this kind of problems, the graph remains unchanged during the iterations, therefore, the most time-consuming part of the optimization is the solution of the linear system not its construction. PGO and PLGO are two instances of this problem that are very common in the SLAM context and, therefore, we selected these two to perform comparative benchmarks.
1) Pose-Graph Optimization: PGO represents the backbone of SLAM systems and it has been well investigated by the research community. For these experiments, we employed standard 3D PGO benchmark datasets -all publicly available [62]. We added to the factors Additive White Gaussian Noise (AWGN) and we initialized the graph using the             and standard deviation over all noise trials. As expected, the result obtained are in line with all other methods. Fig. 14, instead, reports a detailed timing analysis. The time to perform a complete LM iteration is always among the smallest, with a very narrow standard deviation. Furthermore, since the specific implementation of LM is slightly different in each framework, we reported also the total time to perform the full optimization, while the number of LM iteration elapsed are shown in Tab. V. Also in this case, our system is able to achieve state-of-the-art performances that are better or equal to the other approaches.
2) Pose-Landmark Graph Optimization: PLGO is another common global optimization task in SLAM. In this case, the variables contain both robot (or camera) poses and landmarks' position in the world. Factors, instead, embody spatial constraints between either two poses or between a pose and a landmark. As a result, this kind of factor graphs are the perfect representative of the SLAM problem, since they contain the robot trajectory and the map of the environment. To perform the benchmarks we used two datasets: Victoria Park [77] and KITTI-00 [78]. We obtained the last one running ProSLAM [79] on the stereo data and saving the full output graph. We super-imposed to the factors specific AWGN and we generated the initial guess through the breadth-first initialization technique. Tab. VI summarizes the specification of the datasets used in these experiments. Also in this case, we sampled multiple noise trials (5 samples) and reported mean and standard deviation of the results obtained. The configuration of the framework is the same as the one used in PGO experimentsi.e. 100 LM iterations at most, with termination criterion active.
As reported in Tab. VII the ATE (RMSE) that we obtain is compatible with the one of the other frameworks. The higher error in the kitti-00-full dataset is mainly due to the slow convergence of LM, that triggers too early the termination criterion, as shown in Tab. VIII. In such case, the use of GN leads to better results, however, in order to not bias the evaluation, we choose to not report results obtained with different ILS algorithms. As for the wall times to perform the optimization, the results are illustrated in Fig. 15. In PLGO scenarios, given the fact that there are two types of factors, the linear system in Eq. (19) is can be rearranged as follows: A linear system with this structure can be solved more efficiently through the Schur complement of the Hessian matrix [80], namely: Ceres-Solver and g 2 o can make use of the Schur complement to solve this kind of special problem, therefore, we reported also the wall times of the optimization when this technique is used. Obviously, using the Schur complement leads to a major improvement in the efficiency of the linear solver, leading to very low iteration times. For completeness, we reported the results of GTSAM with two different linear solvers: cholesky multifrontal and cholesky sequential. Our framework does not provide at the moment any implementation of a Schurcomplement-based linear solver, still, the performance achieved are in line with all the non-Schur methods, confirming our conjectures.

IX. CONCLUSIONS
In this work, we propose a generic overview on ILS optimization for factor graphs in the fields of robotics and computer vision. Our primary contribute is providing a unified and complete methodology to design efficient solution to generic problems. This paper analyzes in a probabilistic flavor the mathematical fundamentals of ILS, addressing also many important collateral aspects of the problem such as dealing with non-Euclidean spaces and with outliers, exploiting the sparsity or the density. Then, we propose a set of common use-cases that exploit the theoretic reasoning previously done.
In the second half of the work, we investigate how to design an efficient system that is able to work in all the possible scenarios depicted before. This analysis led us to the development of a novel ILS solver, focused on efficiency, flexibility and compactness. The system is developed in modern C++ and almost entirely self-contained in less than 6000 lines of code. Our system can seamlessly deal with sparse/dense, static/dynamic problems with a unified consistent interface. Furthermore, thanks to specific implementation designs, it allows to easily prototype new factors and variables or to intervene at low level when performances are critical. Finally, we provide an extensive evaluation of     the system's performances, both in densee.g. ICP -and sparsee.g. batch global optimization -scenarios. The evaluation shows that the performances achieved are in line with contemporary state-of-the-art frameworks, both in terms of accuracy and speed.

APPENDIX I SE(3) MAPPINGS
In this section, we will assume that a SE(3) variable X is composed as follows: A possible minimal representation for this object could be using 3 Cartesian coordinates for the position and the 3 Euler angles for the orientation, namely ∆x = ∆x ∆y ∆z ∆φ ∆γ ∆ψ ⊤ = ∆t ∆θ ⊤ (68) To pass from one representation to the other, we should define suitable mapping functions. In this case, we use the following notation: More in detail, the function v2t computes the SE(3) isometry reported in Eq. (67) from the 6-dimensional vector ∆x in Eq. (68). While the translational component of X can be recovered straightforwardly from ∆x, the rotational part requires to compose the rotation matrices around each axis, leading to the following relation: R(∆θ) = R(∆φ, ∆γ, ∆ψ) = R x (∆φ) R y (∆γ) R z (∆ψ). .
Other minimal parametrizations of SE(3) can be used, and they typically differ on how they represent the rotational component of the SE(3) object. Common alternatives to euler angles are unit quaternions and axis-angle. Clearly, changing the minimal parametrization will affect the jacobians too, and thus the entire optimization process.