Pose Normalization of Indoor Mapping Datasets Partially Compliant to the Manhattan World Assumption

In this paper, we present a novel pose normalization method for indoor mapping point clouds and triangle meshes that is robust against large fractions of the indoor mapping geometries deviating from an ideal Manhattan World structure. In the case of building structures that contain multiple Manhattan World systems, the dominant Manhattan World structure supported by the largest fraction of geometries is determined and used for alignment. In a first step, a vertical alignment orienting a chosen axis to be orthogonal to horizontal floor and ceiling surfaces is conducted. Subsequently, a rotation around the resulting vertical axis is determined that aligns the dataset horizontally with the coordinate axes. The proposed method is evaluated quantitatively against several publicly available indoor mapping datasets. Our implementation of the proposed procedure along with code for reproducing the evaluation will be made available to the public upon acceptance for publication.


Introduction
The importance of digital models of building environments has been steadily increasing in recent years [86,38]. Nowadays, many building projects are planned digitally in 3D using Building Information Modelling (BIM) techniques [8]. Thus, a valid digital three-dimensional model arises along with the construction of the respective building which can be profitably used during all the stages of the life cycle of a facility, i.e. usage and maintenance e.g. in the context of facility management, changes and modifications on the building and eventually dismantling [2,5,61,27]. However, in the case of older, already existing buildings, three dimensional digital models often do not exist and two dimensional plans are often faulty or outdated. Manually reconstructing digital models (as-is BIM models) for suchlike buildings is a tedious and time consuming process [72,6].
On the other hand, there currently exists a broad range of sensor systems that can be deployed to the task of accurately mapping indoor environments [48,15,64,60]. Terrestrial Laser Scanners (TLS) for instance can provide a high geometric accuracy of acquisition depending on the respective conditions e.g. in terms of surface characteristics and scanning geometry [80,89]. In order to achieve a complete capture of an environment however, multiple scans have to be conducted from varying positions. Especially in the case of mapping the interior of building structures, this can be quite cumbersome as the device needs to be set up at numerous positions while the resulting scans subsequently need to be aligned.
Mobile mapping systems on the other hand alleviate these restrictions by continuously tracking their own position and orientation with respect to an initial pose. Indoor mapping geometries acquired over time can thus be projected successively into a common coordinate system while the operator can achieve a complete scene capture by walking through the scene. Mobile mapping systems encompass e.g. trolley-based (like NavVis 1 ) or backpack-mounted sensors [65,7,87,40] or even UAV-based systems [30] as well as hand-carried (e.g. Leica BLK2GO 2 ) or head-worn devices (e.g. Microsoft HoloLens 3 ). The latter, actually being an Augmented Reality (AR) system, offers the additional advantage of directly visualizing the already captured geometries within the view of the operator, facilitating the complete coverage of an indoor environment.
While conventional TLS or mobile laser scanning systems provide indoor mapping data in the form of point clouds, some consumer grade system like e.g. the mentioned Microsoft HoloLens or the Matterport system 4 for instance, sometimes provide indoor mapping data in the form of preprocessed, condensed triangle meshes. Such triangle meshes being a derived product from the primary point-based measurements were found to still provide sufficient accuracy for a wide range of applications [3,90] while being significantly more compact in terms of data size and thus more efficient in terms of required processing time.
This broad range of available indoor mapping systems can provide an ample data base for the digital, three-dimensional reconstruction of built indoor environments. Instead of having to take individual distance measurements in the respective building or having to bridge the mental gap between conventional, twodimensional floor plans and the three-dimensional modeling environment, indoor mapping data representing existent buildings can be loaded directly into the modeling environment. However, the manual digital reconstruction on the basis of indoor mapping data can still be a time-consuming endeavor. Hence, automating this process has become the focus of a currently quite active field of research [55,39,71].
While recent approaches in the field of automated indoor reconstruction are becoming more flexible regarding the building structure represented by the indoor mapping data [66,94,63,85,92,33], restricting assumptions about the building structure are still oftentimes applied. A frequently applied simplification in this context is the Manhattan World assumption which is for instance relied upon in the indoor reconstruction approaches presented in [25,26,68,78].
The Manhattan World assumption, as first proposed by Coughlan and Yuille [17,18], presupposes all surfaces to be perpendicular to one of the three coordinate axes. Applied to the context of building structures, this assumption thus prohibits curved room surfaces as well as surfaces being oriented diagonally with respect to the main building structure, i.e. diagonal walls or slanted ceilings. The Manhattan World assumption has later been extended to the Atlanta World assumption by Schindler and Dellaert [76] that weakens the Manhattan world assumption by permitting vertical surfaces to have arbitrary angles around a common vertical coordinate axis while horizontal surfaces are still expected to be perpendicular to the vertical axis. Thus, an Atlanta World structure can be regarded as a composition of multiple Manhattan World structures varying by a rotation around a common (vertical) coordinate axis. Besides in the context of indoor reconstruction, the Manhattan World assumption, as well as the weaker Atlanta World assumption have been used in a range of other application fields such as point cloud segmentation [83,46,82], the extraction of road structures from low-scale airborne images [22] or for stabilization and drift reduction in the context of Visual Odomentry (VO) [75,81,82] and Simulatenous Localization and Mapping (SLAM) [70,95,49,53].
The fact that a given indoor reconstruction approach relies on the Manhattan World assumption does not only imply that the building structure to be reconstructed itself must be compliant to the Manhattan World assumption. Rather, this also implies that the geometric representation of the respective building in the indoor mapping data must be correctly aligned with the coordinate axes in accordance with the definition of the Manhattan World assumption, i.e. that the surfaces pertaining to the three main directions (or six when considering oriented directions) are aligned with the three axes of the coordinate system.
In the context of indoor mapping however, the pose of the captured building structure with respect to the coordinate system does not necessarily fulfill this requirement. Frequently, the coordinate system is determined by the initial pose of the indoor mapping system at the beginning of the mapping process. Thus, the orientation of the indoor mapping data can deviate from the Manhattan World assumption by a rotation around the vertical coordinate axis even if the mapped building structure itself is totally compliant with the Manhattan World assumption. Moreover, the orientation of the vertical axis itself can also deviate from its optimal orientation according to the Manhattan World assumption, i.e. being perpendicular to horizontal ceiling and floor surfaces. This is generally not the case when a leveled mounting of the respective indoor mapping sensor is used, e.g. in the case of tripod-mounted systems like TLS or trolley-based systems. In the case of hand-held or head-worn indoor mapping systems where a perfectly leveled orientation at the start of the indoor mapping process cannot be guaranteed, an eventual misalignment of the indoor mapping data with respect to the vertical coordinate axis needs to be taken into account.
Aligning an indoor mapping dataset with the coordinate axes -horizontally and depending on the used indoor mapping system also vertically -is thus a necessary preprocessing step for automated indoor reconstruction approaches that rely on the Manhattan World assumption. Moreover, a suchlike alignment process -also known as pose normalization -can still be a reasonable choice, even if the respective indoor reconstruction method does not presuppose a Manhattan World compliant building structure. This is for instance the case, when a respective indoor reconstruction approach makes use of a voxel grid or octree representation of the input data [23,29,16,36]. Even if a voxel-based indoor reconstruction approach is able to handle building structures deviating from the Manhattan World assumption, having room surfaces aligned with the coordinate axes and thus with the voxel grid will result in a cleaner and visually more appealing reconstruction in voxel space. Furthermore, spatially discretizing data which is not aligned with the coordinate axes can lead to aliasing effects that can impede a successful reconstruction process [58,59,93]. Besides, pose normalization often -but not necessarily always, depending on the respective building structure -results in a minimal axis-aligned bounding box circumscribing the indoor mapping data and thus to reduced memory size of the voxel grid structure.
Lastly, pose normalization of indoor mapping data can also be of benefit in the context of the co-registration of multiple datasets representing the same indoor environment that are to be aligned with each other [91,14,32]. The respective datasets to be aligned can be acquired by different sensor systems or at different times, e.g. in the context of change detection [4,47,56]. While pose normalization with respect to a Manhattan World structure does not entirely solve this problem as an ambiguity of rotations of multiples of 90°around the vertical axis remains, it nonetheless can be reasonable to apply pose normalization when co-registering indoor mapping datasets as it reduces the problem to finding the correct of only four possible states per dataset.
The same arguments speaking in favor of pose normalization -even if an indoor reconstruction approach does not necessarily depend on it -also hold for the case of building structures that are only partly compliant to the Manhattan World assumption. Thus, a pose normalization approach should be robust against a substantial amount of the given indoor mapping geometries deviating from the Manhattan World structure of the building. Particularly in the case of building environments that contain multiple Manhattan World structures (i.e. Atlanta World), the dominant Manhattan World structure (e.g. in terms of the largest fraction of supporting geometries) should be used for alignment with the coordinate axes. In situations, where multiple Manhattan World structures have about the same support, it might be reasonable to detect them all and create multiple solutions for a valid pose normalization.
In a more general context, a range of pose normalization approaches have been presented that aim at aligning arbitrary three-dimensional objects with the coordinate axes. These objects do not necessarily represent building structures [41,69,12,24,50,13,51,77]. These approaches are mainly motivated by the need to design rotation invariant shape descriptors in the context of shape retrievable, i.e. finding all similar three-dimensional objects to a given query shape from a large database of 3D objects [97,84].
In this context, variations of the Principal Component Analysis (PCA) algorithm [37] are often made use of [41,69,12,13]. Also, symmetries in the geometry of the respective object are often exploited as well [12,24,13]. Other approaches rely on the geometric property of rectilinearity [50,51] or aim to minimize the size of a surface-oriented bounding box circumscribing the target object [77].
More specifically concerning building structures, a recent pose normalization approach makes use of Point Density Histograms, discretizing and aggregating the points of an indoor mapping point cloud along the direction of one of the horizontal coordinate axes [58,59]. The optimal horizontal alignment of the point cloud is determined by maximizing the size and distinctness of peaks in this histogram varying with the rotation around the vertical axis.
Other approaches, including the one proposed in this paper, do not discretize the data with respect to their position but with respect to their orientation [67,42,20,19]. This is conducted on the Extended Gaussian Image [31] which consists of the normal vectors of the individual indoor mapping geometries projected on the unit sphere. Besides its application in the context of pose normalization, the Extended Gaussian Image is also frequently applied to the segmentation of point clouds [88,83,79,82,98] or plane detection [52], particularly with regard to building structures.
In a straight-forward approach for instance, the points in the Extended Gaussian Image are subjected to a k-Means clustering [57,54] to determine three clusters corresponding to the main directions of the Manhattan World structure while disregarding the absolute orientation of the normal vectors (i.e. projecting them all in the same hemisphere) [42,20]. This however is not robust to deviations of the indoor mapping point cloud from an ideal Manhattan World structure. In contrast, using DBSCAN [21] for clustering on the Extended Gaussian Image has been proposed [19] which is more robust as it does not fix the number of clusters to exactly three. This allows for the presence of surfaces deviating from an ideal Manhattan World system. The proposed approach however only aims at detecting dominant planes to remove them from the point cloud and does not assemble the detected orientation clusters to Manhattan World structures. In another approach, dominant horizontal directions are detected by projecting the normal vectors to the horizontal plane and binning the resulting angles to a horizontal reference coordinate axis in a similar manner to the approach presented in this paper [67].
All of the approaches mentioned above only concern themselves with determining an orientation around the vertical axis to achieve an alignment of the Manhattan World structure of an indoor mapping dataset with the coordinate axes. To the best of our knowledge, no approach on pose normalization of indoor mapping point clouds or triangle meshes has yet been proposed that aims at determining an optimal alignment with respect to the orientation of the vertical axis as well. Furthermore, the presented approaches do not address the topic of robustness to deviations of the respective building structure from an ideal Manhattan World scenario or the presence of multiple Manhattan World structures in the same building.
In this work, we present a novel pose normalization method for indoor mapping point clouds and triangle meshes that is robust to the represented building structures being only partly compliant to the Manhattan World assumption. In case there are multiple major Manhattan World structures present in the data, the dominant one is detected and used for alignment. Besides the horizontal alignment of the Manhattan World structure with the coordinate system axes, vertical alignment is also supported for cases where the deployed indoor mapping system is not leveled and the resulting dataset is thus misaligned with respect to the vertical coordinate axis. In this context, we presume that the indoor mapping dataset is coarsely leveled to within ±30°o f the optimal vertical direction which can usually be expected to be the case for hand-carried or head-worn mobile indoor mapping systems. We furthermore presuppose the individual indoor mapping geometries to have normal vectors which however do not need to be consistently oriented and can thus be easily determined as a pre-processing step for point clouds while triangle meshes do already have normal vectors inherent in the geometries of the individual triangles. Our implementation of the proposed pose normalization approach along with the code for the presented quantitative evaluation on publicly available indoor mapping datasets will be made available to the public upon acceptance for publication.
The presented approach for pose normalization is described in Section 2 along with a method to resolve the ambiguity of a rotation of multiples of 90°around the vertical axis and the procedure applied for quantitative evaluation. The results of this evaluation procedure applied to several publicly available indoor mapping point clouds and triangle meshes are subsequently presented in Section 3 and discussed in further detail in Section 4. Finally, we close in Section 5 with concluding remarks and an outlook on future research.

Materials and Methods
In this section, we present a novel method for automatic pose normalization of indoor mapping point clouds or triangle meshes which represent building structures that are at least partially compliant to the Manhattan World assumption. The presented method aims at rotating the given indoor mapping geometries to a pose with respect to the surrounding coordinate system for which the largest possible fraction of normal vectors is aligned with the three Cartesian coordinate axes. This comprises an optional leveling step to orient horizontal surfaces like floors and ceilings to be orthogonal to a chosen vertical axis if this is not already achieved by the data acquisition process (e.g. by using leveled tripod or trolley mounted acquisition systems). Subsequently, a second step determines the optimal rotation angle around this vertical axis in order to align the largest possible fraction of the building surfaces with the horizontal pair of orthogonal coordinate axes.
The presented method is applicable to all kind of indoor mapping point clouds and triangle meshes. However, we assume the individual geometric primitives comprising the input data to have normal vectors. While these are intrinsically given for the individual triangles comprising a triangle mesh, the individual points of indoor mapping point clouds do not generally have normal vectors. These can however be easily determined by means of established methods like [62,9,96,74] which we assume in this work as a necessary preprocessing step. Note that these normal vectors need not be oriented, i.e. only their direction is of importance. Furthermore, we assume the input data to be at least coarsely levelled, i.e. we assume the represented building structures to be coarsely aligned with the vertical axis within the range of ±30°.
In the following, n i denotes the i-th normal vector of N given input geometries (i.e. points or triangles) while ·, · denotes the dot product of two 3D vectors. Furthermore, the vector determining the vertical axis is denoted by z. However, it needs to be stated that this vector need not necessarily equal (0 0 1) T . It can be chosen freely in accordance with the intended coordinate system. However, it must coincide within ±30°with the current vertical orientation of the input data. Similarly, a horizontal axis x orthogonal to the configured z-axis is to be chosen. Lastly, the second horizontal axis completing the Cartesian coordinate system must not be explicitly stated but can be determined as Again, note that the horizontal axes need not necessarily equal (1 0 0) T and (0 1 0) T . In the following, Section 2.1 first presents the proposed method for determining an optimal rotation around the vertical axis in order to horizontally align the indoor mapping data with the coordinate system in case Figure 1: Exemplary triangle mesh of a building with multiple Manhattan World systems (dataset 'mJXqzFtmKg4' from Matterport3D [11]). The green bounding box on the top-down-view on the right-hand side illustrates the alignment along the dominant Manhattan World structure considered as ground truth pose while the red bounding box illustrates the pose rotated by 30°around the vertical axis as exemplarily used in Section 2.1. the dataset is already vertically aligned in relation to the vertical axis. A suitable method for ensuring this vertical alignment that can be applied as a preprocessing step to datasets that are only coaresly aligned with the vertical direction (±30°) is subsequently presented in Section 2.2. As the proposed method for determining the rotation around the vertical axis is ambiguous with regard to multiples of 90°, Section 2.3 presents an approach to solve this ambiguity. Lastly, Section 2.4 presents the evaluation methodology applied in this study.

Rotation around the vertical axis
In this section, we preliminarily assume that the given indoor mapping data (comprised of triangles or points) is already leveled with regard to a chosen vertical axis z (that does not necessarily need to equal (0 0 1) T ). Thus, only one rotation angle around this vertical axis is to be determined in order to align the two horizontal axes of the coordinate system with the horizontal directions of the dominant Manhattan World structure underlying the respective building represented by the input data.
In case the given input data is not entirely compliant to the Manhattan World assumption, a best-possible solution in terms of the alignment of all normal vectors with the horizontal coordinate axes is to be found. Even indoor mapping data that represents building structures entirely compliant to the Manhattan World assumption can have a significant amount of normal vector directions deviating from the directions of the respective Manhattan World system. These deviating normal vector directions can be caused by actual unevenness of walls, by noise inherent in data acquisition and normal determination as well as by clutter like furniture objects being present in the indoor mapping data additionally to the building structure itself.
Besides being robust against these restrictions, the presented method is also applicable to building structures that are only partially Manhattan World conform. Building structures with multiple Manhattan World systems like the one depicted in Figure 1 are aligned according to the respective Manhattan World system supported by the largest fraction of normal vector directions. The coordinate axes are visualized in red for x, green for y and blue for the vertical axis z.
Thus, the task at hand is to determine an angle of rotation around the vertical axis z that leads to the largest positive fraction of normal vectors being aligned with the horizontal axes x and y. To this aim, we first filter the normal vectors that can be considered coarsely horizontal with respect to the vertical axis z. For this, we consider all N h normal vectors n h i that are within the range of ±45°of a horizontal orientation, thus where (·, ·) denotes the smallest angle between two 3D vectors with respect to any rotation axis. For the indoor mapping mesh depicted in Figure 1, the corresponding horizontal normal vectors n h i are depicted in the form of an Extended Gaussian Image in Figure 2. In this example, the triangle mesh of Figure 1 is rotated by 30°around the vertical axis relative to the ground truth pose aligned to the dominant Manhattan World structure.
These horizontal normal vectors n h i can subsequently be projected in the horizontal plane formed by the horizontal axes x and y by where their respective angles to the reference direction of x around z as axis of rotation can be determined. The problem at hand can be formulated as determining the rotation angle γ ∈ [0°, 90°) around the vertical axis that minimizes the sum of angular distances of each horizontal normal vector to the respectively nearest horizontal coordinate axis, i.e.: Here, the angular distances of each angle γ i to the nearest horizontal axis are weighted by factor w i . This factor can be constantly set to 1 for the points of an indoor mapping point cloud. In the case of triangle meshes however, it allows to weigh the individual triangles by their respective area as larger triangles imply a larger quantity of points in a corresponding point cloud representation. Equation 5 is not analytically solvable. It can however be solved numerically by derivative-free minimization methods like e.g. Brent Minimization [10]. This, however, does not scale well with the size of the input data, as all the angles derived from the horizontal normal vectors need to be iterated in each step of the respective numeric method. And -particularly in the case of indoor mapping point clouds -the amount of geometric primitives and thus of angles to be processed can reach a tremendous size.
Thus, in this work, we propose an approach that discretizes the input data into a one-dimensional grid of fixed resolution by means of which the angle of rotation for aligning the input data with the horizontal coordinate system can be determined non-iteratively in one step. In this context, a resolution of 1°proved to be suited for a coarse initial determination of the rotation angle for horizontal alignment that can subsequently be refined. For each angle γ i , the respective grid cell is determined which is incremented by the respective weight w i , which again is constantly 1 for points of point clouds but in the case of triangle meshes weights the respective angle by the area of the corresponding triangle. Figure 3 visualizes a suchlike one-dimensional grid representation of the horizontal angles γ i over the full circle of 360°for the mesh presented in Figure 1. The peaks in the summarized weights per grid cell correspond to the eight horizontal main directions of the two Manhattan World systems present in the dataset depicted in Figure 1. To decide about the dominant of the two Manhattan World systems involved and to determine the corresponding rotation angle for an alignment of the input data with it, the weights of the involved grid cells need to be summarized over all peaks pertaining to the same Manhattan World system. To this end, the peaks belonging to the same Manhattan World system and thus having an angular difference of a multiple of 90°between each other need to be identified and associated. Thus, we map the angles The discretized grid representation of the anglesγ i ∈ [0°, 90°) thus needs only a quarter of the size in comparison to discretizing the angles γ i ∈ [−180°, 180°) with the same resolution. Furthermore, the resulting grid as visualized in Figure 4 enables the coarse initial determination of the rotation angle γ. To this end, the weight sums per grid cell are thresholded with a threshold value of 0.75 times the maximal weight sum of the whole grid and subsequently clustered. While clustering, the fact that clusters can extend over the discontinuity between 0°and 90°needs to be taken account of.
Finally, the grid cell cluster with the largest weight summarized over the contained cells is selected and γ is determined as the weighted average of the angle values corresponding to the cluster cells (with 1°r esolution) weighted by their respective weight sum values. Figure 5  The resulting value for γ can subsequently be further refined by determining the weighted median over allγ i within a certain angular distance of the initial value for γ while applying the weights w i . A threshold of 5°was found to be suitable for this task.
Finally, the indoor mapping data can be rotated by the thus refined angle γ around the vertical axis to achieve the alignment of the building geometry with the horizontal coordinate axes. In the case of a triangle mesh, it is sufficient to rotate the vertices of the triangles as the respective normal vectors of the rotated triangles can be calculated on the basis of the triangle geometry. In the case of point clouds however, the respective normal vectors of the points need to be explicitly updated along with the coordinates of the points.

Orientation of the vertical axis
In the preceding Section 2.1, the rotation around the vertical axis was determined under the assumption that the vertical axis is perfectly leveled with respect to the building structure, i.e. that it is orthogonal to horizontal floor and ceiling surfaces. In the case of tripod mounted indoor mapping systems like terrestrial laser scanners, this assumption is justified as these devices are typically leveled before usage. However, in the case of mobile indoor mapping systems like hand-carried or head-worn devices, this is generally not the case. In these cases, the coordinate system of the indoor mapping data is often defined by the initial pose of the mobile mapping device when starting the data acquisition process. In consideration of typical usage postures of such mobile systems, it can be assumed that the respective vertical axis of the coordinate system is still roughly pointing upwards ±30°. If this is not the case, a coarse leveling within this range can easily be conducted manually.
To justify the assumption made in the previous section, this section presents an approach for automatically leveling indoor mapping point clouds or triangle meshes where a chosen vertical axis z corresponds coarsely within ±30°with the actual upwards direction of the building structure standing orthogonally on horizontal floor surfaces. As in the preceding section, the input data for conducting this alignment of the input mapping data with the coordinate system are again the N normal vectors n i of the individual geometric primitives comprising the indoor mapping data (i.e. points or triangles).
Analogous to Equation 5, we can formulate the task of vertically aligning the indoor mapping geometries with the coordinate system axis z as where n v i are the N v normal vectors that are vertically oriented within the range (9) and w i again is a weighting factor being constant for points of a point cloud but corresponding to the respective triangle area for the faces of a triangle mesh. Furthermore, R(α, β) denotes a 3 × 3 rotation matrix determined by two rotation angles α and β around the two horizontal coordinate axes x and y respectively. Thus, the aim of Equation 8 is two find the optimal vertical axis z * as a vector in the initially given coordinate system that has a minimal sum of angles to the vertical normals n v i . This optimal vertical axis z * as well as the initial vertical axis z are exemplarily depicted in Figure 6 for a building with slanted ceilings only coarsely aligned with the actual vertical direction.
As it already was the case with Equation 5 in Section 2.1, Equation 8 is not analytically solvable and solving it numerically is all the more inefficient as this time, a two-dimensional minimization is concerned. Thus, as in the case of determining the rotation angle around the vertical axis in Section 2.1, we again seek to formulate the problem at hand as the task of searching a maximum peak within a discrete grid representation of the relevant input elements. Figure 6: Exemplary triangle mesh of a building with partially slanted ceiling (dataset 'Attic' from [33]). The green line visualizes the reference orientation of the vertical axis considered as ground truth while the red line visualizes the vertical axis rotated −25°around the horizontal x axis and 15°around the horizontal y axis as examplarily used in Section 2.2.
The relevant input elements in this case are the three-dimensional vertical normal vectors n v i . However, the problem at hand is actually two-dimensional as a rotation around the two horizontal axes x and y by the rotation angles α and β is sufficient for aligning the vertical axis z with the optimal vertical direction z * .
In an alternative formulation, this can also be considered as the task of finding the position of the optimal vertical direction z * on the surface of a unit sphere, i.e. within the Extended Gaussian Image. The orientation of a normal vector with respect to the coordinate system can be expressed via the polar angles azimuth and inclination indicating the position of a respective normal vector n i on the unit sphere. The definition of azimuth and inclination with respect to the coordinate system is further illustrated in Figure 7. This representation allows us to construct a two-dimensional azimuth/inclination grid analogous to the approach presented in Section 2.1 whose cells are weighted by the summarized weights w i of the contained normal vectors n i . A suchlike grid of a resolution of 1°extending over the whole unit sphere is depicted in Figure 8 corresponding to the exemplary case presented in Figure 6.
As before in Section 2.1, we want to transform this grid over the full range of the sphere surface to a smaller grid where the weights of cells pertaining to opposing normal vectors get accumulated. This is achieved bỹ while restricting the extension of the grid in the dimension of the inclination to the range of [0°, 40°] and thus only considering the vertical normal vectors n v i . A schematic visualization of this transformation is depicted in Figure 9(a) while Figure 10 shows the resulting two-dimensional azimuth/inclination grid corresponding to the dataset presented in Figure 6.
Subsequently, peaks with cell grid weights above a threshold of 75 % of the highest weight value are again clustered like in the case of the one-dimensional grid of Section 2.1. While doing so however, not only the azimuth discontinuity between 0°and 90°needs to be considered, but also the pole point at 0°inclination where all azimuth values merge to one and the same grid cell.
While in the case of the one-dimensional grid of Section 2.1, grid cell indices could be directly mapped to angles by multiplication with the grid resolution, here, it is not possible to infer the direction of the optimal vertical axis from grid cell indices as the transformed azimuth valuesφ are ambiguous by multiples of 90°. This ambiguousness did also exist in Section 2.1. However, it did not affect the correctness of the resulting horizontal alignment as is the case here.
Thus, to be able to deduce correct directions from peaks in the two-dimensional grid, the respective normal vectors n v i need to be hashed per grid cell. So, the correct direction of the vertical axis can be initialized by a weighted average of all the hashed normal directions weighted by their respective w i value of the cluster with the largest summarized weight. In doing so, normal vectors pointing downwards need to be corrected by inverting the direction to point upwards when calculating the weighted average vector. Like in Section 2.1, the initial result is further refined by a weighted median of all normal vectors within ±5°of the coarsely determined resulting vertical axis.
Besides the need to deduce the correct direction from the detected maximum peak grid cells, there is a second reason to hash normal directions per grid cell. As illustrated in Figure 9(b), two normal vectors that are oriented by the same angle around the vertical axis z in a way that the axis z is the angle bisector between both normals get projected to the same (φ,θ) grid cell by Equation 13 and Equation 14. On the one hand, this can distort the weight sums of the individual grid cells that are used for peak detection. On the other hand, the presence of normal vectors with deviating orientations beyond the ambiguity of ±180°between opposing surfaces can severely distort the initial determination of the vertical direction from the largest peak in the grid.
For this reason, a cluster analysis is conducted among the hashed normal vectors per grid cell. In doing so, all the normal vectors in a grid cell are assigned to clusters. A normal vector can be assigned to an existing cluster if its direction coincides within ±2°with the average direction of the cluster (with consideration of an ambiguity of ±180°). Else, the respective normal vector initializes a new cluster. Finally, for each grid cell, only the largest cluster of normals is retained while the others are discarded. The grid cell weights and the hashed normal vectors are adapted accordingly.

Unambiguousness of the rotation around the vertical axis
The alignment of indoor mapping point clouds or triangle meshes along the coordinate axes as described in the preceding sections 2.1 and 2.2 is ambiguous with respect to a rotation around the vertical axes by multiples of 90°. This is per se not a problem as the aim of the presented approach is to align the indoor mapping data with respect to its Manhattan World structure which inherently implies this ambiguity with respect to four possible rotations around the vertical axis, i.e. all four possible result poses are equally valid with respect to the stated aim.
However, in some situations, it can be desirable to derive an unambiguous pose of the indoor mapping data. For instance, this can be the case when multiple indoor mapping results of the same building environment are to be aligned by the proposed method. These multiple datasets of the same building can e.g. be obtained by different indoor mapping systems or be acquired at different times in the context of change detection.
For this reason, we present a simple method for resolving the ambiguity in the rotation around the vertical axis by reproducibly choosing one of the four possible horizontal orientations. The proposed method presents In case the vertical axis z is the angle bisector between the directions of two normal vectors (same angle δ to z axis), these get transformed to the same point even if they are not opposed. This needs to be dealt with by means of a cluster analysis per (φ,θ) grid cell.   Figure 6. The grid cells contain the summarized weights w i of the contained vertical normal vectors n v i at polar angles (φ i ,θ i ) with value colorization ranging from blue for low values over green and yellow to red for large values. The larger peak corresponds to the floor and the horizontal part of the ceiling while the minor peak corresponds to one of the slanted ceiling surfaces. a straight-forward solution that does not require any semantic interpretation of the indoor mapping data or any elaborate analysis. It can however fail in cases of highly symmetric building layouts with respect to its four inherent Manhattan World directions. We furthermore presuppose that two datasets to be aligned unambiguously by this method cover approximately the same section of an indoor environment. If this is not the case, an approach that incorporates semantic knowledge of the represented indoor environment would be more promising.
Currently however, we propose to resolve the unambiguousness between the four possible horizontal orientations by first aligning the one of the two possible horizontal Manhattan World directions with the chosen reference axis x that corresponds to a larger extent of the bounding box of the respective dataset in this horizontal direction, i.e. the longer horizontal edges of the bounding box should be parallel to the x axis. This is quite straight forward but can fail in cases where the bounding box is nearly quadratic.
The ambiguity is now reduced to a rotation of 180°. To resolve this, we propose to consider the weighted count of indoor mapping geometries in both proximal 10 % sections of the bounding box in x direction and to choose the rotation for which the proximal 10 % section of the bounding box pointing towards the positive x axis has the higher weight sum. In this context, the indoor mapping geometries are again weighted by a constant in the case of points of point clouds and by triangle area in the case of triangle mesh faces. This approach fails, when the amount of mapped indoor structures in both proximal sections of the bounding box along the x axis is about equal.

Evaluation method
Quantitatively evaluating the proposed method is fortunately quite straight forward as ground truth data can be easily obtained. If an indoor mapping dataset is not already correctly aligned with the coordinate system axes in the sense of the aim of this study, it can be aligned manually without great effort. A thus aligned dataset can then be rotated to an arbitrary pose within the defined range applicable for the presented method. For this a 3 × 3 ground truth rotation matrix R GT (α, β, γ) is created, determined by the rotation angles α, β ∈ [−30°, 30°] around the horizontal axes x and y respectively and an arbitrary rotation γ ∈ [−180°, 180°) around the vertical axis z. For creating R GT , the rotation γ around the vertical axis is applied first and then successively β and α around their respective horizontal axis.
Finally, the method presented in Section 2.1 and Section 2.2 is applied to the rotated dataset which should return the rotated dataset back to its aligned state. The resulting 3 × 3 rotation matrix R Test is consituted by where first R Test vertical is determined by aligning the rotated dataset vertically with the vertical axis as described in Section 2.2 and then subsequently, the rotation R Test horizontal around the vertical axis is determined as described in Section 2.1.
As an evaluation metric, the angular difference δ v between the vector of the ground truth axis z and the resulting vector is determined by as well as the analogous angular difference δ h for the horizontal axis x. In case of the horizontal deviation δ h , the ambiguity of valid rotations around the vertical axis by multiples of 90°needs to be considered. To this aim, we iteratively apply The proposed evaluation metrics δ v and δ h can be determined for multiple randomly chosen rotations within the mentioned ranges of [−30°, 30°] for the horizontal axes and [−180°, 180°) for the vertical axis in sufficient quantity to allow for a statistical analysis.

Results
In order to quantitatively evaluate the approach presented in Section 2.1 and Section 2.2, the evaluation procedure proposed in Section 2.4 was applied to a range of different indoor mapping datasets. Firstly, the four triangle meshes of the dataset presented in [33] were used for evaluation. These triangle meshes are depicted in Figure 11 along with 3D bounding boxes indicating their respective ground truth pose. They were acquired by means of the augmented reality headset Microsoft HoloLens providing coarse triangle meshes of its indoor environment. In studies evaluating this device for the use case of indoor mapping, its triangle meshes were found to be accurate in the range of few centimeters in comparison to ground truth data acquired by a terrestrial laser scanner [44,35,34].
The alignment with the coordinate axes of the HoloLens triangle meshes as presented in [33] was found to be inaccurate. Actually, in [33], the presented datasets have been automatically aligned with the coordinate axes by means of an early, inferior version of the approach presented in this paper. To enable a reasonable evaluation of the proposed approach on these triangle meshes, ground truth poses were determined by manually aligning the datasets with the coordinate axes. The newly aligned datasets along with our implementation of the proposed approach and the evaluation procedure will be made available upon acceptance for publication to allow for reproducibility of the presented evaluation results.  All four represented indoor environments show a clearly defined Manhattan World structure. While the dataset 'Office' has mostly horizontal ceiling surfaces with the exception of the stairwell, the datasets 'Attic' and 'Residential House' have slanted ceiling surfaces. The dataset 'Basement' on the other hand shows a range of different barrel-shaped ceilings.
Furthermore, the six indoor mapping point clouds of the ISPRS Indoor Modelling Benchmark dataset presented in [43,45] were used for evaluation purposes. These point clouds as visualized in Figure 12 were acquired by means of different indoor mapping systems with a broad variety of sensor characteristics regarding accuracy and noise. Furthermore, the represented indoor environments are characterized by varying amounts of clutter.
While the other five datasets mostly adhere to the Manhattan World assumption, the dataset 'Case Study 6' has a high amount of horizontally curved wall surfaces and rooms oriented diagonally with respect to the dominant Manhattan World structure defined by three rooms. Furthermore, the point cloud includes a part of the surrounding outdoor terrain with uneven topography and vegetation. As the dataset 'Case Study 6' is quite challenging with respect to the aim of this work, it is depicted in more detail in Figure 13.    [45] also depicted in Figure 12(f). The vertical axis is visualized in blue while the two horizontal axes aligned with the dominant Manhattan World structure of the building are depicted in red and green.
The point clouds of the ISPRS benchmark dataset as they are published are already aligned with the coordinate axes in accordance with the aim of this work. Thus, the poses of the point clouds could directly be used as ground truth poses without any manual adjustment. Contrary to triangle meshes however, point clouds do not intrinsically provide normal vectors per point. This is also the case with the point clouds of the ISPRS Indoor Modelling Benchmark. We thus computed normal vectors for the points after subsampling the point clouds with a resolution of 2 cm using CloudCompare 2.10-alpha [28].
Lastly, we also consider some triangle meshes from the Matterport3D dataset [11]. Matterport3D includes 90 triangle meshes of various kinds of indoor environments acquired with the trolley-mounted Matterport indoor mapping system consisting of multiple RGBD cameras. Among the represented indoor environments are some for which the proposed alignment approach is not applicable, as they are not subject to any clearly identifiable Manhattan World structure. Many others do have a clearly identifiable Manhattan World structure but are to a large extent comparable to general building layouts already covered by the HoloLens triangle meshes or ISPRS point clouds used in the scope of this evaluation.
We thus selected 14 triangle meshes from the Matterport3D dataset that we deem particularly interesting and challenging in the context of this work. This, for instance, comprises triangle meshes representing indoor environments that contain more than one underlying Manhattan World system like the one already presented in Figure 1. In these cases, the presented alignment method is supposed to align the triangle mesh with the most dominant of the Manhattan World structures at hand being supported by the largest fraction of geometries. The 14 selected triangle meshes from the Matterport3D dataset are depicted in Figure 14.
As with the ISPRS benchmark point clouds, we again treat the poses of the triangle meshes as they are published as ground truth alignments without any manual adjustments. To which extent this decision is justified will be discussed in the subsequent Section 4.
The different datasets used in the scope of this evaluation are listed in Table 1 along with the respective number of points or triangles comprising them and the respective evaluation results. For conducting the evaluation, the evaluation procedure described in Section 2.4 was applied to the individual datasets. In doing so, each dataset was rotated 50 times while each time, the respective rotation consists of a randomly determined rotation angle γ ∈ [−180°, 180°) around the vertical axis and two random rotations α, β ∈ [−30°, 30°] around the respective horizontal coordinate axes.
For each of the 50 random input rotations, the alignment procedure described in Section 2.1 and Section 2.2 was applied and the resulting vertical and horizontal angular deviations δ v and δ h as defined in Section 2.4 were determined. Table 1 lists mean values and standard deviations for these evaluation metrics aggregated over all 50 samples per dataset. Furthermore, mean values and standard deviations for the processing time are given as well. The stated values refer to a system with a i7-8550U CPU with 24 GB RAM and do not include data import and export. The implementation which will be released upon acceptance for publication is CPU-parallelized.
As can be seen in Table 1, the resulting averaged vertical and horizontal angular deviations are largely below 1°with the corresponding standard deviations being in a similar range. Some outliers marked in red will be discussed in further detail in the subsequent Section 4.

Discussion
Taking a closer look at the evaluation results presented in Table 1, the overall quite low values for the horizontal and vertical angular deviations δ h and δ v with overall equally low standard deviations indicate that the proposed alignment method works overall quite well for a large range of different indoor mapping point clouds and triangle meshes with randomly varying input rotations within the defined bounds. The consistently larger δ v and δ h values for the triangle meshes acquired with the Microsoft HoloLens may be attributable to them being less accurate and more affected by noise. Triangles pertaining to an actually smooth planar room surface show a considerable variation in normal vector direction. However, the reported δ v and δ h values for these datasets are still mostly well below 1°.     Table 1 for the triangle mesh 'Attic' depicted in Figure 11(c). Without the 5 outliers around 31°, mean δ v results in 0.50°± 0.13°.
Some datasets however show significantly higher averaged values for δ v or δ h , sometimes with the corresponding standard variation being significantly raised as well. These outliers are marked red in Table 1 and will be discussed in more detail in the following paragraphs. To analyze these cases, we will take a closer look at the distribution of the individual 50 deviations constituting the respective mean value and standard deviation.
In the case of the HoloLens triangle mesh 'Attic' for instance, the histogram of δ v values depicted in Figure 15 indicates that the heightened mean and standard deviation values for the angular deviation in the vertical alignment are not caused by a large variability in the resulting vertical alignment. The vertical orientations resulting from the evaluated alignment method rather fluctuate between two clearly defined states, one being the correct vertical orientation according to the ground truth pose at around 0°angular deviation δ v of the vertical axis supported by 45 of the 50 measurements. The other state is a vertical orientation with an angular deviation of about 30°occurring in the remaining five measurements. As visualized by the red box in Figure 16, this corresponds to an alignment where the vertical axis is oriented orthogonally to one of the slanted ceiling surfaces.
This is the only case where the vertical alignment did not work satisfyingly in all 50 samples for all the datasets used in the evaluation. We suspect that the misalignments occurring sporadically on this dataset can be ascribed to the noisy surfaces of the HoloLens triangle meshes. The triangles comprising the large horizontal floor surface for instance differ significantly in the direction of their normal vectors. Thus, only a fraction of the triangles comprising the floor actually corresponds to the proper vertical direction with respect to the applied resolution of 1°. Depending on the input rotation, a peak caused by a slanted ceiling surface with a not insignificant area in comparison to horizontal surfaces like in the case of the dataset at hand representing only the attic story may thus induce a larger peak and consequently a misalignment. In cases like this, applying an angular resolution of more than 1°may be more suited to prevent suchlike misalignments.
Besides the discussed outlier in the vertical aligment, some outliers in the horizontal alignment do exist. The  Table 1 for the triangle mesh 'mJXqzFtmKg4' depicted in Figure 14(f).
again depicted in Figure 17 and Figure 18 respectively. Like in the case before, it is apparent that the alignment results fluctuate between two states depending on the input rotation for both cases while each time, one peak at 0°corresponds to the correct horizontal alignment according to the respective ground truth pose. As can be seen in Figure 19 and Figure 20, the respective second peak corresponds in both cases to a valid second Manhattan World structure present in the respective indoor environment.
In the case of the dataset 'mJXqzFtmKg4', this seems immediately plausible, as both Manhattan World structures present in the indoor environment are supported by a comparable amount of geometries, as was already demonstrated in Figure 4 and Figure 5. Thus, different input rotations may result in slightly different discretizations within the grid of 1°resolution, sometimes favoring one and sometimes the other Manhattan World structure as having the largest peak of summarized geometry weights.
In the case of the dataset 'PuKPg4mmafe' however, the two Manhattan World structures present in the indoor environment apparently do not seem to be supported by an approximately equal fraction of geometries. Rather, the upper right section in Figure 20 constituting the one Manhattan World structure seems to be far smaller than the section on the lower left constituting the other Manhattan World structure. In this case, the ground truth pose of the triangle mesh as published in [11] is aligned with the apparently smaller Manhattan World structure. It is thus not surprising that in the evaluation, a majority of measurements results in high δ h deviations as the evaluated alignment method favors the larger Manhattan World structure. However, it is surprising that a not insignificant fraction of 17 of the 50 randomly chosen input rotations results in a horizontal alignment along the apparently significantly smaller Manhattan World structure.
This situation may be explainable by taking a closer look at the walls constituting the respective Manhattan World structures. As can be seen in Figure 21, the smaller Manhattan World section on the right hand side consists of wall surfaces that are generally smooth and completely covered with geometries. The larger section on the left however has a large fraction of open wall surface were there are no geometries due to the walls there actually being openings or glass surfaces that cannot be captured by the Matterport system used for the acquisition of this dataset. Furthermore, large parts of the actually represented wall surfaces are covered with curtains or other structures resulting in inhomogeneous normal vector directions. In consideration of this, it seems plausible that the actual support for both Manhattan World structures present in the building could  Table 1 for the triangle mesh 'PuKPg4mmafe' depicted in Figure 14(h).
be approximately equal and the applied alignment method could thus be prone to fluctuate between both Manhattan World systems with varying input rotations. Besides these both cases discussed so far, there are two further datasets with high average horizontal angular alignment deviations in the evaluation results reported in Table 1. These are the triangle meshes 'ULsKaCPVFJR' and 'ur6pFq6Qu1A' which are also part of the Matterport3D dataset. Unlike the cases discussed before, these however only show heightened mean values for δ h while the respective standard deviations are low in a range comparable to the other Matterport3D triangle meshes where the evaluated alignment method proofed to be consistently successful.
This suggests that the proposed method consistently results in the same horizontal orientation for all 50 input rotations for both datasets. The respective resulting alignment however deviates from the assumed ground truth pose in the rotation around the vertical axis. This is further illustrated by Figure 22 and Figure 23 where it is easily discernible that the depicted buildings again respectively contain two Manhattan World structures and that the evaluated alignment method consistently chooses the respective other Manhattan World structure that does not coincide with the ground truth pose.
Arguably, it is disputable which of the two Manhattan World structures respectively present in the datasets is the 'correct' one as again in these two examples, both seem to encompass more or less the same fraction of the represented building environment and it is not readily discernable which is the dominant one. Nevertheless, our proposed method proofs to find a reasonable alignment with high accuracy in almost all cases with the only exception being the vertical alignment of the HoloLens triangle mesh 'Attic'. In all other cases where the resulting pose deviates from the ground truth pose, the resulting alignment is still reasonable in the sense that it corresponds to another Manhattan World structure inherent in the respective dataset that is readily identifiable by a human observer even if it may differ from the given ground truth pose corresponding to another alternative Manhattan World structure.
Besides aligning an indoor mapping dataset with the dominant Manhattan World structure supported by the largest fraction of geometries, the proposed method can easily be augmented to identify all major Manhattan World structures along with the respective sets of associated geometries. Among other possible fields of application that will be briefly discussed in the following Section 5, this allows for providing multiple  Figure 17. The green bounding box corresponds to the peak at δ h ≈ 0°while the red bounding box corresponds to the minor peak at δ h ≈ 45°.  Figure 18. The green bounding box corresponds to the peak at δ v ≈ 0°while the red bounding box corresponds to the peak at δ v ≈ 23°.  Figure 14(h) and Figure 20. Note that in the case of the larger part of the building structure determining the Manhattan World system visualized by the red bounding box in Figure 20, large parts of the wall surfaces are missing as wall openings or constituted by curtains or other structures with inhomogeneous normal direction. The smaller part of the building structure on the right hand side which determines the Manhattan World system visualized by the green bounding box in Figure 20 however has largely closed, smooth wall surfaces.  [11] and used as ground truth pose for the evaluation results presented in Table 1. The red bounding box on the other hand represents the horizontal alignment resulting from our presented approach.  [11] and used as ground truth pose for the evaluation results presented in Table 1. The red bounding box on the other hand represents the horizontal alignment resulting from our presented approach. possible alternatives for alignment to the user to choose from in cases where multiple major Manhattan World structures are present in the dataset at hand and it is not readily apparent which among these to use for alignment.

Conclusions
In this work, we present a novel method for the automated pose normalization of indoor mapping data like point clouds and triangle meshes. The aim of the proposed method is to align an indoor mapping point cloud or triangle mesh along the coordinate axes in a way that a chosen vertical axis points upwards with respect to the represented building structure, i.e. the chosen vertical axis is expected to be orthogonal to horizontal floor and ceiling surfaces. Furthermore, a rotation around this vertical axis is to be determined in a way that aligns the two horizontal coordinate axes with the main direction of the dominant Manhattan World structure of the respective building geometry. In case multiple Manhattan World systems are present in the data, the dominant structure supported by the largest fraction of geometries should determine the horizontal alignment.
For both fundamental steps of the proposed method -determining the correct orientation of the vertical axis and subsequently the correct horizontal rotation around this resulting vertical axis -a theoretical solution is presented. As the proposed formulation of the problem at hand cannot be solved efficiently, an efficient approximate solution for a practical implementation is presented. This encompasses discretizing the input data into a grid with fixed resolution while transforming it in a way that enables the problem to be solved by determining the largest peak within this grid of fixed size and finally refining the resulting coarse result by resorting to the original input data in the vicinity of the detected peak. A CPU-parallelized implementation of the proposed method along with the code for the automated evaluation procedure will be made available to the public upon acceptance for publication.
The proposed method is quantitatively evaluated on a range of different indoor mapping point clouds and triangle meshes that are publicly available. The presented results show, that the approach is overall able to consistently produce correct poses for the considered datasets for different input rotations with high accuracy. Furthermore, cases where high deviations with respect to the given ground truth pose occur are presented and discussed.
Concerning potential for future research, it has already been mentioned that the proposed method offers the possibility to not only identify the dominant Manhattan World structure along with the associated geometries in an indoor mapping dataset, but also to detect multiple Manhattan World structures that are sufficiently supported by geometries. Besides enabling to present multiple reasonable alternatives for alignment to choose from, this could potentially also be used in the context of automated indoor reconstruction. In particular, knowing the major Manhattan World structures and their associated geometries could be beneficial for abstracting and idealizing indoor surfaces, i.e. reconstructing suitable surfaces as planes that perfectly conform the Manhattan World assumption. In addition, automatically detecting the involved Manhattan World structures in a building may also be of interest in the context of automatically analyzing the architectural structure of buildings [1,73].
Furthermore, the presented methodology could possibly also be used in the context of Simultaneous Localization and Mapping (SLAM) in indoor environments and indoor mapping in general. Here, identifying Manhattan World structures during the mapping process (or in post processing if the individual indoor mapping geometries have associated timestamps to reconstruct the sequence of acquisition) could potentially be used to correct or reduce drift effects by applying the assumption that building structures that apparently seem to deviate only slightly from an ideal Manhattan World system are to be corrected according to the Manhattan World assumption [70,75,81,82,95,53].