INV-Flow2PoseNet: Light-Resistant Rigid Object Pose from Optical Flow of RGB-D Images Using Images, Normals and Vertices

This paper presents a novel architecture for simultaneous estimation of highly accurate optical flows and rigid scene transformations for difficult scenarios where the brightness assumption is violated by strong shading changes. In the case of rotating objects or moving light sources, such as those encountered for driving cars in the dark, the scene appearance often changes significantly from one view to the next. Unfortunately, standard methods for calculating optical flows or poses are based on the expectation that the appearance of features in the scene remains constant between views. These methods may fail frequently in the investigated cases. The presented method fuses texture and geometry information by combining image, vertex and normal data to compute an illumination-invariant optical flow. By using a coarse-to-fine strategy, globally anchored optical flows are learned, reducing the impact of erroneous shading-based pseudo-correspondences. Based on the learned optical flows, a second architecture is proposed that predicts robust rigid transformations from the warped vertex and normal maps. Particular attention is paid to situations with strong rotations, which often cause such shading changes. Therefore, a 3-step procedure is proposed that profitably exploits correlations between the normals and vertices. The method has been evaluated on a newly created dataset containing both synthetic and real data with strong rotations and shading effects. These data represent the typical use case in 3D reconstruction, where the object often rotates in large steps between the partial reconstructions. Additionally, we apply the method to the well-known Kitti Odometry dataset. Even if, due to fulfillment of the brightness assumption, this is not the typical use case of the method, the applicability to standard situations and the relation to other methods is therefore established.


Introduction
Three-dimensional reconstructions of objects and depth information of scenes play an increasingly important role in industry. Whether it is quality control in production or the recognition of the environment in autonomous driving, the number of applications is continuously increasing. Due to the simplicity of applicability, depth cameras are more and more used in parallel to flexible 3D scanners, and the availability of depth data for a wide variety of applications is steadily increasing. At the same time, the demand for scene understanding methods, represented by optical flow estimation, is constantly increasing, especially in the field of automation. Since in addition to images alone, more and more information is available, the demand for higher quality scene understanding also increases.
For the vast majority of applications, rigid scenes can be assumed and taken into account. In addition, even for dynamic scenes, the optical flow can be approximated by rigid models if not too large motions of the camera or the environment are expected. This rigidity assumption can even guide the estimation of optical flow, whose accuracy can benefit from it. The simultaneous extraction of the rigid transformation between two Figure 1. Sketch of the proposed methodology: In a first step, the pixel-wise optical flow is predicted from all available input (images, normals and vertices). In a second step, normal and vertex maps are warped to the reference frame using the predicted flow field. The stacked, warped normal and vertex maps are subsequently processed by another sub-network in order to predict a rigid transformation that aligns the underlying geometry.

Motivation: Flow-Based Alignment
In order to compute the alignment of two subsequently reconstructed frames, usually robust and transformation invariant features (SIFT, KAZE, ...) are detected and matched between the frames. Robust and outlier resistant methods such as RANSAC-based PnPsolvers are subsequently used to compute the rigid transformation between the views [1]. It is commonly known that this approach, applied with some few good features only, results in way better alignments than using many worse features jointly. Modern deep learning approaches adopt this scheme and deliver competitive results on a wide range of data in real time.
The basis of all common feature methods is the brightness assumption, which expects that the appearance of the object does not change significantly from one frame to another. This is fulfilled for many applications, especially when the camera moves smoothly through a scene or an object undergoes slow motion. If, on the contrary, the direction of the light incidence changes, the shading of the scene also differs dramatically and the brightness assumption gets strongly violated. This leads to a very probable failure of the standard methods based on this requirement, especially in the following situations: • Outdoor scenes where lighting conditions can change suddenly. This can occur from direct sun light, as well as indirect light reflections from other objects.
• Moving objects, especially rotating ones, inevitably change the direction of light incidence. This leads in particular to considerable difficulties in the application area of 3D reconstruction, where the object is often rotated in order to capture it successively from all sides. • Driving cars in the dark may cause strong shading differences in the captured images of the environment. Visible elements in the scene are illuminated by the car's headlights. These light sources move together with the car through the scene, which may yield strong variation of the direction of light incidence.
In order to illustrate this problem and investigate it, a setup with a static direct light source, a static camera and a rotated object is considered. Figure 2a shows how the standard approach based on SIFT matches fails, due to different light incidence. Figure 2b shows the scene in a different color coding that maps the grayscale values to a color scale that is more visual to human perception, which makes the different shading become obvious. While the features in the scene change in appearance, it can still be assumed that a significant portion of the scene overlaps in the different views. In the important case of object rotation in 3D reconstruction, our research shows that, in the vast majority of cases, a typical rotation of 45°still yields more than 80% overlap of the scene. Figure 2c visualizes the overlapping areas of the two views. Optical flow methods can benefit from this in turn, as they view and match the motion as a whole, using pyramidal approaches. Finally, Figure 2d shows correspondences determined using an optical flow method, as introduced in the following. The correspondences do contain noise and smaller errors, especially in feature-poor regions. They are nevertheless capable of predicting stable orientations of the object, significantly more stable ones than feature-based methods. shows the scene, which has been illuminated by a strong spot light, in a different color space that is more visual to human perception. This more clearly visualizes the different shadings of the object, which is the reason for the failure of the common method based on SIFT features. (c) shows overlapping regions of subsequent scans. Even a rotation of approx. 45°yields a large overlap of more than 80%.

Related Work
Optical flow estimation is a well-known problem in applied machine vision and has wide spread use cases in industrial applications such as robotics, automotive driving, and quality control. The task is to determine dense motion at the pixel level between image pairs as accurately as possible. Starting with the method of Horn and Schunck [2], variational methods were the state of the art for a long time. Since the problem itself is an ill-posed problem, further assumptions have to be made on the flow field, which led to a multitude of different methods that use the most diverse regularization procedures to make the problem solvable according to the specific application. In recent years, the problem of optical flow estimation increasingly expanded to the problem of scene flow estimation, which deals with the 3D motion of scene points in space, whereas optical flow was limited to 2D point motion on the image plane. Based on the variational approaches for optical flow, a number of variational scene flow methods have been developed. Most of them use rectified stereo image pairs as input and thus estimate scene flow with different regularization methods or partial rigidity assumptions ( [3][4][5][6][7][8][9][10]). At the same time, methods were developed to determine the scene flow directly from RGB-D data. With an increasingly number of depth sensors that became available, this approach is quite justified. Several variants of methods handle this case ( [11][12][13][14]).
The appearance of FlowNet [15] revolutionized the field of optical flow estimation. It became possible to treat the problem in real time with the help of convolutional neural networks (CNNs). In contrast, the variational methods were extremely time consuming and computationally expensive. A higher accuracy at the expense of a much larger network was subsequently achieved with FlowNet2 [16]. This was followed by the release of PWC-Net [17], which uses warping layers at different levels of an image pyramid, representing the current state of the art that is in addition much smaller than the previously released FlowNet2. Based on PWC-Net, Saxena et al. have presented a method for estimating scene flow from rectified stereo image pairs. In addition, they handle occlusions within the forward pass. Previous methods required at least one forward and one backward warping to stably detect occlusions ( [18][19][20]). Other approaches even tackle the task by iterative approaches such as [21]. In addition, a large amount of research currently focuses on either making networks lighter ( [22,23]) or on training networks without ground truth through un-or self-supervision ( [24][25][26][27]). A survey on variational as well as CNN-based optical flow methods can be found in [28].
Similar to earlier in the variational path, methods that extract scene flow directly from RGB-D data also evolved over time. Qiao et al. showed how scene flow based on FlowNet can be improved by fusion with features of depth data extracted in an extra network pass. Based on PWC-Net, Rishav et al. [29] use depth data from a Lidar sensor to determine scene flow. In doing so, they account for the lower resolution of Lidar data using appropriate reliability weights from [30]. In general, scene flow networks based on RGB-D data show poor performance for outdoor scenes, due to range limitations of the sensors. A number of approaches attempt to address this issue ( [20,[31][32][33]). Since the omission of active components removes the range limitations, but is accompanied by a loss of quality of the depth information, we will nevertheless restrict ourselves to this limited case. We are content with the scene flow within the sensor limits, since it is sufficient for an overwhelming number of practical applications, where the limits of the sensor can be planned accordingly. In order to predict the pose of an object, long time RANSAC approaches using explicit pose estimates based on the singular value decomposition were used. In recent years, first deep learning approaches predicted the pose directly using neural networks. Kendall et al. [34] use in their PoseNet several convolutional layers, followed by linear layers, to directly predict rotation and translation from RGB images. This way, they were the first to solve the problem of camera re-localization in static scenes by a deep learning approach. A few years later, Vijayanarasimhan et al. [35] extend this principle in SfM-Net in order to predict simultaneously the rigid transformations and the depth of the scene. They basically adopt the principles of the famous Structure-from-Motion pipeline to a deep learning framework. In parallel, Zhou et al. [36] developed a related model and showed how to train it in an un-supervised manner.
Finally, there has been a row of methods for direct point cloud registration with deep learning. Some of them replace parts of the standard strategies by deep learning methods and some try to replace the full pipeline. A large number of different approaches, correspondence-based and correspondence-free, are reviewed in [37,38].
Related to the presented work, Ref. [39] and recently Ref. [40] introduced variational and CNN-based methods for flow-aided pose estimation, based on fulfilled brightness assumption. Nevertheless, an automatic and light resistant flow-based pose-estimation method that works correspondence-free, and takes geometrical, textural and coherent scene motion into account has never been addressed before.

Light-Resistant Optical Flow
The optical flow between two images is understood as the displacements of the individual pixels from one to the other image. Determining the optical flow between images of a scene often serves the purpose of scene understanding, as it directly allows the analysis of a large amount of scene information: • The optical flow between calibrated camera images from different perspectives of the same static scene allows theoretically dense point correspondences and accompanying depth data. • The optical flow between static camera images of a moving scene theoretically allows the analysis of scene motion and object tracking. If depth data are additionally available, the scene flow, i.e., the spatial movement of the points in the scene, can be calculated.
In the estimation of the optical flow between two consecutive images I 0 and I 1 , a horizontal and a vertical displacement field (F 01 x , F 01 y ) are calculated, mapping each pixel in image I 0 to its corresponding pixel in image I 1 . The usual basis of the estimation is the brightness assumption, which assumes that corresponding pixels have the same appearance in the different images: Figure 3 shows image I 0 and image I 1 , which has been warped by the optical flow F 01 . Since the used optical flow has been computed from real data, the flow field is semidense and contains some masked pixels. Such errors will be addressed later on, where we will also show how to adopt filters to sparse, semi-sparse and mixed data. Instead of looking for exactly the same values between I 0 and I 1 , filtered values are considered in a regional context in order to robustify the matching. Deep neural networks have proven to be extremely effective for this purpose. The current state of the art is given by PWC-Net, which will be briefly introduced in the following to serve as a basis for the subsequently presented light-resistant method. . Image I 0 in comparison to image I 1 that has been warped by optical flow F 01 . Assuming consistent brightness, these should be identical (ignoring masked pixels due to the semi-dense optical flow from real data). In case of strong rotations of the object, the shading changes dramatically, which violates this assumption.

PWC-Net
PWC-Net combines classical techniques such as a pyramidal approach, warping and correlation to create a highly effective network for optical flow estimation. The input images are passed through a pyramid of convolutions which extract rotation-and translationinvariant features at different levels of the receptive field. The number of hierarchies should be adapted that are appropriate to the image resolution. By successively halving the resolution in each step, the procedure should cover almost the entire scene in the filter of the last stage. From the lowest level, cost volumes based on extracted features are established from which the optical flow is effectively predicted. These flows are refined upwards with each level, incorporating new features of the current level and the flows and more global features from previous levels. By warping the data using the previous flow, the search space is significantly reduced and even large displacements can be treated and predicted with this comparatively small network. Figure 4 depicts the architecture of the network. Each prediction block consists of a cost volume for flow prediction and is fed with the corresponding layer in a U-Net structure, in order to predict a flow field in full resolution. Note that the standard network presented by Sun et al. in [17] predicts the optical flow up to the second last level and afterwards refines the resulting flow by a context-network as post processing. This results in a final optical flow whose resolution is only 1 16 of the input images' resolution. Instead of up-sampling by variational methods, we go for two additional texture-guided up-sampling steps within the network, in order to provide full resolution optical flows within a single training routine. The input is convolved through multiple layers and the optical flow is predicted starting from the lowest level upwards in a U-Net structure. In each level, the layers of I 1 are warped towards the layers of I 0 in order to provide initial flows from previous lower levels. With this pyramidal approach, large flows are also predictable with quite small filter kernels.

INV-Net Using Images, Normals and Vertices
Classical PWC-Net uses texture images only. Unfortunately, for the investigated use case, these texture images may be disturbed due to shading changes, resulting from rotations of the object or light changes, which would make the network likely to fail due to a violated brightness assumption (see Figure 2). In many situations, where depth data are available, a lot of additional information can be provided to the network that is invariant under the shading effects related to light changes or object rotations: • Texture images I 0 and I 1 that underlay shading effects. Nevertheless, they provide full and dense data, which can deliver local context. • Depth maps D 0 and D 1 that store the relative geometrical information of the scene, light-and shading-invariant, with respect to the camera center. Due to measuring errors, there may be outliers or data-less pixels, resulting in semi-dense depth maps. • Vertex maps V 0 and V 1 that store the spatial information of the scene, light-and shading-invariant in three channels of a map in image resolution. They are computed from the depth maps and the available camera calibration in order to store the geometrical information calibration independent. Therefore, they are similar to the depth maps semi-dense maps with masked pixels. Moreover, they are structured representations of point clouds that allow for performing neighboring operations on 3D data in 2D space, which yields large advantages in the following approach. • Normal maps N 0 and N 1 that store spatial information of the surfaces in the scene. They are related to partial derivatives of the 3D vertices and do not underlay scaling and translation bias. They are in a specific range and responsible for a large amount of shading features of a scene (where standard methods based on fulfilled brightness assumption get a large amount of information from), without being disturbed by the light changes. They can be directly computed from the vertex maps, using the topological information given by the image grid (see [41]). Unfortunately, they thus also inherit the semi-density from underlying vertex maps. Figure 5 sketches the basic problem of finding a light resistant pose estimation from all the available input. The first task is to find a light resistant high quality optical flow from this large amount of input data. In addition, both depth maps and vertex maps store the spatial information of reconstructed surface points. Since they are somehow interchangeable, we use the vertex maps only. This way, the method becomes independent of the intrinsic calibration to the cost of a higher amount of data that needs to be processed. Figure 9 (left part) sketches the basic network that takes features from images (textural features), normal maps (shading features) and vertex maps (geometrical features). Thereby, we follow the basic principle of PWC-Net but run the different input through separate feature pipelines and set up independent cost volumes that contribute to the flow prediction. All features are processed as in [17] and fed to the pose prediction in each layer. This way, the network learns to treat the feature appropriate and to obtain advantages from all. Figure 6 depicts the prediction procedure in each layer, except the first one, where only the cost volumes are used for initialization of the flow.

Normalized Convolutions
In order to take into account the semi-density of the vertex maps and the normal maps, the convolutions, leading to the first layer, are replaced by normalized convolutions as introduced by Eldesokey et al. in [42]. Using the following slightly changed convolution procedure, the known masks can be used to ensure that data-less pixels do not contribute to the convolution with respect to neighbored pixels. Suppose, we are given a signal A to be convolved with a filter kernel K. Further assume that the measurements of the signal A are of varying quality with a confidence measure W of the same size as A having values between 0 and 1 to describe these uncertainties. It is desired to use the confidence measure as a weighting of the entries of A during convolution to ensure that reliable measurements have a higher influence on the convolution signal than inferior measurements or missing data for certain points. For this purpose, each summand within the convolution is weighted accordingly and divided by the sum of the weights to ensure the normalized character of the convolution. In detail, the normalized convolution of signal A, convolved with kernel K and weighted by confidence W around data point [n], is given by where denotes the element-wise Hadamard-Product. In order to avoid influence of missing pixels, a binary mask that contains zeros in case of missing data, and ones otherwise, can be fed to the convolutions as confidence W .

Consistency Assumptions
Similar to the brightness assumption given in Equation (1), the following consistency assumptions hold true for normals and vertices of rigid scenes: Figure 7 visualizes the consistency relations for normals and vertices. While the pixels of the warped normal map coincide with the reference normals up to a rotation matrix R, the vertices coincide up to rotation R and a translation vector t. These relations will be essential later on, in order to extract the rigid pose from the given optical flow. A very important result of our research is that features, computed from filtered normal and vertex maps, allow for computation of accurate optical flows. This means that the standard approach for feature extraction from images (as used in PWC-Net) is suitable to compute rotation-and transformation-invariant features from normal and vertex maps as well.

Pose from Warped Normals and Vertices
Several research works have already shown that it is possible to predict the relative pose of two views of a scene using neural networks. Usually, features are detected and matched, and outliers are rejected and then passed through a series of layers in order to obtain representative feature vectors. Finally, as introduced in [34], relative translation and rotation are predicted jointly using at least two subsequent fully connected layers.
In the previous section, a light-resistant optical flow has been computed by INV-Net. Based on this, it is not necessary to search for matches in the entire image. Considering images, normal maps and vertex maps from the first view and the ones from the second view that have been warped towards the first one with the computed optical flow, the data at each pixel-position theoretically match densely. Of course, there are also many erroneous and inaccurate regions in the flow field, especially in feature-poor areas, where the flow is mainly interpolated. Previous work has shown that, in general, more accurate poses are estimated when only a few good features are used for the calculation, instead of many less good ones. This can also be achieved by feature extraction from the warped normal and vertex maps. It should be noted that, in areas where good features for the pose estimation can be found, good optical flows are also available. In a way, both the optical flow and the subsequently calculated pose are based on these same good features. Nevertheless, in the case of low quality features, as is the case with texture-poor and smooth surfaces, or even many false features due to light changes, we benefit from the more general information of the dense flow field.
In order to obtain best poses from the warped vertex and normal maps, we investigated two different approaches (1 Step Method and 2 Step Method), and a combination 3 Step Method that combines both approaches.

1 Step Method
This approach uses the concatenated warped vertex maps to extract jointly rotation matrix R and translation vector t that align the vertex maps rigidly. The relation is based on consistency assumption (4). Note that, after warping, the matching vertices are theoretically placed at the same location in the concatenated input. Due to convolutional layers, the network is able to extract reliable locations, where a more accurate optical flow has been provided. The basic structure is shown in Figure 9 at branch 0. on the right.

2 Step Method
This approach uses two steps to predict rotation and translation individually by two separate networks. Following the consistency property of Equation (3), the warped normal map N 1 and the reference normal map N 0 are related by a rotation matrix R only. In a first step, this relative rotation R is predicted by stacking N 0 and the warped N 1 and processing them through several convolutional layers, followed by two fully connected layers in order to predict optimal rotation with respect to the normals.
Based on the third consistency property of rigid transformations, given in Equation (4), the translation t is predicted from the warped vertex map V 1 that has been rotated by matrix R and the reference vertex map V 0 . Rotation matrix R, from the previous step, has been applied in order to obtain dependency on the translation vector t for this inference step only. The structure is again shown in Figure 9 at branches 1. and 2. on the right.

3 Step Method
Rotation and translation are two fundamentally different operations that have a strong influence on each other. The smaller a rotation, the better it can be approximated linearly. Unfortunately, the joint extraction as in 1 Step Method may yield inaccuracies in case of large rotations. In these situations, it may be beneficial to extract them separately like in the 2 Step Method. Nevertheless, small rotational errors, from the first step of this approach, influence the predicted translation from the second step. The idea of the 3 Step Method is to first apply the 2 Step Method to pre-align the vertex maps.
In a third step, a correctional rotation matrixR and a correctional translation vector t are jointly predicted from the warped and pre-transformed vertex map RV 1 + t and reference vertex map V 0 . The final pose P = (R,t), as depicted in Figure 8, is then given by: For extracting this correctional transformation, the 1 Step Method is used. This is beneficial, since the correctional rotations are usually small, which makes it possible to predict the rotation and the translation jointly in order to avoid weaknesses of successive prediction as in the 2 Step Method. The structure is again depicted in Figure 9 at branches 1., 2. and 3. on the right. We investigated that the combined 3 Step Method performs best, as it compensates for the respective weaknesses of both methods.  . Architecture of Flow2PoseNet. The left part of the network aims to predict accurate optical flow from images, normal-and vertex-maps, using textural features from images, shading features from normals and geometrical features from vertices in order to predict accurate and light resistant flow fields. The pose of the rigid scene is computed in three steps from the warped normal-and vertex-maps. The first step predicts the normals from the warped normal-maps. The second step predicts the translation from the warped and rotated vertex-maps. The third step predicts a correction transformation to refine the predicted rotation and translation incrementally.

Data Sets and Data-Processing
There are already a number of public datasets, as well for optical flow estimation (Flying Chairs, Sintel, Kitti, Flying Things3D) as for pose estimation and odometry (Kitty Odomety, 3D Match, ModelNet14, ShapeNet). Unfortunately, only datasets that provide both images and depth data are suitable for the proposed investigations. Given the depth map and the camera calibration, the required normal maps can be approximated by practical methods, such as [41] and are thus not prerequisites. Therefore, for the evaluation of the estimated flow fields and the inferred poses in comparison to state-of-the-art techniques, the established Kitty Odomety dataset will be used later on. Even if it does not reflect the main application area for the development of the method, since it involves quite small rotations that barely show shading differences due to movement of the camera instead of the scene, it allows for comparison with previously existing methods.
Nevertheless, for the task of rotating objects, ground truth data of both optical flow and scene pose are required for training the presented network. In addition, it is advantageous to be able to use absolutely correct normal, depth and calibration data to avoid the influence of computational errors on the training. To the best of our knowledge, no such dataset exists. In addition, a general dataset for object orientation in the context of 3D reconstruction is not available to our knowledge. Therefore, several datasets are published together with this publication (https://www.dfki.uni-kl.de/~fetzer/flow2pose.html (accessed on 19 September 2022)). Among them are two synthetic datasets with rendered images, normals, depth maps and ground truths of camera calibration, optical flow and camera positions. One of them contains scenes with consistent scene illumination (ConsistentLight) of both camera views. The other one contains scenes with inconsistent illumination (InConsistent-Light), where the position of the light source changes significantly between the views. This simulates the difficult case, where, for example, the object rotates, which may dramatically change the angle of incidence of the light (violated brightness assumption). The scenes of the synthetic data sets were created and rendered using Unity [43]. To avoid dependencies on the background, 75 spherical backgrounds were added to the scenes randomly. The grayscale images, depth maps, normal maps and optical flows were rendered for random scenes each from two random camera perspectives. The calibration information, the camera positions and the position of the illuminating point light are also provided. For both synthetic datasets, a training subset and a test subset were created. The training sets contain 20,000 random scenes in which objects were randomly placed in the scene. The test sets contain 1000 random scenes in which other objects that have not been used in the training sets were chosen. The 22 models used for the training sets are shown in Figure 10a and the eight models used for the test sets are shown in Figure 10b. Figure 11a-d shows the rendered data for an exemplary scene.
In a similar format, a real dataset (BuddhaBirdRealData) is delivered, which consists of captured data from five different objects, shown in Figure 10c. The images are captured by monochrome cameras. The depth data have been reconstructed by a structured light approach using a setup with a controlled environment. Thereby, the reconstructions have been performed within an approximately 1m 3 working volume with a negligible ambient light component. Background effects were avoided by using the darkest possible background color. Camera as well as projector calibration information is provided along with the dataset. The normal maps are computed from the geometry, defined by the depth data and the calibration information. After manually aligning the data, the semi-dense flow fields have been computed and stored. The scenes were illuminated by a projector that has been calibrated jointly with the cameras and thus also delivers the light position in the scenes. Each model has been captured from eight positions with two different cameras each. Flow and pose data are available for each of the camera combinations of ancient positions, which yields ground truth data for 40 combinations per object. This results in 200 ground truth scenes of the real data that can be used for testing the models in real scenarios. Thereby, the first 40 pairs represent the scans within one scan head (consistent light) with eight reconstructions per object. The last 160 pairs represent the inconsistent light case with combinations of camera views between adjacent scans (that use different projectors). Similar to the synthetic case, Figure 11e-h shows the captured and estimated data for an exemplary real scene.
Each scene of the datasets, no matter if real or synthetic, consists of the following data parts: • image0 and image1 contain the 8-bit integer grayscale images of the two camera views. • data0 and data1 are .json files that contain the intrinsic calibration matrices K, camera rotation R and translation t, the minimal and maximal depth values minDepth and maxDepth, the minimal and maximals values of the horizontal and vertical optical flows minFlowX, maxFlowX, minFlowY and maxFlowY and the coordinates if the light source lightPos. • depth0 and depth1 are 16-bit integer grayscale images that need to be scaled after loading using minimal and maximal depth values from the data files: normal0 and normal1 are 24-bit integer RGB images in tangent space that can be re-transformed to spatial space by: Note that missing/masked pixels for which no depth information is available contain zeros in the depth, flow and normal files. After re-scaling and shifting these files, the mask should be applied again to keep the masking information with values of zero.
The presented network uses vertex maps instead of depth maps. These can be computed from depth data and given calibration by applying the following operation to each image pixel (x, y):

Camera Pose and Scene Pose
The given depth, vertex and normal maps are independent of any camera pose, as these are usually not available beforehand and need to be computed by the procedure. In order to use them to triangulate point clouds with respect to the given pose, the vertex maps (or point clouds) and normal maps can be transformed in the following way. Given a camera pose P = (R, t), the 3D point with respect to a complete camera matrix P = K[R|t] is given by: and the normals of the respective 3D points are given by: For completeness, remember that the camera itself is located in R T t. In the usual case of unknown camera poses, only the relative transformation between two vertex maps/point clouds can be estimated from the given data by a procedure as introduced in the previous section. In order to use the provided data, to deliver relative ground truth transformations between two views, the absolute poses need to be transferred to relative ones. If we are given the camera extrinsics of two views R 0 , t 0 and R 1 , t 1 , the relative pose between vertex map V 0 and vertex map V 1 is given by where vertex map V 0 is mapped to vertex map V 1 by applying the transformation as: Example code for reading, transforming and visualizing the data can be found with the datasets.

Pre-and Post-Processing of Data
Point clouds that need to be aligned may theoretically be of an arbitrary scale. Neural network based approaches, like the presented one, need to extract meaningful features within the given vertex maps to find corresponding points from which the desired transformation can be predicted. For this purpose, the network is adapted to the specific task with fixed weights that have been optimally determined during a data based training. For different absolute values of the 3D positions, it is not possible to extract meaningful features within the vertex maps with always the same weights. In particular, learned thresholds for activation within the network may not be applicable.
A practical way around this is to scale and move the point clouds, or equivalently the 3D data in the vertex maps, approximately towards the unit cube, which is located at the world origin. Within this working volume, the neural network can work effectively and perform the alignment. The calculated pose is then combined with the previous transformation towards the unit cube and thus provides the desired operation on the raw data.
In a first step, the point clouds are moved to the origin by subtracting the centroids. In a second step, the point clouds are scaled to fit approximately into the unit cube. Note that the method presented assumes the point clouds to be of similar scale, as it appears from usual depth data coming up from the same sensor. Therefore, the scaling factor s towards the unit cube should also be chosen similarly for both point clouds that are processed.
Let us be given the two point clouds X 0 = {x N } that need to be aligned. The centered point clouds at the origin are given by: X 0 − µ 0 and X 1 − µ 1 are then scaled jointly and robustly in order to ensure that 90% of the point clouds map into the according subspace of the unit cube ([−0.45, 0.45] 3 ⊂ R 3 ) that is located at the origin. This robustifies the scaling and reduces the negative effect of outliers dramatically. Note that, in general, it can be assumed that at least 90% of a point cloud should contain usable data. Let us be given the set of values with maximal absolute coordinates of both centered point sets, Having sorted the values y n ∈ Y in ascending order y 1 ≤ ... ≤ y M+N , the scaling factor that ensures 90% of both point clouds being mapped into the cube, defined above, is given by s = 1/y 0.45(M+N) , where · denotes floor rounding to integer values. The scaled, centered point clouds that are ready to be fed to the network are finally given by: Having computed a poseP = (R,t) using the neural network that aligns the scaled point clouds byRX the final transformation P = (R, t) that aligns the raw point clouds X 0 and X 1 is given by

Coherent Learning of INV-Flow2PoseNet
The goal is to train the network to estimate the best possible optical flow that will enable stable extraction of the pose. Therefore, to obtain an end-to-end trainable network, we define a joint loss function that penalizes both the ground truth flow and the extracted pose under given flow.
The PWC-Net structure predicts flows F (l) of different levels l = 0, ..., L. The Flow2PoseNet moreover uses these flows in order to predict the relative rotation R and translation t. Let be given according ground truth F (l) GT , R GT and t GT .

Multiscale Endpoint Error
The multiscale endpoint error (EPE) penalizes the different levels of the flow calculation with different hardness, provided by the respective weighting parameters α l : with suitable level weights α l , l = 0, ..., L and Frobenius matrix norm · F . In case of sparse data, the differences inside the norm are masked in order to take the sparsity into account.
Note that the higher levels, which describe the rather coarse flow, are more important than the lower levels, which obtain the higher levels as input. However, since the higher levels have a lower resolution, the flow errors in absolute numbers are smaller than those of the lower levels. As a rule of thumb, because of the pooling between each level, the weighting should be at least halved each time to account for the resolution discrepancy. The weights that have been used for the proposed network are {α 0 , ..., α 6 } = {0.001, 0.0025, 0.005, 0.01, 0.02, 0.08, 0.32}.

Alignment Error
A measure that treats both rotation and translation together is the well-known alignment error. It models the mean Euclidean distance of all point correspondences given by the groundtruth flow: This measure best describes the problem to be solved. It has the advantage that it weights the impact of rotation against the translation. Note that it is important to mask errors that contain invalid pixels either of V 0 or of warped V 1 , in order to ensure that only locations are taken into account, where matching vertices in both views are available.
Note that this error alone might erroneously interchange rotations and translation effects in order to receive a minimal alignment error. These interchanges can be prevented by adding some direct translational and rotational error terms to the overall loss function. These additional terms act as a regularization to enforce a better decomposition into translation and rotation.

Translational and Rotational Errors
The error of the predicted translation is given directly as the Euclidean distance towards the ground truth translation: Special attention is required for the rotation error. A suitable differentiable error between two rotation matrices R and R GT is given by the angular error, which is defined by the absolute value of the rotation angle θ of the relative rotation R rel = RR T GT . Having a look at the conversion towards the axis angle representation, there are basically two ways to compute the rotation angle. The first relation is given with the trace of the rotation matrix: Tr(R rel ) = 1 + 2 cos(θ) (18) Another way is to calculate the rotation angle from the length of the extracted rotation axis. Having an explicit rotation matrix, the rotation axis u is given by: The rotation angle θ is related to the length of u by: A direct computation of θ from one of Equations (18) or (20) requires the use of one of the inverse trigonometric functions arcus sinus or arcus cosinus. These yield numeric problems due to singularities in case of angles close to ± π 2 or ±π, which is unsuitable for a general loss function that has to be differentiable. A more stable way to achieve θ is to use the two-dimensional arcus tangens atan2 with both arguments:

Joint Training Loss
The joint loss function is subsequently given by: At the beginning of the training, the gradients of the computed optical flow are detached before back-propagating the alignment errors.

Representation of Rotation
In order to ensure the predicted rotation matrix to be a proper rotation, a minimal parameterization by Euler Angles is chosen. Therefore, three values (θ, ρ, φ) are predicted by the network, defining the rotation angles around the x, y and z axes by the rotation matrices: The total rotation is given by the consecutive execution of these rotations: R = R x R y R z . Vice versa, the respective Euler Angles can be extracted from a given rotation matrix R by: This conversion is especially used to compute the Euler Angles of the refined rotation matrixR in the 3 Step Method of Section 4.

Evaluation
For evaluation, we compare the calculated optical flows and registrations qualitatively on different synthetic and real data sets. Highly accurate results visualize a good generalization without fine-tuning from synthetic training data to the difficult real test scenes.
Therefore, multiple positions of the real scene are shown in Figure 12. It visualizes the performance of the method applied to eight partial scans of the Buddha scene (from the BuddhaBirdReal dataset), as it usually comes up from 3D scanners. Using the alignment given by the neural network, a few iterations of Iterative Closest Points (ICP) for refinement yield impressive results on the overall aligned point cloud of the object. Figures 13 and 14 show the results for exemplary objects from the training (top 3 rows) and test datasets (bottom 3 rows) for the consistent and inconsistent light (moving light source) case. Thereby, the first columns show the input data consisting of images, normals and depth maps (that are converted to vertex maps using the calibration information, as in Equation (6)). The second column shows the resulting optical flow in comparison to the semi-dense ground truth optical flow in column 3. Columns 4 and 5 finally show the initial and the registered point clouds using the proposed neural networks. Special attention should be given to row 6 of Figure 14, which shows the performance of the neural network on a real test scene without fine-tuning.
In particular, for comparability with other methods, we also consider a network trained on the popular training sequences of Kitti Odometry and evaluate it on the test data as shown in Figure 15. As the Kitti dataset has less strong rotations and less shading changes, it is not the typical use case for the proposed method. Nevertheless, the proposed method works reliably for these easier kind of situations as well.   The method also works on this kind of scenario with less rotations and less shading changes than in the mainly investigated case, but also handles noise resulting from the lidar depth measurement in the Kitti data. The network generalizes well from known training to unknown test data.

Quantitative Evaluation
For quantitative evaluation, we first compare the different architectures (1 Step and 3 Step) on the datasets published together with this work. Table 1 shows the results on the full subsets with consistent light and inconsistent light. In both cases, the 3 Step method yields superior results in comparison to the standard procedure that directly predicts rotation and translation jointly. In particular, the resulting rotation is much more accurate, resulting in an alignment error that is up to three times smaller than in the popular standard prediction method. For completeness, we also trained the proposed architecture on the famous Kitti Odometry dataset. As mentioned, the data do not reflect the situations, where the strengths of the proposed architecture comes up. In addition, there are many procedures that are tuned to especially solve this common dataset. Nevertheless, our architecture is also able to deliver results that place within the ranking. Table 2 shows the methods that are also based on point clouds and therefore somehow comparable to the method presented. Our method would place around rank #100, which shows the opportunity of the method to be also applicable to other tasks. Table 2. Extract of the ranking of the Kitti Odometry dataset showing point cloud based methods. The proposed method would be placed within the ranking, although rather at the end. Nevertheless, this shows the additional applicability of the method to other highly studied tasks.

Predicted Dense Optical Flow
A special feature of the proposed method is its coarse to fine pyramidal optical flow base, combined with the rigid pose extraction. Therefore, one can assume that the optical flow predicting sub-network learns rigidity relations from the extractability of the rigid pose from the dense optical flow. As shown in Figure 16, the ground truth optical flow (column 2) that has been used for training and evaluating the networks is sparse, as it only contains the flow of points that are visible in both views. As the data are created synthetically, it is possible to also render dense ground truth optical flows (column 4) that contain the flow of points that are occluded in one of the views and therefore may not be computable at all by the network. As can be seen, the predicted optical flow (column 3) is dense. It also predicts flow values for points that are not visible in both views. These values result from the context of other points, where the flow can be estimated stably. The network learns how the flow behaves for rigid objects and transfers the knowledge to interpolated pixels. This works as well for objects that are known from the training set (rows 1 and 3), test objects that have never been used for training (rows 2 and 4), the ConsistentLight case (rows 1 and 2) as well as for the InconsistentLight case (rows 3 and 4). Table 3 moreover shows that the resulting Endpoint Errors (EPE) do not dramatically increase for the invisible points, which indicates that the network learns to predict flows for the invisible points from context, according to the behavior of rigid objects. Table 3. Quantitative results for the visible and invisible points in the evaluated scenes. The resulting Endpoint Errors (EPE) do not heavily increase. The network is still able to predict accurate flows from the context of visible points and to generalize to the test data as well as for the consistent and inconsistent data.

Conclusions
In this paper, a method has been presented that combines optical flow estimation of rigid scenes with a posterior pose estimation. In this way, including several contributions, a method has been developed that allows scenes with difficult lighting conditions to be registered in a stable way.
Optical flow is thereby estimated accurately using geometric, shading and texture features. The variety of different feature types allows the system to be trained to be illumination resistant (using geometric and normal features) without having to completely sacrifice potentially important texture features.
The pose is then stably estimated from the warped normals and vertex maps using a new 3-step procedure. This has, compared to typical approaches that directly infer the pose, significant advantages especially in cases with strong rotations that often cause the considered shading changes.
The combination of optical flow and rigid pose estimation allows the pose to benefit from the features of different levels of the underlying coarse-to-fine flow approach, which means that the method is not dependent on highly accurate features and can also align smooth scenes with weak features. In turn, the optical flow sub-network learns a typical flow behavior of rigid scenes from the posterior estimability of the pose. This allows accurate dense estimates to be achieved, even for occluded areas based on context and overall learned behavior.

Conflicts of Interest:
The authors declare no conflict of interest.