Line-Constrained Camera Location Estimation in Multi-Image Stereomatching

Stereomatching is an effective way of acquiring dense depth information from a scene when active measurements are not possible. So-called lightfield methods take a snapshot from many camera locations along a defined trajectory (usually uniformly linear or on a regular grid—we will assume a linear trajectory) and use this information to compute accurate depth estimates. However, they require the locations for each of the snapshots to be known: the disparity of an object between images is related to both the distance of the camera to the object and the distance between the camera positions for both images. Existing solutions use sparse feature matching for camera location estimation. In this paper, we propose a novel method that uses dense correspondences to do the same, leveraging an existing depth estimation framework to also yield the camera locations along the line. We illustrate the effectiveness of the proposed technique for camera location estimation both visually for the rectification of epipolar plane images and quantitatively with its effect on the resulting depth estimation. Our proposed approach yields a valid alternative for sparse techniques, while still being executed in a reasonable time on a graphics card due to its highly parallelizable nature.


Introduction
Two-image stereomatching has received much attention among passive depth estimation methods, in large part because of the practicality of the set-up. However, multi-image depth estimation has access to more data for estimating the disparity and is hence expected to deliver better results.

The Benefits of Multiple Views
Stereomatching performs depth estimation by detecting objects in two images from a different location. The disparity between these locations is inversely proportional to the distance of the object from the baseline, and directly proportional to the width of the baseline. Obviously, objects that only appear in one of the two views are troublesome: these occlusions are an important problem encountered by two-image stereomatching. Evidence of the expansive existing literature on that topic: it is one of the main discriminating features between the many stereomatching techniques.
Initial methods ignored the problem of occlusions. Many methods, including the seminal work by Horn and Schunk [1], assume that the optical flow field (called the disparity field in stereomatching) is smooth, which means that there are no occlusions [2][3][4][5]. Some methods explicitly detect occlusions (typically based on an intermediate disparity estimate) and handle these regions separately [6][7][8], while others define the problem as an optimization that is more robust to the impact of occlusions through well-chosen cost functions [9,10]. Estimating the disparity field is done by either treating the disparity as a continuous data field [1- 3,5,6,8,11], or by evaluating a discrete set of hypotheses for each pixel [10,12].

The Importance of Camera Location Estimation
Current multi-image methods assume that the locations of the cameras are accurately known [14][15][16]. For static set-ups with many cameras (i.e., plenoptic cameras), this is indeed the case. However, for a single moving camera (i.e., on a camera dolly), it is unlikely that the location at any given time is known accurately. We usually do not know these locations accurately because of the non-uniform movement of the camera (see Figure 1). We focus specifically on the scenario of a camera dolly: the camera's location is physically restricted to a single line, but its velocity is not necessarily well known or even constant.
Methods exist for estimating unconstrained camera positions in video. The method used for estimating camera positions by Kim et al. [15], for example, is the VOODOO camera tracker (VISCODA Gmbh, Hannover, Germany) [21]. It uses sparse image feature correspondences to estimate the camera parameters. Furthermore, it does not explicitly restrict the camera locations to a line, which we know is the case for our camera dolly.
In known rotations, camera resectioning [22][23][24], the location of cameras is estimated under the assumption that their orientations are perfectly known. Our restricted version of the problem, with a constant camera orientation and movement only along the camera x-axis, is a more specific version of the known-rotations camera estimation problem, which requires the solution to a large-dimensional homogeneous system.
In this paper, we present a novel approach to estimate the camera locations. The proposed approach fits within the stereomatching mindset: rather than using only a small subset of the input data (sparse image features), we use the entire image as a dense data field for camera location estimation. By exploiting the stereomatching framework, we enjoy the benefit of dense matching (much more measurement points to base the estimation on) without the drawback of having to explicitly take the discreteness of feature points into account. For example, the aperture problem in the optical flow formulation of [1] implies that the cost function contribution of homogeneous areas is nearly constant. For occlusions, the cost function contribution is also relatively flat, but we show that explicitly taking occlusions into account can be done in a straightforward manner. The drawback is, of course, the increased number of computations-we solve this practically by doing all calculations on a graphics card with the aide of the Quasar language [25].

Outline
In [16], we showed how the optical flow approach to dense correspondences [3,5] can be extended to apply to a linear sequence of cameras. We now leverage the same framework to estimate the locations of the cameras involved. Local optima turn out to be an issue, which we mitigate by enforcing monotonicity on the camera locations: the camera only ever moves in one direction. In camera dollies, this is clearly the case, as the camera locations along the movement line follow the image order.
The complete work flow for disparity estimation is shown in Figure 2. First, a rough disparity estimate is computed from two views: the central view and the reference view (respectively defining positions 0 and 1 along the line). This initial disparity is used to estimate all of the camera locations θ k , which are required for the full disparity estimation. Using the camera locations θ k , the disparity estimate can be refined using all of the inputs' views. The novelty of this paper lies in the camera location estimation. The full disparity estimation work flow. A rough disparity estimate based on two views is used to estimate the camera locations for all views. Afterwards, the camera locations and all input views can be used to estimate a more accurate disparity map. The novel method proposed in this paper concerns the camera location estimation.
In the (largely hypothetical) scenario that the depth is accurately known, the camera location estimation can be performed directly on the input views and the groundtruth disparity field. In the more likely case that the disparity field is not known in advance, one could envision an iterative process between optimizing the camera locations and refining the estimate of the disparity field-using the refined disparity estimate to improve the location estimates. However, our experiments have shown that there is little to no added value to this workflow, presumably because both optimizations use the same optical flow formulation and hence are already well tuned to each other. When the initial disparity is wrong, this is because of ambiguities in the optical flow formulation (e.g., due to the aperture problem). These ambiguous areas will therefore contribute very little to the cost function of the camera positions (besides adding a nearly constant value).
In Methods, we recap the Multi-image Stereomatching framework from [16] that we adapt and extend in this paper to perform Camera Location Estimation. These are the blocks "disparity estimation" and "camera location estimation" from the diagram in Figure 2, respectively. Results are given, both for known input sequences and for image sequences with non-uniform camera locations. We compare both to the commercially available VOODOO tracker, and the known-rotations solution using the feature points tracked by the VOODOO tracker.

Multi-Image Stereomatching
We start with a quick recap of the method in [16], which estimates the disparity fields under the assumption that the camera locations are known. In multi-image stereomatching, we estimate the depth field of the central camera: the disparity of an object between two camera views is inversely proportional to the depth of the object and directly proportional to the distance between the cameras. Formally, we need to estimate the (scalar) disparity field u k (x, y), which maps each pixel in the image domain Ω in the center view I C to its corresponding pixel in any given view I k . As the inputs are rectified, this corresponding pixel lies on the same scanline, i.e., there is only a shift in the horizontal direction. We say that the center view is located at location θ C = 0, and therefore u k (x, y) will be a scaled version of a global disparity field: u k (x, y) = θ k u(x, y). Lighting conditions, vignetting, etc. can cause serious problems for the color constancy assumption. This is why we enforce the correspondence relationship only for a transformed version of the images, F I C and F I k : find u(x, y) such that ∀(x, y) ∈ Ω unoccluded and ∀k : (1) The image transformation F should preserve the geometry of the input views while mitigating lighting effects; generally speaking, it is a local transform of the pixel values. Good choices are histogram equalisation, a census-transform [26,27], a correlation transform [3], or a conversion to the CIEL * a * b * colorspace with removal of the luminosity component [7,28].
Finally, the global disparity field u(x, y) is estimated by minimizing a data term E d : the sum of squared errors over all views. In order to account for poorly conditioned (i.e., homogeneous) areas or occlusions, a regularisation term E s (u) is included in the objective function: We use the bilateral total variation from [3] as the regularisation term, which has proven to be effective at regularizing the estimate while respecting image borders. For additional details on disparity estimation from known camera locations, please refer to [16].

Camera Location Estimation
The main contribution of this paper is an approach to estimate the camera locations along the baseline based on the data term E d (u, θ) from Equation (2). We assume that an estimate for the disparity field is known, and we minimize E d (u, θ) w.r.t. θ rather than u.
Without any regularisation or additional constraints on the camera positions, the resulting optimization problem separates into a set of relatively straightforward 1D optimization problems, as the K problems become independent. Given an initial estimate θ k,0 , we can optimize a small change ∆θ k by linearisation around θ k,0 (so that θ k = θ k,0 + ∆θ k ): We call F I k (x + θ k,0 u(x, y), y) the warped version of F I k x, y with the current position estimate; it is calculated by linear interpolation of the values F I k x, y at the locations (x + θ k,0 u(x, y), y).
The complete optimization method then consists of a series of such linearisations until convergence.
Sadly, there turn out to be many local optima for the separate per-camera objective functions (see Figure 3) as well as a relatively small well of attraction near the correct minimum. However, whenever the initial estimates for θ k,0 are somewhat close to their correct values, this approach convergences to the correct optimum. However, cameras for which the initial estimate is too far away from the correct value are likely to get caught in local minima. In order to not get caught in these local minima, we employ a twofold approach. First of all, we enforce the monotonicity of the camera locations: the order of the cameras along the line is the same as the order of the input images: θ k < θ k+1 . Secondly, we work in a coarse-to-fine manner. In lower resolutions, the global minimum is much wider and local minima are rarer. By enforcing the monotonicity in the lower resolutions, for which the computations are not as expensive, we can mitigate the effect of local minima in the higher resolutions: we can, in the higher resolutions, assume that the estimates are in the vicinity of the correct optimum, and we no longer need the monotonicity constraints.

Monotonicity Constraint
After applying the linearisation, the problem takes the form where Note that the values of α k and β k depend on the current location estimates θ k,0 and need to be recalculated for each linearization iteration. They are sums over the input image pixels. We take occlusion of the views into account by leveraging the same occlusion detection from [16]; this rsesults in a weight for each pixel, which is 0 whenever the pixel is estimated to be occluded given the current disparity field/location estimate and 1 when it is visible.
As mentioned earlier, without any additional constraints this falls apart in K separate problems. Instead, we link all of these problems using monotonicity constraints: Note that this is a quadratic programming problem for which solvers exist. However, we keep in mind that this problem is only an approximation of the actual problem, which is only valid for small ∆θ k , as it arises from the first order Taylor series expansion in Equation (3). Therefore, accurately solving Equation (6) is of little relevance, and an approximate solution will do. We opt for an iterative approach based on Lagrangian duality. According to the theory of Lagrangian optimisation, this problem is equivalent to minimizing the following Lagrangian w.r.t. θ and λ: and the resulting Lagrangian dual is given by the optimal value of θ k in function of λ: Note that this solution θ k ( λ) is similar to the optimum of Equation (4), but corrected with a term to enforce monotonicity: (λ k − λ k−1 )/β k . Substituting the expression for θ k ( λ) in Equation (7) results in a quadratic programming problem, which is easier to solve than the quadratic problem from Equation (6) because of the simpler form of the constraints. However, due to the positivity constraints on λ k , a closed-form solution is not straightforward. We opt instead for iterative Euclidean projection: we calculate the optimal values without the positivity constraints, then project all λ k on their feasible set (R + ). We have found that applying iterative Euclidean projection directly on Equation (6) is not as efficient, and applying it on the dual variables yields better results. The resulting update rules are given by the solution of the system The coefficient matrix of this system is a symmetric tridiagonal matrix, which can be efficiently inverted to solve the system [29,30]. The derivation of the Lagrangian dual in Equation (8) and the system from Equation (9) are given in Appendix A. The Lagrangian optimisation alternates between improving steps in the dual variables λ using Equation (9) and updates for the locations θ using Equation (8).

Hierarchical Approach
We use a hierarchical approach in order to widen the wells of attraction of the cost function in the initial optimization steps. We subsequently downscale the input images by a factor two until the image size is smaller than 100 × 100. The camera locations are first estimated in the lowest resolution and iteratively refined on the higher resolution versions of the inputs. The aforementioned monotonicity constraints are only applied on the lower levels; higher levels simply use the closed form solution to Equation (4). This speeds up the optimization as the computations in the lower resolutions are not as expensive. We can assume that for the higher resolutions the estimates are in the vicinity of the correct optimum and no longer need the monotonicity constraints. This strikes a good balance: on the lower levels, where the initial location estimates are likely very poor, we employ the monotonicity constraints to force the location estimates in the right direction. On higher levels, where the initial values θ k,0 are closer to their true value and where calculations are more costly, we adopt the unconstrained estimation from Equation (4).
In traditional optical flow methods, the hierarchical approach is used because the estimation hinges on local first order Taylor series approximations, which means that the update steps should stay smaller. For flow estimation (or stereomatching), the size of the variables scales linearly with the resolution of the input image so that this coarse-to-fine approach helps decrease the step magnitudes. In our proposed location estimation, this is not the reason for the hierarchical approach; however, on lower resolutions, the cost functions are smoother, which results in a better behaviour of iterative optimization techniques.

Overview of the Entire Work Flow
The pseudo code for the full work flow is shown in Algorithm 1. For brevity, all loops over the number of views K are omitted, and we simply state the update step for the kth entry.
Algorithm 1: Overview of the proposed approach u ← disparity estimate from two views for from 1 to L do λ k ← the solution of the system Equation (9) (see Appendix A) end for θ k ← θ k,0 + α k +λ k −λ k−1 β k end for end for u ← disparity estimate from all views as described in [16].
First, we estimate the disparity for the center view from the center view and the reference view (fixed at, respectively, locations 0 and 1). Then, for each level t of the pyramid, we perform W iterations of the linearisation around the current estimates of θ k . Each such iteration is called a warp because it entails warping the images with the current estimate of the disparity field θ k u(x, y): F I k,t (x + θ k u t (x, y), y). Per warp, we require the computation of the new values for λ and those of θ. The values for λ follow from L iterations of block coordinate descent, while the optimal values for θ are given in closed form by Equation (8).
Finally, the estimated camera locations are used in a multi-image stereomatching algorithm to arrive at a disparity estimate of the scene.
The parameters used for the reported results are: • T = 5 levels in the pyramid, with a scaling factor of f = 2; • W = 10 warps per level; • L = 10 block-coordinate descent iterations per warp for optimizing the dual variables.
The bottleneck of the algorithm is by far the calculation of α k and β k in each warp because of the summation over all of the image pixels. The method was implemented and executed on the Graphic Processing Unit (GPU) using the Quasar programming language [25], speeding up these calculations drastically compared to a Central Processing Unit (CPU) implementation.

Results
Our evaluation of the proposed method is fourfold. First of all, we take a look at the monotonicity constraint and illustrate its effectiveness at estimating camera positions far from the reference camera position (the one we define to be at location 0). Secondly, we quantify the accuracy of the proposed method in terms of estimated camera positions. We compare on both existing datasets from Kim et al. [15] and new datasets captured in the context of this paper. In the third subsection, we illustrate the method's effectiveness at estimating the camera parameters in a practical setting, plus how-and why-it can cope with typical stereomatching artifacts and errors. Finally, we show visually how the proposed method transforms non-uniformly sampled epipolar plane images into rectified versions consisting only of straight lines.

Monotonicity Constraint
In our first experiment, we estimate the camera positions twice: once with and once without the monotonicity constraints on the locations. To this end, we use some datasets from [15]: the statue and couch data sets. Kim et al. provide a large set of rectified views for both of these sets, as well as estimated disparity maps.
For a more challenging scenario, we use details of these images. This means that, for large displacements of the camera, the scene will change noticeably. We do this by extracting the 640 × 480 patches with the top-left corner at respectively (900, 800) for statue and (800, 2000) for couch. Cropping results in relatively larger homogeneous areas and more occlusions. This is illustrated in Figure 4.
(statue) (couch) Figure 4. Example input to the algorithm. Top: cropping the images results in more occlusions and a more challenging location estimation. Bottom: the graphs show the unconstrained Solution (4) (in blue) and the constrained Solution (6) (in red) using the high quality disparity maps from [15] as input, for the cropped statue (left) and cropped couch (right) datasets.
We perform the camera location estimation using the input disparity maps estimated in [15]. When the unconstrained location estimate is correct, we see in Figure 4 that the constrained estimate simply adopts the same values. When the unconstrained optimization converges poorly, the monotonicity constraints guide the camera locations and we arrive at a correct estimate (see the graphs in Figure 4).

Accuracy of the Proposed Method
In order to quantify the accuracy of the proposed method, we compare its performance with that of the VOODOO camera tracker. We do this both on the Disney datasets and three additional datasets captured in the context of this publication: map, office and collection. These datasets were recorded with a camera mounted on a rail. The movement speed of the cameras was hence not uniform. The rail was annotated with a ruler, and we halted the camera for a while after every 5 mm (map and collection) or 1 cm (office). We automatically extracted the frames with the least inter-frame changes near those locations and visually inspected the videos for correctness.
For the baseline, we evaluate three distinct methods. The first one, VOODOO projected, is the one implemented by the VOODOO tracker. It tracks feature points through the sequence, and then performs unconstrained structure-from-motion estimation. The resulting camera position estimates are regressed onto a single line. The second one, VOODOO constrained, uses the same tracked features (already filtered from outliers and with correspondence information), and passes them to a known-rotations problem solver. As we know that the orientation is fixed and that the cameras only move along their own x-axis, we only need estimate a single value for each camera and a 3D location for each distinct point. The result is a linear homogeneous system that is solved by a singular value decomposition. Additionally, we also compare to our proposed method with the ground truth disparity as input. Finally, a note on the location reference systems is provided. While our proposed approach assumes the central camera to be at position 0, and an arbitrarily chosen reference camera at position 1, this is not the case for the VOODOO methods. They have an unknown reference system for their camera positions, which we do not know and which makes quantitative validation hard. For this reason, both of them are subjected to an additional optimization over the origin and the step size of this axis (see the Appendix for more details). The displayed values are those for the best-case reference system.
As is visible in Figure 5, the proposed method matches the VOODOO tracker whether it starts from disparity maps estimated by Kim et al., or disparity maps estimated by our earlier method [16].  [16]. The box plot shows the average camera location error expressed as a fraction of the inter-camera distance (which is the same for all adjacent cameras), as well as the quantiles.
Clearly, the case of known disparity maps is an atypical case and we need to evaluate the method when it uses estimated disparity maps as well. For our own test scenarios (see Figure 6), no high-quality disparity maps are available. Therefore, we first perform the disparity estimation on one pair of images, and then use this estimate of the disparity for the location estimation. We regress VOODOO's locations estimates to a single line and project all camera locations on that line to arrive at a set of linear locations estimates for the cameras. In Figure 7, it is visible that here, too, our proposed method fares better than the VOODOO tracker. Figure 6. The extreme camera positions for three datasets captured in our lab (from left to right: the map, collection and office datasets). The datasets were recorded using a rail-mounted camera (using a crank and chain system). For the map and collection datasets, the distance to the scene is roughly 20-30 cm and the inter-camera distance is 5 mm between all views. For the office dataset, the distance to the scene is roughly 2-5 m and the inter-camera distance is 1 cm between all views. As our set-up has the camera moving vertically, the algorithms described here are applied on the transposed images, and all images are transposed again for visualization. Figure 7. Comparison of the proposed method and the VOODOO tracker, on the three datasets recorded in our lab. The used disparity maps were all estimated as in [16]. The box plot shows the average camera location error expressed as a fraction of the inter-camera distance (between the center and the reference camera), as well as the quantiles.

Estimation in a Practical Setting
In order to test the proposed approach in a practical set-up, we have used a rail-mounted camera (propelled manually) to capture a static scene. The camera's motion is linear and perpendicular to its look direction, such that the prerequisites of the method are fulfilled. Consider an example input frame and its two-view depth estimate from stereomatching as given in Figure 8. This is the input depth map that was used to estimate the locations of the cameras and rectify the epipolar plane images as given in Figure 9 (i.e., resample them by simulating uniformly distributed camera positions). It is illustrated that, while the entire left portion of the image suffers from a poor disparity estimate, the correct right half still results in correctly estimated camera locations. Additionally, the monotonicity constraints force the location estimates to enter the neighbourhood of the local minima. We note that, while the disparity estimate appears completely off, it was estimated using the method on which our location estimation is based [16]. That is to say, the areas where the estimate is off are those where stereomatching itself is ill-defined (i.e., occlusions or homogeneous areas). In essence, this means that, for that area, it is hard to discriminate between various depth estimates and, by extension, those areas will hold little sway over the camera position estimation. Hence, our approach is robust to such errors in the input map. We also show the impact of a correct location estimation. Figure 10a shows the estimated disparity for the manually recorded sequence, using all input views. The left image shows the estimated disparity field based on a faulty location hypothesis: uniform camera speed. The right image shows the disparity estimate when using our proposed approach to estimate the camera locations.
One can see that the estimation exploiting multiple views (with the correct camera locations) is able to avoid the issues in the left half of the input frame that were present in the two-view scenario in Figure 8. Moreover, one can also see that Figure 10a contains errors that are not present in Figure 10b, e.g., the depth of the elephant, or near the edges of the statue. Correct estimation of the camera positions is paramount for the depth estimation.  Figure 9 shows the images before and after rectification. Rectification consists of the vertical resampling of the epipolar plane image, given the estimated camera locations, which are used as the actual row locations-we resample to uniform row locations: a perfectly rectified epipolar plane image contains only straight lines and constant areas. In effect, we linearily resample to virtual camera locations uniformly over the camera movement line (between both extreme camera positions). These rectified versions can be used by methods that assume uniform camera movement, such as the estimation of the slopes of the lines by [15].

Conclusions
In this paper, we propose a novel technique to estimate camera locations in a multi-view stereomatching set-up. The technique is based on the optical flow formulation of the stereomatching problem, which allows us to express the location estimation as the minimization of a second-order polynomial. After experimental motivation that local minima can be an issue in this optimization problem, we introduce monotonicity constraints to mitigate them.
Estimation of the camera locations is done using a primal-dual approach based on the theory of Lagrangian duality. We perform estimation in a coarse-to-fine manner in order to only have the cost of the expensive primal-dual optimization in the lower resolutions. In higher resolutions, we can assume that we are already in the vicinity of the global optimum, and we do not require monotonicity constraints. The proposed method uses an input disparity estimate for the scene, but it is motivated and illustrated to be robust against poor input disparity estimates because of the shared optical flow formulation: ambiguous areas for stereomatching are also ambiguous for the camera position estimation, and hence influence the final estimate only a very little amount. Finally, we illustrate the impact and importance of the accuracy of the camera location estimates, by performing disparity estimation on a manually recorded scene, both with poor camera location guesses and with the estimated camera locations. Quantitatively, the methods delivers results that match and improve those from existing sparse feature-based trackers.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A. Deriving the Update Rules
We now derive the Lagrangian dual from Equation (8) and the update rule for the dual variables λ. Recall that the optimization problem is given by: The corresponding Lagrangian dual problem is given by where λ k ≥ 0, ∀k ∈ [1, K − 1] and λ K = 0. Deriving with respect to θ k yields Solving the system ∂ θ Λ θ, λ = 0 results in the Lagrangian dual We are thus adjusting the regularization-free estimation (θ k,0 + α k /β k ) with the regularization term (λ k−1 − λ k )/β k . Plugging this Lagrangian dual back into the Lagrangian function yields the dual function Λ θ ( λ), λ . This is a quadratic programming problem; it lacks a closed form solution because of the positivity constraints on λ k . Because of the low complexity of the dual function, block coordinate descent is an adequate optimization technique. Omitting the dependency of θ on λ, the dual function's gradient is given by This simplifies to By equating all partial derivatives to zero, we end up with the following system: