Three-Dimensional Reconstruction of Indoor and Outdoor Environments Using a Stereo Catadioptric System

: In this work, we present a panoramic 3D stereo reconstruction system composed of two catadioptric cameras. Each one consists of a CCD camera and a parabolic convex mirror that allows the acquisition of catadioptric images. We describe the calibration approach and propose the improvement of existing deep feature matching methods with epipolar constraints. We show that the improved matching algorithm covers more of the scene than classic feature detectors, yielding broader and denser reconstructions for outdoor environments. Our system can also generate accurate measurements in the wild without large amounts of data used in deep learning-based systems. We demonstrate the system’s feasibility and effectiveness as a practical stereo sensor with real experiments in indoor and outdoor environments.


Introduction
Conventional stereoscopic reconstruction systems based on perspective cameras have a limited field-of-view that restricts them in applications that involve wide areas such as large-scale reconstruction or self-driving cars. To compensate for the narrow field-of-view, these systems acquire multiple images and combine them to cover the desired scene [1]. A different way to increase the field-of-view is to use fish-eye lenses [2] or mirrors in conjunction with lenses to constitute a panoramic catadioptric system [3,4].
To perform 3D reconstruction with a stereo system, we need to find matching points on the set of cameras. Typically, there are two methods for matching image pairs in passive stereo systems: dense methods based on finding a disparity map by pixel-wise comparison of rectified images along the image rows, and sparse methods using matching features based on pixel region descriptors.
One challenge of the dense methods with panoramic images is that the rectified images are heavily distorted, and their appearance depends on the relative configuration of the two cameras. Therefore, many pixels are interpolated in some image regions, influencing pixel mapping precision and 3D reconstruction. A limitation of dense omnidirectional stereo matching is the running time making it not suitable for real-time applications [25].
On the other hand, sparse methods with panoramic images provide a confined number of matching features for triangulation [26,27] due to the reduced spatial resolution, wasted image-plane pixels, and exposure control [28]. Due to this, it is a common practice to manually select matching points for calibration or to validate the measurements [5,29]. Another prevalent limitation of these systems is the validation in indoor environments only. For example, Mao et al. [26] present a reconstruction algorithm for multiple spherical images; they use a combination of manually selected points and SIFT matching to estimate the fundamental matrix for each image pairs and present sparse reconstruction results in indoor environments. Fiala et al. [5], present a bi-lobed panoramic sensor with the matching features detected using the Panoramic Hough Transform. The results show sparse points of scene polygons in indoor environments. The work presented by [29] similar to our research proposal, shows a sensor with the same amount of components but with the use of hyperbolic mirrors. The 3D reconstruction was done on manually selected points to avoid possible errors due to an automatic matching. To validate the method, they selected four corners of the ceiling to compute the size of the room with an estimated error between 1 and 6 cm. Zhou et al. [30] propose an omnidirectional stereo vision sensor composed of one CCD camera and two pyramid mirrors. To evaluate the performance of the proposed sensor, they tested a calibration plane. Although this sensor cannot accomplish a whole panoramic view, it nevertheless provides a useful vision measurement approach for indoor scenes. Jang et al. [31] propose a catadioptric stereo camera system that uses a single camera and a single lens with conic mirrors. Stereo matching was conducted with a window-based correlation search. The disparity map and the depth map show good reconstruction accuracy indoors except for ambiguous regions like repetitive wall patterns where the intensity-based matching does not work. Ragot et al. [32] propose a dense matching algorithm based on interpolation of sparse points. The matchings points are processed during calibration, and results are stored in Look-Up Tables. The reconstruction algorithms are validated on synthetic images and real indoor images. Chen et al. [4] present a catadioptric multi-stereo system composed of a single camera and five spherical mirrors. They implemented a dense stereo matching and multi point cloud fusion. The results presented in indoor scenes improved between 15% to 30% when combining the stereo results.
Alternatively, deep learning techniques have recently obtained remarkable results in estimating depth from a single perspective image in indoor and outdoor environments. Depth estimation can be defined as a pixel-level regression, and the model architectures usually use Convolutional Neural Networks (CNNs) architectures. Xu et al. [33] proposed predicting depth maps from RGB inputs in a multi-scale framework combining CNNs with Conditional Random Fields (CRF). Later in [34] they propose a multi-task architecture called PAD-Net capable of simultaneously perform depth estimation and scene parsing. Fu et al. [35] introduced a depth estimation network that uses dilated convolutions and a full-image encoder to directly obtain a high-resolution depth map and improve the training time through depth discretization and the use of ordinal regression loss. More related to our work is the one from Won et al. [36], where an end-to-end deep learning model estimates depth from multiple fisheye cameras. Their network can learn global context information and reconstructs accurate omnidirectional depth estimates. However, all these methods require large training sets of RGB-depth pairs, which can be expensive to obtain, and the quality of the results is limited by the data used for training.
Deep Learning techniques have also been applied in feature matching, and optical flow applications [37,38]. Sarlin et al. [39] proposed a graph neural network along with a context aggregation mechanism based on attention to allow their model to reason about the 3D information and the feature assignments. However, the performance is bounded by the training data used and the augmentations performed during training.
To combine the best of both approaches (panoramic imaging and Deep Learning), in this work, we present a stereo catadioptric system that uses a deep learning-inspired feature matching algorithm called DeepMatching [40] augmented with stereo epipolar constraints. With this, our system can produce wide field-of-view 3D reconstructions in indoor and outdoor scenes without the need of specific training data.
The main contributions of this work are twofold. First, we propose a catadioptric stereo system capable of generating semi-dense reconstructions using deep learning matching methods such as DeepMatching [40] with stereo epipolar constraints. Second, the system can produce 3D reconstructions in indoor or outdoor environments at higher framerate than dense methods without new training data.
The rest of the paper is organized as follows: Section 2 illustrates the catadioptric vision system, Section 3 describes the methodology, Section 4 discusses the results, and Section 5 presents the conclusion.

Catadioptric Vision System
This section presents the catadioptric vision system's experimental arrangement, which consists of two omnidirectional cameras, including a CCD camera and a parabolic mirror. Figure 1 shows a schematic overview of the experimental setup for the proposed panoramic 3D reconstruction system, which consists of two catadioptric cameras; an upper camera and a lower camera, each composed of a 0-360 parabolic mirror and a Marlin F-080 CCD camera on a back-to-back configuration.  We assembled the catadioptric cameras to capture the full environment that is reflected in the mirrors. For this purpose, we aligned a mirror with the lower CCD camera to compose the lower parabolic mirror (PML). Then we placed another mirror back-to-back with PML aligned with the upper CCD camera to produce the upper parabolic mirror (PMU). Using a chessboard calibration pattern, we calibrate both catadioptric cameras obtaining their intrinsic and extrinsic parameters, as explained in the next section.

Methodology
This section presents the methodology to perform 3D reconstruction of an object or scene using the proposed catadioptric stereo system. Section 3.1 describes the calibration procedure, Section 3.2 explains the epipolar geometry for panoramic cameras, and Section 3.3 describes the stereo reconstruction.

Catadioptric Camera Calibration
To perform 3D reconstructions with metric dimensions, it is necessary to know the cameras' intrinsic parameters, such as the focal length, optical center, mirror curvature, and extrinsic parameters composed by the rotation and translation between them. These parameters are obtained with the camera calibration. This section briefly describes the process to calibrate the catadioptric stereo vision system based on the work presented in [41].
To calibrate the catadioptric cameras, we used the geometric model proposed in [42]; this model uses a checkerboard pattern (shown in Figure 2a) to obtain the intrinsic parameters of each camera and the extrinsic parameters between the reference system of the cameras and that of the pattern. The pixel coordinates of a point in the image are represented by (u, v). Equation (1) shows the relation between a 3D point X W in the mirror x PM = [x, y, z] T and a 2D point in the image.
Since the mirror is rotationally symmetric, the function f (u, v) only depends on the distance ρ = √ u 2 + v 2 leading to Equation (2).
where f (ρ) is a fourth degree polynomial that defines the curvature of the parabolic mirror: the coefficients a 0 , a 1 , a 2, a 3 , a 4 correspond to the intrinsic parameters of the mirror.
To calibrate the catadioptric camera, we projected the checkerboard pattern shown in Figure 2a at different positions around the catadioptric stereo system. Figure 2b shows images from the pattern acquired with the upper camera and Figure 2c shows the images acquired with the lower camera. (c) Images of the pattern using the lower camera.

Epipolar Geometry for Panoramic Cameras
Epipolar geometry describes the relationship between positions of corresponding points in a pair of images acquired by different cameras [41,43,44]. Given a feature point in one image, its matching view in the other image must lie along the corresponding epipolar curve. This fact is known as the epipolar constraint. Once we know the stereo system's epipolar geometry, the epipolar constraint allows us to reject feature points that could otherwise lead to incorrect correspondences.
In our system, the images are acquired by the catadioptric system formed by the upper camera-mirror (PMU) and the lower camera-mirror (PML), as seen in Figure 3. The projections of a 3D point X W over the mirrors PMU and PML are denoted by x i PMU and x i PML respectively. The coplanarity of the vectors where × corresponds to the cross product and R and T are the rotation matrix and translation vector between the upper and lower mirrors. The coplanarity restriction is simplified in Equation (5) ( where: E is the essential matrix. The skew symmetric matrix S is given by Equation (7). For every point u 1 in the catadioptric image, an epipolar conic is uniquely assigned to the other image u 2 , the search space between corresponding points is limited by Equation (8).
In the general case, the matrix A 2 (E, u 1 ) is a non-linear function of the essential matrix E, the points u 1 , and the calibration parameters of the catadioptric camera.
The vector T PMU→PML defines the epipolar plane Π. Equation (9) describes the vector normal to the plane Π in the camera coordinates system.
The normal vector n 1 can be expressed in the second camera coordinate system using the essential matrix as follows: Then Equation (10) is represented as: The plane equation Π described in the second camera coordinate system is given by Equation (12).
The derivation of A 2 (E, u 1 ) from Equation (8) based on a parabolic mirror, z is substituted into Equation (12) resulting in Equation (13) that allows the calculation of the epipolar conics for this type of mirror: where b 2 is the mirror parameter and p, q, s are the components of the normal vector n 2 computed in Equation (10).

Stereo Reconstruction
Once we have the system calibrated, the next step is to acquire images and reconstruct a scene using the following procedure (see Figure 4):

1.
Capture one image of the environment from each camera of the calibrated system.

2.
Transform the catadioptric images to panoramic images using Algorithm 1.

3.
Extract the features and descriptors from the panoramic images using a feature point detector such as SIFT, SURF, KAZE, a corner detector such as Harris, or more advanced feature detectors such as DeepMatching [40], CPM [45], or SuperGlue [39]. Match the points between the upper and lower camera features as described in Section 3.3.1.

4.
Filter the wrong matches using epipolar constraints as described in Section 3.2.

5.
Map the matching points coordinates from panoramic back to catadioptric images coordinates, using Algorithm 2. 6.
Transform the catadioptric image points to the corresponding mirror PMU and PML. 7.

Feature Detection and Matching
A feature point in an image is a small patch with enough information to be found in another image. Feature points can be corners, which are image patches with significant changes in both directions and are rotation-invariant [46,47]. They can be keypoints that encode texture information in vectors called descriptors, which besides being rotation-invariant, are also scale-invariant [48][49][50][51][52]. Additionally, they can be Deep Learning-based features that can also be invariant to repetitive textures and non-rigid deformations [39,40].
Feature matching entails finding the same feature point in both images, for this, we compared corner detectors such as Harris [46] and Shi-Tomasi [47] along with feature point detectors such SIFT [48], SURF [49], BRISK [50], FAST [51], KAZE [52], a state-of-the-art Deep Learning method called SuperGlue [39] and DeepMatching [40]. In the case of a corner detector, we use epipolar geometry described in Section 3.2 to reduce the search along the epipolar curve and match the points using image-patch correlation.
For the case of a feature point detector, we compute the pairwise distance between the descriptor vectors. Two descriptor vectors match when the distance between them is less than a defined threshold t; t = 10.0 for BRISK and t = 1.0 for SIFT, SURF, KAZE, and FAST. For the case of SuperGlue, we used the outdoor trained model provided by the authors for the outdoor scene and the indoor trained model for the indoor scene. For DeepMatching, after finding the matches using the original implementation, we filter out the wrong matches using epipolar constraints. For this, we measure the Euclidean distance to the epipolar curve and keep the corresponding point in the other image if it lies d pixels from the epipolar. Figure 5c shows a selected feature as a red mark in the upper panoramic image, and Figure 5d shows the matched point in the lower panoramic image as a red mark along with the epipolar line through which the point should lie.

3D Reconstruction
Once we have the feature points between the two catadioptric cameras, we convert those coordinates to mirror coordinates to perform the 3D reconstruction.
Given point pairs x i PMU and x i PML on the mirrors surfaces and the coordinates [x, y, z] of a point x i PMU , we obtained Equation (14) [41,43].
From the distance |D| between two points x i PMU and x i PML , we determined the coordinates of a point X using Equation (15).

Results
This section presents the results for the system calibration, catadioptric epipolar geometry, feature matching, and 3D reconstruction. Table 1 shows the intrinsic parameters of the upper and lower cameras, respectively. As described in Section 3.1, a 0 , a 1 , a 2, a 3 , a 4 are the coefficients of a fourth degree polynomial that defines the curvature of the parabolic mirror representing the intrinsic parameters of the mirror. x c and y c define the optical center of the catadioptric cameras with a resolution of 1392 × 1038. We obtain the extrinsic parameters of the upper and lower catadioptric cameras with respect to each calibration plane (Figure 2a). That is, six extrinsic parameters (rotation and translation) for each of the 32 images acquired during calibration. The calibration errors of each catadioptric camera are shown in Table 2 which are inside the ranges of typical panoramic cameras [53]. Table 2. Calibration error on the catadioptric camera.

Epipolar Geometry Results
In this section, we show how the epipolar curves can be used to filter wrong matching points. Figure 5a shows a feature point (red mark) on the upper catadioptric image, and Figure 5b shows the corresponding point on the lower catadioptric image along with the epipolar curve shown in blue. Figures 5c,d show the same information but for the unwrapped panoramic images. For a corresponding point in the lower camera to be correct, it must lie along the epipolar curve, as shown in the images.

Features Matching Results
We compared the feature detection and matching using multiple methods. Table 3 shows the number of feature matches found with each method in descending order as well as the running time. We use MATLAB 2019b built-in functions for all the methods except SIFT, DeepMatching, and SuperGlue. For SIFT, we use the Machine Vision Toolbox from [54], and for DeepMatching and SuperGlue we use the original implementations [55,56]. For these last two methods, we report the inference time on a CPU Intel Xeon ES-1630 at 3.7 GHz and on a GPU GTX1080. Although these two methods are slow on CPU due to the computationally expensive nature of deep neural networks, the parallel GPU implementation can achieve running times comparable to the other CPU-based methods. Figures 6 and 7 show the feature matches for the first six methods shown in Table 3. From these images, we can see that DeepMatching has a significantly broader coverage and density at the expense of higher computational cost. The second best feature detector in terms of the number of features was the Harris corner detector; however, we can see that the features cover only the trees' borders but not the entire image. KAZE and SuperGlue matches have more image coverage than Harris, but the features are sparse compared to DeepMatching. Table 3. Number of matches and running times obtained with each method. For the Deep learning-based methods, we also report the running times obtained in a GTX 1080.  cameras. (f) shows the 3D reconstruction using filtered DeepMatching.

Number of Matches Time (Msec) CPU/GPU
As described in Section 3.3.1, we filtered the DeepMatching results using epipolar constraints by keeping the matches whose distance from the epipolar curve is less than a defined threshold d. Figure 8 shows the number of DeepMatching features obtained with different filtering levels. The more we increase d, the more matches we get, but also, the more error we are allowing in the reconstruction. Empirically, we found that a value of d between 20 and 30 pixels from the epipolar curve gives the best compromise between features and features' quantity and quality. The effects of d in the reconstruction are described in the next section. The more we increase the distance threshold, the more matches we get, but also, the more error we are allowing in the reconstruction.

3D Reconstruction Results
Once we have the matching points in both catadioptric cameras, the next step is to transform those points into the mirrors. Figure 9a shows the Harris corners (which was the runner-up method in terms of the number of features) on the upper parabolic mirror (PMU) and the lower panoramic mirror (PML). Similarly, Figure 9b shows the DeepMatching points on each of the mirrors. As described in Section 4.3, DeepMatching provides broader coverage and density than all the other methods (see Figures 6 and 7) shown in Table 3 resulting in a more well-rounded reconstruction. The third columns of Figures 6 and 7 show the reconstruction of an outdoor scenario with features obtained with the methods presented in Table 3. For this challenging outdoor environment, we see that the reconstructions generated with the feature matches from all these methods are poor due to the lack of feature density and coverage except for the DeepMatching method, where the features can cover most of the image as shown in Figure 7.
Although DeepMatching returns many feature matches, not all of them are correct. To fix this, we use epipolar constraints as described in Section 3.2. Figure 7a,b show the unfiltered DeepMatching results along with the 3D reconstruction in Figure 7c. Figure 7d,e show the DeepMatching results filtered with d = 30 pixels. Figure 7f shows the reconstruction with these filtered features points. Figure 10 shows the point clouds reconstructed using the Deep features matches with the different filtering levels. As the images show, when we relax the filter (when d is larger), we get more features but also more errors in the background. To quantify the reconstruction error, we reconstruct an object with known dimensions, in this case a rectangular pattern of size 50 cm × 230 cm. Figure 11 shows the results obtained with DeepMatching. Figures 11a,b show the unfiltered DeepMatching features and Figure 11c shows the 3D reconstruction. Figures 11d,e show the filtered matches with d = 15 and Figure 11f shows the 3D reconstruction. From Figure 11 we see that the filtering of the features using epipolar constraints produces a cleaner reconstruction without compromising the density. Using the rectangular pattern with known dimensions, we calculate the reconstruction's error at the center of the pattern and the extremes. At the right extreme, the error correlates with a larger distortion on the periphery of the mirror. Table 4 shows the mean reconstruction error in millimeters and the standard deviation. To evaluate the qualitative results of the 3D reconstructions, we reconstructed three more objects shown in Figure 12. A squared box, a hat, and a clay pot. For the squared box shown in Figure 12a, we computed the angles between normal vectors of adjacent planes compared with 90 • . The results are shown in Table 5. In this table, we compared the angle errors with [4] achieving slightly better results with one stereo pair.

Conclusions
We introduced the development of a stereo catadioptric 3D reconstruction system capable of generating semi-dense reconstructions based on epipolar constrained DeepMatching. The proposed method generates accurate 3D reconstructions for indoor and outdoor environments with significantly more matches than sparse methods, producing broader and denser 3D reconstructions and gracefully removing incorrect correspondences provided by the DeepMatching algorithm. Our system's current limitations in terms of hardware are its large size and fragility, making it unsuitable for real-life situations. In terms of the method, although DeepMatching provides significantly more feature points than corner or feature-point detectors, it is still relatively sparse compared to dense 3D reconstruction deep learning techniques at the expense of faster and more accurate measurements. In future work, we plan to increase the reconstruction's density by combining the current approach with dense 3D reconstruction methods and improving the system's size and robustness.