HOPIS: Hybrid Omnidirectional and Perspective Imaging System for Mobile Robots

In this paper, we present a framework for the hybrid omnidirectional and perspective robot vision system. Based on the hybrid imaging geometry, a generalized stereo approach is developed via the construction of virtual cameras. It is then used to rectify the hybrid image pair using the perspective projection model. The proposed method not only simplifies the computation of epipolar geometry for the hybrid imaging system, but also facilitates the stereo matching between the heterogeneous image formation. Experimental results for both the synthetic data and real scene images have demonstrated the feasibility of our approach.


Introduction
In the past few decades, the imaging geometry for perspective cameras has been well studied, first by the photogrammetry and then by the computer vision community. Multiple view relations between perspective cameras have also been established based on projective geometry [1]. Thus, recent research on geometric image formation is gradually moving toward the construction of catadioptric imaging models, which combines lenses and mirrors to increase the field of view [2,3]. The implementation with the so-called "omnidirectional camera" quickly gained popularity in mobile robot applications, mainly because of its 360 • field of view [4,5]. Epipolar geometry and camera calibration for various types of catadioptric imaging models are also extensively investigated [6][7][8]. Although an omnidirectional camera is capable of capturing images of extremely large scenes, it suffers from low and non-uniform image resolution.
To increase the applicability of omnidirectional cameras, many researchers have proposed camera networks consisting of catadioptric and perspective sensing devices [9]. These approaches combine the advantages of the 360 • field of view of omnidirectional cameras and high-resolution imaging from conventional cameras [10]. They are thus widely adopted in visual servoing [11,12], mobile robots for navigation [13,14] and localization [15,16] applications. This type of hybrid camera configuration can provide global and local vision simultaneously, but it also poses challenges for the imaging geometry and camera system calibration. One major issue is whether these two classes of cameras can be represented by the same imaging model, say the perspective projection model, which will then greatly facilitate the succeeding tasks, such as correspondence matching and 3D scene reconstruction. Unfortunately, there is still no single camera projection model that can be used to represent these heterogeneous imaging systems.
Due to the reflection nature of the omnidirectional cameras, the central catadioptric projection has to be processed by a two-step mapping via a sphere. It is thus not possible to have a unique and undistorted perspective image to represent the scene captured by an omnidirectional camera. Consequently, a concise computational model for multiple view geometry is not available for the hybrid omnidirectional and perspective configuration. The important geometric properties, such as the fundamental matrix, cannot be directly calculated in the 2D image space. In other words, the hybrid fundamental matrix for the perspective and omnidirectional camera pair cannot be directly estimated, even if the two cameras are fully calibrated.
To deal with the problem of mixing catadioptric and perspective cameras, Sturm analyzed the relationship between the multi-view images captured by para-catadioptric, perspective or affine cameras [17]. The fundamental matrices and planar homographies were derived in the lifted surface [18]. Chen et al. presented a three-step approach for hybrid camera network calibration [19]. The catadioptric camera was first calibrated using vanishing points; the perspective camera calibration was then carried out based on several derived 3D points. Cagnoni et al. developed a hybrid omnidirectional-pinhole sensor and derived the relationship between the two cameras using a surrounding calibration pattern box [20]. However, no theoretic imaging formation of the mixed camera model was addressed in their work. Chen and Yang proposed a homography-based image registration technique for the perspective and omnidirectional views [10]. Under the planar surface assumption, the image correspondences can be obtained without camera calibration. A similar technique was also developed by Adorni et al. using the inverse perspective transform between the omnidirectional and perspective images [21].
In this work, we are interested in the development of a hybrid omnidirectional and perspective imaging system (HOPIS) for mobile robot applications. In addition to the construction of the hybrid imaging geometry, we also present a unifying model to reduce the computational complexity of the hybrid camera system. Our objective is not to formulate the fundamental matrix of mixed view pairs from catadioptric and perspective cameras, but to study the generalized stereo matchingfor mixtures of different central projection systems. Consequently, stereo matching can be carried out using available techniques with rectified standard stereo image pairs.
The concept of the virtual image plane is introduced to simplify the imaging relations between the conventional and omnidirectional cameras. Unlike the previous work, such as [22,23], where the virtual images were generated via a straightforward image warping technique without taking the camera parameters into account, we propose virtual imaging formation based on the perspective projection model. The 3D reconstruction is thus feasible using the hybrid image pair. The proposed method not only establishes the epipolar geometry of the hybrid imaging system, but also facilitates the stereo matching between the heterogeneous image formation. We have shown that, for the non-degenerate cases, image rectification for any given perspective viewpoint is always possible from the catadioptric images. Thus, a generalized stereo imagecan be constructed from the hybrid of omnidirectional and perspective imaging.
The rest of this paper is organized as follows. Section 2 describes the imaging models of the perspective and catadioptric cameras. The generalized stereo of the hybrid imaging system via the virtual image plane is described in Section 3. In Section 4, we present the calibration technique for the perspective and catadioptric cameras. Experimental results are provided in Section 5, followed by the performance analysis of the system in Section 6. Finally, Section 7 concludes the paper and discusses some possible directions for future work.

Hybrid Omnidirectional and Perspective Imaging System
The proposed hybrid imaging system (HOPIS) consists of a conventional camera and a catadioptric camera with a hyperboloidal mirror, as shown in Figure 1.
These two types of cameras possess the property of single viewpoint projection, which is an essential condition for a unifying geometric representation. In this section, we describe the imaging model and calibration of both cameras, followed by the point correspondence relation between the two cameras.

Perspective Camera Model
For a perspective or pinhole camera, the relationship between a 3D pointX = (X, Y, Z) ⊤ and the corresponding 2D image pointx = (x, y) ⊤ can be written as: where X and x are the 3D and image points represented by homogeneous four-vector and three-vector, respectively. The 3 × 4 homogeneous matrix P, which is unique up to a scale factor, is called the perspective projection matrix of the camera. For the purpose of camera dissection and calibration, the perspective projection matrix can be further decomposed into the intrinsic camera parameter matrix and the relative pose of the camera, i.e., The 3 × 3 matrix R and 3 × 1 vector t are the relative orientation and translation with respect to the world coordinate system, respectively. The intrinsic parameter matrix K of the camera is a 3 × 3 matrix and usually modeled as: where (u 0 , v 0 ) is the principal point of the camera (the intersection of the optical axis with the image plane), γ is a skew parameter related to the characteristic of the CCD array and f x and f y are scale factors in the x and y directions of the image sensor.

Omnidirectional Camera Model
The imaging models and calibration techniques for the omnidirectional cameras have been extensively investigated since panoramic image formation was introduced [24]. For central catadioptric cameras with a single viewpoint, the most popular imaging geometry is represented by a two-step projection via a unit sphere [25]. Based on this projection model, Barreto and Araujo use an image with at least three lines to calibrate the catadioptric cameras [26]. Ying and Hu present a calibration method using geometric invariants of lines and spheres [7]. Mei and Rives take the lens distortion into account and calibrate the omnidirectional camera using planar grids [27].
Using the unit sphere projection model, a 3D scene point X is first projected to a point X s on the unit sphere located at the origin O of the unifying catadioptric projection model. The point X s is then projected perspectively via a projection center O c located inside the unit sphere to the image plane of the lens camera. For a general catadioptric imaging system, the optical axis of the lens camera is aligned with the line determined by the two centers of projections O and O c . Let the point X s be represented by (X s , Y s , Z s , 1) ⊤ in homogeneous coordinates, then the projection of the 3D scene point X on the image plane is given by: where ξ ∈ [0, 1] is the distance between the two projection centers, O and O c . The image point x of the 3D scene point X can finally be obtained by incorporating the internal camera project matrix K c , as in Equation (3), i.e., In this unifying catadioptric projection model, the camera projection center and coordinate system are defined in terms of the unit sphere. The extrinsic parameters of the omnidirectional camera are the rotation matrix R and the translation vector t with respect to the world coordinate system. The intrinsic parameters include the effective focal length, image center and skew factor of the perspective projection. The implementation of omnidirectional camera calibration with both the intrinsic and extrinsic parameters can be found, for example, in [27,28]. The latter is used for our HOPIS calibration, as described in Section 4. Figure 1 illustrates the point correspondence relation between the omnidirectional and perspective images. Suppose the centers of projection of the catadioptric and perspective cameras are O 1 and O 2 , respectively. Given a 3D point X viewable to both cameras, its projections to the image planes π 1 and π 2 are fully described by the intrinsic and extrinsic parameters of the hybrid imaging system. Suppose a light ray from the 3D point X reflected by the hyperbolic mirror of the catadioptric camera intersects the image plane π 1 at point x. The point correspondence x ′ on the image plane π 2 can be modeled by a coordinate transformation R, t from O 1 to O 2 . If the focal length or sensor resolution of the cameras are different, then an additional internal transformation for the perspective camera needs to be carried out. Thus, the 3D point X can be uniquely determined by the point correspondences x and x ′ .

Generalized Stereo Model via Virtual Image Plane
To simplify the 3D reconstruction formulation for a pair of hybrid omnidirectional and perspective images, a generalized stereo model using the concept of the virtual image plane is proposed. The objective is to rectify the hybrid image pairs to form the fronto-parallel stereo ones, which possess the property of parallel epipolar lines. For the proposed HOPIS configuration, as shown in Figure 2, a virtual camera is constructed associated with the omnidirectional camera, such that the optical axis is perpendicular to the baseline between the catadioptric and perspective cameras. By warping and transforming the omnidirectional and perspective images to the common virtual image plane, a rectified stereo image pair can be derived.

Construction of Virtual Cameras
For a single-viewpoint catadioptric imaging system, the effective viewpoint is located at the focus of the quadric surface behind the reflection mirror. Let O be the sphere center of the unifying catadioptric projection model and O 2 be the projection center of the perspective camera. To rectify the hybrid omnidirectional and perspective image pair for stereo matching, the virtual cameras with the same effective viewpoints as the hybrid imaging system can be constructed with the stereo baseline OO 2 as follows.
Let the coordinate system O be the common reference frame of the hybrid camera system, and the orientation and translation of the perspective camera O 2 are R and t, respectively. Rotate the perspective camera, such that the image scanlines (i.e., along the y-axis) are parallel to the translation vector t. The associated rotation matrix R ′ is then applied to the effective catadioptric viewpoint O to create a virtual camera with the same focal length and orientation as the perspective camera O 2 . To summarize, the image rectification for the hybrid imaging system involves the rotation transformations R ′ and R −1 R ′ for the viewpoints O and O 2 , respectively. Finally, the common focal length of the cameras can be adjusted to increase the overlapping scene of the rectified image pair.
It should be noted that some configurations are not physically realizable due to the non-overlapping scenes captured by the perspective and omnidirectional cameras. For example, the combination of an upward perspective camera and a downward omnidirectional camera with coincident optical axes does not have a common field of view. If we consider the special case that O 2 lies on the line determined by O and O 1 , both the rotation transformations R and R ′ are the identity matrix. In this case, only the virtual perspective image for the catadioptric camera has to be constructed, and no additional image rectification has to be carried out.

Virtual Image Generation
Image rectification for the perspective camera can be performed with a straightforward linear warping technique provided that the intrinsic and extrinsic parameters of the camera are available [29]. The generation of virtual perspective images from an omnidirectional image, however, is not a simple one-to-one linear mapping. As described in Section 2.2, the catadioptric image formation can be modeled by a unifying projection with a two-step linear mapping via a unit sphere. Thus, the virtual images can be synthesized by back-projecting the rays from the omnidirectional image.
As shown in Figure 3, the center of the unit sphere O is the effective viewpoint of the catadioptric camera. For image rectification with the rotation matrix R ′ , the projection matrix of the virtual camera is: where K v is the intrinsic parameter matrix. The projection of a 3D scene point X to the image point x on the virtual image plane is given by: or: whereX is the inhomogeneous representation of X. Thus, we have: and: where X s can also be represented by: Now, the mapping ofX s onto the omnidirectional image plane via the projection center O c is given by: where A c and R c are the camera matrix and the extrinsic orientation of the perspective projection, respectively. Since both A c and R c are the intrinsic parameters of the catadioptric camera, Equation (12) can be rewritten as: or: where e 3 = (0, 0, 1) ⊤ , K c = A c R c and O c = (0, 0, −ξ). Equation (13) establishes the one-to-one correspondence between x and x c . Thus, the virtual image with given camera matrix K v and orientation R ′ can be synthesized from the omnidirectional image. Furthermore, the rectified image can then be used with the perspective camera O 2 , as described in Section 2.3, for stereo matching and 3D reconstruction.

Triangulation from the Hybrid Image Pair
The image formation of the hybrid camera system consists of the projections of a 3D scene point to the conventional and omnidirectional cameras. Suppose that the extrinsic parameters of the perspective camera and the virtual camera generated from the catadioptric camera in the world coordinate system are (R p , t p ) and (R v , t v ), respectively. Then, the transformation (R, t) between the two camera coordinate frames is given by: The projection relations of 3D scene points to both cameras can be derived as follows. Let x p and x v be the projections of a 3D point X on the image plane of the perspective and the virtual camera associated with the catadioptric system, respectively. Suppose that the projection error of the 3D scene point is modeled, then X can be derived as the midpoint of the line segment perpendicular to the rays back projected from both camera centers and passing through the image points x p and x v , respectively. Given the rotation and translation between the perspective and omnidirectional cameras, these two rays are represented by ax p and t + bR ⊤ x v , where a, b ∈ R, and the 3D scene point X can be derived by solving a system of linear equations [30].
It should be noted that, the above triangulation is not based on image rectification and can be performed with any relative orientation and translation (R, t) between the two cameras. Suppose that a correspondence matching (x p , x o ) of the perspective and omnidirectional image pair is identified; it is possible to choose a suitable rotation matrix R v for the construction of a virtual image from the catadioptric camera. Thus, a vergence stereo configuration can be easily achieved to increase the overlapping region observed from the perspective and virtual cameras.

HOPIS Calibration
Camera calibration for the proposed hybrid omnidirectional and perspective imaging system is to derive the intrinsic parameters of both cameras and the relative pose between the cameras. For the initial system calibration, a checkerboard pattern is placed at a location viewable to both cameras. Tsai's method is carried out to estimate the camera matrix K p and extrinsic parameters (R p , t p ) of the perspective camera [31]. The orientation and position of the camera are calculated relative to the world coordinate frame set on the calibration pattern.
To calibrate the omnidirectional camera, the technique presented by Mei and Rives is adopted [27]. Several omnidirectional images captured with different positions and orientations of the checkerboard pattern are used to estimate the intrinsic parameter matrix K c and the relative poses with respective to different pattern coordinate frames. The exterior orientation and translation of the camera coordinate system, (R o , t o ), is the one relative to the same calibration pattern used for the perspective camera. Note that the origin of the omnidirectional camera coordinate system is the sphere center of the unifying projection model.

Rotation and Translation between the Cameras
In the initial system calibration, the rotation and translation between the omnidirectional and perspective cameras can be obtained through the common world coordinate frame. This relative orientation and position within the hybrid camera system, however, might not be constant over time for some applications. An active HOPIS configuration with a PTZ camera can be used for surveillance, robot navigation and human computer interaction, etc. In these cases, auto-calibration for orientation update has practical uses and is highly desirable.
From the decomposition of the essential matrix: the relative orientation and the direction of translation between a pair of perspective cameras can be derived [32]. Thus, if both coordinate frames of the constructed virtual camera and the associated catadioptric camera are aligned, the rotation and translation (R, t) can be obtained by computing the essential matrix. Now, suppose the intrinsic parameter matrices K v and K p of the virtual and perspective cameras are available from the initial calibration. Then, from the relation: where F is the fundamental matrix of the stereo image pair, the essential matrix can be derived from eight image point correspondences [33]. From Equations (14) and (15), the orientation and translation can be derived by: Using the point correspondences of the two perspective images. In the hybrid stereo image pair, however, only the correspondences between the omnidirectional and perspective images are directly accessible. Suppose a point correspondence x c ↔ x ′ is identified, where x c and x ′ belong to the omnidirectional and perspective images, respectively. The corresponding point x on the virtual camera can be derived from Equation (13) with R ′ = I, i.e., Thus, the rotation and translation (R, t) is obtained up to a scale provided that all intrinsic camera matrices are calibrated. The unknown scale factor for the translation t can be determined using a fixed distance between two 3D scene points.

Feature Matching for the Hybrid Image Pair
To find the point correspondences between the omnidirectional and perspective images, the SIFT descriptor is adopted for feature matching [34]. Since unwarping the omnidirectional image to a panoramic form generally increases the searching range, the correspondence matching is carried out on the original hybrid stereo image pair. The search region is further reduced with the knowledge of an approximate orientation of the perspective camera. For example, the searching range on the omnidirectional image can generally be restricted to less than a quarter of the image, depending on the field-of-view of the perspective camera. An example of the feature correspondence matching between the omnidirectional and perspective images is illustrated in Figure 4. The result shows that the feature matching on the virtual-perspective image pair is better than the direct matching on the omnidirectional-perspective image pair.

Experiments
We have performed a number of experiments to assess the effectiveness of the proposed generalized stereo model for the hybrid imaging system. The hybrid camera system, which consists of a Watec-221s analog camera and a SONY DFW-X710 digital camera with an attached hyperbolic mirror, is mounted on a mobile robot, as shown in Figure 5. The image resolutions of the cameras are 320 × 240 and 1024 × 768, respectively. All images were acquired with white-balancing and auto-exposure. The perspective camera equipped with a Tamron 12VM612T lens provided a field of view of 30.4 • × 23.1 • . The intrinsic and extrinsic parameters of the camera system are obtained as described in Section 4. Figure 6 shows the captured omnidirectional and perspective images used for system calibration. The experimental environment and the acquired hybrid stereo image pair are shown in Figures 7 and 8, respectively.
In developing the generalized stereo model by constructing a virtual image from the omnidirectional image, the algorithm described in Section 3 is carried out. Since the field of view of the virtual camera can be set arbitrarily for any fixed focal length, the virtual image can be created with different image sensor sizes. As shown in Figure 5, the perspective camera is placed above the catadioptric camera in our HOPIS configuration. Thus, the resolution of the virtual image is set as 320×480, with the extension in the vertical direction to increase the overlap with the perspective image. Figure 9 shows the virtual image generated from the omnidirectional image (see Figure 8b) with the same viewing direction as the perspective image (see Figure 8a).   Figure 9. The virtual image generated from the omnidirectional image, Figure 8b. The image resolution is set as 320 × 480 for large overlap with the perspective image, Figure 8a. Figure 10 illustrates the epipolar geometry of the hybrid omnidirectional and perspective imaging system. Consider the case that the virtual camera is constructed to have the same orientation as the perspective camera. Figure 10a illustrates the correspondence matching between the perspective image and the virtual image derived from the omnidirectional image. The rotation between the two cameras estimated by these point correspondences is very close to the calibration result. If their optical axes are not perpendicular to the stereo baseline, then the epipoles will lie on the image planes, as illustrated in Figure 10b. Although the epipolar lines can be easily derived from the epipolar geometry of a stereo rig, they correspond to the quadric curves in the omnidirectional image, as shown in Figure 10c. Thus, the stereo matching is greatly simplified on the virtual and perspective image pair, even if they are not rectified to the fronto-parallel configuration. Figure 10. The epipolar geometry of the hybrid omnidirectional and perspective image pair. Figure 10a illustrates the SIFT correspondences between the perspective and virtual images. The epipolar line pairs between the perspective and virtual images are shown in Figure 10b. The corresponding epipolar lines on the omnidirectional image are illustrated in Figure 10c. (c) Figure 11 illustrates the matching results on the perspective and omnidirectional image pair. A specific region of interest captured by the perspective camera (as shown on the top) is used for stereo matching and depth reconstruction. The depth map is then transferred to the common region in the omnidirectional view for global depth perception. As shown on the bottom of Figure 11, the misalignment on the region of interest is due to the parallax between the omnidirectional and perspective cameras. Figure 12 shows another experiment with the input hybrid stereo image pair, epipolar geometry, image rectification and disparity maps (see the captions for more details).

Evaluation on Feature Matching
Similar to the conventional stereo vision systems, it is important to understand the performance of correspondence matching with respect to various camera parameter settings of the proposed hybrid imaging system; more specifically, given a fixed translation and orientation between the omnidirectional and perspective cameras, how to achieve better feature detection and matching results by adjusting their focal lengths and image resolutions. In this work, the performance evaluation on correspondence matching is based on the detection of SIFT features in both cameras. While the intrinsic parameters can be directly changed for the perspective camera, those associated with the omnidirectional camera are only accessible in terms of the virtual images.
To evaluate the effect of focal length on the correspondence matching, the synthetic images are generated with various focal lengths for both the perspective and virtual cameras. The detection and matching of the SIFT features for different focal lengths are tabulated in Table 1. Table 2 tabulates the correspondence matching of the SIFT features for various focal lengths and image resolutions of the virtual camera. The same focal length is used for both the perspective and virtual cameras. For each image resolution and focal length, the number of correspondences is calculated by averaging the results from ten captured and generated stereo image pairs. It is clear that the number of detected features and the processing time increases with the image resolution. However, as shown in the table, higher resolution does not guarantee better correspondence matching results. This is mainly due to the severe image distortion of high resolution warping from the omnidirectional image.  For the perspective camera, the number of detected features increases with the focal length, mainly due to the zoom-in effect on textured patterns. The virtual images, on the other hand, have steady feature extraction results for all focal lengths, since they are all synthesized from the same omnidirectional image. For the correspondence matching between the perspective and virtual images, the same focal length settings for both cameras generally provide the best matching results, as shown on the diagonal of the table. Figure 13. Some test images used for performance comparison, as shown in Table 3.
It is well known that the feature matching between two heterogeneous image pairs is a difficult task, even using scale-invariant feature descriptors. Thus, it is interesting to investigate the improvement on the feature matching results if the virtual image plane is introduced. We have examined the averages of matched features, mismatched features, error matching rate and computation time of the omnidirectional-perspective and the proposed virtual-perspective image pairs. The performance comparison of SIFT matching results using five indoor and two outdoor scenes (three of them are shown in Figure 13) is presented in Table 3. It can be seen that the error matching rate and computation time of the proposed technique are both much lower than the conventional full range search between the perspective and omnidirectional image pair. Table 3. Comparison of the SIFT feature correspondences between the direct matching of the omnidirectional-perspective and the proposed virtual-perspective image pairs. The execution time is in seconds.

Hybrid
Omnidirectional

Conclusion
In this work, we have presented a generalized stereo approach for the hybrid imaging system consisting of a conventional and an omnidirectional camera. The proposed technique provides the robot vision system with the capability of omnidirectional surveillance and 3D reconstruction. It can be used for mobile robot applications, such as obstacle detection, with the derived 3D information and vision-guided navigation using the omnidirectional images [35]. The imaging formation of the hybrid camera system is formulated using a unifying projection model. The epipolar geometry of the hybrid omnidirectional and perspective image pair is simplified by the mapping via a virtual camera. With image rectification and reprojection, stereo matching between the heterogeneous images can be carried out using available techniques for standard image pairs. Thus, our approach is suited for depth recovery using the hybrid omnidirectional and perspective camera system. It is also possible to replace the conventional camera with a PTZ (pan-tilt-zoom) camera, making the region for depth recovery more flexible [36]. The experimental results are presented for both the simulated data and real scene images.