Extrinsic Calibration of Camera Networks Using a Sphere

In this paper, we propose a novel extrinsic calibration method for camera networks using a sphere as the calibration object. First of all, we propose an easy and accurate method to estimate the 3D positions of the sphere center w.r.t. the local camera coordinate system. Then, we propose to use orthogonal procrustes analysis to pairwise estimate the initial camera relative extrinsic parameters based on the aforementioned estimation of 3D positions. Finally, an optimization routine is applied to jointly refine the extrinsic parameters for all cameras. Compared to existing sphere-based 3D position estimators which need to trace and analyse the outline of the sphere projection in the image, the proposed method requires only very simple image processing: estimating the area and the center of mass of the sphere projection. Our results demonstrate that we can get a more accurate estimate of the extrinsic parameters compared to other sphere-based methods. While existing state-of-the-art calibration methods use point like features and epipolar geometry, the proposed method uses the sphere-based 3D position estimate. This results in simpler computations and a more flexible and accurate calibration method. Experimental results show that the proposed approach is accurate, robust, flexible and easy to use.


Introduction
Camera networks with common views have many applications in computer vision such as people tracking and 3D reconstruction. Extrinsic and intrinsic calibration is essential in these applications to ensure the geometric accuracy of 3D scene analysis. The intrinsic calibration parameters define the imaging geometry and the optical characteristics of each camera individually. The extrinsic parameters relate the orientation and the position of the cameras w.r.t. each other. While intrinsic parameters usually need to be estimated once for a given camera (unless the camera has a variable focal length), the extrinsic parameters must be recomputed whenever cameras are moved or reoriented (on purpose or accidentally). Also, intrinsic calibration can be performed in the lab, before deploying the cameras, whereas extrinsic calibration is only possible after deployment. Especially in the case of temporary installations, e.g., to capture public events, the available time for calibration at deployment time is often limited and an efficient and reliable procedure is needed.
In this paper, we assume that the intrinsic parameters are known and we focus on extrinsic calibration. Most existing extrinsic calibration methods first find distances to calibration objects by locating image features from different cameras. These features correspond to the same physical location on an object. Then the distances to the camera of these object points are computed using epipolar geometry from the correspondences. Finally an optimization routine simultaneously refines these estimates and computes the best matching calibration parameters. The most popular methods for the optimization step are generic non-linear optimisation techniques, or algebraic techniques based on the rank-4 factorization of an "observation matrices" [1]. All of these methods tend to get stuck in local optima unless they are initialised with good estimates for the distances to the cameras.
A general problem is that robustly finding correspondences is not trivial because features may be confused with other features, resulting in outliers; some features may not have correspondences due to occlusions. In practice, occlusions of points on objects are difficult to avoid, e.g., a checker board pattern cannot be seen well from directions parallel to the checker board (or from behind the checker board).
In this paper, we propose a novel extrinsic calibration method using a luminescent sphere as the calibration object. Spheres have several advantages over the typical point-like calibration objects: • Spheres have consistent appearance when observed from an arbitrary direction. Therefore the center of the sphere can be used as a feature point, rather than specific points on the calibration object. This means use of spheres can improve locating image features which will benefit classical point-like features-based method. While the sphere center is not visible directly, it can be estimated from an image, and correspondences between camera views can be found even if the cameras observe different sides of the sphere. In principle this is even possible for partially occluded spheres (but we will not treat that case in this paper). • As the sphere is a large and easy recognizable object, finding corresponding spheres in various views is quite easy, even in images of low resolution. As the sphere is luminescent, simple intensity thresholding suffices to locate the spheres in the images. • Assuming the sphere size is known, it is possible to estimate the distance to a specific camera, solely based on the observed image, i.e., without using information from other cameras. This means we can avoid using epipolar geometry to determine these distances, which are needed in the optimization method.
As we can estimate the 3D position of the sphere center w.r.t. the local camera coordinate system, we propose to use a less known method [2] which is based on orthogonal procrustes analysis to get the initial estimate of pairwise extrinsic parameters. Then, we propose to apply an optimization method which was developed by Bouguet [3], to jointly refine the extrinsic parameters for all cameras. The optimization step minimizes the total reprojection error (see Section 6) of calibration samples over all the extrinsic parameters.
The first contribution of the paper is that we propose a simple and accurate method to estimate the 3D position of the sphere center w.r.t. the local camera coordinate system. Our method requires only the estimate of the center of the projected sphere and its area. It is simpler and more robust against image segmentation error than other sphere-based calibration methods which process the sphere images by tracing the contour of the elliptical projection of the sphere.
In Section 7.2, we show that our method estimates the 3D positions of sphere centers w.r.t. the camera coordinate system more accurately than the sphere based method [4], which is one of the few methods designed for a similar purpose. Specifically the distance error of our method is smaller than the error of the sphere based method [4].
While the sphere locations are only used as intermediate results, a higher accuracy in these sphere locations still leads to a more accurate and stable estimation of extrinsic parameters. Specifically in Section 7.2 we will show that our method provides more accurate scale information. What is more, the projection, the triangulation and the reprojection error of our method are smaller than the error of the sphere-based method [4].
A second contribution is that we experimentally show that the orthogonal procrustes analysis achieves more accurate and stable calibration compared to the rank-4 factorization method. Specifically, in Section 7.3 we will show that the projection, the triangulation and the reprojection error of orthogonal procrustes are always smaller than the rank-4 factorization method. Moreover, the rank-4 factorization method requires at least 4 non-coplanar training samples, while orthogonal procrustes approach just needs 3 non-collinear training samples.
The final contribution is that we compare our algorithm with a popular reference calibration method of Svoboda et al. [5] and show that our method outperforms. Their method firstly estimates the projective depth using the method of Sturm et al. [1], which exploits epipolar geometry. Then, it computes projective structures via rank-4 factorization and does Euclidean stratification based on the concept of the absolute conic [6,7]. Finally, it applies the bundle adjustment [8] to refine the calibration. Their method requires at least 8 training samples (non-coplanar) and 3 cameras, which makes it not applicable to calibration of a camera pair. Moreover, they cannot provide scale information without metric measurements.
In Section 7.3, we will show that the projection and the triangulation error of our method are always smaller than the errors in the method of Svoboda. When considering reprojection error, the error of our method is slightly larger (3%) than the error of Svoboda's method. Moreover, we will also show that the image of three sphere locations suffices for accurate calibration. While the method of Svoboda et al. requires at least eight distinct non-coplannar positions. An additional benefit of our approach is that it can extrinsically calibrate two cameras at a time, whereas the method of Svoboda requires at least three cameras.
The remainder of this paper is organized as follows. Section 2 gives a survey of works in the literature for extrinsic calibration. Section 3 provides details about intrinsic and extrinsic parameters. Section 4 explains how to estimate 3D positions of sphere centers w.r.t. the camera coordinate system. Section 5 discusses the actual extrinsic calibration procedure based on the aforementioned 3D positions' estimation. In Section 6, we explain three criteria for accuracy evaluation of camera calibration techniques. Section 7 shows the experimental results. Finally, Section 8 concludes the paper.

Related Work
In recent years, extrinsic calibration of camera networks was well studied both in the photogrammetric and the computer vision community.
Many calibration methods have been proposed to do both intrinsic and extrinsic calibration using a calibration object with known world coordinates [9][10][11][12]. Such a calibration object consists of several easily detectable non-coplanar feature points with known relative 3D positions. Point correspondences between 3D world points and image points would be established by fixing the world coordinate system in the calibration object. Then, these algorithm estimate both intrinsic and extrinsic parameters (w.r.t. the fixed world coordinate system) which best map picture coordinates to corresponding 3D world coordinates. Calibration of a single camera can be done very efficiently for these methods. But for calibration of a camera network, it is tedious and cumbersome to calibrate all the cameras simultaneously using these reference objects, as it is often extremely difficult to make all the points on the calibration object simultaneously visible in all views. Moreover, it involves the design and use of some highly accurate tailor-made calibration patterns, which are often difficult and expensive to manufacture.
Zhang [13] proposed a calibration algorithm which uses a planar grid pattern as the calibration object. This method is mainly for intrinsic calibration. It can also do the extrinsic calibration for a stereo case with short baseline as soon as the planar pattern can be simultaneously visible in both cameras. Ueshiba et al. [14] proposed a similar method which also relies on the planar pattern. Patterns required for these methods [13,14] are easy and cheap to manufacture, which makes them flexible. However, for extrinsic calibration of a camera network or stereo case with wide baseline, these algorithms also encounter the problem of simultaneous visibility.
One way to avoid the problem of occluded features of a calibration object is to create a virtual calibration object by simply moving a detectable point through the working volume. Svoboda et al.
proposed an approach with less user interaction for intrinsic and extrinsic calibration of camera networks [5]. They use a moving laser pointer emitting a bright spot. By waving the bright spot in front of the cameras a large number of correspondences of the virtual calibration object can be detected. As the laser pointer needs to be moved many times, it is not easy to avoid occlusions in all camera views simultaneously. In contrast, sphere-based methods can use a much smaller number of measurements, which makes it practical to avoid occlusions by moving away from the sphere. As the laser pointer is moved, the method of Svoboda et al. also requires accurate time synchronization between cameras to avoid erroneously associating feature points corresponding to changes of laser pointer positions.
Aslan et al. developed a similar approach to automatically calibrate the extrinsic parameters of multiple cameras [15]. Instead of a bright spot, they detected people walking through the room and used a reference point on top of the person's head as the calibration feature. The relative pose of every camera pair is estimated using the corresponding feature points, and with this, the complete camera network is built up using a global error minimization technique. As there is no unique detectable point on top of a person's head, different cameras will always detect different feature points, which causes problem for points correspondences. In contrast, we propose to use a sphere which has consistent appearance when observed from an arbitrary direction.
Sinha et al. used dynamic silhouettes of a moving person to fully calibrate a synchronized multi-camera system [16]. Their algorithm first uses silhouettes to estimate a set of fundamental matrices, which are then applied to compute a projective calibration. The projective calibration is upgraded to a Euclidean one using the self-calibration technique. Finally they apply the bundle adjustment to improve the solution. As their calibration approach relies on silhouettes, it requires a good silhouette extraction method for calibration data capturing. Moreover, the silhouette of a person changes when the person moves, which will cause correspondence problem.
Spheres have been considered as calibration objects since they have consistent appearance when observed from an arbitrary direction. Many sphere-based camera calibration approaches have been proposed. Some of them are only for intrinsic calibration [17][18][19][20][21].
Agrawal et al. [4] and Zhang et al. [22] proposed to do both intrinsic and extrinsic calibration by putting a sphere at three or more different places. For extrinsic calibration, both methods first estimate the 3D positions of the sphere centers in each of the camera coordinate systems based on known intrinsic parameters and conics of the ellipse (projection of a sphere in the image). Then, they obtained the relative rotation and translation between two cameras by 3D point registration of the estimated sphere centers. The accuracy of both camera calibration methods depends highly on the accuracy of the conic matrices which can be obtained by ellipse fitting of the projection of the sphere in the image. Thus, the quality of ellipse detection strongly affects the accuracy of calibration results. In contrast, our method requires only the estimate of the center of the projected sphere and its area. These are simpler to compute and more robust against errors in the projected sphere's estimated contours.

Preliminaries
Extrinsic parameters are expressed with respect to a reference coordinate system, which is also called the world coordinates system. Since we have multiple cameras, we will associate a distinct world coordinate system in which a point is denoted as r w = (X W , Y W , Z W ) T , where the superscript T denotes matrix transposition. In the camera coordinate system, a point is denoted as r (k) = (X (k) , Y (k) , Z (k) ) T with each camera k. The world coordinates r w are w.r.t. a single reference coordinate system. Without loss of generality, we choose the coordinate system of the first camera as the world coordinate system: r w = r (1) . The camera coordinates r (k) are related to the world coordinates r w by r (k) = R (k) r + c (k) , where c (k) are the coordinates of the origin of the global world coordinate system within the local coordinate system of camera k and R (k) is a 3 × 3 rotation matrix. The purpose of this paper is to find c (k) and R (k) for each camera.
Intrinsic parameters define the imaging geometry of the camera. Since we deal with only one camera for intrinsic parameters, we will drop the superscripts k. In the camera coordinate system, a point is denoted as r = (X, Y, Z) T . We assume that the camera is modelled by the usual pinhole, its coordinate system has its origin in the optical center of the camera, its Z axis (optical axis) perpendicular to the image plane. Therefore, the image plane is given by Z = f , where f is the focal length of the camera in physical units (e.g., centimeter). Moreover in each camera image the projection of a 3D point can be characterized by its normalized image coordinates x = (x, y) which have the same physical units as the camera coordinates r. Alternatively, a projected point can also be identified by its integer pixel coordinates u = (u, v) T , i.e., a column and row number in the image.
Assuming a zero-skew camera, the pixel and normalised image coordinates are related by in which s x , s y is the pixel dimensions in physical units. We assume that intrinsic calibration has already been performed. This means we know f x , f y , and the optical center position (u 0 , v 0 ). Moreover, f x = f /s x , f y = f /s y are then related to the focal distance f . As such three specific coordinate systems are associated with each camera observation of a point. The first set of coordinates r = (X, Y, Z) T is expressed in physical units and indicate the 3D position of a point relative to the camera. The second set and third sets of coordinates (x, y) are the normalized image coordinates expressed in physical units and the coordinates (u, v), expressed in pixels, respectively. We also have the relationship: Combining Equations (1) and (2) we get So we can estimate the 3D position of a point w.r.t. the camera coordinate system if we know Z , (u, v) and all 4 intrinsic parameters. In Section 4, we will show how to estimate Z based on the projection of a sphere in the image.

Projection of a Sphere on the Camera Plane
In this section, we assume that f = 1 for simplifying the derivation. Consider a sphere with radius R s with its center at position r s , where r s = (X s , Y s , Z s ) T is w.r.t. the camera coordinate system. The center is located at a distance d s from the origin (optical center of the camera). We also assume that the size of the sphere is small compared to the distance to the camera plane. Another assumption is that the center of the sphere projects to the center of the ellipse. Figure 1 shows that the sphere is projected as an ellipse on the camera plane. In fact the ellipse outline is the projection of a very specific circle (indicated in green) on the sphere. The circle is also the intersection of a cone (also shown in green) tangent to the sphere and the sphere itself. The cone has an opening angle β. The center of the circle is r c = (X c , Y c , Z c ) T equals d c e n and is at a distance d c from the origin. In general, the radius R c of the circle does not equal the radius R s of the sphere and d c = d s .
We rotate the camera coordinate system along the Y axis by an angle of −θ, then along the Z axis by an angle of 1.5π + φ. We will get three orthogonal unit vectors e 1 , e 2 and e n which define a right-handed coordinate system. e n is the unit vector pointing in the direction of the sphere: We define q as the slant height of the cone. Taking into account that the cone is tangent to the sphere, we have sin β = R s /d s and q 2 = d 2 s − R 2 s . Taking into account that the circle is perpendicular to the axis of the cone we have tan β = R c /d c , d c = q cos β and R c = q sin β.
Therefore we can get R c and d c in terms of R s and β: If we introduce a parameter α, then the 3D position of points on that specific circle is given by In the camera plane, we have normalized image coordinates (x, y). In these coordinates, by using Equation (2) with f = 1, the projected ellipse of that circle is given by In this equation, α varies from 0 to 2π, but all other variables are fixed. We now rotate the image coordinate system to get new coordinates We change once again to a new coordinate system, but this time using a translation instead of a rotation. The new coordinates are (x 1 , y 2 ) with y 2 = y 1 + ∆, in which Then we obtain the ellipse equation which only has terms in x 2 1 , y 2 2 and of course the constant term 1, i.e., it is of the form From this we can deduce that the ellipse is centered at (x 1 , y 2 ) = (0, 0), which corresponds to x 1 = 0 and In fact, y 1 is also the distance of the center of the ellipse to the center of the camera image plane, expressed in the original (non-translated, non-rotated) image coordinates (x, y). By replacing R c and d c using Equations (4) and (5), we can get ∆, λ −2 x , λ −2 y .
Then we introduce the variable = R s /Z s . This variable is typically small because the size of the sphere is assumed small compared to the distance to the camera plane. Note that also = sin β/ cos θ. Using the fact that is small, we can employ Taylor series approximation underneath the square roots and find that − ∆ = 2 tan θ + tan θ ≈ tan θ and We know that the area of the ellipse is A = πλ x λ y , we also know that = R s /Z s and = sin β/ cos θ. Combining these equations with Equations (14)- (16), we can deduce that Z s = R s / = R s π/(A cos θ)). This means if we know area A and θ, we can get the estimated Z s with known R s . In Section 4.2, we will give the procedure about how to estimate A and θ from an image of a sphere and known intrinsic parameters.

Estimation of the Sphere Position from an Image
Based on the aforementioned derivation, the practical procedure to estimate the center r s of the sphere from the camera picture works as follows: 1. Segment the ellipse (projection of the sphere) within the image. This yields a binary image which resembles only an ellipse. In this process, we do not need to estimate the exact parameter of the ellipse, as we only need to estimate the the center and area of the projected sphere. In our experiments, we propose to use a lighted sphere, and slightly darkened the room, simple thresholding suffices to segment the ellipse blobs from the background.
2. Since we assume that the center of the sphere projects to the center of the ellipse, from the binary image we estimate the pixel coordinates u s and v s of the center of gravity of the binary blob which represents the ellipse in the segmented image. After that we can get x s = (u s − u 0 )/f x , and the number of pixels in the binary blob representing the segmented ellipse. This area also equals A = πλ x λ y ≈ π 2 / cos θ. So Z s = R s / = R s π/(A cos θ)).
4. From x s , y s and Z s compute X s = x s Z s and Y s = y s Z s .

Multi-Camera Calibration
As stated in Section 3, a given point can be expressed in world coordinates r w = (X W , Y W , Z W ) or in camera-k specific camera coordinates r (k) .
The coordinate systems are related as r (k) = (X (k) , Y (k) , Z (k) ) = R (k) r w + c (k) , where c (k) are the coordinates of the origin of the global world coordinate system within the local coordinate system of camera k and R (k) is a 3 × 3 rotation matrix. Note that R (k) is an orthogonal matrix, so R (k) T R (k) = I, with I the identity matrix. The goal is to find c (k) and R (k) for each camera k. Using the equations in Section 4, we can estimate the camera coordinates of the sphere center from the image of camera k. In our implementation, we select the coordinate frame of camera 1 as the world coordinate frame, which means r w = r (1) , R (1) = I and c (1) = 0. The remaining problem is then to estimate R (k) and c (k) for all other cameras. This problem is equivalent to computing the 3D rigid body transformation which optimally aligns two sets of points for which the correspondence is known. Eggert et al. [23] presented a comparative analysis of four popular and efficient algorithms, each of which computes the translational and rotational components of the transform in closed form, as the solution to a least squares formulation of the problem. Eggert et al. [23] proved that the method [2] provides the best overall accuracy and stability. The first solution [2] was developed by Arun et al. and is based on computing the singular value decomposition of a matrix derived from the standard representation. So we propose to use the method of Arun et al. [2], which investigates the solution of orthogonal procrustes analysis [24]. For comparison with the commonly used rank-4 method, we also implement the rank-4 factorization method of Sturm et al. [1].

Extrinsic Calibration based on Rank-4 Factorization
In the following we switch to homogeneous coordinates to perform the multi-camera calibration. We usex to denote the augmented vector by adding 1 as the last element: T . In this case, the relationship between the two camera coordinates can be written compactly as where Suppose the calibration sphere is moved to N different positions in the room. At each position, all cameras capture images of the sphere and we get the estimated camera coordinates r As the rank of F (k) is at most 4 and this provides a means to compute the matrices P k using the rank-4 factorization method [1]. The calibration proceeds as follows: 1. Let F (k) = U SV T be the SVD decomposition of F (k) ; extract U 1 , the first 4 columns of the matrix U .

At this point, we define
. Combining A, D and Equation (18), we know that AD must equal F (1) , from which we can compute A.
As rank-4 factorization is applied here, it requires N > 3 and the sphere positions to be non-coplanar. Due to noise in the data, the estimated matrix R (k) in P k does not generally satisfy the properties of a rotation matrix. Therefore, we use the orthogonal procrustes solution presented in Section 5.2 to make the estimated rotation matrix orthogonal, then we recalculate the translation parameters.

Extrinsic Calibration based on Orthogonal Procrustes
Arun et al. [2] proposed to estimate R (k) based on orthogonal procrustes analysis. The orthogonal procrustes problem is the following problem in matrix algebra: given matrices A and B, find a square orthogonal matrix Q such that QA is as close as possible to B: specifically Q is defined as arg min Q Q A − B , where . is the Frobenius matrix norm. In other words, the minimization is performed in the sense of the total least squares. This problem was originally solved by Peter Schonemann in 1964 and the solution was later published [24]. This problem is equivalent to finding the nearest orthogonal matrix Q to a given matrix M = A T B. In terms of the singular value M = U SV T of M , this is given by Q = U V T .
As we know that r i + c (k) , with i = 1, 2 . . . N . In order to solve this problem using orthogonal procrustes, we need to decouple the translation and rotation. We calculate the centroid of r Let H (1) be the matrix with columns r (1) i − r (1) , i = 1, 2 . . . N . and H (k) be the matrix with columns r As pointed out by Arun et al., there will be three possibilities for the solution of R (k) from geometrical considerations.

{r (k)
i } are not coplanar. In this case, the rotation solution is unique, R (k) = V 2 U T 2 is the desired solution.

{r
(k) i } are coplanar but not collinear. There is a unique rotation as well as a unique reflection. In this case, we just need to check the value of det(V 2 U T 2 ): where V 2 is obtained by changing the sign of the last column of matrix V 2 .

{r (k)
i } are collinear. There will be no unique solution for R (k) .
Once we obtain the rotation matrix R (k) , the translation vector c (k) is then given by

Refinement through Gradient Descent
The equations in Sections 5.1 and 5.2 provide a quite good estimate of the extrinsic calibration parameters, but their accuracy is limited by the approximation we made during estimating the sphere position w.r.t. the local camera coordinate system in Section 4. Moreover, they are not jointly optimised since we do pairwise calibration to relate each camera to the first camera. To jointly refine the extrinsic parameters for all cameras, we apply an optimization method which was developed by Bouguet [3] to optimize the results. The optimization step minimizes the total reprojection error of training samples (see Section 6) over all the extrinsic parameters. The objective function to be minimized is the mean-square discrepancy between the observed ellipse center in the image and their image reprojections computed using the estimated extrinsic calibration matrices. The optimization is implemented using an iterative gradient descent procedure.

Alignment with a World Coordinate System
The aforementioned calibration yields the external camera parameters in the coordinate frame of the first camera. In practical applications, it is often desirable to have all parameters in some user specified world coordinate system. For example, for camera networks which are intended for indoor people tracking, we would like to have the Z = 0 plane to coincide with the ground floor. For that, and only for that, we need ground truth measurements of at least three sphere centers w.r.t. the user-specified world coordinate system. Then, we use the procrustes approach of Section 5.2 to compute the transform between the user-specified coordinate system and camera-1's coordinate system.

Performance Measures
There exist several evaluation methods and accuracy measures to compare calibration methods. In the following, we assume that we have acquired m images of n additional spheres which are used only as test samples, but not to estimate R (k) and c (k) . We also assume that the ground truth coordinates of these test samples are known. In practice, we measure them using a tape measure according to the user specified world coordinate system. We denote by d(a, b) the Euclidean distance between two points a and b in 3D space.
1. δr w . The triangulation error is a measurement of how the calibration matrices influence the accuracy of multi-camera triangulation. Let r w i be the ground truth position of the ith test sample, andr w i its position estimated using triangulation [25] based on estimated extrinsic parameters and image positions of the sample. This is also the classical method for 3D reconstruction, it represents how well we can measure the 3D world with estimated extrinsic parameters. The estimated 3D positionsr w i will differ from the true positions r w i not only because of measurement errors but also due to any inaccuracies in the calibration matrices and the process of triangulation. The triangulation error δr w pools all of these errors. As such it is not an absolute measure of calibration accuracy. However, as we use the same image feature points and the same triangulation method to compare various calibration methods, δr w is a good quality measure for the extrinsic calibration. The error is expressed in in physical units (e.g., centimeter) and is defined as 2. δu pr . The projection error is a measurement of how the calibration matrices influence the accuracy of projections of 3D points on image planes. Let u ij be the observed pixel coordinates of the ith sample in the jth camera's image, whileû pr ij is the estimated position through projection. When image feature points are projected from 3D points (with known position), the estimated 2D image positionsû pr ij will differ from the true positions u ij not only because of measurement errors, but also due to any inaccuracy in the calibration matrices. The projection error δu pr takes all these errors into account. So it is not an absolute measurement of the calibration accuracy. However, since the same image feature points are used to compare different calibration methods, δu pr is another good quality measurement for the extrinsic calibration. The error is expressed in pixels and is defined as 3. δu repr . The reprojection error is a measurement of self-consistency of the calibration matrices. Different from the projection error, the 3D points are firstly obtained from triangulation based on estimated extrinsic parameters and image points. Then, image feature points are projected from these 3D points. The estimated 2D image positionsû repr ij will differ from the true positions u ij not only because of measurement errors, but also due to any inaccuracy in the calibration matrices and the process of triangulation. The reprojection error δu repr takes all these errors into account. So it is not an absolute measurement of the calibration accuracy. However, since the same image feature points are used to compare different calibration methods, δu repr is another good quality measure for the extrinsic calibration. The error is expressed in pixels and is defined as

Dataset for Evaluation
For evaluation, we calibrated a multi-camera tracking system composed of four side view cameras using a 25 cm diameter "globe lamp" sold as a garden light. The cameras were mounted at a height of about 3 m in each corner of a room (8.6 m by 4.8 m) with a resolution of 780 by 580 pixel. These cameras were firstly intrinsically calibrated using the method of Zhang [13].
The practical calibration proceeds as follows: 1. Put the sphere in the common overlapping area at minimum 3 distinct non-collinear locations and let each camera capture an image at each location. The spheres are distributed evenly over working volume of the multi-camera system.
2. As we use a lighted sphere, and slightly darkened the room, simple thresholding suffices to segment the sphere blobs from the background. Figure 2a is an example of captured images of the sphere. Figure 2b is the image which only shows the ellipse after simple thresholding.  To assess the robustness against the number and distribution of training samples, we will show results for various numbers of calibration (training) samples. These are randomly selected from a set of 45 training samples (i.e., four images of the sphere in 45 different positions). Some of the performance criteria in Section 6 require ground truth expressed in the user specified world coordinate system. For this purpose we measured the positions of 10 sphere centers using a tape measure. As it is not possible to know where exactly the center of the sphere is, we put the sphere on top of a cube box whose edge length is the same as the diameter of the sphere. So the position of the sphere center can be obtained by combining the position of the box corner with the diameter of the sphere. To test the accuracy of calibration, we also captured 40 more test samples. In this case, the sphere was placed in different positions than for the training samples. These test samples are never used to compute the calibration matrices, but only to estimate the accuracy of the matrices and the robustness of the proposed approach.

Comparison with Conics-based Sphere Calibration
In order to assess the accuracy of estimating sphere positions w.r.t. the camera coordinate system, we compared our method with the sphere-based method of Agrawal et al. [4]. This method performs both intrinsic and extrinsic calibration. For extrinsic calibration, they first estimate the positions of the sphere centers in each of the camera coordinate systems based on the intrinsic parameters and conics of the sphere projection. Then they obtain the relative rotation and translation between two cameras by registering the estimated two sets of sphere centers. For fair comparison, both methods use the same intrinsic parameters obtained using the method of Zhang [13].
As it is difficult to measure the ground truth position of a sphere w.r.t. the camera coordinate system, we propose to use the distance between the sphere center and the camera center as the evaluation criteria. For that we measure the position of each camera w.r.t. the predefined world coordinate system, we also have the positions of spheres w.r.t. the same world coordinate system, then we can calculate the euclidean distance. Meanwhile, we estimate the 3D positions of the sphere centers using our proposed approach, as well as Agrawal's method. Then, we obtain an estimate of the distance between the sphere center and the camera center. We use all 40 test samples in this comparison. Figure 3 demonstrates that our approach provides more accurate and stable estimation of 3D positions of sphere centers compared to the conic-based method of Agrawal et al., since the distance error of our method is 9.7 cm, while the average error of Agrawal's method is 24.5 cm. As both methods can estimate the distance between the spheres and the cameras, we can estimate the scale information (distance) between two 3D points which we define as s. Then we introduce another performance criterion: the scale error δs. For two test samples with positions r w 1 and r w 2 in the user-specified world coordinate system, we calculate the ground truth Euclidean distance between these two samples as s w = d(r w 1 , r w 2 ). After we obtain the extrinsic parameters of other cameras w.r.t. the coordinate system of the first camera, we can obtain the 3D positions (w.r.t. the coordinate system of the first camera) of the same two samples using triangulation based on the corresponding 2D image coordinates and estimated extrinsic parameters. Then we have s c = d(r 2 ). This distance is computed in the camera coordinate system, but of course it should equal the ground truth distance in the absence of calibration and image processing errors. Then, the scale error is defined as δs 12 = |s c − s w |/s w . Since we have multiple test samples, we calculate the scale error for each combination of two samples, and calculate the average of all scale errors. Figure 4 shows that our approach allows more accurate and stable estimation of scale information compared to the conic based method of Agrawal et al. The scale error of our method is 2.4%, while the error of Agrawal's method is 10.3%.
For those comparisons which require comparison to ground truth in the user-specified coordinate system, we use the same orthogonal procrustes analysis approach described in Section 5.2 to compute the transform to user-specified coordinates. To assess the robustness to the precise choice of training samples and the number N of training samples, we repeat the calibration procedure 100 times for each N , for various sets of training samples, randomly selected from the available ones.
We can conclude that our method outperforms the method of Agrawal et al. for estimating extrinsic parameters since the projection error, triangulation error and reprojection error of our method are nearly half of the errors of the method of Agrawal et al., as illustrated in Table 1.

Comparison with Epipolar Geometry Based Calibration
In order to show the feasibility and robustness of the proposed method, we also make a comparison to the calibration method of Svoboda et al. [5] using the same training and test samples. Tables 2-4 show the comparison in terms of the projection error, the triangulation error and the reprojection error, respectively. It can be observed from these three tables that the accuracy of all methods improves by increasing the number of training samples. We can obtain that the orthogonal procrustes method is always more accurate and stable than the rank-4 method. We can also conclude that the refinement can improve the accuracy and stability of the orthogonal procrustes method. Compared to the method of Svoboda et al. in terms of the projection and the triangulation error, our refinement method is more accurate and stable, especially when the number of training samples is below 18. The reprojection error of our method is slightly larger (3%) than the error of Svoboda's method. However, our refinement method can still produce reasonable result (the projection, the triangulation and the reprojection errors are 5.3/1.4, 4.5/0.7, 3.3/1.6, respectively), when we only have three training samples, wherever the method of Svoboda is not applicable.
In terms of practical aspect, it is required to crawl on the floor with a laser pointer for the method of Svoboda's method in order to fill up the whole working volume. On the contrary, in our method we just need to put the sphere on the ground or on top of a box, which makes it more easy and pleasant to capture the calibration data. Table 2. Projection error (pixel). The first column represent the number of training samples, and the first row gives the methods used for comparison. The refinement method uses the result of Orthogonal procrustes as initial guess, and then applies the proposed refinement method to improve the accuracy. The numbers are listed as "a/s" where a is the average over 100 experiments and s is the standard deviation  Table 3. Triangulation error (cm). The first column represents the number of training samples, and the first row gives the methods used for comparison. The refinement method uses the result of Orthogonal procrustes as initial guess, and then applies the proposed refinement method to improve the accuracy. The numbers are listed as "a/s" where a is the average over 100 experiments and s is the standard deviation.   Table 4. Reprojection error (pixel). The first column represents the number of training samples, and the first row gives the methods used for comparison. The refinement method uses the result of Orthogonal procrustes as initial guess, and then applies the proposed refinement method to improve the accuracy. The numbers are listed as "a/s" where a is the average over 100 experiments and s is the standard deviation.  We also performed some experiments on the calibration of a camera pair, i.e., two cameras spaced far apart (not in stereo configuration). Methods such as the one of Svoboda et al. [5] cannot be applied in this case as they need a minimum of three cameras. Table 5 shows the result for camera 1 and camera 2. We calibrated using only nine training samples and evaluated based on the 40 test samples. The calibration accuracy after refinement is shown in Table 5, from which we can conclude that our method can still provide accurate calibration for camera networks with two cameras. Table 5. Calibration accuracy of the network with only two cameras. The second row and the third row represent the mean and the standard deviation over all 40 test samples, respectively.

Conclusions
In this paper, we present a simple and robust method to compute the 3D positions of sphere centers w.r.t. the camera coordinate system by estimating only the center of the projected sphere and its area in an image. Then we propose to use orthogonal procrustes analysis to do pairwise calibration. Finally an optimization routine jointly refines the extrinsic parameters for all cameras.
Compared to other sphere based method which process the sphere images by tracing the contour of the elliptical projection of the sphere, our method provides simpler and more accurate estimation of sphere 3D positions w.r.t. local camera coordinate system. While the sphere positions are only used as intermediate results, a higher accuracy in these sphere positions still leads to a more accurate and stable estimation of extrinsic parameters.
Compared to the calibration method of Svoboda et al. [5], our method can provide more accurate estimation of extrinsic parameters with less training samples. We can also provide scale information assuming known sphere size. An additional benefit of our approach is that it can extrinsically calibrate two cameras at a time, whereas the method of Svoboda et al. requires at least three cameras. Moreover, the data capturing procedure is more simple and convenient for our method.
In the proposed algorithm and experiments, we assumed that all cameras in the networks share common volume, we chose one camera's coordinate system as the world coordinate system and related all other cameras to it by pairwise calibration. Our algorithm can be very easily extended to the calibration of camera networks that do not share a common volume. In that case, only pairwise overlap is required. We will explore this in our future work.