Automatic Multi-Camera Extrinsic Parameter Calibration Based on Pedestrian Torsors †

Extrinsic camera calibration is essential for any computer vision task in a camera network. Typically, researchers place a calibration object in the scene to calibrate all the cameras in a camera network. However, when installing cameras in the field, this approach can be costly and impractical, especially when recalibration is needed. This paper proposes a novel, accurate and fully automatic extrinsic calibration framework for camera networks with partially overlapping views. The proposed method considers the pedestrians in the observed scene as the calibration objects and analyzes the pedestrian tracks to obtain extrinsic parameters. Compared to the state of the art, the new method is fully automatic and robust in various environments. Our method detect human poses in the camera images and then models walking persons as vertical sticks. We apply a brute-force method to determines the correspondence between persons in multiple camera images. This information along with 3D estimated locations of the top and the bottom of the pedestrians are then used to compute the extrinsic calibration matrices. We also propose a novel method to calibrate the camera network by only using the top and centerline of the person when the bottom of the person is not available in heavily occluded scenes. We verified the robustness of the method in different camera setups and for both single and multiple walking people. The results show that the triangulation error of a few centimeters can be obtained. Typically, it requires less than one minute of observing the walking people to reach this accuracy in controlled environments. It also just takes a few minutes to collect enough data for the calibration in uncontrolled environments. Our proposed method can perform well in various situations such as multi-person, occlusions, or even at real intersections on the street.


Introduction
Extrinsic camera calibration provides the coordinate system transformations from 3D world coordinates to 3D camera coordinates for all the cameras in the network. This information is essential for many machine vision applications such as tracking, augmented reality, free view image synthesis, 3D reconstruction [1][2][3], or transferring the well-trained recognition models to different camera setups [4].
The classical methods [5][6][7] require a sufficient number of point correspondences of calibration objects to estimate the extrinsic parameters accurately. In addition, the calibration objects also have to be well-observed among all cameras. Moreover, calibrating cameras without any mistakes by using classical methods requires a certain level of skill while sending skilled technicians onsite to recalibrate cameras is costly and time-consuming. It also would be even worse because the cameras also need to be recalibrated after the cameras are adjusted or moved. Additionally, this traditional approach does not work for historic multi-camera video sequences in which no calibration objects were recorded.
Hartley et al. [8] proposed an auto-calibration method based on scene reconstruction from arbitrary features. Due to the interactive fashion as well as a large number of parameters has to be estimated, this method is slow and not always be able to achieve reliable results. Analyzing human data in images and video is the main concern of many machine vision applications. Thus, to leverage the information which also is extracted to serve other high-level tasks, many autocalibration methods [9][10][11][12][13][14] based on pedestrians are proposed. However, they are sensitive to noise as well as impractical for several situations in practice.
The proposed method relies on finding humans in images and estimating their centerline. For this purpose, we use OpenPose [15,16]. Human pose estimation is also an important task for various machine vision applications, such as action recognition, motion capture, sports, etc. Many real-time human estimation methods [15][16][17][18][19][20][21] have been proposed in recent years. Therefore, human pose estimation is fast enough for use in extrinsic calibration. Moreover, pose estimation is useful in itself for video analytics. Therefore, if an application already includes pose estimation, we can reuse this "for free" in calibration. Note that, in this paper, we assume the intrinsic camera parameters have been estimated before the extrinsic camera calibration.
The first contribution of our paper is that we replace the ellipse based detection of the top and the bottom of pedestrians by an approach based on modern human pose estimators. Thus, it provides more robust detection of people and a more accurate estimation of their centerlines. Then, we analyze the amount of video of walking people it needs to reach the desired accuracy. We also combine the proposed method calibration method with a random sampling strategy to deal with the noise and outliers in the estimated human pose data. It shortens the required time of data collection, and also makes our method robust to outliers and noise.
The second contribution is that we propose an automatic method to also handle the case of multiple pedestrians simultaneously in the scene. In [14], this case was handled by manual annotation. In the paper, we propose a brute-force, but still fast, method to effectively find the correspondences and also eliminate correspondences that have poor estimated human pose. We show that it produces accurate results and more complicated methods are not needed.
The third contribution, we proposed a novel extrinsic calibration method based on just the information of the top and the centerlines of the pedestrian. Thus, it helps the proposed calibration system can work well even without the information of the bottom of the pedestrians (which usually happens in heavily occluded scenes). Therefore, the proposed method can be applied to a wide range of usage scenarios, including indoor and outdoor scenes. The experimental results show that the proposed method can achieve very precise accuracy in many challenging scenarios including real intersection, heavily occluded, and multiple people scenes.
The rest of the paper is organized as follows. We discuss the related work in Section 1. In Section 2, we describe the architecture of our calibration method for a pair of cameras in detail. In Section 3, we explain the way to extend the proposed method for a camera network as well as the novel extrinsic calibration based on just the top and centerline of the walking people. We present the obtained results and the detailed analysis of our experiments in Section 4. Finally, we discuss the conclusion and future work in Section 5.

Related Work
Lv et al. [12,22] detect and select the walking human from video sequences by the transition of foreground object shapes. They represent the pedestrians as vertical "walking sticks" of the same height in the 3D environment. Then, they compute the vertical vanishing point and the horizon line based on the vertical "walking sticks". Li et al. [11] proposed a single view camera calibration method that directly estimates the focal length, the tilting angle, and the camera height by using a nonlinear regression model from the observed head and feet points of a walking human.
In [9], Liu et al. proposed a fully automatic calibration method for monocular stationary cameras. They leverage relative 3D pedestrian height distribution to eliminate false pedestrian detections in moderately crowded scenes. In [10], Liu et al. extended their earlier work to camera network calibration. Iteratively, they incorporate robust matching with a partial direct linear transform. Due to the reliance on vanishing point (intersection of near-parallel lines) estimation, a small error of head (or feet) detection could lead to a big error of extrinsic parameters. On the other hand, our method estimates the extrinsic parameters based on estimated 3D positions of the head and feet which is much less sensitive to noise. Moreover, their methods cannot work if there is no feet information in the heavily occluded scene while the proposed method still performs well in that situation. In [23], Lucas et al. proposed a method for urban areas by combining the information from the pedestrians and structures that have parallel and orthogonal lines, such as buildings and road lines. Method [23] addresses the information of pedestrians in a similar manner as [10], and it is only applicable to single view applications.
Most methods [9,10,12] assume that moving pedestrians walk on a planar, horizontal surface. Possegger et al. [13] proposed an unsupervised extrinsic self-calibration method for a network of static cameras and pan-tilt-zoom cameras solely based on correspondences between tracks of a walking human. Then, they eliminate the outliers of feet and head detection by estimating pairwise homographies between the camera views based on the detected locations of feet and head. Finally, they compute the extrinsic parameters of the cameras by solving a non-linear optimization problem to minimize the reprojection error. Therefore, it tends to get stuck in local optima without a good initialization which was not presented in their work. In contrast, our method can have a precise estimated 3D position of the head and feet based on a robust human pose detector for the extrinsic calibration. Our method also does not require the person to walk on a plane surface.
Hödlmoser et al. [24] proposed a novel method to estimate the essential matrix based on the locations of the feet and head of the single pedestrian from the video sequences. After that, the extrinsic parameters were extracted by essential matrix decomposition. To find out the unique and proper solution for the rotation parameters and translation parameters, they needed to apply the chirality check [25,26]. Therefore, their method is quite sensitive to erroneous of the estimated locations between head (or foot) correspondences in different camera views. If the pedestrian walks along a straight line which occurs quite often in practice, all head and foot positions lie in the same plane. This leads to the degenerate case for the essential matrix estimation [27]. Therefore, they can only obtain the homography matrix between two views. However, in this case, their method cannot obtain the unique extrinsic parameters by decomposing the homography matrix [28,29]. Thus, it cannot find the extrinsic parameters with reasonable accuracy in this case.
Our paper is based on the work of Guan et al. [14]; our method does not require that the pedestrian walks on a plane surface (e.g., walking on steps and stairs), as long as the posture of the pedestrian remains the same while walking. The method obtains the extrinsic parameters by computing the 3D rigid body transformation that optimally aligns two sets of points corresponding to top and bottom of the pedestrian. The correspondence of these points between camera views is assumed to be known. In practice, this method, therefore, requires manual annotations to differentiate between multiple people and is not fully automatic. In [14], the top and bottom detection was implemented based on change detection and is not very robust w.r.t. noise and occlusion. In contrast, our propose a method is fully automated and uses a more robust human pose detector. In [30], Lettry et al. proposed a method to solve correspondences for camera calibration based on multiple pedestrians. However, their method produces incorrect correspondences which degrade the accuracy of the calibration. Figure 1 shows the block diagram of the proposed framework of multi-camera calibration based on walking pedestrians. First, the proposed method calibrates all the cameras in the network in a pairwise fashion. Then, if the ground truth measurements are available (at least 3 points), we can apply the refinement and alignment to further optimize the extrinsic parameters. If the bottom of the pedestrians cannot be observed, we apply the novel method to estimate the extrinsic parameters based on just the top and the centerline of the walking person (Section 3.5). However, if the bottom of the pedestrians can be observed properly, we apply the extrinsic calibration method based on the top and the bottom that extracted from OpenPose [16] to have more accurate extrinsic parameters.

Extract the Positions of the Top and the Bottom of the the Observed Pedestrian in Image Coordinates
First, we assume that the frame synchronization, as well as the intrinsic calibration for all cameras in the network, have been done before the extrinsic calibration. To obtain the position of a walking person in the image, we apply the human pose estimation method in [16]. This produces a skeleton model of all major body joints. Because the locations of the head joints are not stable enough in these skeletons, we use the neck joint locations instead as the top of the observed pedestrian.
In this work, we compute the bottom of the observed pedestrian in two different ways. The first one is the midpoint of the left ankle joint and the right ankle joint as the bottom position of a walking person. However, the ankle of the walking person does not always appear in the mildly occluded scene. However, the hip's joints of the walking person could be well observed in the mildly occluded scene. Thus, we also investigate to extract the bottom part of the body by the middle point of the left hip joint and the right hip joint.
In the case that the bottom of the pedestrians cannot be observed, to determine the image positions of the top and the centerline of a walking person, we propose to detect the bounding boxes of the walking person (are extracted by YOLO [31]) in the first step. We estimate the centerline of a pedestrian by the line from the center of the top edge to the center bottom edge of the bounding box.

Extrinsic Camera Calibration Based on a Pedestrian
Consider a camera network with K cameras C 1 , C 2 , ..., C K . Let r (w) = (X w , Y w , Z w ) T be a point in a 3D world coordinate system that is visible to the camera, where the superscript T denotes a matrix transposition. In our work, each camera in the camera network has its own distinct camera coordinate system. We denote a 3D point in the coordinate system of camera k as r (k) = X (k) , Y (k) , Z (k) T . Without loss of generality, we choose the coordinate system of the camera C 1 as the world coordinate system as follows: r (w) = r (1) .
Thus, we can present the transformation between the camera coordinates r (k) and the world coordinates r (w) as follows: where c (k) are the coordinates of the origin of the global world coordinate system with regards to the local coordinate system of camera k. R (k) is the rotation matrix which is a 3 × 3 matrix.
To obtain the extrinsic parameters for all cameras in the camera network, we first calibrate the camera network in a pairwise fashion. Thus, let us consider a camera system (which is composed of two cameras) where a person moving between N different locations while keeping a fixed posture (the bottom and the top of the person can be observed from both cameras). Letũ      top (t) for camera k by applying the proposed method in [14]. Suppose that person walks upright and has heigh h where h is measured from the top of the pedestrian to the bottom of the pedestrian. Let r where e (k) z is unit vector of the person within camera k. Fromx top (t) which is perpendicular to the unique vertical plane containing the origin of camera k, x top (t). At a given time instant, the intersection of all of those planes is a line along the vertical direction. We could also cancel h in m (k) (t) Therefore, e . As explained in [14], once e (k) z is determined, we can compute the 3D locations of the of the bottom and top w.r.t. by the least-squares solutions up to a constant factor h. Finally, we apply the orthogonal Procrustes analysis [32] to estimate the rigid body transformation (relative camera pose in 3D space) between two sets of 3D points (top and bottom).

Robust Extrinsic Calibration
The proposed calibration method can obtain the appropriate extrinsic parameters when the number of inliers is large enough to compensate for the bad effect of the outliers. However, in practice, we do not always have enough samples to have a precise calibration, e.g., calibrate camera network from videos that were recorded a long time ago. The accuracy of the calibration does not only depend on the number of the samples but also the distribution of locations. For example, using more random locations from different spots of the room helps the proposed method improve accuracy ( Figure 2). However, the result of calibration from 20 s (approximately 300 locations) in Table 1 is worse than the result of the calibration from 20 random locations of the room (which is selected randomly from different spots of the scene) in Table 2. It is easy to understand because the person just slowly walks in the room, most of the locations collected in 20 s are so close to each other, and different cameras may see different points of the top or the bottom of the pedestrians. This means many of those locations provide similar information which is not so useful to improve the result of the calibration. Moreover, the outliers of the detection make the information around some spots of the room become inconsistence.  Table 2. Comparison between our method, the method of Guan et al. [14], and the method of Hödlmoser et al. [24] on Camera Network 1 (CN1), Camera Network 2 (CN2). We randomly select 20 locations of the pedestrians in the scene to calibrate the CN1, and CN2. δr (w) , δu (p) , δu (r) , and δu (rr) denotes the triangulation error, projection error, reprojection error, and relative reprojection error, respectively. The bold numbers are the best results among the mentioned methods. In [14], Guan et al. proposed a method based on RANSAC [33] to obtain a subset of top (or bottom) locations which are likely to be observed similarly in different views. However, the estimated 3D points which satisfy this criterion, do not always agree with the same unit vector e (k) z . Furthermore, the pose of the walking people can slightly change during the video sequences which also makes the unit vector of the centerlines slightly changed. So, including all locations that are likely to be observed similarly cannot effectively improve the calibration results. It can also produce bad e (k) z estimation which leads to a bad extrinsic calibration. Instead, we need to find an optimal sparse subset of collected locations without the outliers to improve the performance of the proposed method on short video sequences (short amount of time to collect data). Therefore, we propose a random sample scheme as Algorithm 1 to obtain the sparse subset of collected locations which produces the most stable extrinsic calibration (low reprojection error).   Step 1. For each frame index i in L (ab) , we randomly select a location at frame t that is neighbor of it (t − i < time ). In our experiment, we chose the time = 10. Then, we compute extrinsic parameters using the method that we presented in Section 3.2.
Step 2. Count the number of pairs agreeing with the extrinsic parameters (inliers). A pair is considered to agree with the extrinsic parameters if the reprojected error of that pair is smaller than the threshold error : Step 3. Repeat Steps 1 and 2 until the number of inliers reaches a certain threshold or the number of repetitions is greater than M.
Step 4. Choose the extrinsic parameters that has the highest number of the inliers (the most stable) based on the method that we presented in Section 3.2.
assign the frame index of the last key location k = t return frame indices of the key locations L (ab) .

Automatically Estimates Extrinsic Parameters Based on Multiple Pedestrians
In [14], Guan et al. had to manually annotate the correspondences in the scene to calibrate the camera network. However, it is not really convenient for the customers to annotate the data, especially, if the number of cameras in the network is huge. Thus, we propose a simple and fast method to find the correspondences between different cameras. We use Openpose [16] to estimate 2D skeleton models of humans in the images. In practice, the estimated locations of the necks and feet (or hip) joints thus obtained are sometimes inconsistent between views (e.g., a different physical point is indicated the feet joints or hip joints in two views). The proposed calibration method is insensitive to this problem, as long as the number of observed skeletons is large enough. Otherwise, the results will be poor if people are observed in an insufficient number of locations (e.g., the method will fail if only a single person, always in the same position, is observed).
In controlled environments, these conditions can be easily enforced by providing instructions to the walking people. However, in uncontrolled environments with multiple pedestrians, we cannot order the people to walk by our instruction. In addition, people tend to pass the scene in a short amount of time such as walking along a straight line (insufficient number of locations). Hence, it is difficult to gather data in different locations of the scene for precise calibration.
To handle this problem, we propose an easy and robust brute-force method to solve the association problem (which pedestrian in one camera corresponds to an observation in another camera). First, we apply a simple object matching algorithm based on feature matching to track the pedestrians for each top (n))} be the set of all locations of the top and bottom (feet or hip) of person i from frame m to frame n in camera k with k ∈ {a, b}.
q } be a set of locations of the top and bottom of all pedestrians in the scene of camera k with q is the number of pedestrians in this scene. We compute all possible correspondences C (ab) between camera a and camera b by generating all pairs of elements from H (a) and H (b) . Then, we calibrate the pair of cameras with each generated correspondences to estimate the matching rate as presented in the Algorithm 3. The matching rate represents the percentage of good reprojected results of the top and bottom positions of pedestrians where a good reprojected result means having the relative reprojection error less than threshold (equals 0.05 in our experiments). Finally, the top highest matching rate correspondences are selected to calibrate the pair of cameras. Thus, the difficulty of gathering data in uncontrolled environments also be solved by combining the sample locations of the highest matching rate correspondences from different spots of the scene. To avoid a combinatorial increase in the number of computations, the frames with too many pedestrians are removed. Moreover, the frames with too many pedestrians also have much worse pose estimation results as well as poor human tracking results. This makes it is not useful already to use these frames. Also, using the frames that have a very high number of people does not improve the accuracy of the calibration. Therefore, in practice, we can calibrate the camera network from the parts of video sequences which has a low enough number of pedestrians. In our experiments, we choose 5 pedestrians as the threshold, therefore the number of combination of correspondences for each frame is always less than 5! which does not take too much time to verify all possibilities. The brute-force approach can definitely be replaced by a random sampling scheme to make it runs faster. However, it may skip many appropriate correspondences for the calibration. Therefore, we decide to keep the brute-force approach to maintain the accuracy of the proposed method which is much more important for many applications in practice.

Extrinsic Calibration When Feet and Hip Joints Are Not Available
In Section 3.  top (t) and the centerline using the technique described above for camera k. We denote the projection of the centerline in the image as y = a k (t)x + b k (t).
Select two different points on that line and denote the normalized homogeneous image coordinates as (x (k) , which is the normal of the plane spanned by the camera center and the center line of the pedestrian. It is obvious that e (k) z is on that plane, so that m (k) (t) and e (k) z are orthogonal as Equation (4).
Therefore, e (k) z can be determined in the same fashion as presented in Section 3.2 by using SVD decomposition of (m (k) (t)) T : e (k) z is the singular vector which is corresponding to the lowest singular value. e (k) z is also the normal of the plane composed by all the top positions when the person walks upright on a flat horizontal surface. Since all top positions of a single pedestrian lie on a plane when the person walks on a horizontal surface, there is the homography between any two views of the plane. Various methods [28,29,34] have been proposed to estimate the homography between two views given at least 4 non-collinear corresponding points and solve the structures from motion problem by decomposing the homography matrix. Weng et al. [28] produces two solutions of the extrinsic parameters by decomposing the homography matrix. However, the unique and proper solution cannot be obtained without additional knowledge of the scene. Here we propose a method to obtain the unique extrinsic parameters by decomposing the homography matrix with the estimated e (k) z . Pairwise Planar Homography: as we select the coordinate frame of camera 1 as the world coordinate frame, we have r In the coordinate system of camera 1, the top of the pedestrian lies on the plane n (1) T r (1) with d (1) is the Euclidean distance between the center of camera 1 and the plane, and n (1) is the normal vector of the plane w.r.t the camera coordinate system of camera C 1 . Substituting Equation (7) in Equation (6) gives Since r where which is the homography matrix. Since the vectors from both sides of Equation (11) are parallel, their cross product is zero, which leads tox with × representing the cross product.
Notice that H (k) depends only upon eight independent coefficients, i.e., the three rotation coefficients, the three coordinates of the translation, and the two parameters representing the orientation of the plane. Each point imposes two independent constraints on H (k) , so we need at least four corresponding points (top positions) to solve uniquely for H (k) . The four points should be in a general configuration in the plane (no three of them are collinear). Notice that H (k) can be recovered up to a scale factor, so we get the homography matrix in the form of

Decomposition of the Homography Matrix:
Once H (k) e is obtained, we now discuss how to get extrinsic parameters by decomposing the homography matrix. It has been proven in the literature [28,29] that decomposing the homography matrix will give two candidate solutions for the extrinsic parameters. We will give detail about how to get the unique extrinsic parameters by using additional information.
Ma et al. [35] gave four solutions of the extrinsic parameters by SVD of the homography matrix, only two of which satisfy the positive depth constraint (the Z coordinate of the normal vector of the plane need to be positive, since the camera can see only points that are in front of it). Decomposition of the homography matrix proceeds as follows: 1. Normalization of the homography matrix. The normalized homography matrix is computed as where σ 2 H (k) e is the second largest singular value of H (k) e .

Compute the SVD of H (k) T H (k) as
Define v 3 as the three column vectors of V (k) , and σ (k) 3 as the eigenvalues of H (k) . Compute two more unit vectors by and u (k) 3. Define matrices U (k) Finally the four possible solutions of extrinsic parameters and normal vector of the plane are calculated as Notice that only two solutions of the above four solutions satisfy the positive depth constraint (i.e., the third coordinate of the normal vector for the plane should be positive since the camera can only see points that are in front of it). So either Equation (19) or Equation (21) is the possible solution. Similarly, either Equation (20) or Equation (22) is the possible solution. Once two possible solutions are obtained, we obtain the correct one as follows. From Equation (4), we estimate e (1) z , which is the normal of the plane composed by all the top positions of the pedestrian. e (1) z should be equal to the estimated n (1) if there is no noise in the data. In practice, it rarely happens due to all kinds of noise.
Thus we propose to use the angle between e (1) z and n (1) as a criterion to get the correct n (1) and the corresponding rotation and translation parameters. The correct solution should have a smaller angle to e (1) z . Finally, to calibrate the camera network with the torsors of multiple pedestrians, we choose the highest matching rate correspondence for the calibration. In this case, we use the detected bounding boxes of the pedestrians which are extracted by YOLO [31] to determine the top and bottom of the pedestrians. We choose the center of the top edge and the center bottom edge of the bounding boxes as top and bottom of the pedestrians.

Joint Extrinsic Refinement for All Cameras in the Network
In Sections 3.2-3.4, we present the extrinsic calibration for the camera network in pairwise fashion. However, the extrinsic parameters of the cameras are obtained based on algebraic distance minimization without taking the property of the projective geometry of the cameras into account. Thus, it increases the triangulation error and projection error when we combine the information from all available cameras. To solve this problem, we jointly refine the extrinsic parameters of all cameras by minimizes the total reprojection error based on the method proposed in [14,36]). The objective function defines by the mean-squared discrepancy between the observed image positions of bottom and top of the pedestrian, and their reprojections. Finally, we optimize the extrinsic parameters by an iterative gradient descent procedure.

Performance Measures
In order to evaluate the performance of our method with ground truth points, we compute the triangulation error (δr (w) ), projection error (δu (p) ), and reprojection error (δu (r) ) [14]. In practice, the ground truth points are not always available to measure the performance of the calibration. Thus, we can only measure the calibration by computing reprojection error based on the top (or the bottom) positions of detected pedestrians. However, different cameras have different resolutions. Moreover, the height of the pedestrians at different locations in an image is different. Let N be the number of ground truth 3D sample points and K is the number of cameras in the network. We define the relative reprojection error as follows:

Calibration with Controlled Environment
Calibration with single person. We evaluate our method with a multi-camera tracking system composed of four side-view cameras. For simplicity, we call it Camera Network 1. The cameras were mounted at a height of about 3 m at each corner of a room (8.6 m by 4.8 m). The resolutions of the all videos are 780 by 580 pixel (Figure 3). We obtain the intrinsic parameters by [7]. We compare our method to the calibration method of Hödlmoser et al. [24] and Guan et al. [14]. Figure 3 shows an example of the detected bottom and top positions of the pedestrians of the person in a scene. For single person case, we apply the refinement method which proposed in [14] to obtain the final extrinsic parameters ( Table 2). Table 2 shows that our method has slightly more accurate results than state-of-the-art methods. The differences in accuracy among different methods are small because this case is the simplest situation. Therefore, most of the existing methods can achieve very high accuracy on these video sequences. In addition, the proposed method also requires a very short amount of time to collect the sample data for the calibration. As shown in Table 3, to collect enough data for the calibration, the person only needs to walk around the room with the total accumulated moving distance is around 20 m (by normal walking speed). This is the distance of walking around the room 2-3 times. Thus, the proposed method is fast and convenient for users to calibrate camera networks. Calibration based on a single person in a mildly occluded scene. In order to show that our method works in a complex real-life environment room setup, we also evaluate our method on a setup with three cameras in a kitchen (Figure 4). The cameras were mounted at a height of about 2 m at different corners of a room. The resolutions of the videos are 640 by 480 pixels. We call it Camera Network 2 for simplicity. Table 2 shows that our method outperforms the method proposed by Guan et al. [14]. Note that we also recorded a scene where the person was cleaning the kitchen floor. Despite the movement and occlusions while cleaning the floor, the proposed method still produces a reasonable accuracy (the triangulation error is less than 10 cm) as shown in Tables 4 and 5. It shows the stability and robustness of our method to the occlusion.   Robust extrinsic calibration. In Table 2, we present the calibration results based on different locations in the scene to evaluate the performance of the proposed method in a general sense. However, in practice, we have to collect the data consecutively rather than select arbitrary locations in space.
Thus, we also conduct another experiment of camera calibration based on a segment with regards to the amount of time that we use to collect data. Table 6 shows the successful rate to have accurate extrinsic parameters with the proposed method with all samples and a random-sample scheme. We apply the refinement method which proposed in [14] to obtain the final extrinsic parameters in Table 6. While cleaning the floor, we cannot keep one posture from the beginning until the end. Typically, we have to lean forward or to kneel for cleaning different places in the room. Thus, the result of our method by using all available locations on CN2-cleaning the floor sequences (Table 4) are degraded dramatically. By applying the random sample scheme, the proposed method can eliminate the locations produced from improper postures. Therefore, this improves both the successful percentage and the accuracy of the calibration. Tables 1, 5 and 6 also show that the random sample scheme improves the successful percentage of the calibration and the accuracy of the calibration in the case that the pedestrian was walking with the same posture. Table 6. Success percentages (the triangulation error is below 15 cm) within 1000 experiments of the proposed method which is conducted with the Camera Network 1 (CN1), and Camera Network 2 (CN2). Calibration with multiple walking people. We evaluate our calibration method on the EPFL-Terrace sequences [37], which is a public multi-camera pedestrian video dataset ( Figure 5). This dataset includes two sequences and 7 subjects, which were shot outside our building on a terrace with four DV cameras. In this paper, we call it Camera Network 3 for simplicity. It only takes approximately 270 s and 210 s on EPFL-Terrace dataset and CN3 to solve the correspondences and obtain the extrinsic parameters, respectively (we implemented the code to run on Intel(R) Core(TM) i7-8086K CPU @ 4.00GHz with Python 3). Table 7 also shows very high accuracy results for very challenging multiple people sequences. It proves the propsed scheme to correspondences in the video sequences can deal with outliers and failed tracking results to provide very appropriate pairs of correspondences. Table 7. The calibration results on EPFL-Terrace (CN3) and Camera Network 5 (CN5). For EPFL-Terrace (CN3) and Camera Network 5 (CN5), we apply the proposed method on the first half of the video to obtain the extrinsic parameters. δr (w) , δu (p) , δu (r) are not available for these sequences. Thus, the δu (rr) denotes relative reprojection error is the only one available for these sequences. Comparison between calibration based on single walking person and calibration based on multiple walking. We evaluate the proposed camera calibration based on multiple pedestrians by calibrating the Camera Network 1 with an extra sequence to compare with the result of the calibration based on a single person. This sequence was shot with 3 subjects walking at the same time in the empty room ( Figure 6). We calibrated Camera Network 4-CN4 (Figures 7 and 8) to compare the performance of the calibration in both single person and multiple people situations. This network has five cameras, three of which are mounted at a height of about 3 m, and two cameras were mounted at a height of around 2 m. The resolution of all cameras is 780 by 580 pixels. In the experiments, we also change the orientation and locations of the cameras in the Camera Network 4 to verify the performance of the proposed method from different viewpoints.   Table 2 and Table 8 show the results of the proposed method in the case that there is only one subject in the scene of Camera Network 1 and Camera Network 4, respectively. Table 9 shows the results of the proposed method in multiple walking people case which is reasonably close to the calibration results in the single person case. It shows that the proposed method works well in both cases.  Table 9. Calibration results of camera calibration based on multiple walking people of the Camera Network 1 and the Camera Network 4. Note that, for Camera Network 1, we also apply the refinement method which proposed in [14] to obtain the final extrinsic parameters. δr (w) , δu (p) , δu (r) , and δu (rr) denotes the triangulation error, projection error, reprojection error, and relative reprojection error, respectively. In general, using the feet to calibrate the camera network produce much better results in single-person cases because the he (k) z vector is much longer which makes centerline vector estimation becomes more consistent. In the single person case, we can collect the data of the walking person continuously. Even in the case of occlusion, we can easily associate the data because they belong to only one person. However, in multiple pedestrian case, the data of one person normally are scattered and labeled with different indices due to the occlusion and tracking failures. Thus, the number of useful samples that we could extract in the same amount of time can be reduced. However, the proposed extrinsic calibration method based on multiple walking people in Section 3.4 still obtains the accuracy which is close to the single person case (Tables 8 and 9). Finally, using hip joints also improves the calibration results where the bottom parts of the pedestrians are hard to observe (e.g., S1 and S5 in Table 9).

Calibration with Uncontrolled Environment
Calibration at intersections. To show that our method can be applied to a real-life situation, we also recorded several video sequences at an intersection in Ghent to evaluate the proposed method ( Figure 9). The pedestrians in this scene are quite small (about 60 pixels height). We call it Camera Network 5 for simplicity.   Table 2 shows our method has reasonably low error among different circumstances. However, in the intersection case, the pedestrians appear in some regions that are too small to detect by the human pose estimation. In addition, when the trajectories of the pedestrians are too short, the estimated extrinsic parameters normally have high relative reprojection error. Thus, the matching rates of them are too small to be selected by the proposed method in Section 3.4. Hence, the proposed method also could not find the samples at some regions of the scene, which lead which leads to a higher relative reprojection error for those regions. However, the errors in these regions are still small enough for applications like multi-camera tracking for the intersection scenes.

Calibration with Crowded Scene
In this work, we also apply the proposed method on PETS2009-S2L1 dataset (CN6), which is a crowded scene with multiple pedestrians [38]. Due to the small resolution of the pedestrians in the scenes, the results of OpenPose [15,16] are not stable enough to calibrate the camera network. Thus, we only use the detected bounding boxes of the pedestrians which are extracted by YOLO [31]. We choose the center of the top edge and the center bottom edge of the bounding boxes as top and bottom of the pedestrians. We select the coordinate system of the camera 001 as the world coordinate system. To make the method work on this dataset, we also remove the limitation of the number of pedestrians that appear in the scene. Because of the orientation of the camera view 004, the tracking algorithm has a poor performance on this scene. Note that, the camera view 002 of the dataset is not available. Therefore, the result of the proposed method is not available for this scene. Even though the proposed method only works best for the scene that does not have so many pedestrians, the relative reprojection errors showed in Table 10 are still reasonably low. Figure 10 shows the example of the detected top and bottom of PETS2009-S2L1 dataset.

Calibration with without Feet and Hip Joints
We evaluated the propsed method in Section 3.5 with Camera Network 7-a heavily occluded scene in an office ( Figure 11). Since the scene is heavily occluded, the feet information is not enough for the extrinsic calibration based on feet joints. The extrinsic calibration based on hip joints is also not always available for all camera views. Table 11 shows that the proposed extrinsic calibration based on just the position of the top and the centerline can still obtain the accuracy that is very close to the extrinsic calibration based on hip joints. In the case of camera pair 003-004, the extrinsic calibration based on hip joints cannot collect enough data to estimate extrinsic parameters on single person sequences. On the other hand, the proposed method in Section 3.5 is still able to produce the extrinsic parameters with low relative reprojection errors. Figure 11 shows the example of the detected top and bottom of Camera Network 7.

Conclusions
In this paper, we present a simple and robust method to leverage the human pose estimation for the computation of 3D positions of the top and bottom of the pedestrians. To handle the case where multiple pedestrians are in the scene, we also developed a brute-force method to select appropriate top and bottom locations for the extrinsic camera calibration. For indoor camera networks which are intended for people surveillance, the feet of pedestrians are usually occluded by the furniture. This is the degenerate for most of the current state-of-the-art calibration methods due to the coplanarity of all the positions of the top of a single pedestrian. To the best of our knowledge, no work exists to deal with camera network calibration for this specific scenario. We proposed the extrinsic calibration by using a walking human as the calibration object, assuming only the top of the pedestrian and centerline of the person are visible. Thus, the proposed method can be very useful for many of the existing indoor multi-camera visual surveillance systems. The proposed method can work well in various environments as well as robust against occlusion compared to state-of-the-art methods. More importantly, the proposed method can work completely automatically without manually selecting proper input data for the calibration method. In the future, we will investigate a regional selection method to handle the case where the walking trajectory is too short.

Conflicts of Interest:
The authors declare no conflict of interest.