2.1. Camera Motion Trajectory Analysis and Reconstruction Process Design
As discussed above, the collected soil volume can be obtained by comparing topographic reconstruction results, and one of the key processes is completely and precisely reconstructing the collection area. The camera position and attitude considerably influence the results of multiview reconstruction. Unlike in a ground test, a camera fixed on the mechanical arm of a lander cannot be controlled to obtain the best position and angle according to the research needs, and these factors are limited by the inherent trajectory of the acquisition task. In a soil collection task, there are four common types of camera trajectories, as shown in
Figure 1.
Due to the influence of terrain undulation factors, the images taken in situations (b) and (d) may contain many occlusions that can create gaps in the reconstruction results. Therefore, these trajectories are not suitable for the subsequent point cloud comparison and volume calculation. In the case of type (c), as well as type (d), the intersection angle of the area directly below the camera is too small, which will result in low reconstruction accuracy. Obviously, type (a) is the most suitable for collection area reconstruction. In the actual image acquisition process, the camera cannot be controlled to fully move based on the ideal trajectory. Therefore, this factor should be considered in the subsequent method design, and inappropriate images should be removed from the sequence of images.
The basic 3D reconstruction process of the soil collection area is designed as shown in
Figure 2. Because there are mature algorithms currently available for feature matching and dense reconstruction, they are not the emphasis of this paper. The high-precision orientation of the camera at each moment is one of the most important aspects of multiview reconstruction, and it directly determines the accuracy of the final reconstruction result. Therefore, this section focuses on the high-precision orientation of the camera for reconstructing the collection area, and all images used for reconstruction should meet the relevant trajectory requirements. An incremental method has been designed according to the characteristics of the acquired sequence of images analyzed above to subsequently calculate the associated collection amount.
The internal parameters of the fixed-focus camera are calibrated in advance. A sequence of images of the area is obtained from multiple angles, and these partially overlapped images are then utilized. Because many images are obtained, it is necessary to design a reasonable process based on image registration to ensure the reliability and efficiency of the algorithm [
5]. To perform registration between images, features must be extracted from each image; specifically, the features of two images must be matched, and RANSAC (Random Sample Consensus) [
6] is used to eliminate mismatches. Considering the specific trajectory of the camera, there may be a scaling relationship between adjacent images in the overlapping area. Therefore, SIFT (Scale-invariant Feature Transform) [
7] is selected as the feature detector because it has favorable invariance for image scaling. By analyzing the matching relationships between feature points in images, the relevant matching links can be obtained. Each matching link consists of feature points that correspond to the same point in space. With these matching links, the relations among all feature points can be described, and the relative orientation information associated with the cameras at all moments can be obtained according to the following steps.
2.2. Initial Image Pair Selection and Initial Point Cloud Acquisition
First, two initial images are selected for relative orientation assessment and forward intersection to obtain the initial point cloud of the feature points. The initial point cloud is used for resection and other images determine the camera position and orientation at each moment. The newly oriented camera information is included in the forward intersection calculation to expand the point cloud. By iterating the above process, the camera position and orientation can be obtained for the entire sequence of images, and the positions of all feature points in the initial point cloud coordinate system can be determined. The initial image pair is used to determine the initial point cloud position, and the initial point cloud is the basis for subsequent camera orientation calculations. Therefore, in the process of camera orientation for the surface reconstruction of the collection area, the selection of the initial image pair is very important.
The positions of the cameras at each moment in the lunar soil collection task are almost distributed in a straight line. In this case, the selection of the initial image pair will greatly affect the orientation accuracy of all cameras. The original method involves directly selecting the two images that have the most matching pairs. However, if the positions of the two moments’ cameras which corresponding to the initial image pair is not closely related to the positions of other moments’ cameras, as shown in
Figure 3a, cumulative errors will be introduced in the process of sequentially orienting the cameras. This error can notably reduce the accuracy of all related calculations.
Most of the methods used to solve the above problems introduce the concept of matching link’s length [
8]. If K represents a matching link of the sequence images, then K’s length is the number of all the feature points that correspond to the same spatial point p
k, and this value is recorded as
Lk. If
Lk is large, it means the spatial point is associated with many images. In the previous literature [
8], each pair of images were chosen to be analyzed, and if the number of matching pairs in the two images is larger than the specified threshold, the sum of the matching link’s lengths in the two images will be calculated. All the results are then sorted, and the two images with the largest results are the initial image pairs. This method can achieve satisfactory results; however, subsequent studies have found that the above results may not be optimal image pairs in some cases, as shown in
Figure 3b, in which the overlapping area of the two images is small and the number of points in the initial point cloud obtained by the intersection is not sufficiently large. Therefore, this approach is not conducive to improving the orientation precision of other cameras.
Based on this deficiency, this paper improves the initial image pair selection method. In this case, U
i represents the average matching link length for all the feature points in image
i and can be expressed as follows:
where M
i represents the number of feature points in image
i. Taking two images p and q, the feature point pair relationships are used to determine the relative orientation. Then, the intersection angle of each matching pair is calculated, excluding the matching pairs with resulting values that are less than the intersection threshold (generally set to 2°). Next, record the number of remaining matching pairs as F
p,q. The following formula can be used to determine the initial image pair.
The first term on the right side of the above equation describes the degree of association between the image pair (
p,q) and other images in the sequence, and the second term describes the degree of association between images
p and
q. J
p,q is the weighted fusion result of the two terms. J is calculated for each pair of images, and the results are sorted. The two images with the maximum values are selected as the initial point pair. This method not only ensures that the image pairs are closely related to other images, but also ensures that there are many common feature points between the two images, as shown in
Figure 3c.
After determining the initial image pair, the relative position and orientation information must be obtained. This information includes the rotation matrix
R and translation vector
T between the two cameras. Many relative orientation methods [
9,
10,
11] can be used in this step. For the convenience of description, hereafter in this text the two images and the related cameras are called left and right image, left and right camera. The specific methods for determining the relative position and posture of the left and right cameras are as follows [
12]:
In the formula,
is the direction vector from right camera’s optical center to the left camera’s,
,
are the normalized coordinates of the feature points in the left and right images,
is the
i column vector of the rotation matrix
R between the left and right cameras. Each feature matching pair can provide a constraint equation
fi, simultaneous equations can be obtained from multiple matching pairs, rotation matrix
R’s three angles and vector
are parameters to be estimated. Since
is a directional vector, actually it contains only two unknown parameters. So there are a total of five unknown parameters to be calculated, and five feature matching pairs are enough for getting an analytic solution of the simultaneous equations above theoretically [
12]. In fact, the feature matching pairs are far more than five groups. In order to obtain accurate relative orientation information, linearize the constraint equations
fi by use of first-order Taylor expansion, then achieve the optimal solution by using Gauss–Newton or Levenberg–Marquardt iterative optimization method to minimize the objective function
. The coordinate system of the left camera is set as the reference system, and the position and orientation of the right camera are determined based on the reference system. This step yields the relative orientation of the initial image pair.
Because the internal and external parameters of the first two cameras are known based on the above steps, the initial point cloud can be constructed using the forward intersection method to determine the spatial coordinates of the points corresponding to the feature matching pairs. The precision of the initial point cloud influences whether the subsequent algorithm can be effectively executed. Inaccurate calculations may cause failure during the resection step. Therefore, it is necessary to remove the points with large error from the point cloud. In the process of binocular intersection, an inappropriate intersection angle (too large or too small) will greatly reduce the precision of calculating the spatial positions of intersection points. The solution involves a screening procedure in which the intersection angle of each point is calculated, and if the angle is less than the threshold θ° or larger than (180 − θ)°, the point is removed. Experiments have shown that setting θ to 2 can meet the accuracy requirements of most situations. If the number of feature points is sufficiently large, the threshold value can be increased to obtain higher precision.
2.3. Camera Orientation at Each Moment and Dense Point Cloud Computation for the Collection Area
After the calculation of the initial point cloud, the positions and orientations of other cameras can be determined by the resection method. The other images are ranked based on the number of corresponding feature points related to the initial point cloud, and the image with the most points is selected for resectioning. This is only the initial value calculation, and the position and orientation are eventually optimized. Therefore, the PNP monocular estimation algorithm can meet the accuracy requirements of this approach [
13,
14]. The algorithm requires at least 6 points [
15], and the number of feature points is far greater than that in most cases. As long as there are certain overlapping regions in adjacent images from the sequence, the position and orientation estimation conditions can be satisfied. After determining the relative position and attitude relationship between the camera at the current moment and those of all other oriented cameras, it is necessary to assess whether the image corresponding to the camera at the current moment meets the reconstruction requirements discussed in the conclusion of
Section 2.1. If the position and orientation conditions of a camera are not satisfied, the corresponding image of the moment must be deleted from the sequence.
After completing the resection tasks, for the camera corresponding to the image, forward intersection with other oriented cameras is used to expand the point cloud. To control the cumulative error caused by the backward transmission of computational error, the position and orientation parameters of each camera at each moment in the point cloud should be optimized. Here, bundle adjustment is used to perform the optimization. The sum of the reprojection error for each point in each image is taken as the objective function, and the position of each point in the point cloud and the camera orientation at each moment are the parameters to be optimized. At this time, the orientation of the third image is determined, and the sparse point cloud is expanded. Next, the subsequent image is selected and oriented, and the above steps are repeated until the camera has been oriented at every moment based on the reference system.
At this time, the camera at each moment has been accurately oriented based on the steps above. Based on the orientation calculation results, the dense point cloud of the collection area, which includes the 3D information from the surface, can be obtained using a stereovision method [
16]. The basic concept is that for any pixel in any image, the corresponding position in adjacent images can be located by the patch matching method. Then, the spatial position of the pixel can be obtained by intersection. Processing all the pixels in a sequence of images yields the positional fusion of all corresponding spatial points to obtain the final dense point cloud. Many mature methods have been designed based on this concept, and there are many slight differences in the proven application and actual effect of various methods. Here, Zhu’s robust high-precision and occlusion multiview stereovision method [
17] is selected to rebuild the dense point cloud of the soil collection area because of its high performance and ability to overcome occlusion effects. It can be seen from
Figure 4a that there are slight occlusions at the bottom of some small stones at some angles, but this is unavoidable. Many multi-view dense reconstruction algorithms can’t reconstruct this part completely, and Zhu’s method can get better results. Based on the accurate camera orientation data obtained above, the desired dense point cloud can be effectively reconstructed. A sample of the results is shown in
Figure 4b.
In the process of camera orientation determination, the scale factor is not considered, so there is a scale factor difference between the reconstructed result and the actual area. To obtain the reconstruction result at the actual scale, the available scale datum should be established in the collection area, and the scale factor can be calculated from the ratio of the reconstruction distance to the actual distance between reference points.
In soil collection tasks, it is unnecessary to place a real scale datum at the collection area. In fact, the motion information for the acquisition arm can be provided by motion sensors installed on the arm, and the position of the camera’s optical center relative to the collection arm is always the same. As long as the relative positional relationship between the camera’s optical center and the acquisition arm has already been calculated on earth by hand-eye calibration method [
18], the actual movement distance of the camera’s optical center between any two image acquisition moments can be obtained. Because the position of the optical center in the reference system has already been accurately determined during orientation, the movement distance of the optical center between two moments can be regarded as the virtual scale datum and used to calculate the scale factor. Based on the calculated scale factor, the reconstructed dense point cloud of the collection area can be scaled to the actual size.