2.1. Vibration Effects on a Multi-Frequency Phase-Shifting Sequence
The basic components of a multi-frequency, phase-shifting 3D sensor are one projector and two cameras on either side. For each camera, the measurement process is the same, the essence of which is to continuously acquire a sequence of images. The cameras on both sides generate their own unwrapped phase maps, and then generate 3D point cloud data together based on the binocular vision principle. The principle analysis of the vibration effects only needs to be performed on one camera.
For one camera, when a sequence of images is affected by vibrations, the space constraints within the sequence are destroyed. By analyzing the structure of the sequence, we can find out how the sequence is affected and then try to indicate and compensate for the effects of vibrations.
In phase-shifting methods, a series of sinusoidal fringes along the horizontal axis of the projector image frame, with a constant phase shift, is projected onto a target object and two cameras synchronously capture the phase-encoded fringe images [
15]. In particular, the captured images of the cameras can be expressed as:
        where 
 denotes the pixel coordinates, which will be omitted in the following expressions; 
 is the recorded intensity of the 
th frame at the 
th frequency; 
 is the average intensity; 
 is the modulation intensity; 
 is the constant phase shift; and 
 is the desired phase information of the 
th frequency. If 
 is the number of frames within a single frequency, we have:
The calculation of 
 is called phase recovery. However, 
 is wrapped due to the periodicity of trigonometric functions. In order to unwrap 
 to an absolute phase 
, the value of 
 for different frequencies 
 are needed to perform a heterodyne [
9], which means using a combination of multiple frequencies to produce a frequency lower than any of these frequencies. Considering the unidirectionality of the fringe, 
 and 
 can be simplified to 
 and 
, respectively. As shown in 
Figure 1, the frequencies of the phase functions 
 have to be chosen in a way that the resulting beat function 
 is unambiguous over the field of view. For the situation of 
, 
 and 
 are corresponding wavelengths of 
 and 
, respectively, and the heterodyne wavelength 
 can be solved according to the following equation:
If 
 and 
 are the unwrapped phases of 
 and 
, respectively, it is easy to get:
From Equations (4) and (5), we have:
        or
        
        in which 
 is the rounding function. Therefore, we get a lower frequency 
 through 
 and 
. Similarly, we can continue the multi-frequency heterodyne if we have more 
 until the final 
 has only one cycle in the entire field of view.
To do so, we can divide the multi-frequency phase-shifting method into two parts: the phase recovery and the multi-frequency heterodyne. Both depend on a pixel correspondence between the frames, which means that the same pixel coordinates  in different frames from the same camera must represent the same point on the object. In the quiescent state, the frames are completely coincident in space, and the scenes contained therein are completely identical, but when a sequence is affected by a vibration, the coincidence between the frames is destroyed, and there is a certain degree of displacement between them.
In a multi-frequency, phase-shifting sequence of 
m frequencies with 
n phases for each frequency, the motion between the frequencies can be regarded as the motion between the 
ith frame of one frequency and the 
ith frame of the neighboring frequency. Likewise, the motion between phases can be regarded as the motion between a frame and its neighboring frame within the same frequency. It is easy to understand that the displacement between frequencies is 
n times that between the phase-shifting frames. For example, in a two-frequency, three-step phase-shifting sequence affected by linear motion, there are six frames (
Figure 2). Frames 1–3 belong to the first frequency and frames 4–6 belong to the second. The displacement within a frequency is ∆. Consider the second frame as the reference, then the displacement between the two frequencies is 3∆, which is three times that of the three-step phase-shifting subsequence. Based on this analysis, it can be considered that the multi-frequency heterodyne process is more susceptible to vibration than the phase recovery process.
To quantitatively clarify this concept, we designed the following experiment: we considered a three-frequency, four-step phase-shifting image sequence of a piece of paper, where the purpose of selecting a piece of paper as a target was to obtain a relatively linear unwrapped phase for easy comparison. We project 12 images of a three-frequency, four-step phase-shifting sequence onto a flat white paper and captured them synchronously with the camera. We moved each image ∆ pixels to the right, relative to the previous frame in a direction perpendicular to the fringes, which means that the 
ith image will move (
i − 1)∆ pixels from its origin. 
Figure 3 shows the wrapped and unwrapped phases (obtained using the aforementioned methods) with different ∆s, before and after movement. The thick red lines are from the original sequence, the thin blue lines are from the moved sequence, and the green line is the phase error. In each graph, the abscissa is a pixel interval with a width of 140 pixels (unit: 1), and the ordinate is a phase value (unit: rad). From the experiment, we easily found that as ∆ rose from 0.03 to 0.08, the unwrapped phase error kept growing but the wrapped phase error was close to 0 and had no significant change. Until ∆ reached the considerable amount of 0.25, the wrapped phase error was easy to identify while the unwrapped phase error became more significant. This result means that in the process of gradually increasing the vibration intensity, the multi-frequency heterodyne process was affected first, when compared to the phase recovery process. In other words, below a certain vibration intensity, the phase-shifting subsequence was unaffected, but the constraints between multiple frequencies were destroyed, which is exactly the case of the relatively low frequency vibration discussed in this paper.
In the case of relatively low frequency vibrations, the motion error within the phase-shifting sequence of a certain frequency is small enough to be omitted or is easily removed using temporal phase-recovery algorithms, but the pixel correspondence between frequencies might be damaged. In these situations, there will be wrong phase unwrapping results and destroyed 3D reconstructions. As 
Figure 4 shows, the 3D reconstruction from a multi-frequency, phase-shifting sequence in motion will have “broken” surfaces, which is the main form of motion error in a multi-frequency phase-shifting sequence. As seen in 
Figure 3 when 
 and 
, the 3D data affected by the vibration may be intact in the localized region, but have a “broken” surface globally. This is different from the vibration-affected phase-shifting subsequence, where there will be global ripples, outliers, etc.
  2.2. Vibration Detection and Motion Compensation
In the multi-frequency phase-shifting method, the information is redundant if the ambient light image, 
A, can be regarded as a constant [
16]. For the 
N-step phase-shifting pattern sequence, knowing that 
, we can easily obtain:
        which means that the linear superposition of the fringe images will eliminate the streak component. If multiple reflections are ignored, the resulting image is no different than a uniformly illuminated image. In fact, in the measurement of non-high-reflecting objects, the superposition image is very close to the uniformly illuminated image; the absolute difference is almost negligible. In a multi-frequency phase-shifting sequence, the superposition of each frequency results in a uniformly illuminated image (which is known as a virtual frame) and the whole sequence can be fused as a series of uniformly illuminated virtual frames.
According to phase-shifting motion compensation studies, if the image sequence is affected by vibration or motion, an additional phase shift is introduced [
11]. Under these circumstances, the linear superposition will still have the streak component. With an additional phase shift 
 between frames and defining 
 and 
, we have:
Obviously, the magnitude of the additional phase shift determines the strength of the streak component. In other words, the streak intensity indicates the magnitude of the vibration or motion. By extracting the ROI (region of interest) from the Fourier transform map of the fringe image and applying it to the superposition image, we can extract the peak of the streak frequency in the superposition image and compare it with the corresponding peak in the fringe image, as 
Figure 4 shows. It should be noted that 
Figure 5 is only a schematic diagram drawn according to the Fourier transform maps, and the data in the graph is not strictly accurate. There will be visible streaks if the vibration or motion is strong enough, but in relatively low frequency vibration situations, the additional phase shift 
 may be too small for detection. In these situations, another criterion is proposed to quantitatively evaluate the motion intensity.
If the streak intensity is lower than a certain threshold, the streak component of the virtual frame can be omitted, and the grayscale feature point detection can be applied to the virtual frame. As mentioned in 
Section 2.1, in a motion or vibration situation, the same point on the object will have different pixel coordinates in neighboring frames. The difference between the feature point arrays of a virtual frame pair indicates the motion intensity between the two virtual frames. The L1-norm can be used as an indicator for this difference, which is positively correlated with vibration intensity. Supposing that 
 represents the feature point array in the first virtual frame and 
 in the second, and 
 represents the number of points in 
 (which is also the number of points in 
), we have a metric 
 for the relative movement between the virtual frames:
Furthermore, a homography matrix can be calculated from 
 and 
. The homography matrix is usually used to describe the transformation between images when there is the same plane in two images. There is relative motion between the target and a camera in vibration. If we think the target position is the same, then the camera’s pose is different in two consecutive frames. For points out of plane, homography may not be appropriate for images taken in two camera poses. As 
Figure 6 shows, in the camera coordinate system 
 and 
 of two poses, 
 and 
 are the image points of 
, which is a point out of plane 
, and 
 is the mapped point of 
 using homography 
. It is easily found that for a point 
, the difference between 
 and 
 represents the error of homography. Additionally, we define the following: 
 is the baseline between two camera poses, 
 and 
 are the epipoles, and 
 and 
 are the epipolar lines. The difference between 
 and 
 is called the parallax. It should be noted that in the illustration herein, the two coordinate systems 
 and 
 represent different poses of the same camera, rather than two cameras in stereo vision, whose parallax far exceeds the description range of the homography matrix.
It is easy to prove that 
 depends on 
 and the norm of 
 (or length of 
 because 
). If the parallax between two images is low enough or there is only rotation of the camera pose between two images, in other words the translation 
 between 
 and 
 is small enough relative to the scene depth 
, the homography matrix will be accurate enough to describe the correspondence of all points in two images even when they are not on the same plane [
17]. If the vibration-caused camera motion is small enough relative to the scene depth, the parallax between the two virtual frames is low and the homography matrix is sufficiently accurate for global pixel mapping. For virtual frame sequences that pass the frequency check, the corresponding feature points of the virtual frames can be extracted, and then the homography mapping between two virtual frames can be found.
Furthermore, as long as there are more than eight non-coplanar feature points in a pair, the accuracy of this correspondence can be evaluated by calculating 
 again after the corresponding points have been mapped by the homography matrix. In this sense, the method itself limits its scope of use, and excessive vibration or motion of the camera will be discovered during repeated L1-norm calculations to avoid meaningless or mistaken compensation. When the homography matrix is used to map one image to another, there is interpolation and pixel rounding in the process as digital images have integer pixel coordinates and gray values while feature points have sub-pixel level coordinates. By considering this, we introduced the average pixel displacement 
 to indicate the consistency of feature points after the virtual frame was mapped by the homography matrix. This can be expressed as:
        supposing that 
, 
, and 
 is the order of feature points, we have 
 and 
, which means that 
 depends on the Euclidean distance between each pair of feature points, and 
 depends on the Euclidean distance between the mean values of all feature points in two arrays.
The difference between 
 and 
 is that 
 indicates the absolute difference between two point arrays, which is non-directional, but 
 is directional, as shown in 
Figure 7. This means that if a point array is evenly radially distributed relative to another point array, it will have a small 
 while the 
 is big. Conversely, if one set of points is unidirectionally distributed relative to the other, the 
 may be big even when the 
 is small. The introduction of 
 is to indicate the situation where after compensation, 
 decreases but the compensated virtual frame still has a unidirectional displacement to the reference virtual frame. According to the experiment in 
Section 2.1, the unidirectional displacement is critical to phase unwrapping.
In the multi-frequency, phase-shifting fringe projection methods, the wrapped phase is calculated using the images of the same frequency and the unwrapped phase is obtained using the heterodyne from different wrapped phase images [
9]. From above, we know that the homography matrix can be used to map two images affected by vibration in a low parallax situation. If it is suitable for the homography matrix to map the virtual frames superimposed from the phase-shifting subsequence, it can also be applied to the wrapped phase maps. The heterodyne algorithm recovers the absolute phase information based on the phase values of the same pixel in different wrapped phase maps. Noises and errors of the unwrapped phase come from the destruction of the pixel correspondence. As the homography matrix obtained from the feature points is sufficiently accurate for the global pixel mapping in the vibration situation, it can be applied to correct the pixel correspondence between the wrapped phase maps. The operation flow is shown in 
Figure 8. For the sake of simplicity, we used the reconstruction process of a two-frequency, three-step phase-shifting as an example. Among the six images captured by the left camera, the three images belonging to the same frequency A can generate a wrapped phase map, Wrapped A. In our method, they can simultaneously synthesize a virtual frame, V-Frame A. Similarly, there are the Wrapped B and V-Frame B. Due to the influence of vibration, the pixel correspondence between Wrapped A and Wrapped B was destroyed. In our method, the SIFT (scale-invariant feature transform) method was used to detect feature points in V-Frame A and V-Frame B. The reason for choosing the SIFT method is that it is invariant in perspective transformation and is not sensitive to grayscale changes. Then two arrays of feature points were matched by the FLANN (fast library for approximate nearest neighbor) matcher and according to the average Euclidean distance, the mismatched pairs with excessive distance were screened out. Then, we obtained a homography matrix from two arrays of feature points that mapped V-Frame B to V-Frame A. We then applied the same homography matrix used to map Wrapped B to Wrapped A. Therefore, the pixel correspondence between them was corrected. Using the Wrapped A and corrected Wrapped B, we could obtain the unwrapped phase map of the left camera, where the same thing happens on the right camera. In the end, the correct 3D reconstruction results were generated.