4.1. Implementation Detail
4.1.1. Capturing Studio Specification
All the evaluations and experiments in this paper were performed using 18 Intel RealSense D455 RGB-D cameras. All cameras were connected to external triggers (
KOTRON TG-16C and
KOTRON TG-4C) for synchronization. In our setup, triggers were hierarchically installed. One main trigger
KOTRON TG-16C generated periodic signals, and three subtriggers
KOTRON TG-4C were set to bypass mode. Each subtrigger covered 6 RealSense cameras. Six desktops installed with the
Microsoft Windows 10 operating system were connected to three cameras each in order to cover the high bandwidth requirements of the cameras. Among the 6 desktops, 1 of them is selected as the master host. The schematic diagram of the hardware installation is represented in
Figure 6b.
The resolution and frames per second (FPS) of the RGB-D camera were specified in pairs. We selected the depth and RGB streams according to the experimental situation. The RGB-D cameras were configured to cover 360
of a target object with 60
intervals and 3 different heights, as depicted in
Figure 6a. The heights of the installed cameras were 50 cm, 100 cm, and 150 cm from the ground, respectively.
4.1.2. External Trigger Synchronization
To synchronize multiple RealSense RGB-D cameras, an external trigger was required to generate the signals at 1.8 V pulses per second (PPS). The RealSense RGB-D cameras could receive these signals through a 9-pin connector hidden inside a small latch. The sync signal was connected to pin 5, whereas pin 9 was the ground, shown in
Figure 7a.
In GenLock mode, cameras do not start capturing a scene without the signal from an external trigger even though capturing is requested by the SDK. In other words, the camera’s exposure starts only when the signal from an external trigger is received. Due to this characteristic, all cameras in a multiview system can simultaneously start capturing the scene using GenLock mode. To properly use GenLock mode, an external trigger should be turned off first. Then, the start request is sent to all cameras by the SDK, and finally, the trigger is turned on and sends a signal that matches the frame rate.
Conversely, RealSense’s Slave and Full Slave modes start capturing the scene regardless of the signal’s existence. With the signal from the trigger, the cameras capture sequences synchronously. In contrast, without the signal, the cameras are unsynchronized and capture frames at a designated frame rate. Therefore, in order to synchronize the cameras in Slave and Full Slave modes, the external trigger is turned on beforehand and periodically sends the signal at a frequency that matches the frame rate. While signals are periodically sent, we can obtain the synchronized sequences by requesting cameras to start and stop capturing of the scene using the SDK.
Despite successful synchronization by GenLock, Slave, and Full Slave modes, the synchronization of cameras was prone to be unstable for the first 5 s in the RealSense camera system.
Figure 7b shows the time interval of the
global timestamps with synchronization during 10 s of recording. From
Figure 7b, we can observe that the synchronization stabilized after 5 s and was able to obtain the accurate time interval of the frame rate. Hence, in order to obtain data with stable synchronization, we started recording after 5 s passes from the start of the capturing.
4.1.3. Data Gathering
Each frame was captured by every camera with timestamps. Although the frames can be captured synchronously via the signal from an external trigger, the host does not receive image frames simultaneously from the RealSense cameras. It was even more difficult to determine if the images were captured simultaneously from multiple hosts. The
global or
sensor timestamps were used to find the simultaneously captured frames according to the proposed synchronization method. Assuming that the system times of the hosts were accurately synchronized, frames simultaneously arriving from different hosts had the same
global timestamps. Thus, we could obtain the simultaneously captured frames by gathering each frame from all cameras that had the same
global timestamps. The timestamps among cameras were different in practice. Therefore, we gathered frames whose timestamp differences were within a predefined threshold
. A value of less than a half-frame interval was sufficient for the threshold,
. The simultaneous frame-gathering algorithm
from the timestamp set
and image set
is presented in Algorithm 3. In Slave and Full Slave modes, the
global timestamp set
was used to gather the frames.
Algorithm 3: Simultaneous frame-gathering algorithm |
|
On the other hand, all the cameras are able to start capturing sequences at the same time in GenLock mode. This enabled us to use the sensor timestamps instead of global timestamps, allowing highly accurate synchronization of multiple cameras. In other words, although the devices had different sensor timestamps, the timestamps could be aligned with each other by subtracting each device’s timestamps by its first timestamp. Therefore, in GenLock mode, the aligned sensor timestamps of each camera were used to gather the frames. The aligned frames by GenLock mode were used to quantitatively evaluate if our method accurately synchronized the cameras.
4.2. Quantitative Evaluation
Figure 8a shows the
global timestamps from the 18 RealSense cameras with 6 hosts. It clearly shows that the cameras connected to the same host could be correctly synchronized via the
global timestamps, but large variations existed among the timestamps captured in different hosts.
Figure 8b depicts the
global timestamps by capturing sequences with the proposed synchronization method. It is shown that the proposed synchronization method significantly decreased the variations between hosts’ timestamps. The gathering method in Algorithm 3 matched the closest timestamps within the threshold in sequences of cameras. The lower variations in
Figure 8 made it possible to gather more reliable synchronized frames.
However, the lower variations did not confirm that the cameras were synchronized correctly because the global timestamp was measured using the system clock of each host. To correctly evaluate the global timestamps of multiple hosts, the timestamps needed to be aligned with the absolute time, i.e., the ground-truth time, and then compared to each other. Therefore, in order to quantitatively evaluate the performance of our method, the ground-truth for synchronization was required. The ground-truth for synchronization was obtained by performing GenLock mode for each test.
Even though GenLock mode captures a sequence at half of the specified frame rate and synchronizes only depth cameras, it allows accurate synchronization by specifying the start time of the capturing sequences. The sensor timestamps of each synchronized camera could be normalized by subtracting the first sensor timestamp of the sequence, and with these normalized sensor timestamps, the frames captured by the cameras could be correctly ordered. Based on the correct ordering of the frames, the accuracy of our synchronization method could be evaluated.
We defined a metric, the average maximum delay (
AMD), to quantitatively evaluate the synchronization of multiple RealSense cameras as:
where
m is the number of frames and
is the
global timestamp of the
ith frame, respectively. The
AMD measures the average difference between the maximum and minimum
global timestamp at each frame and, as a result, can observe the accuracy of synchronization at each frame. Because the
AMD measures the biggest difference between the cameras’
global timestamps at a certain frame, the
AMD can be considered as the multiview camera system’s average maximum error. Therefore, a multiview camera system with the maximum error less than an interval between frames (i.e., 33 ms for 30 FPS) can be considered synchronized. Even though the
AMD was useful in evaluating the performance of the multiview camera system as a whole, it was difficult to evaluate the correctness of the synchronization between the cameras.
Thus, we defined another metric, the root mean delay variation (
RMDV), to evaluate the synchronization as:
where
n is the number of sensors and
is the global timestamp of the
sensor on the
frame, respectively.
is the median value of the
frame. The
RMDV computes the mean variation of the delay between the cameras at each frame. In order to minimize the effect of outliers, the median of cameras’
global timestamp at each frame was used as a reference instead of the mean to compute the
RMDV. The
RMDV computes differences between the median
global timestamp and the other cameras’
global timestamps and uses them to accurately measure the correctness of the synchronization.
Table 2 and
Table 3 show the average quantitative results over the frames of the
AMD and
RMDV, and the best results shown in tables are bolded for each measurement. Furthermore, the quantitative measurements for each frame are represented in
Figure 9 and
Figure 10. The results showed no apparent variations in the error over time regardless of the synchronization methods. This implies that the synchronization of a single host was efficiently handled with the external trigger and the
global timestamps, and a major cause of the synchronization errors came from the synchronization error between the hosts. Our synchronization method considering
time synchronization delay and
network synchronization delay showed the best performance by efficiently tackling the host synchronization error, followed by applying only
network synchronization delay or
time synchronization delay or neither of them. From this result, we can observe that the effect of these two delays was notable.
When testing the synchronization without considering the delays, the master host was configured as a local time synchronization server, and the hosts’ system clocks were synchronized through the NTP using the time synchronization service provided by the operating system. Without the consideration of the two delays, errors in the synchronization significantly increased. When comparing the result obtained by applying only network synchronization delay or time synchronization delay, a test that only took into account the time synchronization caused more delay in the synchronization of the cameras than the opposite test. This was because sending and receiving data via the network communication took more time than calling the get-time and set-time operating system functions.
Table 3 shows similar results, but all tests achieved lower synchronization errors than the corresponding tests in
Table 2. Because the time interval between two frames becomes smaller as the frame rate increases, as a result, delays between the cameras also decreased relatively.
4.4. Evaluation on 3D Reconstruction
Three-dimensional reconstruction was the ultimate goal of the multiview camera system. Therefore, we evaluated the improvements of the proposed method on the reconstruction accuracy when capturing objects using multiple cameras with multiple hosts. The average
reprojection error to depth maps [
32] was used to quantitatively evaluate the reconstructed 3D objects from multiple views.
For the 3D reconstruction from the multiview depths, we first calibrated the depth sensors with the standard multicamera calibration method using a checkerboard [
33]. Then, using the extrinsic calibrated depth camera, 3D points in the local camera coordinates could be integrated into the global coordinates. The Poisson reconstruction method [
34] was applied to the integrated point set to construct a 3D mesh of the target object.
The reprojection error was computed by projecting the vertices of the mesh to every depth map to match the corresponding points of the depth map to the 3D vertices. When calculating the reprojection error, the 3D vertices invisible to each depth camera were excluded, which could be determined by the Z-buffer test [
35]. Once the correspondences between the mesh vertices and depth points were matched, the
distance errors in the global coordinates were calculated and averaged.
Table 4 summarizes the reprojection errors without and with the proposed synchronization method when capturing the objects with Slave and Full Slave modes. When capturing the sequence without the proposed synchronization, the standard NTP synchronization scheme was used. The result demonstrated that the proposed method was significantly beneficial in obtaining accurate and reliable reconstruction results from the multiview camera system. The qualitative comparison is described in
Figure 1. The mesh reconstructed using the synchronization scheme without the proposed method showed notable artifacts arising due to the misalignment among the multiview depths. This misalignment was caused by inaccurate synchronization between hosts. In contrast, the proposed method enabled robust reconstruction from multiple cameras with multiple hosts by efficiently addressing the synchronization problem.
4.5. Discussion
The proposed method resolved the problems in the Slave and Full Slave synchronization modes of the RealSense devices by synchronizing multiple hosts’ times. In our experiments, the RealSense cameras were connected to the motherboard USB 3.0 interfaces of the hosts for communication without using an additional PCI-E extension card. It has been reported that the number of RealSense sensors that a host can support is up to three [
18] without using any extension card. Thus, we used six hosts to support eighteen cameras for our experiments. However, if utilizing extension cards for the USB interface, a host can support a wider bandwidth and stably connect to more cameras, as long as the number of cameras connected to a host does not exceed the hardware limitation. In this case, the multiview camera system can be accurately synchronized using a single host without the proposed method.
Furthermore, although the GenLock synchronization mode of the RealSense cameras drops the frame rate by half and cannot synchronize RGB cameras simultaneously with cameras, GenLock mode can be used for other cameras to accurately synchronize multiple RGB-D cameras without the frame rate loss, when all cameras start capturing at the same time. In other words, other depth cameras can be easily synchronized without the proposed synchronization method if using GenLock mode.
However, in many practical applications, it is difficult to specify the starting point of capturing for each camera. As a result, the typical GenLock synchronization scheme cannot be used, but the proposed method can still be used to accurately synchronize multiple cameras with GenLock mode. In addition, our method does not limit the number of devices that can be synchronized by linearly increasing the number of hosts, allowing the multiview camera system to be largely scalable.