#### 4.2.1. Stereo Tracking Process with Multithread Gaze Control

In the stereo tracking process, the left and right-view subprocesses for multithread gaze control are alternatively switched at a small interval of $\Delta t$. The left-view subprocess works for ${t}_{2k-1}-{\tau}_{m}\le t<{t}_{2k}-{\tau}_{m}$, and that of the right view works for ${t}_{2k}-{\tau}_{m}\le t<{t}_{2k+1}-{\tau}_{m}$ as the time-division thread executes with a temporal granularity of $\Delta t$. ${t}_{k}={t}_{0}+k\Delta t$ (k: integer) indicates the image-capturing time of the high-speed vision system, and ${\tau}_{m}$ is the settling time in controlling the mirror angles of the pan-tilt mirror system.

[Left-view subprocess]

(L-1) Switching to the left viewpoint

For time ${t}_{2k-1}-{\tau}_{m}$ to ${t}_{2k-1}$, the pan and tilt angles of the pan-tilt mirror system are controlled to within their desired values $\widehat{\mathbf{\theta}}({t}_{2k-1};{t}_{2k-3})=({\widehat{\theta}}_{1}({t}_{2k-1};{t}_{2k-3}),{\widehat{\theta}}_{2}({t}_{2k-1};{t}_{2k-3}))$ at time ${t}_{2k-1}$, which is estimated at time ${t}_{2k-3}$ when capturing the left-view image in the previous frame.

(L-2) Left-view image capturing

The left-view image $I\left({t}_{2k-1}\right)$ is captured at time ${t}_{2k-1}$; $I\left(t\right)$ indicates the input image of the high-speed vision system at time t.

(L-3) Target detection in left-view image

The target object with a specific color is localized by detecting its center position

$\mathit{u}\left(t\right)=\left(u\right(t),v(t\left)\right)$ in the image

$I\left(t\right)$ at time

t. Assuming that the color of the target object to be tracked is different from its background color in this study,

$\mathit{u}\left({t}_{2k-1}\right)$ is calculated as a moment centroid of a binary image

$C\left({t}_{2k-1}\right)=C(u,v,{t}_{2k-1})$ for the target object as follows:

where the binary image

$C\left(t\right)$ is obtained at time

t by setting a threshold for the HSV (Hue, Saturation, Value) images as follows:

where

H,

S, and

V are the hue, saturation, and value images of

$I\left(t\right)$, respectively.

${H}_{l}$,

${H}_{h}$,

${S}_{l}$, and

${V}_{l}$ are parameters for HSV color thresholding.

(L-4) Determination of mirror angles at the next left-view frame

Assuming that the

u- and

v-directions in the image correspond to the pan and tilt directions of the pan-tilt mirror system, respectively, the pan and tilt angles at time

${t}_{2k+1}$ when capturing the left-view image at the next frame, are determined so as to reduce the error between the position of the target object and its desired position

${\mathit{u}}_{L}^{d}$ in the left-view image with proportional control as follows:

where

$\mathbf{\theta}\left(t\right)=({\theta}_{1}\left(t\right),{\theta}_{2}\left(t\right))$ is collectively the measured values of the pan and tilt angles at time

t, and

K is the gain parameter for tracking control.

[Right-view subprocess]

(R-1) Switching to right viewpoint

For time ${t}_{2k}-{\tau}_{m}$ to ${t}_{2k}$, the pan and tilt angles are controlled to $\widehat{\mathbf{\theta}}({t}_{2k};{t}_{2k-2})$, which is estimated at time ${t}_{2k-2}$ when capturing the right-view image in the next frame.

(R-2) Right-view image capturing

The right-view image $I\left({t}_{2k}\right)$ is captured at time ${t}_{2k}$.

(R-3) Target detection in right-view image

$\mathit{u}\left({t}_{2k}\right)=(u\left({t}_{2k}\right),v\left({t}_{2k}\right))$ is obtained as the center position of the target object in the right-view image at time ${t}_{2k}$, by calculating a moment centroid of $C\left({t}_{2k}\right)$, which is a sub-image $I\left({t}_{2k}\right)$ of the right-view image, constrained by a color threshold at time ${t}_{2k}$, in a similar manner as that described in L-3.

(R-4) Determination of mirror angles in the next right-view frame

Similarly, with the process described in L-4, the pan and tilt angles at time

${t}_{2k+2}$ when capturing the right-view image in the next frame are determined as follows:

where

${\mathit{u}}_{R}^{d}$ is the desired position of the target object in the right-view image.

The input images and the mirror angles captured in the stereo tracking process are stored as the left-view images ${I}_{L}\left({t}_{2k-1}\right)=I\left({t}_{2k-1}\right)$ and the pan and tilt angles ${\mathbf{\theta}}_{L}\left({t}_{2k-1}\right)=\mathbf{\theta}\left({t}_{2k-1}\right)$ at time ${t}_{2k-1}$, for the virtual left pan-tilt camera at the odd-numbered frame, and the right-view images ${I}_{R}\left({t}_{2k}\right)=I\left({t}_{2k}\right)$ and the pan and tilt angles ${\mathbf{\theta}}_{R}\left({t}_{2k}\right)=\mathbf{\theta}\left({t}_{2k}\right)$ at time ${t}_{2k}$ for the virtual right pan-tilt camera at the even-numbered frame.

#### 4.2.2. 3D Image Estimation with Virtually Synchronized Images

Left and right-view images in catadioptric stereo tracking are captured at different timings, and the synchronization errors in stereo measurement increase as the target object’s movement increases. To reduce such errors, this study introduces a frame interpolation technique for virtual synchronization between virtual left and right pan-tilt cameras, and 3D images are estimated with stereo processing for the virtually synchronized left and right-view images. Frame interpolation is a well-known video processing technique in which intermediate frames are generated between existing frames by means of interpolation using space-time tracking [

79,

80,

81,

82], view morphing [

83,

84,

85], and optical flow [

86,

87]; it has been used for many applications, such as frame rate conversion, temporal upsampling for fluid slow motion video, and image morphing.

(S-1) Virtual Synchronization with Frame Interpolation

Considering the right-view image

${I}_{R}\left({t}_{2k}\right)$ captured at time

${t}_{2k}$ as the standard image for virtual synchronization, the left-view image virtually synchronized at time

${t}_{2k}$,

${\tilde{I}}_{L}\left({t}_{2k}\right)$, is estimated with frame interpolation using the two temporally neighboring left-view images

${I}_{L}\left({t}_{2k-1}\right)$ at time

${t}_{2k-1}$ and

${I}_{L}\left({t}_{2k+1}\right)$ at time

${t}_{2k+1}$ as follows:

where

${f}_{FI}({I}_{1},{I}_{2})$ indicates the frame interpolation function using two images

${I}_{1}$ and

${I}_{2}$. We used Meyer’s phase-based method [

88] as the frame interpolation technique in this study.

In a similar manner, the pan and tilt angles of the left pan-tilt camera are virtually synchronized with those of the right pan-tilt camera at time

${t}_{2k}$,

${\tilde{\mathbf{\theta}}}_{L}\left({t}_{2k}\right)$, are also estimated using the temporally neighboring mirror angles

${\mathbf{\theta}}_{L}\left({t}_{2k-1}\right)$ at time

${t}_{2k-1}$ and

${\mathbf{\theta}}_{L}\left({t}_{2k+1}\right)$ at time

${t}_{2k+1}$ as follows:

where it is assumed that the mirror angles of the virtual left pan-tilt camera vary linearly for the interval

$2\Delta t$ during time

${t}_{2k-1}$ and

${t}_{2k+1}$.

(S-2) Triangulation Using Virtually Synchronized Images

The virtually synchronized left and right-view images at time

${t}_{2k}$,

${\tilde{I}}_{L}\left({t}_{2k}\right)$ and

${I}_{R}\left({t}_{2k}\right)$, are used to compute the 3D image of the tracked object in a similar way as those in the standard stereo methodologies for multiple synchronized cameras. Assuming that the camera parameters of the virtual pan-tilt camera at arbitrary pan and tilt angles

$\mathbf{\theta}$ are initially given as the 3 × 4 camera calibration matrix

$\mathit{P}\left(\mathbf{\theta}\right)$, the 3D image

$\mathbf{Z}\left({t}_{2k}\right)$ can be estimated at time

${t}_{2k}$ as a disparity map as follows:

where

${f}_{dm}({I}_{L},{I}_{R};{\mathit{P}}_{L},{\mathit{P}}_{R})$ indicates the function of stereo matching using a pair of left and right-view images,

${I}_{L}$ and

${I}_{R}$, when the 3 × 4 camera calibration matrices of the left- and right cameras are given as

${\mathit{P}}_{L}$ and

${\mathit{P}}_{R}$, respectively. We used the rSGM method [

89] as the stereo matching algorithm in this study.