Gaze Point Tracking Based on a Robotic Body–Head–Eye Coordination Method

When the magnitude of a gaze is too large, human beings change the orientation of their head or body to assist their eyes in tracking targets because saccade alone is insufficient to keep a target at the center region of the retina. To make a robot gaze at targets rapidly and stably (as a human does), it is necessary to design a body–head–eye coordinated motion control strategy. A robot system equipped with eyes and a head is designed in this paper. Gaze point tracking problems are divided into two sub-problems: in situ gaze point tracking and approaching gaze point tracking. In the in situ gaze tracking state, the desired positions of the eye, head and body are calculated on the basis of minimizing resource consumption and maximizing stability. In the approaching gaze point tracking state, the robot is expected to approach the object at a zero angle. In the process of tracking, the three-dimensional (3D) coordinates of the object are obtained by the bionic eye and then converted to the head coordinate system and the mobile robot coordinate system. The desired positions of the head, eyes and body are obtained according to the object’s 3D coordinates. Then, using sophisticated motor control methods, the head, eyes and body are controlled to the desired position. This method avoids the complex process of adjusting control parameters and does not require the design of complex control algorithms. Based on this strategy, in situ gaze point tracking and approaching gaze point tracking experiments are performed by the robot. The experimental results show that body–head–eye coordination gaze point tracking based on the 3D coordinates of an object is feasible. This paper provides a new method that differs from the traditional two-dimensional image-based method for robotic body–head–eye gaze point tracking.


Introduction
When the magnitude of a gaze is too large, human beings change the orientation of their head or body to assist their eyes in tracking targets because saccade alone is insufficient to keep a target at the center region of the retina. Studies on body-head-eye coordination gaze point tracking are still rare because the body-head-eye coordination mechanism of humans is prohibitively complex. Multiple researchers have investigated the eye-head coordination mechanism, binocular coordination mechanism and bionic eye movement control. In addition, researchers have validated the eye-head coordination models on eye-head systems. This work is significant for the development of intelligent robots for human-robot interaction. However, most of these methods are based on the principle of neurology, and their further developments and applications may be limited by people's understanding of human processes. However, binocular coordination based on the 3D coordinates of an object is simple and practical, as verified by our previous paper [1]. aims, avoiding expensive investment in hardware when used in robotics for 3D perception. Namavari A et al. [20] presented an automatic system for the gauging and digitalization of 3D indoor environments. The configuration consisted of an autonomous mobile robot, a reliable 3D laser rangefinder and three elaborated software modules.
The main forms of motion of bionic eyes include saccade [1], smooth pursuit, vergence [21], vestibule-ocular reflex (VOR) [22] and optokinetic reflex (OKR) [23]. Saccade and smooth pursuit are the two most important functions of the human eye. Saccade is used to move eyes voluntarily from one point to another by rapid jumping, while smooth pursuit can be applied to track moving targets. In addition, binocular coordination and eye-head coordination are of high importance to realize object tracking and gaze control.
It is of great significance for robots to be able change their fixation point quickly. In control models, the saccade control system should be implemented using a position servo controller to change and keep the target at the center region of the retina with minimum time consumption. Researchers have been studying the implementation of saccade on robots over the last twenty years. For example, in 1997, Bruske et al. [24] incorporated saccadic control into a binocular vision system by using the feedback error learning (FEL) strategy. In 2013, Wang et al. [25] designed an active vision system that can imitate saccade and other eye movements. The saccadic movements were implemented with an open-loop controller, which ensures faster saccadic eye movements than a closed-loop controller can accommodate. In 2015, Antonelli et al. [26] achieved saccadic movements on a robot head by using a model called recurrent architecture (RA). In this model, the cerebellum is regarded as an adaptive element used to learn an internal model, while the brainstem is regarded as a fixed-inverse model. The experimental results on the robot showed that this model is more accurate and less sensitive to the choice of the inverse model relative to the FEL model. The smooth pursuit system acts as a velocity servo controller to rotate eyes at the same angular rate as the target while keeping them oriented toward the desired position or in the desired region. In Robinson's model of smooth pursuit [27], the input is the velocity of the target's image across the retina. The velocity deviation is taken as the major stimulus to pursue and is transformed into an eye velocity command. Based on Robinson's model, Brown [28] added a smooth predictor to accommodate time delays. Deno et al. [29] applied a dynamic neural network, which unified two apparently disparate models of smooth pursuit and dynamic element organization to the smooth pursuit system. The dynamic neural network can compensate for delays from the sensory input to the motor response. Lunghi et al. [30] introduced a neural adaptive predictor that was previously trained to accomplish smooth pursuit. This model can explain a human's ability to compensate for the 130 ms physiological delay when they follow external targets with their eyes. Lee et al. [31] applied a bilateral OCS model on a robot head and established rudimentary prediction mechanisms for both slow and fast phases. Avni et al. [32] presented a framework for visual scanning and target tracking with a set of independent pan-tilt cameras based on model predictive control (MPC). In another study [33], the authors implemented smooth pursuit eye movement with prediction and learning in addition to solving the problem of time delays in the visual pathways. In addition, some saccade and smooth pursuit models have been validated on bionic eye systems [34][35][36][37]. Santini F et al. [34] showed that the oculomotor strategies by which humans scan visual scenes produce parallaxes that provide an accurate estimation of distance. Other studies have realized the coordinated control of eye and arm movements through configuration and training [35]. Song Y et al. [36] proposed a binocular control model, which was derived from a neural pathway, for smooth pursuit. In their smooth pursuit experiments, the maximum retinal error was less than 2.2 • , which is sufficient to keep a target in the field of view accurately. An autonomous mobile manipulation system was developed in the form of a modified image-based visual servo (IBVS) controller in a study [37].
The above-mentioned work is significant for the development of intelligent robots. However, there are some shortcomings. First, most of the existing methods are based on the principle of neurology, and further developments and applications may be limited by people's understanding aimed at humans. Second, only two-dimensional (2D) image information is applied when gaze shifts to targets are implemented, while 3D information is ignored. Third, the studies of smooth pursuit [16], eye-head coordination [17], gaze shift and approach are independent and have not been integrated. Fourth, bionic eyes are different from human eyes; for example, some of them are two eyes that are fixed without movement or move with only 1 DOF, whereas some of them use special cameras or a single camera. Fifth, the movements of bionic eyes and heads are performed separately, without coordination.
To overcome the shortcomings mentioned above to a certain extent, a novel control method that implements the gaze shift and approach of a robot according to 3D coordinates is proposed in this paper. A robot system equipped with bionic eyes, a head and a mobile robot is designed to help nurses deliver medicine in hospitals. In this system, both the pan and each eye have 2 DOF (namely, tilt and pan [38]), and the mobile robot can rotate and move forward over the ground. When the robot gaze shifts to the target, the 3D coordinates of the target are acquired by the bionic eyes and transferred to the eye coordination system, head coordination system and robot coordination system. The desired position of the eye, head and robot are calculated based on the 3D information of the target. Then, the eye, head and mobile robot are driven to the desired positions. When the robot approaches the target, the eye, head and mobile robot first rotate to the target and then move to the target. This method allows the robot to achieve the above-mentioned functions with minimal resource consumption and can separate the control of the eye, head and mobile robot, which can improve the interactions between robots, human beings and the environment.
The rest of the paper is organized as follows. In Section 2, the robot system platform is introduced, and the control system is presented. In Section 3, the desired position is discussed and calculated. Robot pose control is described in Section 4. The experimental results are given and discussed in Section 5; finally, conclusions are drawn in Section 6.

Platform and Control System
To study the gaze point tracking of the robot, this paper designs a robot experiment platform including the eye-head subsystem and the mobile robot subsystem.

Robot Platform
The physical object of the robot is shown in Figure 1. With the mobile robot as a carrier, a head with two degrees of freedom is fixed on the mobile robot, and the horizontal and vertical rotations of the head are controlled by M hu and M hd , respectively. The bionic eye system is fixed to the head. The mobile robot is driven by two wheels, each of which is individually controlled by a servo motor. The angle and displacement of the robot platform can be determined by controlling the distance and speed of each wheel's movement. The output shaft of each stepper motor of the head and eye is equipped with a rotary encoder to detect the position of the motor. Using the frequency multiplication technique, the resolution of the rotary encoder is 0.036 • . The purpose of using a rotary encoder is to prevent the effects of lost motor motion on the 3D coordinate calculations. The movement of each motor is limited by a limit switch. The initial positioning of the eye system is based on the visual positioning plate [39].
The robot system includes two eyes and one mobile robot. To simulate the eyes and the head, six DOFs are designed in this system. The left eye's pan and tilt are controlled by motors M lu and M ld , respectively. The right eye's pan and tilt are controlled by motors M ru and M rd , respectively. The head's pan and tilt are controlled by motors M hu and M hd , respectively. The mobile robot has two driving wheels and can perform rotation and forward movement. When the mobile robot needs to rotate, two wheels are set to turn the same amount in different directions. When the mobile robot needs to go forward, two wheels are set to turn the same amount in the same direction.
puter and the mobile robot motion controller, the head motion controller and the eye motion controller all communicate through the serial ports. For satisfactory communication quality and stability, the baud rate of serial communication is 9600 bps. The camera communicates with the host computer via a GigE Gigabit Network. The camera's native resolution is 1600 × 1200 pixels. To increase the calculation speed, the system uses an image downsampled to 400 × 300 pixels.   Figure 3 shows the control block diagram of the gaze point tracking of the mobile robot. First, based on binocular stereo-vision perception, the binocular pose and the left and right images are used to calculate the 3D coordinates of the target [40], and the coordinates of the target in the eye coordinate system are converted to the head and mobile robot coordinate system. Then, the desired poses of the eyes, head and mobile robot are calculated according to the 3D coordinates of the target. Finally, according to the desired pose, the motor is controlled to move to the desired position, and the change in the position of the motor is converted into changes in the eyes, head and mobile robot.

Control System
The tracking and approaching motion control problem based on the target 3D coordinates [1] is equivalent to solving the index J minimization problem of Equation (1), where fi is the current state vector of the joint pose of the eye, head and mobile robot and fq is the desired state vector: A diagram of the robot system's organization is shown in Figure 2. The host computer and the mobile robot motion controller, the head motion controller and the eye motion controller all communicate through the serial ports. For satisfactory communication quality and stability, the baud rate of serial communication is 9600 bps. The camera communicates with the host computer via a GigE Gigabit Network. The camera's native resolution is 1600 × 1200 pixels. To increase the calculation speed, the system uses an image downsampled to 400 × 300 pixels.
Sensors 2023, 23, x FOR PEER REVIEW 5 of 39 same amount in different directions. When the mobile robot needs to go forward, two wheels are set to turn the same amount in the same direction. A diagram of the robot system's organization is shown in Figure 2. The host computer and the mobile robot motion controller, the head motion controller and the eye motion controller all communicate through the serial ports. For satisfactory communication quality and stability, the baud rate of serial communication is 9600 bps. The camera communicates with the host computer via a GigE Gigabit Network. The camera's native resolution is 1600 × 1200 pixels. To increase the calculation speed, the system uses an image downsampled to 400 × 300 pixels.   Figure 3 shows the control block diagram of the gaze point tracking of the mobile robot. First, based on binocular stereo-vision perception, the binocular pose and the left and right images are used to calculate the 3D coordinates of the target [40], and the coordinates of the target in the eye coordinate system are converted to the head and mobile robot coordinate system. Then, the desired poses of the eyes, head and mobile robot are calculated according to the 3D coordinates of the target. Finally, according to the desired pose, the motor is controlled to move to the desired position, and the change in the position of the motor is converted into changes in the eyes, head and mobile robot.

Control System
The tracking and approaching motion control problem based on the target 3D coordinates [1] is equivalent to solving the index J minimization problem of Equation (1), where fi is the current state vector of the joint pose of the eye, head and mobile robot and fq is the desired state vector:  Figure 3 shows the control block diagram of the gaze point tracking of the mobile robot. First, based on binocular stereo-vision perception, the binocular pose and the left and right images are used to calculate the 3D coordinates of the target [40], and the coordinates of the target in the eye coordinate system are converted to the head and mobile robot coordinate system. Then, the desired poses of the eyes, head and mobile robot are calculated according to the 3D coordinates of the target. Finally, according to the desired pose, the motor is controlled to move to the desired position, and the change in the position of the motor is converted into changes in the eyes, head and mobile robot.

Control System
system of the eye is OeXeYeZe, which coincides with the left motion module's base coordinate system at the initial position. The head coordinate system is OhXhYhZh, and the coordinates Ph (xh, yh, zh) of the point P in the head coordinate system can be calculated using the coordinates Pe (xe, ye, ze) in the eye coordinate system. The definitions of dx and dy are shown in Figure 4b. The robot coordinate system OwXwYwZw coincides with the head coordinate system of the initial position. In the bionic eye system, the axis of rotation of the robot approximately coincides with Yw.   ,c show the definition of each system parameter. l θp and l θt are the pan and tilt of the left eye, respectively. r θp and r θt are the pan and tilt of the right eye, respectively. h θp and h θt are the pan and tilt of the head, respectively. The angle of the robot that rotates around the Yw axis is w θp. The robot can not only rotate around Yw but can also shift in the XwOwZw plane. When the robot moves, the robot coordinate system at time i is the base coordinate system, and the position of the robot at time i + 1 relative to the base coordinate system is w Pm ( w xm, w zm). When the robot performs gaze point tracking or approaches the target, the 3D coordinates of the target are first calculated at time i, and then the desired posture fq of each part of the robot at time i + 1 is calculated according to the 3D coordinates of the target. When the current pose fi of the robot system is equal to the desired pose, the robot maintains the current pose; when not equal, the system controls the various parts of the robot to move to the desired pose. The current pose vector of the robot system is fi = ( w xmi, w zmi, w θpi, h θpi, h θti, l θpi, l θti, r θpi, r θti), and the desired pose is fq = ( w xmq, w zmq, w θpq, h θpq, h θtq, l θpq, l θtq, r θpq, r θtq). When performing in situ gaze point tracking, the robot performs only pure rotation and does not move forward. When the robot approaches the target, it first turns to the target and then moves straight toward the target. Therefore, the definition of fq in the two tasks is different. Let g fq be the desired pose when The tracking and approaching motion control problem based on the target 3D coordinates [1] is equivalent to solving the index J minimization problem of Equation (1), where f i is the current state vector of the joint pose of the eye, head and mobile robot and f q is the desired state vector: where J is the indicator function. Figure 4a shows the definition of each coordinate system of the robot. The coordinate system of the eye is O e X e Y e Z e , which coincides with the left motion module's base coordinate system at the initial position. The head coordinate system is O h X h Y h Z h , and the coordinates P h (x h , y h , z h ) of the point P in the head coordinate system can be calculated using the coordinates P e (x e , y e , z e ) in the eye coordinate system. The definitions of d x and d y are shown in Figure 4b. The robot coordinate system O w X w Y w Z w coincides with the head coordinate system of the initial position. In the bionic eye system, the axis of rotation of the robot approximately coincides with Y w . where J is the indicator function. Figure 4a shows the definition of each coordinate system of the robot. The coordinate system of the eye is OeXeYeZe, which coincides with the left motion module's base coordinate system at the initial position. The head coordinate system is OhXhYhZh, and the coordinates Ph (xh, yh, zh) of the point P in the head coordinate system can be calculated using the coordinates Pe (xe, ye, ze) in the eye coordinate system. The definitions of dx and dy are shown in Figure 4b. The robot coordinate system OwXwYwZw coincides with the head coordinate system of the initial position. In the bionic eye system, the axis of rotation of the robot approximately coincides with Yw.   ,c show the definition of each system parameter. l θp and l θt are the pan and tilt of the left eye, respectively. r θp and r θt are the pan and tilt of the right eye, respectively. h θp and h θt are the pan and tilt of the head, respectively. The angle of the robot that rotates around the Yw axis is w θp. The robot can not only rotate around Yw but can also shift in the XwOwZw plane. When the robot moves, the robot coordinate system at time i is the base coordinate system, and the position of the robot at time i + 1 relative to the base coordinate system is w Pm ( w xm, w zm). When the robot performs gaze point tracking or approaches the target, the 3D coordinates of the target are first calculated at time i, and then the desired posture fq of each part of the robot at time i + 1 is calculated according to the 3D coordinates of the target. When the current pose fi of the robot system is equal to the desired pose, the robot maintains the current pose; when not equal, the system controls the various parts of the robot to move to the desired pose. The current pose vector of the robot system is fi = ( w xmi, w zmi, w θpi, h θpi, h θti, l θpi, l θti, r θpi, r θti), and the desired pose is fq = ( w xmq, w zmq, w θpq, h θpq, h θtq, l θpq, l θtq, r θpq, r θtq). When performing in situ gaze point tracking, the robot performs only pure rotation and does not move forward. When the robot ap- Figure 4b,c show the definition of each system parameter. l θ p and l θ t are the pan and tilt of the left eye, respectively. r θ p and r θ t are the pan and tilt of the right eye, respectively. h θ p and h θ t are the pan and tilt of the head, respectively. The angle of the robot that rotates around the Y w axis is w θ p . The robot can not only rotate around Y w but can also shift in the X w O w Z w plane. When the robot moves, the robot coordinate system at time i is the base coordinate system, and the position of the robot at time i + 1 relative to the base coordinate system is w P m ( w x m , w z m ). When the robot performs gaze point tracking or approaches the target, the 3D coordinates of the target are first calculated at time i, and then the desired posture f q of each part of the robot at time i + 1 is calculated according to the 3D coordinates of the target. When the current pose f i of the robot system is equal to the desired pose, the robot maintains the current pose; when not equal, the system controls the various parts of the robot to move to the desired pose. The current pose vector of the robot system is f i = ( w x mi , w z mi , w θ pi , h θ pi , h θ ti , l θ pi , l θ ti , r θ pi , r θ ti ), and the desired pose is f q = ( w x mq , w z mq , w θ pq , h θ pq , h θ tq , l θ pq , l θ tq , r θ pq , r θ tq ). When performing in situ gaze point tracking, the robot performs only pure rotation and does not move forward. When the robot approaches the target, it first turns to the target and then moves straight toward the target. Therefore, the definition of f q in the two tasks is different. Let g f q be the desired pose when the gaze point is tracked and a f q be the desired pose of the robot when approaching the target.
After analyzing the control system, we found that the most important step in solving this control problem is to determine the desired pose.

Desired Pose Calculation
When performing in situ gaze point tracking, the robot performs only pure rotation and does not move forward. When the robot approaches the target, it first turns to the target and then moves straight toward the target. Therefore, the calculation of the desired pose can be divided into two sub-problems: (1) desired pose calculation for in situ gaze point tracking and (2) desired pose calculation for approaching gaze point tracking.
The optimal observation position is used for the accurate acquisition of 3D coordinates. The 3D coordinate accuracy is related to the baseline, time difference and image distortion. In the bionic eye platform, the baseline is changed with the changes in the cameras' positions because the optical center is not coincident with the center of rotation. The 3D coordinate error of the target is smaller when the baseline of the two cameras is longer. Therefore, it is necessary to keep the baseline unchanged. On the other hand, there is a time difference caused by unstick synchronization between image acquisition and camera position acquisition. In addition, it is necessary to keep the target in the center areas of the two camera images to obtain accurate 3D coordinates of the target.

Optimal Observation Position of Eyes
In the desired pose of the robot, the most important aspect is the expected pose of the bionic eye [40]. Following the definition of this parameter, the calculation of the desired pose of the robot system is greatly simplified; thus, we present an engineering definition here of the desired pose of the bionic eye.
As shown in Figure 5, l m i ( l u i , l v i ) and r ( r u i , r v i ) are the image coordinates of point e P in the camera at time i. l m o and r m o are the image centers of the left and right cameras, respectively. l P is the vertical point of e P along the line l O c l Z c , and r P is the vertical point of e P along the line r O c r Z c . l ∆m is the distance between l m and l m o . r ∆m is the distance between r m and r m o . D b is the baseline length. The pan angles of the left and right cameras in the optimal observation position are l θ p and r θ p , respectively. The tilt angles of the left and right cameras in the optimal observation position are l θ t and r θ t , respectively. P ob ( l θ p , l θ t , r θ p , r θ t ) is the optimal observation position. After analyzing the control system, we found that the most important step in solving this control problem is to determine the desired pose.

Desired Pose Calculation
When performing in situ gaze point tracking, the robot performs only pure rotation and does not move forward. When the robot approaches the target, it first turns to the target and then moves straight toward the target. Therefore, the calculation of the desired pose can be divided into two sub-problems: (1) desired pose calculation for in situ gaze point tracking and (2) desired pose calculation for approaching gaze point tracking.
The optimal observation position is used for the accurate acquisition of 3D coordinates. The 3D coordinate accuracy is related to the baseline, time difference and image distortion. In the bionic eye platform, the baseline is changed with the changes in the cameras' positions because the optical center is not coincident with the center of rotation. The 3D coordinate error of the target is smaller when the baseline of the two cameras is longer. Therefore, it is necessary to keep the baseline unchanged. On the other hand, there is a time difference caused by unstick synchronization between image acquisition and camera position acquisition. In addition, it is necessary to keep the target in the center areas of the two camera images to obtain accurate 3D coordinates of the target.

Optimal Observation Position of Eyes
In the desired pose of the robot, the most important aspect is the expected pose of the bionic eye [40]. Following the definition of this parameter, the calculation of the desired pose of the robot system is greatly simplified; thus, we present an engineering definition here of the desired pose of the bionic eye.
As shown in Figure 5, l mi ( l ui, l vi) and r ( r ui, r vi) are the image coordinates of point e P in the camera at time i. l mo and r mo are the image centers of the left and right cameras, respectively. l P is the vertical point of e P along the line l Oc l Zc, and r P is the vertical point of e P along the line r Oc r Zc. l ∆m is the distance between l m and l mo. r ∆m is the distance between r m and r m o. Db is the baseline length. The pan angles of the left and right cameras in the optimal observation position are l θp and r θp, respectively. The tilt angles of the left and right cameras in the optimal observation position are l θt and r θt, respectively. Pob ( l θp, l θt, r θp, r θt) is the optimal observation position. When the two eyeballs of the bionic eye move relative to each other, the 3D coordinates of the target obtained by the bionic eye produces a large error. To characterize this error, we give a detailed analysis of its origins in Appendix A. Through analysis, we obtain the following conclusions to reduce the measurement error of the bionic eye: (1) Make the length of Db long enough, and maintain as much length as possible during the movement; we give a detailed analysis of its origins in Appendix A. Through analysis, we obtain the following conclusions to reduce the measurement error of the bionic eye: (1) Make the length of D b long enough, and maintain as much length as possible during the movement; (2) Try to observe the target closer to the target so that the depth error is as small as possible; (3) During the movement of the bionic eye, control the two cameras so that they move at the same angular velocity; (4) Try to keep the target symmetrical, and make l ∆m and r ∆m as equal as possible in the left and right camera images.
Based on these four methods, the motion strategy of the motor is designed, and the measurement accuracy of the target's 3D information can be effectively improved.
According to the conclusion, we can define a definition of the optimal observed pose of the bionic eye to reduce the measurement error.
The optimal observation position needed to meet the conditions is listed in Equation (2). When the target is very close to the eyes, the target's optimal observation position cannot be obtained because the image position of the target can be kept at the image center region. It is challenging to obtain the optimal solution of the observation position based on Equation (12). However, a suboptimal solution can be obtained by using a simplified calculation method. First, l θ t and r θ t are calculated in the case that l θ t and r θ t are equal to zero; then, l θ t and r θ t are calculated while l θ t and r θ t are kept equal to the calculated value. Trial-and-error methods can be used to obtain the optimal solution when the suboptimal solution is obtained.

Desired Pose Calculation for In Situ Gaze Point Tracking
When the range of target motion is large and the desired posture of the eyeball exceeds its reachable posture, the head and mobile robot move to keep the target in the center region of the image. In robotic systems, eye movements tend to consume the least amount of resources and do not have much impact on the stability of the head and mobile robot during exercise. Head rotation consumes more resources than the eyeball but consumes fewer resources than trunk rotation. At the same time, the rotation of the head affects the stability of the eyeball but does not have much impact on the stability of the trunk. Mobile robot rotation consumes the most resources and has a large impact on the stability of the head and eyeball. When tracking the target, one needs only to keep the target in the center region of the binocular image. Therefore, when performing gaze point tracking, the movement mechanism of the head, eyes and mobile robot are designed with the principle of minimal resource consumption and maximum system stability. When the eyeball can perceive the 3D coordinates of the target in the reachable and optimal viewing posture, only the eye is rotated; otherwise, the head is rotated. The head also has an attainable range of poses. When the desired pose exceeds this range, the mobile robot needs to be turned so that the bionic eye always perceives the 3D coordinates of the target in the optimal viewing position. Let h γ p and h γ t be the angles between the head and the gaze point in the For the convenience of calculation, the angles between the head and the fixation point in the horizontal direction and the vertical direction are designated as [− h γ pmax , h γ pmax ] and [− h γ tmax , h γ tmax ], respectively. When the angle between the head and the target exceeds a set threshold, the head needs to be rotated to the h θ p and h θ t positions in the horizontal and vertical directions, respectively. When h θ p exceeds the angle that the head can attain, the angle at which the mobile robot needs to be compensated is w θ p . In the in situ gaze point tracking task, the cart does not need to translate in the X w O w Z w plane, so x w = 0, and z w = 0. Furthermore, according to the definition of the optimal observation pose of the bionic eye, the conditions that g f q should satisfy are The desired pose needs to be calculated based on the 3D coordinates of the target. Therefore, to obtain the desired pose, it is necessary to acquire the 3D coordinates of the target according to the current pose of the robot.

Three-Dimensional Coordinate Calculation
The mechanical structure and coordinate settings of the system are shown in Figure 6a. The principle of binocular stereoscopic 3D perception is shown in Figure 6b. E is the eye coordinate system, E l is the left motion module's end coordinate system, E r is the right motion module's end coordinate system, B l is the left motion module's base coordinate system, B r is the right motion module's base coordinate system, C l is the left camera coordinate system and C r is the right camera coordinate system. In the initial position, E l coincides with B l , and E r overlaps with B r . When the binocular system moves, the base coordinate system does not change. l T represents the transformation matrix of the eye coordinate system E to the left motion module's base coordinate system B l , r T represents the transformation matrix of E to B r , l T e represents the transformation matrix of B l to E l , r T e represents the transformation matrix of B r to E r and l T m represents the leftward motion. The module end coordinate system corresponds to the transformation matrix of the left camera coordinate system, and r T m represents the transformation matrix of the right motion module's end coordinate system to the right camera coordinate system. l T r represents the transformation matrix of the right camera coordinate system to the left camera coordinate system at the initial position.
The origin l O c of C l lies at the optical center of the left camera, the l Z c axis points in the direction of the object parallel to the optical axis of the camera, the l X c axis points horizontally to the right along the image plane and the l Y c axis points vertically downward along the image plane. The origin r O c of C r lies at the optical center of the right camera, r Z c is aligned with the direction of the object parallel to the optical axis of the camera, r X c points horizontally to the right along the image plane and r Y c points vertically downward along the image plane. E l 's origin l O e is set at the intersection of the two rotation axes of the left motion module, l Z e is perpendicular to the two rotation axes and points to the front of the platform, l X e coincides with the vertical rotation axis and l Y e coincides with the horizontal rotation axis. Similarly, the origin r O e of the coordinate system E r is set at the intersection of the two rotation axes of the right motion module, r Z e is perpendicular to the two rotation axes and points toward the front of the platform, r X e coincides with the vertical rotation axis and r Y e coincides with the horizontal rotation axis. The origin l Oc of Cl lies at the optical center of the left camera, the l Zc axis points in the direction of the object parallel to the optical axis of the camera, the l Xc axis points horizontally to the right along the image plane and the l Yc axis points vertically downward along the image plane. The origin r Oc of Cr lies at the optical center of the right camera, r Zc is aligned with the direction of the object parallel to the optical axis of the camera, r Xc points horizontally to the right along the image plane and r Yc points vertically downward along the image plane. El's origin l Oe is set at the intersection of the two rotation axes of the left motion module, l Ze is perpendicular to the two rotation axes and points to the front of the platform, l Xe coincides with the vertical rotation axis and l Ye coincides with the horizontal rotation axis. Similarly, the origin r Oe of the coordinate system Er is set at the intersection of the two rotation axes of the right motion module, r Ze is perpendicular to the two rotation axes and points toward the front of the platform, r Xe coincides with the vertical rotation axis and r Ye coincides with the horizontal rotation axis.
The left motion module's base coordinates system Bl coincides with the eye coordinate system E; thus, l T consists of an identity matrix. To calculate the 3D coordinates of the feature points in real time from the camera pose, it is necessary to calculate r T. At the initial position of the system, the external parameters l Tr of the left and right cameras are calibrated offline, as are the hand-eye parameters of the left-right motion module to the camera coordinate system.
When the system is in its initial configuration, the coordinates of point P in the eye coordinate system are Pe (xe, ye, ze). Its coordinates in Bl are l Pe ( l xe, l ye, l ze), and its coordinates l Pc ( l xc, l yc, l zc) in Cl are The coordinates r Pe ( r xe, r ye, r ze) of point P in Br are r r e e = P T P The coordinates r Pc ( r xc, r yc, r zc) of point P in Cr are Based on the equations (3.9) and (3.12), r T is available: During the movement of the system, when the left motion module rotates by l θp and l θt in the horizontal and vertical directions, respectively, the transformation relationship between Bl and El is The left motion module's base coordinates system B l coincides with the eye coordinate system E; thus, l T consists of an identity matrix. To calculate the 3D coordinates of the feature points in real time from the camera pose, it is necessary to calculate r T. At the initial position of the system, the external parameters l T r of the left and right cameras are calibrated offline, as are the hand-eye parameters of the left-right motion module to the camera coordinate system.
When the system is in its initial configuration, the coordinates of point P in the eye coordinate system are P e (x e , y e , z e ). Its coordinates in B l are l P e ( l x e , l y e , l z e ), and its coordinates l P c ( l x c , l y c , l z c ) in C l are The coordinates r P e ( r x e , r y e , r z e ) of point P in B r are r P e = r TP e The coordinates r P c ( r x c , r y c , r z c ) of point P in C r are The point in C r is transformed into C l : Based on the Equations (6) and (9), r T is available: During the movement of the system, when the left motion module rotates by l θ p and l θ t in the horizontal and vertical directions, respectively, the transformation relationship between B l and E l is The coordinates of point P in C l are Assume that The point l P 1c ( l x 1c , l y 1c ) at which line P l O c intersects l Z c = 1 is l x 1c l y 1c =   l n x x e + l o x y e + l a x z e + l p x l n z x e + l o z y e + l a z z e + l p z l n y x e + l o y y e + l a y z e + l p y l n z x e + l o z y e + l a z z e + l p z   The image coordinates of l P 1c in the left camera are m l (u l , v l ), ( l x 1c , l y 1c ) and (u l , v l ) and can be converted by the parameters of the camera. According to the camera's internal parameter model, the following can be obtained: where l M in is the internal parameter matrix of the left camera. The value of ( l x 1c , l y 1c ) can be obtained by the image coordinates of l P 1c , and the parameters of the left camera can be obtained by substituting (15) into (14): During the motion of the system, when the right motion module rotates through r θ p and r θ t in the horizontal and vertical directions, respectively, the transformation relationship between B r and E r is The coordinates of point P in C r are Assume that The point l P 1c ( r x 1c , r y 1c ) at which line P r O c intersects r Z c = 1 is r x 1c r y 1c =   r n x x e + r o x y e + r a x z e + r p x r n z x e + r o z y e + r a z z e + r p z r n y x e + r o y y e + r a y z e + r p y r n z x e + r o z y e + r a z z e + r p z   The image coordinates of r P 1c in the camera, namely, m r (u r , v r ), ( r x 1c , r y 1c ) and (u r , v r ), can be converted using the parameters of the camera. According to the camera's internal parameter model, the following can be obtained: where r M in is the inner parameter matrix of the right camera. The value of ( r x 1c , r y 1c ) can be obtained by the image coordinates of r P 1c and the parameters in the camera, and the following can be obtained by substituting (21) into (20): Four equations can be obtained from Equations (16) and (22) for x e , y e and z e , and the 3D coordinates of point P e can be calculated by the least squares method.
The 3D coordinates P h (x h , y h , z h ) in the head coordinate system can be obtained by Equation (23). d x and d y are illustrated in Figure 4.
Let the angles at which the current moment of the head rotate relative to the initial position be h θ pi and h θ ti ; the coordinates of the target in the robot coordinate system are According to the 3D coordinates of the target in the head coordinate system, the angle between the target and Z h in the horizontal direction and the vertical direction can be obtained as follows: When h γ p and h γ t exceed a set threshold, the head needs to rotate. To leave a certain margin for the rotation of the eyeball and for the convenience of calculation, the angles required for the head to rotate in the horizontal direction and the vertical direction are calculated by the principle shown in Figure 7a,b, respectively. Figure 7a shows the calculation principle of the horizontal direction angle when the target's x coordinates of the head coordinate system is greater than zero. After the head is rotated to h θ p , the target point is on the l Z e axis of the left motion module end coordinate system, and the left motion module reaches the maximum rotatable threshold e θ pmax . Figure 7b shows the calculation principle of the vertical direction when the target's y coordinates of the head coordinate system are greater than d y . After the head is rotated to h θ t , the target point is on the Z e axis of the eye coordinate system, and the eye reaches the maximum threshold e θ tmax that can be rotated. point is on the l Ze axis of the left motion module end coordinate system, and the left motion module reaches the maximum rotatable threshold e θpmax. Figure 7b shows the calculation principle of the vertical direction when the target's y coordinates of the head coordinate system are greater than dy. After the head is rotated to h t   t, the target point is on the Ze axis of the eye coordinate system, and the eye reaches the maximum threshold e θtmax that can be rotated.

Horizontal Rotation Angle Calculation
Let the current angle of the head in the horizontal direction be h θpi. When the head is rotated in the horizontal direction to h p   , the 3D coordinates of the target in the new head coordinate system are After turning, the left motion module reaches the maximum threshold e θpmax that can be rotated, so that Simplifying Equation (

Horizontal Rotation Angle Calculation
Let the current angle of the head in the horizontal direction be h θ pi . When the head is rotated in the horizontal direction to h θ p , the 3D coordinates of the target in the new head coordinate system are Therefore, The coordinates of the target in the new eye coordinate system are After turning, the left motion module reaches the maximum threshold e θ pmax that can be rotated, so that Simplifying Equation (30), we have Assume that According to the triangular relationship, The solution of Equation (33) is Therefore, Equation (35) has two solutions; therefore, we choose the solution in which the deviation e of Equation (36) is minimized: When the obtained h θ p is outside of the range [− h θ pmax , h θ pmax ], the value of h θ pq is Finally, one can obtain the w θ pq value: Based on the same principle, when the x coordinate of the target in the head coordinate system is less than 0, the coordinates of the target in the right motion module base coordinate system after the rotation are After turning, the right motion module reaches − e θ pmax , and the following can be obtained: We simplify Equation (40) as follows: The same two solutions are available: Select the solution in which the deviation e of Equation (44) is minimized: Using Equations (37) and (38), h θ pq and w θ pq can be obtained.

Vertical Rotation Angle Calculation
When the target's y coordinate in the head coordinate system is greater than d y , the current angle of the head in the vertical direction is h θ ti , and when the head is rotated in the vertical direction to h θ t , the target is in the new head coordinate system. The 3D coordinates are Therefore, Using Equation (29), the coordinates of the eye coordinate system after the rotation of the target can be calculated: After rotation, the left and right motion modules reach the rotatable maximum value e θ tmax in the vertical direction, so that Simplifying Equation (48), we obtain Therefore, Equation (51) has two solutions; therefore, we choose the solution in which the deviation e of Equation (52) is minimized: Similarly, when the target's y coordinates in the head coordinate system are less than d y , we have When the obtained h θ t is outside of the range [− h θ tmax , h θ tmax ], the value of h θ tq is After obtaining h θ pq , h θ tq and w θ pq , P e (x e , y e , z e ) are the coordinates of the target in the eye coordinate system after the mobile robot and the head are rotated: The desired observation pose of the eye, characterized by l θ tq , l θ pq , r θ tq and r θ pq , can be obtained using the method described in the following section.

Calculation of the Desired Observation Poses of the Eye
According to Formula (2), l θ tq = r θ tq = θ t , and l θ pq = r θ pq = θ p . The inverse of the hand-eye matrix of the left camera and left motion module end coordinate system is The coordinate l P c ( l x c , l y c , l z c ) of P e (x e , y e , z e ) in the left camera coordinate system satisfies the following relationship: According to the small hole imaging model, the imaging coordinates of the P e (x e , y e , z e ) point in the left camera are Substituting Equation (61) into Equation (2), we obtain Based on the same principle, the coordinate r P c ( r x c , r y c , r z c ) of P e (x e , y e , z e ) in the right camera coordinate system is The imaging coordinates of point P e (x e , y e , z e ) in the right camera are By Equations (2), (62) and (65), two equations related to θ t and θ p (see Appendix C for the complete equations) can be obtained. It is challenging to calculate the values of θ t and θ p directly from these two equations, however. To obtain a solution, we consider a suboptimal observation pose and use this pose as the initial value; then, we use the trial-and-error method to obtain the optimal observation pose. When θ t is calculated, let θ p = 0; the solution of θ t can then be obtained by ∆v l = −∆v r . When θ p is calculated, the solution of θ p is solved by ∆u l = −∆u r . The solution P ob (θ t , θ t , θ p , θ p ) is a suboptimal observed pose. Based on the suboptimal observation pose, the trial-and-error method can be used to obtain the optimal solution with the smallest error. The range of θ t is [−θ tmax , θ tmax ]. The range of θ p is [−θ pmax , θ pmax ].
According to Equations (60) and (63), let θ p be equal to 0 to obtain The following result is also available: The base coordinate system of the left motion module is the world coordinate system. Therefore, l T w is a unit matrix. To simplify the calculation, we have According to the calculation principle of Section 3.2.1, we have the following:    ∆u l = l f x ( l o x y e + l a x z e ) cos θ t +( l o x z e − l a x y e ) sin θ t +( l p x + l n x x e ) ( l o z y e + l a z z e ) cos θ t +( l o z z e − l a z y e ) sin θ t +( l p z + l n z x e ) ∆v l = l f y ( l o y y e + l a y z e ) cos θ t +( l o y z e − l a y y e ) sin θ t +( l p y + l n y x e ) ( l o z y e + l a z z e ) cos θ t +( l o z z e − l a z y e ) sin θ t +( l p z + l n z x e ) (69)    ∆u r = r f x ( r o x r y e + r a x r z e ) cos θ t +( r o x r z e − r a x r y e ) sin θ t +( r p x + r n x r x e ) ( r o z r y e + r a z r z e ) cos θ t +( r o z z e − r a z r y e ) sin θ t +( r p z + r n z r x e ) ∆v r = r f y ( r o y r y e + r a y r z e ) cos θ t +( r o y r z e − r a y r y e ) sin θ t +( r p y + r n y r x e ) ( r o z r y e + r a z r z e ) cos θ t +( r o z r z e − r a z r y e ) sin θ t +( r p z + r n z r x e ) Assume the following: The solution to θ t that keeps the target at the center of the two cameras needs to satisfy the following conditions: Substituting the second equation of Equations (69) and (70) into Equation (72) and solving the equation, we have k 1 cos 2 θ t + k 2 sin 2 θ t + k 3 sin θ t cos θ t + k 4 cos θ t + k 5 sin θ t + k 6 = 0 (73) where k 1 , k 2 , k 3 , k 4 , k 5 are k 1 = l f y ( l o y y e + l a y z e )( r o z r y e + r a z r z e ) + r f y ( l o z y e + l a z z e )( r o y r y e + r a y r z e ) k 2 = l f y ( l o y z e − l a y y e )( r o z r z e − r a z r y e ) + r f y ( l o z z e − l a z y e )( r o y r z e − r a y r y e ) k 3 = l f y ( l o y y e + l a y z e )( r o z r z e − r a z r y e ) + l f y ( l o y z e − l a y y e )( r o z r y e + r a z r z e ) + r f y ( l o z y e + l a z z e )( r o y r z e − r a y r y e ) + r f y ( l o z z e − l a z y e )( r o y r y e + r a y r z e ) k 4 = l f y ( l o y y e + l a y z e )( r p z + r n z r x e ) + l f y ( l p y + l n y x e )( r o z r y e + r a z r z e ) + r f y ( l o z y e + l a z z e )( r p y + r n y r x e ) + r f y ( l p z + l n z x e )( r o y r y e + r a y r z e ) (77) k 5 = l f y ( l o y z e − l a y y e )( r p z + r n z r x e ) + l f y ( l p y + l n y x e )( r o z r z e − r a z r y e ) + r f y ( l o z z e − l a z y e )( r p y + r n y r x e ) + r f y ( l p z + l n z x e )( r o y r z e − r a y r y e ) (78) k 6 = l f y ( l p y + l n y x e )( r p z + r n z r x e ) + r f y ( l p z + l n z x e )( r p y + r n y r x e ) According to the triangle relationship, we have Replacing cosθ t in Equation (73) with sinθ t , we obtain the following: where k 1 , k 2 , k 3 , k 4 , k 5 are Four solutions can be obtained using Equation (81). The optimal solution is a real number, and the most suitable solution can be selected by the condition of Equation (72).
After θ t is obtained, θ p can be solved based on the obtained θ t . According to Equations (60) and (63), θ t is the solution obtained in Section 3.2.2, so that The following result is also available: Since θ t is known, for convenience of calculation, we set The following results are obtained: ∆u l = l f x ( l n x x e + l a x z e ) cos θ p +( l a x x e − l n x z e ) sin θ p +( l p x + l o x y e ) ( l n z x e + l a z z e ) cos θ p +( l a z x e − l n z z e ) sin θ p +( l p z + l o z y e ) ∆v l = l f y ( l n y x e + l a y z e ) cos θ p +( l a y x e − l n y z e ) sin θ p +( l p y + l o y y e ) ( l n z x e + l a z z e ) cos θ p +( l a z x e − l n z z e ) sin θ p +( l p z + l o z y e ) (91)    ∆u r = r f x ( r n x r x e + r a x r z e ) cos θ p +( r a x r x e − r n x r z e ) sin θ p +( r p x + r o x r y e ) ( r n z r x e + r a z r z e ) cos θ p +( r a z r x e − r n z r z e ) sin θ p +( r p z + r o z r y e ) ∆v r = r f y ( r n y r x e + r a y r z e ) cos θ p +( r a y r x e − r n y r z e ) sin θ p +( r p y + r o y r y e ) ( r n z r x e + r a z r z e ) cos θ p +( r a z r x e − r n z r z e ) sin θ p +( r p z + r o z r y e ) Assume that The solution to θ p that keeps the target at the center of the two cameras needs to satisfy the following conditions: Substituting the second equation of Equations (91) and (92) into Equation (94) and solving the available equation, we obtain k 1 cos 2 θ p + k 2 sin 2 θ p + k 3 sin θ p cos θ p + k 4 cos θ p + k 5 sin θ p + k 6 = 0 where k 1 = l f x ( l n x x e + l a x z e )( r n z r x e + r a z r z e ) + r f x ( r n x r x e + r a x r z e )( l n z x e + l a z z e ) (96) k 2 = l f x ( l n x z e − l a x x e )( r n z r z e − r a z r x e ) + r f x ( r n x r z e − r a x r x e )( l n z z e − l a z x e ) (97) k 3 = l f x ( l n x x e + l a x z e )( r n z r z e − r a z r x e ) + l f x ( l n x z e − l a x x e )( r n z r x e + r a z r z e ) + r f x ( r n x r z e − r a x r x e )( l n z x e + l a z z e ) + r f x ( r n x r x e + r a x r z e )( l n z z e − l a z x e ) (98) k 4 = l f x ( l n x x e + l a x z e )( r o z r y e + r p z ) + l f x ( l o x y e + l p x )( r n z r x e + r a z r z e ) + r f x ( r o x r y e + r p x )( l n z x e + l a z z e ) + r f x ( r n x r x e + r a x r z e )( l o z y e + l p z ) (99) r y e + r p x )( l n z z e − l a z x e ) + r f x ( r n x r z e − r a x r x e )( l o z y e + l p z ) (100) Replacing cosθ p in Equation (73) with sinθ p , we obtain Four solutions can be obtained using Equation (102). The optimal solution must be a real number, and the most suitable solution can be selected using the condition of Equation (94). For the case where the four solutions cannot satisfy Equation (94), the position of the target is beyond the position that the bionic eye can reach. In this case, compensation is required through the head or torso. θ t and θ p obtained at this time are suboptimal solutions close to the optimal solution. θ t and θ p are the optimal solutions.
Through the above steps, the desired observation pose can be calculated. The calculation steps of g f q can be summarized by the flow chart shown in Figure 8.

Desired Pose Calculation for Approaching Gaze Point Tracking
The mobile robot approaches the target in two steps: the first step is that the robot and the head rotate in the horizontal direction until the robot and the head are facing the target, and the second step is that the robot moves straight toward the target. The desired position of the approaching motion should satisfy the following conditions: (1) the target should be on the Z axis of the robot and the head coordinate system, (2) the distance between the target and the robot should be less than the set threshold D T and (3) the eye should be in the optimal observation position. a f q can be defined as

Desired Pose Calculation for Approaching Gaze Point Tracking
The mobile robot approaches the target in two steps: the first step is that the robot and the head rotate in the horizontal direction until the robot and the head are facing the target, and the second step is that the robot moves straight toward the target. The desired position of the approaching motion should satisfy the following conditions: (1) the target should be on the Z axis of the robot and the head coordinate system, (2) the distance between the target and the robot should be less than the set threshold DT and (3) the eye should be in the optimal observation position. a fq can be defined as  The desired rotation angle w θ pq of the moving robot is the same as the angle b γ p between the robot and the target and can be obtained by h θ tq can be obtained using the method described in Section 3.2. The optimal observation pose described in Section 3.2.4 can be used to obtain l θ tq , l θ pq , r θ tq and r θ pq .

Robot Pose Control
After obtaining the desired pose of the robot system, the control block diagram shown in Figure 9 is used to control the robot to move to the desired pose.

Robot Pose Control
After obtaining the desired pose of the robot system, the control block diagram shown in Figure 9 is used to control the robot to move to the desired pose. The desired pose is converted to the desired position of the motor. Δθlt, Δθlp, Δθrt, Δθrp, Δθht and Δθhp are deviations of the desired angle from the current angle of motor Mlu, motor Mld, motor Mru, motor Mrd, motor Mhu and motor Mhd, respectively. l θm and r θm are the angles at which each wheel of the moving robot needs to be rotated. During the in situ gaze point tracking process, the moving robot performs only the rotation of the original position, and the angle of the robot movement can be calculated according to the desired angle of the robot. When the robot rotates, the two wheels move in opposite directions at the same speed. Let the distance between the two wheels of the moving robot be Dr; when the robot rotates around an angle w θpq, the distance that each wheel needs to move is w r The diameter of each wheel is dw, and the angle of rotation of each wheel is (where counterclockwise is positive) In the process of approaching the target, the moving robot follows a straight line, and the angle of rotation of each wheel is The movement of the moving robot is achieved by controlling the rotation of each wheel. Each wheel is equipped with a DC brushless motor, and a DSP2000 controller is used to control the movement of the DC brushless motor. Position servo control is implemented in the DSP2000 controller.
In the robot system, the weight of the camera and lens is approximately 80 g, the weight of the camera and the fixed mechanical parts is approximately 50 g and the motor that controls the vertical rotation of the camera (rotating around the horizontal axis of The desired pose is converted to the desired position of the motor. ∆θ lt , ∆θ lp , ∆θ rt , ∆θ rp , ∆θ ht and ∆θ hp are deviations of the desired angle from the current angle of motor M lu , motor M ld , motor M ru , motor M rd , motor M hu and motor M hd , respectively. l θ m and r θ m are the angles at which each wheel of the moving robot needs to be rotated. During the in situ gaze point tracking process, the moving robot performs only the rotation of the original position, and the angle of the robot movement can be calculated according to the desired angle of the robot. When the robot rotates, the two wheels move in opposite directions at the same speed. Let the distance between the two wheels of the moving robot be D r ; when the robot rotates around an angle w θ pq , the distance that each wheel needs to move is S = w θ pq D r 2 (110) The diameter of each wheel is d w , and the angle of rotation of each wheel is (where counterclockwise is positive) In the process of approaching the target, the moving robot follows a straight line, and the angle of rotation of each wheel is The movement of the moving robot is achieved by controlling the rotation of each wheel. Each wheel is equipped with a DC brushless motor, and a DSP2000 controller is used to control the movement of the DC brushless motor. Position servo control is implemented in the DSP2000 controller.
In the robot system, the weight of the camera and lens is approximately 80 g, the weight of the camera and the fixed mechanical parts is approximately 50 g and the motor that controls the vertical rotation of the camera (rotating around the horizontal axis of rotation) and the corresponding encoder weighs approximately 250 g. The mechanical parts of the fixed vertical rotating motor and encoder weigh approximately 100 g. The radius of the rotation of the camera in the vertical direction is approximately 1 cm, and the rotation in the horizontal direction (rotation about the vertical axis of rotation) has a radius of approximately 2 cm. Therefore, when the gravitational acceleration is 9.8 m/s 2 , the torque required for the vertical rotating electric machine is approximately 0.013 N·m, and the torque required for the horizontal rotating electric machine is approximately 0.043 N·m. The vertical rotating motor uses a 28BYG5401 stepping motor with a holding torque of 0.1 N·m and a positioning torque of 0.008 N·m. The driver is HSM20403A. The horizontal rotating motor is a 57BYGH301 stepping motor with a holding torque of 1.5 N·m, a positioning torque of 0.07 N·m and drive model HSM20504A. The four stepping motors of the eye have a step angle of 1.8 • and are all subdivided by 25, so the actual step angle of each motor is 0.072 • , and the minimum pulse width that the driver can receive is 2.5 µs. The stepper motor has a maximum angular velocity of 200 • /s. The head vertical rotary motor uses a 57BYGH401 stepper motor with a holding torque of 2.2 N·m, a positioning torque of 0.098 N·m and drive model HSM20504A. The head horizontal rotary motor is an 86BYG350B three-phase AC stepping motor with a holding torque of 5 N·m, a positioning torque of 0.3 N-m and an HSM30860M driver. The step angle of the head motor after subdivision is also 0.072 • . The head vertical motor has a load of approximately 5 kg and a radius of rotation of less than 1 cm. The head horizontal rotary motor has a load of approximately 9.5 kg and a radius of rotation of approximately 5 cm. In the experiment, we found that the maximum horizontal pulse frequency that the head horizontal rotary motor can receive is 0.6 Kpps. Its maximum angular velocity is 43.2 • /s.

Experiments and Discussion
Using the robot platform introduced in Section 2, experiments on in situ gaze point tracking and approaching gaze point tracking were performed Each camera has a resolution of 400 × 300 pixels.
The experimental in situ gaze point tracking scene is shown in Figure 10, with a checkerboard target used as the target. For in situ gaze point tracking, the target is held by a person. In the approaching target gaze tracking experiment, the target is fixed in front of the robot.
The experimental in situ gaze point tracking scene is shown in Figure 10, with a checkerboard target used as the target. For in situ gaze point tracking, the target is held by a person. In the approaching target gaze tracking experiment, the target is fixed in front of the robot.

In Situ Gaze Point Tracking Experiment
In the in situ gaze experiment, the target moves at a low speed within a certain range, and the robot combines the movement of the eye, the head and the mobile robot so that the binocular vision can always perceive the 3D coordinates of the target at the optimal observation posture. This experiment prompts the robot to find the target and gaze at it. In the gaze point tracking process, binocular stereo vision is used to calculate the 3D coordinates of the target in the eye coordinate system in real time. Through the positional relationship between the eye and the head, the coordinate system of the target in the eye can be converted to the head coordinate system. Similarly, the 3D coordinates of the target in the robot coordinate system can be obtained. Through the 3D coordinates, the desired poses of the eyes, head and mobile robot are calculated according to the method proposed in this paper. Then, the camera is controlled to the desired position by the stepping motor; after reaching the desired position, the image and the motor position information are collected again, and the 3D coordinates of the target are calculated.
In the experiment, the angles between the head and the target, h γpmax and h γpmax, are each 30°. The method described in Section 3 is used to calculate the desired pose of each joint of the robot based on the 3D coordinates of the target. In the experiment, the actual coordinate position and desired coordinate position of the target in the binocular image space, the actual position and desired position of the eye and head motor, the angle between the head and the robot and the target, and the target in the robot coordinate system are stored. Figure 11a

In Situ Gaze Point Tracking Experiment
In the in situ gaze experiment, the target moves at a low speed within a certain range, and the robot combines the movement of the eye, the head and the mobile robot so that the binocular vision can always perceive the 3D coordinates of the target at the optimal observation posture. This experiment prompts the robot to find the target and gaze at it. In the gaze point tracking process, binocular stereo vision is used to calculate the 3D coordinates of the target in the eye coordinate system in real time. Through the positional relationship between the eye and the head, the coordinate system of the target in the eye can be converted to the head coordinate system. Similarly, the 3D coordinates of the target in the robot coordinate system can be obtained. Through the 3D coordinates, the desired poses of the eyes, head and mobile robot are calculated according to the method proposed in this paper. Then, the camera is controlled to the desired position by the stepping motor; after reaching the desired position, the image and the motor position information are collected again, and the 3D coordinates of the target are calculated.
In the experiment, the angles between the head and the target, h γ pmax and h γ pmax , are each 30 • . The method described in Section 3 is used to calculate the desired pose of each joint of the robot based on the 3D coordinates of the target. In the experiment, the actual coordinate position and desired coordinate position of the target in the binocular image space, the actual position and desired position of the eye and head motor, the angle between the head and the robot and the target, and the target in the robot coordinate system are stored. Figure 11a Figure 11j shows the angle deviation and rotation. In this figure, T-h is the angle between the head and target, T-r is the angle between the robot and target, R-r is the angle of the robot rotation from the origin location and T-o is the angle of the target to the origin location. Figure 11k shows the coordinates ( w x, w z) of the target in the world coordinate system. Figure 11l shows the coordinates ( o x, o z) of the target in the world coordinate system of the origin location. represents the position of the target in the coordinate system, and the "☆" represents the position of the robot in the coordinate system.

Approaching Gaze Point Tracking Experiment
The approaching gaze point tracking experimental scene is shown in Figure 12. The robot approaches the target without obstacles and reaches the area in which the robot can operate on the target. The target can be grasped or carefully observed. In the approaching gaze experiment, a target is fixed at a position 2.2 m from the robot, and when the robot moves to a position where the distance from the target to robot is 0.6 m, the motion is stopped, and the maximum speed of the moving robot is 1 m/s. The experiment realizes the approaching movement to the target in two steps: first, the head, the eye " represents the position of the robot in the coordinate system. As shown in Figure 11, the image coordinate of the target is substantially within ±40 pixels in the central region of the left and right images in the x direction. These coordinates are kept within ±10 pixels of the center region of the left and right images in the y direction. Throughout the experiment, the target was rotated approximately 200 • around the robot. The robot moved approximately 140 • , the head rotated 30 • and the target could be kept in the center region of the binocular images. The motor position curve shows that the motor's operating position can track the desired position very well. The angle variation curve shows that the angle between the target and the head and the robot changes and that the robot turning angles are suitably consistent. The coordinates of the target shown in Figure 11 in the robot coordinate system and the coordinates of the target in the initial position of the world coordinate system are very close to the actual position change in the target's position.
Through the above analysis, we can determine the following: (1) It is feasible to realize gaze point tracking of a robot based on 3D coordinates. (2) Using the movement of the head, eyes and mobile robot used in this paper, it is possible to achieve gaze point tracking of the target while ensuring minimum resource consumption.

Approaching Gaze Point Tracking Experiment
The approaching gaze point tracking experimental scene is shown in Figure 12.
Coordinates ( w x, w z) of the target in the world coordinate system. (l) Coordinates ( o x, o z) of the target in the world coordinate system based on the origin location. The "+" in the subfigures (k) and (l) represents the position of the target in the coordinate system, and the "☆" represents the position of the robot in the coordinate system.

Approaching Gaze Point Tracking Experiment
The approaching gaze point tracking experimental scene is shown in Figure 12. The robot approaches the target without obstacles and reaches the area in which the robot can operate on the target. The target can be grasped or carefully observed. In the approaching gaze experiment, a target is fixed at a position 2.2 m from the robot, and when the robot moves to a position where the distance from the target to robot is 0.6 m, the motion is stopped, and the maximum speed of the moving robot is 1 m/s. The experiment realizes the approaching movement to the target in two steps: first, the head, the eye and the moving robot chassis are rotated so that the head and the moving robot are facing the target, and the head observes the target in the optimal observation posture; second, the movement is controlled. The robot moves linearly in the target's direction. During the movement, the angles of the head and the eye are fine-tuned, and the 3D coordinates of the target are detected in real time until the z coordinate of the target in the robot coordinate system is less than the threshold set to stop the motion. Figure 13 shows the results of the approaching gaze point tracking experiment.  The robot approaches the target without obstacles and reaches the area in which the robot can operate on the target. The target can be grasped or carefully observed. In the approaching gaze experiment, a target is fixed at a position 2.2 m from the robot, and when the robot moves to a position where the distance from the target to robot is 0.6 m, the motion is stopped, and the maximum speed of the moving robot is 1 m/s. The experiment realizes the approaching movement to the target in two steps: first, the head, the eye and the moving robot chassis are rotated so that the head and the moving robot are facing the target, and the head observes the target in the optimal observation posture; second, the movement is controlled. The robot moves linearly in the target's direction. During the movement, the angles of the head and the eye are fine-tuned, and the 3D coordinates of the target are detected in real time until the z coordinate of the target in the robot coordinate system is less than the threshold set to stop the motion. Figure 13 shows the results of the approaching gaze point tracking experiment. Figure 13a Figure 13i shows the positions of the pan motor (M hd ) of the head. Figure 13j shows the angle deviation and rotation. T-h is the angle between the head and the target, T-r is the angle between the robot and the target, R-r is the angle of the robot's rotation from the origin location and T-o is the angle of the target to the origin location. Figure 13k shows the coordinates ( w x, w z) of the target in the world coordinate system. Figure 13l shows the robot's forward distance and the distance between the target and the robot.
The change in the image's coordinate curve indicates that the coordinates of the target in the left and right images move from the initial position to the central region of the image and stabilize in the center region of the image during the approach process. In the process of turning towards the target in the first step, the target coordinates in the image fluctuate because the head motor rotates a large amount and is accompanied by a certain vibration during the rotation, which can be avoided by using a system with better stability. The variety curve of the motor position in Figure 13 shows that the motion of the motor can track the target well with the desired pose, and the prediction of the 3D coordinates is not used during the tracking process, so this prediction is accompanied by a cycle lag. The changes in angle in Figure 13 show that the robot system achieves the task of steering toward the target in the first few control cycles and then moves toward the target at a stable angle. Figure 13a shows the change in the coordinates of the target in the robot coordinate system. When the robot rotates, fluctuations arise around the measured x coordinate, mainly due to the measurement error caused by the shaking of the system. The experimental results in Figure 13b show that the robot's movement toward the target is very consistent. During the approach process, the target can be kept within ±50 pixels of the desired position in the horizontal direction of the image while being within ±20 pixels of the desired position in the vertical direction of the image. The eye motor achieves fast tracking of the target in 1.5 s. The angle between the target and the head is reduced from 20 • to 0 • , and the angle between the target and the robot is reduced from 35 • to 0 • . The robot then over-turns. At 34 • , the target changes by 34 • from the initial position. Through the above analysis, it can be found that by using the combination of the head, the eye and the trunk in the present method, the approach toward the target can be achieved while ensuring that the robot is gazing at the target.

Conclusions
This study achieved gaze point tracking based on the 3D coordinates of the target. First, a robot experiment platform was designed. Based on the bionic eye experiment platform, a head with two degrees of freedom was added, using the mobile robot as a carrier.
Based on the characteristics of the robot platform, this paper proposed a method of gaze point tracking. To achieve in situ gaze point tracking, the combination of the eyes, head and trunk is designed based on the principles of minimum resource consumption and maximum system stability. Eye rotation consumes the least amount of resources and has minimal impact on the stability of the overall system during the exercise. The head rotation consumes more resources than the eyeball but fewer than the trunk rotation. At the same time, the rotation of the head affects the stability of the eyeball but only minimally affects the stability of the entire robotic system. The resources consumed by the rotation of the trunk generally predominate, and the rotation of the trunk tends to affect the stability of the head and the eye. Therefore, when the eye can observe the target in the optimal observation posture, only the eye is rotated; otherwise, the head is rotated, and when the angle at which the head needs to move exceeds its threshold, the mobile robot rotates. When approaching gaze point tracking is performed, the robot and head first face the target and then move straight toward the vicinity of the target. Based on the proposed gaze point tracking method, this paper provides an expected pose calculation method for the horizontal rotation angle and the vertical rotation angle.
Based on the experimental robot platform, a series of experiments was performed, and the effectiveness of the gaze point tracking method was verified. In our future works, a practical task of delivering medicine in a hospital and more detailed comparative experiments, as well as discussions with other similar studies, will be implemented.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A. Measurement Error Analysis of Binocular Stereo Vision
It is assumed that the angle at which the left motion module rotates in the horizontal direction with respect to the initial position is l θ p and that the angle at which the right motion module rotates in the horizontal direction with respect to the initial position is r θ p . When using the bionic eye platform for 3D coordinate calculation, we find that when l θ p = r θ p , the measured three-dimensional coordinate ratio is higher than l θ p > 0 and r θ p < 0, which is more accurate, as shown in Figure A1a,b. In this case, the measurement accuracy is higher than that in the case shown in Figure A1c. This chapter analyzes the measurement error of three-dimensional coordinates, explains the reason for this phenomenon and proposes a method to improve the accuracy of three-dimensional coordinate measurement.
Sensors 2023, 23, x FOR PEER REVIEW 30 of 39 is higher than that in the case shown in Figure A1c. This chapter analyzes the measurement error of three-dimensional coordinates, explains the reason for this phenomenon and proposes a method to improve the accuracy of three-dimensional coordinate measurement.
For the convenience of calculation, according to the characteristics of the bionic eye platform, the binocular stereo vision measurement model shown in Figure A2 is used to analyze the error of the three-dimensional coordinate measurement.  For the convenience of calculation, according to the characteristics of the bionic eye platform, the binocular stereo vision measurement model shown in Figure A2 is used to analyze the error of the three-dimensional coordinate measurement.
For the convenience of calculation, according to the characteristics of the bionic eye platform, the binocular stereo vision measurement model shown in Figure A2 is used to analyze the error of the three-dimensional coordinate measurement. Two cameras are used to imitate human eyes and the principle of the vision system of bionic eyes is shown in Figure 1. We suppose that the optical axes of two cameras used to imitate human eyes are coplanar. As shown in Figure 1, OwXwZw is the world coordinate system, the Xw axis is along the baseline of two cameras and the Zw axis is in the plane  , which consists of the two cameras' optical axes. l Oc l Xc l Zc is the coordinate system of the left camera, where l Zc is along the optical axis of the left camera and l Xc axis is in the plane  . r Oc r Xc r Zc is the coordinate system of the right camera, r Zc is along the optical axis of the right camera and r Xc axis is in the plane  . The two cameras can move cooperatively to imitate the movement of human eyes.

Appendix A.2. Depth Measurement Model
In Figure A2 where l Rw is the rotation transformation matrix from the world coordinate system to the left camera coordinate system. l Rw can be expressed as Figure A2. The principle of the vision system of bionic eyes.
l n x x w cos l θ p − l n x z w sin l θ p + l o x x w sin l θ p sin l θ t + l o x y w cos l θ t + l o x z w cos l θ p sin l θ t + l a x x w sin l θ p cos l θ t − l a x y w sin l θ t + l a x z w cos l θ p cos l θ t + l p x l n y x w cos l θ p − l n y z w sin l θ p + l o y x w sin l θ p sin l θ t + l o y y w cos l θ t + l o y z w cos l θ p sin l θ t + l a y x w sin l θ p cos l θ t − l a y y w sin l θ t + l a y z w cos l θ p cos l θ t + l p y l n z x w cos l θ p − l n z z w sin l θ p + l o z x w sin l θ p sin l θ t + l o z y w cos l θ t + l o z z w cos l θ p sin l θ t + l a z x w sin l θ p cos l θ t − l a z y w sin l θ t + l a z z w cos l θ p cos l θ t + l p z Substituting Equation (A39) into Equation (62), we can obtain: ( l k x l n x x w cos l θ p − l k x l n x z w sin l θ p + l k x l o x x w sin l θ p sin l θ t + l k x l o x y w cos l θ t + l k x l o x z w cos l θ p sin l θ t + l k x l a x x w sin l θ p cos l θ t − l k x l a x y w sin l θ t + l k x l a x z w cos l θ p cos l θ t + l k x l p x ) ( l n z x w cos l θ p − l n z z w sin l θ p + l o z x w sin l θ p sin l θ t + l o z y w cos l θ t + l o z z w cos l θ p sin l θ t + l a z x w sin l θ p cos l θ t − l a z y w sin l θ t + l a z z w cos l θ p cos l θ t + l p z ) ( l k y l n y x w cos l θ p − l k y l n y z w sin l θ p + l k y l o y x w sin l θ p sin l θ t + l k y l o y y w cos l θ t + l k y l o y z w cos l θ p sin l θ t + l k y l a y x w sin l θ p cos l θ t − l k y l a y y w sin l θ t + l k y l a y z w cos l θ p cos l θ t + l k y l p y ) ( l n z x w cos l θ p − l n z z w sin l θ p + l o z x w sin l θ p sin l θ t + l o z y w cos l θ t + l o z z w cos l θ p sin l θ t + l a z x w sin l θ p cos l θ t − l a z y w sin l θ t + l a z z w cos l θ p cos l θ t + l p z ) Based on the same principle, substituting into each matrix and factoring the value of ∆m r , we can obtain ( r k x r n x r x w cos r θ p − r k x r n x r z w sin r θ p + r k x r o x r x w sin r θ p sin r θ t + r k x r o x r y w cos r θ t + r k x r o x r z w cos r θ p sin r θ t + r k x r a x r x w sin r θ p cos r θ t − r k x r a x r y w sin r θ t + r k x r a x r z w cos r θ p cos r θ t + r k x r p x ) ( r n z r x w cos r θ p − r n z r z w sin r θ p + r o z r x w sin r θ p sin r θ t + r o z r y w cos r θ t + r o z r z w cos r θ p sin r θ t + r a z r x w sin r θ p cos r θ t − r a z r y w sin r θ t + r a z r z w cos r θ p cos r θ t + r p z ) ( r k y r n y r x w cos r θ p − r k y r n y r z w sin r θ p + r k y r o y r x w sin r θ p sin r θ t + r k y r o y r y w cos r θ t + r k y r o y r z w cos r θ p sin r θ t + r k y r a y r x w sin r θ p cos r θ t − r k y r a y r y w sin r θ t + r k y r a y r z w cos r θ p cos r θ t + r k y r p y ) ( r n z r x w cos r θ p − r n z r z w sin r θ p + r o z r x w sin r θ p sin r θ t + r o z r y w cos r θ t + r o z r z w cos r θ p sin r θ t + r a z r x w sin r θ p cos r θ t − r a z r y w sin r θ t + r a z r z w cos r θ p cos r θ t + r p z ) By Equations (2), (A40) and (A41), Equation (A42) related to θ t and θ p can be obtained. It can be found from Equation (A32) that both θ t and θ p appear in the form of a trigonometric function, and it is difficult to obtain values of θ t and θ p directly from these two equations. In order to obtain the solution available in the project, we firstly obtain a sub-optimal observation pose and then use the sub-optimal observation pose as the initial value. We finally use the trial and error method to obtain the optimal observation pose.
( l k x l n x x w cos θ p − l k x l n x z w sin θ p + l k x l o x x w sin θ p sin θ t + l k x l o x y w cos θ t + l k x l o x z w cos θ p sin θ t + − l k x l a x y w sin θ t l k x l a x x w sin θ p cos θ t + l k x l a x z w cos θ p cos θ t + l k x l p x ) ( r n z r x w cos θ p − r n z r z w sin θ p + r o z r x w sin θ p sin θ t + r p z + r o z r y w cos θ t + r o z r z w cos θ p sin θ t + r a z r x w sin θ p cos θ t − r a z r y w sin θ t + r a z r z w cos θ p cos θ t ) = ( r k x r n x r x w cos θ p − r k x r n x r z w sin θ p + r k x r o x r x w sin θ p sin θ t + r k x r o x r y w cos θ t + r k x r o x r z w cos θ p sin θ t + r k x r a x r x w sin θ p cos θ t + r k x r p x − r k x r a x r y w sin θ t + r k x r a x r z w cos θ p cos θ t )( l n z x w cos θ p − l n z z w sin θ p + l o z x w sin θ p sin θ t + l o z y w cos θ t + l o z z w cos θ p sin θ t + l a z x w sin θ p cos θ t − l a z y w sin θ t + l a z z w cos θ p cos θ t + l p z ) ( l k y l n y x w cos θ p − l k y l n y z w sin θ p + l k y l o y x w sin θ p sin θ t + l k y l o y y w cos θ t + l k y l o y z w cos θ p sin θ t − l k y l a y y w sin θ t + l k y l a y x w sin θ p cos θ t + l k y l a y z w cos θ p cos θ t + l k y l p y ) ( r n z r x w cos θ p − r n z r z w sin θ p + r o z r x w sin θ p sin θ t + r p z + r o z r y w cos θ t + r o z r z w cos θ p sin θ t + r a z r x w sin θ p cos θ t − r a z r y w sin θ t + r a z r z w cos θ p cos θ t ) = ( r k y r n y r x w cos θ p − r k y r n y r z w sin θ p + r k y r o y r x w sin θ p sin θ t + r k y r o y r y w cos θ t + r k y r o y r z w cos θ p sin θ t + r k y r a y r x w sin θ p cos θ t + r k y r p y − r k y r a y r y w sin θ t + r k y r a y r z w cos θ p cos θ t )( l n z x w cos θ p − l n z z w sin θ p + l o z x w sin θ p sin θ t + l o z y w cos θ t + l o z z w cos θ p sin θ t + l a z x w sin θ p cos θ t − l a z y w sin θ t + l a z z w cos θ p cos θ t + l p z ) (A42)