Say What You Are Looking at: An Attention-Based Interactive System for Autistic Children

: Gaze-following is an effective way for intention understanding in human–robot interaction, which aims to follow the gaze of humans to estimate what object is being observed. Most of the existing methods require people and objects to appear in the same image. Due to the limitation in the view of the camera, these methods are not applicable in practice. To address this problem, we propose a method of gaze following that utilizes a geometric map for better estimation. With the help of the map, this method is competitive for cross-frame estimation. On the basis of this method, we propose a novel gaze-based image caption system, which has been studied for the ﬁrst time. Our experiments demonstrate that the system follows the gaze and describes objects accurately. We believe that this system is competent for autistic children’s rehabilitation training, pension service robots, and other applications.


Introduction
Humans are very good at understanding the intentions of others by following the gaze. We can infer that the child is interested in a ball if they keep staring at it. We also can find crucial clues through the suspect's attention at the scene of the crime. This ability leads us to obtain obscure but essential information. If robots also have the capability of gaze-following, they would be competent for many human-robot interaction tasks, including helping doctors with rehabilitation training for autism [1,2]. This is the goal that we set to achieve. For autistic children, they are usually interested in some abnormal objects, such as bottle caps or door handles, instead of toys that non-autistic children like.
Autistic children may pay attention to the object they are interested in for a long time and ignore the doctor's instructions in rehabilitation training. In order to understand what the children are interested in, the robot also needs to talk to gain the children's attention and guide them to participate in the doctor's task. This is challenging for a robot. As Appl. Sci. 2021, 11, 7426 2 of 12 shown in Figure 1, the general image caption focuses on describing what the robot sees, while the gaze-based image caption focuses on describing what the child is looking at. Obviously, the latter is more suitable for interaction scenarios. In this paper, we propose a gaze-based image caption system. We equipped a depth camera on the robot's chest to collect images; the type of camera we used is RealSense D435. This camera can obtain a color image and the corresponding depth image. The depth map measures the distance of every pixel, which is used to reconstruct geometric relations of the gaze and the map. Given a video sequence of depth images and the corresponding color images, the camera builds a point cloud map for locating what the children are interested in and generates a caption to interact with them.
Appl. Sci. 2021, 11, x FOR PEER REVIEW 2 of 12 the children are interested in, the robot also needs to talk to gain the children's attention and guide them to participate in the doctor's task. This is challenging for a robot. As shown in Figure 1, the general image caption focuses on describing what the robot sees, while the gaze-based image caption focuses on describing what the child is looking at. Obviously, the latter is more suitable for interaction scenarios. In this paper, we propose a gaze-based image caption system. We equipped a depth camera on the robot's chest to collect images; the type of camera we used is RealSense D435. This camera can obtain a color image and the corresponding depth image. The depth map measures the distance of every pixel, which is used to reconstruct geometric relations of the gaze and the map. Given a video sequence of depth images and the corresponding color images, the camera builds a point cloud map for locating what the children are interested in and generates a caption to interact with them.
The kid is looking at a banana.
There is a banana on the table, can you see that?
General Image Caption Gaze-based Image Caption Figure 1. The difference between general image caption and gaze-based image caption.

Gaze-Following
Gaze-following is a problem of predicting where people are looking at in a given image or video. Many previous works have used wearable devices such as glasses with a camera to track the gaze direction and locate the target from a 3D map [3,4]. However, the wearable devices are usually unavailable in practical applications. A more promising way is to predict the gaze from images directly. Fathi et al. [5] and Marin-Jimenez et al. [6] address this problem by assuming that people are only looking at each other. They used the posture and position of the head as clues to locate subjects. Inspired by the study of electric fields, Park et al. [7] used "social charges" to represent latent quantities that drive the primary gaze behavior of members of a social group. Recasens et al. [8] make a great contribution on first publishing a gaze-following dataset and proposing a two-pathway model (a gaze pathway and a saliency pathway). However, Lian et al. [9] and Parks et al. [10] consider an internal connection between gaze and saliency, rather than complete independence. For example, the gaze point is always located at a salient place along the gaze direction. Drawing on this idea, they proposed new models with better performance. Chong et al. [11,12] propose a new deep structure that models the gaze over time. It directly learns the gaze-relevant scene regions by face feature instead of giving head posture. Unfortunately, the above methods require that the people and the objects appear in the same image. For example, a child is watching TV, but the TV is not in the view of camera. These methods cannot keep following the gaze when the camera turns to the side of the TV. Recasens et al. [13] name this problem as cross-frame gaze-following. They proposed a model adding to a transformation pathway based on their previous work [8]. Given two images from different views, this model keeps following the gaze by estimating the transformation between these two frames. However, the deep learning method for geometric estimation is inaccurate and increases the uncertainty of the result. Therefore, we propose a different gaze-following method. We select a sequence of depth and color images as the input and built a three-dimensional (3D) map for predicting the gaze point. Our method provides a new solution for cross-frame gaze-following. The main differences between the above methods and ours are that (a) we build and use 3D maps online for

Gaze-Following
Gaze-following is a problem of predicting where people are looking at in a given image or video. Many previous works have used wearable devices such as glasses with a camera to track the gaze direction and locate the target from a 3D map [3,4]. However, the wearable devices are usually unavailable in practical applications. A more promising way is to predict the gaze from images directly. Fathi et al. [5] and Marin-Jimenez et al. [6] address this problem by assuming that people are only looking at each other. They used the posture and position of the head as clues to locate subjects. Inspired by the study of electric fields, Park et al. [7] used "social charges" to represent latent quantities that drive the primary gaze behavior of members of a social group. Recasens et al. [8] make a great contribution on first publishing a gaze-following dataset and proposing a twopathway model (a gaze pathway and a saliency pathway). However, Lian et al. [9] and Parks et al. [10] consider an internal connection between gaze and saliency, rather than complete independence. For example, the gaze point is always located at a salient place along the gaze direction. Drawing on this idea, they proposed new models with better performance. Chong et al. [11,12] propose a new deep structure that models the gaze over time. It directly learns the gaze-relevant scene regions by face feature instead of giving head posture. Unfortunately, the above methods require that the people and the objects appear in the same image. For example, a child is watching TV, but the TV is not in the view of camera. These methods cannot keep following the gaze when the camera turns to the side of the TV. Recasens et al. [13] name this problem as cross-frame gaze-following. They proposed a model adding to a transformation pathway based on their previous work [8]. Given two images from different views, this model keeps following the gaze by estimating the transformation between these two frames. However, the deep learning method for geometric estimation is inaccurate and increases the uncertainty of the result. Therefore, we propose a different gaze-following method. We select a sequence of depth and color images as the input and built a three-dimensional (3D) map for predicting the gaze point. Our method provides a new solution for cross-frame gaze-following. The main differences between the above methods and ours are that (a) we build and use 3D maps online for cross-frame gaze-following, and (b) compared with other geometric methods, we propose an occlusion detection mechanism that minimizes error prediction.

Attention-Based Image Caption
Computers are expected to describe the world from a human perspective. An image contains a great quantity of information. How do computers choose valuable information? The answer is the attention mechanism. Most of the existing works focus on describing the whole image as human as possible. Xu et al. [14] and Lu et al. [15] generated the contextual attention by a recurrent neural network (RNN) [16] and adjusted the weight of the context vector to generate a sentence. Cornia et al. [17,18] considered attention as visual salient information that is prominent and simple to be noticed, such as high-contrast objects. Moreover, Liu et al. [19] and Sugano et al. [20] modeled the attention as gaze data of the subjects' eye movements while watching the video. Some extended methods of image caption were studied to describe the details of the image. Johnson et al. [21] used the region proposal network (RPN) [22] to generate multiple local bounding boxes to extract features. Each region feature is used to generate a sentence. Subsequently, Yang et al. [23] improved the model using joint inference and context fusion. Wang et al. [24] proposed a multilayer dense attention model to minimize the interference due to non-salient information. However, all these attention measurement methods are quite subjective, and the results vary with people's personalities, age, and emotions. These methods may not be suitable for some people, such as autistic children. To obtain a more objective result, we ensured that the attention is reflected only by the gaze rather than other factors. We propose a novel method that attaches weights to the regions where people are looking at. We note that our gaze data are measured in the video, while the gaze data of methods [19,20] are collected by the people who are watching the video. In this paper, a gaze-based image caption system is proposed on the basis of the work of [21]. We utilized the result of gazefollowing to select the description candidates and to describe the attention area. Similar to [25], our system was mainly designed for autistic children, but it can also be applied to general people. We have three main contributions in this paper: (1) A novel gaze-following method is proposed on the basis of spatial geometry. It predicts attention regions by the spatial relationship between the map and the sightline. (2) An image caption method guided by the gaze is proposed. It describes the region concentrated by users according to attention heatmaps. (3) For the first time, we studied the problem of describing the region where people are looking at and combine image caption with gaze-following.

Methods
We propose an object description system on the basis of third-person visual attention. The system predicts the interested area of a person and describes this area through an object description algorithm. Figure 2 shows the framework of the whole system. It consists of visual attention prediction and object description. With the support of SLAM (simultaneous localization and mapping) and gaze tracking techniques, we constructed a 3D map of the environment online and track the gaze of the person in real time. We utilized the geometric relationship between the map and the sightline to assign weights in the map for indicating the degree of interest. Finally, the object description algorithm was used to preferentially describe objects with high attention. There are four main challenges in this system: (1) object occlusion detection: some occluded objects in the map are invisible, and the algorithm has to determine the obscured point clouds in the sparse map; (2) map update: since moving objects are significantly destructive to the structure of the map, the point clouds of these objects should be updated in real time to avoid residual traces of previous ones; (3) sightline error: the results of gaze tracking are often affected by a number of factors, resulting in violent shaking, and it is therefore essential to improve the stability of gaze tracking; (4) describing specific areas according to the actual attention of the human eye. We describe our algorithm in detail along with the solution to these challenges in Sections 2.1 and 2.2, respectively.
Appl. Sci. 2021, 11, x FOR PEER REVIEW 4 of human eye. We describe our algorithm in detail along with the solution to these challeng in Sections 2.1. and 2.2., respectively. The system takes a video as input. It builds a map and estimates the gaze simultaneous Then, an attention heat map is predicted. Finally, the system selects the attention region to gener a sentence from the image.

Visual Attention Prediction
Before carrying out visual attention prediction, we need to process the results of ga tracking. Since gaze estimation is greatly affected by the noise, we adopt head postu estimation that adjusts weights and generates more stable and accurate results accordi to the confidence level. We can see how to calculate the sightline from Equations (1) a (2).

= 2 = 2
Parameters and represent the left and right eye positions, respectively. T gaze direction of the left and right eyes are and respectively. The parameter represents the head direction, and , are standardizing confidence levels. Equati (1) determines the equation of the sightline in 3D space, where is the starting point the sightline and is the gaze direction. We then transform the linear equation from t camera coordinate to the world coordinate and represent the sight line with Equation ( Notably, the sightline has a direction, and therefore the value of parameter is grea than 0.

=
We then obtain the visible point cloud within the map, which is marked as S, shown in Figure 3. First, we construct a mathematical cone whose main axis has an in nation angle of . We specify that the points within this round table to be the visible po cloud S that reflects potential regions focused by people. For a point in the map, we fi obtain the perpendicular foot to the sightline. After that, we calculate the distance fro this point to the perpendicular foot and the distance from the perpendicular foot to t eye. By Equation (4), we determine whether this point is in the visible point cloud S. Then, an attention heat map is predicted. Finally, the system selects the attention region to generate a sentence from the image.

Visual Attention Prediction
Before carrying out visual attention prediction, we need to process the results of gaze tracking. Since gaze estimation is greatly affected by the noise, we adopt head posture estimation that adjusts weights and generates more stable and accurate results according to the confidence level. We can see how to calculate the sightline from Equations (1) and (2).
Parameters p l and p r represent the left and right eye positions, respectively. The gaze direction of the left and right eyes are d l and d r respectively. The parameter d h represents the head direction, and α e , α h are standardizing confidence levels. Equation (1) determines the equation of the sightline in 3D space, where p is the starting point of the sightline and d is the gaze direction. We then transform the linear equation from the camera coordinate to the world coordinate and represent the sight line with Equation (3). Notably, the sightline has a direction, and therefore the value of parameter t is greater than 0.
We then obtain the visible point cloud within the map, which is marked as S, as shown in Figure 3. First, we construct a mathematical cone whose main axis has an inclination angle of . We specify that the points within this round table to be the visible point cloud S that reflects potential regions focused by people. For a point in the map, we first obtain the perpendicular foot to the sightline. After that, we calculate the distance from this point to the perpendicular foot and the distance from the perpendicular foot to the eye. By Equation (4), we determine whether this point is in the visible point cloud S.
where p i is a point in the map, p f is the perpendicular foot of p i to the sightline, is the inclination angle, r is the radius of the circle above the frustum, and dist( * , * ) is the distance between two points. The visible point cloud S determines the area where a person focuses on. However, some of the points in S are occluded. These points are not visible. Imprecise assignment of weights may distract the attention, and therefore in this paper, we propose a method to address this problem. For each visible point, we construct a cone. The direction of the central axis of the cone is identical to the direction of the sightline. We specify that the point cloud within the cone is the occluded point cloud R, as shown in Equation (5).
where p i is a point in visible point cloud S, p f is the perpendicular foot of p i , p j is a point that needs to be checked, and θ is the inclination angle. We eliminated all the points in R so that the remaining point cloud reflects the actual attention region of the person. We assigned weights to the visible point cloud by constructing a Gaussian model, which is guided by the distance from points to sightline. The weight of each point cloud is inversely correlated with the distance, which implies that closer points allocated bigger weights. For convenience, we specify that the weight is in an interval from 0 to 1. On the basis of the 3D map with weight, we use a camera model to project this map onto an image plane. Subsequently, the discrete points with weight can be transformed into a continuous probability map using image processing techniques, such as dilation and filtering. We combine the results of the saliency detection and multiplied the two probability maps to obtain the final attention predictions. Algorithm 1 shows the detailed algorithm flow. When the person and the target do not appear in the same field of vision, we first observe the face and estimate the sightline of that person, then move the camera along the sightline until it reaches the map and finds out the visible point cloud. When the face moves out of view, its gaze direction is assumed to be unchanged, which corresponds to human behavior. The whole process is the same as that which we discuss above, except changing the field of vision and tracking the sightline. Calculate the distance between p f and p i 11: Calculate the distance between p f and p 12: if p i is satisfied with Equation (4) then 13: Let p i belongs to S 14: end if 15: end for 16: for each p i ∈ S do 17: Generate a line l with a slope of d starting at p i 18: for each p j ∈ S do 19: Find the foot point p f of p j alone the line l 20: Calculate the distance between p f and p i 21: Calculate the distance between p f and p j 22: if p j is satisfied with Equation (5) where is a point in the map, is the perpendicular foot of to the sightline, is the inclination angle, is the radius of the circle above the frustum, and * , * is the distance between two points. The visible point cloud S determines the area where a person focuses on. However, some of the points in S are occluded. These points are not visible. Imprecise assignment of weights may distract the attention, and therefore in this paper, we propose a method to address this problem. For each visible point, we construct a cone. The direction of the central axis of the cone is identical to the direction of t h e sightline. We specify that the point cloud within the cone is the occluded point cloud R, as shown in Equation (5).
where is a point in visible point cloud S, is the perpendicular foot of , is a point that needs to be checked, and is the inclination angle. We eliminated all the points in R so that the remaining point cloud reflects the actual attention region of the person. We assigned weights to the visible point cloud by constructing a Gaussian model, which is guided by the distance from points to sightline. The weight of each point cloud is inversely correlated with the distance, which implies that closer points allocated bigger weights. For convenience, we specify that the weight is in an interval from 0 to 1. On the basis of the 3D map with weight, we use a camera model to project this map onto an image plane. Subsequently, the discrete points with weight can be transformed into a continuous probability map using image processing techniques, such as dilation and filtering. We combine the results of the saliency detection and multiplied the two probability maps to obtain the final attention predictions. Algorithm 1 shows the detailed algorithm flow. When the person and the target do not appear in the same field of vision, we first observe the face and estimate the sightline of that person, then move the camera along the sightline until it reaches the map and finds out the visible point cloud. When the face moves out of view, its gaze direction is assumed to be unchanged, which corresponds to human behavior. The whole process is the same as that which we discuss above, except changing the field of vision and tracking the sightline. Since the scene often contains moving objects, e.g., people and animals whose motion traces are recorded by SLAM in the map, they often disrupt the original structure of the map and result in a large error of visual attention prediction. Therefore, real-time map updating is required. We employed a simple and effective strategy to ensure the real-time capabilities of the system. Before loading a new frame of the point cloud, we erase the camera-observable point cloud to ensure that the map stores the latest frame.
The camera-observable point cloud is defined as where 0 ≤ u ≤ W, 0 ≤ v ≤ H, x, y, z is the position in the world of each point; u, v is the position in image; f x , f y , u 0 , v 0 is the intrinsic parameters from camera; W is image width; and H is image height. We specify that a point can be updated when its projection is within the image range.

Object Description
A dense captioning task first proposed by Johnson et al. [21] and Yang et al. [23] introduces a method by adding joint inference and visual context based on [21] for performance improvement, and [25] also proposes to apply this task to rehabilitation robots. We trained the model using Visual Genome [26]. Figure 4 shows the overall network architecture. Given an image and its probability map created by gaze prediction, we first utilize CNN (convolutional neural network) [27] and region proposal network to generate a series of region features, then select the one with the highest weight density to output a sentence by the captioning model. Specifically, the captioning model consists of a recognition network and LSTM (long short-term memory) [28]. The recognition network is a fully connected network that takes the selected region feature as input and produces a string of visual text code. The RNN (recurrent neural network) utilizes LSTM to propagate the hidden state and recurrently sample the most likely next words. The framework of the whole network is similar to DenseCap [21], except that we add a box selecting module to focus on the attention region. The optimization objective of the dense captioning model is to minimize its loss function L(I, S), which is given in Equation (7).
where I is the input image, and S = (s 1 , . . . , s N ) is a true sentence describing this image. We initialize the weights of the CNN with a pre-trained model on ImageNet [29] and add a regular term when the gradient is updated to avoid overfitting, as shown in Equation (8).
This helps to compare to training with an uninitialized model and without adding regular terms.
where w i is the weights in the network, and w i+1 is the weight to be updated.
(convolutional neural network) [27] and region proposal network to generate a series of region features, then select the one with the highest weight density to output a sentence by the captioning model. Specifically, the captioning model consists of a recognition network and LSTM (long short-term memory) [28]. The recognition network is a fully connected network that takes the selected region feature as input and produces a string of visual text code. The RNN (recurrent neural network) utilizes LSTM to propagate the hidden state and recurrently sample the most likely next words. The framework of the whole network is similar to DenseCap [21], except that we add a box selecting module to focus on the attention region. The optimization objective of the dense captioning model is to minimize its loss function L(I, S), which is given in Equation (7).
where I is the input image, and S = (s1, . . . , sN ) is a true sentence describing this image. We initialize the weights of the CNN with a pre-trained model on ImageNet [29] and add a regular term when the gradient is updated to avoid overfitting, as shown in Equation (8). This helps to compare to training with an uninitialized model and without adding regular terms.
where wi is the weights in the network, and wi+1 is the weight to be updated.

Experimental Evaluation
In this section, we describe the attention prediction dataset for a rehabilitation training scenario for children with autism and evaluate our algorithm in this scenario, both in terms of visual attention prediction and object description. As advanced attention prediction algorithms, [9] and [12] achieve accurate results in predicting the attention of people in the picture. They use the head posture as a clue to track people's attention by a neural network, while we focus on the geometric information brought by the 3D point cloud map and predict attention through the geometric relationship between sightline and the map. This also helps to achieve cross-frame prediction. We compare our method with [9] and [12] in visual attention prediction. Additionally, in object description, we compare with

Experimental Evaluation
In this section, we describe the attention prediction dataset for a rehabilitation training scenario for children with autism and evaluate our algorithm in this scenario, both in terms of visual attention prediction and object description. As advanced attention prediction algorithms, refs. [9] and [12] achieve accurate results in predicting the attention of people in the picture. They use the head posture as a clue to track people's attention by a neural network, while we focus on the geometric information brought by the 3D point cloud map and predict attention through the geometric relationship between sightline and the map. This also helps to achieve cross-frame prediction. We compare our method with [9] and [12] in visual attention prediction. Additionally, in object description, we compare with advanced general image caption methods [21,30,31] to show that our gazed-based method is more suitable for human-robot interaction.

Dataset
We used the Visual Genome dataset [26] for object description model training. It provides region proposals for each image, which may contain important information. Before training, we remove repetitive descriptions and use YoloV3 object detector [32] to crop images, because we focus on the specific objects that attract people. There is no need to involve all objects in the scene. Finally, we used 103,521 images including 688,143 regions for training and 4556 images with more than 70 descriptions including 40,989 regions for testing. For evaluation, we simulated rehabilitation scenes and capture 24 RGB-D videos from the perspective of the rehabilitation robot. Five of them were use to evaluate the performance of the cross-frame gaze estimation method. In actual scenes, the autistic child sits in front of a table and is instructed by a trainer to identify objects on the table, including daily-life tools, fruits, and animals. We require subjects to imitate the autistic child gazing at objects or playing with them and then provide a few sentences based on the objects as evaluation templates. We run our system on TianXP. During the test, the frame rate of visual attention prediction reach about 15 fps, and that of the object description is about 1 fps.

Visual Attention Evaluation
The mapping, gaze-following, and saliency detection in our system are implemented by OpenFace [33], ORB-SLAM2 [34], and OpenCV module [35], respectively. We used our dataset to evaluate visual attention prediction and compare it with [9] and [12]. In order to keep the input of each algorithm the same, we were provided with both the face bounding box input of [9] and [12] methods by OpenFace. Moreover, we divided our visual attention prediction into three schemes: Gaze, Gaze + Hiddendetect, and Gaze + Hiddendetect + Salience for showing the effect of occlusion detection and salience detection. To measure visual attention prediction, we employed the widely used matrices, ROC (receiver operator characteristics), PR (precision recall), and AUC (area under the curve). The ROC curve effectively reflects the relationship between the true-positive rate and the false-positive rate. The PR curve is to visualize the accuracy of the model. Therefore, the large AUC of the two curves demonstrates the superiority of the model.
The test results are provided in Figure 5 and Table 1. The PR plot shows that the accuracy of this algorithm is higher than the others. The ROC curve of our algorithm is below that of [9]. However, the accuracy of our method is better than any of them. The prediction of [9] contains the best coverage area, but it also contains a relatively high level of false-positive cases. As for [12], it is highly accurate in some cases but is susceptible to the complex background. The prediction of [12] is more biased towards the region near the hand, and thus the overall performance is poor. Moreover, in the three experiments of our method, gaze had larger errors and more false positives. The reason is that the algorithm incorrectly assigns more weights to occluded objects. Gaze + HiddenDetect shows that occlusion detection significantly improves the accuracy of our system. On this basis, Gaze + HiddenDetect + Saliency further optimizes the details of attention regions, and Figure 6 shows its predictions. When the participant is looking at the green pepper on the left, only Gaze + HiddenDetect + Saliency accurately locate it. Figure 7 illustrates the effect of our attention prediction algorithm on cross-frame gaze estimation experiments. The target is not in the camera field at first. On the basis of the 3D point cloud map, the algorithm estimates the person's sightline and follows its direction until the point cloud falls into the cone of sight. Finally, we assign weight to these points and projected them into the image. This procedure shows that 3D point cloud map plays an important role in cross-frame estimation.    [12] 0.824 0.335 Dongze Lian [9] 0.949 0.378   [12] 0.824 0.335 Dongze Lian [9] 0.949 0.378 Figure 6. Application of the robot in autism rehabilitation training.

Object Description Evaluation
To assess the performance of our proposed method, we compared it with that of othe state-of-the-art methods. Reference [30] proposes an adaptive attention algorithm, [21 combines object detection with image caption to achieve improved performance, and [31 combines object detection with an adaptive attention algorithm to obtain more specifi descriptions. The dataset is collected from the perspective of a robot.   [12] 0.824 0.335 Dongze Lian [9] 0.949 0.378 Figure 6. Application of the robot in autism rehabilitation training.

Object Description Evaluation
To assess the performance of our proposed method, we compared it with that of othe state-of-the-art methods. Reference [30] proposes an adaptive attention algorithm, [21 combines object detection with image caption to achieve improved performance, and [31 combines object detection with an adaptive attention algorithm to obtain more specific descriptions. The dataset is collected from the perspective of a robot.

Object Description Evaluation
To assess the performance of our proposed method, we compared it with that of other state-of-the-art methods. Reference [30] proposes an adaptive attention algorithm, [21] combines object detection with image caption to achieve improved performance, and [31] combines object detection with an adaptive attention algorithm to obtain more specific descriptions. The dataset is collected from the perspective of a robot. Figure 8 shows the comparison between the performance of our method and that of [31]. Our method generates the description of the area focused by human eyes rather than all objects in the scene. It is an advantage that our method takes the perspective of the person, and the descriptions are more specific and are currently not possible with other image caption algorithms. We apply widely used evaluation metrics, BLEU (bilingual evaluation understudy), CIDER (consensus-based image description evaluation), METEOR (metric for evaluation of translation with explicit ordering), ROUGE (recall-oriented understudy for gisting evaluation), and SPICE (semantic propositional image caption evaluation). Furthermore, due to matching problems of these matrices, we also include BERT (bidirectional encoder representations from transformers) [36] for a better assessment of sentence semantics. Meanwhile, refs. [21,30] generate object descriptions and utilize the mentioned evaluation matrices. By comparing the sentence evaluation scores of the description results in Table 2, we find that our method outperformed other algorithms, with the highest evaluation scores and the best descriptions. tion evaluation). Furthermore, due to matching problems of these matrices, we also in clude BERT (bidirectional encoder representations from transformers) [36] for a better as sessment of sentence semantics. Meanwhile, [21,30] generate object descriptions and uti lize the mentioned evaluation matrices. By comparing the sentence evaluation scores o the description results in Table 2, we find that our method outperformed other algorithms with the highest evaluation scores and the best descriptions.

Conclusions and Future Work
This paper creatively integrates third-person visual attention into object description and proposes a human-robot interaction system applied to the rehabilitation of children with autism. This system is composed of two main parts, attention prediction and objec description. The first part utilizes the spatial relationship between the sightline and the map to predict the interested area. The second part utilizes the attention probability map to describe objects. The experiments demonstrate that our method predicts the objects in terested by children with high efficacy. However, there are some limitations in our paper The existing image description accuracy is insufficient. Moreover, the efficiency of gaze

Conclusions and Future Work
This paper creatively integrates third-person visual attention into object description and proposes a human-robot interaction system applied to the rehabilitation of children with autism. This system is composed of two main parts, attention prediction and object description. The first part utilizes the spatial relationship between the sightline and the map to predict the interested area. The second part utilizes the attention probability map to describe objects. The experiments demonstrate that our method predicts the objects interested by children with high efficacy. However, there are some limitations in our paper. The existing image description accuracy is insufficient. Moreover, the efficiency of gaze tracking and SLAM algorithm needs further optimization. Apart from this, we have several desirable extensions. For example, robots use real-time tracking of areas of interest to the human eye to conduct conversations with people, helping empty-nest elderly, white-collar workers, and other people to relieve their worries, prevent depression, and so on. In terms of robot applications, our system provides new ideas for human-robot interaction. In the future, we will focus on the work of improving recognition accuracy and enriching the forms of sentences. Funding: This research was funded by the Shenzhen Science and Technology Innovation Commission, grant numbers JCYJ20170410172100520 and RCYX20200714114736115. This research was also funded by the Open Program of Neusoft Corporation, item number SKLSAOP1702, as well as the Shenzhen Institute of Artificial Intelligence and Robotics for Society, grant number 2019-INT020 and AC01202101014. The APC was funded by the Shenzhen Science and Technology Innovation Commission.
Institutional Review Board Statement: Ethical review and approval were waived for this study due to the reason that all the participants in this study were the authors (Y.Z. and F.R.D.) but not patients.

Informed Consent Statement: Not applicable.
Data Availability Statement: Not applicable.