1. Introduction
Visual grounding aims to recognize and localize the referred target in natural images based on the interaction information provided by humans. It is a novel interaction method that brings augmented reality (AR)/virtual reality (VR) devices the advantages of being user-friendly, high efficiency, and having more intelligence. For the past few years, visual grounding has become a research hotspot in the field of human-computer interaction (HCI) [
1,
2,
3] and has played a key role in unmanned systems control tasks such as emergency rescue and disaster relief, elderly care, and disability assistance. However, existing vision language grounding (VLG) mainly uses language information to express human intentions, and the grounding accuracy is poor when there are multiple similar visual targets in the image. Actually, humans tend to gaze at the target when they indicate it, and the gaze trace could provide positional information for visual grounding. Gaze interaction is also an important interaction method in AR/VR devices, which can effectively solve the problem of inaccurate grounding mentioned above. In this paper, we propose to apply the gaze interaction method to vision language grounding and design a multi-modal grounding framework with language and gaze. As shown in
Figure 1, the red bounding box represents the conventional vision language grounding result, while the green bounding box represents the proposed multi-modal grounding result fused with gaze intention.
To conduct multi-modal visual grounding research, two stages are essential. Firstly, a multi-modal dataset containing gaze, image, and language should be constructed. Then, a feature fusion model should be established for realizing cross-modal human intention understanding.
In the studies of VLG, there are already several available public datasets, such as RefCOCO [
4], RefCOCO+ [
4], and RefCOCOg [
5], among which RefCOCOg has longer text queries, a more comprehensive description, and wider content coverage. Based on the above datasets, several studies [
6,
7,
8] proposed two-stage visual grounding architectures, which mainly rely on off-the-shelf object detection algorithms to extract the visual features of the candidate targets. The pre-training process of the detection algorithms limits the cognitive categories of the grounding candidates, and the independent object features result in a lack of global information. To avoid the drawback, Kamath et al. [
9] proposed a one-stage modulated grounding model, using a transformer-based architecture to fuse the image and language at an early stage of the model without the object detection algorithms. Based on transformer [
10], Yang et al. [
11] unified the text and box outputs for the grounded vision language model, achieving joint pre-training on multiple datasets from different vision-language tasks and solving the dataset insufficient problem. However, when the linguistic description is not strictly logical or there are multiple similar objects in the image, it is hard for the VLG model to precisely identify the referred target.
In the studies of gaze interaction, the dataset for gaze-based virtual target localization has been fully studied with the popularization of AR/VR devices [
12,
13]. The common line of gaze interaction mainly contains two steps: first, obtaining gaze coordinates, and then performing target localization [
12] and important target extraction tasks [
13]. In the field of gaze-based object detection, existing studies [
14,
15,
16] usually adopted gaze points in each frame to predict the target in videos, but only one gaze point can be easily disturbed by various factors, which leads to incorrect detection. The above methods are all conducted by visual feature selection based on gaze interaction. Due to the less characteristic information of the single interaction mode, the grounding accuracy is unsatisfactory. Furthermore, the gaze trace is influenced by several factors, including the appearance of the target, the observer’s personal habits, and the specific gaze acquisition device utilized.
In this paper, we explore the vision language grounding task fused with gaze intention (VLG-Gaze) and conduct comparison experiments between VLG-Gaze and VLG. Additionally, an AR device is used to collect the real gaze coordinate sequences. Combined with the proposed data augmentation method, a dataset containing more than 80,000 gaze annotations is constructed. Based on the transformer with attention mechanism, a multi-modal feature extraction model for gaze, vision, and language is designed, realizing the VLG-Gaze task by globally feature integrating. Finally, a group of experiments is carried out by comparing the performance of various multi-modal combinations. The results indicate that, compared with the existing state-of-the-art VLG model, the accuracy of the proposed scheme is improved by 5.3%, which fully proves the effectiveness and advantages of the multi-modal grounding framework with language and gaze. The contributions of this paper are summarized as follows:
A multi-modal grounding dataset containing gaze traces is proposed for visual grounding. An AR device is used to collect the real gaze trace of the referred subjects, and each gaze sample corresponds to a natural image and its language description, as well as the target bounding box.
Taking gaze and language as the expression of interactive human intention, we explore the task of VLG-Gaze. In the aspect of dataset construction, the optimal design to improve the model performance is proposed through groups of experiments.
Aiming at the multiple inputs of vision, language, and gaze, a multi-modal grounding framework is established. Experimental results show that our proposed scheme outperforms the existing vision language grounding and the gaze-based visual grounding method, fully proving the effectiveness of gaze fusion.
In the following parts of this paper,
Section 2 introduces the research status of visual grounding and the relevant gaze-based HCI in computer vision. In
Section 3, the acquisition and construction of a multi-modal grounding dataset are described in detail. Subsequently, the multi-modal grounding framework with language and gaze is proposed, and the training details are described in
Section 4. Through comparative experiments,
Section 5 verifies and discusses the effectiveness of gaze fusion, and we analyze the proposed multi-modal grounding dataset.
Section 6 discusses the novel solution for developing the existing models and describes the future work and limitations. Finally, the entire paper is summarized in
Section 6.
2. Related Work
Visual grounding. The common line of visual grounding mainly focuses on vision-language fusion to realize more accurate grounding, which uses only language to present human intention. Conventional vision-language feature fusion methods can be categorized into three forms, including joint embedding approaches, modular-based models, and graph-based reasoning methods [
6,
7,
8,
17]. Zhuang et al. [
6] proposed the Parallel Attention Network, which employs two attention mechanisms to match language and visual features locally and globally, performing recurrently region-matched object proposals. But it ignores the relative position of the visual objects. Yu et al. [
7] proposed a modular-based method that decomposes the text description into three components, including the target appearance, location, and relationship to other objects, and calculates the matching score between the language parts and visual features through three attention-based modules. Yang et al. [
8] proposed a dynamic graph attention network that uses the graph-based reasoning method to represent the image and text in the form of graphs, respectively, matching the elements in the aspect of appearance and relationship from objects to the subject, therefore obtaining the highest matching score from the proposals. To enhance grounding efficiency, Vasudevan et al. [
18] proposed a two-stage model using facial images, language, video, optical flow, and a depth map. But the gaze information extracted from the facial image is uncertain, and the multi-modal fusion is carried out locally and globally only by the long short-term memory (LSTM) model, all of which results in an unsatisfactory grounding performance.
However, the above approaches make it difficult to learn a wide range of feature representations in a large-scale dataset. In the last few years, there have been studies [
9,
19,
20] to establish cross-modal pre-training models based on vision-language datasets, which can not only improve grounding performance but could also be applied to a variety of downstream tasks. Based on bidirectional encoder representations from transformers (BERT) (Devlin et al., 2018) [
21], Lu et al. (2019) [
19] proposed a two-stream feature fusion model vision-and-language BERT (ViLBERT), while Su et al. [
20] adopted the single-stream method and proposed a visual-linguistic BERT (VLBERT), which is pre-trained with large-scale vision-language datasets. All the above two-stage methods rely on pre-trained detection models, such as Faster R-CNN [
22], to extract the target proposals and their visual features. The black box is pre-trained on a fixed vocabulary, making it tedious for the cross-modal model to understand contextual visual information. For the one-stage approaches, the attention-based transformer is generally used, which can perform pixel-level cross-modal attention in parallel, facilitating the deep integration of different models.
Gaze-based HCI in computer vision. With the development of gaze estimation and tracking [
1,
23,
24], gaze has become one of the main interaction modes in HCI systems [
3,
25,
26]. At the same time, gaze is more widely applied to cross-modal computer vision tasks. In order to extract significant targets, Karthikeyan et al. [
13] utilized gaze to extract mainstream intention tracks, identifying and segmenting significant targets among multiple candidate objects in videos. Chen et al. [
27] and Sood et al. [
28] collected small-scale gaze datasets for visual question-answering (VQA) tasks in order to understand the human reasoning process. The former study [
27] analyzes the reasoning attention of machines and humans, proposing a supervision method to optimize the reasoning process. While the latter [
28] collected a dataset of human gaze on both images and questions for VQA to analyze the similarity between human and neural attentive strategies learned by VQA models, the above studies utilize gaze to improve the model performance or compare the machine and human reasoning mechanisms, but the accuracy and stability of gaze sequences are not satisfactory. In conclusion, confined by the quality of the gaze-tracking device and the scale of training data, existing research rarely uses gaze intention in the task of visual grounding.
3. RefCOCOg-Gaze Dataset
In this section, the AR glasses are utilized to collect human gaze traces. Combined with data augmentation methods, the RefCOCOg-Gaze dataset is proposed, which is the first vision language grounding dataset containing gaze trace data.
The RefCOCOg-Gaze includes gaze coordinate sequence, image, and text sentence. In order to ensure the diversity of the interactive scenes, we use the image and text data sourced from the RefCOCOg dataset. Subsequently, the interactive gaze information is recorded so as to build a multi-modal grounding dataset containing gaze trace, which is called RefCOCOg-Gaze in this paper. In terms of the gaze trace, an AR device is utilized to record the real gaze coordinates of 20 participants. Combined with the data augmentation methods, the new dataset with more than 80,000 multi-modal samples is systematically constructed.
3.1. Gaze Data Collection
The utilized AR device is the Nreal Light (made by the Nreal Corporation) with two OLED screens of 3840 × 1080 pixels, and it is equipped with two off-axis infrared cameras with resolutions of 640 × 480 pixels, which can simultaneously acquire eye images at a sampling frequency of 60 Hz. In addition, combined with the gaze tracking algorithm of the device, the gaze coordinates can be obtained in real time. According to the official statistics, the gaze estimate error is less than 1.5° under fully calibrated conditions.
3.1.1. Gaze Recording System
Based on Unity (version 2019.4.30f1c1), gaze-recording Android software is developed for data collection. The specific working procedure of the software is shown in
Figure 2, which includes two stages and five steps. The participants’ fixation coordinate (the green dot in
Figure 2) is obtained by projecting the 3-dimensional sight vector onto the 2-dimensional screen. The screen is a virtual square plane of 5 m × 5 m, and the distance between the screen and the human eyes is 10 m.
3.1.2. Data Collection Procedure
Initialization and calibration. Participants are first asked to wear the AR glasses correctly and then complete the instrument calibration (
Figure 2a). Subsequently, the system calculates the value of the calibration error (
Figure 2b). If the error is greater than 1.5°, the calibration process needs to be repeated. In order to reduce the influence of other unexpected factors, each participant is arranged to sit in an assigned seat and keep the AR glasses still on the participant’s face during the data collection process.
Gaze recording. At the beginning of gaze recording, AR glasses display a text sequence that describes a referred target (
Figure 2c). After fully understanding, the participant could click the confirm button, and then the corresponding picture appeared on the screen. The referred target in the image needs to be gazed at for three seconds. The moment when the picture starts to display is set to 0 s, and the system collects the gaze trace within 0 to 3 s in the first recording stage (
Figure 2d). In order to avoid a linguistic misunderstanding, we indicate the correct referred target during the second part of the recording. At the moment of 3–6 s, a red bounding box is annotated in the image to show the ground truth. At this time, the participant continues to look at the target, and the system collects a gaze trace for another three seconds in the second recording stage (
Figure 2e). The gaze recording process for each sample is completed in six seconds.
In summary, for each data sample, 360 frames of gaze coordinates are collected. Before data collection, we conducted a set of experiments and found that the participants would not feel too tired in six seconds when focusing on a fixed target. To ensure the recording quality and sufficient length of data, we designed the recording process with two stages, each spanning three seconds, for a total duration of six seconds. In order to prevent the participant from being too tired, 6 sets of 50 data samples are collected for each participant. The recording duration time of each set is about 10–15 min, and the participant takes a 5 min rest before each recording set.
3.1.3. Participants Recruit and Quality Assurance
All participants are adult college students and have the ability to fully understand English descriptions. In the preparation phase, we adopt interactive games with AR glasses to train the candidates, including gaze interaction, language interaction, and other operational training courses. Only candidates who perform well in the preparation phase will be selected to participate in the following recording stages. In addition, all of the participants have understood the collection procedure and signed the informed consent before the gaze recording, and they are properly paid by the research group. The experiment was approved by the Ethics Committee of Tianjin University (TJUE-2021-138).
To ensure data quality, each participant has been checked by the HCI experts with a 15-times random sampling. If a gaze annotating error occurred, the participant would be replaced, and the samples would be re-annotated. Totally, 2589 real gazes obtained from 20 participants were collected.
3.2. Gaze Augmentation
Due to the time-consuming collection process for the AR device, the amount of manually annotated data are limited. But it is difficult for the model to be adequately trained on a small-scale dataset. To avoid the long-term consumption of human and material resources, we use the real human gaze to generate simulated gaze sequences in two ways, i.e., LSTM-based augmentation and average-error-based augmentation. Through the augmentation methods, more than 70,000 gaze samples are generated. The effectiveness of the augmentation is proven in the dataset’s application for grounding experiments in
Section 5.3.
3.2.1. LSTM-Based Augmentation
The LSTM-based augmentation adopts a single-layer LSTM model, as shown in
Figure 3. The pre-trained ResNet [
29], RoBERTa, and multilayer perceptron (MLP) are utilized to extract features from the input image, text, and bounding box, respectively. As a reference, Nie et al. [
30] generated time series by segmenting them into subseries-level patches instead of point-wise generation. It is more efficient and could analyze their connections between gaze points; thus, the proposed augmentation model generates ten gaze coordinates in each time step. The feature representation of the first ten gaze coordinates is generated according to the input image, text, and bounding box. In the subsequent time steps, the gaze features are generated based on the previous gaze coordinates. Finally, the output features are transformed into a two-dimensional form of gaze sequence prediction through the fully connected layer.
3.2.2. Average-Error-Based Augmentation
On the one hand, the gaze interaction process contains a certain error with the referred target, and on the other hand, there is a correlation between the coordinate sequence and the time mark. In order to simulate the gaze trace accurately, we first calculate the difference between each gaze trace
and the center of the referred target
as follows:
Then, the mean error of
N real gaze samples is calculated:
In order to augment the dataset, we simulate a number of gaze coordinate sequences, and each coordinate needs to be associated with the real target. Based on the mean error of the real gaze samples and the center of the new referred target
, the simulated gaze trace
is calculated as follows:
In our augmentation implementation, M is the number of frames in the gaze sample, which is 360 according to the recording setup (60 frames per second and a total of six seconds). is the bounding box center of the new data sample, and are the mean error of the i-th gaze frame in the abscissa and ordinate axes, respectively; they are calculated from the real gaze data in Equation (2).
5. Experiments and Results
In order to verify the effectiveness of the RefCOCOg-Gaze dataset and the VLG-Gaze baseline model, this section takes the gaze-based visual grounding method and the VLG method as comparisons, carrying out a variety of qualitative and quantitative experimental analyses. At the same time, the average-error-based augmentation method and the LSTM-based augmentation method are experimentally verified, and the optimal input frame range of the gaze sequence is explored.
5.1. Experimental Setup and Evaluation Metric
The proposed architecture was carried out with the PyTorch framework. Two GPUs were used for distributed training. The pre-training parameter settings are the same as those of the MDETR [
9]. In the fine-tuning process, five epochs are iterated in each experiment, and the Adam optimizer with a learning rate of 5 × 10
−5 is used to optimize the model parameters.
The model’s performance is evaluated by accuracy. Primarily, the intersection over union (IoU) of the predicted result and the ground truth bounding box is calculated; it is the ratio of the intersection and union of the two areas. The model output with an IoU > 0.5 is regarded as the correct prediction. Furthermore, the accuracy of the test set samples is calculated, which is the ratio of the true positive samples to the total samples.
5.2. Analysis of VLG-Gaze Baseline
In terms of the visual grounding task, we conduct contrast experiments to prove the effectiveness of the gaze and language, respectively.
5.2.1. Quantitative Analysis
The performance comparison of gaze-based visual grounding, VLG and the proposed scheme is shown in
Table 1. All the gaze inputs of the methods are selected with 0–6 s to show a fair comparison. In order to explore the effectiveness of gaze fusion, the model and its parameters of the VLG method are referred to as MDETR-referring expression comprehension. Otherwise, the visual backbone for the VLG-Gaze baseline is R101, so the MDETR-R101 is used for comparison. For the gaze-based visual grounding method, we fuse the gaze coordinates with Faster R-CNN [
22], as illustrated in
Figure 5 in detail. Faster R-CNN is a pre-trained object detection model that has the capability of detecting visual target bounding boxes at any position, scale, and aspect ratio. It is widely used in tasks related to object detection and has been proven to perform better in gaze-based object detection [
14]. The mean value of the gaze coordinate sequence within six seconds is calculated to obtain
. Meanwhile, the candidate objects in the picture are detected by Faster R-CNN. We calculate the distance between the centers of all the candidates, and
, the target with the shortest distance is selected as the grounding result. Faster R-CNN in the grounding model has been pre-trained with the MS COCO dataset.
It can be seen from
Table 1 that our proposed scheme outperforms the other visual grounding methods. The VLG-Gaze baseline effectively improves accuracy by 3.2% compared with the VLG method. In addition, compared with gaze-based visual grounding, the accuracy of the proposed scheme is improved by 37.1%. Among all the visual grounding methods, the performance of the gaze-based method is the worst because it is mainly applicable to simple scenes with fewer targets, while the scene of the RefCOCOg-Gaze is more complex and has more interactive objects. The gaze interaction requires language and other information to assist target recognition. For the VLG task, the grounding accuracy can be effectively improved after integration of gaze, indicating that the gaze information provides auxiliary enhancement for target localization to achieve higher grounding precision, which proves the effectiveness of the proposed scheme.
5.2.2. Qualitative Analysis
Figure 6 demonstrates the cases of grounding results using models with different input modes, and it corresponds to the experimental results in
Table 1. The predicted bounding boxes of the VLG method, the gaze-based visual grounding (Gaze-VG) method, and our proposed VLG-Gaze baseline are described in the figure line by line. In
Figure 6, the blue rectangle represents the ground truth bounding box, the red rectangle represents the predicted bounding box of the corresponding method, and the green dot visualizes the gaze position, which is the average value of the gaze coordinates from 0 to 360 frames. The referring language descriptions are annotated at the top of each column.
The leftmost two columns showcase the error correction of the proposed scheme compared to gaze-based visual grounding. The gaze trace directly reflects the position of the interactive target that the human is referring to. However, under the circumstances of multiple overlapping objects in the image, the target is difficult to accurately recognize only by gaze information. For example, in the first column of images, the ground truth target is a table, but the gaze coordinates are located in the center of the desktop and closer to the plate, resulting in the fact that we could only get the prediction of “plate” from the Gaze-VG. By contrast, the proposed scheme identifies the target correctly by fusing the language information.
The two columns in the center showcase the error correction of the proposed scheme compared to the VLG method. The referred target is a catcher in the images in the third column, but because the visual appearance of the object on the left side is similar to that of the correct target, the VLG model cannot directly distinguish the target, resulting in grounding errors. The proposed scheme provides spatial indication of the target position by combining gaze intention and avoiding the ambiguity of language understanding.
The rightmost two columns further demonstrate the advantages of the proposed approach. In the fifth column, due to the low color differentiation of the objects, errors occurred in the VLG model and the gaze-based grounding model. Additionally, there are many similar objects in the images in the sixth column, which makes the VLG and Gaze-VG ineffective. Our proposed scheme combines the advantages of multiple modals of gaze, language, and vision to achieve higher interactive grounding precision.
5.3. Analysis of the RefCOCOg-Gaze Dataset
For the RefCOCOg-Gaze dataset, we analyze it quantitatively and qualitatively, exploring the best mode of the dataset to achieve more accurate model performance and displaying the gaze trace visually. In terms of the real gaze data, 1301 samples are used as a training set, and 1288 samples are used as a test set. In addition, we only adopt the augmentation methods for the training set, and the gaze traces in the test set are maintained unchanged.
5.3.1. Quantitative Analysis
The best way to verify the proposed dataset is through practical model performance. To this end, we first train the VLG-Gaze baseline with three kinds of dataset settings, including RefCOCOg-Gaze without augmentation, RefCOCOg-Gaze with LSTM-based augmentation (RefCOCOg-Gaze LSTM), and RefCOCOg-Gaze with average-error-based augmentation (RefCOCOg-Gaze avg). As shown in
Table 2, the best result appears in the training set of RefCOCOg-Gaze avg; its grounding accuracy is 86.2%. For example, it has higher accuracy than the model with the RefCOCOg-Gaze LSTM dataset and improves the optimal accuracy by 14.2% compared with the RefCOCOg-Gaze dataset. The LSTM-based augmentation also has certain effects on improving performance. The results of
Table 2 are analyzed by statistical tests in
Figure 7 by the box plot; each group represents the performance of the proposed model evaluated on the same dataset of the different gaze ranges. As shown in
Figure 7, the mean accuracy of the model trained on RefCOCOg-Gaze avg is higher than the one trained on RefCOCOg-Gaze (
t (4) = −7.64,
p < 0.001, one-sided paired-sample
t-test) and the one trained on RefCOCOg-Gaze LSTM (
t (4) = −3.50,
p < 0.05, one-sided paired-sample
t-test), respectively. It could be concluded that the average-error-based augmentation method is more suitable and beneficial to improving the performance of the proposed model.
The gaze recording frequency of the AR glasses is 60 Hz, so each sample of the dataset contains 360 frames of gaze coordinates. In order to explore the influence of different gaze lengths on the model’s performance, we designed comparative experiments as shown in
Table 2. The results show that VLG-Gaze baseline performed worse with gaze samples at 0–3 s. Because the participants would occasionally look for the wrong target based only on textual descriptions in the period at the time of 0–3 s, which could lead to inaccurate gaze data in the gaze range of 0–180 frames. In addition, the red bounding box appeared at 3–6 s to avoid misunderstandings and ensure the quality of the gaze recording, and the model accuracy of the gaze sample at 3–6 s is generally higher than 0–3 s. Therefore, the following experiments mainly focus on the gaze analysis at the time 3–6 s, or the 180–360 frames of the whole frames.
According to
Table 2, we find that the model performs better when the gaze input range is set to 240–360 frames, and the best performance achieves 86.2% in accuracy, which is 5.3% higher than that of VLG. However, when the input gaze is shorter, i.e., 300–360 frames, the model performance deteriorates due to the presence of random errors in the gaze input. Increasing the gaze length appropriately can reduce the influence of gaze errors. Corresponding to the experimental design of gaze recording, the error of gaze at 0–3 s is larger, which leads to a decrease in the model accuracy trained with 0–360 gaze frames, and it is consistent with the error distribution of gaze sequence. On the other hand, under the same frame selection of gaze sequence, the augmented datasets achieve better performance, which fully proves the effectiveness of the data augmentation methods proposed in this paper.
5.3.2. Qualitative Analysis
The similarity between the simulated gaze and the real gaze is compared and analyzed by the sequence error, which is demonstrated in
Figure 8. We calculate the average sequence errors of the augmented gaze trace and the real gaze trace on the abscissa/ordinate axis, respectively. The data samples for evaluation are all from the test set. The average sequence errors are the mean differences between all the gaze samples and the center point of the ground truth bounding box. In
Figure 8, The blue dots represent the average sequence errors of the real gaze trace of 0~360 frames, respectively; the green dots represent the average sequence errors of the simulated gaze by LSTM-based augmentation; and the red dots represent the average sequence errors of the simulated gaze by average-error-based augmentation. As can be seen, from the moment of 3 s, especially after 200 frames, the errors of the simulated gaze data from the average-error-based augmentation and the real gaze are all distributed within 50 pixels, but the simulated gaze data from the LSTM-based augmentation has a constant sequence error after 50 frames. It indicates that the gaze trace simulated by the average-error-based augmentation is closer to the real gaze data.
Figure 9 demonstrates the visual comparison of the real gaze sequences and the simulated gaze sequences. The blue-bounding boxes denote the ground truth targets. We represent the gaze coordinates of all the frames by 360 dots, in which the yellow dots represent the gaze sequence of 0–3 s and the green dots represent the gaze coordinates of 3–6 s. It can be seen that the distribution and variation trends of the simulated gaze sequence from RefCOCOg-Gaze avg are similar to those of the real gaze sequence, indicating the effectiveness of the average-error-based augmentation method.
7. Conclusions
In this paper, a multi-modal grounding dataset containing gaze trace is constructed, and a novel vision language grounding framework fused with gaze intention is proposed. The real gaze traces of human participants are collected with AR devices, and the data augmentation methods are designed to expand the scale of the training set. Finally, a novel dataset, RefCOCOg-Gaze, including gaze, image, and language, is constructed. Based on the above dataset, the VLG-Gaze baseline model is proposed, which uses gaze and language interaction information to represent human intention, achieving the complementary advantages of multiple modals. The experimental results declare that the proposed VLG-Gaze scheme significantly enhances the visual grounding accuracy to 86.2%, and the model performs better with the training set augmented by the average-error-based method. In our future work, we will further optimize the model of the VLG-Gaze task and explore more and better multi-modal fusion methods to promote the development of visual grounding.