An Accuracy Enhanced Vision Language Grounding Method Fused with Gaze Intention

Zhang, Junqian; Tu, Long; Zhang, Yakun; Xie, Liang; Xu, Minpeng; Ming, Dong; Yan, Ye; Yin, Erwei

doi:10.3390/electronics12245007

Open AccessArticle

An Accuracy Enhanced Vision Language Grounding Method Fused with Gaze Intention

by

Junqian Zhang

^1,2,

Long Tu

^2,3,*,

Yakun Zhang

^2,3,

Liang Xie

^2,3,

Minpeng Xu

¹,

Dong Ming

¹,

Ye Yan

^1,2,3 and

Erwei Yin

^1,2,3

¹

Academy of Medical Engineering and Translational Medicine, Tianjin University, Tianjin 300072, China

²

Tianjin Artificial Intelligence Innovation Center (TAIIC), Tianjin 300450, China

³

Defense Innovation Institute, Academy of Military Sciences (AMS), Beijing 100071, China

^*

Author to whom correspondence should be addressed.

Electronics 2023, 12(24), 5007; https://doi.org/10.3390/electronics12245007

Submission received: 10 November 2023 / Revised: 9 December 2023 / Accepted: 12 December 2023 / Published: 14 December 2023

Download

Browse Figures

Versions Notes

Abstract

:

Visual grounding aims to recognize and locate the target in the image according to human intention, which provides a new intelligent interaction idea and method for augmented reality (AR) and virtual reality (VR) devices. However, existing vision language grounding adopts language modals for visual grounding, but it performs ineffectively for images containing multiple similar objects. Gaze interaction is an important interaction mode in AR/VR devices, and it provides an advanced solution to the inaccurate vision language grounding cases. Based on the above questions and analysis, a vision language grounding framework fused with gaze intention is proposed. Firstly, we collect the manual gaze annotations using the AR device and construct a novel multi-modal dataset, RefCOCOg-Gaze, combining it with the proposed data augmentation methods. Secondly, an attention-based multi-modal feature fusion model is designed, providing a baseline framework for vision language grounding with gaze intention (VLG-Gaze). Through a series of precisely designed experiments, we analyze the proposed dataset and framework qualitatively and quantitatively. Comparing with the state-of-the-art vision language grounding model, our proposed scheme improves the accuracy by 5.3%, which indicates the significance of gaze fusion in multi-modal grounding tasks.

Keywords:

augmented reality; gaze interaction; human–computer interaction; multi-modal fusion; visual grounding

1. Introduction

Visual grounding aims to recognize and localize the referred target in natural images based on the interaction information provided by humans. It is a novel interaction method that brings augmented reality (AR)/virtual reality (VR) devices the advantages of being user-friendly, high efficiency, and having more intelligence. For the past few years, visual grounding has become a research hotspot in the field of human-computer interaction (HCI) [1,2,3] and has played a key role in unmanned systems control tasks such as emergency rescue and disaster relief, elderly care, and disability assistance. However, existing vision language grounding (VLG) mainly uses language information to express human intentions, and the grounding accuracy is poor when there are multiple similar visual targets in the image. Actually, humans tend to gaze at the target when they indicate it, and the gaze trace could provide positional information for visual grounding. Gaze interaction is also an important interaction method in AR/VR devices, which can effectively solve the problem of inaccurate grounding mentioned above. In this paper, we propose to apply the gaze interaction method to vision language grounding and design a multi-modal grounding framework with language and gaze. As shown in Figure 1, the red bounding box represents the conventional vision language grounding result, while the green bounding box represents the proposed multi-modal grounding result fused with gaze intention.

To conduct multi-modal visual grounding research, two stages are essential. Firstly, a multi-modal dataset containing gaze, image, and language should be constructed. Then, a feature fusion model should be established for realizing cross-modal human intention understanding.

In the studies of VLG, there are already several available public datasets, such as RefCOCO [4], RefCOCO+ [4], and RefCOCOg [5], among which RefCOCOg has longer text queries, a more comprehensive description, and wider content coverage. Based on the above datasets, several studies [6,7,8] proposed two-stage visual grounding architectures, which mainly rely on off-the-shelf object detection algorithms to extract the visual features of the candidate targets. The pre-training process of the detection algorithms limits the cognitive categories of the grounding candidates, and the independent object features result in a lack of global information. To avoid the drawback, Kamath et al. [9] proposed a one-stage modulated grounding model, using a transformer-based architecture to fuse the image and language at an early stage of the model without the object detection algorithms. Based on transformer [10], Yang et al. [11] unified the text and box outputs for the grounded vision language model, achieving joint pre-training on multiple datasets from different vision-language tasks and solving the dataset insufficient problem. However, when the linguistic description is not strictly logical or there are multiple similar objects in the image, it is hard for the VLG model to precisely identify the referred target.

In the studies of gaze interaction, the dataset for gaze-based virtual target localization has been fully studied with the popularization of AR/VR devices [12,13]. The common line of gaze interaction mainly contains two steps: first, obtaining gaze coordinates, and then performing target localization [12] and important target extraction tasks [13]. In the field of gaze-based object detection, existing studies [14,15,16] usually adopted gaze points in each frame to predict the target in videos, but only one gaze point can be easily disturbed by various factors, which leads to incorrect detection. The above methods are all conducted by visual feature selection based on gaze interaction. Due to the less characteristic information of the single interaction mode, the grounding accuracy is unsatisfactory. Furthermore, the gaze trace is influenced by several factors, including the appearance of the target, the observer’s personal habits, and the specific gaze acquisition device utilized.

In this paper, we explore the vision language grounding task fused with gaze intention (VLG-Gaze) and conduct comparison experiments between VLG-Gaze and VLG. Additionally, an AR device is used to collect the real gaze coordinate sequences. Combined with the proposed data augmentation method, a dataset containing more than 80,000 gaze annotations is constructed. Based on the transformer with attention mechanism, a multi-modal feature extraction model for gaze, vision, and language is designed, realizing the VLG-Gaze task by globally feature integrating. Finally, a group of experiments is carried out by comparing the performance of various multi-modal combinations. The results indicate that, compared with the existing state-of-the-art VLG model, the accuracy of the proposed scheme is improved by 5.3%, which fully proves the effectiveness and advantages of the multi-modal grounding framework with language and gaze. The contributions of this paper are summarized as follows:

A multi-modal grounding dataset containing gaze traces is proposed for visual grounding. An AR device is used to collect the real gaze trace of the referred subjects, and each gaze sample corresponds to a natural image and its language description, as well as the target bounding box.
Taking gaze and language as the expression of interactive human intention, we explore the task of VLG-Gaze. In the aspect of dataset construction, the optimal design to improve the model performance is proposed through groups of experiments.
Aiming at the multiple inputs of vision, language, and gaze, a multi-modal grounding framework is established. Experimental results show that our proposed scheme outperforms the existing vision language grounding and the gaze-based visual grounding method, fully proving the effectiveness of gaze fusion.

In the following parts of this paper, Section 2 introduces the research status of visual grounding and the relevant gaze-based HCI in computer vision. In Section 3, the acquisition and construction of a multi-modal grounding dataset are described in detail. Subsequently, the multi-modal grounding framework with language and gaze is proposed, and the training details are described in Section 4. Through comparative experiments, Section 5 verifies and discusses the effectiveness of gaze fusion, and we analyze the proposed multi-modal grounding dataset. Section 6 discusses the novel solution for developing the existing models and describes the future work and limitations. Finally, the entire paper is summarized in Section 6.

2. Related Work

Visual grounding. The common line of visual grounding mainly focuses on vision-language fusion to realize more accurate grounding, which uses only language to present human intention. Conventional vision-language feature fusion methods can be categorized into three forms, including joint embedding approaches, modular-based models, and graph-based reasoning methods [6,7,8,17]. Zhuang et al. [6] proposed the Parallel Attention Network, which employs two attention mechanisms to match language and visual features locally and globally, performing recurrently region-matched object proposals. But it ignores the relative position of the visual objects. Yu et al. [7] proposed a modular-based method that decomposes the text description into three components, including the target appearance, location, and relationship to other objects, and calculates the matching score between the language parts and visual features through three attention-based modules. Yang et al. [8] proposed a dynamic graph attention network that uses the graph-based reasoning method to represent the image and text in the form of graphs, respectively, matching the elements in the aspect of appearance and relationship from objects to the subject, therefore obtaining the highest matching score from the proposals. To enhance grounding efficiency, Vasudevan et al. [18] proposed a two-stage model using facial images, language, video, optical flow, and a depth map. But the gaze information extracted from the facial image is uncertain, and the multi-modal fusion is carried out locally and globally only by the long short-term memory (LSTM) model, all of which results in an unsatisfactory grounding performance.

However, the above approaches make it difficult to learn a wide range of feature representations in a large-scale dataset. In the last few years, there have been studies [9,19,20] to establish cross-modal pre-training models based on vision-language datasets, which can not only improve grounding performance but could also be applied to a variety of downstream tasks. Based on bidirectional encoder representations from transformers (BERT) (Devlin et al., 2018) [21], Lu et al. (2019) [19] proposed a two-stream feature fusion model vision-and-language BERT (ViLBERT), while Su et al. [20] adopted the single-stream method and proposed a visual-linguistic BERT (VLBERT), which is pre-trained with large-scale vision-language datasets. All the above two-stage methods rely on pre-trained detection models, such as Faster R-CNN [22], to extract the target proposals and their visual features. The black box is pre-trained on a fixed vocabulary, making it tedious for the cross-modal model to understand contextual visual information. For the one-stage approaches, the attention-based transformer is generally used, which can perform pixel-level cross-modal attention in parallel, facilitating the deep integration of different models.

Gaze-based HCI in computer vision. With the development of gaze estimation and tracking [1,23,24], gaze has become one of the main interaction modes in HCI systems [3,25,26]. At the same time, gaze is more widely applied to cross-modal computer vision tasks. In order to extract significant targets, Karthikeyan et al. [13] utilized gaze to extract mainstream intention tracks, identifying and segmenting significant targets among multiple candidate objects in videos. Chen et al. [27] and Sood et al. [28] collected small-scale gaze datasets for visual question-answering (VQA) tasks in order to understand the human reasoning process. The former study [27] analyzes the reasoning attention of machines and humans, proposing a supervision method to optimize the reasoning process. While the latter [28] collected a dataset of human gaze on both images and questions for VQA to analyze the similarity between human and neural attentive strategies learned by VQA models, the above studies utilize gaze to improve the model performance or compare the machine and human reasoning mechanisms, but the accuracy and stability of gaze sequences are not satisfactory. In conclusion, confined by the quality of the gaze-tracking device and the scale of training data, existing research rarely uses gaze intention in the task of visual grounding.

3. RefCOCOg-Gaze Dataset

In this section, the AR glasses are utilized to collect human gaze traces. Combined with data augmentation methods, the RefCOCOg-Gaze dataset is proposed, which is the first vision language grounding dataset containing gaze trace data.

The RefCOCOg-Gaze includes gaze coordinate sequence, image, and text sentence. In order to ensure the diversity of the interactive scenes, we use the image and text data sourced from the RefCOCOg dataset. Subsequently, the interactive gaze information is recorded so as to build a multi-modal grounding dataset containing gaze trace, which is called RefCOCOg-Gaze in this paper. In terms of the gaze trace, an AR device is utilized to record the real gaze coordinates of 20 participants. Combined with the data augmentation methods, the new dataset with more than 80,000 multi-modal samples is systematically constructed.

3.1. Gaze Data Collection

The utilized AR device is the Nreal Light (made by the Nreal Corporation) with two OLED screens of 3840 × 1080 pixels, and it is equipped with two off-axis infrared cameras with resolutions of 640 × 480 pixels, which can simultaneously acquire eye images at a sampling frequency of 60 Hz. In addition, combined with the gaze tracking algorithm of the device, the gaze coordinates can be obtained in real time. According to the official statistics, the gaze estimate error is less than 1.5° under fully calibrated conditions.

3.1.1. Gaze Recording System

Based on Unity (version 2019.4.30f1c1), gaze-recording Android software is developed for data collection. The specific working procedure of the software is shown in Figure 2, which includes two stages and five steps. The participants’ fixation coordinate (the green dot in Figure 2) is obtained by projecting the 3-dimensional sight vector onto the 2-dimensional screen. The screen is a virtual square plane of 5 m × 5 m, and the distance between the screen and the human eyes is 10 m.

3.1.2. Data Collection Procedure

Initialization and calibration. Participants are first asked to wear the AR glasses correctly and then complete the instrument calibration (Figure 2a). Subsequently, the system calculates the value of the calibration error (Figure 2b). If the error is greater than 1.5°, the calibration process needs to be repeated. In order to reduce the influence of other unexpected factors, each participant is arranged to sit in an assigned seat and keep the AR glasses still on the participant’s face during the data collection process.

Gaze recording. At the beginning of gaze recording, AR glasses display a text sequence that describes a referred target (Figure 2c). After fully understanding, the participant could click the confirm button, and then the corresponding picture appeared on the screen. The referred target in the image needs to be gazed at for three seconds. The moment when the picture starts to display is set to 0 s, and the system collects the gaze trace within 0 to 3 s in the first recording stage (Figure 2d). In order to avoid a linguistic misunderstanding, we indicate the correct referred target during the second part of the recording. At the moment of 3–6 s, a red bounding box is annotated in the image to show the ground truth. At this time, the participant continues to look at the target, and the system collects a gaze trace for another three seconds in the second recording stage (Figure 2e). The gaze recording process for each sample is completed in six seconds.

In summary, for each data sample, 360 frames of gaze coordinates are collected. Before data collection, we conducted a set of experiments and found that the participants would not feel too tired in six seconds when focusing on a fixed target. To ensure the recording quality and sufficient length of data, we designed the recording process with two stages, each spanning three seconds, for a total duration of six seconds. In order to prevent the participant from being too tired, 6 sets of 50 data samples are collected for each participant. The recording duration time of each set is about 10–15 min, and the participant takes a 5 min rest before each recording set.

3.1.3. Participants Recruit and Quality Assurance

All participants are adult college students and have the ability to fully understand English descriptions. In the preparation phase, we adopt interactive games with AR glasses to train the candidates, including gaze interaction, language interaction, and other operational training courses. Only candidates who perform well in the preparation phase will be selected to participate in the following recording stages. In addition, all of the participants have understood the collection procedure and signed the informed consent before the gaze recording, and they are properly paid by the research group. The experiment was approved by the Ethics Committee of Tianjin University (TJUE-2021-138).

To ensure data quality, each participant has been checked by the HCI experts with a 15-times random sampling. If a gaze annotating error occurred, the participant would be replaced, and the samples would be re-annotated. Totally, 2589 real gazes obtained from 20 participants were collected.

3.2. Gaze Augmentation

Due to the time-consuming collection process for the AR device, the amount of manually annotated data are limited. But it is difficult for the model to be adequately trained on a small-scale dataset. To avoid the long-term consumption of human and material resources, we use the real human gaze to generate simulated gaze sequences in two ways, i.e., LSTM-based augmentation and average-error-based augmentation. Through the augmentation methods, more than 70,000 gaze samples are generated. The effectiveness of the augmentation is proven in the dataset’s application for grounding experiments in Section 5.3.

3.2.1. LSTM-Based Augmentation

The LSTM-based augmentation adopts a single-layer LSTM model, as shown in Figure 3. The pre-trained ResNet [29], RoBERTa, and multilayer perceptron (MLP) are utilized to extract features from the input image, text, and bounding box, respectively. As a reference, Nie et al. [30] generated time series by segmenting them into subseries-level patches instead of point-wise generation. It is more efficient and could analyze their connections between gaze points; thus, the proposed augmentation model generates ten gaze coordinates in each time step. The feature representation of the first ten gaze coordinates is generated according to the input image, text, and bounding box. In the subsequent time steps, the gaze features are generated based on the previous gaze coordinates. Finally, the output features are transformed into a two-dimensional form of gaze sequence prediction through the fully connected layer.

3.2.2. Average-Error-Based Augmentation

On the one hand, the gaze interaction process contains a certain error with the referred target, and on the other hand, there is a correlation between the coordinate sequence and the time mark. In order to simulate the gaze trace accurately, we first calculate the difference between each gaze trace

{\{(x_{i}^{g}, y_{i}^{g})\}}_{i = 1}^{M}

and the center of the referred target

(c_{1}, c_{2})

as follows:

D = {\{(d_{x_{i}}, d_{y_{i}})\}}_{i = 1}^{M} = \{(x_{1}^{g} - c_{1}, y_{1}^{g} - c_{2}), (x_{2}^{g} - c_{1}, y_{2}^{g} - c_{2}), \dots, (x_{M}^{g} - c_{1}, y_{M}^{g} - c_{2})\}

(1)

Then, the mean error of N real gaze samples is calculated:

\bar{D} = {\{({\bar{d}}_{x_{i}}, {\bar{d}}_{y_{i}})\}}_{i = 1}^{M} = {(\frac{1}{N} \sum_{j = 1}^{N} d_{x_{i}}^{j}, \frac{1}{N} \sum_{j = 1}^{N} d_{y_{i}}^{j})}_{i = 1}^{M}

(2)

In order to augment the dataset, we simulate a number of gaze coordinate sequences, and each coordinate needs to be associated with the real target. Based on the mean error of the real gaze samples and the center of the new referred target

(c_{1} ’, c_{2} ’)

, the simulated gaze trace

G^{'}

is calculated as follows:

G^{'} = {\{(x_{i} ’, y_{i} ’)\}}_{i = 1}^{M} = {\{(c_{1} ’ + {\bar{d}}_{x_{i}}, c_{2} ’ + {\bar{d}}_{y_{i}})\}}_{i = 1}^{M}

(3)

In our augmentation implementation, M is the number of frames in the gaze sample, which is 360 according to the recording setup (60 frames per second and a total of six seconds).

(c_{1} ’, c_{2} ’)

is the bounding box center of the new data sample,

{\bar{d}}_{x_{i}}

and

{\bar{d}}_{y_{i}}

are the mean error of the i-th gaze frame in the abscissa and ordinate axes, respectively; they are calculated from the real gaze data in Equation (2).

4. Method

In this section, we primarily describe and define the task of VLG-Gaze. Based on the task, a multi-modal grounding framework with language and gaze is proposed as the baseline, which is shown in Figure 4 in detail. We introduce the framework, including the feature extraction part and the multi-modal fusion part based on the transformer. The feature extraction part consists of four modules to extract global presentations for different input modalities, including visual feature extraction, linguistic feature extraction, and gaze feature extraction, and the multi-modal fusion module computes the attention-based target presentation from inter- and intra-modals. Subsequently, the training strategy is designed for the proposed architecture, including pre-training on the large-scale vision-language datasets and fine-tuning on our proposed RefCOCOg-Gaze.

4.1. Task Description

Formally, we define the task of VLG-Gaze as follows: I is the given image,

S = \{ω_{1}, ω_{2}, \dots, ω_{T}\}

is the referring expression text of length T, and

G = \{(x_{1}^{g}, y_{1}^{g}), (x_{2}^{g}, y_{2}^{g}), \dots, (x_{M}^{g}, y_{M}^{g})\}

is the referring gaze coordinate sequence of M frames. The VLG-Gaze aims to output a bounding box

B = (x, y, w, h)

, to present the referred target position in the image. For example, in Figure 1, we could get information about the image, language, and gaze trace from the person wearing the AR device. Relying on human intention, the elephant “nearly fully behind the lead elephant” could be correctly recognized and localized by the multi-modal interaction model.

4.2. Feature Extraction

As shown in Figure 4a, for the input RGB image

I \in R^{h \times w \times 3}

with width and height of w and h, the pre-trained ResNet-101 [29] is used to extract the visual feature

F_{i} \in R^{H \times W \times 2048}

. After adjusting the channels by a 1 × 1 convolution kernel, the feature map is flattened to generate

F_{img} \in R^{H W \times 256}

.

Based on the text input, a dynamic word vector

E \in R^{T \times 768}

with dimension of 768 is generated by the pre-trained RoBERTa, as demonstrated in Figure 4b. The expression feature

F_{txt} \in R^{T \times 256}

is obtained after dimensionality reduction using the fully connected layer. Details of the above modules can be found in MDETR [9].

For the input of the gaze coordinate sequence shown in Figure 4c, in order to realize the alignment of the gaze feature and visual feature, we first convert the gaze coordinate sequence into a Gaussian map sequence. It is the saliency map of the input gaze, and the pixel values in the Gaussian map conform to the two-dimensional normal distribution. The standard deviation of the Gaussian maps refers to the gaze trace standard deviation, and the size of the Gaussian map is consistent with the input image, and then the same cropping and scaling methods of image I are adopted. Combined with the bilinear interpolation, the Gaussian maps are concatenated, and the size of them is adjusted to

F_{g} \in R^{M \times h \times w}

. In order to make it in the same feature space as the visual and language features, the Gaussian map is calculated and dimensionality reduced by a 1 × 1 convolution kernel, and the gaze feature is flattened to

F_{gaze} \in R^{H W \times 256}

.

4.3. Multi-Modal Fusion Based on Transformers

Afterwards, we concatenate the above visual, language, and gaze features to get

F \in R^{(H W + T + H W) \times 256}

as the input of the transformer model. The model consists of the encoder and decoder; as shown in Figure 4d, both the encoder and decoder are composed of six layers of multi-head attention modules. The multi-modal encoding feature

F^{'}

is obtained from self-attention calculation by the encoder. During decoding, the target matrix

F_{tgt} \in R^{N \times 256}

is initialized to zero and is used for the input query of the decoder after adding the positional embedding

F_{pos}

. We calculate the cross-attention feature of the target matrix and the multi-modal encoding feature as the multi-modal fusion

F_{out} \in R^{N \times 256}

, which is the representation of N candidate bounding boxes, N is usually set to 100. Finally, the bounding box predictions

{(x^{i}, y^{i}, w^{i}, h^{i})}_{i = 1}^{N}

and their probabilities

{s^{i}}_{i = 1}^{N}

are calculated by the linear branches, respectively, in which the target with the highest confidence is the result of the VLG-Gaze.

4.4. Training Strategy

In this paper, the training process mainly includes two stages: pre-training of the vision-language feature extraction module and fine-tuning of the overall architecture. Since the VLG task has been fully studied [6,7,8,9,11], we first conduct large-scale pre-training for the visual feature extraction module, linguistic feature extraction module, and transformer module with the conventional vision-language datasets, fully leveraging the existing vision-language fusion capabilities. Then, on the basis of the proposed RefCOCOg-Gaze dataset, fine-tuning of the whole model is carried out to achieve human gaze intention fusion for visual grounding.

Pre-training. Referring to the pre-training method in MDETR [9], Flickr30k [31], MS COCO [32], and Visual Genome (VG) [33], datasets are employed to conduct the pre-training process on our proposed model. MDETR is a modulated detector based on a transformer; it was pre-trained on 1.3 M text-image pairs and obtained alignment ability between language and image, which is very compatible with our task. For specific loss functions and details, please see reference [9].

Fine-tuning. Combined with the proposed RefCOCOg-Gaze dataset and the pre-trained modules, the model can be trained integrally using the same loss function as MDETR. This training strategy not only retains the vision-language fusion ability of the conventional VLG model but also integrates the novel gaze attention to facilitate the visual grounding performance.

5. Experiments and Results

In order to verify the effectiveness of the RefCOCOg-Gaze dataset and the VLG-Gaze baseline model, this section takes the gaze-based visual grounding method and the VLG method as comparisons, carrying out a variety of qualitative and quantitative experimental analyses. At the same time, the average-error-based augmentation method and the LSTM-based augmentation method are experimentally verified, and the optimal input frame range of the gaze sequence is explored.

5.1. Experimental Setup and Evaluation Metric

The proposed architecture was carried out with the PyTorch framework. Two GPUs were used for distributed training. The pre-training parameter settings are the same as those of the MDETR [9]. In the fine-tuning process, five epochs are iterated in each experiment, and the Adam optimizer with a learning rate of 5 × 10⁻⁵ is used to optimize the model parameters.

The model’s performance is evaluated by accuracy. Primarily, the intersection over union (IoU) of the predicted result and the ground truth bounding box is calculated; it is the ratio of the intersection and union of the two areas. The model output with an IoU > 0.5 is regarded as the correct prediction. Furthermore, the accuracy of the test set samples is calculated, which is the ratio of the true positive samples to the total samples.

5.2. Analysis of VLG-Gaze Baseline

In terms of the visual grounding task, we conduct contrast experiments to prove the effectiveness of the gaze and language, respectively.

5.2.1. Quantitative Analysis

The performance comparison of gaze-based visual grounding, VLG and the proposed scheme is shown in Table 1. All the gaze inputs of the methods are selected with 0–6 s to show a fair comparison. In order to explore the effectiveness of gaze fusion, the model and its parameters of the VLG method are referred to as MDETR-referring expression comprehension. Otherwise, the visual backbone for the VLG-Gaze baseline is R101, so the MDETR-R101 is used for comparison. For the gaze-based visual grounding method, we fuse the gaze coordinates with Faster R-CNN [22], as illustrated in Figure 5 in detail. Faster R-CNN is a pre-trained object detection model that has the capability of detecting visual target bounding boxes at any position, scale, and aspect ratio. It is widely used in tasks related to object detection and has been proven to perform better in gaze-based object detection [14]. The mean value of the gaze coordinate sequence within six seconds is calculated to obtain

(avg (X^{g}), avg (Y^{g}))

. Meanwhile, the candidate objects in the picture are detected by Faster R-CNN. We calculate the distance between the centers of all the candidates, and

(avg (X^{g}), avg (Y^{g}))

, the target with the shortest distance is selected as the grounding result. Faster R-CNN in the grounding model has been pre-trained with the MS COCO dataset.

It can be seen from Table 1 that our proposed scheme outperforms the other visual grounding methods. The VLG-Gaze baseline effectively improves accuracy by 3.2% compared with the VLG method. In addition, compared with gaze-based visual grounding, the accuracy of the proposed scheme is improved by 37.1%. Among all the visual grounding methods, the performance of the gaze-based method is the worst because it is mainly applicable to simple scenes with fewer targets, while the scene of the RefCOCOg-Gaze is more complex and has more interactive objects. The gaze interaction requires language and other information to assist target recognition. For the VLG task, the grounding accuracy can be effectively improved after integration of gaze, indicating that the gaze information provides auxiliary enhancement for target localization to achieve higher grounding precision, which proves the effectiveness of the proposed scheme.

5.2.2. Qualitative Analysis

Figure 6 demonstrates the cases of grounding results using models with different input modes, and it corresponds to the experimental results in Table 1. The predicted bounding boxes of the VLG method, the gaze-based visual grounding (Gaze-VG) method, and our proposed VLG-Gaze baseline are described in the figure line by line. In Figure 6, the blue rectangle represents the ground truth bounding box, the red rectangle represents the predicted bounding box of the corresponding method, and the green dot visualizes the gaze position, which is the average value of the gaze coordinates from 0 to 360 frames. The referring language descriptions are annotated at the top of each column.

The leftmost two columns showcase the error correction of the proposed scheme compared to gaze-based visual grounding. The gaze trace directly reflects the position of the interactive target that the human is referring to. However, under the circumstances of multiple overlapping objects in the image, the target is difficult to accurately recognize only by gaze information. For example, in the first column of images, the ground truth target is a table, but the gaze coordinates are located in the center of the desktop and closer to the plate, resulting in the fact that we could only get the prediction of “plate” from the Gaze-VG. By contrast, the proposed scheme identifies the target correctly by fusing the language information.

The two columns in the center showcase the error correction of the proposed scheme compared to the VLG method. The referred target is a catcher in the images in the third column, but because the visual appearance of the object on the left side is similar to that of the correct target, the VLG model cannot directly distinguish the target, resulting in grounding errors. The proposed scheme provides spatial indication of the target position by combining gaze intention and avoiding the ambiguity of language understanding.

The rightmost two columns further demonstrate the advantages of the proposed approach. In the fifth column, due to the low color differentiation of the objects, errors occurred in the VLG model and the gaze-based grounding model. Additionally, there are many similar objects in the images in the sixth column, which makes the VLG and Gaze-VG ineffective. Our proposed scheme combines the advantages of multiple modals of gaze, language, and vision to achieve higher interactive grounding precision.

5.3. Analysis of the RefCOCOg-Gaze Dataset

For the RefCOCOg-Gaze dataset, we analyze it quantitatively and qualitatively, exploring the best mode of the dataset to achieve more accurate model performance and displaying the gaze trace visually. In terms of the real gaze data, 1301 samples are used as a training set, and 1288 samples are used as a test set. In addition, we only adopt the augmentation methods for the training set, and the gaze traces in the test set are maintained unchanged.

5.3.1. Quantitative Analysis

The best way to verify the proposed dataset is through practical model performance. To this end, we first train the VLG-Gaze baseline with three kinds of dataset settings, including RefCOCOg-Gaze without augmentation, RefCOCOg-Gaze with LSTM-based augmentation (RefCOCOg-Gaze LSTM), and RefCOCOg-Gaze with average-error-based augmentation (RefCOCOg-Gaze avg). As shown in Table 2, the best result appears in the training set of RefCOCOg-Gaze avg; its grounding accuracy is 86.2%. For example, it has higher accuracy than the model with the RefCOCOg-Gaze LSTM dataset and improves the optimal accuracy by 14.2% compared with the RefCOCOg-Gaze dataset. The LSTM-based augmentation also has certain effects on improving performance. The results of Table 2 are analyzed by statistical tests in Figure 7 by the box plot; each group represents the performance of the proposed model evaluated on the same dataset of the different gaze ranges. As shown in Figure 7, the mean accuracy of the model trained on RefCOCOg-Gaze avg is higher than the one trained on RefCOCOg-Gaze (t (4) = −7.64, p < 0.001, one-sided paired-sample t-test) and the one trained on RefCOCOg-Gaze LSTM (t (4) = −3.50, p < 0.05, one-sided paired-sample t-test), respectively. It could be concluded that the average-error-based augmentation method is more suitable and beneficial to improving the performance of the proposed model.

The gaze recording frequency of the AR glasses is 60 Hz, so each sample of the dataset contains 360 frames of gaze coordinates. In order to explore the influence of different gaze lengths on the model’s performance, we designed comparative experiments as shown in Table 2. The results show that VLG-Gaze baseline performed worse with gaze samples at 0–3 s. Because the participants would occasionally look for the wrong target based only on textual descriptions in the period at the time of 0–3 s, which could lead to inaccurate gaze data in the gaze range of 0–180 frames. In addition, the red bounding box appeared at 3–6 s to avoid misunderstandings and ensure the quality of the gaze recording, and the model accuracy of the gaze sample at 3–6 s is generally higher than 0–3 s. Therefore, the following experiments mainly focus on the gaze analysis at the time 3–6 s, or the 180–360 frames of the whole frames.

According to Table 2, we find that the model performs better when the gaze input range is set to 240–360 frames, and the best performance achieves 86.2% in accuracy, which is 5.3% higher than that of VLG. However, when the input gaze is shorter, i.e., 300–360 frames, the model performance deteriorates due to the presence of random errors in the gaze input. Increasing the gaze length appropriately can reduce the influence of gaze errors. Corresponding to the experimental design of gaze recording, the error of gaze at 0–3 s is larger, which leads to a decrease in the model accuracy trained with 0–360 gaze frames, and it is consistent with the error distribution of gaze sequence. On the other hand, under the same frame selection of gaze sequence, the augmented datasets achieve better performance, which fully proves the effectiveness of the data augmentation methods proposed in this paper.

5.3.2. Qualitative Analysis

The similarity between the simulated gaze and the real gaze is compared and analyzed by the sequence error, which is demonstrated in Figure 8. We calculate the average sequence errors of the augmented gaze trace and the real gaze trace on the abscissa/ordinate axis, respectively. The data samples for evaluation are all from the test set. The average sequence errors are the mean differences between all the gaze samples and the center point of the ground truth bounding box. In Figure 8, The blue dots represent the average sequence errors of the real gaze trace of 0~360 frames, respectively; the green dots represent the average sequence errors of the simulated gaze by LSTM-based augmentation; and the red dots represent the average sequence errors of the simulated gaze by average-error-based augmentation. As can be seen, from the moment of 3 s, especially after 200 frames, the errors of the simulated gaze data from the average-error-based augmentation and the real gaze are all distributed within 50 pixels, but the simulated gaze data from the LSTM-based augmentation has a constant sequence error after 50 frames. It indicates that the gaze trace simulated by the average-error-based augmentation is closer to the real gaze data.

Figure 9 demonstrates the visual comparison of the real gaze sequences and the simulated gaze sequences. The blue-bounding boxes denote the ground truth targets. We represent the gaze coordinates of all the frames by 360 dots, in which the yellow dots represent the gaze sequence of 0–3 s and the green dots represent the gaze coordinates of 3–6 s. It can be seen that the distribution and variation trends of the simulated gaze sequence from RefCOCOg-Gaze avg are similar to those of the real gaze sequence, indicating the effectiveness of the average-error-based augmentation method.

6. Discussion

6.1. Work Lines of Vision Language Grounding

The conventional vision language grounding studies mainly use language information to present the interactive intention and fuse with the linguistic feature to generate the bounding box presentation. However, the natural language could be diverse with the specific scenes and the speaker’s habits; it is infinite in terms of structures and adjective styles. Meanwhile, natural images are always semantically rich; even for the same kind of object, there are unseen visual forms of them. To overcome the above difficulties, existing studies mainly adopt two solutions: building more complex feature fusion structures [7,8,34] and applying large-scale datasets for pre-training [9,11,20]. The method of building more complex feature fusion structures tends to customize the visual features under language guidance in the early stages of visual feature extraction, but the grounding accuracy performs poorly because of the limited scale of the visual grounding dataset. The method of applying large-scale datasets for pre-training adopts the universal vision-language model, which could be pre-trained on the extensive vision language datasets, then designed the specific head layers to realize the grounding task, but the time cost and the equipment requirements are relatively high. In the case that similar objects are present in the image, existing visual grounding models usually fail to recognize the intended target, as shown in Figure 6. In our study, the gaze intention proved to be effective for improving the grounding performance. This approach could improve the performance of existing vision-language models effectively and ingeniously.

6.2. Limitations and Future Work

As mentioned in Section 3.2, the gaze collection process is time-consuming, and the quality of gaze data may be influenced by the characteristics of the participants. Therefore, the augmentation of gaze data is needed for the RefCOCOg-Gaze. In our subsequent work, we will use the desktop eye tracker to efficiently collect gaze data. Since the gaze intention has proven to be effective for improving the visual grounding performance, it would be valuable to explore the effectiveness of the extensive vision-language modals.

7. Conclusions

In this paper, a multi-modal grounding dataset containing gaze trace is constructed, and a novel vision language grounding framework fused with gaze intention is proposed. The real gaze traces of human participants are collected with AR devices, and the data augmentation methods are designed to expand the scale of the training set. Finally, a novel dataset, RefCOCOg-Gaze, including gaze, image, and language, is constructed. Based on the above dataset, the VLG-Gaze baseline model is proposed, which uses gaze and language interaction information to represent human intention, achieving the complementary advantages of multiple modals. The experimental results declare that the proposed VLG-Gaze scheme significantly enhances the visual grounding accuracy to 86.2%, and the model performs better with the training set augmented by the average-error-based method. In our future work, we will further optimize the model of the VLG-Gaze task and explore more and better multi-modal fusion methods to promote the development of visual grounding.

Author Contributions

Conceptualization, L.T., Y.Z. and L.X.; Data curation, J.Z.; Formal analysis, L.T.; Funding acquisition, D.M., Y.Y. and E.Y.; Investigation, Y.Z. and L.X.; Methodology, J.Z.; Software, Y.Z.; Supervision, M.X. and E.Y.; Validation, L.X. and M.X.; Writing—original draft, J.Z.; Writing—review and editing, L.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China, Grant Numbers 62332019 and 62076250.

Data Availability Statement

The proposed dataset will be available at https://github.com/Zhangjunqian/VLG-Gaze (accessed on 1 February 2023).

Conflicts of Interest

The authors declare no conflict of interest.

References

Chen, L.L.; Li, Y.X.; Bai, X.W.; Wang, X.D.; Hu, Y.Q.; Song, M.W.; Xie, L.; Yan, Y.; Yin, E.W. Real-time gaze tracking with head-eye coordination for head-mounted displays. In Proceedings of the IEEE International Symposium on Mixed and Augmented Reality, Singapore, 17–21 October 2022. [Google Scholar]
Zhou, D.L.; Zhang, H.J.; Li, Q.; Ma, J.H.; Xu, X.F. COutfitGAN: Learning to synthesize compatible outfits supervised by silhouette masks and fashion styles. IEEE Trans. Multimed. 2022, 25, 4986–5001. [Google Scholar] [CrossRef]
Shi, C.Y.; Yang, D.P.; Qiu, S.Y.; Zhao, J.D. I-GSI: A novel grasp switching interface based on eye-tracking and augmented reality for multi-grasp prosthetic hands. IEEE Robot. Autom. Lett. 2023, 8, 1619–1626. [Google Scholar] [CrossRef]
Yu, L.C.; Poirson, P.; Yang, S.; Berg, A.C.; Berg, T.L. Modeling context in referring expressions. In Proceedings of the 14th European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016. [Google Scholar]
Mao, J.H.; Huang, J.; Toshev, A.; Camburu, O.; Yuille, A.; Murphy, K. Generation and comprehension of unambiguous object descriptions. In Proceedings of the 2016 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016. [Google Scholar]
Zhuang, B.H.; Wu, Q.; Shen, C.H.; Reid, I.; Hengel, A. Parallel attention: A unified framework for visual object discovery through dialogs and queries. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
Yu, L.C.; Lin, Z.; Shen, X.H.; Yang, J.; Lu, X.; Bansal, M.; Berg, T. Mattnet: Modular attention network for referring expression comprehension. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
Yang, S.B.; Li, G.B.; Yu, Y.Z. Graph-structured referring expression reasoning in the wild. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020. [Google Scholar]
Kamath, A.; Singh, M.; Lecun, Y.; Synnaeve, G.; Misra, I.; Carion, N. MDETR-Modulated detection for end-to-end multi-modal understanding. In Proceedings of 2021 IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.; Kaiser, L.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Yang, Z.Y.; Gan, Z.; Wang, J.F.; Hu, X.W.; Ahmed, F.; Liu, Z.C.; Lu, Y.M.; Wang, L.J. UniTAB: Unifying text and box outputs for grounded vision-language modeling. In Proceedings of the 17th European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022. [Google Scholar]
Soliman, M.; Tavakoli, H.R.; Laaksonen, J. Towards gaze-based video annotation. In Proceedings of the 2016 6th International Conference on Image Processing Theory, Tools and Applications, Oulu, Finland, 12–15 December 2017. [Google Scholar]
Karthikeyan, S.; Ngo, T.; Eckstein, M.; Manjunath, B. Eye tracking assisted extraction of attentionally important objects from videos. In Proceedings of the 2016 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Boston, MA, 7–12 June 2015. [Google Scholar]
Jayawardena, G.; Jayarathna, S. Automated Filtering of Eye Gaze Metrics from Dynamic Areas of Interest. In Proceedings of the IEEE 21st International Conference on Information Reuse and Integration for Data Science, Las Vegas, NV, USA, 11–13 August 2020. [Google Scholar]
Cho, D.Y.; Kang, M.K. Human gaze-aware attentive object detection for ambient intelligence. Eng. Appl. Artif. Intell. 2021, 106, 104471. [Google Scholar] [CrossRef]
Barz, M.; Sonntag, D. Automatic Visual Attention Detection for Mobile Eye Tracking Using Pre-Trained Computer Vision Models and Human Gaze. Sensors 2021, 21, 4143. [Google Scholar] [CrossRef] [PubMed]
Qiao, Y.Y.; Deng, C.R.; Wu, Q. Referring expression comprehension: A survey of methods and datasets. IEEE Trans. Multimed. 2020, 23, 4426–4440. [Google Scholar] [CrossRef]
Vasudevan, A.B.; Dai, D.X.; Gool, L.V. Object referring in videos with language and human gaze. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
Lu, J.S.; Batra, D.; Parikh, D.; Lee, S. ViLBERT: Pre-training task-agnostic visiolinguistic representations for vision-and-language tasks. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]
Su, W.J.; Zhu, X.Z.; Cao, Y.; Li, B.; Lu, L.; Wei, F.; Dai, J. VL-BERT: Pre-training of generic visual-linguistic representations. In Proceedings of the International Conference on Learning Representations, Online, 25–30 April 2020. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Ren, S.Q.; He, K.M.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Hansen, D.W.; Pece, A.E. Eye tracking in the wild. Comput. Vis. Image Underst. 2005, 98, 155–181. [Google Scholar] [CrossRef]
Hu, Z.M.; Zhang, C.Y.; Li, S.; Wang, G.P.; Manocha, D. SGaze: A data-driven eye-head coordination model for realtime gaze prediction. IEEE Trans. Vis. Comput. Graph. 2019, 25, 2002–2010. [Google Scholar] [CrossRef] [PubMed]
Pfeiffer, C.; Scaramuzza, D. Human-piloted drone racing: Visual processing and control. IEEE Robot. Autom. Lett. 2021, 6, 3467–3474. [Google Scholar] [CrossRef]
Wang, Q.H.; He, B.T.; Xun, Z.R.; Xu, C.; Gao, F. GPA-Teleoperation: Gaze enhanced perception-aware safe assistive aerial teleoperation. IEEE Robot. Autom. Lett. 2022, 7, 5631–5638. [Google Scholar] [CrossRef]
Chen, S.; Jiang, M.; Yang, J.H.; Zhao, Q. AiR: Attention with Reasoning Capability. In Proceedings of the 16th European Conference on Computer Vision, Glasgow, UK, 14–20 September 2020. [Google Scholar]
Sood, E.; Kögel, F.; Strohm, F.; Dhar, P.; Bulling, A. VQA-MHUG: A gaze dataset to study multimodal neural attention in visual question answering. In Proceedings of the SIGNLL Conference on Computational Natural Language Learning, Punta Cana, Dominican Republic, 10–11 November 2021. [Google Scholar]
He, K.M.; Zhang, X.Y.; Ren, S.Q.; Sun, J. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Nie, Y.Q.; Nguyen, N.H.; Sinthong, P.; Kalagnanam, J. A time series is worth 64 words: Long-term forecasting with transformers. In Proceedings of the 11th International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Plummer, B.A.; Wang, L.W.; Cervantes, C.M.; Caicedo, J.C.; Hockenmaier, J.; Lazebnik, S. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the 14th European Conference on Computer Vision, Santiago, Chile, 11–18 December 2015. [Google Scholar]
Lin, T.; Maire, M.; Belongie, S.J.; Hays, J.; Perona, P.; Ramanan, D.; Dollar, P.; Zitnick, C.L. Microsoft COCO: Common objects in context. In Proceedings of the 13th European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014. [Google Scholar]
Krishna, R.; Zhu, Y.K.; Groth, O.; Johnson, J.; Hata, K.; Kravitz, J.; Chen, S.; Kalantidis, Y.; Li, L.; Shamma, D.A.; et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. 2017, 123, 32–73. [Google Scholar] [CrossRef]
Liao, Y.; Zhang, A.; Chen, Z.; Hui, T.; Liu, S. Progressive language-customized visual feature learning for one-stage visual grounding. IEEE Trans. Image Process. 2022, 31, 4266–4277. [Google Scholar] [CrossRef] [PubMed]

Figure 1. The proposed scheme and the previous method. (a) The typical convolutional visual grounding framework that only uses language to express human intention. (b) Our proposed multi-modal grounding framework takes the gaze interactive information as the input signal for the multi-modal interaction.

Figure 2. The gaze recording system and the recording procedure design. It includes two stages and five steps. The green dot presents the gaze point of the participant, and the red box indicates the ground truth target.

Figure 3. The LSTM-based augmentation model.

Figure 4. The multi-modal grounding framework with language and gaze is proposed as the baseline of the task of Gaze-VLG. It mainly consists of four modules, including the feature extraction modules that extract the image, language, and gaze features, respectively, and the multi-modal fusion module based on transformer.

Figure 5. The gaze-based visual grounding method for experimental comparison, the red square represents the grounding result.

Figure 6. The cases of grounding predictions using models with different input modes.

Figure 8. Sequence error comparison of the real gaze and the simulated gaze sample.

Figure 9. Visual comparison of the real gaze sequences and the simulated gaze sequences.

Figure 7. The model performance on different datasets is demonstrated by the box plot (p < 0.05, “*”; p < 0.001, “***”).

Table 1. The performance comparison between models with various modes of input modality.

Grounding Method	Accuracy
VLG	80.9%
Gaze-based visual grounding	47.0%
VLG-Gaze baseline	84.1%

Table 2. The model performance comparison of different dataset settings.

Training Set	Gaze Range (Frames)	Accuracy
RefCOCOg-Gaze	0–180	53.3%
	0–360	70.8%
	180–360	72.0%
	240–360	71.4%
	300–360	71.0%
RefCOCOg-Gaze LSTM	0–180	78.0%
	0–360	76.0%
	180–360	77.4%
	240–360	78.0%
	300–360	76.1%
RefCOCOg-Gaze avg	0–180	77.1%
	0–360	84.1%
	180–360	85.0%
	240–360	86.2%
	300–360	84.5%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, J.; Tu, L.; Zhang, Y.; Xie, L.; Xu, M.; Ming, D.; Yan, Y.; Yin, E. An Accuracy Enhanced Vision Language Grounding Method Fused with Gaze Intention. Electronics 2023, 12, 5007. https://doi.org/10.3390/electronics12245007

AMA Style

Zhang J, Tu L, Zhang Y, Xie L, Xu M, Ming D, Yan Y, Yin E. An Accuracy Enhanced Vision Language Grounding Method Fused with Gaze Intention. Electronics. 2023; 12(24):5007. https://doi.org/10.3390/electronics12245007

Chicago/Turabian Style

Zhang, Junqian, Long Tu, Yakun Zhang, Liang Xie, Minpeng Xu, Dong Ming, Ye Yan, and Erwei Yin. 2023. "An Accuracy Enhanced Vision Language Grounding Method Fused with Gaze Intention" Electronics 12, no. 24: 5007. https://doi.org/10.3390/electronics12245007

APA Style

Zhang, J., Tu, L., Zhang, Y., Xie, L., Xu, M., Ming, D., Yan, Y., & Yin, E. (2023). An Accuracy Enhanced Vision Language Grounding Method Fused with Gaze Intention. Electronics, 12(24), 5007. https://doi.org/10.3390/electronics12245007

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Accuracy Enhanced Vision Language Grounding Method Fused with Gaze Intention

Abstract

1. Introduction

2. Related Work

3. RefCOCOg-Gaze Dataset

3.1. Gaze Data Collection

3.1.1. Gaze Recording System

3.1.2. Data Collection Procedure

3.1.3. Participants Recruit and Quality Assurance

3.2. Gaze Augmentation

3.2.1. LSTM-Based Augmentation

3.2.2. Average-Error-Based Augmentation

4. Method

4.1. Task Description

4.2. Feature Extraction

4.3. Multi-Modal Fusion Based on Transformers

4.4. Training Strategy

5. Experiments and Results

5.1. Experimental Setup and Evaluation Metric

5.2. Analysis of VLG-Gaze Baseline

5.2.1. Quantitative Analysis

5.2.2. Qualitative Analysis

5.3. Analysis of the RefCOCOg-Gaze Dataset

5.3.1. Quantitative Analysis

5.3.2. Qualitative Analysis

6. Discussion

6.1. Work Lines of Vision Language Grounding

6.2. Limitations and Future Work

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI