Gaze-Dependent Image Re-Ranking Technique for Enhancing Content-Based Image Retrieval

: Content-based image retrieval (CBIR) aims to ﬁnd desired images similar to the image input by the user, and it is extensively used in the real world. Conventional CBIR methods do not consider user preferences since they only determine retrieval results by referring to the degree of resemblance or likeness between the query and potential candidate images. Because of the above reason, a “semantic gap” appears, as the model may not accurately understand the potential intention that a user has included in the query image. In this article, we propose a re-ranking method for CBIR that considers a user’s gaze trace as interactive information to help the model predict the user’s inherent attention. The proposed method uses the user’s gaze trace corresponding to the image obtained from the initial retrieval as the user’s preference information. We introduce image captioning to effectively express the relationship between images and gaze information by generating image captions based on the gaze trace. As a result, we can transform the coordinate data into a text format and explicitly express the semantic information of the images. Finally, image retrieval is performed again using the generated gaze-dependent image captions to obtain images that align more accurately with the user’s preferences or interests. The experimental results on an open image dataset with corresponding gaze traces and human-generated descriptions demonstrate the efﬁcacy or efﬁciency of the proposed method. Our method considers visual information as the user’s feedback to achieve user-oriented image retrieval.


Introduction
With the popularization of personal electronic terminals such as smartphones and tablets, the amount of visual data available on the internet has been overgrown in recent years [1].Correspondingly, the performance of related image retrieval technologies is also improving and becoming more sophisticated [2].Content-based image retrieval (CBIR) is a type of search technology that aims to find images similar to the image input by users (query image), and it is extensively applied in the real world [3].For example, search engines of image retrieval functions (e.g., Google (http://images.google.it,accessed on 23 February 2023)) and a similar production search of online shopping (e.g., Amazon (https://www.amazon.co.jp/, accessed on 23 February 2023) and eBay (https://www.ebay.com/, accessed on 23 February 2023))are examples of such applications.Conventional CBIR methods have exhibited remarkable precision in retrieving images that bear similarities to the query image, as documented in several studies [4][5][6][7].It is feasible to efficiently and effectively retrieve associated images from a vast database using a single input image.In the last decade, extensive research efforts have been devoted to developing novel theories and models for CBIR, establishing several practical algorithms [8].However, as the volume of visual data on the web continues to increase, there is a growing imperative to consider users' subjective preferences during image retrieval to enhance the value of the retrieved data and satisfy the increasing demands of users.
For users, whether the retrieval result is appropriate cannot be judged solely by the image content, but also based on the user's preferences.The aforementioned problem is a crucial challenge known as the semantic gap [9], which CBIR has been facing for a long time.Since the query image input by a user is rich in detail and information, it can be challenging for the CBIR model to accurately localize which specific part of the image the user intends to retrieve a similar one from the dataset [10].Conventional CBIR methods compare the features of the query image with the features of images in the dataset (candidate images) and rank these candidate images according to their similarity to the query image.Fadaei et al. [11] proposed an approach to extract dominant color descriptors for CBIR.The approach uses a weighting mechanism that assigns more importance to the informative pixels of an image using suitable masks.However, these methods may fail to rank images fitting to the user's preferences with a high rank consistently.To achieve user-oriented image retrieval, one of the possible strategies is to re-order the images of the initial retrieval, satisfying the user's preferences with a higher position.
Re-ranking is an approach to re-order the results of initial retrieval using reliable information and typically plays a role in the post-processing step in image retrieval tasks [12].Re-ranking methods can be classified into two categories based on the information source employed: self-re-ranking and interactive re-ranking [13].Self-re-ranking approaches aim to improve the accuracy of initial retrieval results by identifying relevant visual patterns from them and re-ordering them based on external information, such as text labels or class information.For example, Zhang et al. [14] proposed a new method based on a global model for extracting features from the entire image and a local model for extracting features from individual regions of interest in the image.The results from these two models are then combined using a re-ranking approach, which significantly improves the accuracy of the retrieval results.Conversely, interactive re-ranking approaches use user feedback from the initial retrieval results as preference information for re-ordering [13].Therefore, interactive re-ranking is expected to achieve higher user satisfaction by utilizing feedback reflecting users' preferences.
Recent studies [15][16][17] have shown that gaze information, which consists of human eye movements, plays a critical role in visual recognition during daily life and involves attention shifts.Gaze information is integral to non-verbal communication in an interaction process between humans and the natural world and is closely related to user's preferences when viewing images [18].Therefore, by introducing the user's gaze information into the initial retrieval re-order as the user's feedback information for re-ranking, the users are expected to find their desired images in a higher rank.However, directly using gaze trace data in a coordinate format may not accurately capture the relationships between objects in an image, which could be problematic.
To tackle the abovementioned issue, we propose a novel CBIR method utilizing image captioning as a gaze-dependent image re-ranking method.Figure 1 illustrates the underlying concept of our proposed image re-ranking method.Our method leverages the interdisciplinary technology of image captioning.This technique enables machines to automatically produce natural language descriptions for any given image [19].The proposed method entails developing a neural network that integrates images and gaze information to generate image captioning controlled by gaze traces.The transformer is used in the proposed method, focusing on its characteristics.In contrast to convolutional neural networks (CNNs) that rely on local connectivity, the transformer achieves a global representation through shallow layers and preserves spatial information [20].Given the expectation that the transformer model is more closely aligned with human visual characteristics than CNNs [21], the proposed method trains a connected caption and gaze trace (CGT) model [22] on the basis of the transformer architecture.This enables the model to learn the intricate relationship between images, captioning, and gaze traces.Our method attempts to present images that meet the user's favor at a higher rank using the gazedependent image caption.
With the introduction of image captioning, which connects image features and gaze information, the proposed method can explicitly express semantic information in images (e.g., relationships between objects that users focus on) to realize the re-ranking that accurately reflects the user's preferences.Specifically, our method uses the Contrastive Language-Image Pre-Training (CLIP) model [23], extensively acknowledged as one of the most sophisticated cross-modal embedding techniques currently available.The primary objective of CLIP is to create a common latent space for computing the similarity between images and text.Therefore, we can compare the gaze-dependent image captions and the candidate images by embedding them in the latent space and ranking the images that reflect the user's preferences higher in the re-order process.Extensive experimentation demonstrates the remarkable performance of the proposed method in re-ranking for image retrieval on a publicly available dataset [24] of annotated images in the MSCOCO dataset [25].
In conclusion, the key contributions of our study are as follows. • We propose a novel gaze-dependent re-ranking method for CBIR to tackle the "semantic gap" challenge of the conventional CBIR methods.

•
We introduce the gaze-dependent image caption to convert coordinate-format visual information into text-format semantic information, thereby realizing re-ranking according to the users' preferences.
Note that we have previously presented some preliminary results of our current work in a prior study [22], where we demonstrated the effectiveness of incorporating gazedependent image captioning for achieving personalized image retrieval.In this study, we build upon our previous work and extend it in the following ways.First, we expand on our previous study by utilizing gaze-dependent image captioning as auxiliary information to achieve user-oriented image re-ranking.Second, we evaluate the effectiveness of previous cross-modal retrieval methods and interactive re-ranking methods to validate the robustness of our proposed method.Finally, in our ablation study, we verify the novelty of our method by comparing the effectiveness of incorporating gaze data directly for re-ranking and transforming the data from coordinate data into a text format to bridge the semantic gap in CBIR effectively.
The remainder of this paper is structured as follows.Section 2 briefly overviews the related works.Section 3 presents a detailed description of our proposed gaze-dependent image re-ranking method for CBIR.The experimental results are presented in Section 4, where we provide qualitative and quantitative results of the proposed method.Section 5 discusses the implications of our findings and limitations associated with our study.Finally, we conclude with a summary of our contributions in Section 6.

Semantic Gap in CBIR
The semantic gap is a significant challenge for CBIR models.It arises from the disparity between low-level visual features extracted from images and high-level semantic conceptions commonly utilized to depict the content of images.In other words, in CBIR, as images are the input, the ambiguity arising from the rich and complex content of a query image makes it challenging for the model to comprehend the inherent query intention embedded within the query image.For example, to retrieve similar images, discrepancies could appear between the similarity that the user envisioned and the similarity that the model perceived.
Recent research breakthroughs in deep learning [26][27][28][29] have opened up the possibility of solving the semantic gap problem in CBIR.One solution to address the above issue involves extracting subpatches from various regions of the query image and characterizing them using the in-depth features proposed by Razavian et al. [30].By compressing the indepth features of subpatches, more representative patch-based similarity can be computed to bridge the semantic gap.Wang et al. [31] proposed a two-stage approach for CBIR that combines sparse representation and feature fusion techniques to enhance the retrieval accuracy.In [32], the authors proposed a novel technique for CBIR by aggregating deep local features and generating compact global descriptors.Gong et al. [33] employed CNNs to extract deep characteristics from patches at different scales and areas, along with orderless pooling strategies.CNN-based approaches are typically superior to classic SIFTbased or GIST-based approaches [34,35] because they are preferred for feature extraction in image retrieval due to their ability to capture features closely related to the semantic attributes of images.Although CNNs are effective at extracting features for CBIR, their use typically involves analyzing the entire content of an image, which can result in inaccuracies when representing and identifying the underlying query intention within the query image.
Despite significant progress in recent years, the semantic gap remains a challenging problem for CBIR.An ongoing study is required to improve retrieval performance and develop more effective techniques for connecting low-level visual features with high-level semantic concepts.

Image Caption with Visual Information
To improve retrieval performance, many studies [24,[36][37][38] have explored the introduction of various auxiliary information to generate image captions that are more in line with human habits and preferences, and visual information is one of them.He et al. [24] were dedicated to exploring the relevance between human attention and the discourse used in perception and text formation processes.They examined the mechanisms of attention allocation in the top-down soft attention method and proved the effectiveness of visual saliency for image captioning.In [38], a gazing sub-network was presented to estimate the gazing sequence from entities in a caption annotated by humans and then adopt the pointer network that automatically produces a similar description sequence imitating human visual habits.
Overall, the study of image captioning with visual information continues to evolve rapidly, and there is a growing interest in developing more sophisticated models that can generate captions that are not only accurate but also informative, diverse, and engaging.

Re-Ranking Method
In the domain of image retrieval, re-ranking is an essential component that entails the precise re-ordering of the initial retrieval results.Re-ranking is a crucial aspect of various retrieval tasks, including object retrieval, person re-identification, and CBIR, due to its ability to improve the accuracy and relevance of the retrieved results.Two distinct types of re-ranking approaches exist, namely self-re-ranking and interactive re-ranking.
Self-re-ranking is an automated technique that re-orders the initial retrieval results using data derived from the top-ranked images in the retrieval results.Certain retrieval approaches [39][40][41] employ text labels associated with top-ranked images as supplementary data to calculate the correlation between the query image and candidate images, thereby facilitating the re-ranking process.In contrast, Mandal et al. [42] utilized the class information of the top-ranked images during the re-ranking stage.The above techniques can enhance retrieval efficacy without requiring input from users.However, in the absence of supplemental information to bridge the discrepancy in comprehension between users and models, these methods still require assistance in disambiguating inherently vague queries.In contrast to the above techniques, our method bridges the semantic gap by incorporating gaze information to anticipate the user's preference.
Interactive re-ranking is a technique that involves re-ordering the initial retrieval results by interacting with users, typically through feedback mechanisms such as relevance feedback or users' preferences.Traditional interactive re-ranking techniques [43-47] entail users selecting images associated with the desired image from the initial retrieval results, after which the retrieved results are re-ordered based on the feedback obtained from users.In the fashion industry, novel approaches [48,49] have been proposed as alternatives for obtaining feedback, employing natural language to assess the top-ranked images.These approaches deviate from the conventional feedback mechanisms and provide a fresh outlook on evaluating fashion items.Chen et al. [50] propose a method that utilizes text feedbacks to improve image retrieval accuracy by incorporating multi-grained uncertainty regularization to handle the complex relationship between the image and text features.Reranking can be performed using reinforcement learning or deep metric learning, depending on the feedback received in a natural language format.Tan et al. [51] introduced an additional technique for efficiently handling multiple iterations of user-provided natural language queries to reorganize the retrieved results.Generally, interactive re-ranking uses users' diverse feedback to obtain re-ranking information.Furthermore, to better explore user preferences and bridge the semantic gap in CBIR, what type of information is suitable to be introduced as feedback while interacting with users is a topic worthy of in-depth exploration and long-term research.

Gaze-Dependent Image Re-Ranking
The proposed method comprises a set of sequential procedures: content-based image retrieval, gaze-dependent image captioning, and re-ranking.Figure 2 provides an overview of our method.First, we rank the candidate images using the conventional CBIR method based on the query image.Then, a gaze-dependent image caption is generated using gaze traces corresponding to the images obtained in the initial retrieval as the user's preference information.A relevant supplement should also be considered in case of accuracy degradation due to missing information about the query image.Finally, crossmodal image retrieval (CMIR) is performed with the generated captions to obtain images that better fit the user's preference as the re-ranking part.

Initial Retrieval
The proposed method takes the query image I q as the input for the conventional CBIR model and ranks candidate images I n (n = 1, . . ., N; N represents the total number of candidate images) for the first step.As the proposed method necessitates the use of textual information to process captioning information during the re-ranking part, we expect that the CBIR model can handle both image and text elements naturally.Therefore, the proposed method focuses on cross-modal retrieval methods to perform the initial retrieval.Specifically, we introduce the CLIP model [23], which has been widely employed in contemporary image retrieval research and has exhibited significant retrieval precision.
The CLIP model employs a transformer-based encoder to create a shared latent space that facilitates the comparison of feature vectors obtained from both the query image and the candidate images.Furthermore, the initial results can be obtained by calculating their similarity s q,n as shown in the following equation.
where v q and v n represent the feature vectors of the query image and candidate images, respectively.The operator "•" represents the dot product that takes two vectors and returns a scalar value.The dot product of two vectors is defined as the product of their magnitudes and the cosine of the angle between them.

Gaze-Dependent Image Caption
To model image, captioning, and gaze information, we have constructed the CGT model, employing previous methods [22] to acquire knowledge of their interrelations.This model comprises three modules: an image encoder, a caption encoder-decoder, and a gaze trace encoder-decoder.As shown in Figure 3, each module is built on a transformer and receives c v , c w , c r as input, denoting the features of image, caption, and gaze trace, respectively.The image encoder T v is defined as follows: where FFN(•) refers to the feed-forward network with two ReLU-based linear transformation layers, based on a previous study [52].It also employs multi-head attention (MHA) with a weight matrix of query Q, value V, and key K to deal with the multi-modal input during training and is defined as follows: where P o , P Q i , P K i , and P V i , the parameter matrices, are the projections following the definition [52].As illustrated in Figure 3, it is worth noting that the outputs of the image encoder T v have a significant impact on both the caption encoder-decoder T w and the gaze trace encoder-decoder T r .The CGT model is designed to effectively capture the interdependence between the image, caption, and gaze trace.Our method employs a symmetrical structure to generate captions and gaze traces to achieve this.The caption encoder-decoder T w and the trace encoder-decoder T r are defined as follows: To train the CGT model, the total function L all is defined as follows: In this total function, L r evaluates the difference between the bounding boxes of the predicted and actual gaze traces, which affects the performance of gaze trace generation.Conversely, L w is a cross-entropy loss function that measures the dissimilarity between the human-generated and predicted captions.The total loss function, L sum , is computed as the sum of L r and L w .Furthermore, the model utilizes L rw , which is a cycle consistency loss that compares the output caption and trace.The weighting coefficients for these loss functions are represented as λ * , and their values were selected based on the method proposed in a previous study [53].
To summarize, using transformer-based architecture in our gaze-dependent image caption model offers several benefits.Since it is region-based, we can facilitate the extraction of even the slightest movement of the human gaze by dividing the image into smaller regions.Furthermore, under the training of the interdependent relationship between gaze information and visual/captioning features, our method can generate comprehensive captions controlled by the gaze trace.

Re-Ranking Based on Vision Information
After the above processing, the proposed method converts coordinate-format gaze trace data into text-format semantic information using image captions, which is used for a re-ranking that matches user preferences.To select the feedback information for re-ranking, obtaining many feedbacks from users contributes to reflecting their preferences.Conversely, too much feedback increases the burden for the user and the computational complexity of the model from a practical perspective.Considering the above factors, in this experiment, the proposed method uses the gaze traces corresponding to a specific number of images (the top a images) from the initial retrieval results as feedback from the user.The re-ranking process is as follows.First, the top a images from the initial retrieval results and their corresponding gaze traces are input into the CGT model to generate the same number of gaze-dependent image captions.By inputting the generated captions into the CLIP model, we can obtain the feature vectors v 1 , v 2 , v 3 , . . ., v a representing user preference information.Moreover, we consider the query image feature during the re-ranking process to prevent accuracy degradation caused by the absence of its information.Finally, v r used for re-ranking can be obtained by averaging the feature vector v q of the query image with a feature vectors (v 1 , v 2 , v 3 , . . ., v a ) extracted from the gaze-dependent image captions by the following equation: Because the major components that constitute the feature vector v r represent user preference information based on the gaze information, using v r for re-ranking makes it possible to obtain the result ranking those images that reflect the user's preferences with a higher rank.

Dataset.
In the experiments, we used pairs of images and human visual information from a dataset presented by [24] called HAIC.The dataset consists of 4000 images randomly chosen from the MSCOCO [25] and Pascal-50S [54] datasets.Each image has been annotated with a human-generated caption and corresponding gaze trace, indicating the regions of the image that the annotators focused on.With the training on this dataset, the CGT model can accurately learn the deep relationship between gazes, images, and captions to connect image features and the gaze trace using captioning to explicitly express semantic information, such as object relationships.
In our experiments, we divided the images in the HAIC dataset into training and test sets containing 2000 and 2000 images, respectively.Specifically, the 2000 images in the test set are randomly divided into 1500 candidate images and 500 query images in the test phase.Because each image in the HAIC dataset has a corresponding gaze trace and human-generated caption, we use the paired data of the same image separately in re-ranking during the test phase.Gaze traces of the candidate images are used as feedback information for re-ranking.In contrast, the human-generated caption of the query images is used as the ground truth to retrieve the desired image and perform the evaluation by calculating the metrics, both the initial retrieval results and the re-ranking.
Considering the balance between the introduction of sufficient gaze information and the calculation time of the network, we set a = 3 in this experiment, using the gaze trace of the top three images from the initial retrieval results to generate the gaze-dependent image caption as the feedback information for re-ranking.For the λ * in loss function, their values depended on the particular experiment and dataset being used.In this experiment, we set λ 1 = 0.5, λ 2 = 0.3, λ 3 = 0.1, and λ 4 = 0.1, respectively.
To assess the effectiveness of our method, we applied recent CMIR and re-ranking approaches in the evaluation.Specifically, we compared our method against other existing techniques, such as [45,46,55,56], to assess its adaptability.The comparative methods were implemented using open-source codes provided by their respective authors.As for the time cost during training and retrieval, it is about 100 h and 70 s, respectively.Thus, it is necessary to consider the reduction of the computation cost for the retrieval phase.

Evaluation Metrics
To accurately evaluate the proposed method, we adopt two extensively used evaluation metrics, namely Recall@k and NDCG@k.Specifically, we set k to 1, 5, 10, 50, and 100 to measure the performance at different retrieval depths.Higher values of Recall@k and NDCG@k indicate the superior performance of the method.Recall@k is an extensively used evaluation metric for machine learning and information retrieval.It quantifies the capacity of a model to recognize all pertinent instances in a given dataset.The definition of Recall@k is shown as follows: Recall where n q represents the number of desired images that appeared in the top k retrieval results.
The desired images refer to the retrieval results obtained using the human-generated caption corresponding to the query images in our experiment.NDCG@k (normalized discounted cumulative gain) is also an evaluation metric commonly used in image retrieval to measure the effectiveness of a ranked list of items.It measures the quality of the recommendations by assigning a score between 0 and 1 based on how well the recommended items match the user's preferences.NDCG considers each recommended item's relevance and position in the list of recommendations [57].The relevance score of each item is usually determined by the user's feedback (e.g., ratings, clicks, purchases) on the items.NDCG is calculated by taking the discounted cumulative gain (DCG) of the ranked list and normalizing it by the ideal DCG(IDCG).Specifically, NDCG@k is defined as follows: where DCG is calculated by summing the relevance scores of the ranked items, discounted by their position in the list.IDCG is the maximum possible DCG for the list, calculated by assuming that the items are ranked in order of decreasing relevance.Specifically, DCG and IDCG are defined as follows: where rel i denotes the relevance score of ith item in the desired image list.

Comparison with State-of-the-Art Methods
To establish the adaptability of the proposed method, we conducted a comparative analysis with current state-of-the-art techniques such as those presented in [55,56].The evaluation was performed using the HAIC datasets for both initial retrieval and re-ranking, and we present the corresponding results.Specifically, Tables 1 and 2 list the assessment results for Recall@k and NDCG@k, respectively.In the evaluation process, we first calculated the initial results on the basis of a CMIR model (e.g., CLIP), and then the proposed method was employed to rearrange the initial retrieval results.The results in Tables 1 and 2 illustrate that after re-ordering using the proposed method, all re-ranking results outperform the initial retrieval results obtained from their respective baseline (BL) methods.This means that the proposed re-ranking method can improve the performance of the initial retrieval results calculated by the BL method in CMIR.These findings demonstrate the efficacy of the proposed re-ranking method in enhancing CMIR performance.Examples of retrieval results are depicted in Figure 4a,b.To better evaluate the retrieval results qualitatively, instead of using the whole desired image list as the ground truth in the quantitative evaluation, in this study, we set the first image of the desired image list as the relevant image and observed the rank of this image in the re-ranking results.In Figure 4a, the proposed method re-ordered an image from the third position in the initial retrieval to the first position, the same rank as the ground truth of the same image.Furthermore, in Figure 4b, we find that the ground truth image was ranked low in the initial retrieval (not in the top 3) but re-ordered in a higher rank after being processed by the proposed method.Finally, we demonstrate some results of the generated gaze-dependent image captions and the original images overlaid with the corresponding gaze trace in Figure 5.As shown in Figure 5, it can be seen that the gaze controls the description order of the generated caption.For example, in Figure 5a, the gaze goes through the sky, forest, and grass, and the caption also follows this order to describe the image.Based on this qualitative assessment, it is verified that the proposed method can enhance the overall retrieval performance beyond its initial state.Furthermore, the efficacy of the re-ranking technique that incorporates eye gaze data is substantiated.
Table 1.Comparison of Recall@k between the proposed method and state-of-the-art methods.By comparing the results of applying the proposed method to multiple advanced cross-modal image retrieval methods, the general applicability of the proposed method to improve retrieval performance can be analyzed.The bold number in the table is the best result.

Comparison with Re-Ranking Methods
In this subcategory, we perform an empirical analysis to assess the effectiveness of the proposed method compared with the conventional re-ranking method.Note that the proposed method uses a unique interactive re-ranking method, which converts coordinatebased gaze trace data into semantic information represented in a textual format using image captions.This method distinguishes itself from the conventional interactive re-ranking methods.Considering the above situation, it can be challenging to directly compare our proposed and conventional methods, as reported in [58].To compare our proposed and existing methods, we evaluate our method against methods presented in [10,45,46], which can be assessed under the same conditions.Conventional techniques that obtain relevant feedback from users concerning the highest-ranked images are employed for the evaluation.Initially, we computed the initial retrieval based on the conventional CMIR model, CLIP [23].Subsequently, we re-ordered the retrieval results using both the proposed method and the conventional re-ranking methods.Finally, we quantitatively compared the re-ranking results.Note that we utilized the hyper-parameters of the conventional interactive re-ranking methods that maximize the average retrieval efficacy.
Tables 3 and 4 illustrate the experimental results.In the BL column of Tables 3 and 4, the retrieval efficacy of the conventional CMIR approaches is presented.These tables show that the proposed method demonstrates superior performance over the conventional re-ranking methods in terms of retrieval effectiveness.Note that the conventional methods cannot significantly improve the BL retrieval performance.These results are likely due to the failure to determine the inherent query intention inside the query images.In contrast to the conventional re-ranking methods, our method uses gaze information feedback to facilitate re-ranking that accurately reflects user preferences.Based on these results, we could confirm the effectiveness of our method.Table 3. Experimental results of Recall@k for interactive re-ranking methods.By comparing the evaluation metrics of the proposed method and multiple re-ranking methods, it is possible to quantify the improvement of the retrieval performance between the proposed method and other methods for the same task.The bold number in the table is the best result.

Ablation Study
In this subsection, we examine the influence of the transformation of the gaze trace data.The above experiments prove the versatility of our method; however, a compelling explanation of its novelty is still required, that is, introducing the gaze trace as the feedback of re-ranking and transforming it from coordinate data into a semantic format through image captioning.In such a situation, we require a comparative method that uses only the raw gaze data in the coordinate form for re-ranking.However, there are few relevant studies thus far, and it is not easy to make an effective comparison.Therefore, as depicted in Figure 6, the ablation method directly segments the sight-stayed areas of images according to the corresponding gaze data of the most focused regions (approximately five regions) and calculates the average feature vectors of these slices for re-ranking.Tables 5 and 6 illustrate the experimental results.According to the results in these tables, our method outperforms the ablation methods.It is practical to generate gazedependent image captions to realize the re-ranking that accurately reflects user preferences.

Discussion
The proposed gaze-dependent re-ranking model enhances the ability to overcome the semantic gap.In this section, we will address the constraints of the current model and potential future directions for further research.

Limitation
The images that match the user's preferences will be ranked higher in the re-ranking results using the proposed method.However, there is still a problem in that the image at the top of the re-ranking result list is not always the desired image.Furthermore, in this study, the experiments were solely conducted under the condition where a = 3, and the efficacy of the proposed method under a broader range of environmental settings was not investigated.Lastly, as the proposed method attempts to establish the relationship between images, gazes, and captions via image captioning, the dataset utilized in the experiments must comprise data on the three models above.In this case, the current public dataset of gaze-image pairs required for adaptation is small.The lack of training data also leaves room for further improvement in the performance of the model.Some indications of this lack can be found in the results shown in Figure 5.We noticed that although the generated caption describes the content of the user's attention in the image in detail, there will also be problems of low text quality, such as repeated use of the same sentence pattern (as in Figure 5c) or simply stacking nouns (as in Figure 5d) to cause redundancy, which may affect the re-ranking performance.

Future Works
In future research, we intend to enhance the resilience of the proposed method, elevating the desired positions of images in the re-ranking results and ensuring the accuracy of the top-ranking results.Moreover, it is necessary to find the proper value a to explore the maximum likelihood balance between the adequate introduction of gaze information and the efficiency of the model during further experiments.Conversely, we need to find ways to expand the dataset, such as predicting the gaze trace for egovision images or videos with human annotations or referring to the data augmentation method [59] to increase existing saliency datasets.With sufficient training data, better performance in responding to complex and changing environments is expected.Furthermore, while the primary focus of this paper is to introduce a novel framework for retrieval using gaze-dependent captioning, the evaluation of the generated captions is an essential component of this application.As the quality of the captions directly impacts the retrieval performance of the model, it is imperative to conduct quantitative evaluations of the captioning in future research to explore ways to enhance the model's performance further.In practical applications, there may also be situations where a user performs retrieval repeatedly, so the multi-time re-ranking problem for the proposed method will be our future study content.Finally, it is important to consider cases where images may be distorted or have low contrast.In such cases, it may be beneficial to use image restoration techniques as a pre-processing step to prioritize image clarity and obtain accurate gaze information in the future.

Conclusions
In this article, we have proposed a novel re-ranking method for CBIR to solve the CBIR semantic gap problem.We introduced gaze trace information as the user's feedback to predict user preference during image retrieval.Moreover, we transform the gaze data into semantic information using an image captioning method to help the model better comprehend the inherent query intention inside the query image.Specifically, the proposed method first generates gaze-dependent image captions on the basis of the user's gaze information corresponding to the images obtained in the initial retrieval as the user's preference information.Next, image retrieval is performed again using the generated captioning to obtain retrieval results that better reflect the user's preferences.The experimental results show that the proposed method effectively improves image retrieval accuracy by re-searching images using gaze trace information.

Figure 1 .
Figure 1.Concept of our method to improve conventional CBIR methods.Conventional methods face the problem of accurately localizing which specific part of the image the user intends to retrieve.Our method attempts to present images that meet the user's favor at a higher rank using the gazedependent image caption.

Figure 2 .
Figure 2.An overview of our method.We use gaze traces associated with initial retrieval results as feedback to generate gaze-dependent image captions.Moreover, we comprehensively use feature information from gaze-dependent image captions and query images for re-ranking.

Figure 3 .
Figure 3.An overview of our connected caption and gaze trace (CGT) model.The model includes three transformer-based modules to learn the relevance of image, captioning, and gaze information.Note that the outputs of the image module T v affect both the caption module T w and the gaze trace module T r to deliver the image features.

Figure 4 .Figure 5 .
Figure 4. Retrieval results achieved using the proposed method.To demonstrate the qualitative evaluation, we set the first image in the ground truth as the relevant image and marked it with a red box.(a,b) are the re-ranking results for "kitchen" and "traffic lights" respectively.Correspondingly, we observed the rank of these images in the re-ranking results and highlighted them with a yellow frame.

Figure 6 .
Figure 6.Concept diagram of the ablation study.

Table 2 .
Comparison of NDCG@k between the proposed method and state-of-the-art methods.The bold numbers in the table are the best results.

Table 4 .
Experimental results of NDCG@k for interactive re-ranking methods.The bold numbers in the table are the best results.

Table 5 .
Experimental results of Recall@k for the verification of the novelty of our method to transform gaze trace from coordinate data into semantic information.Note that "Ablation" denotes the reranking directly based on the sight-stayed areas of images according to the corresponding gaze data without the image caption.The bold number in the table is the best result.

Table 6 .
Experimental results of NDCG@k for the verification of the novelty of our method.The bold numbers in the table are the best results.