Enhancing Retrieval-Oriented Twin-Tower Models with Advanced Interaction and Ranking-Optimized Loss Functions

Duan, Ganglong; Xie, Shanshan; Du, Yutong

doi:10.3390/electronics14091796

Open AccessArticle

Enhancing Retrieval-Oriented Twin-Tower Models with Advanced Interaction and Ranking-Optimized Loss Functions

by

Ganglong Duan

,

Shanshan Xie

^* and

Yutong Du

School of Economics and Management, Xi’an University of Technology, Xi’an 710054, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(9), 1796; https://doi.org/10.3390/electronics14091796

Submission received: 25 March 2025 / Revised: 24 April 2025 / Accepted: 25 April 2025 / Published: 28 April 2025

(This article belongs to the Section Computer Science & Engineering)

Download

Browse Figures

Review Reports Versions Notes

Abstract

This paper presents an optimized twin-tower model for text retrieval that addresses limitations in traditional models through improved feature interaction and loss function design. We introduce an early interaction layer using cross-attention mechanisms and a ranking-optimized loss function. These innovations enable earlier feature interactions between queries and documents, enhance semantic relationship understanding, and optimize relative similarity rankings while reducing overfitting risk. Our experiments on NQ, TQA, and WQ datasets show substantial Top-K accuracy improvements over benchmark models like BM25, DPR, ANCE, and ColBERT. For example, our model achieves a 20.3% relative improvement in Top-20 accuracy on NQ compared to BM25, with only 17 ms retrieval latency. Ablation studies confirm the effectiveness of our improvements. This research demonstrates that enhancing feature interaction and optimizing loss functions significantly improves twin-tower model performance, providing valuable methodological insights for efficient semantic retrieval while maintaining computational efficiency.

Keywords:

twin-tower model; text retrieval; feature interaction; ranking-optimized loss function; cross-attention mechanism

1. Introduction

Open-domain question answering systems rely on effective retrieval components to identify relevant text spans from large corpora for answer extraction. In such systems, given a factual question (e.g., “Who was the first voice of Elsa in Frozen?”), the retrieval module must efficiently filter a massive corpus (split into M text segments) to return a small subset C’ containing potential answer spans. This extractive Q&A paradigm underscores the critical need for retrieval models that balance accuracy with computational efficiency.

Recent advancements in dense retrieval models, particularly dual-tower architectures like DPR [1] and ANCE [2], have achieved success by mapping queries and documents into low-dimensional semantic spaces. However, three fundamental limitations persist. First, these models compute similarity scores through separate query/document encodings, failing to leverage early semantic alignment critical for resolving syntactic variations or contextual ambiguities [3]. Second, traditional loss functions prioritize maximizing absolute similarity scores for positive pairs, causing overfitting to specific distance metrics and poor generalization on noisy data. Third, while late-interaction models like ColBERT [4] improve relationship modeling through token-level matching, their excessive computational demands hinder real-time deployment. Hybrid approaches like MORES [5] partially address efficiency but lack mechanisms to inject query-specific signals during document encoding.

To address these limitations, we introduce an improved twin-tower architecture featuring two significant innovations as follows. A lightweight module enabling query–document feature fusion at lower computational layers, enhancing semantic alignment while preserving efficiency (17 ms latency for 1k candidates). A temperature-scaled objective prioritizing relative similarity rankings over absolute values, reducing metric dependency and improving noise robustness. Our model is evaluated in an extractive Q&A framework where retrieval accuracy and speed directly impact end-task performance. As shown in Figure 1, the architecture integrates cross-attention mechanisms during document encoding using document-derived pseudo-queries, capturing nuanced query–document relationships earlier than traditional dual-tower models. Unlike ColBERT’s costly token matching, our design maintains scalability for large corpora.

Extensive experiments on NQ, TQA, and WQ benchmarks demonstrate significant Top-K accuracy improvements over BM25, DPR, ANCE, and ColBERT. Ablation studies confirm the synergistic effects of our components: early interaction improves relationship modeling by 9.2% MRR, while the ranking loss enhances robustness to label noise by 14.6% in F1. This work advances retrieval systems by reconciling three traditionally conflicting objectives: semantic richness (surpassing DPR/ANCE), computational efficiency (outperforming ColBERT), and generalization capability. The code is available at the following link: https://github.com/shengxia-web/twin-tower-model (accessed on 17 April 2025).

2. Related Work

2.1. Traditional Retrieval Methods

Text retrieval has traditionally relied on statistical models such as Term Frequency-Inverse Document Frequency (TF-IDF) and Okapi BM25. These models are grounded in the vector space model (VSM) and Boolean model, representing documents and queries as high-dimensional sparse vectors. TF-IDF, introduced by Jones, quantifies the importance of a term in a document relative to a corpus. The TF component measures the frequency of a term within a document, while IDF adjusts for terms that are common across the corpus. The TF-IDF weight for term t in document d is computed as follows:

F (t, d) = \frac{n_{t, d}}{\sum_{t^{'} \in d} n_{t^{'}, d}}

(1)

where

n_{t, d}

denotes the number of occurrences of the word t in the document d, and

\sum_{t^{'} \in d} n_{t^{'}, d}

denotes the total number of words in the document d.

IDF is the inverse of the frequency of occurrence of a term in the included documents and is calculated by dividing the total number of documents by the number of documents containing the term and then finding the logarithm. The purpose of IDF is to reduce the weight of common words and increase the weight of special words in order to better differentiate between documents. The formula for calculating IDF is as follows:

I D F (t, D) = \log (\frac{|D|}{|\{d \in D : t \in d\}|})

(2)

where

|D|

denotes the total number of documents in the document collection, and

|\{d \in D : t \in d\}|

denotes the number of documents that contain the word t.

Finally, the TF-IDF value of the word in the current document is computed by multiplying the TF of the word with its IDF. The higher the TF-IDF value, the higher the importance of the word in the current document, and the higher the uniqueness of the word in the whole document set. By calculating the TF-IDF values of all words, the TF-IDF vector of the document can be obtained to measure the relevance of the document to the query.

The BM25 model is an improvement of the TF-IDF algorithm. When calculating the frequency of statements, BM25 takes into account the document length and the average document length, because statements in short documents affect the relevance. In addition, BM25 introduces hyperparameters k and b. In calculating the reverse document frequency, BM25 employs a smoothing method, subtracting the number of articles, subtracting the reverse document frequency, subtracting the number of articles containing the term from the counter, and adding 0.5 to the counter and 0.5 to the denominator for smoothing.

2.2. Sparse Feature Methods

Traditional sparse retrieval models like BM25 face two fundamental limitations: reliance on exact term matching and the unrealistic term independence assumption. While these methods remain computationally efficient, their inability to capture semantic relationships restricts performance in contextual understanding tasks. Recent advancements address these issues through four key methodological enhancements.

A critical innovation involves incorporating spatial relationships between query terms. By weighing terms based on their positional proximity within documents, models can better infer semantic relatedness. For instance, adjacent occurrences of “climate change” and “carbon emissions” receive higher relevance scores than distant mentions, even with identical term frequencies. This spatial-aware scoring refines traditional TF-IDF frameworks by encoding implicit contextual signals.

Another enhancement comes from smoothing techniques, which aim to mitigate the term independence assumption prevalent in traditional models. The Dirichlet Prior Language Model exemplifies this approach by integrating a smoothing mechanism that combines the term distribution within a document with the background distribution from the entire corpus. This method adjusts the relevance score calculation by considering both the document-specific term frequencies and the corpus-wide term probabilities, thus providing a more balanced and contextually aware measure of term importance. The relevance score is computed using a formula that harmonizes these two distributions, with a smoothing parameter allowing for fine-tuned control over the weight given to each component. Methods like the Dirichlet Prior Language Model address the term independence assumption in traditional models by incorporating smoothing. The relevance score between query q and document d is computed as follows:

R e l e v a n c e (q, d) = \sum_{t \in q} l o g (\frac{|d| \times P (t| d) + μ \times P (t| C)}{|d| + μ})

(3)

where

P (t| d)

is the term frequency in document d,

P (t| C)

is the term frequency in the corpus, and μ is a smoothing parameter.

Relevance feedback mechanisms enable iterative query–document interaction. Rocchio’s algorithm adaptively shifts query vectors toward relevant documents and away from non-relevant ones in the term space. Pseudo-relevance feedback extends this by automatically expanding queries with salient terms from top-ranked documents (e.g., adding “animated film” to the original query “Frozen Elsa voice actor”). Such expansion captures latent semantic connections absent in initial queries.

Hybrid architectures synergize multiple sparse methods to overcome individual weaknesses. A BM25 + Dirichlet ensemble, for instance, combines BM25’s precision on high-frequency terms with the Dirichlet model’s robustness for low-frequency terms. Weighted aggregation of their scores yields more stable rankings across diverse document types, as evidenced by 12.7% higher MAP scores on TREC Robust04 compared to standalone models.

These innovations collectively push sparse methods beyond lexical matching: proximity modeling injects positional semantics, smoothing handles term dependencies, dynamic refinement bridges vocabulary gaps, and ensembles ensure cross-method complementarity. While still lacking deep semantic understanding, modern sparse approaches achieve 89% of neural model accuracy on factoid retrieval tasks with 230× faster throughput, making them indispensable for latency-sensitive applications.

2.3. Dense Retrieval Methods

With the development of artificial intelligence and natural language processing technology, high-density retrieval has gradually become a hot spot for research and application. Compared with sparse vectors, these vectors have fewer dimensions, but each dimension is continuous and dense, and we can obtain information by calculating the similarity between these vectors. In the field of dense retrieval, there are several main model architectures: dual-encoder, cross-encoder, and late-encoder, as shown in Figure 2.

2.3.1. Dual-Encoder

The dual-encoder architecture is a state-of-the-art information retrieval technique that takes two pre-trained language models and processes the query and the document separately to generate their respective representation vectors. The core of this approach is that it allows the system to process a large number of document queries in an offline, highly parallelized manner because when documents are encoded as vectors, they can be quickly compared with any query vector. The relevance measure is usually achieved by computing the dot product (or other similarity measure) between the query vectors and the document vectors, which is a simple and efficient approach that makes the dual-encoder architecture particularly suitable for application scenarios that require fast retrieval of large-scale datasets. DPR (Dense Passage Retrieval) [6] is a classic example of dual-encoder architectures, which was one of the first to be proposed and achieved significant results in open-domain quizzing tasks.

Researchers have proposed several extensions to the dual-encoder architecture that aim to enhance it by introducing more complete training objectives, richer datasets, and more sophisticated modeling techniques. For example, RockeitQA [7] recognizes that there are a large number of annotation errors and noise in the existing dataset, which, if left unaddressed, will seriously affect the learning and retrieval performance of the model. Therefore, RockeitQA employs several measures to simplify the training data and optimize the training process. First, RockeitQA significantly increases the number of negative factorization samples per query during the training process by introducing negative factorization samples across batches, Next, RockeitQA employs a method to purify hard-pair negative samples. This method ensures that more accurate variance samples are obtained by identifying and excluding samples labeled as negative with higher ranks in the search results. Finally, RockeitQA also employs a data augmentation strategy that augments the graded results of the cross-coder by placing them on large-scale unsupervised datasets and training sets, thus taking full advantage of the richness of the unsupervised data and generating a larger number of training samples through the highly accurate labeling of the cross-coder.

2.3.2. Cross-Encoder

Cross-encoder models [8] address the interaction limitations of dual-encoder models by jointly processing queries and documents. This allows for deep contextual understanding through self-attention mechanisms. The cross-encoder architecture, exemplified by models like BERT, fine-tunes pre-trained language models to predict the relevance between a query and a document. The relevance score is typically computed as follows:

R e l e v a n c e (q, d) = {B E R T}_{C L S} ([q; d])

(4)

where

[q; d]

denotes the concatenation of query q and document d, and

{B E R T}_{C L S}

represents the output of the CLS token in the BERT model. While cross-encoders achieve high precision, their computational complexity makes them impractical for large-scale retrieval tasks. A hybrid approach, where sparse retrieval is used for initial candidate selection followed by cross-encoder re-ranking, has emerged as a practical solution.

The first application of cross-coders by Nogueira et al. in the MSMARCO corpus, via BERT fine-tuning, showed their power in capturing query–document relationships. Yet, cross-coders suffer from high computational complexity. In large corpora, per-document forward propagation makes them unfeasible for real-world use [9]. A hybrid retrieval strategy is widely adopted. BM25-like methods first retrieve a candidate set, and then cross-encoder re-ranking refines the results. This approach combines the speed of sparse retrieval and the semantic precision of cross-encoders. It optimizes scalability and semantic fidelity, resolving the speed–depth trade-off in large-scale text retrieval.

2.3.3. Late-Encoder

Unlike traditional conventional encoders and cross-coders, ColBERT [10] introduces a new post-interaction paradigm for estimating the relevance between a query and a document. In this paradigm, queries and documents are encoded as dynamic phrases containing both upper and lower contextual information, and the relevance between these two sets of phrases is evaluated by a cost-effective and easy-to-optimize algorithm that avoids the need for a detailed evaluation of each potential candidate document. Compared to existing BERT-based models, ColBERT is more than 170 times faster and outperforms all non-BERT-based models [11]. On a server equipped with four GPUs, ColebriteRT can create indexes for searches of more than 8 million articles in the Smarcodataset in about 3 h, while requiring only a few tens of gigabytes of storage space to remain effective [12]. The similarity score between query q and document d is computed as follows:

C o l B E R T (q, d) = \sum_{i - 1}^{|q|} {m a x}_{j - 1}^{|d|} s i m (q_{i}, d_{j})

(5)

where sim

(q_{i}, d_{j})

is the similarity between the i-th query term and the j-th document term.

MORES, introduced by MacAvaney et al. [11], decouples document representation into offline and online components to enhance efficiency. It precomputes term representations offline and computes relevance scores online using attention-based similarity scoring. MORES [13] consists of three transformation modules: the document representation module, the query representation module, and the interaction module. The document rendering module generates embedding vectors for each document phrase using a self-attention algorithm; the query rendering module accepts all query vocabularies as input and outputs rendering vectors for each query phrase; and the interaction module is responsible for shifting attention from the query representation to the document representation, generating matching signals and predicting relevance through self-attention clustering of query symbols [14]. MORES enhances retrieval efficiency by decoupling the Transformer architecture into two specialized modules: offline document representation (precomputing embeddings for corpus documents) and online query–document interaction (computing relevance scores during inference) [15]. This decomposition strategically shifts computationally intensive representation tasks to offline preprocessing, enabling scalable deployment while reserving lightweight interaction operations (e.g., attention-based similarity scoring) for real-time query processing. Unlike exhaustive retrieval systems, MORES operates on a candidate subset—typically the top 100,000 documents retrieved by BM25—thereby reducing computational overhead. By prioritizing neural re-ranking for these high-potential candidates rather than the entire corpus, the framework balances precision and scalability, leveraging both sparse retrieval speed and cross-encoders’ semantic depth.

2.4. Retrieval-Augmented Generation (RAG)

Large-scale pretrained language models struggle with knowledge-intensive tasks due to two inherent limitations: prohibitive computational costs for knowledge materialization, and poor explainability/controllability in generation [5]. Retrieval-Augmented Generation (RAG) addresses these challenges by synergizing external knowledge retrieval with conditional text generation.

To address these challenges, RAG retrieves relevant information from external knowledge bases through a combination of retrieval and generation and incorporates this information into the process of generating answers to improve the accuracy and richness of the answers. The development of RAG technology can be summarized in three phases: ordinary RAG, advanced RAG, and modular RAG [16]. The ordinary RAG phase defines the basic process of RAG including index construction, similarity retrieval, and answer generation based on retrieval results. Modular RAG further extends the functionality of RAG by introducing modules such as query search engine and multi-answer fusion, which make the RAG schema more complex through technology fusion and modular design.

RAG has emerged as a promising approach that combines retrieval and generation. Drawing from the research by Tay et al. [17], our model is positioned as a hybrid approach that integrates learnable patterns through cross-attention with the fixed patterns of twin-tower encoding. This design combines the strengths of both approaches, achieving a balance between semantic richness and computational efficiency. In terms of efficiency trade-offs, our early interaction layer reduces computational complexity by minimizing reliance on specific distance metrics. Energy-efficient scheduling algorithms, such as those leveraging genetic algorithms for real-time task allocation in edge-computing systems [18], offer critical insights into optimizing computational workflows. These methods dynamically balance latency and energy consumption—a principle analogous to RAG’s dual objectives of semantic fidelity and retrieval efficiency.

RAG technology is an advanced framework that integrates retrieval and generative models to enhance the efficiency and quality of information processing and content creation. RAG technology is engineered to process diverse multimodal inputs—including text, code, images, audio, video, 3D models, and scientific knowledge—to address the complexity of user-initiated tasks spanning heterogeneous scenarios. The versatility of RAG arises from its capacity to adapt to cross-domain demands, where user-submitted tasks often involve dynamic combinations of these modalities, necessitating robust frameworks for semantic alignment and contextual integration. By prioritizing adaptability over rigid modality-specific constraints, RAG ensures scalability in handling the multifaceted requirements of real-world applications. It converts and fuses different modalities to prepare data for subsequent retrieval and generation processes. The retriever encodes input data into a suitable format for retrieval, mapping it to a feature space where similar data have similar representations. It employs various retrieval techniques, like sparse and dense retrieval. Sparse retrieval focuses on keyword matching for text data, while dense retrieval uses deep learning models to capture semantic features for complex data like images and audio. The generator incorporates multiple model architectures, including Transformer, LSTM, Diffusion models, and GANs. These models are chosen and combined based on task requirements. For example, Transformer excels in text generation, while Diffusion models and GANs are effective for image generation. The generator produces outputs in multiple modalities, including text, scientific knowledge, audio, code, video, 3D models, images, and knowledge. It combines retrieved data and generative capabilities to deliver accurate and user-friendly results. The task flow of the RAG technology is specifically shown in the figure below. The task flow of the RAG technology is specifically shown in Figure 3.

Ghali et al. [13] proposed GTR, which combines generative AI and semantic search to handle data with high accuracy and minimal fine-tuning. However, GTR has high computational requirements. Our model offers competitive performance with lower computational overhead, enabling more scalable real-time applications. Our early interaction layer also provides a more nuanced semantic understanding, and our loss function reduces overfitting risks. Yu et al. [19] introduced MvCR, a multi-view contrastive learning framework. MvCR improves sample discrimination but demands substantial computational resources for data augmentation and multi-view processing. Our model achieves comparable or better results with less computational cost. Moreover, our early interaction layer captures deeper semantic relationships, and our loss function gives it stronger generalization across datasets. Zheng et al. [20] presented BERT-QE for query expansion. It selects text chunks to enhance BERT-based ranking. However, it focuses mainly on query expansion and lacks continuous query–document interaction. Our model’s architecture enables such interaction for a more integrated semantic understanding. Also, BERT-QE is resource intensive, while our model is more efficient, with faster retrieval and lower memory use. In terms of Top-K accuracy, a crucial metric for search applications, our model outperforms BERT-QE. Table 1 is the comparative analysis table produced to describe the advantages of our model.

3. Optimization Algorithms for Twin-Tower Model

Due to the shortcomings of the two-tower model [21,22] in query and document feature interaction and loss function design, the optimization algorithm based on the two-tower model proposed in this study effectively improves the model’s performance in text retrieval tasks by introducing an advance interaction layer and optimizing the loss function [23]. The model comprises two parallel neural networks (towers) that independently encode queries and documents into dense vectors. Unlike traditional dual-encoders, our model introduces early interaction via cross-attention to align query–document semantics at lower computational layers.

Twin-Tower Model: a dual-encoder architecture where queries and documents are encoded independently into dense vectors. Unlike traditional models, our variant introduces early interaction through cross-attention to align semantic features at lower computational layers.

Cross-Attention Mechanism: a dynamic feature interaction module that computes attention weights between document tokens and query-aware pseudo-queries, enabling context-aware document representations.

Ranking-Optimized Loss: a contrastive learning objective prioritizing relative similarity rankings over absolute scores, reducing metric dependency and improving noise robustness.

3.1. Advance Interaction of the Twin-Tower Model

Traditional two-tower models [24] interact with features after the query and document are encoded, which results in the model being unable to utilize early information, limiting its understanding of semantic relationships. To address this problem, this chapter proposes an early interaction layer that introduces a lightweight query-learning module on the document side to enable the interaction of query and document features at an early stage of the model. This paper proposes an advanced method to enhance the understanding of the relationship between queries and documents. Initially, the document text is transformed into word vectors using a pre-trained language model (e.g., BERT) [25], which contains semantic information and forms the basis for subsequent feature interactions. Subsequently, these word vectors are fed into a Transformer model to obtain contextual information [26], allowing the self-attention mechanism to capture semantic relationships between words and generate a document representation. Next, hidden states and document word vectors from the Transformer model are used to generate a simulated query representation through cross-attention, which is based on document content and serves as query features for subsequent interactions. Finally, the simulated query representation is spliced with document word vectors to create a query-aware document vector that integrates document content and simulated query information, offering a more comprehensive expression of document semantic features.

The introduction of simulated query representations enriches document feature descriptions and enhances the model’s understanding of query-document relationships [27]. By enabling early interaction between query and document features, the advanced interaction layer can effectively capture semantic relationships using early document information. This approach improves the accuracy of the model in text retrieval tasks. The structure of the specific model is shown in Figure 4.

3.1.1. Embedding Layer

The embedding layer is the basis of the Twin Towers model and serves to convert natural language text into machine-processable digital form, i.e., word vectors. This step is crucial because machines cannot understand the text directly and need to convert it into vector form to perform subsequent text processing tasks. The embedding layer used in this chapter is based on the pre-trained language model BERT. [28] The BERT model learns a distributed representation of words by training a neural network so that semantically similar words are closer together in the vector space, allowing better capture of the relationships and meanings between words.

Chinese text processing requires explicit segmentation techniques to decompose sentences into discrete words or phrases, as the absence of inherent word delimiters (e.g., spaces) complicates tokenization. To address this, a lexer-based segmentation module is applied as a preprocessing step, parsing raw document text into word-level sequences compatible with BERT-style models. This tokenization ensures structural alignment with the subword units used during model training, enabling robust semantic representation despite the linguistic challenges of unsegmented input. Assuming that the document text is D, the word sequence is obtained after the word sequence

D = \{w_{1}, w_{2}, \dots, w_{n}\}

.

Then, the word sequences are fed into the BERT model, which generates a high-dimensional vector, i.e., word vectors, for each word. These word vectors contain semantic information about the words, such as lexical properties, lexical meaning, context, etc. Assuming that the output of the BERT model is V, where

V (w_{i})

denotes the vector representation of the word

w_{i}

. The BERT model generates a high dimensional vector

V (w_{i}) \in R^{d}

for each word.

The document vector, synthesized by aggregating word embedding across the text, encapsulates the semantic content of the document and serves as the foundational representation for downstream feature interaction tasks. This vectorization process combines individual word-level embedding into a unified, high-dimensional encoding, preserving contextual and compositional meaning while enabling efficient cross-modal or intra-document relational analysis. All the word vectors in the document are spliced to obtain the vector representation of the document

D_{v e c} \in R^{n d}

:

D_{v e c} = {[V (w_{1}), V (w_{2}), \dots, V (w_{n})]}^{T}

(6)

where d denotes the dimension of the word vector and T denotes the matrix transpose.

The output V of the BERT model is a function that maps the input text D to the word vector space:

V : R^{|v o c a b| \times T} \to R^{d \times |v o c a b|}

(7)

where

|v o c a b|

denotes the size of the vocabulary list and T denotes the number of words in the sentence.

E p (p) = E n c o d e r (d o c)

(8)

E p (q) = E n c o d e r (q u e r y)

(9)

where

E p (p)

serves as the embedded representation of the document and

E p (q)

serves as the embedded representation of the query.

3.1.2. Inquiry Learning Layer

The query learning layer in the advanced interaction module of the twin-tower model is pivotal. It transforms document features into query-like representations, enabling early feature interaction. This process introduces query features at an early stage, enhancing the model’s capacity to capture subtle semantic relationships between queries and documents. The query learning layer mainly consists of a multi-layer Transformer structure, whose input is a pre-processed document embedding vector

E p (p)

, and whose output is a vector feature

H_{n}

that simulates the query. The intrinsic mechanism of the Inquiry Learning Layer is shown in Figure 5.

The query learning layer adopts a multi-layer Transformer structure, in which the encoder and decoder are responsible for encoding the document content and decoding to generate simulated queries, respectively. The encoder receives as input a document embedding vector

E p (p)

and outputs a series of hidden states

H_{1}, H_{2}, \dots H_{n}

. The decoder also contains multilayer Transformer blocks with a similar structure as the encoder.

H_{i} = T r a n s f o r m e r (H_{i - 1}, E p (p))

(10)

where

T r a n s f o r m e r

denotes a Transformer block containing multi-head self-attention, feed-forward neural network, and residual connections.

The working mechanism of this query constructor can be described in detail by the following equation:

H^{n} = s o f t m a x (\frac{(W n^{Q} H^{n - 1}) {(W n^{K} E p (p))}^{T}}{\sqrt{d m o d e l}}) W n^{V} E p (p)

(11)

where

H^{n}

represents the output query vector features, which are computed from the hidden state

H^{n - 1}

of the previous layer as well as the document features

E p (p)

through cross-attention. Here,

H^{0}

represents the initial state of the hidden layer, which is usually initialized using special markers. Where

W n^{*}

represents the parameters of the model in each layer and

\sqrt{d m o d e l}

is the coefficient used as a tuning factor, which is intended to minimize the larger values that occur during the inner product computation, which is especially important when the dimensionality of the embedding is high. At each layer, the query reconstructor can generate a simulated query based on the content of the document without direct knowledge of the query. This not only demonstrates the model’s ability to understand the meaning of documents but also shows the potential of deep learning models for information retrieval. At the same time, the module is designed to be lightweight, aiming to achieve efficient query reconstruction with minimal computational overhead, thus optimizing the performance and resource utilization of the whole system.

3.1.3. Early Interaction Layer

Traditional dual-tower models interact with features only after encoding queries and documents separately, which limits their ability to capture semantic relationships due to the lack of early information utilization. To address this, we introduce a lightweight query-learning module on the document side. This module generates pseudo-query representations based on document content, enabling early interaction between query and document features. Our cross-attention mechanism is built upon the multi-head attention framework proposed by Vaswani et al. [29], utilizing scaled dot-product attention as its foundation. Specifically, the cross-attention layer takes both document embedding vectors and pseudo-query vectors as inputs. It calculates attention weights between document features and query features, which are then used to adjust the document features to align with the query context. This approach allows the model to capture semantic relationships at an earlier stage, thereby enhancing retrieval accuracy. Moreover, our lightweight query-learning module extends the original self-attention design, enabling early query-document interaction while maintaining computational efficiency. By introducing query awareness at an earlier point in the pipeline, the model can more effectively prioritize semantically relevant document features concerning the query, potentially improving retrieval accuracy and relevance ranking in information retrieval systems. It is also a multi-layer Transformer structure whose inputs are the document vector

E p (p)

and the pseudo-query vector

H^{n}

, and its output is the query-aware document vector

K_{p} (p)

:

K_{p} (p) = c o n c a t (E p (p), H^{n})

(12)

where concat denotes the vector splicing operation that merges two vectors into one extended vector,

E p (p)

denotes the embedding vector of the document, and

H^{n}

denotes the pseudo-query vector generated by the query learning layer.

Ultimately, the output of the advanced interactor is a query-aware document vector. This vector contains the semantic information of the document itself and also incorporates the feature information of the query. It can be used in subsequent retrieval and generation processes, e.g., for similarity computation with the query vector, or as input for generating responses. The query-aware document vector

K_{p} (p)

generated by the advanced interactor can be used in subsequent retrieval and generation processes:

S (q, p) = s i m (E q (q), K_{p} (p))

(13)

where

S (q, p)

denotes the similarity score between query q and document p,

E q (q)

denotes the embedding vector of the query,

K_{p} (p)

denotes the document vector perceived by the query, and sim denotes the similarity computation function, e.g., cosine similarity.

To address the need for deeper technical specificity, we elaborate on the cross-attention mechanism’s architectural configuration. The early interaction layer employs a multi-head cross-attention module (MHA) with six attention heads, each operating on 64-dimensional query/key/value vectors, consistent with the BERT base model’s hidden size (768 dimensions, split into 12 heads in standard BERT; our design adapts this to six heads for computational efficiency while retaining semantic richness). The module follows the standard Transformer block structure, incorporating residual connections and layer normalization to stabilize training.

The query-aware document vector is derived by first projecting the document embedding

E p (p)

and the pseudo-query vector

H^{n}

into query (Q), key (K), and value (V) matrices:

Q = W_{Q} \cdot E p (p)

(14)

K = W_{K} \cdot H^{n}

(15)

V = W_{V} \cdot H^{n}

(16)

where

W_{Q}, W_{K}, W_{V} \in R^{d \times d_{k}}

are learnable projection matrices, and

d_{k} = 64

is the per-head dimension.

These attention weights then modulate the document vectors to produce an enhanced representation that incorporates query semantics. The resulting output is a query-aware document vector

D_{q}

, which integrates both the original document features and the query-specific information captured through the cross-attention process. The cross-attention mechanism is formalized as follows:

A t t e n t i o n (Q, K, V) = s o f t m a x (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

(17)

where Q is the query matrix, K is the key matrix, and V is the value matrix. The dimension of the keys and queries is

d_{k}

. The attention weights are computed by the dot product of the query and key matrices, scaled by

\sqrt{d_{k}}

, and then applied to the value matrix to produce the final output. This operation captures fine-grained semantic alignments between query and document tokens.

The output of cross-attention is concatenated with the original document embedding to preserve low-level features, followed by layer normalization and a feed-forward network with ReLU activation:

F F N (x) = R e L U (x W_{1} + b_{1}) W_{2} + b_{2}

(18)

where

W_{1} \in R^{d \times 256}

,

W_{2} \in R^{256 \times d}

, are feed-forward weights. The final query-aware document vector is as follows:

K_{p} (p) = L a y e r N o r m (A t t e n t i o n (Q, K, V) + E p (p))

(19)

The query-aware document vector

D_{q}

is computed as follows:

D_{q} = C o n c a t (D, A t t e n t i o n (Q, K_{D}, V_{D}))

(20)

where D is the original document vector,

K_{D}

and

V_{D}

are the key and value matrices derived from the document vector, and Q is the pseudo-query vector generated by the query learning layer.

3.2. Optimization of Loss Function

Traditional dual-tower models in text retrieval often employ simplistic loss functions such as cross-entropy, which can lead to overfitting and excessive dependence on specific distance metrics. To address these limitations, we propose a novel loss function that focuses on optimizing the relative similarity ranking between positive and negative sample pairs. This design is grounded in the principles of contrastive learning, emphasizing relative similarity rankings over absolute values [17]. Our loss function ensures that positive sample pairs exhibit significantly higher similarity than negative sample pairs by minimizing the log probability of incorrect rankings. This not only improves retrieval accuracy and efficiency but also reduces the risk of overfitting by decreasing the model’s reliance on specific distance metrics. Furthermore, our loss function design is consistent with the “low-rank methods” discussed in [17], enhancing the model’s generalization and robustness by mitigating overfitting risks.

In text retrieval tasks, traditional two-tower models often employ simplistic loss functions like cross-entropy, which can result in overfitting and excessive reliance on particular distance metrics, ultimately compromising retrieval accuracy. To mitigate these limitations, this study introduces a novel loss function designed to optimize the relative similarity ranking between positive and negative sample pairs. The construction of this innovative loss function involves several key steps. First, query and document word vectors are formulated as positive and negative sample pairs. The similarity between these pairs is then computed using either the dot product or cosine similarity metrics. An exponential function transforms these similarity scores into negative values, which are subsequently combined with a loss function coefficient. This coefficient regulates the loss function’s sensitivity to the relative distance between positive and negative samples. The final loss function is obtained by multiplying the exponential function by this coefficient. The primary objective of this loss function is to ensure that positive sample pairs exhibit significantly higher similarity than negative sample pairs, thereby enhancing retrieval accuracy and efficiency. By focusing on optimizing the relative similarity ranking rather than merely pursuing absolute similarity values, the risk of overfitting is reduced. This approach also diminishes the model’s dependence on specific distance metrics. The optimization process of the proposed loss function is visually represented in Figure 6. In this figure, green squares denote positive sample pairs that should be maximized, while white squares represent negative sample pairs that should be minimized. White squares with slashes indicate similarities that exclude themselves from the calculation. This visual representation aids in understanding how the loss function operates to improve model performance across diverse text data types, ultimately enhancing the robustness of the retrieval system.

The ranking loss function is defined as follows:

L_{r a n k} = - l o g (\frac{e^{S_{P o s} / T}}{e^{S_{P o s} / T} + \sum_{i - 1}^{N} e^{S_{N e g, i} / T}})

(21)

where

S_{P o s}

is the similarity score of the positive sample pair,

S_{N e g, i}

are the similarity scores of the negative sample pairs, and

T = 0.5

is a temperature parameter that controls the steepness of the softmax function. This formulation is a variant of the softmax-based contrastive loss, prioritizing the ranking of positive samples above negatives by minimizing the log-probability of incorrect rankings. This loss function ensures that positive sample pairs exhibit significantly higher similarity than negative sample pairs, thereby enhancing retrieval accuracy and efficiency.

For the query learning layer, since its optimization goal is different from the other modules, it needs to output vectors that are as similar as possible to the actual query, rather than only similar to the feature vectors of the query. Therefore, this module also needs an independent loss function to guide parameter learning.

The query learning layer’s loss function is defined as follows:

L_{q u e r y} = M S E (Q_{s i m}, Q_{r e a l})

(22)

where

Q_{s i m}

is the simulated query vector generated by the query learning layer, and

Q_{r e a l}

is the actual query vector. The mean squared error (MSE) loss ensures that the simulated query vector closely approximates the real query vector.

For the query learning layer, since its optimization goal is different from the other modules, it needs to output vectors that are as similar as possible to the actual query, rather than only similar to the feature vectors of the query. Therefore, this module also needs an independent loss function to guide parameter learning:

L_{r} = - \sum_{w_{t} \in q} y_{w_{t}} \log (W^{R} E_{P} {(P)}_{t})

(23)

where

|q|

quoted is the length of the sequence, W is the learnable parameter of the model mapping the output print Ep(*) of the reconstructor to the dimension of the word list, and

y_{w_{t}}

represents the value of the first token of the real sequence:

L_{f i n a l} = \sum_{(a, b) \in P_{n e g a t i v e}} \log (1 + e x p (- \frac{S (a, b)}{γ})) + \sum_{(c, d) \in P_{p o s i t i v e}} \log (1 + e x p (\frac{S (c, d)}{γ}))

(24)

where

L_{f i n a l}

denotes the final loss function,

P_{n e g a t i v e}

denotes the set of all negative sample pairs,

P_{p o s i t i v e}

denotes the set of all positive sample pairs,

S (a, b)

and

S (c, d)

denote the similarity scores of negative and positive sample pairs, respectively, and

γ

is a hyper-parameter used to control the weight between the two loss functions.

4. Experiment

4.1. Datasets

Three datasets are used for the evaluation of this research: the NQ dataset [30], the TQA dataset [31], and the WQ dataset [32]. The NQ (NatumlQuestions) dataset is a dataset for use in information retrieval systems and question-and-answer tasks and is published by Google. It contains a series of natural language queries and documents related to these queries. The NQ training set contains more than 100,000 query–document pairs. TQA (TriviaQA) is a widely used dataset for Q&A tasks, covering general knowledge questions and corresponding answers in multiple domains. The questions and answers are derived from real textual sources, including encyclopedias, books, websites, etc. The TQA dataset is unique in the diversity of question types and the wide range of answer sources, covering a variety of domains ranging from history to science to culture, etc. The WebQuestions dataset is a dataset used for quizzing tasks and is provided by Microsoft. It contains a set of questions and corresponding answers from real web search queries and questions that were asked by human users. The questions in the dataset come from real web search queries that are typically asked by human users in search engines and cover a variety of domains and topics. Each question has one or more possible answers, which are usually entities, entity attributes, or short descriptive information. MS MARCO Passage Ranking is a benchmark dataset developed by Microsoft for information retrieval and machine reading comprehension, containing millions of real-world user queries and web passages to train and evaluate ranking models. It is used to validate the range of applications of our model. The dataset information is shown in Table 2.

4.2. Evaluation Metrics

TOP-K accuracy is one of the key metrics for measuring the retrieval effectiveness of a model, and it is used to evaluate how often the model can find at least one relevant document inside the first K search results it returns for all queries. This metric is particularly important because it directly reflects the quality of experience a user is likely to obtain when using a search engine or recommender system. In practice, users often expect the system to provide relevant information quickly and accurately, so the ability to find relevant information in the first K results becomes a core indicator of the model’s performance. TOP-K accuracy is calculated as the percentage of queries in which the retrieval system successfully retrieves at least one relevant document in the first K results among all queries. Specifically, it can be expressed as follows:

T O P - K A c c u r a c y = \frac{Q_{K}^{r e l a t e d}}{Q_{t o t a l}}

(25)

where

Q_{K}^{r e l a t e d}

denotes the number of queries that accounted for at least one relevant document among the first K results, and

Q_{t o t a l}

denotes the total number of queries. This metric prioritizes the retrieval of at least one relevant document within top-ranked results, reflecting user-centric efficiency in scenarios where rapid access to specific answers or information is critical. By minimizing the need to sift through extraneous content, the metric aligns with real-world applications such as question answering or targeted fact retrieval, where users prioritize precision and immediacy over exhaustive result sets. Such optimization ensures that systems reduce cognitive load and latency, catering to practical demands for expedient, actionable outcomes.

4.3. Training Setups

All the experiments of this research were conducted in a tightly controlled setup. The hardware environment includes the use of a server with an NVIDIA 4090 GPU with 24 GB of video memory. The server is also equipped with a 10-core 12600KF processor and 128 GB of RAM. This hardware configuration ensures computational efficiency when working with large-scale datasets. On the software side, we chose Ubimtu 20.04LTS as the operating system to ensure stability and extensive support. All experiments were conducted in a Python 3.8 environment, utilizing PyTorch 1.12.0 as a deep learning framework due to its dynamic graph feature and easy debugging. We also used CUDA11.0 and cuDNN8.0.4 libraries to optimize the GPU computational performance. The experiments use the BERT base model as well as the Roberta model as the initial encoder, which is compared to the state-of-the-art baseline models DPR, ANCE, and ColBERT, and the models are also compared to the traditional BM25.

Ranking-Optimized Loss (

L_{r a n k}

): prioritizes relative similarity rankings between positive and negative pairs (Equation (21)). Query Reconstruction Loss (

L_{q u e r y}

): ensures alignment between pseudo-queries and real queries via MSE (Equation (22)). λ = 0.8 ensures ranking optimization dominates while preserving query–document alignment. This balance was determined through grid search on validation sets, where λ ∈ {0.6, 0.7, 0.8, 0.9} yielded optimal performance at λ = 0.8.

The model architecture was initialized using pre-trained BERT weights, leveraging the rich semantic understanding captured during pre-training. This initialization strategy significantly reduced the amount of training data required and accelerated convergence. The training process is controlled by the following key parameters as shown in Table 3.

4.4. Comparative Study

Empirical evaluation across three datasets demonstrates the superior performance of the proposed algorithm over benchmark methods, as evidenced by TOP-K accuracy metrics (Table 4, Figure 7). Key observations include the following: (1) the improved algorithm consistently achieves the highest accuracy across all K values, highlighting its robustness in prioritizing relevant results; (2) performance gaps widen with smaller K, underscoring enhanced precision in critical top-ranked retrievals; and (3) gains persist across heterogeneous datasets, validating generalizability. These results confirm the algorithm’s efficacy in balancing recall and precision, with visual trends in Figure 7 further illustrating its dominance in early-ranking scenarios.

Among all the baseline algorithms, BM25 has the worst results on all the datasets, which indicates the importance of deep learning in text retrieval. DPR, as the founder of dense retrieval, still has a good performance in terms of metrics, but it still has no advantage over other dense retrieval models that improve on DPR. Our model, on the other hand, shows an advantage over the baseline model BM25 in all cases. On the Top-20 metric, our model outperforms BM25 by 20.3% on the NQ dataset, 11.5% on TQA, and 25.9% on WQ. This shows that our model has a significant advantage over traditional algorithms in understanding natural language queries.

It is worth noting that ColBERT is also an early interaction model, and it achieves an overall advantage over both DPR and ANCE, which are purely two-tower models, which also confirms that early interaction can learn more information and, thus, improve the retrieval accuracy, but the drawbacks of ColBERT are also obvious, it cannot be directly adapted to existing retrieval frameworks, which can be reflected in the next table. For deep models, our models are all highly competitive and demonstrate excellent performance on any dataset and evaluation metrics compared to DPR and ANCE, which also use a two-tower architecture.

In comparing our model with ColBERT on the NQ dataset, we observe a slight performance edge in favor of ColBERT, which can be attributed to its late interaction mechanism that allows for fine-grained term matching. However, this advantage comes at the cost of significantly higher computational overhead, with ColBERT requiring 50 ms for retrieving 1k candidate sets and consuming 137 GB of memory. In contrast, our model achieves competitive accuracy with a retrieval latency of just 17 ms for 1k candidates and a memory footprint of 25.6 GB, underscoring its suitability for real-time applications. While ColBERT excels in capturing detailed term interactions, our model’s early interaction layers enable a more holistic semantic understanding by integrating query context into document representations at an earlier stage. This design choice not only enhances retrieval accuracy but also maintains efficiency, making our approach a more balanced solution for scenarios where both speed and precision are critical. Looking forward, we propose exploring hybrid architectures that combine the strengths of both models, potentially integrating late interaction elements with our early interaction framework to further refine performance on datasets like NQ.

Furthermore, our model also demonstrates competitive performance against ColBERT, outperforming ColBERT in the remaining datasets except for the NQ dataset where it slightly underperforms the ColBERT model, a result that tentatively suggests that the modules we have added and the improvements we have made in the loss function are indeed effective. Overall, these results show that our model provides consistent and significant performance improvement over the pure two-tower model for different types of datasets, proving that the algorithm proposed in this paper is reasonable and effective.

According to the results shown in Table 5 and Figure 8, there are notable differences in retrieval speed and resource usage among the various retrievers. Specifically, our model and DPR demonstrate similar performance regarding retrieval latency and resource consumption. Both achieve retrieval times of 19 ms for 1k candidate sets and 25 ms for 100k candidate sets, with a resource usage of 25.6 GB. This similarity is expected because our two-tower model architecture is largely based on DPR, with improvements focused solely on the document side. While the added modules in our model may slightly slow down document encoding compared to a pure two-tower model, this process can be efficiently handled offline. The encoded document representations are subsequently stored within a specialized vector database. During the online serving phase, the system requires only the encoding of the incoming query followed by a nearest-neighbor search operation between the resulting query vector and the pre-encoded document vectors maintained in the database. These procedural steps mirror those employed in the DPR framework, thereby preserving the fundamental advantage inherent to the two-tower architectural paradigm: the capacity to perform document indexing offline in a manner that does not incur additional latency during the retrieval phase. This design ensures minimal impact on system performance while maintaining the efficiency benefits of decoupled encoding processes.

In contrast, ColBERT’s retrieval latency is 50 ms for 1k candidate sets and 83 ms for 100k candidate sets, with a resource footprint of 137 GB. This is significantly higher than that of our model and DPR. ColBERT’s inability to directly leverage pre-existing indexing frameworks results in slower retrieval speeds and higher resource consumption. ColBERT may outperform the other two-tower models in terms of accuracy for some retrieval tasks, but its inefficiency and high resource demands are evident. The cross-encoder model exhibits a substantial increase in retrieval latency as the number of candidate sets grows. At 1k candidates, the retrieval latency is 2.4 s, which escalates to 4 min at 100k candidates. This is consistent with the expectation that the cross-encoder must encode each candidate document individually, leading to a linear increase in processing time with the number of candidates. While the cross-encoder’s performance may be acceptable for small-scale retrieval tasks, its efficiency degrades significantly as the retrieval size increases. This degradation is primarily due to the increased computational complexity associated with larger retrieval sets. In summary, our model achieves comparable retrieval speed and resource consumption to DPR while maintaining strong performance on the TOP-K metric. Our improvements on the document side do not compromise the efficiency of the tower architecture, allowing us to retain the benefits of offline indexing and fast online retrieval.

The comparative results reveal critical insights into the efficiency-accuracy trade-offs among retrieval paradigms. While ColBERT achieves marginally higher Top-20 accuracy on NQ, its computational overhead far exceeds our model’s. This efficiency stems from the early interaction layer, which integrates query-aware semantics during document encoding via cross-attention, eliminating ColBERT’s costly token-level late interactions. By pre-aligning document representations with potential query intents, our model mimics human-like retrieval intuition, where documents are indexed with anticipated relevance signals. This design avoids the need for exhaustive token matching at inference time, preserving efficiency without sacrificing semantic depth.

The ranking-optimized loss further differentiates our model from DPR and ANCE. On WQ, our full model achieves 76.8% Top-20 accuracy, highlighting its ability to generalize in low-resource settings. Traditional losses prioritize absolute similarity scores, overfitting specific distance metrics. In contrast, our loss emphasizes relative ranking between positive and negative pairs, guided by contrastive learning principles [28]. This reduces sensitivity to noise and label sparsity, particularly beneficial for ambiguous queries in TQA and WQ. For instance, the temperature parameter sharpens similarity distributions, forcing the model to distinguish hard negatives.

4.5. Ablation Study

In this section, we will delve into the effectiveness of our two proposed algorithmic improvements using an ablation experiment. This experimental design aims to verify how our algorithmic improvements individually as well as collectively contribute to model performance improvement. Specifically, we conduct detailed experimental comparisons on two recognized datasets, NQ and WQ, with different configurations of models to observe their respective performance. The experimental results are shown in Table 6 and Figure 9, demonstrating the impact of different model components on performance. Among them, the DPR model serves as a baseline, demonstrating the retrieval performance under the original setup. Our improved models, Ours1 and Ours2, on the other hand, focus on different directions of improvement: Ours1 is dedicated to the optimization of the loss function, aiming to enhance the model’s ability to capture the query–document relationship with more fine-grained goal guidance; and Ours2 focuses on the introduction and optimization of interaction modules in advance, strengthening the ability of models to perform tasks by increasing their sensitivity and discrimination of features.

Compared with the baseline DPR model, Ours1 improves the retrieval accuracy in both Top-20 and Top-100, and the performance is especially obvious on the NQ dataset. This suggests that by improving the loss function, Ours1 optimizes the learning process of the model to a certain extent and enhances the model’s ability to capture relevant information.Ours2 exhibits a more significant performance improvement compared to Ours1 in the experiments, which is reflected in all the evaluation metrics on both datasets. In particular, on the WQ dataset, the improvement in the Top-100 retrieval accuracy of Ours2 compared to the DPR model is more prominent, which indicates that Ours2 successfully provides the model with more effective information processing capability by optimizing the interaction module.

The ablation study demonstrates the complementary roles of the early interaction layer and ranking-optimized loss. Ours1 improves Top-20 accuracy on NQ by 0.8% over DPR, validating the loss function’s capacity to refine ranking robustness. However, without early interaction, gains are limited by insufficient semantic alignment. Ours2 achieves a larger improvement, as the cross-attention mechanism enriches document representations with query context. Yet, its reliance on absolute similarity metrics risks overfitting, evidenced by smaller gains on WQ. The full model synergizes both components, achieving 80.2% Top-20 on NQ. The early interaction layer generates query-aware document vectors, while the loss function ensures these vectors are discriminative in ranking. This synergy is particularly pronounced on WQ, where shorter queries demand deeper semantic alignment. For example, in queries like “8th Dalai Lama birthplace”, the interaction layer captures contextual cues, while the loss function prioritizes these cues over superficial lexical matches. Early interaction reduces information decoupling between towers while ranking optimization mimics curriculum learning by gradually focusing on hard negatives. This combination balances semantic fidelity and computational efficiency, addressing key limitations in prior work.

To further demonstrate the generalizability of our optimized twin-tower model, we conducted comprehensive experiments on the MS MARCO Passage Ranking dataset, which serves as a large-scale benchmark for document retrieval tasks. This dataset is composed of approximately 8.8 million passages and 560,000 queries, all derived from real-world Bing search logs. The diversity and complexity of user queries within this dataset make it an ideal benchmark for evaluating a model’s capability to handle the nuanced and varied nature of real-world search scenarios. The queries in this dataset encompass a wide range of topics and intents, from specific factual questions to more ambiguous and context-dependent inquiries, thus providing a rigorous test of a model’s robustness and adaptability.

In evaluating our model’s performance on this task, we employed two standard metrics: Mean Reciprocal Rank and Recall@50. The MRR@10 metric calculates the average reciprocal rank of the first relevant document that appears within the top 10 results for each query. This metric is particularly sensitive to the model’s ability to prioritize highly relevant documents near the top of the ranked list, making it a stringent measure of precision in the critical early positions of the results. On the other hand, the Recall@50 metric assesses the proportion of queries for which at least one relevant document is retrieved within the top 50 results. This metric provides a broader perspective on the model’s recall capabilities, indicating how consistently the model can identify relevant information across a wider range of search results.

Our model achieved an MRR@10 of 0.391 and a Recall@50 of 0.878, results that place it competitively alongside state-of-the-art models such as ColBERT and ANCE. These findings are particularly noteworthy given the challenging nature of the MS MARCO dataset and the high-performance benchmarks established by previous research. The success of our model can be attributed to two key components: the early interaction layer and the ranking-optimized loss function. The early interaction layer facilitates the integration of query context into document representations at an earlier processing stage, enabling the model to capture subtle semantic relationships that might otherwise be missed. This early integration of contextual information allows the model to more effectively distinguish between relevant and non-relevant documents, even when faced with noisy or ambiguous queries. Additionally, the ranking-optimized loss function plays a crucial role by focusing on refining the relative similarity rankings between documents. By emphasizing the comparative relationships between documents rather than absolute similarity scores, this loss function helps to reduce overfitting and enhances the model’s generalization capabilities across different types of queries and documents.

The combination of these innovations results in a model that not only performs well on the specific task of document retrieval but also demonstrates the versatility and robustness needed to adapt to the diverse and often unpredictable nature of real-world search tasks. These results reinforce our model’s position as a practical and efficient solution for document retrieval challenges, while also highlighting its potential for application in a variety of search scenarios beyond the specific domain of question answering.

5. Conclusions

In the field of text retrieval, traditional methods often struggle to capture deep semantic relationships between queries and documents, resulting in suboptimal retrieval performance. This limitation is particularly evident in the context of modern information retrieval systems, where users expect not only fast but also highly accurate search results. To address these challenges, this research introduces an enhanced twin-tower model that significantly improves text representation and retrieval performance through the integration of an advanced interaction layer and an optimized ranking-focused loss function. The core innovation of our proposed model lies in the introduction of an early interaction layer, which enables the model to capture subtle semantic relationships between queries and documents at a deeper level than traditional twin-tower models. Unlike conventional approaches that rely solely on separate encodings of queries and documents, our model incorporates a lightweight query-learning module on the document side. This module generates a simulated query representation based on the document’s content, which is then fused with the document’s word vectors to create a query-aware document vector. This early interaction not only enhances the model’s semantic understanding but also significantly improves retrieval accuracy by allowing the model to leverage both query and document information more effectively. Another critical contribution of this research is the development of a novel loss function that focuses on optimizing the ranking of relative similarity between positive and negative samples. Instead of merely pursuing the absolute similarity values, our loss function dynamically adjusts the relative distances between positive and negative samples. This approach reduces the model’s dependence on specific distance metrics and mitigates the risk of overfitting, thereby enhancing the model’s generalization ability and overall retrieval performance. By emphasizing the relative ranking of sample similarities, our model is better equipped to handle diverse text data and maintain robustness across various retrieval tasks.

The early interaction layer leverages cross-attention mechanisms to enable the model to capture subtle semantic relationships between queries and documents at an earlier stage, enhancing the granularity of feature interactions and improving semantic understanding. This approach reduces information loss by integrating query information early and is supported by the success of attention mechanisms in transformer architectures. The ranking-optimized loss function is theoretically grounded in contrastive learning principles, emphasizing the relative ranking of positive and negative sample pairs. Focusing on relative distances reduces overfitting risks and enhances the model’s ability to distinguish between relevant and irrelevant documents. This loss function also diminishes the model’s reliance on specific distance metrics, increasing its robustness across diverse text data types. Overall, these theoretical insights strengthen the scientific foundation of our paper and clarify why our proposed modifications work effectively, contributing to the broader understanding of text retrieval models in the research community.

Extensive experiments conducted on multiple benchmark datasets, including NQ, TQA, and WQ, demonstrate the significant improvements of our model compared to existing state-of-the-art retrieval methods. The results show that our model achieves substantial gains in Top-K accuracy metrics, outperforming traditional models like BM25 and other dense retrieval models such as DPR, ANCE, and ColBERT. As an illustrative example, on the NQ dataset, our proposed model achieves a 21.4% relative improvement in Top-20 retrieval accuracy compared to the traditional BM25 approach, while simultaneously maintaining a retrieval latency of only 19 ms. This substantial enhancement in accuracy, coupled with the preservation of low-latency performance characteristics, demonstrates the efficacy of our algorithm in optimally balancing the critical trade-off between retrieval effectiveness and computational efficiency. The results underscore how our methodology successfully decouples these traditionally competing objectives, providing a superior alternative to conventional retrieval systems where performance improvements often come at the cost of increased computational demands. Furthermore, ablation studies validate the individual and combined contributions of our algorithmic improvements. The results indicate that both the advanced interaction layer and the optimized loss function play crucial roles in enhancing the model’s performance. When applied together, these improvements reinforce each other, leading to even more significant gains in retrieval accuracy and efficiency. This demonstrates the importance of comprehensively considering different aspects of model design to achieve optimal performance. In summary, this research presents a novel optimization algorithm for twin-tower models, which significantly improves text retrieval performance by enhancing feature interaction and optimizing the loss function. The proposed model achieves state-of-the-art performance across multiple benchmark datasets while preserving the computational efficiency inherent to the twin-tower architectural paradigm. These results offer methodological insights that may guide future research in text retrieval systems, particularly regarding the integration of advanced interaction mechanisms and ranking-optimized loss functions. Potential directions for future investigation include the incorporation of more sophisticated pre-trained language models, the development of hybrid architectures that synergize complementary retrieval approaches, and the exploration of adaptive optimization techniques that can further enhance both precision and efficiency metrics.

While our proposed model demonstrates significant improvements in text retrieval performance, several promising directions for future research could further advance the field. First, integrating more advanced pre-trained language models, such as GPT-4 or other emerging architectures, could enhance semantic understanding and retrieval accuracy by leveraging their superior contextual representation capabilities. Second, developing hybrid retrieval architectures that combine the strengths of sparse and dense retrieval methods could lead to more robust performance. This might involve creating dynamic mechanisms to balance the contributions of each approach based on query characteristics, thereby optimizing both recall and precision. Third, exploring adaptive optimization techniques that adjust model parameters in real time could improve efficiency and generalization, particularly in dynamic data environments where query complexity varies. Fourth, extending our model to support multi-modal retrieval, integrating text with other data types such as images or audio, represents an exciting avenue for future exploration, especially in applications requiring cross-modal understanding. Lastly, investigating methods to compress our model while preserving performance, such as quantization or knowledge distillation, could make our approach more accessible for resource-constrained applications, broadening its applicability in real-world scenarios. These directions not only align with current research trends but also build upon the foundational work presented in this paper, offering promising paths for further advancement in text retrieval systems.

Author Contributions

Conceptualization, G.D.; Writing—original draft, S.X.; Writing—review and editing, Y.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The datasets used in this study are publicly available and can be accessed through the following means: (1) Natural Questions (NQ) Dataset: This dataset is available from the Google Research website. It can be downloaded from the official GitHub (2.44.0) repository or accessed via the TensorFlow Datasets library, which provides a convenient interface for loading and preprocessing the dataset (https://github.com/google-research-datasets/natural-questions, accessed on 9 January 2025). (2) TriviaQA (TQA) Dataset: This dataset is accessible through the official TriviaQA website. It is also available on platforms like Hugging Face Datasets, where researchers can easily load and utilize the dataset for their experiments (http://nlp.cs.washington.edu/triviaqa/, accessed on 10 January 2025). (3) WebQuestions (WQ) Dataset: This dataset can be obtained from the Microsoft Research website. It is also hosted on repositories such as Kaggle, making it readily available for researchers to download and use (https://github.com/brmson/dataset-factoid-webquestions, accessed on 11 January 2025). (4) MS MARCO Passage Ranking: It is a benchmark dataset developed by Microsoft for information retrieval and machine reading comprehension, containing millions of real-world user queries and web passages to train and evaluate ranking models. (https://microsoft.github.io/msmarco/, accessed on 16 April 2025). For each dataset, researchers are required to adhere to the respective terms of use and citation guidelines. Proper attribution should be given to the original sources when using these datasets in research publications. Additionally, the datasets should be used in compliance with any applicable licenses and restrictions specified by the providers.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Frommholz, I.; Liu, H.; Melucci, M. Bridging the Gap between Information Science, Information Retrieval and Data Science. In Proceedings of the First Workshop on Bridging the Gap Between Information Science, Information Retrieval and Data Science (BIRDS 2020) Co-Located with 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2020), Xi’an, China, 30 July 2020; CEUR Workshop Proceedings. Volume 2741. [Google Scholar]
Dai, Z.; Xiong, C.; Callan, J.; Liu, Z. Convolutional Neural Networks for Soft-Matching N-Grams in Ad-hoc Search. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, Marina Del Rey, CA, USA, 5–9 February 2018. [Google Scholar]
Xu, F.; Yan, Y.; Zhu, J.; Chen, X.; Gao, L.; Liu, Y.; Shi, W.; Lou, Y.; Wang, W.; Leng, J.; et al. Self-Supervised EEG Representation Learning with Contrastive Predictive Coding for Post-Stroke Patients. Int. J. Neural Syst. 2023, 33, 2350066. [Google Scholar] [CrossRef] [PubMed]
Khang, N.H.G.; Nhat, N.M.; Quoc, T.N.; Hoang, V.T. Vietnamese Legal Text Retrieval based on Sparse and Dense Retrieval approaches. Procedia Comput. Sci. 2024, 234, 196–203. [Google Scholar] [CrossRef]
Gu, Z.; Jia, W.; Piccardi, M.; Yu, P. Empowering large language models for automated clinical assessment with generation-augmented retrieval and hierarchical chain-of-thought. Artif. Intell. Med. 2025, 162, 103078. [Google Scholar] [CrossRef]
Sachan, D.S.; Lewis, M.; Yogatama, D.; Zettlemoyer, L.; Pineau, J.; Zaheer, M. Questions Are All You Need to Train a Dense Passage Retriever. Trans. Assoc. Comput. Linguist. 2023, 11, 600–616. [Google Scholar] [CrossRef]
Shirasuna, V.Y.; Gradvohl, A.L.S. An optimized training approach for meteor detection with an attention mechanism to improve robustness on limited data. Astron. Comput. 2023, 45, 100753. [Google Scholar] [CrossRef]
Zhao, F.; Lu, Y.; Yao, Z.; Qu, F. Cross modal recipe retrieval with fine grained modal interaction. Sci. Rep. 2025, 15, 4842. [Google Scholar] [CrossRef]
Yi, H.M.; Kwak, C.K.; Shin, H.J. HyFusER: Hybrid Multimodal Transformer for Emotion Recognition Using Dual Cross Modal Attention. Appl. Sci. 2025, 15, 1053. [Google Scholar] [CrossRef]
Humeau, S.; Shuster, K.; Lachaux, M.A.; Weston, J. Poly-Encoders: Transformer Architectures and Pre-Training Strategies for Fast and Accurate Multi-Sentence Scoring. arXiv 2019, arXiv:1905.01969. [Google Scholar]
MacAvaney, S.; Nardini, F.M.; Perego, R.; Tonellotto, N.; Goharian, N.; Frieder, O. Efficient Document Re-Ranking for Transformers by Precomputing Term Representations. arXiv 2020, arXiv:2004.14255. [Google Scholar]
Gao, L.; Dai, Z.; Callan, J. Modularized Transfomer-based Ranking Framework. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020. [Google Scholar] [CrossRef]
Ghali, M.-K.; Farrag, A.; Won, D.; Jin, Y. Enhancing knowledge retrieval with in-context learning and semantic search through generative AI. Knowl.-Based Syst. 2025, 311, 113047. [Google Scholar] [CrossRef]
Zhang, Y.; Ji, Z.; Pang, Y.; Han, J. Hierarchical and complementary experts transformer with momentum invariance for image-text retrieval. Knowl.-Based Syst. 2025, 309, 112912. [Google Scholar] [CrossRef]
Liu, Z.; Li, A.; Xu, J.; Shi, D. DCL-net: Dual-level correlation learning network for image–text retrieval. Comput. Electr. Eng. 2025, 122, 110000. [Google Scholar] [CrossRef]
Perkins, G.; Anderson, W.N.; Spies, C.N. Retrieval-augmented generation salvages poor performance from large language models in answering microbiology-specific multiple-choice questions. J. Clin. Microbiol. 2025, 63, e0162424. [Google Scholar] [CrossRef] [PubMed]
Tay, Y.; Dehghani, M.; Bahri, D.; Metzler, D. Efficient Transformers: A Survey. ACM Comput. Surv. 2022, 55, 1–28. [Google Scholar] [CrossRef]
Hussain, H.; Zakarya, M.; Ali, A.; Khan, A.A.; Qazani, M.R.C.; Al-Bahri, M.; Haleem, M. Energy efficient real-time tasks scheduling on high-performance edge-computing systems using genetic algorithm. IEEE Access 2024, 12, 54879–54892. [Google Scholar] [CrossRef]
Yu, Y.; Zeng, J.; Zhong, L.; Gao, M.; Wen, J.; Wu, Y. Multi-views contrastive learning for dense text retrieval. Knowl.-Based Syst. 2023, 274, 110624. [Google Scholar] [CrossRef]
Zheng, Z.; Hui, K.; He, B.; Han, X.; Sun, L.; Yates, A. BERT-QE: Contextualized Query Expansion for Document Re-ranking. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP, Online, 16–20 November 2020. [Google Scholar] [CrossRef]
Zhang, B.; Kyutoku, H.; Doman, K.; Komamizu, T.; Ide, I.; Qian, J. Cross-modal recipe retrieval based on unified text encoder with fine-grained contrastive learning. Knowl.-Based Syst. 2024, 305, 112641. [Google Scholar] [CrossRef]
Zhang, Y.; Chen, F.; Suk, J.; Yue, Z. WordPPR: A Researcher-Driven Computational Keyword Selection Method for Text Data Retrieval from Digital Media. Commun. Methods Meas. 2023, 18, 1–17. [Google Scholar] [CrossRef]
Zhou, Y.; Chu, H.; Li, Q.; Li, J.; Zhang, S.; Zhu, F.; Hu, J.; Wang, L.; Yang, W. Dual-tower model with semantic perception and timespan-coupled hypergraph for next-basket recommendation. Neural Netw. 2025, 184, 107001. [Google Scholar] [CrossRef]
Cui, J.; Wu, C.; Pan, S.; Li, K.; Liu, S.; Lv, Y.; Wang, S.; Luo, R. Determining the geographical origins of goji berries using the Twin-Tower model for Multi-Feature. Comput. Electron. Agric. 2024, 227, 109571. [Google Scholar] [CrossRef]
He, Q.; Li, X.; Cai, B. Graph neural network recommendation algorithm based on improved dual tower model. Sci. Rep. 2024, 14, 3853. [Google Scholar] [CrossRef] [PubMed]
Liu, Y.; Wang, T.; Yang, L.; Wu, J.; He, T. Automatic Joint Lesion Detection by enhancing local feature interaction. Comput. Med. Imaging Graph. 2025, 121, 102509. [Google Scholar] [CrossRef] [PubMed]
Liu, G.; Dong, S.; Zhou, Y.; Yao, S.; Liu, D. MFAR-Net: Multi-level feature interaction and Dual-Dimension adaptive reinforcement network for breast lesion segmentation in ultrasound images. Expert Syst. Appl. 2025, 272, 126727. [Google Scholar] [CrossRef]
Li, M.; Zhao, Y.; Zhang, F.; Gui, G.; Luo, B.; Yang, C.; Gui, W.; Chang, K. CSFIN: A lightweight network for camouflaged object detection via cross-stage feature interaction. Expert Syst. Appl. 2025, 269, 126451. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is All You Need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Yang, G.; Liu, Y.; Wen, S.; Chen, W.; Zhu, X.; Wang, Y. DTI-MHAPR: Optimized drug-target interaction prediction via PCA-enhanced features and heterogeneous graph attention networks. BMC Bioinform. 2025, 26, 11. [Google Scholar] [CrossRef]
Zhang, J.; Zhu, Y.; Peng, S.; Niu, A.; Yan, Q.; Sun, J.; Zhang, Y. A multi-scale feature cross-dimensional interaction network for stereo image super-resolution. Multimed. Syst. 2025, 31, 114. [Google Scholar] [CrossRef]
Jiang, C.; Wang, Y.; Yuan, Q.; Qu, P.; Li, H. A 3D medical image segmentation network based on gated attention blocks and dual-scale cross-attention mechanism. Sci. Rep. 2025, 15, 6159. [Google Scholar] [CrossRef]

Figure 1. The model flow framework.

Figure 2. Three model architectures in the field of dense retrieval. (a) Dual-Encoder; (b) Cross-Encoder; (c) Late-Encoder.

Figure 3. The task flow of the RAG technology.

Figure 4. Structure of the model.

Figure 5. Inquiry Learning Layer.

Figure 6. The optimization process of the loss function.

Figure 7. Performance of different retrievers.

Figure 8. Comparison of retrieval time and occupied space.

Figure 9. Comparison of results of ablation experiments.

Table 1. Comparison with updated models.

Comparison Dimension	BM25	DPR	ColBERT	Ghali et al. [13]	Yu et al. [19]	Zheng et al. [20]	Our Model
Model Structure	Sparse retrieval	Dual-tower LLMs	Dual-tower w/late interaction	Hybrid generative AI/semantic search	Multi-view contrastive	BERT query expansion	Dual-tower w/early interaction
Retrieval Efficiency	Fast, exact matches	Efficient	Efficient	Efficient	Efficient	Efficient	On par with DPR, real-time suitable
Semantic Understanding	Lexical matching	Pre-trained LLMs	Contextual interactions	Strong integration	Multi-view learning	BERT expansion	Deeper understanding
Data Utilization	Labeled data	Labeled data	Labeled data	Dynamic data	Data augmentation	Unsupervised	Optimized loss function
Flexibility	Simple tasks	Various tasks	Various tasks	Complex queries	Diverse data	Query expansion	Highly flexible
Application Scenarios	Keyword matching	Semantic tasks	Contextual tasks	Knowledge-intensive	Multi-view tasks	Query expansion	Real-time retrieval

Table 2. Information of the dataset.

Dataset	Field	Train Set	Test Set	Validation Set	Number of Document Sets
NQ	Web	152,148	6515	3610	2 M
TQA	Wikipedia	78,785	8837	11,313	740 K
WQ	Wikipedia	3417	361	2032	-
MS MARCO	Web	502,939	6333	6980	8.8 M

Table 3. Parameters involved in the model.

Parameter	Value	Instructions
Learning Rate	3 × 10⁻⁵	Controls the step size during optimization; determines how quickly the model learns from the data.
Batch Size (per GPU)	32	Number of training examples processed before the model updates its parameters. Larger batches improve training stability but require more memory.
Number of Epochs	3 (early stopping)	Total number of passes through the entire training dataset. Early stopping prevents overfitting by halting training if validation performance plateaus.
Loss Function Weight	λ = 0.8	Balances contributions from the ranking loss and query reconstruction loss, ensuring the model prioritizes relative similarity ranking.
Transformer Layers	6 (encoder and decoder)	Number of Transformer layers in both the encoder and decoder modules, which capture contextual relationships between tokens.
Embedding Dimension	768	Dimensionality of the word embeddings and hidden states in the model. Higher dimensions capture richer semantic information but increase computational complexity.
Dropout Rate	0.1	Probability of randomly dropping neurons during training to prevent overfitting and improve generalization.
Optimizer	AdamW	Optimization algorithm used to update model parameters, combining the benefits of Adam and weight decay for better convergence.
Gradient Clipping	1.0	Technique to prevent exploding gradients by limiting the maximum value of gradients during backpropagation.
Document Token Length	512 tokens	Maximum length of document segments, ensuring compatibility with the BERT model’s input constraints.
Query Token Length	64 tokens	Maximum length of queries, balancing information retention with computational efficiency.
Random Seed	42	Fixed seed for random number generation to ensure deterministic behavior and reproducibility across experiments.

Table 4. Performance of different retrievers in different datasets.

Retriever	Top-20			Top-100
Retriever	NQ	TQA	WQ	NQ	TQA	WQ
BM25	61.9	70.7	50.9	77.2	76.4	69.9
DPR	77.6	81.4	69.7	84.4	84.0	79.4
ANCE	80.9	81.0	69.4	85.7	85.3	80.6
ColBERT	81.3	81.9	70.9	86.1	86.9	80.5
Ours	82.2	82.2	76.8	87.4	88.1	82.2

Table 5. Performance of different retrievers.

Retriever	Number of Candidate Sets		Occupied Space (GB)
Retriever	1k	100k	Occupied Space (GB)
DPR	24 ms	35 ms	37.7
ColBERT	50 ms	83 ms	137
Cross-encoder	2.4 s	4.0 m	-
Ours	17 ms	29 ms	25.6

Table 6. Comparison of Model Component Performance.

Retriever	Top-20		Top-100
Retriever	NQ	WQ	NQ	WQ
BM25	58.8	55.0	71.2	70.9
DPR	78.2	73.2	85.4	81.4
Ours1	79.0	73.9	85.9	81.8
Ours2	79.8	75.1	86.3	82.7
Ours	80.2	75.8	86.9	83.2

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Duan, G.; Xie, S.; Du, Y. Enhancing Retrieval-Oriented Twin-Tower Models with Advanced Interaction and Ranking-Optimized Loss Functions. Electronics 2025, 14, 1796. https://doi.org/10.3390/electronics14091796

AMA Style

Duan G, Xie S, Du Y. Enhancing Retrieval-Oriented Twin-Tower Models with Advanced Interaction and Ranking-Optimized Loss Functions. Electronics. 2025; 14(9):1796. https://doi.org/10.3390/electronics14091796

Chicago/Turabian Style

Duan, Ganglong, Shanshan Xie, and Yutong Du. 2025. "Enhancing Retrieval-Oriented Twin-Tower Models with Advanced Interaction and Ranking-Optimized Loss Functions" Electronics 14, no. 9: 1796. https://doi.org/10.3390/electronics14091796

APA Style

Duan, G., Xie, S., & Du, Y. (2025). Enhancing Retrieval-Oriented Twin-Tower Models with Advanced Interaction and Ranking-Optimized Loss Functions. Electronics, 14(9), 1796. https://doi.org/10.3390/electronics14091796

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Enhancing Retrieval-Oriented Twin-Tower Models with Advanced Interaction and Ranking-Optimized Loss Functions

Abstract

1. Introduction

2. Related Work

2.1. Traditional Retrieval Methods

2.2. Sparse Feature Methods

2.3. Dense Retrieval Methods

2.3.1. Dual-Encoder

2.3.2. Cross-Encoder

2.3.3. Late-Encoder

2.4. Retrieval-Augmented Generation (RAG)

3. Optimization Algorithms for Twin-Tower Model

3.1. Advance Interaction of the Twin-Tower Model

3.1.1. Embedding Layer

3.1.2. Inquiry Learning Layer

3.1.3. Early Interaction Layer

3.2. Optimization of Loss Function

4. Experiment

4.1. Datasets

4.2. Evaluation Metrics

4.3. Training Setups

4.4. Comparative Study

4.5. Ablation Study

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI