Query-Adaptive Hybrid Search

Posokhov, Pavel; Skrylnikov, Stepan; Masliukhin, Sergei; Zavgorodniaia, Alina; Koroteeva, Olesia; Matveev, Yuri

doi:10.3390/make8040091

Open AccessArticle

Query-Adaptive Hybrid Search

by

Pavel Posokhov

^1,2

,

Stepan Skrylnikov

²

,

Sergei Masliukhin

^1,2

,

Alina Zavgorodniaia

²

,

Olesia Koroteeva

^1,*

and

Yuri Matveev

^1,2

¹

Information Technologies and Programming Faculty, ITMO University, 197101 Saint Petersburg, Russia

²

STC-Innovations Ltd., 194044 Saint Petersburg, Russia

^*

Author to whom correspondence should be addressed.

Mach. Learn. Knowl. Extr. 2026, 8(4), 91; https://doi.org/10.3390/make8040091

Submission received: 27 February 2026 / Revised: 29 March 2026 / Accepted: 2 April 2026 / Published: 5 April 2026

(This article belongs to the Special Issue Trustworthy AI: Integrating Knowledge, Retrieval, and Reasoning)

Download

Browse Figures

Versions Notes

Abstract

The modern information retrieval field increasingly relies on hybrid search systems combining sparse retrieval with dense neural models. However, most existing hybrid frameworks employ static mixing coefficients and independent component training, failing to account for the specific needs of individual queries and corpus heterogeneity. In this paper, we introduce an adaptive hybrid retrieval framework featuring query-driven alpha prediction that dynamically calibrates the mixing weights based on query latent representations instantiated in a lightweight low-latency configuration and a full-capacity encoder-scale predictor, enabling flexible trade-offs between computational efficiency and retrieval accuracy without relying on resource-inefficient LLM-based online evaluation. Furthermore, we propose antagonist negative sampling, a novel training paradigm that optimizes the dense encoder to resolve the systematic failures of the lexical retriever, prioritizing hard negatives where BM25 exhibits high uncertainty. Empirical evaluations on large-scale multilingual benchmarks (MLDR and MIRACL) indicate that our approach demonstrates superior average performance compared to state-of-the-art models such as BGE-M3 and mGTE, achieving an nDCG@10 of 74.3 on long-document retrieval. Notably, our framework recovers up to 92.5% of the theoretical oracle performance and yields significant improvements in nDCG@10 across 16 languages, particularly in challenging long-context scenarios.

Keywords:

hybrid retrieval; semantic retrieval; dense retrieval; adaptive fusion; BM25; multilingual information retrieval; negative sampling; retriever models

1. Introduction

The rapid expansion of large-scale multilingual corpora and retrieval-augmented generation (RAG) systems, which are extensively utilized in chatbot and knowledge-assistant applications [1,2], has renewed interest in hybrid retrieval architectures. Traditional sparse retrieval methods, such as BM25 [3], remain highly competitive due to their efficiency, interpretability [4], and strong performance on long documents and domain-specific corpora. In parallel, dense neural retrievers based on dual-encoder architectures have demonstrated superior generalization and semantic matching capabilities, particularly for short queries and paraphrased content [5]. Recent work [6] has shown that neither paradigm is universally optimal across tasks, languages, or documents. Consequently, hybrid systems that fuse sparse and dense scores have become the state-of-the-art solution for both academic benchmarks and industrial RAG pipelines and dialogue agents [7,8].

The retrieval task studied in this paper can be described as follows. Given a query q, a system is required to find the best-suited candidate c to answer it in a large corpus of candidates C. Formally, the objective of a hybrid retrieval system is to rank a candidate list

R (c_{1}, c_{2}, \dots, c_{n}), where each c_{i} \in C

by relevance score

s_{h y b r i d}

:

s_{h y b r i d} (q, c, α) = α \cdot s_{d e n s e} (q, c) + (1 - α) \cdot s_{s p a r s e} (q, c)

(1)

where

s_{h y b r i d}

denotes the hybrid relevance score.

s_{d e n s e}

and

s_{s p a r s e}

represent the dense and sparse scores between the query q and a candidate c, respectively. The parameter

α \in [0, 1]

is an interpolation coefficient that controls the relative contribution of the two scoring functions. The fundamental challenge in hybrid retrieval lies in determining the optimal value of α.

Most industrial and academic standards, including BGE-M3 [9] and mGTE [10], rely on a static mixing coefficient. This implicitly assumes that all queries share the same optimal trade-off between lexical and semantic signals. However, the empirical results of these studies on multilingual and cross-domain benchmarks show that the optimal fusion weight varies not only across domains and corpora but also across individual queries depending on their structure and language. A fixed coefficient introduces a structural bottleneck that can substantially degrade retrieval quality by overemphasizing the wrong modality for a given query.

To bridge this gap, recent approaches utilize large language models (LLMs) to select α on a per-query basis by estimating the relative quality of lexical and dense outputs [11]. While yielding strong empirical performance, incorporating an LLM into the online retrieval loop introduces prohibitive computational latency, memory overhead, and significant GPU requirements. These limitations make LLM-based dynamic weighting highly impractical. Furthermore, in existing hybrid pipelines, the sparse and dense components are typically trained in isolation. Consequently, the combined system often fails to achieve true synergy as the dense model is not explicitly conditioned to correct the systematic failures of the sparse model.

In this work, we address these limitations by proposing a query-adaptive hybrid retrieval framework that explicitly optimizes for component complementarity and query-specific fusion without the overhead of generative LLMs. Our primary contribution is the query-driven alpha prediction (QDAP) module, which dynamically infers the optimal fusion weight α directly from the latent representation of the input query to balance lexical and semantic signals. By operating purely on the query embedding, QDAP offers the accuracy benefits of dynamic weighting while maintaining the low-latency requirements of standard retrieval pipelines. Our second contribution is antagonist negative sampling, a hybrid-aware contrastive training paradigm designed to further ensure that the hybrid components actively complement one another. Rather than optimizing the neural encoder in isolation, this strategy explicitly forces it to learn semantic representations that resolve the systematic failures (such as vocabulary mismatch) of the sparse retriever. The practical value and efficiency of these two contributions are validated through extensive empirical evaluations on the large-scale MLDR [9] and MIRACL [12] benchmarks. Our framework achieves superior average performance across 16 languages compared to state-of-the-art models like BGE-M3 and mGTE, particularly excelling in challenging long-context scenarios.

2. Related Works

Sparse Retrieval. Traditional lexical models such as BM25 [3] remain the industry standard in information retrieval due to their computational efficiency, interpretability, and strong performance on long documents or exact-match queries. However, because they rely on exact token overlap, classical sparse models inherently struggle with the “vocabulary mismatch” problem, failing to retrieve relevant documents if the query uses different terminology. To address this limitation, learned sparse retrieval has been proposed as an alternative paradigm that retains explicit lexical structure while leveraging transformer-based representation learning. Models such as SPLADE [13,14], DeepImpact [15], and uniCOIL [16] employ transformer-based encoders to predict term-level importance weights, effectively performing neural expansion of the vocabulary to mitigate mismatch issues.

Dense Retrieval. To further overcome the limitations of exact lexical matching, dense retrieval shifts the paradigm from term matching to semantic similarity. Popularized by dense passage retrieval (DPR) [6], this approach encodes both queries and documents into a shared continuous latent space using dual-encoder architectures [17] powered by pre-trained language models (e.g., BERT [18], RoBERTa [19], or SBERT [20]). By computing relevance scores via simple geometric operations like cosine similarity or dot product, dense retrievers excel at capturing deep semantic meaning, synonyms, and paraphrased content. While highly effective for short queries and conversational contexts, standalone dense retrievers can struggle with out-of-domain generalization and identifying highly specific keywords within long documents, highlighting the need to pair them with sparse methods.

Hybrid Retrieval Systems. Recognizing the complementary nature of lexical (precise matching) and semantic (generalization) signals, recent systems increasingly adopt hybrid retrieval. Early empirical studies, such as those on BioASQ [21], demonstrated that hybrid models consistently outperform individual components in domain-specific settings. Extending this line of work, [22] introduced the concept of “Blended RAG,” which integrates BM25, dense vector retrieval, and sparse encoders through advanced query strategies. These findings suggest that substantial improvements in retrieval arise from developing effective mechanisms for paradigm interaction.

More recent state-of-the-art multilingual systems, such as BGE-M3 [9] and mGTE [10], have adopted a multigranularity approach. They jointly train dense embeddings, learned sparse weights, and multi-vector representations (ColBERT-style [23,24]) within a single framework. Although these unified architectures achieve state-of-the-art performance, they impose significant training complexity and storage overhead. In contrast, our framework demonstrates that a fixed classical BM25 component—when paired with a dense model and trained via antagonist-aware sampling—can achieve superior hybrid performance without the computational burden of complex learned sparse components.

Dynamic Fusion Methods. A central challenge in hybrid systems is the fusion of relevance scores derived from heterogeneous representation spaces. Reciprocal rank fusion (RRF) [25] is a widely used non-parametric method that aggregates ranked lists based on positional information. Despite its empirical robustness, RRF violates Lipschitz continuity and discards crucial distributional information regarding relevance scores [26]. Consequently, weighted linear combinations (convex combination) with proper score normalization (e.g., min–max or Z-score) have emerged as a more theoretically grounded alternative. Empirical studies, such as the replication of DPR [27], confirmed that a simple linear combination of BM25 and dense scores yields statistically significant improvements, a finding standardized in industrial pipelines like Azure AI Search [28].

Despite these advances, most hybrid systems continue to rely on a static mixing coefficient (α) to balance sparse and dense signals. This assumption overlooks the fact that the relative importance of lexical and semantic signals varies significantly across different queries. While recent work in machine learning has explored regularization-based approaches to quantify and constrain dependencies between model variables during training [29], explicitly constraining dependencies in retrieval settings may limit the model’s ability to adapt to query variability.

Recently, the authors of DAT [11] introduced a dynamic fusion framework in which α is selected on a per-query basis using a large language model (LLM) to estimate the relative quality of retrieval outputs. Although this yields strong performance, incorporating an LLM into the retrieval loop introduces substantial computational overhead [30] and inference latency. These limitations are prohibitive in long-context retrieval (e.g., MLDR [9]), large-scale benchmarks (e.g., MIRACL [12]), and conversational systems [31]. Our work addresses this gap by introducing a lightweight query-dependent predictor that infers the fusion weight directly from the query embedding, eliminating the need for costly LLM-based relevance assessments.

Negative Sampling Strategies. The efficiency of contrastive learning in training the dense component of a hybrid system is fundamentally determined by the quality of negative samples [32]. Early dense retrieval approaches relied on random in-batch negatives [6], which are often too easy to provide meaningful gradient updates. To improve discriminative quality, advanced hard-negative mining strategies were introduced. For instance, ANCE [33] utilizes asynchronous mining to dynamically select negatives based on the model’s evolving index; SimANS [34] samples ambiguous negatives to avoid overly penalizing false negatives; and TAS-B [35] incorporates topic-aware sampling with a balanced margin loss. Additionally, RocketQA [36,37] utilizes cross-encoder distillation to denoise false negatives in large batches.

However, a critical limitation of techniques like ANCE, SimANS, and TAS-B is that they optimize the dense retriever in isolation, selecting hard negatives based solely on the dense model’s own similarity scores. In a hybrid retrieval setting, a negative example that is challenging for a dense model might be trivially filtered by the sparse component due to a lack of lexical overlap.

Our proposed antagonist negative sampling differs fundamentally from these existing hard-negative mining techniques by being explicitly hybrid-aware. Instead of mining negatives based purely on the dense encoder’s confusion, our strategy specifically targets candidates that are highly ranked by the “antagonistic” model (the sparse retriever) or by both components simultaneously. By doing so, we force the dense encoder to learn semantic patterns specifically tailored to compensate for the systematic failure modes of lexical retrieval (such as vocabulary mismatch), ensuring the two components are truly complementary.

3. Materials and Methods

In this section, we present the methodology underlying our adaptive hybrid retrieval framework, which is built around two core contributions: query-driven dynamic fusion of lexical and semantic signals and antagonist-aware training of the dense encoder aimed at compensating for systematic failures of sparse retrieval. Together, these components define an integrated hybrid search pipeline in which weight prediction and representation learning are optimized jointly for complementary behavior.

To properly implement this framework, we leverage architectural elements and data-processing pipelines inspired by prior multilingual retrieval systems, such as M3-Embedding [9] and mGTE [10], and adapt an open-source implementation of BGE-M3 to support predictive fusion control and specialized negative sampling. This design allows us to isolate and analyze the interaction between fixed lexical retrieval and trainable dense representations while preserving compatibility with established large-scale benchmarks and evaluation protocols.

3.1. Datasets

For training and evaluation, we employ datasets that have become standard benchmarks for multilingual and long-context retrieval. To ensure robust multilingual and cross-domain generalization, our models were trained jointly across all datasets and languages (MLDR [9] and MIRACL [12]), following the joint training paradigm established by state-of-the-art models like BGE-M3 [9]. We did not fine-tune separate models for individual datasets. The most widely adopted datasets for this task in prior work are MLDR, MIRACL, and MKQA, all of which are part of the MTEB benchmark used for evaluating multilingual retrieval models.

MLDR (Multilingual Long-Document Retrieval). This benchmark, introduced by [9], is designed for retrieval over long documents exceeding 8000 words. It comprises data in 13 languages collected from Wikipedia, Wudao, and mC4 corpora. The empirical results indicate that sparse (lexical) retrieval methods often outperform dense approaches on MLDR, yielding an advantage of approximately 10 points in terms of nDCG@10. This behavior can be attributed to the characteristics of long-form texts and large document sizes, where exact term matches provide a more reliable signal of relevance.

MIRACL (Multilingual Retrieval Across a Continuum of Languages). The MIRACL benchmark [12] covers 18 languages, contains segmented Wikipedia pages and is designed to evaluate ad hoc retrieval in a monolingual setting, where both queries and passages are expressed in the same language. A key characteristic of this dataset is the scale of its candidate corpora; for instance, the English corpus contains more than 30 million candidate passages. In contrast to MLDR, dense (semantic) retrieval methods typically outperform lexical approaches on this benchmark. MIRACL serves as a primary benchmark for assessing hybrid retrieval systems, including BGE-M3 and mGTE models. In our experiments, we consider 16 out of the 18 available languages as the remaining two are designated as “hidden” and are used exclusively for testing purposes. These languages do not provide training or validation splits, rendering comparative evaluation of our method on them non-representative.

MKQA (Multilingual Knowledge Questions and Answers). MKQA [38] is a large-scale benchmark for multilingual question answering and retrieval, consisting of parallel question sets in multiple languages paired with answers grounded in an English-language knowledge base. The dataset is designed to evaluate cross-lingual generalization, where queries in different languages are mapped to a shared English corpus.

Despite its adoption in prior work, we intentionally exclude MKQA from our evaluation due to a fundamental mismatch between its design for cross-lingual retrieval, where queries in multiple languages are issued against an English-only corpus, and the objectives of this study due to BM25 becoming ineffective as it has no overlap between query and document languages. This design will eliminate the complementary interaction between lexical and semantic components that the proposed method builds upon. Consequently, evaluating on MKQA would not provide meaningful insights on the effectiveness of query-adaptive fusion or antagonist training. Nevertheless, extending the proposed framework to cross-lingual retrieval remains a promising direction for future work.

The statistical characteristics of the datasets used in this study are summarized in Table 1. All the MIRACL length statistics are reported in tokens. For space-delimited languages, tokens correspond to whitespace-delimited words. For Chinese, Japanese, Thai, and Korean, tokens are computed using the BGE-M3 multilingual subword tokenizer. Length statistics for MLDR follow the representation provided in the Hugging Face dataset card.

To clarify how data was utilized during model development, we adhered to standard practices established in prior works [10] for these specific benchmarks. For MLDR, we utilized the official training, validation, and test splits exactly as provided by the benchmark for their respective purposes. For MIRACL, the official test set is hidden; therefore, following standard community practice [9], we repurposed the official development (validation) split to serve as our final test set. To perform model validation and hyperparameter tuning during training, we constructed a custom validation set by randomly sampling a subset from the MIRACL training split, ensuring its size exactly matched that of the original development set. The scale of the data and the linguistic diversity ensure sufficient coverage of the training corpus and enable the model to generalize effectively to both short queries and long-document retrieval tasks.

Evaluation Metrics. To evaluate retriever performance, we adopt normalized discounted cumulative gain at rank k (nDCG@k) as the primary evaluation metric. This metric assesses ranking quality by accounting for both the relevance of retrieved documents and their positions in the ranked list. Formally, nDCG@k is defined as the ratio of the discounted cumulative gain (DCG@k) and the ideal DCG discounted cumulative gain (IDCG@k):

n D C G @ k = \frac{D C G @ k}{I D C G @ k}

(2)

Here, the discounted cumulative gain (DCG@k) denotes the discounted cumulative gain for the top-k retrieved results, computed as

D C G @ k = \sum_{i = 1}^{k} \frac{r e l_{i}}{{log}_{2} (i + 1)}

(3)

where

r e l_{i}

represents the relevance score of the document at rank i. The logarithmic denominator introduces a position-based discount, reducing the contribution of documents appearing lower in the ranking. The ideal discounted cumulative gain (IDCG@k) corresponds to the maximum achievable value of DCG@k with an ideal ranking in which documents are ordered by decreasing relevance:

I D C G @ k = \sum_{i = 1}^{| R E L_{k} |} \frac{2 r e l_{i} - 1}{{log}_{2} (i + 1)}

(4)

where

| R E L_{k} |

denotes the set of relevant documents up to rank k sorted in descending order of relevance. The transformation

2 r e l_{i} - 1

is commonly used in graded relevance settings to emphasize highly relevant documents.

In our experiments, we set k = 10, thereby reporting nDCG@10. This choice follows standard practice in multilingual retrieval benchmarks and prior work and reflects the practical importance of accurately ranking documents within the top-10 results. Such a focus is particularly relevant for downstream retrieval-augmented generation (RAG) systems, where only a limited number of retrieved documents are utilized.

3.2. Hybrid Search

In this section, we describe the architecture of our hybrid text retriever system, which integrates semantic (dense) and lexical (sparse) search. Unlike models such as BGE-M3, which employ trainable term weights for sparse retrieval, our approach deliberately relies on the classic BM25 algorithm. This design choice allows us to keep one of the antagonist models fixed and focus on analyzing their complementarity under dynamic weight control (Section 3.4).

Dense Retrieval Component. For semantic retrieval, we adopt an architecture analogous to mGTE. The encoder is a transformer augmented with rotary position embeddings (RoPE) [39] and an unpadding mechanism to improve computational efficiency. Query e_q and candidate representations e_c, are derived from the normalized hidden state of the [CLS] token. The dense relevance score

s_{d e n s e}

is computed as either the cosine similarity or the dot product between the corresponding embeddings.

Sparse Retrieval Component. As the lexical component, we employ BM25 with optimized hyperparameters. Our deliberate decision to utilize the classical BM25 algorithm rather than modern trainable sparse models (such as SPLADE or BGE-M3) is primarily motivated by the need to simplify the training process and tightly isolate the optimization of the dense encoder. Specifically, employing a fixed lexical baseline eliminates the confounding variables that would inevitably arise from a simultaneously updating sparse model, allowing us to rigorously evaluate our core hypothesis regarding the efficacy of antagonist negative sampling. Furthermore, this approach significantly improves overall training stability. Jointly training dense and sparse components introduces complex optimization dynamics; by freezing the sparse weights, we ensure that the dense encoder can reliably converge while learning to complement a strictly stationary “antagonist” target. From a broader methodological perspective, this configuration establishes a highly versatile modular framework. Because the proposed system—encompassing both antagonist-aware training and query-adaptive weighting—is inherently model-agnostic, isolating and verifying its core mechanics with a static baseline provides a generalized methodology. Consequently, this design ensures that the architecture can readily integrate fully learned sparse retrievers, enabling the system to seamlessly harness their advanced capabilities to achieve even stronger synergistic performance in complex deployment environments. Key implementation details differ from BM25 baseline and include:

Language-specific tokenization: Each language is processed using a dedicated tokenizer that accounts for its morphological characteristics;
Stop-word handling: We use customized stop-word lists tailored for information retrieval tasks;
Static fixation: The use of classical BM25 allows us to treat lexical retrieval as a fixed baseline, supporting our hypothesis that, in hybrid retrieval, the complementarity of components is more critical than their individual accuracy, provided that the priority model is correctly selected on a per-query basis.

Formally, the lexical relevance score assigned to a candidate c for a query term j in corpus C is computed using the standard BM25 formulation:

w_{j} (c, C) = \frac{(k_{1} + 1) {t f}_{j}}{k_{1} ((1 - b) + b \frac{c l}{a v c l}) + {t f}_{j}} log \frac{N - {c f}_{j} + 0.5}{{c f}_{j} + 0.5}

(5)

The overall BM25 score for a candidate is obtained by summing

w_{j} (c, C)

over all query terms j, where:

tf_j denotes the term frequency of token j in candidate c;
cf_j is the candidate frequency of j in the corpus;
cl and avcl represent the candidate length and average candidate length, respectively;
N is the total number of candidates in C;
k₁ modulates the saturation rate of term frequency, delaying saturation to allow term frequency to have a higher effect in longer candidates;
b adjusts length normalization, with higher values leading to increased penalization of longer candidates.

We optimize these hyperparameters for different candidate length characteristics. These values are selected based on validation experiments and prior BM25 tuning practices for retrieval. For the short-text MIRACL dataset, we use k₁ = 0.9 and b = 0.4, which reduces length normalization and allows quick term frequency saturation appropriate for concise candidates. For the long-text MLDR dataset, we employ k₁ = 1.2 and b = 0.75, increasing both saturation control and length normalization to account for wider variations in candidate length and term frequencies. We use the natural logarithm; however, the choice of base does not affect ranking. Additionally, we adjust tokenization strategies to optimize scoring behavior. For languages that do not rely on whitespace to delimit words, such as Hindi, Chinese, Japanese, Korean, Bengali and Telugu, we employ trigram-based tokenization provided by Weaviate. Trigram tokenization is applied consistently to both queries and candidates, with BM25 statistics (tf_j and cf_j) computed over trigram tokens rather than word-level terms.

Score Normalization and Fusion. Relevance scores produced by semantic dense retrievers (e.g., cosine similarity) and lexical sparse retrievers (e.g., unbounded BM25 scores) inhabit heterogeneous numerical spaces, posing a non-trivial challenge for score-level integration. To address this challenge we consider and evaluate two alternative fusion paradigms based on the score normalization method: rank-based reciprocal rank fusion and score-based min–max fusion. The first approach provides a score-agnostic integration by leveraging only the ordinal positions of documents. Under this paradigm, the transformed score

s_{m}^{'}

for each retriever m is defined as

s_{m}^{'} = \frac{1}{k + {rank}_{m} (c, q)}

(6)

where

{rank}_{m} (c, q)

is the position of document c in the results of retriever m, and k is a smoothing constant. We set k = 60 following the standard empirical heuristic established in the original RRF literature [25]. A defining characteristic of RRF is that the original score distributions are discarded in favor of ranking information, completely losing the semantic distance between candidates.

In contrast, the second approach, min–max fusion, aims to rescale the raw scores to a common range [0, 1] while preserving the original shape of score distribution. Unlike RRF, this method maintains the relative relevance margins between candidates, simply mapping them to a unified scale. For each model

m \in {s p a r s e, d e n s e}

, the normalized score

s_{m}^{'}

is calculated as

s_{m}^{'} = \frac{s_{m} - s_{m, \min}}{s_{m, \max} - s_{m, \min}}

(7)

where s_m is the raw score, and

s_{m, \max}, s_{m, \min}

represent the maximum and minimum scores within the top-100 retrieved candidates.

Finally, the integrated hybrid relevance score is computed as a weighted sum of the processed semantic and lexical signals:

s_{hybrid} = α \cdot s_{d e n s e}^{'} + (1 - α) \cdot s_{s p a r s e}^{'}

(8)

We evaluate s_hybrid for both paradigms using this unified notation: in the case of RRF, α is traditionally set to 0.5, whereas, for min–max, it is dynamically predicted by our proposed query-driven alpha prediction module.

3.3. Query-Driven Alpha Prediction

To enable a fully functional hybrid retrieval framework, the mechanism for dynamic weight control is required. The central hypothesis of our work is that queries contain sufficient semantic and structural information to adapt the retrieval strategy in real time [40]. To operationalize this principle across heterogeneous deployment regimes, we propose two configurations of the query-driven alpha prediction (QDAP) module to address a trade-off between computational efficiency and predictive accuracy: a lightweight adapter-based predictor for low-latency inference and a full-capacity encoder-scale predictor that maximizes fusion accuracy while remaining substantially smaller and more resource-efficient than generative large language models. While existing models such as BGE-M3 and mGTE rely on fixed mixing coefficients, our approach performs automatic selection of the dominant retrieval modality based on the embedding of the input query. The complete hybrid retrieval pipeline is illustrated in Figure 1 and Figure 2.

The proposed query-driven alpha prediction (QDAP) module is integrated directly into the hybrid search system. The input query is passed to both the dense and sparse retrieval models. Their respective relevance scores produced by the encoder-based retriever and BM25 are combined into a final hybrid score using the weighting coefficient α predicted by QDAP.

We introduce two predictor configurations, QDAP-S (Figure 1) and QDAP-L (Figure 2), allowing a trade-off between low latency and maximum retrieval accuracy. Both variants share the same architectural principles. The output layer of the predictor is a classification head consisting of 101 neurons, each corresponding to a discrete value of α in the range [0,1] with a step equal to 0.01.

This specific number of bins is an empirically selected value that provides sufficient granularity for precise mixing without overloading the model with an excessive number of trainable weights, thereby preserving training and inference speed. By using 101 classes, the model estimates a full probability distribution over the discrete α space. To ensure prediction stability and account for correlations between neighboring weight values, we apply a learnable 1D convolution with a kernel size of 7 for smoothing. The predictor outputs are normalized using a SoftMax layer prior to loss computation.

The primary difference between the two configurations is their size and the computational regime they target. QDAP-S is designed for low-latency memory-efficient deployment in large-scale or resource-constrained retrieval systems by operating on fixed embeddings. This architecture consists of a single linear layer followed by convolution; therefore, the computational overhead and inference latency it introduces are practically negligible. Importantly, despite its simplicity, the empirical results demonstrate that QDAP-S still achieves better average performance across the evaluated benchmarks compared to state-of-the-art baselines like BGE-M3 and mGTE.

QDAP-L is initialized from the dense retriever to achieve higher-fidelity estimation of the optimal fusion coefficient. This architecture employs a full encoder backbone (approximately 300 M parameters). Importantly, this full-capacity configuration remains within the parameter and memory footprint of standard retrieval encoders (e.g., mGTE-base) and is substantially smaller than contemporary generative LLMs. While the absolute execution speed of QDAP-L is hardware-dependent, processing a query through its 300 M-parameter encoder requires only a single forward pass, which makes it roughly 20 to 30 times faster in terms of inference latency than dynamic fusion approaches relying on 7B+ parameter LLMs.

QDAP-S (Figure 1) employs this lightweight adapter model (e.g., TMP Adapter [40]), which operates on fixed embeddings produced by the dense retriever, thereby avoiding direct fine-tuning of the main encoder. QDAP-L (Figure 2), in contrast, uses an encoding model duplicate of the primary dense retriever combined with the QDAP-S classification head and is initialized with its weights. While this design increases GPU memory consumption, it does not negatively impact the overall joint search latency of the system because, in real-world applications, QDAP-L processes the query in parallel with the primary dense retriever. Consequently, QDAP-S is preferable in real-world deployment scenarios with strict GPU memory limitations when hosting an additional encoder duplicate is physically prohibitive. For all other scenarios with sufficient VRAM, the recommended architecture is QDAP-L. Ultimately, this configuration significantly improves the accuracy of α prediction.

Following metric-driven optimization principles (e.g., Choppy [41]), the predictor is trained to directly optimize the target evaluation metric rather than surrogate losses. Furthermore, training is performed exclusively on the train split of each dataset, with validation and test partitions strictly held out to prevent information leakage. For predictor training, the target variable is defined as the nDCG@10 score computed for each possible value of α (from 0 to 1 with a step of 0.01) for a given query from the training datasets described in Section 3.1. The distribution of nDCG@10 across different α values is often non-monotonic; therefore, we do not consider a regression setting with single linear neuron for this task. Empirical observations indicate that the performance landscape can exhibit multiple peaks with different optimal mixing values for a single query. A standard regression model tends to average these peaks, failing to capture the true multimodality of the distribution.

We propose to optimize the model using a composite loss function that combines a probabilistic divergence with a geometry-aware discrepancy measure. The proposed loss, denoted as

L

, integrates classical cross-entropy and a one-dimensional Wasserstein distance (WD) to encourage both accurate mode prediction and alignment between the predicted and target distributions:

L = λ * L_{C E} (\hat{y}, y) + (1 - λ) * L_{W D} (\hat{y}, y)

(9)

where

λ \in (0, 1)

is a weighting coefficient, y denotes the target distribution of nDCG@10 values over the discretized α space, and

\hat{y}

represents the corresponding distribution predicted by the model. In our experiments we set λ = 0.62, empirically selected to balance optimization stability and distributional alignment.

The cross-entropy loss (L_CE) is defined as

L_{C E} (\hat{y}, y) = - \sum_{i = 1}^{n} {y_{i}}^{2} * log {(\hat{y}}_{i})

(10)

where both y and

\hat{y}

are normalized using the SoftMax function. n denotes the total number of discrete bins in the output space (corresponding to the quantized α values), and i represents the index of each individual bin. The squared target term acts as a weighting mechanism that emphasizes bins with higher ground-truth probability mass while suppressing contributions from low-probability targets. This increases the influence of dominant or more certain target components in the loss, biasing optimization toward accurately modeling the most salient regions of the target distribution. Unlike standard cross-entropy, where each target component contributes linearly, this formulation introduces a nonlinear emphasis on confident ground-truth assignments, improving robustness in scenarios where the target distribution is sharply peaked or misalignment in high-probability bins is particularly costly. Cross-entropy provides smooth and well-behaved gradients, making it well suited for backpropagation and stochastic optimization methods. However, it still treats each bin independently and does not account for the geometric structure of the output space.

To incorporate information about the global structure of the distributions, we additionally include a one-dimensional Wasserstein distance (

WD

) term. In the discrete one-dimensional setting, WD admits the closed-form expression

L_{W D} (\hat{y}, y) = \sum_{i = 1}^{n} |\sum_{j = 1}^{i} y_{j} - \sum_{j = 1}^{i} {\hat{y}}_{j}|

(11)

where the inner sum over the index j calculates the accumulated probability up to the current bin i. This formulation measures the cumulative mismatch between the predicted and target distributions, explicitly accounting for the ordering of bins in the discretized α-space. As a result, probability mass shifted over small distances incurs a smaller penalty than mass displaced over larger distances, which is consistent with the underlying geometry of the problem.

The motivation for combining cross-entropy and Wasserstein distance follows from their complementary properties. To illustrate this intuitively, if the true optimal weight for a query is α = 0.8, predicting α = 0.79 is a minor error, whereas predicting α = 0.1 is a severe misallocation that completely alters the retrieval balance. Cross-entropy alone is insufficient because it treats the output space as purely categorical; predicting 0.79 incurs the exact same penalty as predicting 0.1, completely ignoring the geometric proximity of the bins. While the Wasserstein distance preserves the geometric structure of the output space and provides a physically meaningful notion of discrepancy between distributions, it is generally non-smooth and depends on an optimal transport formulation. In discrete settings, this often leads to piecewise linear loss landscapes and unstable or noisy gradients when optimized directly. Conversely, cross-entropy is smooth and yields stable dense gradients but ignores the geometry of the output space and penalizes all misallocations of probability mass equally regardless of their distance. By deliberately combining these two terms, the proposed loss function simultaneously enforces geometric alignment between distributions and ensures favorable optimization properties. This hybrid formulation allows the model to retain essential structural information while remaining amenable to efficient gradient-based training. The complete training procedure is formalized in Algorithm A1 provided in Appendix A.

3.4. Antagonist Negative Sampling

The effectiveness of hybrid retrieval is determined less by the standalone accuracy of individual components (dense and sparse) than by their additive capacity and quality of mutual complementarity. In this work, we hypothesize that optimizing a hybrid system requires training the dense model to focus specifically on cases where lexical retrieval (BM25) fails. To this end, we introduce antagonist negative sampling, a method designed to select not merely “hard” examples but those that are critical for the joint performance of the dense model and its sparse antagonist. We decompose the construction of the training data into two stages: filtering positive training pairs and collecting antagonistic negative samples. The process is shown in Figure 3.

For the first stage we select training examples based on the effectiveness of lexical retrieval. Specifically, we assume that the dense model should be primarily trained on queries q for which the sparse component fails to retrieve the relevant candidate

c^{+}

from the original dataset. We select

(q, c^{+})

pairs whose value of the target metric (nDCG@10) falls below a predefined threshold σ using sparse retrieval alone (BM25).

D_{t r a i n} = {(q, c^{+}) \in D | n D C G @ 10 {(q)}_{s p a r s e} < σ}

(12)

Empirically, we find that this filtering strategy enables the model to more effectively correct the antagonist’s predictions by concentrating the gradient on semantically complex relationships that are inaccessible to simple keyword matching.

At the second stage, we collect negative examples for contrastive learning. Traditional hard-negative mining methods (e.g., Naive Top-K) select candidates with the highest similarity within the space of a single model. Our method adapts this process to the hybrid setting, where

s_{d e n s e}^{'} (q, c)

and

s_{s p a r s e}^{'} (q, c)

are the normalized values of raw relevance scores

s_{d e n s e} (q, c)

and

s_{s p a r s e} (q, c)

for a query q and a candidate c. In this case the hybrid score is defined as a linear combination:

s_{h y b r i d} (q, c, α) = α \cdot s_{d e n s e}^{'} (q, c) + (1 - α) \cdot s_{s p a r s e}^{'} (q, c)

(13)

As antagonistic negative examples

c^{-} \in C_{a n t}^{'}

we select candidates from the top-100 results of both models that dominate the positive candidate

c^{+}

for any possible value of the weight.

From a mathematical perspective, this condition is equivalent to the simultaneous satisfaction of

s_{d e n s e} (q, c^{-}) > s_{d e n s e} (q, c^{+}) and s_{s p a r s e} (q, c^{-}) > s_{s p a r s e} (q, c^{+}) .

(14)

The use of such examples forces the dense model not merely to minimize contrastive error but to explicitly reshape the embedding space in order to compensate for failures of the sparse antagonist in scenarios where hybrid retrieval would inevitably produce false positives.

Training of the dense encoder with antagonistic negatives is performed by minimizing the InfoNCE loss [42]:

L_{d e n s e} = - log \frac{exp (s_{d e n s e} (q, c^{+}) / τ)}{exp (s_{d e n s e} (q, c^{+}) / τ) + \sum_{c^{-} \in P_{a n t}^{t}} exp (s_{d e n s e} (q, c^{-}) / τ)}

(15)

where τ, q, and c denote the temperature parameter, query and candidate. The positive

c^{+}

is the relevant candidate to q, and other irrelevant candidates are negatives, which can be either hard negatives or in-batch negatives (candidates of other instances in the same batch).

s (q, c)

is the relevance score of q and c, measured by the cosine similarity between their respective representations. According to our experimental results (Section 4.2), the combination of positive-pair filtering and sampling from the joint antagonist pool (Sparse + Dense negatives) provides the best convergence behavior and final system quality. The complete data construction and training procedure is formalized in Algorithm A2 provided in Appendix B.

4. Results

In this section, we present the evaluation results of the proposed hybrid text retriever (HTR) model on major multilingual retrieval benchmarks. The primary performance metric is nDCG@10. We compare the proposed method against current state-of-the-art approaches, including BGE-M3 and mGTE-TRM. We utilize the official checkpoints of these baselines, which were originally trained on the MLDR and MIRACL datasets by their respective authors. Since our method utilizes the same training data sources, this ensures a fair comparison of model capabilities.

We report nDCG@10 as the primary evaluation metric and observe that the proposed HTR model consistently outperforms the baseline methods across multiple languages and domains. The observed improvements are consistent across runs with different random seeds and across multiple languages and domains. To account for stochasticity in training and inference, we conduct 10 independent runs with different random seeds affecting model initialization and training batch composition. While this multi-run setup allows us to verify the consistency of the results, we report mean performance values in accordance with common practice in the literature to ensure direct comparability with prior work.

4.1. Evaluation

Results On the MLDR Dataset. The MLDR benchmark targets multilingual retrieval over long documents, a setting in which effective lexical matching remains crucial due to sparse yet informative term overlaps. To disentangle the contributions of different retrieval signals, we evaluate both standalone and hybrid retrieval models. We first consider two internal baselines: STR, a dense semantic text retriever that relies on vector representations, and LTR, a sparse lexical text retriever relying on BM25 provided by Weaviate. As shown in Table 2, LTR attains a high average score of 67.3, in nDCG@10, outperforming both the baseline BM25 (53.6) and the trainable sparse model in BGE-M3 (62.2), demonstrating the strength of lexical signals in long-context retrieval. In contrast, STR achieves a lower average nDCG@10 of 61.8, indicating that semantic representations provide complementary gains but remain sensitive to long-document noise.

The proposed hybrid text retriever HTR model attains the best overall performance at 74.3 nDCG@10, substantially surpassing the hybrid variants of mGTE-TRM (71.3) and BGE-M3 (65.0), which also included this benchmark in original training. These results confirm the effectiveness of dynamic weight selection in long-context retrieval scenarios. When combined with the LTR in our proposed HTR framework, the model attains the best overall performance of 74.3 nDCG@10, exceeding the hybrid variants of mGTE-TRM (71.3) and BGE-M3 (65.0). These results demonstrate that dynamically weighting lexical (LTR) and semantic (STR) signals is crucial for effective retrieval over long documents.

Results On the MIRACL Dataset. On the MIRACL benchmark, where semantic retrieval typically dominates, our HTR model also demonstrates strong performance relative to most baseline configurations. The results are shown in Table 3. While multi-vector architectures such as BGE-M3 (Dense + Sparse + Multi-vec) achieve strong performance on MIRACL, these models rely on late-interaction mechanisms that effectively operate as lightweight rerankers, incurring substantially higher computational and memory overhead due to per-token or per-vector similarity computation at retrieval time. In contrast, the proposed HTR model maintains a single-vector dense representation paired with a fixed lexical component and a lightweight query-adaptive fusion module.

Although the lexical component BM25 (LTR) performs substantially worse on this dataset (31.4 nDCG@10), the HTR with the query-driven alpha predictor (QDAP) (Section 3.3) effectively compensates for this gap, achieving a final score of 67.1 nDCG@10 while delivering consistent gains in long-context regimes (MLDR) and preserving significantly lower system complexity. This positions the method as a practical alternative for large-scale or latency-sensitive deployments where multi-vector or reranking-style architectures may be infeasible.

This performance is comparable to specialized dense-only models and confirms that the system can dynamically prioritize semantic signals when lexical retrieval is ineffective.

As shown in Table 3, while our model HTR (Dense + Sparse) demonstrates strong performance across most languages, it is slightly outperformed by BGE-M3 (Dense + Sparse + Multi-vec), which is likely attributable to differences in model capacity and embedding dimensionality. Specifically, BGE-M3 has a larger model size (569 M parameters) and higher-dimensional embeddings (1024) compared to the HTR’s 305 M parameters and smaller 768-dimensional embeddings. Larger models with higher-dimensional embeddings generally have greater capacity to model complex relationships, leading to improved retrieval performance, particularly in multilingual settings. Importantly, despite this gap, the HTR maintains competitive results, indicating that its architecture and combination of dense and sparse representations remain effective.

Comparison With the Theoretical Upper Bound (Oracle). To assess the quality of the α predictor, we introduce two additional evaluation settings:

HTR (Optimal). Hybrid search model used with a fixed value of α that yields the best performance over the entire dataset.
HTR (Oracle). An idealized setting when the optimal α is selected individually for each query, representing the upper bound on achievable performance.

According to our results (Table 4), the HTR reaches 92.54% of the oracle performance, which is approximately 4% higher than the score obtained using a single globally optimal constant.

Overall, the proposed adaptive fusion and antagonistic training strategies allow our method to outperform the baseline models in terms of average performance across all the evaluated languages.

4.2. Ablation Study

In this section, we conduct an ablation study to assess the contribution of the key technical components of our system: the weight predictor architecture (QDAP) and the antagonist negative sampling method.

Impact of QDAP Architecture and Loss Functions. We compared two predictor configurations: QDAP-L, which employs the full encoding model (analogous to the dense retriever), and QDAP-S, a compact adapter model operating on fixed embeddings without fine-tuning the main encoder. We also evaluated various combinations of loss functions: cross-entropy (CE), Wasserstein distance (WD), and their weighted combination.

As shown in Table 5, the full-scale QDAP-L consistently outperforms the adapter QDAP-S by an average of 4.23%. Regarding loss functions, the best performance is achieved using the combination of Wasserstein distance + cross-entropy, which is on average 6.9% more effective than WD alone and 2.8% better than CE alone. These results indicate that incorporating the physical proximity of the predicted α coefficient to the optimum (via WD) alongside classical classification (CE) ensures the most stable convergence of the predictor. They also validate the intended design trade-off: QDAP-L defines the upper bound on adaptive fusion accuracy, while QDAP-S provides a computationally efficient approximation that retains the majority of performance gains under significantly reduced latency and memory constraints. To optimize the synergy between lexical and semantic components, we evaluate various data preparation strategies, including positive-pair filtering and negative-sample selection, using oracle metrics to eliminate the mixing coefficient prediction error, as detailed in Table 6. The experimental setup compares three filtering regimes—no filtering, sparse filter, and dense-sparse filter—across four negative sampling techniques: random negatives, sparse negatives, dense negatives, and the proposed sparse-dense negatives.

Positive sampling. The results in Table 6 indicate that the proposed sparse filtering strategy, which prioritizes training instances where lexical retrieval performance falls below an average threshold, consistently yields the highest retrieval performance. On average, sparse filtering improves performance by 7.1% compared to dense sparse filtering and by 3.69% compared to the unfiltered baseline. In contrast, dense sparse filtering exhibits markedly lower effectiveness. This degradation is likely attributable to an excessive reduction in both the volume and diversity of the training set when constraints are simultaneously imposed by both retrieval modalities, thereby limiting the model’s exposure to informative positive examples.

Negative sampling. The negative sampling results provided in Table 6 indicate that the proposed sparse dense negative strategy consistently outperforms the strongest baseline, dense negatives, by an average margin of 3.66%. This improvement suggests that sampling hard negatives from the intersection of the top-ranked results of both lexical and dense retrievers effectively targets regions of joint uncertainty within the hybrid retrieval. Consequently, the dense encoder is encouraged to learn discriminative representations that compensate for failure cases in which both lexical overlap and semantic similarity provide insufficient signals for reliable retrieval. To empirically investigate the complementary nature of lexical and semantic signals, we analyze the sensitivity of hybrid retrieval performance to the mixing coefficient α, as illustrated in Figure 4 and Figure 5.

The experimental setup tracks the variation in retrieval quality, where the X axis represents the interpolation weight α ranging from 0 (purely sparse BM25 retrieval) to 1 (purely dense retrieval) and the Y axis denotes the nDCG@10 score. Figure 4 depicts the performance trajectories for the MIRACL benchmark, where the quality curves predominantly exhibit stability or improvement as α approaches 1, thereby confirming the dominance of dense semantic representations in short-passage retrieval scenarios. In contrast, Figure 5, which illustrates the results for the MLDR dataset, demonstrates a marked shift in the optimal performance region toward lower α values, reflecting the system’s reliance on precise lexical matching when processing long documents. This observed divergence in optimal mixing ratios across different retrieval regimes substantiates the limitations of static fusion parameters and empirically validates the necessity for the proposed query-driven alpha prediction (QDAP) module to dynamically optimize the lexical–semantic trade-off on a per-query basis.

We examine the stability and efficiency of the proposed optimization paradigm by illustrating hybrid retrieval performance throughout the training of the dense encoder depicted in Figure 6 and Figure 7. This analysis is designed to assess whether antagonistic negative sampling preserves the standalone quality of the semantic retriever while improving hybrid effectiveness. The horizontal axis denotes the number of training epochs, and the vertical axis reports the average nDCG@10 score. Distinct colored curves correspond to performance under the theoretical oracle condition (i.e., an ideal mixing coefficient) and under several fixed values of α coefficient.

The results on the MIRACL benchmark in Figure 6 indicate rapid convergence and high performance stability, suggesting efficient adaptation to short-passage semantic matching. In contrast, Figure 7 exhibits more gradual improvement and clear stratification across α values, reflecting the increased difficulty of modeling semantic dependencies in long-context retrieval.

Across both benchmarks, the dense encoder demonstrates stable and monotonic performance gains, including consistent improvements in the oracle setting. These trends indicate that training the dense model as a corrective antagonist to BM25 enhances the overall hybrid retrieval capacity without inducing mode collapse or compromising the retriever’s ability to capture deep semantic features.

5. Discussion

The results of our study demonstrate the superiority of the proposed HTR method over the current state-of-the-art approaches, including BGE-M3 and mGTE, on major multilingual benchmarks. The proposed method outperforms the current state-of-the-art approaches originally trained on major multilingual benchmarks, such as BGE-M3 and mGTE. Our analysis highlights several factors that are critical to the system’s effectiveness.

Dynamic Weighting. Unlike traditional hybrid models (e.g., BGE-M3) that rely on fixed mixing coefficients (e.g., w₂ = 0.3 for the lexical component), our approach employs a QDAP predictor to adapt the retrieval strategy to each query. The experiments show that the HTR achieves 92.54% of the theoretical maximum (oracle), indicating that the query embedding provides sufficient information to optimally balance semantic and lexical contributions.

Component Synergy Over Individual Accuracy. In hybrid systems, the ability of components to complement each other outweighs their standalone performance. We deliberately used the classic BM25 as the lexical component even though modern trainable sparse models (as in BGE-M3) may perform better in isolation. By treating BM25 as a “static antagonist” and training the dense model to correct its errors, our hybrid system outperforms more complex alternatives on the MLDR dataset. It is important to note that the strongest MIRACL baseline, BGE-M3 (Dense+Sparse+Multi-vec), employs a late-interaction multi-vector design inspired by ColBERT-style retrieval, which computes fine-grained token-level or multi-embedding similarity during inference. While this strategy yields high effectiveness in short-passage semantic matching, it substantially increases both memory footprint and retrieval-time computation, effectively shifting the system closer to a reranking paradigm rather than a pure first-stage retriever.

In contrast, our approach deliberately constrains the dense component to a single-vector representation and focuses on adaptive fusion and antagonist-aware training. This design prioritizes scalability and universality across retrieval regimes, enabling strong average performance without relying on multi-vector expansion or reranking-style interaction mechanisms.

Antagonistic Training. The proposed antagonist negative sampling strategy contrasts with the knowledge self-distillation approach in BGE-M3. Instead of training the model to match relevance scores, the dense encoder learns to solve queries that lexical search misses (positive sampling based on BM25 errors) and to distinguish semantically similar but irrelevant documents (antagonist hard negatives). Ablation studies confirm a 3.72% performance gain compared to training without antagonist-aware sampling.

Context Length and Data-Type Considerations. Consistent with prior findings from BGE-M3 and mGTE, long documents (MLDR) benefit from prioritizing lexical search, while short passages (MIRACL) require semantic focus. The adaptive weight predictor automatically adjusts α, favoring semantics for MIRACL and lexical signals for MLDR, enhancing the system’s universality.

Generalization and Fine-Tuning. Regarding the implications of training the retriever on specific domains versus utilizing a pre-trained retriever without further fine-tuning, our framework prioritizes broad generalization. Following a fine-tuning procedure analogous to that of baseline models such as BGE-M3 and mGTE, we train the dense encoder weights and the QDAP alpha adjustment on a combined corpus containing all the evaluated languages and datasets (MLDR and MIRACL) simultaneously. Rather than fine-tuning on isolated domains or specific datasets, this joint training strategy ensures the development of a robust universal model that is capable of generalizing across diverse linguistic and structural contexts without requiring domain-specific adaptations.

Limitations and Future Directions. Despite its strong performance, the current system deliberately relies on classic BM25 as a fixed lexical antagonist to isolate the effects of adaptive weighting and antagonistic training. This design choice simplifies the analysis but limits the expressiveness of the sparse component. In future work, we plan to replace BM25 with a trainable sparse retriever, including SPLADE-style models (e.g., BGE-M3), and jointly optimize the sparse and dense encoders under the guidance of the adaptive weight predictor. While the proposed hybrid framework demonstrates strong multilingual capabilities, its extension to cross-lingual retrieval tasks presents unique challenges. In a purely cross-lingual setting, the classical sparse search component would be largely ineffective due to the lack of exact vocabulary overlap. However, the adaptive weighting mechanism could behave differently and prove highly beneficial in mixed-language scenarios, such as when a query integrates multiple languages (e.g., English technical terms embedded in a non-English query). In such cases, QDAP can dynamically adjust to leverage the available sparse signals alongside semantic representations. Evaluating the proposed method on dedicated cross-lingual datasets is a key objective for our future work, with this study serving as its foundation.

6. Conclusions

In this work, we presented an adaptive hybrid retrieval framework that dynamically balances lexical and semantic retrieval signals on a per-query basis. By introducing the query-driven alpha prediction (QDAP) module, we replace static fusion strategies with a lightweight embedding-based predictor that is capable of selecting an optimal mixing coefficient in real time without incurring the computational costs associated with large language model evaluators. We demonstrated that query embeddings contain sufficient structural and semantic information to dynamically balance lexical and semantic signals in real time. Our adaptive approach achieved 92.54% of the theoretical oracle performance, significantly narrowing the gap between traditional fixed-weight systems and idealized per-query optimization. Furthermore, our findings emphasize that the effectiveness of a hybrid system is defined by the synergy between its components rather than their standalone accuracy.

We further proposed antagonist negative sampling, a training paradigm that conditions the dense encoder to explicitly address the failure modes of BM25. This antagonist-aware design emphasizes complementarity over standalone component accuracy, resulting in a hybrid system that is more robust across heterogeneous retrieval regimes. Through antagonist negative sampling, we successfully focused the dense model’s learning capacity on the specific weaknesses of the BM25 algorithm.

Experiments on multilingual benchmarks demonstrate that the proposed method consistently outperforms state-of-the-art hybrid models, achieving strong gains in nDCG@10 and reaching over 92% of the theoretical oracle upper bound. The results confirm that long-document retrieval benefits from prioritizing lexical signals, while short-passage and semantic benchmarks favor dense representations—an imbalance that can be effectively resolved through adaptive weighting.

While our current framework utilizes a fixed BM25 component to maintain a stable “antagonist” during training, future work could explore extending this framework to jointly train both dense and trainable sparse components under the guidance of the adaptive predictor, with the goal of further narrowing the gap to oracle performance and further enhancing the universality of hybrid retrieval across diverse linguistic and structural contexts. More broadly, the proposed methodology provides a general blueprint for designing retrieval systems that are both computationally efficient and universally effective across languages, domains, and document scales.

Author Contributions

Conceptualization, P.P.; methodology, P.P.; software, S.S.; validation, S.S. and A.Z.; formal analysis, S.S.; investigation, S.S. and A.Z.; resources, S.M. and A.Z.; data curation, O.K. and Y.M.; writing—original draft preparation, P.P., S.S., S.M. and A.Z.; writing—review and editing, P.P., S.S. and S.M.; visualization, S.M.; supervision, O.K. and Y.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

All the datasets used in this study (MLDR and MIRACL) are publicly available in their official repositories: the MLDR dataset can be accessed at https://huggingface.co/datasets/Shitao/MLDR (DOI: 10.18653/v1/2024.findings-acl.137) (accessed on 12 March 2026), and the MIRACL dataset is available at https://huggingface.co/datasets/miracl/miracl (DOI: 10.1162/tacl_a_00595) (accessed on 12 March 2026).

Conflicts of Interest

Authors Pavel Posokhov, Stepan Skrylnikov, Sergei Masliukhin, Alina Zavgorodniaia and Yuri Matveev were employed by the company STC-Innovations Ltd. The remaining author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

QDAP	Query-Driven Alpha Prediction
LLM	Large Language Model
MLDR	Multilingual Long-Document Retrieval Dataset
MIRACL	Multi-Modal Image Registration and Connectivity Analysis Dataset
BM25	Best Matching 25
BGE-M3	BAAI General Embedding: Multilinguality, Multifunctionality, Multigranularity
mGTE	Multilingual General Text Embeddings Model
nDCG	Normalized Discounted Cumulative Gain
RAG	Retrieval-Augmented Generation
DPR	Dense Passage Retrieval
BERT	Bidirectional Encoder Representations from Transformer Architecture
RoBERTa	Robustly Optimized BERT Approach
SBERT	Sentence-BERT
ColBERT	Contextualized Late Interaction Over BERT
ANCE	Approximate Nearest Neighbor Negative Contrastive Estimation
SPLADE	Sparse Lexical and Expansion Model
uniCOIL	Universal Contextualized Inverted List
SimANS	Similarity-Based Adversarial Negative Sampling
TAS-B	Topic-Aware Sampling (BERT-based)
RRF	Reciprocal Rank Function
BioASQ	Biomedical Semantic Indexing and Question Answering
MKQA	Multilingual Knowledge Questions and Answers
RoPE	Rotary Position Embedding
WD	Wasserstein Distance
CE	Cross-Entropy
InfoNCE	Information Noise-Contrastive Estimation
HTR	Hybrid Text Retriever
LTR	Lexical Text Retriever
STR	Semantic Text Retriever
GPU	Graphics Processing Unit

Appendix A. Training of the Query-Driven Alpha Prediction (QDAP) Module

To formalize the training procedure of the query-driven alpha prediction module, we outline the end-to-end optimization process in Algorithm A1. The algorithm details the preparation of the target distributions based on retrieval metrics, the structural flow through the predictor architectures, and the composite loss optimization step.

Algorithm A1 Pseudocode for Query-Driven Alpha Prediction (QDAP) Module Training

Require:: Training set $D = {(q, s_{S}, s_{L}, y_{r e l})}$ with queries q, pre-computed raw dense/sparse scores $s_{S}, s_{L}$ , and relevance labels $y_{r e l}$
1:: for each $(q, s_{S}, s_{L}, y_{r e l}) \in D$ do
: Stage 1: Data Preparation
2:: $s_{S}^{'} \leftarrow MinMax (s_{S})$ {Sparse score normalization}
3:: $s_{L}^{'} \leftarrow MinMax (s_{L})$ {Lexical score normalization}
4:: for $α_{i} \in {0.00, 0.01, \dots, 1.00}$ do
5:: $s_{hybrid} \leftarrow α_{i} \cdot s_{S}^{'} + (1 - α_{i}) \cdot s_{L}^{'}$ {Hybrid combination of scores}
6:: $v_{i} \leftarrow nDCG @ 10 (s_{hybrid}, y_{r e l})$
7:: end for
8:: $y \leftarrow SoftMax ([v_{1} \dots, v_{101}])$ {Target metric distribution}
: Stage 2: Forward Pass
9:: if using QDAP-S then
10:: $e_{q} \leftarrow FixedEncoder (q)$
11:: else if using QDAP-L then
12:: $e_{q} \leftarrow TrainableEncoder (q)$
13:: end if
14:: $z \leftarrow Linear (e_{q})$ {Classification head over 101 bins}
15:: $z_{smooth} \leftarrow Conv 1 D (z, kernel_size = 7)$
16:: $\hat{y} \leftarrow SoftMax (z_{smooth})$ {Predicted probability distribution}
: Stage 3: Composite Loss Optimization
17:: $L_{C E} \leftarrow - \sum_{i = 1}^{101} y_{i}^{2} log ({\hat{y}}_{i})$ {Target-squared cross-entropy}
18:: $L_{W D} \leftarrow \sum_{i = 1}^{101} |\sum_{j = 1}^{i} y_{j} - \sum_{j = 1}^{i} {\hat{y}}_{j}|$ {1D Wasserstein distance}
19:: $L \leftarrow λ L_{C E} + (1 - λ) L_{W D}$
20:: $Θ \leftarrow Θ - η \nabla_{Θ} L$ {Update trainable parameters $Θ$ }
21:: end for

Appendix B. Antagonist Negative Sampling Algorithm

To formalize the data construction procedure for antagonist negative sampling, we outline the two-stage preparation process in Algorithm A2. The algorithm details the filtering of positive training pairs based on lexical retrieval effectiveness, the normalization of relevance scores, and the condition-based selection of antagonistic negative samples.

Specifically, Stage 1 isolates queries for which the standalone sparse retriever (BM25) fails to meet a predefined accuracy threshold σ, ensuring the training focuses on lexical failure modes. Subsequently, Stage 2 constructs the antagonist negative pool by evaluating the top-100 candidates retrieved by both models from the global corpus. By selecting only the negative candidates that strictly dominate the positive relevant document in both normalized dense and sparse score spaces, the resulting dataset

D_{t r a i n}

explicitly forces the dense encoder to learn representations that compensate for instances where hybrid retrieval would otherwise produce false positives.

Algorithm A2 Pseudocode for Antagonist Negative Sampling Data Preparation

Require:: Original dataset $D = {(q, c^{+})}$ with queries q and relevant candidates $c^{+}$ , document corpus C, predefined threshold σ
Ensure:: Filtered training dataset with antagonistic negative samples $D_{t r a i n}$
: Stage 1: Positive Training Pairs Filtering
1:: $D_{f i l t e r e d} \leftarrow \emptyset$ {Initialize filtered positive pairs}
2:: for each $(q, c^{+}) \in D$ do
3:: $v_{s p a r s e} \leftarrow nDCG @ 10 {(q, C)}_{s p a r s e}$ {Evaluate lexical retrieval on corpus C}
4:: if $v_{s p a r s e} < σ$ then
5:: $D_{f i l t e r e d} \leftarrow D_{f i l t e r e d} \cup {(q, c^{+})}$ {Retain query where BM25 fails}
6:: end if
7:: end for
: Stage 2: Antagonistic Negative Sample Collecting
8:: $D_{t r a i n} \leftarrow \emptyset$ {Initialize final training dataset}
9:: for each $(q, c^{+}) \in D_{f i l t e r e d}$ do
10:: $C_{t o p 100} \leftarrow T o p 100 {(q, C)}_{d e n s e} \cup T o p 100 {(q, C)}_{s p a r s e}$ {Retrieve candidates from corpus}
11:: $s_{d e n s e}^{'} \leftarrow MinMax (s_{d e n s e})$ {Dense score normalization on $C_{t o p 100}$ }
12:: $s_{s p a r s e}^{'} \leftarrow MinMax (s_{s p a r s e})$ {Sparse score normalization on $C_{t o p 100}$ }
13:: $C_{a n t}^{'} \leftarrow \emptyset$ {Initialize antagonistic negative pool}
14:: for each $c^{-} \in C_{t o p 100} ∖ {c^{+}}$ do
15:: if $s_{d e n s e}^{'} (q, c^{-}) > s_{d e n s e}^{'} (q, c^{+})$ and $s_{s p a r s e}^{'} (q, c^{-}) > s_{s p a r s e}^{'} (q, c^{+})$ then
16:: $C_{a n t}^{'} \leftarrow C_{a n t}^{'} \cup {c^{-}}$ {Collect antagonistic negatives}
17:: end if
18:: end for
19:: if $C_{a n t}^{'} \neq \emptyset$ then
20:: $D_{t r a i n} \leftarrow D_{t r a i n} \cup {(q, c^{+}, C_{a n t}^{'})}$ {Form training instance}
21:: end if
22:: end for
23:: return $D_{t r a i n}$

References

Bora, A.; Cuayáhuitl, H. Systematic Analysis of Retrieval-Augmented Generation-Based LLMs for Medical Chatbot Applications. Mach. Learn. Knowl. Extr. 2024, 6, 2355–2374. [Google Scholar] [CrossRef]
Lakatos, R.; Pollner, P.; Hajdu, A.; Joó, T. Investigating the Performance of Retrieval-Augmented Generation and Domain-Specific Fine-Tuning for the Development of AI-Driven Knowledge-Based Systems. Mach. Learn. Knowl. Extr. 2025, 7, 15. [Google Scholar] [CrossRef]
Robertson, S.; Zaragoza, H. The Probabilistic Relevance Framework: BM25 and Beyond. Found. Trends Inf. Retr. 2009, 3, 333–389. [Google Scholar] [CrossRef]
Arabzadeh, N.; Yan, X.; Clarke, C. Predicting Efficiency/Effectiveness Trade-offs for Dense vs. Sparse Retrieval Strategy Selection. In Proceedings of the CIKM ’21: Proceedings of the 30th ACM International Conference on Information & Knowledge Management, Gold Coast, QLD, Australia, 1–5 November 2021; pp. 2862–2866. [Google Scholar] [CrossRef]
Posokhov, P.; Matveeva, A.; Makhnytkina, O.; Matveev, A.; Matveev, Y. Personalizing Retrieval-Based Dialogue Agents. In Proceedings of the Speech and Computer; Prasanna, S.R.M., Karpov, A., Samudravijaya, K., Agrawal, S.S., Eds.; Springer: Cham, Switzerland, 2022; pp. 554–566. [Google Scholar]
Karpukhin, V.; Oguz, B.; Min, S.; Lewis, P.; Wu, L.; Edunov, S.; Chen, D.; Yih, W.t. Dense Passage Retrieval for Open-Domain Question Answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; pp. 6769–6781. [Google Scholar] [CrossRef]
Posokhov, P.; Skrylnikov, S.; Makhnytkina, O.; Matveev, Y. Hybrid Approach to the Personification of Dialogue Agents. In Proceedings of the Artificial Intelligence and Speech Technology; Dev, A., Sharma, A., Agrawal, S.S., Rani, R., Eds.; Springer: Cham, Switzerland, 2025; pp. 102–115. [Google Scholar]
Matveev, Y.; Makhnytkina, O.; Posokhov, P.; Matveev, A.; Skrylnikov, S. Personalizing Hybrid-Based Dialogue Agents. Mathematics 2022, 10, 4657. [Google Scholar] [CrossRef]
Chen, J.; Xiao, S.; Zhang, P.; Luo, K.; Lian, D.; Liu, Z. M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2024, Bangkok, Thailand, 11–16 August 2024; pp. 2318–2335. [Google Scholar] [CrossRef]
Zhang, X.; Zhang, Y.; Long, D.; Xie, W.; Dai, Z.; Tang, J.; Lin, H.; Yang, B.; Xie, P.; Huang, F.; et al. mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track, Miami, FL, USA, 12–16 November 2024; pp. 1393–1412. [Google Scholar] [CrossRef]
Hsu, H.L.; Tzeng, J. DAT: Dynamic Alpha Tuning for Hybrid Retrieval in Retrieval-Augmented Generation. arXiv 2025, arXiv:2503.23013. [Google Scholar] [CrossRef]
Zhang, X.; Thakur, N.; Ogundepo, O.; Kamalloo, E.; Alfonso-Hermelo, D.; Li, X.; Liu, Q.; Rezagholizadeh, M.; Lin, J. MIRACL: A Multilingual Retrieval Dataset Covering 18 Diverse Languages. Trans. Assoc. Comput. Linguist. 2023, 11, 1114–1131. [Google Scholar] [CrossRef]
Formal, T.; Lassance, C.; Piwowarski, B.; Clinchant, S. SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval. arXiv 2021, arXiv:2109.10086. [Google Scholar] [CrossRef]
Formal, T.; Piwowarski, B.; Clinchant, S. SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, New York, NY, USA, 11–15 July 2021; pp. 2288–2292. [Google Scholar] [CrossRef]
Mallia, A.; Khattab, O.; Suel, T.; Tonellotto, N. Learning Passage Impacts for Inverted Indexes. In Proceedings of the SIGIR ’21: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual Event, 11–15 July 2021; pp. 1723–1727. [Google Scholar] [CrossRef]
Lin, J.J.; Ma, X. A Few Brief Notes on DeepImpact, COIL, and a Conceptual Framework for Information Retrieval Techniques. arXiv 2021, arXiv:2106.14807. [Google Scholar] [CrossRef]
Humeau, S.; Shuster, K.; Lachaux, M.; Weston, J. Poly-encoders: Architectures and Pre-training Strategies for Fast and Accurate Multi-sentence Scoring. In Proceedings of the 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers); Burstein, J., Doran, C., Solorio, T., Eds.; Association for Computational Linguistics: Minneapolis, MN, USA, 2019; pp. 4171–4186. [Google Scholar] [CrossRef]
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv 2019, arXiv:1907.11692. [Google Scholar] [CrossRef]
Reimers, N.; Gurevych, I. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. arXiv 2019, arXiv:1908.10084. [Google Scholar] [CrossRef]
Ma, J.; Korotkov, I.; Hall, K.B.; McDonald, R.T. Hybrid First-stage Retrieval Models for Biomedical Literature. In Proceedings of the Working Notes of CLEF 2020—Conference and Labs of the Evaluation Forum, Thessaloniki, Greece, 22–25 September 2020; CEUR Workshop Proceedings; Cappellato, L., Eickhoff, C., Ferro, N., Névéol, A., Eds.; CEUR-WS.org: Aachen, Germany, 2020; Volume 2696. [Google Scholar]
Sawarkar, K.; Mangal, A.; Solanki, S.R. Blended RAG: Improving RAG (Retriever-Augmented Generation) Accuracy with Semantic Search and Hybrid Query-Based Retrievers. In Proceedings of the 2024 IEEE 7th International Conference on Multimedia Information Processing and Retrieval (MIPR), San Jose, CA, USA, 7–9 August 2024; pp. 155–161. [Google Scholar] [CrossRef]
Khattab, O.; Zaharia, M. ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, New York, NY, USA, 25–30 July 2020; pp. 39–48. [Google Scholar] [CrossRef]
Santhanam, K.; Khattab, O.; Saad-Falcon, J.; Potts, C.; Zaharia, M. ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Seattle, WA, USA, 10–15 July 2022; pp. 3715–3734. [Google Scholar] [CrossRef]
Cormack, G.V.; Clarke, C.L.A.; Buettcher, S. Reciprocal rank fusion outperforms condorcet and individual rank learning methods. In Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, New York, NY, USA, 19–23 July 2009; pp. 758–759. [Google Scholar] [CrossRef]
Bruch, S.; Gai, S.; Ingber, A. An Analysis of Fusion Functions for Hybrid Retrieval. ACM Trans. Inf. Syst. 2023, 42, 1–35. [Google Scholar] [CrossRef]
Ma, X.; Sun, K.; Pradeep, R.; Lin, J. A Replication Study of Dense Passage Retriever. arXiv 2021, arXiv:2104.05740. [Google Scholar] [CrossRef]
Bernston, A. Azure AI Search: Outperforming Vector Search with Hybrid Retrieval and Reranking. 2023. Available online: https://techcommunity.microsoft.com/blog/azure-ai-foundry-blog/azure-ai-search-outperforming-vector-search-with-hybrid-retrieval-and-reranking/3929167 (accessed on 6 March 2026).
Incremona, A.; Pozzi, A.; Guiscardi, A.; Tessera, D. A differentiable and uncertainty-aware mutual information regularizer for bias mitigation. Neurocomputing 2026, 669, 132498. [Google Scholar] [CrossRef]
Shaik, H.; Villuri, G.; Doboli, A. An Overview of Large Language Models and a Novel, Large Language Model-Based Cognitive Architecture for Solving Open-Ended Problems. Mach. Learn. Knowl. Extr. 2025, 7, 134. [Google Scholar] [CrossRef]
Matveev, A.; Makhnytkina, O.; Matveev, Y.; Svischev, A.; Korobova, P.; Rybin, A.; Akulov, A. Virtual Dialogue Assistant for Remote Exams. Mathematics 2021, 9, 2229. [Google Scholar] [CrossRef]
Masliukhin, S.; Posokhov, P.; Skrylnikov, S.; Makhnytkina, O.; Ivanovskaya, T. Prompt-based multi-task learning for robust text retrieval. Sci. Tech. J. Inf. Technol. Mech. Opt. 2024, 24, 1016–1023. [Google Scholar] [CrossRef]
Xiong, L.; Xiong, C.; Li, Y.; Tang, K.; Liu, J.; Bennett, P.N.; Ahmed, J.; Overwijk, A. Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval. arXiv 2020, arXiv:2007.00808. [Google Scholar] [CrossRef]
Zhou, K.; Gong, Y.; Liu, X.; Zhao, W.X.; Shen, Y.; Dong, A.; Lu, J.; Majumder, R.; Wen, J.r.; Duan, N. SimANS: Simple Ambiguous Negatives Sampling for Dense Text Retrieval. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: Industry Track, Abu Dhabi, United Arab Emirates, 7–11 December 2022; pp. 548–559. [Google Scholar] [CrossRef]
Hofstätter, S.; Lin, S.C.; Yang, J.H.; Lin, J.; Hanbury, A. Efficiently Teaching an Effective Dense Retriever with Balanced Topic Aware Sampling. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, New York, NY, USA, 11–15 July 2021; pp. 113–122. [Google Scholar] [CrossRef]
Qu, Y.; Ding, Y.; Liu, J.; Liu, K.; Ren, R.; Zhao, W.X.; Dong, D.; Wu, H.; Wang, H. RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, 6–11 June 2021; pp. 5835–5847. [Google Scholar] [CrossRef]
Ren, R.; Qu, Y.; Liu, J.; Zhao, W.X.; She, Q.; Wu, H.; Wang, H.; Wen, J.R. RocketQAv2: A Joint Training Method for Dense Passage Retrieval and Passage Re-ranking. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 7–11 November 2021; pp. 2825–2835. [Google Scholar] [CrossRef]
Longpre, S.; Lu, Y.; Daiber, J. MKQA: A Linguistically Diverse Benchmark for Multilingual Open Domain Question Answering. Trans. Assoc. Comput. Linguist. 2021, 9, 1389–1406. [Google Scholar] [CrossRef]
Su, J.; Lu, Y.; Pan, S.; Wen, B.; Liu, Y. RoFormer: Enhanced Transformer with Rotary Position Embedding. arXiv 2021, arXiv:2104.09864. [Google Scholar] [CrossRef]
Posokhov, P.; Masliukhin, S.; Stepan, S.; Tirskikh, D.; Makhnytkina, O. Relevance Scores Calibration for Ranked List Truncation via TMP Adapter. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2025, Vienna, Austria, 1–27 July 2025; pp. 7728–7734. [Google Scholar] [CrossRef]
Bahri, D.; Tay, Y.; Zheng, C.; Metzler, D.; Tomkins, A. Choppy: Cut Transformer for Ranked List Truncation. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, New York, NY, USA, 25–30 July 2020; pp. 1513–1516. [Google Scholar] [CrossRef]
van den Oord, A.; Li, Y.; Vinyals, O. Representation Learning with Contrastive Predictive Coding. arXiv 2018, arXiv:1807.03748. [Google Scholar]

Figure 1. Hybrid retrieval pipeline with small prediction module QDAP-S operating on dense encoder embedding.

Figure 2. Hybrid retrieval pipeline with encoder-size prediction module represented by full encoding model QDAP-L.

Figure 3. Two-stage antagonist negative sampling.

Figure 4. Analysis of hybrid retrieval behavior on MIRACL dataset.

Figure 5. Analysis of hybrid retrieval behavior on MLDR dataset.

Figure 6. Dense encoder performance on MIRACL dataset across training epochs.

Figure 7. Dense encoder performance on MLDR dataset across training epochs.

Table 1. Characteristics of test datasets.

Dataset	Language	Train	Val	Test	Corpus	Avg. Doc Length
MLDR	Arabic (ar)	1817	200	200	7607	9428
	German (de)	1847	200	200	10,000	9039
	English (en)	10,000	200	800	200,000	3308
	Spanish (es)	2254	200	200	9551	8771
	French (fr)	1608	200	200	10,000	9659
	Hindi (hi)	1618	200	200	3806	5555
	Italian (it)	2151	200	200	10,000	9195
	Japanese (ja)	2262	200	200	10,000	9297
	Korean (ko)	2198	200	200	6176	7832
	Portuguese (pt)	1845	200	200	6569	7922
	Russian (ru)	1864	200	200	10,000	9723
	Thai (th)	197	200	200	10,000	8089
	Chinese (zh)	10,000	200	800	200,000	4249
MIRACL	Arabic (ar)	3295	200	2896	2,061,414	53
	Bengali (bn)	1431	200	411	297,265	55
	English (en)	2663	200	799	32,893,221	64
	Spanish (es)	1962	200	648	10,373,953	65
	Persian (fa)	1907	200	632	2,207,172	48
	Finnish (fi)	2697	200	1271	1,883,509	40
	French (fr)	943	200	343	14,636,953	54
	Hindi (hi)	969	200	350	506,264	68
	Indonesian (id)	3871	200	960	1,446,315	47
	Japanese (ja)	3277	200	860	6,953,614	95
	Korean (ko)	868	200	213	1,486,752	101
	Russian (ru)	4483	200	1252	9,543,918	44
	Swahili (sw)	1701	200	482	131,924	35
	Telugu (te)	3252	200	828	518,079	50
	Thai (th)	2772	200	733	542,166	107
	Chinese (zh)	1112	200	393	4,934,368	88

Table 2. Multilingual retrieval performance on the MLDR dev set (measured by nDCG@10). Bold values indicate the best performance in each column.

Model	Avg	ar	de	en	es	fr	hi	it	ja	ko	pt	ru	th	zh
BM25	53.6	45.1	52.6	57.0	78.0	75.7	43.7	70.9	36.2	25.7	82.6	61.3	33.6	34.6
BGE-M3 (Dense)	52.5	47.6	46.1	48.9	74.8	73.8	40.7	62.7	50.9	42.9	74.4	59.5	33.6	26.0
BGE-M3 (Sparse)	62.2	58.7	53.0	62.1	87.4	82.7	49.6	74.7	53.9	47.9	85.2	72.9	40.3	40.5
BGE-M3 (Multi-vec)	57.6	56.6	50.4	55.8	79.5	77.2	46.6	66.8	52.8	48.8	77.5	64.2	39.4	32.7
BGE-M3 (Dense + Sparse)	64.8	63.0	56.4	64.2	88.7	84.2	52.3	75.8	58.5	53.1	86.0	75.6	42.9	42.0
BGE-M3 (Dense + Sparse + Multi-vec)	65.0	64.7	57.9	63.8	86.8	83.9	52.2	75.5	60.1	55.7	85.4	73.8	44.7	40.0
mGTE-TRM (Dense)	56.5	55.0	54.9	51.0	81.2	76.2	45.2	66.7	52.1	46.7	79.1	64.2	35.3	27.4
mGTE-TRM (Sparse)	71.0	74.3	66.2	66.4	93.6	88.4	61.0	82.2	66.2	64.2	89.9	82.0	47.4	41.8
mGTE-TRM (Dense + Sparse)	71.3	74.6	66.6	66.5	93.6	88.6	61.6	83.0	66.7	64.6	89.8	82.1	47.7	41.4
STR (Dense)	61.8	63.5	59.7	48.9	86.3	82.1	51.0	73.2	60.4	55.1	82.3	74.2	39.6	27.2
LTR (Sparse)	67.3	70.5	61.0	64.4	87.5	82.9	51.0	79.8	66.7	64.7	83.7	74.1	37.5	51.3
HTR (Dense + Sparse)	74.3	80.0	69.6	67.7	92.7	91.2	62.8	87.4	69.6	72.5	88.9	85.0	45.4	53.2

Table 3. Multilingual retrieval performance on the MIRACL dev set (measured by nDCG@10). Bold values indicate the best performance in each column.

Model	Avg	ar	bn	en	es	fa	fi	fr	hi	id	ja	ko	ru	sw	te	th	zh
BM25	31.7	39.5	48.2	26.7	7.7	28.7	45.8	11.5	35.0	29.7	31.2	37.1	25.6	35.1	38.3	49.1	17.5
BGE-M3 (Dense)	68.9	78.4	80.0	56.9	55.5	57.7	78.6	57.8	59.3	56.0	72.8	69.9	70.1	78.6	86.2	82.6	61.7
BGE-M3 (Sparse)	54.2	67.1	68.7	43.7	38.8	45.2	65.3	35.5	48.2	48.9	56.3	61.5	44.5	57.9	79.0	70.9	36.3
BGE-M3 (Multi-vec)	70.3	79.6	81.1	59.4	57.2	58.8	80.1	59.0	61.4	58.2	74.5	71.2	71.2	79.0	87.9	83.0	62.7
BGE-M3 (Dense + Sparse)	70.0	79.6	80.7	58.8	57.5	59.2	79.7	57.6	62.8	58.3	73.9	71.3	69.8	78.5	87.2	83.1	62.5
BGE-M3 (Dense + Sparse + Multi-vec)	71.2	80.2	81.5	59.8	59.2	60.3	80.4	60.7	63.2	59.1	75.2	72.2	71.7	79.6	88.2	83.8	63.9
mGTE-TRM (Dense)	63.1	71.4	72.7	54.1	51.4	51.2	73.5	53.9	51.6	50.3	65.8	62.7	63.2	69.9	83.0	74.0	60.8
mGTE-TRM (Sparse)	55.8	66.5	70.4	35.6	46.2	40.0	47.6	66.5	39.8	48.9	47.9	59.3	64.3	47.1	59.4	83.0	70.5
mGTE-TRM (Dense + Sparse)	64.7	73.4	75.1	49.9	57.6	62.7	52.0	74.7	53.5	56.4	52.8	67.1	66.7	63.5	69.5	85.2	75.8
STR (Dense)	66.0	75.4	75.2	55.9	54.1	57.0	76.3	55.1	56.0	53.2	69.2	65.9	67.0	73.1	84.2	77.6	61.4
LTR (Sparse)	31.4	39.0	42.2	31.0	24.2	30.0	45.6	14.7	32.8	36.3	21.7	35.3	20.2	45.5	29.4	38.1	15.6
HTR (Dense + Sparse)	67.1	76.4	76.0	58.8	55.5	58.2	77.4	55.5	57.5	56.9	69.5	66.7	66.6	75.2	84.7	78.0	60.2

Table 4. Summary results for MIRACL and MLDR (nDCG@10). Bold values indicate the best performance in each column.

Model	Avg	MLDR	MIRACL
previous works
BM25	42.6	53.6	31.7
BGE-M3 (Dense)	60.7	52.5	68.9
BGE-M3 (Sparse)	58.2	62.2	54.2
BGE-M3 (Multi-vec)	63.9	57.6	70.3
BGE-M3 (Dense + Sparse)	67.4	64.8	70.0
BGE-M3 (Dense + Sparse + Multi-vec)	68.1	65.0	71.2
mGTE-TRM (Dense)	59.8	56.5	63.1
mGTE-TRM (Sparse)	63.4	71.0	55.8
mGTE-TRM (Dense + Sparse)	68.0	71.3	64.7
our work
STR (Dense)	63.9	61.8	66.0
LTR (Sparse)	49.3	67.3	31.4
rank fusion
HTR (Optimal)	60.7	62.7	58.8
HTR (Dense + Sparse)	67.5	70.6	64.4
HTR (Oracle)	75.4	78.1	72.6
minmax fusion
HTR (Optimal)	66.7	68.1	65.3
HTR (Dense + Sparse)	70.7	74.3	67.1
HTR (Oracle)	76.4	79.1	73.8

Table 5. Comparison of QDAP architectures and loss functions (nDCG@10). Bold values indicate the best performance in each column.

Model Type	Loss Type	Avg	MLDR	MIRACL
QDAP-L (Full model)	WD + CE	70.7	74.3	67.1
	WD	67.2	70.5	63.8
	CE	69.1	72.6	65.6
QDAP-S (Adapter)	WD + CE	68.6	72	65.3
	WD	63.3	66.9	59.7
	CE	66.5	70.2	62.9

Table 6. Analysis of antagonistic sampling strategies (oracle nDCG@10). Bold values indicate the best performance in each column.

Positives	Negatives	Avg	MLDR	MIRACL
No Filtering	Random	64.3	67.0	61.5
	Sparse	70.3	74.9	65.7
	Dense	71.5	74.2	68.8
	Sparse + Dense	73.1	75.7	70.6
Dense + Sparse Filtering	Random	62.9	66.3	59.6
	Sparse	67.2	68.8	65.7
	Dense	68.5	69.6	67.4
	Sparse + Dense	71.7	73.9	69.6
Sparse Filtering	Random	66.9	69.8	64.1
	Sparse	72.8	76.4	69.1
	Dense	73.4	76.5	70.2
	Sparse + Dense	76.4	79.1	73.8

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Posokhov, P.; Skrylnikov, S.; Masliukhin, S.; Zavgorodniaia, A.; Koroteeva, O.; Matveev, Y. Query-Adaptive Hybrid Search. Mach. Learn. Knowl. Extr. 2026, 8, 91. https://doi.org/10.3390/make8040091

AMA Style

Posokhov P, Skrylnikov S, Masliukhin S, Zavgorodniaia A, Koroteeva O, Matveev Y. Query-Adaptive Hybrid Search. Machine Learning and Knowledge Extraction. 2026; 8(4):91. https://doi.org/10.3390/make8040091

Chicago/Turabian Style

Posokhov, Pavel, Stepan Skrylnikov, Sergei Masliukhin, Alina Zavgorodniaia, Olesia Koroteeva, and Yuri Matveev. 2026. "Query-Adaptive Hybrid Search" Machine Learning and Knowledge Extraction 8, no. 4: 91. https://doi.org/10.3390/make8040091

APA Style

Posokhov, P., Skrylnikov, S., Masliukhin, S., Zavgorodniaia, A., Koroteeva, O., & Matveev, Y. (2026). Query-Adaptive Hybrid Search. Machine Learning and Knowledge Extraction, 8(4), 91. https://doi.org/10.3390/make8040091

Article Menu

Query-Adaptive Hybrid Search

Abstract

1. Introduction

2. Related Works

3. Materials and Methods

3.1. Datasets

3.2. Hybrid Search

3.3. Query-Driven Alpha Prediction

3.4. Antagonist Negative Sampling

4. Results

4.1. Evaluation

4.2. Ablation Study

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A. Training of the Query-Driven Alpha Prediction (QDAP) Module

Appendix B. Antagonist Negative Sampling Algorithm

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI