4.1. Experimental Setup
- (1)
Experiment platform.
We conduct all experiments on a server equipped with an NVIDIA A100 80 GB GPU and a 96-core CPU. The system runs Ubuntu 24.04 with CUDA 12.1. Our implementation is based on PyTorch 2.4 [
41] and FlashInfer [
42].
- (2)
Baselines and configurations.
We compare BTHA with representative top-
k attention baselines, including Loki [
43] and Quest [
11]. Loki accelerates top-
k attention through low-rank approximation, while Quest performs query-aware block-level KV selection. We further compare BTHA with MagicPIG [
44], which accelerates top-
k attention using LSH [
15]. LSH is a randomized hashing method that typically relies on random projections to generate hash codes. Unlike learning-to-hash methods, LSH usually requires a large number of hash bits to maintain retrieval accuracy.
In addition to top-
k attention baselines, we compare BTHA with several KV cache compression methods, including StreamingLLM [
9], H
2O [
8], and SnapKV [
10]. For all baselines, we adopt the recommended configurations from the original papers, including hyperparameters such as channel numbers, block sizes, and KV cache budgets when applicable. For BTHA, we set the number of hash bits to
and the block selection ratio to
, which provide a robust configuration across most tasks. Following Quest [
11], we use vanilla full attention in the first two layers, since these layers are commonly observed to be outlier layers in top-
k attention methods. We also include the vanilla Transformer with full attention, denoted as
dense, as a reference baseline to evaluate both the effectiveness and efficiency of BTHA.
- (3)
Models and datasets.
We evaluate BTHA on two mainstream large language models, Llama-2 [
45] and Llama-3.1 [
46]. The evaluation is conducted on two widely used long-context benchmarks, LongBench-E [
47] and RULER [
48]. LongBench-E is a multi-task benchmark covering question answering, document summarization, few-shot learning, synthetic tasks, and code understanding, which provides a comprehensive evaluation of long-context language modeling ability. RULER focuses on controlled long-context evaluation, especially retrieval-oriented tasks over extremely long contexts. These benchmarks allow us to evaluate both the task-level accuracy and long-range retrieval capability of BTHA under different context lengths.
4.2. Accuracy Evaluation
- (1)
Evaluation on LongBench-E.
We first evaluate all methods on LongBench-E, which covers question answering, document summarization, synthetic tasks, few-shot learning, and code understanding. As shown in
Table 2, BTHA achieves strong accuracy under the same 512 KV position budget. On Llama-2-7B-32K-Instruct, BTHA obtains an average score of
, slightly surpassing the dense full-attention baseline with
and outperforming all compared sparse attention and KV cache compression baselines. On Llama-3.1-8B-Instruct, BTHA achieves an average score of
, which is very close to the dense full-attention baseline of
, with only a
point gap. Compared with non-dense baselines, BTHA also achieves the best average score, outperforming Loki, Quest, MagicPIG, StreamingLLM, H
2O, and SnapKV. These results indicate that BTHA can effectively preserve task-level accuracy while attending to only a limited number of KV positions.
- (2)
Evaluation on RULER.
We further evaluate all methods on RULER, a controlled long-context benchmark designed to test retrieval, tracing, aggregation, and question-answering capabilities under extended context lengths. As shown in
Table 3, BTHA consistently maintains accuracy close to the dense full-attention baseline on both evaluated models. On Llama-2-7B-32K-Instruct, BTHA achieves an average score of
, only
points lower than dense full attention, while clearly outperforming all non-dense baselines. On Llama-3.1-8B-Instruct, BTHA obtains an average score of
, again only
points below dense full attention and substantially higher than Loki, Quest, MagicPIG, StreamingLLM, H
2O, and SnapKV. These results demonstrate that BTHA can preserve long-range retrieval ability through block-level coarse routing and hash-based fine-grained retrieval, providing a favorable accuracy-efficiency trade-off for long-context inference.
- (3)
Task-level observations.
Although BTHA achieves the best or near-best average accuracy across the evaluated benchmarks, its gains are not uniform across all tasks. On LongBench-E, BTHA obtains larger improvements on tasks where the useful evidence is relatively sparse or concentrated, such as LCC, RepoBench, TQA, MultiFieldQA, and MultiNews. These tasks benefit from the Block-then-Hash design because block-level routing first narrows the search space to relevant context regions, while hash-based retrieval further selects fine-grained KV entries within the routed candidates. On RULER, BTHA performs particularly well on retrieval-oriented tasks such as needle search and multi-query retrieval, indicating that the learned hash codes can effectively preserve salient long-range evidence under a strict KV budget. Meanwhile, BTHA still shows small gaps compared with dense attention on several tasks, especially those requiring broad-context aggregation, highly diffuse evidence, or fine-grained counting and verification. For example, on LongBench-E, the gaps are more visible on tasks such as Passage Count and GovReport, while on RULER, performance drops are observed on some aggregation-oriented or full-context matching tasks such as FWE, NMK2, and NMV. This suggests that when the answer depends on many weakly relevant tokens distributed across the context, a fixed KV selection budget may discard information that is individually low-ranked but collectively useful. Overall, these task-level results show that BTHA is most effective when the attention mass is concentrated on a small subset of informative tokens, while further improvement is still possible for tasks requiring more exhaustive context coverage.
- (4)
End-to-end efficiency.
We further evaluate the end-to-end inference efficiency under the same configuration, with a prefill length of 36K tokens and a decode length of 3.6K tokens. In addition to the decoding latency, we also report the prefill time to provide a comprehensive comparison of the overall inference cost. As shown in
Figure 3, BTHA consistently reduces the decoding time compared with dense full attention. On Llama-2-7B-32K-Instruct, BTHA achieves the lowest decoding cost among the compared methods, reducing the decode time from about 189 s for dense attention to about 95 s. On Llama-3.1-8B-Instruct, BTHA also substantially decreases the decoding cost, reducing the decode time from about 292 s to about 196 s. The prefill overhead of BTHA remains comparable to other methods, indicating that the proposed Block-then-Hash selection mainly accelerates the decoding stage without introducing significant prefill overhead. These results demonstrate that BTHA improves end-to-end inference efficiency while maintaining competitive accuracy under a fixed token selection budget.
- (5)
Layer-wise latency under different sequence lengths.
To further analyze the scalability of BTHA, we compare the latency of a single Transformer layer under different sequence lengths. As shown in
Figure 4, the latency of dense attention increases rapidly as the sequence length grows, since it needs to access and compute attention scores over all cached KV positions. In contrast, BTHA maintains a much slower latency growth under the same 1.56% token selection ratio. For Llama-2, BTHA achieves the lowest latency across long sequence lengths and shows a clear advantage over Dense, Loki, and Quest when the sequence length increases to 128K and 256K. For Llama-3.1 with GQA, BTHA also keeps the latency nearly stable and remains substantially faster than Dense and Loki at long contexts. These results indicate that the Block-then-Hash selection mechanism effectively reduces the per-layer decoding cost and scales better with increasing context length.
- (6)
Model-scale generalization analysis.
To further examine whether the sparse retrieval mechanism remains effective across different model scales and context settings, we additionally evaluate BTHA on Qwen2.5-14B-Instruct-1M and Qwen2.5-32B-Instruct. As shown in
Table 4,
Table 5 and
Table 6, BTHA achieves accuracy close to dense attention under strict sparse token budgets. On LongBench-E, BTHA obtains comparable average performance to dense attention on both the 14B and 32B models with only 512 selected KV tokens, with average gaps of 0.37 and 0.28 points, respectively. On RULER with 256K context length, BTHA nearly matches dense attention under a 4096-token budget, with only a 0.04-point average gap, while outperforming the token-level top-
k baseline. These results suggest that the proposed sparse retrieval strategy can generalize to larger models and longer contexts.
4.3. Ablation Study
We conduct ablation experiments on Llama-3.1-8B-Instruct to examine the effectiveness of the main design choices in BTHA. The evaluation focuses on the block selection ratio, the necessity of the Block-then-Hash retrieval pipeline, and the contribution of offline hash training. Unless otherwise specified, all experiments are conducted on LongBench-E under the same KV position budget.
- (1)
Effect of block selection ratio .
The block selection ratio controls how many KV blocks are routed into the candidate set before hash-based token-level retrieval. A smaller reduces the candidate search space and improves efficiency, but it may discard relevant blocks before fine-grained retrieval. In contrast, a larger preserves more historical information but introduces more redundant candidates, increasing the overhead of hash comparison and sparse KV gathering.
Figure 5 and
Table 7 show the LCC results under different values of
. The results indicate that increasing
does not monotonically improve performance. When
is too small, the routed candidate set may miss useful KV positions; when
is too large, excessive irrelevant tokens are passed to the hash retrieval stage, which weakens fine-grained selection. BTHA achieves the best LCC score of 68.62 when
, suggesting that a moderate block selection ratio provides a better balance between candidate recall and redundancy reduction.
- (2)
Effect of the Block-then-Hash retrieval pipeline.
We further evaluate whether both retrieval stages are necessary. To this end, we compare BTHA with two simplified variants. The first variant removes block-level routing and directly performs hash retrieval, denoted as Hash Only. The second variant removes hash-based fine-grained retrieval and only uses block-level routing, denoted as Block Only.
As shown in
Table 8, Hash Only achieves an LCC score of 67.46. Although it preserves token-level retrieval, it lacks the coarse block routing stage and therefore suffers from a larger search space. Block Only obtains a lower LCC score of 66.34, indicating that coarse block selection alone is insufficient because all tokens inside the selected blocks are treated uniformly. By combining block-level routing and hash-based token retrieval, BTHA achieves the highest LCC score of 68.62. These results show that the two stages play different and complementary roles. Block routing acts as a coarse-grained recall stage: by selecting query-relevant blocks, it increases the chance of covering dense-attention top-
k KV positions while reducing the global search space. Hash retrieval then acts as a fine-grained filtering stage, removing redundant tokens inside the routed blocks and selecting more relevant token-level KV entries. Consequently, the full BTHA pipeline achieves the best LCC score by combining coarse top-
k candidate recall with fine-grained token selection.
- (3)
Effect of offline hash training.
We also evaluate the effect of offline training on the hash network. Before training, the hash projection is randomly initialized and behaves similarly to random projection-based hashing. Although this provides a simple approximation for similarity search, it is not optimized for the query–key relevance pattern of LLM attention. After offline training, the hash network learns to approximate the KV position ranking induced by full attention, making the learned Hamming distance more consistent with the attention-based top-k positions.
Table 9 reports the Top-k IoU before and after hash training. On Llama-3.1-8B-Instruct, the Top-k IoU improves from 41.32% before training to 59.74% after training. This significant improvement shows that offline training effectively aligns the hash space with full-attention retrieval targets, thereby improving the quality of fine-grained KV position selection.
- (4)
End-to-end efficiency.
The above ablation results show that BTHA benefits from both coarse-grained and fine-grained retrieval. Compared with Hash Only, BTHA avoids global hash comparison over the full KV cache by first routing a compact set of candidate blocks. Compared with Block Only, BTHA further removes redundant tokens inside the selected blocks through learned hash retrieval. Moreover, the trained hash network provides a more attention-aligned retrieval space than random hashing. These results demonstrate that the block routing stage, the hash retrieval stage, and offline hash training jointly contribute to the final accuracy–efficiency trade-off of BTHA.
- (5)
Token budget ablation.
First, we investigate the impact of token budget on the inference performance of BTHA. As shown in
Figure 6, BTHA consistently achieves higher accuracy than Quest and Loki across different token-budget settings. More importantly, BTHA exhibits only mild performance degradation as the token budget decreases. Even under an extremely constrained setting where only about 0.4% of tokens are retained, BTHA still maintains competitive model accuracy. These results indicate that the learned hash-based retrieval mechanism enables BTHA to effectively preserve critical contextual information under strict KV selection budgets, highlighting the potential of hash learning for efficient long-context decoding.
We further investigate the effect of hash length on BTHA by varying the number of hash bits from 32 to 256. As shown in
Figure 7, using too few hash bits leads to noticeable accuracy degradation, especially under the 32-bit setting. This is because overly compact hash codes provide limited discrimination capability for distinguishing important KV entries. As the number of hash bits increases, the accuracy improves rapidly and becomes stable when the hash length reaches 128 bits. On both LCC and GovReport, BTHA with 128 or 256 hash bits achieves performance close to the dense-attention baseline, and in some cases even slightly surpasses it. These results indicate that BTHA does not require excessively long hash codes to maintain retrieval quality; a moderate hash length is sufficient to preserve critical contextual information while keeping the hash representation compact and efficient.