Next Article in Journal
Enhanced Bidirectional Power Flow Control for Grid-Connected Solar PV-Based Water Pumping Systems
Previous Article in Journal
Demand-Driven Configuration Method and Model for Equipment Performance Indices
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

BTHA: Block-Then-Hash Attention for Efficient Long Context

School of Information and Communication Engineering, Hainan University, Haikou 570228, China
*
Author to whom correspondence should be addressed.
Electronics 2026, 15(12), 2635; https://doi.org/10.3390/electronics15122635 (registering DOI)
Submission received: 10 May 2026 / Revised: 25 May 2026 / Accepted: 28 May 2026 / Published: 15 June 2026
(This article belongs to the Special Issue Advanced Computer Science and Intelligent Systems Innovations)

Abstract

Long-context large language models incur substantial computational overhead during autoregressive decoding. Existing sparse attention methods can improve inference efficiency, but they typically rely on fixed sparse patterns, historical attention statistics, or coarse-grained proxy representations to estimate important KV positions, making it difficult to accurately capture query-dependent fine-grained relevance for dynamic KV retrieval. In this paper, we propose Block-then-Hash Attention (BTHA), a two-stage KV retrieval method: it first performs block-level routing with mean key representations to rapidly reduce the candidate search space, and then applies a learnable orthogonal hash network within the routed KV candidates for fine-grained token-level position retrieval. The hash network is trained offline to learn the hash mapping between queries and keys, with a low training cost: on Llama-3.1-8B-Instruct, training can be completed in approximately two hours using a single NVIDIA A100 GPU. During inference, BTHA implements block-level routing, hash-based retrieval, and sparse attention computation with dedicated operators, and further employs CPU–GPU collaborative scheduling to reduce memory access, synchronization, and candidate selection overhead, thereby achieving end-to-end decoding acceleration. Extensive experiments on LongBench-E show that BTHA consistently outperforms state-of-the-art top-K attention methods in both accuracy and efficiency; under a 512-position budget, it achieves the best average accuracy on both Llama-2-7B-32K-Instruct and Llama-3.1-8B-Instruct, while delivering up to 7.0× speedup over vanilla full attention.

1. Introduction

Large language models (LLMs) have shown strong capabilities in long-context understanding, including multi-document question answering, long-document summarization, code understanding, and retrieval-augmented generation [1]. As the context length increases, however, autoregressive decoding becomes increasingly expensive due to the growing key–value (KV) cache. At each decoding step, the current query needs to access historical cached keys and values. For a sequence with length S and head dimension d, the decoding cost of full attention for each query head is O ( S d ) . This cost becomes a major bottleneck in long-context inference, especially when the context length reaches tens or hundreds of thousands of tokens. Recent fault-tolerant intelligent control studies have also emphasized the importance of robustness and deployment reliability under actuator faults and performance constraints [2].
A large body of work has attempted to reduce the cost of long-context attention. Sparse attention methods reduce the quadratic complexity of full attention by imposing local windows, global tokens, random connections, low-rank structures, or routing-based sparse patterns [3,4,5,6,7]. These methods have demonstrated that long-context modeling does not always require dense attention over all positions. However, many sparse attention patterns are either predefined or mainly designed for training-time attention computation, while long-context LLM inference requires efficient and query-dependent KV cache access during decoding.
Another line of research focuses on KV cache compression and selection. Methods such as H2O [8], StreamingLLM [9], SnapKV [10], Quest [11], and related approaches reduce the memory and bandwidth cost by preserving only important cached positions. These methods exploit attention sinks, heavy hitters, observation windows, or query-aware KV selection to reduce the number of cached entries used during inference. Although effective, many of them rely on accumulated attention statistics, heuristic cache policies, or coarse-grained selection rules. Such designs may be insufficient for fine-grained, query-dependent retrieval when the relevant information is sparsely distributed across a long context.
Block-level sparse attention further exploits the structural regularity of long-context attention. By partitioning the sequence into contiguous blocks, block-sparse methods can reduce the retrieval or attention space at a coarse granularity [12,13,14]. Block-level selection is attractive because it is computationally cheaper than token-level selection. Nevertheless, block-level routing alone may retain many irrelevant positions inside selected blocks, since all positions within a selected block are usually treated uniformly. Therefore, a purely block-level method may still introduce redundant KV access, while an overly aggressive block selection may discard important fine-grained positions.
Hash-based attention and retrieval provide another efficient mechanism for approximate similarity search. For example, LSH-based attention methods use randomized projections to assign similar representations into the same buckets [5,15,16]. However, random-projection-based hashing usually requires many hash bits or multiple hash rounds to maintain retrieval accuracy. Moreover, if hash retrieval is applied globally to all cached positions, it still needs to compute Hamming distances over the full KV cache, which limits the practical speedup in long-context decoding. This motivates a coarse-to-fine design: first reduce the candidate search space at the block level, and then apply fine-grained hash retrieval only inside the routed candidates. We propose BTHA, a Block-then-Hash Attention method for efficient long-context decoding. As illustrated in Figure 1, BTHA decomposes KV position selection into two consecutive stages. First, BTHA partitions the KV cache into fixed-size blocks and scores each block according to the similarity between the current query and the block-level mean key representation, thereby performing coarse-grained block routing. This stage rapidly identifies historical regions that are likely to be relevant to the current query and reduces the search space from the entire KV cache to a compact candidate set. Second, BTHA applies a learnable orthogonal hashing network within the selected candidate blocks to conduct fine-grained KV position retrieval. Unlike random LSH, the hashing projection matrices in BTHA are learned from data and regularized with orthogonality constraints, encouraging non-redundant hash dimensions and preserving the geometric structure of query–key representations before binarization. Finally, BTHA computes standard dot-product softmax attention only over the selected key–value entries. Therefore, approximation is introduced only in the KV retrieval stage, while the final attention aggregation remains exact. Unlike recent trainable retrieval or neural routing methods [16,17], BTHA does not learn a new attention mechanism, fine-tune the backbone LLM, or directly learn an end-to-end token/block selection policy. Its learnable component is restricted to an offline-trained orthogonal hash mapping, whose objective is to make query–key pairs with high dense-attention relevance have smaller distances in the hash space. In this way, the hash distance better approximates the dense-attention top-k query–key relevance. Meanwhile, the orthogonality constraint encourages non-redundant hash dimensions and helps preserve the geometric structure of the original query–key representation space before binarization, avoiding excessive distortion of the representation space. Therefore, BTHA learns a similarity-preserving KV retrieval space rather than a new language-modeling representation, an attention-weight predictor, or an end-to-end sparse-attention policy.Experimental results show that BTHA achieves a superior accuracy–speed trade-off and delivers end-to-end decoding acceleration, as shown in Figure 2.
The main contributions of this work are summarized as follows:
  • We propose Block-then-Hash Attention, an efficient sparse attention mechanism tailored for long-context decoding. By combining coarse-grained block routing with fine-grained hash-based retrieval, BTHA substantially reduces the KV search space while preserving high attention fidelity.
  • We design a learnable orthogonal hashing network for KV cache retrieval. The network is trained offline to learn query–key hashing mappings, improving fine-grained position retrieval within candidate KV blocks compared with random or purely heuristic hashing schemes.
  • We implement an end-to-end acceleration framework for long-context decoding. Compared with existing top-K attention methods, BTHA achieves better accuracy and efficiency, and obtains up to 7.0× end-to-end decoding speedup over standard full attention.

2. Related Works

2.1. Long-Context Sparse Attention

The quadratic complexity of full self-attention limits the scalability of Transformer models to long-context scenarios [18]. Early long-context architectures explored recurrence, compression, locality, and sparsity to reduce the attention cost. Transformer-XL introduces segment-level recurrence to reuse hidden states across segments and extend the effective context length [19]. Compressive Transformer further compresses old memories into a smaller representation, enabling longer-range dependency modeling under limited memory [20]. Reformer replaces full attention with locality-sensitive hashing attention and reversible layers, reducing the memory and computation cost for long sequences [5]. Longformer adopts sliding-window attention with task-motivated global attention, making attention cost scale linearly with sequence length for long-document tasks [3]. BigBird combines local, global, and random attention patterns and provides theoretical support for sparse attention while maintaining strong performance on long-sequence tasks [4]. Performer approximates softmax attention with positive random features and offers a linear-attention alternative for long sequences [6]. The Routing Transformer clusters tokens and performs content-based sparse attention to avoid attending to all positions [7]. ETC designs global–local sparse attention for long and structured inputs [21]. Nyströmformer approximates self-attention using the Nyström method to reduce complexity while preserving long-range modeling ability [22]. Long Range Arena provides a unified benchmark for evaluating efficient Transformer variants under long-context tasks [23]. Memorizing Transformers extend context access by retrieving key–value memories from a non-differentiable memory bank [24]. These studies demonstrate that sparse or approximate attention can substantially reduce long-context computation. Different from these methods, our work focuses on decoding-time KV position selection: block-level routing first narrows the historical search space, and a learnable hash module further retrieves fine-grained KV positions under an explicit budget, while the final attention aggregation remains exact.

2.2. KV Cache Compression and Selection

KV cache has become a major memory and bandwidth bottleneck in long-context LLM inference. Recent studies exploit the sparsity and structural regularity of attention to compress, evict, or selectively load cached key–value entries. H2O observes that a small set of heavy-hitter tokens contributes most of the attention mass and formulates KV eviction as a dynamic submodular optimization problem [8]. Scissorhands exploits the persistence of token importance and maintains a fixed-size KV cache by preferentially preserving pivotal tokens [25]. StreamingLLM identifies attention sinks and shows that preserving initial sink positions together with recent positions enables stable streaming inference over very long sequences [9]. FastGen profiles attention heads and applies adaptive KV cache policies according to their structural patterns [26]. SnapKV uses an observation window to identify head-specific important prompt positions and compresses the KV cache without fine-tuning [10]. Quest introduces query-aware sparsity and selects critical KV pages according to the current query, reducing memory movement during long-context inference [11]. SparQ Attention selectively fetches cached history to improve bandwidth efficiency during LLM inference [27]. CacheGen compresses and streams KV caches using quantization and entropy coding to reduce serving bandwidth [28]. InfiniGen dynamically manages KV cache offloading and prefetching, fetching only essential KV entries from CPU memory [29]. LazyLLM dynamically prunes prompt tokens and selectively computes KV states for positions relevant to the next-token prediction [30]. DuoAttention separates retrieval heads from streaming heads and applies different KV cache strategies to reduce memory and latency [31]. Ada-KV further studies adaptive head-wise budget allocation for KV cache eviction [32]. Compared with these KV cache compression and selection methods, our method does not rely solely on accumulated attention scores, fixed windows, or head-type heuristics. Instead, it combines block-level candidate routing with learnable hash-based fine retrieval, where the block stage is controlled by a block selection ratio and the hash stage is controlled by an absolute KV position budget.

2.3. Block-Sparse Attention

Block-sparse attention exploits the observation that long-context attention often exhibits structured sparsity at the level of contiguous token blocks. Sparse Transformer is an early attempt to reduce attention complexity using structured sparse attention patterns [33]. Longformer and BigBird use local windows, global tokens, and random connections, which can be interpreted as structured sparse attention patterns over long sequences [3,4]. ETC also relies on global–local sparse attention to encode long and structured inputs [21]. The Routing Transformer performs content-based sparse routing, grouping similar tokens so that attention is computed within routed subsets [7]. The Sinkhorn Transformer learns differentiable sorting-based sparse attention patterns for efficient sequence modeling [34]. Nyströmformer reduces attention cost by approximating the global attention matrix with landmark points [22]. The Blockwise Parallel Transformer performs blockwise computation to support large-context training with reduced memory usage [35]. Landmark Attention represents each block with a landmark token and retrieves relevant blocks through the attention mechanism itself [36]. XAttention introduces antidiagonal scoring as an efficient proxy for block importance and prunes unimportant attention blocks during long-context inference [12]. Native Sparse Attention combines coarse-grained token compression, selective token attention, and local attention in a hardware-aligned sparse architecture [13]. MoBA applies mixture-of-experts-style routing to block attention, allowing the model to dynamically select relevant historical blocks [14]. These block-sparse methods mainly reduce attention computation by selecting or approximating attention blocks. In contrast, our method uses block selection only as a coarse routing stage. The selected blocks are expanded into candidate KV positions, and a learnable orthogonal hash network further selects fine-grained positions under an absolute budget before exact softmax attention is computed.

3. Method

3.1. Problem Formulation

Consider a decoder-only Transformer at layer l. Let the current query of head h be denoted as
q l , h R d ,
where d is the head dimension. The cached keys and values are denoted as
K l = { k l , t , g } t = 1 S , V l = { v l , t , g } t = 1 S ,
where S is the current KV cache length, t denotes the cached position index, and g is the key–value head index. For models using grouped-query attention (GQA), multiple query heads share one key–value head. We denote the key–value head corresponding to query head h as g ( h ) .
In standard full attention, the output of head h is computed over all cached positions:
o l , h = t = 1 S α l , h , t v l , t , g ( h ) ,
where
α l , h , t = exp q l , h k l , t , g ( h ) / d j = 1 S exp q l , h k l , j , g ( h ) / d .
When S is large, full attention introduces substantial memory access and computational overhead. Therefore, our goal is to construct a compact subset of cached KV positions
T l , h { 1 , , S } ,
and compute exact attention only over the selected positions. Each selected position corresponds to one cached key–value entry. The keys are used for routing and retrieval, while the values are used only for final attention aggregation.

3.2. Block-Level Coarse Routing

Under the squared reconstruction error, the mean vector is the optimal single-vector representative for a given block. Specifically, for a block containing key vectors { k i } i = 1 B , the vector that minimizes i = 1 B k i c 2 2 is the block mean c = 1 B i = 1 B k i . Motivated by this property, we first partition the KV cache along the position dimension into fixed-size blocks and use the mean key of each block as its coarse-grained representation.
We first partition the KV cache along the position dimension into fixed-size blocks. Let the block size be B, and the number of blocks be
N = S B .
The position index set of the b-th block is denoted as I b . For each block and key–value head, we compute a mean key representation:
k ¯ l , b , g = 1 | I b | t I b k l , t , g .
For the last incomplete block, the mean is computed using the actual number of valid positions, avoiding the dilution caused by padding.
The relevance score between query head h and block b is computed by the scaled dot product between the query and the block-level mean key:
s l , h , b = q l , h k ¯ l , b , g ( h ) d .
For each query head, we select the top-ranked blocks according to a block selection ratio ρ b :
B l , h = TopK b s l , h , b , ρ b N ,
where ρ b ( 0 , 1 ] controls the fraction of selected blocks. The selected blocks are expanded into a candidate KV position set:
C l , h = b B l , h I b S sink ,
where S sink denotes the sink positions that are always retained for stable access to global context.
The block-level routing stage serves as a coarse retrieval module. It reduces the search space before fine-grained retrieval while keeping the selection process aligned with the key vectors in the KV cache.

3.3. Shared Learnable Orthogonal Hash Network

After block-level routing, BTHA applies a shared learnable orthogonal hash network to perform fine-grained position-level retrieval within the candidate KV position set. Unlike asymmetric query–key hashing, the query and key are projected by the same hash matrix. For layer l and query head h, the continuous hash projections are defined as
u l , h q = W l , h q l , h , u l , t , h k = W l , h k l , t , g ( h ) ,
where W l , h R r × d is a learnable hash projection matrix, r is the number of hash bits, and d is the head dimension.
When r = d , the hash matrix is constrained as a square orthogonal matrix:
W l , h W l , h = I .
When r < d , the hash matrix is constrained to have orthonormal rows:
W l , h W l , h = I r .
This constraint encourages different hash dimensions to capture non-redundant directions in the representation space.
During inference, binary hash codes are obtained by the sign function:
b l , h q = sign W l , h q l , h , b l , t , h k = sign W l , h k l , t , g ( h ) ,
where each element of the binary code belongs to { 1 , + 1 } . The Hamming distance between the query code and the cached key code is defined as
d H q l , h , k l , t , g ( h ) = i = 1 r I b l , h , i q b l , t , h , i k .
A smaller Hamming distance indicates that the cached key at position t is more similar to the current query in the learned hash space.
During decoding, the key hash codes are maintained in a key hash cache. When a new key is appended to the KV cache, only the hash code of the newly appended key needs to be computed and cached. Similarly, the block mean cache is updated incrementally: completed blocks are reused, while only the current tail block is updated.

3.4. Orthogonality Analysis

The orthogonal constraint is introduced to preserve the geometry of hidden representations before binarization. We first consider the square case where W R d × d and
W W = I .
For any vector x R d , the projected representation is
u = W x .
Then its squared norm is
u 2 2 = W x W x = x W W x = x x = x 2 2 .
Therefore, an orthogonal projection preserves the vector norm.
Similarly, for any two vectors x , y R d , we have
W x W y = x W W y = x y .
Thus, the dot product between two vectors is also preserved. Since cosine similarity is defined by the normalized dot product, we further have
cos W x , W y = W x W y W x 2 W y 2 = x y x 2 y 2 = cos x , y .
Therefore, before binarization, the shared orthogonal hash projection preserves the norm, dot product, and angular relation of the original query–key representations. This property is important because attention relevance is originally measured by the scaled dot product between the query and cached keys.
For the rectangular case W R r × d with r < d , we constrain the rows of W to be orthonormal:
W W = I r .
Although the full norm and dot product in R d are not exactly preserved after dimensionality reduction, the orthogonality constraint still ensures that the projected hash dimensions are non-redundant and have equal scale. This prevents the learned hash matrix from collapsing into correlated or degenerate projection directions.

3.5. Training Objective

The hash network is trained to approximate the KV position ranking induced by full attention. For each query, we first compute the full attention logits over all cached positions:
a l , h , t = q l , h k l , t , g ( h ) d .
The positions with the largest attention logits are treated as positive cached positions:
P l , h = TopK t a l , h , t , K oracle ,
where K oracle is the oracle positive-position budget. Negative positions are sampled from the remaining cached positions:
N l , h { 1 , , S } P l , h .
Since the sign function is non-differentiable, directly optimizing the binary hash codes is difficult. Therefore, during training, we replace the hard sign function with a smooth continuous relaxation:
b ˜ = 2 · Sigmoid γ u 1 ,
where γ is a temperature coefficient. When γ becomes larger, the relaxed output becomes closer to the hard binary code produced by the sign function:
lim γ 2 · Sigmoid γ u 1 = sign ( u ) .
We define the relaxed hash-space similarity between a query and a cached key as
s ˜ H q , k = 1 r b ˜ q b ˜ k .
This relaxed similarity is differentiable and is consistent with Hamming distance in the binary case. Specifically, when b q , b k { 1 , + 1 } r , the Hamming distance satisfies
d H b q , b k = 1 2 r b q b k .
Therefore, maximizing the binary inner product is equivalent to minimizing the Hamming distance.
The pairwise ranking objective encourages positive cached positions to have larger relaxed hash similarity than negative cached positions:
L rank = 1 | P | | N | p P n N max 0 , m s ˜ H q , k p + s ˜ H q , k n ,
where m is the ranking margin.
To prevent hash bit collapse, we introduce a bit balance loss:
L bal = 1 M i = 1 M b ˜ i 2 2 ,
where M is the number of relaxed hash representations in a mini-batch.
Furthermore, to explicitly encourage the learned hash matrix to be orthogonal, we introduce an orthogonality regularization term:
L orth = W l , h W l , h I r F 2 .
The final training objective is
L = L rank + λ bal L bal + λ orth L orth ,
where λ bal and λ orth control the strengths of the balance and orthogonality regularization terms, respectively.

3.6. Block-Then-Hash Inference

During inference, BTHA first performs block-level routing and obtains the candidate KV position set C l , h . The block stage is controlled by the block selection ratio ρ b , while the hash stage is controlled by an absolute KV position budget K hash . The final selected position set is defined as
T l , h = C l , h , | C l , h | K hash , TopK t C l , h smallest d H q l , h , k l , t , g ( h ) , K hash , | C l , h | > K hash .
After the final KV position subset T l , h is obtained, the output is computed using exact dot-product attention. Specifically, for t T l , h ,
α l , h , t = exp q l , h k l , t , g ( h ) / d j T l , h exp q l , h k l , j , g ( h ) / d .
The attention output is then
o l , h = t T l , h α l , h , t v l , t , g ( h ) .
Therefore, BTHA approximates only the KV position retrieval stage, while preserving exact softmax attention over the selected key–value entries.
The overall Block-then-Hash inference procedure is summarized in Algorithm 1.
Algorithm 1 Block-then-Hash Attention (BTHA)
  • Require: Query q l , h , key cache K l , value cache V l , block mean cache K ¯ l , key hash cache B l k , shared hash projection W l , h , block size B, block selection ratio ρ b , hash budget K hash , sink positions S sink
  • Ensure: Attention output o l , h
  • 1:  g g ( h )                                        ▹ Map query head h to its KV head under GQA
  • 2:  S current KV cache length
  • 3:  N S / B
  •    Stage 1: Block-level coarse routing
  • 4: for  b = 1 to N do
  • 5:       I b { ( b 1 ) B + 1 , , min ( b B , S ) }
  • 6:       s b q l , h k ¯ l , b , g / d
  • 7: end for
  • 8:  M b ρ b N
  • 9:  B l , h TopK b ( s b , M b )
  • 10:  C l , h b B l , h I b S sink
  •      Stage 2: hash retrieval
  • 11: if  | C l , h | K hash  then
  • 12:       T l , h C l , h
  • 13: else
  • 14:        b l , h q sign ( W l , h q l , h )
  • 15:       for all  t C l , h  do
  • 16:              d t d H b l , h q , b l , t , h k
  • 17:       end for
  • 18:        T l , h TopK t C l , h smallest ( d t , K hash )
  • 19: end if
  •       Stage 3: Exact attention over selected positions
  • 20: for all  t T l , h  do
  • 21:        a t q l , h k l , t , g / d
  • 22: end for
  • 23:  α t Softmax t T l , h ( a t )
  • 24:  o l , h t T l , h α t v l , t , g
  • 25: return  o l , h

3.7. Complexity Analysis

We analyze the per-step decoding complexity of BTHA for one layer and one query head. Let S be the current KV cache length, d be the head dimension, B be the block size, and
N = S B
be the number of KV blocks. Let ρ b denote the block selection ratio, r denote the number of hash bits, and K hash denote the hash retrieval budget. The number of routed candidate positions is approximately
| C l , h | ρ b S + | S sink | .
For standard full attention, the dominant decoding cost per head is
O ( S d ) ,
which comes from computing query–key scores over all cached positions and aggregating the corresponding values.
In BTHA, the block-level routing stage uses cached block mean keys. Since each block is represented by one mean key vector, the block scoring cost is
O ( N d ) = O S B d .
After the top-ranked blocks are selected, hash-based retrieval is performed only inside the routed candidate set. The query hash computation costs
O ( r d ) ,
and the dense Hamming comparison cost is
O ( | C l , h | r ) .
When binary hash codes are packed into machine words of width w, the Hamming distance computation is reduced to
O | C l , h | r w ,
where bitwise XOR produces the mismatch mask and population-count operations compute the Hamming distance.
Finally, BTHA computes exact dot-product attention only over the selected KV positions. Let
K eff = min | C l , h | , K hash
be the effective number of selected positions. The final exact attention cost is
O ( K eff d ) .
Therefore, the overall per-step decoding complexity is
O S B d + r d + | C l , h | r w + K eff d .
Since B > 1 , ρ b < 1 , and K eff S in long-context decoding, BTHA substantially reduces both KV memory access and exact attention computation compared with full attention.
BTHA is further optimized at the system level. We implement the hash encoding, candidate filtering, XOR-based Hamming computation, and sparse KV gathering with low-level C/CUDA kernels [37,38]. These kernels operate directly on cached block summaries and packed hash codes, enabling BTHA to be integrated into existing decoding pipelines without changing model weights or the high-level Transformer architecture [39]. The block mean cache and key hash cache are maintained incrementally: when a new key is appended, only the current tail block mean and the hash code of the new key are updated.
The additional storage overhead is small. The block mean cache requires
O S B d
storage per key–value head, while the bit-packed key hash cache requires storage.
O S r w
In our implementation, the packed hash cache accounts for only about 5 % of the original KV cache size. Thus, BTHA avoids full-cache recomputation while introducing limited auxiliary memory overhead.

3.8. Hash Training Data and Configuration

We train the orthogonal hash projection using query and key representations sampled from real-world long-context data. The training data contains 100 sampled sequences, evenly drawn from the Book and ArXiv subsets of RedPajama [40]. For each sampled sequence, we first perform a dense prefill pass with the backbone LLM and collect the corresponding query and key representations from each layer and attention head. To construct a training instance, we randomly sample a query position from the second half of the sequence so that the selected query has access to a sufficiently long historical context. Following the causal constraint of decoder-only attention, only keys whose positions are not later than the sampled query position are used as candidate keys.
For a sampled query at position i, we compute its dense query–key relevance scores against all causally visible keys using the standard scaled dot product. The visible keys are then ranked according to these dense attention scores. In our implementation, the oracle positive budget is set to K oracle = 0.1 i , corresponding to the top 10% highest-scoring visible keys. These keys are treated as positive samples, while negative samples are drawn from the remaining visible keys. Each training unit is represented as a labeled query–key triplet ( q , k , y ) , where y = 1 denotes a high-relevance key and y = 0 denotes a low-relevance key. The triplets are randomly shuffled across mini-batches during optimization. This construction directly aligns the hash learning objective with dense-attention-induced token relevance, allowing the learned Hamming distance to better approximate the dense top-k retrieval order.
The hash projection is optimized with stochastic gradient descent (SGD). We use momentum and weight decay for stable optimization and train the hash network with repeated representation sampling to improve robustness across different query–key distributions. The main training configuration is summarized in Table 1.

4. Experimental

4.1. Experimental Setup

(1) 
Experiment platform.
We conduct all experiments on a server equipped with an NVIDIA A100 80 GB GPU and a 96-core CPU. The system runs Ubuntu 24.04 with CUDA 12.1. Our implementation is based on PyTorch 2.4 [41] and FlashInfer [42].
(2) 
Baselines and configurations.
We compare BTHA with representative top-k attention baselines, including Loki [43] and Quest [11]. Loki accelerates top-k attention through low-rank approximation, while Quest performs query-aware block-level KV selection. We further compare BTHA with MagicPIG [44], which accelerates top-k attention using LSH [15]. LSH is a randomized hashing method that typically relies on random projections to generate hash codes. Unlike learning-to-hash methods, LSH usually requires a large number of hash bits to maintain retrieval accuracy.
In addition to top-k attention baselines, we compare BTHA with several KV cache compression methods, including StreamingLLM [9], H2O [8], and SnapKV [10]. For all baselines, we adopt the recommended configurations from the original papers, including hyperparameters such as channel numbers, block sizes, and KV cache budgets when applicable. For BTHA, we set the number of hash bits to rbit = 128 and the block selection ratio to ρ b = 0.5 , which provide a robust configuration across most tasks. Following Quest [11], we use vanilla full attention in the first two layers, since these layers are commonly observed to be outlier layers in top-k attention methods. We also include the vanilla Transformer with full attention, denoted as dense, as a reference baseline to evaluate both the effectiveness and efficiency of BTHA.
(3) 
Models and datasets.
We evaluate BTHA on two mainstream large language models, Llama-2 [45] and Llama-3.1 [46]. The evaluation is conducted on two widely used long-context benchmarks, LongBench-E [47] and RULER [48]. LongBench-E is a multi-task benchmark covering question answering, document summarization, few-shot learning, synthetic tasks, and code understanding, which provides a comprehensive evaluation of long-context language modeling ability. RULER focuses on controlled long-context evaluation, especially retrieval-oriented tasks over extremely long contexts. These benchmarks allow us to evaluate both the task-level accuracy and long-range retrieval capability of BTHA under different context lengths.

4.2. Accuracy Evaluation

(1) 
Evaluation on LongBench-E.
We first evaluate all methods on LongBench-E, which covers question answering, document summarization, synthetic tasks, few-shot learning, and code understanding. As shown in Table 2, BTHA achieves strong accuracy under the same 512 KV position budget. On Llama-2-7B-32K-Instruct, BTHA obtains an average score of 34.57 , slightly surpassing the dense full-attention baseline with 34.47 and outperforming all compared sparse attention and KV cache compression baselines. On Llama-3.1-8B-Instruct, BTHA achieves an average score of 54.02 , which is very close to the dense full-attention baseline of 54.10 , with only a 0.08 point gap. Compared with non-dense baselines, BTHA also achieves the best average score, outperforming Loki, Quest, MagicPIG, StreamingLLM, H2O, and SnapKV. These results indicate that BTHA can effectively preserve task-level accuracy while attending to only a limited number of KV positions.
(2) 
Evaluation on RULER.
We further evaluate all methods on RULER, a controlled long-context benchmark designed to test retrieval, tracing, aggregation, and question-answering capabilities under extended context lengths. As shown in Table 3, BTHA consistently maintains accuracy close to the dense full-attention baseline on both evaluated models. On Llama-2-7B-32K-Instruct, BTHA achieves an average score of 63.94 , only 1.10 points lower than dense full attention, while clearly outperforming all non-dense baselines. On Llama-3.1-8B-Instruct, BTHA obtains an average score of 81.62 , again only 1.10 points below dense full attention and substantially higher than Loki, Quest, MagicPIG, StreamingLLM, H2O, and SnapKV. These results demonstrate that BTHA can preserve long-range retrieval ability through block-level coarse routing and hash-based fine-grained retrieval, providing a favorable accuracy-efficiency trade-off for long-context inference.
(3) 
Task-level observations.
Although BTHA achieves the best or near-best average accuracy across the evaluated benchmarks, its gains are not uniform across all tasks. On LongBench-E, BTHA obtains larger improvements on tasks where the useful evidence is relatively sparse or concentrated, such as LCC, RepoBench, TQA, MultiFieldQA, and MultiNews. These tasks benefit from the Block-then-Hash design because block-level routing first narrows the search space to relevant context regions, while hash-based retrieval further selects fine-grained KV entries within the routed candidates. On RULER, BTHA performs particularly well on retrieval-oriented tasks such as needle search and multi-query retrieval, indicating that the learned hash codes can effectively preserve salient long-range evidence under a strict KV budget. Meanwhile, BTHA still shows small gaps compared with dense attention on several tasks, especially those requiring broad-context aggregation, highly diffuse evidence, or fine-grained counting and verification. For example, on LongBench-E, the gaps are more visible on tasks such as Passage Count and GovReport, while on RULER, performance drops are observed on some aggregation-oriented or full-context matching tasks such as FWE, NMK2, and NMV. This suggests that when the answer depends on many weakly relevant tokens distributed across the context, a fixed KV selection budget may discard information that is individually low-ranked but collectively useful. Overall, these task-level results show that BTHA is most effective when the attention mass is concentrated on a small subset of informative tokens, while further improvement is still possible for tasks requiring more exhaustive context coverage.
(4) 
End-to-end efficiency.
We further evaluate the end-to-end inference efficiency under the same configuration, with a prefill length of 36K tokens and a decode length of 3.6K tokens. In addition to the decoding latency, we also report the prefill time to provide a comprehensive comparison of the overall inference cost. As shown in Figure 3, BTHA consistently reduces the decoding time compared with dense full attention. On Llama-2-7B-32K-Instruct, BTHA achieves the lowest decoding cost among the compared methods, reducing the decode time from about 189 s for dense attention to about 95 s. On Llama-3.1-8B-Instruct, BTHA also substantially decreases the decoding cost, reducing the decode time from about 292 s to about 196 s. The prefill overhead of BTHA remains comparable to other methods, indicating that the proposed Block-then-Hash selection mainly accelerates the decoding stage without introducing significant prefill overhead. These results demonstrate that BTHA improves end-to-end inference efficiency while maintaining competitive accuracy under a fixed token selection budget.
(5) 
Layer-wise latency under different sequence lengths.
To further analyze the scalability of BTHA, we compare the latency of a single Transformer layer under different sequence lengths. As shown in Figure 4, the latency of dense attention increases rapidly as the sequence length grows, since it needs to access and compute attention scores over all cached KV positions. In contrast, BTHA maintains a much slower latency growth under the same 1.56% token selection ratio. For Llama-2, BTHA achieves the lowest latency across long sequence lengths and shows a clear advantage over Dense, Loki, and Quest when the sequence length increases to 128K and 256K. For Llama-3.1 with GQA, BTHA also keeps the latency nearly stable and remains substantially faster than Dense and Loki at long contexts. These results indicate that the Block-then-Hash selection mechanism effectively reduces the per-layer decoding cost and scales better with increasing context length.
(6) 
Model-scale generalization analysis.
To further examine whether the sparse retrieval mechanism remains effective across different model scales and context settings, we additionally evaluate BTHA on Qwen2.5-14B-Instruct-1M and Qwen2.5-32B-Instruct. As shown in Table 4, Table 5 and Table 6, BTHA achieves accuracy close to dense attention under strict sparse token budgets. On LongBench-E, BTHA obtains comparable average performance to dense attention on both the 14B and 32B models with only 512 selected KV tokens, with average gaps of 0.37 and 0.28 points, respectively. On RULER with 256K context length, BTHA nearly matches dense attention under a 4096-token budget, with only a 0.04-point average gap, while outperforming the token-level top-k baseline. These results suggest that the proposed sparse retrieval strategy can generalize to larger models and longer contexts.

4.3. Ablation Study

We conduct ablation experiments on Llama-3.1-8B-Instruct to examine the effectiveness of the main design choices in BTHA. The evaluation focuses on the block selection ratio, the necessity of the Block-then-Hash retrieval pipeline, and the contribution of offline hash training. Unless otherwise specified, all experiments are conducted on LongBench-E under the same KV position budget.
(1) 
Effect of block selection ratio ρ b .
The block selection ratio ρ b controls how many KV blocks are routed into the candidate set before hash-based token-level retrieval. A smaller ρ b reduces the candidate search space and improves efficiency, but it may discard relevant blocks before fine-grained retrieval. In contrast, a larger ρ b preserves more historical information but introduces more redundant candidates, increasing the overhead of hash comparison and sparse KV gathering.
Figure 5 and Table 7 show the LCC results under different values of ρ b . The results indicate that increasing ρ b does not monotonically improve performance. When ρ b is too small, the routed candidate set may miss useful KV positions; when ρ b is too large, excessive irrelevant tokens are passed to the hash retrieval stage, which weakens fine-grained selection. BTHA achieves the best LCC score of 68.62 when ρ b = 0.5 , suggesting that a moderate block selection ratio provides a better balance between candidate recall and redundancy reduction.
(2) 
Effect of the Block-then-Hash retrieval pipeline.
We further evaluate whether both retrieval stages are necessary. To this end, we compare BTHA with two simplified variants. The first variant removes block-level routing and directly performs hash retrieval, denoted as Hash Only. The second variant removes hash-based fine-grained retrieval and only uses block-level routing, denoted as Block Only.
As shown in Table 8, Hash Only achieves an LCC score of 67.46. Although it preserves token-level retrieval, it lacks the coarse block routing stage and therefore suffers from a larger search space. Block Only obtains a lower LCC score of 66.34, indicating that coarse block selection alone is insufficient because all tokens inside the selected blocks are treated uniformly. By combining block-level routing and hash-based token retrieval, BTHA achieves the highest LCC score of 68.62. These results show that the two stages play different and complementary roles. Block routing acts as a coarse-grained recall stage: by selecting query-relevant blocks, it increases the chance of covering dense-attention top-k KV positions while reducing the global search space. Hash retrieval then acts as a fine-grained filtering stage, removing redundant tokens inside the routed blocks and selecting more relevant token-level KV entries. Consequently, the full BTHA pipeline achieves the best LCC score by combining coarse top-k candidate recall with fine-grained token selection.
(3) 
Effect of offline hash training.
We also evaluate the effect of offline training on the hash network. Before training, the hash projection is randomly initialized and behaves similarly to random projection-based hashing. Although this provides a simple approximation for similarity search, it is not optimized for the query–key relevance pattern of LLM attention. After offline training, the hash network learns to approximate the KV position ranking induced by full attention, making the learned Hamming distance more consistent with the attention-based top-k positions.
Table 9 reports the Top-k IoU before and after hash training. On Llama-3.1-8B-Instruct, the Top-k IoU improves from 41.32% before training to 59.74% after training. This significant improvement shows that offline training effectively aligns the hash space with full-attention retrieval targets, thereby improving the quality of fine-grained KV position selection.
(4) 
End-to-end efficiency.
The above ablation results show that BTHA benefits from both coarse-grained and fine-grained retrieval. Compared with Hash Only, BTHA avoids global hash comparison over the full KV cache by first routing a compact set of candidate blocks. Compared with Block Only, BTHA further removes redundant tokens inside the selected blocks through learned hash retrieval. Moreover, the trained hash network provides a more attention-aligned retrieval space than random hashing. These results demonstrate that the block routing stage, the hash retrieval stage, and offline hash training jointly contribute to the final accuracy–efficiency trade-off of BTHA.
(5) 
Token budget ablation.
First, we investigate the impact of token budget on the inference performance of BTHA. As shown in Figure 6, BTHA consistently achieves higher accuracy than Quest and Loki across different token-budget settings. More importantly, BTHA exhibits only mild performance degradation as the token budget decreases. Even under an extremely constrained setting where only about 0.4% of tokens are retained, BTHA still maintains competitive model accuracy. These results indicate that the learned hash-based retrieval mechanism enables BTHA to effectively preserve critical contextual information under strict KV selection budgets, highlighting the potential of hash learning for efficient long-context decoding.
We further investigate the effect of hash length on BTHA by varying the number of hash bits from 32 to 256. As shown in Figure 7, using too few hash bits leads to noticeable accuracy degradation, especially under the 32-bit setting. This is because overly compact hash codes provide limited discrimination capability for distinguishing important KV entries. As the number of hash bits increases, the accuracy improves rapidly and becomes stable when the hash length reaches 128 bits. On both LCC and GovReport, BTHA with 128 or 256 hash bits achieves performance close to the dense-attention baseline, and in some cases even slightly surpasses it. These results indicate that BTHA does not require excessively long hash codes to maintain retrieval quality; a moderate hash length is sufficient to preserve critical contextual information while keeping the hash representation compact and efficient.

5. Conclusions

In this paper, we presented BTHA, a budget-aware Block-then-Hash Attention framework for efficient long-context inference. BTHA decomposes KV position selection into two consecutive stages: block-level coarse routing and hash-based fine-grained retrieval. The block-level stage first identifies query-relevant historical regions using cached block mean keys, thereby reducing the search space before token-level selection. The hash stage then performs efficient position-level retrieval within the routed candidate blocks using shared learnable orthogonal hash projections and cached binary key codes. Finally, exact dot-product attention is computed only over the selected KV positions, preserving the standard attention computation on the retained tokens.
The proposed design offers both algorithmic and system-level benefits. Algorithmically, the two-stage selection mechanism combines coarse-grained context localization with fine-grained token retrieval, enabling BTHA to retain important KV positions under a fixed position budget. System-wise, BTHA maintains block mean caches and key hash caches incrementally, and uses low-level C/CUDA kernels for hash encoding, XOR-based Hamming computation, candidate filtering, and sparse KV gathering. These optimizations allow BTHA to be integrated into existing Transformer decoding pipelines without modifying model weights or the high-level model architecture.
Extensive experiments on LongBench-E and RULER demonstrate that BTHA consistently preserves accuracy close to dense full attention while substantially reducing the number of attended KV positions. Compared with representative sparse attention and KV cache compression baselines, BTHA achieves better average accuracy under the same KV position budget across both Llama-2-7B-32K-Instruct and Llama-3.1-8B-Instruct. These results indicate that hierarchical KV retrieval is an effective direction for scalable long-context inference. Future work may explore more expressive block representations and adaptive budget allocation across layers and heads.

Author Contributions

Conceptualization, R.L. and M.H.; methodology, R.L.; software, R.L. and L.L.; validation, R.L. and L.L.; formal analysis, R.L. and L.L.; investigation, R.L. and L.L.; resources, M.H.; data curation, R.L. and L.L.; writing—original draft preparation, R.L. and L.L.; writing—review and editing, R.L., L.L. and M.H.; visualization, R.L. and L.L.; supervision, M.H.; project administration, M.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding. The APC was funded by Hainan University.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets used in this study are publicly available. LongBench-E and RULER can be accessed from their official repositories. The experimental results generated during the current study are available from the corresponding author upon reasonable request. The code is available at https://github.com/Rekio-lll/BTHA (accessed on 9 May 2026).

Acknowledgments

The authors would like to thank the developers of PyTorch, FlashInfer, LongBench-E, and RULER for providing the open-source tools and benchmarks used in this study. During the preparation of the initial draft, the authors used OpenAI’s ChatGPT-5.5 to assist with English language editing, sentence refinement, and improving the clarity of presentation. The authors carefully reviewed, revised, and verified all generated content and take full responsibility for the final content of the manuscript.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:
LLMLarge Language Model
BTHABlock-then-Hash Attention
KVKey–Value
GQAGrouped-Query Attention
LSHLocality-Sensitive Hashing
MHAMulti-Head Attention
IoUIntersection over Union
GPUGraphics Processing Unit
CPUCentral Processing Unit
CUDACompute Unified Device Architecture
APCArticle Processing Charge

References

  1. Wan, W.; Zhang, C.; Huang, L. Efficient Headline Generation with Hybrid Attention for Long Texts. Electronics 2024, 13, 3558. [Google Scholar] [CrossRef]
  2. Zhao, H.; Wang, B.; Fu, Y.; Li, N.; Gao, Z. Fixed-time adaptive fault-tolerant control of a multi-mode VTOL UAV with variable prescribed performance boundaries under random disturbances. ISA Trans. 2026, in press. [Google Scholar] [CrossRef]
  3. Beltagy, I.; Peters, M.E.; Cohan, A. Longformer: The long-document transformer. arXiv 2020, arXiv:2004.05150. [Google Scholar] [CrossRef]
  4. Zaheer, M.; Guruganesh, G.; Dubey, K.A.; Ainslie, J.; Alberti, C.; Ontanon, S.; Pham, P.; Ravula, A.; Wang, Q.; Yang, L.; et al. Big bird: Transformers for longer sequences. Adv. Neural Inf. Process. Syst. 2020, 33, 17283–17297. [Google Scholar]
  5. Kitaev, N.; Kaiser, Ł.; Levskaya, A. Reformer: The efficient transformer. arXiv 2020, arXiv:2001.04451. [Google Scholar] [CrossRef]
  6. Choromanski, K.; Likhosherstov, V.; Dohan, D.; Song, X.; Gane, A.; Sarlos, T.; Hawkins, P.; Davis, J.; Mohiuddin, A.; Kaiser, L.; et al. Rethinking attention with performers. arXiv 2020, arXiv:2009.14794. [Google Scholar]
  7. Roy, A.; Saffar, M.; Vaswani, A.; Grangier, D. Efficient content-based sparse attention with routing transformers. Trans. Assoc. Comput. Linguist. 2021, 9, 53–68. [Google Scholar] [CrossRef]
  8. Zhang, Z.; Sheng, Y.; Zhou, T.; Chen, T.; Zheng, L.; Cai, R.; Song, Z.; Tian, Y.; Ré, C.; Barrett, C.; et al. H2o: Heavy-hitter oracle for efficient generative inference of large language models. Adv. Neural Inf. Process. Syst. 2023, 36, 34661–34710. [Google Scholar]
  9. Xiao, G.; Tian, Y.; Chen, B.; Han, S.; Lewis, M. Efficient streaming language models with attention sinks. arXiv 2023, arXiv:2309.17453. [Google Scholar]
  10. Li, Y.; Huang, Y.; Yang, B.; Venkitesh, B.; Locatelli, A.; Ye, H.; Cai, T.; Lewis, P.; Chen, D. Snapkv: Llm knows what you are looking for before generation. Adv. Neural Inf. Process. Syst. 2024, 37, 22947–22970. [Google Scholar]
  11. Tang, J.; Zhao, Y.; Zhu, K.; Xiao, G.; Kasikci, B.; Han, S. Quest: Query-aware sparsity for efficient long-context llm inference. arXiv 2024, arXiv:2406.10774. [Google Scholar]
  12. Xu, R.; Xiao, G.; Huang, H.; Guo, J.; Han, S. Xattention: Block sparse attention with antidiagonal scoring. arXiv 2025, arXiv:2503.16428. [Google Scholar] [CrossRef]
  13. Yuan, J.; Gao, H.; Dai, D.; Luo, J.; Zhao, L.; Zhang, Z.; Xie, Z.; Wei, Y.; Wang, L.; Xiao, Z.; et al. Native sparse attention: Hardware-aligned and natively trainable sparse attention. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vienna, Austria, 27 July 2025; pp. 23078–23097. [Google Scholar]
  14. Lu, E.; Jiang, Z.; Liu, J.; Du, Y.; Jiang, T.; Hong, C.; Liu, S.; He, W.; Yuan, E.; Wang, Y.; et al. Moba: Mixture of block attention for long-context llms. Adv. Neural Inf. Process. Syst. 2026, 38, 17790–17815. [Google Scholar]
  15. Gionis, A.; Indyk, P.; Motwani, R. Similarity search in high dimensions via hashing. In Proceedings of the VLDB, Edinburgh, UK, 7–10 September 1999; Volume 99, pp. 518–529. [Google Scholar]
  16. Gong, P.; Yi, J.; Wang, S.; Zhang, J.; Jin, Z.; Zhou, O.; Liu, R.; Xu, G.; Bai, Y.; Ye, B.; et al. HATA: Trainable and Hardware-Efficient Hash-Aware Top-k Attention for Scalable Large Model Inference. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2025, Vienna, Austria, 27 July–1 August 2025; pp. 24856–24871. [Google Scholar] [CrossRef]
  17. Li, W.; Zhang, Y.; Luo, G.; Wan, H.; Gong, Z.; Chao, F.; Ji, R. Spotlight Attention: Towards Efficient LLM Generation via Non-linear Hashing-based KV Cache Retrieval. In Proceedings of the Thirty-Ninth Annual Conference on Neural Information Processing Systems, San Diego, CA, USA, 2–7 December 2025. [Google Scholar]
  18. Duan, G.; Chen, J.; Zhou, Y.; Zheng, X.; Zhu, Y. Large Language Model Inference Acceleration Based on Hybrid Model Branch Prediction. Electronics 2024, 13, 1376. [Google Scholar] [CrossRef]
  19. Dai, Z.; Yang, Z.; Yang, Y.; Carbonell, J.G.; Le, Q.; Salakhutdinov, R. Transformer-xl: Attentive language models beyond a fixed-length context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 2978–2988. [Google Scholar]
  20. Rae, J.W.; Potapenko, A.; Jayakumar, S.M.; Lillicrap, T.P. Compressive transformers for long-range sequence modelling. arXiv 2019, arXiv:1911.05507. [Google Scholar] [CrossRef]
  21. Ainslie, J.; Ontanon, S.; Alberti, C.; Cvicek, V.; Fisher, Z.; Pham, P.; Ravula, A.; Sanghai, S.; Wang, Q.; Yang, L. ETC: Encoding long and structured inputs in transformers. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; pp. 268–284. [Google Scholar]
  22. Xiong, Y.; Zeng, Z.; Chakraborty, R.; Tan, M.; Fung, G.; Li, Y.; Singh, V. Nyströmformer: A nyström-based algorithm for approximating self-attention. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 2–9 February 2021; Volume 35, pp. 14138–14148. [Google Scholar]
  23. Tay, Y.; Dehghani, M.; Abnar, S.; Shen, Y.; Bahri, D.; Pham, P.; Rao, J.; Yang, L.; Ruder, S.; Metzler, D. Long range arena: A benchmark for efficient transformers. arXiv 2020, arXiv:2011.04006. [Google Scholar] [CrossRef]
  24. Wu, Y.; Rabe, M.N.; Hutchins, D.; Szegedy, C. Memorizing transformers. arXiv 2022, arXiv:2203.08913. [Google Scholar] [CrossRef]
  25. Liu, Z.; Desai, A.; Liao, F.; Wang, W.; Xie, V.; Xu, Z.; Kyrillidis, A.; Shrivastava, A. Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time. Adv. Neural Inf. Process. Syst. 2023, 36, 52342–52364. [Google Scholar]
  26. Ge, S.; Zhang, Y.; Liu, L.; Zhang, M.; Han, J.; Gao, J. Model tells you what to discard: Adaptive kv cache compression for llms. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 7 May 2024. [Google Scholar]
  27. Ribar, L.; Chelombiev, I.; Hudlass-Galley, L.; Blake, C.; Luschi, C.; Orr, D. Sparq attention: Bandwidth-efficient llm inference. arXiv 2023, arXiv:2312.04985. [Google Scholar]
  28. Liu, Y.; Li, H.; Cheng, Y.; Ray, S.; Huang, Y.; Zhang, Q.; Du, K.; Yao, J.; Lu, S.; Ananthanarayanan, G.; et al. Cachegen: Kv cache compression and streaming for fast large language model serving. In Proceedings of the ACM SIGCOMM 2024 Conference, Sydney, Australia, 4–8 August 2024; pp. 38–56. [Google Scholar]
  29. Lee, W.; Lee, J.; Seo, J.; Sim, J. {InfiniGen}: Efficient generative inference of large language models with dynamic {KV} cache management. In Proceedings of the 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), Santa Clara, CA, USA, 10–12 July 2024; pp. 155–172. [Google Scholar]
  30. Fu, Q.; Cho, M.; Merth, T.; Mehta, S.; Rastegari, M.; Najibi, M. Lazyllm: Dynamic token pruning for efficient long context llm inference. arXiv 2024, arXiv:2407.14057. [Google Scholar] [CrossRef]
  31. Xiao, G.; Tang, J.; Zuo, J.; Guo, J.; Yang, S.; Tang, H.; Fu, Y.; Han, S. Duoattention: Efficient long-context llm inference with retrieval and streaming heads. In Proceedings of the International Conference on Learning Representations, Singapore, 24 April 2025. [Google Scholar]
  32. Feng, Y.; Lv, J.; Cao, Y.; Xie, X.; Zhou, S.K. Ada-kv: Optimizing kv cache eviction by adaptive budget allocation for efficient llm inference. Adv. Neural Inf. Process. Syst. 2026, 38, 113152–113188. [Google Scholar]
  33. Child, R.; Gray, S.; Radford, A.; Sutskever, I. Generating long sequences with sparse transformers. arXiv 2019, arXiv:1904.10509. [Google Scholar] [CrossRef]
  34. Tay, Y.; Bahri, D.; Yang, L.; Metzler, D.; Juan, D.C. Sparse sinkhorn attention. In Proceedings of the International Conference on Machine Learning, PMLR, Online, 13–18 July 2020; pp. 9438–9447. [Google Scholar]
  35. Liu, H.; Abbeel, P. Blockwise parallel transformers for large context models. Adv. Neural Inf. Process. Syst. 2023, 36, 8828–8844. [Google Scholar]
  36. Mohtashami, A.; Jaggi, M. Landmark attention: Random-access infinite context length for transformers. arXiv 2023, arXiv:2305.16300. [Google Scholar] [CrossRef]
  37. Yuan, X.; Kong, W.; Luo, Z.; Xu, M. Efficient Inference Offloading for Mixture-of-Experts Large Language Models in Internet of Medical Things. Electronics 2024, 13, 2077. [Google Scholar] [CrossRef]
  38. Fuad, K.A.A.; Chen, L. A Survey on Sparsity Exploration in Transformer-Based Accelerators. Electronics 2023, 12, 2299. [Google Scholar] [CrossRef]
  39. Han, S.; Wang, M.; Zhang, J.; Li, D.; Duan, J. A Review of Large Language Models: Fundamental Architectures, Key Technological Evolutions, Interdisciplinary Technologies Integration, Optimization and Compression Techniques, Applications, and Challenges. Electronics 2024, 13, 5040. [Google Scholar] [CrossRef]
  40. Weber, M.; Fu, D.Y.; Anthony, Q.; Oren, Y.; Adams, S.; Alexandrov, A.; Lyu, X.; Nguyen, H.; Yao, X.; Adams, V.; et al. Redpajama: An open dataset for training large language models. Adv. Neural Inf. Process. Syst. 2024, 37, 116462–116492. [Google Scholar]
  41. Ansel, J.; Yang, E.; He, H.; Gimelshein, N.; Jain, A.; Voznesensky, M.; Bao, B.; Bell, P.; Berard, D.; Burovski, E.; et al. Pytorch 2: Faster machine learning through dynamic python bytecode transformation and graph compilation. In Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, La Jolla, CA, USA, 27 April–1 May 2024; Volume 2, pp. 929–947. [Google Scholar]
  42. Ye, Z.; Chen, L.; Lai, R.; Lin, W.; Zhang, Y.; Wang, S.; Chen, T.; Kasikci, B.; Grover, V.; Krishnamurthy, A.; et al. FlashInfer: Efficient and customizable attention engine for LLM inference serving. Proc. Mach. Learn. Syst. 2025, 7. [Google Scholar] [CrossRef]
  43. Singhania, P.; Singh, S.; He, S.; Feizi, S.; Bhatele, A. Loki: Low-rank keys for efficient sparse attention. Adv. Neural Inf. Process. Syst. 2024, 37, 16692–16723. [Google Scholar]
  44. Chen, Z.; Sadhukhan, R.; Ye, Z.; Zhou, Y.; Zhang, J.; Nolte, N.; Tian, Y.; Douze, M.; Bottou, L.; Jia, Z.; et al. Magicpig: Lsh sampling for efficient llm generation. In Proceedings of the International Conference on Learning Representations, Singapore, 24 April 2025. [Google Scholar]
  45. Together. Llama-2-7B-32K-Instruct. 2023. Available online: https://huggingface.co/togethercomputer/Llama-2-7B-32K-Instruct (accessed on 20 February 2025).
  46. Meta AI. Introducing Llama 3.1: Our Most Capable Models to Date. 2024. Available online: https://ai.meta.com/blog/meta-llama-3-1/ (accessed on 21 February 2025).
  47. Bai, Y.; Lv, X.; Zhang, J.; Lyu, H.; Tang, J.; Huang, Z.; Du, Z.; Liu, X.; Zeng, A.; Hou, L.; et al. Longbench: A bilingual, multitask benchmark for long context understanding. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Bangkok, Thailand, 11–16 August 2024; pp. 3119–3137. [Google Scholar]
  48. Hsieh, C.P.; Sun, S.; Kriman, S.; Acharya, S.; Rekesh, D.; Jia, F.; Zhang, Y.; Ginsburg, B. RULER: What’s the real context size of your long-context language models? arXiv 2024, arXiv:2404.06654. [Google Scholar]
  49. Qwen Team. Qwen2.5: A Party of Foundation Models. 2024. Available online: https://qwenlm.github.io/blog/qwen2.5/ (accessed on 20 February 2025).
  50. Qwen Team. Qwen2.5-1M: Deploy Your Own Qwen with Context Length up to 1M Tokens. 2025. Available online: https://qwenlm.github.io/blog/qwen2.5-1m/ (accessed on 21 February 2025).
Figure 1. Overview of the proposed Block-then-Hash Attention.
Figure 1. Overview of the proposed Block-then-Hash Attention.
Electronics 15 02635 g001
Figure 2. Comparison of accuracy and token generation speed. For detailed analysis, refer to Section 4.
Figure 2. Comparison of accuracy and token generation speed. For detailed analysis, refer to Section 4.
Electronics 15 02635 g002
Figure 3. End-to-end performance comparison of LLM inference under 1.56% token selection. We report both prefill and decode time costs under the same sequence length.
Figure 3. End-to-end performance comparison of LLM inference under 1.56% token selection. We report both prefill and decode time costs under the same sequence length.
Electronics 15 02635 g003
Figure 4. Performance comparison of a single Transformer layer under 1.56% token selection.
Figure 4. Performance comparison of a single Transformer layer under 1.56% token selection.
Electronics 15 02635 g004
Figure 5. Effect of the block selection ratio ρ b on LCC using Llama-3.1-8B-Instruct.
Figure 5. Effect of the block selection ratio ρ b on LCC using Llama-3.1-8B-Instruct.
Electronics 15 02635 g005
Figure 6. Ablation study on the token budget. (a) Accuracy comparison on LCC using Llama-2 under different token budgets. (b) Accuracy comparison on FWE using Llama-3.1 under different token budgets.
Figure 6. Ablation study on the token budget. (a) Accuracy comparison on LCC using Llama-2 under different token budgets. (b) Accuracy comparison on FWE using Llama-3.1 under different token budgets.
Electronics 15 02635 g006
Figure 7. Ablation study on the number of hash bits. (a) Accuracy comparison on LCC under different numbers of hash bits. (b) Accuracy comparison on GovReport under different numbers of hash bits. The dashed lines denote the dense-attention baselines for Llama-2 and Llama-3.1, respectively.
Figure 7. Ablation study on the number of hash bits. (a) Accuracy comparison on LCC under different numbers of hash bits. (b) Accuracy comparison on GovReport under different numbers of hash bits. The dashed lines denote the dense-attention baselines for Llama-2 and Llama-3.1, respectively.
Electronics 15 02635 g007
Table 1. Training configuration of the BTHA orthogonal hash projection.
Table 1. Training configuration of the BTHA orthogonal hash projection.
HyperparameterValue
Sampled sequences100
Data sourceBook and ArXiv [40]
Positive-key ratioTop 10%
OptimizerSGD
Training epochs20
Training iterations per epoch20
Repeated representation-sampling iterations10
Learning rate0.08
Momentum0.9
Weight decay 10 6
Balance loss weight λ bal 0.5
Orthogonality loss weight λ orth 1
Table 2. Transposed accuracy results on LongBench-E with 512 KV position budget. MP denotes MagicPIG, SL denotes StreamingLLM, and S-KV denotes SnapKV. The best result in each column is shown in bold.
Table 2. Transposed accuracy results on LongBench-E with 512 KV position budget. MP denotes MagicPIG, SL denotes StreamingLLM, and S-KV denotes SnapKV. The best result in each column is shown in bold.
ModelMethodLCCPRetrHQATQARepoSamTrecMQA2WikiGovPCntMltNQasprAVG.
Llama-2-7B-32K-InstructDense67.5311.8915.3085.0355.0339.3269.0022.4413.1332.011.1724.5111.7634.47
Loki58.6811.9714.9185.3044.4138.9569.0022.1113.0930.510.5223.8212.8232.78
Quest65.1415.5313.6485.1852.5739.2467.5719.3312.5124.831.2016.6110.9332.64
MP66.4310.0114.6986.1755.8138.9469.0021.7013.2931.291.0823.7411.0634.09
SL46.914.848.2260.6735.4319.5028.0015.517.5419.390.0018.095.4320.73
H2O27.422.414.1920.0716.336.7424.333.062.199.610.007.830.239.57
S-KV52.749.4411.7866.9045.1337.5637.0017.4412.6513.540.0812.987.2424.96
BTHA68.509.7314.9686.3056.3439.2669.1423.1013.3431.490.4424.9111.8534.57
Llama-3.1-8B-InstructDense67.2499.6760.2191.6452.3642.5571.6654.8244.0835.0313.1926.1944.6854.10
Loki61.2999.6759.4891.4548.4741.9972.3354.4744.3334.7412.7425.8545.1553.23
Quest58.8199.6760.0390.7946.7239.7571.3351.5043.9033.6413.1625.6943.5252.19
MP53.3998.8356.2877.9042.3534.2863.6749.1037.8432.589.9624.5738.2047.61
SL64.9094.3348.5279.2445.5840.1051.6734.1337.8122.7313.1521.5124.9144.51
H2O64.9994.0057.8992.0646.7041.2365.0045.1240.9129.4012.8224.6433.7549.89
S-KV66.4999.6759.9391.9848.6040.5860.3352.8243.6326.2913.0423.2036.4251.00
BTHA68.5299.6760.1291.9652.5142.2371.6754.6343.2634.3612.4825.7845.0354.02
Table 3. Transposed accuracy results on RULER. MP denotes MagicPIG, SL denotes StreamingLLM, and S-KV denotes SnapKV. The best result in each column is shown in bold.
Table 3. Transposed accuracy results on RULER. MP denotes MagicPIG, SL denotes StreamingLLM, and S-KV denotes SnapKV. The best result in each column is shown in bold.
ModelMethodNS1NS2NS3NMK1NMK2NMVNMQVTFWEQA1QA2AVG.
Llama-2-7B-32K-InstructDense93.75100.0091.6793.7581.2566.6752.0821.0448.6130.2136.4665.04
Loki25.002.080.000.000.000.000.001.0419.4414.5816.677.16
Quest100.0095.8352.0887.5054.1752.3456.5126.8736.3623.9634.3856.37
MP97.9293.7554.1783.3371.8859.3843.4916.0451.3930.2135.4257.91
SL1.045.211.046.251.046.254.170.2154.8622.9227.0811.82
H2O0.000.000.003.120.000.780.000.0020.4915.6216.675.15
S-KV25.001.040.002.080.000.260.003.9620.4922.9226.049.25
BTHA99.4299.3182.7694.1878.6964.9754.8320.6442.8828.7436.9163.94
Llama-3.1-8B-InstructDense100.0098.96100.0097.9277.0894.2796.0951.0475.3578.1240.6282.68
Loki98.9697.9296.8896.8859.3891.6794.7950.0057.9976.0435.2977.80
Quest100.0093.7547.9297.3553.1278.9190.1061.2563.1973.9638.5472.55
MP94.7969.7951.0465.6220.8344.7957.8141.6757.9967.7138.5455.51
SL1.044.173.125.213.124.695.730.6267.7051.0434.3816.44
H2O36.462.083.124.172.081.821.5623.3348.6148.9633.3318.68
S-KV98.9688.548.3389.583.1213.2852.3452.2938.8978.1239.5851.18
BTHA100.0098.92100.0098.0870.9990.2696.8151.4172.3877.2441.7381.62
Table 4. Accuracy results on LongBench-E for Qwen2.5-14B-Instruct-1M [49] with sparse token budget = 512.
Table 4. Accuracy results on LongBench-E for Qwen2.5-14B-Instruct-1M [49] with sparse token budget = 512.
MethodsLCCPRetrHQATQARepoSamTrecMQA2WikiGovPCntMltNQasprAVG.
Dense44.32100.0065.9688.4136.2545.5276.3453.7360.6831.9322.8322.1441.4153.04
BTHA44.46100.0064.9088.2036.4145.3376.3753.5559.7031.5620.7222.1241.3752.67
Table 5. Accuracy results on LongBench-E for Qwen2.5-32B-Instruct [49] with sparse token budget = 512.
Table 5. Accuracy results on LongBench-E for Qwen2.5-32B-Instruct [49] with sparse token budget = 512.
MethodsLCCPRetrHQATQARepoSamTrecMQA2WikiGovPCntMltNQasprAVG.
Dense54.0499.8369.2786.2636.0343.6075.6752.2860.6930.1422.0021.9144.0853.52
BTHA53.8499.8368.8886.3235.8243.1475.2752.0960.2129.0221.9021.8443.9053.24
Table 6. Accuracy results on RULER (256K) for Qwen2.5-14B-Instruct-1M [50] with sparse token budget = 4096 (1.56%).
Table 6. Accuracy results on RULER (256K) for Qwen2.5-14B-Instruct-1M [50] with sparse token budget = 4096 (1.56%).
MethodsNS1NS2NS3NMK1NMK2NMVNMQVTFWEQA1QA2AVG.
Dense100.00100.00100.00100.0090.0085.0097.50100.0095.0060.0040.0087.95
top-k100.00100.00100.00100.0090.0081.2598.75100.0088.3360.0040.0087.12
BTHA98.00100.0099.00100.0096.5088.0096.5095.0086.5061.0046.5087.91
Table 7. LCC results under different block selection ratios ρ b on Llama-3.1-8B-Instruct.
Table 7. LCC results under different block selection ratios ρ b on Llama-3.1-8B-Instruct.
ρ b 0.20.30.40.50.60.70.80.9
LCC67.8567.7767.8068.6267.9268.2168.0467.78
Table 8. Component ablation of block-level routing and hash-based retrieval on Llama-3.1-8B-Instruct.
Table 8. Component ablation of block-level routing and hash-based retrieval on Llama-3.1-8B-Instruct.
MethodBlock RoutingHash RetrievalLCC
Hash OnlyNoYes67.46
Block OnlyYesNo66.34
BTHAYesYes68.62
Table 9. Effect of offline hash training on Top-k IoU using Llama-3.1-8B-Instruct.
Table 9. Effect of offline hash training on Top-k IoU using Llama-3.1-8B-Instruct.
Hash VariantTraining StatusTop-k IoU
Random HashBefore Training41.32%
Learned HashAfter Training59.74%
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liu, R.; Liu, L.; Huang, M. BTHA: Block-Then-Hash Attention for Efficient Long Context. Electronics 2026, 15, 2635. https://doi.org/10.3390/electronics15122635

AMA Style

Liu R, Liu L, Huang M. BTHA: Block-Then-Hash Attention for Efficient Long Context. Electronics. 2026; 15(12):2635. https://doi.org/10.3390/electronics15122635

Chicago/Turabian Style

Liu, Runqian, Lianjun Liu, and Mengxing Huang. 2026. "BTHA: Block-Then-Hash Attention for Efficient Long Context" Electronics 15, no. 12: 2635. https://doi.org/10.3390/electronics15122635

APA Style

Liu, R., Liu, L., & Huang, M. (2026). BTHA: Block-Then-Hash Attention for Efficient Long Context. Electronics, 15(12), 2635. https://doi.org/10.3390/electronics15122635

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.
Back to TopTop