Next Article in Journal
Forming Teams of Smart Objects to Support Mobile Edge Computing for IoT-Based Connected Vehicles
Previous Article in Journal
Enhancement of Mechanical Properties and Hydrogen Embrittlement Resistance of Laser-Directed Energy Deposition-Fabricated 316L Stainless Steel by Laser Shock Peening
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Distance-Based Compression Method for Large Language Models

School of Information Science and Technology, Hangzhou Normal University, Hangzhou 311121, China
*
Author to whom correspondence should be addressed.
Appl. Sci. 2025, 15(17), 9482; https://doi.org/10.3390/app15179482
Submission received: 11 July 2025 / Revised: 19 August 2025 / Accepted: 25 August 2025 / Published: 29 August 2025

Abstract

The computational cost of the Transformer architecture is highly dependent on the length of the input sequence, with a computational complexity of O ( n 2 ) due to the self-attention mechanism. As a result, Transformer-based models, such as Large Language Models, incur significant computational and storage overhead when processing tasks involving long input sequences. To mitigate these challenges, we propose a compression method that allows users to manually adjust the trade-off between compression efficiency and model performance. The method employs a trainable model to minimize information loss, ensuring that the impact on accuracy remains minimal. The method demonstrated an accuracy degradation within acceptable limits on LongBench v2.

1. Introduction

Recent advancements in Large Language Models (LLMs) can be attributed not only to the vast number of parameters and extensive training data but also to the inherent advantages of the Transformer architecture [1]. Unlike Recurrent Neural Networks (RNNs), which suffer from issues such as gradient vanishing or gradient explosion [2], the Transformer can capture dependencies between tokens at any position within the input sequence via its self-attention mechanism, allowing it to model long-range dependencies more effectively than RNN. This allows the representation of each token to be dynamically adjusted based on the global context, thereby facilitating more efficient modeling of long-range dependencies.
However, the self-attention mechanism of the Transformer requires global computation for each token, resulting in a computational complexity of O ( n 2 ) , where n represents the length of the input sequence. As the sequence length increases, the computational and memory demands grow significantly, requiring considerable resources for both model training and inference. Consequently, the Transformer’s efficiency is sensitive to the length of the input sequence. Nevertheless, the ability of the model to handle long sequences is critical for tasks involving ultra-long text in LLMs, long videos in Large Vision-Language Models (LVLMs) [3], and applications such as agent-based models and Chain-of-Thought (CoT) inference [4]. The need for long-context models in these domains has been well-documented, especially in tasks requiring an understanding of context that extends over large spans of data.
To mitigate the inefficiency, sparse attention mechanisms [5,6,7,8,9] reduce complexity to O ( n log 2 n ) or even O ( n ) by restricting token interactions. However, these methods suffer not only from limited expressiveness but also from predefined biases introduced by manually crafted attention patterns.
Long-context models like Transformer-XL [10], Compressive Transformer [11], Infini-attention [12], and InfLLM [13] extend context retention through recurrence, memory compression, or external memory. Transformer-XL introduced a recurrence mechanism and relative position encoding, enabling it to capture longer dependencies when processing long texts. Compressive Transformer introduced a compression mechanism to reduce memory usage. Infini-attention adopts a dynamic memory adjustment strategy. Recently, InfLLM introduced a training-free memory mechanism to enhance long-context processing. Despite progress, these approaches still struggle to balance memory savings and information preservation in long-range dependencies.
We propose a distance-based cache compression method for token-level memory. This method adheres to the following principles:
  • It employs a trainable model such as MLP, which we called the merge model, to merge two token-specific cache units with the same compression ratio into a new cache unit.
  • We aim to implement a distance-based compression method for past memory during inference, where the compression ratio progressively increases with the distance from the next token, halting once the compression ratio reaches a predefined threshold.
  • The compression strategy in our method is deterministic, with the specific positions of compression being predetermined, thus eliminating the need for additional computation during inference.
Our method builds on Compressive Transformer and introduces a hyperparameter L to control compression. It enables smooth transitions between compressed and uncompressed memory while maintaining implementation simplicity. The code is available at https://github.com/Weinen343/DCM (accessed on 19 August 2025).

2. Related Work

2.1. Long-Context Processing

Long-context processing techniques are crucial for tasks requiring extensive contextual information [14,15,16,17,18,19,20,21,22]. By leveraging sparse attention, sliding windows, and hierarchical memory structures [23,24,25], models can scale beyond traditional limitations to better handle long documents, dialogues, and complex sequences.
Transformer-XL [10] introduces a segment-level recurrence mechanism and novel positional encoding to capture long-range dependencies beyond standard Transformers.
Compressive Transformer [11] retains extended context by compressing past hidden states into compact forms, enabling efficient long-sequence modeling.
Infini-attention [12] preserves complete contextual information through dynamic memory adjustments during sequence processing.
InfLLM [13] enhances long-sequence processing without additional training, storing distant context in external memory and efficiently retrieving token-specific information.

2.2. Long-Context Pre-Training

Long-context pre-training refers to the process of training language models to handle and effectively process longer text sequences or contextual windows during pre-training.
The use of relative positional encoding and rotary positional encoding [26] significantly improves the ability of models to handle long-range dependencies. These encoding schemes offer more flexible and efficient positional representations compared to traditional absolute positional encodings, which struggle with long sequences. By incorporating these methods, models are able to scale effectively to longer input lengths while maintaining their performance on tasks that require understanding of distant context.
Memory-efficient strategies [27] compress past hidden states into smaller, more compact representations, thus reducing memory overhead while retaining important contextual information. These strategies enable the model to process longer sequences by efficiently storing and retrieving relevant information without overwhelming the available memory. This approach allows models to capture long-term dependencies and improve their performance on long-context tasks, such as document-level language modeling or multi-turn dialogue systems.
Positional interpolation methods [28] extend the effective context window of models by dynamically interpolating positional encodings. This approach enables the model to handle much longer sequences by adapting the positional encoding across a larger input space, improving its ability to process extended contexts. The ability to interpolate positional information efficiently allows the model to maintain long-range dependencies while reducing the computational cost typically associated with larger context windows.

3. Method

Given the redundancy between tokens and their corresponding KV-cache entries, our method compresses the KV-cache so that a single token can represent information that would otherwise require multiple tokens, thereby effectively reducing the memory demands of the key–value states. Furthermore, since tokens farther from the current position have a smaller impact on current inference, we adopt a hierarchical compression strategy: KV-cache entries closer to the current position are compressed at a lower ratio to preserve as much critical information as possible, while those farther away are compressed at a higher ratio. Even if some information is lost in the latter, its impact on inference performance remains limited.
To implement this, we propose a distance-based compression method. Specifically, we determine their upper compression ratio limit based on their distance from the token generation location and the hyperparameter L = 2 k . The upper limit of the compression ratio for each cache unit is illustrated in Figure 1, where the different cache groups and their corresponding compression limits are shown. Ordered by proximity to the next token, the first group of cache units has an upper limit of 1, and the number of units is L. The second group has an upper limit of 2, and the number of units is L 2 . This pattern continues, with each subsequent group doubling the compression ratio limit and halving the number of units compared to the previous group, until the number reaches 1, at which point the compression ratio limit is L. Based on this, our method dynamically adjusts and merges the cache during the inference process.

3.1. Merge Planning

In the inference process, effective cache management and compression are critical for ensuring optimal performance. We adopt a priority strategy based on the compression ratio r: cache units with the lowest r are compressed first, to avoid excessive compression of higher-r units that could cause severe information loss. When multiple cache units share the same compression ratio, we prioritize the one farthest from the next predicted token.
To rigorously characterize the compression and merge process, we define the input sequence length as n, the number of Transformer layers as L layer , the number of attention heads per layer as H, and the key/value dimensions as d k and d v , respectively. In the l-th layer and h-th head at time step t, the key and value are denoted by K t ( l , h ) R d k and V t ( l , h ) R d v . A cache unit  U t is defined as the set of all ( K , V ) pairs at the same time step across all layers and heads:
U t ( K t ( l , h ) , V t ( l , h ) ) | l = 1 , , L layer , h = 1 , , H .
The compression ratio r N denotes the number of original tokens represented by a single cache unit. If a compressed unit U ˜ represents r consecutive tokens, then its compression ratio is r.
For a single-head attention with query q, using an uncompressed history H i = { ( K t , V t ) } t i , the output is
Attn ( q ; H i ) t i α t ( q ) V t , α t ( q ) = exp q K t d k s i exp q K s d k .
After compression, with the representative set H ˜ i = { ( K ˜ u , V ˜ u , r u ) } u where r u denotes the number of tokens represented, the attention output becomes
Attn ˜ ( q ; H ˜ i ) u α ˜ u ( q ) V ˜ u , α ˜ u ( q ) = exp q K ˜ u d k v exp q K ˜ v d k .
The information loss is measured as
L IL E q Attn ( q ; H i ) Attn ˜ ( q ; H ˜ i ) 2 2 ,
representing the expected squared deviation between the attention outputs with full and compressed histories.
To manage caches, we define two queues: the far queue  Q 1 and the near queue  Q 2 . Q 1 stores cache units that no longer require compression and have no length restriction, allowing long-term storage. Q 2 stores cache units that require compression, with a maximum length constraint
| Q 2 | L max = 2 × L group 1 ,
where L group is a grouping parameter independent of the number of Transformer layers L layer . During token prediction, the merged queue
Q = Q 1 + Q 2
forms the historical context H i used in attention computation.
As new caches are generated, if  Q 2 reaches its capacity, a systematic merging process is performed. Starting from the head of Q 2 , we locate the first cache unit whose compression ratio r has not reached its upper limit. We then find the cache unit closest to the tail of Q 2 with the same compression ratio and merge it with its preceding cache unit. If all cache units in Q 2 have reached their compression limits, the cache unit at the tail of Q 2 is moved to Q 1 . After either merging or transferring, the newly generated cache unit is appended to Q 2 .
For each new token, the following process is carried out:
  • If | Q 2 |   <   2   ×   L group 1 , enqueue the new U ;
  • Otherwise, apply f ( p ) for the earliest unsatisfied p P ;
  • If all positions meet their limits, the tail of Q 2 is moved to Q 1 , and the process restarts from f ( 1 ) .
During the extensive inference process of large models, the continuous computation of merge positions introduces a significant temporal overhead. However, upon closer examination, a pattern in the compression scheme becomes apparent, revealing that the real-time calculation of merge positions is, in fact, unnecessary during inference. Instead, it is possible to precompute all merge positions before the actual inference begins, thereby eliminating the need for ongoing calculations during runtime.
The queue of merge positions can be divided into two distinct segments, each corresponding to a specific phase in the overall merge pattern. The first segment occurs when no cache has yet been moved to the Q 1 queue. The second segment, on the other hand, occurs when all cache units have reached their maximum compression ratios. At this stage, the cache unit at the tail of the queue is moved to the Q 1 , and this process continues until every cache unit has again reached the upper limit of its compression ratio. The subsequent stages of inference rely exclusively on this second segment of the merge process, ensuring that once the initial phase is complete, the remaining merge positions follow a predictable, repetitive pattern. This cyclical process eliminates the need for real-time computation and ensures that the entire merge procedure is efficiently handled before inference commences.
As shown in Figure 2 for the case L group = 16 , certain positions within the inference process are of particular importance, referred to as key positions. These are located at the rightmost boundary of groups whose compression upper limit is greater than 1. When a merge operation is executed at a specific key position, the resulting leftward shift implies that all subsequent key positions to the right of the current merge position will also require merge operations to maintain their compression upper limits.
Let the positions in Q 2 be indexed from nearest to farthest as 1 , 2 , , 2 × L group 1 . An algorithm precomputes the sequence of key positions:
P = { p 1 , p 2 , , p m } , p i = 2 i 1 , p i 2 × L group 1 .
The merge-scheduling operator f ( p ) for a key position p is defined recursively:
  • First, invoke f ( 2 × p + 1 ) , ensuring merges at more distant key positions are performed before handling p;
  • At position p, perform a merge and record it via recordPosition ( p ) ;
  • Invoke f ( 2 × p + 1 ) again to ensure that, after processing p, all further positions still satisfy the compression schedule.
The computation of merge positions in the first part can be implemented as Algorithm 1:
Algorithm 1 Merge scheduling for first segment
1:
Input: Queue Q 2 , L group
2:
i L group
3:
while  i > 0   do
4:
   for  j = 1 to i 1  do
5:
      recordPosition ( j )
6:
      f ( 2 × j + 1 )
7:
   end for
8:
    i i / 2
9:
end while
For the second part, when all units have reached their upper limits, the tail element of Q 2 is moved into Q 1 , and the process restarts from f ( 1 ) .
In a steady state, the ( 2 L group 1 ) representative units in Q 2 collectively cover approximately
T Q 2 ( L group ) = g = 0 k L group 2 g · 2 g = L group ( k + 1 ) = L group 1 + log 2 L group
tokens. Given a total context length n, the remaining distant tokens are summarized into Q 1 at ratio L group , yielding an effective number of units per step of
N eff ( n , L group ) ( 2 L group 1 ) + max 0 , n T Q 2 ( L group ) L group .
This determines the retrieval cost in attention computation and the KV read bandwidth, scaling with slope 1 / L group for n L group log L group .

3.2. Merge Model

The core of our method lies in a merge model designed to minimize information loss during compression of cache units. Simple approaches such as linear averaging may discard high-order semantic interactions. To address this, we introduce a trainable multilayer perceptron (MLP) as the merge operator,
M θ : ( U a , U b ) U a b ,
which performs nonlinear fusion of adjacent cache units U a and U b corresponding to segments S a and S b . The objective is
q : Attn q ; H { U a , U b } Attn q ; H { M θ ( U a , U b ) } .
The training objective combines
L IL ( θ ) = E q Attn ( q ; H { U a , U b } ) Attn ( q ; H { M θ ( U a , U b ) } ) 2 2 ,
L align ( θ ) = E h i + 1 full h i + 1 comp 2 2 + λ · KL p full ( · | x ) p comp ( · | x ) ,
along with regularization to constrain magnitude and preserve scale invariance of K.
The choice of an MLP is motivated by both efficiency and theoretical considerations: as a universal approximator, it can model any continuous function on a compact domain [29], capturing complex nonlinear interactions between cache units while remaining computationally lightweight. More expressive architectures, such as self-attention-based models, could be used but incur substantial computational and memory costs, especially given the high frequency of merge operations.
To mitigate potential semantic drift—i.e., the mismatch between merged cache representations and the distribution that the LLM was trained on—we suggest a two-stage training procedure. First, the merge model is trained to produce semantically faithful cache representations. Optionally, the LLM can be fine-tuned to better adapt to these compressed representations, thereby improving robustness and inference quality.
During training, to maintain efficiency, we adopt a chunk-based merge strategy: multiple tokens are grouped into a single compression unit, and the merge operation is applied per chunk rather than token by token. This contrasts with stepwise merging during inference and significantly improves training throughput. Figure 3 illustrates the simplified compression structure used in training. The overall training pipeline, depicted in Figure 4, consists of first training the merge model and then optionally fine-tuning the LLM to align with the merged representations.

4. Experiment

4.1. Datasets

LongBench v2 [20] serves as a comprehensive benchmark designed to evaluate the ability of LLMs in understanding and processing extended texts. LongBench v2 is structured into six primary categories, comprising twenty distinct tasks, which encompass critical long-text application scenarios such as Single-Doc QA, Multi-Doc QA, Long ICL, Dialogue History, Code Repo, and Structured Data.

4.2. Hyperparameter Selection

In the hyperparameter selection phase, the primary objective is to evaluate the model’s performance across a range of hyperparameter configurations in a systematic and efficient manner. This process aims to identify the optimal set of hyperparameters that maximizes model performance while minimizing computational overhead. By exploring various combinations of hyperparameters, the goal is to balance trade-offs between accuracy, generalization, and resource utilization, ultimately improving the model’s robustness and efficiency.

4.2.1. MLP Layer Depths

We utilize a standard MLP to integrate distinct cache layers, driven by the hypothesis that two cache instances can be effectively combined to form a unified cache that the LLMs can interpret. This hypothesis appears to be intuitive, as utilizing MLPs for extracting and integrating information is a common practice in the field of deep learning. However, the challenge lies in whether caches with different layers and compression ratios can indeed be merged without significant loss of performance. To explore this, we implemented MLPs with varying layer configurations and monitored the LLMs’ performance after cache compression. This allowed us to assess the optimal MLP configuration for cache integration. If the results suggest suboptimal performance, it may indicate the necessity for a separate MLP for each cache layer. Moreover, even a simple MLP might not suffice for this task, highlighting the potential complexity involved in cache merging.
The experimental results, as shown in Table 1, are promising. They demonstrate that a single, standard MLP is capable of successfully merging caches from some layer configuration, which aligns with the objectives of our assumptions. This outcome suggests that, contrary to initial concerns, the integration process can be achieved efficiently with a single MLP, without the need for additional layer-specific models.

4.2.2. Group Length

The impact of L 1 and L 2 on accuracy is summarized in Table 2. In our experiments, increasing L 1 generally does not degrade accuracy, although it does not always lead to improvements. In contrast, when L 2 > L 1 , accuracy drops significantly. Within the range L 2 L 1 , accuracy first declines and then gradually recovers as L 2 increases. The observed accuracy drop manifests not only as incorrect answers but occasionally as the generation of meaningless content, suggesting that LLMs may struggle to fully interpret caches compressed via MLPs. Possible contributing factors include limited MLP capacity, insufficient training, or the potential need for model fine-tuning to accommodate compressed caches.
Interestingly, simply setting L 1 = L 2 does not guarantee improved accuracy. Larger L 1 values sometimes lead to higher accuracy, particularly when L 2 is also large. This indicates that models trained with higher L 1 can better adapt to higher compression ratios. If this hypothesis holds, it implies that a single general-purpose model trained with a sufficiently large L can accommodate a range of L values at inference, without the need for separate models for each configuration.
We conduct a detailed sensitivity study by varying L across a wide range of values. We evaluate model accuracy relative to the reference setting L = 512 (denoted as Δ ) and simultaneously measure the information loss L IL . A positive Δ indicates accuracy degradation compared with the baseline L = 512 .
Overall, the curves in Figure 5 indicate that smaller L values generally reduce L IL and improve task accuracy. In contrast, when L becomes larger, the risk of over-compression effects and reduced alignment with the trained merge distribution leads to increased information loss ( L IL ) and lower accuracy. Nevertheless, the overall trend tends to flatten as L continues to increase.
Empirically, when L is small, our method behaves similarly to a Compressive Transformer: the compression ratio is low, resulting in minimal information loss but also limited performance gains. The uncompressed window in this regime is smaller than that of a standard Compressive Transformer. When L is large, distant caches are highly compressed, while the low-compression region expands, effectively acting as a sliding window. Compared to a conventional sliding window, our approach retains more information in the low-compression region and preserves distant information in a highly compressed form. The former improves memory efficiency, while the latter can be advantageous or detrimental depending on whether the model can extract meaningful signals from highly compressed caches.
The steady-state effective unit count can be approximated as
N eff ( n , L ) ( 2 L 1 ) + max 0 , n L ( 1 + log 2 L ) L .
If the design objective is to minimize a combination of error and cost,
min L E ( L ) + μ · N eff ( n , L ) ,
where E ( L ) decreases with L and can be modeled as E ( L ) c 1 log L —reflecting that the effective coverage T Q 2 L log L grows with L—and the dominant cost term is c 2 L + c 3 n / L . The first-order condition yields
L n ,
neglecting the slowly varying log L factor. This provides a theoretical rationale for the empirical rule that L should scale with the square root of the context length.
In practice, we recommend setting
L = sentence_length + max_token
as a heuristic choice for the hyperparameter. This balances compression efficiency and accuracy, allowing the model to retain sufficient context while keeping memory usage manageable.
Future work may explore the adaptive selection of L based on the semantic importance of tokens or cache units, potentially improving performance without increasing computational cost.

4.3. Quality, Stability, and Efficiency of Hierarchical Cache Compression

We evaluate the quality and stability of cache compression using three metrics: information loss, semantic drift, and boundary discontinuity. The information loss metric is defined as
L IL = E q Attn ( q ; H i ) Attn ˜ ( q ; H ˜ i ) 2 2 ,
computed over a sampled set of queries q at each decoding step and averaged across layers. Semantic drift is quantified using the Semantic Drift Index (SDI):
SDI α · 1 cos ( h full , h comp ) + ( 1 α ) · JS α full , α comp ,
where α full and α comp are attention weight distributions, and JS denotes Jensen–Shannon divergence. Lower SDI values indicate higher semantic stability. Boundary Semantic Discontinuity (BSD) measures discontinuities at block boundaries:
BSD 1 Z boundaries b α b ϵ α b + ϵ 1 ,
where ϵ is a small step radius and Z is a normalization factor.
During training, compression is applied per block, while inference uses multi-level hierarchical merging. To reduce distribution shift and BSD, we propose the following: (A) Train-As-Inference (TAI), simulating the inference-time binary merge schedule within each block; (B) boundary regularization, adding a penalty γ · BSD to encourage continuity across block edges.
We evaluate L IL , SDI, and BSD on Single-Doc QA (9-layer MLP, L = 512 , block length 128). Table 3 summarizes baseline results and the improvements from Train-As-Inference (TAI) and boundary regularization.
For L = 512 , TAI reduces BSD from 0.065 to 0.048 and L IL from 0.13 to 0.12; combining both techniques further reduces BSD to 0.041 and L IL to 0.11.
Efficiency is measured via per-step latency t step , peak memory M peak , energy consumption E, KV bytes per unit B KV , and effective unit count N eff ( n , L ) . Memory compression ratio is computed as
C M ( n , L ) n ( 2 L 1 ) + max { 0 , ( n L ( 1 + log 2 L ) ) / L } .
Table 4 presents representative values for n = 10 5 tokens on an A100 80GB GPU in BF16 precision.
Compared to uncompressed inference, L = 512 achieves ∼ 82.8 × memory compression, ∼ 10 × faster per-step latency, and ∼ 8 × lower energy consumption. Memory growth follows O ( 2 L 1 ) + n / L , with constant per-step merge cost due to precomputed positions. Our method is deterministic, in contrast to other architectures such as Compressive Transformer, Infini-attention, or InfLLM, which have varying degrees of determinism and memory growth patterns.
To account for differences in backbone strength, we normalize scores
S norm = S method S base ,
yielding 1.02, 1.05, and 1.01 for Single-Doc QA, Multi-Doc QA, and Few-Shot tasks, respectively.

4.4. Prediction Accuracy

To evaluate the accuracy of our method on long-context tasks, we conducted experiments using our model on the Longbench v2 dataset. The results were then compared with mainstream LLMs such as GLM-4-9B-Chat and Qwen2.5-72B-Inst [30]. As shown in Table 5, our model achieves accuracy close to GLM-4-9B-Chat with the hyperparameter setting of L = 512 . However, considering that our base model Llama-3.1-8B-Instruct already has a stronger capability in handling long contexts, the results are not particularly surprising.

5. Conclusions

Our method proposes a distance-based compression method based on the idea of a Compressive Transformer, which realizes more flexible compression ratio control with a simple compression method, and saves computational cost, while the accuracy is close to that of the original model. However, our method still has limitations. It heavily relies on two components: a merge model that can combine KV-caches with minimal information loss, and an LLM capable of effectively interpreting the compressed KV-cache. Both components currently lack theoretical foundations and efficient training methodologies.

Author Contributions

Conceptualization, H.S.; methodology, H.S.; software, H.S.; validation, H.S.; writing—original draft preparation, H.S.; writing—review and editing, H.S.; visualization, H.S.; supervision, B.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Vaswani, A. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
  2. Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Netw. 1994, 5, 157–166. [CrossRef] [PubMed]
  3. Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning PMLR, Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
  4. Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Xia, F.; Chi, E.; Le, Q.V.; Zhou, D. Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural Inf. Process. Syst. 2022, 35, 24824–24837. [Google Scholar]
  5. Child, R.; Gray, S.; Radford, A.; Sutskever, I. Generating long sequences with sparse transformers. arXiv 2019, arXiv:1904.10509. [Google Scholar] [CrossRef]
  6. Beltagy, I.; Peters, M.E.; Cohan, A. Longformer: The long-document transformer. arXiv 2020, arXiv:2004.05150. [Google Scholar] [CrossRef]
  7. Kitaev, N.; Kaiser, Ł.; Levskaya, A. Reformer: The efficient transformer. arXiv 2020, arXiv:2001.04451. [Google Scholar] [CrossRef]
  8. Wang, S.; Li, B.Z.; Khabsa, M.; Fang, H.; Ma, H. Linformer: Self-attention with linear complexity. arXiv 2020, arXiv:2006.04768. [Google Scholar] [CrossRef]
  9. Zaheer, M.; Guruganesh, G.; Dubey, K.A.; Ainslie, J.; Alberti, C.; Ontanon, S.; Pham, P.; Ravula, A.; Wang, Q.; Yang, L.; et al. Big bird: Transformers for longer sequences. Adv. Neural Inf. Process. Syst. 2020, 33, 17283–17297. [Google Scholar]
  10. Dai, Z. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv 2019, arXiv:1901.02860. [Google Scholar]
  11. Rae, J.W.; Potapenko, A.; Jayakumar, S.M.; Lillicrap, T.P. Compressive transformers for long-range sequence modelling. arXiv 2019, arXiv:1911.05507. [Google Scholar] [CrossRef]
  12. Munkhdalai, T.; Faruqui, M.; Gopal, S. Leave no context behind: Efficient infinite context transformers with infini-attention. arXiv 2024, arXiv:2404.07143. [Google Scholar] [CrossRef]
  13. Xiao, C.; Zhang, P.; Han, X.; Xiao, G.; Lin, Y.; Zhang, Z.; Liu, Z.; Han, S.; Sun, M. Infllm: Unveiling the intrinsic capacity of llms for understanding extremely long sequences with training-free memory. arXiv 2024, arXiv:2402.04617. [Google Scholar]
  14. Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
  15. Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 2020, 21, 1–67. [Google Scholar]
  16. Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; Volume 1, pp. 4171–4186. [Google Scholar]
  17. Rajpurkar, P. Squad: 100,000+ questions for machine comprehension of text. arXiv 2016, arXiv:1606.05250. [Google Scholar]
  18. Zhang, Y. Dialogpt: Large-Scale generative pre-training for conversational response generation. arXiv 2019, arXiv:1911.00536. [Google Scholar]
  19. Tay, Y.; Dehghani, M.; Abnar, S.; Shen, Y.; Bahri, D.; Pham, P.; Rao, J.; Yang, L.; Ruder, S.; Metzler, D. Long range arena: A benchmark for efficient transformers. arXiv 2020, arXiv:2011.04006. [Google Scholar] [CrossRef]
  20. Bai, Y.; Tu, S.; Zhang, J.; Peng, H.; Wang, X.; Lv, X.; Cao, S.; Xu, J.; Hou, L.; Dong, Y.; et al. Longbench v2: Towards deeper understanding and reasoning on realistic long-context multitasks. arXiv 2024, arXiv:2412.15204. [Google Scholar]
  21. Li, T.; Zhang, G.; Do, Q.D.; Yue, X.; Chen, W. Long-context llms struggle with long in-context learning. arXiv 2024, arXiv:2404.02060. [Google Scholar]
  22. Zhang, X.; Chen, Y.; Hu, S.; Xu, Z.; Chen, J.; Hao, M.K.; Han, X.; Thai, Z.L.; Wang, S.; Liu, Z.; et al. Bench: Extending Long Context Evaluation Beyond 100K Tokens. arXiv 2024, arXiv:2402.13718. [Google Scholar]
  23. Wu, Y.; Rabe, M.N.; Hutchins, D.; Szegedy, C. Memorizing transformers. arXiv 2022, arXiv:2203.08913. [Google Scholar] [CrossRef]
  24. Chandar, S.; Ahn, S.; Larochelle, H.; Vincent, P.; Tesauro, G.; Bengio, Y. Hierarchical memory networks. arXiv 2016, arXiv:1605.07427. [Google Scholar] [PubMed]
  25. Wu, Y.; Zhao, Y.; Hu, B.; Minervini, P.; Stenetorp, P.; Riedel, S. An efficient memory-augmented transformer for knowledge-intensive nlp tasks. arXiv 2022, arXiv:2210.16773. [Google Scholar]
  26. Su, J.; Ahmed, M.; Lu, Y.; Pan, S.; Bo, W.; Liu, Y. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing 2024, 568, 127063. [Google Scholar] [CrossRef]
  27. Xiong, W.; Liu, J.; Molybog, I.; Zhang, H.; Bhargava, P.; Hou, R.; Martin, L.; Rungta, R.; Sankararaman, K.A.; Oguz, B.; et al. Effective long-context scaling of foundation models. arXiv 2023, arXiv:2309.16039. [Google Scholar] [CrossRef]
  28. Chen, S.; Wong, S.; Chen, L.; Tian, Y. Extending context window of large language models via positional interpolation. arXiv 2023, arXiv:2306.15595. [Google Scholar] [CrossRef]
  29. Hornik, K.; Stinchcombe, M.; White, H. touvron2023llama. Neural Netw. 1989, 2, 359–366. [Google Scholar] [CrossRef]
  30. Yang, A.; Li, A.; Yang, B.; Zhang, B.; Hui, B.; Zheng, B.; Yu, B.; Gao, C.; Huang, C.; Lv, C.; et al. Qwen3 technical report. arXiv 2025, arXiv:2505.09388. [Google Scholar] [CrossRef]
Figure 1. This illustration demonstrates the upper limit of the compression ratio, with different colors used to represent different groups. As the cache unit approaches the next token, its compression ratio decreases. To simplify the inference process, each group’s upper limit is doubled and the number of units halved compared to the previous group, ensuring the product of the upper limit and the number of units remain constant at hyperparameter L. The light gray on the far left represents the cache units moved into Q 1 , with a quantity of m.
Figure 1. This illustration demonstrates the upper limit of the compression ratio, with different colors used to represent different groups. As the cache unit approaches the next token, its compression ratio decreases. To simplify the inference process, each group’s upper limit is doubled and the number of units halved compared to the previous group, ensuring the product of the upper limit and the number of units remain constant at hyperparameter L. The light gray on the far left represents the cache units moved into Q 1 , with a quantity of m.
Applsci 15 09482 g001
Figure 2. Illustration of the two distinct cases that occur during the compression process. (a) The first case occurs before any cache is moved into Q 1 , where the length of Q 2 is maintained at 2 × L 1 through constant compression while new caches are added. (b) The second case occurs after a cache from Q 2 has been moved into Q 1 , where some caches in Q 2 have not yet reached the upper limit of the compression ratio and still need to be compressed following the move.
Figure 2. Illustration of the two distinct cases that occur during the compression process. (a) The first case occurs before any cache is moved into Q 1 , where the length of Q 2 is maintained at 2 × L 1 through constant compression while new caches are added. (b) The second case occurs after a cache from Q 2 has been moved into Q 1 , where some caches in Q 2 have not yet reached the upper limit of the compression ratio and still need to be compressed following the move.
Applsci 15 09482 g002
Figure 3. Illustration of the compression method during the training phase. As depicted, the compression in the training phase only requires pairwise compression of the cache units on the right side.
Figure 3. Illustration of the compression method during the training phase. As depicted, the compression in the training phase only requires pairwise compression of the cache units on the right side.
Applsci 15 09482 g003
Figure 4. Illustration of the training process. The first stage is to train the merge model so that it can retain as much information as possible when compressing the cache. The second stage is to fine-tune the LLM and merge model so that the LLM correctly understands the compressed cache.
Figure 4. Illustration of the training process. The first stage is to train the merge model so that it can retain as much information as possible when compressing the cache. The second stage is to fine-tune the LLM and merge model so that the LLM correctly understands the compressed cache.
Applsci 15 09482 g004
Figure 5. Sensitivity curves under varying L. (a) Relative change Δ : positive values mean performance degradation. (b) Information loss L IL : lower values indicate closer approximation to the full-attention model.
Figure 5. Sensitivity curves under varying L. (a) Relative change Δ : positive values mean performance degradation. (b) Information loss L IL : lower values indicate closer approximation to the full-attention model.
Applsci 15 09482 g005
Table 1. Predicted performance of the Llama-3.1-8B-Instruct model as a base model, with merge models of varying MLP layer depths, on the Qasper, MultiFieldQA-en, and 2WikiMultihopQA datasets.
Table 1. Predicted performance of the Llama-3.1-8B-Instruct model as a base model, with merge models of varying MLP layer depths, on the Qasper, MultiFieldQA-en, and 2WikiMultihopQA datasets.
LayerQasperMultiFieldQA-en2WikiMultihopQA
338.528.542.0
644.034.048.5
946.535.550.0
1247.037.051.5
1548.538.553.0
1847.039.052.0
Table 2. Predicted accuracy of the Llama-3.1-8B-Instruct model as a base model, with a 9-layer MLP merge model, on the Single-Doc QA, Multi-Doc QA and Few-Shot Learning, using hyperparameters L = L 1 during training and L = L 2 during testing.
Table 2. Predicted accuracy of the Llama-3.1-8B-Instruct model as a base model, with a 9-layer MLP merge model, on the Single-Doc QA, Multi-Doc QA and Few-Shot Learning, using hyperparameters L = L 1 during training and L = L 2 during testing.
L 2 48163264128256512
L 1
 4 32.8 29.3 30.7 25.6 20.3 15.8 13.5 10.2
831.729.927.124.522.116.912.89.8
1633.032.028.322.021.415.211.49.0
3234.631.831.828.218.013.310.07.6
6430.331.332.429.624.414.512.38.7
12829.630.832.926.725.119.811.09.3
25629.529.931.727.126.320.014.210.8
51234.031.529.026.028.030.032.033.8
(a) Predicted accuracy on the Single-Doc QA.
L 2 48163264128256512
L 1
 4 24.1 18.5 18.7 15.3 14.6 10.7 9.8 8.1
823.019.819.415.915.111.58.96.5
1623.222.120.617.316.212.410.17.2
3225.123.723.518.817.513.39.56.8
6424.623.924.420.018.814.311.68.0
12824.824.225.721.519.516.212.09.4
25624.924.325.922.020.316.813.59.5
51225.224.023.221.021.523.023.824.1
(b) Predicted accuracy on the Multi-Doc QA.
L 2 48163264128256512
L 1
 4 60.4 54.0 54.8 48.6 36.6 20.5 20.2 15.0
865.255.754.148.539.026.515.410.2
1664.460.956.046.836.822.617.012.1
3263.061.562.452.036.620.514.811.0
6462.462.262.555.047.026.021.015.6
12861.862.463.754.746.036.820.218.5
25663.060.862.455.745.837.723.320.4
51265.262.060.054.055.058.561.065.4
(c) Predicted accuracy for Few-Shot Learning.
Table 3. Baseline metrics for Single-Doc QA and improvements from TAI and boundary regularization.
Table 3. Baseline metrics for Single-Doc QA and improvements from TAI and boundary regularization.
Method L IL SDIBSD
Baseline0.130.100.065
TAI only0.120.100.048
TAI + Boundary Reg.0.110.080.041
Table 4. Computational efficiency and memory usage for different L values.
Table 4. Computational efficiency and memory usage for different L values.
Configuration t step (ms) M peak (GB)E (J/1k Tokens) N eff
Uncompressed38.546118 10 5
L = 512 3.96.1151208
L = 256 5.68.7211357
Table 5. Predicted accuracy of the Llama-3.1-8B-Instruct model as a base model, when the merge model is composed of a 9-layer MLP, is evaluated across 6 tasks: I. Single-Doc QA, II. Multi-Doc QA, III. Long ICL, IV. Dialogue History, V. Code Repo, and VI. Structured Data, with hyperparameters L = 512 . The best results are highlighted in bold.
Table 5. Predicted accuracy of the Llama-3.1-8B-Instruct model as a base model, when the merge model is composed of a 9-layer MLP, is evaluated across 6 tasks: I. Single-Doc QA, II. Multi-Doc QA, III. Long ICL, IV. Dialogue History, V. Code Repo, and VI. Structured Data, with hyperparameters L = 512 . The best results are highlighted in bold.
AvgIIIIIIIVVVI
GLM-4-9B-Chat30.230.927.233.338.528.024.2
GLM-4-Plus44.341.742.446.951.346.048.5
Qwen2.5-72B-Inst.39.440.635.242.025.650.042.4
GPT-4o50.148.644.058.046.256.051.5
Llama-3.1-8B-Inst.30.034.930.423.517.932.030.3
Our method(Llama-3.1-8B-Inst.)35.635.836.138.237.439.328.0
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Shen, H.; Hu, B. Distance-Based Compression Method for Large Language Models. Appl. Sci. 2025, 15, 9482. https://doi.org/10.3390/app15179482

AMA Style

Shen H, Hu B. Distance-Based Compression Method for Large Language Models. Applied Sciences. 2025; 15(17):9482. https://doi.org/10.3390/app15179482

Chicago/Turabian Style

Shen, Hongxin, and Baokun Hu. 2025. "Distance-Based Compression Method for Large Language Models" Applied Sciences 15, no. 17: 9482. https://doi.org/10.3390/app15179482

APA Style

Shen, H., & Hu, B. (2025). Distance-Based Compression Method for Large Language Models. Applied Sciences, 15(17), 9482. https://doi.org/10.3390/app15179482

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop