V-PRUNE: Semantic-Aware Patch Pruning Before Tokenization in Vision–Language Model Inference

Seo, Hyein; Choi, Yong Suk

doi:10.3390/app15179463

Open AccessArticle

V-PRUNE: Semantic-Aware Patch Pruning Before Tokenization in Vision–Language Model Inference

by

Hyein Seo

and

Yong Suk Choi

^*

Department of Computer Science, Hanyang University, Seoul 04763, Republic of Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(17), 9463; https://doi.org/10.3390/app15179463

Submission received: 1 August 2025 / Revised: 25 August 2025 / Accepted: 27 August 2025 / Published: 28 August 2025

Download

Browse Figures

Versions Notes

Abstract

Recent vision–language models (VLMs) achieve strong performance across multimodal benchmarks but suffer from high inference costs due to the large number of visual tokens. Prior studies have shown that many image tokens receive consistently low attention scores during inference, indicating that a substantial portion of visual content contributes little to final predictions. These observations raise questions about the efficiency of conventional token pruning strategies, which are typically applied after all attention operations and depend on late-emerging attention scores. To address this, we propose V-PRUNE, a semantic-aware patch-level pruning framework for vision–language models that removes redundant content before tokenization. By evaluating local similarity via color and histogram statistics, our method enables lightweight and interpretable pruning without architectural changes. Applied to CLIP-based models, our approach reduces FLOPs and inference time across vision–language understanding tasks, while maintaining or improving accuracy. Qualitative results further confirm that essential regions are preserved and the pruning behavior is human-aligned, making our method a practical solution for efficient VLM inference.

Keywords:

vision–language models; efficient vision transformers; feature pruning; visual question answering

1. Introduction

Vision–language models (VLMs) have shown strong performance on tasks like image captioning [1], visual question answering [2], and image–text retrieval [3]. However, their high inference cost poses challenges for real-world deployment. As VLMs grow in size and complexity, improving inference efficiency without compromising performance is increasingly crucial.

One popular approach to alleviating this issue is token pruning [4,5,6], which discards less important visual tokens based on attention scores. While effective, this approach often overlooks semantically important regions, as it relies solely on attention distributions that may not align with the true structure of the image. Moreover, attention-based pruning requires full attention computation over all tokens before any can be removed, leading to unnecessary overhead. This raises the question of whether such post-attention pruning is truly the most efficient strategy.

Based on these observations, this work explores the following research questions: (1) Is post-attention token pruning the most effective strategy for achieving efficiency in VLMs? (2) Can pruning visual regions at the image level, before tokenization, provide greater efficiency while maintaining accuracy?

Motivated by this, we propose V-PRUNE, which shifts pruning decisions to an earlier stage: directly at the image level, before tokenization and attention calculations. By evaluating each patch based on its visual characteristics and selectively removing them, our approach enables semantically grounded pruning decisions that better reflect the inherent meaning of the image as illustrated in Figure 1. It allows for more aggressive removal of less informative background regions, thereby concentrating computational resources on semantically meaningful areas.

Our contributions are summarized as follows:

We introduce a patch-level pruning strategy that eliminates redundant visual regions prior to token embedding.
Our approach leverages semantic cues to guide pruning, rather than relying solely on attention weights.
The proposed method significantly improves inference efficiency while preserving task performance across vision–language benchmarks.

2. Related Work

2.1. Vision–Language Model Optimization

As vision–language models (VLMs) have become increasingly large and powerful, a growing body of research has focused on improving their efficiency, particularly during inference. Approaches such as model distillation [7,8], parameter-efficient adaptation [9], and token reduction [10,11] have been explored to reduce computational cost while maintaining performance. Among these, token pruning has emerged as a particularly effective technique for optimizing transformer-based VLMs.

2.2. Token Pruning Methods

Token pruning techniques aim to reduce the number of tokens processed in the attention mechanism by discarding tokens deemed less important. Early methods, such as DynamicViT [10], use a gating network to dynamically prune tokens at different transformer layers based on learned importance scores. The gating network is jointly trained with the main model and enables the model to adaptively retain only the most informative tokens based on the input content. Specifically targeting vision–language models, FastV [5] accelerates inference by pruning visual tokens after cross-modal attention layers. FastV identifies visual tokens that contribute minimally to the multimodal interaction and dynamically prunes them based on attention-based importance estimates. Although FastV effectively reduces visual redundancy in VLMs, it operates at the token level and relies on post-embedding attention computations.

Despite the effectiveness of these methods, they operate at the token level after embedding, relying on learned predictors or attention scores without explicitly considering the intrinsic visual structure of the original image. As a result, pruning decisions may not fully capture the high-level semantic relevance of different regions in the input.

2.3. Patch-Based Approaches

Patch-based processing has been a standard approach in vision tasks since the introduction of Vision Transformers (ViTs) [12], which divide input images into fixed-size patches and treat them as token sequences for transformer models. Variants such as DeiT [13] and Swin Transformer [14] further refined patch-based designs to improve training efficiency and performance. Building on the patch-based representation, recent works have explored reducing the number of patches to enhance inference efficiency. For instance, PatchSlimming [15] estimates patch importance with a lightweight predictor and prunes less informative patches during inference. However, it is tailored for vision-only tasks and is not directly applicable to vision–language models.

In contrast, we extend the concept of patch-based pruning to the VLM setting, enabling pruning decisions to be guided by the visual characteristics of the raw input while preserving essential semantic content needed for multimodal understanding.

3. Methods

We propose a patch-level pruning pipeline that reduces visual redundancy before tokenization and image encoding. The architecture of our method is illustrated in Figure 2. It consists of two key stages: pruning decision; patch reorganization and tokenization.

3.1. Problem Setup

We propose a patch-based pruning framework that operates prior to image embedding. By examining raw image patches directly, we can exploit the semantic cues present in the visual content itself to identify redundant or less meaningful regions before they are converted into tokens. This enables more aggressive elimination of redundant or background regions, which often carry repetitive semantics, while preserving critical content necessary for multimodal understanding.

3.2. Patch Division

Before the pruning decision, the image is partitioned into a grid of non-overlapping square patches. We define a group as

P = {p_{1}, p_{2}, p_{3}, p_{4}}

, consisting of four spatially adjacent patches. Grouping neighboring patches into a

2 \times 2

block is particularly beneficial, as it allows redundancy to be identified across spatially related regions rather than treating each patch in isolation. We empirically validate the choice of 2 × 2 groups in Section 4.2.5. Within each group, the importance of individual patches is evaluated, and less informative patches can be selectively removed, while the remaining ones proceed to the embedding stage of the vision encoder. This preprocessing step enables patch-level operations such as pruning and selection to be applied efficiently before tokenization.

We adopt a patch-based approach rather than a pixel-based analysis for three main reasons. First, computational efficiency: analyzing every pixel individually would require processing hundreds of thousands of elements per image, leading to prohibitive overhead. In contrast, patch-based partitioning aggregates pixels into fixed-size units, significantly reducing the search space for redundancy removal. Second, semantic representation: patches naturally preserve local context, such as textures or object boundaries, which cannot be captured by isolated pixel values. This makes patch-level analysis more aligned with the semantic structure of the image. Third, model performance: since modern vision encoders (e.g., CLIP) already operate on patch embeddings, pruning at the patch level provides a seamless interface with the model architecture, ensuring that efficiency gains are achieved without disrupting downstream inference.

3.3. Pruning Decision

After partitioning the input image into patches, V-PRUNE evaluates the patch scores with the semantic relevance of each patch through a two-stage pruning process—structural similarity filtering followed by color similarity filtering—as illustrated in Figure 3. This process effectively removes visually redundant or uninformative regions while preserving key semantic content. Patches identified as less important are discarded, and only the remaining patches

P_{keep}

are tokenized and passed into the VLM for inference.

For each patch, we first evaluate structural consistency using cosine similarity, which measures coarse appearance resemblance based on patch-level color vectors. The cosine similarity is computed with respect to the average mean vector of its corresponding group and is defined as follows:

cos (θ_{i}) = \frac{p_{i} \cdot \bar{p}}{∥ p_{i} ∥ \cdot ∥ \bar{p} ∥}, \bar{p} = \frac{1}{4} \sum_{i = 1}^{4} p_{i},

(1)

the group satisfies the structural similarity condition

\forall i cos (θ_{i}) \geq τ_{\cos}

, where

τ_{\cos}

is the cosine similarity threshold. Having a cosine similarity greater than the threshold indicates that the group shares similar visual semantics, making it a candidate for patch removal.

Next, we evaluate color similarity by computing the chi-square distance [16] between color histograms

h_{i}

of the patches within the

P

. A group is considered to satisfy the color similarity condition if the following criterion is met:

\sum_{1 \leq i < j \leq 4} χ^{2} (h_{i}, h_{j}) \leq τ_{chi},

(2)

where

χ^{2} (h_{i}, h_{j})

denotes the chi-square distance between the histograms of

p_{i}

and

p_{j}

, and

τ_{chi}

is the threshold. A sum of chi-square distances below the threshold indicates that the patches within the group have similar color distributions, also making the group a candidate for patch removal.

Finally, groups that satisfy both conditions in Equations (1) and (2) are selected for pruning. Within each selected group, the three patches with lower cosine similarity are masked by setting their pixel values to zero, retaining only the patch with the highest cosine similarity. This two-step evaluation process ensures that the retained patch is both visually consistent and semantically representative, effectively reducing redundancy among patches.

Pruning is performed independently for each 2 × 2 group with fixed computations, resulting in linear complexity

O (N)

, where N is the number of patches. This keeps the pruning overhead minimal. Figure 4 visualizes the patches before and after pruning.

3.4. Patch Reorganization and Tokenization

After the pruning step, the remaining patches are spatially reorganized into a compact layout by sequentially arranging them without preserving the original two-dimensional grid structure, forming a reduced-size image for tokenization as shown in the right box of Figure 2. Despite the spatial reordering, we maintain the original positional semantics by assigning positional embeddings based on each patch’s original location in the full image. This strategy preserves the model’s understanding of the original spatial structure while enabling a reduction in the computational cost of visual processing. Furthermore, reducing the number of patches before visual tokenization effectively shortens the input length to the visual encoder. As a result, if the token count is reduced by a ratio R, the FLOPs of the attention modules decrease approximately by a factor of

{(1 - R)}^{2}

.

4. Results and Analysis

In this section, we evaluate the effectiveness of our proposed patch-based visual token pruning framework in reducing inference cost while preserving performance across various vision–language tasks. We evaluate our patch-based visual token pruning framework on several multimodal benchmarks using LLaVA v1.5-7B [17] and LLaVA-NeXT [18].

4.1. Experiment Settings

All input images are resized to a fixed resolution of 336 × 336 pixels and subsequently divided into a 24 × 24 grid of non-overlapping patches, each of size 14 × 14. To align with the CLIP [19] encoder’s requirements, we apply this patch division consistently across all evaluations. Our pruning framework operates exclusively during inference; the underlying model weights and architecture are kept unchanged during training. This ensures that our approach can be applied as a plug-and-play acceleration technique without requiring model retraining or fine-tuning. All experiments are conducted on a system equipped with NVIDIA RTX 3090 GPUs.

For all experiments, we set the pruning thresholds to

τ_{chi} = 5

and

τ_{cos} = 0.98

, determined through preliminary validation. These values are kept fixed across all datasets and benchmarks for consistency, and the related ablation studies can be found in Section 5.3.1 and Section 5.3.2.

We evaluate our method using the following metrics to comprehensively assess both the performance and efficiency of the proposed pruning framework:

Accuracy: The average accuracy across tasks in each benchmark, reflecting overall performance after pruning.
Inference Time: The total time required to perform inference on all images in a dataset, used to evaluate the overall latency improvement achieved by pruning.
Visual Token Reduction: The reduction in the number of visual tokens processed across the dataset, measured to assess the efficiency gains from pruning. To provide a hardware-independent view of computational savings, we report the reduction as a FLOPs ratio.

We focus exclusively on the inference-time computation cost, specifically the FLOPs required for the Multi-Head Attention (MHA) and Feed-Forward Network (FFN) modules within each transformer layer. The FLOPs can be expressed as

L \times (4 n d^{2} + 2 n^{2} d + 2 n d m)

, where L is the total number of transformer layers, n is the number of tokens, d is the hidden dimension, and m is the intermediate size of the FFN. The FLOPs ratio can then be approximated as follows:

FLOPs ratio = \frac{{FLOPs}_{model}}{{FLOPs}_{baseline}}

(3)

This formulation reflects how pruning affects different stages of the model and emphasizes the quadratic benefit of token reduction in transformer architectures. Together, these metrics allow us to evaluate the effectiveness of our pruning strategy in reducing computational overhead while preserving the model’s performance on vision–language tasks.

4.2. Experimental Results

To validate the effectiveness of our patch-based pruning framework, we evaluate it across multiple vision-and-language benchmarks: CRPE [20], AI2D [21], A-Bench [22], AesBench [23], HallusionBench [24], POPE [25], A-OKVQA [26], and TakeMeAnything [27]. To enhance readability, we refer to some benchmark datasets using abbreviations in the remainder of this paper (e.g., TaskMeAnything→TMA; HallusionBench→H-Bench). We compare our method against the baseline and recent pruning-based approaches [5,28] in terms of inference time, inference accuracy, and computational efficiency (FLOPs).

4.2.1. Inference Time, Inference Accuracy, and FLOPs

Table 1 presents a comprehensive comparison of inference accuracy and computational efficiency on LLaVA v1.5-7B across multiple benchmarks. Our method maintains strong performance, achieving over 95% accuracy retention on most datasets and even surpassing the baseline in some cases (e.g., AI2D; AesBench). Despite this, it significantly reduces computation, with average FLOPs reduced by over 25% and up to 35% on datasets like HallusionBench and TaskMeAnything. Notably, our approach achieves competitive or the fastest runtimes (e.g., 00:26 on AesBench; 01:50 on A-Bench), even when FLOPs reduction is limited—demonstrating real-world efficiency. For fair comparison, the patch pruning ratio of FastV and VisionZip [28] was adjusted to align with the dynamic, image-adaptive pruning level of our method. Unlike these existing approaches, our method employs a fundamentally different pruning strategy that still delivers comparable performance. This suggests that our approach not only stands strong on its own but also holds the potential to complement existing methods, potentially leading to further performance gains when combined.

4.2.2. Task Dependency of Pruning Effectiveness

Our method performs best on datasets that contain high visual redundancy, such as TaskMeAnything and HallusionBench, where a large portion of patches can be safely discarded. In contrast, tasks with visually dense layouts or small objects dispersed across the image are less amenable to aggressive pruning. This highlights the importance of dataset-aware pruning strategies and suggests that incorporating task-specific visual priors (e.g., question-relevant regions in VQA or text areas in OCR) may further enhance performance.

While our method achieves substantial FLOPs reductions in many benchmarks, the actual end-to-end inference time is not always linearly proportional to the FLOPs savings. This discrepancy arises from the architectural constraints of transformer models—where token processing is highly parallelized and overhead is introduced by the patch scoring and masking operations. In datasets where pruning is minimal (e.g., AesBench or CRPE), the additional processing can sometimes outweigh the benefits of token reduction.

4.2.3. Visual Comparison of Pruned Patches

To further validate the effectiveness of our patch pruning strategy, we compared the pruned patches obtained by our semantic-based approach (applied prior to tokenization) with those from the attention-based FastV method. As illustrated in Figure 5, our method more accurately preserves semantically important regions such as key objects and structural layouts, demonstrating a better understanding of visual content. This selective pruning reduces redundant patches before they reach the tokenization stage, leading to fewer unnecessary attention computations. As a result, our method achieves comparable accuracy while significantly reducing inference time, as shown in Table 1.

A notable advantage of our approach is its interpretability. Because pruning decisions are made directly at the patch level based on observable image features (mean color and distributional similarity), the resulting pruned inputs are human-understandable and align with intuitive notions of visual relevance. This contrasts with many attention-based token reduction methods, which often obscure the internal decision process.

4.2.4. Result on LLaVA-Next

To verify the effectiveness of our method in high-resolution settings, we conduct experiments on LLaVA-NeXT [18], which increases visual token count via AnyRes-based image splitting. As shown in Table 2, our method consistently reduces inference time across all tasks while maintaining comparable or even better accuracy. For instance, on AI2D, the inference time was reduced from 24:12 to 16:25, with only a minor drop in accuracy (62.85→62.17). In contrast, on POPE, our method even improves accuracy (41.11→44.16) while also reducing latency (16:37→15:50). This demonstrates that our pruning strategy effectively removes redundant visual information without degrading performance.

4.2.5. Analysis of Design Choices

In this study, redundancy was set as the criterion for removing unnecessary patches, and we considered that a minimum local context is required to capture it. From this perspective, we adopted the 2 × 2 grouping strategy, based on the following reasons.

Consistency with VLM patch grids: Modern VLMs (e.g., CLIP) split images into even×even patch grids (e.g., $14 \times 14$ ; $16 \times 16$ ). A $2 \times 2$ group aligns seamlessly with such grids, avoiding padding or boundary correction.
Minimum contextual unit: A $1 \times 1$ patch lacks surrounding context, while larger groups dilute fine-grained details. A $2 \times 2$ group is the smallest unit that preserves patch-level granularity with sufficient spatial cues.
Computational efficiency: With n patches per group, redundancy detection requires $(\binom{n}{2})$ comparisons ( $O (n^{2})$ ). Thus, larger groups rapidly increase cost, whereas $n = 4$ achieves the best balance.

In addition, to empirically validate the effect of group size, we compared

2 \times 2

,

3 \times 3

, and

4 \times 4

groupings. Since the threshold range for redundancy detection changes with the number of patches in a group and can consequently influence accuracy, we evaluated inference efficiency only to isolate the effect of group size itself. As shown in Table 3, the

2 \times 2

grouping consistently achieved the lowest inference time, whereas larger groups such as

3 \times 3

and

4 \times 4

introduced additional computational overhead due to the increased redundancy search space. These results experimentally demonstrate that grouping four adjacent patches provides the most balanced trade-off between simplicity and efficiency, thereby strengthening the validity of the proposed

2 \times 2

grouping strategy.

5. Discussion

Our proposed patch-level pruning framework demonstrates that early-stage semantic filtering can significantly reduce computational cost in vision–language model (VLM) inference without compromising task performance. The quantitative results confirm that pruning redundant patches prior to tokenization reduces FLOPs and runtime across a wide range of benchmarks, while the qualitative analysis shows that the model retains semantically meaningful content and maintains response accuracy. Nevertheless, we observe several important dynamics and trade-offs that merit further discussion.

5.1. Comparison of Random vs. Proposed Patch Pruning

We compare our method to random pruning under the same patch reduction ratio (Figure 6). The bar chart shows that our method consistently achieves higher accuracy across all datasets compared to random patch selection, indicating that preserving semantically meaningful patches leads to better task performance. The performance gap is clearly visualized by the difference between the dark blue (V-PRUNE) and light blue (random) bars.

Meanwhile, the line plots show that inference times for both methods follow a similar trend, as indicated by the closely aligned curves. This demonstrates that despite having similar computational costs, our method yields substantially better accuracy. These results highlight the advantage of guided pruning over naive random selection in both effectiveness and efficiency.

5.2. Time–FLOPs–Accuracy Trade-Off

Analyzing the trade-off between computational cost, runtime, and task performance reveals that our method is particularly effective on datasets like TakeMeAnything and HallusionBench, where high pruning ratios (over 30%) indicate substantial visual redundancy. In such cases, we achieve significant FLOPs reductions and runtime gains with minimal performance drop.

In contrast, datasets like CRPE and AesBench, which contain dense visual content (e.g., diagrams, charts, or multi-object scenes), result in lower pruning ratios and limited FLOPs savings, partly due to scoring and reorganization overhead. Nonetheless, runtime still improves, showing that FLOPs alone do not fully determine efficiency. Overall, these results confirm that our patch-based method aligns well with tasks where spatial redundancy is prevalent and that runtime reduction is governed by both FLOPs and processing overhead.

To further support our quantitative findings, we include qualitative visualizations in Figure 7. These examples illustrate how our method removes visually redundant patches such as large uniform backgrounds, while preserving task-relevant regions, confirming the semantic fidelity and interpretability of our pruning approach.

5.3. Effect of Patch Pruning Threshold

In the patch pruning process, threshold values serve as criteria for determining similarity, directly influencing both the number of retained tokens and the overall computational cost. In this section, we analyze the effects of two thresholds—namely the chi-square distance threshold based on color histograms and the cosine similarity threshold based on pixel intensities—on pruning performance. We also discuss the correlation and complementary roles of these two criteria.

5.3.1. Effect of Chi-Square Distance Threshold

The chi-square distance threshold

τ_{chi}

serves as a criterion for measuring color distribution similarity within patch groups. As shown in Figure 8, applying a higher threshold causes more patch groups to be identified as similar and aggressively pruned, thereby reducing the number of processed tokens and yielding greater FLOPs savings. However, this also increases the risk of removing informative visual cues, resulting in a slight accuracy drop as a trade-off. Moreover, FLOPs reduction does not always directly translate into shorter inference time; in some cases, pruning overhead offsets the computational savings.

To further illustrate this effect, Table 4 shows that as

τ_{chi}

increases, computational cost progressively decreases due to the quadratic relationship between token length and the self-attention mechanism in transformer-based encoders. Importantly, task performance is largely preserved across different thresholds, confirming both the effectiveness and controllability of our early-stage, group-based pruning design. Overall, these results suggest that moderate thresholds strike the best balance between efficiency and accuracy by removing redundant regions while retaining essential semantic content.

5.3.2. Effect of Cosine Similarity Threshold

The cosine similarity threshold

τ_{\cos}

evaluates color similarity based on pixel mean values. As illustrated in Figure 9, lowering the threshold preserves more patches, thereby increasing the token count, FLOPs, and inference time. Nonetheless, accuracy remains nearly constant across different thresholds, indicating that the additional tokens retained at lower thresholds contribute little to performance. In this study, we set the threshold to 0.98, which effectively removes redundant patches while maintaining high accuracy.

5.3.3. Relationship Between the Two Thresholds

The chi-square distance and cosine similarity thresholds operate on different aspects of patch similarity—color distribution and average color, respectively—and thus play complementary roles. While cosine similarity effectively captures global color averages, it is limited in distinguishing structural differences within patches. Conversely, the chi-square distance is more sensitive to such differences, enabling it to separate patches with similar averages but distinct internal patterns (e.g., sky vs. clouds). Experimental results show that combining both thresholds yields more stable and refined pruning compared to using either criterion alone, supporting the effectiveness of our group-based patch pruning design.

6. Conclusions

We presented V-PRUNE, a patch-level pruning framework that improves the inference efficiency of vision–language models (VLMs) by removing semantically redundant visual information before tokenization. Unlike prior approaches that rely on attention-based token pruning after visual features have been embedded, our method operates directly on raw image patches, enabling earlier and more interpretable pruning decisions by evaluating local similarity using lightweight statistical measures such as structural and color-based features. Our method efficiently filters out less informative patches, significantly reducing the burden of subsequent attention calculations.

Through extensive experiments across eight benchmarks, we demonstrated that our method achieves substantial reductions in both FLOPs and inference time, while maintaining competitive or even superior accuracy compared to state-of-the-art pruning techniques. Notably, our approach exhibits particularly strong performance on tasks characterized by high visual redundancy, where many background regions can be safely discarded without harming semantic understanding. Furthermore, the method remains robust in visually complex or cluttered scenes, consistently preserving task-relevant information.

Qualitative analyses further support our findings, showing that our pruning framework selectively retains semantically meaningful image regions, resulting in compact yet interpretable input representations. Overall, our findings highlight the practical potential of semantic-aware, pre-embedding pruning as a scalable and effective solution for optimizing the performance of modern vision–language models.

Directions for Future Work

Our method operates independently of the model’s attention mechanism or downstream decoder behavior. While this allows seamless integration with existing VLM architectures, jointly optimizing pruning decisions with model-specific attention or decoding strategies could further enhance performance.

Author Contributions

Conceptualization, H.S.; methodology, H.S. and Y.S.C.; software, H.S.; validation, H.S. and Y.S.C.; formal analysis, H.S. and Y.S.C.; investigation, H.S.; resources, H.S. and Y.S.C.; data curation, H.S.; writing—original draft preparation, H.S. and Y.S.C.; writing—review and editing, H.S. and Y.S.C.; visualization, H.S.; supervision, Y.S.C.; project administration, Y.S.C.; funding acquisition, Y.S.C. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Institute of Information and communications Technology Planning and evaluation (IITP) grant (No.RS-2025-25422680, No. RS-2020-II201373), and the National Research Foundation of Korea (NRF) grant (No. RS-2025-00520618) funded by the Korean Government (MSIT).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Available in CRPE [20], AI2D [21], A-Bench [22], AesBench [23], HallusionBench [24], POPE [25], A-OKVQA [26], and TakeMeAnything [27].

Conflicts of Interest

The authors declare no conflicts of interest.

References

Yu, J.; Wang, Z.; Vasudevan, V.; Yeung, L.; Seyedhosseini, M.; Wu, Y. CoCa: Contrastive Captioners are Image-Text Foundation Models. Trans. Mach. Learn. Res. 2022; preprint. [Google Scholar]
Chen, B.; Xu, Z.; Kirmani, S.; Ichter, B.; Sadigh, D.; Guibas, L.; Xia, F. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 14455–14465. [Google Scholar]
Karthik, S.; Roth, K.; Mancini, M.; Akata, Z. Vision-by-Language for Training-Free Compositional Image Retrieval. In Proceedings of the Twelfth International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024. [Google Scholar]
Fayyaz, M.; Koohpayegani, S.A.; Jafari, F.R.; Sengupta, S.; Joze, H.R.V.; Sommerlade, E.; Pirsiavash, H.; Gall, J. Adaptive token sampling for efficient vision transformers. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 396–414. [Google Scholar]
Chen, L.; Zhao, H.; Liu, T.; Bai, S.; Lin, J.; Zhou, C.; Chang, B. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 19–35. [Google Scholar]
Zhang, Q.; Cheng, A.; Lu, M.; Zhuo, Z.; Wang, M.; Cao, J.; Guo, S.; She, Q.; Zhang, S. [CLS] Attention is All You Need for Training-Free Visual Token Pruning: Make VLM Inference Faster. arXiv 2024, arXiv:2412.01818. [Google Scholar] [CrossRef]
Zhu, D.; Chen, J.; Shen, X.; Li, X.; Elhoseiny, M. MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models. In Proceedings of the Twelfth International Conference on Learning Representations, Vienna, Austria, 7 May 2024. [Google Scholar]
Zhou, L.; Palangi, H.; Zhang, L.; Hu, H.; Corso, J.; Gao, J. Unified Vision-Language Pre-Training for Image Captioning and VQA. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 13041–13049. [Google Scholar]
Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. LoRA: Low-Rank Adaptation of Large Language Models. In Proceedings of the International Conference on Learning Representations, Virtual, 25 April 2022. [Google Scholar]
Rao, Y.; Zhao, W.; Liu, B.; Lu, J.; Zhou, J.; Hsieh, C.J. DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Virtual, 6–14 December 2021. [Google Scholar]
Bolya, D.; Fu, C.Y.; Dai, X.; Zhang, P.; Feichtenhofer, C.; Hoffman, J. Token merging: Your vit but faster. arXiv 2022, arXiv:2210.09461. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. In Proceedings of the International Conference on Learning Representations, Virtual, 3–7 May 2021. [Google Scholar]
Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H. Training data-efficient image transformers & distillation through attention. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 10347–10357. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
Tang, Y.; Han, K.; Wang, Y.; Xu, C.; Guo, J.; Xu, C.; Tao, D. Patch slimming for efficient vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 12165–12174. [Google Scholar]
McHugh, M.L. The chi-square test of independence. Biochem. Med. 2013, 23, 143–149. [Google Scholar] [CrossRef] [PubMed]
Liu, H.; Li, C.; Li, Y.; Lee, Y.J. Improved baselines with visual instruction tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 26296–26306. [Google Scholar]
Liu, H.; Li, C.; Li, Y.; Li, B.; Zhang, Y.; Shen, S.; Lee, Y.J. LLaVA-NeXT: Improved Reasoning, OCR, and World Knowledge. 2024. Available online: https://llava-vl.github.io/blog/2024-01-30-llava-next/ (accessed on 30 January 2024).
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
Wang, W.; Ren, Y.; Luo, H.; Li, T.; Yan, C.; Chen, Z.; Wang, W.; Li, Q.; Lu, L.; Zhu, X.; et al. The all-seeing project v2: Towards general relation comprehension of the open world. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; pp. 471–490. [Google Scholar]
Kembhavi, A.; Salvato, M.; Kolve, E.; Seo, M.; Hajishirzi, H.; Farhadi, A. A diagram is worth a dozen images. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part IV 14. Springer: Berlin/Heidelberg, Germany, 2016; pp. 235–251. [Google Scholar]
Zhang, Z.; Wu, H.; Li, C.; Zhou, Y.; Sun, W.; Min, X.; Chen, Z.; Liu, X.; Lin, W.; Zhai, G. A-Bench: Are LMMs Masters at Evaluating AI-generated Images? In Proceedings of the Thirteenth International Conference on Learning Representations, Singapore, 24–28 April 2025.
Huang, Y.; Yuan, Q.; Sheng, X.; Yang, Z.; Wu, H.; Chen, P.; Yang, Y.; Li, L.; Lin, W. AesBench: An Expert Benchmark for Multimodal Large Language Models on Image Aesthetics Perception. arXiv 2024, arXiv:2401.08276. [Google Scholar] [CrossRef]
Guan, T.; Liu, F.; Wu, X.; Xian, R.; Li, Z.; Liu, X.; Wang, X.; Chen, L.; Huang, F.; Yacoob, Y.; et al. Hallusionbench: An advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 14375–14385. [Google Scholar]
Li, Y.; Du, Y.; Zhou, K.; Wang, J.; Zhao, X.; Wen, J.R. Evaluating Object Hallucination in Large Vision-Language Models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–10 December 2023. [Google Scholar]
Schwenk, D.; Khandelwal, A.; Clark, C.; Marino, K.; Mottaghi, R. A-okvqa: A benchmark for visual question answering using world knowledge. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 146–162. [Google Scholar]
Zhang, J.; Huang, W.; Ma, Z.; Michel, O.; He, D.; Gupta, T.; Ma, W.C.; Farhadi, A.; Kembhavi, A.; Krishna, R. Task Me Anything. In Proceedings of the Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024, Vancouver, BC, Canada, 10–15 December 2024. [Google Scholar]
Yang, S.; Chen, Y.; Tian, Z.; Wang, C.; Li, J.; Yu, B.; Jia, J. Visionzip: Longer is better but not necessary in vision language models. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 10–17 June 2025; pp. 19792–19802. [Google Scholar]

Figure 1. Conceptual difference between V-PRUNE and previous token pruning method. Unlike previous methods that prune tokens after image and positional embeddings during the attention stage, V-PRUNE performs patch pruning directly at the image level, before visual tokenization and embedding.

Figure 2. The architecture of V-PRUNE which is the proposed patch-based visual pruning method. Our framework evaluates patch groups based on structural and color similarity, retains only the most representative patches, and constructs a compact image for inference while preserving original positional semantics.

Figure 3. Illustration of the patch decision process. Each 2 × 2 group is evaluated based on cosine similarity of average group mean and simplified color distribution similarity. The most representative patch is retained, and others are masked. For simplicity, histogram blocks are abstracted as icons to represent RGB distribution comparison.

Figure 4. Example of proposed patch pruning method. Redundant patches are removed while semantic regions are preserved.

Figure 5. Visual comparison of pruned patches between the attention-based FastV method and our semantics-based approach. Our method more effectively retains meaningful regions such as object boundaries and structural layouts, while discarding redundant background patches. This leads to reduced attention computation and faster inference with minimal accuracy loss.

Figure 6. Comparison of pruning performance between random patch selection and our proposed method.

Figure 7. Qualitative examples on four representative benchmarks. Each row shows (from left to right) the following: the original image; the pruned image (removed patches shown in black). Our method successfully preserves semantically important content (e.g., central objects, labels, and textual regions), while removing visually redundant background areas. In most cases, the output remains consistent or even improves, demonstrating the practical benefits of semantic-aware patch pruning.

Figure 8. FLOPs comparison across benchmarks.

Figure 9. Qualitative examples on four representative benchmarks. Each row shows (from left to right) the following: the original image; the pruned image (removed patches shown in black). Our method successfully preserves semantically important content (e.g., central objects, labels, and textual regions), while removing visually redundant background areas. In most cases, the output remains consistent or even improves, demonstrating the practical benefits of semantic-aware patch pruning. Here, ‘token #’ indicates the number of image tokens remaining after pruning during benchmark evaluation.

Table 1. Comparison of inference accuracy and inference time and efficiency across multiple benchmarks on LLaVA v1.5-7B. Our method maintains accuracy comparable to the baselines while significantly reducing FLOPs. Time is reported in minutes:seconds for processing the full dataset.

	Model	CRPE	AI2D	A-Bench	AesBench	HBench	POPE	A-OKVQA	TMA
Accuracy	LLaVA	88.27 100%	50.93 100%	65.50 100%	42.50 100%	35.54 100%	80.02 100%	78.60 100%	45.71 100%
	FastV	87.46 99%	51.84 101%	62.91 96%	51.11 120%	37.22 104%	80.48 100%	78.42 99%	45.47 99%
	VisionZip	86.48 97%	51.97 102%	63.19 96%	52.77 124%	32.17 90%	76.69 95%	77.90 99%	43.96 96%
	V-PRUNE	86.43 97%	52.42 102%	63.19 96%	51.94 122%	37.22 104%	76.66 95%	75.10 95%	42.80 93%
Time	LLaVA	23:03	03:49	02:12	01:06	06:30	10:21	01:21	14:55
	FastV	21:58	03:42	01:56	00:28	06:25	10:15	01:18	12:24
	VisionZip	20:03	03:19	01:50	00:27	10:28	10:16	01:14	06:49
	V-PRUNE	21:28	03:25	01:50	00:26	06:23	10:08	01:15	07:42
FLOPs ratio	LLaVA	100%	100%	100%	100%	100%	100%	100%	100%
	FastV	20%	30%	13%	20%	30%	20%	20%	40%
	VisionZip	20%	30%	13%	20%	30%	20%	20%	40%
	V-PRUNE	22.70%	29.86%	13.29%	19.96%	33.71%	21.71%	21.71%	36.72%

Table 2. Performance on LLaVA-NeXT using our method.

	Model	AI2D	A-Bench	A-OKVQA	POPE
Acc.	LLaVA-NeXT	62.85	50.00	51.52	41.11
Acc.	V-PRUNE	62.17	47.51	51.11	44.16
Time	LLaVA-NeXT	24:12	26:18	02:58	16:37
Time	V-PRUNE	16:25	25:13	02:44	15:50

Table 3. Inference time comparison across different group sizes (

2 \times 2

,

3 \times 3

, and

4 \times 4

), measured on LLaVA-1.5-7B using two NVIDIA RTX 3090 GPUs.

Table 3. Inference time comparison across different group sizes (

2 \times 2

,

3 \times 3

, and

4 \times 4

), measured on LLaVA-1.5-7B using two NVIDIA RTX 3090 GPUs.

	Group	AI2D	A-Bench	A-OKVQA	POPE
Time	2 × 2	23:15	12:00	09:21	31:11
	3 × 3	23:41	12:32	09:58	33:20
	4 × 4	23:44	12:22	09:45	32:56

Table 4. Impact of chi-square distance threshold used for patch pruning. We report model accuracy, inference time, and FLOPs to evaluate how the threshold impacts pruning aggressiveness and overall efficiency.

	$τ_{chi}$	CRPE	AI2D	A-Bench	AesBench	H-Bench	POPE	A-OKVQA	TMA
Time	≤5	37:19	05:59	03:36	01:06	27:45	10:08	02:19	12:24
	≤10	36:53	05:38	03:29	01:02	26:37	10:18	02:16	11:59
	≤20	36:33	05:37	03:25	01:02	26:43	10:05	02:14	11:37
Acc.	≤5	0.8643	0.5242	0.6319	0.5194	37.2240	77.1373	0.7510	0.4280
	≤10	0.8619	0.5187	0.6333	0.5166	37.0137	77.9240	0.7493	0.4250
	≤20	0.8601	0.5191	0.6361	0.5055	37.0137	76.6600	0.7414	0.4163
FLOPs	≤5	22.70%	29.86%	13.29%	19.96%	33.71%	22.06%	21.71%	36.72%
	≤10	24.88%	30.74%	17.15%	21.97%	35.62%	23.35%	23.94%	43.61%
	≤20	27.80%	32.48%	18.38%	25.92%	37.81%	26.15%	27.00%	49.01%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Seo, H.; Choi, Y.S. V-PRUNE: Semantic-Aware Patch Pruning Before Tokenization in Vision–Language Model Inference. Appl. Sci. 2025, 15, 9463. https://doi.org/10.3390/app15179463

AMA Style

Seo H, Choi YS. V-PRUNE: Semantic-Aware Patch Pruning Before Tokenization in Vision–Language Model Inference. Applied Sciences. 2025; 15(17):9463. https://doi.org/10.3390/app15179463

Chicago/Turabian Style

Seo, Hyein, and Yong Suk Choi. 2025. "V-PRUNE: Semantic-Aware Patch Pruning Before Tokenization in Vision–Language Model Inference" Applied Sciences 15, no. 17: 9463. https://doi.org/10.3390/app15179463

APA Style

Seo, H., & Choi, Y. S. (2025). V-PRUNE: Semantic-Aware Patch Pruning Before Tokenization in Vision–Language Model Inference. Applied Sciences, 15(17), 9463. https://doi.org/10.3390/app15179463

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

V-PRUNE: Semantic-Aware Patch Pruning Before Tokenization in Vision–Language Model Inference

Abstract

1. Introduction

2. Related Work

2.1. Vision–Language Model Optimization

2.2. Token Pruning Methods

2.3. Patch-Based Approaches

3. Methods

3.1. Problem Setup

3.2. Patch Division

3.3. Pruning Decision

3.4. Patch Reorganization and Tokenization

4. Results and Analysis

4.1. Experiment Settings

4.2. Experimental Results

4.2.1. Inference Time, Inference Accuracy, and FLOPs

4.2.2. Task Dependency of Pruning Effectiveness

4.2.3. Visual Comparison of Pruned Patches

4.2.4. Result on LLaVA-Next

4.2.5. Analysis of Design Choices

5. Discussion

5.1. Comparison of Random vs. Proposed Patch Pruning

5.2. Time–FLOPs–Accuracy Trade-Off

5.3. Effect of Patch Pruning Threshold

5.3.1. Effect of Chi-Square Distance Threshold

5.3.2. Effect of Cosine Similarity Threshold

5.3.3. Relationship Between the Two Thresholds

6. Conclusions

Directions for Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI