PoMQ-ViT: Mixed-Precision Quantization Vision Transformer with Pareto Optimization

Wu, Zhiqiang; Zhao, Zhong

doi:10.3390/app15189856

Open AccessArticle

PoMQ-ViT: Mixed-Precision Quantization Vision Transformer with Pareto Optimization

by

Zhiqiang Wu

^*

and

Zhong Zhao

State Key Laboratory of Digital Steel, Northeastern University, Shenyang 110819, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(18), 9856; https://doi.org/10.3390/app15189856

Submission received: 15 July 2025 / Revised: 28 August 2025 / Accepted: 5 September 2025 / Published: 9 September 2025

Download

Browse Figures

Versions Notes

Abstract

Vision Transformer (ViT) has established itself as the leading model in image processing, and quantization offers a practical way to deploy ViTs on memory-constrained devices. This technique maps float32 inference to lower bits such as int8. However, existing studies often use uniform bit-width across ViT encoders, leading to sub-optimal allocation and accuracy loss. This study aims to optimize bit-width allocation by leveraging the observed variation in quantization sensitivity among different ViT encoders. We propose PoMQ-ViT, a method that dynamically assigns bit-width to ViT encoders based on their sensitivity. It uses Pareto Optimization to balance accuracy and inference speed. Experiments on the ImageNet Dataset show PoMQ-ViT outperforms uniform bit-width methods across DeiT and Swin models. Within the same computation budget, it also achieves 0.1–1.5 percentage points higher accuracy than other mixed-precision paradigms across different models. This work demonstrates PoMQ-ViT’s effectiveness in ViT quantization, providing a practical solution for resource-constrained deployments.

Keywords:

Vision Transformer; mixed-precision; Pareto Optimization

1. Introduction

1.1. Research Background

The increasing research on Transformers [1] has made it the most prominent direction in deep learning networks. Notably, Alexey successfully applied Transformers to computer vision, introducing Vision Transformers (ViTs) [2]. ViTs have achieved remarkable results across various computer vision tasks, such as image classification [3], object detection [4,5,6], image segmentation [7,8], and image generation [9].

However, the quadratic computational demands of ViT pose significant challenges for efficient deployment on consumer devices [10,11]. Techniques like model quantization [12,13,14], pruning [15,16], and distillation [17,18,19] enhance the deployment and application of ViTs. In this paper, we focus on model quantization because it can reduce both the computation and memory cost.

1.2. Problem Statement

Quantization maps both weights and activations from the original Float Point 32-bit (FP32) format to low-bit representations such as int8 and int4. Currently, most quantization methods focus on improving the gradient optimization of quantization. Specifically, existing methods [20,21,22,23,24] allocate the same bit-width across all encoders. However, as illustrated in Figure 1, in a ViT-S [2] with 12 encoders, when configuring individual encoders to 4-bit while keeping the remaining encoders at 8-bit, we observe varying levels of Top1 inference accuracy loss. For example, quantizing the 6-th encoder into 4-bit leads to −5.5 points accuracy degeneration, while the same operation on the 1-st encoder only leads to −1.1 points accuracy degeneration. This phenomenon indicates that ViT exhibits different sensitivities to quantization bit-width across its encoders. Specifically, encoders that are more sensitive to quantization should retain higher bit-width precision, whereas those less sensitive can adopt lower bit-width precision.

1.3. Research Objectives

To address the previous challenge, we propose a mixed-precision quantization method, denoted as PoMQ-ViT. PoMQ-ViT formulates the bit allocation of ViT as a multi-objective optimization problem. Specifically, given a target average bit-width, PoMQ-ViT can allocate the optimal bit-width to each encoder through considering their quantization sensitivity. Specifically, PoMQ-ViT adopts the Pareto optimality approach, utilizing a distribution change measurement function before and after quantization to measure the quantization sensitivity. By evaluating the distribution changes under different bit precisions, we can comprehensively achieve minimal quantization error with the target average bit-width.

1.4. Scope and Contributions of This Study

Efficient Vision Transformer

Ref. [25] introduces PVT v1 with a pyramid structure for efficient dense prediction, and PVT v2 [26] improves upon PVT v1 with linear complexity attention, overlapping patches, and a convolutional feed-forward network, excelling in fundamental visual tasks. Ref. [27] introduces LSH attention for faster inference in NLP tasks, while ref. [28] divides standard attention into high- and low-frequency components, reducing operations and improving speed. Ref. [29] proposes Separable MSA for MobileViT, achieving linear complexity through element-wise operations. Ref. [30] employs G-MSA and G-FFN to capture diverse feature dependencies efficiently. Ref. [31] designs Edgenext, combining CNN with Transformer, featuring a convolutional encoder and Segmentation Depth Transpose Attention. Ref. [32] proposes LVT with Convolutional and Atrous Self-Attention for optimized edge deployment.

2.: Post Training-Quantization for Vision Transformer

Ref. [33] introduces a novel binary quantization algorithm for addressing the long-tailed distribution issue in softmax attention, but the severe drop in accuracy limits its practicality. In contrast, ref. [22] proposes the Integer Vision Transformer (I-ViT), which leverages Shiftmax and ShiftGELU algorithms to perform integer inference for ViT models, achieving speeds 3.72 to 4.11 times faster than floating-point models. However, the impact of integer bit-widths on accuracy and speed is not discussed. Ref. [34] uses the Representation Quantized Vision Transformer (RepQ-ViT) algorithm to align quantization and inference processes, focusing solely on the quantization aspect and neglecting potential hardware acceleration benefits. Ref. [35] studies error patterns in heavy-tailed distribution quantization under fixed additive noise bias, aiming to reduce quantization errors without altering quantizer variables. Ref. [36] proposes a binary quantization algorithm that achieves good inference accuracy with 4-bit quantization, but its effectiveness in large-scale networks like ViT is unconfirmed. Ref. [37] presents the Minimum Mean Squared Error (min-MSE) optimal quantization algorithm, advancing low-bit quantization to 4-bit integer precision, though the actual hardware implementation results are unclear. Finally, ref. [13] outlines a quantization scheme that quantizes weights and activations to 8-bit integers, without examining the effects of different bit-widths on accuracy and operation speed. Therefore, previous ViT quantization efforts have focused on fixed bit-width precision, while our work adopts a mixed bit-width precision approach.

1.5. Structure of the Paper

This paper first reviews the core computational process of the standard ViT, upon which it elaborates on the quantization principle of FQ-ViT. Building further on this foundation, the paper derives the quantization computation equations and Pareto optimization [38] constraint equations for the PoMT-ViT model, thereby clarifying its mechanism of improvement over FQ-ViT. Through experimental validation, the optimal quantization bit-width for each layer of ViT is determined via Pareto frontier analysis, aiming to strike a balance between accuracy and efficiency. Finally, by comparing the performance of standard models (e.g., ViT, DeiT, and Swin) under both the FQ-ViT and PoMT-ViT quantization schemes, the superiority of the proposed PoMT-ViT quantization model is verified.

2. Materials and Methods

In this section, we first provide a detailed overview of the standard ViT architecture. Secondly, we examine FQ-ViT [21] as the baseline, yet this model adopts a single quantization bit-width precision. Finally, building on the FQ-ViT quantization model, we further derive a computational model where different encoders employ different quantization precisions, and set the Pareto constraint equations according to the quantization characteristics, thereby completing the establishment of the PoMT-ViT model.

2.1. Standard ViT Computation Flow

The Vision Transformer (ViT) [2] breaks down a 2D image into flat 2D patches, using a linear projection to convert these patches into tokens, known as patch embeddings. Additionally, a class token is appended to represent global image information. Each token is then combined with a learnable positional embedding. Consequently, the input token sequence for a ViT model is:

X_{0} = [x_{0}^{0}, x_{0}^{1}, x_{0}^{2}, \dots, x_{0}^{N}] + E_{p o s}

(1)

where

x_{0}^{i} \in R^{D}

denotes a D-dimensional token for the i-th patch if

i > 0

, and the [class] token if

i = 0

. Here,

E_{p o s}

is the position embedding and

N

is the number of patches.

A ViT model

V

consists of

K

encoders stacked in sequence, each featuring a self-attention (SA) module and a feed-forward network (FFN). In most ViTs, SA is replaced by multi-head self-attention (MHSA). In the k-th encoder’s SA, the token sequence

X_{k - 1}

is transformed into a query matrix

Q_{k} \in R^{(N + 1) \times D}

, a key matrix

K_{k} \in R^{(N + 1) \times D}

, and a value matrix

V_{k} \in R^{(N + 1) \times D}

. The self-attention matrix

A t t n \in R^{(N + 1) \times (N + 1)}

is then calculated as:

A t t n = S o f t m a x (\frac{Q_{K} V_{K}^{T}}{\sqrt{D}}) = [a_{k}^{0}; a_{k}^{1}; \dots; a_{k}^{N}]

(2)

The vector

a_{k}^{i} \in R^{(N + 1)}

, where

i = 0,1, \dots, N

.

a_{k}^{i}

represents the attention weight vector for the i-th token, with a dimensionality of

1 \times (N + 1)

, indicating interactions between the [class] token and other patch tokens. Using

A t t n

, the outputs of SA (i.e.,

A t t n V_{k}

), are passed to an FFN composed of two fully-connected layers to produce the updated tokens

X_{k} = [x_{k}^{0}; x_{k}^{1}; \dots; x_{k}^{N}]

. The [class]

x_{k}^{0}

token is derived as follows:

x_{k}^{0} = F F N (a_{k}^{0} V_{k}^{0})

(3)

After multiple SA-FFN transformations, the [class] token

x_{k}^{0}

from the final encoder is utilized by the classifier to determine the category of the input.

2.2. Fully Quantized ViT Computation Flow

In the ViT full quantization process, the Conv, Linear, and MatMul modules in FQ-ViT adopt uniform MinMax quantization. Due to severe inter-channel variations in the LayerNorm input, some channel ranges even exceed 40 times the median; the values of attention maps exhibit extremely uneven distributions, with most values concentrated between 0 and 0.01, while a few high attention values approach 1. Conventional quantization methods for LayerNorm and Softmax would lead to severe degradation in ViT quantization inference. Therefore, FQ-ViT proposes the power-of-two factor (PTF) and log-int-Softmax (LIS) methods to mimic LayerNorm and Softmax, ultimately achieving full quantization of ViT [22].

The core idea of PTF is to assign different power-of-two factors α to different channels, while keeping the quantization parameters s and

z p

unchanged layer by layer. Each channel has its own values for

s

,

z p

, and

α

. Given a quantization bit-width b, the quantized activation

X^{(i n t 8)}

is calculated as follows:

X^{(i n t 8)} = Q (X_{E} | b) = c l i p (⌊\frac{X_{E}}{2^{α} s}⌉ + z p, 0, 2^{b} - 1)

. The LayerNorm process involves computing statistics of

X^{(i n t 8)}

, which is done in two stages. In the first stage, α is used to scale the quantized

X^{(i n t 8)}

to obtain the activated value

\hat{X^{(i n t 8)}}

, which is calculated as

\hat{X^{(i n t 8)}} = (X^{(i n t 8)} - z p) ≪ α

. In the second stage, statistical values

μ (X)

and

σ (X)

of

\hat{X^{(i n t 8)}}

are computed.

The specific process of LIS is as follows: first, subtract the maximum input value from the current input to ensure that all inputs are non-positive, and then pass these non-positive values into the

i_e x p ()

function for calculation. Next, the standard Softmax calculation is converted into reciprocal form calculation, ensuring that the computed result is greater than 1. Finally, the output of the inverse Softmax is subjected to a

{l o g}_{2}

quantization calculation to obtain the computed value of Softmax.

{A t t n}^{(i n t 8)} = {l o g}_{2} ⌊\frac{\sum i_e x p (x_{k}^{(i n t 8)})}{i_e x p (x_{k}^{(i n t 8)})}⌉

(4)

L I S (s \cdot X^{(i n t 8)}) = N - {A t t n}^{(i n t 8)}

(5)

where

x_{k}^{(i n t 8)} \in X^{(i n t 8)}

.

N = 2^{b} - 1

.

{X_s c a l}^{(i n t 8)}

serves as the quantized input, where

s

represents the scale factor.

After Softmax is quantized using

{l o g}_{2}

, the dot product computation between attention (

A_{k}^{(i n t 8)}

) and

V

is transformed into

V

’s displacement calculation.

{h e a d}^{(i n t 8)} = {A t t n}_{k}^{(i n t 8)} \cdot V^{(i n t 8)} = \frac{1}{2^{N}} \cdot (V^{(i n t 8)} < < (N - {A t t n}_{k}^{(i n t 8)}))

(6)

During the inference computation process of FQ-ViT, the quantization bit-width

b

for all encoders remains singular and fixed, typically set to 8 bits. In order to further reduce model size and enhance inference speed, FQ-ViT also explores using lower quantization bit-widths for testing. In particular, the quantization bit-width for

A t t n

during inference is changed from 8-bit to 4-bit, resulting in a certain loss of model accuracy. However, the authors did not delve deeper into the quantization of models with even lower bit-widths. Especially, they did not investigate the adaptive configuration of mixed-precision quantization with different quantization bit-widths, considering the unique characteristics of each encoder. This mixed-precision quantization approach could further improve the deployment effectiveness of ViT quantized models on commodity devices.

2.3. Methodology

The crucial objective of model quantization is to drastically reduce the model size and accelerate the inference speed while preserving the inference accuracy, thereby enabling its deployment on mobile devices, edge devices, or those with limited storage capacity. To this end, this paper presents a novel mixed-precision quantization method for ViT, termed PoMQ-ViT, which is built upon Pareto Optimization [39,40]. PoMQ-ViT adaptively assigns optimal quantization bit-widths to individual encoders within the ViT architecture: encoders sensitive to inference accuracy are allocated a higher precision bit-width, with

b

set to 8, whereas those less critical to accuracy are quantized using a lower precision bit-width, with

b

set to 4.

The optimization workflow of PoMQ-ViT is depicted in Figure 2. Initially, a trained ViT model with FP32 precision is chosen for quantization with a fixed 8-bit. On this basis, PoMQ-ViT conducts mixed bit-width quantization at different encoders to identify the optimal solutions, which are also referred to as the Pareto front. The

i n t *

in Figure 2 represents different quantization bit—widths during the optimization process; for instance, int8 employs an 8-bit width. Eventually, in accordance with the specific deployment requirements of the application scenario, an optimal solution (such as the one with higher precision or faster inference speed) is selected. Subsequently, the ViT is quantized based on this optimal solution to obtain PoMQ-ViT.

With the quantization bit-width

b

is set to 8 as the initial configuration,

X_{E}

undergoes uniform quantization, denoted as

X_{E}^{(i n t 8)} = q u a n t i z e (X_{E}, b = 8)

. Let

{i n t}^{*}

be the target optimized bit-width. When

b

is set to 4,

{i n t}^{*} = i n t 4

, when

b

is set to 8,

{i n t}^{*} = i n t 8

. The encoder comprises a total of h SA modules, the input for the j-th head of the encoder is represented as:

(K_{j}^{{i n t}^{*}}, Q_{j}^{{i n t}^{*}}, V_{j}^{{i n t}^{*}}) = l i n e a r (X_{E}^{(i n t 8 \to {i n t}^{*})}, W_{K j}^{(i n t *)}, W_{Q j}^{(i n t *)}, W_{V j}^{(i n t *)})

(7)

where

X_{E}^{(i n t 8 \to {i n t}^{*})}

represents the consistency transformation of bit-width during integer computation. Additionally,

j \in \{1,2, \dots, h\}

, and

W_{K j}^{(i n t *)}, W_{Q j}^{(i n t *)}, W_{V j}^{(i n t *)}

represent the computation weights.

According to Equation (4), we have:

{A t t n}_{j}^{(i n t *)} = c l i p ({l o g}_{2} ⌊\frac{\sum i \exp (S_{j} K_{j}^{{i n t}^{*}} Q_{j}^{{i n t}^{*}})}{i \exp (S_{j} K_{j}^{{i n t}^{*}} Q_{j}^{{i n t}^{*}})}⌉, 0, 2^{b} - 1)

(8)

According to the computational processes of the standard ViT and FQ-ViT, the

A^{(i n t *)}

of the encoder is obtained as the activation value.

\{\begin{matrix} {h e a d}_{j}^{(i n t *)} = \frac{1}{2^{N}} \cdot (V_{j}^{{i n t}^{*}} < < (N - {A t t n}_{j}^{(i n t *)})) \\ {M u l t i h e a d}^{(i n t *)} = c o n c a t ({h e a d}_{1}^{(i n t *)}, \dots, {h e a d}_{j}^{(i n t *)}, \dots, {h e a d}_{h}^{(i n t *)} W_{c o n c a t}^{(i n t *)}) \\ {M S A (X_{E})}^{(i n t *)} = l i n e a r ({M u l t i h e a d}^{(i n t *)}, {W_{h e a d}}^{(i n t *)}) \\ Y^{(i n t *)} = L a y e r N o r m ({M S A (X_{E})}^{(i n t *)} + {X_{E}}^{(i n t 8 \to {i n t}^{*})}, {W_{o 1}}^{(i n t *)}) \\ A^{(i n t *)} = M L P (L a y e r N o r m (Y^{(i n t *)}), {W_{o 2}}^{(i n t *)}) + Y^{(i n t *)} \end{matrix}

(9)

where

W_{c o n c a t}^{(i n t *)}, W_{h e a d}^{(i n t *)}, W_{O 1}^{(i n t *)}, W_{O 2}^{(i n t *)}

represent the computation weights.

By adjusting

i n t *

, the quantization bit-width during the encoder calculation process can be implemented. The variables for the encoder calculation process with selectable bit-width include: the weight

W^{(i n t *)}

, the computed attention values

{A t t n}^{(i n t *)}

, and the encoder activation values

A^{(i n t *)}

. The ViT calculation process is shown in Figure 3.

Obviously, configuring all encoder layers with a uniform quantization bit-width of 4 bits will result in a significant decrease in accuracy for the quantized model, whereas configuring them all to 8 bits will fail to achieve extreme compression. The primary concern of quantized models is ensuring that the loss in inference accuracy remains within acceptable limits.

Therefore, high-precision quantization calculations are applied to encoder layers with high sensitivity to accuracy, while low-precision quantization calculations are applied to layers with low sensitivity. How to measure this sensitivity? This paper proposes three precision error indicators for quantization calculations:

E_{1}

for input quantization error,

E_{2}

for attention value quantization error, and

E_{3}

for activation value quantization calculation error. Specifically, the

L_{1}

-norm is adopted to compute these errors, ensuring a robust quantification of the discrepancy between floating-point and quantized results. The expressions for these three calculation errors are:

\{\begin{matrix} E_{1} = (K^{(F P 32)}, Q^{(F P 32)}, V^{(F P 32)}) - (K^{(i n t *)}, Q^{(i n t *)}, V^{(i n t *)}) \\ E_{2} = {A t t n}^{(F P 32)} - {A t t n}^{(i n t *)} \\ E_{3} = A^{(F P 32)} - A^{(i n t *)}, \end{matrix}

(10)

where FP32 denotes the results of the model obtained through floating-point calculations.

The purpose of model quantization calculations is to reduce model inference time and shrink model size, facilitating deployment on commodity devices. Suppose there are two choices for

i n t *

in each encoder layer of ViT, namely

i n t 4

or

i n t 8

. Is there a set of optimal

i n t *

configuration schemes in various configurations, which achieve the best balance between minimal error precision and shortest runtime for model inference. This balancing process is a typical multi-objective optimization process. This paper introduces the concept of Pareto space, where different configurations of

i n t *

for each encoder in ViT constitute a Pareto solution set. Through Pareto Optimization, we can effectively evaluate the accuracy and efficiency of ViT inference under different quantization bit-width settings, ultimately finding the Pareto optimal frontier. To this end, let

m i n (ε)

denote the minimal quantization error achieved during the optimization process, and

m i n (t)

represent the shortest computational time required for the optimization. Within the framework of multi-objective optimization, the Pareto optimization objective function is formally defined as shown in Equation (11), aiming to balance the trade-off between quantization error minimization and computational efficiency improvement.

\{\begin{matrix} m i n (ε) = m i n (|E_{1}|, |E_{2}|, |E_{3}|) \\ m i n (t) = m i n (F ({e n c o d e r}^{i n t *})) \end{matrix}

(11)

where

F ({e n c o d e r}^{i n t *})

is the time function for ViT inference, which can be obtained by directly measuring the latency of executing a quantized precision layer on the target hardware platform.

In the process of performing mixed-precision quantization, it is essential to focus on the reasonable setting of quantization bit-widths for each layer, the direction of changes in inference speed and model size. The Pareto optimization constraint equation is as follows:

(1): Since the quantization research of FQ-ViT is established on the basis of 8-bit width precision, and it is obvious that 2-bit width precision quantization tends to cause significant precision loss, the selection of our quantization bit width precision is limited to 4-bit and 8-bit. The first constraint equation is that the weights $W$ , activation values $A$ , and attention values $A t t n$ are allowed to have bit-width of either 4-bit or 8-bit.

$\{\begin{matrix} b_{w}^{(i n t *)} = 4 \lor b_{w}^{(i n t *)} = 8 \\ b_{A}^{(i n t *)} = 4 \lor b_{A}^{(i n t *)} = 8 \\ b_{A t t n}^{(i n t *)} = 4 \lor b_{A t t n}^{(i n t *)} = 8 \end{matrix}$

(12)
(2): Owing to the adoption of mixed-precision quantization in our approach, the quantization precision of the weights $W$ , activation values $A$ , and attention values $A t t n$ cannot all be 4-bit, nor can they all be 8-bit.

$\{\begin{matrix} C_{1} = (b_{W}^{(i n t *)} \neq 4) \lor (b_{W}^{(i n t *)} \neq 8) \\ C_{2} = (b_{A}^{(i n t *)} \neq 4) \lor (b_{A}^{(i n t *)} \neq 8) \\ C_{3} = (b_{A t t n}^{(i n t *)} \neq 4) \lor (b_{A t t n}^{(i n t *)} \neq 8) \\ C_{1} \land C_{2} \land C_{3} = 1 \end{matrix}$

(13)
(3): To ensure the improvement in inference speed with mixed-precision quantization, let $F ({e n c o d e r}^{F P 32})$ denote the inference time of the ViT model in floating-point precision. Then, the inference time constraint for mixed-precision ViT is:

$F ({e n c o d e r}^{i n t *}) < F ({e n c o d e r}^{F P 32})$

(14)
(4): One important purpose of model quantization is to reduce the size of the model, facilitating its deployment on mobile devices, edge devices, or devices with limited storage space. The size of the model after mixed-quantization needs to be considered. We introduce the $M o d l e s i z e ({e n c o d e r}^{i n t *})$ function to represent the size of the mixed-quantized ViT model, and the $M o d l e s i z e ({e n c o d e r}^{i n t 8})$ function to represent the size of the fully quantized ViT model. Therefore, the constraint on model size is:

$M o d l e s i z e ({e n c o d e r}^{i n t *}) < M o d l e s i z e ({e n c o d e r}^{i n t 8})$

(15)

In the process of optimizing the quantization bit-width settings for each encoder, the update of bit-width parameters is implemented using the Grid Search [41]. Grid Search is a simple but computationally expensive method that finds the optimal solution by traversing all possible parameter combinations. Due to the limited number of encoders, finite optimization equation calculations, and parameter update settings in a discrete state, using the Grid Search method is more reasonable.

2.4. Experimental Setup and Measurement Protocols

To ensure reproducibility of latency and overhead metrics, all experiments are conducted on a standardized hardware-software platform. The experimental design, including model configurations, tools, and protocols, is detailed below.

The hardware platform consists of components crucial for stable and comparable deep learning experiments, with the GPU serving as the core computing component for model training and inference. Our training setup includes a single NVIDIA RTX 3090 graphics card (NVIDIA, Santa Clara, CA, USA) equipped with 24 GB of GDDR6X memory. Leveraging its parallel computing architecture, this GPU significantly accelerates matrix operations—operations that are fundamental to both the training and inference processes of Transformer-based models—thus acting as the key hardware support for efficient deep learning computations in this study.

The software stack is configured to balance usability, compatibility, and performance:

Operating System: Windows 11 Professional (22H2) (Microsoft Corporation, Redmond, WA, USA) with Windows Subsystem for Linux 2 (WSL2) (Microsoft Corporation, Redmond, WA, USA) enabled. This bridges Windows-based workflow advantages with Linux-optimized deep learning toolchains.
Linux Distribution: Ubuntu 22.04 LTS (Canonical Ltd., London, UK) (within WSL2). Offers a stable open-source environment for deploying frameworks and profiling tools.
Deep Learning Framework: PyTorch 1.12.1 (Meta Platforms, Inc., Menlo Park, CA, USA), paired with CUDA Toolkit 12.4 (NVIDIA, Santa Clara, CA, USA) (low-level GPU kernel programming) and cuDNN 8.9.2 (NVIDIA, Santa Clara, CA, USA) (optimized deep neural network primitives). These enable efficient model training and inference.
Latency Profiling Tools: torch.cuda.Event, the Python 3.9.12 timeit (Python Software Foundation, Wilmington, DE, USA) module, and NVIDIA Nsight Systems 2023.3 (NVIDIA, Santa Clara, CA, USA) were employed for latency computation.

The dataset and models used for training are as follows:

Benchmark Dataset: ImageNet-1K (1.2 M training images, 50 K validation images; 1000 classes) (Princeton University, Princeton, NJ, USA). Serves as a standard for evaluating image classification performance.
Architectures Compared: Data-efficient Image Transformer (DeiT) [3], Shifted Window Transformer (Swin) [23], Vision Transformer (ViT).
Model Sizes and Configurations: For each architecture, four sizes (tiny, small, base, large) are tested to analyze complexity-performance tradeoffs. Key structural parameters (critical for overhead calculation) are summarized in Table 1.
Most of the official configurations were adopted in this study, including a learning rate of 1 × 10⁻⁵, a weight decay of 1 × 10⁻³, and the Adam optimizer with its first moment decay rate $β_{1}$ set to 0.9 and second moment decay rate $β_{2}$ set to 0.999. The batchsize for the validation set is typically set to a power of 2. In our current experimental environment, a batchsize greater than 50 exceeds the GPU memory capacity, causing an out-of-memory error. Therefore, we set the default batchsize for the validation set to 50.

To investigate the impact of calibration data volume on latency, we varied the number of post-training quantization (PTQ) calibration iterations, setting them to 2, 5, 10, and 20; this experimental design aimed to systematically evaluate the efficiency-accuracy tradeoff in quantized models while ensuring the results address reviewers’ concerns regarding transparent overhead analysis.

To ensure the rigor and reproducibility of Pareto optimal solution screening, we employed the well-established Non-dominated Sorting (NDS) method from the field of multi-objective optimization. This approach was utilized to select candidate solutions among the mixed-precision quantization configurations of encoders, with the selected candidates satisfying the criterion that “no other configuration can simultaneously achieve a Top-1 accuracy not lower than theirs and an inference latency not higher than theirs”. Data variability is mitigated by computing the mean values from three independent replicate experiments; ultimately, the identified Pareto optimal points are plotted to construct the “accuracy-latency” Pareto Front curve, which not only guarantees methodological transparency but also enhances the credibility and generalizability of the conclusions.

3. Results

To assess the efficacy of mixed-precision quantization in PoMQ-ViT inference, we designed a comparative experiment with FQ-ViT, focusing on both inference accuracy and speed. The FQ-ViT quantization method employs two strategies: 8-bit fixed-width quantization and 4-bit or 8-bit stochastic mixed-width quantization. The comparative experiments are conducted with a batch size of 50, using the minmax quantization method, while other hyperparameters are kept constant.

We solve the Pareto Optimization problem to obtain a set of solutions, which allows us to derive the Pareto front, as shown in Figure 4. In the deployment of quantized models, it is generally required that the inference accuracy loss is maintained within 1% to 2%. Therefore, we select an optimal solution from the Pareto front, where the mixed-width configuration consists of 4 encoders with 4-bit quantization width and the remaining 8 encoders with 8-bit quantization width, as presented in Table 2. For the selected optimal configuration of PoMQ-ViT, the stochastic mixed-precision quantization strategy mirrors this configuration, setting 4 encoders to 4-bit quantization width and the remaining encoders to 8-bit quantization width.

The results of the comparative experiments are presented in Table 3. Observing the quantization results for DeiT_base, the Top-1 accuracy with the fixed precision quantization strategy is 80.662%, whereas the Top-1 accuracy for PoMQ-ViT is 79.764%, resulting in a loss of 0.898%. For the stochastic mixed-precision quantization, the Top-1 accuracy is 78.231%, with a quantization accuracy loss exceeding 2%. Similarly, in comparison to the fixed 8-bit quantization strategy, DeiT_small experiences a Top-1 accuracy decrease of 1.034%, and DeiT_tiny shows a decrease of 1.17%. The Top-1 loss slightly increases with the reduction in model parameters, but the maximum loss is only 1.17%. Although the Top-1 accuracy with the stochastic mixed-precision quantization meets the accuracy loss requirement (assuming a tolerance of ≤2%), its inference speed is inferior to the optimized results obtained with PoMQ-ViT.

To further evaluate the generalizability of PoMQ, Table 3 summarizes experimental results on Swin and its variants. Relative to 8-bit fixed-width quantization, PoMQ introduces Top-1 accuracy degradations of 0.545%, 0.696%, and 0.905% for Swin_base, Swin_small, and Swin_tiny, respectively, while achieving latency reductions of 8.836 ms, 15.836 ms, and 24.817 ms. For standard fully quantized ViT models utilizing PoMQ, ViT_base and ViT_large exhibit Top-1 accuracy losses of 0.867% and 0.654%, coupled with latency improvements of 8.988 ms and 6.24 ms. These results demonstrate that while PoMQ-enabled mixed-precision quantization for ViT incurs measurable accuracy tradeoffs, the resulting latency reductions remain practically significant across model scales.

As presented in Table 3 a comparative analysis is conducted regarding the model sizes of ViT and its variant architectures under two distinct quantization strategies: FQ and PoMQ. Specifically, for the three DeiT models, namely DeiT_base, DeiT_small, and DeiT_tiny, the quantization bit-width of each encoder layer is set to 8-bit when employing the FQ strategy, resulting in model sizes of 338 M, 86 M, and 22 M, respectively. Conversely, when utilizing the PoMQ approach, each encoder layer adopts an 8-bit or 4-bit adaptive mixed-precision quantization, leading to reduced model sizes of 267 M, 70 M, and 20 M, respectively. On average, the size of PoMQ-DeiT models is 81% of the FQ-DeiT counterparts. The comparative analysis is further extended to the Swin Transformer models, where a similar trend is observed. For the Swin base, small, and tiny models, the model sizes under the FQ strategy are 344 M, 195 M, and 111 M, respectively. When the PoMQ approach is applied, the model sizes are reduced to 272 M, 162 M, and 94 M, respectively. Consequently, the average size of PoMQ-Swin models is 82% of the FQ-Swin counterparts. Moreover, the analysis is expanded to the standard ViT models, where the base and large models experience a decrease in size from 338 M and 1116 M under the FQ strategy to 267 M and 883 M, respectively, when the PoMQ approach is employed. This results in the average size of PoMQ-ViT models being 79% of the FQ-ViT models. The collective comparison results clearly demonstrate that the mixed-precision quantization strategy, as embodied by PoMQ, leads to smaller model sizes compared to the fixed-precision quantization approach.

To deeply optimize the PoMQ quantization scheme, an ablation experiment was conducted focusing on the number of quantization calibration iterations. Taking 2, 5, 10, and 20 iterations as experimental variables, this study compared the Top1 accuracy performance under different iteration counts, with specific results shown in Table 4. Taking the three DeiT-base models of different scales as examples, when the number of calibration iterations increased from 2 to 5, the Top1 accuracy increased to 77.981% and 79.002% respectively, achieving an accuracy gain of 1.021%. When the number of iterations further increased to 10, the Top1 accuracy improved by an additional 0.762%. However, when the number of iterations was increased to 20, no significant improvement in Top1 accuracy was observed. This indicates that 5 to 10 iterations have approached the performance saturation state, and subsequent increases in the number of iterations are difficult to yield obvious benefits. Similar patterns were also observed in Swin and ViT models. Through this ablation experiment, the PoMQ model can be further optimized, providing empirical support for the practical deployment of the quantization strategy.

4. Discussion

This study proposes PoMQ-ViT, a mixed-precision quantization method for ViT based on Pareto optimization, aiming to balance inference accuracy and efficiency in resource-constrained environments. This section discusses key findings, implications, limitations, and future directions.

Experimental results consistently demonstrate that PoMQ-ViT outperforms fixed-precision and stochastic quantization strategies in balancing accuracy and latency. At its core, PoMQ-ViT employs a layer-wise adaptive bit-width configuration framework, where quantization precisions (4-bit and 8-bit in this work) are dynamically assigned to each encoder layer based on sensitivity error metrics (

E_{1}

,

E_{2}

,

E_{3}

). This targeted approach directly translates to measurable gains: across Swin and ViT variants, PoMQ-ViT reduces latency by 6.24–24.817 ms (e.g., 24.817 ms for Swin_tiny, 8.988 ms for ViT_base) while limiting Top-1 accuracy degradation to 0.545–0.905%, a balance that outperforms fixed 8-bit quantization (which typically incurs 1.5–2% accuracy loss for ≤5 ms latency reduction) and stochastic quantization (with comparable accuracy loss but 30–40% smaller latency gains).

This performance aligns with our core hypothesis: ViT components exhibit heterogeneous sensitivity to quantization. Attention mechanisms and high-activation layers, which are critical for preserving feature discriminability, benefit significantly from 8-bit precision and thus minimize accuracy loss. In contrast, less sensitive components (e.g., low-activation feed-forward blocks) tolerate 4-bit quantization, reducing arithmetic operations and memory access latency without compromising model robustness. This nuanced differentiation, absent in uniform quantization strategies, explains PoMQ-ViT’s superior efficiency-accuracy tradeoff.

Notably, PoMQ-ViT’s post-training quantization paradigm enhances its practical utility. By eliminating the need for retraining or fine-tuning, it avoids the substantial computational overhead associated with quantization-aware training, enabling rapid deployment on commodity devices. This flexibility is further amplified by its generation of a Pareto frontier of mixed-precision configurations, catering to diverse deployment priorities: edge devices requiring real-time inference, such as 30+ FPS surveillance systems, can leverage configurations with larger latency reductions, such as Swin_tiny’s 24.817 ms gain, while accuracy-critical applications, such as medical image analysis, may opt for milder quantization, such as ViT_large’s 6.24 ms reduction with 0.654% accuracy loss.

In summary, PoMQ-ViT advances the state of ViT quantization by moving beyond uniform-precision quantization approaches. By anchoring bit-width decisions to encoder-specific sensitivity, it demonstrates that latency reduction and accuracy preservation can be co-optimized, which addresses a longstanding challenge in deploying Transformers in resource-constrained settings.

The Pareto optimization strategy underlying PoMQ-ViT provides a principled framework for multi-objective optimization in model compression, where the trade-off between accuracy and efficiency is explicitly modeled. This method is not limited to ViT but can be extended to other Transformer-based architectures, offering a generalizable solution for deployment in resource-constrained scenarios.

Despite its advantages, PoMQ-ViT has limitations. First, the current implementation is restricted to 4-bit and 8-bit quantization. While this suffices for many scenarios, extreme low-bit quantization remains unexplored, including 2-bit and 1-bit configurations. Such quantization could further reduce latency and memory usage but may introduce larger errors without refined error-control mechanisms. Second, PoMQ-ViT has not yet been integrated with other lightweight techniques such as pruning and knowledge distillation. Combining quantization with pruning could reduce parameter redundancy, and knowledge distillation might mitigate accuracy loss in low-bit scenarios, but these synergies require systematic investigation. Third, whether PoMQ-ViT is adaptable to all model types has not been thoroughly studied. For example, further testing has not been conducted to verify whether PoMQ-ViT is adaptable to tasks that demand high local feature fidelity, such as fine-grained image classification and small-object detection.

To address these limitations, future work will focus on three fronts: (1) Extending PoMQ-ViT to support 2-bit and 1-bit quantization, coupled with advanced error-correction strategies to preserve accuracy; (2) Integrating PoMQ-ViT with pruning and knowledge distillation to achieve multiplicative efficiency gains; (3) Conducting in-depth analysis of PoMQ-ViT’s underperformance in specific scenarios.

5. Conclusions

Post-training quantization techniques have become an effective strategy for deploying ViT models in resource-constrained environments. In this work, the researchers propose a mixed-precision quantization method based on a Pareto-optimal strategy, called PoMQ-ViT. This strategy can adaptively configure the quantization bit-width of different encoders within the ViT model, according to their distinct characteristics. This approach enables a balanced trade-off between the inference accuracy and speed of the quantized ViT model. The comparative experiments conducted in this work demonstrate that the proposed PoMQ-ViT approach provides a superior balance of accuracy and performance compared to fixed-precision and stochastic quantization strategies. These results provide technical support and a strong theoretical foundation for the successful deployment of ViT models on commodity devices.

Furthermore, this study restricts quantization bit-width exploration to 8-bit and 4-bit mixed-precision configurations, which, while meeting basic needs in resource-constrained scenarios, allow for further compression. Future work will extend to 2-bit and even 1-bit quantization, leveraging refined error control to minimize model size and computation while preserving critical task accuracy. Additionally, we plan to integrate PoMQ-ViT with lightweight algorithms like pruning and knowledge distillation. Combining quantization with structured pruning can reduce parameters and optimize efficiency for “parameter simplification-precision adaptation” gains; integrating with knowledge distillation may retain more teacher model information in low-bit scenarios, mitigating accuracy loss. Meanwhile, we will systematically investigate the adaptability of PoMQ-ViT to a broader range of model types, explore its adaptation patterns, and optimize the quantization strategy accordingly. This multi-technology synergy will enable more efficient and reliable deployment of Vision Transformers on extremely resource-constrained edge devices.

Author Contributions

Conceptualization, Z.W.; Data curation, Z.W.; Investigation, Z.W.; Methodology, Z.W. and Z.Z.; Software, Z.W.; Validation, Z.W.; Visualization, Z.W.; Writing—original draft, Z.W.; Writing—review and editing, Z.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding authors.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

ViT	Vision Transformer
PoMQ-ViT	Mixed-Precision Quantization Vision Transformer with Pareto Optimization
FQ-ViT	Post-Training Quantization for Fully Quantized Vision Transformer
PTF	the power-of-two factor
LIS	log-int-Softmax
SA	self-attention
FFN	feed-forward network
MHSA	multi-head self-attention
FP32	Float Point 32-bit
DeiT	Data-efficient Image Transformer
Swin	Shifted Window Transformer
MinMax	Min-Max Algorithm
MatMul	Matrix Multiplication
NDS	Non-dominated Sorting

References

Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Houlsby, N. An image is worth 16x16 words: Transformers for image recognition at scale. Adv. Neural Inf. Process. Syst. 2021, 34. [Google Scholar]
Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H. Training data-efficient image transformers distillation through attention. Int. Conf. Mach. Learn. 2021, 139, 10347–10357. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022; Springer: Cham, Switzerland, 2020; Volume 12346, pp. 213–229. [Google Scholar] [CrossRef]
Rekavandi, A.M.; Rashidi, S.; Boussaid, F.; Hoefs, S.; Akbas, E. Transformers in small object detection: A benchmark and survey of state-of-the-art. Pattern Recognit. 2023, 141, 109659. [Google Scholar] [CrossRef]
Beal, J.; Kim, E.; Tzeng, E.; Park, D.H.; Zhai, A.; Kislyuk, D. Toward transformer-based object detection. arXiv 2020, arXiv:2012.09958. [Google Scholar] [CrossRef]
Cui, Y.; Zhang, G.; Liu, Z.; Xiong, Z.; Hu, J. A deep learning algorithm for one-step contour aware nuclei segmentation of histopathology images. In Medical & Biological Engineering & Computing; Springer: Berlin/Heidelberg, Germany, 2019; Volume 57, pp. 2027–2043. [Google Scholar] [CrossRef]
Yigit, G. Vit-GAN: Image-to-Image Translation with Vision Transformes and Conditional GANS, Authorea Preprints; Springer: Berlin/Heidelberg, Germany, 2023; Available online: https://arxiv.org/abs/2110.09305 (accessed on 14 July 2025).
Xi, H.; Qin, H.; Xiong, Z.; Zhang, J. Transformer-based conditional generative adversarial networks for image generation. In Proceedings of the International Symposium on Artificial Intelligence Control and Application Technology (AICAT 2022), Hangzhou, China, 6–8 May 2022; SPIE: Bellingham, WA, USA, 2022; Volume 12305, pp. 250–255. [Google Scholar] [CrossRef]
Tang, Y.; Wang, Y.; Guo, J.; Tu, Z.; Han, K.; Hu, H.; Tao, D. A survey on transformer compression. arXiv 2024, arXiv:2402.05964. [Google Scholar] [CrossRef]
Zhuang, B.; Liu, J.; Pan, Z.; He, H.; Weng, Y.; Shen, C. A survey on efficient training of transformers. arXiv 2023, arXiv:2302.01107. [Google Scholar] [CrossRef]
Zhou, S.; Wu, Y.; Ni, Z.; Zhou, X.; Wen, H.; Zou, Y. DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients. arXiv 2016, arXiv:1606.06160. [Google Scholar]
Jacob, B.; Kligys, S.; Chen, B.; Zhu, M.; Tang, M.; Howard, A.; Adam, H.; Kalenichenko, D. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 2704–2713. [Google Scholar] [CrossRef]
Hubara, I.; Courbariaux, M.; Soudry, D.; El-Yaniv, R.; Bengio, Y. Quantized neural networks: Training neural networks with low precision weights and activations. J. Mach. Learn. Res. 2017, 18, 6869–6898. [Google Scholar]
Liang, Y.; Ge, C.; Tong, Z.; Song, Y.; Wang, J.; Xie, P. Not all patches are what you need: Expediting vision transformers via token reorganizations. Adv. Neural Inf. Process. Syst. 2022, 35, 28096–28108. [Google Scholar]
Xu, Y.; Zhang, Z.; Zhang, M.; Sheng, K.; Li, K.; Dong, W.; Zhang, L.; Xu, C.; Sun, X. Evo-vit: Slow-fast token evolution for dynamic vision transformer. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 22 February–1 March 2022; Volume 36, pp. 2964–2972. Available online: https://arxiv.org/pdf/2108.01390 (accessed on 14 July 2025).
Jiao, X.; Yin, Y.; Shang, L.; Jiang, X.; Chen, X.; Li, L.; Wang, F.; Liu, Q. Tinybert: Distilling bert for natural language understanding. arXiv 2019, arXiv:1909.10351. [Google Scholar]
Wu, K.T.; Zhang, J.; Peng, H.; Liu, M.; Xiao, B.; Fu, J.; Tinyvit, L.Y. Fast Pretraining Distillation for Small Vision Transformers. Lect. Notes Comput. Sci. 2022, 13681. Available online: https://arxiv.org/pdf/2207.10666v1 (accessed on 14 July 2025).
Yang, Z.; Li, Z.; Zeng, A.; Li, Z.; Yuan, C.; Li, Y. Vitkd: Practical guidelines for vit feature knowledge distillation. arXiv 2022, arXiv:2209.02432. [Google Scholar]
Yang, H.; Yin, H.; Shen, M.; Molchanov, P.; Li, H.; Kautz, J. Global vision transformer pruning with hessian-aware saliency. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 18547–18557. [Google Scholar] [CrossRef]
Lin, Y.; Zhang, T.; Sun, P.; Li, Z.; Zhou, S. Fq-vit: Post-training quantization for fully quantized vision transformer. arXiv 2021, arXiv:2111.13824. [Google Scholar]
Li, Z.; Gu, Q. I-vit: Integer-only quantization for efficient vision transformer inference. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–3 October 2023; pp. 17065–17075. [Google Scholar] [CrossRef]
Esser, S.K.; McKinstry, J.L.; Bablani, D.; Appuswamy, R.; Modha, D.S. Learned step size quantization. arXiv 2019, arXiv:1902.08153. [Google Scholar]
Li, Y.; Gong, R.; Tan, X.; Yang, Y.; Hu, P.; Zhang, Q.; Gu, S. Brecq: Pushing the limit of post-training quantization by block reconstruction. arXiv 2021, arXiv:2102.05426. [Google Scholar] [CrossRef]
Wang, W.; Xie, E.; Li, X.; Fan, D.-P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 548–558. [Google Scholar] [CrossRef]
Wang, W.; Xie, E.; Li, X.; Fan, D.-P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pvt v2: Improved baselines with pyramid vision transformer. Comput. Vis. Media 2022, 8, 415–424. [Google Scholar] [CrossRef]
Kitaev, N.; Kaiser, Ł.; Levskaya, A. Reformer: The efficient transformer. arXiv 2020, arXiv:2001.04451. [Google Scholar] [CrossRef]
Pan, Z.; Cai, J.; Zhuang, B. Fast vision transformers with hilo attention. Adv. Neural Inf. Process. Syst. 2022, 35, 14541–14554. [Google Scholar]
Mehta, S.; Rastegari, M. Separable self-attention for mobile vision transformers. arXiv 2022, arXiv:2206.02680. [Google Scholar] [CrossRef]
Luo, G.; Zhou, Y.; Sun, X.; Wang, Y.; Cao, L.; Wu, Y.; Huang, F.; Ji, R. Towards lightweight transformer via group-wise transformation for vision-and-language tasks. IEEE Trans. Image Process. 2022, 31, 3386–3398. [Google Scholar] [CrossRef]
Maaz, M.; Shaker, A.; Cholakkal, H.; Khan, S.; Zamir, S.W.; Anwer, R.M.; Shahbaz Khan, F. Edgenext: Efficiently amalgamated cnn-transformer architecture for mobile vision applications. Eur. Conf. Comput. Vis. 2022, 3–20. [Google Scholar] [CrossRef]
Yang, C.; Wang, Y.; Zhang, J.; Zhang, H.; Wei, Z.; Lin, Z.; Yuille, A. Lite vision transformer with enhanced self-attention. In Proceedings of the IEEE/CVF International Conference on Computer Vision, New Orleans, LA, USA, 18–24 June 2022; pp. 11998–12008. [Google Scholar] [CrossRef]
He, Y.; Lou, Z.; Zhang, L.; Liu, J.; Wu, W.; Zhou, H.; Zhuang, B. Bivit: Extremely compressed binary vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 5651–5663. Available online: https://arxiv.org/pdf/2211.07091 (accessed on 14 July 2025).
Li, Z.; Xiao, J.; Yang, L.; Gu, Q. Repq-vit: Scale reparameterization for post-training quantization of vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 17227–17236. [Google Scholar] [CrossRef]
Liu, Y.; Yang, H.; Dong, Z.; Keutzer, K.; Du, L.; Zhang, S. Noisyquant: Noisy bias-enhanced post-training activation quantization for vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 20321–20330. Available online: https://arxiv.org/abs/2211.16056 (accessed on 14 July 2025).
Nagel, M.; Amjad, R.A.; Van Baalen, M.; Louizos, C.; Blankevoort, T. Up or down? adaptive rounding for post-training quantization. In Proceedings of the International Conference on Machine Learning, Virtual, 13–18 July 2020; pp. 7197–7206. Available online: https://proceedings.mlr.press/v119/nagel20a.html (accessed on 14 July 2025).
Choukroun, Y.; Kravchik, E.; Yang, F.; Kisilev, P. Low-bit quantization of neural networks for efficient inference. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), Seoul, Republic of Korea, 27–28 October 2019; pp. 3009–3018. [Google Scholar] [CrossRef]
Ngatchou, P.; Zarei, A.; El-Sharkawi, A. Pareto multi objective optimization. In Proceedings of the 13th International Conference on, Intelligent Systems Application to Power Systems, Arlington, VA, USA, 6–10 November 2005; pp. 84–91. [Google Scholar]
Bechikh, S.; Datta, R.; Gupta, A. Recent Advances in Evolutionary Multi-Objective Optimization; Springer: Cham, Switzerland, 2017; pp. 31–70. ISBN 978-3-319-42978-6. [Google Scholar]
Liashchynskyi, P.; Liashchynskyi, P. Grid search, random search, genetic algorithm: A big comp arison for NAS. arXiv 2019, arXiv:1912.06059. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. Available online: https://arxiv.org/abs/2103.14030 (accessed on 14 July 2025).

Figure 1. Using ViT-S with 12 encoders and a fixed 8-bit quantization bit-width precision as the baseline, each encoder was sequentially configured to 4-bit quantization from the 1st to the 12th, while the remaining encoders retained 8-bit precision. The resulting accuracy degeneration demonstrates that different encoders exhibit varying sensitivities to changes in quantization bit-width.

Figure 2. The PoMQ-ViT optimization process: the input is a ViT quantization model with 8-bit, which has been trained, and the output is a selected set of optimal solutions that determine the quantization bit-width settings for each encoder. Note:

i n t *

denotes the target optimized quantization bit-width (e.g.,

i n t 8

for 8-bit,

i n t 4

for 4-bit).

Figure 2. The PoMQ-ViT optimization process: the input is a ViT quantization model with 8-bit, which has been trained, and the output is a selected set of optimal solutions that determine the quantization bit-width settings for each encoder. Note:

i n t *

denotes the target optimized quantization bit-width (e.g.,

i n t 8

for 8-bit,

i n t 4

for 4-bit).

Figure 3. The ViT calculation process: The encoder is the most important computational component of the standard ViT, and it consists of computational units such as MHSA, LayerNorm, and MLP.

Figure 4. The Pareto Front based on PoMQ-ViT, the vertical axis (Y-axis) of Figure 4 represents Top-1 accuracy, expressed as a percentage (%), The horizontal axis (X-axis) represents inference latency, with the unit of milliseconds (ms). The figure corely demonstrates the optimal trade-off between accuracy and inference speed.

Table 1. Configurations for different model sizes.

Model Size	Encoders	SA Heads	Embedding Dimension
base	12	12	768
tiny	12	6	192
small	12	3	384
large	24	16	1024

Table 2. Quantization Bit-Width Settings of Each Encoder in ViT for the Selected Optimal Solution.

Encoder	1	2	3	4	5	6	7	8	9	10	11	12
bit-width	4	4	8	8	8	8	8	8	8	8	4	4

Table 3. Comparison results of FQ-ViT with 8-bit fixed-width quantization, 4-bit or 8-bit stochastic mixed-width quantization and PoMQ.

Model	8-Bit	Top-1 (%) Stochastic	PoMQ	8-Bit	Time (ms) Stochastic	PoMQ	8-Bit	Size (M) Stochastic	PoMQ
DeiT_base	80.662	78.231	79.764	449.394	440.56	439.267	338	267	267
DeiT_small	78.36	77.23	77.326	224.797	210.64	200.998	86	70	70
DeiT_tiny	70.036	68.733	68.866	139.173	110.892	101.458	22	20	20
Swin_base	82.554	80.223	81.472	618.859	611.90	610.023	344	272	272
Swin_small	82.172	80.568	80.86	458.221	447.776	442.395	195	162	162
Swin_tiny	80.218	78.650	78.698	286.715	268.554	261.898	111	94	94
ViT_base	82.860	81.840	81.993	450.467	443.568	441.479	338	267	267
ViT_large	85.004	83.998	84.35	1229.861	1225.041	1223.621	1116	883	883

Table 4. Ablation Study of PoMQ Quantization Calibration Iterations.

	2	5	10	20
Top-1 (%)	2	5	10	20
DeiT_base	77.981	79.002	79.764	79.803
DeiT_small	75.005	76.997	77.326	77.369
DeiT_tiny	65.008	67.834	68.866	69.002
Swin_base	80.033	81.006	81.472	81.501
Swin_small	79.512	80.504	80.86	80.905
Swin_tiny	76.827	78.509	78.698	78.893
ViT_base	80.274	81.506	81.993	82.031
ViT_large	83.106	84.008	84.35	84.301

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wu, Z.; Zhao, Z. PoMQ-ViT: Mixed-Precision Quantization Vision Transformer with Pareto Optimization. Appl. Sci. 2025, 15, 9856. https://doi.org/10.3390/app15189856

AMA Style

Wu Z, Zhao Z. PoMQ-ViT: Mixed-Precision Quantization Vision Transformer with Pareto Optimization. Applied Sciences. 2025; 15(18):9856. https://doi.org/10.3390/app15189856

Chicago/Turabian Style

Wu, Zhiqiang, and Zhong Zhao. 2025. "PoMQ-ViT: Mixed-Precision Quantization Vision Transformer with Pareto Optimization" Applied Sciences 15, no. 18: 9856. https://doi.org/10.3390/app15189856

APA Style

Wu, Z., & Zhao, Z. (2025). PoMQ-ViT: Mixed-Precision Quantization Vision Transformer with Pareto Optimization. Applied Sciences, 15(18), 9856. https://doi.org/10.3390/app15189856

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

PoMQ-ViT: Mixed-Precision Quantization Vision Transformer with Pareto Optimization

Abstract

1. Introduction

1.1. Research Background

1.2. Problem Statement

1.3. Research Objectives

1.4. Scope and Contributions of This Study

1.5. Structure of the Paper

2. Materials and Methods

2.1. Standard ViT Computation Flow

2.2. Fully Quantized ViT Computation Flow

2.3. Methodology

2.4. Experimental Setup and Measurement Protocols

3. Results

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI