Rethinking Query, Key, and Value Embedding in Vision Transformer under Tiny Model Constraints

A vision transformer (ViT) is the dominant model in the computer vision field. Despite numerous studies that mainly focus on dealing with inductive bias and complexity, there remains the problem of finding better transformer networks. For example, conventional transformer-based models usually use a projection layer for each query (Q), key (K), and value (V) embedding before multi-head self-attention. Insufficient consideration of semantic $Q, K$, and $V$ embedding may lead to a performance drop. In this paper, we propose three types of structures for $Q$, $K$, and $V$ embedding. The first structure utilizes two layers with ReLU, which is a non-linear embedding for $Q, K$, and $V$. The second involves sharing one of the non-linear layers to share knowledge among $Q, K$, and $V$. The third proposed structure shares all non-linear layers with code parameters. The codes are trainable, and the values determine the embedding process to be performed among $Q$, $K$, and $V$. Hence, we demonstrate the superior image classification performance of the proposed approaches in experiments compared to several state-of-the-art approaches. The proposed method achieved $71.4\%$ with a few parameters (of $3.1M$) on the ImageNet-1k dataset compared to that required by the original transformer model of XCiT-N12 ($69.9\%$). Additionally, the method achieved $93.3\%$ with only $2.9M$ parameters in transfer learning on average for the CIFAR-10, CIFAR-100, Stanford Cars datasets, and STL-10 datasets, which is better than the accuracy of $92.2\%$ obtained via the original XCiT-N12 model.


Introduction
Currently, transformers have emerged as dominant models in deep learning because of their superior performance, especially in terms of long-range dependency [7,30]. Transformers first appeared in natural language processing (NLP) [30] and have since become a widely used backbone Figure 1. Conceptual diagram of conventional linear mapping, our non-linear mapping, and our non-linear mapping with shared layer. Compared to separate mapping, shared mapping can find a new combination of Q, K, and V in a shared manifold.
network of state-of-the-art models [1,20,21]. A transformer mainly adopts a self-attention mechanism with the following feedforward network to capture the sequence of input tokens. The self-attention mechanism can deal with the global information of input tokens, and this feature allows the models to become state-of-the-art models [1,17].
Although the use of transformers have been increasing in NLP, convolutional neural networks (CNN) have been remained in the mainstream in computer vision (CV) [10,27,28,39]. CNN has representative features such as translation equivariance and spatial locality. These features enable the dominant performance of CNNs in CV tasks [12,23,24,28] such as image classification, object detection, and segmentation; however, recently, transformers have also increased in CV because vision transformers (ViT) [8] offer compet-itive performance. As current research shows [2,8,29], transformer-based networks have now become state-of-theart models replacing CNN-based networks.
ViT shows superior performance by removing inductive bias, which is a representative feature of CNN and reinforcing long-range dependency. Standard vision transformers flatten 2-dimensional images into 1-dimensional sequences of tokens and then apply a global self-attention mechanism to extract the relation among tokens. This procedure can consider the long-range relationship among tokens, in contrast to CNNs, which consider the local relationship of image pixels. However, it results in extensive computations that increase quadratically according to the image resolution and the need for pre-trained weights [8,25] using largescale datasets, that is, JFT-300M [26], ImageNet-21k [6]. To alleviate these inefficiencies, several researchers have considered adding the inductive bias in the transformer explicitly [31,35,37,38], and attempted to reduce the computational cost by modifying the multi-head self-attention (MHSA) method [9]. However, there are few considerations regarding new embedding techniques for query (Q), key (K), and value (V ) [31].
In this study, we consider the design of the structure of embedding Q, K, and V . First, we present a two-layer embedding structure model with a rectified linear unit activation function (ReLU). In contrast to the original embedding, the structure can transform the input data into a non-linear space. It has the potential to improve the performance of the ViT because more non-linearity can solve more complicated problems. The second structure is a one-layer shared structure, which is a variant of the first structure. The third structure is a two-layer shared structure with code parameters to improve the original self-attention mechanism used in conventional transformers. Finally, experimental results demonstrate that the proposed method outperforms the state-of-the-art (SOTA) model in the ImageNet [6] classification task and transfer learning on multiple datasets (i.e., CIFAR-10, CIFAR-100 [16], Stanford Cars [15], and STL-10 [4]) for evaluating various image classification tasks.

Related Works
ViT was first proposed in Dosovitskiy et al. [8]. This pioneering work directly applies a plain transformer encoder, which is commonly used in NLP. They split images into patches to embed them as tokens and feed them to the transformer encoder with additional positional embeddings. The transformer encoder consists of repeated multi-head selfattention and multilayer perceptron (MLP) layers. In contrast to CNN structures, the transformer can handle longrange information by applying global self-attention among the tokens. However, global self-attention causes a lack of inductive bias, which makes large-scale pre-training necessary and quadratically increasing computation. Although DeiT [29] suggests distilling knowledge using the CNN model as a teacher model to train the transformer without inductive bias, it requires an additional teacher model that causes extra computation.
Therefore, current research mainly focuses on applying explicit inductive bias to vision transformers [5,18,33,37] and reducing the computational complexity from quadratic to linear [9,13]. XCiT [9] is one study that focused on reducing the computational complexity. It introduces cross-covariance attention (XCA) instead of standard selfattention. XCA effectively reduces computation without a large performance gap compared to the baseline. However, it does not consider Q, K, and V embedding, which directly affects the attention operation. In contrast to these works, we introduce Q, K, and V embedding techniques to improve the performance of image recognition.
CvT [31] is the most similar work conducted before. CvT utilizes convolutional projection for the Q, K, and V embedding of the transformer encoder to take advantage of both the CNN and transformer. However, they also stay to use separate projections and linear operations. This may lead to a constrained combination of Q, K, and V . Although CvT-13 achieved 81.6% in ImageNet classification using 20M parameters and larger models achieved higher scores, they did not consider tiny model constraints.

Attention Mechanism in Vision Transformer
ViT [8] constructs the network following the original NLP transformer encoder. It splits images into sets of patches and uses them as tokens, which are embedded with the positional embedding. The embedded vector is used for the following repeated self-attention and feed-forward networks. In this section, we introduce two representative approaches to the self-attention mechanism, namely selfattention and cross-covariance attention.

Self-attention
Conventional transformers adopt self-attention as the core operation of the network [8,30]. In the transformer, the input token X is projected by the linear projection layer W q , W k , and W v to embed Q, K, and V vectors, respectively. Then, the dot product V with the attention matrix, which is obtained by the scaled dot product of Q and K. This self-attention operation can be represented as: where SA(Q, K, V ) is the output of self-attention and Q, K, V are processed by the following operations: where N is the number of tokens, and d is the token dimension. Self-attention is usually conducted in a multi-headed manner. When the number of heads is h,

Cross-covariance Attention
Cross-covariance attention (XCA) is a modified selfattention mechanism that can reduce the computational complexity from O(N 2 d) to O(N d 2 ). XCiT [9], which is a variant of ViT, suggested the use of XCA instead of standard self-attention operation, which is widely used in transformer networks, demonstrating its SOTA performance in CV tasks, for example, ImageNet classification, self-supervised learning, object detection, and semantic segmentation. They conducted feature dimension selfattention instead of token dimension self-attention, which is used in standard transformer networks. By simply transposing Q, K, V and reversing the order of the dot product, the computational complexity is reduced from quadratic to linear with the number of tokens, N . This transposed feature dimension self-attention can be represented as below: whereQ,K are the l2-normalized Q, K vector, and τ is the temperature scaling parameter. This is orthogonal to our research, which proposes to modify the Q, K, V embedding procedure. We adopted XCiT as a baseline model to verify the performance of the proposed method.

Proposed method
As mentioned in Eq. 2, the conventional embedding method uses a linear layer for each Q, K, and V . The layer is not shared across the embedding spaces of Q, K, and V . In this paper, we propose three types of new embedding techniques to improve the performance of ViT, as shown in

Separate Non-linear Embedding (SNE)
In contrast to the original ViT-based models, we apply non-linear transformations to extract Q, K, and V , respectively, as follows: where W (1) q ∈ R d×dq and W (2) q ∈ R dq×d represent the weight parameters of the first and second fully connected layers, respectively, and the layers encode the input token as Q. W σ is an activation function. Based on the obtained Q, K, and V , Eq. 3 is computed for self-attention.
The layers for the SNE consist of two fully connected layers with an activation function (ReLU) to conduct a nonlinear transformation of the input tokens. The non-linear embedding approach may have some advantages. This could increase the total number of non-linearities of the model. Under the limited number of parameters, the increased number of non-linearities could have a positive effect on improving the generalization. Furthermore, it can expand the search space to find new combinations of Q, K, and V .

Partially-Shared Non-linear Embedding (P-SNE)
P-SNE shares a layer from two fully connected layers in the SNE model. There are two options for which the layer will be selected (i.e., first or second layers). Sharing the first layer is similar to the linear embedding originally used in the ViT model. The shared first layer produces the same output because the input values are also shared. Consequently, we chose the second layer to be shared among Q, K, and V .
The shared layer linearly transforms each activation value extracted from the first non-linear layer. With the shared layer, Q, K, and V can share knowledge on how to build each token. In addition, separate layers of original or SNE structures might have a potential issue, but the shared layer could prevent this issue. In the separate layers, even if one of the layers responsible for Q, K, and V extraction does not learn well, there might be no problem in minimizing the training loss. This means that the network will be trained well even if one of the three separate layers for Q, K, and V is not updated.
Finally, Q, K, and V in the P-SNE, are extracted as follows: where W (2) s ∈ R ds×d denotes the weight parameters of the shared second layer when d s = d q = d k = d v . We replace the weight parameters W

Fully-Shared Non-linear Embedding (F-SNE)
. Instead of Q, K, V projection layers that separately transform input embedding token X into Q, K, V vectors, the shared projection layers transform the input embedding token X into Q, K, V using the shared weight. It can guide the projection of Q, K, and V towards the same manifold.
However, we could only conduct a single transformation using the shared projection layers, which makes it impossible to embed different vectors simultaneously. To handle this problem, we add codes C q , C k , C v to reflect separate Q, K, and V embedding. C q , C k , C v are trainable vectors that are concatenated to the input token X before passing the shared projection layers W s . Additionally, the same C q , C k , C v are shared among all the encoders in the transformer. By this sharing, codes will converge at the optimal semantic representation of Q, K, and V that exists consistently regardless of the encoder.
Q, K, V embedding by the proposed shared projection layers with codes C q , C k , C v can be comprehensively represented as follows: on X ∈ R N ×d and C q , C k , C v ∈ R N ×c , where ⊕ denotes vector concatenation and c is an arbitrary code size. Here, C q , C k , C v ∈ R 1×c , but they are repeated N times to match the dimension. The codes C q , C k , and C v are computed to minimize the loss function as follows: where L is a loss function used in the transformer (i.e., cross-entropy loss). θ denotes the total parameters of the transformer network, and D is the training set. R represents a regularizer (i.e., weight decay), and λ is a parameter that controls the strength of the regularizer.

Experiments and Results
We evaluated the effect of the new Q, K, and V embedding methods on image classification tasks. We first evaluated the performance of the proposed structures with the ImageNet-1k dataset and then used those models to transfer to the other datasets (i.e., CIFAR-10, CIFAR-100, Stanford Cars, and STL-10). Additionally, we conducted brief experiments using distillation when we evaluated the performance of the models with the ImageNet-1k dataset.
All the experiments for our methods were conducted using XCiT-Nano and XCiT-Tiny models, which were introduced in [9] as the baseline models. In the following sections, we use the brief notations XCiT-N12 and XCiT-T12 to describe the XCiT-Nano and XCiT-Tiny models with 12 repeated encoders. While the input image size of the models was fixed at 224×224, we trained the models using a batch size of 4,096 for XCiT-N12-based models and 2,816 for XCiT-T12-based models. For ImageNet training, we trained the model for 400 epochs with an initial learning rate of 5 × 10 −4 . For transfer learning, we used an initial learning rate of 5 × 10 −5 for 1,000 epochs of training. We followed other experimental setups of the original XCiT. Overall experiments were conducted on NVIDIA DGX A100 (8 GPUs).
For the Q, K, and V projection layers, we used the output dimensions depicted in Table 1 according to the variants. Additionally, in the case of the XCiT-N12 (F-SNE, c) model, the shared projection layers have an input dimension of (128 + c) to take the input concatenated with the code, and the XCiT-T12 (F-SNE, c) model has shared projection layers with an input dimension of (192 + c). Code is a parameter vector of arbitrary code size (e.g., c = 8, 16, 32, 64) defined outside the encoder to be shared among the encoders. For a fair comparison, we compare the number of parameters as well as the performance in the following sections. In the case of the F-SNE models, codes were also included in the model parameters.  Table 2 shows the ImageNet classification results to compare other models with our proposed method. Top-1 accuracies from other models noted in the table came directly from each model's publications. In this section, we used ImageNet-1k, which consists of 1.28M training images and 50,000 validation images with 1,000 labels. To demonstrate the performance of the proposed method, we selected XCiT as a baseline model and modified Q, K, and V projection layers corresponding to each model variant.

ImageNet Classification
As shown in Table 2, the P-SNE model achieved the best score for the below 4M parameters constraints, improving the accuracy by 1.5% from the baseline. Even in the case of the below 10M parameters constraints, the P-SNE model surpassed the previous SOTA model, which recorded 77.3% top-1 accuracy while improving the accuracy by 0.5% from it. The starred (*) model is an additional experiment to compare the performance of our method using the same number of parameters as the original XCiT model. This shows that the performance can be improved by adding parameters to the embedding layers as much as the capacity saved by fully sharing the embedding layers. For example, the XCiT-T12 (F-SNE, 8*) model scored 77.7%, which is close to the best accuracy of 77.8%. Moreover, the non-linear embedding method (SNE) also improves the classification rates on the ImageNet dataset, compared to the baseline model which uses the linear embedding method. Table 2. Evaluation on ImageNet classification task with <10M parameters. In Type, 'C' denotes CNN architecture, and 'T' denotes the transformer architecture. Gray color represents our methods.

Distillation
We evaluated the performances on ImageNet classification task using distillation technique as well. In this section, we used RegNetY-16GF [22] as a teacher model to conduct hard distillation as proposed in [29]. Similar to the previous experimental results in [9,29], distillation could improve the performance of each model. Additionally, shared nonlinear embedding methods improve the performance of the baseline model, although a separate non-linear embedding method decreases the performance. This reminds us of the better properties of the shared embedding methods, which is consistent with the results shown in the previous section. These results are organized in Table 3.

Transfer Learning
To demonstrate the generalization performance of our method, we conducted transfer learning experiments on CIFAR-10, CIFAR-100 [16], Stanford Cars [15], and STL-10 [4] datasets as shown in Table 4. CIFAR-10 and CIFAR-100 consist of 50,000 training images and 10,000 test images with 10 and 100 classes, respectively. The Stanford Cars dataset consists of 8,144 training images and 8,041 test images of 196 classes. Lastly, we used the STL-10 dataset, which contains 5,000 training images and 8,000 test images from 10 classes.
As shown in the average score of Table 4, our method generally improves the performance of the baseline model. In particular, XCiT-N12 (F-SNE, 32) and XCiT-T12 (F-SNE, 8) achieved the best score under each parameter constraint, below 4M and 10M, despite having fewer parameters. Including these best models, F-SNE models showed mostly better transfer performance compared to other variants on average. It can be assumed that fully shared embedding may provide better generalization capability than other variants. However, on the Cars dataset, there were performance drops in the case of transferring XCiT-T12-based models, as opposed to the case of transferring XCiT-N12based models. The embedding dimension of XCiT-T12 is larger than that of XCiT-N12, which means that XCiT-T12 has an extra capacity to find a complex mapping function. It may decrease the contribution of non-linear embeddings. However, it is still worth using non-linear embedding methods in small models.

Ablation Study and Analysis
We performed several ablation studies to analyze the proposed embedding structures, such as variation in the number of non-linear layers, comparison of shared and unshared code, code visualization of F-SNE, and code size search.

Impact of the Number of Layers
All the proposed methods utilize the two-layer model for non-linear embedding, which is the smallest unit that expresses non-linearity. To analyze the impact of the number of layers, we performed an analysis according to the number of layers. Fig. 3 shows the results based on the XCiT-N12 (SNE) model using the ImageNet dataset. The accuracies with three or four layers decreased, but the best accuracy was achieved when two layers were used. Based on the results, we conclude that more than a single non-linearity embedding can hinder the embedding performance in terms of accuracy.

Sharing or Unsharing Codes in F-SNE
Owing to the code, the input token can be identified in the F-SNE for separately extracting Q, K, and V . In the F-SNE model, codes C q , C k , and C v are shared across all embedding modules in the transformer model. This is effective for slightly decreasing the total number of parameters. In addition, as shown in Table 5, the performance can be improved. Without sharing, the codes might have different values, which could hinder to find the optimal code vectors for Q, K, and V . Sharing might be a good prior knowledge to reach the optimal solution. 6.3. C q , C k , and C v in Different Tasks The optimal code values were initially found on the Im-ageNet dataset, and then the computed code values were used as the initial codes for the downstream tasks. Even Table 4. Evaluation on transfer learning. All models were pre-trained using the ImageNet-1k dataset. † indicates the results are obtained from the paper. ‡ indicates that the experiment cannot be performed due to the lack of source code. Gray color represents our methods. if the pre-trained codes are used, the code values might vary according to the tasks because the codes are updated via a back-propagation process using the downstream task dataset. Figure 4 shows correlation values of the codes extracted from each downstream task. Interestingly, the correlation matrices are very similar even if the task is changed. The diagonal elements give some information on the similarity between the same codes, and non-diagonal elements depict the similarity between different codes. Non-diagonal elements are close to the zero, and each code of C q , C k , and C v tends to have the property of orthogonality. These results can be interpreted that C q , C k , and C v learn their inherent feature to be used as Q, K, and V , regardless of the task; furthermore, the values of the l2-norm for each dataset can be obtained, as presented in Table 6. The codes of Im-ageNet, Cars, and STL-10 have similar l2-norm values, but CIFAR-10 and 100 have different l2-norm values.

Do Better ImageNet Models Transfer Better?
In general, better ImageNet models transfer better [14]. An experiment was conducted to verify whether this fact also applies to our cases. As shown in Fig. 5, we could observe correlation between performances on ImageNet and on downstream tasks. The correlation tendency is reflected  STL datasets are similar to the task of ImageNet, so they show high correlations between two tasks. However, in the Stanford Cars dataset, it is difficult to see the correlations.

Code Size
It is important to select an appropriate code size to utilize the code C q , C k , and C v in the F-SNE model. We conducted the experiments for the code size, from 8 to 64, to find an appropriate code size that leads to better performance. As shown in Figure 6, the code size of 8 achieved the best accuracy between the nano models and 16 achieved the best accuracy between the tiny models. The tiny models have an embedding dimension of about 1.5 times the embedding dimension of the nano models, as shown in Table 1. It can be interpreted that it requires a larger code size when the embedding dimension of the model gets larger.

Limitations
The limitations of our work are considered in two aspects: performance and constraint. In terms of performance, our approaches are degraded in the Stanford Cars dataset with the XCiT-T12 model, as shown in Table 4. For the constraint, we started with the constraints of the small models. However, our approach can also be applied to other large models, and we believe that small modifications can make our structures easily applicable to large models. Figure 6. ImageNet classification top-1 accuracy of nano (Nano-ImageNet) and tiny (Tiny-ImageNet) models with respect to the code size from 8 to 64. The nano models correspond to the left axis, and the tiny models correspond to the right axis. Best viewed in color.

Conclusion
In this paper, we proposed three new types of Q, K, and V vector embedding structures for ViT. The first embedding structure utilizes two non-linear layers to embed the input token into separate non-linear spaces of Q, K, and V . The second structure shares a single layer between the two layers. Based on experiments, we observed that sharing a single layer is effective for ImageNet classification. The third structure shares two layers with the Q, K, and V codes. The codes are trained via a back-propagation algorithm to minimize the loss of ViT. The structure is helpful for improving the classification rates in several downstream tasks, such as CIFAR-100 and STL-10. We could easily improve the XCiT model, which is a representative ViT model, using the proposed structures. In particular, our structures performed well with a small number of parameters (approximately 3M or 6M), but we could not prove the effectiveness of the structures in large models. Some modifications may be required to apply the proposed structures to large models; however, we believe that our research is valuable because it can be the starting point for future studies in this direction.