EEViT: Efficient Enhanced Vision Transformer Architectures with Information Propagation and Improved Inductive Bias

Mahmood, Rigel; Patel, Sarosh; Elleithy, Khaled

doi:10.3390/ai6090233

Open AccessArticle

EEViT: Efficient Enhanced Vision Transformer Architectures with Information Propagation and Improved Inductive Bias

by

Rigel Mahmood

,

Sarosh Patel

^*

and

Khaled Elleithy

Department of Computer Science and Engineering, University of Bridgeport, Bridgeport, CT 06604, USA

^*

Author to whom correspondence should be addressed.

AI 2025, 6(9), 233; https://doi.org/10.3390/ai6090233

Submission received: 4 August 2025 / Revised: 13 September 2025 / Accepted: 15 September 2025 / Published: 17 September 2025

Download

Browse Figures

Versions Notes

Abstract

The Transformer architecture has been the foundational cornerstone of the recent AI revolution, serving as the backbone of Large Language Models, which have demonstrated impressive language understanding and reasoning capabilities. When pretrained on large amounts of data, Transformers have also shown to be highly effective in image classification via the advent of the Vision Transformer. However, they still lag in vision application performance compared to Convolutional Neural Networks (CNNs), which offer translational invariance, whereas Transformers lack inductive bias. Further, the Transformer relies on the attention mechanism, which despite increasing the receptive field, makes it computationally inefficient due to its quadratic time complexity. In this paper, we enhance the Transformer architecture, focusing on its above two shortcomings. We propose two efficient Vision Transformer architectures that significantly reduce the computational complexity without sacrificing classification performance. Our first enhanced architecture is the EEViT-PAR, which combines features from two recently proposed designs of PerceiverAR and CaiT. This enhancement leads to our second architecture, EEViT-IP, which provides implicit windowing capabilities akin to the SWIN Transformer and implicitly improves the inductive bias, while being extremely memory and computationally efficient. We perform detailed experiments on multiple image datasets to show the effectiveness of our architectures. Our best performing EEViT outperforms existing SOTA ViT models in terms of execution efficiency and surpasses or provides competitive classification accuracy on different benchmarks.

Keywords:

computer vision; vision transformer; efficient attention; perceiver

1. Introduction

The field of computer vision has seen remarkable advancements in the last couple of decades. The early computer vision systems relied on feature engineering techniques such as edge detection, shape descriptors, and dimensionality reduction among others, before feeding the enhanced information to a machine learning classifier. While these worked to some degree, their recognition accuracy lagged far behind human perception. Even though Convolutional Neural Networks (CNNs) were proposed more than two decades ago, the refinements in their design have been revolutionary in the field of computer vision, often matching or surpassing human recognition.

CNNs learn to identify relevant patterns and implicit details in images, such as edges, textures, and shapes, as they are trained on supervised data. They are capable of learning hierarchical representations as they process visual data in layers; beginning layers focus on learning simple features such as edges, while later layers focus on learning more complex patterns and visual relationships. In each layer of the CNN, multiple learnable convolution kernels are applied to an input image in a sliding manner. The resulting image after the application of a convolution kernel is referred to as a feature map. The feature map operation is followed by a pooling layer that aggregates the information in neighboring blocks. After passing the image through a number of layers involving feature maps and pooling operations, a classifier in the form of a simple neural network is applied to the linearized form of feature maps in the last layer to make the final classification decision. This learning process provides the inductive bias to CNNs in generalizing to new, unseen images, recognizing objects and patterns regardless of variations in lighting, angles, or backgrounds.

While CNNs have demonstrated remarkable success in computer vision applications, their performance when used in Natural Language Processing (NLP) is unimpressive. One of the reasons for the CNNs to be less effective in NLP is their limited receptive field. This is because of the convolution operation which typically uses a kernel much smaller than the data it needs to process. Although the sliding of the kernel and the subsequent pooling operation are able to aggregate more information about the data, this limited receptive field causes loss of information in language modeling. This is akin to trying to understand a paragraph by breaking it into a few words at a time. Thus, a different architecture is desirable for effective language modeling.

The Transformer architecture as proposed in the pioneering work of Vaswani et al. [1] has been monumental in the NLP field. The recent success of Large Language models such as ChatGPT [2], Gemini [3], Llama [4], DeepSeek [5], Claude [6], Qwen [7] etc. with their language comprehension and human-like conversation capabilities corroborates the transformer architecture’s capacity for capturing complex language representations in NLP. One of the main reasons for the success of the Transformer is that it uses the attention mechanism, which potentially has a very large receptive field and can examine the entire text document at once. Attention measures the pairwise similarity between the words (or tokens) of the entire input sequence (converted to tokens) in order to comprehend it.

Similar to a CNN, the Transformer also uses many layers, with each layer doing multiple attentions (referred to as attention heads). The output from these attention heads is aggregated in a learnable manner by a linear network. After processing the tokens through many layers, the last layer feeds the learned representations to a classifier to decide which word would be produced next from the Transformer. The produced word becomes part of the input that is then used to generate the following word. This concept where output is generated one word at a time, and the previously generated output is appended to the input in an iterative manner is referred to as “autoregressive” generation. During the training of the Transformer for language modeling, the target output sequence is incrementally masked so that the Transformer learns to predict the next masked token.

For text classification, e.g., sentiment analysis, the masking of future tokens during training and autoregressive generation of one token at a time is not needed. Instead, an extra token referred to as a “CLS”, or class token, is appended in the beginning of the input data. This token accumulates the class knowledge as the information is refined in each layer for each token by looking at its attention with respect to all other tokens. The classifier at the end of all layers then processes the information in the CLS token to determine the class of the input text.

The immense success of the Transformer in NLP and its potentially large receptive field, raises the question: can the Transformer be equally or more effective than a CNN at image comprehension and classification? The answer to this question was partially given in the seminal work of the Vision Transformer paper [8]. In this paper, the authors adapted the NLP Transformer for classification by converting the input image into small patches and treating the sequence of these patches as sequence of input tokens like in an NLP Transformer. Two important conclusions were given about the effectiveness of image classification using CNNs and Transformers.

1.: CNNs have an inductive bias for image classification and therefore perform better than Transformers, especially when the training data is relatively small.
2.: Transformers can match the classification accuracy of CNNs if pretrained on large datasets. The authors used the JFT training dataset with 300 million images, and demonstrated impressive image classification capabilities of Transformers matching the best CNN architectures for it.

Since the pioneering work of the Vision Transformer, many research ideas have been proposed to enhance the effectiveness of the Transformer in the computer vision field. Why do we strive to make the Transformer better in the image domain when we know that CNN has an inductive bias for vision? There are two reasons for it. One is the expressive power of the attention mechanism in a Transformer as evidenced in LLMs. If we can improve the lack of inductive bias for vision in a Transformer, then it can be better at image comprehension than a CNN. The second reason is the rise of Vision Language Models (VLMs).

Vision language models are multimodal models that are trained on both image and text data. VLMs can understand images, answer questions about images, do visual question answering, image captioning, capture spatial properties, and segment images among other applications. Since VLMs need a unified architecture for both NLP and image domains, the Transformer is a natural choice. The Transformer’s core principle in understanding the relationships in the input is the attention mechanism. Since the attention computes a pairwise similarity between all input tokens, it has a time complexity of

O (n^{2})

if n is the input sequence length. For LLMs and VLMs, the reduction in the computation of attention without loss in performance is an important goal in a Transformer architecture.

In this paper, we focus on enhancing the image classification capabilities of the Vision Transformer (ViT). We first analyze the most popular SOTA Transformer architectures and highlight the key architectural enhancements introduced in these models. We focus on reducing the complexity of the attention mechanism while maintaining or even improving the classification accuracy as compared to the existing ViTs. The important contributions of our work can be summarized as follows:

1.: We enhance the popular PerceiverAR [9] architecture for computer vision which is more efficient than a ViT. Our enhanced Perceiver has approximately one fourth the attention computations as compared to the standard ViT, but has much better classification accuracy than the PerceiverAR.
2.: We adapt a recent segment level PerceiverAR based attention architecture [9] originally proposed for language modeling, for computer vision. We further enhance this attention design to improve the inductive bias for vision Transformer.
3.: We do a detailed comparison of recent SOTA ViT architectures on different popular datasets under different dataset augmentations to provide a proper comparison framework between different designs. We believe such an evaluation environment is important for the research community to fairly compare new designs against published models.

The rest of the paper is organized as follows. We review the important works in the computer vision field focusing on the Transformer based architectures in the next section entitled “Related Work”. Then in Section 3, we cover some preliminaries in terms of formal computations involved in a ViT and the PerceiverAR design, upon which most of our enhancements take place. In Section 4, we present our two enhanced efficient ViT designs. The results section (Section 5) provides a detailed comparison of our enhanced architectures with the popular SOTA ViTs. In Section 6, we discuss the insights gained from our enhanced architectures. Finally, we provide conclusions and future work in Section 7.

2. Related Work

The seminal work, “An Image is Worth 16 × 16 Words: Transformers for Image Recognition at Scale” [8], first introduced the Vision Transformer (ViT) by applying transformer architectures—originally developed for natural language processing—directly to image classification tasks. This method challenged the then-traditional reliance on convolutional neural networks (CNNs) in computer vision by demonstrating that transformers can achieve competitive, if not superior, performance when trained on sufficiently large datasets. To process images, ViTs operate by dividing each image into fixed-size patches (e.g., 16 × 16 pixels). These patches are then flattened and linearly projected into embeddings, treating them analogously to word tokens in NLP models. This sequence of patch embeddings, along with a special classification token and the corresponding positional embeddings, is then fed into a standard transformer encoder. This setup allows the model to capture global relationships across the entire image. The output corresponding to the classification token is finally passed through a multilayer perceptron (MLP) head to produce the final class predictions. ViT models require significantly less computational resources to train compared to CNNs when pre-trained on large datasets. However, they tend to overfit on smaller datasets due to the lack of inductive biases that are present in CNNs, such as locality and translation equivariance.

CaiT (Class-Attention in Image Transformers) [10] is a novel architecture that enhances Vision Transformers (ViTs) by enabling deeper models to train more effectively through two key contributions. One, CaiT introduces a class token that is updated separately via special class-attention layers, thus decoupling class token processing from patch token processing. Two, CaiT uses a technique called LayerScale, which applies small initial values to residual branch weights, analogous to techniques like residual scaling in ResNets. With these techniques, CaiT achieves superior performance on ImageNet without needing the use of external data or distillation, surpassing both CNNs and previous ViTs.

Bridging the gap between transformers and CNNs in vision, the Swin Transformer [11] introduces a hierarchical vision transformer that computes self-attention within local windows and shifts the windows across layers. This makes it efficient and scalable for high-resolution vision tasks (like detection and segmentation), while maintaining global modeling capabilities. Rather than computing global attention (which is costly), Swin performs self-attention within non-overlapping local windows. It then shifts the windows by a small offset in the next layer, thus enabling cross-window interaction without large compute cost. Similar to CNNs, Swin processes images in stages, gradually reducing resolution and increasing feature dimensions. This allows Swin to produce multi-scale feature maps, which are useful for downstream tasks like object detection and semantic segmentation. Unlike ViT, which scales quadratically, Swin scales linearly with image size due to window-based attention. While Swin has produced impressive image classification results, its drawback is high GPU memory usage due to increased depth and feature map resolution in earlier stages [12].

Twins [13] is a vision transformer architecture that improves spatial modeling by combining local and global attention mechanisms in a hierarchical design. The aim is to better capture both fine-grained local patterns and long-range dependencies, which are key for dense prediction tasks like segmentation and detection. Twins offers two types of attention: Locally-grouped Self-Attention (LSA), which operates within small non-overlapping local windows to efficiently model local dependencies, and Global Sub-sampled Attention (GSA), which captures global context by applying attention on a sub-sampled version of tokens, reducing computational cost. Like the Swin Transformer, Twins builds multi-scale feature maps by gradually reducing spatial resolution and increasing depth, resulting in a model that achieves improved accuracy and efficiency, as well as demonstrates the need for both local and global attention. Since Twins uses both local and global attention, its training and inference computations and memory usage are relatively high.

Data-efficient image Transformers (DeiT) [14] show that with the right training recipe and a novel distillation token, Vision Transformers can be trained from scratch on ImageNet-1K without any extra data, matching or even surpassing CNNs of similar size. DeiT achieves this by adding a learnable distillation token alongside the class token. During training, this token attends to a CNN teacher’s soft labels, thus enabling the ViT student to distill knowledge via its own attention mechanism. Further, DeiT leverages strong regularization and augmentations, stochastic depth, and an optimized learning-rate schedule. Overall, this integrated knowledge distillation into the transformer architecture allows transformer-based vision models to be accessible and accurate without massive pretraining.

The Tokens-to-Token (T2T) Vision Transformer [15] addresses a key limitation of vanilla Vision Transformers (ViTs): they lack local structure modeling in early layers, making them data-hungry and inefficient for training on smaller datasets like ImageNet from scratch. To combat this, instead of directly splitting an image into fixed-size patches, T2T progressively aggregates neighboring tokens, preserving local structures. This mimics the hierarchical nature of CNNs and allows better low-level feature extraction. Progressive tokenization is achieved through passing the input image through a series of soft splits, followed by token transformation and then by re-tokenization steps. This in turn creates a smaller, more informative token set with improved spatial representation. By combining the T2T module with standard Transformer blocks to create T2T-ViT models, these models are more parameter-efficient. They can be trained from scratch on ImageNet without external data or heavy augmentations, while achieving competitive or better accuracy than standard ViT and CNN models, while using fewer parameters and no pretraining. While T2T has a better patch tokenization learning process, its attention backbone is same as a ViT having quadratic attention complexity.

To overcome the lack of inductive bias in Transformers, some hybrid CNN-Transformer architectures have been proposed, e.g., CeiT [16], FasterViT [17], Uniformer [18], CoatNet [19]. CeiT uses a beginning CNN layer to extract initial low-level features from the image. The smaller feature maps obtained are then converted to patches which contain more meaningful spatial information. The CeiT also uses convolutions in the feed forward component of the Transformer block to improve inductive bias. FasterViT also uses initial CNN layers to extract better visual features. It also introduces Hierarchical Attention (HAT) that decomposes global self-attention into a multi-level attention mechanism for better efficiency. Uniformer [18] integrates the strengths of 3D convolutions and spatio-temporal self-attention in a unified Transformer format. It provides efficient and effective representation learning in both image and video understanding. CoatNet [19] provides a principled vertical stacking of convolution and attention layers. It incorporates a simple “relative position bias” into the self-attention mechanism, allowing the model to implicitly learn both input-adaptive relationships and spatially fixed, translation-equivariant patterns like convolutions. Overall, due to the more complex hybrid designs involving CNNs and Transformers, and CNNs not being effective in NLP, such architectures are less favorable in VLMs where language understanding is also important.

3. Preliminaries

For completeness, we summarize the design of the canonical ViT [8] in Appendix A. Our enhanced efficient architectures are inspired by the ideas of the PerceiverAR [9] design. Before we delve into our enhancements, we briefly describe the attention mechanism of the PerceiverAR architecture. Since our proposed work is primarily based on enhancements of the PerceiverAR [9] architecture, we highlight the important architectural details of this design in the following subsection.

PerceiverAR Architecture

In order to reduce the computational complexity of attention, PerceiverAR [9] splits the input sequence into two components of context and the latent in the first layer of the transformer. The first layer computes the attention between the latent and the entire input, but produces the output equivalent to only the size of the latent. All succeeding layers operate like a normal Transformer layer, computing the attention on the input that is the size of the latent and producing the same size output. If n is the sequence length of input

x

, then after the patch embedding, the relationship between the context and the latent is a simple concatenation, described as

x \in R^{n \times d} = x_{context} ∥ x_{latent}

(1)

where

∥

indicates the concatenation of two components. If the context length is c, the latent length is l, and d is the embedding dimensionality, then the context

\in R^{c \times d}

and the latent

\in R^{l \times d}

. In the first layer of PerceiverAR, query Q is computed only on the latent part. So that the latent can absorb the information from the context, the key and values are computed on the entire sequence length of size n. The attention computation in the first layer is described by the following equations:

Q = W_{q} x_{latent} \in R^{l \times d}

(2)

K = W_{k} x \in R^{n \times d}

(3)

V = W_{v} x \in R^{n \times d}

(4)

Output = [Softmax ({QK}^{⊤})] V = AV \in R^{l \times d}

(5)

After the first layer of the Transformer, the remaining layers receive an input

\in R^{l \times d}

, do a normal attention on inputs of size l, without splitting the input into two parts, and produce an output

\in R^{l \times d}

. The computation in the second layer onwards in the PerceiverAR is described below.

Q = W_{q} x \in R^{l \times d}

(6)

K = W_{k} x \in R^{l \times d}

(7)

V = W_{v} x \in R^{l \times d}

(8)

Output = [Softmax ({QK}^{⊤})] V = AV \in R^{l \times d}

(9)

For language modeling, a masking is applied to the attention matrix to hide the future tokens, but for classification purposes, no masking of attention is needed. The attention computation of PerceiverAR is depicted in Figure 1.

Each layer in the PerceiverAR performs layer normalization, followed by the feed forward network similar to the standard Transformer. Thus, the main difference between the standard Transformer and the PerceiverAR is in the Attention module. The attention complexity of PerceiverAR in the first layer of transformer is

O (l \times n)

, while the remaining layers have a complexity of

O (l^{2})

. If many layers are used and if

l ≪ n

, then computation is significantly reduced as compared to the standard transformer.

Even though the PerceiverAR has shown promising results on both NLP and Vision benchmarks, it has the compromise of context being implicitly compressed into the latent output after the first layer. Since the context in remaining layers is not explicitly refined as in a normal transformer, this may have an impact on the information comprehension of the Transformer. We improve upon this aspect and describe our enhanced architectures in the next section.

4. Proposed Enhanced Efficient ViT (EEViT) Architectures

We use the PerceiverAR [9] design as the baseline for devising EEViT architectures. PerceiverAR reduces the attention computation complexity significantly, albeit at the cost of some loss of information, as the output from first layer is condensing the context information into the latent. We overcome this drawback by proposing two enhancements to the PerceiverAR baseline design to improve its classification performance while keeping the computation cost low. The first enhancement is a simple extension of the PerceiverAR design and is detailed in the next subsection.

4.1. Enhanced Efficient ViT Based on Perceiver Attention (EEViT-PAR)

Since the loss of information occurs in the PerceiverAR baseline because of absorption of the context into the latent component, we improve upon this by having the first k layers refine and propagate the context to the next layer. This is accomplished by doing a separate attention on the context component. Thus, in the first k layers, each layer performs two attentions–one is the same as the PerceiverAR baseline, i.e., a cross attention between the latent and the entire input, while the second attention operates on the context only. The first k layers generate two outputs that are concatenated to become a single output. After the first k layers, the output is reduced to the size of the latent achieving a high reduction in attention computation. The following equations describe the computations in the EEViT-PAR.

For the first k layers, the input x is split into a context and latent component as

x = x_{context} ∥ x_{latent} (x \in R^{n \times d}, x_{context} \in R^{c \times d}, x_{latent} \in R^{l \times d})

(10)

Q_{l} = W_{ql} x_{latent} \in R^{l \times d}

(11)

Q_{c} = W_{qc} x_{context} \in R^{c \times d}

(12)

K_{c} = W_{kc} x_{context} \in R^{c \times d}

(13)

V_{c} = W_{vc} x_{context} \in R^{c \times d}

(14)

K = W_{k} x \in R^{n \times d}

(15)

V = W_{v} x \in R^{n \times d}

(16)

{Out}_{1} = [Softmax (Q_{c} K_{c}^{⊤})] V_{c} = [A_{c} V_{c}] \in R^{c \times d}

(17)

{Out}_{2} = [Softmax (Q_{l} K^{⊤})] V = [A_{l} V] \in R^{l \times d}

(18)

Output = {O u t}_{1} ∥ {O u t}_{2} \in R^{n \times d}

(19)

For layers

(k + 1)

to

(m - p)

– (if m is the total number of layers):

Q = W_{q} x \in R^{l \times d}

(20)

K = W_{k} x \in R^{l \times d}

(21)

V = W_{v} x \in R^{l \times d}

(22)

Output = [Softmax ({QK}^{⊤})] V = [AV] \in R^{l \times d}

(23)

For the last p layers, the CLS token is appended to the (

m - p

)th layer:

x \in R^{(l + 1) \times d}

(24)

Q = W_{q} x \in R^{(l + 1) \times d}

(25)

K = W_{k} x \in R^{(l + 1) \times d}

(26)

V = W_{v} x \in R^{(l + 1) \times d}

(27)

Output = [Softmax ({QK}^{⊤})] V = [AV] \in R^{(l + 1) \times d}

(28)

The classification head is attached to the last token’s embedding. We visually depict the operation of EEViT-PAR in Figure 2.

As can be seen from Figure 2, after the first k layers, the size of the attention to be computed is the size of the latent. Even in the first k layers, the attention is split into separate computations of

O (l n)

instead of the usual full attention in ViT, which is

O (n^{2})

. If the number of layers

m > > k

, then the time complexity of the attention becomes

O (l^{2})

, which is quite a lot less than a normal ViT. To enhance the performance further, we borrow the introduction of the CLS token in later stages from the CaiT [10] architecture. Only the last p layers process the CLS token. The motivation for this is to have the first many layers focus on representation learning, and the last few layers then use the learned knowledge to focus on the classification task.

To improve the computational efficiency of the above EEViT-PAR architecture, we realize that if the length of the sequence on which the attention is to be performed can be reduced, the computational efficiency can be improved. One approach was followed in the Linformer [20] design, which performed projection to a lower dimension along the sequence length. While this greatly reduces the attention computation, the compression of information causes a loss of performance. Other alternatives that do not use compression along the sequence dimension have been proposed, e.g., Longformer [21]. Longformer uses a few variations of the sliding window attention to reduce the complexity of attention. In the sliding window approach, attention is calculated between the token representations in a small neighborhood of the current token. While this reduces the computation, the information from tokens farther apart is unavailable, resulting in the loss of contextual information, thus degrading the classification performance. We devise an elegant information propagating attention mechanism that has the efficiency of the sliding window approach, but makes the full context available down the layers of the Transformer. The design of our information propagating ViT is detailed in the next subsection.

4.2. Enhanced Efficient ViT Based on Information Propagation (EEViT-IP)

The key idea in EEViT-IP is to divide the input sequence into small disjoint segments with each segment of size s. The attention is then computed in each segment within the token representations of the segment. To allow for information propagation between segments, we overlap each segment by half the length of the segment, i.e.,

s / 2

. Rather than computing the attention on each segment of size s, which would require

O (s^{2})

operations in each segment, we use the PerceiverAR concept to split each segment further into

s / 2

components. The Q is computed on the last

s / 2

part of the segment, while K and V are computed on the entire segment length s. Thus the attention computation in each segment becomes

O ((s / 2) \times s) = O (s^{2} / 2)

. The idea of dividing the attention computation into overlapping segments has been previously presented in LongLoRA [22]. It has also been used in creating efficient segmented attention for Language Modeling [23], which we adapt in this paper for Vision classification.

LongLoRA introduces a shifted sparse attention mechanism (

S^{2}

-Attn), where the input sequence is partitioned into groups, and attention is computed within each group independently. To enable cross-group information flow, attention heads are split into two halves: one half attends within unshifted groups, while the other applies a shift of half the group size, allowing overlap and communication between adjacent segments. This design facilitates efficient context extension in large language models (LLMs) by applying Low-Rank Adaptation (LoRA) to the attention layers. For instance, LongLoRA can extend an LLM’s context length from 8K tokens to 32K by fine-tuning on 32K-token inputs, divided into 8K chunks. In this setup,

S^{2}

-Attn splits attention heads, with the shifted group attending 4096 tokens ahead. As attention propagates through transformer layers, information gradually diffuses across all chunks, approximating full attention coverage.

In our EEViT-IP design, the input sequence is divided into overlapping segments. On consecutive half segments, we apply the PerceiverAR operation with Q being computed on the later half of the segment, and K, V on the entire segment. This allows propagation of information from the previous half segment to the current half segment. The architecture of EEViT-IP is detailed in Figure 3.

In Figure 3, the

A 0

block is different from other IP blocks as it performs regular attention on the first

s / 2

size of the input. After the first

s / 2

part of the input sequence, a PerceiverAR operation is performed in each IP block. If s is the segment size, then the attention equations governing the ATTN block are described below.

A 0

Block:

Q_{0} = W_{q 0} x_{0 : s / 2} \in R^{\frac{s}{2} \times d}

(29)

K_{0} = W_{k 0} x_{0 : s / 2} \in R^{\frac{s}{2} \times d}

(30)

V_{0} = W_{v 0} x_{0 : s / 2} \in R^{\frac{s}{2} \times d}

(31)

{Output}_{0} = [Softmax (Q_{0} K_{0}^{⊤}) V_{0}] = [A_{0} V_{0}] \in R^{(s / 2) \times d}

(32)

The IP block performs the PerceiverAR operation on two consecutive half segments. The operations of the ith IP block are described below (i varies from 1 to (

2 \times

number of segments

- 1

).

IP Block:

Q_{{ip}_{i}} = W_{q_{i} p_{i}} x_{(s / 2) \cdot i : (s / 2) \cdot (i + 1)} \in R^{\frac{s}{2} \times d}

(33)

K_{{ip}_{i}} = W_{k_{i} p_{i}} x_{(s / 2) \cdot (i - 1) : (s / 2) \cdot (i + 1)} \in R^{s \times d}

(34)

V_{{ip}_{i}} = W_{v_{i} p_{i}} x_{(s / 2) \cdot (i - 1) : (s / 2) \cdot (i + 1)} \in R^{s \times d}

(35)

{Output}_{{ip}_{i}} = [Softmax (Q_{{ip}_{i}} K_{{ip}_{i}}^{⊤}) V_{{ip}_{i}}] = [A_{{ip}_{i}} V_{{ip}_{i}}] \in R^{\frac{s}{2} \times d}

(36)

The output from each of the IP blocks is equal to half of the segment size i.e.,

\in R^{\frac{s}{2} \times d}

. All blocks use Q

\in R^{\frac{s}{2} \times d}

while K, V

\in R^{s \times d}

are computed on the full segment size i.e., on the current half segment and the previous half segment. All blocks in a layer output data of

\frac{s}{2} \times d

size. These are then concatenated to form an output of size

n \times d

. All layers in the EEViT-IP architecture are identical as shown in Figure 3.

The advantage of our IP architecture is that it is highly efficient as attention is calculated on a segment of size s which is much smaller than the total sequence length. Figure 4 shows the attention pattern as it is computed in the first layer. Because of the overlap in segments, the information implicitly propagates to left neighbors (as shown in Figure 5). Thus, each subsequent layer increases the attention receptive field (because of overlap of the half segments in the PerceiverAR blocks). After enough layers, the information from all previous segments is available to an IP block as shown in Figure 6.

Our approach, where we use overlapping segments, accomplishes the same goal as LongLoRA [22] but is more efficient. This is due to the Perceiver mechanism being used in information propagation, which further reduces the attention computation in each segment. Further, in LongLoRA, the number of segments is limited by the number of heads, whereas in our EEViT-IP, there is no restriction on the number of segments to use in partitioning the input sequence.

5. Results

We perform detailed experiments comparing our proposed models with recent ViT architectures. The datasets used to compare ViT designs are shown in Table 1. These span varying train/test and image sizes. Some of the datasets have a relatively higher number of classes, e.g., CIFAR-100, Tiny ImageNet, and ImageNet-1K, and therefore provide more challenges in classification accuracy, especially when no pretrained models are used, and only basic data augmentation is applied. All models are trained using the same learning rate schedules (cosine annealing) and Adam optimizer for 200 epochs. The starting learning rate is 1 × 10⁻⁴ with a batch size of 64 for all models and datasets, except for ImageNet. Note that only on the ImageNet-1K dataset do we use a batch size of 128 and train the models for 100 epochs. This is because of the longer training times needed for this dataset (1.28 million images) on the RTX 4090 GPU system available to us. All transformer models being compared use the exact same data augmentations and are similar in size around 15 million parameters (using 8 layers with 6 heads per layer and an embedding dimension of 384), except for Swin-T which uses 27 million parameters. The model sizes were chosen to provide a good accuracy without overfitting or underfitting on the datasets tested.

We first compare our EEViT-PAR, which improves upon the PerceiverAR [9] design by continuing to propagate the context for initial k layers. Further, it introduces the CLS token a few layers before the last layer (similar to the CaiT [10] design). Table 2 compares the top-1 accuracy for our EEViT-PAR model with the baselines of PerceiverAR [9] and the Vision Transformer [8]. All models in Table 2 use 8 layers with 6 heads in each layer with an embedding dimension of 384. The EEViT-PAR propagates the context to two layers (

k = 2

) and uses the CLS token in the last two layers (

p = 2

). The choice of

k = 2

and

p = 2

was determined empirically to provide a good balance between reduced model complexity and performance. We provide more detail on this in Section 6. During training, all models in Table 2 use the basic data augmentations of random horizontal flip and random cropping after padding the image. All models use 4 segments, except for Tiny ImageNet which uses 8 segments, and ImageNet-1K which uses 16 segments. All results (except for the ImageNet) are based on the average of accuracies for three runs at the end of 200 epochs. ImageNet training takes about 4 days to complete on the compute resources available to us, and thus we provide results for a single run.

If more augmentations are applied to the datasets such as random shear, translation, solarization (but without cutmix [24]), the performance of all models improves as shown in Table 3. These augmentations encourage the model to learn from more diverse and difficult examples. The relative performance of the three models in Table 2 and Table 3 is similar. CIFAR-100 and Tiny Imagenet are more difficult datasets for achieving a higher classification accuracy, as the ratio of training data to the number of classes is relatively low and Transformer models require more training to achieve a higher accuracy.

Figure 7, Figure 8 and Figure 9 show the accuracy results of the different models as training proceeds for the first 100 epochs. From the results of Table 2 and Table 3 and Figure 7, Figure 8 and Figure 9, it can be noted that our enhanced architecture EEViT-PAR significantly improves upon the baseline PerceiverAR [9] design. Even though the top-1 classification accuracy of EEViT-PAR is similar to the ViT [8], it is important to note that EEViT-PAR has approximately one fourth the computation complexity of the standard ViT. This is because after the first k layers (usually two or three), only the latent sized data is processed in the remaining layers of EEViT-PAR. It is remarkable that EEViT-PAR achieves similar performance as a ViT with one fourth the computational cost.

Figure 10 compares the performance of EEViT-PAR with the baseline models of PerceiverAR and ViT on the ImageNet-1K dataset. Here also, the EEViT-PAR is significantly better than the Perceiver-AR but fairly similar to the ViT. We emphasize again that the computation cost of EEViT-PAR is significantly less than the ViT.

Our second proposed enhanced architecture, EEViT-IP, further improves the execution efficiency as well as provides improved visual comprehension through its overlapped segmented PerceiverAR design.

Table 4 presents the top-1 accuracy results with other SOTA ViT models on the different benchmarks, using the basic augmentation of random crop and horizontal flips only.

Table 5 presents the top-1 accuracy results with other SOTA ViT models on the different benchmarks, using both basic augmentations (random crop and horizontal flip) as well as the random shear, translation, and solarization augmentations.

The best accuracy results in Table 4 and Table 5 are indicated in bold. As can be seen from Table 4 and Table 5, the Swin architecture and our proposed design, EEViT-IP, achieve the best results. It should be noted that the Swin architecture in Table 4 and Table 5 is the Swin-T (tiny) version with approximately 27 million parameters, as opposed to our EEViT-IP which is much smaller with eight layers and six heads only using approximately 14 million parameters. Our EEViT-IP design has better inductive bias due to the overlapping segments that compute the PerceiverAR-style attention. This is also empirically evident from the early adaptation of the model, obtaining relatively higher accuracies as shown in Figure 11, Figure 12 and Figure 13. We show the comparison of different models in their learning, depicting the test accuracy after each epoch on different datasets. The execution efficiency of EEViT-IP is approximately 12.5% of a standard ViT with similar number of layers and heads in each layer. With respect to the Swin Transformer, the EEViT-IP has 50% less computations. We analyze the efficiency of our models in the next section.

Compared to all other ViT-based architectures, the EEViT-IP extracts more visual representation information early in the training. Figure 14 compares the performance of EEViT-IP with the other SOTA models of CaiT, Twins, and Swin on the ImageNet-1K dataset with basic augmentations only. Since our focus is on the architecture of the Transformer, we restrict the comparison to basic augmentations as Transformer’s performance is sensitive to more variations in data. Even though Swin-T performs slightly better than our EEViT-IP architecture, it is extremely memory and computationally inefficient in training [12]. We benchmark the per-epoch execution time and the GPU RAM consumed for different batch sizes during ImageNet-1K training in Table 6.

As compared to Swin, where layers in Swin differ in resolution, channel size, windowing, our EEViT-IP design is structurally identical in each layer. Each layer does a Perceiver-AR style attention on overlapping neighboring segments only. This results in not only a highly efficient attention computation but also a uniform memory access pattern.

As can be seen from Figure 14, EEViT-IP extracts visual information very quickly due to its information propagation design by performing the PerceiverAR operation on overlapping segments. Thus, our EEViT-IP design is not only extremely computationally efficient, but it is also a more informative design that can learn visual features early on.

For our EEViT-IP design, to determine the best number of segments, we have done empirical studies and determined that depending upon the dataset, usually 8 or 16 segments result in higher performance (8 for smaller datasets, and 16 for larger image sizes). More segments result in smaller execution time, with a small drop in accuracy.

Table 7 shows how the segment size affects the classification accuracy in our EEViT-IP attention. For CIFAR-100, the EEViT model uses six layers with eight heads per layer and an embedding dimension of 384. The image size being fed to the model is 32 × 32 with a patch size of 4 × 4 resulting in a total of 64 tokens. For relatively smaller images, between four and eight segments usually yield the best classification accuracy. The Tiny ImageNet results in Table 7 use the FasterViT-EEViT-IP model where the attention in FasterViT is replaced by our EEViT-IP attention. Since FasterViT works on 224 × 224 size images, we rescale the Tiny ImageNet images to 224 × 224 before feeding it to the FasterViT-EEViT-IP model. A patch size of 14 × 14 is used resulting in total of 256 patches. As can be seen from Table 7, here if 16 segments are used, a higher classification accuracy is obtained.

In general, since the EEViT-IP operates on the consecutive pair of segments in each layer, and as shown in Figure 4, Figure 5 and Figure 6, the effective attention implicitly increases by one in each layer, in order to obtain implicit full attention across all patches, the number of segments need to be less or approximately equal to the number of layers. In Table 7, the FasterViT that we use has 13 layers, so in this case 16 segments produce a better result. Note that more segments result in a more efficient execution model, as the attention computation is done on smaller segments. Thus, there is a tradeoff in accuracy versus efficiency with respect to number of segments to use in our EEViT-IP design.

Using a batch size of 256, the EEViT-IP consumes a GPU RAM of 16.5 GB, with a per-epoch training time of 35 min for a model with 26.5 million parameters while training on the ImageNet-1K dataset. We show the comparison of different models in their learning, depicting the test accuracy after each epoch on different datasets. The comparison to some other recent designs, e.g., T2T, is not applicable, as it uses the standard ViT attention but a much better patch tokenization than a standard ViT. If we replace T2T’s attention with our IP attention, we obtain a more efficient T2T. Similarly, the designs in the Super Vision Transformer [25], the EViT [26], ConvNext [27], or that of DeiT [14] are not meaningful in our comparison, as they are either token reorganization approaches, hybrid CNN-Transformer architectures, or distillation approaches. Our work is related to the design of a better attention mechanism, and thus we compare our approach to recent Transformer-based designs with novel attention mechanisms.

To reduce the attention complexity, the work in [28] replaces the quadratic attention with linear attention mechanisms, such as Performer [29], Nyströmformer [30], and Linformer [20], to reduce its GPU usage. The inductive prior for image data is provided by convolutional sub-layers. While the efficiency gains are achieved, the accuracy results reported on CIFAR-10, CIFAR-100, and the Tiny ImageNet datasets are lacking those of recent ViT models.

One of the recent important works in the ViT domain is the FasterViT [17]. It uses a hierarchical attention mechanism along with carrier tokens. These dedicated tokens reside per window and facilitate both local and global receptive capability, enabling efficient cross-window communication. FasterViT replaces patch tokenization with a convolution-based patch embedding stem, i.e., a stack of convolutional layers that progressively reduce spatial resolution while increasing channel dimension. To demonstrate the effectiveness of our EEViT-IP attention mechanism, we can replace the FasterViT’s attention module with the EEViT-IP attention. Figure 15 and Figure 16 demonstrate the boosted performance on the CIFAR-10 and CIFAR-100 datasets.

6. Discussion

Even though the Transformer architecture has been amazingly successful in the NLP and vision domains due to its attention mechanism, the

O (n^{2})

complexity of attention is a concern, especially as the sequence length n becomes longer. Thus, an active area of research is to reduce this complexity without causing a drop in its predictive performance. Towards this goal, we have proposed two enhanced architectures in this paper: EEViT-PAR and EEViT-IP. The EEViT-PAR improves the predictive performance of a previously proposed PerceiverAR [9] design. While the PerceiverAR has 1/4th the computational efficiency of the standard ViT Transformer, it does experience a drop in accuracy when used in the image domain. We slightly increase the low computational cost of PerceiverAR but are able to achieve the same image classification accuracy as a standard ViT.

In the EEViT-PAR, we divide the set of input tokens into a context and a latent component like the original PerceiverAR, but propagate both the processed context and the latent to the next

k - 1

layers. The original PerceiverAR stops the propagation of context after the first layer, thereby incurring a loss in performance as the context is absorbed into the latent after the first layer. Empirically, we have determined that only a k of 2 or 3 is needed to achieve the equivalent classification performance of a ViT. Similar to the observations presented in Cait [10], the use of the CLS token in the last 2–3 layers provides majority of the gains in classification accuracy. We further incorporate the CLS token in only the last p layers like the CaiT [10] to improve the performance of our model. This does not cause any noticeable change in overall computation efficiency. The exact computation equation for the attention component of EEViT-PAR is given in Equation (37).

A = [(k - 1) \times (c^{2}) + l \times (c + l)] + l^{2} (m - k)

(37)

where c is the size of the context, l is the size of the latent, and m is the total number of layers in the Transformer. For example, if we use

m = 12

(12 layers), a sequence length of

n = 256

, and if

c = l = n / 2

, then the EEViT-PAR has approximately 29% of the attention computations as compared to a standard ViT with

n^{2}

attention computations in each layer. Note that the PerceiverAR has approximately 25% of the computations as compared to a standard ViT. Thus, with approximately a 5% increase in computations with respect to the baseline PerceiverAR, we are able to achieve comparable classification accuracy to a standard ViT, as demonstrated by our empirical results on different datasets.

Efficiency Analysis

With the goal of surpassing the classification accuracy of the ViT with even less computations than the PerceiverAR, our EEViT-IP uses the PerceiverAR concept on much smaller segments. By overlapping the segments, the learnt visual attention propagates from one segment to another in the layers of the Transformer. The exact computation in the attention component of the EEViT-IP is given by Equation (38):

A = [{(\frac{s}{2})}^{2} + (\frac{s}{2} \times s) \times \frac{2 n}{s}] \times m

(38)

where s is the segment size that the input sequence of length n is divided into, and m is the number of layers in the Transformer. As an example, if the input image is divided into 256 patches and the segment size is 32, then the EEViT-IP Transformer with 12 layers has only 13% of the attention computations as compared to a standard ViT. In addition, our EEViT-IP is 60% more efficient than the PerceiverAR.

As can be seen from Table 4 and Table 5, our EEViT-IP surpasses most SOTA ViT designs in top-1 accuracy on the different benchmarks. The Swin Transformer matches the top-1 performance of our design, but it should be noted that our design is more computation and memory efficient than the Swin Transformer. In a Swin Transformer, the image is first split into non-overlapping 4 × 4 patches, which are further grouped into 7 × 7 non-overlapping windows. The attention is calculated only locally within each 7 × 7 window, consisting of 49 tokens. In comparison, we also divide the input image into patches, but depending upon the image size, the patch size can be larger. We further divide the embedded patches into segments and perform the attention within the segment only. Because of the overlap in segments, we effectively achieve the same result as the shifted windows as in a Swin Transformer. Since our segment size can be made smaller, we can achieve higher computational efficiency than a Swin Transformer. Further, our segmentation scheme is much more flexible in terms of the requirements for image size than the Swin Transformer.

The PerceiverAR style computation on overlapped segments causes the attention information to flow from each segment to the neighboring segment in each succeeding layer, eventually reaching all segments after a few layers. We visually depict this feature of EEViT-IP in Figure 17. The receptive attention field continues to grow towards the right of each segment as computation proceeds through different layers. The attention computation is between neighboring half segments only which also achieves similar effect as in a window attention in the Swin Transformer.

While the Swin Transformer obtains a very good image recognition accuracy, it has high GPU memory requirements during training. While the windowed attention mechanism of Swin theoretically reduces the quadratic complexity of attention to linear, the intermediate attention maps and other activations within each window still need to be stored for the backward pass, resulting in high GPU memory usage. To address this issue, the Swin V2 [12] proposed “activation checkpointing” and “ZeRO” optimizer to save GPU memory. In our study, we observed that the GPU memory required for training ImageNet-1K on Swin-T with a batch size of 128 exceeds 24 GB. In comparison, our EEViT-IP requires only 7 GB of GPU RAM with a batch size of 128. Twins [13] is another popular architecture similar in design to the Swin Transformer. It separates global and local attention into two distinct modules: Locally-grouped Attention (LgA) and Global Subsampled Attention (GSA). While being efficient and producing slightly better accuracy than Swin, its drawback is its difficulty in adapting to different tasks due to the complex integration of LgA and GSA. In comparison to Twins, our EEViT-IP architecture is much simpler in its attention processing and uses the simple mechanism of computing attention over overlapping segments, making it a better candidate for VLMs. We also achieve better classification accuracy than Twins in addition to computationally being more efficient.

While the EEViT-PAR architecture that we developed is an incremental improvement over the PerceiverAR baseline combining ideas from CaiT, our EEViT-IP model is an elegant extension of the ideas of PerceiverAR leading to a highly efficient ViT without loss of contextual information. The efficiency analysis, as described in the previous section, indicates that our EEViT-IP uses approximately 13% of the attention computations as compared to a standard ViT. Further, this reduction in computation does not cause any loss in performance, and in fact provides better inductive bias as the model learns the classification with less information in earlier layers. This is further supported by the empirical results on different datasets.

Our EEViT-IP architecture is extremely suitable for use in VLMs, as it is entirely an attention-based Transformer, which is highly efficient in its attention computation without loss of context for long input patches/token sequences. We list some of the important reasons why a Transformer-only based architecture is important for use in VLMs.

The unified attention mechanism of a Transformer can seamlessly process both visual and textual modalities, whereas convolutional backbones like ConvNeXt [27] are optimized primarily for spatial hierarchies in images and lack native mechanisms for cross-modal alignment. While ConvNeXt-ViT [27] hybrids leverage convolutional inductive biases for local feature extraction, Transformer-only designs offer a more unified and modality-consistent architecture, which is critical for cross-modal alignment in VLMs.
The attention mechanism in Transformers enables explicit comprehension of long-range dependencies, which is essential for capturing semantic correspondences between image regions and textual tokens. Convolutional networks, by contrast, are inherently local and require deeper stacks or added modules to approximate this capability.
State-of-the-art VLMs (e.g., CLIP [31], BLIP [32], LLaVA [33], SigLIP [34], SigLIP 2 [35]) use Transformer-based designs, demonstrating that transformers offer superior scalability and generalization across tasks such as retrieval, captioning, and multimodal reasoning.
The scaling laws of Transformers have shown predictable improvements with larger model sizes and training data, a property that has not been consistently observed with convolutional-hybrid backbones in multimodal learning [36].
The homogeneity of a Transformer stack simplifies optimization, transfer learning, and downstream adaptation (e.g., fine-tuning with LoRA or adapters), whereas hybrid ConvNeXt-ViT models often require more complex architectural tuning.

7. Conclusions

The Transformer architecture has been instrumental in the success of LLMs that can understand and generate human text, automating tasks and perform reasoning in answering complex multistep questions. There is increased focus on enhancing the multimodal capabilities of LLMs including understanding and generating images and video data. While CNNs are more effective than Transformers in the vision domain, for creating Visual Language Models (VLMs), it is beneficial to have more efficient and high performance Vision Transformers. To address this, we have developed two efficient architectures that improve upon the performance of current SOTA ViTs.

Our first design, EEViT-PAR, improves upon the PerceiverAR by increasing its image classification accuracy significantly while keeping the computation cost nearly the same (only 30% of the execution cost of a ViT). Inspired by the information propagation capabilities of the PerceiverAR from its context component to the latent component, we further enhance our model and propose information propagating, i.e., EEViT-IP design. The key concept in this design is to create a highly efficient ViT by dividing the input patches tokens into small overlapping segments. The PerceiverAR operation is then applied by treating each segment’s first half as the context and the second half as the latent component. Since the self attention is carried out independently on each segment, this reduces the computation complexity significantly if the segment size is relatively small. The half segment overlap between segments creates the equivalent effect of full sequence attention as the information gets propagated from segment to the next overlapped segment in each succeeding layer of the transformer. The overlap segment effect also enhances the latent representation, resulting in higher recognition accuracy matching or surpassing equivalent size SOTA vision Transformers. Since our segmentation of the input sequence is completely flexible, it allows us to handle any size images without having to resize them to a particular size as needed in the Swin Transformer.

Our future work involves applying EEViT-IP to the creation of VLMs, and in enhancing the design of SOTA ViTs by replacing their attention module with our EEViT-IP attention. Some of the recent ViT architectures that can benefit from our EEViT-IP attention include Tokens-to-Token (T2T) [15] and FasterViT [17]. T2T uses full attention as in the canonical ViT and therefore, replacing its attention module with our EEViT-IP attention can improve its efficiency and performance. Similarly, FasterViT uses hierarchical windowed attention with a few global tokens. Replacing it with our EEViT-IP attention has the potential of expanding its versatility for different sized images, as well as potentially increasing its performance.

Author Contributions

Conceptualization, R.M. and K.E.; methodology, R.M.; software, R.M.; validation, S.P. and R.M.; formal analysis, R.M.; resources, R.M.; data curation, R.M.; visualization, R.M.; writing—original draft preparation, S.P., R.M. and K.E.; writing—review and editing, S.P., R.M. and K.E. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Source code of our work is publicly accessible via our GitHub repository: https://github.com/rigel-mahmood/EEViT (accessed on 14 September 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Canonical Vision Transformer (ViT)

The ViT [8] is an adaptation of the Transformer architecture for image classification. In the ViT, an image is first divided into fixed size patches. These patches are assembled row-wise in a linear sequence. An extra patch is added in the beginning to act as the class (CLS) token. This idea is borrowed from the BERT [37] NLP transformer where the CLS token learns the class information as it is refined through the different layers of the Transformer. The patches are then transformed to embedding vectors by a patch embedding network (similar to a token embedding network in an NLP transformer). The output from the patch embedding network

\in R^{(n + 1) \times d}

if n is the number of patches, and d is the embedding dimensionality of the Transformer.

The output from the patch embedding is added with position vectors, which then are fed to the input layer of the transformer. This is then fed into a Transformer which is identical in its architecture to an NLP Transformer. Figure A1 and Figure A2 show the architecture of the Transformer for classification purposes, as applicable to ViT. The input vectors are transformed by three learnable transformations Q, K, and V. To be able to refine the learning in each layer (similar to feature maps in a CNN), the transformer divides the attention calculation into parallel heads with each head computing the attention on a section of the embedding dimension. The output in each head is computed by further multiplying the attention A with V. The canonical transformer’s operation can be summarized by the following equations. The output produced by the i-th head

H_{i} \in R^{(n + 1) \times d_{k}}

is given as

H_{i} = Softmax (\frac{Q_{i} K_{i}^{⊤}}{\sqrt{d_{k}}}) V_{i} = A_{i} V_{i}

(A1)

where

d_{k} = \frac{d}{h}

is the dimensionality of each head,

Q_{i}, K_{i}, V_{i}

are learnable transformations of the input, each

\in R^{(n + 1) \times d_{k}}

, and the attention

A_{i} \in R^{(n + 1) \times (n + 1)}

in each head computes the similarity of each patch’s learnt representation with every other patch’s representation via an inner product. When

A_{i}

is multiplied by

V_{i}

to produce the output in an attention head, each patch’s representation gets adjusted by the similarity with all other patches.

Figure A1. Architecture of each Encoder Layer in a ViT. The Attention block is further divided into Heads.

Figure A2. Architecture of Vision Transformer. Input x contains an extra patch for Classification purposes.

The output in each transformer layer is obtained by concatenating the output of all heads and transformed further by a projection matrix

W^{o} \in R^{d \times d}

as

{layer}_{i} = Concatenate (H_{0}, H_{1}, \dots, H_{h - 1}) W^{o}

(A2)

The output from each layer thus has the same dimensionality as the input, i.e., the output

\in R^{(n + 1) \times d}

is a learnt transformation of the input with each input vector transformed by the relationship to other vectors in the input sequence via the attention mechanism. A classification head is added to the last layer, which uses the first vector corresponding to the CLS token to perform the classification decision.

out = classifier [{layer}_{p - 1} ({layer}_{p - 2} (\dots {layer}_{0} (embedding (x) + P E (x)))) [0]]

(A3)

Skip connections and layer normalization are also used in each layer to stabilize the training of the transformer as shown in Figure A1. As can be seen, the ViT is very similar to the NLP Transformer for classification purposes, i.e., BERT. The attention complexity of the ViT is

O (n^{2})

.

Appendix B. Pseudocode for EEViT-IP Attention

Figure A3. Pseudocode for EEViT-IP Attention.

References

Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. Adv. Neural Inf. Process. Syst. (NeurIPS) 2017, 30, 5998–6008. [Google Scholar]
Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A.; et al. Training Language Models to Follow Instructions with Human Feedback. In Proceedings of the 36th Conference on Neural Information Processing Systems (NeurIPS), New Orleans, LA, USA, 28 November–9 December 2022; Volume 35, pp. 27730–27744. [Google Scholar]
Gemini Team; Anil, R.; Borgeaud, S.; Alayrac, J.-B.; Yu, J.; Soricut, R.; Schalkwyk, J.; Dai, A.M.; Hauth, A.; Millican, K.; et al. Gemini: A Family of Highly Capable Multimodal Models. arXiv 2023, arXiv:2312.11805. [Google Scholar] [CrossRef]
Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.-A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. LLaMA: Open and Efficient Foundation Language Models. arXiv 2023, arXiv:2302.13971. [Google Scholar] [CrossRef]
Lu, H.; Liu, W.; Zhang, B.; Wang, B.; Dong, K.; Liu, B.; Sun, J.; Ren, T.; Li, Z.; Yang, H.; et al. DeepSeek-VL: Towards Real-World Vision-Language Understanding. arXiv 2024, arXiv:2403.05525. [Google Scholar]
Anthropic. System Card: Claude Opus4 & Claude Sonnet4; Anthropic Technical Report; Anthropic: San Francisco, CA, USA, 2025. [Google Scholar]
Bai, J.; Bai, S.; Chu, Y.; Cui, Z.; Dang, K.; Deng, X.; Fan, Y.; Ge, W.; Han, Y.; Huang, F.; et al. Qwen Technical Report. arXiv 2023, arXiv:2309.16609. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. In Proceedings of the 9th International Conference on Learning Representations (ICLR), Virtual Event, 3–7 May 2021. [Google Scholar]
Hawthorne, C.; Jaegle, A.; Cangea, C.; Borgeaud, S.; Nash, C.; Malinowski, M.; Dieleman, S.; Vinyals, O.; Botvinick, M.; Simon, I.; et al. General-purpose, long-context autoregressive modeling with Perceiver AR. In Proceedings of the 39th International Conference on Machine Learning, PMLR, Baltimore, MD, USA, 17–23 July 2022; Chaudhuri, K., Jegelka, S., Song, L., Szepesvári, C., Niu, G., Sabato, S., Eds.; Proceedings of Machine Learning Research. Volume 162, pp. 8535–8558. [Google Scholar]
Touvron, H.; Cord, M.; Sablayrolles, A.; Synnaeve, G.; Joulin, A. Going Deeper with Image Transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 32–42. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
Liu, Z.; Hu, H.; Lin, Y.; Xie, Z.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer V2: Scaling Up Capacity and Resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 12009–12019. [Google Scholar]
Chu, X.; Tian, Z.; Wang, Y.; Zhang, B.; Ren, H.; Wei, X.; Xia, H.; Shen, C. Twins: Revisiting the Design of Spatial Attention in Vision Transformers. Adv. Neural Inf. Process. Syst. (NeurIPS) 2021, 34, 9355–9366. [Google Scholar]
Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H. Training Data-Efficient Image Transformers & Distillation through Attention. In Proceedings of the 9th International Conference on Learning Representations (ICLR), Virtual Event, 3–7 May 2021. [Google Scholar]
Yuan, L.; Chen, Y.; Wang, T.; Yu, W.; Shi, Y.; Tay, F.E.H.; Feng, J.; Yan, S. Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 558–567. [Google Scholar]
Xie, Z.; Lin, Y.; Zhang, Z.; Cao, Y.; Lin, S.; Hu, H. CeiT: Cross-Attention Image Transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 10794–10803. [Google Scholar]
Chen, X.; Hsieh, C.J.; Dai, X.; Chen, D.; Chen, Y.; Yuan, L.; Liu, Z.; Vasconcelos, N. FasterViT: Fast Vision Transformers with Hierarchical Attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 16122–16132. [Google Scholar]
Wu, Y.; Gao, Y.; Liu, Z.; Hu, H.; Lin, S. UniFormer: Unified Transformer for Efficient Vision Modeling. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022; pp. 271–288. [Google Scholar]
Dai, Z.; Liu, H.; Zhou, H.; Le, Q.V. CoAtNet: Marrying Convolution and Attention for All Data Sizes. arXiv 2021, arXiv:2106.04803. [Google Scholar]
Wang, S.; Li, B.Z.; Khabsa, M.; Fang, H.; Ma, H. Linformer: Self-Attention with Linear Complexity. arXiv 2020, arXiv:2006.04768. [Google Scholar] [CrossRef]
Beltagy, I.; Peters, M.E.; Cohan, A. Longformer: The Long-Document Transformer. arXiv 2020, arXiv:2004.05150. [Google Scholar] [CrossRef]
Chen, Y.; Qian, S.; Tang, H.; Lai, X.; Liu, Z.; Han, S.; Jia, J. LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models. In Proceedings of the 12th International Conference on Learning Representations (ICLR), Vienna, Austria, 18–22 May 2024. [Google Scholar]
Mahmood, K.; Huang, S. Enhanced Computationally Efficient Long LoRA Inspired Perceiver Architectures for Auto-Regressive Language Modeling. arXiv 2024, arXiv:2412.06106. [Google Scholar]
Yun, S.; Han, D.; Oh, S.J.; Chun, S.; Choe, J.; Yoo, Y. CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6023–6032. [Google Scholar]
Lin, M.; Chen, M.; Zhang, Y.; Shen, C.; Ji, R.; Cao, L. Super Vision Transformer. Int. J. Comput. Vis. 2023, 131, 3136–3151. [Google Scholar] [CrossRef]
Liang, Y. EViT: Expediting Vision Transformers via Token Reorganizations. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 4 May 2021. [Google Scholar]
Liu, Z.; Mao, H.; Wu, C.-Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A ConvNet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
Jeevan, P.; Sethi, A. Convolutional Xformers for Vision. arXiv 2022, arXiv:2201.10271. [Google Scholar] [CrossRef]
Choromanski, K.; Likhosherstov, V.; Dohan, D.; Song, X.; Gane, A.; Sarlos, T.; Hawkins, P.; Davis, J.; Mohiuddin, A.; Kaiser, Ł.; et al. Rethinking Attention with Performers. arXiv 2020, arXiv:2009.14794. [Google Scholar]
Xiong, Y.; Zeng, Z.; Chakraborty, R.; Tan, M.; Fung, G.; Li, Y.; Singh, V. Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention. arXiv 2021, arXiv:2102.03902. [Google Scholar] [CrossRef]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models From Natural Language Supervision. arXiv 2021, arXiv:2103.00020. [Google Scholar] [CrossRef]
Li, J.; Li, D.; Xiong, C.; Hoi, S. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. In Proceedings of the 39th International Conference on Machine Learning (ICML), PMLR, Baltimore, MD, USA, 17–23 July 2022; Volume 162, pp. 12888–12900. [Google Scholar]
Liu, H.; Li, C.; Wu, Q.; Lee, Y.J. LLaVA: Large Language and Vision Assistant via Visual Instruction Tuning. arXiv 2023, arXiv:2304.08485. [Google Scholar]
Zhai, X.; Mustafa, B.; Kolesnikov, A.; Beyer, L. SigLIP: Sigmoid Loss for Language-Image Pre-Training. arXiv 2023, arXiv:2303.15343. [Google Scholar]
Tschannen, M.; Gritsenko, A.; Wang, X.; Naeem, M.F.; Alabdulmohsin, I.; Parthasarathy, N.; Evans, T.; Beyer, L.; Xia, Y.; Mustafa, B.; et al. SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features. arXiv 2025, arXiv:2502.14786. [Google Scholar]
Henighan, T.; Kaplan, J.; Katz, M.; Chen, M.; Hesse, C.; Jackson, J.; Jun, H.; Brown, T.B.; Dhariwal, P.; Gray, S.; et al. Scaling laws for autoregressive generative modeling. arXiv 2020, arXiv:2010.14701. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]

Figure 1. Architecture of the PerceiverAR [9], used as baseline.

Figure 2. EEViT-PAR architecture.

Figure 3. EEViT-IP architecture.

Figure 4. The green rectangle indicates the normal attention block, The pink boundaries indicate the (IP Block) where PerceiverAR attention is applied between two overlapping half segments. As an example, PerceiverAR attention is computed between the [S3–S4] and [S4–S5] segments (emphasized by the dark blue rectangle in rows 4 and 5).

Figure 5. The increase in the attention receptive field indicated by the purple colored rectangles in the second layer of EEViT-IP architecture. The purple blocks are not calculated but implicitly contain information because of the IP block in the previous layer. The attention in segments [S4–S5] now includes information from S3, as S4 from the previous layer already has information from S3.

Figure 6. After enough layers, the entire information from all previous segments is implicitly available in the attention in the IP blocks (indicated by purple blocks). The attention calculation is done only on the two blocks near the diagonal (indicated by the pink outer box). Continuing from the previous example, after enough layers, the S5 segment now contains implicitly information from all previous segments.

Figure 7. Model accuracies for ViT, PerceiverAR, and EEViT-PAR (ours) for up to 100 epochs on CIFAR-10.

Figure 8. Model accuracies for ViT, PerceiverAR, and EEViT-PAR (ours) for up to 100 epochs on CIFAR-100.

Figure 9. Model accuracies for ViT, PerceiverAR, and EEViT-PAR (ours) for up to 100 epochs on Tiny ImageNet.

Figure 10. Model accuracies for ViT, PerceiverAR, and EEViT-PAR (ours) for up to 50 epochs on ImageNet, in intervals of 10 epochs. Only basic augmentations of random crop and horizontal flip are used.

Figure 11. Model accuracies for CaiT, Swin, Twins, and EEViT-IP (ours) for up to 200 epochs on CIFAR-10.

Figure 12. Model accuracies for CaiT, Swin, Twins, and EEViT-IP (ours) for up to 200 epochs on CIFAR-100.

Figure 13. Model accuracies for CaiT, PerceiverAR, Twins, and EEViT-IP (ours) for up to 100 epochs on ImageNet.

Figure 14. Model accuracies for CaiT, Twins, and EEViT-IP (ours) for up to 50 epochs on ImageNet, in intervals of 10 epochs.

Figure 15. Model accuracies for FasterViT and FasterViT-EEViT-IP (ours) for up to 200 epochs on CIFAR-10.

Figure 16. Model accuracies for FasterViT and FasterViT-EEViT-IP (ours) for up to 200 epochs on CIFAR-100.

Figure 17. Information propagation in EEViT-PAR as PerceiverAR style Attention is computed on pair of overlapping segments. The Information in

S 0_{A}

eventually reaches

S 3_{B}

after five layers, achieving the effect of full attention (attention information propagation from

S 0_{A}

is shown in the highlighted purple color). Different color boxes at the bottom indicate the pair of half segments between which the attention is being computed.

Figure 17. Information propagation in EEViT-PAR as PerceiverAR style Attention is computed on pair of overlapping segments. The Information in

S 0_{A}

eventually reaches

S 3_{B}

after five layers, achieving the effect of full attention (attention information propagation from

S 0_{A}

is shown in the highlighted purple color). Different color boxes at the bottom indicate the pair of half segments between which the attention is being computed.

Table 1. Characteristics of different datasets used in our evaluation of different ViT architectures. All datasets use color images.

Dataset	Image Size	Number of Training/Test Images	Number of Classes
CIFAR-10	3 × 32 × 32	50,000/10,000	10
CIFAR-100	3 × 32 × 32	50,000/10,000	100
SVHN	3 × 32 × 32	73,257/26,032	10
Tiny ImageNet	3 × 64 × 64	100,000/10,000	200
ImageNet-1K	3 × 224 × 224	1,281,167/50,000	1000

Table 2. Comparison of Top-1 Accuracy between EEViT, PerceiverAR and ViT models on different datasets using basic augmentations (random crop and horizontal flip).

Dataset	PerceiverAR (Baseline)	ViT	EEViT-PAR (Ours)
CIFAR-10	74.64	82.92	82.33
CIFAR-100	47.19	55.55	55.88
SVHN	94.74	95.90	95.82
Tiny ImageNet	31.59	39.58	41.07
ImageNet-1K	47.72	59.82	61.40

Table 3. Comparison of Top-1 Accuracy between EEViT, PerceiverAR and ViT models on different datasets using basic as well as random shear, translation, solarization augmentations.

Dataset	PerceiverAR (Baseline)	ViT	EEViT-PAR (Ours)
CIFAR-10	79.99	87.37	87.22
CIFAR-100	53.00	62.09	61.76
SVHN	96.59	97.62	97.38
Tiny ImageNet	38.53	46.85	45.72
ImageNet-1K	56.34	64.42	63.92

Table 4. Comparison of Top-1 Accuracy between our proposed architecture and other SOTA models on different datasets using basic augmentations (random crop and horizontal flip).

Dataset	PerceiverAR (Baseline)	ViT	EEViT-PAR (Ours)	CaiT	Twins	Swin	EEViT-IP (Ours)
CIFAR-10	74.64	82.92	82.33	81.51	81.28	89.19	88.05
CIFAR-100	47.19	55.55	55.88	56.41	48.73	60.62	61.46
SVHN	94.75	95.82	95.93	95.97	94.89	96.24	96.36
Tiny ImageNet	25.78	40.89	41.07	28.48	29.65	46.64	45.11

Table 5. Comparison of Top-1 Accuracy between our proposed architecture and other SOTA models on different datasets using both basic augmentations (random crop and horizonal flip) and random shear, translation, solarization augmentations.

Dataset	PerceiverAR (Baseline)	ViT	EEViT-PAR (Ours)	CaiT	Twins	Swin	EEViT-IP (Ours)
CIFAR-10	80.23	87.48	87.24	83.86	85.12	89.19	91.73
CIFAR-100	52.78	62.72	62.89	59.90	56.00	60.62	66.04
SVHN	96.59	97.61	97.13	96.24	96.74	97.21	97.74
Tiny ImageNet	28.23	46.85	45.72	33.80	35.78	52.14	51.09

Table 6. GPU RAM and execution time requirements comparing Swin-T (27 M parameters) with our EEViT-IP (26.5 M parameters) on an RTX 4090 GPU system with 24 GB GPU RAM.

	Swin-T (Batch = 16)	Swin-T (Batch = 32)	EEViT-IP (Batch = 16)	EEViT-IP (Batch = 32)
GPU RAM	10.48 GB	19.57 GB	2.47 GB	3.52 GB
Time/Epoch	4.3 h	4.1 h	1.7 h	0.8 h

Table 7. Comparison of Top-1 Accuracy for different number of segments in our EEViT-IP design using both basic augmentations (random crop and horizontal flip) and random shear, translation, solarization augmentations.

Number of Segments	CIFAR-100 Accuracy (%)	Tiny ImageNet Accuracy (%)
4	66.04	59.6
8	67.9	60.8
16	63.2	61.2

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Mahmood, R.; Patel, S.; Elleithy, K. EEViT: Efficient Enhanced Vision Transformer Architectures with Information Propagation and Improved Inductive Bias. AI 2025, 6, 233. https://doi.org/10.3390/ai6090233

AMA Style

Mahmood R, Patel S, Elleithy K. EEViT: Efficient Enhanced Vision Transformer Architectures with Information Propagation and Improved Inductive Bias. AI. 2025; 6(9):233. https://doi.org/10.3390/ai6090233

Chicago/Turabian Style

Mahmood, Rigel, Sarosh Patel, and Khaled Elleithy. 2025. "EEViT: Efficient Enhanced Vision Transformer Architectures with Information Propagation and Improved Inductive Bias" AI 6, no. 9: 233. https://doi.org/10.3390/ai6090233

APA Style

Mahmood, R., Patel, S., & Elleithy, K. (2025). EEViT: Efficient Enhanced Vision Transformer Architectures with Information Propagation and Improved Inductive Bias. AI, 6(9), 233. https://doi.org/10.3390/ai6090233

Article Menu

EEViT: Efficient Enhanced Vision Transformer Architectures with Information Propagation and Improved Inductive Bias

Abstract

1. Introduction

2. Related Work

3. Preliminaries

PerceiverAR Architecture

4. Proposed Enhanced Efficient ViT (EEViT) Architectures

4.1. Enhanced Efficient ViT Based on Perceiver Attention (EEViT-PAR)

4.2. Enhanced Efficient ViT Based on Information Propagation (EEViT-IP)

5. Results

6. Discussion

Efficiency Analysis

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A. Canonical Vision Transformer (ViT)

Appendix B. Pseudocode for EEViT-IP Attention

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI