1. Introduction
With the immense success of the Transformer architecture in Large Language Models (LLMs), Vision Transformers have emerged as a powerful alternative to Convolutional Neural Networks (CNNs) for image recognition. The Vision Transformer (ViT) was proposed in the seminal work in [
1]. In the work, an input image is divided into equal sized patches which are then fed to a Transformer as a sequence of tokens, similar to a Natural Language Processing (NLP) Transformer. The work in [
1] demonstrated that when pretrained on large image datasets, a ViT can match or exceed the performance of a CNN for image recognition. While ViTs lack the natural inductive bias for vision as compared to CNNs, their ability to model long-range dependencies through self-attention is advantageous. In addition, Transformers possess larger capacity as compared to CNNs [
1,
2]. The recent interest in Vision Language Models (VLMs) also favors their use in order to achieve a unified architecture for both text and vision.
Despite these advantages, the transition from local convolutions in a CNN to global token processing in a ViT introduces scalability constraints. While treating images as sequences provides the flexibility required for high image recognition performance, it entails substantial computational and memory overhead. This is primarily caused by the self-attention mechanism, which scales quadratically with the number of tokens. Consequently, as some vision tasks involve higher image resolutions, fine-grained spatial representations, or longer video sequences, the reliance on global attention poses a fundamental challenge. This limits efficient and effective structured visual modeling, necessitating more optimized ViT architectural approaches.
A significant focus in recent research has been on the development of efficient attention mechanisms as well as Transformer architectures to reduce the
attention complexity to
or near-linear complexity. Notable works in the NLP domain include Linformer [
3], Longformer [
4], Performer [
5], and linear attention [
6], among others. Linformer uses projections along the sequence dimension to reduce its length. While theoretically achieving linear time complexity, it compromises the performance because of effective compression of the context. Longformer uses a sliding window with some interleaved global attention to improve efficiency, but still lacks the performance as compared to full global attention. Performer introduced the idea of using kernel-based approximation to attention. While it theoretically achieves linear complexity for attention, the overhead of the FAVOR+ algorithm used in Performer for its kernel methods implementation is considerably high. Linear attention also uses the kernel methods approach but instead uses learnable kernel feature maps to approximate attention. It often results in lower performance compared to standard full attention because the feature map used to approximate attention is typically low-rank, hindering the modeling of complex spatial or semantic information.
The architectural approaches developed to mitigate the quadratic computational complexity of full attention in vision tasks have employed hierarchical, multi-scale designs. These achieve efficiency by focusing on local or spatially reduced attention, which also introduces locality biases similar to CNNs. This leads to improved performance as well as better scalability for high-resolution images. Recent effective approaches use hierarchical representations and progressive tokenization strategies that gradually aggregate local information into higher level tokens. Tokens-to-Token (T2T) [
7] is an important work that recursively merges neighboring tokens, enabling the model to capture local structure while reducing token counts. Despite its effectiveness, the T2T design uses standard self-attention in its last stages, which is less efficient, and further lacks explicit inductive bias for directional or structured information flow. As a result, such models may underutilize the sequential nature of token aggregation and fail to fully exploit structured propagation patterns that could improve representational efficiency.
Recent advances in structured attention mechanisms have demonstrated that constraining information flow can lead to more efficient and effective long-range dependency modeling. Examples of architectures that utilize directional or propagative attention include Swin [
8], CrossViT [
9], ECP [
10] and EEViT [
11]. Such designs replace unrestricted global interactions with carefully orchestrated information propagation across tokens. These approaches suggest that global context need not be modeled through dense all-to-all attention, but can instead emerge from progressive, structured communication patterns that are both computationally efficient and semantically meaningful.
Inspired by some of the recent work in efficient attention, we propose a dual-stream segmented information propagation architecture with significant improvements over the recent EEViT design. Specifically, attention heads are divided into two complementary streams: one performing left-to-right segmented propagation and the other performing right-to-left propagation. This design enables efficient bidirectional context aggregation in a Perceiver-style manner, allowing long-range dependencies to be captured through structured information flow rather than full global attention. Further, inspired by the FasterViT’s [
12] idea of carrier tokens, we inject summary, or class, tokens in each segment. Thus, instead of relying on a single global class token, each segment is associated with its own class token, allowing local semantic summaries to be formed at intermediate stages. Second-level cross-attention is performed between image tokens and these segment-level class tokens, enabling hierarchical semantic fusion and gradual comprehension of the global context. This design encourages the model to reason over structured semantic units rather than flattened token sequences.
Extensive experimental results demonstrate that our proposed dual-stream attention, termed as , achieves superior performance compared to conventional Vision Transformer designs, while significantly improving the structure and efficiency of information flow. Our design can also act as a drop-in replacement for the attention module in current SOTA ViTs. We demonstrate this by replacing the full attention modules in T2T and FasterViT architectures. This results in higher compute efficiency while maintaining equivalent or slightly improved classification accuracy. Thus, by combining segmented bidirectional information propagation, segment-level summary or class tokens and self and segment-level cross-attention, our approach provides an efficient linear complexity attention design. While our technique can be applied to video processing, we currently focus on enhanced and efficient image recognition. Our main contributions are summarized as follows:
The Attention Module: We propose a dual-stream information propagating attention module that splits attention heads into parallel left-to-right and right-to-left segmented streams. This approach achieves both enhanced context without loss of information and efficient long-range dependency modeling with linear complexity.
Segment-Level Class Token Strategy: We introduce a hierarchical classification framework that assigns a dedicated class token to each spatial segment. This approach replaces the traditional single-class-token bottleneck, allowing the model to capture more granular semantic information across regions in the image.
Hierarchical Semantic Fusion: We implement a structured aggregation mechanism that utilizes cross-segment attention to fuse segment-level class tokens. This enhances global representation and improves inductive bias for hierarchical visual representation learning.
Architectural Versatility and Performance: We demonstrate that the
module can be used as a versatile “plug-and-play” replacement for standard attention in established architectures like Tokens-to-Token (T2T), FasterViT and PVT [
13]. Extensive experiments on image datasets show that
achieves competitive or superior performance with significantly improved computational efficiency.
2. Related Work
The pioneering work of Dosovitskiy introduced the idea of Vision Transformer (ViT) [
1] by adapting a language-domain Transformer to the vision domain. It simply changed the input token embedding to the Transformer to be obtained from image patches rather than text tokens. Since the goal in a ViT is primarily classification, the ViT [
1] introduced an extra randomly initialized token in the input sequence to act as the class (CLS) token. This idea was borrowed from the BERT [
14] NLP Transformer that is primarily used for text classification. The ViT showed impressive results when pretrained on large image datasets, matching the state-of-the-art CNNs.
To improve the performance of the attention mechanism in the ViT, several ideas ranging from improving the architecture of the ViT to better structural decomposition of the input image in smaller carefully crafted windows have been proposed. The important works in the area of architecture enhancements include the Perceiver class of designs [
10,
15,
16,
17] and the CaiT [
18] Transformer. The fundamental idea in a Perceiver is to divide the input sequence in two parts, referred to as the context and the latent. In the first layer of the Transformer, an attention action is performed between the latent component and the entire input sequence, resulting in an output of the size of only the latent. All remaining layers perform normal self-attention and have inputs and outputs the size of the latent. This results in greatly reduced attention computations, especially if the depth of the Transformer is large.
CaiT [
18] processes image patches without a class token in the first several layers, then adds a class token with the class attention in the last few layers to aggregate global information. This allows the Transformer to learn rich local and mid-level features in early layers without being distracted by the classification goal. The later layers can then better focus on the classification. While CaiT improves classification accuracy as compared to a standard ViT, it does not improve the compute efficiency. To improve the efficiency as well as the image classification performance, important works that focus on two-dimensional structural decomposition of the input image into smaller sizes include Swin [
8], Swin-V2 [
19], Twins [
20], Tokens-to-Token [
7], FasterViT [
12] and PVT [
13] designs, among others.
Swin introduces a hierarchical Vision Transformer that performs self-attention within local windows, while enabling information exchange across windows via a shifted window mechanism. While Swin produces very good classification results, its reliance on fixed window sizes can limit the modeling for different image sizes. Further, the shifted window strategy introduces additional implementation complexity which requires high memory usage during training. Swin-V2 [
19] extends Swin by improving training stability and scalability, particularly for large models and high-resolution inputs. It replaces the dot-product attention with cosine similarity attention. It also introduces log-spaced continuous relative position bias, enabling better generalization to varying image sizes. Despite these advantages, Swin-V2 inherits the windowing locality constraints of Swin, and remains computationally expensive. Another similar approach in the Twins Transformer [
20] alternates local self-attention with reduced-cost global attention to capture both fine details and overall context. Its global component uses spatially reduced attention, where keys and values are down-projected to keep computation low. Unlike Swin, Twins injects explicit global context early and avoids window shifting, leading to a simpler design and better global modeling. The slight drawback of the Twins design is that the global attention (even when reduced) is still less efficient than windowed attention, and there may be loss of detail for high-resolution images due to compression of global context.
The Tokens-to-Token (T2T) [
7] design focuses on improving the tokenization process itself by progressively aggregating neighboring patches into more informative tokens before applying global self-attention. This hierarchical token construction allows the model to better capture local structures and spatial relationships that are often lost in standard patch embeddings in a ViT. While T2T improves representation quality and performance on classification benchmarks, the use of standard full attention in the last stage layers is inefficient. FasterViT [
12] aims to improve both throughput and accuracy by introducing hierarchical attention (HAT), where attention is computed within local windows and selectively across windows using lightweight global tokens. This approach balances local feature extraction with efficient global context modeling while reducing memory access and attention cost. FasterViT demonstrates strong performance; however, its hierarchical attention mechanism is more complex than standard attention, and performance gains may be less pronounced for low-resolution or smaller images.
Pyramid Vision Transformer (PVT) [
13] introduces a hierarchical architecture that progressively reduces spatial resolution while increasing feature dimensionality, closely resembling CNN pyramids. It employs spatially reduced attention to lower the computational cost of attention, enabling efficient processing of large images. PVT effectively bridges the gap between CNNs and Transformers for dense vision tasks. Its main limitation lies in the aggressive spatial reduction, which can degrade fine-grained spatial information and affect performance on tasks requiring precise localization. PVT-V2 [
21] introduces Linear Spatial Reduced Attention (SRA) using average pooling to reduce the feature map to a fixed size, making the computational cost linear with respect to image resolution. It also uses overlapping patches in the input embedding to improve local spatial information. While the linear SRA is fast, the use of fixed-size average pooling is inherently lossier than the original design. Further, the design also uses depthwise convolutions, making it a hybrid CNN–Transformer architecture rather than a primarily Transformer-based design.
While our focus in this paper is on Transformer-based designs for vision, we briefly mention a few other hybrid designs. For example, LeViT [
22] is designed for faster inference by combining convolutional layers with attention mechanisms in a hierarchical architecture. It employs convolutions and downsampling in early stages for feature extraction, and lightweight attention blocks in later stages to model global interactions efficiently. Another popular design, DeiT [
23], focuses on improving the data and training efficiency of Vision Transformers, demonstrating that competitive performance can be achieved without pretraining on large datasets. It introduces a distillation strategy that leverages a CNN teacher to guide the Transformer training through an additional distillation token. While DeiT significantly reduces training data requirements and improves convergence, it does not address the inherent quadratic complexity of self-attention or introduce structural modifications for efficient high-resolution image processing.
In order to reduce the complexity of self-attention and enable VLMs to process both text and images in a uniform manner, a highly efficient approach has been recently proposed in [
10,
11]. Here, the input sequence is divided into small overlapping segments where a Perceiver operation is performed on the consecutive pair of segments. The Perceiver operation propagates the information from the left segment (context) in the pair to the right segment (latent) in a very effective manner. This information flow in the EEViT-IP [
11] design demonstrates that while attention is only computed locally, it implicitly accumulates equivalent full attention after a few layers in the Transformer.
In this paper, we enhance this information propagation by introducing dual-stream information flow by employing forward as well as backward Perceiver-style attention in pairs of overlapping segments. The information from the dual streams is combined in a learnable way using projections. Further, we introduce a summary token with each segment, and perform second-level cross-attention to gradually accumulate better global context. Our attention design is extremely uniform and does not rely on carefully crafted windows. Our design is highly efficient and increases inductive bias for vision as it gradually builds the local structure into a global context. For completeness, before we present our architecture, we briefly describe some of the preliminary background needed in
Section 3.
4. Proposed Attention
The key ideas in
attention represent an accumulation of recent SOTA architectures. The goal is to achieve a highly efficient attention design without any information loss, while potentially enhancing the contextual information. To reduce the
complexity of attention, it is important that the sequence length
n be processed in smaller partitions and in a structured manner. One such successful approach was developed in [
10] for NLP, and recently adapted for Vision Transformers in [
11]. Here, the input sequence (patch embeddings from an image) is divided into disjoint segments, each segment of size
s. To allow for information propagation between segments, a PerceiverAR (PAR) attention is carried out in pairs of overlapping segments. The query
Q is computed on the second segment in a pair, while key
K and the value
V are computed on the segment pair whose length is
. Thus, the attention computation in each segment pair of segments becomes
.
The idea of dividing the attention computation into overlapping segments has also been proposed for a different NLP objective in LongLoRA [
24]. LongLoRA terms its attention approach as shifted sparse attention (S
2-Attn). This technique optimizes attention by partitioning the sequence length into distinct groups that calculate attention independently. To ensure information flow between these groups, the attention heads are split into two halves; in the second half, tokens are shifted by half the group size which allows for context sharing. As the information flows through different layers of the Transformer, the information exchange expands to all the segments, effectively implicitly obtaining full attention. The LongLoRA approach is primarily used to apply Low-Rank Adaptation (LoRA) to attention layers, effectively extending the context length of pretrained Large Language Models (LLMs).
In our
attention, we combine and further enhance the concepts from both LongLoRA and EEViT-IP. We use overlapping segmented attention, with half the heads incorporating opposing information flow to extract better context. PerceiverAR attention is applied to overlapping pair of segments in half the heads such that information propagation occurs from left to right. The remaining half of the heads reverse the PerceiverAR operation such that the query
Q is computed on the first segment in the pair, and the key
K and the value
V are computed on the entire pair. This allows the information to flow from right to left.
Figure 3 and
Figure 4 depict this dual-stream PerceiverAR-based processing in our
attention mechanism.
The attention computations taking place in the two streams are given by Equations (17)–(32). In the left-to-right stream, the computation in the first segment is different from other segments as it performs regular full attention. This is because to complete the pair, there is no segment before it. The computation in all other segments is a PerceiverAR operation with Q being computed on the second segment in the pair of segments, and K,V on the entire pair. If s is the segment size, then the attention equations governing the first segment in the left-to-right stream are described below.
4.1. Left-to-Right Attention Stream (LR)
4.1.1. Segment Attention
4.1.2. Segments – Attention
The PerceiverAR operation is performed on two consecutive segments. The operations producing the output corresponding to segment
i in the left-to-right stream are defined as follows, where
.
4.2. Right-to-Left Attention Stream (RL)
In the right-to-left stream, the computation in the final segment differs from the others because there is no subsequent segment to complete the PerceiverAR pair. Therefore, the final segment performs standard self-attention. All other segments perform Perceiver-style attention where Q is computed on the left segment of the pair while K and V are computed on the entire pair, enabling information propagation from right to left.
4.2.1. Segment Attention
4.2.2. Segments – Attention
The PerceiverAR operation is performed on two consecutive segments. The operations producing the output corresponding to segment
i in the right-to-left stream are defined below, where
.
The output of each attention operation is equal to the segment size, i.e., . For all overlapping segment pairs, the query matrix is computed on the left segment of the pair, while the key and value matrices are computed on the entire pair.
Since the patches representing the segments on the right of a given segment are different from those on the left, we combine the visual context obtained in the left-to-right (LR) and right-to-left (RL) attention streams across corresponding halves of the attention heads in a learnable manner.
Figure 5 illustrates the proposed
architecture. For simplicity, the computations within a single head from the LR and RL streams are shown. All layers in the
architecture share the same structure as depicted in
Figure 5.
To further enhance the
architecture, we introduce
segment summary tokens. These tokens provide an additional cross-attention pathway between the full input sequence of size
n and the segment summary tokens (denoted by
). The summary segment, denoted as
, has tokens equal to the number of segments, i.e.,
. The computations are described as follows.
The above concept is somewhat similar to the idea of carrier tokens as employed in FasterViT [
12]. It helps in accumulating the global context in a more effective manner without causing a loss in inductive bias that is obtained by the gradual context-building in the two attention streams.
After the model has learned local relationships between patches, we further enhance global context modeling by performing a reverse summary segment to image tokens cross-attention operation. This enables image tokens to absorb global contextual information aggregated at the segment level. The reverse cross-attention formulation is defined in Equations (
37)–(
40).
Since the summary segment contains only tokens (equal to the number of segments), this reverse cross-attention introduces negligible additional computational overhead. Moreover, it is applied only in later stages of training to preserve the inductive bias of Vision Transformers, which benefits from a gradual transition from local feature aggregation to global context modeling.
Our
attention design is highly efficient, as attention is computed only between paired segments rather than across the full token sequence. The dual information-propagating streams, together with learnable stream aggregation and summary-segment cross-attention, enable enhanced contextual understanding and more effective long-range feature interaction. We perform detailed experiments and compare our design with the other SOTA designs in the
Section 5.
5. Experimental Results
We tested our architecture on different image classification datasets.
Table 1 shows the characteristics of the datasets on which models were trained and tested.
For ImageNet-100 and ImageNet-1K datasets, we resized the images to
. Since the purpose of our experiments was to compare the relative effectiveness of different architectures, we used standard data augmentations from RandAugment [
25]. This method randomly samples
N transformations from a pool of operations such as AutoContrast, cutout, affine transformations (shear, translate, rotate), color enhancements (contrast, brightness, sharpness), and pixel-level operations (solarize, equalize, posterize). Following standard practices for Vision Transformers, the magnitude of these transformations was scaled according to the dataset resolution to maintain consistent regularization strength across different experimental scales.
To ensure a fair comparison between the underlying Transformer architectures, we deliberately excluded complex composite augmentations such as CutMix and Mixup. Since these techniques are known to act as strong regularizers that can mask architectural deficiencies or inflate performance in data-hungry models, omitting these allows for a clearer evaluation of the model’s inherent inductive biases and attention performance.
Our training hyperparameters for each dataset are summarized in
Table 2. We follow standard training practices commonly used for Vision Transformers, employing a warmup period of five epochs together with a linear-then-cosine learning rate schedule. Specifically, the learning rate is initialized at
, increased linearly during the warmup stage to reduce gradient instability, and subsequently decayed using a half-period cosine schedule to encourage stable convergence during later stages of optimization. All models are trained using the AdamW optimizer with a weight decay of 0.05, a dropout rate of 0.08, and gradient clipping with a threshold of 0.3. ImageNet-1K and ImageNet-100 models are trained for 200 epochs using a batch size of 256, while CIFAR-100 and Tiny-ImageNet are trained for 100 and 70 epochs, respectively, using batch sizes of 128 due to their smaller training set sizes. Label smoothing is applied only for the ImageNet-based experiments. All models are evaluated under identical data augmentation and training/testing protocols to ensure fair comparison. We conduct each training run three times and report the average top-one accuracy of these three runs.
Extended ablation studies examining the contribution of individual architectural components, segmentation granularity, and training dynamics are provided in
Appendix A (
Table A1 and
Table A2 and
Figure A1 and
Figure A2).
The architectural configurations for all models, including embedding dimensions (dims), number of attention heads (heads), and stage depths (depths), are summarized in
Table 3. All models are trained from scratch without using pretrained weights. All models are of comparable size, ranging from approximately 15 to 23 million parameters.
Table 4 presents the classification results comparing our
-based Transformer with popular state-of-the-art (SOTA) Vision Transformers across different datasets. For comparison with a linear attention design, we implement the Linformer [
3] for a vision task. Transformer models including ViT, PerceiverAR, Linformer, and EEViT-IP use an embedding dimension of 384, with six heads per layer and eight layers. The Linformer implementation uses 224 × 224-sized images with a patch size of 14 × 14, resulting in a sequence length of 256 tokens. Our Linformer implementation uses a compression of the sequence length to k = 64.
We present results for two variations of our attention-based Transformer models. The first uses standard non-overlapping input patch tokenization. Since many current SOTA models use hierarchical input patch tokenization, our -CP model employs CNN-based input patch tokenization. For images, we use a kernel with strides of two and five in successive layers to downsample the image from to , resulting in 484 tokens with 128 channels. Thus, the input to our -CP Transformer consists of 484 tokens with an initial embedding dimension of 128.
To further improve efficiency, we introduce controllable patch merging between stages in our -CP variant. Our multi-stage design progressively integrates local spatial features with global context. At stage transitions, the model employs a convolutional downsampling module to simultaneously reduce spatial resolution and expand feature dimensionality. Unlike standard linear patch merging, this module incorporates a depthwise-separable convolution block (a depthwise convolution followed by a pointwise convolution) to provide an explicit local inductive bias, followed by a strided convolution (stride = 2) to condense the token sequence.
To further analyze the effect of stage-wise design decisions on training dynamics, we investigate when to introduce global context during optimization, as studied in the ablation experiments reported in
Appendix A,
Table A3.
This downsampling mechanism is strategically applied to specific stages (e.g., Stage 3) to manage the computational complexity of subsequent dual-stream attention layers. By transforming flattened token sequences back into a 2D spatial grid for these operations, the architecture maintains overlapping local context and structural integrity as the feature map transitions from high-resolution local processing to low-resolution global abstraction.
As can be seen from
Table 4, our
-CP model not only provides the most efficient attention but also outperforms other SOTA models in terms of top-one accuracy. From the results in
Table 4, it can also be observed that patch tokenization and patch merging contribute positively to model performance, both in terms of classification accuracy and computational efficiency, in addition to the attention mechanism. This is why incorporating CNN-based patch tokenization and patch merging into our
attention mechanism significantly enhances model performance. While linear attention models such as Linformer [
3] and Performer [
5] perform well on NLP tasks, current linear attention approaches either suffer from performance degradation or introduce additional computation overhead [
26]. In the next subsection, we incorporate some of the best patch tokenization approaches from SOTA Vision Transformer models into our
attention.
Enhancing T2T and FasterViT with Our Attention
Two examples of Vision Transformer (ViT) models that achieve high performance with a focus on the patching process are T2T and FasterViT. In both models, the patching process moves away from the non-overlapping patches of the original ViT to better capture local structural information and spatial correlations.
T2T utilizes a recursive “Tokens-to-Token” module that performs a restructuring of the token sequence. It unfolds the flattened tokens back into a 2D spatial grid, applies an overlapping sliding window to group neighboring tokens, and then re-projects them into a single token. This progressively reduces the sequence length while aggregating local neighborhood details. However, T2T uses standard full attention after its patching modules. Since our attention is not only highly efficient but also improves the inductive bias by implicitly aggregating information from different segments across layers, we replace the T2T full attention with our attention. T2T uses 196 tokens in its last stage (i.e., ). Since many layers are used in the last stage (12 in our implementation), this replacement benefits both computational efficiency and classification accuracy.
Similarly, FasterViT relies on a hierarchical morphological approach, utilizing overlapping convolutions or patch-embedding layers to create a multi-scale representation. By using overlapping patches, both architectures ensure that pixels at patch boundaries are not isolated, effectively preserving local continuity and fine-grained features that are often lost in standard ViT patching. To process images, a FasterViT design typically uses four stages, with the last two stages employing full attention between and tokens (i.e., 196 and 49 tokens). We replace the attention in the last two stages of FasterViT with our attention.
Table 5 presents the top-one classification accuracy results on three datasets, comparing the original T2T and FasterViT models with their
-enhanced versions. For comparison, results for SwinV2 [
19], PVT-V2 [
21], and Twins-SVT [
20] are also included. In the upper section of
Table 5, the ImageNet-1K training is carried for 200 epochs using the same augmentations as the DeiT training protocol, except for Mixup. We do not use Mixup in this set of results in order to highlight the architectural differences between models.
However, to conform to protocols in the reported literature, we also follow the normal DeiT training protocol (except for repeated augmentations) and report these results in the lower section of
Table 5. This training protocol employs repeated augmentation, RandAugment, random erasing, Mixup, and CutMix regularization strategies. Mixup and CutMix are applied with coefficients of 0.8 and 1.0, respectively, while label smoothing is set to 0.1. Training additionally utilizes gradient clipping with a maximum norm of 1.0 and an exponential moving average (EMA) of model weights with a decay factor of 0.99996 to improve training stability and evaluation performance.
The reason for omitting repeated augmentation sampling is that it is primarily designed for distributed multi-GPU training, where different augmented views of the same sample can be processed simultaneously across workers to improve sample diversity and regularization efficiency. Since our ImageNet-1K experiments were conducted on a single GPU, repeated augmentation sampling was not employed, as its benefits are substantially reduced in non-distributed training settings while introducing additional computational overhead.
Figure 6 and
Figure 7 depict the relative performance gains over the baselines when using our
attention in the respective models.
From
Table 5 and
Figure 6 and
Figure 7, it can be observed that while the classification accuracy gains over the T2T and FasterViT baselines are relatively small when replacing full attention with
attention, the computational efficiency improves noticeably. For example, when training on the ImageNet-1K dataset with a batch size of 256, FasterViT takes approximately 21 minutes to complete one epoch on an NVIDIA RTX 4090 GPU. Replacing its attention with
attention reduces the per-epoch time to about 16 minutes.
While we have demonstrated improvements in two state-of-the-art ViT architectures by replacing their attention mechanisms with attention, this approach is broadly applicable. Any ViT architecture can benefit from incorporating our efficient and context-enhancing attention mechanism.
6. Discussion
By breaking the input sequence of tokens (i.e., patch embeddings) into small segments and performing a PerceiverAR attention computation on pairs of overlapping segments, we achieve linear complexity in our attention computation. In a PerceiverAR computation, if the Q operation is performed on the right segment in a pair, and are computed on the entire pair, then information is propagated from the left segment to the right segment. Similarly, if the Q operation is performed on the left segment in a pair, information propagates from the right segment to the left segment.
This is the key idea in our
attention, where half of the heads perform left-to-right information propagation, and the other half perform right-to-left context propagation. We visually depict this phenomenon in
Figure 8. A more detailed visualization of the attention computation pattern and its scaling behavior is provided in
Appendix B (
Figure A3,
Figure A4 and
Figure A5).
As shown in
Figure 8, we illustrate the progressive expansion of the effective receptive field in
attention across layers. Due to the overlapping segmented attention mechanism, information is gradually propagated between neighboring segments at each layer. After
layers (where
is the number of segments), information originating from the leftmost segment can reach the rightmost segment through successive local exchanges. Symmetrically, information from the rightmost segment can propagate to the leftmost segment over the same depth. At intermediate segment positions, information from both spatial extremes becomes accessible after approximately
layers, as illustrated for the case of eight segments in
Figure 8.
Overall,
Figure 8 provides an illustrative depiction of how local interactions progressively give rise to broader contextual mixing over multiple layers. The combined effect of overlapping segments, bidirectional propagation, and summary-token cross-attention enables efficient long-range dependency modeling without explicitly computing full quadratic attention.
The computational complexity of
attention is given by
Here, n is the sequence length, which is divided into segments of size s. The number of overlapping segment pairs is , h is the number of attention heads, and l is the number of layers in the Transformer. The term corresponds to the regular attention performed in the first and last segments, as they do not form overlapping pairs. The term represents the PerceiverAR computation over each pair of segments. Since s, h, and l are constants, the overall complexity of attention remains linear in n, i.e., .
A detailed breakdown of this computation pattern is provided in
Appendix B, where we further illustrate the segment-wise attention mechanism and its computational structure.
Another important contribution of
attention is that it implicitly improves the inductive bias for vision. The model gradually builds context from smaller segments, eventually forming a global representation, as depicted in
Figure 8. This progressive aggregation of information is one of the primary reasons that
-based Transformers achieve better classification accuracy than full-attention ViTs.
We further enhance classification performance by introducing a summary segment whose size equals the number of segments, effectively assigning one summary token per segment. By performing cross-attention between the full input sequence and this summary segment, the model learns a global representation in a structured and efficient manner. This idea is analogous to the use of carrier tokens in FasterViT.
While attention is highly efficient and improves inductive bias, it becomes even more effective when combined with enhanced patch tokenization schemes such as those used in T2T and FasterViT. We demonstrate this by integrating T2T and FasterViT patch tokenization front ends with attention to construct highly efficient Vision Transformers. The attention design is also highly scalable. For very-high-resolution images, if GPU memory constraints limit attention computation, smaller segment sizes can be used to reduce memory requirements. Additionally, unlike Swin, Twins, and FasterViT, does not require input image sizes to conform to specific window divisibility constraints. The only requirement is that the input sequence be divisible into segments of equal size. If this condition is not met, a small number of tokens can be padded to ensure divisibility without affecting performance.
For ViTs, FLOPs (Floating Point Operations) serve as the primary proxy for computational complexity, directly determining the workload a hardware device must perform to process an image. While parameters dictate the model’s memory footprint on the GPU, FLOPs are the bottleneck for inference latency and throughput. This is especially important for real-time applications such as autonomous vehicles and robots.
Table 6 presents the GigaFLOP counts for some of the SOTA models examined in this work.
FLOPs in
Table 6 were measured using the
fvcore library [
27]. As shown, the models that incorporate our
attention are more efficient in terms of inference computational cost compared to other models. Our models also achieve higher classification accuracy, as indicated by the results in
Table 4 and
Table 5. Additional implementation details and extended analyses are provided in
Appendix A and
Appendix B.
Limitations
While demonstrates strong performance across multiple Vision Transformer backbones and image classification benchmarks, one limitation would be that the segmented information propagation may diminish for very small image resolutions or extremely short token sequences. In such regimes, the additional segmentation structure may provide limited practical advantage over standard self-attention and may also be more compute intensive.
Additionally, the proposed segmented attention mechanism assumes that the token sequence can be partitioned into approximately uniform segments. When the sequence length is not divisible by the number of segments s, zero padding is applied to the final segment to maintain consistent segment dimensions during attention computation. Although the computational overhead introduced by this padding is minimal, excessive padding in highly irregular sequence configurations may slightly reduce efficiency.
7. Conclusions
The increasing reasoning capability of recent LLMs is now requiring them to be able to understand, process and generate images and video data. Thus, a unified Transformer-based architecture becomes a natural choice for VLMs. The key challenges in a Transformer architecture for vision include efficient attention computation and the lack of natural inductive bias for vision. We present a very effective solution to these problems in this work. We devise a highly efficient dual-stream information propagating attention design, termed attention, that breaks the input sequence into small overlapping segments. By performing a PerceiverAR style of computation between a pair of overlapping segments, we can propagate the information from left-to-right in the input segments as processing proceeds through the layers of the Transformer. Since the patch tokens to the left of an image patch are different from those on the right of it, we use reverse PerceiverAR computation in half the heads of the Transformer to achieve right-to-left information flow between segments. By combining the two streams in a learnable way, we are able to achieve the equivalent of full global attention after only a few layers. This not only results in a very efficient attention mechanism, but also improves the inductive bias as context is gradually built from information from individual segments to a larger receptive field.
To improve the classification performance, we use an extra segment to act as the summary of all segments. In each layer of the Transformer, we apply cross-attention from all other segments to this summary segment to build the global context. Our attention-based ViT achieves some of the best classification results for a pure Transformer-only design with simple patching input. To further demonstrate the usefulness of our work, we replace the attention modules in popular SOTA designs of T2T and FasterViT with our attention, resulting in highly efficient architectures with slightly improved classification performance. One of the important contributions of our work is its scalability to very long sequences because of its linear complexity.
Our future work involves applying the attention architecture to video comprehension. The sequence of frames in a video can be converted to a sequence of overlapping segments to propagate information from one frame to another. The summary segment used in our design can then be used to accumulate the information through cross-attention between frame segments and the summary segment to provide an effective visual understanding.