Next Article in Journal
From Automation to Collaboration: Mapping AI–Human Interaction in Organizations Through Bibliometric Analysis
Previous Article in Journal
An Overview of Machine Learning and Deep Learning Methods for Style Classification in Paintings
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

DS2 Attention: Dual-Stream Segmented Information Propagating Linear Attention for Vision Transformers

Department of Computer Science and Engineering, University of Bridgeport, Bridgeport, CT 06604, USA
*
Author to whom correspondence should be addressed.
AI 2026, 7(6), 188; https://doi.org/10.3390/ai7060188
Submission received: 6 April 2026 / Revised: 13 May 2026 / Accepted: 18 May 2026 / Published: 24 May 2026

Abstract

While Vision Transformers (ViTs) have achieved state-of-the-art (SOTA) results in visual recognition, their scalability remains fundamentally constrained by the quadratic complexity of global self-attention. To address this, we present a linear complexity attention design employing dual-stream information propagation to enhance representational efficiency and structured feature aggregation. Our proposed D S 2 attention acts as a versatile replacement for standard attention in various SOTA designs, such as Tokens-to-Token (T2T) and FasterViT. In our design, half of the attention heads perform left-to-right segmented information propagation in a Perceiver-style manner, while the remaining half of the heads perform right-to-left propagation. This bidirectional structured attention enables efficient long-range dependency modeling without the overhead of full global attention. To improve classification performance, we introduce a segment-level classification strategy in which each segment is associated with a summary token. The final prediction is produced via cross-attention between image tokens and these summary tokens, enabling hierarchical semantic comprehension. Extensive experiments demonstrate that the proposed attention design achieves on average 0.3% higher accuracy on the ImageNet-1K dataset, while offering improved information flow and higher efficiency across SOTA Vision Transformer designs.

1. Introduction

With the immense success of the Transformer architecture in Large Language Models (LLMs), Vision Transformers have emerged as a powerful alternative to Convolutional Neural Networks (CNNs) for image recognition. The Vision Transformer (ViT) was proposed in the seminal work in [1]. In the work, an input image is divided into equal sized patches which are then fed to a Transformer as a sequence of tokens, similar to a Natural Language Processing (NLP) Transformer. The work in [1] demonstrated that when pretrained on large image datasets, a ViT can match or exceed the performance of a CNN for image recognition. While ViTs lack the natural inductive bias for vision as compared to CNNs, their ability to model long-range dependencies through self-attention is advantageous. In addition, Transformers possess larger capacity as compared to CNNs [1,2]. The recent interest in Vision Language Models (VLMs) also favors their use in order to achieve a unified architecture for both text and vision.
Despite these advantages, the transition from local convolutions in a CNN to global token processing in a ViT introduces scalability constraints. While treating images as sequences provides the flexibility required for high image recognition performance, it entails substantial computational and memory overhead. This is primarily caused by the self-attention mechanism, which scales quadratically with the number of tokens. Consequently, as some vision tasks involve higher image resolutions, fine-grained spatial representations, or longer video sequences, the reliance on global attention poses a fundamental challenge. This limits efficient and effective structured visual modeling, necessitating more optimized ViT architectural approaches.
A significant focus in recent research has been on the development of efficient attention mechanisms as well as Transformer architectures to reduce the O ( n 2 ) attention complexity to O ( n ) or near-linear complexity. Notable works in the NLP domain include Linformer [3], Longformer [4], Performer [5], and linear attention [6], among others. Linformer uses projections along the sequence dimension to reduce its length. While theoretically achieving linear time complexity, it compromises the performance because of effective compression of the context. Longformer uses a sliding window with some interleaved global attention to improve efficiency, but still lacks the performance as compared to full global attention. Performer introduced the idea of using kernel-based approximation to attention. While it theoretically achieves linear complexity for attention, the overhead of the FAVOR+ algorithm used in Performer for its kernel methods implementation is considerably high. Linear attention also uses the kernel methods approach but instead uses learnable kernel feature maps to approximate attention. It often results in lower performance compared to standard full attention because the feature map used to approximate attention is typically low-rank, hindering the modeling of complex spatial or semantic information.
The architectural approaches developed to mitigate the quadratic computational complexity of full attention in vision tasks have employed hierarchical, multi-scale designs. These achieve efficiency by focusing on local or spatially reduced attention, which also introduces locality biases similar to CNNs. This leads to improved performance as well as better scalability for high-resolution images. Recent effective approaches use hierarchical representations and progressive tokenization strategies that gradually aggregate local information into higher level tokens. Tokens-to-Token (T2T) [7] is an important work that recursively merges neighboring tokens, enabling the model to capture local structure while reducing token counts. Despite its effectiveness, the T2T design uses standard self-attention in its last stages, which is less efficient, and further lacks explicit inductive bias for directional or structured information flow. As a result, such models may underutilize the sequential nature of token aggregation and fail to fully exploit structured propagation patterns that could improve representational efficiency.
Recent advances in structured attention mechanisms have demonstrated that constraining information flow can lead to more efficient and effective long-range dependency modeling. Examples of architectures that utilize directional or propagative attention include Swin [8], CrossViT [9], ECP [10] and EEViT [11]. Such designs replace unrestricted global interactions with carefully orchestrated information propagation across tokens. These approaches suggest that global context need not be modeled through dense all-to-all attention, but can instead emerge from progressive, structured communication patterns that are both computationally efficient and semantically meaningful.
Inspired by some of the recent work in efficient attention, we propose a dual-stream segmented information propagation architecture with significant improvements over the recent EEViT design. Specifically, attention heads are divided into two complementary streams: one performing left-to-right segmented propagation and the other performing right-to-left propagation. This design enables efficient bidirectional context aggregation in a Perceiver-style manner, allowing long-range dependencies to be captured through structured information flow rather than full global attention. Further, inspired by the FasterViT’s [12] idea of carrier tokens, we inject summary, or class, tokens in each segment. Thus, instead of relying on a single global class token, each segment is associated with its own class token, allowing local semantic summaries to be formed at intermediate stages. Second-level cross-attention is performed between image tokens and these segment-level class tokens, enabling hierarchical semantic fusion and gradual comprehension of the global context. This design encourages the model to reason over structured semantic units rather than flattened token sequences.
Extensive experimental results demonstrate that our proposed dual-stream attention, termed as D S 2 , achieves superior performance compared to conventional Vision Transformer designs, while significantly improving the structure and efficiency of information flow. Our design can also act as a drop-in replacement for the attention module in current SOTA ViTs. We demonstrate this by replacing the full attention modules in T2T and FasterViT architectures. This results in higher compute efficiency while maintaining equivalent or slightly improved classification accuracy. Thus, by combining segmented bidirectional information propagation, segment-level summary or class tokens and self and segment-level cross-attention, our approach provides an efficient linear complexity attention design. While our technique can be applied to video processing, we currently focus on enhanced and efficient image recognition. Our main contributions are summarized as follows:
  • The D S 2 Attention Module: We propose a dual-stream information propagating attention module that splits attention heads into parallel left-to-right and right-to-left segmented streams. This approach achieves both enhanced context without loss of information and efficient long-range dependency modeling with linear complexity.
  • Segment-Level Class Token Strategy: We introduce a hierarchical classification framework that assigns a dedicated class token to each spatial segment. This approach replaces the traditional single-class-token bottleneck, allowing the model to capture more granular semantic information across regions in the image.
  • Hierarchical Semantic Fusion: We implement a structured aggregation mechanism that utilizes cross-segment attention to fuse segment-level class tokens. This enhances global representation and improves inductive bias for hierarchical visual representation learning.
  • Architectural Versatility and Performance: We demonstrate that the D S 2 module can be used as a versatile “plug-and-play” replacement for standard attention in established architectures like Tokens-to-Token (T2T), FasterViT and PVT [13]. Extensive experiments on image datasets show that D S 2 achieves competitive or superior performance with significantly improved computational efficiency.

2. Related Work

The pioneering work of Dosovitskiy introduced the idea of Vision Transformer (ViT) [1] by adapting a language-domain Transformer to the vision domain. It simply changed the input token embedding to the Transformer to be obtained from image patches rather than text tokens. Since the goal in a ViT is primarily classification, the ViT [1] introduced an extra randomly initialized token in the input sequence to act as the class (CLS) token. This idea was borrowed from the BERT [14] NLP Transformer that is primarily used for text classification. The ViT showed impressive results when pretrained on large image datasets, matching the state-of-the-art CNNs.
To improve the performance of the attention mechanism in the ViT, several ideas ranging from improving the architecture of the ViT to better structural decomposition of the input image in smaller carefully crafted windows have been proposed. The important works in the area of architecture enhancements include the Perceiver class of designs [10,15,16,17] and the CaiT [18] Transformer. The fundamental idea in a Perceiver is to divide the input sequence in two parts, referred to as the context and the latent. In the first layer of the Transformer, an attention action is performed between the latent component and the entire input sequence, resulting in an output of the size of only the latent. All remaining layers perform normal self-attention and have inputs and outputs the size of the latent. This results in greatly reduced attention computations, especially if the depth of the Transformer is large.
CaiT [18] processes image patches without a class token in the first several layers, then adds a class token with the class attention in the last few layers to aggregate global information. This allows the Transformer to learn rich local and mid-level features in early layers without being distracted by the classification goal. The later layers can then better focus on the classification. While CaiT improves classification accuracy as compared to a standard ViT, it does not improve the compute efficiency. To improve the efficiency as well as the image classification performance, important works that focus on two-dimensional structural decomposition of the input image into smaller sizes include Swin [8], Swin-V2 [19], Twins [20], Tokens-to-Token [7], FasterViT [12] and PVT [13] designs, among others.
Swin introduces a hierarchical Vision Transformer that performs self-attention within local windows, while enabling information exchange across windows via a shifted window mechanism. While Swin produces very good classification results, its reliance on fixed window sizes can limit the modeling for different image sizes. Further, the shifted window strategy introduces additional implementation complexity which requires high memory usage during training. Swin-V2 [19] extends Swin by improving training stability and scalability, particularly for large models and high-resolution inputs. It replaces the dot-product attention with cosine similarity attention. It also introduces log-spaced continuous relative position bias, enabling better generalization to varying image sizes. Despite these advantages, Swin-V2 inherits the windowing locality constraints of Swin, and remains computationally expensive. Another similar approach in the Twins Transformer [20] alternates local self-attention with reduced-cost global attention to capture both fine details and overall context. Its global component uses spatially reduced attention, where keys and values are down-projected to keep computation low. Unlike Swin, Twins injects explicit global context early and avoids window shifting, leading to a simpler design and better global modeling. The slight drawback of the Twins design is that the global attention (even when reduced) is still less efficient than windowed attention, and there may be loss of detail for high-resolution images due to compression of global context.
The Tokens-to-Token (T2T) [7] design focuses on improving the tokenization process itself by progressively aggregating neighboring patches into more informative tokens before applying global self-attention. This hierarchical token construction allows the model to better capture local structures and spatial relationships that are often lost in standard patch embeddings in a ViT. While T2T improves representation quality and performance on classification benchmarks, the use of standard full attention in the last stage layers is inefficient. FasterViT [12] aims to improve both throughput and accuracy by introducing hierarchical attention (HAT), where attention is computed within local windows and selectively across windows using lightweight global tokens. This approach balances local feature extraction with efficient global context modeling while reducing memory access and attention cost. FasterViT demonstrates strong performance; however, its hierarchical attention mechanism is more complex than standard attention, and performance gains may be less pronounced for low-resolution or smaller images.
Pyramid Vision Transformer (PVT) [13] introduces a hierarchical architecture that progressively reduces spatial resolution while increasing feature dimensionality, closely resembling CNN pyramids. It employs spatially reduced attention to lower the computational cost of attention, enabling efficient processing of large images. PVT effectively bridges the gap between CNNs and Transformers for dense vision tasks. Its main limitation lies in the aggressive spatial reduction, which can degrade fine-grained spatial information and affect performance on tasks requiring precise localization. PVT-V2 [21] introduces Linear Spatial Reduced Attention (SRA) using average pooling to reduce the feature map to a fixed size, making the computational cost linear with respect to image resolution. It also uses overlapping patches in the input embedding to improve local spatial information. While the linear SRA is fast, the use of fixed-size average pooling is inherently lossier than the original design. Further, the design also uses depthwise convolutions, making it a hybrid CNN–Transformer architecture rather than a primarily Transformer-based design.
While our focus in this paper is on Transformer-based designs for vision, we briefly mention a few other hybrid designs. For example, LeViT [22] is designed for faster inference by combining convolutional layers with attention mechanisms in a hierarchical architecture. It employs convolutions and downsampling in early stages for feature extraction, and lightweight attention blocks in later stages to model global interactions efficiently. Another popular design, DeiT [23], focuses on improving the data and training efficiency of Vision Transformers, demonstrating that competitive performance can be achieved without pretraining on large datasets. It introduces a distillation strategy that leverages a CNN teacher to guide the Transformer training through an additional distillation token. While DeiT significantly reduces training data requirements and improves convergence, it does not address the inherent quadratic complexity of self-attention or introduce structural modifications for efficient high-resolution image processing.
In order to reduce the complexity of self-attention and enable VLMs to process both text and images in a uniform manner, a highly efficient approach has been recently proposed in [10,11]. Here, the input sequence is divided into small overlapping segments where a Perceiver operation is performed on the consecutive pair of segments. The Perceiver operation propagates the information from the left segment (context) in the pair to the right segment (latent) in a very effective manner. This information flow in the EEViT-IP [11] design demonstrates that while attention is only computed locally, it implicitly accumulates equivalent full attention after a few layers in the Transformer.
In this paper, we enhance this information propagation by introducing dual-stream information flow by employing forward as well as backward Perceiver-style attention in pairs of overlapping segments. The information from the dual streams is combined in a learnable way using projections. Further, we introduce a summary token with each segment, and perform second-level cross-attention to gradually accumulate better global context. Our attention design is extremely uniform and does not rely on carefully crafted windows. Our design is highly efficient and increases inductive bias for vision as it gradually builds the local structure into a global context. For completeness, before we present our architecture, we briefly describe some of the preliminary background needed in Section 3.

3. Preliminaries

3.1. Vision Transformer

The Vision Transformer (ViT) [1] uses an encoder-only architecture as shown in Figure 1, where the input to the Transformer is obtained by converting the image into fixed-size patches. If an image is divided into n patches of size p × p , and d denotes the embedding dimension of each patch, then the positionally encoded input to the Transformer is given by
X p = X + P
where X , P R ( n + 1 ) × d and P represents the positional encoding associated with each image patch. An additional randomly initialized class token is appended to the beginning of the sequence. During training, this token learns to encode the global representation required for image classification as it propagates through the encoder layers.
The computation in each attention head i is performed using query, key, and value projections parameterized by weight matrices W q , i , W k , i , and W v , i . These matrices map the input embeddings into a lower-dimensional subspace of dimension d h , where
d h = d n h
and n h denotes the number of attention heads.
The query, key, and value matrices for attention head i are computed as
K i = x i W k , i
Q i = x i W q , i
V i = x i W v , i
where W q , i , W k , i , W v , i R d h × d h . The self-attention output for head i is computed as
A i = softmax Q i K i T d h V i
The outputs from all attention heads are concatenated to form the Multi-Head Self-Attention (MHSA) representation. This combined representation is then passed through a linear projection layer parameterized by W o :
MHSA o u t = [ A 0 A 1 A n h 1 ] W o
where denotes concatenation across the attention heads.
As shown in Figure 1, the positionally encoded input X p first passes through a normalization layer before entering the MHSA block. The Transformer encoder also incorporates residual (ResNet-style) skip connections around both the attention and feed-forward modules to stabilize training in deeper architectures.
The attention matrix computed in each head, softmax ( Q i K i T ) R n × n , requires computing pairwise interactions between all tokens. Consequently, the computational complexity of self-attention is O ( n 2 ) , where n is the number of image patches. This quadratic complexity is the dominant computational cost in standard Transformer architectures.
To reduce the attention computations in a Transformer, a simple approach has been proposed in the PerceiverAR [17] design. Since we use this concept in the development of our D S 2 attention, we briefly describe the key ideas in PerceiverAR design in Section 3.2.

3.2. PerceiverAR Transformer

The PerceiverAR [17] architecture, depicted in Figure 2, divides the input sequence into two components: the context and the latent. The first layer performs cross-attention between the latent tokens and the entire input sequence. After the first layer, the output is reduced to the size of the latent sequence, significantly reducing attention computations in the remaining layers. For simplicity, the division into multiple attention heads is omitted in the following equations.
The input sequence x is divided into context and latent components as
x = x context x latent ,
where
x R n × d , x context R c × d , x latent R l × d ,
and n = c + l . The symbol denotes concatenation.

3.3. Layer 0

In the first layer, cross-attention is computed between the latent tokens and the full input sequence.
Q l = x latent W q l R l × d
K = x W k R n × d
V = x W v R n × d
The attention output for the first layer is
Out layer 0 = Softmax ( Q l K T ) V = A V R l × d
where A = Softmax ( Q l K T ) denotes the attention matrix.

3.4. Layers i ( i > 0 )

After the first layer, the sequence size is reduced to the latent dimension l. Subsequent layers perform standard self-attention only on the latent sequence.
Q = x W q R l × d
K = x W k R l × d
V = x W v R l × d
The output of layer i is therefore
Out layer i = Softmax ( Q K T ) V = A V R l × d
The final classification head is applied to the representation produced by the last encoder layer. This is typically obtained either from the embedding of the final token or from the mean pooling of all token embeddings in the final latent sequence.
We utilize the very useful capability of PerceiverAR in dividing the input sequence into two components, and the absorbing of the information from the context into the latent after the first layer, in arriving at our extremely efficient and information-enhancing architecture that we describe in Section 4.

4. Proposed DS 2 Attention

The key ideas in D S 2 attention represent an accumulation of recent SOTA architectures. The goal is to achieve a highly efficient attention design without any information loss, while potentially enhancing the contextual information. To reduce the O ( n 2 ) complexity of attention, it is important that the sequence length n be processed in smaller partitions and in a structured manner. One such successful approach was developed in [10] for NLP, and recently adapted for Vision Transformers in [11]. Here, the input sequence (patch embeddings from an image) is divided into disjoint segments, each segment of size s. To allow for information propagation between segments, a PerceiverAR (PAR) attention is carried out in pairs of overlapping segments. The query Q is computed on the second segment in a pair, while key K and the value V are computed on the segment pair whose length is 2 s . Thus, the attention computation in each segment pair of segments becomes s × 2 s = 2 s 2 .
The idea of dividing the attention computation into overlapping segments has also been proposed for a different NLP objective in LongLoRA [24]. LongLoRA terms its attention approach as shifted sparse attention (S2-Attn). This technique optimizes attention by partitioning the sequence length into distinct groups that calculate attention independently. To ensure information flow between these groups, the attention heads are split into two halves; in the second half, tokens are shifted by half the group size which allows for context sharing. As the information flows through different layers of the Transformer, the information exchange expands to all the segments, effectively implicitly obtaining full attention. The LongLoRA approach is primarily used to apply Low-Rank Adaptation (LoRA) to attention layers, effectively extending the context length of pretrained Large Language Models (LLMs).
In our D S 2 attention, we combine and further enhance the concepts from both LongLoRA and EEViT-IP. We use overlapping segmented attention, with half the heads incorporating opposing information flow to extract better context. PerceiverAR attention is applied to overlapping pair of segments in half the heads such that information propagation occurs from left to right. The remaining half of the heads reverse the PerceiverAR operation such that the query Q is computed on the first segment in the pair, and the key K and the value V are computed on the entire pair. This allows the information to flow from right to left. Figure 3 and Figure 4 depict this dual-stream PerceiverAR-based processing in our D S 2 attention mechanism.
The attention computations taking place in the two streams are given by Equations (17)–(32). In the left-to-right stream, the computation in the first segment is different from other segments as it performs regular full attention. This is because to complete the pair, there is no segment before it. The computation in all other segments is a PerceiverAR operation with Q being computed on the second segment in the pair of segments, and K,V on the entire pair. If s is the segment size, then the attention equations governing the first segment in the left-to-right stream are described below.

4.1. Left-to-Right Attention Stream (LR)

4.1.1. Segment S 0 Attention

Q 0 L R = x ( 0 : s ) W q 0 L R R s × d
K 0 L R = x ( 0 : s ) W k 0 L R R s × d
V 0 L R = x ( 0 : s ) W v 0 L R R s × d
Out S 0 L R = Softmax ( Q 0 L R K 0 L R T ) V 0 L R = A 0 L R V 0 L R R s × d .

4.1.2. Segments S 1 S n 1 Attention

The PerceiverAR operation is performed on two consecutive segments. The operations producing the output corresponding to segment i in the left-to-right stream are defined as follows, where i = 1 , , n 1 .
Q i L R = x ( s i : s ( i + 1 ) ) W q i L R R s × d
K i L R = x ( s ( i 1 ) : s ( i + 1 ) ) W k i L R R 2 s × d
V i L R = x ( s ( i 1 ) : s ( i + 1 ) ) W v i L R R 2 s × d
Out S i L R = Softmax ( Q i L R K i L R T ) V i L R = A i L R V i L R R s × d

4.2. Right-to-Left Attention Stream (RL)

In the right-to-left stream, the computation in the final segment differs from the others because there is no subsequent segment to complete the PerceiverAR pair. Therefore, the final segment performs standard self-attention. All other segments perform Perceiver-style attention where Q is computed on the left segment of the pair while K and V are computed on the entire pair, enabling information propagation from right to left.

4.2.1. Segment S n 1 Attention

Q n 1 R L = x ( s ( n 2 ) : s ( n 1 ) ) W q ( n 1 ) R L R s × d
K n 1 R L = x ( s ( n 2 ) : s ( n 1 ) ) W k ( n 1 ) R L R s × d
V n 1 R L = x ( s ( n 2 ) : s ( n 1 ) ) W v ( n 1 ) R L R s × d
Out S n 1 R L = Softmax ( Q n 1 R L K n 1 R L T ) V n 1 R L = A n 1 R L V n 1 R L R s × d

4.2.2. Segments S 0 S n 2 Attention

The PerceiverAR operation is performed on two consecutive segments. The operations producing the output corresponding to segment i in the right-to-left stream are defined below, where i = 0 , , n 2 .
Q i R L = x ( s i : s ( i + 1 ) ) W q i R L R s × d
K i R L = x ( s i : s ( i + 2 ) ) W k i R L R 2 s × d
V i R L = x ( s i : s ( i + 2 ) ) W v i R L R 2 s × d
Out S i R L = Softmax ( Q i R L K i R L T ) V i R L = A i R L V i R L R s × d
The output of each attention operation is equal to the segment size, i.e., Out R s × d . For all overlapping segment pairs, the query matrix Q R s × d is computed on the left segment of the pair, while the key and value matrices K , V R 2 s × d are computed on the entire pair.
Since the patches representing the segments on the right of a given segment are different from those on the left, we combine the visual context obtained in the left-to-right (LR) and right-to-left (RL) attention streams across corresponding halves of the attention heads in a learnable manner. Figure 5 illustrates the proposed D S 2 architecture. For simplicity, the computations within a single head from the LR and RL streams are shown. All layers in the D S 2 architecture share the same structure as depicted in Figure 5.
To further enhance the D S 2 architecture, we introduce segment summary tokens. These tokens provide an additional cross-attention pathway between the full input sequence of size n and the segment summary tokens (denoted by s s ). The summary segment, denoted as s s , has tokens equal to the number of segments, i.e., n s . The computations are described as follows.
Q summary = s s W q summary R n s × d
K n = x W k R n × d
V n = x W v R n × d
Out summary = Softmax ( Q summary K n T ) V n = A n s s V n R n s × d
The above concept is somewhat similar to the idea of carrier tokens as employed in FasterViT [12]. It helps in accumulating the global context in a more effective manner without causing a loss in inductive bias that is obtained by the gradual context-building in the two attention streams.
After the model has learned local relationships between patches, we further enhance global context modeling by performing a reverse summary segment to image tokens cross-attention operation. This enables image tokens to absorb global contextual information aggregated at the segment level. The reverse cross-attention formulation is defined in Equations (37)–(40).
Q n = x W q n R n × d
K s s = s s W k s s R n s × d
V s s = s s W v s s R n s × d
Out n = Softmax Q n K s s T V s s = A s s n V s s R n × d
Since the summary segment contains only n s tokens (equal to the number of segments), this reverse cross-attention introduces negligible additional computational overhead. Moreover, it is applied only in later stages of training to preserve the inductive bias of Vision Transformers, which benefits from a gradual transition from local feature aggregation to global context modeling.
Our D S 2 attention design is highly efficient, as attention is computed only between paired segments rather than across the full token sequence. The dual information-propagating streams, together with learnable stream aggregation and summary-segment cross-attention, enable enhanced contextual understanding and more effective long-range feature interaction. We perform detailed experiments and compare our design with the other SOTA designs in the Section 5.

5. Experimental Results

We tested our architecture on different image classification datasets. Table 1 shows the characteristics of the datasets on which models were trained and tested.
For ImageNet-100 and ImageNet-1K datasets, we resized the images to 224 × 224 . Since the purpose of our experiments was to compare the relative effectiveness of different architectures, we used standard data augmentations from RandAugment [25]. This method randomly samples N transformations from a pool of operations such as AutoContrast, cutout, affine transformations (shear, translate, rotate), color enhancements (contrast, brightness, sharpness), and pixel-level operations (solarize, equalize, posterize). Following standard practices for Vision Transformers, the magnitude of these transformations was scaled according to the dataset resolution to maintain consistent regularization strength across different experimental scales.
To ensure a fair comparison between the underlying Transformer architectures, we deliberately excluded complex composite augmentations such as CutMix and Mixup. Since these techniques are known to act as strong regularizers that can mask architectural deficiencies or inflate performance in data-hungry models, omitting these allows for a clearer evaluation of the model’s inherent inductive biases and attention performance.
Our training hyperparameters for each dataset are summarized in Table 2. We follow standard training practices commonly used for Vision Transformers, employing a warmup period of five epochs together with a linear-then-cosine learning rate schedule. Specifically, the learning rate is initialized at 3 × 10 4 , increased linearly during the warmup stage to reduce gradient instability, and subsequently decayed using a half-period cosine schedule to encourage stable convergence during later stages of optimization. All models are trained using the AdamW optimizer with a weight decay of 0.05, a dropout rate of 0.08, and gradient clipping with a threshold of 0.3. ImageNet-1K and ImageNet-100 models are trained for 200 epochs using a batch size of 256, while CIFAR-100 and Tiny-ImageNet are trained for 100 and 70 epochs, respectively, using batch sizes of 128 due to their smaller training set sizes. Label smoothing is applied only for the ImageNet-based experiments. All models are evaluated under identical data augmentation and training/testing protocols to ensure fair comparison. We conduct each training run three times and report the average top-one accuracy of these three runs.
Extended ablation studies examining the contribution of individual architectural components, segmentation granularity, and training dynamics are provided in Appendix A (Table A1 and Table A2 and Figure A1 and Figure A2).
The architectural configurations for all models, including embedding dimensions (dims), number of attention heads (heads), and stage depths (depths), are summarized in Table 3. All models are trained from scratch without using pretrained weights. All models are of comparable size, ranging from approximately 15 to 23 million parameters.
Table 4 presents the classification results comparing our D S 2 -based Transformer with popular state-of-the-art (SOTA) Vision Transformers across different datasets. For comparison with a linear attention design, we implement the Linformer [3] for a vision task. Transformer models including ViT, PerceiverAR, Linformer, and EEViT-IP use an embedding dimension of 384, with six heads per layer and eight layers. The Linformer implementation uses 224 × 224-sized images with a patch size of 14 × 14, resulting in a sequence length of 256 tokens. Our Linformer implementation uses a compression of the sequence length to k = 64.
We present results for two variations of our D S 2 attention-based Transformer models. The first uses standard non-overlapping input patch tokenization. Since many current SOTA models use hierarchical input patch tokenization, our D S 2 -CP model employs CNN-based input patch tokenization. For 224 × 224 images, we use a 7 × 7 kernel with strides of two and five in successive layers to downsample the image from 224 × 224 to 22 × 22 , resulting in 484 tokens with 128 channels. Thus, the input to our D S 2 -CP Transformer consists of 484 tokens with an initial embedding dimension of 128.
To further improve efficiency, we introduce controllable patch merging between stages in our D S 2 -CP variant. Our multi-stage design progressively integrates local spatial features with global context. At stage transitions, the model employs a convolutional downsampling module to simultaneously reduce spatial resolution and expand feature dimensionality. Unlike standard linear patch merging, this module incorporates a depthwise-separable convolution block (a 3 × 3 depthwise convolution followed by a 1 × 1 pointwise convolution) to provide an explicit local inductive bias, followed by a strided convolution (stride = 2) to condense the token sequence.
To further analyze the effect of stage-wise design decisions on training dynamics, we investigate when to introduce global context during optimization, as studied in the ablation experiments reported in Appendix A, Table A3.
This downsampling mechanism is strategically applied to specific stages (e.g., Stage 3) to manage the computational complexity of subsequent dual-stream attention layers. By transforming flattened token sequences back into a 2D spatial grid for these operations, the architecture maintains overlapping local context and structural integrity as the feature map transitions from high-resolution local processing to low-resolution global abstraction.
As can be seen from Table 4, our D S 2 -CP model not only provides the most efficient attention but also outperforms other SOTA models in terms of top-one accuracy. From the results in Table 4, it can also be observed that patch tokenization and patch merging contribute positively to model performance, both in terms of classification accuracy and computational efficiency, in addition to the attention mechanism. This is why incorporating CNN-based patch tokenization and patch merging into our D S 2 attention mechanism significantly enhances model performance. While linear attention models such as Linformer [3] and Performer [5] perform well on NLP tasks, current linear attention approaches either suffer from performance degradation or introduce additional computation overhead [26]. In the next subsection, we incorporate some of the best patch tokenization approaches from SOTA Vision Transformer models into our D S 2 attention.

Enhancing T2T and FasterViT with Our D S 2 Attention

Two examples of Vision Transformer (ViT) models that achieve high performance with a focus on the patching process are T2T and FasterViT. In both models, the patching process moves away from the non-overlapping patches of the original ViT to better capture local structural information and spatial correlations.
T2T utilizes a recursive “Tokens-to-Token” module that performs a restructuring of the token sequence. It unfolds the flattened tokens back into a 2D spatial grid, applies an overlapping sliding window to group neighboring tokens, and then re-projects them into a single token. This progressively reduces the sequence length while aggregating local neighborhood details. However, T2T uses standard full attention after its patching modules. Since our D S 2 attention is not only highly efficient but also improves the inductive bias by implicitly aggregating information from different segments across layers, we replace the T2T full attention with our D S 2 attention. T2T uses 196 tokens in its last stage (i.e., 14 × 14 ). Since many layers are used in the last stage (12 in our implementation), this replacement benefits both computational efficiency and classification accuracy.
Similarly, FasterViT relies on a hierarchical morphological approach, utilizing overlapping convolutions or patch-embedding layers to create a multi-scale representation. By using overlapping patches, both architectures ensure that pixels at patch boundaries are not isolated, effectively preserving local continuity and fine-grained features that are often lost in standard ViT patching. To process 224 × 224 images, a FasterViT design typically uses four stages, with the last two stages employing full attention between 14 × 14 and 7 × 7 tokens (i.e., 196 and 49 tokens). We replace the attention in the last two stages of FasterViT with our D S 2 attention.
Table 5 presents the top-one classification accuracy results on three datasets, comparing the original T2T and FasterViT models with their D S 2 -enhanced versions. For comparison, results for SwinV2 [19], PVT-V2 [21], and Twins-SVT [20] are also included. In the upper section of Table 5, the ImageNet-1K training is carried for 200 epochs using the same augmentations as the DeiT training protocol, except for Mixup. We do not use Mixup in this set of results in order to highlight the architectural differences between models.
However, to conform to protocols in the reported literature, we also follow the normal DeiT training protocol (except for repeated augmentations) and report these results in the lower section of Table 5. This training protocol employs repeated augmentation, RandAugment, random erasing, Mixup, and CutMix regularization strategies. Mixup and CutMix are applied with coefficients of 0.8 and 1.0, respectively, while label smoothing is set to 0.1. Training additionally utilizes gradient clipping with a maximum norm of 1.0 and an exponential moving average (EMA) of model weights with a decay factor of 0.99996 to improve training stability and evaluation performance.
The reason for omitting repeated augmentation sampling is that it is primarily designed for distributed multi-GPU training, where different augmented views of the same sample can be processed simultaneously across workers to improve sample diversity and regularization efficiency. Since our ImageNet-1K experiments were conducted on a single GPU, repeated augmentation sampling was not employed, as its benefits are substantially reduced in non-distributed training settings while introducing additional computational overhead.
Figure 6 and Figure 7 depict the relative performance gains over the baselines when using our D S 2 attention in the respective models.
From Table 5 and Figure 6 and Figure 7, it can be observed that while the classification accuracy gains over the T2T and FasterViT baselines are relatively small when replacing full attention with D S 2 attention, the computational efficiency improves noticeably. For example, when training on the ImageNet-1K dataset with a batch size of 256, FasterViT takes approximately 21 minutes to complete one epoch on an NVIDIA RTX 4090 GPU. Replacing its attention with D S 2 attention reduces the per-epoch time to about 16 minutes.
While we have demonstrated improvements in two state-of-the-art ViT architectures by replacing their attention mechanisms with D S 2 attention, this approach is broadly applicable. Any ViT architecture can benefit from incorporating our efficient and context-enhancing D S 2 attention mechanism.

6. Discussion

By breaking the input sequence of tokens (i.e., patch embeddings) into small segments and performing a PerceiverAR attention computation on pairs of overlapping segments, we achieve linear complexity in our attention computation. In a PerceiverAR computation, if the Q operation is performed on the right segment in a pair, and K , V are computed on the entire pair, then information is propagated from the left segment to the right segment. Similarly, if the Q operation is performed on the left segment in a pair, information propagates from the right segment to the left segment.
This is the key idea in our D S 2 attention, where half of the heads perform left-to-right information propagation, and the other half perform right-to-left context propagation. We visually depict this phenomenon in Figure 8. A more detailed visualization of the attention computation pattern and its scaling behavior is provided in Appendix B (Figure A3, Figure A4 and Figure A5).
As shown in Figure 8, we illustrate the progressive expansion of the effective receptive field in D S 2 attention across layers. Due to the overlapping segmented attention mechanism, information is gradually propagated between neighboring segments at each layer. After n s 1 layers (where n s is the number of segments), information originating from the leftmost segment can reach the rightmost segment through successive local exchanges. Symmetrically, information from the rightmost segment can propagate to the leftmost segment over the same depth. At intermediate segment positions, information from both spatial extremes becomes accessible after approximately ( n s 1 ) / 2 layers, as illustrated for the case of eight segments in Figure 8.
Overall, Figure 8 provides an illustrative depiction of how local interactions progressively give rise to broader contextual mixing over multiple layers. The combined effect of overlapping segments, bidirectional propagation, and summary-token cross-attention enables efficient long-range dependency modeling without explicitly computing full quadratic attention.
The computational complexity of D S 2 attention is given by
A = 2 × [ ( s 2 + ( s × 2 s ) ) × ( n s 1 ) × h 2 ] × d h × l = O ( n )
Here, n is the sequence length, which is divided into n / s segments of size s. The number of overlapping segment pairs is ( n / s 1 ) , h is the number of attention heads, and l is the number of layers in the Transformer. The s 2 term corresponds to the regular attention performed in the first and last segments, as they do not form overlapping pairs. The ( s × 2 s ) term represents the PerceiverAR computation over each pair of segments. Since s, h, and l are constants, the overall complexity of D S 2 attention remains linear in n, i.e., O ( n ) .
A detailed breakdown of this computation pattern is provided in Appendix B, where we further illustrate the segment-wise attention mechanism and its computational structure.
Another important contribution of D S 2 attention is that it implicitly improves the inductive bias for vision. The model gradually builds context from smaller segments, eventually forming a global representation, as depicted in Figure 8. This progressive aggregation of information is one of the primary reasons that D S 2 -based Transformers achieve better classification accuracy than full-attention ViTs.
We further enhance classification performance by introducing a summary segment whose size equals the number of segments, effectively assigning one summary token per segment. By performing cross-attention between the full input sequence and this summary segment, the model learns a global representation in a structured and efficient manner. This idea is analogous to the use of carrier tokens in FasterViT.
While D S 2 attention is highly efficient and improves inductive bias, it becomes even more effective when combined with enhanced patch tokenization schemes such as those used in T2T and FasterViT. We demonstrate this by integrating T2T and FasterViT patch tokenization front ends with D S 2 attention to construct highly efficient Vision Transformers. The D S 2 attention design is also highly scalable. For very-high-resolution images, if GPU memory constraints limit attention computation, smaller segment sizes can be used to reduce memory requirements. Additionally, unlike Swin, Twins, and FasterViT, D S 2 does not require input image sizes to conform to specific window divisibility constraints. The only requirement is that the input sequence be divisible into segments of equal size. If this condition is not met, a small number of tokens can be padded to ensure divisibility without affecting performance.
For ViTs, FLOPs (Floating Point Operations) serve as the primary proxy for computational complexity, directly determining the workload a hardware device must perform to process an image. While parameters dictate the model’s memory footprint on the GPU, FLOPs are the bottleneck for inference latency and throughput. This is especially important for real-time applications such as autonomous vehicles and robots. Table 6 presents the GigaFLOP counts for some of the SOTA models examined in this work.
FLOPs in Table 6 were measured using the fvcore library [27]. As shown, the models that incorporate our D S 2 attention are more efficient in terms of inference computational cost compared to other models. Our models also achieve higher classification accuracy, as indicated by the results in Table 4 and Table 5. Additional implementation details and extended analyses are provided in Appendix A and Appendix B.

Limitations

While D S 2 demonstrates strong performance across multiple Vision Transformer backbones and image classification benchmarks, one limitation would be that the segmented information propagation may diminish for very small image resolutions or extremely short token sequences. In such regimes, the additional segmentation structure may provide limited practical advantage over standard self-attention and may also be more compute intensive.
Additionally, the proposed segmented attention mechanism assumes that the token sequence can be partitioned into approximately uniform segments. When the sequence length is not divisible by the number of segments s, zero padding is applied to the final segment to maintain consistent segment dimensions during attention computation. Although the computational overhead introduced by this padding is minimal, excessive padding in highly irregular sequence configurations may slightly reduce efficiency.

7. Conclusions

The increasing reasoning capability of recent LLMs is now requiring them to be able to understand, process and generate images and video data. Thus, a unified Transformer-based architecture becomes a natural choice for VLMs. The key challenges in a Transformer architecture for vision include efficient attention computation and the lack of natural inductive bias for vision. We present a very effective solution to these problems in this work. We devise a highly efficient dual-stream information propagating attention design, termed D S 2 attention, that breaks the input sequence into small overlapping segments. By performing a PerceiverAR style of computation between a pair of overlapping segments, we can propagate the information from left-to-right in the input segments as processing proceeds through the layers of the Transformer. Since the patch tokens to the left of an image patch are different from those on the right of it, we use reverse PerceiverAR computation in half the heads of the Transformer to achieve right-to-left information flow between segments. By combining the two streams in a learnable way, we are able to achieve the equivalent of full global attention after only a few layers. This not only results in a very efficient attention mechanism, but also improves the inductive bias as context is gradually built from information from individual segments to a larger receptive field.
To improve the classification performance, we use an extra segment to act as the summary of all segments. In each layer of the Transformer, we apply cross-attention from all other segments to this summary segment to build the global context. Our D S 2 attention-based ViT achieves some of the best classification results for a pure Transformer-only design with simple patching input. To further demonstrate the usefulness of our work, we replace the attention modules in popular SOTA designs of T2T and FasterViT with our D S 2 attention, resulting in highly efficient architectures with slightly improved classification performance. One of the important contributions of our work is its scalability to very long sequences because of its linear complexity.
Our future work involves applying the D S 2 attention architecture to video comprehension. The sequence of frames in a video can be converted to a sequence of overlapping segments to propagate information from one frame to another. The summary segment used in our design can then be used to accumulate the information through cross-attention between frame segments and the summary segment to provide an effective visual understanding.

Author Contributions

Conceptualization, R.M. and K.E.; methodology, R.M., S.P. and K.E.; software, R.M.; validation, R.M.; formal analysis, R.M.; investigation, R.M.; writing—review and editing, K.E. and S.P.; visualization, R.M.; supervision, K.E. and S.P.; project administration, K.E. and S.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The source code presented in this study is openly available on GitHub at: https://github.com/rigel-mahmood/DS2Attention (accessed on 5 April 2026).

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Ablation Study of DS2 Attention Enhancements

Our D S 2 attention design presents four significant enhancements to the standard full attention used in Vision Transformers (ViTs). These include:
1.
Segmented input context processing using PerceiverAR-style information propagation.
2.
Further enhancement of information flow using left-to-right and right-to-left dual-stream processing.
3.
Addition of an extra summary segment to propagate global context via cross-attention between the input context segments and the summary segment.
4.
Introduction of reverse cross-attention from the summary segment to the input context to further enhance global context. This is introduced later in training.
While all of the above contribute positively to classification performance, we conduct an ablation study to determine the effect of each individual enhancement. Table A1 presents the top-one accuracy results for different enhancements on two datasets for our D S 2 -base model with standard non-overlapping input patching.
Table A1. Top-one accuracy for different architectural enhancements of the attention mechanism in D S 2 -base.
Table A1. Top-one accuracy for different architectural enhancements of the attention mechanism in D S 2 -base.
Architecture VariantCIFAR-100ImageNet-100
Left-to-Right Segmented Stream67.6476.00
Right-to-Left Segmented Stream67.2376.18
Dual-Stream Attention71.0978.32
Dual Stream + Summary Attention72.1378.74
Dual Stream + Summary + Global Context72.4079.16
As observed from the ablation results in Table A1, dual-stream processing contributes the most significant improvement in classification accuracy. The left-to-right segmented stream behaves similarly to the right-to-left propagation stream, with the primary difference arising from the contextual information accumulated in the early layers. As a result, the performance difference between the two directional streams remains below 0.5%. However, both propagation directions remain important because each stream aggregates contextual information from a different spatial perspective. Specifically, left-to-right propagation progressively incorporates information from preceding segments, while right-to-left propagation accumulates information from subsequent segments. Although both streams theoretically achieve equivalent receptive coverage after n s 1 layers, the intermediate context formation and feature interactions differ throughout the propagation process. This complementary bidirectional aggregation enables richer contextual modeling than either directional stream alone. The accumulation of global information via the additional summary segment, along with cross-attention between the input context and the summary segment, further improves performance in a noticeable manner. The reverse cross-attention between the summary segment and the input context, introduced later in training, provides an additional performance gain.
For image recognition, it is important that local features are learned first in order to build a strong inductive bias. By prioritizing these local relationships early, the model is guided toward meaningful feature hierarchies instead of relying on global correlations, which are less effective for vision tasks. Therefore, in our D S 2 attention design, global context is enhanced through reverse cross-attention between the summary segment and the input sequence later in training.
Table A2 presents the effect of varying the number of segments on the classification performance of the proposed D S 2 -base architecture. The results demonstrate that segmentation granularity significantly influences model effectiveness, as overly coarse or overly fine partitioning can reduce representational quality and long-range information propagation efficiency. For smaller-resolution datasets such as CIFAR-100, the best performance is achieved using 8 segments, while higher-resolution 224 × 224 ImageNet-100 images benefit from a finer partitioning strategy with 16 segments. These results suggest that the optimal number of segments should scale with image resolution in order to balance local feature aggregation and global contextual modeling.
Table A2. Effect of number of segments on top-one accuracy for our D S 2 -base model. Bold values indicate the highest top-one accuracy achieved for each dataset under different numbers of segments.
Table A2. Effect of number of segments on top-one accuracy for our D S 2 -base model. Bold values indicate the highest top-one accuracy achieved for each dataset under different numbers of segments.
Number of SegmentsCIFAR-100ImageNet-100
Segment Size Accuracy Segment Size Accuracy
41672.196478.67
8872.403278.84
16471.831679.16
32N/AN/A878.21
Figure A1 shows the effect on training and validation losses when global context is introduced after 60 epochs during training of the D S 2 -Faster model, where the FasterViT attention modules are replaced with our D S 2 attention.
From Figure A2, it can be observed that prior to the introduction of the enhanced global context, the slope of the validation accuracy is relatively flat. After the enhanced global context is introduced via reverse cross-attention between the summary segment and the input context, the slope of the validation accuracy increases noticeably.
To determine when the enhanced global context should be introduced during training, we empirically find that the optimal point is approximately halfway through the total number of training epochs. Table A3 shows the epoch at which the enhanced global context is introduced, along with the corresponding accuracy of the D S 2 -Faster model during training on the ImageNet-100 dataset.
Table A3. Effect of introducing global context (GC) after n epochs on top-one accuracy on the ImageNet-100 dataset during training of the D S 2 -Faster model for 150 epochs.
Table A3. Effect of introducing global context (GC) after n epochs on top-one accuracy on the ImageNet-100 dataset during training of the D S 2 -Faster model for 150 epochs.
No GCGC @ 30GC @ 60GC @ 100GC @ 120
Top-1 Accuracy82.8683.7884.5884.4484.21
Figure A1. Effect of introducing global context after 60 epochs on training and validation loss for the D S 2 -Faster model.
Figure A1. Effect of introducing global context after 60 epochs on training and validation loss for the D S 2 -Faster model.
Ai 07 00188 g0a1
Figure A2. Training and validation accuracies during training of the D S 2 -Faster model on the ImageNet-100 dataset. Enhancement of global context is introduced after epoch 60.
Figure A2. Training and validation accuracies during training of the D S 2 -Faster model on the ImageNet-100 dataset. Enhancement of global context is introduced after epoch 60.
Ai 07 00188 g0a2

Appendix B. Computation Cost Analysis in DS 2 Attention

We perform left-to-right and right-to-left Perceiver-style attention on a pair of segments to achieve linear attention complexity while increasing inductive bias. Figure A3 illustrates the attention computation pattern used in the proposed D S 2 attention mechanism. The input sequence is divided into segments of size s, denoted as s 0 , s 1 , s 2 , s 3 . Instead of computing full global attention, attention is performed over overlapping segment pairs.
For each pair of adjacent segments, the query (Q) is computed on one segment, while the keys (K) and values (V) are computed over both segments in the pair. This results in an attention computation of size s × 2 s for each overlapping pair. The first and last segments additionally include a local self-attention computation of size s 2 . This overlapping design enables progressive information propagation across the sequence while significantly reducing the computational complexity compared to full attention. The total attention cost in both the streams is given by Equation (41).
Figure A3. Dual-stream D S 2 attention illustrating left-to-right (a) and right-to-left (b) information propagation across overlapping segments. Queries (Q) are computed on a target segment, while keys and values ( K , V ) are computed over paired segments, enabling bidirectional context flow with efficient s 2 and s · 2 s attention computations.
Figure A3. Dual-stream D S 2 attention illustrating left-to-right (a) and right-to-left (b) information propagation across overlapping segments. Queries (Q) are computed on a target segment, while keys and values ( K , V ) are computed over paired segments, enabling bidirectional context flow with efficient s 2 and s · 2 s attention computations.
Ai 07 00188 g0a3
Figure A4 shows the total attention computations as a function of sequence length n, while keeping the segment size s fixed. As the sequence length increases, the number of overlapping segment pairs increases linearly. Since each pair contributes a fixed computation cost proportional to s × 2 s , the overall attention complexity grows linearly with n, i.e., O ( n ) . This is in contrast to standard self-attention, which scales quadratically as O ( n 2 ) . This result validates the scalability of the proposed D S 2 attention mechanism for long sequences.
Figure A4. Attention computations for varying sequence lengths n with fixed segment size s = 64 .
Figure A4. Attention computations for varying sequence lengths n with fixed segment size s = 64 .
Ai 07 00188 g0a4
Figure A5 presents the attention computations as the segment size s is varied while keeping the sequence length fixed. As s increases, the cost of each overlapping attention computation ( s × 2 s ) increases quadratically with respect to s. This highlights an important trade-off in the D S 2 design:
  • Smaller segment sizes reduce computation and improve efficiency.
  • Larger segment sizes increase the receptive field per layer but incur higher computational cost.
For comparison, the attention cost of a standard Vision Transformer (ViT) is also shown.
Figure A5. Attention computations for varying segment size s with fixed sequence length n = 1024 . D S 2 attention provides a controllable trade-off between efficiency and receptive field, while remaining significantly more efficient than standard ViT attention.
Figure A5. Attention computations for varying segment size s with fixed sequence length n = 1024 . D S 2 attention provides a controllable trade-off between efficiency and receptive field, while remaining significantly more efficient than standard ViT attention.
Ai 07 00188 g0a5

References

  1. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image Is Worth 16×16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
  2. Dehghani, M.; Gritsenko, A.; Arnab, A.; Minderer, M.; Tay, Y.; Kolesnikov, A.; Beyer, L. Scaling Vision Transformers to 22 Billion Parameters. In Proceedings of the International Conference on Machine Learning (ICML), Honolulu, HI, USA, 23–29 July2023. [Google Scholar]
  3. Wang, S.; Li, B.; Khabsa, M.; Fang, H.; Ma, H. Linformer: Self-Attention with Linear Complexity. arXiv 2020, arXiv:2006.04768. [Google Scholar] [CrossRef]
  4. Beltagy, I.; Peters, M.; Cohan, A. Longformer: The Long-Document Transformer. arXiv 2020, arXiv:2004.05150. [Google Scholar] [CrossRef]
  5. Choromanski, K.; Likhosherstov, V.; Dohan, D.; Song, X.; Gane, A.; Sarlos, T.; Hawkins, P.; Davis, J.; Mohiuddin, A.; Kaiser, L.; et al. Rethinking Attention with Performers. In Proceedings of the International Conference on Learning Representations (ICLR), Vienna, Austria, 4 May 2021. [Google Scholar]
  6. Katharopoulos, A.; Vyas, A.; Pappas, N.; Fleuret, F. Transformers Are RNNs: Fast Autoregressive Transformers with Linear Attention. In Proceedings of the International Conference on Machine Learning (ICML), Virtual Event, 13–18 July 2020. [Google Scholar]
  7. Yuan, L.; Chen, Y.; Wang, T.; Yu, W.; Shi, Y.; Jiang, Z.; Tay, F.; Feng, J.; Yan, S. Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021. [Google Scholar]
  8. Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021. [Google Scholar]
  9. Chen, C.F.R.; Fan, Q.; Panda, R. CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021. [Google Scholar]
  10. Mahmood, K.; Huang, S. Efficient Context Propagating Perceiver Architectures for Auto-Regressive Language Modeling. In Proceedings of the 28th European Conference on Artificial Intelligence (ECAI 2025), Including the 14th Conference on Prestigious Applications of Intelligent Systems (PAIS 2025), Bologna, Italy, 25–30 October 2025; Lynce, I., Murano, N., Vallati, M., Villata, S., Chesani, F., Milano, M., Omicini, A., Dastani, M., Eds.; IOS Press: Amsterdam, The Netherlands, 2025; Volume 413, pp. 4428–4435. [Google Scholar] [CrossRef]
  11. Mahmood, R.; Patel, S.; Elleithy, K. EEViT: Efficient Enhanced Vision Transformer Architectures with Information Propagation and Improved Inductive Bias. AI 2025, 6, 233. [Google Scholar] [CrossRef]
  12. Hatamizadeh, A.; Yin, H.; Molchanov, P.; Kautz, J. FasterViT: Fast Vision Transformers with Hierarchical Attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023. [Google Scholar]
  13. Wang, W.; Xie, E.; Li, X.; Fan, D.P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021. [Google Scholar]
  14. Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), Minneapolis, MN, USA, 3–5 June 2019. [Google Scholar]
  15. Jaegle, A.; Gimeno, F.; Brock, A.; Zisserman, A.; Vinyals, O.; Carreira, J. Perceiver: General Perception with Iterative Attention. In Proceedings of the International Conference on Machine Learning (ICML), Virtual Event, 18–24 July 2021. [Google Scholar]
  16. Jaegle, A.; Borgeaud, S.; Alayrac, J.B.; Doersch, C.; Ionescu, C.; Ding, D.; Koppula, S.; Zisserman, A.; Vinyals, O.; Carreira, J. Perceiver IO: A General Architecture for Structured Inputs and Outputs. In Proceedings of the International Conference on Learning Representations (ICLR), Online, 25–29 April 2022. [Google Scholar]
  17. Hawthorne, C.; Jaegle, A.; Borgeaud, S.; Brock, A.; Bornschein, J.; Vinyals, O.; Carreira, J. General-Purpose, Long-Context Autoregressive Modeling with Perceiver AR. In Proceedings of the International Conference on Machine Learning (ICML), Virtual Event, 17–23 July 2022. [Google Scholar]
  18. Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H. Going Deeper with Image Transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021. [Google Scholar]
  19. Liu, Z.; Hu, H.; Lin, Y.; Yao, Z.; Xie, Z.; Wei, Y.; Ning, J.; Cao, Y.; Zhang, Z.; Dong, L.; et al. Swin Transformer V2: Scaling Up Capacity and Resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022. [Google Scholar]
  20. Chu, X.; Tian, Z.; Wang, Y.; Zhang, B.; Ren, H.; Wei, X.; Xia, H.; Shen, C. Twins: Revisiting Spatial Attention Design in Vision Transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 9355–9366. [Google Scholar]
  21. Wang, W.; Xie, E.; Li, X.; Fan, D.-P.; Song, K.; Liang, D.; Luo, P.; Shao, L. PVT v2: Improved Baselines with Pyramid Vision Transformer. Comput. Vis. Media 2022, 8, 415–424. [Google Scholar] [CrossRef]
  22. Graham, B.; El-Nouby, A.; Touvron, H.; Stock, P.; Joulin, A.; Jégou, H.; Douze, M. LeViT: A Vision Transformer in ConvNet’s Clothing for Faster Inference. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021. [Google Scholar]
  23. Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H. Training Data-Efficient Image Transformers and Distillation through Attention. In Proceedings of the International Conference on Machine Learning (ICML), Virtual Event, 18–24 July 2021. [Google Scholar]
  24. Chen, Y.; Qian, S.; Tang, H.; Lai, X.; Liu, Z.; Han, S.; Jia, J. LongLoRA: Efficient Fine-Tuning of Long-Context Large Language Models. In Proceedings of the International Conference on Learning Representations (ICLR), Vienna Austria, 7–11 May 2024. [Google Scholar]
  25. Cubuk, E.D.; Zoph, B.; Shlens, J.; Le, Q.V. RandAugment: Practical Automated Data Augmentation with a Reduced Search Space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 14–19 June 2020; pp. 3008–3017. [Google Scholar]
  26. Han, D.; Pan, X.; Han, Y.; Song, S.; Huang, G. FLatten Transformer: Vision Transformer using Focused Linear Attention. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 5961–5971. [Google Scholar]
  27. Facebook AI Research. Fvcore: Common Library Components for PyTorch-Based Computer Vision Research; GitHub Repository. 2019. Available online: https://github.com/facebookresearch/fvcore (accessed on 5 April 2026).
Figure 1. Architecture of Vision Transformer.
Figure 1. Architecture of Vision Transformer.
Ai 07 00188 g001
Figure 2. Architecture of PerceiverAR [17]. For brevity, attention heads are not shown.
Figure 2. Architecture of PerceiverAR [17]. For brevity, attention heads are not shown.
Ai 07 00188 g002
Figure 3. PerceiverAR operation applied on a pair of overlapping segments resulting in left-to-right information flow as processing takes place in succeeding layers of the Transformer. Q is computed on the right segment in a pair during PerceiverAR operation. This computation pattern is employed in first half of the heads.
Figure 3. PerceiverAR operation applied on a pair of overlapping segments resulting in left-to-right information flow as processing takes place in succeeding layers of the Transformer. Q is computed on the right segment in a pair during PerceiverAR operation. This computation pattern is employed in first half of the heads.
Ai 07 00188 g003
Figure 4. PerceiverAR operation applied on a pair of overlapping segments resulting in right-to-left information flow as processing takes place in succeeding layers of the Transformer. Q is computed on the left segment in a pair during PerceiverAR operation. This computation pattern is employed in second half of the heads.
Figure 4. PerceiverAR operation applied on a pair of overlapping segments resulting in right-to-left information flow as processing takes place in succeeding layers of the Transformer. Q is computed on the left segment in a pair during PerceiverAR operation. This computation pattern is employed in second half of the heads.
Ai 07 00188 g004
Figure 5. Architecture of our D S 2 attention architecture.
Figure 5. Architecture of our D S 2 attention architecture.
Ai 07 00188 g005
Figure 6. Comparison of T2T and D S 2 -T2T across CIFAR-100, ImageNet-100, and ImageNet-1K datasets.
Figure 6. Comparison of T2T and D S 2 -T2T across CIFAR-100, ImageNet-100, and ImageNet-1K datasets.
Ai 07 00188 g006
Figure 7. Comparison of FasterViT and D S 2 -Faster across CIFAR-100, ImageNet-100, and ImageNet-1K datasets.
Figure 7. Comparison of FasterViT and D S 2 -Faster across CIFAR-100, ImageNet-100, and ImageNet-1K datasets.
Ai 07 00188 g007
Figure 8. Illustrative depiction of progressive receptive-field expansion in D S 2 attention across layers. The figure is intended to provide intuition for local-to-global context propagation through dual-stream overlapping segmented attention.
Figure 8. Illustrative depiction of progressive receptive-field expansion in D S 2 attention across layers. The figure is intended to provide intuition for local-to-global context propagation through dual-stream overlapping segmented attention.
Ai 07 00188 g008
Table 1. Datasets used in our experiments.
Table 1. Datasets used in our experiments.
DatasetNumber of ClassesImage SizeTrain ImagesTest Images
CIFAR-1010 3 × 32 × 32 50,00010,000
CIFAR-100100 3 × 32 × 32 50,00010,000
Tiny ImageNet200 3 × 64 × 64 100,00010,000
ImageNet-100100 3 × 224 × 224 130,00050,000
ImageNet-1K1000 3 × 224 × 224 1,281,167100,000
Table 2. Training hyperparameters used for each dataset.
Table 2. Training hyperparameters used for each dataset.
ParameterImageNet-1KImageNet-100CIFAR-10CIFAR-100Tiny-ImageNet
Training Epochs20020020010070
Batch Size25625664128128
Initial Learning Rate 3 × 10 4 3 × 10 4 3 × 10 4 3 × 10 4 3 × 10 4
OptimizerAdamWAdamWAdamWAdamWAdamW
Weight Decay0.050.050.050.050.05
Warmup Epochs55555
Gradient Clipping0.30.30.30.30.3
Dropout Rate0.080.080.080.080.08
Label SmoothingYesYesNoNoNo
Table 3. Model configurations: Embedding dimensions (dims), number of attention heads (heads), and stage depths (depths).
Table 3. Model configurations: Embedding dimensions (dims), number of attention heads (heads), and stage depths (depths).
ModelDimsHeadsDepths
D S 2 (base)[128, 192, 256, 512][4, 6, 8, 16][2, 2, 6, 2]
D S 2 -CP (ours)[128, 192, 256, 512][4, 6, 8, 16][2, 2, 6, 2]
SwinV2[96, 192, 384, 768][3, 6, 12, 24][2, 2, 6, 2]
Twins-SVT[64, 128, 256, 512][2, 4, 8, 16][2, 2, 10, 4]
PVT-V2[64, 128, 320, 384][1, 2, 5, 8][3, 4, 6, 3]
T2T[128, 192, 384][4, 6, 6][2, 2, 8]
Table 4. Top-one accuracy for different models on image classification benchmarks. Bold indicates the best accuracy while blue indicates the second-best accuracy.
Table 4. Top-one accuracy for different models on image classification benchmarks. Bold indicates the best accuracy while blue indicates the second-best accuracy.
ModelCIFAR-10CIFAR-100Tiny-ImageNetImageNet-100
ViT89.8064.4345.0469.96
Perceiver-AR77.2749.1444.2867.41
EEViT-IP92.0867.6452.0676.00
T2T94.0871.8657.9678.52
Twins-SVT94.7173.4858.9281.04
Swin-V293.8972.1654.0481.66
PVT-V294.2873.9154.1283.11
Linformer85.5362.5648.9174.63
D S 2 -base (ours)94.0172.4054.1679.16
D S 2 -CP (ours)95.1175.1259.4283.24
Table 5. Top-one accuracy results for different models. Bold indicates the best accuracy, and blue indicates the second-best accuracy for each dataset.
Table 5. Top-one accuracy results for different models. Bold indicates the best accuracy, and blue indicates the second-best accuracy for each dataset.
ModelCIFAR-100ImageNet-100ImageNet-1K
Swin-V272.1681.6675.64
Twins-SVT73.4881.0477.41
PVT-V273.9183.1178.34
T2T71.8678.5274.27
D S 2 -T2T (ours)74.8279.1374.81
FasterViT75.9284.7378.53
D S 2 -Faster (ours)76.1084.8878.87
ImageNet-1K with DeiT Training Protocol (300 Epochs + Mixup )
FasterViT80.71
D S 2 -Faster (ours)80.84
Table 6. Model size and inference GigaFLOPs for different Vision Transformer architectures. Lower GigaFLOP values indicate greater computational efficiency. Bold values indicate the more efficient model within directly comparable architecture pairs.
Table 6. Model size and inference GigaFLOPs for different Vision Transformer architectures. Lower GigaFLOP values indicate greater computational efficiency. Bold values indicate the more efficient model within directly comparable architecture pairs.
ModelLayers (Stage Depth)Model Size (M)GigaFLOPs
Baseline Vision Transformer Models
ViT1013.174.36
EEViT-IP1015.083.68
D S 2 -base (ours)12 [2,2,6,2]11.342.61
D S 2 -CP (ours)12 [2,2,6,2]13.172.91
Hierarchical and Efficient Vision Transformers
Swin-V212 [2,2,6,2]27.664.71
PVT-V216 [3,4,6,3]19.563.02
Twins-SVT18 [2,2,10,4]23.592.82
T2T12 [2,2,8]22.381.89
D S 2 -T2T (ours)12 [2,2,8]19.031.54
FasterViT-Based Architectures
FasterViT17 [2,3,6,6]30.944.03
D S 2 -Faster (ours)17 [2,3,6,6]23.613.19
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Mahmood, R.; Patel, S.; Elleithy, K. DS2 Attention: Dual-Stream Segmented Information Propagating Linear Attention for Vision Transformers. AI 2026, 7, 188. https://doi.org/10.3390/ai7060188

AMA Style

Mahmood R, Patel S, Elleithy K. DS2 Attention: Dual-Stream Segmented Information Propagating Linear Attention for Vision Transformers. AI. 2026; 7(6):188. https://doi.org/10.3390/ai7060188

Chicago/Turabian Style

Mahmood, Rigel, Sarosh Patel, and Khaled Elleithy. 2026. "DS2 Attention: Dual-Stream Segmented Information Propagating Linear Attention for Vision Transformers" AI 7, no. 6: 188. https://doi.org/10.3390/ai7060188

APA Style

Mahmood, R., Patel, S., & Elleithy, K. (2026). DS2 Attention: Dual-Stream Segmented Information Propagating Linear Attention for Vision Transformers. AI, 7(6), 188. https://doi.org/10.3390/ai7060188

Article Metrics

Back to TopTop