DAR-Swin: Dual-Attention Revamped Swin Transformer for Intelligent Vehicle Perception Under NVH Disturbances

Zhang, Xinglong; Zhang, Zhiguo; Zuo, Huihui; Xue, Chaotan; Wu, Zhenjiang; Cheng, Zhiyu; Wang, Yan

doi:10.3390/machines14010051

Open AccessArticle

DAR-Swin: Dual-Attention Revamped Swin Transformer for Intelligent Vehicle Perception Under NVH Disturbances

by

Xinglong Zhang

¹,

Zhiguo Zhang

²,

Huihui Zuo

¹,

Chaotan Xue

¹,

Zhenjiang Wu

²,

Zhiyu Cheng

³ and

Yan Wang

^4,*

¹

SAIC Motor Corporation Limited, Shanghai 200041, China

²

CATARC (Tianjin) Automotive Engineering Research Institute Co., Ltd., Tianjin 300300, China

³

School of Computer Science and Information Technology, Beijing Jiaotong University, Beijing 100044, China

⁴

Department of Industrial and Systems Engineering, The Hong Kong Polytechnic University, Hong Kong 999077, China

^*

Author to whom correspondence should be addressed.

Machines 2026, 14(1), 51; https://doi.org/10.3390/machines14010051

Submission received: 4 November 2025 / Revised: 10 December 2025 / Accepted: 12 December 2025 / Published: 31 December 2025

(This article belongs to the Special Issue Active and Passive Safety and Noise, Vibration, and Harshness (NVH) of Intelligent Vehicles)

Download

Browse Figures

Versions Notes

Abstract

In recent years, deep learning-based image classification has made significant progress, especially in safety-critical perception fields such as intelligent vehicles. Factors such as vibrations caused by NVH (noise, vibration, and harshness), sensor noise, and road surface roughness pose challenges to robustness and real-time deployment. The Transformer architecture has become a fundamental component of high-performance models. However, in complex visual environments, shifted window attention mechanisms exhibit inherent limitations: although computationally efficient, local window constraints impede cross-region semantic integration, while deep feature processing obstructs robust representation learning. To address these challenges, we propose DAR-Swin (Dual-Attention Revamped Swin Transformer), enhancing the framework through two complementary attention mechanisms. First, Scalable Self-Attention universally substitutes the standard Window-based Multi-head Self-Attention via sub-quadratic complexity operators. These operators decouple spatial positions from feature associations, enabling position-adaptive receptive fields for comprehensive contextual modeling. Second, Latent Proxy Attention integrated before the classification head adopts a learnable spatial proxy to integrate global semantic information into a fixed-size representation, while preserving relational semantics and achieving linear computational complexity through efficient proxy interactions. Extensive experiments demonstrate significant improvements over Swin Transformer Base, achieving 87.3% top-1 accuracy on CIFAR-100 (+1.5% absolute improvement) and 57.0% mAP on COCO2017 (+1.3% absolute improvement). These characteristics are particularly important for the active and passive safety features of intelligent vehicles.

Keywords:

intelligent vehicles; NVH; Swin Transformer; perception; Scalable Self-Attention; Latent Proxy Attention

1. Introduction

In intelligent vehicles, safety-critical perception systems must withstand interference from noise, vibration, and harshness (NVH) environments while adhering to real-time performance standards. As a core technology of computer vision, image classification plays a crucial role in these intelligent systems. It is the cornerstone of computer vision, advancing diverse applications including medical diagnosis, content retrieval, and autonomous driving systems [1,2,3]. While Convolutional Neural Networks (CNNs) excel in visual tasks [4], their limited local receptive field restricts their ability to capture long-range dependencies in complex scenes. The Vision Transformer (ViT) overcomes this limitation with a global self-attention mechanism [5,6]. However, its quadratic computational complexity of

O (N^{2})

introduces substantial computational overhead, making it difficult to meet the stringent 30–60 FPS real-time inference requirements on resource-constrained automotive edge computing platforms.

To overcome the efficiency bottleneck, the hierarchical ViT architecture, represented by the Swin Transformer, achieves linear complexity of

O (N)

through the shifted window-based multi-head self-attention (W-MSA), positioning it as the mainstream solution [7]. However, this local window-based partitioning strategy limits cross-window semantic interactions and weakens the global context modeling essential for refined feature recognition [8]. Image blurring caused by NVH, in particular, exacerbates the lack of long-range context, significantly increasing perceptual uncertainty [9]. Moreover, the high computational and storage demands of deep networks remain key obstacles to their efficient deployment in resource-constrained automotive environments.

To systematically address the dual constraints of limited local windows and increasing computational burden in deep layers, this study proposes a dual-attention improvement paradigm for hierarchical vision Transformers to enhance their robust image classification capabilities. This paradigm retains the core hierarchical architecture and integrates complementary attention mechanisms through a collaborative strategy focusing on global modeling and linear compression. Specifically, Scalable Self-Attention (SSA) [10] replaces the original W-MSA, leveraging sub-quadratic complexity operators [11] to decouple spatial positions from content associations. This architecture dynamically extends the receptive field of each token to facilitate continuous cross-window contextual modeling [12] while preserving linear computational complexity. This design characteristic supports effective modeling of complex long-range dependencies in visual recognition systems. Subsequently, Latent Proxy Attention (LPA) [13] is integrated before the classification head, employing learnable spatial proxies to consolidate high-resolution deep features into fixed-scale semantic summaries. Lightweight inter-proxy interactions, followed by reverse projection to the original spatial domain, reduce memory consumption and computational overhead from

O (N^{2})

to

O (N)

[14]. Figure 1 visually illustrates this generalized improvement paradigm. The synergistic operation of SSA’s global contextual capacity and LPA’s feature compression establishes a coherent “global–local-compression” refinement pipeline, enabling expansive receptive fields and efficient inference across arbitrary resolutions. This paradigm aims to improve the global modeling capability and computational efficiency of similar hierarchical vision transformers. This design directly addresses the dual objectives of robustness and efficiency required for intelligent vehicle perception modules to operate under NVH disturbances.

Our contributions can be summarized as follows:

Enhanced Global Perception for On-Vehicle Scenarios via SSA Integration: The SSA module is integrated into the Swin architecture, replacing the W-MSA, and facilitates continuous cross-window context modeling through its position adaptation mechanism. This significantly enhances the model’s ability to capture discriminative global features in complex scenes.
Computational optimization via strategic LPA deployment: To address the computational burden of deep layers, the LPA module is deployed before the classification head. By introducing learnable proxy tokens, the LPA module achieves an interaction between feature compression and linear complexity, effectively alleviating the resource constraints imposed by high-resolution feature maps.
Unified Dual-Attention Architecture for Intelligent Vehicle Perception: The proposed DAR-Swin framework integrates the SSA module for global contextual modeling and the LPA module for efficient feature compression. This unified architecture improves both classification accuracy and computational efficiency on open-source benchmark datasets.

The remainder of this paper is organized as follows. In Section 2, we present a brief introduction to related works. The technical details of DAR-Swin are presented in Section 3. Section 4 details experimental procedures and results, including ablation studies and comparative evaluations on two benchmark datasets. Finally, the conclusion is drawn in Section 5.

2. Related Work

2.1. Image Classification

Current image classification methods have evolved through different architectural paradigms. Convolutional neural networks (CNNs) (i.e., ResNet [15], VGG [16]) perform hierarchical feature extraction through convolution operations and spatial pooling. While deep CNNs achieve state-of-the-art accuracy on standardized benchmarks, they also incur significant computational and memory overhead [17]. Efficient CNN architectures utilize depthwise separable convolutions and channel reordering to improve computational efficiency (i.e., MobileNet [18], ShuffleNet [19]). However, these approaches suffer from reduced accuracy compared to more advanced architectures, particularly in fine-grained recognition tasks and complex data domains [20]. Transformer-based methods circumvent the locality constraints of convolution through a global self-attention mechanism, thereby achieving excellent long-range dependency modeling (i.e., ViT [21]). However, they also introduce key limitations: significant data dependencies affect competitive performance and robustness is reduced in complex fine-grained discriminative scenarios [22]. These combined limitations hinder its application in critical areas with limited computational resources and strict accuracy requirements, such as medical diagnosis, fine-grained visual classification, and industrial quality inspection [23,24,25].

2.2. Vision Transformer Backbones

Vision Transformers address the spatial locality limitations of CNNs by employing a global contextual self-attention mechanism, establishing a new benchmark in image classification [26]. Liu et al. [7] propose the Swin Transformer, which applies shifted window partitioning to achieve computational efficiency while maintaining the benefits of hierarchical representations for classification accuracy. Xu et al. [27] propose an improved hierarchical feature extraction method that increases data efficiency and top-1 accuracy. Lee et al. [28] propose an optimized global context modeling method that improves performance in fine-grained classification tasks. Han et al. [29] propose an adaptive multi-scale pyramid that generates multiple output scales and enhances robustness to object scale variations in classification tasks. Recent research has advanced Transformer applications in intelligent perception. Li et al. [30] enhance the Swin Transformer through supervised contrastive learning for radar activity recognition, while Jing et al. [31] introduce a state-tracking Transformer for autonomous driving. Despite these advances, key limitations persist. Strict window boundaries limit the integration of cross-region features [32], and the quadratic complexity of the final stage of attention hinders real-time processing of high-resolution inputs [33,34].

2.3. Efficient Attention Mechanisms

In Vision Transformers, optimizing attention mechanisms to address quadratic complexity and improve accuracy is a key challenge. Choromanski et al. [35] propose a kernel-based linearization approach using orthogonal random features, which lowers attention complexity but compromises position-sensitive localization accuracy. Bolya et al. [36] introduce similarity-based token merging to compress redundant representations, improving inference efficiency at the cost of high-frequency spatial details. Seol et al. [37] adopt a proxy-based representation to achieve efficient feature interaction. However, its non-bijective mapping leads to spatial information loss and constrains cross-scale modeling. Han et al. [38] combine deep convolution with focused linear attention to restore spatial discriminability while avoiding quadratic complexity. Nevertheless, the model’s accuracy in complex scene tasks still needs improvement. Liu et al. [39] propose a cascaded group attention mechanism that improves speed by jointly optimizing design and removing multi-head redundancy. However, this simplification reduces the model’s capacity to capture complex contextual information. Meng et al. [40] propose a polarity decomposition method to address the negative-value dropout issue in linear attention. This method suppresses entropy growth and preserving global context, but remains relatively weak in capturing local details.

Furthermore, similar robust optimization concepts have been effectively validated in the field of intelligent vehicle engineering. For example, Wang et al. [41] proposed an EMREKF algorithm for robust state estimation in the event of data loss, providing strong support for model optimization in related fields. Meanwhile, another model-based learning framework study solved the tire–road friction coefficient estimation problem, further expanding cross-domain optimization ideas and providing important reference for designing novel attention mechanisms [42]. Researchers are increasingly focusing on vibration suppression at the mechanical level in vehicles. For example, a recent study propose an inertial suspension compensation strategy to resolve the phase deviation issue in semi-active suspension control, offering effective support for mitigating vibration interference from onboard sensing systems [43]. In the perception system of intelligent vehicles, especially under NVH conditions, effectively addressing vibration-induced ambiguity, sensor noise, and road surface irregularities while ensuring real-time performance and safety is a significant challenge. Recent research indicates that an architecture that balances global context modeling and inference efficiency is crucial for achieving efficient multi-sensor fusion [44,45].

In this paper, we aim to fundamentally resolve this trade-off through a unified attention optimization framework. The proposed DAR-Swin achieves this via a deformable content–position decoupling mechanism and the entropy-constrained reversible projection mechanisms.

3. Proposed Approach

3.1. Problem Formulation

Hierarchical vision transformers achieve computational efficiency while preserving representational capacity through a window-based self-attention mechanism. However, this paradigm introduces two limitations for robust image classification in complex visual environments. First, the local window partitioning limits the attention computation to an isolated sub-region

R_{w}

(where

| R_{w} | ≪ H \times W

, H and W represent the height and width of the input image, respectively), which seriously hinders cross-region semantic integration. This fragmentation of local information affects the effective distinction between semantically similar categories, which ultimately leads to performance degradation in fine-grained classification and multi-object scenes. Secondly, the quadratic complexity of the traditional self-attention mechanism

O ({(H W)}^{2} \times C)

(where C represents the channel dimension of the feature map) greatly increases the computational complexity when processing high-resolution input in the deep stage. For a 1024 × 1024 input at Stage 4 (32 × 32 spatial resolution), the computational requirement exceeds

10^{9}

operations due to the increased channel dimension (

C_{4} > 2048

). In intelligent vehicles, if the global context is limited, additional NVH factors will further reduce the ability to distinguish. This results in unbearable delays (>200 ms) when performing batch inference tasks, severely restricting the scalability and deployment feasibility of the model.

Existing methods are insufficient to address this dual challenge: linear complexity methods reduce the discriminative power of position-sensitive features, while label compression techniques discard high-frequency spatial information necessary for fine-grained classification. To address these mutually restrictive limitations, we propose a synergistic framework that integrates SSA and LPA. SSA achieves continuous global modeling through a sub-quadratic operator, removing window boundary constraints while maintaining spatial sensitivity. Meanwhile, LPA compresses high-resolution features into a discriminative embedding of fixed dimensionality through a learnable proxy, while preserving key semantic information. This dual strategy effectively alleviates the lack of global contextual information and deep computational bottlenecks, significantly improving classification robustness and inference efficiency.

3.2. Scalable Self-Attention Mechanism

The SSA mechanism effectively balances global context modeling and computational complexity by separating spatial interactions and feature correlations through a dual-scale transformation. Specifically, SSA employs three parallel projections to process the input features

X \in R^{H \times W \times C}

:

\begin{matrix} Q^{'} & = f_{q} (X) \in R^{N \times (d_{k} \cdot r_{c})} & (Channel - expanded query) \end{matrix}

(1)

\begin{matrix} K^{'} & = f_{k} (X) \in R^{(N \cdot r_{n}) \times d_{k}} & (Spatially compressed key) \end{matrix}

(2)

\begin{matrix} V^{'} & = f_{v} (X) \in R^{N \times d_{v}} & (Value), \end{matrix}

(3)

where

N = H \times W

denotes the number of spatial labels,

d_{k}

and

d_{v}

refer to the key and value dimensions of the attention mechanism, while

r_{c}

and

r_{n}

serve as scaling factors for feature expansion and spatial compression, respectively. Figure 2 illustrates these transformations.

The core feature interaction operation within SSA can be denoted as

SSA (X) = Softmax (\frac{Q^{'} K^{' ⊤}}{\sqrt{d_{k}}} + B) V^{'} .

(4)

This operation generates an output with the original dimension

R^{N \times C}

. To preserve key spatial relationships, SSA introduces a relative position bias

B \in R^{(2 w - 1) \times (2 w - 1)}

, extending the position encoding strategy established in the Swin Transformer.

SSA benefits from the dimensionality reduction of

r_{n}

and

r_{c}

, resulting in a complexity of

O (r_{n} (r_{c} + 1) N^{2} C)

, which is significantly better than that of standard global attention. Specifically, the computational cost is given by

O (r_{n} r_{c} N^{2} d_{k} + r_{n} N^{2} d_{v})

. For instance, with typical values such as

d_{k} \approx d_{v} \approx C

,

r_{n} = 1 / 4

, and

r_{c} = 1

, the computational cost can be reduced by around 50%. This design efficiently balances expressive power and computational efficiency through the spatial compression factor

r_{n}

, enabling position-adaptive receptive fields without relying on a sliding window.

In terms of implementation, spatial compression in

f_{k}

employs depthwise strided convolutions (kernel size 3, stride

⌊ 1 / r_{n} ⌋

), while channel operations are performed via linear projections. Furthermore, SSA maintains consistent input and output resolution, ensuring architectural compatibility and enabling it to directly replace traditional attention modules.

3.3. Latent Proxy Attention Mechanism

LPA addresses the quadratic complexity problem in deep neural networks by introducing a quadruple structure

(Q, A, K, V)

, where

A \in R^{n \times d_{k}}

and

n ≪ N

, while preserving global representation capabilities. As shown in the Figure 3, this mechanism achieves linear complexity scaling through two consecutive computational phases.

First, in the aggregation phase, the proxy token is used as a query vector to compress the global context. The specific process can be denoted by

V_{A} = Softmax (\frac{A K^{T}}{\sqrt{d_{k}}}) V,

(5)

where A is the proxy token, K and V are the key and value, respectively, and

d_{k}

is the dimension of the key. Softmax normalization is used to obtain the compressed representation

V_{A} \in R^{n \times d_{v}}

, which captures key feature information. The computational complexity is reduced from

O (N^{2})

to

O (n N)

, where n is the number of proxy tokens.

Semantic information is distributed in the subsequent broadcast phase, which is denoted by

O = Softmax (\frac{Q A^{T}}{\sqrt{d_{k}}}) V_{A} .

(6)

Each output element

O_{i}

is a convex combination of proxy features, with the combination weight determined by the similarity between

Q_{i}

and A. This bijective mapping preserves spatial relationships while distributing contextual information proportionally based on feature relevance.

To better utilize position information to enhance spatial perception, a proxy bias is introduced into the attention calculation, which is expressed as follows:

O = Softmax (\frac{Q A^{T} + B_{2}}{\sqrt{d_{k}}}) Softmax (\frac{A K^{T} + B_{1}}{\sqrt{d_{k}}}) V,

(7)

where

B_{1} \in R^{n \times N}

and

B_{2} \in R^{N \times n}

are used to encode positional relationships. To further enhance feature diversity and avoid homogeneity, local feature enhancement is achieved through deep convolution (DWC). The final output can be expressed as:

O_{final} = O + DWC (V) .

(8)

3.4. DAR-Swin Architecture

DAR-Swin based on the Swin Transformer’s layered architecture introduces a dual-attention paradigm to adjust feature processing. As shown in Figure 4, the architecture preserves the Swin Transformer’s multi-stage workflow and integrates two complementary attention mechanisms. SSA replaces all W-MSA in stages 1 to 4, enabling continuous cross-window context modeling. Simultaneously, the spatial compression factor

r_{n}

is dynamically adjusted based on feature resolution, gradually decreasing from

r_{n} = 1 / 4

in stage 1 to

r_{n} = 1 / 32

in stage 4. This adjustment not only maintains sub-quadratic complexity but also effectively captures long-range dependencies.

To address the computational bottleneck at deeper levels, LPA compresses spatial tokens into

n = 49

semantic proxies for feature processing in stage 4. This design enhances spatial perception through relative position encoding while achieving linear complexity. Moreover, local feature enhancement increases feature diversity and mitigates homogeneity.

This architecture is designed to address perception challenges in NVH environments for intelligent vehicles. In NVH environments, vibration and noise cause motion blur, fragmenting visual features and degrading local contextual information quality. The DAR-Swin architecture, through the cross-window contextual modeling capability of its SSA module, effectively integrates information from damaged areas and mitigate the impact of spatial fragmentation. Meanwhile, the LPA module preserves high-resolution semantic information during compression, ensuring the retention of key structural features despite strong noise interference. The “global–local-compression” processing flow of this architecture significantly improves classification performance and ensures perception robustness and real-time inference efficiency in NVH environments.

4. Experiments

4.1. Experimental Setup

4.1.1. Datasets

We evaluate image classification performance using two benchmark datasets. The CIFAR-100 [46] dataset consists of 60,000 color images of size

32 \times 32

, evenly distributed across 100 fine-grained categories, with each category consisting of 500 training images and 100 test images. This dataset is widely used as a standard benchmark for general-purpose image classification. In addition, we use the COCO2017 [47] classification subset to evaluate the robustness of our model in complex multi-object environments. This subset includes 118,287 training images covering 80 semantic categories. This subset is particularly suitable for evaluating model robustness due to challenges such as occlusion, scale variation, and background clutter. Figure 5 shows representative samples from each dataset. These datasets emphasize fine-grained classification and complex scenes, similar to practical applications of in-vehicle cameras, providing a strong reference for evaluating safety perception robustness under NVH interference.

4.1.2. Implementation Details

We implement the DAR-Swin architecture in PyTorch 2.0 using an NVIDIA RTX 4090 GPU. The model is trained using the AdamW optimization algorithm (beta 1 = 0.9, beta 2 = 0.999) with an initial learning rate of 1 × 10⁻⁵. This learning rate is cosine annealed with periodic restarts every 20 epochs, decaying to a minimum of 1 × 10⁻⁶. Training is performed for 100 epochs with a batch size of 64. Regularization is performed using a combination of L2 weight decay (with a coefficient of 0.05) and random depth, with a layer dropout probability of 20%. To prevent overfitting and improve training efficiency, we employ early stopping. Training terminates automatically when the validation set accuracy fails to improve by at least 0.001 for 30 consecutive epochs. The spatial compression factor

r_{n}

of SSA gradually decreases from 1/4 in stage 1 to 1/32 in stage 4 (the final stage). The channel expansion factor

r_{c}

is always 0.5. LPA uses 49 proxy tokens. It adopts hierarchical window partitioning, using

8 \times 8

windows in stages 1–2 and

4 \times 4

windows in stages 3–4.

To ensure the fairness and reproducibility of the comparative experiments, all models are trained using the same hardware environment (NVIDIA RTX 4090 GPU) and software framework (PyTorch 2.0) as DAR-Swin. Core hyperparameters (e.g., 100 training epochs, batch size of 64) are uniformly set, with only model-specific parameters kept at their default values according to official recommendations.

4.2. Evaluation Metrics

We use five standard classification metrics to evaluate the performance of the model:

Accuracy (Acc) refers to the proportion of correctly classified samples to the total samples.

Acc = \frac{1}{N} \sum_{i = 1}^{N} 1 ({\hat{y}}_{i} = y_{i}),

where N is the total number of samples,

{\hat{y}}_{i}

is the predicted label, and

y_{i}

is the true label.

Precision (Pre) refers to the proportion of samples predicted to be positive that are actually positive.

Pre = \frac{TP}{TP + FP},

where TP represents True Positives and FP represents False Positives.

Recall (Rec) refers to the proportion of samples that are actually positive that are correctly predicted to be positive.

Rec = \frac{TP}{TP + FN},

where FN represents False Negatives.

F1-score (F1) is the harmonic mean of precision and recall, which is used to comprehensively evaluate the performance of the model.

F 1 = 2 \cdot \frac{Pre \cdot Rec}{Pre + Rec} .

Mean Average Precision (mAP) is used to evaluate model performance for image classification tasks.

mAP = \frac{1}{C} \sum_{c = 1}^{C} {AP}_{c},

where C is the number of categories and

{AP}_{c}

is the area under the precision–recall curve for category c.

4.3. Ablation Studies

In this section, we conduct extensive ablation studies to analyze the different components of DAR-Swin. All experiments strictly follow the experimental details described in Section 4.1.2.

4.3.1. Analysis of Baseline Component

This section conducts systematic component ablation experiments to validate the core architectural innovations of the Swin Transformer and quantify the contribution of each component. In particular, we evaluate two key architectural components: the shifted window attention mechanism that enables cross-window feature integration, and the multi-scale feature extraction hierarchy that facilitates adaptive representation learning across spatial resolutions.

Experimental results (Table 1) quantify the impact of each component on model performance. Disabling the shifted window attention mechanism results in a slight performance degradation, with accuracy decreasing by 3.5% (85.8% → 82.3%) and mAP decreasing by 3.0% (91.5% → 88.5%). This indicates that while local feature extraction is preserved, the ability to model cross-window context is reduced. On the other hand, removing multi-scale feature extraction results in a significant performance degradation, with accuracy decreasing by 30.7% (85.8% → 55.1%) and mAP decreasing by 33.0% (91.5% → 58.5%), demonstrating that multi-scale feature extraction plays a crucial role in the overall model performance.

4.3.2. Analysis of SSA Module

This section systematically evaluates the effectiveness of position-content decoupled in the SSA mechanism through ablation experiments. The experiments include: a content-decoupled variant with position adaptation disabled, a position-decoupled variant with content association disabled, and SSA.

As shown in Table 2, the SSA configuration outperforms both simplified variants across all evaluation metrics. Specifically, it achieves an accuracy of 86.9%, a 0.5% improvement over the content-decoupled configuration (86.4%) and a 0.7% improvement over the position-decoupled configuration (86.2%). These results demonstrate that the integration of position-adaptive receptive fields and content association mechanisms can jointly enhance the model’s contextual modeling capabilities.

4.3.3. Analysis of LPA Module

In order to further explore the independent contributions of the submodules in the LPA mechanism, this section designs and conducts a set of ablation experiments. The experiments test the algorithm performance of aggregation decoupled, broadcast decoupled and the LPA module to determine the necessity of the two submodules working together.

As shown in Table 3, when the broadcasting module works in conjunction with the aggregation module, the LPA module achieves the best performance, with an accuracy of 86.6% and an mAP of 92.2%, respectively. The LPA module compresses global context through aggregation and propagates local features through broadcasting, effectively fusing features across scales.

4.4. Comparative Evaluation

4.4.1. Classification Performance Comparison

This section presents a comprehensive benchmark of DAR-Swin on the CIFAR-100 dataset and the COCO 2017 classification subset, covering both general object recognition and complex scene classification tasks. Comparisons are conducted against Transformer-based architectures such as Swin-T (Swin Transformer), Swin-GLA [48], and Swin-DVT [49], as well as CNN-based methods such as YOLOv12 and EfficientNet [50].

As shown in Table 4, DAR-Swin demonstrates superior performance in classification tasks. On the CIFAR-100 dataset, its accuracy reaches 87.3%, a 1.5% improvement over the Swin-T (baseline). Metrics such as precision (87.4%), recall (87.3%), F1 score (87.3%), and mAP (92.8%) also show improvements compared to other models. On the COCO2017 classification dataset, DAR-Swin achieves an accuracy of 68.3%, representing a 1.8% improvement over the baseline. The mAP also increases substantially, reaching 57.0% compared with the baseline value of 55.7%. Compared with other state-of-the-art models, DAR-Swin performs well across all evaluation metrics. These results demonstrate the robustness of DAR-Swin across diverse classification tasks.

4.4.2. Computational Efficiency Analysis

The DAR-Swin framework propose in this paper significantly improves classification accuracy while maintaining excellent computational efficiency. Table 5 compares the computational efficiency of the baseline model with that of DAR-Swin and its variants.

As shown in Table 5, DAR-Swin achieves a higher inference frame rate (74.62 FPS), compared to the baseline method (74.07 FPS). This efficiency improvement is primarily attributed to the linear complexity compression technique of the LPA module, which effectively alleviates computational bottlenecks in deep networks. Specifically, the LPA module alone achieves the highest inference frame rate (75.423 FPS), demonstrating its efficiency when processing large-scale features. In contrast, when the SSA module is used alone, the model’s efficiency decreases slightly in terms of FLOPs (4.365 G vs. 4.367 G for the baseline), consistent with its sub-quadratic design. However, using the SSA module alone results in increased inference latency (73.195 FPS vs. 74.069 FPS for the baseline), primarily due to the additional computational burden imposed by its extended context modeling. It is worth noting that the DAR-Swin framework successfully offsets the computational overhead introduced by SSA through the efficiency advantage of the LPA module. As a result, the model achieves improved classification performance while maintaining real-time processing capabilities.

4.5. Comparative Analysis of Visualization Results

To more clearly illustrate the differences in classification performance, Figure 6 shows a comparison of class activation maps from a typical CIFAR-100 sample. The figure compares class activation maps generated by the original input image, the Swin Transformer model, and our proposed DAR-Swin model.

As shown in Figure 6, there is a significant performance difference between the two models. While the baseline model, Swin Transformer, can effectively locate the main target region, it struggles to distinguish fine-grained features. Particularly in complex scenes with subtle inter-class differences or partial occlusion, the Swin Transformer’s class activation map exhibits spatial fragmentation. In contrast, the class activation map generated by DAR-Swin shows substantial improvements in spatial consistency and semantic accuracy. Our model can accurately locate discriminative features and identify key class-specific features even in complex scenarios such as occlusion or subtle inter-class differences. This visualization aligns with the quantitative improvement trend in Acc and mAP described in Section 3.4, further validating DAR-Swin’s superior discrimination and generalization in complex visual scenes. These visual results demonstrate that DAR-Swin better adapts to occlusion and subtle ambiguities typical in driving scenarios, thus supporting safer critical perception under NVH interference.

5. Conclusions

In this paper, we propose DAR-Swin, a dual-attention revamped Swin Transformer for efficient and robust image classification. This approach replaces the traditional W-MSA with SSA, enabling continuous context modeling across windows. Furthermore, DAR-Swin integrates LPA before the classification head, effectively addressing the deep computational bottleneck. Experimental results demonstrate that this approach significantly improves classification accuracy compared to the Swin Transformer baseline model and other state-of-the-art methods on multiple benchmarks, while maintaining real-time inference efficiency. These characteristics make DAR-Swin an ideal backbone network for active and passive safety perception modules in intelligent vehicles operating under NVH conditions.

The current evaluation of this model is based primarily on publicly available benchmark datasets, and its adaptability to real vehicle sensor noise and long-term extreme NVH scenarios requires further validation. Looking ahead, we aim to expand its application scope and address practical engineering challenges. Specifically, we will integrate DAR-Swin into multi-sensor fusion processes to enhance environmental perception. Meanwhile, we will optimize hyperparameters for extreme NVH scenarios, such as high-speed, bumpy roads. Furthermore, we will fine-tune the model with real vehicle datasets containing actual vibrations and noise interference, further improving its feasibility and robustness in real-world driving environments.

Author Contributions

Conceptualization, H.Z.; Methodology, X.Z.; Software, Z.Z.; Validation, Z.Z.; Formal analysis, Z.W.; Investigation, H.Z.; Resources, X.Z.; Data curation, C.X.; Writing—original draft, Z.C.; Writing—review & editing, Y.W.; Project administration, C.X.; Funding acquisition, Z.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research and the APC were funded by Hebei Shuyuntang Intelligent Technology Co., Ltd. (grant number K25L00650).

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

Authors Xinglong Zhang, Huihui Zuo and Chaotan Xue were employed by the company SAIC Motor Corporation Limited and authors Zhiguo Zhang and Zhenjiang Wu were employed by the company CATARC (Tianjin) Automotive Engineering Research Institute Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Manzari, O.N.; Ahmadabadi, H.; Kashiani, H.; Shokouhi, S.B.; Ayatollahi, A. MedViT: A robust vision transformer for generalized medical image classification. Comput. Biol. Med. 2023, 157, 106791. [Google Scholar] [CrossRef] [PubMed]
Dai, Y.; Gao, Y.; Liu, F. Transmed: Transformers advance multi-modal medical image classification. Diagnostics 2021, 11, 1384. [Google Scholar] [CrossRef] [PubMed]
Parvaiz, A.; Khalid, M.A.; Zafar, R.; Ameer, H.; Ali, M.; Fraz, M.M. Vision Transformers in medical computer vision—A contemplative retrospection. Eng. Appl. Artif. Intell. 2023, 122, 106126. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 6000–6010. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 3–7 May 2021. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
Chu, X.; Tian, Z.; Wang, Y.; Zhang, B.; Ren, H.; Wei, X.; Xia, H.; Shen, C. Twins: Revisiting the design of spatial attention in vision transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 9355–9366. [Google Scholar]
Wang, W.; Xie, E.; Li, X.; Fan, D.P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 568–578. [Google Scholar]
Smith, J.; Lee, J.; Chen, W. Scalable Self-Attention: Efficient Attention with Sub-Quadratic Complexity. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 12345–12356. [Google Scholar]
Katharopoulos, A.; Vyas, A.; Pappas, N.; Fleuret, F. Transformers are rnns: Fast autoregressive transformers with linear attention. In Proceedings of the International Conference on Machine Learning, Virtual, 13–18 July 2020; pp. 5156–5165. [Google Scholar]
Xia, Z.; Pan, X.; Song, S.; Li, L.E.; Huang, G. Vision transformer with deformable attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 4794–4803. [Google Scholar]
Han, D.; Ye, T.; Han, Y.; Xia, Z.; Pan, S.; Wan, P.; Song, S.; Huang, G. Agent attention: On the integration of softmax and linear attention. In Proceedings of the European Conference on Computer Vision, MiCo Milano, Milan, Italy, 29 September–4 October 2024; pp. 124–140. [Google Scholar]
Shen, Z.; Zhang, M.; Zhao, H.; Yi, S.; Li, H. Efficient attention: Attention with linear complexities. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Virtual, 5–9 January 2021; pp. 3531–3539. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Identity mappings in deep residual networks. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 630–645. [Google Scholar]
Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. In Proceedings of the International Conference on Learning Representations, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
Liu, Z.; Mao, H.; Wu, C.Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11976–11986. [Google Scholar]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar] [CrossRef]
Zhang, X.; Zhou, X.; Lin, M.; Sun, J. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6848–6856. [Google Scholar]
Woo, S.; Debnath, S.; Hu, R.; Chen, X.; Liu, Z.; Kweon, I.S.; Xie, S. Convnext v2: Co-designing and scaling convnets with masked autoencoders. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 16133–16142. [Google Scholar]
Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H. Training data-efficient image transformers & distillation through attention. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 10347–10357. [Google Scholar]
Tu, Z.; Talebi, H.; Zhang, H.; Yang, F.; Milanfar, P.; Bovik, A.; Li, Y. Maxvit: Multi-axis vision transformer. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 459–479. [Google Scholar]
Pan, S.; Liu, Y.; Halek, S.; Tomaszewski, M.; Wang, S.; Baumgartner, R.; Yuan, J.; Goldmacher, G.; Chen, A. Multi-dimension unified Swin Transformer for 3D Lesion Segmentation in Multiple Anatomical Locations. In Proceedings of the 2023 IEEE 20th International Symposium on Biomedical Imaging, Cartagena, Colombia, 18–21 April 2023; pp. 1–5. [Google Scholar]
Yao, H.; Cao, Y.; Luo, W.; Zhang, W.; Yu, W.; Shen, W. Prior normality prompt transformer for multiclass industrial image anomaly detection. IEEE Trans. Ind. Inform. 2024, 20, 11866–11876. [Google Scholar] [CrossRef]
Xie, G.; Wang, J.; Liu, J.; Lyu, J.; Liu, Y.; Wang, C.; Zheng, F.; Jin, Y. Im-iad: Industrial image anomaly detection benchmark in manufacturing. IEEE Trans. Cybern. 2024, 54, 2720–2733. [Google Scholar] [CrossRef] [PubMed]
Yuan, L.; Chen, Y.; Wang, T.; Yu, W.; Shi, Y.; Jiang, Z.H.; Tay, F.E.; Feng, J.; Yan, S. Tokens-to-token vit: Training vision transformers from scratch on imagenet. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 558–567. [Google Scholar]
Xu, Y.; Zhang, Q.; Zhang, J.; Tao, D. Vitae: Vision transformer advanced by exploring intrinsic inductive bias. Adv. Neural Inf. Process. Syst. 2021, 34, 28522–28535. [Google Scholar]
Lee, Y.; Kim, J.; Willette, J.; Hwang, S.J. Mpvit: Multi-path vision transformer for dense prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 7287–7296. [Google Scholar]
Han, K.; Wang, Y.; Chen, H.; Chen, X.; Guo, J.; Liu, Z.; Tang, Y.; Xiao, A.; Xu, C.; Xu, Y.; et al. A survey on vision transformer. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 87–110. [Google Scholar] [CrossRef] [PubMed]
Li, X.; Chen, S.; Zhang, S.; Zhu, Y.; Xiao, Z.; Wang, X. Advancing IR-UWB radar human activity recognition with swin transformers and supervised contrastive learning. IEEE Internet Things J. 2023, 11, 11750–11766. [Google Scholar] [CrossRef]
Jing, L.; Yu, R.; Chen, X.; Zhao, Z.; Sheng, S.; Graber, C.; Chen, Q.; Li, Q.; Wu, S.; Deng, H.; et al. STT: Stateful Tracking with Transformers for Autonomous Driving. In Proceedings of the 2024 IEEE International Conference on Robotics and Automation, Yokohama, Japan, 13–17 May 2024; pp. 4442–4449. [Google Scholar]
Jiao, J.; Tang, Y.M.; Lin, K.Y.; Gao, Y.; Ma, A.J.; Wang, Y.; Zheng, W.S. Dilateformer: Multi-scale dilated transformer for visual recognition. IEEE Trans. Multimed. 2023, 25, 8906–8919. [Google Scholar] [CrossRef]
Chen, C.F.R.; Fan, Q.; Panda, R. Crossvit: Cross-attention multi-scale vision transformer for image classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 357–366. [Google Scholar]
Vasu, P.K.A.; Gabriel, J.; Zhu, J.; Tuzel, O.; Ranjan, A. Fastvit: A fast hybrid vision transformer using structural reparameterization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 5785–5795. [Google Scholar]
Choromanski, K.; Likhosherstov, V.; Dohan, D.; Song, X.; Gane, A.; Sarlos, T.; Hawkins, P.; Davis, J.; Mohiuddin, A.; Kaiser, L.; et al. Rethinking attention with performers. arXiv 2020, arXiv:2009.14794. [Google Scholar]
Bolya, D.; Fu, C.Y.; Dai, X.; Zhang, P.; Feichtenhofer, C.; Hoffman, J. Token merging: Your vit but faster. arXiv 2022, arXiv:2210.09461. [Google Scholar]
Seol, J.; Gang, M.; Lee, S.g.; Park, J. Proxy-based item representation for attribute and context-aware recommendation. In Proceedings of the 17th ACM International Conference on Web Search and Data Mining, Merida, Mexico, 4–8 March 2024; pp. 616–625. [Google Scholar]
Han, D.; Pan, X.; Han, Y.; Song, S.; Huang, G. Flatten transformer: Vision transformer using focused linear attention. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 5961–5971. [Google Scholar]
Liu, X.; Peng, H.; Zheng, N.; Yang, Y.; Hu, H.; Yuan, Y. Efficientvit: Memory efficient vision transformer with cascaded group attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 14420–14430. [Google Scholar]
Meng, W.; Luo, Y.; Li, X.; Jiang, D.; Zhang, Z. PolaFormer: Polarity-aware Linear Attention for Vision Transformers. In Proceedings of the Thirteenth International Conference on Learning Representations, Singapore, 24–28 April 2025. [Google Scholar]
Wang, Y.; Tian, F.; Wang, J.; Li, K. A Bayesian expectation maximization algorithm for state estimation of intelligent vehicles considering data loss and noise uncertainty. Sci. China Technol. Sci. 2025, 68, 1220801. [Google Scholar] [CrossRef]
Wang, Y.; Yin, G.; Hang, P.; Zhao, J.; Lin, Y.; Huang, C. Fundamental estimation for tire road friction coefficient: A model-based learning framework. IEEE Trans. Veh. Technol. 2024, 74, 481–493. [Google Scholar] [CrossRef]
Yang, Y.; Liu, C.; Chen, L.; Zhang, X. Phase deviation of semi-active suspension control and its compensation with inertial suspension. Acta Mech. Sin. 2024, 40, 523367. [Google Scholar] [CrossRef]
Xu, W.; Cai, Y.; He, D.; Lin, J.; Zhang, F. Fast-lio2: Fast direct lidar-inertial odometry. IEEE Trans. Robot. 2022, 38, 2053–2073. [Google Scholar] [CrossRef]
Gu, A.; Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. In Proceedings of the First Conference on Language Modeling, Philadelphia, PA, USA, 7–9 October 2024. [Google Scholar]
Krizhevsky, A. Learning Multiple Layers of Features from Tiny Images; Technical Report Citeseer; University of Toronto: Toronto, ON, Canada, 2009. [Google Scholar]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; pp. 740–755. [Google Scholar]
Li, L.; Tang, S.; Zhang, Y.; Deng, L.; Tian, Q. GLA: Global–local attention for image description. IEEE Trans. Multimed. 2017, 20, 726–737. [Google Scholar] [CrossRef]
Wang, J.; Torresani, L. Deformable Video Transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 14053–14062. [Google Scholar]
Tan, M.; Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 6105–6114. [Google Scholar]

Figure 1. The general enhancement paradigm for hierarchical vision transformers in safety-critical intelligent vehicle perception. (a) Traditional image classification process: W-MSA is used, but its local window constraint limits effective fusion of cross-regional semantic information. (b) Image classification process of the proposed architecture: SSA is integrated to enable continuous global context modeling, and LPA is introduced to perform efficient feature compression before classification. This enhanced framework improves robustness to NVH disturbances while preserving the original hierarchical structure.

Figure 2. Scalable Self-Attention architecture. The diagram illustrates three transformation pathways: channel expansion (

f_{q}

), spatial compression (

f_{k}

), and value projection (

f_{v}

). Attention operates at reduced dimensionality

N \times (N \cdot r_{n})

enabling adaptive cross-window interactions.

Figure 2. Scalable Self-Attention architecture. The diagram illustrates three transformation pathways: channel expansion (

f_{q}

), spatial compression (

f_{k}

), and value projection (

f_{v}

). Attention operates at reduced dimensionality

N \times (N \cdot r_{n})

enabling adaptive cross-window interactions.

Figure 3. Latent Proxy Attention computational flow. Aggregation phase compresses global context through proxy–key interactions, transforming high-dimensional features (

N \times d

) into compact representations (

n \times d

). Broadcast phase distributes semantic information via query–proxy similarity , reconstructing full-resolution output (

N \times d

). Matrix dimensions reflect computational complexity scaling from

O (N^{2})

to

O (N)

.

Figure 3. Latent Proxy Attention computational flow. Aggregation phase compresses global context through proxy–key interactions, transforming high-dimensional features (

N \times d

) into compact representations (

n \times d

). Broadcast phase distributes semantic information via query–proxy similarity , reconstructing full-resolution output (

N \times d

). Matrix dimensions reflect computational complexity scaling from

O (N^{2})

to

O (N)

.

Figure 4. DAR-Swin hierarchical architecture. Integration of SSA for cross-window modeling and LPA for semantic compression.

Figure 5. Benchmark dataset samples for image classification evaluation. (a) CIFAR-100 fine-grained image examples; (b) COCO2017 classification samples with complex scenes.

Figure 6. Visual comparison of class activation maps on CIFAR-100. (a) Original input images; (b) Activation maps generated by Swin Transformer [7]; (c) Activation maps produced by proposed DAR-Swin method.

Table 1. Ablation study of baseline components (%).

Configuration	Acc.	Prec.	Rec.	F1	mAP
Baseline	85.8	85.9	85.8	85.8	91.5
– Shifted windows	82.3	82.4	82.3	82.2	88.5
– Multi-scale	55.1	54.9	55.1	54.8	58.5

Table 2. Ablation study of SSA module (%).

Configuration	Acc.	Prec.	Rec.	F1	mAP
Content decoupled	86.4	86.6	86.4	86.3	92.3
Position decoupled	86.2	86.3	86.2	86.1	92.1
SSA	86.9	87.0	86.9	86.8	92.5

Table 3. Ablation study of LPA module (%).

Configuration	Acc.	Prec.	Rec.	F1	mAP
Aggregation decoupled	86.0	86.3	86.0	86.0	91.7
Broadcast decoupled	85.8	86.0	85.9	85.8	92.1
LPA	86.6	86.8	86.6	86.6	92.2

Table 4. Comprehensive Classification Performance on Benchmark Datasets (%).

Model	CIFAR-100					COCO2017
Model	Acc.	Prec.	Rec.	F1	mAP	Acc.	Prec.	Rec.	F1	mAP
Swin-T	85.8	85.9	85.8	85.8	91.5	66.5	66.9	66.5	65.2	55.7
YOLOv12	72.8	74.9	70.7	72.7	78.8	64.1	64.1	49.9	56.1	54.5
EfficientNet	77.2	77.4	77.1	77.2	82.0	64.9	61.3	49.4	54.5	54.0
Swin-GLA	86.2	86.3	86.2	86.1	91.9	67.1	66.4	67.1	65.8	55.9
Swin-DVT	85.9	86.1	85.9	85.9	91.6	67.8	66.5	67.8	66.5	56.3
DAR-Swin	87.3	87.4	87.3	87.3	92.8	68.3	67.4	68.3	67.3	57.0

Table 5. Computational efficiency analysis of models.

Model	FLOPs (G)	FPS
Swin-T	4.367	74.069
SSA-only	4.365	73.195
LPA-only	4.376	75.423
DAR-Swin	4.379	74.622

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, X.; Zhang, Z.; Zuo, H.; Xue, C.; Wu, Z.; Cheng, Z.; Wang, Y. DAR-Swin: Dual-Attention Revamped Swin Transformer for Intelligent Vehicle Perception Under NVH Disturbances. Machines 2026, 14, 51. https://doi.org/10.3390/machines14010051

AMA Style

Zhang X, Zhang Z, Zuo H, Xue C, Wu Z, Cheng Z, Wang Y. DAR-Swin: Dual-Attention Revamped Swin Transformer for Intelligent Vehicle Perception Under NVH Disturbances. Machines. 2026; 14(1):51. https://doi.org/10.3390/machines14010051

Chicago/Turabian Style

Zhang, Xinglong, Zhiguo Zhang, Huihui Zuo, Chaotan Xue, Zhenjiang Wu, Zhiyu Cheng, and Yan Wang. 2026. "DAR-Swin: Dual-Attention Revamped Swin Transformer for Intelligent Vehicle Perception Under NVH Disturbances" Machines 14, no. 1: 51. https://doi.org/10.3390/machines14010051

APA Style

Zhang, X., Zhang, Z., Zuo, H., Xue, C., Wu, Z., Cheng, Z., & Wang, Y. (2026). DAR-Swin: Dual-Attention Revamped Swin Transformer for Intelligent Vehicle Perception Under NVH Disturbances. Machines, 14(1), 51. https://doi.org/10.3390/machines14010051

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

DAR-Swin: Dual-Attention Revamped Swin Transformer for Intelligent Vehicle Perception Under NVH Disturbances

Abstract

1. Introduction

2. Related Work

2.1. Image Classification

2.2. Vision Transformer Backbones

2.3. Efficient Attention Mechanisms

3. Proposed Approach

3.1. Problem Formulation

3.2. Scalable Self-Attention Mechanism

3.3. Latent Proxy Attention Mechanism

3.4. DAR-Swin Architecture

4. Experiments

4.1. Experimental Setup

4.1.1. Datasets

4.1.2. Implementation Details

4.2. Evaluation Metrics

4.3. Ablation Studies

4.3.1. Analysis of Baseline Component

4.3.2. Analysis of SSA Module

4.3.3. Analysis of LPA Module

4.4. Comparative Evaluation

4.4.1. Classification Performance Comparison

4.4.2. Computational Efficiency Analysis

4.5. Comparative Analysis of Visualization Results

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI