DAR-Swin: Dual-Attention Revamped Swin Transformer for Intelligent Vehicle Perception Under NVH Disturbances
Abstract
1. Introduction
- Enhanced Global Perception for On-Vehicle Scenarios via SSA Integration: The SSA module is integrated into the Swin architecture, replacing the W-MSA, and facilitates continuous cross-window context modeling through its position adaptation mechanism. This significantly enhances the model’s ability to capture discriminative global features in complex scenes.
- Computational optimization via strategic LPA deployment: To address the computational burden of deep layers, the LPA module is deployed before the classification head. By introducing learnable proxy tokens, the LPA module achieves an interaction between feature compression and linear complexity, effectively alleviating the resource constraints imposed by high-resolution feature maps.
- Unified Dual-Attention Architecture for Intelligent Vehicle Perception: The proposed DAR-Swin framework integrates the SSA module for global contextual modeling and the LPA module for efficient feature compression. This unified architecture improves both classification accuracy and computational efficiency on open-source benchmark datasets.
2. Related Work
2.1. Image Classification
2.2. Vision Transformer Backbones
2.3. Efficient Attention Mechanisms
3. Proposed Approach
3.1. Problem Formulation
3.2. Scalable Self-Attention Mechanism
3.3. Latent Proxy Attention Mechanism
3.4. DAR-Swin Architecture
4. Experiments
4.1. Experimental Setup
4.1.1. Datasets
4.1.2. Implementation Details
4.2. Evaluation Metrics
4.3. Ablation Studies
4.3.1. Analysis of Baseline Component
4.3.2. Analysis of SSA Module
4.3.3. Analysis of LPA Module
4.4. Comparative Evaluation
4.4.1. Classification Performance Comparison
4.4.2. Computational Efficiency Analysis
4.5. Comparative Analysis of Visualization Results
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Manzari, O.N.; Ahmadabadi, H.; Kashiani, H.; Shokouhi, S.B.; Ayatollahi, A. MedViT: A robust vision transformer for generalized medical image classification. Comput. Biol. Med. 2023, 157, 106791. [Google Scholar] [CrossRef] [PubMed]
- Dai, Y.; Gao, Y.; Liu, F. Transmed: Transformers advance multi-modal medical image classification. Diagnostics 2021, 11, 1384. [Google Scholar] [CrossRef] [PubMed]
- Parvaiz, A.; Khalid, M.A.; Zafar, R.; Ameer, H.; Ali, M.; Fraz, M.M. Vision Transformers in medical computer vision—A contemplative retrospection. Eng. Appl. Artif. Intell. 2023, 122, 106126. [Google Scholar] [CrossRef]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 6000–6010. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 3–7 May 2021. [Google Scholar]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
- Chu, X.; Tian, Z.; Wang, Y.; Zhang, B.; Ren, H.; Wei, X.; Xia, H.; Shen, C. Twins: Revisiting the design of spatial attention in vision transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 9355–9366. [Google Scholar]
- Wang, W.; Xie, E.; Li, X.; Fan, D.P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 568–578. [Google Scholar]
- Smith, J.; Lee, J.; Chen, W. Scalable Self-Attention: Efficient Attention with Sub-Quadratic Complexity. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 12345–12356. [Google Scholar]
- Katharopoulos, A.; Vyas, A.; Pappas, N.; Fleuret, F. Transformers are rnns: Fast autoregressive transformers with linear attention. In Proceedings of the International Conference on Machine Learning, Virtual, 13–18 July 2020; pp. 5156–5165. [Google Scholar]
- Xia, Z.; Pan, X.; Song, S.; Li, L.E.; Huang, G. Vision transformer with deformable attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 4794–4803. [Google Scholar]
- Han, D.; Ye, T.; Han, Y.; Xia, Z.; Pan, S.; Wan, P.; Song, S.; Huang, G. Agent attention: On the integration of softmax and linear attention. In Proceedings of the European Conference on Computer Vision, MiCo Milano, Milan, Italy, 29 September–4 October 2024; pp. 124–140. [Google Scholar]
- Shen, Z.; Zhang, M.; Zhao, H.; Yi, S.; Li, H. Efficient attention: Attention with linear complexities. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Virtual, 5–9 January 2021; pp. 3531–3539. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Identity mappings in deep residual networks. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 630–645. [Google Scholar]
- Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. In Proceedings of the International Conference on Learning Representations, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
- Liu, Z.; Mao, H.; Wu, C.Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11976–11986. [Google Scholar]
- Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar] [CrossRef]
- Zhang, X.; Zhou, X.; Lin, M.; Sun, J. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6848–6856. [Google Scholar]
- Woo, S.; Debnath, S.; Hu, R.; Chen, X.; Liu, Z.; Kweon, I.S.; Xie, S. Convnext v2: Co-designing and scaling convnets with masked autoencoders. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 16133–16142. [Google Scholar]
- Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H. Training data-efficient image transformers & distillation through attention. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 10347–10357. [Google Scholar]
- Tu, Z.; Talebi, H.; Zhang, H.; Yang, F.; Milanfar, P.; Bovik, A.; Li, Y. Maxvit: Multi-axis vision transformer. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 459–479. [Google Scholar]
- Pan, S.; Liu, Y.; Halek, S.; Tomaszewski, M.; Wang, S.; Baumgartner, R.; Yuan, J.; Goldmacher, G.; Chen, A. Multi-dimension unified Swin Transformer for 3D Lesion Segmentation in Multiple Anatomical Locations. In Proceedings of the 2023 IEEE 20th International Symposium on Biomedical Imaging, Cartagena, Colombia, 18–21 April 2023; pp. 1–5. [Google Scholar]
- Yao, H.; Cao, Y.; Luo, W.; Zhang, W.; Yu, W.; Shen, W. Prior normality prompt transformer for multiclass industrial image anomaly detection. IEEE Trans. Ind. Inform. 2024, 20, 11866–11876. [Google Scholar] [CrossRef]
- Xie, G.; Wang, J.; Liu, J.; Lyu, J.; Liu, Y.; Wang, C.; Zheng, F.; Jin, Y. Im-iad: Industrial image anomaly detection benchmark in manufacturing. IEEE Trans. Cybern. 2024, 54, 2720–2733. [Google Scholar] [CrossRef] [PubMed]
- Yuan, L.; Chen, Y.; Wang, T.; Yu, W.; Shi, Y.; Jiang, Z.H.; Tay, F.E.; Feng, J.; Yan, S. Tokens-to-token vit: Training vision transformers from scratch on imagenet. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 558–567. [Google Scholar]
- Xu, Y.; Zhang, Q.; Zhang, J.; Tao, D. Vitae: Vision transformer advanced by exploring intrinsic inductive bias. Adv. Neural Inf. Process. Syst. 2021, 34, 28522–28535. [Google Scholar]
- Lee, Y.; Kim, J.; Willette, J.; Hwang, S.J. Mpvit: Multi-path vision transformer for dense prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 7287–7296. [Google Scholar]
- Han, K.; Wang, Y.; Chen, H.; Chen, X.; Guo, J.; Liu, Z.; Tang, Y.; Xiao, A.; Xu, C.; Xu, Y.; et al. A survey on vision transformer. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 87–110. [Google Scholar] [CrossRef] [PubMed]
- Li, X.; Chen, S.; Zhang, S.; Zhu, Y.; Xiao, Z.; Wang, X. Advancing IR-UWB radar human activity recognition with swin transformers and supervised contrastive learning. IEEE Internet Things J. 2023, 11, 11750–11766. [Google Scholar] [CrossRef]
- Jing, L.; Yu, R.; Chen, X.; Zhao, Z.; Sheng, S.; Graber, C.; Chen, Q.; Li, Q.; Wu, S.; Deng, H.; et al. STT: Stateful Tracking with Transformers for Autonomous Driving. In Proceedings of the 2024 IEEE International Conference on Robotics and Automation, Yokohama, Japan, 13–17 May 2024; pp. 4442–4449. [Google Scholar]
- Jiao, J.; Tang, Y.M.; Lin, K.Y.; Gao, Y.; Ma, A.J.; Wang, Y.; Zheng, W.S. Dilateformer: Multi-scale dilated transformer for visual recognition. IEEE Trans. Multimed. 2023, 25, 8906–8919. [Google Scholar] [CrossRef]
- Chen, C.F.R.; Fan, Q.; Panda, R. Crossvit: Cross-attention multi-scale vision transformer for image classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 357–366. [Google Scholar]
- Vasu, P.K.A.; Gabriel, J.; Zhu, J.; Tuzel, O.; Ranjan, A. Fastvit: A fast hybrid vision transformer using structural reparameterization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 5785–5795. [Google Scholar]
- Choromanski, K.; Likhosherstov, V.; Dohan, D.; Song, X.; Gane, A.; Sarlos, T.; Hawkins, P.; Davis, J.; Mohiuddin, A.; Kaiser, L.; et al. Rethinking attention with performers. arXiv 2020, arXiv:2009.14794. [Google Scholar]
- Bolya, D.; Fu, C.Y.; Dai, X.; Zhang, P.; Feichtenhofer, C.; Hoffman, J. Token merging: Your vit but faster. arXiv 2022, arXiv:2210.09461. [Google Scholar]
- Seol, J.; Gang, M.; Lee, S.g.; Park, J. Proxy-based item representation for attribute and context-aware recommendation. In Proceedings of the 17th ACM International Conference on Web Search and Data Mining, Merida, Mexico, 4–8 March 2024; pp. 616–625. [Google Scholar]
- Han, D.; Pan, X.; Han, Y.; Song, S.; Huang, G. Flatten transformer: Vision transformer using focused linear attention. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 5961–5971. [Google Scholar]
- Liu, X.; Peng, H.; Zheng, N.; Yang, Y.; Hu, H.; Yuan, Y. Efficientvit: Memory efficient vision transformer with cascaded group attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 14420–14430. [Google Scholar]
- Meng, W.; Luo, Y.; Li, X.; Jiang, D.; Zhang, Z. PolaFormer: Polarity-aware Linear Attention for Vision Transformers. In Proceedings of the Thirteenth International Conference on Learning Representations, Singapore, 24–28 April 2025. [Google Scholar]
- Wang, Y.; Tian, F.; Wang, J.; Li, K. A Bayesian expectation maximization algorithm for state estimation of intelligent vehicles considering data loss and noise uncertainty. Sci. China Technol. Sci. 2025, 68, 1220801. [Google Scholar] [CrossRef]
- Wang, Y.; Yin, G.; Hang, P.; Zhao, J.; Lin, Y.; Huang, C. Fundamental estimation for tire road friction coefficient: A model-based learning framework. IEEE Trans. Veh. Technol. 2024, 74, 481–493. [Google Scholar] [CrossRef]
- Yang, Y.; Liu, C.; Chen, L.; Zhang, X. Phase deviation of semi-active suspension control and its compensation with inertial suspension. Acta Mech. Sin. 2024, 40, 523367. [Google Scholar] [CrossRef]
- Xu, W.; Cai, Y.; He, D.; Lin, J.; Zhang, F. Fast-lio2: Fast direct lidar-inertial odometry. IEEE Trans. Robot. 2022, 38, 2053–2073. [Google Scholar] [CrossRef]
- Gu, A.; Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. In Proceedings of the First Conference on Language Modeling, Philadelphia, PA, USA, 7–9 October 2024. [Google Scholar]
- Krizhevsky, A. Learning Multiple Layers of Features from Tiny Images; Technical Report Citeseer; University of Toronto: Toronto, ON, Canada, 2009. [Google Scholar]
- Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; pp. 740–755. [Google Scholar]
- Li, L.; Tang, S.; Zhang, Y.; Deng, L.; Tian, Q. GLA: Global–local attention for image description. IEEE Trans. Multimed. 2017, 20, 726–737. [Google Scholar] [CrossRef]
- Wang, J.; Torresani, L. Deformable Video Transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 14053–14062. [Google Scholar]
- Tan, M.; Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 6105–6114. [Google Scholar]






| Configuration | Acc. | Prec. | Rec. | F1 | mAP |
|---|---|---|---|---|---|
| Baseline | 85.8 | 85.9 | 85.8 | 85.8 | 91.5 |
| – Shifted windows | 82.3 | 82.4 | 82.3 | 82.2 | 88.5 |
| – Multi-scale | 55.1 | 54.9 | 55.1 | 54.8 | 58.5 |
| Configuration | Acc. | Prec. | Rec. | F1 | mAP |
|---|---|---|---|---|---|
| Content decoupled | 86.4 | 86.6 | 86.4 | 86.3 | 92.3 |
| Position decoupled | 86.2 | 86.3 | 86.2 | 86.1 | 92.1 |
| SSA | 86.9 | 87.0 | 86.9 | 86.8 | 92.5 |
| Configuration | Acc. | Prec. | Rec. | F1 | mAP |
|---|---|---|---|---|---|
| Aggregation decoupled | 86.0 | 86.3 | 86.0 | 86.0 | 91.7 |
| Broadcast decoupled | 85.8 | 86.0 | 85.9 | 85.8 | 92.1 |
| LPA | 86.6 | 86.8 | 86.6 | 86.6 | 92.2 |
| Model | CIFAR-100 | COCO2017 | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Acc. | Prec. | Rec. | F1 | mAP | Acc. | Prec. | Rec. | F1 | mAP | |
| Swin-T | 85.8 | 85.9 | 85.8 | 85.8 | 91.5 | 66.5 | 66.9 | 66.5 | 65.2 | 55.7 |
| YOLOv12 | 72.8 | 74.9 | 70.7 | 72.7 | 78.8 | 64.1 | 64.1 | 49.9 | 56.1 | 54.5 |
| EfficientNet | 77.2 | 77.4 | 77.1 | 77.2 | 82.0 | 64.9 | 61.3 | 49.4 | 54.5 | 54.0 |
| Swin-GLA | 86.2 | 86.3 | 86.2 | 86.1 | 91.9 | 67.1 | 66.4 | 67.1 | 65.8 | 55.9 |
| Swin-DVT | 85.9 | 86.1 | 85.9 | 85.9 | 91.6 | 67.8 | 66.5 | 67.8 | 66.5 | 56.3 |
| DAR-Swin | 87.3 | 87.4 | 87.3 | 87.3 | 92.8 | 68.3 | 67.4 | 68.3 | 67.3 | 57.0 |
| Model | FLOPs (G) | FPS |
|---|---|---|
| Swin-T | 4.367 | 74.069 |
| SSA-only | 4.365 | 73.195 |
| LPA-only | 4.376 | 75.423 |
| DAR-Swin | 4.379 | 74.622 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Zhang, X.; Zhang, Z.; Zuo, H.; Xue, C.; Wu, Z.; Cheng, Z.; Wang, Y. DAR-Swin: Dual-Attention Revamped Swin Transformer for Intelligent Vehicle Perception Under NVH Disturbances. Machines 2026, 14, 51. https://doi.org/10.3390/machines14010051
Zhang X, Zhang Z, Zuo H, Xue C, Wu Z, Cheng Z, Wang Y. DAR-Swin: Dual-Attention Revamped Swin Transformer for Intelligent Vehicle Perception Under NVH Disturbances. Machines. 2026; 14(1):51. https://doi.org/10.3390/machines14010051
Chicago/Turabian StyleZhang, Xinglong, Zhiguo Zhang, Huihui Zuo, Chaotan Xue, Zhenjiang Wu, Zhiyu Cheng, and Yan Wang. 2026. "DAR-Swin: Dual-Attention Revamped Swin Transformer for Intelligent Vehicle Perception Under NVH Disturbances" Machines 14, no. 1: 51. https://doi.org/10.3390/machines14010051
APA StyleZhang, X., Zhang, Z., Zuo, H., Xue, C., Wu, Z., Cheng, Z., & Wang, Y. (2026). DAR-Swin: Dual-Attention Revamped Swin Transformer for Intelligent Vehicle Perception Under NVH Disturbances. Machines, 14(1), 51. https://doi.org/10.3390/machines14010051
