A Survey on Visual Mamba
Abstract
1. Introduction
- This survey is the first attempt to offer an in-depth analysis of the Mamba technique in the vision domain, explicitly concentrating on analyzing the proposed strategies.
- An investigation on how Mamba’s capabilities can be enhanced and combined with other architectures in order to achieve superior performance, by expanding upon the Naive-based Mamba visual framework.
- We offer an exploration that organizes the literature based on various application tasks. In addition, we establish a taxonomy, identify advancements specific to each task, as well as offer insights on overcoming challenges.
- To keep up with the rapid development in this field, we will regularly update this review with the latest relevant papers and develop an open-source implementation at https://github.com/ziyangwang007/Awesome-Visual-Mamba (accessed on 25 June 2024).
2. Formulation of Mamba
2.1. State Space Model
2.1.1. Discretization
2.1.2. Architectures
2.1.3. Selective SSM
2.2. Other Key Concepts in Mamba
2.2.1. Selection Mechanism
2.2.2. Scan
2.2.3. Discussion
3. Mamba for Vision
3.1. Visual Mamba Block
3.1.1. ViM
3.1.2. VSS
3.2. Pure Mamba
3.2.1. ViM-Based
3.2.2. VSS-Based
3.2.3. Visual Data as Multi-Dimensional Data
3.2.4. Summary of 2D Scanning Mechanisms
3.3. Mamba with Other Architectures
| Scanning Mechanisms | Method | 
|---|---|
| BiDirectional Scan [26] | Vision Mamba [26], Motion Mamba [34] | 
| HARMamba [40], MMA [41], VL-Mamba [42] | |
| Video Mamba Suite [43], Point Mamba [44] | |
| LMa-UNet [45] | |
| Motion-Guided Dual-Camera Tracker [46] | |
| Cross-Scan [27] | VMamba [27],VL-Mamba [42], VMRNN [47] | 
| RES-VMAMBA [48], Sigma [49], ReMamber [50] | |
| Mamba-UNet [51], Semi-Mamba-UNet [52] | |
| VMambaMorph [53], ChangeMamba [54] | |
| H-vmunet [55], MambaMIR [56], MambaIR [57] | |
| Serpent [58], Mamba-HUNet [59], TM-UNet [60] | |
| Swin-UMamba [61], UltraLight VM-UNet [62] | |
| VM-UNet [63], VM-UNET-V2 [64] | |
| MedMamba [65], MIM-ISTD [66], RS3Mamba [67] | |
| Continuous 2D Scanning [28] | PlainMamba [28] | 
| Local Scan [29] | LocalMamba [29], FreqMamba [68] | 
| Efficient 2D Scanning (ES2D) [30] | EfficientVMamba [30] | 
| Zigzag Scan [31] | ZigMa [31] | 
| Omnidirectional Selective Scan [32] | VmambaIR [32], RS-Mamba [69] | 
| 3D BiDirectional Scan [33] | VideoMamba [33] | 
| Hierarchical Scan [34] | Motion Mamba [34] | 
| Spatiotemporal Selective Scan [35] | Vivim [35] | 
| Multi-Path Scan [36] | RSMamba [36] | 
3.3.1. Mamba with Convolution
| Other Architecture | Mamba Method | Capability | 
|---|---|---|
| Convolution | RES-VMAMBA [48] | Food vision tasks | 
| MedMamba [65] | Medical images classification tasks | |
| HSIMamba [72] | Hyperspectral images classification tasks | |
| MambaMIR [56] | Medical images reconstruction tasks | |
| MambaMIR-GAN [56] | ||
| MambaIR [57] | Image restoration tasks | |
| VMambaMorph [53] | 3D images registration tasks | |
| FreqMamba [68] | Image deraining tasks | |
| Pan-Mamba [73] | Pan-sharpening tasks | |
| MambaTalk [74] | Gesture synthesis tasks | |
| Samba [75] | Images semantic segmentation tasks | |
| Semi-Mamba-UNet [52], Swin-UMamba [61] | Medical images segmentation tasks | |
| H-vmunet [55], UltraLight VM-UNet [62] | ||
| Weak-Mamba-UNet [76] | ||
| LMa-UNet [45], SegMamba [71], T-Mamba [77] | ||
| Vivim [35], nnMamba [70], ProMamba [78] | ||
| Recurrence | VMRNN [47] | Video prediction tasks | 
| VMambaMorph [53] | 3D images registration tasks | |
| Attention | SSM-ViT [79] | Event camera-based tasks | 
| MMA [41] | Image super-resolution tasks | |
| ViS4mer [80] | Long movie clip classification tasks | |
| FDVM-Net [81] | Images exposure correction tasks | |
| CMViM [82] | 3D multi-modal representation tasks | |
| Motion-Guided Dual-Camera Tracker [46] | Endoscopy skill evaluation tasks | |
| MambaIR [57] | Image restoration tasks | |
| FreqMamba [68] | Image deraining tasks | |
| 3DMambaComplete [83] | Point cloud completion tasks | |
| VM-UNET-V2 [64], Weak-Mamba-UNet [76] | Medical images segmentation tasks | |
| UltraLight VM-UNet [62], ProMamba [78] | ||
| U-Net | U-Mamba [84], UVM-Net [85], Mamba-UNet [51] | Medical images tasks | 
| TM-UNet [60], Semi-Mamba-UNet [52] | ||
| Swin-UMamba [61], Weak-Mamba-UNet [76] | ||
| LMa-UNet [45], LightM-UNet [86] | ||
| UltraLight VM-UNet [62], VM-UNET-V2 [64] | ||
| H-vmunet [55], Mamba-HUNet [59] | ||
| VM-UNet [63] | ||
| MambaMIR-GAN [56] | Medical images reconstruction tasks | |
| VmambaIR [32] | Image restoration tasks | |
| Motion Mamba [34] | Generation tasks | |
| MambaMorph [87] | Multi-modality registration tasks | |
| FreqMamba [68] | Image deraining tasks | |
| RS-Mamba [69] | Dense image prediction tasks | |
| Diffusion | DiS [88], ZigMa [31], Motion Mamba [34] SSM-based diffusion model [89] | Generation tasks | 
| MD-Dose [90] | Radiation dose prediction tasks | 
3.3.2. Mamba with Recurrence
3.3.3. Mamba with Attention
3.3.4. Others
3.4. Comparison of Mamba Models and Other State-of-the-Art Models
3.4.1. Analysis and Comparison in Image Classification Tasks
3.4.2. Analysis and Comparison in Object Detection and Instance Segmentation Tasks
| Model | Backbone | Image Size | Params (M) | FLOPs (G) | Top-1 ACC (%) | 
|---|---|---|---|---|---|
| CNN | ResNet-50 [98] | 25.5 | 4.1 | 76.50 | |
| ResNet-50-D [99] | 25.0 | 4.3 | 77.16 | ||
| ResNet-101 [98] | 44.6 | 7.8 | 77.4 | ||
| ResNet-152 [98] | 60.2 | 11.6 | 78.3 | ||
| ResNeXt-50-d [100] | 25 | 4.1 | 77.8 | ||
| ResNeXt-101-d [100] | 44 | 7.8 | 78.8 | ||
| RegNetY-4G [94] | 21 | 4.0 | 80.0 | ||
| RegNetY-8G [94] | 39 | 8.0 | 81.7 | ||
| Transformer | ViT-B/16 [95] | 86 | 55.4 | 77.9 | |
| ViT-L/16 [95] | 307 | 190.7 | 76.5 | ||
| DeiT-S [101] | 22 | 4.6 | 79.8 | ||
| DeiT-B [101] | 86 | 17.6 | 81.8 | ||
| DeiT-B [101] | 86 | 55.4 | 83.1 | ||
| Swin-T [91] | 29 | 4.5 | 81.3 | ||
| Swin-S [91] | 50 | 8.7 | 83.0 | ||
| Swin-B [91] | 88 | 15.4 | 83.5 | ||
| Swin-B [91] | 88 | 47.0 | 84.5 | ||
| ViL-Small-APE [96] | 24.6 | 4.9 | 82.0 | ||
| ViL-Small-RPB [96] | 24.6 | 4.9 | 82.4 | ||
| ViL-Medium-APE [96] | 39.7 | 8.7 | 83.3 | ||
| ViL-Medium-RPB [96] | 39.7 | 8.7 | 83.5 | ||
| ViL-Base-APE [96] | 55.7 | 13.4 | 83.2 | ||
| ViL-Base-RPB [96] | 55.7 | 13.4 | 83.7 | ||
| Focal-Tiny [97] | 29.1 | 4.9 | 82.2 | ||
| Focal-Small [97] | 51.1 | 9.1 | 83.5 | ||
| Focal-Base [97] | 89.8 | 16.0 | 83.8 | ||
| Mamba | Vim-Ti [26] | 7 | - | 76.1 | |
| Vim-S [26] | 26 | - | 80.5 | ||
| VMamba-T [27] | 22 | 4.5 | 82.2 | ||
| VMamba-S [27] | 44 | 9.1 | 83.5 | ||
| VMamba-B [27] | 75 | 15.2 | 83.2 | ||
| PlainMamba-L1 [28] | 7 | 3.0 | 77.9 | ||
| PlainMamba-L2 [28] | 25 | 8.1 | 81.6 | ||
| PlainMamba-L3 [28] | 50 | 14.4 | 82.3 | ||
| LocalVim-T [29] | 8 | 1.5 | 76.2 | ||
| LocalVim-S [29] | 28 | 4.8 | 81.2 | ||
| LocalVMamba-T [29] | 26 | 5.7 | 82.7 | ||
| LocalVMamba-S [29] | 50 | 11.4 | 83.7 | ||
| EfficientVMamba-T [30] | 6 | 0.8 | 76.5 | ||
| EfficientVMamba-S [30] | 11 | 1.3 | 78.7 | ||
| EfficientVMamba-B [30] | 33 | 4.0 | 81.8 | ||
| Mamba-2D-S [38] | 24 | - | 81.7 | ||
| Mamba-2D-B [38] | 92 | - | 83.0 | ||
| SiMBA-S (Monarch) [39] | 18.5 | 3.6 | 81.1 | ||
| SiMBA-S (EinFFT) [39] | 15.3 | 2.4 | 81.7 | ||
| SiMBA-S (MLP) [39] | 26.5 | 5.0 | 84.0 | ||
| SiMBA-B (Monarch) [39] | 26.9 | 5.5 | 82.6 | ||
| SiMBA-B (EinFFT) [39] | 22.8 | 4.2 | 83.0 | ||
| SiMBA-B (MLP) [39] | 40.0 | 9.0 | 84.7 | ||
| SiMBA-L (Monarch) [39] | 42 | 8.7 | 83.8 | ||
| SiMBA-L (EinFFT) [39] | 36.6 | 7.6 | 83.9 | 
| Model | Backbone | Params (M) | FLOPs (G) | ||||||
|---|---|---|---|---|---|---|---|---|---|
| CNN | ResNet-50 [98] | 38.2 | 58.8 | 41.4 | 34.7 | 55.7 | 37.2 | 44 | 260 | 
| ResNet-101 [98] | 38.2 | 58.8 | 41.4 | 34.7 | 55.7 | 37.2 | 63 | 336 | |
| ResNeXt101-d [100] | 41.9 | - | - | 37.5 | - | - | 63 | 340 | |
| ResNeXt101-d [100] | 42.8 | - | - | 38.4 | - | - | 102 | 493 | |
| Transformer | ViT-Adapter-T [102] | 41.1 | 62.5 | 44.3 | 37.5 | 59.7 | 39.9 | 28.1 | - | 
| ViT-Adapter-S [102] | 44.7 | 65.8 | 48.3 | 39.9 | 62.5 | 42.8 | 47.8 | - | |
| ViT-Adapter-B [102] | 47.0 | 68.2 | 51.4 | 41.8 | 65.1 | 44.9 | 120.2 | - | |
| Swin-Tiny [91] | 42.2 | - | - | 39.1 | - | - | 48 | 264 | |
| Swin-Small [91] | 44.8 | - | - | 40.9 | - | - | 69 | 354 | |
| PVT-Tiny [103] | 36.7 | 59.2 | 39.3 | 35.1 | 56.7 | 37.3 | 32.9 | - | |
| PVT-Small [103] | 40.4 | 62.9 | 43.8 | 37.8 | 60.1 | 40.3 | 44.1 | - | |
| PVT-Medium [103] | 42.0 | 64.4 | 45.6 | 39.0 | 61.6 | 42.1 | 63.9 | - | |
| PVT-Large [103] | 42.9 | 65.0 | 46.6 | 39.5 | 61.9 | 42.5 | 81.0 | - | |
| Mamba | VMamba-T [27] | 46.5 | 68.5 | 50.7 | 42.1 | 65.5 | 45.3 | 42 | 262 | 
| VMamba-S [27] | 48.2 | 69.7 | 52.5 | 43.0 | 66.6 | 46.4 | 64 | 357 | |
| VMamba-B [27] | 48.5 | 69.6 | 53.0 | 43.1 | 67.0 | 46.4 | 96 | 482 | |
| PlainMamba-Adapter-L1 [28] | 44.1 | 64.8 | 47.9 | 39.1 | 61.6 | 41.9 | 31 | 388 | |
| PlainMamba-Adapter-L2 [28] | 46.0 | 66.9 | 50.1 | 40.6 | 63.8 | 43.6 | 53 | 542 | |
| PlainMamba-Adapter-L3 [28] | 46.8 | 68.0 | 51.1 | 41.2 | 64.7 | 43.9 | 79 | 696 | |
| EfficientVMamba-T [30] | 35.6 | 57.7 | 38.0 | 33.2 | 54.4 | 35.1 | 11 | 60 | |
| EfficientVMamba-S [30] | 39.3 | 61.8 | 42.6 | 36.7 | 58.9 | 39.2 | 31 | 197 | |
| EfficientVMamba-B [30] | 43.7 | 66.2 | 47.9 | 40.2 | 63.3 | 42.9 | 53 | 252 | |
| LocalVMamba-T [29] | 46.7 | 68.7 | 50.8 | 42.2 | 65.7 | 45.5 | 45 | 291 | |
| LocalVMamba-S [29] | 48.4 | 69.9 | 52.7 | 43.2 | 66.7 | 46.5 | 69 | 414 | |
| SiMBA-S [39] | 46.9 | 68.6 | 51.7 | 42.6 | 65.9 | 45.8 | 60 | 382 | 
| Model | Backbone | Params (M) | FLOPs (G) | ||||||
|---|---|---|---|---|---|---|---|---|---|
| CNN | ConvNeXt-T [104] | 46.2 | 67.9 | 50.8 | 41.7 | 65.0 | 44.9 | 48 | 262 | 
| Transformer | Swin-T [91] | 50.5 | 69.3 | 54.9 | 43.7 | 66.6 | 47.1 | 86 | 745 | 
| Swin-S [91] | 51.8 | 70.4 | 56.3 | 44.7 | 67.9 | 48.5 | 107 | 838 | |
| Swin-B [91] | 51.9 | 70.9 | 56.5 | 45.0 | 68.4 | 48.7 | 145 | 982 | |
| ViT-Adapter-T [102] | 46.0 | 67.6 | 50.4 | 41.0 | 64.4 | 44.1 | 28.1 | - | |
| ViT-Adapter-S [102] | 48.2 | 69.7 | 52.5 | 42.8 | 66.4 | 45.9 | 47.8 | - | |
| ViT-Adapter-B [102] | 49.6 | 70.6 | 54.0 | 43.6 | 67.7 | 46.9 | 120.2 | - | |
| PVT-Tiny [103] | 39.8 | 62.2 | 43.0 | 37.4 | 59.3 | 39.9 | 32.9 | - | |
| PVT-Small [103] | 43.0 | 65.3 | 46.9 | 39.9 | 62.5 | 42.8 | 44.1 | - | |
| PVT-Medium [103] | 44.2 | 66.0 | 48.2 | 40.5 | 63.1 | 43.5 | 63.9 | - | |
| PVT-Large [103] | 44.5 | 66.0 | 48.3 | 40.7 | 63.4 | 43.7 | 81.0 | - | |
| ViL-Tiny-RPB [96] | 44.2 | 66.4 | 48.2 | 40.6 | 63.2 | 44.0 | 26.9 | 199 | |
| ViL-Small-RPB [96] | 47.1 | 68.7 | 51.5 | 42.7 | 65.9 | 46.2 | 45.0 | 277 | |
| ViL-Medium-RPB [96] | 48.9 | 70.3 | 54.0 | 44.2 | 67.9 | 47.7 | 60.1 | 352 | |
| ViL-Base-RPB [96] | 49.6 | 70.7 | 54.6 | 44.5 | 68.3 | 48.0 | 76.1 | 439 | |
| Focal-Tiny [97] | 47.2 | 69.4 | 51.9 | 42.7 | 66.5 | 45.9 | 48.8 | 291 | |
| Focal-Small [97] | 48.8 | 70.5 | 53.6 | 43.8 | 67.7 | 47.2 | 71.2 | 401 | |
| Focal-Base [97] | 49.0 | 70.1 | 53.6 | 43.7 | 67.6 | 47.0 | 110.0 | 533 | |
| Mamba | VMamba-T [27] | 48.5 | 69.9 | 52.9 | 43.2 | 66.8 | 46.3 | 42 | 262 | 
| VMamba-S [27] | 49.7 | 70.4 | 54.2 | 44.0 | 67.6 | 47.3 | 64 | 357 | |
| LocalVMamba-T [29] | 48.7 | 70.1 | 53.0 | 43.4 | 67.0 | 46.4 | 45 | 291 | |
| LocalVMamba-S [29] | 49.9 | 70.5 | 54.4 | 44.1 | 67.8 | 47.4 | 69 | 414 | 
3.4.3. Analysis and Comparison in Semantic Segmentation Tasks
| Model | Backbone | Image Size | Params (M) | FLOPs (G) | mIoU (SS) | mIoU (MS) | 
|---|---|---|---|---|---|---|
| CNN | ResNet-50 [98] | 67 | 953 | 42.1 | 42.8 | |
| ResNet-101 [98] | 85 | 1030 | 42.9 | 44.0 | ||
| ConvNeXt-T [104] | 60 | 939 | 46.0 | 46.7 | ||
| ConvNeXt-S [104] | 82 | 1027 | 48.7 | 49.6 | ||
| ConvNeXt-B [104] | 122 | 1170 | 49.1 | 49.9 | ||
| Transformer | Swin-T [91] | 60 | 945 | 44.4 | 45.8 | |
| Swin-S [91] | 81 | 1039 | 47.6 | 49.5 | ||
| Swin-B [91] | 121 | 1188 | 48.1 | 49.7 | ||
| Focal-T [97] | 62 | 998 | 45.8 | 47.0 | ||
| Focal-S [97] | 85 | 1130 | 48.0 | 50.0 | ||
| Focal-B [97] | 126 | 1354 | 49.0 | 50.5 | ||
| DeiT-S + MLN [105] | 58 | 1217 | 43.8 | 45.1 | ||
| DeiT-B + MLN [105] | 144 | 2007 | 45.5 | 47.2 | ||
| Mamba | Vim-Ti [26] | 13 | - | 41.0 | - | |
| Vim-S [26] | 46 | - | 44.9 | - | ||
| VMamba-T [27] | 55 | 939 | 47.3 | 48.3 | ||
| VMamba-S [27] | 76 | 1037 | 49.5 | 50.5 | ||
| VMamba-B [27] | 110 | 1167 | 50.0 | 51.3 | ||
| VMamba-S [27] | 76 | 1620 | 50.8 | 50.8 | ||
| PlainMamba-L1 [28] | 35 | 174 | 44.1 | - | ||
| PlainMamba-L2 [28] | 55 | 285 | 46.8 | - | ||
| PlainMamba-L3 [28] | 81 | 419 | 49.1 | - | ||
| LocalVim-T [29] | 36 | 181 | 43.4 | 44.4 | ||
| LocalVim-S [29] | 58 | 297 | 46.4 | 47.5 | ||
| LocalVMamba-T [29] | 57 | 970 | 47.9 | 49.1 | ||
| LocalVMamba-S [29] | 81 | 1095 | 50.0 | 51.0 | ||
| EfficientVMamba-T [30] | 14 | 230 | 38.9 | 39.3 | ||
| EfficientVMamba-S [30] | 29 | 505 | 41.5 | 42.1 | ||
| EfficientVMamba-B [30] | 65 | 930 | 46.5 | 47.3 | ||
| SiMBA-S [39] | 62 | 1040 | 49.0 | 49.6 | 
4. Visual Mamba in Application Fields
4.1. General Visual Mamba
4.1.1. High/Mid-Level Vision
4.1.2. Low-Level Vision
4.2. Medical Visual Mamba
4.2.1. Two-Dimensional Medical Images
4.2.2. Three-Dimensional Medical Images
4.2.3. Challenge
4.3. Remote Sensing Image
5. Conclusions
5.1. Challenges and Limitations
5.2. Future Directions
Author Contributions
Funding
Conflicts of Interest
References
- Rosenblatt, F. The Perceptron, a Perceiving and Recognizing Automaton Project Para; Cornell Aeronautical Laboratory: Buffalo, NY, USA, 1957. [Google Scholar]
- Rosenblatt, F.; Jones, B.; Smith, T.; Brown, C.; Green, M.; Wilson, A.; Taylor, J.; White, P.; King, R.; Johnson, L. Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms; Spartan Books: Washington, DC, USA, 1962; Volume 55. [Google Scholar]
- LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2012, 25, 84–90. [Google Scholar] [CrossRef]
- Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
- Bahdanau, D.; Cho, K.; Bengio, Y. Neural machine translation by jointly learning to align and translate. arXiv 2014, arXiv:1409.0473. [Google Scholar]
- Parikh, A.P.; Täckström, O.; Das, D.; Uszkoreit, J. A decomposable attention model for natural language inference. arXiv 2016, arXiv:1606.01933. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
- Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
- Gu, A.; Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. arXiv 2023, arXiv:2312.00752. [Google Scholar]
- Lieber, O.; Lenz, B.; Bata, H.; Cohen, G.; Osin, J.; Dalmedigos, I.; Safahi, E.; Meirom, S.; Belinkov, Y.; Shalev-Shwartz, S.; et al. Jamba: A Hybrid Transformer-Mamba Language Model. arXiv 2024, arXiv:2403.19887. [Google Scholar]
- Pióro, M.; Ciebiera, K.; Król, K.; Ludziejewski, J.; Jaszczur, S. Moe-mamba: Efficient selective state space models with mixture of experts. arXiv 2024, arXiv:2401.04081. [Google Scholar]
- Anthony, Q.; Tokpanov, Y.; Glorioso, P.; Millidge, B. BlackMamba: Mixture of Experts for State-Space Models. arXiv 2024, arXiv:2402.01771. [Google Scholar]
- Fu, D.Y.; Dao, T.; Saab, K.K.; Thomas, A.W.; Rudra, A.; Ré, C. Hungry hungry hippos: Towards language modeling with state space models. arXiv 2022, arXiv:2212.14052. [Google Scholar]
- Hendrycks, D.; Gimpel, K. Gaussian error linear units (gelus). arXiv 2016, arXiv:1606.08415. [Google Scholar]
- Ramachandran, P.; Zoph, B.; Le, Q.V. Swish: A Self-Gated Activation Function. arXiv 2017, arXiv:1710.05941. [Google Scholar]
- Ba, J.L.; Kiros, J.R.; Hinton, G.E. Layer normalization. arXiv 2016, arXiv:1607.06450. [Google Scholar]
- Sun, Y.; Dong, L.; Huang, S.; Ma, S.; Xia, Y.; Xue, J.; Wang, J.; Wei, F. Retentive network: A Successor to Transformer for Large Language Models. arXiv 2023, arXiv:2307.08621. [Google Scholar]
- Katharopoulos, A.; Vyas, A.; Pappas, N.; Fleuret, F. Transformers are rnns: Fast autoregressive transformers with linear attention. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual Event, 13–18 July 2020; pp. 5156–5165. [Google Scholar]
- Poli, M.; Massaroli, S.; Nguyen, E.; Fu, D.Y.; Dao, T.; Baccus, S.; Bengio, Y.; Ermon, S.; Ré, C. Hyena hierarchy: Towards larger convolutional language models. In Proceedings of the International Conference on Machine Learning, PMLR, Honolulu, HI, USA, 23–29 July 2023; pp. 28043–28078. [Google Scholar]
- Romero, D.W.; Kuzina, A.; Bekkers, E.J.; Tomczak, J.M.; Hoogendoorn, M. Ckconv: Continuous kernel convolution for sequential data. arXiv 2021, arXiv:2102.02611. [Google Scholar]
- Zhai, S.; Talbott, W.; Srivastava, N.; Huang, C.; Goh, H.; Zhang, R.; Susskind, J. An attention free transformer. arXiv 2021, arXiv:2105.14103. [Google Scholar]
- Peng, B.; Alcaide, E.; Anthony, Q.; Albalak, A.; Arcadinho, S.; Cao, H.; Cheng, X.; Chung, M.; Grella, M.; GV, K.K.; et al. Rwkv: Reinventing rnns for the transformer era. arXiv 2023, arXiv:2305.13048. [Google Scholar]
- Tallec, C.; Ollivier, Y. Can recurrent neural networks warp time? arXiv 2018, arXiv:1804.11188. [Google Scholar]
- Zhu, L.; Liao, B.; Zhang, Q.; Wang, X.; Liu, W.; Wang, X. Vision mamba: Efficient visual representation learning with bidirectional state space model. arXiv 2024, arXiv:2401.09417. [Google Scholar]
- Liu, Y.; Tian, Y.; Zhao, Y.; Yu, H.; Xie, L.; Wang, Y.; Ye, Q.; Liu, Y. Vmamba: Visual state space model. arXiv 2024, arXiv:2401.10166. [Google Scholar]
- Yang, C.; Chen, Z.; Espinosa, M.; Ericsson, L.; Wang, Z.; Liu, J.; Crowley, E.J. PlainMamba: Improving Non-Hierarchical Mamba in Visual Recognition. arXiv 2024, arXiv:2403.17695. [Google Scholar]
- Huang, T.; Pei, X.; You, S.; Wang, F.; Qian, C.; Xu, C. LocalMamba: Visual State Space Model with Windowed Selective Scan. arXiv 2024, arXiv:2403.09338. [Google Scholar]
- Pei, X.; Huang, T.; Xu, C. EfficientVMamba: Atrous Selective Scan for Light Weight Visual Mamba. arXiv 2024, arXiv:2403.09977. [Google Scholar]
- Hu, V.T.; Baumann, S.A.; Gui, M.; Grebenkova, O.; Ma, P.; Fischer, J.; Ommer, B. Zigma: Zigzag mamba diffusion model. arXiv 2024, arXiv:2403.13802. [Google Scholar]
- Shi, Y.; Xia, B.; Jin, X.; Wang, X.; Zhao, T.; Xia, X.; Xiao, X.; Yang, W. VmambaIR: Visual State Space Model for Image Restoration. arXiv 2024, arXiv:2403.11423. [Google Scholar]
- Li, K.; Li, X.; Wang, Y.; He, Y.; Wang, Y.; Wang, L.; Qiao, Y. Videomamba: State space model for efficient video understanding. arXiv 2024, arXiv:2403.06977. [Google Scholar]
- Zhang, Z.; Liu, A.; Reid, I.; Hartley, R.; Zhuang, B.; Tang, H. Motion mamba: Efficient and long sequence motion generation with hierarchical and bidirectional selective ssm. arXiv 2024, arXiv:2403.07487. [Google Scholar]
- Yang, Y.; Xing, Z.; Zhu, L. Vivim: A video vision mamba for medical video object segmentation. arXiv 2024, arXiv:2401.14168. [Google Scholar]
- Chen, K.; Chen, B.; Liu, C.; Li, W.; Zou, Z.; Shi, Z. Rsmamba: Remote sensing image classification with state space model. arXiv 2024, arXiv:2403.19654. [Google Scholar] [CrossRef]
- Behrouz, A.; Santacatterina, M.; Zabih, R. MambaMixer: Efficient Selective State Space Models with Dual Token and Channel Selection. arXiv 2024, arXiv:2403.19888. [Google Scholar]
- Li, S.; Singh, H.; Grover, A. Mamba-ND: Selective State Space Modeling for Multi-Dimensional Data. arXiv 2024, arXiv:2402.05892. [Google Scholar]
- Patro, B.N.; Agneeswaran, V.S. SiMBA: Simplified Mamba-Based Architecture for Vision and Multivariate Time series. arXiv 2024, arXiv:2403.15360. [Google Scholar]
- Li, S.; Zhu, T.; Duan, F.; Chen, L.; Ning, H.; Wan, Y. HARMamba: Efficient Wearable Sensor Human Activity Recognition Based on Bidirectional Selective SSM. arXiv 2024, arXiv:2403.20183. [Google Scholar]
- Cheng, C.; Wang, H.; Sun, H. Activating Wider Areas in Image Super-Resolution. arXiv 2024, arXiv:2403.08330. [Google Scholar]
- Qiao, Y.; Yu, Z.; Guo, L.; Chen, S.; Zhao, Z.; Sun, M.; Wu, Q.; Liu, J. VL-Mamba: Exploring State Space Models for Multimodal Learning. arXiv 2024, arXiv:2403.13600. [Google Scholar]
- Chen, G.; Huang, Y.; Xu, J.; Pei, B.; Chen, Z.; Li, Z.; Wang, J.; Li, K.; Lu, T.; Wang, L. Video mamba suite: State space model as a versatile alternative for video understanding. arXiv 2024, arXiv:2403.09626. [Google Scholar]
- Liu, J.; Yu, R.; Wang, Y.; Zheng, Y.; Deng, T.; Ye, W.; Wang, H. Point mamba: A novel point cloud backbone based on state space model with octree-based ordering strategy. arXiv 2024, arXiv:2403.06467. [Google Scholar]
- Wang, J.; Chen, J.; Chen, D.; Wu, J. Large Window-based Mamba UNet for Medical Image Segmentation: Beyond Convolution and Self-attention. arXiv 2024, arXiv:2403.07332. [Google Scholar]
- Zhang, Y.; Yan, W.; Yan, K.; Lam, C.P.; Qiu, Y.; Zheng, P.; Tang, R.S.Y.; Cheng, S.S. Motion-Guided Dual-Camera Tracker for Low-Cost Skill Evaluation of Gastric Endoscopy. arXiv 2024, arXiv:2403.05146. [Google Scholar]
- Tang, Y.; Dong, P.; Tang, Z.; Chu, X.; Liang, J. VMRNN: Integrating Vision Mamba and LSTM for Efficient and Accurate Spatiotemporal Forecasting. arXiv 2024, arXiv:2403.16536. [Google Scholar]
- Chen, C.S.; Chen, G.Y.; Zhou, D.; Jiang, D.; Chen, D.S. Res-VMamba: Fine-Grained Food Category Visual Classification Using Selective State Space Models with Deep Residual Learning. arXiv 2024, arXiv:2402.15761. [Google Scholar]
- Wan, Z.; Wang, Y.; Yong, S.; Zhang, P.; Stepputtis, S.; Sycara, K.; Xie, Y. Sigma: Siamese Mamba Network for Multi-Modal Semantic Segmentation. arXiv 2024, arXiv:2404.04256. [Google Scholar]
- Yang, Y.; Ma, C.; Yao, J.; Zhong, Z.; Zhang, Y.; Wang, Y. ReMamber: Referring Image Segmentation with Mamba Twister. arXiv 2024, arXiv:2403.17839. [Google Scholar]
- Wang, Z.; Zheng, J.Q.; Zhang, Y.; Cui, G.; Li, L. Mamba-unet: Unet-like pure visual mamba for medical image segmentation. arXiv 2024, arXiv:2402.05079. [Google Scholar]
- Ma, C.; Wang, Z. Semi-Mamba-UNet: Pixel-Level Contrastive and Pixel-Level Cross-Supervised Visual Mamba-based UNet for Semi-Supervised Medical Image Segmentation. arXiv 2024, arXiv:2402.07245. [Google Scholar]
- Wang, Z.; Zheng, J.Q.; Ma, C.; Guo, T. VMambaMorph: A Visual Mamba-based Framework with Cross-Scan Module for Deformable 3D Image Registration. arXiv 2024, arXiv:2404.05105. [Google Scholar]
- Chen, H.; Song, J.; Han, C.; Xia, J.; Yokoya, N. ChangeMamba: Remote Sensing Change Detection with Spatio-Temporal State Space Model. arXiv 2024, arXiv:2404.03425. [Google Scholar] [CrossRef]
- Wu, R.; Liu, Y.; Liang, P.; Chang, Q. H-vmunet: High-order Vision Mamba UNet for Medical Image Segmentation. arXiv 2024, arXiv:2403.13642. [Google Scholar]
- Huang, J.; Yang, L.; Wang, F.; Wu, Y.; Nan, Y.; Aviles-Rivero, A.I.; Schönlieb, C.B.; Zhang, D.; Yang, G. MambaMIR: An Arbitrary-Masked Mamba for Joint Medical Image Reconstruction and Uncertainty Estimation. arXiv 2024, arXiv:2402.18451. [Google Scholar]
- Guo, H.; Li, J.; Dai, T.; Ouyang, Z.; Ren, X.; Xia, S.T. MambaIR: A Simple Baseline for Image Restoration with State-Space Model. arXiv 2024, arXiv:2402.15648. [Google Scholar]
- Shahab Sepehri, M.; Fabian, Z.; Soltanolkotabi, M. Serpent: Scalable and Efficient Image Restoration via Multi-scale Structured State Space Models. arXiv 2024, arXiv:2403.17902. [Google Scholar]
- Sanjid, K.S.; Hossain, M.T.; Junayed, M.S.S.; Uddin, D.M.M. Integrating Mamba Sequence Model and Hierarchical Upsampling Network for Accurate Semantic Segmentation of Multiple Sclerosis Legion. arXiv 2024, arXiv:2403.17432. [Google Scholar]
- Tang, H.; Cheng, L.; Huang, G.; Tan, Z.; Lu, J.; Wu, K. Rotate to Scan: UNet-like Mamba with Triplet SSM Module for Medical Image Segmentation. arXiv 2024, arXiv:2403.17701. [Google Scholar]
- Liu, J.; Yang, H.; Zhou, H.Y.; Xi, Y.; Yu, L.; Yu, Y.; Liang, Y.; Shi, G.; Zhang, S.; Zheng, H.; et al. Swin-umamba: Mamba-based unet with imagenet-based pretraining. arXiv 2024, arXiv:2402.03302. [Google Scholar]
- Wu, R.; Liu, Y.; Liang, P.; Chang, Q. UltraLight VM-UNet: Parallel Vision Mamba Significantly Reduces Parameters for Skin Lesion Segmentation. arXiv 2024, arXiv:2403.20035. [Google Scholar]
- Ruan, J.; Xiang, S. Vm-unet: Vision mamba unet for medical image segmentation. arXiv 2024, arXiv:2402.02491. [Google Scholar]
- Zhang, M.; Yu, Y.; Gu, L.; Lin, T.; Tao, X. VM-UNET-V2 Rethinking Vision Mamba UNet for Medical Image Segmentation. arXiv 2024, arXiv:2403.09157. [Google Scholar]
- Yue, Y.; Li, Z. MedMamba: Vision Mamba for Medical Image Classification. arXiv 2024, arXiv:2403.03849. [Google Scholar]
- Chen, T.; Tan, Z.; Gong, T.; Chu, Q.; Wu, Y.; Liu, B.; Ye, J.; Yu, N. MiM-ISTD: Mamba-in-Mamba for Efficient Infrared Small Target Detection. arXiv 2024, arXiv:2403.02148. [Google Scholar]
- Ma, X.; Zhang, X.; Pun, M.O. RS3Mamba: Visual State Space Model for Remote Sensing Images Semantic Segmentation. arXiv 2024, arXiv:2404.02457. [Google Scholar] [CrossRef]
- Zhen, Z.; Hu, Y.; Feng, Z. FreqMamba: Viewing Mamba from a Frequency Perspective for Image Deraining. arXiv 2024, arXiv:2404.09476. [Google Scholar]
- Zhao, S.; Chen, H.; Zhang, X.; Xiao, P.; Bai, L.; Ouyang, W. RS-Mamba for Large Remote Sensing Image Dense Prediction. arXiv 2024, arXiv:2404.02668. [Google Scholar]
- Gong, H.; Kang, L.; Wang, Y.; Wan, X.; Li, H. nnmamba: 3d biomedical image segmentation, classification and landmark detection with state space model. arXiv 2024, arXiv:2402.03526. [Google Scholar]
- Xing, Z.; Ye, T.; Yang, Y.; Liu, G.; Zhu, L. Segmamba: Long-range sequential modeling mamba for 3d medical image segmentation. arXiv 2024, arXiv:2401.13560. [Google Scholar]
- Yang, J.X.; Zhou, J.; Wang, J.; Tian, H.; Liew, A.W.C. Hsimamba: Hyperpsectral imaging efficient feature learning with bidirectional state space for classification. arXiv 2024, arXiv:2404.00272. [Google Scholar]
- He, X.; Cao, K.; Yan, K.; Li, R.; Xie, C.; Zhang, J.; Zhou, M. Pan-Mamba: Effective pan-sharpening with State Space Model. arXiv 2024, arXiv:2402.12192. [Google Scholar]
- Xu, Z.; Lin, Y.; Han, H.; Yang, S.; Li, R.; Zhang, Y.; Li, X. MambaTalk: Efficient Holistic Gesture Synthesis with Selective State Space Models. arXiv 2024, arXiv:2403.09471. [Google Scholar]
- Zhu, Q.; Cai, Y.; Fang, Y.; Yang, Y.; Chen, C.; Fan, L.; Nguyen, A. Samba: Semantic Segmentation of Remotely Sensed Images with State Space Model. arXiv 2024, arXiv:2404.01705. [Google Scholar]
- Wang, Z.; Ma, C. Weak-Mamba-UNet: Visual Mamba Makes CNN and ViT Work Better for Scribble-based Medical Image Segmentation. arXiv 2024, arXiv:2402.10887. [Google Scholar]
- Hao, J.; He, L.; Hung, K.F. T-Mamba: Frequency-Enhanced Gated Long-Range Dependency for Tooth 3D CBCT Segmentation. arXiv 2024, arXiv:2404.01065. [Google Scholar]
- Xie, J.; Liao, R.; Zhang, Z.; Yi, S.; Zhu, Y.; Luo, G. ProMamba: Prompt-Mamba for polyp segmentation. arXiv 2024, arXiv:2403.13660. [Google Scholar]
- Zubić, N.; Gehrig, M.; Scaramuzza, D. State Space Models for Event Cameras. arXiv 2024, arXiv:2402.15584. [Google Scholar]
- Islam, M.M.; Bertasius, G. Long movie clip classification with state-space video models. In Proceedings of the European Conference on Computer Vision. Springer, Glasgow, UK, 23–28 August 2022; pp. 87–104. [Google Scholar]
- Zheng, Z.; Zhang, J. FD-Vision Mamba for Endoscopic Exposure Correction. arXiv 2024, arXiv:2402.06378. [Google Scholar]
- Yang, G.; Du, K.; Yang, Z.; Du, Y.; Zheng, Y.; Wang, S. CMViM: Contrastive Masked Vim Autoencoder for 3D Multi-modal Representation Learning for AD classification. arXiv 2024, arXiv:2403.16520. [Google Scholar]
- Li, Y.; Yang, W.; Fei, B. 3DMambaComplete: Exploring Structured State Space Model for Point Cloud Completion. arXiv 2024, arXiv:2404.07106. [Google Scholar]
- Ma, J.; Li, F.; Wang, B. U-mamba: Enhancing long-range dependency for biomedical image segmentation. arXiv 2024, arXiv:2401.04722. [Google Scholar]
- Zheng, Z.; Wu, C. U-shaped Vision Mamba for Single Image Dehazing. arXiv 2024, arXiv:2402.04139. [Google Scholar]
- Liao, W.; Zhu, Y.; Wang, X.; Pan, C.; Wang, Y.; Ma, L. Lightm-unet: Mamba assists in lightweight unet for medical image segmentation. arXiv 2024, arXiv:2403.05246. [Google Scholar]
- Guo, T.; Wang, Y.; Meng, C. Mambamorph: A mamba-based backbone with contrastive feature learning for deformable mr-ct registration. arXiv 2024, arXiv:2401.13934. [Google Scholar]
- Fei, Z.; Fan, M.; Yu, C.; Huang, J. Scalable Diffusion Models with State Space Backbone. arXiv 2024, arXiv:2402.05608. [Google Scholar]
- Oshima, Y.; Taniguchi, S.; Suzuki, M.; Matsuo, Y. SSM Meets Video Diffusion Models: Efficient Video Generation with Structured State Spaces. arXiv 2024, arXiv:2403.07711. [Google Scholar]
- Fu, L.; Li, X.; Cai, X.; Wang, Y.; Wang, X.; Shen, Y.; Yao, Y. MD-Dose: A Diffusion Model based on the Mamba for Radiotherapy Dose Prediction. arXiv 2024, arXiv:2403.08479. [Google Scholar]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
- Shi, X.; Chen, Z.; Wang, H.; Yeung, D.Y.; Wong, W.K.; Woo, W.c. Convolutional LSTM network: A machine learning approach for precipitation nowcasting. Adv. Neural Inf. Process. Syst. 2015, 28, 802–810. [Google Scholar]
- Li, W.; Hong, X.; Fan, X. SpikeMba: Multi-Modal Spiking Saliency Mamba for Temporal Video Grounding. arXiv 2024, arXiv:2404.01174. [Google Scholar]
- Radosavovic, I.; Kosaraju, R.P.; Girshick, R.; He, K.; Dollár, P. Designing network design spaces. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10428–10436. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- Zhang, P.; Dai, X.; Yang, J.; Xiao, B.; Yuan, L.; Zhang, L.; Gao, J. Multi-scale vision longformer: A new vision transformer for high-resolution image encoding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 2998–3008. [Google Scholar]
- Yang, J.; Li, C.; Zhang, P.; Dai, X.; Xiao, B.; Yuan, L.; Gao, J. Focal self-attention for local-global interactions in vision transformers. arXiv 2021, arXiv:2107.00641. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- He, T.; Zhang, Z.; Zhang, H.; Zhang, Z.; Xie, J.; Li, M. Bag of tricks for image classification with convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 558–567. [Google Scholar]
- Xie, S.; Girshick, R.; Dollár, P.; Tu, Z.; He, K. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1492–1500. [Google Scholar]
- Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H. Training data-efficient image transformers & distillation through attention. In Proceedings of the International conference on machine learning, PMLR, Virtual, 18–24 July 2021; pp. 10347–10357. [Google Scholar]
- Chen, Z.; Duan, Y.; Wang, W.; He, J.; Lu, T.; Dai, J.; Qiao, Y. Vision transformer adapter for dense predictions. arXiv 2022, arXiv:2205.08534. [Google Scholar]
- Wang, W.; Xie, E.; Li, X.; Fan, D.P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 568–578. [Google Scholar]
- Liu, Z.; Mao, H.; Wu, C.Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11976–11986. [Google Scholar]
- Touvron, H.; Cord, M.; Jégou, H. Deit iii: Revenge of the vit. In Computer Vision—ECCV 2022: 17th European Conference, Tel Aviv, Israel, 23–27 October 2022, Proceedings, Part XXIV; Springer: Cham, Switzerland, 2022; pp. 516–533. [Google Scholar]
- Zhao, H.; Zhang, M.; Zhao, W.; Ding, P.; Huang, S.; Wang, D. Cobra: Extending Mamba to Multi-Modal Large Language Model for Efficient Inference. arXiv 2024, arXiv:2403.14520. [Google Scholar]
- Gao, H.; Dang, D. Aggregating Local and Global Features via Selective State Spaces Model for Efficient Image Deblurring. arXiv 2024, arXiv:2403.20106. [Google Scholar]
- Zhou, Q.; Yang, W.; Fei, B.; Xu, J.; Zhang, R.; Liu, K.; Luo, Y.; He, Y. 3DMambaIPF: A State Space Model for Iterative Point Cloud Filtering via Differentiable Rendering. arXiv 2024, arXiv:2404.05522. [Google Scholar]
- Zhang, T.; Li, X.; Yuan, H.; Ji, S.; Yan, S. Point Could Mamba: Point Cloud Learning via State Space Model. arXiv 2024, arXiv:2403.00762. [Google Scholar]
- Liang, D.; Zhou, X.; Wang, X.; Zhu, X.; Xu, W.; Zou, Z.; Ye, X.; Bai, X. PointMamba: A Simple State Space Model for Point Cloud Analysis. arXiv 2024, arXiv:2402.10739. [Google Scholar]
- Shen, Q.; Yi, X.; Wu, Z.; Zhou, P.; Zhang, H.; Yan, S.; Wang, X. Gamba: Marry Gaussian Splatting with Mamba for single view 3D reconstruction. arXiv 2024, arXiv:2403.18795. [Google Scholar]
- Seeram, E. Digital Radiography: Physical Principles and Quality Control; Springer: Singapore, 2019. [Google Scholar]
- Lui, R.N.; Wong, S.H.; Sánchez-Luna, S.A.; Pellino, G.; Bollipo, S.; Wong, M.Y.; Chiu, P.W.; Sung, J.J. Overview of guidance for endoscopy during the coronavirus disease 2019 pandemic. J. Gastroenterol. Hepatol. 2020, 35, 749–759. [Google Scholar] [CrossRef]
- Withers, P.J.; Bouman, C.; Carmignato, S.; Cnudde, V.; Grimaldi, D.; Hagen, C.K.; Maire, E.; Manley, M.; Du Plessis, A.; Stock, S.R. X-ray computed tomography. Nat. Rev. Methods Prim. 2021, 1, 18. [Google Scholar] [CrossRef]
- Christensen-Jeffries, K.; Couture, O.; Dayton, P.A.; Eldar, Y.C.; Hynynen, K.; Kiessling, F.; O’Reilly, M.; Pinton, G.F.; Schmitz, G.; Tang, M.X.; et al. Super-resolution ultrasound imaging. Ultrasound Med. Biol. 2020, 46, 865–891. [Google Scholar] [CrossRef]
- Tiwari, A.; Srivastava, S.; Pant, M. Brain tumor segmentation and classification from magnetic resonance images: Review of selected methods from 2014 to 2019. Pattern Recognit. Lett. 2020, 131, 244–260. [Google Scholar] [CrossRef]
- Ye, Z.; Chen, T. P-Mamba: Marrying Perona Malik Diffusion with Mamba for Efficient Pediatric Echocardiographic Left Ventricular Segmentation. arXiv 2024, arXiv:2402.08506. [Google Scholar]
- Yang, S.; Wang, Y.; Chen, H. MambaMIL: Enhancing Long Sequence Modeling with Sequence Reordering in Computational Pathology. arXiv 2024, arXiv:2403.06800. [Google Scholar]







| Category | Sub-Category | Method | Efficiency | Code | 
|---|---|---|---|---|
| Backbone | Visual Mamba | Vision Mamba [26] | Params Vim-Ti: 7, Vim-S: 26 | ✓ | 
| VMamba [27] | FLOPs Base: 15.2 Small: 9.1, Tiny: 4.5 | ✓ | ||
| PlainMamba [28] | FLOPs PlainMamba-L1: 3.0 PlainMamba-L2: 8.1 PlainMamba-L3: 14.4 | ✓ | ||
| LocalMamba [29] | FLOPs LocalVMamba-T: 5.7 LocalVMamba-S: 11.4 | ✓ | ||
| Mamba-ND [38] | Params Mamba-2D: 24 Mamba-3D: 36 | ✓ | ||
| SiMBA [39] | - | ✓ | ||
| RES-VMAMBA [48] | - | ✓ | ||
| Efficient Mamba | EfficientVMamba [30] | - | ✓ | |
| MambaMixer [37] | - | ✓ | ||
| High/Mid-level vision | Object detection | SSM-ViT [79] | Params 17.5 | ✗ | 
| Segmentation | ReMamber [50] | - | ✗ | |
| Sigma [49] | - | ✓ | ||
| Video classification | ViS4mer [80] | Memory 5273.6 | ✓ | |
| Video understanding | Video Mamba Suite [43] | - | ✓ | |
| VideoMamba [33] | FLOPs VideoMamba-Ti: 7.1 VideoMamba-S: 28 VideoMamba-M: 83.1 | ✓ | ||
| SpikeMba [93] | - | ✗ | ||
| Multi-Modal understanding | Cobra [106] | - | ✓ | |
| ReMamber [50] | - | ✗ | ||
| VL-Mamba [42] | - | ✗ | ||
| Video prediction | VMRNN [47] | Params 2.6, FLOPs 0.9 | ✓ | |
| HARMamba [40] | FLOPs PAMAP2:279.21 UCI:237.83 UNIMIB HAR:238.36 WISDM:256.52 | ✗ | ||
| Low-level vision | Image super-resolution | MMA [41] | - | ✗ | 
| Image restoration | MambaIR [57] | Params 16.7 | ✓ | |
| SERPENT [58] | - | ✗ | ||
| VmambaIR [32] | Params 10.50, FLOPs 20.5 | ✓ | ||
| Image dehazing | UVM-Net [85] | Params 19.25 | ✓ | |
| Image derain | FreqMamba [68] | Params 14.52 | ✗ | |
| Image deblurring | ALGNet [107] | FLOPs 17 | ✗ | |
| Visual generation | MambaTalk [74] | - | ✗ | |
| Motion Mamba [34] | - | ✓ | ||
| DiS [88] | - | ✓ | ||
| ZigMa [31] | - | ✓ | ||
| Point cloud | 3DMambaComplete [83] | Params 34.06, FLOPs 7.12 | ✗ | |
| 3DMambaIPF [108] | - | ✗ | ||
| Point Cloud Mamba [109] | Params 34.2, FLOPs 45.0 | ✗ | ||
| POINT MAMBA [44] | Memory 8550 | ✓ | ||
| SSPointMamba [110] | Params 12.3, FLOPs 3.6 | ✓ | ||
| 3D reconstruction | GAMBA [111] | - | ✗ | |
| Video generation | SSM-based diffusion model [89] | - | ✓ | 
| Category | Sub-Category | Method | Efficiency | Code | 
|---|---|---|---|---|
| 2D | Segmentation | Mamba-UNet [51] | - | ✓ | 
| H-vmunet [55] | Memory 0.676 Params 8.97 | ✓ | ||
| Mamba-HUNet [59] | - | ✗ | ||
| P-Mamba [117] | Inference speed 23.49 Memory 12.22 Params 183.37 FLOPs 71.81× | ✗ | ||
| ProMamba [78] | Params 102 | ✗ | ||
| TM-UNet [60] | Params 14.86 Total Params 8.41 FLOPs 3.42 | ✗ | ||
| Semi-Mamba-UNet [52] | - | ✓ | ||
| Swin-UMamba [61] | Params 28 FLOPs 18.9 | ✓ | ||
| UltraLight VM-UNet [62] | Params 0.049 GFLOPs 0.060 | ✓ | ||
| U-Mamba [84] | - | ✓ | ||
| VM-UNet [63] | Params 34.62 FLOPs 7.56 FPS 20.612 | ✓ | ||
| VM-UNET-V2 [64] | Params 17.91 FLOPS 4.40 FPS 32.58 | ✓ | ||
| Weak-Mamba-UNet [76] | - | ✓ | ||
| Radiation dose prediction | MD-Dose [90] | Inference speed 18 Params 30.47 | ✓ | |
| Classification | MedMamba [65] | - | ✓ | |
| MambaMIL [118] | - | ✓ | ||
| Image reconstruction | MambaMIR /MambaMIR-GAN [56] | - | ✓ | |
| Exposure correction | FDVM-Net [81] | Inference speed 22.95 | ✓ | |
| 3D | Segmentation | LMa-UNet [45] | - | ✓ | 
| LightM-UNet [86] | Params 1.87 FLOPs 457.62 × | ✓ | ||
| SegMamba [71] | Inference speed 151 | ✓ | ||
| T-Mamba [77] | - | ✓ | ||
| Vivim [35] | FPS 35.33 | ✓ | ||
| Classification | CMViM [82] | Params 50 | ✓ | |
| Motion tracking | Motion-Guided Dual-Camera Tracker [46] | - | ✓ | |
| Backbone | nnMamba [70] | Params 15.55 FLOPs 141.14 | ✗ | |
| Image registration | VMambaMorph [53] | Inference speed 19 Memory 3.93 Params 9.64 | ✓ | |
| MambaMorph [87] | Inference speed 27 Memory 7.60 Params 7.59 | ✓ | 
| Category | Method | Highlight | Efficiency | Code | 
|---|---|---|---|---|
| Pan-sharpening | Pan-Mamba [73] | channel swapping Mamba; cross-modal Mamba | Params 0.1827 FLOPs 3.0088 | ✓ | 
| Infrared Small Target Detection | MIM-ISTD [66] | Mamba-in-Mamba architecture | Params 1.16 FLOPs 1.01 Inference speed 30 Memory 1774 | ✓ | 
| Classification | RSMamba [36] | multi-path activation | - | ✓ | 
| HSIMamba [72] | process data bidirectionally | Memory 136.53 | ✓ | |
| Image dense prediction | RS-Mamba [69] | omnidirectional selective scan | - | ✓ | 
| Change detection | ChangeMamba [54] | cross-scan mechanism | - | ✓ | 
| Semantic segmentation | RS3Mamba [67] | dual-branch network | FLOPs 31.65 Params 43.32 Memory 2332 | ✓ | 
| Samba [75] | encoder-decoder architecture | Params 51.9 | ✓ | 
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. | 
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhang, H.; Zhu, Y.; Wang, D.; Zhang, L.; Chen, T.; Wang, Z.; Ye, Z. A Survey on Visual Mamba. Appl. Sci. 2024, 14, 5683. https://doi.org/10.3390/app14135683
Zhang H, Zhu Y, Wang D, Zhang L, Chen T, Wang Z, Ye Z. A Survey on Visual Mamba. Applied Sciences. 2024; 14(13):5683. https://doi.org/10.3390/app14135683
Chicago/Turabian StyleZhang, Hanwei, Ying Zhu, Dan Wang, Lijun Zhang, Tianxiang Chen, Ziyang Wang, and Zi Ye. 2024. "A Survey on Visual Mamba" Applied Sciences 14, no. 13: 5683. https://doi.org/10.3390/app14135683
APA StyleZhang, H., Zhu, Y., Wang, D., Zhang, L., Chen, T., Wang, Z., & Ye, Z. (2024). A Survey on Visual Mamba. Applied Sciences, 14(13), 5683. https://doi.org/10.3390/app14135683
 
        



 
       