Swin-MFA: A Multi-Modal Fusion Attention Network Based on Swin-Transformer for Low-Light Image Human Segmentation
Abstract
:1. Introduction
2. Relative Work
- Human body images are segmented in the multi-modal and multi-feature way in low-illumination scenes, by using the fusion information of the depth image and RGB image as the segmentation basis.
- A multi-modal end-to-end segmentation network based on swin-transformer is proposed, which realizes end-to-end RGB and depth feature-fusion attention by combining swin-transformer features that are demonstrated to be stable under changeable-lighting conditions. It can totally replace the traditional convolutional neural network and improve the accuracy of segmentation.
- Aiming at the shortcomings of traditional image segmentation under low illumination, a modified and pre-processed body semantic segmentation dataset (LLHS) with fine annotation for a low-light scene is proposed, which is much larger in scale and scene than the previous dataset, filling the gap in the semantic segmentation dataset in the low-illuminance condition.
3. Materials and Methods
3.1. Low Light Human Segmentation Dataset
- (1)
- The physical location of the camera of RGB image and depth image results in different spatial-coordinate systems. The images taken by RGB camera and depth camera are not matched by pixels, so it is necessary to register RGB image and depth image. The internal parameter matrix and external parameter matrix in different scenes are obtained by calibrating RGB camera and depth camera, respectively. Then, the transformation matrix of two coordinate systems is calculated by Equation (1):where, pir is the coordinates of pixels in the depth image before processing, Hir is the internal parameter matrix of the depth camera and Hrgb is the internal parameter matrix of the RGB camera. R and T are rotation matrices and shift vectors, respectively, derived from the outer parameter matrix.where and are rotation matrix and shift vector of depth camera (RGB camera) in external parameter matrix, respectively.
- (2)
- In depth images, due to camera shooting angle and objects blocking, black gaps appear in the image, resulting in the interference of the image edge information, which needs to be processed. The depth camera of Realsense device is set on the left side, and the imaging algorithm is realized by referring to the left camera. Therefore, the upper and lower five pixels of the left side adjacent to the black gap can be used as the processing neighborhood to fill the vacancy. In order to maintain image-edge information, it is necessary to make the filled pixels contain background information rather than foreground information. Therefore, the pixel of the farthest point with the largest pixel value in the neighborhood is used to fill the black vacancy. The specific calculation formula can be expressed in Equation (4).where P(i,j) is the pixel value of the i-th row and j-th column in the filling kernel, and P’(i,j) is the corresponding pixel value of the i-th row and j-th column in the image after processing.
3.2. Swin-MFA
3.2.1. Swin-Transformer Base Backbone Network
3.2.2. Self-Attention Mechanism
3.2.3. Feature-Fusion Attention Mechanism
3.3. Loss Function
4. Results
4.1. Network Fusion Mechanism Experiments
4.2. Network Connections between Encoder and Decoder Experiments
4.3. Network Comparative Experiments
4.4. Experiment of the Combining Datasets of Different Light Intensities
5. Discussion
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Liu, X.; Deng, Z.; Yang, Y. Recent progress in semantic image segmentation. Artif. Intell. Rev. 2019, 52, 1089–1106. [Google Scholar] [CrossRef]
- Wang, W.; Zhou, T.; Yu, F.; Dai, J.; Konukoglu, E.; Van Gool, L. Exploring cross-image pixel contrast for semantic segmentation. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; pp. 7283–7293. [Google Scholar]
- Nobuyuki, O. A Threshold Selection Method from Gray-Level Histograms. IEEE Trans. Syst. Man Cybern. 1979, 9, 62–66. [Google Scholar]
- Yang, L.; Wu, X.Y.; Zhao, D.W.; Li, H.; Zhai, J. An improved Prewitt algorithm for edge detection based on noised image. In Proceedings of the 2011 4th International Congress on Image and Signal Processing, Shanghai, China, 15–17 October 2011; pp. 1197–1200. [Google Scholar]
- Coates, A.; Ng, A.Y. Learning feature representations with K-means. Lect. Notes Comput. Sci. 2012, 7700, 561–580. [Google Scholar]
- Achanta, R.; Shaji, A.; Smith, K.; Lucchi, A.; Fua, P.; Susstrunk, S. SLIC superpixels compared to state-of-the-art superpixel methods. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 34, 2274–2282. [Google Scholar] [CrossRef] [PubMed]
- Shelhamer, E.; Long, J.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 640–651. [Google Scholar] [CrossRef] [PubMed]
- Wu, X.Y.; Wu, Z.Y.; Guo, H.; Ju, L.L.; Wang, S. DANNet: A One-Stage Domain Adaptation Network for Unsupervised Nighttime Semantic Segmentation. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021; pp. 15764–15773. [Google Scholar]
- Sun, Y.X.; Zuo, W.X.; Liu, M. RTFNet: RGB-Thermal Fusion Network for Semantic Segmentation of Urban Scenes. IEEE Robot. Autom. Lett. 2019, 4, 2576–2583. [Google Scholar] [CrossRef]
- Chen, X.; Lin, K.Y.; Wang, J.; Wu, W.; Qian, C.; Li, H.; Zeng, G. Bi-directional Cross-Modality Feature Propagation with Separation-and-Aggregation Gate for RGB-D Semantic Segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; pp. 561–577. [Google Scholar]
- Ronneberger, O.; Fisher, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the 18th International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]
- Zhao, H.S.; Qi, X.J.; Shen, X.Y.; Shi, J.; Jia, J. Icnet for real- time semantic segmentation on high-resolution images. In Proceedings of the 2018 European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 418–434. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
- Lin, T.Y.; Wang, Y.X.; Liu, X.Y.; Qiu, X. A Survey of Transformers. arXiv 2021, arXiv:2106.04554. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- Liu, Z.; Lin, Y.T.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 9992–10002. [Google Scholar]
- He, X.; Chen, Y.S.; Lin, Z.H. Spatial-Spectral Transformer for Hyperspectral Image Classification. Remote Sens. 2021, 13, 498. [Google Scholar] [CrossRef]
- Lu, Y.T.; Fu, J.; Li, X.; Zhou, W.; Liu, S.; Zhang, X.; Jia, C.; Liu, Y.; Chen, Z. RTN: Reinforced Transformer Network for Coronary CT Angiography Vessel-level Image Quality Assessment. arXiv 2022, arXiv:2207.06177. [Google Scholar]
- Zheng, S.X.; Lu, J.C.; Zhao, H.S.; Zhu, X.; Luo, Z.; Wang, Y.; Fu, Y.; Feng, J.; Xiang, T.; Torr, P.H.S.; et al. Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 6877–6886. [Google Scholar]
- Schneider, L.; Jasch, M.; Frohlich, B.; Weber, T.; Franke, U.; Pollefeys, M.; Ratsch, M. Multimodal neural networks: RGB-D for semantic segmentation and object detection. Lect. Notes Comput. Sci. 2017, 10269, 98–109. [Google Scholar]
- Hung, S.W.; Lo, S.Y.; Hang, H.M. Incorporating Luminance, Depth and Color Information by a Fusion-based Network for Semantic Segmentation. In Proceedings of the 2019 IEEE International Conference on Image Processing (ICIP), Taipei, Taiwan, China, 22–25 September 2019; pp. 2374–2378. [Google Scholar]
- Qi, X.J.; Liao, R.J.; Jia, J.Y.; Fidler, S.; Urtasun, R. 3D Graph Neural Networks for RGBD Semantic Segmentation. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 5209–5218. [Google Scholar]
- Hu, X.; Yang, K.; Fei, L.; Wang, K. Acnet: Attention based network to exploit complementary features for rgbd semantic segmentation. In Proceedings of the 2019 IEEE International Conference on Image Processing (ICIP), Taipei, Taiwan, 22–25 September 2019; pp. 1440–1444. [Google Scholar]
- Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. Int. J. Comput. Vis. Vol. 2020, 128, 336–359. [Google Scholar] [CrossRef]
- Li, X.; Wang, W.H.; Hu, X.L.; Yang, J. Selective Kernel Networks. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 510–519. [Google Scholar]
- Howard, A.; Sandler, M.; Chu, G.; Chen, L.-C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. Searching for MobileNetV3. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea, 27 October–2 November 2019; pp. 1314–1324. [Google Scholar]
- Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef] [PubMed]
- Sun, K.; Zhao, Y.; Jiang, B.R.; Cheng, T.; Xiao, B.; Liu, D.; Mu, Y.; Wang, X.; Liu, W.; Wang, J. High-resolution representations for labeling pixels and regions. arXiv 2019, arXiv:1904.04514. [Google Scholar]
- Chen, J.N.; Lu, Y.Y.; Yu, Q.H.; Luo, X.D.; Adeli, E.; Wang, Y.; Lu, L.; Yuille, A.L.; Zhou, Y.Y. TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation. arXiv 2021, arXiv:2102.04306. [Google Scholar]
- Cao, H.; Wang, Y.Y.; Chen, J.; Jiang, D.S.; Zhang, X.P.; Tian, Q.; Wang, M.N. Swin-unet: Unet-like pure transformer for medical image segmentation. arXiv 2021, arXiv:2105.05537. [Google Scholar]
- Sun, L.; Yang, K.; Hu, X.; Hu, W.; Wang, K. Real-Time Fusion Network for RGB-D Semantic Segmentation Incorporating Unexpected Obstacle Detection for Road-Driving Images. IEEE Robot. Autom. Lett. 2020, 5, 5558–5565. [Google Scholar] [CrossRef]
- Seichter, D.; Kohler, M.; Lewandowski, B.; Wengefeld, T.; Gross, H.-M. Efficient RGB-D Semantic Segmentation for Indoor Scene Analysis. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; pp. 13525–13531. [Google Scholar]
- Hazirbas, C.; Ma, L.; Domokos, C.; Cremers, D. FuseNet: Incorporating Depth into Semantic Segmentation via Fusion-Based CNN Architecture. In Proceedings of the 13th Asian Conference on Computer Vision (ACCV), Taipei, Taiwan, 20–24 November 2016; pp. 213–228. [Google Scholar]
- Jiang, J.D.; Zheng, L.N.; Luo, F.; Zhang, Z. RedNet: Residual Encoder-Decoder Network for indoor RGB-D Semantic Segmentation. arXiv 2018, arXiv:1806.01054. [Google Scholar]











| Fusion Method | Global acc | mIoU | 
|---|---|---|
| Add | 93.4 | 80.7 | 
| Concat | 92.8 | 80.8 | 
| Ours | 93.6 | 81.0 | 
| Number of Connections | Global acc | mIoU | 
|---|---|---|
| 0 | 89.2 | 73.7 | 
| 1 | 91.8 | 76.6 | 
| 2 | 92.3 | 78.8 | 
| 3 | 93.4 | 81.0 | 
| Method | Global acc | mIoU | 
|---|---|---|
| Lraspp [26] | 73.5 | 63.4 | 
| Deeplabv3 [27] | 84.1 | 54.3 | 
| Unet [11] | 81.0 | 71.7 | 
| HRNet [28] | 46.7 | 59.1 | 
| TransUnet [29] | 88.0 | 75.5 | 
| SwinUnet [30] | 87.5 | 69.8 | 
| ACNet [23] | 92.7 | 75.7 | 
| RFNet [31] | 82.4 | 72.3 | 
| 3DGNN [22] | 92.2 | 77.6 | 
| ESANet [32] | 92.2 | 80.4 | 
| FuseNet [33] | 84.3 | 75.3 | 
| RedNet [34] | 88.5 | 75.0 | 
| LDFNet [21] | 89.9 | 78.3 | 
| Swin-MFA (Ours) | 93.4 | 81.0 | 
| Method | 10% | 15% | 20% | 25% | 30% | |||||
|---|---|---|---|---|---|---|---|---|---|---|
| mIoU | acc | mIoU | acc | mIoU | acc | mIoU | acc | mIoU | acc | |
| ACNet | 75.6 | 92.6 | 75.0 | 92.4 | 75.2 | 92.0 | 75.1 | 92.5 | 75.1 | 92.1 | 
| RFNet | 72.3 | 82.7 | 71.5 | 82.3 | 71.5 | 81.8 | 71.3 | 82.4 | 72.1 | 82.5 | 
| 3DGNN | 77.7 | 92.5 | 77.4 | 92.5 | 77.1 | 92.0 | 77.3 | 92.1 | 76.9 | 91.1 | 
| ESANet | 80.3 | 92.2 | 80.0 | 92.1 | 79.8 | 91.7 | 80.6 | 92.2 | 79.9 | 91.8 | 
| FuseNet | 75.1 | 84.2 | 74.0 | 84.4 | 74.2 | 84.0 | 75.5 | 86.4 | 74.6 | 84.7 | 
| RedNet | 74.8 | 88.3 | 71.8 | 88.4 | 72.9 | 87.6 | 70.6 | 88.5 | 73.4 | 87.4 | 
| Swin-MFA | 80.8 | 92.9 | 80.6 | 92.9 | 80.1 | 92.4 | 80.5 | 92.6 | 80.2 | 92.9 | 
| Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. | 
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Yi, X.; Zhang, H.; Wang, Y.; Guo, S.; Wu, J.; Fan, C. Swin-MFA: A Multi-Modal Fusion Attention Network Based on Swin-Transformer for Low-Light Image Human Segmentation. Sensors 2022, 22, 6229. https://doi.org/10.3390/s22166229
Yi X, Zhang H, Wang Y, Guo S, Wu J, Fan C. Swin-MFA: A Multi-Modal Fusion Attention Network Based on Swin-Transformer for Low-Light Image Human Segmentation. Sensors. 2022; 22(16):6229. https://doi.org/10.3390/s22166229
Chicago/Turabian StyleYi, Xunpeng, Haonan Zhang, Yibo Wang, Shujiang Guo, Jingyi Wu, and Cien Fan. 2022. "Swin-MFA: A Multi-Modal Fusion Attention Network Based on Swin-Transformer for Low-Light Image Human Segmentation" Sensors 22, no. 16: 6229. https://doi.org/10.3390/s22166229
APA StyleYi, X., Zhang, H., Wang, Y., Guo, S., Wu, J., & Fan, C. (2022). Swin-MFA: A Multi-Modal Fusion Attention Network Based on Swin-Transformer for Low-Light Image Human Segmentation. Sensors, 22(16), 6229. https://doi.org/10.3390/s22166229
 
         
                                                

 
       