MsIFT: Multi-Source Image Fusion Transformer
Abstract
:1. Introduction
- A multi-source image fusion method with the global receptive field is proposed. The non-locality of the transformer is helpful for overcoming the feature semantic bias caused by semantic misalignment between multi-source images.
- Different feature extractor and task predictor networks are proposed and unified for three classification-based downstream tasks, and the MsIFT can be uniformly used for pixel-wise classification, image-wise classification and semantic segmentation.
2. Related Works
2.1. DNN-Based Multi-Source Image Classification
2.2. Transformer in Computer Vision
3. Method
3.1. Problem Formulation
3.2. MsIFT Architecture
3.2.1. CNN Feature Extractor
3.2.2. Feature Fusion Transformer
3.2.3. Task Predictor
3.3. MsIFT Loss
4. Experiments
4.1. Data Description
4.2. Implementation Details
4.3. Evaluation Metrics
4.4. Quantitative Analysis
4.4.1. Image-Wise Classification
4.4.2. Pixel-Wise Classification
4.4.3. Semantic Segmentation
4.4.4. Ablation Study
4.5. Qualitative Analysis
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Xu, X.; Li, W.; Ran, Q.; Du, Q.; Gao, L.; Zhang, B. Multisource remote sensing data classification based on convolutional neural network. IEEE Trans. Geosci. Remote Sens. 2017, 56, 937–949. [Google Scholar] [CrossRef]
- Qiu, X.; Li, M.; Dong, L.; Deng, G.; Zhang, L. Dual-band maritime imagery ship classification based on multilayer convolutional feature fusion. J. Sens. 2020, 2020, 8891018. [Google Scholar] [CrossRef]
- Hu, W.S.; Li, H.C.; Pan, L.; Li, W.; Tao, R.; Du, Q. Spatial–spectral feature extraction via deep ConvLSTM neural networks for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2020, 58, 4237–4250. [Google Scholar] [CrossRef]
- Li, H.C.; Wang, W.Y.; Pan, L.; Li, W.; Du, Q.; Tao, R. Robust capsule network based on maximum correntropy criterion for hyperspectral image classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 13, 738–751. [Google Scholar] [CrossRef]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. In Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
- Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 213–229. [Google Scholar]
- Zheng, S.; Lu, J.; Zhao, H.; Zhu, X.; Luo, Z.; Wang, Y.; Fu, Y.; Feng, J.; Xiang, T.; Torr, P.H.; et al. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 6881–6890. [Google Scholar]
- Chen, X.; Yan, B.; Zhu, J.; Wang, D.; Yang, X.; Lu, H. Transformer tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 8126–8135. [Google Scholar]
- Zhang, M.M.; Choi, J.; Daniilidis, K.; Wolf, M.T.; Kanan, C. VAIS: A dataset for recognizing maritime imagery in the visible and infrared spectrums. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Boston, MA, USA, 7–12 June 2015; pp. 10–16. [Google Scholar]
- Shermeyer, J.; Hogan, D.; Brown, J.; Van Etten, A.; Weir, N.; Pacifici, F.; Hansch, R.; Bastidas, A.; Soenen, S.; Bacastow, T.; et al. SpaceNet 6: Multi-sensor all weather mapping dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020; pp. 196–197. [Google Scholar]
- Pacifici, F.; Du, Q.; Prasad, S. Report on the 2013 IEEE GRSS data fusion contest: Fusion of hyperspectral and LiDAR data [technical committees]. IEEE Geosci. Remote Sens. Mag. 2013, 1, 36–38. [Google Scholar] [CrossRef]
- Aziz, K.; Bouchara, F. Multimodal deep learning for robust recognizing maritime imagery in the visible and infrared spectrums. In Proceedings of the International Conference Image Analysis and Recognition, Póvoa de Varzim, Portugal, 27–29 June 2018; Springer: Berlin/Heidelberg, Germany, 2018; pp. 235–244. [Google Scholar]
- Santos, C.E.; Bhanu, B. Dyfusion: Dynamic IR/RGB fusion for maritime vessel recognition. In Proceedings of the 2018 25th IEEE International Conference on Image Processing (ICIP), Athens, Greece, 7–10 October 2018; pp. 1328–1332. [Google Scholar]
- Hong, D.; Gao, L.; Yokoya, N.; Yao, J.; Chanussot, J.; Du, Q.; Zhang, B. More diverse means better: Multimodal deep learning meets remote-sensing imagery classification. IEEE Trans. Geosci. Remote Sens. 2020, 59, 4340–4354. [Google Scholar] [CrossRef]
- Zhu, H.; Ma, M.; Ma, W.; Jiao, L.; Hong, S.; Shen, J.; Hou, B. A spatial-channel progressive fusion ResNet for remote sensing classification. Inf. Fusion 2021, 70, 72–87. [Google Scholar] [CrossRef]
- Khodadadzadeh, M.; Li, J.; Prasad, S.; Plaza, A. Fusion of hyperspectral and LiDAR remote sensing data using multiple feature learning. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2015, 8, 2971–2983. [Google Scholar] [CrossRef]
- Li, W.; Gao, Y.; Zhang, M.; Tao, R.; Du, Q. Asymmetric Feature Fusion Network for Hyperspectral and SAR Image Classification. IEEE Trans. Neural Netw. Learn. Syst. 2022, 1–14. [Google Scholar] [CrossRef]
- Zhang, M.; Li, W.; Tao, R.; Li, H.; Du, Q. Information Fusion for Classification of Hyperspectral and LiDAR Data Using IP-CNN. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–12. [Google Scholar] [CrossRef]
- Mohlax, S.; Pande, S.; Banerjee, B.; Chaudhuri, S. Fusatnet: Dual attention based spectrospatial multimodal fusion network for hyperspectral and lidar classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020; pp. 92–93. [Google Scholar]
- Peng, Y.; Li, W.; Luo, X.; Du, J.; Gan, Y.; Gao, X. Integrated fusion framework based on semicoupled sparse tensor factorization for spatio-temporal–spectral fusion of remote sensing images. Inf. Fusion 2021, 65, 21–36. [Google Scholar] [CrossRef]
- Li, H.C.; Hu, W.S.; Li, W.; Li, J.; Plaza, A. ACLNN: Spatial, Spectral and Multiscale Attention ConvLSTM Neural Network for Multisource Remote Sensing Data Classification. IEEE Trans. Neural Netw. Learn. Syst. 2020, 33, 1–15. [Google Scholar]
- Hong, D.; Yokoya, N.; Xia, G.S.; Chanussot, J.; Zhu, X.X. X-ModalNet: A semi-supervised deep cross-modal network for classification of remote sensing data. ISPRS J. Photogramm. Remote Sens. 2020, 167, 12–23. [Google Scholar] [CrossRef] [PubMed]
- Zhang, M.; Li, W.; Du, Q.; Gao, L.; Zhang, B. Feature Extraction for Classification of Hyperspectral and LiDAR Data Using Patch-to-Patch CNN. IEEE Trans. Cybern. 2020, 50, 100–111. [Google Scholar] [CrossRef]
- Huang, Z.; Cheng, G.; Wang, H.; Li, H.; Shi, L.; Pan, C. Building extraction from multi-source remote sensing images via deep deconvolution neural networks. In Proceedings of the 2016 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Beijing, China, 10–15 July 2016; pp. 1835–1838. [Google Scholar]
- Liao, W.; Bellens, R.; Pižurica, A.; Gautama, S.; Philips, W. Combining feature fusion and decision fusion for classification of hyperspectral and LiDAR data. In Proceedings of the 2014 IEEE Geoscience and Remote Sensing Symposium, Quebec City, QC, Canada, 13–18 July 2014; pp. 1241–1244. [Google Scholar]
- Hang, R.; Li, Z.; Ghamisi, P.; Hong, D.; Xia, G.; Liu, Q. Classification of hyperspectral and LiDAR data using coupled CNNs. IEEE Trans. Geosci. Remote Sens. 2020, 58, 4939–4950. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–7 December 2017; pp. 5998–6008. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Chen, H.; Wang, Y.; Guo, T.; Xu, C.; Deng, Y.; Liu, Z.; Ma, S.; Xu, C.; Xu, C.; Gao, W. Pre-trained image processing transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 12299–12310. [Google Scholar]
- Hu, J.; Cao, L.; Lu, Y.; Zhang, S.; Wang, Y.; Li, K.; Huang, F.; Shao, L.; Ji, R. ISTR: End-to-End Instance Segmentation with Transformers. arXiv 2021, arXiv:2105.00637. [Google Scholar]
- Chen, H.; Qi, Z.; Shi, Z. Remote Sensing Image Change Detection With Transformers. IEEE Trans. Geosci. Remote. Sens. 2021, 60, 1–14. [Google Scholar] [CrossRef]
- Hendrycks, D.; Gimpel, K. Gaussian error linear units (gelus). arXiv 2016, arXiv:1606.08415. [Google Scholar]
- Ba, J.L.; Kiros, J.R.; Hinton, G.E. Layer normalization. arXiv 2016, arXiv:1607.06450. [Google Scholar]
- Contributors, M. OpenMMLab’s Image Classification Toolbox and Benchmark. 2020. Available online: https://github.com/open-mmlab/mmclassification (accessed on 17 July 2022).
- Contributors, M. MMSegmentation: OpenMMLab Semantic Segmentation Toolbox and Benchmark. 2020. Available online: https://github.com/open-mmlab/mmsegmentation (accessed on 17 July 2022).
- Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. ImageNet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar] [CrossRef]
- Shi, Q.; Li, W.; Tao, R.; Sun, X.; Gao, L. Ship classification based on multifeature ensemble with convolutional neural network. Remote Sens. 2019, 11, 419. [Google Scholar] [CrossRef]
- Huang, L.; Li, W.; Chen, C.; Zhang, F.; Lang, H. Multiple features learning for ship classification in optical imagery. Multimed. Tools Appl. 2018, 77, 13363–13389. [Google Scholar] [CrossRef]
- Zhang, E.; Wang, K.; Lin, G. Classification of marine vessels with multi-feature structure fusion. Appl. Sci. 2019, 9, 2153. [Google Scholar] [CrossRef]
- Li, W.; Chen, C.; Su, H.; Du, Q. Local binary patterns and extreme learning machine for hyperspectral imagery classification. IEEE Trans. Geosci. Remote Sens. 2015, 53, 3681–3693. [Google Scholar] [CrossRef]
- Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar]
- Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual attention network for scene segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3146–3154. [Google Scholar]
- Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
- Yuan, Y.; Chen, X.; Wang, J. Object-contextual representations for semantic segmentation. In Computer Vision–ECCV 2020, Proceedings of the 16th European Conference, Glasgow, UK, 23–28 August 2020; Part VI 16; Springer: Berlin/Heidelberg, Germany, 2020; pp. 173–190. [Google Scholar]
- Huang, Z.; Wang, X.; Huang, L.; Huang, C.; Wei, Y.; Liu, W. Ccnet: Criss-cross attention for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 603–612. [Google Scholar]
No. | Class Name | Train | Test |
---|---|---|---|
1 | Cargo ship | 83 | 63 |
2 | Medium ship | 62 | 76 |
3 | Passenger ship | 58 | 59 |
4 | Sailing ship | 148 | 136 |
5 | Small boat | 158 | 195 |
6 | Tug boat | 30 | 20 |
Total | 539 | 549 |
No. | Class Name | Train | Test |
---|---|---|---|
1 | Health grass | 198 | 1053 |
2 | Stressed grass | 190 | 1064 |
3 | Synthetic grass | 192 | 505 |
4 | Tress | 188 | 1056 |
5 | Soil | 186 | 1056 |
6 | Water | 182 | 143 |
7 | Residential | 196 | 1072 |
8 | Commercial | 191 | 1053 |
9 | Road | 193 | 1059 |
10 | Highway | 191 | 1036 |
11 | Railway | 181 | 1054 |
12 | Parking lot 1 | 192 | 1041 |
13 | Parking lot 2 | 184 | 285 |
14 | Tennis court | 181 | 247 |
15 | Running track | 187 | 473 |
Total | 2832 | 12,197 |
Pixel-Wise Classification | Image-Wise Classification | Semantic Segmentation | |
---|---|---|---|
Batchsize | 48 | 10 | 4 |
Optimizer | SGD | SGD | SGD |
Initialized learning rate | 0.001 | 0.001 | 0.01 |
Learning Rate Decay | Cosine Annealing | Cosine Annealing | Poly schedule |
Momentum | 0.9 | 0.9 | 0.9 |
Weight decay | 0.0001 | 0.0001 | 0.0005 |
Epochs | 600 | 200 | 30 |
Method | VIS | IR | VIS + IR |
---|---|---|---|
CNN [9] | 81.9 | 54.0 | 82.1 |
Gnostic field [9] | 82.4 | 58.7 | 82.4 |
CNN + gnostic field [9] | 81.0 | 56.8 | 87.4 |
ME-CNN [37] | 87.3 | - | - |
MFL (feature-level) + ELM [38] | 87.6 | - | - |
CNN + Gabor + MS-CLBP [38] | 88.0 | - | - |
Multimodal CNN [12] | - | - | 86.7 |
DyFusion [13] | - | - | 88.4 |
SF-SRDA [39] | 87.6 | 74.7 | 88.0 |
MCFF Combination 3-SUM (C2C5F6) [2] | 87.5 | 71.1 | 89.6 |
MCFF Combination 3-SUM (C3C5F6) [2] | 87.7 | 71.4 | 89.6 |
MCFF Combination 2-CON (C3C5F6) [2] | 87.9 | 71.9 | 89.9 |
MCFF Combination 3-CON (C2C5F6) [2] | 87.5 | 71.1 | 89.9 |
MsIFT (ours) | 87.1 | 72.1 | 92.3 |
Method | HSI | LiDAR | HSI + LiDAR |
---|---|---|---|
SVM [40] | 78.79 | - | 80.15 [+1.36] |
ELM [40] | 79.52 | - | 80.76 [+1.24] |
Two-Branch CNN [1] | 77.79 | - | 83.75 [+5.96] |
Dual-Channel CapsNet [4] | 81.53 | - | 86.61 [+5.08] |
SSCL3DNN [3] | 82.72 | - | 86.01 [+3.29] |
A3CLNN [21] | 87.00 | - | 90.55 [+3.55] |
MDL + Early [14] | 82.05 | 67.35 | 83.07 [+1.02] |
MDL + Middle [14] | 82.05 | 67.35 | 89.55 [+7.5] |
MDL + Late [14] | 82.05 | 67.35 | 87.98 [+5.93] |
MDL + EnDe [14] | 82.05 | 67.35 | 90.71 [+8.66] |
MDL + Cross [14] | 82.05 | 67.35 | 91.99 [+9.94] |
MsIFT (ours) | 82.05 | 67.35 | 93.02 [+10.97] |
Method | Backbone | OPT | SAR | ||
---|---|---|---|---|---|
mIoU | Accuracy | mIoU | Accuracy | ||
Deeplabv3+ [43] | ResNet-50 | 63.36 | 67.18 | 56.46 | 61.87 |
OCRNet [44] | HRNetV2-W18 | 65.74 | 68.58 | 54.56 | 59.65 |
CCNet [45] | ResNet-50 | 65.49 | 68.82 | 55.36 | 60.61 |
PSPNet [41] | ResNet-50 | 65.82 | 68.85 | 55.32 | 60.54 |
DANet [42] | ResNet-50 | 66.90 | 70.17 | 57.45 | 62.64 |
Fusion Method | Backbone | OPT + SAR | |||
mIoU | Accuracy | ||||
PSPNet [41]: | |||||
DeconvNet-Fusion (Minimum) [24] | ResNet-50 | 58.25 | 59.65 | ||
DeconvNet-Fusion (AM) [24] | ResNet-50 | 64.06 | 68.83 | ||
DeconvNet-Fusion (GM) [24] | ResNet-50 | 54.92 | 66.28 | ||
MsIFT (Ours) | ResNet-50 | 67.51 | 70.49 | ||
DANet [42]: | |||||
DeconvNet-Fusion (Minimum) [24] | ResNet-50 | 60.03 | 61.47 | ||
DeconvNet-Fusion (AM) [24] | ResNet-50 | 65.09 | 70.01 | ||
DeconvNet-Fusion (GM) [24] | ResNet-50 | 56.91 | 66.2 | ||
MsIFT (Ours) | ResNet-50 | 67.94 | 70.82 |
OPT | SAR | Concat | CNN | FTE | FTD | AL | mIoU |
---|---|---|---|---|---|---|---|
✓ | 64.07 | ||||||
✓ | 54.28 | ||||||
✓ | ✓ | ✓ | 50.21 | ||||
✓ | ✓ | ✓ | 48.53 | ||||
✓ | ✓ | ✓ | ✓ | 65.95 | |||
✓ | ✓ | ✓ | ✓ | 66.08 | |||
✓ | ✓ | ✓ | ✓ | 65.06 | |||
✓ | ✓ | ✓ | ✓ | ✓ | 66.38 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhang, X.; Jiang, H.; Xu, N.; Ni, L.; Huo, C.; Pan, C. MsIFT: Multi-Source Image Fusion Transformer. Remote Sens. 2022, 14, 4062. https://doi.org/10.3390/rs14164062
Zhang X, Jiang H, Xu N, Ni L, Huo C, Pan C. MsIFT: Multi-Source Image Fusion Transformer. Remote Sensing. 2022; 14(16):4062. https://doi.org/10.3390/rs14164062
Chicago/Turabian StyleZhang, Xin, Hangzhi Jiang, Nuo Xu, Lei Ni, Chunlei Huo, and Chunhong Pan. 2022. "MsIFT: Multi-Source Image Fusion Transformer" Remote Sensing 14, no. 16: 4062. https://doi.org/10.3390/rs14164062
APA StyleZhang, X., Jiang, H., Xu, N., Ni, L., Huo, C., & Pan, C. (2022). MsIFT: Multi-Source Image Fusion Transformer. Remote Sensing, 14(16), 4062. https://doi.org/10.3390/rs14164062