Multi-Modality and Multi-Scale Attention Fusion Network for Land Cover Classification from VHR Remote Sensing Images
Abstract
:1. Introduction
- The VHR remote sensing images usually contain large complex scenes and a large number of spectral features can be easily confused; thus, it is arduous for CNN to obtain sufficient features only from spectral data. For example, in Figure 1, segmentation models are often confused by factors such as shadows and occlusion.
- Simple multi-scale extraction modules can no longer adapt to complex VHR remote sensing images, because the scale information of ground targets is quite different.
- We designed a multi-modality fusion module to fuse richer features and avoid redundant features of IRRG and nDSM images. This addresses the problem of low accuracy of land cover classification from VHR remote sensing images caused by the weak feature representation ability of single modality data.
- We present a novel multi-scale spatial context enhancement module that considers the advantages of both ASPP (Atrous Spatial Pyramid Pooling) and a non-local block to improve image feature fusion, which successfully addresses the problem of a large difference in target scale in VHR remote sensing images.
2. Related Work
2.1. Traditional Methods
2.2. Popular Segmentation Networks
2.3. Multi-Modality Data Fusion
2.4. Multi-Scale Feature Extraction Methods
3. Proposed Method
3.1. Overall Architecture
3.2. Multi-Modality Fusion Module
3.3. Multi-Scale Spatial Context Enhancement Module
3.4. Residual Skip Connection Strategy
4. Experimental Results and Analysis
4.1. Dataset Description
4.2. Training Details
4.3. Metrics
4.4. Results and Analysis
4.4.1. Results on the Potsdam Dataset
4.4.2. Results on the Vaihingen Dataset
4.4.3. Ablation Experiment
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Xu, S.; Pan, X.; Li, E.; Wu, B. Automatic Building Rooftop Extraction from Aerial Images via Hierarchical RGB-D Priors. IEEE Trans. Geosci. Remote Sens. 2018, 56, 7369–7387. [Google Scholar] [CrossRef]
- Xu, Y.; Xie, Z.; Feng, Y.; Chen, Z. Road Extraction from High-Resolution Remote Sensing Imagery Using Deep Learning. Remote Sens. 2018, 10, 1461. [Google Scholar] [CrossRef] [Green Version]
- Lei, T.; Zhang, Y.; Lv, Z.; Li, S.; Nandi, A.K. Landslide inventory mapping from bi-temporal images using a deep convolution network. IEEE Geosci. Remote Sens. Lett. 2019, 16, 163–170. [Google Scholar] [CrossRef]
- Lv, Z.; Li, G.; Jin, Z. Iterative Training Sample Expansion to Increase and Balance the Accuracy of Land Classification from VHR Imagery. IEEE Trans. Geosci. Remote Sens. 2021, 59, 139–150. [Google Scholar] [CrossRef]
- Schmidhuber, J. Deep Learning in Neural Networks: An Overview. Neural Netw. 2015, 61, 85–117. [Google Scholar] [CrossRef] [Green Version]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Proc. Adv. Neural Inf. Process. Syst. 2012, 25, 1097–1105. [Google Scholar] [CrossRef]
- Redmon, J.; Divvala, S.; Girshick, R. You Only Look Once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
- Lei, T.; Wang, R.; Zhang, Y.; Wan, Y.; Liu, C.; Nandi, A.K. DefED-Net: Deformable encoder-decoder network for liver and liver tumor segmentation. IEEE Trans. Radiat. Plasma Med. Sci. 2021. [Google Scholar] [CrossRef]
- Le Hegarat-Mascle, S.; Bloch, I.; Vidal-Madjar, D. Application of Dempster-Shafer evidence theory to unsupervised classification in multisource remote sensing. IEEE Trans. Geosci. Remote Sens. 1997, 35, 1018–1031. [Google Scholar] [CrossRef] [Green Version]
- Su, L.; Yuan, Y.; Huang, F.; Cao, J.; Li, L.; Zhou, S. Spectrum reconstruction method for airborne temporally–spatially modulated Fourier transform imaging spectrometers. IEEE Trans. Geosci. Remote Sens. 2013, 52, 3720–3728. [Google Scholar] [CrossRef]
- Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
- Badrinarayanan, V.; Kendall, A.; Cipolla, R. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef] [PubMed]
- Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]
- Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 22–25 July 2017; pp. 2881–2890. [Google Scholar]
- Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef]
- Chen, L.C.; Papandreou, G.; Shroff, F.; Adam, H. Rethinking atrous convolution for semantic image segmentation. arXiv Prepr. 2017, arXiv:1706.05587. [Google Scholar]
- Chen, L.C.; Papandreou, G.; Shroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
- Yu, C.; Wang, J.; Gao, C.; Yu, G.; Shen, C.; Sang, N. Context prior for scene segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; pp. 12416–12425. [Google Scholar]
- Qin, R.; Fang, W. A hierarchical building detection method for very high resolution remotely sensed images combined with DSM using graph cut optimization. Photogramm. Eng. Remote Sens. 2014, 80, 873–883. [Google Scholar] [CrossRef]
- Mou, L.; Hua, Y.; Zhu, X.X. Relation Matters: Relational Context-Aware Fully Convolutional Network for Semantic Segmentation of High-Resolution Aerial Images. IEEE Trans. Geosci. Remote Sens. 2020, 58, 7557–7569. [Google Scholar] [CrossRef]
- Li, R.; Zheng, S.; Zhang, C.; Duan, C.; Su, J.; Wang, L.; Atkinson, P.M. Multiattention Network for Semantic Segmentation of Fine-Resolution Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2021. [Google Scholar] [CrossRef]
- Han, B.; Yin, J.; Luo, X.; Jia, X. Multibranch Spatial-Channel Attention for Semantic Labeling of Very High-Resolution Remote Sensing Images. IEEE Geosci. Remote Sens. Lett. 2020. [Google Scholar] [CrossRef]
- Tarabalka, Y.; Chanussot, J.; Benediktsson, J.A. Segmentation and classification of hyperspectral images using watershed transformation. Pattern Recognit. 2010, 7, 2367–2379. [Google Scholar] [CrossRef] [Green Version]
- Fan, J.; Han, M.; Wang, J. Single point iterative weighted fuzzy C-means clustering algorithm for remote sensing image segmentation. Pattern Recognit. 2009, 11, 2527–2540. [Google Scholar] [CrossRef]
- Lei, T.; Jia, X.; Zhang, Y.; Liu, S.; Meng, H.; Nandi, A.K. Superpixel-based Fast fuzzy C-means clustering algorithms for color image segmentation. IEEE Trans. Fuzzy Syst. 2019, 27, 1753–1766. [Google Scholar] [CrossRef] [Green Version]
- Yu, H.; Gao, L.; Li, J.; Li, S.; Zhang, B.; Benediktsson, J. SpectralSpatial Hyperspectral Image Classification Using Subspace-Based Support Vector Machines and Adaptive Markov Random Fields. Remote Sens. 2016, 8, 355. [Google Scholar] [CrossRef] [Green Version]
- Meher, S.K. Knowledge-Encoded Granular Neural Networks for Hyperspectral Remote Sensing Image Classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2015, 8, 2349–2446. [Google Scholar] [CrossRef]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Identity mappings in deep residual networks. In Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 8–16 October 2016; pp. 630–645. [Google Scholar]
- Sherrah, J. Fully convolutional networks for dense semantic labelling of high-resolution aerial imagery. arXiv Prepr. 2016, arXiv:1606.02585. [Google Scholar]
- Cao, Z. End-to-End DSM Fusion Networks for Semantic Segmentation in High-Resolution Aerial Images. IEEE Geosci. Remote Sens. Lett. 2019, 16, 1766–1770. [Google Scholar] [CrossRef]
- Lu, X.; Li, X.; Mou, L. Semi-Supervised Multitask Learning for Scene Recognition. IEEE Trans. Cybern. 2015, 9, 1967–1976. [Google Scholar]
- Liu, C.; Zeng, D.; Wu, H.; Wang, Y.; Jia, S.; Xin, L. Urban Land Cover Classification of High-Resolution Aerial Imagery Using a Relation-Enhanced Multiscale Convolutional Network. Remote Sens. 2020, 12, 311. [Google Scholar] [CrossRef] [Green Version]
- Huang, G.; Liu, Z.; Maaten, L.v.d.; Weinberger, K.Q. Densely Connected Convolutional Networks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 22–25 July 2017; pp. 2261–2269. [Google Scholar]
- Audebert, N.; Le Saux, B.; Lef‘ever, S. Beyond RGB: Very high resolution urban remote sensing with multimodal deep networks. ISPRS J. Photogramm. Remote Sens. 2018, 140, 20–32. [Google Scholar] [CrossRef] [Green Version]
- Peng, C.; Li, Y.; Jiao, L.; Chen, Y.; Shang, R. Densely Based Multi-scale and Multi-Modal Fully Convolutional Networks for High-Resolution Remote-Sensing Image Semantic Segmentation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2019, 12, 2612–2626. [Google Scholar] [CrossRef]
- Marmanis, D.; Wegner, J.D.; Galliani, S.; Schindler, K.; Datcu, M.; Stilla, U. Semantic segmentation of aerial images with an ensemble of CNSS. ISPRS Ann. Photogramm. Remote Sens. Spat. Inf. Sci. 2016, 3, 473–480. [Google Scholar] [CrossRef] [Green Version]
- Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv Prepr. 2014, arXiv:1409.1556. [Google Scholar]
- Yu, B.; Yang, L.; Chen, F. Semantic Segmentation for High Spatial Resolution Remote Sensing Images Based on Convolution Neural Network and Pyramid Pooling Module. IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens. 2018, 9, 3252–3261. [Google Scholar] [CrossRef]
- Wang, Y.; Liang, B.; Ding, M.; Li, J. Dense Semantic Labeling with Atrous Spatial Pyramid Pooling and Decoder for High-Resolution Remote Sensing Imagery. Remote Sens. 2018, 1, 20. [Google Scholar] [CrossRef] [Green Version]
- Shang, R.; Zhang, J.; Jiao, L.; Li, Y.; Marturi, N.; Stolkin, R. Multi-scale Adaptive Feature Fusion Network for Semantic Segmentation in Remote Sensing Images. Remote Sens. 2020, 12, 872. [Google Scholar] [CrossRef] [Green Version]
- Xiang, S.; Xie, Q.; Wang, M. Semantic Segmentation for Remote Sensing Images Based on Adaptive Feature Selection Network. IEEE Geosci. Remote Sens. Lett. 2021. [Google Scholar] [CrossRef]
- He, X.; Zhou, Y.; Zhao, J.; Zhang, M.; Yao, R.; Liu, B.; Li, H. Semantic Segmentation of Remote-Sensing Images Based on Multiscale Feature Fusion and Attention Refinement. IEEE Geosci. Remote Sens. Lett. 2021. [Google Scholar] [CrossRef]
- Zhao, Q.; Liu, J.; Li, Y.; Zhang, H. Semantic Segmentation with Attention Mechanism for Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2021. [Google Scholar] [CrossRef]
- He, K. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Hazirbas, C.; Ma, L.; Domokos, C.; Cremers, D. FuseNet: Incorporating Depth into Semantic Segmentation via Fusion-based CNN Architecture. In Proceedings of the Asian Conference on Computer Vision (ACCV), Taipei, Taiwan, 20–24 November 2016. [Google Scholar]
- Hu, J.; Shen, L.; Albanie, S.; Sun, G.; Wu, E. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7174. [Google Scholar]
- Buades, A.; Coll, B.; Morel, J. A non-local algorithm for image denoising. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–26 June 2005; pp. 60–65. [Google Scholar]
- Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 7794–7803. [Google Scholar]
- Gerke, M.; Rottensteiner, F.; Wegner, J.D.; Sohn, G. ISPRS semantic labeling contest. In Proceedings of the Photogrammetric Computer Vision (PCV), Zurich, Switzerland, 5–7 September 2014. [Google Scholar]
- ISPRS Potsdam 2D Semantic Labeling Dataset. Available online: http://www2.isprs.org/commissions/comm3/wg4/2d-sem-labelpotsdam.html (accessed on 30 January 2021).
- Xu, Q.; Yuan, X.; Ouyang, C.; Zeng, Y. Attention-Based Pyramid Network for Segmentation and Classification of High-Resolution and Hyperspectral Remote Sensing Images. Remote Sens. 2020, 12, 3501. [Google Scholar] [CrossRef]
- ISPRS Vaihingen 2D Semantic Labeling Dataset. Available online: http://www2.isprs.org/commissions/comm3/wg4/2d-semlabel-vaihingen.html (accessed on 30 January 2021).
- Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Li, F.F. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
Method | Imp. Surf. | Building | Low Veg. | Tree | Car | Mean F1 | OA |
---|---|---|---|---|---|---|---|
DeepLab v3+ [17] | 89.88 | 93.78 | 83.23 | 81.66 | 93.50 | 88.41 | 87.72 |
MANet [40] | 91.33 | 95.91 | 85.88 | 87.01 | 91.46 | 90.32 | 89.19 |
DSMFNet [30] | 93.03 | 95.75 | 86.33 | 86.46 | 94.88 | 91.29 | 90.36 |
DP-DCN [35] | 92.53 | 95.36 | 87.21 | 86.32 | 95.42 | 91.37 | 90.45 |
REMSNet [32] | 93.48 | 96.17 | 87.52 | 87.97 | 95.03 | 92.03 | 90.79 |
MMAFNet | 93.61 | 96.26 | 87.87 | 88.65 | 95.32 | 92.34 | 91.04 |
Method | Imp. Surf. | Building | Low Veg. | Tree | Car | Mean F1 | OA |
---|---|---|---|---|---|---|---|
DeepLab v3+ [17] | 87.67 | 93.95 | 79.17 | 86.26 | 80.34 | 85.48 | 87.22 |
MANet [40] | 90.12 | 94.08 | 81.01 | 87.21 | 81.16 | 86.72 | 88.17 |
DP-DCN [35] | 91.47 | 94.55 | 80.13 | 88.02 | 80.25 | 86.89 | 89.32 |
DSMFNet [30] | 91.47 | 95.08 | 82.11 | 88.61 | 81.01 | 87.66 | 89.80 |
REMSNet [32] | 92.01 | 95.67 | 82.35 | 89.73 | 81.26 | 88.20 | 90.08 |
MMAFNet | 92.06 | 96.12 | 82.71 | 90.01 | 82.13 | 88.61 | 90.27 |
Models | Imp. Surf. | Building | Low Veg. | Tree | Car | Mean F1 | OA |
---|---|---|---|---|---|---|---|
Res50 | 86.94 | 89.67 | 75.83 | 84.42 | 77.40 | 82.85 | 84.98 |
Res50+MFM | 88.15 | 93.84 | 76.49 | 86.48 | 78.02 | 84.60 | 86.66 |
Res50+MSCEM | 88.79 | 93.09 | 79.79 | 85.55 | 80.38 | 85.52 | 87.35 |
Res50+RSC | 90.11 | 92.97 | 80.24 | 86.04 | 81.14 | 86.10 | 87.82 |
Res50+MFM+MSCEM+RSC(MMAFNet) | 92.06 | 96.12 | 82.71 | 90.01 | 82.13 | 88.61 | 90.27 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Lei, T.; Li, L.; Lv, Z.; Zhu, M.; Du, X.; Nandi, A.K. Multi-Modality and Multi-Scale Attention Fusion Network for Land Cover Classification from VHR Remote Sensing Images. Remote Sens. 2021, 13, 3771. https://doi.org/10.3390/rs13183771
Lei T, Li L, Lv Z, Zhu M, Du X, Nandi AK. Multi-Modality and Multi-Scale Attention Fusion Network for Land Cover Classification from VHR Remote Sensing Images. Remote Sensing. 2021; 13(18):3771. https://doi.org/10.3390/rs13183771
Chicago/Turabian StyleLei, Tao, Linze Li, Zhiyong Lv, Mingzhe Zhu, Xiaogang Du, and Asoke K. Nandi. 2021. "Multi-Modality and Multi-Scale Attention Fusion Network for Land Cover Classification from VHR Remote Sensing Images" Remote Sensing 13, no. 18: 3771. https://doi.org/10.3390/rs13183771
APA StyleLei, T., Li, L., Lv, Z., Zhu, M., Du, X., & Nandi, A. K. (2021). Multi-Modality and Multi-Scale Attention Fusion Network for Land Cover Classification from VHR Remote Sensing Images. Remote Sensing, 13(18), 3771. https://doi.org/10.3390/rs13183771