A Spatio-Temporal Encoding Neural Network for Semantic Segmentation of Satellite Image Time Series
Abstract
:1. Introduction
- We propose a novel backbone network for semantic segmentation tailored for pixel-level classification tasks sensitive to spatial information. This network abandons down-sampling, which may lead to unrecoverable spatial information loss, and instead combines dilated convolutions and dense connections to rapidly expand the receptive field and obtain multi-scale features.
- For the first time, we utilize a Transformer encoder to extract temporal features for each pixel. To emphasize the annual cyclic patterns of crops, position encoding is calculated based on the position of the acquisition date within that year.
- We provide an open-source implementation of the model based on PyTorch on GitHub, which can be found at the following URL: https://github.com/ThinkPak/stenn-pytorch (accessed on 25 October 2023).
2. Study Area and Dataset
2.1. Study Area
2.2. Dataset
3. Methodology
3.1. Related Work
3.2. STENN Architecture
- First, it undergoes spatial encoding. We employ multi-level shared dilated convolutional layers, simultaneously processing each frame of the SITS, generating feature map time series with channel numbers of 16, 32, 32, and 64.
- Next is temporal encoding. We concatenate the results of spatial encoding along the channel dimension, obtaining a feature map time series with 144 channels. Then, we reshape this sequence into a time series for each pixel, and after positional encoding, it is fed into the Transformer encoder. Subsequently, the encoded results are averaged along the time dimension, ultimately generating a single feature map.
- Finally, through the semantic segmentation head, we map the feature map containing spatio-temporal information into segmentation results.
3.2.1. Spatial Encoding with Dilated Convolution
3.2.2. Temporal Encoding with Transformer Encoder
4. Experiments and Analysis
4.1. Evaluation Metric
4.2. Experimental Detail
- The backbone network of U-TAE [1] is U-Net. At the lowest resolution, the attention-based time encoder generates a set of time attention masks for each pixel. After spatial interpolation, these attention masks fuse the feature map time sequences at all resolutions into a single feature map. In the coding branch, four sets of group normalization are utilized, while batch normalization is employed in the decoding branch. A temporal encoding approach employing an L-TAE with 16 heads and a key-query space dimensionality of four was chosen.
- U-ConvLSTM [40] and U-BiConvLSTM [39] use U-Net as their backbone network, replacing L-TAE in the network with ConvLSTM [37] or bidirectional ConvLSTM. In comparison to the original method, batch normalization in the encoder is replaced with group normalization. Like the U-TAE, the Backbone utilizes the U-Net architecture, simply replacing the L-TAE with ConvLSTM or BiConvLSTM in their respective positions.
- The encoding space of 3D U-Net [40] is three-dimensional, allowing it to simultaneously process spatial and spectral dimensions. Finally, it performs mean fusion in the temporal dimension. The network comprises five consecutive 3D convolutional blocks, conducting spatial down-sampling after the second and fourth blocks. Each convolutional block doubles the channel count of the processed feature maps, with the innermost feature map having a channel size set at 128. Leaky ReLU and 3D batch normalization are employed within the convolutional blocks of this architecture.
4.3. Results
- VGG Backbone: The spatial encoding is replaced with the first 10 layers of VGG-16, and the network width has been adjusted accordingly to maintain a similar parameter count. The feature maps at each stage are up-sampled to match the original image resolution, and after channel-wise concatenation, they undergo temporal encoding.
- No Dense Connection: Remove the dense connections from the spatial encoding, where the input of each layer is only the output of the previous layer.
- No Transformer encoder: Remove the Transformer encoder and directly perform temporal mean fusion on the feature map time series obtained after spatial encoding. Then, obtain the final segmentation result through convolution.
- Single Date (August): Select one image from SITS taken in August for training, and the model produces segmentation results after spatial encoding.
- Single Date (May): Select one image from SITS taken in May for training.
Model | Param () | OA | mIoU | IT (ms) |
---|---|---|---|---|
STENN (Ours) | 447 | 79.37 | 55.8 | 322 |
U-TAE | 1087 | 79.31 | 47.82 | 88 |
ConvLSTM | 1009 | 77.47 | 54.06 | 163 |
ConvGRU | 956 | 70.31 | 36.93 | 141 |
U-ConvLSTM | 1521 | 75.27 | 36.19 | 110 |
U-BiConvLSTM | 1210 | 76.14 | 42.32 | 104 |
3D U-Net | 1554 | 76.92 | 47.21 | 96 |
VGG Backbone | 401 | 78.18 | 55.26 | 193 |
No dense connection | 403 | 77.26 | 53.16 | 245 |
No Transformer encoder | 289 | 73.41 | 31.91 | 153 |
Single Date (August) | 289 | 58.75 | 31.12 | 151 |
Single Date (May) | 289 | 55.16 | 30.43 | 151 |
5. Discussion
6. Conclusions
- Further optimization of time series models: This involves improvements to the Transformer encoder or the exploration of other architectures suitable for time series modeling. This can contribute to reducing computational requirements and memory demands.
- Integration across diverse data sources: Consider incorporating more data sources into the framework to further enhance the accuracy of semantic segmentation. For instance, meteorological data, land-use data, and other sources may provide valuable information for interpreting land cover phenology.
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Garnot, V.S.F.; Landrieu, L. Panoptic segmentation of satellite image time series with convolutional temporal attention networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 4872–4881. [Google Scholar]
- Abad, M.S.J.; Abkar, A.A.; Mojaradi, B. Effect of the temporal gradient of vegetation indices on early-season wheat classification using the random forest classifier. Appl. Sci. 2018, 8, 1216. [Google Scholar] [CrossRef]
- Chen, Y.; Li, M.; Zhang, Z. Does the Rural Land Transfer Promote the Non-Grain Production of Cultivated Land in China? Land 2023, 12, 688. [Google Scholar] [CrossRef]
- Pluto-Kossakowska, J. Review on multitemporal classification methods of satellite images for crop and arable land recognition. Agriculture 2021, 11, 999. [Google Scholar] [CrossRef]
- Pandey, P.C.; Koutsias, N.; Petropoulos, G.P.; Srivastava, P.K.; Ben Dor, E. Land use/land cover in view of earth observation: Data sources, input dimensions, and classifiers—A review of the state of the art. Geocarto Int. 2021, 36, 957–988. [Google Scholar] [CrossRef]
- Yuan, X.; Shi, J.; Gu, L. A review of deep learning methods for semantic segmentation of remote sensing imagery. Expert Syst. Appl. 2021, 169, 114417. [Google Scholar] [CrossRef]
- Wang, L.; Yan, J.; Mu, L.; Huang, L. Knowledge discovery from remote sensing images: A review. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2020, 10, e1371. [Google Scholar] [CrossRef]
- Wang, Y.; Zhang, D.; Dai, G. Classification of high resolution satellite images using improved U-Net. Int. J. Appl. Math. Comput. Sci. 2020, 30, 399–413. [Google Scholar]
- Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]
- Xu, Z.; Zhang, W.; Zhang, T.; Li, J. HRCNet: High-resolution context extraction network for semantic segmentation of remote sensing images. Remote Sens. 2020, 13, 71. [Google Scholar] [CrossRef]
- Ding, L.; Tang, H.; Bruzzone, L. LANet: Local attention embedding to improve the semantic segmentation of remote sensing images. IEEE Trans. Geosci. Remote Sens. 2020, 59, 426–435. [Google Scholar] [CrossRef]
- Yu, M.; Qin, F. Research on the Applicability of Transformer Model in Remote-Sensing Image Segmentation. Appl. Sci. 2023, 13, 2261. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
- Wang, H.; Chen, X.; Zhang, T.; Xu, Z.; Li, J. CCTNet: Coupled CNN and transformer network for crop segmentation of remote sensing images. Remote Sens. 2022, 14, 1956. [Google Scholar] [CrossRef]
- Zhang, C.; Jiang, W.; Zhang, Y.; Wang, W.; Zhao, Q.; Wang, C. Transformer and CNN hybrid deep neural network for semantic segmentation of very-high-resolution remote sensing imagery. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–20. [Google Scholar] [CrossRef]
- Li, Y.; Cheng, Z.; Wang, C.; Zhao, J.; Huang, L. RCCT-ASPPNet: Dual-Encoder Remote Image Segmentation Based on Transformer and ASPP. Remote Sens. 2023, 15, 379. [Google Scholar] [CrossRef]
- Tian, Q.; Zhao, F.; Zhang, Z.; Qu, H. GLFFNet: A Global and Local Features Fusion Network with Biencoder for Remote Sensing Image Segmentation. Appl. Sci. 2023, 13, 8725. [Google Scholar] [CrossRef]
- He, X.; Zhou, Y.; Zhao, J.; Zhang, D.; Yao, R.; Xue, Y. Swin transformer embedding UNet for remote sensing image semantic segmentation. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–15. [Google Scholar] [CrossRef]
- Bolton, D.K.; Friedl, M.A. Forecasting crop yield using remotely sensed vegetation indices and crop phenology metrics. Agric. For. Meteorol. 2013, 173, 74–84. [Google Scholar] [CrossRef]
- Pan, L.; Xia, H.; Zhao, X.; Guo, Y.; Qin, Y. Mapping winter crops using a phenology algorithm, time-series Sentinel-2 and Landsat-7/8 images, and Google Earth Engine. Remote Sens. 2021, 13, 2510. [Google Scholar] [CrossRef]
- Meroni, M.; d’Andrimont, R.; Vrieling, A.; Fasbender, D.; Lemoine, G.; Rembold, F.; Seguini, L.; Verhegghen, A. Comparing land surface phenology of major European crops as derived from SAR and multispectral data of Sentinel-1 and-2. Remote Sens. Environ. 2021, 253, 112232. [Google Scholar] [CrossRef]
- Moskolaï, W.R.; Abdou, W.; Dipanda, A.; Kolyang. Application of deep learning architectures for satellite image time series prediction: A review. Remote Sens. 2021, 13, 4822. [Google Scholar] [CrossRef]
- Sakamoto, T.; Yokozawa, M.; Toritani, H.; Shibayama, M.; Ishitsuka, N.; Ohno, H. A crop phenology detection method using time-series MODIS data. Remote Sens. Environ. 2005, 96, 366–374. [Google Scholar] [CrossRef]
- Sun, C.; Bian, Y.; Zhou, T.; Pan, J. Using of multi-source and multi-temporal remote sensing data improves crop-type mapping in the subtropical agriculture region. Sensors 2019, 19, 2401. [Google Scholar] [CrossRef]
- Garnot, V.S.F.; Landrieu, L.; Giordano, S.; Chehata, N. Satellite image time series classification with pixel-set encoders and temporal self-attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 12325–12334. [Google Scholar]
- Chen, G.; Li, C.; Wei, W.; Jing, W.; Woźniak, M.; Blažauskas, T.; Damaševičius, R. Fully convolutional neural network with augmented atrous spatial pyramid pool and fully connected fusion path for high resolution remote sensing image segmentation. Appl. Sci. 2019, 9, 1816. [Google Scholar] [CrossRef]
- Badrinarayanan, V.; Kendall, A.; Cipolla, R. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef]
- Diakogiannis, F.I.; Waldner, F.; Caccetta, P.; Wu, C. ResUNet-a: A deep learning framework for semantic segmentation of remotely sensed data. ISPRS J. Photogramm. Remote Sens. 2020, 162, 94–114. [Google Scholar] [CrossRef]
- Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
- Kattenborn, T.; Leitloff, J.; Schiefer, F.; Hinz, S. Review on Convolutional Neural Networks (CNN) in vegetation remote sensing. ISPRS J. Photogramm. Remote Sens. 2021, 173, 24–49. [Google Scholar] [CrossRef]
- Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Semantic image segmentation with deep convolutional nets and fully connected crfs. arXiv 2014, arXiv:1412.7062. [Google Scholar]
- Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef]
- Chen, L.C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking atrous convolution for semantic image segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar]
- Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
- Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
- Dou, G.; Zhao, K.; Guo, M.; Mou, J. Memristor-based LSTM network for text classification. Fractals 2023, 31, 2340040. [Google Scholar] [CrossRef]
- Shi, X.; Chen, Z.; Wang, H.; Yeung, D.-Y.; Wong, W.-K.; Woo, W.-C. Convolutional LSTM network: A machine learning approach for precipitation nowcasting. Adv. Neural Inf. Process. Syst. 2015, 28. [Google Scholar]
- Rußwurm, M.; Körner, M. Convolutional LSTMs for cloud-robust segmentation of remote sensing imagery. arXiv 2018, arXiv:1811.02471. [Google Scholar]
- Ballas, N.; Yao, L.; Pal, C.; Courville, A. Delving deeper into convolutional networks for learning video representations. arXiv 2015, arXiv:1511.06432. [Google Scholar]
- Rustowicz, R.M.; Cheong, R.; Wang, L.; Ermon, S.; Burke, M.; Lobell, D. Semantic segmentation of crop type in Africa: A novel dataset and analysis of deep learning methods. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, CA, USA, 16–17 June 2019; pp. 75–82. [Google Scholar]
Training STENN for SITS Semantic Segmentation |
Input: SITS data X; Ground truth labels Y |
Output: Semantic Segmentation R |
1. Load the training dataset X, configure model parameters, and initialize the weights of the STENN model. |
2. While step ≤ Epoch: |
3. Iterate through all the data in X, and utilize the STENN model to generate the segmentation result R. |
4. Calculate the loss between R and Y using the specified loss function and update the entire model’s parameters based on the computed loss. |
5. End the training loop once the defined number of training steps (epochs) has been reached. |
6. Validate and test the trained STENN model, and save all the results. |
Layer Name | Kernel Size | Dilation Rate | Input Channel | Output Channel |
---|---|---|---|---|
Conv Layer-1 | {1, 2, 3} | 10 | 16 | |
Conv Layer-2 | {2, 4, 6} | 26 | 32 | |
Conv Layer-3 | {2, 4, 6} | 58 | 32 | |
Conv Layer-4 | {2, 4, 6} | 90 | 64 | |
Concat (dim = channel) & Reshape | ||||
Transformer encoder (input = 144, head = 8) | ||||
Mean (dim = time) & Reshape | ||||
Conv Layer-5 | {1, 1} | 144 | 20 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhang, F.; Wang, Y.; Du, Y.; Zhu, Y. A Spatio-Temporal Encoding Neural Network for Semantic Segmentation of Satellite Image Time Series. Appl. Sci. 2023, 13, 12658. https://doi.org/10.3390/app132312658
Zhang F, Wang Y, Du Y, Zhu Y. A Spatio-Temporal Encoding Neural Network for Semantic Segmentation of Satellite Image Time Series. Applied Sciences. 2023; 13(23):12658. https://doi.org/10.3390/app132312658
Chicago/Turabian StyleZhang, Feifei, Yong Wang, Yawen Du, and Yijia Zhu. 2023. "A Spatio-Temporal Encoding Neural Network for Semantic Segmentation of Satellite Image Time Series" Applied Sciences 13, no. 23: 12658. https://doi.org/10.3390/app132312658
APA StyleZhang, F., Wang, Y., Du, Y., & Zhu, Y. (2023). A Spatio-Temporal Encoding Neural Network for Semantic Segmentation of Satellite Image Time Series. Applied Sciences, 13(23), 12658. https://doi.org/10.3390/app132312658