RiceStageSeg: A Multimodal Benchmark Dataset for Semantic Segmentation of Rice Growth Stages
Abstract
1. Introduction
- We construct and publicly release a high-quality multimodal semantic segmentation dataset, RiceStageSeg, which covers the multiple key growth stages of rice, and includes both RGB and multispectral imagery.
- We propose and evaluate multiple multimodal fusion strategies based on state-of-the-art semantic segmentation models, and provide reproducible benchmark experiments.
- We explore the complementary nature of information across different remote sensing modalities and analyze fusion mechanisms, offering insights for multimodal modeling in agricultural remote sensing.
2. Related Work
2.1. Crop Growth Stage Monitoring Methods
2.2. Crop Growth Stage Dataset
2.3. Remote Sensing Semantic Segmentation
3. RiceStageSeg Dataset
3.1. Description of the Study Area
3.2. Data Acquisition and Preprocessing
3.3. Data Annotation
4. Baseline Experiments
4.1. Experimental Settings
4.2. Unimodal Segmentation
4.2.1. RGB-Only Segmentation
4.2.2. Multispectral-Only Segmentation
4.3. RGB + MS Fusion Segmentation
4.4. Results Summary
5. Discussion
5.1. Consistency of Phenological Stage Classification Across Temporal Variations
5.2. Analysis of Multispectral Modality Performance and Its Potential
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Serrano, L.; Filella, I.; Penuelas, J. Remote sensing of biomass and yield of winter wheat under different nitrogen supplies. Crop Sci. 2000, 40, 723–731. [Google Scholar] [CrossRef]
- Singh, S.; Tyagi, V.; Naresh, R.K. Effect of Irrigation Scheduling and Fertility Management on Growth and Yield Parameters of Basmati Rice. Int. J. Plant Soil Sci. 2023, 35, 503–518. [Google Scholar] [CrossRef]
- Sakamoto, T.; Yokozawa, M.; Toritani, H.; Shibayama, M.; Ishitsuka, N.; Ohno, H. A crop phenology detection method using time-series MODIS data. Remote Sens. Environ. 2005, 96, 366–374. [Google Scholar] [CrossRef]
- Yang, Q.; Shi, L.; Han, J.; Yu, J.; Huang, K. A Near Real-Time Deep Learning Approach for Detecting Rice Phenology Based on UAV Images. Agric. For. Meteorol. 2020, 287, 107938. [Google Scholar] [CrossRef]
- Zeng, L.; Wardlow, B.D.; Xiang, D.; Hu, S.; Li, D. A review of vegetation phenological metrics extraction using time-series, multispectral satellite data. Remote Sens. Environ. 2020, 237, 111511. [Google Scholar] [CrossRef]
- Xu, H.; Song, J.; Zhu, Y. Evaluation and Comparison of Semantic Segmentation Networks for Rice Identification Based on Sentinel-2 Imagery. Remote Sens. 2023, 15, 1499. [Google Scholar] [CrossRef]
- Fatchurrachman; Rudiyanto; Soh, N.C.; Shah, R.M.; Giap, S.G.E.; Setiawan, B.I.; Minasny, B. High-resolution mapping of paddy rice extent and growth stages across peninsular malaysia using a fusion of sentinel-1 and 2 time series data in google earth engine. Remote Sens. 2022, 14, 1875. [Google Scholar] [CrossRef]
- Wang, M.; Wang, J.; Chen, L.; Du, Z. Mapping paddy rice and rice phenology with Sentinel-1 SAR time series using a unified dynamic programming framework. Open Geosci. 2022, 14, 414–428. [Google Scholar] [CrossRef]
- Zhao, Z.; Dong, J.; Zhang, G.; Yang, J.; Liu, R.; Wu, B.; Xiao, X. Improved Phenology-Based Rice Mapping Algorithm by Integrating Optical and Radar Data. Remote Sens. Environ. 2024, 315, 114460. [Google Scholar] [CrossRef]
- Wang, W.K.; Zhang, J.Y.; Hui, W.; Cao, Q.; Tian, Y.C.; Zhu, Y.; Cao, W.X.; Liu, X.J. Non-Destructive Monitoring of Rice Growth Key Indicators Based on Fixed-Wing UAV Multispectral Images. Sci. Agric. Sin. 2023, 56, 4175–4191. [Google Scholar]
- Dai, Y.; Yu, S.; Ma, T.; Ding, J.; Chen, K.; Zeng, G.; Xie, A.; He, P.; Peng, S.; Zhang, M.; et al. Improving the estimation of rice above-ground biomass based on spatio-temporal UAV imagery and phenological stages. Front. Plant Sci. 2024, 15, 1328834. [Google Scholar] [CrossRef]
- Mia, M.S.; Tanabe, R.; Habibi, L.N.; Hashimoto, N.; Homma, K.; Maki, M.; Matsui, T.; Tanaka, T.S.T. Multimodal Deep Learning for Rice Yield Prediction Using UAV-Based Multispectral Imagery and Weather Data. Remote Sens. 2023, 15, 2511. [Google Scholar] [CrossRef]
- Hong, D.; Gao, L.; Yokoya, N.; Yao, J.; Chanussot, J.; Du, Q. More diverse means better: Multimodal deep learning meets remote-sensing imagery classification. IEEE Trans. Geosci. Remote Sens. 2020, 59, 4340–4354. [Google Scholar] [CrossRef]
- Deng, J.; Hong, D.; Li, C.; Yao, J.; Yang, Z.; Zhang, Z.; Chanussot, J. RustQNet: Multimodal deep learning for quantitative inversion of wheat stripe rust disease index. Comput. Electron. Agric. 2024, 225, 109245. [Google Scholar] [CrossRef]
- Mori, A.; Doi, Y.; Iizumi, T. GCPE: The global dataset of crop growth events for agricultural and earth system modeling. J. Agric. Meteorol. 2023, 79, 120–129. [Google Scholar] [CrossRef]
- Sakamoto, T.; Gitelson, A.A.; Arkebauer, T.J. MODIS-based corn grain yield estimation model incorporating crop phenology information. Remote Sens. Environ. 2013, 131, 215–231. [Google Scholar] [CrossRef]
- d’Andrimont, R.; Taymans, M.; Lemoine, G.; Ceglar, A.; Yordanov, M.; van der Velde, M. Detecting flowering phenology in oil seed rape parcels with Sentinel-1 and-2 time series. Remote Sens. Environ. 2020, 239, 111660. [Google Scholar] [CrossRef]
- Jeong, S.J.; Schimel, D.; Frankenberg, C.; Drewry, D.T.; Fisher, J.B.; Verma, M.; Berry, J.A.; Lee, J.-E.; Joiner, J. Application of satellite solar-induced chlorophyll fluorescence to understanding large-scale variations in vegetation phenology and function over northern high latitude forests. Remote Sens. Environ. 2017, 190, 178–187. [Google Scholar] [CrossRef]
- Qu, T.; Wang, H.; Li, X.; Luo, D.; Yang, Y.; Liu, J.; Zhang, Y. A fine crop classification model based on multitemporal Sentinel-2 images. Int. J. Appl. Earth Obs. Geoinf. 2024, 134, 104172. [Google Scholar] [CrossRef]
- Zhou, H.; Huang, F.; Lou, W.; Gu, Q.; Ye, Z.; Hu, H.; Zhang, X. Yield prediction through UAV-based multispectral imaging and deep learning in rice breeding trials. Agric. Syst. 2025, 223, 104214. [Google Scholar] [CrossRef]
- Shi, Y.; Han, L.; Zhang, X.; Sobeih, T.; Gaiser, T.; Thuy, N.H.; Behrend, D.; Srivastava, A.K.; Halder, K.; Ewert, F. Deep Learning Meets Process-Based Models: A Hybrid Approach to Agricultural Challenges. arXiv 2025, arXiv:2504.16141. [Google Scholar] [CrossRef]
- Zhang, C.; Kovacs, J.M. The application of small unmanned aerial systems for precision agriculture: A review. Precis. Agric. 2012, 13, 693–712. [Google Scholar] [CrossRef]
- Song, Y.; Wang, J.; Shang, J.; Liao, C. Using UAV-based SOPC derived LAI and SAFY model for biomass and yield estimation of winter wheat. Remote Sens. 2020, 12, 2378. [Google Scholar] [CrossRef]
- Munteanu, A.; Neagul, M. Semantic Segmentation of Vegetation in Remote Sensing Imagery Using Deep Learning. arXiv 2022, arXiv:2209.14364. [Google Scholar] [CrossRef]
- Laborte, A.G.; Gutierrez, M.A.; Balanza, J.G.; Saito, K.; Zwart, S.J.; Boschetti, M.; Murty, M.; Villano, L.; Aunario, J.K.; Reinke, R. RiceAtlas, a spatial database of global rice calendars and production. Sci. Data 2017, 4, 170074. [Google Scholar] [CrossRef]
- Kotsuki, S.; Tanaka, K. SACRA—A method for the estimation of global high-resolution crop calendars from a satellite-sensed NDVI. Hydrol. Earth Syst. Sci. 2015, 19, 4441–4461. [Google Scholar] [CrossRef]
- Sani, D.; Mahato, S.; Saini, S.; Agarwal, H.K.; Devshali, C.C.; Anand, S.; Arora, G.; Jayaraman, T. SICKLE: A multi-sensor satellite imagery dataset annotated with multiple key cropping parameters. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2024; pp. 5995–6004. [Google Scholar]
- Sykas, D.; Papoutsis, I.; Zografakis, D. Sen4AgriNet: A Harmonized Multi-Country, Multi-Temporal Benchmark Dataset for Agricultural Earth Observation Machine Learning Applications. In Proceedings of the 2021 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Brussels, Belgium, 11–16 July 2021; pp. 5830–5833. [Google Scholar]
- Garnot, V.S.F.; Landrieu, L.; Chehata, N. Multi-modal temporal attention models for crop mapping from satellite time series. ISPRS J. Photogramm. Remote Sens. 2022, 187, 294–305. [Google Scholar] [CrossRef]
- Chiu, M.T.; Xu, X.; Wei, Y.; Huang, Z.; Schwing, A.G.; Brunner, R.; Khachatrian, H.; Karapetyan, H.; Dozier, I.; Rose, G. Agriculture-vision: A large aerial image database for agricultural pattern analysis. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2020; pp. 2828–2838. [Google Scholar]
- Sheng, J.; Sun, Y.; Huang, H.; Xu, W.; Pei, H.; Zhang, W.; Wu, X. HBRNet: Boundary Enhancement Segmentation Network for Cropland Extraction in High-Resolution Remote Sensing Images. Agriculture 2022, 12, 1284. [Google Scholar] [CrossRef]
- Li, B.; Zhang, D.; Zhao, Z.; Gao, J.; Li, X. U3M: Unbiased Multiscale Modal Fusion Model for Multimodal Semantic Segmentation. Pattern Recognit. 2025, 168, 111801. [Google Scholar] [CrossRef]
- Zhang, X.; Yokoya, N.; Gu, X.; Tian, Q.; Bruzzone, L. Local-to-Global Cross-Modal Attention-Aware Fusion for HSI-X Semantic Segmentation. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5531817. [Google Scholar] [CrossRef]
- Marinov, Z.; Reiß, S.; Kersting, D.; Kleesiek, J.; Stiefelhagen, R. Mirror u-net: Marrying multimodal fission with multi-task learning for semantic segmentation in medical imaging. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 2283–2293. [Google Scholar]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
- Sun, K.; Xiao, B.; Liu, D.; Wang, J. Deep high-resolution representation learning for human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5693–5703. [Google Scholar]
- Xiao, T.; Liu, Y.; Zhou, B.; Jiang, Y.; Sun, J. Unified perceptual parsing for scene understanding. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 418–434. [Google Scholar]
- Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar]
- Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 5–9 October 2015; Springer International Publishing: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- Wang, D.; Hu, M.; Jin, Y.; Miao, Y.; Yang, J.; Xu, Y.; Qin, X.; Ma, J.; Sun, L.; Li, C.; et al. Hypersigma: Hyperspectral intelligence comprehension foundation model. IEEE Trans. Pattern Anal. Mach. Intell. arXiv 2025, arXiv:2406.11519. [Google Scholar] [CrossRef]
- Hong, D.; Zhang, B.; Li, X.; Li, Y.; Li, C.; Yao, J.; Yokoya, N.; Li, H.; Ghamisi, P.; Jia, X.; et al. SpectralGPT: Spectral remote sensing foundation model. arXiv 2023, arXiv:2311.07113. [Google Scholar] [CrossRef]
- Cong, Y.; Khanna, S.; Meng, C.; Liu, P.; Rozi, E.; He, Y.; Burke, M.; Lobell, D.B.; Ermon, S. Satmae: Pre-training transformers for temporal and multi-spectral satellite imagery. Adv. Neural Inf. Process Syst. 2022, 35, 197–211. [Google Scholar]
RGB Imagery | Channel Number | Location | Growth Stage | Cell Size (m) | Resolution (Pixel) |
---|---|---|---|---|---|
1 | 3 | Kaizhou | Heading | (0.02, 0.02) | (24,503, 19,576) |
2 | 3 | Kaizhoudade | Green-up, Jointing | (0.02, 0.02) | (22,869, 22,160) |
3 | 3 | Nanchuan Fushou | Jointing | (0.02, 0.02) | (20,154, 26,666) |
4 | 3 | Nanchuan Fushou | Milky | (0.02, 0.02) | (19,198, 26,955) |
5 | 3 | Nanchuan Nongji | Jointing | (0.02, 0.02) | (31,413, 24,432) |
6 | 3 | Nanchuan Nongji | Green-up, Tillering | (0.02, 0.02) | (30,606, 24,328) |
7 | 3 | Nanchuan Nongji | Milky, Maturity | (0.02, 0.02) | (28,047, 24,769) |
8 | 3 | Nanchuan Quyu | Green-up, Tillering | (0.02, 0.02) | (39,667, 14,387) |
9 | 3 | Tongnan Chongcan | Heading | (0.02, 0.02) | (27,467, 22,930) |
10 | 3 | Tongnan Chongcan | Milky, Maturity | (0.02, 0.02) | (18,702, 23,152) |
11 | 3 | Tongnan Zitong | Jointing | (0.02, 0.02) | (32,051, 20,554) |
12 | 3 | Tongnan Zitong | Milky, Maturity | (0.02, 0.02) | (26,292, 22,744) |
13 | 3 | Yongchuan Laishu | Maturity | (0.02, 0.02) | (22,190, 24,719) |
14 | 3 | Yongchuan Laishu | Heading | (0.02, 0.02) | (20,569, 25,733) |
15 | 3 | Yongchuan Laishu | Green-up | (0.02, 0.02) | (24,345, 27,095) |
16 | 3 | Youyang Guanba | Heading | (0.02, 0.02) | (35,245, 15,946) |
Multispectral Imagery | Channel Number | Location | Growth Stage | Cell Size (m) | Resolution (Pixel) |
---|---|---|---|---|---|
1 | 10 | Kaizhou | Heading | (0.04, 0.04) | (12,122, 10,323) |
2 | 10 | Kaizhoudade | Green-up, Jointing | (0.05, 0.05) | (11,011, 13,018) |
3 | 10 | Nanchuan Fushou | Jointing | (0.04, 0.04) | (9204, 14,935) |
4 | 10 | Nanchuan Fushou | Milky | (0.04, 0.04) | (9001, 15,524) |
5 | 10 | Nanchuan Nongji | Jointing | (0.04, 0.04) | (14,233, 12,994) |
6 | 10 | Nanchuan Nongji | Green-up, Tillering | (0.04, 0.04) | (13,967, 13,956) |
7 | 10 | Nanchuan Nongji | Milky, Maturity | (0.04, 0.04) | (13,411, 14,528) |
8 | 10 | Nanchuan Quyu | Green-up, Tillering | (0.03, 0.03) | (19,140, 8106) |
9 | 10 | Tongnan Chongcan | Heading | (0.04, 0.04) | (12,853, 12,472) |
10 | 10 | Tongnan Chongcan | Milky, Maturity | (0.04, 0.04) | (9105, 14,021) |
11 | 10 | Tongnan Zitong | Jointing | (0.04, 0.04) | (14,494, 11,297) |
12 | 10 | Tongnan Zitong | Milky, Maturity | (0.04, 0.04) | (12,136, 13,055) |
13 | 10 | Yongchuan Laishu | Maturity | (0.05, 0.05) | (10,291, 14,054) |
14 | 10 | Yongchuan Laishu | Heading | (0.05, 0.05) | (12,193, 15,204) |
15 | 10 | Yongchuan Laishu | Green-up | (0.05, 0.05) | (11,139, 14,578) |
16 | 10 | Youyang Guanba | Heading | (0.03, 0.03) | (19,209, 10,078) |
Imagery | Total | Background | Green-Up | Tillering | Heading | Jointing | Milky | Maturity |
---|---|---|---|---|---|---|---|---|
1 | 479,670,728 | 362,061,556 | 0 | 0 | 117,609,172 | 0 | 0 | 0 |
2 | 506,777,040 | 456,092,671 | 1,303,488 | 0 | 0 | 49,380,881 | 0 | 0 |
3 | 537,426,564 | 451,466,886 | 0 | 0 | 0 | 85,959,678 | 0 | 0 |
4 | 517,482,090 | 431,287,450 | 0 | 0 | 0 | 0 | 86,194,640 | 0 |
5 | 767,482,416 | 542,829,249 | 0 | 0 | 0 | 224,653,167 | 0 | 0 |
6 | 744,582,768 | 536,323,332 | 192,904,520 | 15,354,916 | 0 | 0 | 0 | 0 |
7 | 694,696,143 | 521,917,432 | 0 | 0 | 0 | 0 | 172,154,907 | 623,804 |
8 | 570,689,129 | 361,757,052 | 47,439,505 | 161,492,572 | 0 | 0 | 0 | 0 |
9 | 629,818,310 | 395,165,722 | 0 | 0 | 234,652,588 | 0 | 0 | 0 |
10 | 432,988,704 | 241,591,291 | 0 | 0 | 0 | 0 | 123,610,515 | 67,786,898 |
11 | 658,776,254 | 535,556,236 | 0 | 0 | 0 | 123,220,018 | 0 | 0 |
12 | 597,985,248 | 445,467,792 | 0 | 0 | 0 | 0 | 144,206,002 | 8,311,454 |
13 | 548,514,610 | 418,338,026 | 0 | 0 | 0 | 0 | 0 | 130,176,584 |
14 | 529,302,077 | 412,100,804 | 0 | 0 | 117,201,273 | 0 | 0 | 0 |
15 | 659,627,775 | 515,968,905 | 143,658,870 | 0 | 0 | 0 | 0 | 0 |
16 | 562,016,770 | 310,407,105 | 0 | 0 | 251,609,665 | 0 | 0 | 0 |
Total | 9,437,836,626 | 6,938,331,509 | 385,306,383 | 176,847,488 | 721,072,698 | 483,213,744 | 526,166,064 | 206,898,740 |
Configuration | Parameter |
---|---|
Programming | Python 3.8.20 |
Library and wrapper | PyTorch 2.0.0, mmcv2.2.0 |
CPU | AMD EPYC 7542 32-Core Processor |
GPU | NVIDIA GeForce RTX 4090 |
Operating system | Ubuntu 22.04 |
Method | Model Settings | IoU (%) | MIoU (%) | Average F1 (%) | ||||||
---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2 | 3 | 4 | 5 | 6 | ||||
Swin | backbone: Swin Transformer decode_head: UperHead auxiliary_head: FCNHead | 79.76 | 70.77 | 70.44 | 86.63 | 85.60 | 82.09 | 76.17 | 78.56 | 87.86 |
PSPNet | backbone: ResNetV1c decode_head: PSPHead auxiliary_head: FCNHead | 86.32 | 84.14 | 87.01 | 92.45 | 84.02 | 86.01 | 84.25 | 85.31 | 92.63 |
UPerNet | backbone: ResNetV1c decode_head: UperHead auxiliary_head: FCNHead | 80.76 | 82.35 | 82.06 | 85.28 | 85.42 | 86.32 | 80.43 | 83.23 | 90.15 |
HRNet | backbone: HRNet decode_head: FCNHead | 78.19 | 82.12 | 84.01 | 86.33 | 86.79 | 84.46 | 74.06 | 82.28 | 89.13 |
U-Net | backbone: U-Net decode_head: FCNHead | 81.15 | 68.07 | 73.97 | 90.16 | 89.93 | 85.13 | 76.38 | 75.74 | 85.03 |
ViT | backbone: Vision Transformer decode_head: UperHead auxiliary_head: FCNHead | 80.72 | 76.65 | 75.84 | 89.54 | 86.96 | 83.51 | 75.55 | 81.25 | 89.57 |
Method | Model Settings | IoU (%) | MIoU (%) | Average F1 (%) | ||||||
---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2 | 3 | 4 | 5 | 6 | ||||
Swin | backbone: Swin Transformer decode_head: UperHead auxiliary_head: FCNHead | 75.71 | 73.16 | 78.66 | 85.22 | 83.05 | 85.15 | 73.34 | 79.18 | 88.30 |
PSPNet | backbone: ResNetV1c decode_head: PSPHead auxiliary_head: FCNHead | 79.98 | 83.9 | 89.18 | 88.74 | 89.66 | 88.45 | 80.46 | 85.77 | 92.29 |
UPerNet | backbone: ResNetV1c decode_head: UperHead auxiliary_head: FCNHead | 77.21 | 80.26 | 85.8 | 87.67 | 87.01 | 87.96 | 79.81 | 83.67 | 91.06 |
HRNet | backbone: HRNet decode_head: FCNHead | 78.2 | 82.82 | 87.84 | 85.59 | 88.03 | 86.84 | 76.71 | 83.72 | 91.08 |
U-Net | backbone: U-Net decode_head: FCNHead | 69.9 | 62.44 | 70.83 | 72.41 | 68.66 | 68.39 | 56.89 | 67.07 | 80.18 |
ViT | backbone: Vision Transformer decode_head: UperHead auxiliary_head: FCNHead | 74.74 | 68.62 | 80.26 | 77.12 | 71.72 | 68.14 | 57.53 | 71.16 | 82.96 |
Method | Model Settings | IoU (%) | MIoU (%) | Average F1 (%) | ||||||
---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2 | 3 | 4 | 5 | 6 | ||||
Swin | backbone: Swin Transformer decode_head: UperHead auxiliary_head: FCNHead | 85.09 | 84.24 | 85.53 | 90.1 | 86.92 | 89.58 | 82.65 | 86.30 | 92.63 |
PSPNet | backbone: ResNetV1c decode_head: PSPHead auxiliary_head: FCNHead | 87.51 | 89.27 | 89.77 | 93.85 | 92.56 | 90.42 | 84.05 | 89.63 | 94.51 |
UPerNet | backbone: ResNetV1c decode_head: UperHead auxiliary_head: FCNHead | 86.74 | 88.17 | 90.81 | 94.38 | 92.51 | 93.59 | 91.01 | 91.03 | 95.29 |
HRNet | backbone: HRNet decode_head: FCNHead | 86.99 | 87.96 | 91.37 | 92.86 | 90.24 | 89.43 | 83.89 | 88.97 | 94.14 |
U-Net | backbone: U-Net decode_head: FCNHead | 81.69 | 75.7 | 75.09 | 85.1 | 82.95 | 79.7 | 73.41 | 79.09 | 88.27 |
ViT | backbone: Vision Transformer decode_head: UperHead auxiliary_head: FCNHead | 82.59 | 81.35 | 85.02 | 89.19 | 85.27 | 86.51 | 78.15 | 84.01 | 91.27 |
Method | Model Settings | Image 5 | Image 11 | ||||||
---|---|---|---|---|---|---|---|---|---|
IoU (%) | MIoU (%) | Average F1 (%) | IoU (%) | MIoU (%) | Average F1 (%) | ||||
0 | 1 | 0 | 1 | ||||||
Swin | backbone: Swin Transformer decode_head: UperHead auxiliary_head: FCNHead | 79.76 | 86.63 | 81.77 | 85.26 | 81.24 | 87.53 | 82.50 | 87.20 |
PSPNet | backbone: ResNetV1c decode_head: PSPHead auxiliary_head: FCNHead | 86.32 | 92.45 | 88.11 | 92.10 | 86.73 | 90.01 | 87.39 | 91.13 |
UPerNet | backbone: ResNetV1c decode_head: UperHead auxiliary_head: FCNHead | 80.76 | 85.28 | 82.08 | 85.48 | 79.22 | 88.12 | 81.00 | 84.79 |
HRNet | backbone: HRNet decode_head: FCNHead | 78.19 | 86.33 | 80.57 | 84.91 | 78.98 | 88.10 | 80.80 | 84.81 |
U-Net | backbone: U-Net decode_head: FCNHead | 81.15 | 90.16 | 83.79 | 87.66 | 81.11 | 88.33 | 82.55 | 86.99 |
ViT | backbone: Vision Transformer decode_head: UperHead auxiliary_head: FCNHead | 80.72 | 89.54 | 83.30 | 86.72 | 77.87 | 90.41 | 80.38 | 83.73 |
Method | Model Settings | Image 5 | Image 11 | ||||||
---|---|---|---|---|---|---|---|---|---|
IoU (%) | MIoU (%) | Average F1 (%) | IoU (%) | MIoU (%) | Average F1 (%) | ||||
0 | 1 | 0 | 1 | ||||||
Swin | backbone: Swin Transformer decode_head: UperHead auxiliary_head: FCNHead | 75.71 | 85.22 | 78.49 | 82.9 | 78.43 | 87.71 | 80.29 | 84.03 |
PSPNet | backbone: ResNetV1c decode_head: PSPHead auxiliary_head: FCNHead | 79.98 | 88.74 | 82.54 | 85.56 | 82.55 | 88.31 | 83.70 | 88.63 |
UPerNet | backbone: ResNetV1c decode_head: UperHead auxiliary_head: FCNHead | 77.21 | 87.67 | 80.27 | 84.72 | 79.33 | 86.44 | 80.75 | 84.52 |
HRNet | backbone: HRNet decode_head: FCNHead | 78.20 | 85.59 | 80.36 | 84.64 | 77.12 | 83.61 | 78.40 | 82.51 |
U-Net | backbone: U-Net decode_head: FCNHead | 69.91 | 72.41 | 70.63 | 75.03 | 71.08 | 72.83 | 71.43 | 74.62 |
ViT | backbone: Vision Transformer decode_head: UperHead auxiliary_head: FCNHead | 74.74 | 77.12 | 75.44 | 79.36 | 77.68 | 74.96 | 77.14 | 81.18 |
Method | Model Settings | Image 5 | Image 11 | ||||||
---|---|---|---|---|---|---|---|---|---|
IoU (%) | MIoU (%) | Average F1 (%) | IoU (%) | MIoU (%) | Average F1 (%) | ||||
0 | 1 | 0 | 1 | ||||||
Swin | backbone: Swin Transformer decode_head: UperHead auxiliary_head: FCNHead | 85.09 | 90.10 | 86.56 | 90.88 | 86.53 | 91.28 | 87.48 | 91.88 |
PSPNet | backbone: ResNetV1c decode_head: PSPHead auxiliary_head: FCNHead | 87.51 | 93.85 | 89.37 | 92.91 | 86.27 | 95.71 | 88.16 | 92.78 |
UPerNet | backbone: ResNetV1c decode_head: UperHead auxiliary_head: FCNHead | 86.74 | 94.38 | 88.98 | 93.28 | 89.22 | 94.45 | 90.27 | 94.27 |
HRNet | backbone: HRNet decode_head: FCNHead | 86.99 | 92.86 | 88.71 | 92.91 | 87.89 | 94.07 | 89.13 | 93.72 |
U-Net | backbone: U-Net decode_head: FCNHead | 81.69 | 85.10 | 82.69 | 87.03 | 80.72 | 84.35 | 81.45 | 84.64 |
ViT | backbone: Vision Transformer decode_head: UperHead auxiliary_head: FCNHead | 82.59 | 89.19 | 84.52 | 88.39 | 79.81 | 88.98 | 81.64 | 85.73 |
Method | Model Settings | IoU (%) | MIoU (%) | Average F1 (%) | ||||||
---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2 | 3 | 4 | 5 | 6 | ||||
HyperSIGMA | backbone: HyperSIGMA(base) decode_head: UperHead | 85.31 | 86.46 | 90.23 | 91.07 | 88.67 | 87.71 | 82.27 | 87.99 | 92.95 |
SpectralGPT | backbone: SpectralGPT(base) decode_head: UperHead | 84.14 | 83.17 | 84.36 | 88.95 | 85.54 | 88.39 | 81.36 | 85.33 | 91.57 |
SatMAE | backbone: SatMAE(base) decode_head: UperHead | 81.71 | 80.44 | 83.92 | 87.63 | 83.87 | 84.95 | 76.82 | 82.71 | 89.70 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhang, J.; Chen, T.; Li, Y.; Meng, Q.; Chen, Y.; Deng, J.; Sun, E. RiceStageSeg: A Multimodal Benchmark Dataset for Semantic Segmentation of Rice Growth Stages. Remote Sens. 2025, 17, 2858. https://doi.org/10.3390/rs17162858
Zhang J, Chen T, Li Y, Meng Q, Chen Y, Deng J, Sun E. RiceStageSeg: A Multimodal Benchmark Dataset for Semantic Segmentation of Rice Growth Stages. Remote Sensing. 2025; 17(16):2858. https://doi.org/10.3390/rs17162858
Chicago/Turabian StyleZhang, Jianping, Tailai Chen, Yizhe Li, Qi Meng, Yanying Chen, Jie Deng, and Enhong Sun. 2025. "RiceStageSeg: A Multimodal Benchmark Dataset for Semantic Segmentation of Rice Growth Stages" Remote Sensing 17, no. 16: 2858. https://doi.org/10.3390/rs17162858
APA StyleZhang, J., Chen, T., Li, Y., Meng, Q., Chen, Y., Deng, J., & Sun, E. (2025). RiceStageSeg: A Multimodal Benchmark Dataset for Semantic Segmentation of Rice Growth Stages. Remote Sensing, 17(16), 2858. https://doi.org/10.3390/rs17162858