Multi-View Edge Attention Network for Fine-Grained Food Image Segmentation
Abstract
1. Introduction
2. Materials and Methods
2.1. Datasets
2.2. Equipment and Experimental Setup
2.3. Method
2.3.1. STViT Backbone
2.3.2. Feature Extraction
2.3.3. Detail Enhancement Decoder
3. Results
3.1. Comparative Experimental Results
3.2. Ablation Study
3.3. Qualitative Evaluation
4. Discussion
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Min, W.; Jiang, S.; Liu, L.; Rui, Y.; Jain, R. A survey on food computing. ACM Comput. Surv. (CSUR) 2019, 52, 1–36. [Google Scholar] [CrossRef]
- Wu, X.; Yu, S.; Lim, E.P.; Ngo, C.W. Ovfoodseg: Elevating open-vocabulary food image segmentation via image-informed textual representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 4144–4153. [Google Scholar]
- Min, W.; Wang, Z.; Liu, Y.; Luo, M.; Kang, L.; Wei, X.; Wei, X.; Jiang, S. Large scale visual food recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 9932–9949. [Google Scholar] [CrossRef] [PubMed]
- Li, W.; Li, J.; Ma, M.; Hong, X.; Fan, X. Multi-scale spiking pyramid wireless communication framework for food recognition. IEEE Trans. Multimed. 2024, 27, 2734–2746. [Google Scholar] [CrossRef]
- Sun, K.; Zhang, Y.J.; Tong, S.Y.; Tang, M.D.; Wang, C.B. Study on rice grain mildewed region recognition based on microscopic computer vision and YOLO-v5 model. Foods 2022, 11, 4031. [Google Scholar] [CrossRef] [PubMed]
- Liang, S.; Gu, Y. A Coarse-to-Fine Feature Aggregation Neural Network with a Boundary-Aware Module for Accurate Food Recognition. Foods 2025, 14, 383. [Google Scholar] [CrossRef]
- Chen, Z.; Wang, J.; Wang, Y. Enhancing Food Image Recognition by Multi-Level Fusion and the Attention Mechanism. Foods 2025, 14, 461. [Google Scholar] [CrossRef]
- Shao, W.; Min, W.; Hou, S.; Luo, M.; Li, T.; Zheng, Y.; Jiang, S. Vision-based food nutrition estimation via RGB-D fusion network. Food Chem. 2023, 424, 136309. [Google Scholar] [CrossRef]
- Yang, X.; Ho, C.T.; Gao, X.; Chen, N.; Chen, F.; Zhu, Y.; Zhang, X. Machine learning: An effective tool for monitoring and ensuring food safety, quality, and nutrition. Food Chem. 2025, 477, 143391. [Google Scholar] [CrossRef]
- Shao, W.; Hou, S.; Jia, W.; Zheng, Y. Rapid non-destructive analysis of food nutrient content using swin-nutrition. Foods 2022, 11, 3429. [Google Scholar] [CrossRef]
- Li, T.; Wei, W.; Xing, S.; Min, W.; Zhang, C.; Jiang, S. Deep learning-based near-infrared hyperspectral imaging for food nutrition estimation. Foods 2023, 12, 3145. [Google Scholar] [CrossRef] [PubMed]
- Naseem, S.; Rizwan, M. The role of artificial intelligence in advancing food safety: A strategic path to zero contamination. Food Control 2025, 175, 111292. [Google Scholar] [CrossRef]
- Panahi, O. The Future of Healthcare: AI. Public Health Digit. Revolut. Med. Clin. Case Rep. J. 2025, 3, 763–766. [Google Scholar]
- Panahi, O. The role of artificial intelligence in shaping future health planning. Int. J. Health Policy Plan. 2025, 4, 1–5. [Google Scholar]
- Wang, W.; Min, W.; Li, T.; Dong, X.; Li, H.; Jiang, S. A review on vision-based analysis for automatic dietary assessment. Trends Food Sci. Technol. 2022, 122, 223–237. [Google Scholar] [CrossRef]
- Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
- Liang-Chieh, C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A. Semantic image segmentation with deep convolutional nets and fully connected crfs. In Proceedings of the International Conference on Learning Representations, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
- Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar]
- Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7794–7803. [Google Scholar]
- Huang, Z.; Wang, X.; Huang, L.; Huang, C.; Wei, Y.; Liu, W. Ccnet: Criss-cross attention for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27–28 October 2019; pp. 603–612. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar] [CrossRef]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- Zheng, S.; Lu, J.; Zhao, H.; Zhu, X.; Luo, Z.; Wang, Y.; Fu, Y.; Feng, J.; Xiang, T.; Torr, P.H. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 6881–6890. [Google Scholar]
- Wang, Q.; Dong, X.; Wang, R.; Sun, H. Swin transformer based pyramid pooling network for food segmentation. In Proceedings of the 2022 IEEE 2nd International Conference on Software Engineering and Artificial Intelligence (SEAI), Xiamen, China, 10–12 June 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 64–68. [Google Scholar]
- Alahmari, S.S.; Gardner, M.; Salem, T. Segment Anything in Food Images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 3715–3720. [Google Scholar]
- Lan, X.; Lyu, J.; Jiang, H.; Dong, K.; Niu, Z.; Zhang, Y.; Xue, J. Foodsam: Any food segmentation. IEEE Trans. Multimed. 2023, 27, 2795–2808. [Google Scholar] [CrossRef]
- Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.Y. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 4015–4026. [Google Scholar]
- Wu, X.; Fu, X.; Liu, Y.; Lim, E.P.; Hoi, S.C.; Sun, Q. A large-scale benchmark for food image segmentation. In Proceedings of the 29th ACM International Conference on Multimedia, Chengdu, China, 20–24 October 2021; pp. 506–515. [Google Scholar]
- Jaswanthi, R.; Amruthatulasi, E.; Bhavyasree, C.; Satapathy, A. A hybrid network based on GAN and CNN for food segmentation and calorie estimation. In Proceedings of the 2022 International Conference on Sustainable Computing and Data Communication Systems (ICSCDS), Erode, India, 7–9 April 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 436–441. [Google Scholar]
- Isola, P.; Zhu, J.Y.; Zhou, T.; Efros, A.A. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1125–1134. [Google Scholar]
- Muñoz, B.; Martínez-Arroyo, A.; Acevedo, C.; Aguilar, E. Lightweight DeepLabv3+ for Semantic Food Segmentation. Foods 2025, 14, 1306. [Google Scholar] [CrossRef] [PubMed]
- Liang, X.; Jia, X.; Huang, W.; He, X.; Li, L.; Fan, S.; Li, J.; Zhao, C.; Zhang, C. Real-time grading of defect apples using semantic segmentation combination with a pruned YOLO V4 network. Foods 2022, 11, 3150. [Google Scholar] [CrossRef]
- Verk, J.; Hernavs, J.; Klančnik, S. Using a Region-Based Convolutional Neural Network (R-CNN) for Potato Segmentation in a Sorting Process. Foods 2025, 14, 1131. [Google Scholar] [CrossRef]
- Rodríguez-de Vera, J.M.; Villacorta, P.; Estepa, I.G.; Bolaños, M.; Sarasúa, I.; Nagarajan, B.; Radeva, P. Dining on details: Llm-guided expert networks for fine-grained food recognition. In Proceedings of the 8th International Workshop on Multimedia Assisted Dietary Management, Ottawa, ON, Canada, 29 October 2023; pp. 43–52. [Google Scholar]
- Ponte, D.; Aguilar, E.; Ribera, M.; Radeva, P. Multi-task visual food recognition by integrating an ontology supported with LLM. J. Vis. Commun. Image Represent. 2025, 10, 104484. [Google Scholar] [CrossRef]
- Yu, Q.; Zhao, X.; Pang, Y.; Zhang, L.; Lu, H. Multi-view aggregation network for dichotomous image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 3921–3930. [Google Scholar]
- Ke, L.; Ye, M.; Danelljan, M.; Tai, Y.W.; Tang, C.K.; Yu, F. Segment anything in high quality. Adv. Neural Inf. Process. Syst. 2023, 36, 29914–29934. [Google Scholar]
- Salvador, A.; Hynes, N.; Aytar, Y.; Marin, J.; Ofli, F.; Weber, I.; Torralba, A. Learning cross-modal embeddings for cooking recipes and food images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 3020–3028. [Google Scholar]
- Okamoto, K.; Yanai, K. Uec-foodpix complete: A large-scale food image segmentation dataset. In Proceedings of the Pattern Recognition. ICPR International Workshops and Challenges: Virtual Event, January 10–15, 2021; Proceedings, Part V; Springer: Berlin/Heidelberg, Germany, 2021; pp. 647–659. [Google Scholar]
- Paszke, A.; Gross, S.; Chintala, S.; Chanan, G.; Yang, E.; DeVito, Z.; Lin, Z.; Desmaison, A.; Antiga, L.; Lerer, A. Automatic Differentiation in Pytorch. 2017. Available online: https://openreview.net/pdf?id=BJJsrmfCZ (accessed on 20 August 2025).
- Huang, H.; Zhou, X.; Cao, J.; He, R.; Tan, T. Vision transformer with super token sampling. arXiv 2022, arXiv:2211.11167. [Google Scholar]
- Wei, J.; Wang, S.; Huang, Q. F3Net: Fusion, feedback and focus for salient object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 12321–12328. [Google Scholar]
- Chen, Z.; Xu, Q.; Cong, R.; Huang, Q. Global context-aware progressive aggregation network for salient object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 10599–10606. [Google Scholar]
- Mei, H.; Ji, G.P.; Wei, Z.; Yang, X.; Wei, X.; Fan, D.P. Camouflaged object segmentation with distraction mining. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 8772–8781. [Google Scholar]
- Zhu, H.; Li, P.; Xie, H.; Yan, X.; Liang, D.; Chen, D.; Wei, M.; Qin, J. I can find you! boundary-guided separated attention network for camouflaged object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 22 February–1 March 2022; Volume 36, pp. 3608–3616. [Google Scholar]
- Hu, H.; Chen, Y.; Xu, J.; Borse, S.; Cai, H.; Porikli, F.; Wang, X. Learning implicit feature alignment function for semantic segmentation. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 487–505. [Google Scholar]
- Xie, C.; Xia, C.; Ma, M.; Zhao, Z.; Chen, X.; Li, J. Pyramid grafting network for one-stage high resolution saliency detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11717–11726. [Google Scholar]
- Qin, X.; Dai, H.; Hu, X.; Fan, D.P.; Shao, L.; Van Gool, L. Highly accurate dichotomous image segmentation. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 38–56. [Google Scholar]
- Pei, J.; Zhou, Z.; Jin, Y.; Tang, H.; Heng, P.A. Unite-divide-unite: Joint boosting trunk and structure for high-accuracy dichotomous image segmentation. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 3 November 2023; pp. 2139–2147. [Google Scholar]
- Zhou, Y.; Dong, B.; Wu, Y.; Zhu, W.; Chen, G.; Zhang, Y. Dichotomous image segmentation with frequency priors. In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, Macao, Chinam, 19–25 August 2023. IJCAI ’23. [Google Scholar] [CrossRef]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
- Zhang, T.; Li, L.; Zhou, Y.; Liu, W.; Qian, C.; Hwang, J.N.; Ji, X. Cas-vit: Convolutional additive self-attention vision transformers for efficient mobile applications. arXiv 2024, arXiv:2408.03703. [Google Scholar]
Method | Evaluation Metrics | |||||
---|---|---|---|---|---|---|
Net [42] | 0.281 | 0.779 | 0.598 | 0.681 | 0.639 | 0.572 |
GCPANet [43] | 0.194 | 0.807 | 0.626 | 0.729 | 0.662 | 0.627 |
PFNet [44] | 0.259 | 0.776 | 0.483 | 0.618 | 0.624 | 0.317 |
BSANet [45] | 0.295 | 0.759 | 0.535 | 0.661 | 0.607 | 0.437 |
IFA [46] | 0.321 | 0.726 | 0.503 | 0.636 | 0.585 | 0.420 |
PGNet [47] | 0.158 | 0.819 | 0.674 | 0.738 | 0.653 | 0.631 |
ISNet [48] | 0.264 | 0.784 | 0.612 | 0.706 | 0.651 | 0.604 |
UDUN [49] | 0.163 | 0.822 | 0.663 | 0.747 | 0.685 | 0.647 |
FP-DIS [50] | 0.163 | 0.832 | 0.653 | 0.741 | 0.668 | 0.628 |
MVANet [36] | 0.153 | 0.867 | 0.679 | 0.778 | 0.708 | 0.652 |
Our | 0.131 | 0.886 | 0.699 | 0.796 | 0.743 | 0.668 |
Method | Evaluation Metrics | |||||
---|---|---|---|---|---|---|
Net [42] | 0.268 | 0.698 | 0.559 | 0.653 | 0.519 | 0.526 |
GCPANet [43] | 0.249 | 0.718 | 0.598 | 0.681 | 0.557 | 0.556 |
PFNet [44] | 0.303 | 0.769 | 0.497 | 0.643 | 0.503 | 0.465 |
BSANet [45] | 0.270 | 0.689 | 0.538 | 0.649 | 0.508 | 0.486 |
IFA [46] | 0.279 | 0.668 | 0.528 | 0.638 | 0.498 | 0.478 |
PGNet [47] | 0.237 | 0.721 | 0.628 | 0.698 | 0.565 | 0.574 |
ISNet [48] | 0.253 | 0.712 | 0.562 | 0.677 | 0.532 | 0.529 |
UDUN [49] | 0.201 | 0.749 | 0.653 | 0.712 | 0.581 | 0.607 |
FP-DIS [50] | 0.201 | 0.756 | 0.664 | 0.702 | 0.564 | 0.585 |
MVANet [36] | 0.187 | 0.763 | 0.676 | 0.754 | 0.619 | 0.647 |
Our | 0.158 | 0.786 | 0.722 | 0.758 | 0.718 | 0.693 |
Method | FPS | UEC-FoodPix Complete | FoodSeg103 |
---|---|---|---|
Swin-Transformer [51] | 5.8 | 0.681 | 0.703 |
SAM-Encoder [27] | 4.9 | 0.610 | 0.629 |
CAS-ViT [52] | 8.6 | 0.532 | 0.496 |
STViT [41] | 6.3 | 0.668 | 0.693 |
Multi-View | MCLM | MCRM | HQ-Token | UEC-FoodPIX Complete | |
---|---|---|---|---|---|
✓ | ✓ | ✓ | 0.179 | 0.594 | |
✓ | ✓ | 0.171 | 0.641 | ||
✓ | ✓ | ✓ | 0.173 | 0.633 | |
✓ | ✓ | ✓ | 0.169 | 0.652 | |
✓ | ✓ | ✓ | 0.163 | 0.667 | |
✓ | ✓ | ✓ | ✓ | 0.158 | 0.693 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Liu, C.; Sheng, G.; Min, W.; Wu, X.; Jiang, S. Multi-View Edge Attention Network for Fine-Grained Food Image Segmentation. Foods 2025, 14, 3016. https://doi.org/10.3390/foods14173016
Liu C, Sheng G, Min W, Wu X, Jiang S. Multi-View Edge Attention Network for Fine-Grained Food Image Segmentation. Foods. 2025; 14(17):3016. https://doi.org/10.3390/foods14173016
Chicago/Turabian StyleLiu, Chengxu, Guorui Sheng, Weiqing Min, Xiaojun Wu, and Shuqiang Jiang. 2025. "Multi-View Edge Attention Network for Fine-Grained Food Image Segmentation" Foods 14, no. 17: 3016. https://doi.org/10.3390/foods14173016
APA StyleLiu, C., Sheng, G., Min, W., Wu, X., & Jiang, S. (2025). Multi-View Edge Attention Network for Fine-Grained Food Image Segmentation. Foods, 14(17), 3016. https://doi.org/10.3390/foods14173016