IGSMNet: Ingredient-Guided Semantic Modeling Network for Food Nutrition Estimation
Abstract
1. Introduction
- We propose a novel Ingredient-Guided Semantic Modeling Network (IGSMNet) for food nutrition estimation, which jointly integrates RGB-D visual features and ingredient semantics to enhance the estimation accuracy.
- We develop an ingredient-guided fusion module that utilizes ingredient information to guide visual feature learning. This enables the network to focus on nutritionally relevant regions and enhances its discrimination.
- We introduce an internal semantic modeling strategy composed of dynamic position encoding and fine-grained semantic modeling, which collectively strengthen contextual feature representation.
- Extensive experiments on the Nutrition5k dataset show that the proposed IGSMNet can achieve promising results.
2. Related Work
2.1. RGB Image-Based Methods
2.2. RGB-D Image-Based Methods
2.3. Ingredient-Guided Methods
3. Method
3.1. Overview
3.2. Ingredient-Guided Module
3.3. Internal Semantic Modeling
3.3.1. Dynamic Position Encoding
3.3.2. Fine-Grained Modeling
3.4. Training Objective
3.5. Evaluation Metrics
4. Experiments
4.1. Experimental Setup
4.2. Experimental Results and Analysis
4.2.1. Comparison with RGB Image-Based Methods
4.2.2. Comparison with RGB-D-Based Methods
4.3. Ablation Study
4.3.1. Effectiveness of the Ingredient-Guided Module
4.3.2. Effectiveness of the Fine-Grained Modeling Scheme
4.3.3. Effectiveness of the Dynamic Position Encoding
4.4. Further Analysis
4.4.1. Comparison of Ingredient-Guided Integration Strategies
4.4.2. Modeling Order of IG and ISM
5. Discussion
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Kaushal, S.; Tammineni, D.K.; Rana, P.; Sharma, M.; Sridhar, K.; Chen, H.H. Computer vision and deep learning-based approaches for detection of food nutrients/nutrition: New insights and advances. Trends Food Sci. Technol. 2024, 146, 104408. [Google Scholar] [CrossRef]
- Jacobs, D.R.; Tapsell, L.C. Food, not nutrients, is the fundamental unit in nutrition. Nutr. Rev. 2007, 65, 439–450. [Google Scholar] [CrossRef]
- Gargano, D.; Appanna, R.; Santonicola, A.; De Bartolomeis, F.; Stellato, C.; Cianferoni, A.; Casolaro, V.; Iovino, P. Food allergy and intolerance: A narrative review on nutritional concerns. Nutrients 2021, 13, 1638. [Google Scholar] [CrossRef]
- Zhou, L.; Zhang, C.; Liu, F.; Qiu, Z.; He, Y. Application of deep learning in food: A review. Compr. Rev. Food Sci. Food Saf. 2019, 18, 1793–1811. [Google Scholar] [CrossRef]
- Subar, A.F.; Kirkpatrick, S.I.; Mittl, B.; Zimmerman, T.P.; Thompson, F.E.; Bingley, C.; Willis, G.; Islam, N.G.; Baranowski, T.; McNutt, S.; et al. The automated self-administered 24-h dietary recall (ASA24): A resource for researchers, clinicians and educators from the National Cancer Institute. J. Acad. Nutr. Diet. 2012, 112, 1134. [Google Scholar] [CrossRef]
- Bianco, R.; Coluccia, S.; Marinoni, M.; Falcon, A.; Fiori, F.; Serra, G.; Ferraroni, M.; Edefonti, V.; Parpinel, M. 2D Prediction of the Nutritional Composition of Dishes from Food Images: Deep Learning Algorithm Selection and Data Curation Beyond the Nutrition5k Project. Nutrients 2025, 17, 2196. [Google Scholar] [CrossRef] [PubMed]
- Yin, Y.; Qi, H.; Zhu, B.; Chen, J.; Jiang, Y.G.; Ngo, C.W. Foodlmm: A versatile food assistant using large multi-modal model. IEEE Trans. Multimed. 2025. [Google Scholar] [CrossRef]
- Ma, B.; Zhang, D.; Wu, X.J. Food nutrition estimation with RGB-D fusion module and bidirectional feature pyramid network. Multimed. Syst. 2025, 31, 1–11. [Google Scholar] [CrossRef]
- Saad, A.M.; Rahi, M.R.H.; Islam, M.M.; Rabbani, G. Diet engine: A real-time food nutrition assistant system for personalized dietary guidance. Food Chem. Adv. 2025, 7, 100978. [Google Scholar] [CrossRef]
- Feng, Z.; Xiong, H.; Min, W.; Hou, S.; Duan, H.; Liu, Z.; Jiang, S. Ingredient-Guided RGB-D Fusion Network for Nutritional Assessment. IEEE Trans. Agrifood Electron. 2025, 3, 156–166. [Google Scholar] [CrossRef]
- Shao, W.; Min, W.; Hou, S.; Luo, M.; Li, T.; Zheng, Y.; Jiang, S. Vision-based food nutrition estimation via RGB-D fusion network. Food Chem. 2023, 424, 136309. [Google Scholar] [CrossRef] [PubMed]
- Jovanovic, L.; Bacanin, N.; Petrovic, A.; Zivkovic, M.; Antonijevic, M.; Gajic, V.; Elsayed, M.M.; Abouhawwash, M. Exploring artificial intelligence potential in solar energy production forecasting: Methodology based on modified PSO optimized attention augmented recurrent networks. Sustain. Comput. Inform. Syst. 2025, 47, 101174. [Google Scholar] [CrossRef]
- Chen, J.J.; Ngo, C.W.; Chua, T.S. Cross-modal recipe retrieval with rich food attributes. In Proceedings of the 25th ACM International Conference on Multimedia, Mountain View, CA, USA, 23–27 October 2017; pp. 1771–1779. [Google Scholar]
- Ming, Z.Y.; Chen, J.; Cao, Y.; Forde, C.; Ngo, C.W.; Chua, T.S. Food photo recognition for dietary tracking: System and experiment. In Proceedings of the International Conference on Multimedia Modeling, Bangkok, Thailand, 5–7 February 2018; Springer: Berlin/Heidelberg, Germany, 2018; pp. 129–141. [Google Scholar]
- Sosa-Holwerda, A.; Park, O.H.; Albracht-Schulte, K.; Niraula, S.; Thompson, L.; Oldewage-Theron, W. The role of artificial intelligence in nutrition research: A scoping review. Nutrients 2024, 16, 2066. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
- Haq, M.A. CNN based automated weed detection system using UAV imagery. Comput. Syst. Sci. Eng. 2022, 42, 2. [Google Scholar] [CrossRef]
- Bidyalakshmi, T.; Jyoti, B.; Mansuri, S.M.; Srivastava, A.; Mohapatra, D.; Kalnar, Y.B.; Narsaiah, K.; Indore, N. Application of artificial intelligence in food processing: Current status and future prospects. Food Eng. Rev. 2025, 17, 27–54. [Google Scholar] [CrossRef]
- Zhang, D.; Ma, B.; Wu, X.J. Adaptive Feature Fusion and Enhancement Network for Food Nutrition Estimation. IEEE Trans. Agrifood Electron. 2025. [Google Scholar] [CrossRef]
- Zhang, F.; Yin, J.; Wu, N.; Hu, X.; Sun, S.; Wang, Y. A dual-path model merging CNN and RNN with attention mechanism for crop classification. Eur. J. Agron. 2024, 159, 127273. [Google Scholar] [CrossRef]
- Chang, J.; Wang, H.; Su, W.; He, X.; Tan, M. Artificial intelligence in food bioactive peptides screening: Recent advances and future prospects. Trends Food Sci. Technol. 2025, 156, 104845. [Google Scholar] [CrossRef]
- Zhang, D.; Wu, X.J.; Yu, J. Label consistent flexible matrix factorization hashing for efficient cross-modal retrieval. Acm Trans. Multimed. Comput. Commun. Appl. (TOMM) 2021, 17, 1–18. [Google Scholar] [CrossRef]
- Zhang, D.; Wu, X.J.; Yu, J. Discrete bidirectional matrix factorization hashing for zero-shot cross-media retrieval. In Proceedings of the Chinese Conference on Pattern Recognition and Computer Vision (PRCV), Zhuhai, China, 29 October–1 November 2021; pp. 524–536. [Google Scholar]
- García-Infante, M.; Castro-Valdecantos, P.; Delgado-Pertinez, M.; Teixeira, A.; Guzmán, J.L.; Horcada, A. Effectiveness of machine learning algorithms as a tool to meat traceability system. A case study to classify Spanish Mediterranean lamb carcasses. Food Control 2024, 164, 110604. [Google Scholar] [CrossRef]
- Shao, W.; Hou, S.; Jia, W.; Zheng, Y. Rapid non-destructive analysis of food nutrient content using swin-nutrition. Foods 2022, 11, 3429. [Google Scholar] [CrossRef]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
- Jiao, P.; Wu, X.; Zhu, B.; Chen, J.; Ngo, C.W.; Jiang, Y. Rode: Linear rectified mixture of diverse experts for food large multi-modal models. arXiv 2024, arXiv:2407.12730. [Google Scholar]
- Thames, Q.; Karpur, A.; Norris, W.; Xia, F.; Panait, L.; Weyand, T.; Sim, J. Nutrition5k: Towards automatic nutritional understanding of generic food. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 8903–8911. [Google Scholar]
- Bi, H.; Wu, R.; Liu, Z.; Zhu, H.; Zhang, C.; Xiang, T.Z. Cross-modal hierarchical interaction network for RGB-D salient object detection. Pattern Recognit. 2023, 136, 109194. [Google Scholar] [CrossRef]
- NR, D.; GK, D.S.; Kumar Pareek, D.P. A Framework for Food recognition and predicting its Nutritional value through Convolution neural network. In Proceedings of the International Conference on Innovative Computing & Communication (ICICC), Delhi, India, 19–20 February 2022. [Google Scholar]
- Ege, T.; Yanai, K. Simultaneous estimation of dish locations and calories with multi-task learning. IEICE Trans. Inf. Syst. 2019, 102, 1240–1246. [Google Scholar] [CrossRef]
- Fang, S.; Shao, Z.; Kerr, D.A.; Boushey, C.J.; Zhu, F. An end-to-end image-based automatic food energy estimation technique based on learned energy distribution images: Protocol and methodology. Nutrients 2019, 11, 877. [Google Scholar] [CrossRef]
- Wang, B.; Bu, T.; Hu, Z.; Yang, L.; Zhao, Y.; Li, X. Coarse-to-fine nutrition prediction. IEEE Trans. Multimed. 2023, 26, 3651–3662. [Google Scholar] [CrossRef]
- Meyers, A.; Johnston, N.; Rathod, V.; Korattikara, A.; Gorban, A.; Silberman, N.; Guadarrama, S.; Papandreou, G.; Huang, J.; Murphy, K.P. Im2Calories: Towards an automated mobile vision food diary. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1233–1241. [Google Scholar]
- Han, Y.; Cheng, Q.; Wu, W.; Huang, Z. Dpf-nutrition: Food nutrition estimation via depth prediction and fusion. Foods 2023, 12, 4293. [Google Scholar] [CrossRef]
- Nian, F.; Hu, Y.; Gu, Y.; Wu, Z.; Yang, S.; Shu, J. Ingredient-guided multi-modal interaction and refinement network for RGB-D food nutrition assessment. Digit. Signal Process. 2024, 153, 104664. [Google Scholar] [CrossRef]
- Wang, W.; Chen, W.; Qiu, Q.; Chen, L.; Wu, B.; Lin, B.; He, X.; Liu, W. Crossformer++: A versatile vision transformer hinging on cross-scale attention. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 46, 3123–3136. [Google Scholar] [CrossRef]
- Wang, W.; Guo, Z.; Jiang, W.; Lan, Y.; Ma, W. CrossHash: Cross-scale Vision Transformer Hashing for Image Retrieval. In Proceedings of the ICASSP 2025—2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, 6–11 April 2025; pp. 1–5. [Google Scholar]
- Nair, V.; Hinton, G.E. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), Haifa, Israel, 21–24 June 2010; pp. 807–814. [Google Scholar]
- Ba, J.L.; Kiros, J.R.; Hinton, G.E. Layer normalization. arXiv 2016, arXiv:1607.06450. [Google Scholar] [CrossRef]
- Shao, Z.; Vinod, G.; He, J.; Zhu, F. An end-to-end food portion estimation framework based on shape reconstruction from monocular image. In Proceedings of the 2023 IEEE International Conference on Multimedia and Expo (ICME), Brisbane, Australia, 10–14 July 2023; pp. 942–947. [Google Scholar]
- Zhang, J.; Liu, H.; Yang, K.; Hu, X.; Liu, R.; Stiefelhagen, R. CMX: Cross-modal fusion for RGB-X semantic segmentation with transformers. IEEE Trans. Intell. Transp. Syst. 2023, 24, 14679–14694. [Google Scholar] [CrossRef]
- Zhang, C.; Cong, R.; Lin, Q.; Ma, L.; Li, F.; Zhao, Y.; Kwong, S. Cross-modality discrepant interaction network for RGB-D salient object detection. In Proceedings of the 29th ACM International Conference on Multimedia, Virtual, China, 20–24 October 2021; pp. 2094–2102. [Google Scholar]
- Zhou, W.; Pan, Y.; Lei, J.; Ye, L.; Yu, L. DEFNet: Dual-branch enhanced feature fusion network for RGB-T crowd counting. IEEE Trans. Intell. Transp. Syst. 2022, 23, 24540–24549. [Google Scholar] [CrossRef]
- Zhang, J.; Liu, R.; Shi, H.; Yang, K.; Reiß, S.; Peng, K.; Fu, H.; Wang, K.; Stiefelhagen, R. Delivering arbitrary-modal semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 1136–1147. [Google Scholar]



| Method Type | Methods | PMAE (%) | |||||
|---|---|---|---|---|---|---|---|
| Calories | Mass | Fat | Carb | Protein | Mean | ||
| RGB images | Google-Nutrition-rgb [28] | 26.1 | 18.8 | 34.2 | 31.9 | 29.5 | 29.1 | 
| Coarse-to-Fine Nutrition [33] | 24.1 | 19.4 | 36.0 | 32.1 | 33.5 | 29.0 | |
| Swin-nutrition [25] | 16.2 | 13.7 | 24.9 | 21.8 | 25.4 | 20.4 | |
| Portion-Nutrition [41] | 15.8 | - | - | - | - | - | |
| RoDE [27] | 52.4 | 38.4 | 67.1 | 47.8 | 53.9 | 51.9 | |
| DPF-Nutrition [35] | 14.7 | 10.6 | 22.6 | 20.7 | 20.2 | 17.8 | |
| RGB-D images | CMX [42] | 21.8 | 20.7 | 34.8 | 37.0 | 33.2 | 29.5 | 
| HINet [29] | 24.5 | 25.2 | 43.4 | 39.9 | 38.8 | 34.3 | |
| CDINet [43] | 21.1 | 20.4 | 37.1 | 37.1 | 32.8 | 29.7 | |
| DEFNet [44] | 32.7 | 34.2 | 48.9 | 40.3 | 43.8 | 39.9 | |
| TriTransNet [26] | 22.1 | 20.1 | 37.5 | 34.8 | 38.0 | 30.5 | |
| Deliver [45] | 29.5 | 25.9 | 48.3 | 47.7 | 46.1 | 39.5 | |
| Google-Nutrition-rgbd [28] | 18.8 | 18.9 | 18.1 | 23.8 | 20.9 | 20.1 | |
| IMIR-Net [36] | 14.7 | 11.4 | 23.3 | 20.9 | 21.6 | 18.4 | |
| Feng et al. [10] | 13.7 | 9.8 | 19.2 | 19.3 | 17.6 | 15.9 | |
| IGSMNet | 12.2 | 9.4 | 19.1 | 18.3 | 16.0 | 15.0 | |
| Baseline | IG | FGM | DPE | Calories | Mass | Fat | Carb | Protein | Mean | 
|---|---|---|---|---|---|---|---|---|---|
| ✓ | 13.7 | 9.4 | 22.6 | 19.4 | 19.6 | 16.9 | |||
| ✓ | ✓ | 13.3 | 10.2 | 19.5 | 18.5 | 16.6 | 15.6 | ||
| ✓ | ✓ | ✓ | 12.6 | 9.4 | 19.5 | 18.3 | 16.2 | 15.2 | |
| ✓ | ✓ | ✓ | ✓ | 12.2 | 9.4 | 19.1 | 18.3 | 16.0 | 15.0 | 
| IG Integration Strategy | Calories | Mass | Fat | Carb | Protein | Mean | 
|---|---|---|---|---|---|---|
| w/o IG | 13.7 | 9.4 | 22.6 | 19.4 | 19.6 | 16.9 | 
| Add | 13.5 | 9.7 | 21.8 | 19.2 | 18.5 | 16.5 | 
| MLP | 13.4 | 9.3 | 22.0 | 19.8 | 18.8 | 16.6 | 
| Cross-Attention | 12.2 | 9.4 | 19.1 | 18.3 | 16.0 | 15.0 | 
| Configuration | Calories | Mass | Fat | Carb | Protein | Mean | 
|---|---|---|---|---|---|---|
| w/o IG | 13.7 | 9.4 | 22.6 | 19.4 | 19.6 | 16.9 | 
| IG only | 13.3 | 10.2 | 19.5 | 18.5 | 16.6 | 15.6 | 
| ISM → IG | 12.9 | 9.7 | 18.3 | 19.1 | 15.5 | 15.1 | 
| IG → ISM | 12.2 | 9.4 | 19.1 | 18.3 | 16.0 | 15.0 | 
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. | 
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhang, D.; Shi, W.; Ma, B.; Min, W.; Wu, X.-J. IGSMNet: Ingredient-Guided Semantic Modeling Network for Food Nutrition Estimation. Foods 2025, 14, 3697. https://doi.org/10.3390/foods14213697
Zhang D, Shi W, Ma B, Min W, Wu X-J. IGSMNet: Ingredient-Guided Semantic Modeling Network for Food Nutrition Estimation. Foods. 2025; 14(21):3697. https://doi.org/10.3390/foods14213697
Chicago/Turabian StyleZhang, Donglin, Weixiang Shi, Boyuan Ma, Weiqing Min, and Xiao-Jun Wu. 2025. "IGSMNet: Ingredient-Guided Semantic Modeling Network for Food Nutrition Estimation" Foods 14, no. 21: 3697. https://doi.org/10.3390/foods14213697
APA StyleZhang, D., Shi, W., Ma, B., Min, W., & Wu, X.-J. (2025). IGSMNet: Ingredient-Guided Semantic Modeling Network for Food Nutrition Estimation. Foods, 14(21), 3697. https://doi.org/10.3390/foods14213697
 
        


 
       