Automated Food Weight and Content Estimation Using Computer Vision and AI Algorithms: Phase 2
Abstract
1. Introduction
2. Materials and Methods
2.1. RGB Setup
2.2. Monocular Depth Estimation Setup
- 1.
- Training with highly accurate synthetic datasets instead of noisy real-world annotations.
- 2.
- Scaling up the teacher network using a DINOv2-G backbone for better pseudo-label generation.
- 3.
- Refining student networks with massive collections of real-world unlabeled images via distillation.
2.3. Application of the Depth Map
- Isolating the relevant region of interest (ROI) based on the detected tray.
- Aligning the segmentation masks with the corresponding pixels in the depth map.
- Applying geometric operations to correct perspective-induced distortion, ensuring that all measurements are made relative to a consistent reference plane.
2.4. Food Detection and Cropping
- 1.
- Background removal: eliminates irrelevant visual elements that could negatively affect the performance of the monocular depth estimation model.
- 2.
- Context standardization: spatially centers the food region, improving consistency in the subsequent depth maps.
- 3.
- Segmentation guidance: provides bounding box prompts for later instance segmentation of individual food components.
2.5. Segmentation Process
- 1.
- Enhanced Boundary Precision: The ability of SAM to capture fine-grained details allows for accurate segmentation of curved or irregular food borders, which is crucial for precise volume estimation.
- 2.
- Improved Generalization: The zero-shot capabilities of SAM enable it to perform well across various food types and presentations without the need for task-specific training.
- 3.
- Streamlined Workflow: Replacing the YOLO-based segmentation model used in our previous work [9] with SAM simplifies the segmentation process, reducing the need for extensive labeled datasets.
2.6. Base Plane Correction
2.7. Volume Estimation
2.8. Runtime Performance and Computational Cost
3. Results
3.1. Robustness of Food Volume Estimation Under Rotational Variability
- Mean volume: average of all volume estimations for that rotation
- Standard deviation (SD): dispersion of frame-level volumes during that pass
- Coefficient of variation (CV): relative variability defined as
3.2. Visualization of Inter-Rotation Consistency
3.3. Additional Generalization Test on a Different Food Item
3.4. Frame-Level Variability Analysis
4. Conclusions and Future Works
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Correction Statement
References
- He, Y.; Xu, C.; Khanna, N.; Boushey, C.J.; Delp, E.J. Food Image Analysis: Segmentation, Identification and Weight Estimation. In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME), San Jose, CA, USA, 15–19 July 2013; IEEE: San Jose, CA, USA, 2013; pp. 1–6. [Google Scholar] [CrossRef]
- Ando, Y.; Ege, T.; Cho, J.; Yanai, K. DepthCalorieCam: A Mobile Application for Volume-Based Food Calorie Estimation Using Depth Cameras. In Proceedings of the 5th International Workshop on Multimedia Assisted Dietary Management (MADiMa’19), Nice, France, 21 October 2019; pp. 76–81. [Google Scholar] [CrossRef]
- Salim, N.; Ahmad, R.; Mubin, A.; Yusuf, H. Study for Food Recognition System Using Deep Learning. J. Phys. Conf. Ser. 2021, 1962, 012014. [Google Scholar] [CrossRef]
- Zunair, H.; Hamza, A.B. PEEKABOO: Hiding Parts of an Image for Unsupervised Object Localization. 2024. Available online: https://github.com/hasibzunair/peekaboo (accessed on 17 December 2025).
- Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In Proceedings of the European Conference on Computer Vision (ECCV), Zurich, Switzerland, 6–12 September 2014; pp. 740–755. [Google Scholar]
- Yang, L.; Kang, B.; Huang, Z.; Zhao, Z.; Xu, X.; Feng, J.; Zhao, H. Depth Anything V2. Available online: https://depth-anything-v2.github.io (accessed on 5 May 2025).
- Lan, X.; Lyu, J.; Jiang, H.; Dong, K.; Niu, Z.; Zhang, Y.; Xue, J. FoodSAM: Any Food Segmentation. arXiv 2023, arXiv:2308.05938. [Google Scholar] [CrossRef]
- Alahmari, S.S.; Gardner, M.; Salem, T. Segment Anything in Food Images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 17–21 June 2024; pp. 3715–3720. [Google Scholar]
- Gonzalez, B.; Garcia, G.; Velastin, S.A.; GholamHosseini, H.; Tejeda, L.; Farias, G. Automated Food Weight and Content Estimation Using Computer Vision and AI Algorithms. Sensors 2024, 24, 7660. [Google Scholar] [CrossRef] [PubMed]
- Ultralytics. YOLOv8 Documentation. Available online: https://docs.ultralytics.com/ (accessed on 22 May 2024).
- Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, N.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.Y.; et al. Segment Anything. arXiv 2023, arXiv:2304.02643. [Google Scholar] [PubMed]
- Ranftl, R.; Bochkovskiy, A.; Koltun, V. Vision Transformers for Dense Prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 12179–12189. [Google Scholar]
- Gonzalez, B.; Garcia, G.; Gecele, O.; Ramirez, J.; Velastin, S.A.; Farias, G. Preliminary Results on Food Weight Estimation with RGB-D Images. In Proceedings of the 14th International Conference on Pattern Recognition Systems (ICPRS), London, UK, 15–18 July 2024. [Google Scholar]
- Hikvision. DS-2CD2786G2-IZS Product Datasheet. Available online: https://www.hikvision.com/es-la/products/IP-Products/Network-Cameras/Pro-Series-EasyIP-/ds-2cd2786g2-izs/ (accessed on 5 May 2025).
- Ke, B.; Obukhov, A.; Huang, S.; Metzger, N.; Daudt, R.C.; Schindler, K. Marigold: Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation. arXiv 2024, arXiv:2505.09358. [Google Scholar]
- Fu, X.; Yin, W.; Hu, M.; Wang, K.; Ma, Y.; Tan, P.; Shen, S.; Lin, D.; Long, X. GeoWizard: Unleashing the Diffusion Priors for 3D Geometry Estimation from a Single Image. arXiv 2024, arXiv:2403.12013. [Google Scholar] [CrossRef]
- NVIDIA. GeForce RTX 2060 User Guide; NVIDIA Corporation: Santa Clara, CA, USA, 2019. [Google Scholar]







| Parameter | Value | Description |
|---|---|---|
| Sensor type | 1/1.8” Progressive Scan CMOS | High-quality imaging sensor |
| Max. resolution | 3840 × 2160 pixels | 8 MP resolution for detailed capture |
| Main stream resolutions | 2688 × 1520, 1920 × 1080, etc. | Frame rate: 30 FPS |
| Video compression | H.265/H.265+/H.264 | High-efficiency formats supported |
| Lens type | 2.8–12 mm motorized varifocal | Adjustable FOV from 108° to 46° (H) |
| Aperture | F1.4 | Low-light sensitivity enhancement |
| WDR | 120 dB | Wide dynamic range for backlight scenes |
| IR illumination | Up to 40 m | Night vision via IR LEDs |
| Protection | IP66/IK10 | Water, dust, and vandal resistance |
| Model | NYU AbsRel ↓ | KITTI AbsRel ↓ |
|---|---|---|
| Depth Anything V2 (ViT-L) [6] | 0.056 | 0.045 |
| DPT-Hybrid [12] | 0.110 | 0.062 |
| Marigold [15] | 0.055 | 0.099 |
| GeoWizard [16] | 0.052 | 0.097 |
| Stage | Time (s) | Percentage (%) |
|---|---|---|
| Food segmentation (SAM) | 6.97 | 50.7 |
| Plate border segmentation (SAM) | 5.57 | 40.6 |
| Monocular depth estimation (Depth Anything V2) | 1.15 | 8.4 |
| Other operations | ∼0.05 | 0.3 |
| Total (reported) | 13.74 | 100 |
| Rotation | Mean Volume [a.u.] | SD (%) |
|---|---|---|
| Rotation 1 | 2,448,803 | 12.5 |
| Rotation 2 | 2,827,374 | 10.6 |
| Rotation 3 | 1,884,739 | 13.7 |
| Rotation 4 | 2,480,449 | 9.5 |
| Rotation 5 | 2,781,133 | 10.5 |
| Rotation 6 | 2,561,689 | 9.3 |
| Rotation 7 | 2,780,916 | 7.6 |
| Rotation 8 | 2,934,666 | 7.2 |
| Rotation | Mean Volume [a.u.] | CV (%) |
|---|---|---|
| Rotation 1 | 7,244,201 | 11.5 |
| Rotation 2 | 6,364,047 | 10.7 |
| Rotation 3 | 6,761,096 | 6.4 |
| Rotation 4 | 7,586,902 | 8.3 |
| Rotation 5 | 6,524,396 | 8.3 |
| Rotation 6 | 7,026,012 | 8.9 |
| Rotation 7 | 7,029,018 | 9.2 |
| Rotation 8 | 6,866,476 | 8.2 |
| Participant | CV Across All Frames (%) |
|---|---|
| Person 1 (242 g) | 24.9 |
| Person 2 (257 g) | 15.2 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Gonzalez, B.; Garcia, G.; Velastin, S.A.; GholamHosseini, H.; Tejeda, L.; Ramirez, H.; Farias, G. Automated Food Weight and Content Estimation Using Computer Vision and AI Algorithms: Phase 2. Sensors 2026, 26, 76. https://doi.org/10.3390/s26010076
Gonzalez B, Garcia G, Velastin SA, GholamHosseini H, Tejeda L, Ramirez H, Farias G. Automated Food Weight and Content Estimation Using Computer Vision and AI Algorithms: Phase 2. Sensors. 2026; 26(1):76. https://doi.org/10.3390/s26010076
Chicago/Turabian StyleGonzalez, Bryan, Gonzalo Garcia, Sergio A. Velastin, Hamid GholamHosseini, Lino Tejeda, Heilym Ramirez, and Gonzalo Farias. 2026. "Automated Food Weight and Content Estimation Using Computer Vision and AI Algorithms: Phase 2" Sensors 26, no. 1: 76. https://doi.org/10.3390/s26010076
APA StyleGonzalez, B., Garcia, G., Velastin, S. A., GholamHosseini, H., Tejeda, L., Ramirez, H., & Farias, G. (2026). Automated Food Weight and Content Estimation Using Computer Vision and AI Algorithms: Phase 2. Sensors, 26(1), 76. https://doi.org/10.3390/s26010076

