An Improved Diffusion Model for Generating Images of a Single Category of Food on a Small Dataset
Abstract
1. Introduction
- 1.
- Data-Efficient Framework: We propose a novel Ingredient-Aware Diffusion Model integrating linear interpolation into MaskDiT. This approach stabilizes training dynamics on small-scale datasets, enabling high-fidelity generation with limited samples.
- 2.
- Adaptive Semantic Modeling: We design plug-and-play LIE and CA modules to explicitly align visual features with ingredient semantics. These provide flexibility to handle heterogeneous data, switching adaptively between recipe-driven and label-driven modes.
- 3.
- Practical Utility Validation: We demonstrate the value in food informatics by verifying effectiveness in data augmentation. Utilizing our synthetic images improved downstream food classification accuracy, offering a viable solution to data scarcity in smart food systems.
2. Method
2.1. Input and Timestep Construction
2.2. Label and Ingredients Encoding
2.3. Network Architecture
2.4. Loss Function
3. Experiments and Discussion
3.1. Experimental Setup
3.2. Quantitative Evaluation
3.3. Validation of Recipe-Consistent Synthesis
- 1.
- Ingredient Removal: In the first column, removing “Crushed hot and dry chili” from the text prompt results in a dish where the chili texture is absent, yet the glossy, gelatinous texture of the braised pork is preserved. This indicates that the model successfully disentangles specific ingredient features from the global dish appearance.
- 2.
- Ingredient Addition: In the second column, adding “Spiced corned egg” generates a geometrically consistent egg integrated naturally with the pork chunks.
- 3.
- Complex Modification: The third column shows simultaneous ingredient removal, proving the model’s robustness in handling complex recipe alterations.
3.4. Impact of Using Generated Images on Food Classification
- 1.
- Training Set (Few-shot Baseline): A limited set of 200 real images (100 per class).
- 2.
- Augmented Training Sets: The baseline set augmented with 400 synthetic images (200 per class) generated by DiT, Finetuned Latent Diffusion, and our proposed method, respectively.
- 3.
- Test Set: A large-scale held-out set of 1600 real images (800 per class) to ensure statistical reliability.
3.5. Ablation Studies
4. Discussion
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Panayotova, G.G. Artificial Intelligence in Nutrition and Dietetics: A Comprehensive Review of Current Research. Healthcare 2025, 13, 2579. [Google Scholar] [CrossRef]
- Jung, H. Creating a Smartphone Application for Image-Assisted Dietary Assessment Among Older Adults with Type 2 Diabetes. Ph.D. Thesis, University of Washington, Seattle, WA, USA, 2017. [Google Scholar]
- Guo, Z. Applications of Artificial Intelligence in Food Industry. Foods 2025, 14, 1241. [Google Scholar] [CrossRef]
- Chaturvedi, A.; Waghade, C.; Mehta, S.; Ghugare, S.; Dandekar, A. Food Recognition and Nutrition Estimation Using Deep Learning. Int. J. Res. Eng. Sci. Manag. 2020, 3, 506–510. [Google Scholar]
- Ma, P.; Lau, C.P.; Yu, N.; Li, A.; Liu, P.; Wang, Q.; Sheng, J. Image-based nutrient estimation for Chinese dishes using deep learning. Food Res. Int. 2021, 147, 110437. [Google Scholar] [CrossRef] [PubMed]
- Min, W.; Jiang, S.; Liu, L.; Rui, Y.; Jain, R. A Survey on Food Computing. ACM Comput. Surv. 2019, 52, 1–36. [Google Scholar] [CrossRef]
- Han, F.; Hao, G.; Guerrero, R.; Pavlovic, V. Mpg: A multi-ingredient pizza image generator with conditional stylegans. arXiv 2020, arXiv:2012.02821. [Google Scholar]
- Karras, T.; Laine, S.; Aittala, M.; Hellsten, J.; Lehtinen, J.; Aila, T. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 8110–8119. [Google Scholar]
- Han, F.; Guerrero, R.; Pavlovic, V. CookGAN: Meal image synthesis from ingredients. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Snowmass Village, CO, USA, 1–5 March 2020; pp. 1450–1458. [Google Scholar]
- Fu, W.; Han, Y.; He, J.; Baireddy, S.; Gupta, M.; Zhu, F. Conditional synthetic food image generation. arXiv 2023, arXiv:2303.09005. [Google Scholar] [CrossRef]
- Zhu, T.; Chen, J.; Zhu, R.; Gupta, G. StyleGAN3: Generative networks for improving the equivariance of translation and rotation. arXiv 2023, arXiv:2307.03898. [Google Scholar]
- Brock, A.; Donahue, J.; Simonyan, K. Large scale GAN training for high fidelity natural image synthesis. arXiv 2018, arXiv:1809.11096. [Google Scholar]
- Ho, J.; Jain, A.; Abbeel, P. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 2020, 33, 6840–6851. [Google Scholar]
- Karras, T.; Aittala, M.; Aila, T.; Laine, S. Elucidating the design space of diffusion-based generative models. Adv. Neural Inf. Process. Syst. 2022, 35, 26565–26577. [Google Scholar]
- Nichol, A.Q.; Dhariwal, P. Improved denoising diffusion probabilistic models. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 8162–8171. [Google Scholar]
- Hyvärinen, A.; Dayan, P. Estimation of non-normalized statistical models by score matching. J. Mach. Learn. Res. 2005, 6, 685–709. [Google Scholar]
- Song, Y.; Ermon, S. Generative modeling by estimating gradients of the data distribution. Adv. Neural Inf. Process. Syst. 2019, 32. [Google Scholar] [CrossRef]
- Dosovitskiy, A. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- Peebles, W.; Xie, S. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 4–6 October 2023; pp. 4195–4205. [Google Scholar]
- Zheng, H.; Nie, W.; Vahdat, A.; Anandkumar, A. Fast training of diffusion models with masked transformers. arXiv 2023, arXiv:2306.09305. [Google Scholar]
- Gao, S.; Zhou, P.; Cheng, M.M.; Yan, S. Mdtv2: Masked diffusion transformer is a strong image synthesizer. arXiv 2023, arXiv:2303.14389. [Google Scholar]
- He, K.; Chen, X.; Xie, S.; Li, Y.; Dollár, P.; Girshick, R. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 16000–16009. [Google Scholar]
- Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; Hochreiter, S. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar] [CrossRef]
- Markham, O.; Chen, Y.; Tai, C.e.A.; Wong, A. FoodFusion: A latent diffusion model for realistic food image generation. arXiv 2023, arXiv:2312.03540. [Google Scholar] [CrossRef]
- Han, Y.; He, J.; Gupta, M.; Delp, E.J.; Zhu, F. Diffusion Model with Clustering-based Conditioning for Food Image Generation. In Proceedings of the 8th International Workshop on Multimedia Assisted Dietary Management, Ottawa, ON, Canada, 29 October 2023. [Google Scholar] [CrossRef]
- Liu, X.; Gong, C.; Liu, Q. Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv 2022, arXiv:2209.03003. [Google Scholar] [CrossRef]
- Bossard, L.; Guillaumin, M.; Van Gool, L. Food-101–mining discriminative components with random forests. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2014; pp. 446–461. [Google Scholar]
- Chen, J.; Ngo, C.W. Deep-based ingredient recognition for cooking recipe retrieval. In Proceedings of the 24th ACM International Conference on Multimedia, Amsterdam, The Netherlands, 15–19 October 2016; pp. 32–41. [Google Scholar]
- Ma, N.; Goldstein, M.; Albergo, M.S.; Boffi, N.M.; Vanden-Eijnden, E.; Xie, S. Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2024; pp. 23–40. [Google Scholar]
- Kingma, D.P.; Welling, M. Auto-encoding variational bayes. arXiv 2013, arXiv:1312.6114. [Google Scholar]
- Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 2020, 21, 1–67. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
- Gao, P.; Han, J.; Zhang, R.; Lin, Z.; Geng, S.; Zhou, A.; Zhang, W.; Lu, P.; He, C.; Yue, X.; et al. Llama-adapter v2: Parameter-efficient visual instruction model. arXiv 2023, arXiv:2304.15010. [Google Scholar]
- Zhang, L.; Rao, A.; Agrawala, M. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 4–6 October 2023; pp. 3836–3847. [Google Scholar]
- Loshchilov, I.; Hutter, F. Decoupled weight decay regularization. arXiv 2017, arXiv:1711.05101. [Google Scholar]
- Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 10684–10695. [Google Scholar]
- Xu, M.; Wang, J.; Tao, M.; Bao, B.K.; Xu, C. CookGALIP: Recipe controllable generative adversarial CLIPs with sequential ingredient prompts for food image generation. IEEE Trans. Multimed. 2024, 27, 2772–2782. [Google Scholar] [CrossRef]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Lipman, Y.; Chen, R.T.; Ben-Hamu, H.; Nickel, M.; Le, M. Flow matching for generative modeling. arXiv 2022, arXiv:2210.02747. [Google Scholar]
- Shorten, C.; Khoshgoftaar, T.M. A survey on image data augmentation for deep learning. J. Big Data 2019, 6, 1–48. [Google Scholar]
- Frid-Adar, M.; Diamant, I.; Klang, E.; Amitai, M.; Goldberger, J.; Greenspan, H. GAN-based synthetic medical image augmentation for increased CNN performance in liver lesion classification. Neurocomputing 2018, 321, 321–331. [Google Scholar] [CrossRef]






| (a) Results on Food101 Dataset | ||
| Method | Hamburger | Chicken Wings |
| FID↓ | FID↓ | |
| DiT [19] | 23.82 | 25.80 |
| Finetuned Latent Diffusion [36] | 35.27 | 22.36 |
| Ours | 10.93 | 19.81 |
| (b) Results on Food172 Dataset | ||
| Method | Braised Pork | Sweet Mung Bean Soup |
| FID↓ | FID↓ | |
| CookGALIP [37] | 37.15 | 44.98 |
| Finetuned Latent Diffusion [36] | 29.58 | 35.70 |
| Ours | 17.26 | 24.14 |
| Method | Specificity (%) | Accuracy (%) |
|---|---|---|
| Real (Baseline) | 95.76 (±1.11) | 95.65 (±1.28) |
| DiT | 95.31 (±0.90) | 95.09 (±1.08) |
| Finetuned LDM | 95.53 (±1.80) | 95.39 (±1.04) |
| Ours | 96.28 (±0.45) | 96.20 (±0.48) |
| Method (Train Steps) | FID ↓ |
|---|---|
| MaskDiT (100 K steps) | 62.45 |
| Ours (100 K steps) | 10.93 |
| Architecture | FID ↓ |
|---|---|
| Baseline | 22.16 |
| Baseline/LIE | 17.63 |
| Baseline/LIE/CA | 17.26 |
| LIE (without W) | 23.82 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Chen, Z.; Xiao, Z.; Wu, D.; Sang, Q. An Improved Diffusion Model for Generating Images of a Single Category of Food on a Small Dataset. Foods 2026, 15, 443. https://doi.org/10.3390/foods15030443
Chen Z, Xiao Z, Wu D, Sang Q. An Improved Diffusion Model for Generating Images of a Single Category of Food on a Small Dataset. Foods. 2026; 15(3):443. https://doi.org/10.3390/foods15030443
Chicago/Turabian StyleChen, Zitian, Zhiyong Xiao, Dinghui Wu, and Qingbing Sang. 2026. "An Improved Diffusion Model for Generating Images of a Single Category of Food on a Small Dataset" Foods 15, no. 3: 443. https://doi.org/10.3390/foods15030443
APA StyleChen, Z., Xiao, Z., Wu, D., & Sang, Q. (2026). An Improved Diffusion Model for Generating Images of a Single Category of Food on a Small Dataset. Foods, 15(3), 443. https://doi.org/10.3390/foods15030443

