NutritionVerse3D2D: Large 3D Object and 2D Image Food Dataset for Dietary Intake Estimation

Tai, Chi-en Amy; Keller, Matthew; Nair, Saeejith; Chen, Yuhao; Wu, Yifan; Markham, Olivia; Parmar, Krish; Xi, Pengcheng; Wong, Alexander

doi:10.3390/data10110180

Open AccessArticle

NutritionVerse3D2D: Large 3D Object and 2D Image Food Dataset for Dietary Intake Estimation

by

Chi-en Amy Tai

^1,*

,

Matthew Keller

¹,

Saeejith Nair

¹,

Yuhao Chen

¹,

Yifan Wu

¹

,

Olivia Markham

¹,

Krish Parmar

¹,

Pengcheng Xi

²

and

Alexander Wong

¹

Department of Systems Design Engineering, University of Waterloo, Waterloo, ON N2L 3G1, Canada

²

National Research Council Canada, Ottawa, ON K1A 0R6, Canada

^*

Author to whom correspondence should be addressed.

Data 2025, 10(11), 180; https://doi.org/10.3390/data10110180

Submission received: 18 July 2025 / Revised: 26 October 2025 / Accepted: 27 October 2025 / Published: 4 November 2025

Download

Browse Figures

Versions Notes

Abstract

Elderly populations often face significant challenges when it comes to dietary intake tracking, often exacerbated by health complications. Unfortunately, conventional diet assessment techniques such as food frequency questionnaires, food diaries, and 24 h recall are subject to substantial bias. Recent advancements in machine learning and computer vision show promise of automated nutrition tracking methods of food, but require a large, high-quality dataset in order to accurately identify the nutrients from the food on the plate. However, manual creation of large-scale datasets with such diversity is time-consuming and hard to scale. On the other hand, synthesized 3D food models enable view augmentation to generate countless photorealistic 2D renderings from any viewpoint, reducing imbalance across camera angles. In this paper, we present a process to collect a large image dataset of food scenes that span diverse viewpoints and highlight its usage in dietary intake estimation. We first collect quality 3D objects of food items (NV-3D) that are used to generate photorealistic synthetic 2D food images (NV-Synth) and then manually collect a validation 2D food image dataset (NV-Real). We benchmark various intake estimation approaches on these datasets and present NutritionVerse3D2D, a collection of datasets that contain 3D objects and 2D images, along with models that estimate intake from the 2D food images. We release all the datasets along with the developed models to accelerate machine learning research on dietary sensing.

Keywords:

datasets; dietary; segmentation; prediction

1. Introduction

The desire to age in place has grown immensely in the past decade, with 77% of adults over 50 wanting to stay at home in 2021 [1]. However, one of the main challenges with aging in place is ensuring adequate food and nutritional intake. Elderly populations often face significant challenges when it comes to dietary intake tracking, often exacerbated by health complications [2]. Accurate dietary intake estimation is critical for informing policies and programs to support healthy eating, as malnutrition has been directly linked to decreased quality of life [3]. It has been reported that one in four older adults that are 65 years or older are malnourished [4]. Unfortunately, conventional diet assessment techniques such as food frequency questionnaires, food diaries, and 24 h recall [5] are subject to substantial bias [6,7,8]. Additionally, the task of documenting daily food intake can be burdensome, requiring individuals to search for individual food items in databases and track overall nutrient consumption. Previous studies [6,7] have shown that employing self-tracking methods suffers from significant bias.

Emerging alternative approaches for diet assessment, including mobile applications [9,10], digital photography [11], and personal assistants [12], incur high time costs and may necessitate trained personnel. Fortunately, recent promising methods combine these alternative methods with computer vision and machine learning algorithms to automatically estimate nutritional information from food images [13,14]. Consequently, there is a growing need to automate the process of tracking dietary intake estimation, alleviating the burden of manual alternatives. Leveraging smartphone cameras for dietary intake estimation presents a promising solution, with deep neural networks offering a potential avenue for automation.

For such dietary intake estimation systems to be effective, diverse high-quality training data capturing multiple angles and modalities are required. However, manual creation of large-scale datasets with such diversity is time-consuming and hard to scale. On the other hand, synthesized 3D food models enable view augmentation to generate countless photorealistic 2D renderings from any viewpoint, reducing imbalance across camera angles. Leveraging 3D assets facilitates the creation of rich multi-modal datasets (e.g., RGB, depth) with photorealistic images, perfect annotations, and dietary metadata through algorithmic scene composition. Compared to existing datasets that are focused solely on quantity, our contributions also address the gap in the quality of the data by procedurally generating scenes that span a huge diversity of food items, placements, and camera angles.

In this paper, we present a process to collect a large image dataset of food scenes that span diverse viewpoints and highlight its usage in dietary intake estimation. First, we develop a methodology for collecting quality 3D objects of food items with a particular focus on speed and consistency, and introduce NutritionVerse-3D (NV-3D), a large-scale high-quality high-resolution dataset of 105 3D food models, in conjunction with their associated weight, food name, and nutritional value. We leverage these high-quality photorealistic 3D food models and introduce NutritionVerse-Synth (NV-Synth), a dataset of 84,984 high-resolution 2D food images algorithmically rendered from 7081 unique scenes, along with associated diet information derived from the 3D models. To evaluate realism, we also collect the NutritionVerse-Real (NV-Real) dataset of 889 manually captured images across 251 distinct dishes. We benchmark various intake estimation approaches on these datasets and present NutritionVerse3D2D, a collection of datasets that contain 3D objects and 2D images along with models that estimate intake from the 2D food images. We release all the datasets along with the developed models at https://bit.ly/genai4good accessed on 29 October 2025 to accelerate machine learning research on dietary sensing.

Notably, this paper builds upon previous work that empirically studied various dietary intake estimation approaches [15], but details information about the 3D object collection, expands on the methodology for view synthesis, and conducts an analysis of leveraging different deep-learning model architectures. This paper presents several contributions as follows:

Collection of quality 3D objects for food items, in conjunction with their associated weight, food name, and nutritional value (NV-3D).
Methodology for view synthesis using NV-3D to create an associated 2D food image dataset, namely NutritionVerse-Synth (NV-Synth)
Introduction of a 2D food image validation dataset, NutritionVerse-Real (NV-Real), enriched with both diet information and segmentation masks.
Exploration of the benefits of incorporating depth information in food estimation tasks, accompanied by comprehensive experimental results.
Valuable insights into the synergistic utilization of synthetic and real data to enhance the accuracy of diet estimation methods.
Analysis of leveraging different deep-learning model architectures to improve dietary intake estimation.

2. Related Work

2.1. Food Datasets

Existing literature [14,16,17,18,19] collects images of real scenes to train models that achieve high accuracy. However, these techniques operate on fixed modalities and viewpoints, hindering systematic comparison due to data limitations. For example, Ref. [16] is only trained and evaluated on the RGB image of the top view of a food scene. Furthermore, current food recognition and intake estimation methods face several key limitations: restricted output variables (e.g., only calories or mass), lack of diverse viewpoints or incomplete food annotations in datasets, and biases from predefined camera angles during data capture.

The lack of a comprehensive, high-quality image dataset hinders the accuracy and realism of systems based on machine learning and computer vision. Recent advancements in machine learning and computer vision show promise of automated nutrition tracking methods of food [13,14], but require a large, high-quality dataset in order to accurately identify the nutrients from the food on the plate. Unfortunately, existing datasets such as FoodSeg [20] and Nutrition5k [19] do not have per-item nutrition details. Existing datasets also comprise 2D images with fixed or randomly selected camera views that are discretely sampled [13,21,22,23,24,25]. These set views introduce bias and are not very representative of the types of angles and quality of photos taken by older adults in realistic scenarios, which would affect the training and accuracy of the model.

Recently released quality food image datasets such as Central Asian Food Scenes Dataset (CAFSD) [26], Central Asian Food Dataset [27], ISIA Food-500 [28], UECF Food 100 [22], FoodX-251 [21], FoodNExTDB [29], and Food2K [23] contain a significant number of food images with diverse food items. Unfortunately, the dietary information linked to these 2D images is not made available, posing a challenge in utilizing these datasets to estimate energy, macronutrient, and micronutrient intake. In addition, existing datasets comprise 2D images with fixed or randomly selected camera views that are discretely sampled [13,21,22,23,24,25]. These set views introduce bias in terms of how individuals take images with their camera, which would affect the training and accuracy of the model. Recipe-related datasets, like Recipe1M [30,31], are extensively utilized in food recognition and recipe generation studies. However, these datasets lack crucial components, such as food segmentation and ingredient labels, which makes the task of estimating nutritional information very difficult. Chen and Ngo investigated a deep learning-based ingredient recognition system for cooking recipe retrieval and coupled the problem of food categorization with ingredient recognition by simultaneously training the model with both tasks [32]. Notably, their model does not examine the accuracy of food intake estimation, and the images in their dataset only had an average of three recognizable ingredients, which is unrealistic for real-world scenarios [32].

Though the generation of 2D images from 3D object assets is a promising task to mitigate the laborious and time-consuming task of creating a 2D food image dataset, there is a lack of realistic 3D food object assets. Three-dimensional object asset datasets such as ShapeNet [33] and Omni-Object3D [34] have recently risen in popularity. However, these datasets contain limited food object assets, especially those that look more realistic than synthetic, and do not have the associated mass or nutritional facts associated with the food object. On the other hand, NV-3D provides the associated weight and nutritional value for the 3D food asset, allowing for the generated 2D images to also have the corresponding nutritional values along with other annotation data that can be obtained from the 3D food asset. Table 1 provides a general overview of existing dietary intake estimation datasets and methods. As seen, NV-Synth and NV-Real datasets are the only ones that are publicly available and have annotation data (e.g., segmentation masks), and dietary information.

2.2. Dietary Intake Estimation Methods

Bolaños and Radeva [36] proposed a method using the modified GoogLeNet architecture to simultaneously recognize and localize foods in images, but did not estimate food volume or dietary information. DepthCalorieCam [14] utilized visual-inertial odometry on a smartphone to estimate food volume and derive caloric content by multiplying the calorie density of the food’s category with the estimated size of the food. However, their contribution was only demonstrated in three food types and does not aim to predict other nutritional information. Menu-Match [16] provides an automated computer vision system for food logging and tracking of calories, but focus only on the restaurant scenario and has only 646 images in their dataset. Comparable studies [17,35,37] focus on recognizing meal contents and estimating calories from individual meals. However, the methodologies in [17] are also primarily tailored to restaurant scenarios, and there is limited testing conducted in settings outside of restaurants. On the other hand, the dataset and methodologies in [35] are not publicly available and are limited to only 8 food categories. Furthermore, all these works [14,16,17,35] are constrained to calories, and do not predict other dietary components. In [37], the system estimates more macronutrients in addition to calories, such as carbohydrates, fat, and proteins, but only focuses on food segmentation in addition to volume estimation to predict macronutrients from an image of food.

The computer vision-based calorie estimation proposed in [18] leverages depth information to predict nutritional information, but also only explores calculating caloric content and is limited to individual food items rather than full meals. Methods such as those proposed by SimFoodLoc [36] and NutriNet [38] focus on classifying food items present in a meal, but do not predict nutritional information from these images. Geolocation-based approaches, such as Menu-Match [16] and Im2Calories [17], utilize location and image inputs to match food to menu items in nearby restaurants. However, their applicability is limited to restaurant food and relies on information of a user’s location. Additionally, the accuracy of nutritional content determination is not detailed.

Nutrition5k [19] presents a promising development in image-based recognition systems. However, a major limitation of the dataset is that the models are trained on images captured from only four specific viewpoints [19]. This narrow range of viewpoints does not accurately reflect the diverse angles from which individuals typically capture meal images, limiting the model’s ability to generalize to various real-life scenarios. Liang and Li [18] also present a promising computer vision-based food calories estimation dataset and method, but their dataset is limited to only calories and includes only 2978 images [18]. Furthermore, they require that images be taken with a specific calibration reference to ensure accurate calorie estimation, which is infeasible for real-world usage [18].

3. Materials and Methods

3.1. Data Collection of 3D Objects

The two primary factors considered in the design of the data collection pipeline are speed and consistency. Speed is important to maximize the number of food models that can be collected in a feasible amount of time for a large-scale dataset. Likewise, consistency is also critical to minimize human interaction and the likelihood of variation in collecting data so that the number of high-quality food models obtained is optimized.

Though it is now feasible to use automated wearable cameras, these devices have been found to be incredibly intrusive [39] and pose significant ethical ramifications [40]. Given that the main goal is convenient nutritional intake tracking for older individuals, recent advances in mobile phone applications [9,10] demonstrate that nutritional intake tracking through mobile devices would be more convenient and accepted by older individuals. Subsequently, mobile devices were chosen for collecting images, and specifically, the iPhone [41] was chosen as the primary image capturing device due to its popularity and quality camera resolution (though any phone with a suitable camera could be used too).

To generate a quality 3D model of a food item, various 3D scanner applications were compared based on their review rating, exporting capabilities, and ease of usage. In addition to having a high review rating and a variety of model export formats, Polycam [42] also has a web interface with a shareable account for easy image input captured from multiple devices [43]. Hence, leveraging the Polycam app, 3D models of food items are generated from 2D images taken by the iPhone. Consequently, three main restrictions are imposed by using the Polycam app. First, at least 70% overlap between the photos is needed to produce quality 3D food models without holes or blurs. Second, a variety of angles of the food need to be captured to render a full model, and third, a maximum limit of 250 images is allowed for each food item.

To address the first restriction, an electric turntable with the default rotation speed of a full rotation in 24 s and a custom image-taking script implemented using the built-in Shortcuts iPhone application is used to automatically collect consistent images of each food item in a short period of time whilst allowing for at least 70% overlap between the photos.

However, to meet the second limitation, a variety of angles need to be obtained for each food item. To ensure consistency between item captures, the camera angles and food 6D-poses collected for each food item should be the same. In experimenting with the number of camera angles, faster and more consistent performance is obtained using two iPhones set at two different angles compared to only one iPhone. Unfortunately, using two iPhones causes shadow interference in the image captures for each iPhone due to the lighting conditions in the room. In particular, the room has sparse fluorescent ceiling lights that are about 1 m apart from each other. Therefore, we experimented with a variety of tripod layouts to discern the setup with the least amount of shadows on the turntable. As seen in Figure 1, the setup for the data collection process has two iPhones on adjacent tripods with very specific tripod distances for each iPhone and low shadow interference on the exemplar sushi piece on the turntable.

In terms of the third main Polycam limitation, coordination between the number of photos taken and the combinations of the food item 6D-pose and the camera angle had to be determined. With a limit of 250 photos, the ideal scenario for data collection is to position the food in four different ways with two different camera angles. As such, the photo limit and the number of combinations lead to roughly 30 photos per food 6D pose-camera angle combination for a total of 240 images. Hence, as seen in the custom Shortcuts app in Figure 2, the iPhones are configured to automatically take 30 consecutive photos. After taking 60 photos of the food item in one 6D-pose (30 photos per iPhone-tripod), the food item is rotated to another 6D-pose and the custom Shortcuts app is started again. Occasionally, due to the shape of some food items, four different food 6D-poses is infeasible. For example, the egg and cheese bite could not stand on its side without rolling when the turntable rotated. Thus, to ensure consistency between image captures, the number of camera angles is increased to compensate for the lower number of possible food 6D-poses, as seen in Table 2.

Though the setup led to successful 3D model renderings, these models often had pieces of their background included in the model itself. To address this problem, the object masking feature in the Polycam app is used to remove the background from the images and render only the food item. After conducting several experiments using plates with different textures or colors, it was determined that placing the food item on a white plate with low reflectivity and having a black tablecloth on top of the table rendered the most consistent quality of 3D models. Though the turntable has a white color, the food item is not placed directly on the electric turntable, as cleaning the turntable is risky and hence, may result in irreparable damage.

The overall process to generate the 3D models of food items is shown in Figure 3 with an example of a successful 3D model rendering displayed in Figure 4. The total weight and protein weight of each food item are weighed using a food scale, and the food name is recorded for each food item. The nutritional value is obtained from the food packaging or from the Canada Nutrient File posted on the Government of Canada website [44] for non-packaged food items such as apples. An example of a collected nutritional label on the food packaging is shown in Figure 5 and an example of the data available on the Government of Canada website for an example food item is shown in Figure 6.

Item-Specific Challenges

In the collection of various food items, we quickly discovered that it is easier to render 3D models of certain types of food compared to others. Specifically, models for textureless food such as cheese, thin foods such as chips, and small items such as grapes often failed to render or rendered in an unrecognizable fashion. For example, as seen in Figure 7, the bottoms of chip models have significant holes. On the other hand, it is easier to generate 3D models of larger items with more texture, such as chicken strips or a chicken wing. Yet, irrespective of texture or size, items that fall apart (have high fragility) throughout the entire data collection process also led to poor model renderings. Such an instance is the tuna rice ball. Though the 3D model for one tuna rice ball was successfully created, most of the tuna rice balls failed to capture, as the tuna would slip or change shape when the sushi was flipped, which resulted in a poor 3D model rendering, as seen in Figure 7. Thus, extra care had to be taken during data collection for fragile food items to ensure that a high-quality model could be captured. A generalized summary of properties that contribute to the success of a 3D model rendering, along with examples, is displayed in Table 3.

One hundred and five food models comprising 20 unique types are created successfully using the pipeline proposed in Section 2 and are saved in the OBJ and PLY file formats, two of the most widely used file formats for 3D models [45]. Provided along with the models are their associated weight, food name, and nutritional value. The total number of food models per category is shown in Figure 8 with mixed protein referring to food items (e.g., tuna rice ball) that contain almost equal amounts of protein and other categories such as carbohydrates.

Along with each model is the nutritional information for the food item. A sample of the metadata file is shown in Table 4.

3.2. Synthesis Dataset (NV-Synth)

These photorealistic 3D food models allow for view synthesis. Using the 3D meshes from the open access NutritionVerse-3D dataset [46], Nvidia’s Omniverse Isaac Sim simulation framework [47] was used to generate synthetic scenes of meals. Nvidia Omniverse is a platform for digital twin simulation. For each scene, up to seven ingredients were sampled and then procedurally dropped onto a plate to simulate realistic food scenes. Using more than seven often leads to items falling off the plate due to simulation physics. To maximize the realism and capture diverse plating conditions (including scenarios where the ingredients are highly disordered), the internal physics engine was leveraged to simulate physics-based interactions between ingredients of different masses and densities. Specifically, we compute sectorized 2D drop positions by dividing the plate into N equal angular regions, then converting to evenly spaced Cartesian coordinates centered between the plate edge and center to minimize overlap.

To increase the variance of our scene generation, we run a simulation in which items are released from varying heights above the plate, so the settling sequence depends on physical characteristics of the ingredient, like geometry and mass. Any items failing to settle within the maximum iterations or ejected outside the plate are discarded. This procedural initialization helps minimize extreme collisions associated with wholly random placement while still yielding diversity.

Next, we capture 12 distinct camera viewpoints per scene distributed on a modified Fibonacci sphere to provide uniform coverage with minimal occlusion. This geometry maximizes distinct angles with low redundancy. The downstream dataset generation process then randomly samples 4 of the 12 views for each scene to reduce observational bias. Finally, we randomly vary the brightness of each item’s texture between 1–2×, replicating effects such as changes in indoor lighting. The resulting renders exhibit natural shading and occlusion effects, defined by the materials in the original 3D assets.

Thus, realistic images were captured (e.g., some parts of the dish are out of focus or occluded by other items) by using a variety of diverse and realistic camera perspectives and lighting conditions. Each scene is captured from 12 viewpoints and includes perfect ground truth annotations such as RGB, depth, semantic, instance, and amodal segmentation masks, bounding boxes, and detailed nutritional information per food item. Nutritional metadata for each synthetic scene is derived by aggregating data from the instantiated 3D food assets, based on the NutritionVerse-3D model library [46], and the generated segmentation masks. An illustration of this automatic synthetic meal generation process is shown in Figure 9.

Examples of leveraging view synthesis with a 3D food model are shown in Figure 10 for an apple, an egg and cheese bite, a chicken leg, and a shrimp salad roll. View synthesis is utilized in these figures, as the postprocessed sample of generated 2D images includes angles of the food that were not captured in the initial data collection process. As a result, similar 2D images obtained by postprocessing 3D food models extend beyond the fixed camera angles used in the data collection process to reduce imbalance or bias towards a certain viewing angle. An example of two random camera angles for a food scene from NV-Synth is shown in Figure 11. The nutritional metadata for the synthetic scenes was then calculated based on the metadata available in the NutritionVerse-3D dataset [46] and the outputted annotation metadata from Omniverse.

NV-Synth is a collection of 84,984 2D images of 7082 distinct dishes with associated dietary metadata, including mass, calories, carbohydrates, fats, and protein contents and ingredient labels for which food items are in each dish. 105 individual food items are represented in the dataset (45 unique semantic types as shown in Figure 12), and the mean number of times each food item appeared in a food scene is 369.59. An average of 5.62 food items is present in each dish, and the mean dietary content of each food scene is 602.1 kcal, 315.9 g, 55.1 g, 34.2 g, and 30.0 g for calories, mass, protein, carbohydrate, and fat content, respectively. Figure 13 details the mean nutritional contents across scenes. A subset of this dataset (28,328) was used for model development and was created by randomly selecting 4 different viewpoints (from 12 different angles) for each food scene to reduce bias. We use a 60%/20%/20% training/validation/testing split of the scenes for the experiments and ensure all images from the same scene are kept in the same split.

3.3. Validation Dataset (NV-Real)

The NV-Real dataset was created by manually collecting images of food scenes in real life. The food items in the dishes were limited to those available in NutritionVerse-3D [46] to ensure appropriate verification of the approach. We used an iPhone 13 Pro Max [41] to collect 10 images at random camera angles for each food dish. An example of two random camera angles for a food scene from NV-Real is shown in Figure 11. To determine the dietary content of the dish, we measured the weight of every ingredient using a food scale. We then gathered the food composition information either from the packaging of the ingredients or from the Canada Nutrient File available on the Government of Canada website [44] in cases where packaging did not contain the dietary data. The weight and average nutritional value were then used to compute the nutritional values (e.g., protein, calories, fat, carbohydrate) of the ingredients, which were summed to get the total nutritional value of the dish. Segmentation masks were then generated through human labeling of the images using Roboflow [48]. For feasibility, four randomly selected images per dish were included in the annotation set to be manually labeled. Any images found with labeling inconsistencies were subsequently removed. We spent a total of 60 h collecting images and 40 h annotating the images. Examples of the segmentation mask for scenes labeled using Roboflow in the NutritionVerse-Real dataset are shown in Figure 14.

NV-Real includes 889 2D images of 251 distinct dishes comprised of the real food items used to generate synthetic images. The metadata associated with the real-world dataset includes the type of food for each item on the plate, with 45 unique food types present in the dataset. Each food item appears at least once in a dish an average of 18.29 times. As seen in Figure 15, each dish has between 1 and 7 ingredients, with 26% of dishes having seven ingredients in the dish.

The mean values represented in the scenes comprising the real-world dataset for calories, mass, protein, carbohydrate, and fat content are 830.0 kcal, 406.3 g, 59.9 g, 38.2 g, and 64.0 g, respectively. As depicted in Figure 16, the distribution of the total dish nutritional values within the dataset as a percentage of the daily value is fairly well distributed for calories (female) and fat, but skewed towards the lower range of the nutrient category for calories (male) and carbohydrates. This skewed distribution is not merely a limitation but rather reflects the realistic challenges often encountered with dataset distributions in the real world.

We use a 70%/30% training/testing split for the experiments and ensure all images from the same scene are kept in the same split. No validation data was required for the experiments, as we used the same model hyperparameters from the synthetic experimental model runs for comparison parity between the synthetic and real training results.

4. Results and Discussion

For direct prediction, various model architectures have been extensively studied in literature [14,16,17,18,19] with the latest state-of-the-art model architecture being the Nutrition5k model architecture [19] that estimates all five dietary intake tasks. Motivated by Nutrition5k [19], which comprises an Inception-ResNet backbone encoder [51] and a head module with four fully connected layers, we examine two deep learning architecture weight initializations to estimate the dietary information directly from the raw RGB image. For preprocessing, the RGB channels for the images were normalized based on their mean and standard deviation. We implemented the model architecture and hyperparameters used in the experimental setup for Nutrition5k [19] and used it to develop an accurate dietary intake estimation model.

We fine-tuned this architecture using two sets of pre-trained weights for the Inception-ResNet backbone encoder: (1) weights trained on the ImageNet dataset [51] and (2) weights trained on the Nutrition5k dataset. These models were trained with 50 epochs and no early stopping criteria. ReLU was also removed from the last model layer, as its removal was found to improve performance.

Two model variations of each method were trained using 3-channel RGB input and 4-channel RGB-depth input, respectively. The RGB channels were normalized based on their mean and standard deviation, and the depth channel was min-max normalized.

Two weight initializations were considered for the backbone in the Nutrition5k direct prediction model architecture: weights from training on ImageNet [51] and weights from training on the Nutrition5k dataset [19]. The ImageNet weights were selected due to their widespread usage, while the Nutrition5k weights were used because Nutrition5k is state-of-the-art in food intake estimation. We report the performance for these two weight approaches as initial weights with ImageNet and initial weights with Nutrition5k. Notably, the initial weights with Nutrition5k were obtained using no normalized labels, as it was found that removing the normalization improved performance.

A systematic approach was taken to analyze three main dimensions to improve model performance:

using depth information
incorporating synthetic data
leveraging different deep-learning model architectures

4.1. Using Depth Information

Table 5 shows the NV-Synth test set results for the model architectures trained on the NV-Synth train set with the lowest mean absolute error (MAE) described in Equation (1) for each nutrient bolded and indicated with an *. In Equation (1), N represents the number of data points,

y_{i}

denotes the true value of the i-th data point, and

{\hat{y}}_{i}

represents the predicted value for the i-th label. The MAE loss is calculated by taking the absolute difference between each predicted value and its corresponding true value, summing these absolute differences over all labels, and then averaging them.

MAE = \frac{1}{N} \sum_{i = 1}^{N} | y_{i} - {\hat{y}}_{i} |

(1)

Models trained using depth leverage the RGBD images, whereas models trained without depth use the RGB images. As seen in Table 5, using depth leads to generally worse MAE values than using the pure RGB images. Hence, it appears that depth information does not improve model performance. This finding is congruent with [19], who observed a decline in their direct model performance when using depth images. The first row (ImageNet initial weights and No depth) was used as the baseline for the paired two-tailed t-test.

4.2. Incorporating Synthetic Data

We investigate this question by comparing the performance on the NV-Real test set for three scenarios: (A) Using models trained only on NV-Synth, (B) Fine-tuning models trained on NV-Synth using NV-Real, and (C) Training models only on NV-Real. The first row (ImageNet initial weights trained under scenario A) was used as the baseline for the paired two-tailed t-test.

Through the comparisons, the best model (deemed by the lowest MAE for the NV-Real test set) generally across the five diet components is the model with ImageNet initial weights trained only on the NV-Real train set (as seen in Table 6).

4.3. Leveraging Different Deep-Learning Model Architectures

We use a deep learning architecture to predict calorie, mass, carbohydrate, fat, and protein content, comprised of a feature extractor base and fully connected layers shared between tasks and task-specific fully connected layers. We adopt the architectural framework outlined above, which is built upon the foundation established by [19]. The first row (Inception-ResNet base with no compression) was used as the baseline for the paired two-tailed t-test.

Within this structure seen in the Full Architecture part of Figure 17, the outputs of the feature extractor cascade into fully connected layers. We begin with using an Inception-ResNet [51] base feature extractor pre-trained on ImageNet [52] and later explore using a ViT [52] and M-AutoE [53] as the feature extractor layer.

The outputs of the last hidden layer of the base architecture are passed into a series of two fully connected layers whose weights are shared across all five tasks. After, a task-specific fully connected layer followed by a linear output is used to gather the predicted content for each regression task—calories, mass, protein, fat, and carbohydrates.

Given the results from using synthetic data, we leverage the NV-Real dataset to train various models with RGB images with no normalization applied to color channels. During each experiment, we used an RMSProp optimizer with a learning rate set to 0.0001, epsilon set to 1.0, weight decay set to 0.9, and momentum set to 0.9. Both ViT and M-AutoE models were comprised of 12 attention heads and 12 hidden layers, and the M-AutoE decoder consisted of 16 attention heads and 8 hidden layers. For ViT, we used a patch size of 16 × 16, an input resolution of 224 × 224, and leveraged the pretrained google/vit-base-patch16-224 model. For M-AutoE, we leverage the pretrained facebook/vit-mae-base model with a decoder depth of 8 blocks and the reconstruction target of pixels. All models were trained with a batch size of 32, and we employed an MAE loss function.

The investigation into different architectures aims to optimize model performance by minimizing the combined MAE of all nutrient tasks. By comparing the efficacy of various feature extractors and fully connected layer configurations, we seek to identify architectures that strike a balance between accuracy and resource utilization, facilitating the widespread adoption of our solution for dietary tracking applications.

To determine the architecture, different changes were made to the original base architecture, and their performance was compared across models for each regression task. First, we investigated changing the fully connected layers appended to our base architecture, then explored changing the base feature extractor with a vision image transformer and masked autoencoder architecture.

Firstly, we introduce a modification to the fully connected layer section of our architecture, employing a single shared fully connected layer that feeds into separate regression heads, as depicted in the Compressed Architecture part of Figure 17. This adjustment, referred to as the compressed architecture, aims to maximize the utilization of the pre-trained weights from the ImageNet [54] dataset within the base feature extractor architecture, thereby predicting each nutrient task without the need to train additional fully connected layers. As illustrated in Table 7, the implementation of the non-compressed architecture resulted in improved performance, indicated by a lower combined MAE across the five tasks.

While the full architecture, constructed upon the Inception-ResNet base feature extractor, exhibited a lower combined MAE across all tasks compared to the compressed fully connected layers, we note that the compressed model showcased a notable reduction in MAE specifically for the calorie prediction task. This observation, coupled with the marginal disparity in the combined MAE between both predictors, prompted further exploration of the compressed architecture in subsequent experiments.

Next, we address improving model performance by replacing the base feature extractor layers with a ViT architecture [52], and a M-AutoE [53] to explore different architectures’ abilities to capture food image features that contribute to nutritional information. Each of these architectures was also pre-trained on ImageNet [54] and used to replace the feature extractor in the full and compressed fully connected layer set-ups shown in Figure 17.

As shown in Table 7, the best-performing feature extractor overall was the ViT, where either model trained with this as the base layer achieved a combined MAE score lower than all other models tested, suggesting this architecture is the most capable of capturing hidden features of food images contributing to nutritional information. Notably, the combined MAE for ViT was 412.6, an improvement of 25.5% compared to the Inception-ResNet model.

The low performance of the M-AutoE model can be attributed to the relatively large hidden feature outputs of the model, which limited the downstream layers’ capacity to effectively leverage the extracted features. This may have resulted in some of the rich feature representations not having been preserved in the predictor layers, diminishing the model’s ability to accurately capture and utilize crucial information for making predictions.

5. Conclusions

In this paper, we present a process to collect a large image dataset of food scenes that span diverse viewpoints and highlight its usage in dietary intake estimation. We first introduce NutritionVerse-3D, a large-scale high-quality, high-resolution dataset of 105 3D food models in conjunction with their associated weight, food name, and nutritional value. The methodology to collect this dataset was also presented along with the encountered challenges to develop the pipeline.

Leveraging the 3D models in the dataset, 3D food scenes were generated and coupled with automated view synthesis algorithms to create an infinite number of 2D images from any angle, namely NutritionVerse-Synth (NV-Synth), a novel large-scale synthetic dataset to advance food image analysis and dietary assessment. NV-Synth represents the largest and most comprehensive synthetic food dataset to date. Unlike other datasets, NV-Synth also contains a comprehensive set of labels that no other dataset has, including depth images, instance masks, and semantic masks.

Using a manually collected 2D image dataset, NV-Real, we conducted experiments on various dietary intake estimation approaches and attempted to verify our findings using the NV-Real dataset. Interestingly, using depth information and incorporating synthetic data generally did not help model performance. Surprisingly, it was also more advantageous to leverage the weights trained on the ImageNet dataset rather than the weights trained on the Nutrition5k dataset. In addition, the ViT architecture with two fully connected layers between tasks, a task-specific fully connected layer, and a prediction head with no removed fully connected layers performed the best compared to Inception-ResNet and M-AutoE. Compared to the next best base model (Inception-ResNet), ViT achieved a combined MAE that was better than the next best by over 25.5%.

Notably, our study is limited in its generalizability because the dataset comprises only 45 food types and lacks significant cultural diversity, restricting its broader applicability to varied populations. Specifically, 88% of the NutritionVerse-3D models belong to Euro-Canadian/American diets. Scalability to other cuisines such as Mediterranean, Asian, or Indigenous diets would require collecting more 3D food models to account for stews and fermented items. Future work also involves iterating on the synthetic dataset to more closely mirror images collected in real life by increasing the diversity of images and viewpoints per scene and applying these models on an external food dataset to validate their generalization to different situations. To further optimize the performance and efficiency of the dietary prediction model, future work includes addressing magnitude differences between tasks by using a different loss function or normalizing task labels. By releasing the dataset and simulation pipeline publicly, we hope to provide an essential resource to accelerate nutrition-focused research and applications.

Author Contributions

Conceptualization, C.-e.A.T. and Y.C.; methodology, C.-e.A.T., M.K., S.N., Y.C., Y.W., O.M. and K.P.; software, C.-e.A.T., M.K., S.N., Y.C., Y.W., O.M. and K.P.; validation, C.-e.A.T., M.K., Y.C. and P.X.; formal analysis, C.-e.A.T., M.K., S.N., Y.C., Y.W., O.M. and K.P.; investigation, C.-e.A.T., M.K., S.N., Y.C., Y.W., O.M. and K.P.; resources, A.W.; data curation, C.-e.A.T. and M.K.; writing—original draft preparation, C.-e.A.T., M.K., S.N., Y.C. and Y.W.; writing—review and editing, C.-e.A.T., Y.C. and A.W.; visualization, C.-e.A.T., Y.C. and A.W.; supervision, P.X. and A.W.; project administration, P.X. and A.W.; funding acquisition, P.X. and A.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Research Council Canada (NRC) through the Aging in Place (AiP) Challenge Program, project number AiP-006.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original data presented in the study are openly available at https://bit.ly/genai4good (accessed on 29 October 2025).

Acknowledgments

The authors thank their partners in the Kinesiology and Health Sciences department Heather Keller, Sharon Kirkpatrick, and their graduate student partner, Meagan Jackson. The authors also thank undergraduate research assistants Tanisha Nigam, Komal Vachhani, and Cosmo Zhao. This article is a revised and expanded version of a paper entitled NutritionVerse: Empirical Study of Various Dietary Intake Estimation Approaches [15], which was presented at Proceedings of the 8th International Workshop on Multimedia Assisted Dietary Management.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Davis, M.R. Despite Pandemic, Percentage of Older Adults Who Want to Age in Place Stays Steady. 2021. Available online: https://www.overleaf.com/project/6902d3b535892b7c7552d53d (accessed on 11 October 2022).
Ahmed, T.; Haboubi, N. Assessment and management of nutrition in older people and its importance to health. Clin. Interv. Aging 2010, 5, 207–216. [Google Scholar] [CrossRef]
Keller, H.H.; Østbye, T.; Richard, G. Nutritional risk predicts quality of life in elderly community-living Canadians. J. Gerontol. Ser. A 2004, 59, M68–M74. [Google Scholar] [CrossRef]
Kaiser, M.J.; Bauer, J.M.; Rämsch, C.; Uter, W.; Guigoz, Y.; Cederholm, T.; Thomas, D.R.; Anthony, P.S.; Charlton, K.E.; Maggio, M.; et al. Frequency of Malnutrition in Older Adults: A Multinational Perspective Using the Mini Nutritional Assessment. J. Am. Geriatr. Soc. 2010, 58, 1734–1738. [Google Scholar] [CrossRef]
Subar, A.F.; Kirkpatrick, S.I.; Mittl, B.; Zimmerman, T.P.; Thompson, F.E.; Bingley, C.; Willis, G.; Islam, N.G.; Baranowski, T.; McNutt, S.; et al. The Automated Self-Administered 24-Hour Dietary Recall (ASA24): A Resource for Researchers, Clinicians, and Educators from the National Cancer Institute. J. Acad. Nutr. Diet. 2012, 112, 1134–1137. [Google Scholar] [CrossRef]
Kipnis, V.; Subar, A.F.; Midthune, D.; Freedman, L.S.; Ballard-Barbash, R.; Troiano, R.P.; Bingham, S.; Schoeller, D.A.; Schatzkin, A.; Carroll, R.J. Structure of dietary measurement error: Results of the OPEN biomarker study. Am. J. Epidemiol. 2003, 158, 14–21. [Google Scholar] [CrossRef]
Freedman, L.S.; Commins, J.M.; Moler, J.E.; Arab, L.; Baer, D.J.; Kipnis, V.; Midthune, D.; Moshfegh, A.J.; Neuhouser, M.L.; Prentice, R.L.; et al. Pooled results from 5 validation studies of dietary self-report instruments using recovery biomarkers for energy and protein intake. Am. J. Epidemiol. 2014, 180, 172–188. [Google Scholar] [CrossRef]
Freedman, L.S.; Commins, J.M.; Moler, J.E.; Willett, W.; Tinker, L.F.; Subar, A.F.; Spiegelman, D.; Rhodes, D.; Potischman, N.; Neuhouser, M.L.; et al. Pooled results from 5 validation studies of dietary self-report instruments using recovery biomarkers for potassium and sodium intake. Am. J. Epidemiol. 2015, 181, 473–487. [Google Scholar] [CrossRef]
Elbert, S.P.; Dijkstra, A.; Oenema, A. A Mobile Phone App Intervention Targeting Fruit and Vegetable Consumption: The Efficacy of Textual and Auditory Tailored Health Information Tested in a Randomized Controlled Trial. J. Med. Internet Res. 2016, 18, e147. [Google Scholar] [CrossRef]
Zhang, W.; Yu, Q.; Siddiquie, B.; Divakaran, A.; Sawhney, H. “Snap-n-Eat”: Food Recognition and Nutrition Estimation on a Smartphone. J. Diabetes Sci. Technol. 2015, 9, 525–533. [Google Scholar] [CrossRef] [PubMed]
Williamson, D.A.; Allen, H.R.; Martin, P.D.; Alfonso, A.J.; Gerald, B.; Hunt, A. Comparison of digital photography to weighed and visual estimation of portion sizes. J. Am. Diet. Assoc. 2003, 103, 1139–1145. [Google Scholar] [CrossRef]
Rusu, A.; Randriambelonoro, M.; Perrin, C.; Valk, C.; Álvarez, B.; Schwarze, A.K. Aspects Influencing Food Intake and Approaches towards Personalising Nutrition in the Elderly. J. Popul. Ageing 2020, 13, 239–256. [Google Scholar] [CrossRef]
Ciocca, G.; Napoletano, P.; Schettini, R. Food Recognition: A New Dataset, Experiments, and Results. IEEE J. Biomed. Health Inform. 2017, 21, 588–598. [Google Scholar] [CrossRef]
Ando, Y.; Ege, T.; Cho, J.; Yanai, K. DepthCalorieCam: A Mobile Application for Volume-Based FoodCalorie Estimation Using Depth Cameras. In Proceedings of the 5th International Workshop on Multimedia Assisted Dietary Management, MADiMa ’19, Nice, France, 21 October 2019; pp. 76–81. [Google Scholar] [CrossRef]
Tai, C.e.A.; Keller, M.; Nair, S.; Chen, Y.; Wu, Y.; Markham, O.; Parmar, K.; Xi, P.; Keller, H.; Kirkpatrick, S.; et al. NutritionVerse: Empirical Study of Various Dietary Intake Estimation Approaches. In Proceedings of the 8th International Workshop on Multimedia Assisted Dietary Management, Ottawa, ON, Canada, 29 October–3 November 2023; pp. 11–19. [Google Scholar]
Beijbom, O.; Joshi, N.; Morris, D.; Saponas, S.; Khullar, S. Menu-Match: Restaurant-Specific Food Logging from Images. In Proceedings of the 2015 IEEE Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 5–9 January 2015; pp. 844–851. [Google Scholar] [CrossRef]
Myers, A.; Johnston, N.; Rathod, V.; Korattikara, A.; Gorban, A.; Silberman, N.; Guadarrama, S.; Papandreou, G.; Huang, J.; Murphy, K. Im2Calories: Towards an Automated Mobile Vision Food Diary. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1233–1241. [Google Scholar] [CrossRef]
Liang, Y.; Li, J. Computer vision-based food calorie estimation: Dataset, method, and experiment. arXiv 2017, arXiv:1705.07632. [Google Scholar] [CrossRef]
Thames, Q.; Karpur, A.; Norris, W.; Xia, F.; Panait, L.; Weyand, T.; Sim, J. Nutrition5k: Towards Automatic Nutritional Understanding of Generic Food. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 8903–8911. [Google Scholar]
Wu, X.; Fu, X.; Liu, Y.; Lim, E.P.; Hoi, S.C.; Sun, Q. A Large-Scale Benchmark for Food Image Segmentation. In Proceedings of the ACM International Conference on Multimedia, Chengdu, China, 20–24 October 2021; pp. 506–515. [Google Scholar]
Kaur, P.; Sikka, K.; Wang, W.; Belongie, S.J.; Divakaran, A. FoodX-251: A Dataset for Fine-grained Food Classification. arXiv 2019, arXiv:1907.06167. [Google Scholar] [CrossRef]
Matsuda, Y.; Hoashi, H.; Yanai, K. Recognition of multiple-food images by detecting candidate regions. In Proceedings of the 2012 IEEE International Conference on Multimedia and Expo, Melbourne, VIC, Australia, 9–13 July 2012; pp. 25–30. [Google Scholar]
Min, W.; Wang, Z.; Liu, Y.; Luo, M.; Kang, L.; Wei, X.; Wei, X.; Jiang, S. Large Scale Visual Food Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 9932–9949. [Google Scholar] [CrossRef]
Chen, X.; Zhu, Y.; Zhou, H.; Diao, L.; Wang, D. Chinesefoodnet: A large-scale image dataset for chinese food recognition. arXiv 2017, arXiv:1705.02743. [Google Scholar]
Bossard, L.; Guillaumin, M.; Van Gool, L. Food-101—Mining Discriminative Components with Random Forests. In Proceedings of the European Conference on Computer Vision—ECCV 2014, Zurich, Switzerland, 6–12 September 2014; Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T., Eds.; Springer: Cham, Switzerland, 2014; pp. 446–461. [Google Scholar]
Karabay, A.; Varol, H.A.; Chan, M.Y. Improved food image recognition by leveraging deep learning and data-driven methods with an application to Central Asian Food Scene. Sci. Rep. 2025, 15, 14043. [Google Scholar] [CrossRef]
Karabay, A.; Bolatov, A.; Varol, H.A.; Chan, M.Y. A central asian food dataset for personalized dietary interventions. Nutrients 2023, 15, 1728. [Google Scholar] [CrossRef]
Min, W.; Liu, L.; Wang, Z.; Luo, Z.; Wei, X.; Wei, X.; Jiang, S. Isia food-500: A dataset for large-scale food recognition via stacked global-local attention network. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 393–401. [Google Scholar]
Romero-Tapiador, S.; Tolosana, R.; Lacruz-Pleguezuelos, B.; Marcos-Zambrano, L.J.; Bazán, G.X.; Espinosa-Salinas, I.; Fierrez, J.; Ortega-Garcia, J.; de Santa Pau, E.C.; Morales, A. Are Vision-Language Models Ready for Dietary Assessment? Exploring the Next Frontier in AI-Powered Food Image Recognition. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 11–15 June 2025; pp. 430–439. [Google Scholar]
Marin, J.; Biswas, A.; Ofli, F.; Hynes, N.; Salvador, A.; Aytar, Y.; Weber, I.; Torralba, A. Recipe1M+: A Dataset for Learning Cross-Modal Embeddings for Cooking Recipes and Food Images. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 187–203. [Google Scholar] [CrossRef]
Salvador, A.; Hynes, N.; Aytar, Y.; Marin, J.; Ofli, F.; Weber, I.; Torralba, A. Learning Cross-Modal Embeddings for Cooking Recipes and Food Images. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 3068–3076. [Google Scholar]
Chen, J.; Ngo, C.W. Deep-Based Ingredient Recognition for Cooking Recipe Retrieval. In Proceedings of the 24th ACM International Conference on Multimedia, MM ’16, Amsterdam, The Netherlands, 15–19 October 2016; pp. 32–41. [Google Scholar] [CrossRef]
Chang, A.X.; Funkhouser, T.; Guibas, L.; Hanrahan, P.; Huang, Q.; Li, Z.; Savarese, S.; Savva, M.; Song, S.; Su, H.; et al. ShapeNet: An Information-Rich 3D Model Repository. arXiv 2015, arXiv:1512.03012. [Google Scholar]
Wu, T.; Zhang, J.; Fu, X.; Wang, Y.; Ren, J.; Pan, L.; Wu, W.; Yang, L.; Wang, J.; Qian, C.; et al. Omniobject3d: Large-vocabulary 3D object dataset for realistic perception, reconstruction and generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 803–814. [Google Scholar]
Pouladzadeh, P.; Shirmohammadi, S.; Al-Maghrabi, R. Measuring Calorie and Nutrition From Food Image. IEEE Trans. Instrum. Meas. 2014, 63, 1947–1956. [Google Scholar] [CrossRef]
Bolaños, M.; Radeva, P. Simultaneous food localization and recognition. In Proceedings of the 2016 23rd International Conference on Pattern Recognition (ICPR), Cancun, Mexico, 4–8 December 2016; pp. 3140–3145. [Google Scholar] [CrossRef]
Konstantakopoulos, F.S.; Georga, E.I.; Fotiadis, D.I. A review of image-based food recognition and volume estimation artificial intelligence systems. IEEE Rev. Biomed. Eng. 2023, 17, 136–152. [Google Scholar] [CrossRef] [PubMed]
Mezgec, S.; Seljak, B. NutriNet: A Deep Learning Food and Drink Image Recognition System for Dietary Assessment. Nutrients 2017, 9, 657. [Google Scholar] [CrossRef]
Kelly, P.; Marshall, S.J.; Badland, H.; Kerr, J.; Oliver, M.; Doherty, A.R.; Foster, C. An Ethical Framework for Automated, Wearable Cameras in Health Behavior Research. Am. J. Prev. Med. 2013, 44, 314–319. [Google Scholar] [CrossRef]
Mok, T.M.; Cornish, F.; Tarr, J. Too Much Information: Visual Research Ethics in the Age of Wearable Cameras. Integr. Psychol. Behav. Sci. 2015, 49, 309–322. [Google Scholar] [CrossRef]
Apple. Apple. 2022. Available online: https://www.apple.com/ca/iphone/ (accessed on 25 October 2022).
Polycam. Polycam—LiDAR & 3D Scanner for iPhone & Android. 2022. Available online: https://poly.cam/ (accessed on 25 October 2022).
Chambers, J.; Hullette, T.; Gharge, P. The Best 3D Scanner Apps of 2022 (iPhone & Android). 2022. Available online: https://all3dp.com/2/best-3d-scanner-app-iphone-android-photogrammetry/ (accessed on 25 October 2022).
Government of Canada. Canadian Nutrient File (CNF)—Search by Food. 2022. Available online: https://food-nutrition.canada.ca/cnf-fce/ (accessed on 25 February 2023).
McHenry, K.; Bajcsy, P. An overview of 3D data content, file formats and viewers. Natl. Cent. Supercomput. Appl. 2008, 1205, 22. [Google Scholar]
Tai, C.e.A.; Keller, M.; Kerrigan, M.; Chen, Y.; Nair, S.; Xi, P.; Wong, A. NutritionVerse-3D: A 3D Food Model Dataset for Nutritional Intake Estimation. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023. Women in Computer Vision (WiCV). [Google Scholar]
NVIDIA. NVIDIA Isaac Sim. 2023. Available online: https://developer.nvidia.com/isaac-sim (accessed on 21 July 2023).
Roboflow, Version 1.0; Computer Vision: 2022. Available online: https://research.roboflow.com/citations (accessed on 29 September 2023).
Government of Canada. Percent Daily Value. 2019. Available online: https://www.canada.ca/en/health-canada/services/understanding-food-labels/percent-daily-value.html (accessed on 29 September 2023).
Osilla, E.V.; Safadi, A.O.; Sharma, S. Calories; StatPearls Publishing: Treasure Island, FL, USA, 2018. [Google Scholar]
Szegedy, C.; Ioffe, S.; Vanhoucke, V. Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning. arXiv 2016, arXiv:1602.07261. [Google Scholar] [CrossRef]
Wu, B.; Xu, C.; Dai, X.; Wan, A.; Zhang, P.; Tomizuka, M.; Keutzer, K.; Vajda, P. Visual Transformers: Token-based Image Representation and Processing for Computer Vision. arXiv 2020, arXiv:2006.03677. [Google Scholar] [CrossRef]
He, K.; Chen, X.; Xie, S.; Li, Y.; Dollár, P.; Girshick, R. Masked Autoencoders Are Scalable Vision Learners. arXiv 2021, arXiv:2111.06377. [Google Scholar] [CrossRef]
Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Li, F.F. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]

Figure 1. Setup for the data collection process for an exemplar sushi piece.

Figure 2. Custom Shortcuts app used to take photos on the iPhones for data collection.

Figure 3. Overall process to generate 3D models of food items in NutritionVerse-3D.

Figure 4. Example of a successful 3D model Polycam rendering.

Figure 5. Sample nutritional label for a food item used to record nutritional information.

Figure 6. Sample nutritional information from the Canada Nutrient File for a food item [44].

Figure 7. Problematic renders of food objects showcasing challenges.

Figure 8. Count of 3D food models in each food category in the NutritionVerse-3D dataset.

Figure 9. Automatic synthetic meal generation of various 3D food scenes with multi-modal image data (e.g., RGB and depth data) and annotation metadata (e.g., object detection annotations and segmentation annotations) using the NutritionVerse-3D dataset. Nvidia Omniverse is used to easily capture camera perspectives and lighting conditions that are both diverse and realistic, including scenes where large parts of the dish may be out of focus (e.g., Scene 1, View 2) or occluded by other items (e.g., Scene 2, View 2).

Figure 10. Example of 2D images obtained from 3D models of an apple, an egg and cheese bite, a chicken leg, and a shrimp salad roll.

Figure 11. Example food scenes from NV-Synth and NV-Real with two different camera angles.

Figure 12. Frequency distribution of the various semantic class types present across generated NV-Synth scenes, along with the average mass of each food item type.

Figure 13. Distribution of various nutritional factor amounts, mass, and ingredient counts in the generated NV-Synth food scenes. Recommended daily value is also shown in red for the nutrients.

Figure 14. Examples of the segmentation mask for scenes labeled using Roboflow in the NutritionVerse-Real dataset.

Figure 15. Distribution of number of ingredients in a dish.

Figure 16. Distribution of the dataset across various macronutrients as a percent of the daily value (DV) obtained from [49,50].

Figure 17. Architecture used for nutrient prediction. (a) Full Architecture. (b) Compressed Architecture with fewer fully connected layers.

Table 1. Overview of existing dietary intake estimation datasets compared to ours, where mixed refers to whether multiple food item types are present in an image.

Work	Public	Data							Dietary Info
Work	Public	# Images	# Itms	Real	Mixed	# Angles	Depth Info	Annotation Masks	Calories	Mass	Protein	Fat	Carbohydrate
[14]	✓	18	3	Y	N	1			✓
[16]	✓	646	41	Y	Y	1			✓
[17]	✓	50,374	201	Y	Y	1			✓
[18]	✓	2978	160	Y	N	2			✓	✓
[19]	✓	5006	555	Y	Y	4	✓		✓	✓	✓	✓	✓
[35]		3000	8	Y	Y	2		✓	✓	✓
NV-Real	✓	889	45	Y	Y	4		✓	✓	✓	✓	✓	✓
NV-Synth	✓	84,984	45	N	Y	12	✓	✓	✓	✓	✓	✓	✓

Table 2. Overview of the food 6D-poses and camera settings combinations in data collection to produce a total of 240 images for the first and last rows and a total of 180 images for the second row.

Num of Food 6D-Poses	Num of Camera Angles
2	4
3	2
4	2

Table 3. Quantifiers and examples of various properties that contribute to a good quality (green) and poor quality (red) model rendering.

Property	Quantifier	Example
Texture	Low	Cheese Block
Texture	High	Granola Bar
Volume	Low	Grape
Volume	High	Apple
Thickness	Low	Potato Chip
Thickness	High	Salad Chicken Strip
Fragility	Low	Chicken Wing
Fragility	High	Tuna Rice Ball

Table 4. Sample portion of the metadata file (iron, magnesium, potassium, sodium, vitamin D, and vitamin B12 values are also available) with food_weight_grams indicated by fwg for brevity.

Item_Id	fwg	Calories	Fat	Carbohydrates	Protein	Calcium
id-11-red-apple-145g	145	85.55	0.29	20.39	0.39	0.01
id-12-carrot-9g	9	3.69	0.02	0.86	0.08	0.00
id-13-salad-beef-strip-1g	1	2.15	0.08	0.00	0.32	0.00
id-14-salad-beef-strip-7g	7	15.05	0.58	0.00	2.26	0.00

Table 5. MAE evaluation of model architectures using NV-Synth (RGB images for No Depth and RGBD for Yes Depth) with the lowest MAE value in each column shown in bold, and significance indicated by *** for p < 0.001 relative to the baseline.

Initial Weights	Depth	Calories (kcal)	Mass (g)	Protein (g)	Fat (g)	Carb (g)
ImageNet	No	161.9	84.6	17.0	8.7	19.9
Nutrition5k	No	134.2 ***	72.7 ***	17.9 ***	9.1 ***	22.0 ***
ImageNet	Yes	249.5 ***	103.3 ***	25.4 ***	14.6 ***	21.2 ***
Nutrition5k	Yes	214.4 ***	82.3	24.5 ***	13.5 ***	19.8

Table 6. MAE comparison of results for scenario A (models trained only on NV-Synth), scenario B (models trained on NV-Synth and fine-tuned on NV-Real), and scenario C (models trained only on NV-Real) evaluated on the NV-Real dataset, with the lowest MAE value in each column shown in bold, and significance indicated by * for p < 0.05, ** for p < 0.01, and *** for p < 0.001 relative to the baseline.

Initial Weights	Scenario	Calories (kcal)	Mass (g)	Protein (g)	Fat (g)	Carb (g)
ImageNet	A	485.5	175.2	39.6	26.2	55.7
Nutrition5k	A	1083.0 ***	443.9 ***	96.3 ***	55.5 ***	64.8 ***
ImageNet	B	471.5	139.7 ***	33.2 ***	23.8 **	51.4 ***
Nutrition5k	B	497.7	170.5	36.0 *	25.9	52.9 **
ImageNet	C	290.1 ***	117.9 ***	25.2 ***	15.6 ***	26.4 ***
Nutrition5k	C	489.7	188.2	32.7 **	24.6	59.6

Table 7. MAE model performance using compressed architecture vs full architecture with the lowest MAE value in each column shown in bold, and significance indicated by * for p < 0.05, ** for p < 0.01, and *** for p < 0.001 relative to the baseline. Note that Comp. refers to Compressed.

Base	Comp.	Calorie (kcal)	Mass (g)	Protein (g)	Fat (g)	Carb (g)	Combined
Inception-ResNet	No	290.1	117.85	25.2	15.6	26.4	475.1
Inception-ResNet	Yes	305.5 *	162.1 ***	35.2 **	17.4 *	51.3 ***	571.5
ViT	No	253.7	98.4	22.1	14.3	24.2	412.6
ViT	Yes	311.9 *	149.5 **	27.3	16.2	46.8 ***	551.7
M-AutoE	No	463.3 ***	144.3 *	32.0 *	24.4 ***	53.5 ***	717.5
M-AutoE	Yes	476.2 ***	152.0 **	30.6 *	22.9 ***	56.6 ***	737.3

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Tai, C.-e.A.; Keller, M.; Nair, S.; Chen, Y.; Wu, Y.; Markham, O.; Parmar, K.; Xi, P.; Wong, A. NutritionVerse3D2D: Large 3D Object and 2D Image Food Dataset for Dietary Intake Estimation. Data 2025, 10, 180. https://doi.org/10.3390/data10110180

AMA Style

Tai C-eA, Keller M, Nair S, Chen Y, Wu Y, Markham O, Parmar K, Xi P, Wong A. NutritionVerse3D2D: Large 3D Object and 2D Image Food Dataset for Dietary Intake Estimation. Data. 2025; 10(11):180. https://doi.org/10.3390/data10110180

Chicago/Turabian Style

Tai, Chi-en Amy, Matthew Keller, Saeejith Nair, Yuhao Chen, Yifan Wu, Olivia Markham, Krish Parmar, Pengcheng Xi, and Alexander Wong. 2025. "NutritionVerse3D2D: Large 3D Object and 2D Image Food Dataset for Dietary Intake Estimation" Data 10, no. 11: 180. https://doi.org/10.3390/data10110180

APA Style

Tai, C.-e. A., Keller, M., Nair, S., Chen, Y., Wu, Y., Markham, O., Parmar, K., Xi, P., & Wong, A. (2025). NutritionVerse3D2D: Large 3D Object and 2D Image Food Dataset for Dietary Intake Estimation. Data, 10(11), 180. https://doi.org/10.3390/data10110180

Article Menu

NutritionVerse3D2D: Large 3D Object and 2D Image Food Dataset for Dietary Intake Estimation

Abstract

1. Introduction

2. Related Work

2.1. Food Datasets

2.2. Dietary Intake Estimation Methods

3. Materials and Methods

3.1. Data Collection of 3D Objects

Item-Specific Challenges

3.2. Synthesis Dataset (NV-Synth)

3.3. Validation Dataset (NV-Real)

4. Results and Discussion

4.1. Using Depth Information

4.2. Incorporating Synthetic Data

4.3. Leveraging Different Deep-Learning Model Architectures

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI