RDINet: A Deep Learning Model Integrating RGB-D and Ingredient Features for Food Nutrition Estimation

Kuang, Zhejun; Gao, Haobo; Yu, Jiaxuan; Sun, Dawen; Zhao, Jian; Sun, Lei

doi:10.3390/app16010454

Open AccessArticle

RDINet: A Deep Learning Model Integrating RGB-D and Ingredient Features for Food Nutrition Estimation

by

Zhejun Kuang

^1,2,3,

Haobo Gao

^1,2,3,

Jiaxuan Yu

⁴,

Dawen Sun

^1,2,3,*,

Jian Zhao

^1,2,3

and

Lei Sun

¹

College of Computer Science and Technology, Changchun University, 6543 Weixing Road, Changchun 130022, China

²

Jilin Provincial Key Laboratory of Human Health Status Identification Function & Enhancement, 6543 Weixing Road, Changchun 130022, China

³

Key Laboratory of Intelligent Rehabilitation and Barrier-Free for the Disabled, Changchun University, Ministry of Education, 6543 Weixing Road, Changchun 130022, China

⁴

College of Artificial Intelligence, Nankai University, 38 Tongyan Road, Jinnan District, Tianjin 300350, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(1), 454; https://doi.org/10.3390/app16010454 (registering DOI)

Submission received: 28 November 2025 / Revised: 24 December 2025 / Accepted: 30 December 2025 / Published: 1 January 2026

Download

Browse Figures

Versions Notes

Abstract

With growing public health awareness, accurate food nutrition estimation plays an increasingly important role in dietary management and disease prevention. The main bottleneck lies in how to effectively integrate multi-source heterogeneous information. We propose RDINet, a multimodal network that fuses RGB appearance, depth geometry, and ingredient semantics for food nutrition estimation. It comprises two core modules: The RGB-D fusion module integrates the textural appearance of RGB images and the 3D shape information conveyed by depth images through a channel–spatial attention mechanism, achieving a joint understanding of food appearance and geometric morphology without explicit 3D reconstruction; the ingredient fusion module embeds ingredient information into visual features via attention mechanisms, enabling the model to fully leverage components that are visually difficult to discern or prone to confusion, thereby activating corresponding nutritional reasoning pathways and achieving cross-modal inference from explicit observations to latent attributes. Experimental results on the Nutrition5k dataset show that RDINet achieves percentage mean absolute errors (PMAE) of 14.9%, 11.2%, 19.7%, 18.9%, and 19.5% for estimating calories, mass, fat, carbohydrates, and protein, respectively, with a mean PMAE of 16.8% across all metrics, outperforming existing mainstream methods. The results demonstrate that the appearance–geometry–semantics fusion framework is effective.

Keywords:

nutrition estimation; deep learning; color–depth feature integration; ingredient information; multimodality fusion

1. Introduction

In contemporary health management field, nutritional assessment serves as a fundamental and critical component, and it has been widely applied in disease prevention, clinical treatment, personalized dietary planning, and public health policy formulation, among other areas. Accurate nutritional intake assessment not only helps identify an individual’s nutritional imbalances but also provides a scientific basis for interventions in prevalent diet-related health conditions [1]. The World Health Organization has also explicitly stated that promoting healthy dietary patterns depends on the accurate perception and calculation of food nutritional information. Consequently, developing efficient and accurate nutritional assessment methods is crucial for providing scientific dietary guidance, monitoring daily intake, and maintaining long-term health.

With the advancement of artificial intelligence, particularly in deep learning and computer vision, automated dietary assessment has become a research hotspot. Although existing methods have made progress in image recognition, volume estimation, and multimodal fusion, they still have significant limitations: most methods lack effective modeling of the three-dimensional structure of food, leading to inaccurate volume and mass estimation; meanwhile, they struggle to handle visually similar foods with different nutritional components, as well as hidden ingredients that are crucial to nutrition but difficult to identify visually, thereby causing estimation errors.

To address this limitation, we introduce RDINet, a model designed for food nutrient estimation that jointly leverages RGB appearance, depth geometry, and ingredient semantic information. By employing a feature fusion mechanism across images together with an ingredient-guided strategy, RDINet enables unified modeling of food appearance, shape, and composition. This integration effectively combines diverse sources of information essential for accurate nutrient estimation and helps mitigate errors in food nutrient prediction. In the feature extraction stage, we separately process RGB images and depth images to extract feature representations from different levels. Subsequently, attention mechanisms are applied to integrate features of color and depth modalities within each corresponding scale, producing multiple sets of fused representations. To further integrate cross-layer information, we perform unified modeling on multi-level features, strengthening the representation of the overall food structure and ultimately yielding an RGB-D joint feature embedding that captures both appearance and geometric cues. Based on this, RDINet introduces an ingredient fusion module that maps the multi-hot encoded ingredient list into semantic embedding vectors and dynamically fuses them into RGB-D visual features via attention mechanisms. This module leverages the external semantic knowledge of known ingredients, enabling the model to effectively utilize ingredient information that is either visually difficult to discern or easily confused with other ingredients yet crucial for nutrition, thereby activating the corresponding nutritional reasoning pathways, achieving cross-modal inference from explicit visual observations to latent nutritional attributes, and reducing nutrition estimation error. Experimental results on the Nutrition5k dataset show that RDINet outperforms existing methods across five nutritional metrics: calories, mass, fat, carbohydrates, and protein. Specifically, the PMAE reaches 14.9%, 11.2%, 19.7%, 18.9%, and 19.5%, respectively. Compared to the DSDGF-Nutri proposed by Hou et al. [2], the mean PMAE across the five nutrients is reduced by 2.7%, validating the superiority of RDINet. Our work makes three key contributions:

We design an RGB-D fusion module that integrates appearance features from RGB images and geometric priors from depth images, achieving joint understanding of food appearance and geometric morphology without explicit 3D reconstruction, thereby effectively reducing nutritional estimation errors.
We propose an ingredient fusion module centered on cross-attention, which leverages cross-attention mechanisms to dynamically modulate visual feature responses using ingredient priors. This design not only effectively alleviates the challenge of distinguishing visually similar ingredients with vastly different nutritional profiles but also compensates for the missing information caused by nutritionally critical components that are difficult to recognize visually. By doing so, the model can reliably infer latent nutritional attributes from explicit visual observations, significantly reducing nutrient estimation errors. This constitutes the core methodological innovation of our work.
We evaluated RDINet on the public dataset Nutrition5k, and the results demonstrate its excellent performance across multiple nutrient estimation tasks, outperforming existing methods.

2. Related Work

2.1. Traditional Methods

Traditional nutritional assessment methods primarily include the Food Frequency Questionnaire (FFQ) [3] and the 24-Hour Dietary Recall (24HR) [4]. These methods rely on participants’ recall of what and how much they ate, which can lead to memory bias. Furthermore, they impose a substantial reporting burden on participants. These inherent limitations restrict their applicability and accuracy in large-scale populations.

2.2. Image-Based Methods

With the advancement of artificial intelligence, achieving automated and reliable dietary assessment has become a feasible goal. Numerous studies and review articles have pointed out that the application of AI technologies, particularly deep learning and computer vision, has improved the accuracy and efficiency of dietary assessment [5,6,7]. Mariappan et al. [8] introduced an image-based method that estimates intake by analyzing food images captured before and after eating. Ege and Yanai [9] developed a multi-task model that detects food while simultaneously estimating its calories. Situju et al. [10] presented a Convolutional Neural Network (CNN) to estimate saltiness and calories. Fang et al. [11] proposed a deep generative model that produces calorie spatial heatmaps from food images. This method divides the image into multiple regions, estimates the calorie density for each region separately, and addresses the issue of inaccurate overall energy estimation in mixed-food scenarios. Papathanail et al. [12] developed an AI-based mobile application for recognizing food categories and estimating portion sizes from a single meal photo. Keller et al. [13] used a Vision Transformer (ViT) for nutritional estimation of food images. Although this method does not directly identify specific ingredients, it mitigates nutrient allocation errors caused by visual similarities to some extent by independently estimating calories for different regions of the image. However, a common limitation of most of the aforementioned methods is the lack of effective modeling of the three-dimensional structure of food, leading to inaccurate volume and mass estimation [14]. To address this issue, researchers have begun to introduce depth sensing and volume modeling techniques.

2.3. Volume Modeling Approaches

Puri et al. [15] estimated volume by performing 3D modeling of food images, thereby deriving nutritional information. Chen et al. [16] used regular geometric shape templates to approximate food shapes for volume calculation. Myers et al. [17] proposed the Im2Calories system. This system integrates a segmentation network with a deep regression network, not only identifying different food regions but also using volume modeling methods to estimate the calories and other nutritional components of each food. Lu et al. [18] proposed integrating RGB and depth images, using depth information to perceive spatial structure and construct three-dimensional shapes for volume estimation. Thames et al. [19] introduced an RGB-D multi-task model on the Nutrition5k dataset, effectively improving the accuracy of nutrition estimation. However, these methods all failed to incorporate semantic or compositional information of ingredients and thus cannot effectively distinguish between visually similar foods with different nutritional components. This limitation leads to certain estimation errors [20].

2.4. Ingredient Fusion Methods

Shroff et al. [21] proposed combining image recognition results with supplementary information provided by users to achieve preliminary nutrition estimation. Chotwanvirat et al. [22] constructed a Thai food database and estimated food mass and carbohydrate content by combining images with reference objects for scale estimation. This study also manually annotated ingredients for some food samples, providing supervisory signals for the model. Ruede et al. [23] explored joint modeling of images and recipe texts to infer ingredients and their quantities from the text. Kusuma et al. [24] verified the accuracy of database-based nutrition assessment methods. Nguyen [25] and Folson [26] et al. used the FRANI system, introducing a user feedback mechanism to correct model results, thereby improving accuracy. This method attempted to incorporate human–computer interaction but lacked a systematic ingredient guidance mechanism, making it difficult to generalize to complex dishes and cross-cultural diets. Lee et al. [27] used natural language processing techniques to extract nutritional elements from recipes and integrated them with image recognition results. These methods, to varying degrees, attempted to introduce and utilize ingredient semantic information, but the final effectiveness was still not ideal.

Although existing methods have made progress in incorporating depth perception and ingredient semantics, they lack a systematic multimodal complementary mechanism capable of simultaneously modeling 3D structural cues and reasoning about implicit nutritional components. The proposed RDINet achieves deep synergy among appearance, geometry, and semantic information, thereby reducing nutrient estimation errors.

3. Materials and Methods

3.1. Methods

The framework diagram of this paper is shown in Figure 1. Key modules include: a ViT-based feature extractor, a RGB-D fusion module, an ingredient fusion module, and a nutrition regressor.

The ViT-based feature extractor employs a model based on the ViT architecture as its backbone to model global dependencies in images. During the feature extraction process, we obtain feature representations from the shallow, intermediate, and deep layers, denoting the RGB features as {

R_{s}, R_{i}, R_{d}

} and the depth features as {

D_{s}, D_{i}, D_{d}

}, where the subscripts s, i, and d represent shallow, intermediate, and deep layers, respectively. The shallow RGB feature

R_{s}

conveys perceptual attributes like color patterns, surface texture, and edge structures, while

D_{s}

provides the unique 3D geometric structure of the depth map.

R_{d}

and

D_{d}

contain more abstract high-level semantic information. In this way, the ViT-based feature extractor provides effective multi-level representations for the subsequent RGB-D fusion.

The RGB-D fusion module enables deep integration between visual features from RGB images and the geometric structure from depth images. RGB images provide rich color and texture details, which help with visual representation modeling of food. Depth images provide 3D structural cues such as object size, shape, and relative position, which are beneficial for estimating food volume and morphology. By fusing appearance features and geometric structural information, this module enhances the model’s joint perception capability of food appearance and 3D structure, providing a reliable perceptual foundation for nutritional estimation. The channel–spatial attention fusion block in this module is used to fuse RGB and depth features at the same level. The feature integration block aggregates multi-level RGB-D fused features, producing the final representation

F_{R G B D}

.

The ingredient fusion module introduces external ingredient composition information as a high-level semantic prior. In nutrition estimation, many food ingredients are either visually imperceptible or easily confused due to appearance similarity, and relying solely on visual features may lead to estimation bias. To address this, the module embeds ingredient information into visual features via attention mechanisms, enabling the model to effectively leverage these ambiguous yet nutritionally critical cues, thereby activating the corresponding nutritional reasoning pathways, achieving cross-modal inference from explicit visual observations to latent nutritional attributes, and reducing nutrition estimation error.

The nutrition regressor is a regression module responsible for mapping multimodal fused features to quantitative predictions of various nutrients, and serves as the final prediction component of the entire nutritional estimation system.

Specifically, the input includes the RGB image, depth image, and ingredient category information of the food. The ViT-based feature extractor separately captures shallow, intermediate, and deep representations from both RGB and depth modalities, yielding RGB features {

R_{s}, R_{i}, R_{d}

} and depth features {

D_{s}, D_{i}, D_{d}

}. Subsequently, the RGB-D fusion module is used to fuse RGB and depth features, in which the channel–spatial attention fusion block fuses features from the same level of RGB and depth images, resulting in {

{R D}_{s}, {R D}_{i}, {R D}_{d}

}. The feature integration block further aggregates these multi-level features to obtain the unified multimodal feature

F_{R G B D}

. The ingredient information is represented as a multi-hot encoded vector

F_{I n g} \in {0,1}^{C}

, where

C

is the cardinality of the ingredient vocabulary, and each dimension corresponds to a specific ingredient, taking the value 1 if it is present and 0 otherwise. The ingredient fusion module fuses

F_{I n g}

and

F_{R G B D}

, and generates dedicated features for five nutrients (calorie, mass, fat, carb, and protein), denoted as {

F_{R D I}^{C a l}, F_{R D I}^{M a s s}, F_{R D I}^{F a t}, F_{R D I}^{C a r b}, F_{R D I}^{P r o t}

}. Finally, each feature vector is passed through the nutrition regressor to output the corresponding nutritional content prediction.

The algorithm is presented below (Algorithm 1):

Algorithm 1: Algorithm of RDINet

input: images

R G B I m g

images

D e p t h I m g

ingredients

F_{I n g}

ground truth values

y_{C a l}, y_{M a s s}, y_{F a t}, y_{C a r b}, y_{P r o t}

training epochs

E

batch size

B

model

M

output:

{\hat{y}}_{C a l}, {\hat{y}}_{M a s s}, {\hat{y}}_{F a t}, {\hat{y}}_{C a r b}, {\hat{y}}_{P r o t}

Initialize all weights

for

I \in \{1, \dots, E\}

do

Divide {

R G B I m g, D e p t h I m g, F_{I n g}, y_{C a l}, y_{M a s s}, y_{F a t}, y_{C a r b}, y_{P r o t}

} to minibatches with size

B

for

j

∈ minibatches do

R_{s}, R_{i}, R_{d} \leftarrow F e a t u r e E x t r a c t o r (R G B I m g)

D_{s}, D_{i}, D_{d} \leftarrow F e a t u r e E x t r a c t o r (D e p t h I m g)

for

l \in \{s, i, d\}

do

{R D}_{l} \leftarrow C h a n n e l S p a t i a l A t t e n t i o n F u s i o n B l o c k (R_{l}, D_{l})

F_{R G B D} \leftarrow F e a t u r e I n t e g r a t i o n B l o c k ({R D}_{s}, {R D}_{i}, {R D}_{d})

for

N u t r i t i o n \in {C a l, M a s s, F a t, C a r b, P r o t}

do

F_{R D I}^{N u t r i t i o n} \leftarrow I n g r e d i e n t F u s i o n M o d u l e (F_{R G B D}, F_{I n g})

for

N u t r i t i o n \in {C a l, M a s s, F a t, C a r b, P r o t}

do

{\hat{y}}_{N u t r i t i o n} \leftarrow N u t r i t i o n R e g r e s s o r (F_{R D I}^{N u t r i t i o n})

L o s s \leftarrow F_{L o s s} ({\hat{y}}_{C a l}, {\hat{y}}_{M a s s}, {\hat{y}}_{F a t}, {\hat{y}}_{C a r b}, {\hat{y}}_{P r o t}, y_{C a l}, y_{M a s s}, y_{F a t}, y_{C a r b}, y_{P r o t})

backward (

L o s s

)

update all weights

end

return

{\hat{y}}_{C a l}, {\hat{y}}_{M a s s}, {\hat{y}}_{F a t}, {\hat{y}}_{C a r b}, {\hat{y}}_{P r o t}

3.2. Dataset

We have summarized the currently available datasets for nutritional content estimation, as shown in Table 1. The table summarizes the number of food categories, sample sizes, data types (images or videos), whether depth data is included, whether ingredient data is included, and the nutrients that can be estimated such as mass, calories, protein, fat, and carbohydrates. Currently, there are not many publicly available datasets suitable for nutritional estimation. Among them, ECUSTFD lacks depth maps and ingredient information. MetaFood3D lacks ingredient information. Our proposed method requires RGB images, depth images, and ingredient data; therefore, only the Nutrition5k dataset meets our requirements. Specifically, we used 3.5k RGB images, depth images, and ingredient information from this dataset to evaluate our method. Figure 2 shows examples from the Nutrition5k dataset. Figure 2a is an RGB image showing a top-down view of the food. Figure 2b shows the corresponding depth map, with proximity to the camera encoded by color: blue denotes nearer regions and red indicates greater distance. Figure 2c presents ingredient-related information, specifying the constituent items and their associated nutrient profiles. The dataset contains some unsuitable samples, as shown in Figure 3, including images of empty plates, partially captured dishes, or overlapping food items, which can negatively impact model performance. To mitigate this, we first excluded such samples and then randomly split the remaining dataset into training, validation, and test subsets with proportions of 70%, 10%, and 20%, respectively. Although the numerical distributions of different nutrients in Nutrition5K exhibit some degree of imbalance, the dataset encompasses a rich diversity of dish types and ingredient combinations, and the overall sample size is sufficiently large. In practice, this splitting strategy approximately maintains consistent nutrient distributions across subsets. Moreover, it ensures an adequate amount of training data while providing independent validation and test sets for model selection and final evaluation.

3.3. ViT-Based Feature Extractor

The RGB image features of food help in identifying food types, thereby providing necessary information for subsequent nutrient estimation. For example, by analyzing the shape, color, and texture of foods in images, the model can infer the type of food. Chicken breast and tofu, although both white and block-shaped with similar appearances, have different nutritional compositions: the former is rich in animal protein, while the latter is a plant-based protein source with higher fat content. If the image feature extraction is inaccurate, leading to errors in food type recognition, it will result in deviations in nutrient estimation, thus affecting the accuracy of nutritional analysis [30]. To more effectively handle complex food images, we adopted ViT. ViT employs self-attention mechanisms to dynamically capture dependencies among image patches, thereby modeling both global structure and local details. When identifying visually similar foods, ViT can distinguish them more accurately because it not only focuses on local features but also captures the context of the entire image. For instance, when identifying chicken breast and tofu, ViT can recognize the muscle fiber structure of chicken breast and the uniform texture of tofu through local image patches. It can also leverage self-attention to capture global contextual information, such as the types of surrounding ingredients and overall color distribution, thereby making more accurate judgments. Therefore, ViT has an advantage in recognizing visually similar foods. Additionally, ViT can effectively model long-range dependencies within images, meaning it can accurately identify target foods even in complex backgrounds or scenes with multiple objects present [31].

Depth images also help estimate the nutritional content of food. Through depth images, models can obtain spatial information about food, which is often not fully captured by RGB images alone. For example, depth images can be used to accurately measure the volume and shape of food, thereby enabling more precise estimation of nutritional content, such as calories, fat, and protein [32]. ViT captures correlations between distant regions in an image through the self-attention mechanism, which helps in understanding complex plate layouts. When applied to depth images, ViT can not only identify food contours but also precisely model their height, volume, and spatial relationship with the container, thus achieving more accurate portion size estimation [33]. For instance, in the nutritional assessment of a bowl of rice, an RGB image may struggle to determine whether the bowl is full or half-full, but a depth image can clearly reflect the height of the rice mound. ViT integrates depth information from the bowl’s edges and the top of the food to globally infer the corresponding portion size, and then combines this with food recognition results to estimate the corresponding nutrient content. This integration reduces errors in nutritional estimation.

In this study, we adopted the DINOv2 [34] model based on the ViT architecture as the feature extraction backbone network, separately extracting visual and geometric features from RGB images and depth images. This choice is mainly based on its excellent feature quality and strong downstream task transferability demonstrated after pre-training on large-scale data. DINOv2 features perform excellently in dense prediction tasks that require precise spatial localization, and the features it outputs exhibit strong spatial alignment and local discriminability. Therefore, when applied to food images, DINOv2 can accurately identify food boundaries and distinguish adjacent dishes. Although DINOv2 was originally trained on RGB images, its generated features have been verified to be usable for signal processing in other modalities. For example, DINOv2 shows good performance in depth estimation tasks on the KITTI and NYUd datasets, indicating that its feature space contains sufficient geometric prior knowledge to effectively support 3D structure understanding. Therefore, we apply DINOv2 to depth images to model the raised height, stacking layers, and relative volume of food, thereby reducing estimation errors.

Specifically, we choose to use the DINOv2 model to separately extract features from RGB and depth inputs. By leveraging its shallow, intermediate, and deep layer outputs to construct multi-level feature representations, the model can gradually evolve from capturing local visual structures (such as color, edges, and texture in RGB images, as well as surface undulations in depth images) to modeling high-level semantic information (such as food categories, stacking structures, and spatial relationships among foods). When processing RGB images, this multi-level feature extraction approach helps preserve local details like color and texture while enhancing the discriminative ability for food categories. When processing depth images, it helps better interpret the stacking height of ingredients and height differences between regions. This hierarchical feature extraction strategy provides effective geometric and semantic support for subsequent food nutrient content estimation.

The input image

I m g \in R^{H \times W \times C}

is processed by the DINOv2 backbone network, which comprises

L = 12

Transformer blocks. The feature map at layer

l \in {0,1, \dots, L - 1}

is computed as:

F_{l} = {D I N O v 2}_{l} (I m g) \in R^{B \times N \times D}

(1)

where

B

is the batch size,

N = 256

the sequence length, and

D = 768

the feature dimension.

From this, the features at three representative levels—shallow (

s

), intermediate (

i

), and deep (

d

)—for the RGB image are extracted as:

R_{s}, R_{i}, R_{d} = {F_{s}, F_{i}, F_{d}}_{R G B I m g}, s < i < d

(2)

Similarly, for the depth image, the corresponding levels are:

D_{s}, D_{i}, D_{d} = {F_{s}, F_{i}, F_{d}}_{D e p t h I m g}, s < i < d

(3)

3.4. RGB-D Fusion Module

The RGB-D fusion module aims to solve the problem of fusing cross-modal visual information. This module integrates the RGB and depth views of a single food sample: the former provides rich color, texture, and appearance information, while the latter contains precise three-dimensional structure and spatial geometric information. By jointly analyzing the visual appearance of RGB images and the spatial geometric layout of depth images, this module can generate more comprehensive and discriminative feature representations. This multimodal feature fusion enhances the model’s understanding of the food’s category and three-dimensional form, thereby providing a more reliable foundation for subsequent food nutrition estimation [35,36].

At the same network levels, we use the channel–spatial attention fusion block to fuse the corresponding RGB and depth features, with its architecture shown in Figure 4. This mechanism can explicitly model the complementarity between RGB and depth inputs: the former conveys rich chromatic and textural cues, while the latter encodes object geometry and spatial arrangement. The channel attention mechanism enables automatic learning of the importance distribution across feature channels for each input type, thereby supporting adaptive weighting fusion between modalities. Spatial attention allows the network to attend to the most informative food regions in the image, strengthens the response to key structures, and further enhances the capability of multimodal feature collaboration [37].

Specifically, we reshape the

R_{l}

and

D_{l}

, and then concatenate them. The formulas are as follows:

F_{l}^{R G B I n} = R e s h a p e (R_{l}) \in R^{B \times D \times H \times W}

(4)

F_{l}^{D e p t h I n} = R e s h a p e (D_{l}) \in R^{B \times D \times H \times W}

(5)

F_{C o n l} = C o n c a t (F_{l}^{R G B I n}, F_{l}^{D e p t h I n}) \in R^{B \times 2 D \times H \times W}

(6)

where

H

and

W

denote the spatial dimensions of the feature maps, both set to 16.

We process

F_{C o n l}

using a sequence consisting of global average pooling (GAP), a convolutional layer (Conv), ReLU, another Conv, and Sigmoid to obtain the channel attention weights

C A

. Specifically, two 1 × 1 convolutional layers are employed: the first reduces the channel dimension from

2 D

to

\frac{D}{4}

, and the second restores it to

2 D

. The formula is as follows:

C A = S i g m o i d (C o n v (R e L U (C o n v (G A P (F_{C o n l}))))) \in R^{B \times 2 D \times 1 \times 1}

(7)

The channel attention

C A

is evenly split into two parts, serving, respectively, as the RGB channel attention

{C A}_{R G B}

and the depth channel attention

{C A}_{D}

. The formula is follows:

{C A}_{R G B}, {C A}_{D} = S p l i t (C A) \in R^{B \times D \times 1 \times 1}

(8)

A sequence consisting of a convolutional layer followed by Sigmoid is applied to

F_{C o n l}

to generate the spatial attention map

S A

. Here, the input and output channel numbers of the convolutional layer are

2 D

and 1, respectively, and the kernel size is 7 × 7. The formula is as follows:

S A = S i g m o i d (C o n v (F_{C o n l})) \in R^{B \times 1 \times H \times W}

(9)

The features

F_{l}^{R G B I n}

and

F_{l}^{D e p t h I n}

are multiplied by their corresponding channel attention weights, respectively, and then multiplied by the spatial attention weights, yielding

F_{l}^{R G B O u t}

and

F_{l}^{D e p t h O u t}

. The two outputs are concatenated to yield the RGB-D fused feature

{R D}_{l}

at this level. The formulas are as follows:

F_{l}^{R G B O u t} = F_{l}^{R G B I n} \times {C A}_{R G B} \times S A \in R^{B \times D \times H \times W}

(10)

F_{l}^{D e p t h O u t} = F_{l}^{D e p t h I n} \times {C A}_{D} \times S A \in R^{B \times D \times H \times W}

(11)

{R D}_{l} = C o n c a t (F_{l}^{R G B O u t}, F_{l}^{D e p t h O u t}) \in R^{B \times 2 D \times H \times W}

(12)

The feature integration block is used to fuse multi-level RGB-D fused features

{R D}_{l}

into the final RGB-D feature, illustrated in Figure 5. This module integrates multi-level features, enabling the model not only to capture the color and texture of food but also to better understand its spatial layout. It enhances the perception of food shape and structure, thereby obtaining a more comprehensive feature representation [38], which helps the model estimate nutritional content more accurately.

Specifically, the input multi-level RGB-D fused features

{R D}_{l}

are separately processed by convolutional layers, with their outputs concatenated to obtain

F_{C o n R D}

. The convolutional layers have

2 D

and

D

input and output channels, respectively, and the kernel size is 1 × 1. The formula is as follows:

F_{C o n R D} = C o n c a t (C o n v ({R D}_{s}), c o n v ({R D}_{i}), c o n v ({R D}_{d})) \in R^{B \times 3 D \times H \times W}

(13)

A sequence comprising a convolutional layer followed by batch normalization and ReLU is used to process

F_{C o n R D}

, obtaining the final RGB-D feature

F_{R G B D}

. The input and output channel numbers of the convolutional layer are

3 D

and

D

, respectively, the kernel size is 1 × 1, and the padding is 1. The formula is as follows:

F_{R G B D} = R e L U (B a t c h N o r m (C o n v (F_{C o n R D}))) \in R^{B \times D \times H \times W}

(14)

3.5. Ingredient Fusion Module

The ingredient fusion module aims to fuse visual features with external semantic knowledge (i.e., the ingredient list) to achieve semantic complementarity of heterogeneous information. Although RGB-D image-based methods can effectively recognize food categories and estimate portion sizes, they still struggle to capture key latent factors that affect nutrition. For example, in “pumpkin soup,” whether milk or sugar is added has minimal impact on its appearance, making it unreliable to determine the presence of these ingredients based on vision alone. However, milk increases fat and protein content, while sugar substantially raises carbohydrate and calorie levels. Additionally, chicken breast or pork belly, after being minced and cooked, both appear as light-colored meat granules, with their texture and color differences greatly diminished, making them often visually indistinguishable. Yet chicken breast is low in fat and rich in protein, whereas pork belly is rich in fat and higher in calories. Relying solely on visual features, the model finds it difficult to accurately identify the specific type of ingredient used, leading to nutritional estimation bias. By introducing the ingredient list, the model can activate corresponding nutritional reasoning pathways, analyze the overall nutritional composition, and thereby estimate nutrient content more accurately [39].

As shown in Figure 6, the module primarily relies on attention mechanisms to achieve multi-modal feature fusion. Attention mechanisms can perform adaptive filtering of input features through dynamic weight allocation, allowing the network to emphasize more discriminative information for the task, making it suitable for multi-modal information fusion scenarios [40].

Cross-attention mechanisms are used to fuse RGB-D features

F_{R G B D}

with the ingredient semantic representations obtained through a Multilayer Perceptron (MLP). The RGB-D image features include observable visual cues such as color, texture, and spatial structure, which are used as queries (Q) in the cross-attention mechanism. By encoding ingredient information, the model can implicitly learn combinatorial patterns among ingredients during training. For example, milk and sugar frequently co-occur in sweet porridges such as pumpkin soup and purple sweet potato puree, forming a typical high-fat, high-sugar combination that leaves almost no visual trace in the final dish. Although chicken breast and pork belly appear visually similar, they tend to appear in distinct dietary contexts—such as fitness meals versus home-style braised dishes—resulting in markedly different nutritional profiles. Such ingredient combinations, which lack salient visual cues but exhibit stable semantic associations, can be effectively modeled through semantic encoding, thereby enhancing the understanding of the overall dish structure. The encoded semantic ingredient features are fed into the cross-attention mechanism as keys (K) and values (V). Through this cross-attention mechanism, features at each spatial location in the image can be dynamically modulated according to the semantic patterns of the overall ingredient composition, enabling cross-modal reasoning from explicit visual observations to latent food attributes [41].

Self-attention mechanisms are employed to process the joint representation resulting from the fusion of image and ingredient features, modeling long-range dependencies among the fused features across different spatial locations. This further enhances the response to regions related to key ingredients and integrates global contextual information. Consequently, the model can not only capture visual-semantic matches but also establish structural associations across regions, thereby improving its overall understanding of complex dishes [42].

Specifically, the ingredient information

F_{I n g}

is processed.

F_{I n g}

is fed into a two-layer MLP, where the first layer maps the input dimension

D_{I n g}

to a hidden dimension

D_{O u t} = 256

using ReLU, and the second layer preserves the dimensionality through a linear transformation only. Subsequently, a dimension of length 1 is inserted before the sequence dimension of this feature, forming a tensor of shape

R^{B \times 1 {\times D}_{O u t}}

, which is then replicated

N = 256

times along that dimension, resulting in a repeated feature sequence

F_{C o n I n g}

of length

N

. This facilitates alignment with visual features and supports the subsequent fusion process. The formulas are as follows:

F_{A t t I n g} = R e s h a p e (L i n e a r (R e L U (L i n e a r (F_{I n g}))))) \in R^{B \times 1 {\times D}_{O u t}}

(15)

F_{C o n I n g} = E x p a n d (F_{A t t I n g}) \in R^{B \times N \times D_{O u t}}

(16)

where

F_{I n g} \in R^{B \times D_{I n g}}

is a multi-hot encoded vector of dimension

D_{I n g} = 555

, constructed based on a vocabulary of 555 common ingredients. For each dish, the dimensions corresponding to present ingredients are set to 1, and the rest are set to 0.

N

is set to 256.

The RGB-D feature

F_{R G B D}

is flattened, reshaped, and linearly transformed to match the subsequent fusion process. The reshaped vector has a shape of

R^{B \times N \times D}

, and the linear transformation maps from

D

to

D_{O u t}

, where

D = 768

. The formula is as follows:

F_{A t t R G B D} = F_{C o n R G B D} = L i n e a r (R e s h a p e (F l a t (F_{R G B D}))) \in R^{B \times N \times D_{O u t}}

(17)

The concatenated features of

F_{C o n I n g}

and

F_{C o n R G B D}

are fed into a two-layer MLP. The first layer maps the concatenated dimension

{2 D}_{O u t}

to 512 with a ReLU activation, and the second layer projects the dimension from 512 back to

D_{O u t}

. We define:

F_{C o n R D I} = L i n e a r (R e L U (L i n e a r (C o n c a t (F_{C o n I n g}, F_{C o n R G B D})))) \in R^{B \times N \times D_{O u t}}

(18)

Feature fusion is performed on

F_{A t t I n g}

and

F_{A t t R G B D}

using a cross-attention mechanism. Specifically,

F_{A t t R G B D}

after linear projection is used as the query (Q), and this projection does not change the feature dimension.

F_{A t t I n g}

is separately projected via linear transformations to serve as the key (K) and value (V), with the projection process also preserving the dimensionality unchanged. By computing the similarity between image features and ingredient semantics, content in the image that is related to specific ingredients is identified. Subsequently, these similarities are used to perform weighted aggregation on the value vectors, thereby injecting ingredient contextual information into the image features. Finally, an enhanced image feature

F_{A t t R D I}

that incorporates ingredient semantics is generated. The formulas are as follows:

Q = L i n e a r (F_{A t t R G B D}) \in R^{B \times N \times D_{O u t}}

(19)

K = L i n e a r (F_{A t t I n g}) \in R^{B \times 1 \times D_{O u t}}

(20)

V = L i n e a r (F_{A t t I n g}) \in R^{B \times 1 \times D_{O u t}}

(21)

F_{A t t R D I} = A t t e n t i o n (Q, K, V) = S o f t m a x (\frac{Q K^{T}}{\sqrt{D_{O u t}}}) V \in R^{B \times N \times D_{O u t}}

(22)

where

Q K^{T}

measures the similarity between Q and K.

\sqrt{D_{O u t}}

scales the dot-product results to stabilize training. The Softmax function normalizes the similarities into attention weights. Finally, multiplication with V achieves semantic-based feature weighting.

After adding

F_{A t t R D I}

and

F_{C o n R D I}

, a self-attention mechanism is used to enhance regions related to ingredients. The TransformerEncoderLayer in PyTorch captures global dependencies through multi-head self-attention, allowing it to dynamically attend to the most informative regions in both image and ingredient modalities. The parameters are set as follows:

{d_m o d e l = D}_{O u t}

,

n_{h e a d} = 8

. We define:

F_{R D I} = T r a n s f o r m e r E n c o d e r L a y e r (F_{A t t R D I} + F_{C o n R D I}) \in R^{B \times N \times D_{O u t}}

(23)

In RDINet, five independent instances of the ingredient fusion module are used to fuse RGB-D image features and ingredient information separately, yielding five distinct fused features. Each fused feature is dedicated to its corresponding nutrient content estimation task. This design ensures that the prediction for each nutrient is based on a specially learned feature representation, rather than reusing shared features generated by a single fusion module, thereby effectively avoiding mutual interference among tasks, enabling each task to independently optimize its own dedicated feature representation, and ultimately reducing the bias in nutrient content estimation.

3.6. Nutrition Regressor

The architecture of the nutrition regressor is shown in Figure 7. It first applies global average pooling to

F_{R D I}

, followed by a four-layer MLP. Specifically: The first linear layer maps from

D_{O u t}

to 2048 and uses ReLU. The second and third linear layers each map 2048 to 2048 and use ReLU. The fourth linear layer maps from 2048 to 1 without activation and produces the final nutrient content prediction.

The overall computation is given by the following formula:

\begin{array}{l} ŷ_{N u t r i t i o n} \\ = L i n e a r (R e L U (L i n e a r (R e L U (L i n e a r (R e L U (L i n e a r (G A P (F_{R D I})))))))) \\ \in R^{B \times 1} \end{array}

(24)

3.7. Loss Function

We use the sum of the percentage mean absolute error (PMAE) across the five nutrients as the loss function.

First, we compute the mean absolute error (MAE) for each nutrient over the current batch:

{M A E}_{N u t r i t i o n} = \frac{1}{B} \sum_{i = 1}^{B} |{ŷ_{N u t r i t i o n}}_{i} - {y_{N u t r i t i o n}}_{i}|

(25)

Then, the PMAE for each nutrient is calculated as:

{P M A E}_{N u t r i t i o n} = \frac{{M A E}_{N u t r i t i o n}}{\frac{1}{B} \sum_{i = 1}^{B} {y_{N u t r i t i o n}}_{i}}

(26)

Finally, the total loss aggregates the PMAE values for all five nutrients.

L o s s = {P M A E}_{C a l} + {P M A E}_{M a s s} + {P M A E}_{F a t} + {P M A E}_{C a r b} + {P M A E}_{P r o t}

(27)

Here,

B

denotes the batch size,

{ŷ_{N u t r i t i o n}}_{i}

and

{y_{N u t r i t i o n}}_{i}

denote the prediction and ground-truth, respectively, for a specific nutrient of sample i in the current batch.

3.8. Evaluation Metrics

We adopt PMAE as the primary evaluation metric. In nutrient estimation tasks, the magnitudes of different nutrients vary significantly—for instance, certain foods contain more than 390 kcal of energy but only around 12 g of protein. Since the MAE is highly dependent on the unit and scale of the target variable, it is difficult to fairly compare model performance across different nutrients. In contrast, PMAE introduces a relative error normalization, providing a more equitable and interpretable evaluation metric, making it better suited for real-world applications. The metric is computed as follows:

{P M A E}_{N u t r i t i o n} = \frac{\frac{1}{N_{N u t r i t i o n}} \sum_{i = 1}^{N_{N u t r i t i o n}} |{ŷ_{N u t r i t i o n}}_{i} - {y_{N u t r i t i o n}}_{i}|}{\frac{1}{N_{N u t r i t i o n}} \sum_{i = 1}^{N_{N u t r i t i o n}} {y_{N u t r i t i o n}}_{i}}

(28)

where

N_{N u t r i t i o n}

is the total number of samples, and

{ŷ_{N u t r i t i o n}}_{i}

and

{y_{N u t r i t i o n}}_{i}

denote the predicted and ground-truth values, respectively, for a specific nutrient of sample i.

4. Results

4.1. Implementation Details

Experiments were run under the following configuration: CPU is Intel(R) Xeon(R) Gold 6226R, GPU is NVIDIA GeForce RTX 3090, operating system is Ubuntu 22.04, Python version is 3.9.19, PyTorch version is 2.0.0, and CUDA version is 12.4.

To enhance the model’s robustness to variations in lighting and shooting angles while preserving the spatial alignment between RGB and depth images, we apply only brightness adjustment to RGB images during training. Random horizontal flipping is applied synchronously to both RGB and depth images. Ingredient vectors are not augmented in any way to maintain the accuracy of semantic supervision. All data augmentation is disabled during validation and testing.

During training, we use a learning rate of 0.00001, the Adam optimizer [43], a batch size of 32, and train for 70 epochs.

At inference time, the input consists of an RGB image, its corresponding depth image, and an ingredient vector for a single dish. A ViT-based feature extractor separately extracts shallow, intermediate, and deep features from the RGB and depth images. The RGB-D fusion module then fuses the RGB and depth features: within each corresponding level, the channel–spatial attention fusion block integrates the paired features to produce multi-level RGB-D fused representations. Subsequently, the feature integration block aggregates these multi-level features into a unified RGB-D multimodal representation. This representation is then fed into five parallel ingredient fusion modules, each of which fuses the RGB-D features with the ingredient vector to generate a task-specific feature representation. Finally, each task-specific feature is passed through an independent nutrition regressor to predict the corresponding nutrient value.

4.2. Experimental Results

We evaluate methods on the Nutrition5k dataset. All runs use the same hardware and software environment to ensure consistent experimental conditions and comparable results.

Results are presented in Table 2. We compare our method against representative approaches: Google-Nutrition-rgb, a classic RGB-only method for nutrient estimation; ViT and the more recent Swin-Nutrition, both representative models based on the Transformer architecture; DPF-Nutrition, a recent RGB-D approach; and DSDGF-Nutri, one of the current state-of-the-art multimodal methods that leverages RGB images, depth maps, and ingredient information for nutrient estimation.

Compared to Google-Nutrition-rgb, our method reduces PMAE by 15.3%, 13.4%, 21.1%, 18.1%, 16.9%, and 17.7% on calories, mass, fat, carbohydrates, protein, and average PMAE, respectively. Compared to ViT, it reduces PMAE by 10.7%, 10.2%, 17.9%, 14.3%, 15.4%, and 13.7%, respectively. Compared to Swin-Nutrition, it reduces PMAE by 8.8%, 8.6%, 12.7%, 10.5%, 12.2%, and 10.6%, respectively. In the comparison with multimodal methods, compared to DPF-Nutrition, it reduces PMAE by 4.7%, 4.5%, 7.3%, 7.2%, 5.4%, and 5.9%, respectively. Compared to DSDGF-Nutri, it reduces PMAE by 1.3%, 0.6%, 4.8%, 3.5%, 3.4%, and 2.8%, respectively. These results demonstrate that RDINet not only shows advantages over traditional methods but also maintains a leading position when compared with recent Transformer-based single-modal models and advanced multimodal models, validating the effectiveness and competitiveness of the proposed method in the current nutrient estimation task.

Figure 8 shows the training loss and validation PMAE curves of RDINet. Figure 8a presents the training loss curve, while Figure 8b displays the validation PMAE curves for calories, mass, fat, carbohydrates, protein, and their average. The horizontal axis in both subfigures represents the training epoch.

As shown in Figure 8, both the training loss and validation PMAE decrease rapidly during the first 40 epochs. Between epochs 40 and 60, the decline slows down and exhibits minor fluctuations. After approximately epoch 60, all curves stabilize, indicating that RDINet achieves stable convergence within 70 epochs on the Nutrition5k dataset.

Moreover, after convergence, the validation PMAE does not show any noticeable increase; the validation curves for all nutrient components remain steady in the later training stages without a sustained upward trend. This suggests that the model does not suffer from significant overfitting during training. The consistent behavior between the training loss and validation PMAE curves during the convergence phase further demonstrates the stability of the proposed method throughout the optimization process.

To evaluate the model’s generalization capability, we conduct experiments on the ECUSTFD dataset. All experiments are performed under identical conditions to ensure consistency in setup and comparability of results. The training, validation, and test sets follow the official dataset split. Since ECUSTFD provides only RGB images—without depth maps or ingredient information—we adapt RDINet accordingly: the RGB-D fusion module and ingredient fusion module are removed, and only the three-level feature integration component is retained, resulting in a degraded version of RDINet, referred to as RDINet (RGB-only).

Table 3 presents the mass estimation results of different methods on the ECUSTFD dataset. Under RGB-only input, RDINet (RGB-only) achieves a PMAE of 18.2%, which represents a 12.8% absolute reduction compared to Google-Nutrition-rgb. Its performance is also comparable to that of Swin-Nutrition, with only a 1.3% higher PMAE. This demonstrates that, even in the absence of depth information and ingredient annotations, the proposed model architecture is still able to effectively extract discriminative features relevant to mass estimation and maintain stable predictive performance across different datasets.

It is important to emphasize that the version of RDINet evaluated on ECUSTFD is the RGB-only variant, where the RGB-D fusion and ingredient fusion modules are disabled. This experiment is designed to assess the model’s generalization capability and robustness under modality-constrained conditions, rather than to showcase the peak performance of its full multimodal configuration. Combined with the multimodal results on the Nutrition5k dataset, these findings indicate that RDINet exhibits strong adaptability across varying input modalities and dataset settings. Overall, RDINet consistently demonstrates robust generalization, further validating the effectiveness of its architectural design.

4.3. Ablation Experiments

First, we evaluate the contribution of the main components of our model to food nutrient estimation on the Nutrition5k dataset. In Table 4, “RGB” denotes nutrient estimation using only RGB images, while “Depth” denotes estimation using only depth images. The “RGB-D Fusion Module” includes both channel attention and spatial attention mechanisms, which are used to fuse RGB and depth features. The “Ingredient Fusion Module” incorporates cross-attention and self-attention mechanisms to integrate visual features with ingredient information. Finally, “RGB + Depth + Ingredient (RGB-D Fusion Module + cross-attention + self-attention, RGB-D Fusion Module + Ingredient Fusion Module)” corresponds to RDINet proposed in this paper.

As shown in Table 4, RDINet achieves better performance than approaches using only RGB images or only depth images. Compared to the RGB-only estimation method, the ‘RGB + Depth (channel–spatial attention, RGB-D Fusion Module)’ approach—which fuses RGB and depth images—reduces PMAE by 6.8%, 6.4%, 13.0%, 10.8%, 10.5%, and 9.5% on calories, mass, fat, carbohydrates, protein, and average PMAE, respectively. The ‘RGB + Depth + Ingredient (RGB-D Fusion Module + cross-attention + self-attention, RGB-D Fusion Module + Ingredient Fusion Module)’ method, which further integrates ingredient information, achieves even greater improvements, reducing PMAE by 13.4%, 12.5%, 19.1%, 20.2%, 17.7%, and 16.6% on the same metrics, respectively.

RGB images provide appearance information such as color and texture, while depth images convey 3D shape and size cues of the food. By fusing RGB and depth images using the RGB-D fusion module, the model gains a more comprehensive understanding of the food’s category, morphology, and structure, thereby reducing estimation errors. Compared to RGB + Depth (spatial attention), RGB + Depth (channel attention) achieves a 1.0% lower average PMAE, indicating that modality-specific feature recalibration is more critical than spatial refinement for nutrient estimation. Combining both channel and spatial attention further reduces the error.

The ingredient fusion module has an even more critical impact on final performance, as it enables the model to surpass the performance ceiling of purely visual approaches. Although the RGB-D fusion module enhances robustness by incorporating depth information, it still cannot handle food components that are visually ambiguous or difficult to discern yet nutritionally significant. In contrast, the ingredient fusion module effectively bridges the gap between visual appearance and latent nutritional attributes by introducing semantic priors of ingredients. Specifically, this module leverages ingredient semantic cues to activate corresponding nutritional reasoning pathways and achieves cross-modal fusion between vision and semantics through attention mechanisms. Building upon the ‘RGB + Depth (channel–spatial attention, RGB-D Fusion Module)’ baseline, introducing only cross-attention reduces the error by 5.3%, demonstrating that explicitly aligning visual regions with ingredient semantics is both necessary and highly effective. In contrast, self-attention alone yields a more modest reduction of 2.7%. This indicates that cross-attention—by treating RGB-D features as queries and ingredient information as keys and values—more effectively models the relationship between visual cues and semantic ingredient knowledge, explicitly aligning key regions across the two modalities and thereby optimizing multimodal fusion. In comparison, self-attention operates on a concatenated sequence of RGB-D and ingredient features, enabling intra-sequence interactions that capture internal dependencies but lacking precise modeling of cross-modal relationships. This limitation results in inferior performance compared to cross-attention. When both mechanisms are combined, the error is further reduced, confirming their complementary benefits.

Therefore, these two modules integrate multi-source information from the perspectives of “2D appearance + 3D geometry” and “visual features + ingredient semantics,” working together to reduce nutrient estimation errors.

Second, we conduct a sensitivity analysis on the number of fusion levels in the feature extraction stage to validate the rationale behind our chosen three-level structure (shallow, intermediate, and deep). We compare multiple fusion strategies based on the DINOv2 backbone: using a single layer (layer 11), two layers (layers 6 and 11), three layers (layers 0, 6, and 11—the configuration adopted by RDINet), and four layers (layers 0, 4, 8, and 11).

As shown in Table 5, the three-layer fusion strategy achieves the best performance. Compared to using only the deepest layer, the two-layer scheme (intermediate + deep) reduces the average PMAE by 2.8%. Further incorporating the shallow layer—resulting in the three-layer (shallow + intermediate + deep) configuration—yields an additional 3.1% reduction in average PMAE over the two-layer variant. These results demonstrate that the model is highly sensitive to the inclusion of complementary features from multiple levels, confirming the contribution of multi-level representations to nutrient estimation. Specifically, shallow features preserve fine-grained textures and high-frequency details of food (e.g., surface granules, gloss variations); intermediate features capture local compositional elements and their spatial arrangements (e.g., meat chunks, vegetable slices, sauce regions); while deep features encode holistic semantic cues (e.g., dish category and cooking style). This hierarchical fusion enables the model to comprehensively understand food appearance and structure—from pixel-level details to high-level semantics.

However, when a fourth layer is added to the fusion, performance gains plateau or even slightly degrade across all metrics compared to the three-layer setup, with PMAE changes ranging from −0.3% to +0.4%. This suggests that the model becomes insensitive to additional feature layers beyond the three-level configuration; excessive feature aggregation introduces redundancy and information overload without delivering meaningful discriminative benefits.

Therefore, the proposed three-level (shallow/intermediate/deep) feature fusion scheme strikes an effective balance between representational completeness and feature efficiency.

Third, we evaluated the performance of different feature extractors on the task of food nutrient content estimation using the Nutrition5k dataset. Among these, VGG16 [47], InceptionV3 [48], ResNet-18 and ResNet-50 [49] are CNN-based models, while PVT [50] and DINOv2 are ViT-based models.

As shown in Table 6, DINOv2 outperforms VGG16 by reducing PMAE in calories, mass, fat, carbohydrates, protein, and mean PMAE by 7.1%, 6.6%, 8.7%, 11.3%, 5.6%, and 7.9%, respectively. Compared to ResNet-50, the proposed method reduces the above metrics by 6.0%, 5.2%, 6.0%, 7.9%, 4.1%, and 5.9%, respectively. Compared to PVT, it achieves reductions of 1.3%, 2.0%, 3.6%, 5.7%, 2.9%, and 3.1% in these metrics, respectively.

In the task of nutritional content estimation, recognizing food and accurately modeling its three-dimensional structure are crucial. The experimental results indicate that ViT models excel in this aspect due to fundamental architectural differences. CNNs, with their convolutional operations, tend to capture local texture features, whereas ViTs are more sensitive to global macro-structures and spatial relationships. Therefore, ViTs can more effectively recognize food and understand the overall contour, stacking patterns, and relative positions between different food components, providing a more robust feature foundation for subsequent regression tasks.

Moreover, DINOv2 outperforms PVT, primarily due to its superior feature quality and inherent geometric priors. DINOv2 is a Vision Transformer on large-scale data through self-supervised learning, and it extracts features with strong generalization capabilities and sensitivity to geometric structures. Since its pre-trained weights serve as the backbone, the model can effectively utilize these generic visual representations, thereby achieving excellent performance in nutrient estimation tasks. Additionally, the self-supervised pre-training objectives of DINOv2 enable it to implicitly learn rich structural and geometric cues during pre-training, thus endowing its feature space with some degree of three-dimensional perception ability. Previous studies have confirmed that DINOv2 performs exceptionally well in tasks such as depth estimation. Therefore, when processing depth images, DINOv2 can more effectively decode and utilize the geometric information within them, accurately capturing the height, volume, and spatial relationships of food, leading to more precise nutrient content estimation.

Fourth, we evaluate on the Nutrition5k dataset the impact of using separate versus shared ingredient fusion modules for each nutrient estimation task.

As shown in Table 7, the separate ingredient fusion strategy outperforms the shared counterpart, reducing PMAE by 2.2%, 1.9%, 5.6%, 5.8%, 4.1%, and 4.0% on calories, mass, fat, carbohydrates, protein, and average PMAE, respectively.

This performance gap indicates that different nutrients exhibit distinct dependency patterns on ingredient information, and a single shared fusion representation is insufficient to simultaneously satisfy the modeling requirements of all nutrient types. By introducing an independent ingredient fusion module for each nutrient, the model can learn more tailored fusion features, thereby more effectively capturing the relationship between ingredient composition and specific nutritional attributes. This design enables precise extraction of relevant ingredient cues while avoiding interference and redundancy inherent in shared modules, ultimately leading to lower nutrient estimation errors.

4.4. Visualization Analysis

To intuitively show how effective RDINet is, we generated feature heatmaps, as shown in Figure 9, where red denotes high-attention regions and blue denotes low-attention regions. Taking Dish_1562085185 as an example, when estimating calorie and fat content, the model primarily attends to scrambled eggs and hash browns; when estimating carbohydrates, it focuses mainly on hash browns; and when estimating protein, it concentrates on scrambled eggs. These attended regions correspond precisely to the primary food ingredients that are rich in the respective nutrients, indicating that our model can adaptively focus on relevant food regions when evaluating different nutritional components.

5. Discussion

The core value of this study lies in advancing dietary management toward automation. The proposed method enables individuals to obtain real-time nutritional information about their food in daily life with minimal burden, reducing the cognitive load and operational barriers associated with traditional dietary recording methods (e.g., manual weighing or food diaries). By providing immediate and objective nutritional feedback, the method helps users gain a clearer understanding of their dietary patterns and promptly adjust their intake behaviors, thereby promoting balanced diets, controlling caloric intake, and supporting health goals such as chronic disease prevention and weight management.

RDINet achieves an average inference time of 21.19 ms, demonstrating excellent real-time performance that fully meets the practical demands of daily dietary logging, instant feedback, and health management—highlighting its strong deployment potential. Future work will explore mobile-optimized variants to further facilitate widespread adoption on edge devices.

Although it demonstrates strong potential in everyday health management scenarios, the method still faces challenges in clinical nutrition management. Clinical settings demand extremely high absolute accuracy, where even minor deviations may impact patient outcomes. The current performance level of the model is insufficient to meet such clinical-grade precision requirements. This limitation stems partly from the limited scale and diversity of the training data. At present, large-scale publicly available datasets that simultaneously include high-fidelity RGB images, depth information, and ingredient-level annotations remain scarce, which hinders accurate interpretation of complex and specialized clinical meals. Therefore, an important future direction is to construct a larger-scale and more representative multimodal food dataset, thereby overcoming the performance bottleneck of the model in clinical scenarios through improved data quality.

Figure 10 visualizes the attention maps of different methods on the sample Dish_1562085185. Specifically, RDINet (RGB-only) is a variant of RDINet that removes both the RGB-D fusion module and the ingredient fusion module, retaining only the three-layer fusion of RGB features. Swin-Nutrition is currently the best-performing method for nutrition estimation that uses only RGB images. RDINet (RGB + Depth, RGB-D fusion module) is another RDINet variant that excludes the ingredient fusion module but employs the RGB-D fusion module to integrate RGB and depth inputs. DPF-Nutrition is the current best-performing method that utilizes both RGB and depth images for nutrition estimation. RDINet (RGB + Depth + Ingredient, RGB-D fusion module + cross-attention) is an RDINet variant that uses the RGB-D fusion module to combine RGB and depth data, while incorporating only the cross-attention mechanism from the ingredient fusion module to fuse ingredient information. RDINet denotes the full proposed model, which leverages both the RGB-D fusion module and the complete ingredient fusion module (including both cross-attention and self-attention) to jointly fuse RGB, depth, and ingredient information.

The requirement for paired RGB-D images poses a major challenge for deploying RDINet at scale in real-world applications. The channel–spatial attention mechanism within the RGB-D fusion module adaptively weights and integrates complementary features from RGB (color and texture) and depth (3D geometry), yielding more discriminative representations. Channel attention prioritizes modality-specific channel importance, while spatial attention focuses on discriminative food regions, enhancing responses to key structural components and thereby improving multimodal synergy.

As shown in Figure 10, compared to Swin-Nutrition, RDINet (RGB-only) fails to attend to the correct region for calorie estimation and exhibits smaller activation areas for fat, carbohydrates, and protein. Compared to DPF-Nutrition, RDINet (RGB-only) shows weaker and more diffuse activations across all nutrients—predominantly low-response regions—particularly for carbohydrates and protein, indicating an inability to effectively focus on the central food area. This highlights the critical role of the RGB-D fusion module in leveraging depth information.

However, since RDINet requires aligned RGB-D image pairs as input, it is prone to failure when applied to photos captured by standard mobile devices, which typically lack dedicated depth sensors and cannot produce high-quality depth maps. To address this limitation, a promising future direction is to integrate a monocular depth estimation module that predicts depth maps directly from a single RGB image. Recent advances in monocular depth estimation—such as Marigold and Depth Anything—have demonstrated high-accuracy depth prediction even in complex scenes, significantly improving the accessibility of our system on consumer-grade devices. By integrating such a depth estimator with RDINet, users would only need to take a regular RGB photo, the system could then generate a pseudo-depth map and perform accurate nutrition estimation, thereby eliminating dependence on specialized depth hardware and enabling broader deployment in everyday mobile scenarios.

Similarly, the lack of ingredient annotations presents another major barrier to large-scale practical adoption. Cross-attention is a key component of the ingredient fusion module, it treats RGB-D visual features as queries and ingredient semantics as keys and values, computes cross-modal similarity, and dynamically injects semantic context. This enables explicit alignment between visual appearance and nutritional factors, enhancing the model’s understanding of complex dishes and reducing estimation bias—especially when visual cues are ambiguous. Self-attention, in contrast, operates on the fused joint representation to model long-range dependencies, further strengthening responses to key ingredient regions and integrating global context, thus improving overall robustness in nutrient prediction.

As shown in Figure 10, RDINet (RGB + Depth, RGB-D fusion module) and DPF-Nutrition exhibit similar activation patterns, both suffering from misalignment-induced biases. When cross-attention is introduced—as in RDINet (RGB + Depth + Ingredient, RGB-D fusion module + cross-attention)—activations become stronger and more spatially concentrated, particularly for carbohydrates and protein. After further incorporating self-attention, RDINet exhibits increased activation intensity and more comprehensive coverage across all nutrients, demonstrating that the full module significantly improves focus accuracy and reduces bias from hidden factors. This underscores the importance of integrating ingredient information via the proposed fusion module.

However, in real-world scenarios, users often do not know the exact ingredient composition of their meals—especially for complex or mixed dishes—and model performance degrades substantially in the absence of accurate ingredient labels. To overcome this, a crucial future direction is to integrate an automatic food ingredient recognition module that takes a single RGB image as input and outputs a list of primary ingredients. Thanks to rapid progress in large vision-language models such as CLIP and Florence-2, precise identification of multiple ingredients from complex food images has become increasingly feasible. Integrating such a recognition module with RDINet would enable the system to generate ingredient lists directly from user-captured photos, achieving end-to-end nutrition estimation without external annotation or database dependency—thereby greatly enhancing practicality, user-friendliness, and scalability for real-world deployment.

Despite the aforementioned challenges, this study lays a solid foundation for convenient and automated dietary assessment and outlines a clear evolutionary pathway. With the continuous expansion of high-quality datasets and the ongoing enhancement of front-end perception capabilities, the proposed core framework holds great promise to evolve from a research prototype into a highly automated and reliable technical platform, enabling large-scale deployment across broad domains such as health management, remote nutritional intervention, and even clinical decision support—thereby delivering tangible societal and health-related value.

6. Conclusions

To accurately estimate dietary nutrition, this paper proposes a multimodal fusion network—RDINet. This method constructs a more comprehensive and robust food representation by jointly leveraging the appearance information from RGB images, the 3D spatial structure information provided by depth images, and ingredient semantic knowledge.

Specifically, RGB images can effectively capture the color, texture, and surface details of food, while depth images supplement geometric shape, stacking relationships, and volume information. The fusion of these two modalities enhances the model’s joint perception capability in terms of food category, shape, and 3D morphology. Building upon this, we embed ingredient composition into visual features, enabling the model to not only rely on appearance for judgment but also leverage prior knowledge to identify ingredients that are invisible after cooking or visually highly similar yet nutritionally distinct. This deep integration of vision and semantics effectively mitigates the nutritional estimation bias caused by relying solely on images.

We evaluated RDINet on the public Nutrition5k dataset. Compared to current mainstream nutritional estimation methods, RDINet achieves better performance in estimating the five nutritional attributes, verifying the effectiveness and necessity of the multimodal information fusion strategy.

Author Contributions

Conceptualization, Z.K. and D.S.; methodology, J.Y. and H.G.; software, D.S. and H.G.; validation, J.Z. and H.G.; formal analysis, J.Z. and H.G.; investigation, L.S. and H.G.; resources, J.Y. and H.G.; data curation, D.S. and H.G.; writing—original draft preparation, Z.K. and H.G.; writing—review and editing, L.S. and H.G.; visualization, D.S. and H.G.; supervision, J.Z. and H.G. All authors have read and agreed to the published version of the manuscript.

Funding

Jilin Provincial Department of Science and Technology under Grant (YDZJ202501ZYTS589, 20230401092YY).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Our code is available at https://github.com/yongerstar/RDINet (accessed on 24 December 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Black, R.E.; Victora, C.G.; Walker, S.P.; Bhutta, Z.A.; Christian, P.; De Onis, M.; Ezzati, M.; Grantham-McGregor, S.; Katz, J.; Martorell, R.; et al. Maternal and Child Undernutrition and Overweight in Low-Income and Middle-Income Countries. Lancet 2013, 382, 427–451. [Google Scholar] [CrossRef]
Hou, S.; Feng, Z.; Xiong, H.; Min, W.; Li, P.; Jiang, S. DSDGF-Nutri: A Decoupled Self-Distillation Network with Gating Fusion For Food Nutritional Assessment. In Proceedings of the 33rd ACM International Conference on Multimedia, Dublin, Ireland, 27–31 October 2025; pp. 5218–5227. [Google Scholar]
Willett, W.C.; Sampson, L.; Stampfer, M.J.; Rosner, B.; Bain, C.; Witschi, J.; Hennekens, C.H.; Speizer, F.E. Reproducibility and validity of a semiquantitative food frequency questionnaire. Am. J. Epidemiol. 1985, 122, 51–65. [Google Scholar] [CrossRef] [PubMed]
Subar, A.F.; Kirkpatrick, S.I.; Mittl, B.; Zimmerman, T.P.; Thompson, F.E.; Bingley, C.; Willis, G.; Islam, N.G.; Baranowski, T.; McNutt, S.; et al. The Automated Self-Administered 24-Hour Dietary Recall (ASA24): A Resource for Researchers, Clinicians, and Educators from the National Cancer Institute. J. Acad. Nutr. Diet. 2012, 112, 1134–1137. [Google Scholar] [CrossRef]
Chotwanvirat, P.; Prachansuwan, A.; Sridonpai, P.; Kriengsinyos, W. Advancements in Using AI for Dietary Assessment Based on Food Images: Scoping Review. J. Med. Internet Res. 2024, 26, e51432. [Google Scholar] [CrossRef] [PubMed]
Albaladejo, L.; Giai, J.; Deronne, C.; Baude, R.; Bosson, J.-L.; Bétry, C. Assessing Real-Life Food Consumption in Hospital with an Automatic Image Recognition Device: A Pilot Study. Clin. Nutr. ESPEN 2025, 68, 319–325. [Google Scholar] [CrossRef] [PubMed]
Cofre, S.; Sanchez, C.; Quezada-Figueroa, G.; López-Cortés, X.A. Validity and Accuracy of Artificial Intelligence-Based Dietary Intake Assessment Methods: A Systematic Review. Br. J. Nutr. 2025, 133, 1241–1253. [Google Scholar] [CrossRef]
Mariappan, A.; Bosch, M.; Zhu, F.; Boushey, C.J.; Kerr, D.A.; Ebert, D.S.; Delp, E.J. Personal Dietary Assessment Using Mobile Devices; Bouman, C.A., Miller, E.L., Pollak, I., Eds.; SPIE: San Jose, CA, USA, 2009; p. 72460Z. [Google Scholar]
Ege, T.; Yanai, K. Multi-Task Learning of Dish Detection and Calorie Estimation. In Proceedings of the Joint Workshop on Multimedia for Cooking and Eating Activities and Multimedia Assisted Dietary Management, Stockholm, Sweden, 15 July 2018; pp. 53–58. [Google Scholar]
Situju, S.F.; Takimoto, H.; Sato, S.; Yamauchi, H.; Kanagawa, A.; Lawi, A. Food Constituent Estimation for Lifestyle Disease Prevention by Multi-Task CNN. Appl. Artif. Intell. 2019, 33, 732–746. [Google Scholar] [CrossRef]
Fang, S.; Shao, Z.; Kerr, D.A.; Boushey, C.J.; Zhu, F. An End-to-End Image-Based Automatic Food Energy Estimation Technique Based on Learned Energy Distribution Images: Protocol and Methodology. Nutrients 2019, 11, 877. [Google Scholar] [CrossRef]
Papathanail, I.; Vasiloglou, M.F.; Stathopoulou, T.; Ghosh, A.; Baumann, M.; Faeh, D.; Mougiakakou, S. A Feasibility Study to Assess Mediterranean Diet Adherence Using an AI-Powered System. Sci. Rep. 2022, 12, 17008. [Google Scholar] [CrossRef]
Keller, M.; Tai, C.A.; Chen, Y.; Xi, P.; Wong, A. NutritionVerse-Direct: Exploring Deep Neural Networks for Multitask Nutrition Prediction from Food Images. arXiv 2024, arXiv:2405.07814. [Google Scholar]
Xu, C.; He, Y.; Khanna, N.; Boushey, C.J.; Delp, E.J. Model-Based Food Volume Estimation Using 3D Pose. In Proceedings of the 2013 IEEE International Conference on Image Processing, Melbourne, Australia, 15–18 September 2013; pp. 2534–2538. [Google Scholar]
Puri, M.; Zhu, Z.; Yu, Q.; Divakaran, A.; Sawhney, H. Recognition and Volume Estimation of Food Intake Using a Mobile Device. In Proceedings of the 2009 Workshop on Applications of Computer Vision (WACV), Snowbird, UT, USA, 7–8 December 2009; pp. 1–8. [Google Scholar]
Chen, H.-C.; Jia, W.; Yue, Y.; Li, Z.; Sun, Y.-N.; Fernstrom, J.D.; Sun, M. Model-Based Measurement of Food Portion Size for Image-Based Dietary Assessment Using 3D/2D Registration. Meas. Sci. Technol. 2013, 24, 105701. [Google Scholar] [CrossRef] [PubMed]
Myers, A.; Johnston, N.; Rathod, V.; Korattikara, A.; Gorban, A.; Silberman, N.; Guadarrama, S.; Papandreou, G.; Huang, J.; Murphy, K. Im2Calories: Towards an Automated Mobile Vision Food Diary. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1233–1241. [Google Scholar]
Lu, Y.; Stathopoulou, T.; Vasiloglou, M.F.; Christodoulidis, S.; Stanga, Z.; Mougiakakou, S. An Artificial Intelligence-Based System to Assess Nutrient Intake for Hospitalised Patients. IEEE Trans. Multimed. 2021, 23, 1136–1147. [Google Scholar] [CrossRef]
Thames, Q.; Karpur, A.; Norris, W.; Xia, F.; Panait, L.; Weyand, T.; Sim, J. Nutrition5k: Towards Automatic Nutritional Understanding of Generic Food. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021; pp. 8899–8907. [Google Scholar]
Tahir, G.A.; Loo, C.K. A Comprehensive Survey of Image-Based Food Recognition and Volume Estimation Methods for Dietary Assessment. Healthcare 2021, 9, 1676. [Google Scholar] [CrossRef]
Shroff, G.; Smailagic, A.; Siewiorek, D.P. Wearable Context-Aware Food Recognition for Calorie Monitoring. In Proceedings of the 2008 12th IEEE International Symposium on Wearable Computers, Pittaburgh, PA, USA, 28 September–1 October 2008; pp. 119–120. [Google Scholar]
Chotwanvirat, P.; Hnoohom, N.; Rojroongwasinkul, N.; Kriengsinyos, W. Feasibility Study of an Automated Carbohydrate Estimation System Using Thai Food Images in Comparison With Estimation by Dietitians. Front. Nutr. 2021, 8, 732449. [Google Scholar] [CrossRef]
Ruede, R.; Heusser, V.; Frank, L.; Roitberg, A.; Haurilet, M.; Stiefelhagen, R. Multi-Task Learning for Calorie Prediction on a Novel Large-Scale Recipe Dataset Enriched with Nutritional Information. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; pp. 4001–4008. [Google Scholar]
Kusuma, J.D.; Yang, H.-L.; Yang, Y.-L.; Chen, Z.-F.; Shiao, S.-Y.P.K. Validating Accuracy of a Mobile Application against Food Frequency Questionnaire on Key Nutrients with Modern Diets for mHealth Era. Nutrients 2022, 14, 537. [Google Scholar] [CrossRef] [PubMed]
Nguyen, P.H.; Tran, L.M.; Hoang, N.T.; Trương, D.T.T.; Tran, T.H.T.; Huynh, P.N.; Koch, B.; McCloskey, P.; Gangupantulu, R.; Folson, G.; et al. Relative Validity of a Mobile AI-Technology–Assisted Dietary Assessment in Adolescent Females in Vietnam. Am. J. Clin. Nutr. 2022, 116, 992–1001. [Google Scholar] [CrossRef]
Folson, G.K.; Bannerman, B.; Atadze, V.; Ador, G.; Kolt, B.; McCloskey, P.; Gangupantulu, R.; Arrieta, A.; Braga, B.C.; Arsenault, J.; et al. Validation of Mobile Artificial Intelligence Technology–Assisted Dietary Assessment Tool Against Weighed Records and 24-Hour Recall in Adolescent Females in Ghana. J. Nutr. 2023, 153, 2328–2338. [Google Scholar] [CrossRef]
Lee, H.-A.; Huang, T.-T.; Yen, L.-H.; Wu, P.-H.; Chen, K.-W.; Kung, H.-H.; Liu, C.-Y.; Hsu, C.-Y. Precision Nutrient Management Using Artificial Intelligence Based on Digital Data Collection Framework. Appl. Sci. 2022, 12, 4167. [Google Scholar] [CrossRef]
Liang, Y.; Li, J. Computer Vision-Based Food Calorie Estimation: Dataset, Method, and Experiment. arXiv 2017, arXiv:1705.07632. [Google Scholar] [CrossRef]
Chen, Y.; He, J.; Vinod, G.; Raghavan, S.; Czarnecki, C.; Ma, J.; Mahmud, T.I.; Coburn, B.; Mao, D.; Nair, S.; et al. MetaFood3D: 3D Food Dataset with Nutrition Values. arXiv 2024, arXiv:2409.01966. [Google Scholar]
Banerjee, S.; Palsani, D.; Mondal, A.C. Nutritional Content Detection Using Vision Transformers- An Intelligent Approach. Int. J. Innov. Res. Eng. Manag. 2024, 11, 21–27. [Google Scholar] [CrossRef]
Xiao, Z.; Gao, X.; Wang, X.; Deng, Z. Visual Transformers for Food Image Recognition: A Comprehensive Review. arXiv 2024, arXiv:2503.18997. [Google Scholar]
Ando, Y.; Ege, T.; Cho, J.; Yanai, K. DepthCalorieCam: A Mobile Application for Volume-Based FoodCalorie Estimation Using Depth Cameras. In Proceedings of the 5th International Workshop on Multimedia Assisted Dietary Management, Nice, France, 21 October 2019; pp. 76–81. [Google Scholar]
Kwan, Z.; Zhang, W.; Wang, Z.; Ng, A.B.; See, S. Nutrition Estimation for Dietary Management: A Transformer Approach with Depth Sensing. IEEE Trans. Multimed. 2025, 27, 6047–6058. [Google Scholar] [CrossRef]
Oquab, M.; Darcet, T.; Moutakanni, T.; Vo, H.; Szafraniec, M.; Khalidov, V.; Fernandez, P.; Haziza, D.; Massa, F.; El-Nouby, A.; et al. DINOv2: Learning Robust Visual Features without Supervision. arXiv 2024, arXiv:2304.07193. [Google Scholar] [CrossRef]
Han, Y.; Cheng, Q.; Wu, W.; Huang, Z. DPF-Nutrition: Food Nutrition Estimation via Depth Prediction and Fusion. Foods 2023, 12, 4293. [Google Scholar] [CrossRef] [PubMed]
Shao, W.; Min, W.; Hou, S.; Luo, M.; Li, T.; Zheng, Y.; Jiang, S. Vision-Based Food Nutrition Estimation via RGB-D Fusion Network. Food Chem. 2023, 424, 136309. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Computer Vision—ECCV 2018, Proceedings of the 15th European Conference, Munich, Germany, 8–14 September 2018; Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2018; Volume 11211, pp. 3–19. ISBN 978-3-030-01233-5. [Google Scholar]
Chen, Z.; Wang, J.; Wang, Y. Enhancing Food Image Recognition by Multi-Level Fusion and the Attention Mechanism. Foods 2025, 14, 461. [Google Scholar] [CrossRef]
Al-Saffar, M.; Baiee, W.R. Nutrition Information Estimation from Food Photos Using Machine Learning Based on Multiple Datasets. Bull. Electr. Eng. Inform. 2022, 11, 2922–2929. [Google Scholar] [CrossRef]
Zhao, F.; Zhang, C.; Geng, B. Deep Multimodal Data Fusion. ACM Comput. Surv. 2024, 56, 1–36. [Google Scholar] [CrossRef]
Wei, X.; Zhang, T.; Li, Y.; Zhang, Y.; Wu, F. Multi-Modality Cross Attention Network for Image and Sentence Matching. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 10938–10947. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Wang, B.; Bu, T.; Hu, Z.; Yang, L.; Zhao, Y.; Li, X. Coarse-to-Fine Nutrition Prediction. IEEE Trans. Multimed. 2024, 26, 3651–3662. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2021, arXiv:2010.11929. [Google Scholar] [CrossRef]
Shao, W.; Hou, S.; Jia, W.; Zheng, Y. Rapid Non-Destructive Analysis of Food Nutrient Content Using Swin-Nutrition. Foods 2022, 11, 3429. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2015, arXiv:1409.1556. [Google Scholar] [CrossRef]
Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the Inception Architecture for Computer Vision. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 2818–2826. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Wang, W.; Xie, E.; Li, X.; Fan, D.-P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; IEEE: Piscataway, NJ, USA, 2021. [Google Scholar]

Figure 1. Overall framework of RDINet for food nutrition estimation.

Figure 2. Illustrative samples from Nutrition5k. (a) RGB images. (b) Corresponding depth maps. (c) Associated nutrient labels.

Figure 3. Examples of unsuitable images: (a) contains non-food content; (b) captures only a partial view of the meal; (c) exhibits overlapping food items.

Figure 4. The architecture of the channel–spatial attention fusion block.

Figure 5. The architecture of the feature integration block.

Figure 6. The architecture of the ingredient fusion module.

Figure 7. The architecture of the nutrition regressor.

Figure 8. (a) Training loss and (b) validation PMAE curves for RDINet on Nutrition5k.

Figure 9. The visualization results. (a) Heatmaps generated by the model for estimating various nutritional contents. (b) The corresponding detailed nutritional composition.

Figure 10. Comparison of feature visualizations for different methods on Dish_1562085185. Red indicates high attention areas, blue indicates low attention areas.

Table 1. Comparison of commonly used food nutrition estimation datasets.

Dataset	Categories	Samples	Image/Video	Depth	Ingredients	Nutrition
ECUSTFD [28]	19	2978	Image	N	N	Mass
MetaFood3D [29]	108	637	Video	Y	N	Calories, Mass, Fat, Carbs and Protein
Nutrition5k [19]	250	5006	Image and Video	Y	Y	Calories, Mass, Fat, Carbs and Protein

Table 2. Performance comparison of mainstream nutrition estimation methods on Nutrition5k. Lower PMAE (%) indicates better performance. The best results were highlighted in bold.

Multimodal Data	Methods	Calorie PMAE (%)	Mass PMAE (%)	Fat PMAE (%)	Carb. PMAE (%)	Protein PMAE (%)	Mean PMAE (%)
RGB	Google-Nutrition-rgb [19]	30.2	24.6	40.8	37.0	36.4	34.5
	Coarse-to-Fine Nutrition [44]	29.4	25.7	42.2	38.3	39.8	35.1
	ViT [45]	25.6	21.4	37.6	33.2	34.9	30.5
	Swin-Nutrition [46]	23.7	19.8	32.4	29.4	31.7	27.4
RGB + Depth	Google-Nutrition-rgbd [19]	23.4	23.7	25.1	30.4	25.2	25.6
	DPF-Nutrition [35]	19.6	15.7	27.0	26.1	24.9	22.7
	RGB-D Net [36]	20.2	15.4	26.2	27.8	25.8	23.1
RGB + Depth + Ingredient	DSDGF-Nutri [2]	16.2	11.8	24.5	22.4	22.9	19.6
RGB + Depth + Ingredient	RDINet	14.9	11.2	19.7	18.9	19.5	16.8

Table 3. Mass estimation performance of mainstream methods on ECUSTFD. The best results were highlighted in bold.

Methods	Mass PMAE (%)
Google-Nutrition-rgb	31.0
Coarse-to-Fine Nutrition	22.4
ViT	20.1
Swin-Nutrition	16.9
RDINet (RGB-only)	18.2

Table 4. Ablation study of key architectural components on Nutrition5k. The best results were highlighted in bold.

Methods	Calorie PMAE (%)	Mass PMAE (%)	Fat PMAE (%)	Carb. PMAE (%)	Protein PMAE (%)	Mean PMAE (%)
RGB	28.3	23.7	38.8	39.1	37.2	33.4
Depth	26.7	21.9	35.7	34.2	33.3	30.4
RGB + Depth (channel attention)	23.1	19.2	28.5	30.4	29.1	26.1
RGB + Depth (spatial attention)	24.0	20.1	29.7	31.2	30.3	27.1
RGB + Depth (channel–spatial attention, RGB-D Fusion Module)	21.5	17.3	25.8	28.3	26.7	23.9
RGB + Depth + Ingredient (RGB-D Fusion Module + cross-attention)	16.8	12.9	21.5	20.7	21.2	18.6
RGB + Depth + Ingredient (RGB-D Fusion Module + self-attention)	19.2	15.1	24.3	23.5	23.8	21.2
RGB + Depth + Ingredient (RGB-D Fusion Module + cross-attention + self-attention, RGB-D Fusion Module + Ingredient Fusion Module)	14.9	11.2	19.7	18.9	19.5	16.8

Table 5. Ablation study on the number of feature extraction layers on Nutrition5k. The best results were highlighted in bold.

Methods	Calorie PMAE (%)	Mass PMAE (%)	Fat PMAE (%)	Carb. PMAE (%)	Protein PMAE (%)	Mean PMAE (%)
One layer (11)	17.7	19.5	26.3	25.0	24.8	22.7
Two layers (6, 11)	15.9	13.9	23.8	22.8	23.2	19.9
Thress layers (0, 6, 11)	14.9	11.2	19.7	18.9	19.5	16.8
Four layers (0, 4, 8, 11)	14.6	11.1	19.9	19.3	19.8	16.9

Table 6. Ablation study on different backbone networks for feature extraction on Nutrition5k. The best results were highlighted in bold.

Methods	Calorie PMAE (%)	Mass PMAE (%)	Fat PMAE (%)	Carb. PMAE (%)	Protein PMAE (%)	Mean PMAE (%)
VGG16	22.0	17.8	28.4	30.2	25.1	24.7
InceptionV3	21.8	17.1	26.0	28.8	24.3	23.6
ResNet-18	21.3	16.7	25.8	27.4	24.1	23.1
ResNet-50	20.9	16.4	25.7	26.8	23.6	22.7
PVT	16.2	13.2	23.3	24.6	22.4	19.9
DINOv2	14.9	11.2	19.7	18.9	19.5	16.8

Table 7. Ablation study of shared and separate ingredient fusion strategies. The best results were highlighted in bold.

Methods	Calorie PMAE (%)	Mass PMAE (%)	Fat PMAE (%)	Carb. PMAE (%)	Protein PMAE (%)	Mean PMAE (%)
Shared ingredient fusion	17.1	13.1	25.3	24.7	23.6	20.8
Separate ingredient fusion	14.9	11.2	19.7	18.9	19.5	16.8

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Kuang, Z.; Gao, H.; Yu, J.; Sun, D.; Zhao, J.; Sun, L. RDINet: A Deep Learning Model Integrating RGB-D and Ingredient Features for Food Nutrition Estimation. Appl. Sci. 2026, 16, 454. https://doi.org/10.3390/app16010454

AMA Style

Kuang Z, Gao H, Yu J, Sun D, Zhao J, Sun L. RDINet: A Deep Learning Model Integrating RGB-D and Ingredient Features for Food Nutrition Estimation. Applied Sciences. 2026; 16(1):454. https://doi.org/10.3390/app16010454

Chicago/Turabian Style

Kuang, Zhejun, Haobo Gao, Jiaxuan Yu, Dawen Sun, Jian Zhao, and Lei Sun. 2026. "RDINet: A Deep Learning Model Integrating RGB-D and Ingredient Features for Food Nutrition Estimation" Applied Sciences 16, no. 1: 454. https://doi.org/10.3390/app16010454

APA Style

Kuang, Z., Gao, H., Yu, J., Sun, D., Zhao, J., & Sun, L. (2026). RDINet: A Deep Learning Model Integrating RGB-D and Ingredient Features for Food Nutrition Estimation. Applied Sciences, 16(1), 454. https://doi.org/10.3390/app16010454

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

RDINet: A Deep Learning Model Integrating RGB-D and Ingredient Features for Food Nutrition Estimation

Abstract

1. Introduction

2. Related Work

2.1. Traditional Methods

2.2. Image-Based Methods

2.3. Volume Modeling Approaches

2.4. Ingredient Fusion Methods

3. Materials and Methods

3.1. Methods

3.2. Dataset

3.3. ViT-Based Feature Extractor

3.4. RGB-D Fusion Module

3.5. Ingredient Fusion Module

3.6. Nutrition Regressor

3.7. Loss Function

3.8. Evaluation Metrics

4. Results

4.1. Implementation Details

4.2. Experimental Results

4.3. Ablation Experiments

4.4. Visualization Analysis

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI