Next Article in Journal
Effect of Hippophae rhamnoides Extract Addition on the Quality and Safety of Traditional Kazakh Chunked Delicacy “Jaya”
Previous Article in Journal
Treatment with Thyme Essential Oil Delays Loss Reductions in Postharvest Chinese Flowering Cabbage
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

IGSMNet: Ingredient-Guided Semantic Modeling Network for Food Nutrition Estimation

1
School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi 214122, China
2
Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100086, China
*
Author to whom correspondence should be addressed.
Foods 2025, 14(21), 3697; https://doi.org/10.3390/foods14213697
Submission received: 17 July 2025 / Revised: 21 September 2025 / Accepted: 29 September 2025 / Published: 30 October 2025
(This article belongs to the Section Food Nutrition)

Abstract

In recent years, food nutrition estimation has received growing attention due to its critical role in dietary analysis and public health. Traditional nutrition assessment methods often rely on manual measurements and expert knowledge, which are time-consuming and not easily scalable. With the advancement of computer vision, RGB-based methods have been proposed, and more recently, RGB-D-based approaches have further improved performance by incorporating depth information to capture spatial cues. While these methods have shown promising results, they still face challenges in complex food scenes, such as limited ability to distinguish visually similar items with different ingredients and insufficient modeling of spatial or semantic relationships. To solve these issues, we propose an Ingredient-Guided Semantic Modeling Network (IGSMNet) for food nutrition estimation. The method introduces an ingredient-guided module that encodes ingredient information using a pre-trained language model and aligns it with visual features via cross-modal attention. At the same time, an internal semantic modeling component is designed to enhance structural understanding through dynamic positional encoding and localized attention, allowing for fine-grained relational reasoning. On the Nutrition5k dataset, our method achieves PMAE values of 12.2% for Calories, 9.4% for Mass, 19.1% for Fat, 18.3% for Carb, and 16.0% for Protein. These results demonstrate that our IGSMNet consistently outperforms existing baselines, validating its effectiveness.

1. Introduction

Food nutrition plays a vital role in both daily diet management and clinical nutrition planning, particularly as public attention to health and wellness continues to increase [1]. In its earlier stages, nutrition assessment primarily relied on traditional biochemical methods [2,3]. While these conventional methods are well-established, they often require domain expertise for operation and interpretation, limiting their accessibility and scalability. Furthermore, traditional nutritional assessment methods are often labor-intensive and time-consuming. These limitations hinder their ability to meet the growing demand for rapid, accessible, and precise nutritional evaluation [4,5,6]. To address these issues, recent developments [7,8,9,10,11] in artificial intelligence [12,13,14], particularly in computer vision, have opened new possibilities for automated, accurate, and scalable nutritional assessment, offering a promising alternative to conventional approaches.
In recent years, some vision-based nutritional assessment methods [11,15] have been proposed. These methods employ deep learning [16,17,18,19,20] or machine learning [21,22,23,24] to analyze food images. They learn visual representations that support the direct prediction of nutritional content from visual data. By directly using the visual feature representation, such approaches aim to simplify the assessment process and minimize manual intervention. For instance, Swin-nutrition [25] integrates the Swin Transformer [26] with a feature fusion module and a nutrient prediction module to estimate nutritional content directly from images. Similarly, RoDE [27] employs a mixture-of-experts framework that dynamically adapts to varying task complexities, enhancing precision in nutritional estimation. Despite these advancements, these methods remain constrained by several challenges. For example, fluctuations in lighting conditions can distort the quality of RGB images and subtle food components are often difficult to detect due to their low visual contrast and the limited saliency in RGB images. Moreover, a more fundamental limitation lies in the absence of spatial or depth information, which restricts the ability to capture structural cues from food images. These challenges reduce the performance of nutrient estimation. Therefore, combining depth information with RGB images provides a promising direction to improve the performance of vision-based nutritional estimation methods.
To address the limitations of RGB-only methods, recent research has explored RGB-D-based nutritional assessment, where depth information is integrated with RGB images to provide richer spatial and structural cues. These methods [19,28,29] generally outperform purely RGB-based models. Building on this direction, several representative methods have been developed, including Google-Nutrition [28], IGFNutrition [10], and ADFE [19]. To be specific, Google-Nutrition first investigates depth information for nutrition estimation, showing that spatial cues can significantly enhance nutrient prediction performance. Afterwards, IGFNutrition further combines RGB images, depth maps, and ingredient information to build a more comprehensive food representation for nutritional analysis. ADFE adopts window-based and shifted-window self-attention to enhance visual feature learning, improving the performance siginificantly. While these methods have achieved good results, several challenges remain, such as distinguishing visually similar dishes with different compositions and capturing fine-grained spatial or semantic relationships. Therefore, how to develop a more effective RGB-D-based method that can fully exploit the complementary strengths of visual and depth information remains an open and pressing problem.
To solve the above problem, we propose an Ingredient-Guided Semantic Modeling Network (IGSMNet) for food nutrition estimation. The developed IGSMNet contains two core components: an ingredient-guided module and an internal semantic modeling scheme. Specifically, the first module focuses on bridging ingredient semantics and visual representation. We utilize a dual-branch network to extract multi-scale features from RGB and depth images, capturing complementary texture and spatial cues. The second module addresses the need for internal spatial and contextual reasoning. It comprises two components: the dynamic position encoding and a fine-grained semantic modeling mechanism. The DPE mechanism introduces learnable spatial bias into the attention process, helping the model preserve relative position information. Meanwhile, the fine-grained modeling component enhances local semantic interaction by allowing each feature point to attend to its surrounding context. These components together enrich the internal dependencies among features. The contributions of this work are summarized as follows:
  • We propose a novel Ingredient-Guided Semantic Modeling Network (IGSMNet) for food nutrition estimation, which jointly integrates RGB-D visual features and ingredient semantics to enhance the estimation accuracy.
  • We develop an ingredient-guided fusion module that utilizes ingredient information to guide visual feature learning. This enables the network to focus on nutritionally relevant regions and enhances its discrimination.
  • We introduce an internal semantic modeling strategy composed of dynamic position encoding and fine-grained semantic modeling, which collectively strengthen contextual feature representation.
  • Extensive experiments on the Nutrition5k dataset show that the proposed IGSMNet can achieve promising results.
The remainder of this paper is organized as follows: Section 2 briefly reviews the related work. Section 3 introduces the proposed method in detail, and Section 4 reports the experimental results. A discussion is provided in Section 5. Finally, we conclude the work in Section 6.

2. Related Work

In the early stages, nutritional assessment primarily relied on manual recording and laboratory-based analysis. While these traditional methods [2,3] are effective, they are often time-consuming and require significant human effort. Recently, with the development of deep learning, a variety of vision-based methods have emerged, enabling faster and more accessible nutritional estimation. In this section, we mainly review recent advances in vision-based nutritional assessment. Current approaches in this field can be broadly categorized into three groups: methods based on RGB images, those incorporating RGB-D data, and methods guided by ingredient information.

2.1. RGB Image-Based Methods

Early RGB-based methods mainly relied on convolutional neural networks (CNNs) to extract visual features from food images for nutrition estimation. For example, NR et al. [30] utilize CNNs to learn image features for direct prediction of nutritional content. To enhance the relevance of visual features to caloric estimation, Ege et al. [31] formulate food classification as an auxiliary objective, exploiting the inherent association between food types and caloric content. The approach performs simultaneous dish localization and calorie estimation through a multitask learning scheme. Similarly, Fang et al. [32] observe a strong correlation between energy density and caloric distribution and propose a generative adversarial network to produce energy density maps, improving the relevance of visual cues to calorie estimation. To capture more complex dependencies beyond the capability of CNNs, Shao et al. [25] adopt a transformer-based model and perform multi-scale feature fusion for better representation. In response to the limited performance of direct regression approaches, Wang et al. [33] reformulate the estimation task as classification and design a coarse-to-fine strategy using a structured smoothing objective function. With the increasing application of large multimodal models in vision tasks. Jiao et al. [27] propose a linear rectification mixture of experts to address task conflicts during multi-objective fine-tuning. While these RGB-based methods achieve competitive results, they lack the spatial depth cues necessary for estimating volume, which is critical for accurate nutrient assessment. As a result, recent research has begun to explore the integration of RGB and depth information to address this limitation.

2.2. RGB-D Image-Based Methods

Recent works have incorporated depth information to perform the nutrition estimation task, aiming to address the limitations of RGB images in capturing spatial information. For example, Meyers et al. [34] formulate the caloric estimation task as a two-step process, where food class recognition is first used to assign predefined caloric densities and volume is subsequently inferred from depth data to compute the final nutritional value. Based on this, Thames et al. [28] integrate depth images directly into the visual input by appending them as a fourth channel to the RGB representation, which is then processed by a CNN to assess nutritional information, including Calories, Proteins, Fats, and Carbs. Although RGB-D input is effective, early methods often suffer from insufficient modeling of cross-modal relationships and insufficient exploitation of multi-scale features. To solve these limitations, many methods have been developed. For instance, Shao et al. [11] introduce the RGB-D network, which employs multi-scale feature aggregation and cross-modal fusion schemes to capture fine-grained structure, thereby improving estimation performance. Ma et al. [8] develop the FBFPN model, which incorporates a bidirectional feature pyramid alongside an RGB-D fusion module to enhance visual representation. In addition, considering that depth images are difficult to obtain in practical applications, Han et al. [35] propose DPF-Nutrition, which leverages a depth prediction module and fuses the generated depth with RGB features via an attention scheme. Although these RGB-D-based approaches can achieve promising results, they fail to account for ingredient-level information embedded in food images. Therefore, how to incorporate ingredient cues to improve nutritional estimation needs to be further explored.

2.3. Ingredient-Guided Methods

To address the issue of RGB-D-based methods in handling visually overlooked ingredient information, recent studies have explored the integration of ingredient-level semantics to further enhance nutritional assessment. For instance, Nian et al. [36] represent ingredient information using textual descriptors, which are encoded into semantic features and aligned with visual representations to enhance representational completeness. Feng et al. [10] perform RGB-D feature fusion by transforming visual representations into the frequency domain and employing a hierarchical multi-scale fusion scheme. Moreover, to enhance semantic alignment, ingredient information is subsequently incorporated as guidance during the feature learning, improving nutritional estimation performance. While existing methods have demonstrated the efficacy of using ingredient information, most existing methods perform ingredient guidance at a coarse level and lack schemes to effectively model the fine-grained contextual dependencies between ingredients and visual features. Therefore, how to combine ingredient information and improve the effectiveness of semantic modeling needs further research.

3. Method

In this section, we first provide an overview of our proposed model, followed by a detailed description of its individual modules.

3.1. Overview

Figure 1 illustrates the overall structure of our method, which is composed of two primary modules. The first is the ingredient-guided (IG) module, which incorporates the ingredient information to enhance visual representations, enabling the model to better associate visual patterns with nutritional content. The second is the internal semantic modeling (ISM) module, which aims to improve the semantic expressiveness of the visual feature pyramid through intrinsic semantic perception. Moreover, the second module includes two subcomponents: dynamic positional encoding and fine-granularity modeling. The positional encoding dynamically adapts to features at different scales. The fine-granularity modeling component captures contextual relationships among visual elements at each level of the pyramid. Together, these two modules significantly enhance the representational capacity of the constructed food feature pyramid, thereby improving the performance.
In the proposed method, we take as input an RGB image X rgb R H × W × 3 and its corresponding depth image X depth R H × W × 3 , where H and W, respectively, denote the image height and width. To extract modality-specific features, an asymmetric dual-branch architecture is employed, with each branch structured as a four-stage hierarchical network. The spatial resolution is progressively reduced by half at each stage, thereby increasing the receptive field and enhancing semantic abstraction. This process yields multi-scale feature sets for both modalities, represented as F rgb = { F rgb i | i = 1 , 2 , 3 , 4 } and F depth = { F depth i | i = 1 , 2 , 3 , 4 } . Meanwhile, high-level features are further enriched through semantic fusion with CLIP-derived embeddings, while the ingredient-guided and internal semantic modules jointly enhance representation learning. Finally, a multitask learning framework is employed to infer the content of five nutritional ingredients.

3.2. Ingredient-Guided Module

In food nutrition estimation, it is essential to not only recognize ingredients based on visual appearance but also to understand their spatial information, as this directly impacts the accuracy of portion and nutrient assessment. While RGB images capture rich visual cues such as color and surface texture, they lack the ability to describe the spatial information (such as depth and volume). To address this issue, the depth images are leveraged in this work, which can help the model estimate size and spatial arrangement more precisely. To fully exploit the complementary strengths of RGB and depth modalities, we utilize a hierarchical fusion scheme. Specifically, we integrate features from both modalities at different levels. The formulation can be defined as follows:
F fused i = F rgb i + F depth i ,
where F fused i R B × C × H × W represents the fused feature at the i-th level. By integrating features from both RGB and depth modalities across scales, it can concurrently exploit appearance-based information and spatial structure. Moreover, lower-level representations can preserve detailed spatial and edge information, while higher-level features encode abstract semantic relationships.
To further improve the accuracy of food nutrition estimation, we introduce ingredient knowledge as a semantic guide for visual feature learning. By embedding ingredient information into the model, the network can be guided to attend to visual regions that are semantically aligned with specific food components, thereby improving its ability to recognize individual ingredients. As a result, the guidance mechanism directly contributes to more accurate and reliable nutritional estimation. Specifically, we utilize the pre-trained CLIP textual encoder to encode the ingredient information, which can be defined as follows:
F ingredient = CLIP ( ingredients ) ,
where F ingredient R B × C represents the food ingredient features, B is the batch size, and C is the embedding dimensionality. To facilitate the guidance of visual feature extraction using ingredient semantics, we use a cross-attention scheme [16] that captures the interaction between ingredient semantics and visual information. Prior to attention computation, we convert the two features into a compatible shape to allow effective alignment: F ingredient R B × 1 × C and F fused i R B × ( H × W ) × C . Specifically, as illustrated in Figure 2, the cross-attention strategy is used to associate textual ingredient features with visual representations. The ingredient embedding serves as the query, while the fused visual features act as both keys and values. Then, we have:
Q = W Q F ingredient , K = W K F fused i , V = W V F fused i ,
where W Q , W K , W V R C × C are linear transformation matrices for the query, key, and value, respectively. The output of cross-attention is computed as:
CrossAttention ( Q , K , V ) = softmax Q K T d k V ,
where d k denotes the dimension of the keys, which in this case equals C. Combining the above elements, we obtain the ingredient-guided visual feature at each scale:
F guide i = softmax ( W Q F ingredient ) ( W K F fused i ) T C ( W V F fused i ) .
The above operations yield the ingredient-guided feature map F guide i at scale i, in which the visual representation is enhanced based on the semantic relevance of the corresponding ingredient information. To ensure that the original visual cues are preserved during this guidance process, we incorporate a residual connection by summing the guided feature with its corresponding fused visual representation:
F ¯ guide i = F guide i + F fused i .

3.3. Internal Semantic Modeling

After obtaining the ingredient-guided features, we observe that while these representations incorporate ingredient semantic information, they lack explicit modeling of internal spatial and compositional relationships among food components. To this end, and inspired by [37,38], we develop the internal semantic modeling strategy to further enhance representational capacity. Figure 3 shows the framework. This module can capture structural dependencies among feature elements by modeling the internal relationships within each feature map. Specifically, the proposed module consists of two key components. First, dynamic positional encoding is applied to introduce position-specific representations, enabling the network to differentiate visual elements based on their spatial locations. Second, fine-grained semantic modeling is used to capture local contextual relationships, ensuring that each feature element can attend to its surrounding semantic structure.

3.3.1. Dynamic Position Encoding

To encode relative spatial information among features, we adopt a dynamic position encoding (DPE) mechanism, which introduces learnable positional bias into the attention computation. Specifically, the attention map in the internal semantic modeling (ISM) module is rewritten as follows:
Attnmap = Softmax ( Q K T / d + b ) V ,
where Q , K , V R G 2 × D denote the query, key, and value matrices, respectively, and d is a scaling factor for normalization, D is the embedding dimension, and G is the size of patch and will be formally defined in the next section. The term b R G 2 × G 2 represents the relative position bias matrix introduced by DPE. In [26], the bias is computed by a fixed table b i , j = b ^ Δ x i j , Δ y i j where b ^ is a fixed-sized matrix and ( Δ x i j , Δ y i j ) is the offset between tokens i and j. However, this scheme lacks adaptability to varying patch sizes across feature pyramid levels. To address this limitation, we introduce a dynamically generated relative position module named DPE, which is defined as:
b i , j = DPE ( Δ x i j , Δ y i j ) ,
where DPE is implemented as a multi-layer perceptron (MLP). It takes the offset ( Δ x i j , Δ y i j ) as input and produces a scalar output. The scheme consists of three fully connected layers with ReLU activations [39] and LayerNorm [40], and the intermediate layer is set to dimension D / 4 . This design ensures the learned positional bias can adapt to different spatial contexts and generalize across variable input resolutions.

3.3.2. Fine-Grained Modeling

To further capture local dependencies within patches, we employ a fine-grained modeling (FGM) mechanism. Given an ingredient-guided feature map F ¯ guide i at level i, we partition it into N non-overlapping square patches of size G × G . Each patch is independently flattened and projected into queries, keys, and values. Then, Q, K, and V are f Q = F ¯ guide i k W Q , f K = F ¯ guide i k W K , and f V = F ¯ guide i k W V , respectively. For each attention head h in a multi-head self-attention module with H heads, the attention weights are computed as:
W 1 h = softmax f Q 1 h , i f K 1 h , j T d / H ,
where D / H is the feature dimension processed by each head. For FGM, k and j denote the index of the embedding, and h is the index of the head. The output for all heads is aggregated as:
f F G M = h = 1 H k , j = 1 G W 1 h · f V 1 h , j ,
In addition, the internal semantic modeling strategy contains four components: DPE, FGM, a normalization layer (LayerNorm, LN), and a multilayer perceptron (MLP). The FGM module adopts a window-based multi-head self-attention scheme, where attention blocks are alternated and each block is equipped with residual connections. The transformer block applied to image input x i can be defined as:
U ^ j = FGM LN ( U j 1 ) + U j 1 , U j = MLP LN ( U ^ j ) + U ^ j ,
where U j 1 is the input from the previous block and LN is the LayerNorm operation. The initial input F ¯ guide i passes through L stacked ISM blocks, yielding the final refined feature representation at each pyramid level, denoted by F i s m i .

3.4. Training Objective

After obtaining the final feature pyramid F ism , we perform global feature aggregation to facilitate nutritional prediction. Specifically, each level of the pyramid is processed using adaptive average pooling. Then, we have:
P i = AdaptiveAvgPool 2 d ( F ism i , output_size = ( 1 , 1 ) ) ,
The pooled outputs from all four pyramid levels are then concatenated along the channel dimension:
F concat = C a t ( ( P 1 , P 2 , P 3 , P 4 ) , d i m = 1 ) ,
where F concat B × ( C × 4 ) . Then, the concatenated vector is then passed through a series MLP layers to obtain the corresponding predicted results. Specifically, our method is trained using a multitask learning scheme to jointly predict five nutrients (i.e., Calories, Mass, Fats, Proteins, and Carbohydrates). As the value scales of these nutrients vary considerably, we normalize the loss measures for each nutritional ingredient and apply the loss function as follows:
L = l cal + l carb + l pro + l fat + l mass ,
where l cal , l mass , l pro , l carb , and l fat represent the loss for Calories, Mass, Proteins, Carbohydrates, and Fats, respectively. each component is computed as the normalized absolute error between the predicted and ground-truth values. For example, the result of l cal is:
l cal = i = 1 N | y i y ^ i | i = 1 N y i ,
where y i is the true nutritional value and y ^ i is the predicted nutritional value. The same formulation is applied to all five tasks, ensuring consistent error metrics across different nutrient types.

3.5. Evaluation Metrics

To quantitatively assess the efficacy of our method, we adopt the percentage of mean absolute error (PMAE) as the metric. Let N denote the number of samples. The mean absolute error (MAE) is computed as:
MAE = 1 N i = 1 N | y i y ^ i | .
Based on this, PMAE is defined by normalizing MAE. Then, PMAE can be defined as:
PMAE = MAE 1 N i = 1 N y i ,
where PAME reflects the prediction error relative to the scale of true values. A smaller PMAE indicates better estimation accuracy.

4. Experiments

4.1. Experimental Setup

We conduct experiments on the Nutrition5k [28] dataset. Specifically, the dataset contains 5000 dishes across a wide range of categories and includes annotations for over 250 types of ingredients. The dataset provides single-view and 360 ° captures, from which we select top-view RGB-D pairs acquired using an Intel RealSense D435 camera [28]. The data is partitioned into training and test sets following a 5:1 ratio. More detailed information can be found referring to [28]. All experiments are implemented on a workstation equipped with an NVIDIA GTX 3090 GPU. The backbone network is initialized with ImageNet-pre-trained weights. Input images are uniformly cropped and resized to 224 × 224 . Moreover, we apply synchronous inversion augmentation to the RGB-D training samples to improve the robustness. Model optimization is carried out using the Adam optimization algorithm, starting with a learning rate of 5 × 10 5 that decays exponentially at a rate of 0.99. Training is performed for 150 epochs with a batch size of 32.

4.2. Experimental Results and Analysis

In this section, we conduct a comprehensive comparison between our IGSMNet and some recent competitive methods on the Nutrition5k dataset. To the best of our knowledge, The compared approaches include representative RGB-based methods, such as Google-Nutrition-rgb [28], Portion-Nutrition [41], Swin-nutrition [25], and DPF-Nutrition [35]. In addition, we evaluate against recent RGB-D fusion approaches, including CMX [42], HINet [29], CDINet [43], DEFNet [44], TriTransNet [26], Deliver [45], Google-Nutrition-rgbd [28], IMIR-Net [36], and Feng et al. [10]. The experimental results are given in Table 1.

4.2.1. Comparison with RGB Image-Based Methods

As shown in Table 1, we can see that the proposed IGSMNet outperforms all RGB-only baselines. Our IGSMNet achieves consistent improvements across all tasks. Specifically, in terms of the overall average, our model obtains a PMAE of 15.0%, surpassing the best-performing RGB-based method, DPF-Nutrition [35], which reports 17.8%. Regarding specific nutritional estimation tasks, our method also demonstrates clear advantages. For instance, the PMAE for Protein estimation is reduced from 20.2% to 16.0%, marking a 4.2% improvement. On the Mass metric, our approach yields a PMAE of 9.4%, outperforming the 10.6% reported by DPF-Nutrition by 1.2%. These results demonstrate that our framework, by incorporating richer information beyond RGB inputs, significantly enhances prediction accuracy. Overall, the proposed method consistently outperforms RGB-based approaches, highlighting its effectiveness in addressing the limitations of RGB-only nutritional estimation.

4.2.2. Comparison with RGB-D-Based Methods

We also compare with RGB-D-based methods, as summarized in Table 1. It can be observed that our proposed approach achieves consistent improvements over existing RGB-D fusion approaches in predicting most nutritional components. Specifically, our method achieves a mean PMAE of 15.0%, significantly outperforming all other methods. For instance, the best-performing compared approach, Google-Nutrition-rgbd, reports a mean PMAE of 20.1%, indicating a substantial 5.1% improvement obtained by our model. In terms of individual nutrients, our model consistently reports the lowest PMAE values: 12.2% for Calories, 9.4% for Mass, 19.1% for Fat, 18.3% for Carb, and 16.0% for Protein. Compared to these RGB-D-based methods such as Google-Nutrition-rgbd and Deliver, which rely on cross-modal attention or multimodal feature alignment schemes, our model exhibits a clear advantage in prediction accuracy. Moreover, the Mass metric shows limited improvement because it depends mainly on physical cues such as food volume and thermal radiation, which are not the primary focus of the IG and ISM modules. These modules are designed to enhance ingredient-level semantics and contextual reasoning, leading to clear gains in nutrient estimation but only marginal benefits for Mass prediction. It is worth noting that while most methods suffer from high PMAE in Fat and Protein prediction, our model maintains relatively low errors, indicating better generalization across nutrients. The better performance of our method can be attributed to two main factors: (1) the ingredient-guided scheme, which provides explicit semantic information aligned with nutrient properties, and (2) the fine-grained semantic modeling module, which further promotes the integration and complementarity between visual representations and ingredient information.

4.3. Ablation Study

This section analyzes the impact of different components on nutritional assessment performance. Specifically, we conduct ablation studies on the ingredient-guided scheme, the fine-grained modeling strategy, and the dynamic positional encoding module. The experimental results and a detailed analysis are presented below.

4.3.1. Effectiveness of the Ingredient-Guided Module

As shown in Table 2, the ingredient-guided module plays a critical role in enhancing estimation performance. When this module is introduced into the baseline model, the overall prediction error (mean PMAE) is reduced from 16.9% to 15.6%, yielding a 1.3% improvement. For the estimation of the individual nutrient predictions, all components except Mass benefit from this module. In particular, the PMAE for Calories drops from 13.7% to 13.3%, Fat from 22.6% to 19.5%, Carb from 19.4% to 18.5%, and Protein from 19.6% to 16.6%. These results indicate that ingredient-level information contributes significantly to the semantic alignment between visual features and nutrition-related properties. Although the ingredient-guided module effectively improves predictions for most nutritional components, a slight performance drop is observed for the Mass. This may be due to the weak correlation between ingredient semantics and physical quantity, as ingredient presence does not directly reflect the Mass.

4.3.2. Effectiveness of the Fine-Grained Modeling Scheme

As shown in the fourth row of Table 2, incorporating the FGM scheme leads to a further reduction in mean PMAE, from 15.6% to 15.2%, representing a 0.4% improvement. This gain is primarily reflected in the predictions of Calories, Mass, and Protein. Compared to using the ingredient-guided module alone, where the Mass error increases, the inclusion of FGM reduces the Mass PMAE from 10.2% to 9.4%. This indicates that fine-grained modeling strengthens the internal structure of visual features and effectively preserves semantics unrelated to ingredient information. In addition, the improved modeling of fine-grained information contributes to a more stable and discriminative representation, enhancing the effectiveness of ingredient-guided features.

4.3.3. Effectiveness of the Dynamic Position Encoding

As observed in the last row of Table 2, introducing the DPE module leads to a further reduction in mean PMAE from 15.2% to 15.0%, indicating a marginal improvement of 0.2%. This improvement is primarily reflected in the predictions of Calories, Fat, and Protein. The performance gain indicates that, without position encoding, the model may not capture spatial associations among visual elements, limiting its ability to identify whether features belong to the same food item. By incorporating DPE, the model can establish positional context across local regions, reinforcing the semantic coherence of visual features and improving the performance.

4.4. Further Analysis

4.4.1. Comparison of Ingredient-Guided Integration Strategies

To explore the efficacy of different ingredient-guided integration strategies, we conduct some experiments (including Add, MLP, and Cross-Attention schemes); the results are given in Table 3. It can be observed that the incorporation of ingredient-guided schemes, regardless of the specific integration strategy, leads to consistent performance improvements compared to the baseline without ingredient guidance. Among these schemes, the cross-attention strategy achieves the best result, reducing the mean PMAE to 15.0% and yielding the lowest errors for Calories, Fat, Carb, and Protein. In contrast, the Add and MLP strategies yield less substantial improvements, with mean PMAE values of 16.5% and 16.6%, respectively. The performance advantage of the cross-attention scheme indicates its efficacy in capturing the relationships between ingredient information and visual features. Compared with the Add and MLP strategies, the cross-attention scheme effectively establishes associations between ingredient information and relevant visual regions. This interaction suppresses interference from unrelated ingredients and improves the semantic relevance of the fused features. As a result, the cross-attention scheme achieves more accurate overall predictions and enhances the precision of nutritional component estimation.

4.4.2. Modeling Order of IG and ISM

As shown in Table 4, we can see that incorporating the ISM strategy improves overall performance, regardless of the integration order. Specifically, both modeling orders outperform the single IG baseline, with mean PMAEs reduced from 15.6% to 15.0% and 15.1%, respectively. Notably, applying IG before ISM achieves slightly better results across most nutritional components. The improvement is mainly due to the fact that IG introduces semantic alignment but may weaken local visual details. Using ISM after IG helps recover this lost information by modeling contextual dependencies, thus improving overall performance. In contrast, when ISM is applied first, this refinement cannot occur, leading to reduced performance in certain components, such as Carb. These results indicate that the order of applying IG before ISM is more effective for maintaining semantic consistency and improving prediction accuracy.

5. Discussion

Vision-based nutritional analysis offers an efficient and scalable solution for estimating nutrient composition directly from food images, supporting individualized dietary monitoring and nutritional intervention. Although our method achieves good results compared to existing baselines, several limitations are observed. Specifically, the proposed method shows no improvement on the Mass task. This may result from the weak association between ingredient information and physical quantity, as ingredient presence does not directly reflect volume or density. Consequently, the model may lack sensitivity to structural information. Moreover, while the model leverages ingredient information as a guiding signal, it assumes complete and accurate information availability. In practical scenarios, ingredient annotations are often partial, noisy, or user-provided. The current model may lack robustness under such conditions. This highlights the need for more flexible schemes that can adapt to uncertain or partial ingredient descriptions while maintaining promising performance. In addition, another promising direction is the integration of personalization into nutritional assessment, where user history and dietary patterns are incorporated to achieve more practical outcomes.

6. Conclusions

In this paper, we propose a novel food nutrition estimation method named IGSMNet, which integrates ingredient information with RGB-D visual features to further improve the performance. The proposed method first combines RGB images with depth information to extract semantically enriched RGB-D representations. Ingredient information is then employed to guide these representations, enhancing their alignment with nutritional attributes. To further refine feature discrimination, an internal semantic modeling scheme is introduced, which performs fine-granularity encoding to capture local contextual dependencies among fused features. In addition, a dynamic position encoding module is incorporated to strengthen the ability to perceive spatial relations. Experimental results on the Nutrition5k dataset demonstrate the superior performance of the developed IGSMNet compared with existing baselines. In the future, we will explore lightweight model designs to improve efficiency and facilitate deployment in real-world nutritional assessment scenarios.

Author Contributions

Methodology, D.Z., W.S., and B.M.; Validation, W.S.; Writing—original draft preparation, D.Z., W.S., and B.M.; Writing—review and editing, W.M. and X.-J.W.; Supervision, W.M. and X.-J.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Key Research and Development Program of China (2023YFF1105102).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data is available at https://github.com/dmcsy/ISMIG, accessed on 28 September 2025.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Kaushal, S.; Tammineni, D.K.; Rana, P.; Sharma, M.; Sridhar, K.; Chen, H.H. Computer vision and deep learning-based approaches for detection of food nutrients/nutrition: New insights and advances. Trends Food Sci. Technol. 2024, 146, 104408. [Google Scholar] [CrossRef]
  2. Jacobs, D.R.; Tapsell, L.C. Food, not nutrients, is the fundamental unit in nutrition. Nutr. Rev. 2007, 65, 439–450. [Google Scholar] [CrossRef]
  3. Gargano, D.; Appanna, R.; Santonicola, A.; De Bartolomeis, F.; Stellato, C.; Cianferoni, A.; Casolaro, V.; Iovino, P. Food allergy and intolerance: A narrative review on nutritional concerns. Nutrients 2021, 13, 1638. [Google Scholar] [CrossRef]
  4. Zhou, L.; Zhang, C.; Liu, F.; Qiu, Z.; He, Y. Application of deep learning in food: A review. Compr. Rev. Food Sci. Food Saf. 2019, 18, 1793–1811. [Google Scholar] [CrossRef]
  5. Subar, A.F.; Kirkpatrick, S.I.; Mittl, B.; Zimmerman, T.P.; Thompson, F.E.; Bingley, C.; Willis, G.; Islam, N.G.; Baranowski, T.; McNutt, S.; et al. The automated self-administered 24-h dietary recall (ASA24): A resource for researchers, clinicians and educators from the National Cancer Institute. J. Acad. Nutr. Diet. 2012, 112, 1134. [Google Scholar] [CrossRef]
  6. Bianco, R.; Coluccia, S.; Marinoni, M.; Falcon, A.; Fiori, F.; Serra, G.; Ferraroni, M.; Edefonti, V.; Parpinel, M. 2D Prediction of the Nutritional Composition of Dishes from Food Images: Deep Learning Algorithm Selection and Data Curation Beyond the Nutrition5k Project. Nutrients 2025, 17, 2196. [Google Scholar] [CrossRef] [PubMed]
  7. Yin, Y.; Qi, H.; Zhu, B.; Chen, J.; Jiang, Y.G.; Ngo, C.W. Foodlmm: A versatile food assistant using large multi-modal model. IEEE Trans. Multimed. 2025. [Google Scholar] [CrossRef]
  8. Ma, B.; Zhang, D.; Wu, X.J. Food nutrition estimation with RGB-D fusion module and bidirectional feature pyramid network. Multimed. Syst. 2025, 31, 1–11. [Google Scholar] [CrossRef]
  9. Saad, A.M.; Rahi, M.R.H.; Islam, M.M.; Rabbani, G. Diet engine: A real-time food nutrition assistant system for personalized dietary guidance. Food Chem. Adv. 2025, 7, 100978. [Google Scholar] [CrossRef]
  10. Feng, Z.; Xiong, H.; Min, W.; Hou, S.; Duan, H.; Liu, Z.; Jiang, S. Ingredient-Guided RGB-D Fusion Network for Nutritional Assessment. IEEE Trans. Agrifood Electron. 2025, 3, 156–166. [Google Scholar] [CrossRef]
  11. Shao, W.; Min, W.; Hou, S.; Luo, M.; Li, T.; Zheng, Y.; Jiang, S. Vision-based food nutrition estimation via RGB-D fusion network. Food Chem. 2023, 424, 136309. [Google Scholar] [CrossRef] [PubMed]
  12. Jovanovic, L.; Bacanin, N.; Petrovic, A.; Zivkovic, M.; Antonijevic, M.; Gajic, V.; Elsayed, M.M.; Abouhawwash, M. Exploring artificial intelligence potential in solar energy production forecasting: Methodology based on modified PSO optimized attention augmented recurrent networks. Sustain. Comput. Inform. Syst. 2025, 47, 101174. [Google Scholar] [CrossRef]
  13. Chen, J.J.; Ngo, C.W.; Chua, T.S. Cross-modal recipe retrieval with rich food attributes. In Proceedings of the 25th ACM International Conference on Multimedia, Mountain View, CA, USA, 23–27 October 2017; pp. 1771–1779. [Google Scholar]
  14. Ming, Z.Y.; Chen, J.; Cao, Y.; Forde, C.; Ngo, C.W.; Chua, T.S. Food photo recognition for dietary tracking: System and experiment. In Proceedings of the International Conference on Multimedia Modeling, Bangkok, Thailand, 5–7 February 2018; Springer: Berlin/Heidelberg, Germany, 2018; pp. 129–141. [Google Scholar]
  15. Sosa-Holwerda, A.; Park, O.H.; Albracht-Schulte, K.; Niraula, S.; Thompson, L.; Oldewage-Theron, W. The role of artificial intelligence in nutrition research: A scoping review. Nutrients 2024, 16, 2066. [Google Scholar] [CrossRef]
  16. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
  17. Haq, M.A. CNN based automated weed detection system using UAV imagery. Comput. Syst. Sci. Eng. 2022, 42, 2. [Google Scholar] [CrossRef]
  18. Bidyalakshmi, T.; Jyoti, B.; Mansuri, S.M.; Srivastava, A.; Mohapatra, D.; Kalnar, Y.B.; Narsaiah, K.; Indore, N. Application of artificial intelligence in food processing: Current status and future prospects. Food Eng. Rev. 2025, 17, 27–54. [Google Scholar] [CrossRef]
  19. Zhang, D.; Ma, B.; Wu, X.J. Adaptive Feature Fusion and Enhancement Network for Food Nutrition Estimation. IEEE Trans. Agrifood Electron. 2025. [Google Scholar] [CrossRef]
  20. Zhang, F.; Yin, J.; Wu, N.; Hu, X.; Sun, S.; Wang, Y. A dual-path model merging CNN and RNN with attention mechanism for crop classification. Eur. J. Agron. 2024, 159, 127273. [Google Scholar] [CrossRef]
  21. Chang, J.; Wang, H.; Su, W.; He, X.; Tan, M. Artificial intelligence in food bioactive peptides screening: Recent advances and future prospects. Trends Food Sci. Technol. 2025, 156, 104845. [Google Scholar] [CrossRef]
  22. Zhang, D.; Wu, X.J.; Yu, J. Label consistent flexible matrix factorization hashing for efficient cross-modal retrieval. Acm Trans. Multimed. Comput. Commun. Appl. (TOMM) 2021, 17, 1–18. [Google Scholar] [CrossRef]
  23. Zhang, D.; Wu, X.J.; Yu, J. Discrete bidirectional matrix factorization hashing for zero-shot cross-media retrieval. In Proceedings of the Chinese Conference on Pattern Recognition and Computer Vision (PRCV), Zhuhai, China, 29 October–1 November 2021; pp. 524–536. [Google Scholar]
  24. García-Infante, M.; Castro-Valdecantos, P.; Delgado-Pertinez, M.; Teixeira, A.; Guzmán, J.L.; Horcada, A. Effectiveness of machine learning algorithms as a tool to meat traceability system. A case study to classify Spanish Mediterranean lamb carcasses. Food Control 2024, 164, 110604. [Google Scholar] [CrossRef]
  25. Shao, W.; Hou, S.; Jia, W.; Zheng, Y. Rapid non-destructive analysis of food nutrient content using swin-nutrition. Foods 2022, 11, 3429. [Google Scholar] [CrossRef]
  26. Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
  27. Jiao, P.; Wu, X.; Zhu, B.; Chen, J.; Ngo, C.W.; Jiang, Y. Rode: Linear rectified mixture of diverse experts for food large multi-modal models. arXiv 2024, arXiv:2407.12730. [Google Scholar]
  28. Thames, Q.; Karpur, A.; Norris, W.; Xia, F.; Panait, L.; Weyand, T.; Sim, J. Nutrition5k: Towards automatic nutritional understanding of generic food. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 8903–8911. [Google Scholar]
  29. Bi, H.; Wu, R.; Liu, Z.; Zhu, H.; Zhang, C.; Xiang, T.Z. Cross-modal hierarchical interaction network for RGB-D salient object detection. Pattern Recognit. 2023, 136, 109194. [Google Scholar] [CrossRef]
  30. NR, D.; GK, D.S.; Kumar Pareek, D.P. A Framework for Food recognition and predicting its Nutritional value through Convolution neural network. In Proceedings of the International Conference on Innovative Computing & Communication (ICICC), Delhi, India, 19–20 February 2022. [Google Scholar]
  31. Ege, T.; Yanai, K. Simultaneous estimation of dish locations and calories with multi-task learning. IEICE Trans. Inf. Syst. 2019, 102, 1240–1246. [Google Scholar] [CrossRef]
  32. Fang, S.; Shao, Z.; Kerr, D.A.; Boushey, C.J.; Zhu, F. An end-to-end image-based automatic food energy estimation technique based on learned energy distribution images: Protocol and methodology. Nutrients 2019, 11, 877. [Google Scholar] [CrossRef]
  33. Wang, B.; Bu, T.; Hu, Z.; Yang, L.; Zhao, Y.; Li, X. Coarse-to-fine nutrition prediction. IEEE Trans. Multimed. 2023, 26, 3651–3662. [Google Scholar] [CrossRef]
  34. Meyers, A.; Johnston, N.; Rathod, V.; Korattikara, A.; Gorban, A.; Silberman, N.; Guadarrama, S.; Papandreou, G.; Huang, J.; Murphy, K.P. Im2Calories: Towards an automated mobile vision food diary. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1233–1241. [Google Scholar]
  35. Han, Y.; Cheng, Q.; Wu, W.; Huang, Z. Dpf-nutrition: Food nutrition estimation via depth prediction and fusion. Foods 2023, 12, 4293. [Google Scholar] [CrossRef]
  36. Nian, F.; Hu, Y.; Gu, Y.; Wu, Z.; Yang, S.; Shu, J. Ingredient-guided multi-modal interaction and refinement network for RGB-D food nutrition assessment. Digit. Signal Process. 2024, 153, 104664. [Google Scholar] [CrossRef]
  37. Wang, W.; Chen, W.; Qiu, Q.; Chen, L.; Wu, B.; Lin, B.; He, X.; Liu, W. Crossformer++: A versatile vision transformer hinging on cross-scale attention. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 46, 3123–3136. [Google Scholar] [CrossRef]
  38. Wang, W.; Guo, Z.; Jiang, W.; Lan, Y.; Ma, W. CrossHash: Cross-scale Vision Transformer Hashing for Image Retrieval. In Proceedings of the ICASSP 2025—2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, 6–11 April 2025; pp. 1–5. [Google Scholar]
  39. Nair, V.; Hinton, G.E. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), Haifa, Israel, 21–24 June 2010; pp. 807–814. [Google Scholar]
  40. Ba, J.L.; Kiros, J.R.; Hinton, G.E. Layer normalization. arXiv 2016, arXiv:1607.06450. [Google Scholar] [CrossRef]
  41. Shao, Z.; Vinod, G.; He, J.; Zhu, F. An end-to-end food portion estimation framework based on shape reconstruction from monocular image. In Proceedings of the 2023 IEEE International Conference on Multimedia and Expo (ICME), Brisbane, Australia, 10–14 July 2023; pp. 942–947. [Google Scholar]
  42. Zhang, J.; Liu, H.; Yang, K.; Hu, X.; Liu, R.; Stiefelhagen, R. CMX: Cross-modal fusion for RGB-X semantic segmentation with transformers. IEEE Trans. Intell. Transp. Syst. 2023, 24, 14679–14694. [Google Scholar] [CrossRef]
  43. Zhang, C.; Cong, R.; Lin, Q.; Ma, L.; Li, F.; Zhao, Y.; Kwong, S. Cross-modality discrepant interaction network for RGB-D salient object detection. In Proceedings of the 29th ACM International Conference on Multimedia, Virtual, China, 20–24 October 2021; pp. 2094–2102. [Google Scholar]
  44. Zhou, W.; Pan, Y.; Lei, J.; Ye, L.; Yu, L. DEFNet: Dual-branch enhanced feature fusion network for RGB-T crowd counting. IEEE Trans. Intell. Transp. Syst. 2022, 23, 24540–24549. [Google Scholar] [CrossRef]
  45. Zhang, J.; Liu, R.; Shi, H.; Yang, K.; Reiß, S.; Peng, K.; Fu, H.; Wang, K.; Stiefelhagen, R. Delivering arbitrary-modal semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 1136–1147. [Google Scholar]
Figure 1. The framework of our IGSMNet, which contains two primary modules. The first is the ingredient-guided module and the second is the internal semantic modeling module.
Figure 1. The framework of our IGSMNet, which contains two primary modules. The first is the ingredient-guided module and the second is the internal semantic modeling module.
Foods 14 03697 g001
Figure 2. The framework of our ingredient-guided module.
Figure 2. The framework of our ingredient-guided module.
Foods 14 03697 g002
Figure 3. The framework of the internal semantic modeling scheme.
Figure 3. The framework of the internal semantic modeling scheme.
Foods 14 03697 g003
Table 1. Performance comparison of our IGSMNet and other methods on Nutrition5K; the best results are highlighted in bold.
Table 1. Performance comparison of our IGSMNet and other methods on Nutrition5K; the best results are highlighted in bold.
Method TypeMethodsPMAE (%)
CaloriesMassFatCarbProteinMean
RGB imagesGoogle-Nutrition-rgb [28]26.118.834.231.929.529.1
Coarse-to-Fine Nutrition [33]24.119.436.032.133.529.0
Swin-nutrition [25]16.213.724.921.825.420.4
Portion-Nutrition [41]15.8-----
RoDE [27]52.438.467.147.853.951.9
DPF-Nutrition [35]14.710.622.620.720.217.8
RGB-D imagesCMX [42]21.820.734.837.033.229.5
HINet [29]24.525.243.439.938.834.3
CDINet [43]21.120.437.137.132.829.7
DEFNet [44]32.734.248.940.343.839.9
TriTransNet [26]22.120.137.534.838.030.5
Deliver [45]29.525.948.347.746.139.5
Google-Nutrition-rgbd [28]18.818.918.123.820.920.1
IMIR-Net [36]14.711.423.320.921.618.4
Feng et al. [10]13.79.819.219.317.615.9
IGSMNet12.29.419.118.316.015.0
Table 2. Ablation study evaluating the impact of each module (the best results are marked in bold).
Table 2. Ablation study evaluating the impact of each module (the best results are marked in bold).
BaselineIGFGMDPECaloriesMassFatCarbProteinMean
13.79.422.619.419.616.9
13.310.219.518.516.615.6
12.69.419.518.316.215.2
12.29.419.118.316.015.0
Table 3. Results of different ingredient-guided integration strategies (we mark the best results in bold).
Table 3. Results of different ingredient-guided integration strategies (we mark the best results in bold).
IG Integration StrategyCaloriesMassFatCarbProteinMean
w/o IG13.79.422.619.419.616.9
Add13.59.721.819.218.516.5
MLP13.49.322.019.818.816.6
Cross-Attention12.29.419.118.316.015.0
Table 4. Effects of different integration orders between IG and ISM (the best results are marked in bold).
Table 4. Effects of different integration orders between IG and ISM (the best results are marked in bold).
ConfigurationCaloriesMassFatCarbProteinMean
w/o IG13.79.422.619.419.616.9
IG only13.310.219.518.516.615.6
ISM → IG12.99.718.319.115.515.1
IG → ISM12.29.419.118.316.015.0
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, D.; Shi, W.; Ma, B.; Min, W.; Wu, X.-J. IGSMNet: Ingredient-Guided Semantic Modeling Network for Food Nutrition Estimation. Foods 2025, 14, 3697. https://doi.org/10.3390/foods14213697

AMA Style

Zhang D, Shi W, Ma B, Min W, Wu X-J. IGSMNet: Ingredient-Guided Semantic Modeling Network for Food Nutrition Estimation. Foods. 2025; 14(21):3697. https://doi.org/10.3390/foods14213697

Chicago/Turabian Style

Zhang, Donglin, Weixiang Shi, Boyuan Ma, Weiqing Min, and Xiao-Jun Wu. 2025. "IGSMNet: Ingredient-Guided Semantic Modeling Network for Food Nutrition Estimation" Foods 14, no. 21: 3697. https://doi.org/10.3390/foods14213697

APA Style

Zhang, D., Shi, W., Ma, B., Min, W., & Wu, X.-J. (2025). IGSMNet: Ingredient-Guided Semantic Modeling Network for Food Nutrition Estimation. Foods, 14(21), 3697. https://doi.org/10.3390/foods14213697

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop