Next Article in Journal
Unveiling the Potential of Endophytic Bacillus amyloliquefaciens LJ1 from Nanguo Pear: A Genomic and Functional Study for Biocontrol of Post-Harvest Rot
Previous Article in Journal
Comparative Analysis of Flavor Quality of Beef with Tangerine Peel Reheated by Stir-Frying, Steaming and Microwave
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Multi-View Edge Attention Network for Fine-Grained Food Image Segmentation

1
School of Information and Electrical Engineering, Ludong University, Yantai 264025, China
2
The Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China
3
University of Chinese Academy of Sciences, Beijing 100049, China
4
Institute of Science & Technology, Jiangnan University, Wuxi 214122, China
*
Author to whom correspondence should be addressed.
Foods 2025, 14(17), 3016; https://doi.org/10.3390/foods14173016
Submission received: 8 July 2025 / Revised: 23 August 2025 / Accepted: 26 August 2025 / Published: 28 August 2025
(This article belongs to the Special Issue Food Computing-Enabled Precision Nutrition)

Abstract

Precisely identifying and delineating food regions automatically from images, a task known as food image segmentation, is crucial for enabling applications in food science such as automated dietary logging, accurate nutritional analysis, and food safety monitoring. However, accurately segmenting food images, particularly delineating food edges with precision, remains challenging due to the wide variety and diverse forms of food items, frequent inter-food occlusion, and ambiguous boundaries between food and backgrounds or containers. To overcome these challenges, we proposed a novel method called the Multi-view Edge Attention Network (MVEANet), which focuses on enhancing the fine-grained segmentation of food edges. The core idea behind this method is to integrate information obtained from observing food from different perspectives to achieve a more comprehensive understanding of its shape and specifically to strengthen the processing capability for food contour details. Rigorous testing on two large public food image datasets, FoodSeg103 and UEC-FoodPIX Complete, demonstrates that MVEANet surpasses existing state-of-the-art methods in segmentation accuracy, performing exceptionally well in depicting clear and precise food boundaries. This work provides the field of food science with a more accurate and reliable tool for automated food image segmentation, offering strong technical support for the development of more intelligent dietary assessment, nutritional research, and health management systems.

1. Introduction

Driven by increasing public health awareness and the growing demand for scientific dietary planning, the ability to automatically and precisely identify and separate food regions from images has become a critical technical requirement in the field of food science [1]. This technique, commonly referred to as food image segmentation, serves as an essential foundation for applications such as automated dietary tracking, accurate nutritional analysis, and food safety supervision. Its main objective is to meticulously delineate the specific contours of each food item within complex images containing food, utensils and backgrounds [2]. Only with such precise food contour information can subsequent processes like automatic food recognition [3,4,5,6,7], nutritional content calculation [8,9,10,11], and personalized health management recommendations [12,13,14] become truly reliable. Despite recent advances in related technologies, achieving high-precision food contour extraction in real-world scenarios (e.g., stacked food items, blurred boundaries, and diverse food categories) still poses significant challenges. Therefore, the development of more accurate and robust food image segmentation methods has substantial research value and broad application prospects to advance nutrition research, develop intelligent health applications, and improve the efficiency of food safety monitoring.
However, food image segmentation is fraught with the following distinct challenges: (1) High Intra-Class Variance: A single food item can exhibit vastly different appearances due to varied preparation and cooking methods (e.g., steaming, stir-frying, stewing). This challenge is particularly pronounced in complex cuisines, such as Chinese food, and is further exacerbated when ingredients possess irregular shapes or lack distinctive color and texture features. (2) Ambiguous Boundaries: The delineation between different food items or between food and the background is often ill-defined. For instance, a serving of rice, composed of numerous small, individual components, exemplifies this issue. Although the overall region is discernible, the scattered nature of individual grains impedes the precise definition of a compact boundary.
Recent years have witnessed significant progress in food image segmentation [15], primarily driven by rapid advancements in machine learning and computer vision. Early innovations, such as Fully Convolutional Networks (FCNs), pioneered end-to-end pixel-level classification [16]. Subsequent architectures introduced critical enhancements; for example, dilated convolutions [17] expanded the receptive field without compromising spatial resolution, and the Pyramid Scene Parsing Network (PSPNet) aggregated multi-scale contextual information through diverse pooling layers [18]. To better capture long-range dependencies, non-local methods were developed to model relationships between pixel pairs across feature maps [19], a concept later refined with cross-attention mechanisms for improved computational efficiency [20].
More recently, Transformer-based architectures [21,22] have instigated a paradigm shift. Vision Transformers (ViTs) have been effectively applied to food image segmentation, more proficiently leveraging global contextual information to enrich feature representations and yield superior segmentation results [23,24]. The advent of the Segment Anything Model (SAM) has also provided a powerful tool, subsequently adapted and fine-tuned for improved performance on food datasets [25,26,27]. Furthermore, dedicated efforts in food-specific domains have shown notable advancements. For instance, Wu et al. [28] developed ReLeM, the first pretraining model for food images, which integrates recipe information with visual features to mitigate intra-class variance. Currently, generative techniques have been explored; Jaswanthi et al. [29] proposed a hybrid method that uses generative adversarial networks (GANs) [30] to produce auxiliary masks for CNN-based classification. Generic segmentation and detection methods have also found applications in food image segmentation [31,32,33]. The burgeoning field of Large Language Models (LLMs) is also poised to offer novel, multimodal solutions for food image segmentation [34,35].
Despite these substantial advancements, many existing models still operate at a coarse granularity, often resulting in imprecise segmentation, particularly evident at the fine-grained edges and boundaries of food items.We hypothesize that standard foundation models like SAM, while powerful, produce suboptimal results on fine-grained segmentation tasks because their decoders lack sufficient domain-specific detail from the early-stage features. We posit that by explicitly engineering a multi-view feature extraction pipeline to capture and fuse complementary local and global information, we can create an enriched feature representation. This enhanced representation, when integrated with a detail-focused decoder, will empower the model to more accurately delineate complex food boundaries compared to architectures that lack this targeted, synergistic feature enhancement. To address these limitations, our paper introduces the Multi-view Edge Attention Network (MVEANet), a novel food image segmentation method. Built upon the Segment Anything Model (SAM) [27], MVEANet integrates a multi-view feature fusion mechanism, inspired by Multi-view Aggregation Network(MVANet) [36], and incorporates High-Quality Tokens (HQ-Tokens) [37] to significantly improve the prediction of fine-grained mask details. The code is available at https://github.com/Axboexx/MVEANet (accessed on 20 August 2025).

2. Materials and Methods

2.1. Datasets

We evaluate our method using the following publicly available food image segmentation datasets:
FoodSeg103 [28]: This dataset was constructed based on the Recipe1M dataset [38]. Initially, the most frequent ingredient categories from Recipe1M were identified, and the top 124 were selected. After a further screening process to ensure class distinction and quality, this set was refined to a final 103 ingredient categories. The creators then selected images from Recipe1M that contained between 2 and 16 distinct and clearly annotatable ingredients. This selection process yielded a final dataset of 7118 images with corresponding pixel-level masks. Figure 1 shows several examples from this dataset.
UEC-FoodPIX Complete [39]: This large-scale dataset is a direct quality enhancement of the original UEC-FoodPix. It contains 10,000 images covering 102 dish categories. The key contribution of the “Complete” version is the meticulous, manual refinement of the segmentation masks; while the masks in the training set of original dataset were generated semi-automatically with the GrabCut algorithm, leading to boundary inaccuracies, all masks in this version have been corrected by human annotators following a strict set of predefined rules to ensure high precision. Example images are provided in Figure 2.

2.2. Equipment and Experimental Setup

The operating system version is Ubuntu 20.04 LTS. We use Pytorch 1.12.0 [40] and Python 3.8 to construct our model, which is then trained on an NVIDIA A800 GPU (80 GB), an Intel(R) Xeon(R) Platinum 8358 CPU @2.60 GHz, 8 GB RAM, and a 1TB SSD. During the training process, for the FoodSeg103 dataset, 4983 images are used as the training set, along with 4983 corresponding training mask images, and 2135 images with 2135 corresponding mask images for testing. The image sizes are resized to 1024 × 1024 , and the batch size is set to 1. For the UEC-FoodPIX Complete dataset, 9000 images are selected for training and 1000 images for testing. The image sizes are also resized to 1024 × 1024 , and the batch size is set similarly to 1.

2.3. Method

To validate the efficacy of multi-view and HQ-Token integration for food image segmentation, this paper introduces the Multi-view Edge Attention Network (MVEANet). The overall architecture of the model is illustrated in Figure 3, primarily divided into three parts: The first part employs Super Token Vision Transformer(STViT) [41] as its backbone network for rapid global feature extraction and to generate the distant view of input data. The second part is the feature extraction module, where we utilize the feature extraction module of MVANet [36] to further process the input data from the first part and generate multiple intermediate prediction masks. The third part is the HQ-SAM Decoder, which fuses the HQ-Token [37] with the SAM decoder [27] to output the final segmentation result for the input image. Notably, during model optimization, the multiple intermediate masks generated by the second part, along with the final segmentation result, are simultaneously fed into the loss function for optimization. In the testing phase, only the predicted mask generated by the third part is output.
We adopt the loss function configuration from MVANet [36]. The total loss function L is an aggregation of losses from intermediate representation and the final prediction map, as described in the paper. Intermediate representation include local representation, global representations and attention maps, denoted as l l , l g , l a , respectively. The final prediction map is represented as l f . Loss l employs the combination of the binary cross-entropy (BCE) loss and the weighted IoU loss, a common practice in segmentation tasks. Its definition is as follows:
l = l B C E + l I o U
The total loss L is therefore defined as follows:
L = l f + i = 1 5 ( l l i + λ g l g i + λ a l a i ) .
Among them, λ g and λ a are weighting coefficients, and we also keep the value of 0.3.

2.3.1. STViT Backbone

Our model employs STViT [41] as its backbone. STViT is a general-purpose Vision Transformer designed to address the high computational complexity of the self-attention mechanism in traditional Vision Transformers. The core of STViT lies in its proposed Super Token Attention (STA) mechanism, which comprises three processes: Super Token Sampling (STS), Multi-Head Self-Attention (MHSA), and Token Upsampling (TU). In particular, STS reduces complexity through iterative steps and sparse computation, where, for each token, only its surrounding 3 × 3 superpixels are used to compute associations. The structure of the basic STViT module, the Super Token Transformer Block, is shown as Figure 4.

2.3.2. Feature Extraction

Conventional image segmentation methods typically proceed directly to decoding after encoding by a backbone model or encoder. However, given that food images are fine-grained and present greater segmentation challenges than general images, we introduce a new set of feature extraction methods between the encoder and decoder. These methods are derived from the Multi-view Complementary Localization Module (MCLM) and Multi-view Complementary Refinement Module (MCRM) proposed in MVANet [36]. MCLM aims to achieve complementary localization of global and local features through multi-grained pooling and cross-attention. MCRM utilizes the detailed information from local features to refine global features and enhances multi-view complementarity through cross-attention. The structures of MCLM and MCRM are shown in Figure 5 and Figure 6, respectively.
After processing in Section 2.3.1, multi-level feature maps are generated and denoted E i | i = 1 , 2 , 3 , 4 , 5 . Among these, E 5 represents the panoramic view, while E 1 E 4 corresponds to the local views. First, the map of features E 5 is divided into a global feature E 5 G R B × C × H 32 × W 32 and a set of local features { E 5 L m } m = 1 M , where E 5 L m R B × C × H 32 × W 32 . Subsequently, aligned with their respective positions in the original image, these local features are assembled into unified global features E 5 L 9 R B × C × H 32 × W 32 . Following this, multi-grained pooling is used to generate pyramid features:
P n = AvgPool n ( E 5 L g ) , n { 1 , 2 , , N } ,
where E 5 L g is the unified global feature, and N denotes the number of parallel pooling branches. Subsequently, cross-attention is performed between the global features and the multi-grained features.
T G = T ( E 5 G ) + LN ( MHCA ( T ( E 5 G ) W Q , [ T ( P 1 ) , , T ( P N ) ] W K , V , [ T ( P 1 ) , , T ( P N ) ] W K , V ) ) .
Here, T ( · ) signifies the tokenization operation, W Q and W K , V are projection matrices, M H C A refers to Multi-Head Cross-Attention, L N is Layer Normalization, and F F N is the Feed-Forward Network. This is immediately followed by cross-attention between the local features and the global tokens,
T m L = MHCA ( T ( E 5 L m ) W m Q , T G m , T G m ) ,
where E 5 L m is the m t h local feature, and T G m corresponds to the portion of the rearranged global token aligning with the local region. Finally, feature fusion is performed to generate the feature map for subsequent processing.
D 5 = [ E 5 G , { E 5 L m } m = 1 M ] .
The MCRM (Multi-view Complementary Refinement Module) takes input features denoted as D i , where i { 1 , 2 , 3 , 4 , 5 } represents the layer index. Similar to MCLM, the feature D i is partitioned along the batch dimension into D i G and { D i L m } m = 1 M before processing,
A = sigmoid ( conv ( D i G ) ) ,
where D i G is the global feature, D i L m is the local feature, A is the attention map, ⊙ denotes the Hadamard product, and a s s e m b l e and s p l i t are operations for combination and decomposition of features, respectively. Subsequently, a multi-grained pooling process similar to that in MCLM is applied to { D i L m } m = 1 M to obtain multi-perceptual tokens T i G M with different contextual information for the m t h patch. These tokens are concatenated to serve as the K and V for cross-attention, followed by the cross-attention operation.
T i L m = MHCA T ( D i L 1 ) , , T ( D i L M ) W Q i , T i G 1 , , T i G M W K i , V i , T i G 1 , , T i G M W K i , V i .
Finally, refined feature fusion is performed as follows:
D i G = D i G + sum D i L m m = 1 M ,
D i = D i L m m = 1 M , D i G ,
where D i L m represents the reconstructed local features from the updated local tokens, and D i G is the detail-enhanced globally optimized feature.

2.3.3. Detail Enhancement Decoder

To address the issue of insufficient mask quality often encountered in traditional segmentation models when dealing with complex structures and fine boundaries, HQ-SAM [37] introduced the innovative concept of a High-Quality Output Token (HQ-Token). The core idea behind the HQ-token is to enable the model to generate higher-quality segmentation masks without significantly increasing the complexity or computational cost of the model. Specifically, the HQ-token is designed as a special, learnable token injected into the mask decoder. It not only operates on intrinsic features of the decoder, but, more crucially, it can effectively fuse features extracted from the early and final layers of the backbone network (typically low-level and high-level features from a Vision Transformer). This fusion mechanism allows the HQ-Token to simultaneously capture both global contextual information and fine-grained local details from the image. Through this approach, the HQ-Token guides the model during the decoding process to pay closer attention to the precision of object boundaries, the integrity of internal structures, and the expressiveness of details, thereby significantly enhancing the overall quality of generated masks and mitigating common artifacts, holes, or unsmooth boundary issues.

3. Results

This section will detail the performance of MVEANet in the FoodSeg103 and UEC-FoodPIX Complete, including a comparison with other segmentation models. Subsequently, we will present the setup and results of our ablation studies, validating the positive contribution of each component of MVEANet to the segmentation results. Finally, we will present the qualitative results of our proposed method in the food segmentation task.
Evaluation Metrics. Mean Absolute Error ( M A E ) quantifies the average pixel-wise absolute difference between a continuous prediction map and a binarized ground truth mask (gt). Here, W and H represent the width and height of the image, respectively. Lower M A E indicates superior performance. F β m a x and F β ω are the maximum and weighted scores of precision and recall, respectively, where β 2 is set to 0.3 . S m concurrently assesses the structural similarity between the prediction and the mask, considering both the characteristics of region-level and object-level. E m is widely used for evaluating the correspondence of pixel-level and image-level. Mean Intersection over Union( m I o U ) measures the overlap between the prediction and the ground truth.

3.1. Comparative Experimental Results

Our proposed model establishes a new state-of-the-art (SOTA) in food image segmentation, consistently outperforming existing methods across two benchmark datasets. As comprehensively detailed in Table 1 and Table 2, our model exhibits superior performance across a multitude of evaluation metrics. Specifically, it enhances the m I o U of MVANet by 1.6% on the UEC-FoodPix Complete and by a more substantial 4.6% on the FoodSeg103.

3.2. Ablation Study

To comprehensively evaluate the impact of various backbone architectures on model performance, we conducted a comparative analysis. The experiment was designed to evaluate the effectiveness of different backbones in designated tasks, their generalizability and their computational efficiency, measured in Frames Per Second (FPS). First, we perform 10 untimed inference iterations as a warm-up to eliminate interference from irrelevant factors. We then measure the total time of 100 consecutive inference runs. The final FPS is calculated as the average of these 100 runs. We synchronize the GPU computation flow before and after the timing loop to ensure accurate measurement of the GPU execution time.
A critical aspect of our study was ensuring our proposed model is not only accurate but also computationally efficient, a key requirement for practical applications. To this end, we conducted a quantitative analysis of the trade-off between performance and speed across several backbone architectures, with results presented in Table 3. This study serves as our primary investigation into the efficiency of the model.
The results show that while the Swin-Transformer [51] backbone achieves the highest accuracy, it does so at a significant computational cost (5.8 FPS). In contrast, STViT [41] provides a much more compelling balance, delivering strong segmentation performance (0.693 m I o U on FoodSeg103) at a faster inference speed (6.3 FPS). Therefore, we selected STViT [41] as the final backbone for MVEANet, as it represents the best compromise between high accuracy and practical efficiency. This FPS comparison provides a direct, practical measure of the end-to-end computational cost of each configuration, informing our final architectural design.
To further validate the effectiveness of each proposed component within our model, we performed an in-depth ablation study, with the results concisely summarized in Table 4. Our model integrates four principal design elements: a Multi-View strategy, MCLM, MCRM, and the HQ-Token. This study systematically quantifies the individual contribution of each component through a controlled incremental analysis. Our ablation study, summarized in Table 4, systematically deconstructs the impact of each major component of our network. The baseline, which utilizes the Multi-View without MCLM or MCRM, achieves a foundational m I o U of 0.594. Adding only MCLM yields a significant performance increase to 0.641 m I o U . The key takeaway here is that MCLM is highly effective at its primary task: improving object localization. By using global tokens to guide the local feature patches, it helps the method to correctly identify where the food items are within the detailed close-up views and filter out background noise. Conversely, adding only MCRM also increases performance to 0.633 m I o U . This demonstrates a distinct contribution of MCRM: enhancing fine-grained details. It excels at using detailed information from local views to refine the texture and boundaries of object masks when both methods are used together, achieving an m I o U of 0.652. This result is greater than what would be expected from simply summing their individual improvements, highlighting a clear synergy. This synergy arises because the methods perform complementary and sequential tasks: MCLM first provides a clean, well-localized feature map, which then allows MCRM to apply its powerful detail refinement capabilities far more effectively. Building on this strong foundation, HQ-Token further improves m I o U to 0.667. HQ-Token acts as a specialized tool in the decoder stage; it is specifically designed to translate the high-quality, refined features produced by our MCLM-MCRM pipeline into a final segmentation mask. For performance evaluation, we used M A E and mean m I o U as key metrics. A comparative analysis of the performance data across these diverse configurations unequivocally demonstrates two key findings: (1) Each of our proposed modules contributes a discernible and tangible performance gain, and (2) the synergistic interaction among all modules culminates in the optimal performance achieved by our final model.

3.3. Qualitative Evaluation

Figure 7 presents the qualitative segmentation results of our proposed model along with other mainstream methods in selected samples from UEC-FoodPix Complete. Compared with ground truth, it can be clearly seen that our model shows significant advantages in the completeness of the generated segmentation mask and the accuracy of edge details. For example, when dealing with food with complex and irregular shapes as shown in the second and seventh rows, PGNet tends to produce large areas of wrong segmentation (over-segmentation), while BSANet and F 3 Net fail to capture the entire food area. Although the results of MVANet are relatively good, there are still obvious details lost and adhesion problems when dealing with the details of the food in the bowl in the fourth row and the boundaries of the food in the fifth row. In contrast, our model can accurately outline the contours of food, effectively distinguish different food entities, and retain key internal details. Its segmentation results are visually closest to the ground truth.
To further verify the generalization and robustness of our model, Figure 8 shows the qualitative comparison results on the FoodSeg103 with more diverse scenes. The results once again confirm the superiority of our model. A very convincing example is the ring-shaped food in the third row: all other models, especially PGNet and MVANet, failed to identify the empty area in the center of the food. Our method correctly distinguishes food from background. In addition, when processing the cake in the fifth row and the multi-object scene in the sixth row, our method is able to generate better edges than other methods and better separate neighboring foods (such as chicken and asparagus). These visualization results strongly demonstrate that compared to existing methods, our model has a stronger ability to handle complex spatial layouts, maintain detail integrity, and accurately locate boundaries.

4. Discussion

Despite the promising performance of MVEANet in the FoodSeg103 and UEC-FoodPIX Complete datasets, several avenues of improvement remain. Firstly, the current backbone network could potentially be replaced by more specialized alternatives, such as models specifically designed for food computing or those extensively pretrained on large-scale food datasets. Second, we observed that the performance of the model is degraded when segmenting foods with amorphous boundaries, heavily mixed ingredients, or significant overlap, such as in stews or mixed salads. We hypothesize that purely visual approaches face inherent limitations in these ambiguous scenarios. Therefore, a promising direction for future research is to explore multimodal solutions, such as integrating textual information like recipes or ingredient lists, to provide the necessary contextual priors to resolve these visual ambiguities. Finally, with rapid advancements in large models, there is significant potential to support food image segmentation, as their superior image understanding and generation capabilities can provide even more effective feature information for segmentation tasks.

5. Conclusions

Multi-view Edge Attention Network (MVEANet) addresses limitations in fine-grained food image segmentation, particularly concerning imprecise boundaries and diverse appearances, common in previous models. This SAM-based method integrates multi-view and HQ-Token, utilizing an STViT backbone for global feature extraction. MCLM and MCRM of MVANet provide complementary localization and refine features, while the HQ-Token enhances mask quality by fusing multi-level features for accurate boundary depiction. We validate the effectiveness of MVEANet using the FoodSeg103 and UEC-FoodPIX Complete datasets. This approach aims to improve the localization and boundary delineation of pixels, allowing applications such as nutritional assessment to be improved. Future work will focus on quantitative analysis, generalizability, and real-time performance.

Author Contributions

Methodology, C.L.; Validation, C.L.; Writing—original draft, C.L.; Writing—review & editing, G.S. and W.M.; Supervision, W.M., X.W. and S.J. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Min, W.; Jiang, S.; Liu, L.; Rui, Y.; Jain, R. A survey on food computing. ACM Comput. Surv. (CSUR) 2019, 52, 1–36. [Google Scholar] [CrossRef]
  2. Wu, X.; Yu, S.; Lim, E.P.; Ngo, C.W. Ovfoodseg: Elevating open-vocabulary food image segmentation via image-informed textual representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 4144–4153. [Google Scholar]
  3. Min, W.; Wang, Z.; Liu, Y.; Luo, M.; Kang, L.; Wei, X.; Wei, X.; Jiang, S. Large scale visual food recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 9932–9949. [Google Scholar] [CrossRef] [PubMed]
  4. Li, W.; Li, J.; Ma, M.; Hong, X.; Fan, X. Multi-scale spiking pyramid wireless communication framework for food recognition. IEEE Trans. Multimed. 2024, 27, 2734–2746. [Google Scholar] [CrossRef]
  5. Sun, K.; Zhang, Y.J.; Tong, S.Y.; Tang, M.D.; Wang, C.B. Study on rice grain mildewed region recognition based on microscopic computer vision and YOLO-v5 model. Foods 2022, 11, 4031. [Google Scholar] [CrossRef] [PubMed]
  6. Liang, S.; Gu, Y. A Coarse-to-Fine Feature Aggregation Neural Network with a Boundary-Aware Module for Accurate Food Recognition. Foods 2025, 14, 383. [Google Scholar] [CrossRef]
  7. Chen, Z.; Wang, J.; Wang, Y. Enhancing Food Image Recognition by Multi-Level Fusion and the Attention Mechanism. Foods 2025, 14, 461. [Google Scholar] [CrossRef]
  8. Shao, W.; Min, W.; Hou, S.; Luo, M.; Li, T.; Zheng, Y.; Jiang, S. Vision-based food nutrition estimation via RGB-D fusion network. Food Chem. 2023, 424, 136309. [Google Scholar] [CrossRef]
  9. Yang, X.; Ho, C.T.; Gao, X.; Chen, N.; Chen, F.; Zhu, Y.; Zhang, X. Machine learning: An effective tool for monitoring and ensuring food safety, quality, and nutrition. Food Chem. 2025, 477, 143391. [Google Scholar] [CrossRef]
  10. Shao, W.; Hou, S.; Jia, W.; Zheng, Y. Rapid non-destructive analysis of food nutrient content using swin-nutrition. Foods 2022, 11, 3429. [Google Scholar] [CrossRef]
  11. Li, T.; Wei, W.; Xing, S.; Min, W.; Zhang, C.; Jiang, S. Deep learning-based near-infrared hyperspectral imaging for food nutrition estimation. Foods 2023, 12, 3145. [Google Scholar] [CrossRef] [PubMed]
  12. Naseem, S.; Rizwan, M. The role of artificial intelligence in advancing food safety: A strategic path to zero contamination. Food Control 2025, 175, 111292. [Google Scholar] [CrossRef]
  13. Panahi, O. The Future of Healthcare: AI. Public Health Digit. Revolut. Med. Clin. Case Rep. J. 2025, 3, 763–766. [Google Scholar]
  14. Panahi, O. The role of artificial intelligence in shaping future health planning. Int. J. Health Policy Plan. 2025, 4, 1–5. [Google Scholar]
  15. Wang, W.; Min, W.; Li, T.; Dong, X.; Li, H.; Jiang, S. A review on vision-based analysis for automatic dietary assessment. Trends Food Sci. Technol. 2022, 122, 223–237. [Google Scholar] [CrossRef]
  16. Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
  17. Liang-Chieh, C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A. Semantic image segmentation with deep convolutional nets and fully connected crfs. In Proceedings of the International Conference on Learning Representations, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
  18. Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar]
  19. Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7794–7803. [Google Scholar]
  20. Huang, Z.; Wang, X.; Huang, L.; Huang, C.; Wei, Y.; Liu, W. Ccnet: Criss-cross attention for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27–28 October 2019; pp. 603–612. [Google Scholar]
  21. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar] [CrossRef]
  22. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
  23. Zheng, S.; Lu, J.; Zhao, H.; Zhu, X.; Luo, Z.; Wang, Y.; Fu, Y.; Feng, J.; Xiang, T.; Torr, P.H. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 6881–6890. [Google Scholar]
  24. Wang, Q.; Dong, X.; Wang, R.; Sun, H. Swin transformer based pyramid pooling network for food segmentation. In Proceedings of the 2022 IEEE 2nd International Conference on Software Engineering and Artificial Intelligence (SEAI), Xiamen, China, 10–12 June 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 64–68. [Google Scholar]
  25. Alahmari, S.S.; Gardner, M.; Salem, T. Segment Anything in Food Images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 3715–3720. [Google Scholar]
  26. Lan, X.; Lyu, J.; Jiang, H.; Dong, K.; Niu, Z.; Zhang, Y.; Xue, J. Foodsam: Any food segmentation. IEEE Trans. Multimed. 2023, 27, 2795–2808. [Google Scholar] [CrossRef]
  27. Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.Y. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 4015–4026. [Google Scholar]
  28. Wu, X.; Fu, X.; Liu, Y.; Lim, E.P.; Hoi, S.C.; Sun, Q. A large-scale benchmark for food image segmentation. In Proceedings of the 29th ACM International Conference on Multimedia, Chengdu, China, 20–24 October 2021; pp. 506–515. [Google Scholar]
  29. Jaswanthi, R.; Amruthatulasi, E.; Bhavyasree, C.; Satapathy, A. A hybrid network based on GAN and CNN for food segmentation and calorie estimation. In Proceedings of the 2022 International Conference on Sustainable Computing and Data Communication Systems (ICSCDS), Erode, India, 7–9 April 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 436–441. [Google Scholar]
  30. Isola, P.; Zhu, J.Y.; Zhou, T.; Efros, A.A. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1125–1134. [Google Scholar]
  31. Muñoz, B.; Martínez-Arroyo, A.; Acevedo, C.; Aguilar, E. Lightweight DeepLabv3+ for Semantic Food Segmentation. Foods 2025, 14, 1306. [Google Scholar] [CrossRef] [PubMed]
  32. Liang, X.; Jia, X.; Huang, W.; He, X.; Li, L.; Fan, S.; Li, J.; Zhao, C.; Zhang, C. Real-time grading of defect apples using semantic segmentation combination with a pruned YOLO V4 network. Foods 2022, 11, 3150. [Google Scholar] [CrossRef]
  33. Verk, J.; Hernavs, J.; Klančnik, S. Using a Region-Based Convolutional Neural Network (R-CNN) for Potato Segmentation in a Sorting Process. Foods 2025, 14, 1131. [Google Scholar] [CrossRef]
  34. Rodríguez-de Vera, J.M.; Villacorta, P.; Estepa, I.G.; Bolaños, M.; Sarasúa, I.; Nagarajan, B.; Radeva, P. Dining on details: Llm-guided expert networks for fine-grained food recognition. In Proceedings of the 8th International Workshop on Multimedia Assisted Dietary Management, Ottawa, ON, Canada, 29 October 2023; pp. 43–52. [Google Scholar]
  35. Ponte, D.; Aguilar, E.; Ribera, M.; Radeva, P. Multi-task visual food recognition by integrating an ontology supported with LLM. J. Vis. Commun. Image Represent. 2025, 10, 104484. [Google Scholar] [CrossRef]
  36. Yu, Q.; Zhao, X.; Pang, Y.; Zhang, L.; Lu, H. Multi-view aggregation network for dichotomous image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 3921–3930. [Google Scholar]
  37. Ke, L.; Ye, M.; Danelljan, M.; Tai, Y.W.; Tang, C.K.; Yu, F. Segment anything in high quality. Adv. Neural Inf. Process. Syst. 2023, 36, 29914–29934. [Google Scholar]
  38. Salvador, A.; Hynes, N.; Aytar, Y.; Marin, J.; Ofli, F.; Weber, I.; Torralba, A. Learning cross-modal embeddings for cooking recipes and food images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 3020–3028. [Google Scholar]
  39. Okamoto, K.; Yanai, K. Uec-foodpix complete: A large-scale food image segmentation dataset. In Proceedings of the Pattern Recognition. ICPR International Workshops and Challenges: Virtual Event, January 10–15, 2021; Proceedings, Part V; Springer: Berlin/Heidelberg, Germany, 2021; pp. 647–659. [Google Scholar]
  40. Paszke, A.; Gross, S.; Chintala, S.; Chanan, G.; Yang, E.; DeVito, Z.; Lin, Z.; Desmaison, A.; Antiga, L.; Lerer, A. Automatic Differentiation in Pytorch. 2017. Available online: https://openreview.net/pdf?id=BJJsrmfCZ (accessed on 20 August 2025).
  41. Huang, H.; Zhou, X.; Cao, J.; He, R.; Tan, T. Vision transformer with super token sampling. arXiv 2022, arXiv:2211.11167. [Google Scholar]
  42. Wei, J.; Wang, S.; Huang, Q. F3Net: Fusion, feedback and focus for salient object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 12321–12328. [Google Scholar]
  43. Chen, Z.; Xu, Q.; Cong, R.; Huang, Q. Global context-aware progressive aggregation network for salient object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 10599–10606. [Google Scholar]
  44. Mei, H.; Ji, G.P.; Wei, Z.; Yang, X.; Wei, X.; Fan, D.P. Camouflaged object segmentation with distraction mining. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 8772–8781. [Google Scholar]
  45. Zhu, H.; Li, P.; Xie, H.; Yan, X.; Liang, D.; Chen, D.; Wei, M.; Qin, J. I can find you! boundary-guided separated attention network for camouflaged object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 22 February–1 March 2022; Volume 36, pp. 3608–3616. [Google Scholar]
  46. Hu, H.; Chen, Y.; Xu, J.; Borse, S.; Cai, H.; Porikli, F.; Wang, X. Learning implicit feature alignment function for semantic segmentation. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 487–505. [Google Scholar]
  47. Xie, C.; Xia, C.; Ma, M.; Zhao, Z.; Chen, X.; Li, J. Pyramid grafting network for one-stage high resolution saliency detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11717–11726. [Google Scholar]
  48. Qin, X.; Dai, H.; Hu, X.; Fan, D.P.; Shao, L.; Van Gool, L. Highly accurate dichotomous image segmentation. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 38–56. [Google Scholar]
  49. Pei, J.; Zhou, Z.; Jin, Y.; Tang, H.; Heng, P.A. Unite-divide-unite: Joint boosting trunk and structure for high-accuracy dichotomous image segmentation. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 3 November 2023; pp. 2139–2147. [Google Scholar]
  50. Zhou, Y.; Dong, B.; Wu, Y.; Zhu, W.; Chen, G.; Zhang, Y. Dichotomous image segmentation with frequency priors. In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, Macao, Chinam, 19–25 August 2023. IJCAI ’23. [Google Scholar] [CrossRef]
  51. Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
  52. Zhang, T.; Li, L.; Zhou, Y.; Liu, W.; Qian, C.; Hwang, J.N.; Ji, X. Cas-vit: Convolutional additive self-attention vision transformers for efficient mobile applications. arXiv 2024, arXiv:2408.03703. [Google Scholar]
Figure 1. (ac) are food images from the FoodSeg103 dataset alongside their corresponding ground-truth semantic segmentation masks.
Figure 1. (ac) are food images from the FoodSeg103 dataset alongside their corresponding ground-truth semantic segmentation masks.
Foods 14 03016 g001
Figure 2. (ac) are food images from the UEC-FoodPIX Complete dataset alongside their corresponding ground-truth semantic segmentation masks.
Figure 2. (ac) are food images from the UEC-FoodPIX Complete dataset alongside their corresponding ground-truth semantic segmentation masks.
Foods 14 03016 g002
Figure 3. Overall architecture of the MVEANet. STViT acts as the backbone, extracting global features and creating distant views. MCLM and MCRM process these features further, generating multiple intermediate prediction masks. The decoder produces the prediction mask. During training, we optimize the loss function using both the intermediate masks and the final prediction mask, but for testing, only the final prediction mask is output.
Figure 3. Overall architecture of the MVEANet. STViT acts as the backbone, extracting global features and creating distant views. MCLM and MCRM process these features further, generating multiple intermediate prediction masks. The decoder produces the prediction mask. During training, we optimize the loss function using both the intermediate masks and the final prediction mask, but for testing, only the final prediction mask is output.
Foods 14 03016 g003
Figure 4. Architecture of the Super Token Vision Transformer (STViT); Super Token Sampling for Efficient Vision Transformers.
Figure 4. Architecture of the Super Token Vision Transformer (STViT); Super Token Sampling for Efficient Vision Transformers.
Foods 14 03016 g004
Figure 5. Architecture of the Multi-view Complementary Localization Module (MCLM), using multi-grained pooling and cross-attention to achieve complementary localization of global and local features.
Figure 5. Architecture of the Multi-view Complementary Localization Module (MCLM), using multi-grained pooling and cross-attention to achieve complementary localization of global and local features.
Foods 14 03016 g005
Figure 6. Architecture of the Multi-view Complementary Refinement Module(MCRM), refining global features using detailed local information and boosting multi-view complementarity via cross-attention.
Figure 6. Architecture of the Multi-view Complementary Refinement Module(MCRM), refining global features using detailed local information and boosting multi-view complementarity via cross-attention.
Foods 14 03016 g006
Figure 7. Segmentation results of the models in UEC-FoodPix Complete.
Figure 7. Segmentation results of the models in UEC-FoodPix Complete.
Foods 14 03016 g007
Figure 8. Segmentation results of the models in FoodSeg103.
Figure 8. Segmentation results of the models in FoodSeg103.
Foods 14 03016 g008
Table 1. Comparison with other segmentation methods on UEC-FoodPix Complete. The dataset contains 10,000 images covering 102 dish categories. ↓ represents the lower value is better, while ↑ represents the higher value is better.
Table 1. Comparison with other segmentation methods on UEC-FoodPix Complete. The dataset contains 10,000 images covering 102 dish categories. ↓ represents the lower value is better, while ↑ represents the higher value is better.
MethodEvaluation Metrics
M A E F β m a x F β ω S m E m m I o U
F 3 Net [42]0.2810.7790.5980.6810.6390.572
GCPANet [43]0.1940.8070.6260.7290.6620.627
PFNet [44]0.2590.7760.4830.6180.6240.317
BSANet [45]0.2950.7590.5350.6610.6070.437
IFA [46]0.3210.7260.5030.6360.5850.420
PGNet [47]0.1580.8190.6740.7380.6530.631
ISNet [48]0.2640.7840.6120.7060.6510.604
UDUN [49]0.1630.8220.6630.7470.6850.647
FP-DIS [50]0.1630.8320.6530.7410.6680.628
MVANet [36]0.1530.8670.6790.7780.7080.652
Our0.1310.8860.6990.7960.7430.668
Table 2. Comparison with other segmentation methods on FoodSeg103. The dataset consists of 7118 images across 103 ingredient categories. ↓ represents the lower value being better, while ↑ represents the higher value being better.
Table 2. Comparison with other segmentation methods on FoodSeg103. The dataset consists of 7118 images across 103 ingredient categories. ↓ represents the lower value being better, while ↑ represents the higher value being better.
MethodEvaluation Metrics
M A E F β m a x F β ω S m E m m I o U
F 3 Net [42]0.2680.6980.5590.6530.5190.526
GCPANet [43]0.2490.7180.5980.6810.5570.556
PFNet [44]0.3030.7690.4970.6430.5030.465
BSANet [45]0.2700.6890.5380.6490.5080.486
IFA [46]0.2790.6680.5280.6380.4980.478
PGNet [47]0.2370.7210.6280.6980.5650.574
ISNet [48]0.2530.7120.5620.6770.5320.529
UDUN [49]0.2010.7490.6530.7120.5810.607
FP-DIS [50]0.2010.7560.6640.7020.5640.585
MVANet [36]0.1870.7630.6760.7540.6190.647
Our0.1580.7860.7220.7580.7180.693
Table 3. Ablation experiments of different backbones on UEC-FoodPix Complete and FoodSeg103. ↑ represents the higher value being better.
Table 3. Ablation experiments of different backbones on UEC-FoodPix Complete and FoodSeg103. ↑ represents the higher value being better.
MethodFPSUEC-FoodPix CompleteFoodSeg103
m I o U m I o U
Swin-Transformer [51]5.80.6810.703
SAM-Encoder [27]4.90.6100.629
CAS-ViT [52]8.60.5320.496
STViT [41]6.30.6680.693
Table 4. Ablation experiments of each component. ↓ represents the lower value is better, while ↑ represents the higher value is better.
Table 4. Ablation experiments of each component. ↓ represents the lower value is better, while ↑ represents the higher value is better.
Multi-ViewMCLMMCRMHQ-TokenUEC-FoodPIX Complete
M A E m I o U
0.1790.594
0.1710.641
0.1730.633
0.1690.652
0.1630.667
0.1580.693
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liu, C.; Sheng, G.; Min, W.; Wu, X.; Jiang, S. Multi-View Edge Attention Network for Fine-Grained Food Image Segmentation. Foods 2025, 14, 3016. https://doi.org/10.3390/foods14173016

AMA Style

Liu C, Sheng G, Min W, Wu X, Jiang S. Multi-View Edge Attention Network for Fine-Grained Food Image Segmentation. Foods. 2025; 14(17):3016. https://doi.org/10.3390/foods14173016

Chicago/Turabian Style

Liu, Chengxu, Guorui Sheng, Weiqing Min, Xiaojun Wu, and Shuqiang Jiang. 2025. "Multi-View Edge Attention Network for Fine-Grained Food Image Segmentation" Foods 14, no. 17: 3016. https://doi.org/10.3390/foods14173016

APA Style

Liu, C., Sheng, G., Min, W., Wu, X., & Jiang, S. (2025). Multi-View Edge Attention Network for Fine-Grained Food Image Segmentation. Foods, 14(17), 3016. https://doi.org/10.3390/foods14173016

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop