A Method for Reconstructing and Predicting the Volume of Bowl-Type Tableware and Its Application in Dietary Analysis

Ji, Xu; Song, Kai; Sun, Lianzheng; Lu, Haolin; Zhang, Hengyuan; Feng, Yiran

doi:10.3390/sym18010199

Open AccessArticle

A Method for Reconstructing and Predicting the Volume of Bowl-Type Tableware and Its Application in Dietary Analysis

by

Xu Ji

¹

,

Kai Song

¹

,

Lianzheng Sun

²,

Haolin Lu

¹,

Hengyuan Zhang

¹

and

Yiran Feng

^1,3,4,*

¹

Department of Mechanical Engineering and Automation, Dalian Polytechnic University, Dalian 116034, China

²

Qingdao Yeelink Information Technology Co., Ltd., Qingdao 266100, China

³

SKL of Marine Food Processing & Safety Control, National Engineering Research Center of Seafood, Dalian Polytechnic University, Dalian 116034, China

⁴

Department of Key Laboratory of Marine Food Processing Technology and Equipment of Liaoning Province, Dalian Polytechnic University, Dalian 116034, China

^*

Author to whom correspondence should be addressed.

Symmetry 2026, 18(1), 199; https://doi.org/10.3390/sym18010199

Submission received: 26 December 2025 / Revised: 14 January 2026 / Accepted: 19 January 2026 / Published: 21 January 2026

(This article belongs to the Section Computer)

Download

Browse Figures

Versions Notes

Abstract

To overcome the low accuracy of conventional methods for estimating liquid volume and food nutrient content in bowl-type tableware, as well as the tool dependence and time-consuming nature of manual measurements, this study proposes an integrated approach that combines geometric reconstruction with deep learning–based segmentation. After a one-time camera calibration, only a frontal and a top-down image of a bowl are required. The pipeline automatically extracts key geometric information, including rim diameter, base diameter, bowl height, and the inner-wall profile, to complete geometric modeling and capacity computation. The estimated parameters are stored in a reusable bowl database, enabling repeated predictions of liquid volume and food nutrient content at different fill heights. We further propose Bowl Thick Net to predict bowl wall thickness with millimeter-level accuracy. In addition, we developed a Geometry-aware Feature Pyramid Network (GFPN) module and integrated it into an improved Mask R-CNN (Region-based Convolutional Neural Network) framework to enable precise segmentation of bowl contours. By integrating the contour mask with the predicted bowl wall thickness, precise geometric parameters for capacity estimation can be obtained. Liquid volume is then predicted using the geometric relationship of the liquid or food surface, while food nutrient content is estimated by coupling predicted food weight with a nutritional composition database. Experiments demonstrate an arithmetic mean error of −3.03% for bowl capacity estimation, a mean liquid-volume prediction error of 9.24%, and a mean nutrient-content (by weight) prediction error of 11.49% across eight food categories.

Keywords:

image segmentation; nutrient content prediction; model reconstruction; volume estimation

1. Introduction

With the growing global emphasis on healthy diets, precision nutrition, and quality-of-life management, dietary management and nutritional assessment systems have become active research topics in public health, intelligent healthcare, and data-driven science. Conventional dietary assessment approaches—such as manual weighing, handwritten logging, and subjective experience–based estimation—are labor-intensive and time-consuming, prone to error, and thus insufficient for the increasingly accurate and efficient dietary management demanded by modern society. In recent years, automated dietary assessment systems based on image analysis, computer vision, and deep learning have emerged, with particular interest in food nutrient prediction. However, most existing methods still rely on combining 2D imagery with depth cameras, which increases hardware cost and deployment complexity and makes robust use in everyday environments challenging [1,2,3].

In the field of computer vision and deep learning, Dehais et al. proposed a method for reconstructing bowl-shaped container models based on two-dimensional images. This approach employs multi-view image reconstruction techniques to model the container’s geometry, enabling liquid volume estimation [4]. Jia et al. suggested restoring the 3D inner wall curve of a bowl using a single top-down view with an attached calibrated paper tape. The reconstructed bowl is then projected onto an image containing the contents, allowing volume estimation by reading the liquid surface or food height [5]. Li et al. further parameterised bowl shapes as power-function cross-sections, establishing an optimisation solution using ruler-marked points [6]. This approach enables volume estimation from a single image, even without the constraint of a camera positioned directly overhead. Schober et al. addressed transparent liquids by employing a CNN (Convolutional Neural Network)to segment liquid surfaces from containers, integrating depth information to translate liquid level and geometric inference into volume estimation [7]. Cobo et al. constructed and trained a CNN regression model to directly predict red wine volume in a glass from a single-view photograph [8]. In approaches predicting volume before weight or nutritional content, Zhao et al. used overhead food images as input, first segmenting dish regions with U-Net before regressing calories and various nutrients with Res Net [9]. Han et al. introduced monocular depth prediction and RGB-D feature fusion to enhance three-dimensional food representation and support nutritional estimation [10]. We summarise the results and limitations of bowl volume prediction and nutritional content estimation for liquids or food within bowls using image-based or other methods from the aforementioned prior work, as shown in Table 1.

To address the challenge of accurately predicting liquid volume and food nutrient content, this paper proposes a capacity prediction method that integrates deep learning–based segmentation with model reconstruction of bowl-type tableware. Using commonly used bowls as the study objects, we extract key geometric parameters—including rim diameter, base diameter, effective height, and the inner-wall contour—and construct a reconstructed model to enable subsequent estimation of liquid volume and prediction of food nutrient content.

The main contributions of this study are as follows:

(a) Compared with conventional manual measurement, which is time-consuming, and existing methods that often yield inaccurate reconstruction of bowl-type container models, this study uses only a frontal-view and a top-down-view image of the bowl, without requiring a depth camera. We adopt an improved deep learning–based mask segmentation model and leverage multi-view images to reconstruct a geometry-parameterized bowl model, enabling accurate capacity estimation. The resulting capacity model further supports subsequent prediction of the liquid volume and food nutrient content of items contained in the bowl.

(b) We propose Bowl Thick Net, a model that predicts bowl wall thickness by detecting and fitting circles to the inner and outer rims at the bowl mouth, thereby further improving the accuracy of bowl capacity estimation.

(c) We propose a Geometry-aware Feature Pyramid Network (GFPN) module that, when integrated into an improved Mask R-CNN framework, enables accurate segmentation of bowl contour masks. Combined with Bowl Thick Net and geometric reasoning, the proposed pipeline recovers the bowl’s geometric parameters, constructs a reconstructed bowl model, and estimates its capacity, achieving an arithmetic mean error of −3.03% in bowl volume prediction.

(d) We propose a method for predicting the liquid volume contained in a bowl and the nutrient content of the served food, and we develop a visualization platform that integrates image processing, geometric modeling, and liquid-volume or nutrient-content prediction. After a one-time camera calibration, the system requires only a frontal view and a top-down view of the bowl to automatically extract the rim, base, height, and inner-wall contour dimensions, enabling non-contact, semi-automated modeling and capacity estimation. The resulting parameters are stored in a reusable bowl database to support repeated use. Given a top-down image of the liquid or food in the bowl, the platform can rapidly predict liquid volume and food nutrient content at different fill heights. Experimental results show a mean liquid-volume prediction error of 9.24% and a mean prediction error of 11.49% for nutrient content (by weight) across eight food categories.

2. Materials and Methods

2.1. Overall Task Flow

This section outlines the overall workflow of this research, as shown in Figure 1. The entire work is divided into four parts. Section 2.2 mainly completes camera calibration and distortion correction, obtains intrinsic and extrinsic parameters, and establishes the conversion relationship from pixels to physical dimensions, providing a unified scale benchmark for all subsequent geometric measurements. Section 2.3 and Section 2.4 perform preprocessing and contour extraction on the top and front views. In the top view, the inner and outer wall circles of the bowl are fitted and identified, and the wall thickness parameters are predicted using Bowl Thick Net to obtain the key geometric constraints required for subsequent modeling. Section 2.5 and Section 2.6 extract the outer wall contour points based on the segmentation results of the front view and perform parametric fitting. At the same time, the wall thickness information is introduced to correct the inner wall contour, completing the axisymmetric model reconstruction of the bowl and calculating the bowl volume. The reconstructed bowl model is further used to establish a bowl instance library to support the rapid use of different bowl types. Section 2.7, under the condition of known bowl instance models, demonstrates the application of this method in dietary analysis. Masking and information extraction are performed separately for liquids and food. Liquid volume prediction is based on the liquid surface information and the bowl model to calculate the liquid height and volume. Food nutrient content prediction first completes food category identification and mask extraction, then combines the bowl model to calculate the food volume, and finally converts the food weight and corresponding nutrient content from the density and nutrient information database.

2.2. Camera Calibration and Perspective Correction

The frontal-view and top-down images of bowl-type tableware in this study were captured using a Hikvision (Hangzhou, China) MV-CS060-10UC-PRO industrial camera with a resolution of 3072 × 2048 and a maximum frame rate of 59.6 fps. Camera calibration is a critical step to ensure accurate estimation of tableware dimensions. Based on the pinhole camera model, we establish the mapping between pixel coordinates and world coordinates, thereby enabling dimensional measurement from 2D images to 3D models and compensating for errors caused by lens distortion. Variations in object distance induced by different camera models and settings may change the scale factor between image pixels and real-world physical units, thereby degrading measurement accuracy. To improve the precision of tableware dimension measurement, it is necessary to calibrate object-distance variations and compensate for height-related errors. Although the subsequent experimental images were

640 \times 640

, an industrial camera was chosen to ensure the stability and reproducibility of geometric measurements. Its low distortion, controllable intrinsic exposure, and ability to retain clear edges even after scaling from high-resolution acquisition are beneficial for accurate fitting and scale conversion. In contrast, mobile phones or ordinary cameras often employ implicit processing such as autofocus, auto exposure, multi-frame fusion, distortion correction, and food enhancement, which can easily introduce scale drift and edge distortion, reducing measurement consistency.

This paper adopts the planar calibration method proposed by Zhang, using a two-dimensional checkerboard calibration target with a side length of 25 mm. During the calibration process, 15 images of the calibration target are acquired by the camera at different heights and viewpoints (these 15 images are calibration images, not dataset samples), and the pixel coordinates of the checkerboard corner points are extracted [11]. Subsequently, the homography matrix at each viewpoint is calculated to estimate the in-camera and out-of-camera pose parameters. Lens distortion parameters are further optimized using nonlinear least squares, and finally, maximum likelihood estimation is used to integrate the multi-view results to obtain stable calibration parameters. The above process is completed using the Camera Calibrator tool in MATLAB version R2019b, which establishes the mapping relationship between pixel coordinates and the true physical scale, providing a unified scale benchmark for subsequent geometric fitting and volume calculation, and reducing systematic errors caused by distortion.

In each calibration image, corner detection is performed to obtain the pixel coordinates of the checkerboard corners. We then solve for the mapping between the image pixel coordinates and the corresponding real-world geometric coordinates, thereby estimating the camera’s geometric parameters—including focal length, principal point, and intrinsic matrix—as well as its pose parameters, i.e., the rotation matrix, translation vector, and extrinsic matrix, together with the lens distortion coefficients [12]. These parameters determine the camera pose in the physical world, and the accuracy of the calibration model is verified by statistics of the reprojection error. In our experiments, the mean reprojection error is 0.23 pixels, indicating a small calibration error and ensuring the reliability of subsequent measurements. Figure 2 shows the reprojection error statistics of the camera calibration.

During calibration, the checkerboard target was placed within the camera’s depth of field, and its pose was corrected using a spirit level and the camera pose parameters. Based on the established mapping between pixel and world coordinates, we obtained a pixel-to-world scale of

P_{x} = 0.1719 m m / p i x e l

when the camera-to-target distance was 954.325 mm and the image resolution was

1280 \times 1280

pixels. To facilitate subsequent image processing and bowl segmentation, the image resolution was resized to

640 \times 640

. Accordingly, the pixel-to-world mapping was adjusted using the scaling factor, yielding

P_{x} = 0.3438 m m / p i x e l

at a resolution of

640 \times 640

.

2.3. Image Enhancement and Preprocessing for Geometric Measurement

The outline of a bowl is usually composed of straight lines and arcs. Subsequent contour fitting and volume estimation are very sensitive to the accuracy of edge coordinates. Therefore, preprocessing needs to improve robustness without changing the pixel-to-physical scale mapping. In order to simulate common image degradation in actual acquisition, we simulated three phenomena that are easy to occur during shooting: local highlights (bright edges caused by specular reflection or direct light), salt and pepper noise (impact noise introduced by reflected flash, compression artifacts or transient interference), and random impulse noise clusters (bright and dark area clusters caused by electromagnetic interference or sensor transients). Through image comparison, we found that median filtering has a stronger ability to suppress the aforementioned outliers and impulse noise, and better preserves edge position and sharpness, thus providing a more stable input for contour segmentation, geometric fitting, and volume calculation [13]. Figure 3 shows the image results of the three image degradation cases and the images after grayscale and median filtering.

In addition to the three types of degradation enhancements mentioned above, to improve the algorithm’s adaptability to changes in shooting conditions, we also adopted enhancement strategies that only change the grayscale distribution without altering the geometric relationships. These include small perturbations in brightness, contrast, and gamma, slight color jitter before grayscale processing, and moderate Gaussian noise and slight blurring. These do not change pixel coordinates and calibration scale, thus not disrupting the mapping relationship. Random scaling, cropping, rotation resampling, affine and perspective transformations, and non-uniform stretching directly change the relationship between pixels and physical scale, leading to systematic deviations in diameter, wall thickness, and volume integrals; therefore, they were not used in this experiment. To unify the processing flow and facilitate dataset construction, even the unenhanced original images underwent uniform grayscale conversion and median filtering. Since median filtering is a local, lightweight non-linear smoothing, it only introduces a small degree of pixel change in low-noise samples, but it can significantly improve the consistency of data across different batches and under different lighting conditions, as well as the robustness of subsequent contour extraction. Finally, both the front and top views of the bowl were preprocessed using a unified strategy of first grayscale conversion and then median filtering.

After filtering the frontal-view and top-down images, we extract the inner and outer rim contours from the top-down view to support subsequent bowl wall-thickness prediction. To this end, we design a two-stage pipeline consisting of (i) image processing and (ii) contour extraction. To evaluate the effectiveness of different method combinations, we tested 40 combinations (each comprising one image-processing method and one contour-extraction method). By visually inspecting and comparing the resulting images, we selected the optimal scheme for subsequent bowl wall-thickness prediction.

In stage (i), we apply seven common image-processing methods to enhance bowl contrast and improve contour separability. (a): CLAHE (Contrast Limited Adaptive Histogram Equalization) performs contrast-limited adaptive histogram equalization in local regions, enhancing low-contrast details and emphasizing bowl edges and specular highlights [14]. (b): Histogram Equalization adjusts the global intensity distribution to alleviate uneven background illumination [15]. (c): Inversion reverses pixel intensities to increase the visual distinction between the bowl and the background when their contrast is weak. In morphological processing [16] (d): Top Hat transform enhances bright details and suppresses slowly changing background components by subtracting the opening operation result from the original images, thereby highlighting contour structures [17]. (e): Black Hat subtracts the original image from the closed image to enhance dark-region details, which is beneficial when the bowl’s inner surface appears darker [18]. (f): Closing Operation applies dilation followed by erosion to fill small gaps and holes, improving contour continuity. (g): Opening Operation applies erosion followed by dilation to remove small artifacts and isolated noise, producing cleaner images for subsequent contour extraction [19].

In stage (ii), we employ four edge-detection techniques for contour extraction. (a): High-Pass Filter preserves high-frequency components to accentuate intensity transitions, thereby sharpening the bowl-rim contours [20]. (b): Canny applies a multi-step procedure—including denoising, gradient computation, non-maximum suppression, and hysteresis thresholding—to stably extract the primary outer contours and edges while suppressing noise [21]. (c): The Laplace localizes edges using the second-order grayscale derivative; although it is sensitive to noise, it provides complementary capability for capturing inner-wall details [22]. (d): Morphological Gradient extracts contours by computing the difference between dilation and erosion, which is well suited to bowls with regular structures and distinct edges, highlighting the outer boundary while reducing interference from internal texture [23].

By combining the stage-one image processing methods with the stage-two contour extraction methods, we processed the top-down images of bowls and obtained image-processing results for 40 different method combinations. The experimental results show that, as illustrated in Figure 4, the combination of CLAHE and Canny significantly enhances the difference between the inner and outer walls, outperforming the other methods as well as the original image. Under this setting, both high-pass filter and Canny can extract the inner- and outer-wall contours; however, high-pass filtering is prone to adhesion. In contrast, Canny produces fewer breakpoints and branches, yields smoother contours, and provides clearer separation, making the processed top-down bowl images more suitable for subsequent bowl thickness prediction.

2.4. Bowl Wall Thickness Estimation Model—Bowl Thick Net

In bowl model reconstruction and capacity prediction, accurate estimation of geometric parameters is crucial, particularly because differences between the inner and outer wall structures affect the accuracy of capacity computation. In this section, circle localization is not performed directly on the original image but rather on the edge map after CLAHE and Canny edge extraction. Under this setting, the classic Hough circle transform can also be used for circle detection at the method level, but it usually outputs multiple sets of candidate circle responses, which may simultaneously cover arcs or circles formed by outer walls, inner walls, and other edge structures [24]. To meet the task’s requirement of stably obtaining the largest outer circle and the second largest inner circle and outputting their corresponding geometric parameters, it is often necessary to have corresponding candidate selection and constraint strategies, such as radius range limitation, circle center consistency check, and deduplication and sorting. Based on the uniformity of the measurement process and the requirement of batch processing, this paper adopts a structured output learning method to directly predict and distinguish the largest and second largest circles, while outputting their geometric parameters, thereby reducing additional rule design and parameter processing, making subsequent contour fitting and volume estimation more consistent and reproducible. In this study, we adopt Conv NeXt-Tiny as the backbone network and use Res Net-18 as a baseline for comparison. We incorporate a Transformer-based framework with the Hungarian matching algorithm, and investigate the effects of Spatial Positional Encoding and Circle NMS (Circle Non-Maximum Suppression) on the recognition accuracy of the inner- and outer-rim contours at the bowl mouth [25]. We design a Bowl Thick Net model that represents the inner and outer rim contours as two equivalent circles. The model detects the circular contours of the inner and outer rims in the top-down image and predicts bowl wall thickness from the size difference between the two fitted circles. This subsection details the feature extraction module, encoder–decoder architecture, post-processing, underlying principles, and experimental evaluation of the proposed model.

2.4.1. Feature Extraction for the Bowl Thick Net Model

In the feature extraction stage, the input consists of a batch of

640 \times 640

RGB (Red Green Blue)top-down images of a bowl. Features are extracted using a backbone network, where we compare Conv NeXt-Tiny with Res Net-18; this subsection focuses on Conv NeXt-Tiny [26]. Compared with the conventional Res Net-18, Conv NeXt employs convolutional operations such as depth wise convolution and pointwise convolution. Its hierarchical design from Stage 1 to Stage 4 progressively increases the channel dimension while reducing the feature-map resolution, thereby enhancing representational capacity. This design is particularly advantageous over Res Net-18 in capturing fine image details and higher-level features. The first layer is the stem, which maps the three-channel input to 96 channels, producing a feature map of size

B \times 96 \times 160 \times 160

. Stage 1 contains three Conv (Convolution) NeXt blocks and performs feature extraction via depth-wise separable and pointwise convolutions, reducing the feature map to

80 \times 80

with 192 channels. Stage 2 contains three Conv NeXt blocks, reducing the feature map to

40 \times 40

with 384 channels. Stage 3 contains nine Conv NeXt blocks, further reducing the feature map to

20 \times 20

with 768 channels. Finally, Stage 4 extracts higher-level features and outputs a feature map of size

B \times 256 \times 20 \times 20

.

2.4.2. Spatial Positional Encodings

To enable the network to capture spatial location information—particularly for detecting circular geometric parameters—we explicitly incorporate PE (Positional Encoding) into the feature maps [27]. Because the Transformer architecture is inherently not position-aware, we generate PE using 2D sine and cosine functions to ensure that the network can model spatial relationships across different locations. The formulations of the 2D sine and cosine positional encodings are given in Equations (1) and (2), respectively:

P E_{(i, j), 2 k} = s i n (\frac{i}{{10,000}^{\frac{2 k}{D}}})

(1)

P E_{(i, j), 2 k + 1} = c o s (\frac{i}{{10,000}^{\frac{2 k}{D}}})

(2)

In this equation,

i

and

j

denote the row and column indices of a location in the feature map. Since the feature map size is

20 \times 20

,

i

and

j

range from 0 to 19, i.e.,

i, j \in {0, 1, \dots, 19}

. The variable

k

is the index of the positional-encoding dimension, and

D = 256

is the positional-encoding dimensionality. The term

{10,000}^{\frac{2 k}{D}}

is used to control the frequency of the sine or cosine functions at different dimensions, so that the encoding of each position has different scales across dimensions, ensuring encoding diversity and strong positional discriminability.

The PE generated by the above equations is added element-wise to the backbone output feature map

F

of size

B \times 256 \times 20 \times 20

, yielding a position-aware feature map

\tilde{F}

with the same size

B \times 256 \times 20 \times 20

. The spatial resolution and channel dimension remain unchanged, while each spatial location now contains explicit positional information. The feature map is then flattened into a sequence

Z_{0}

of size

B \times 400 \times 256

, where 400 is the sequence length after flattening (

20 \times 20

). Specifically, each pixel in the

20 \times 20

feature map is converted into a 256-dimensional token, preserving spatial correspondence and providing the input for subsequent processing.

2.4.3. Encoder and Decoder

The encoder consists of six layers, each comprising a multi-head self-attention module and a FFN (Feed-Forward Network). In each layer, the input is processed with Add & Norm to ensure stable gradient propagation. The encoder outputs a tensor of size

B \times 400 \times 256

, which is fed into the decoder and serves as the Key and Value in cross-attention. In the decoder, the inputs include the encoder output and 10 learnable object queries. Each query interacts with the encoded features through self-attention and cross-attention. Each decoder layer likewise contains multi-head self-attention and an FFN, with the central objective of generating the final predictions, including the circle center coordinates

c_{x}, c_{y}

and the circle diameter

d

. The regression is performed via two branches: the Circle Head, which predicts geometric attributes, and the Obj Head, which determines target existence [28]. The predicted circular parameters are then de-normalized by mapping the coordinates and diameter from

[0, 1]

back to pixel values, as shown in Equation (3):

c_{x} = c_{x}^{n} \times 640, c_{y} = c_{y}^{n} \times 640, d = d^{n} \times 640

(3)

In this equation,

c_{x}

,

c_{y}

, and

d^{n}

are the normalized outputs of the model, representing the normalized x-coordinate of the circle center, the normalized y-coordinate of the circle center, and the normalized circle diameter, respectively. All three values lie in the range

[0, 1]

, and 640 denotes the input image resolution.

2.4.4. Post-Processing and Matching

In the post-processing and matching stage, to accurately match each predicted circle to its corresponding ground-truth circle, we employ the Hungarian matching algorithm to compute the matching cost between predicted and ground-truth circles. The cost matrix consists of a geometric loss and an existence loss. The geometric loss is measured using the Smooth L1 loss [29], while the existence loss is computed using the BCE With Logits loss [30]. The existence loss is defined in Equation (4):

L_{c x i s t} = - \frac{1}{N} \sum_{i} [y_{i} l o g (σ ({\hat{y}}_{i})) + (1 - y_{i}) l o g (1 - σ ({\hat{y}}_{i}))]

(4)

In this equation,

N

denotes the number of samples, which here refers to the total number of detected circles.

y_{i}

is the ground-truth label for the

i

-th sample:

y_{i} = 1

indicates that the circle corresponds to a target (i.e., the target exists), whereas

y_{i} = 0

indicates that it is not a target (i.e., the target does not exist).

{\hat{y}}_{i}

is the model’s raw prediction score for the

i

-th sample. This score can take any real value and is used to determine whether the sample is a target.

σ ({\hat{y}}_{i})

is the output of the sigmoid activation function, representing the probability that the

i

-th sample is a target by converting the raw score into a probability. By combining the geometric loss and the existence loss, we obtain the overall matching cost, as defined in Equation (5):

L_{t o t a l} = λ_{g c o} \cdot L_{g c o} + λ_{e x i s t} \cdot L_{e x i s t}

(5)

In this equation,

L_{geo}

denotes the geometric loss and

L_{exist}

denotes the existence loss.

λ_{geo}

and

λ_{exist}

are hyperparameters used to balance the respective contributions of the geometric and existence losses.

2.4.5. Circle NMS

Circle NMS is a critical step for handling multiple circle predictions, particularly when detecting the outer and inner rims of a bowl, where several predictions may be produced and some may be redundant or noisy. Circle NMS suppresses duplicate circles based on the distance between circle centers and the relative difference in diameters, ensuring that only the most accurate outer- and inner-rim predictions are retained [31]. This procedure is essential for improving the stability and robustness of the model.

2.4.6. Principle of Bowl Thick Net for Wall-Thickness Prediction

The Bowl Thick Net model approximates the inner and outer rim contours at the bowl mouth as two circles. By detecting the two largest circular contours in the top-down image, it maps the predicted circle centers and diameters from the

[0, 1]

range back to pixel values. Using the pixel-to-world mapping and the camera calibration parameters, the diameters of the two circles are then computed in real-world units. The bowl wall thickness is obtained by halving the difference between the two diameters. The corresponding formula is given in Equation (6):

T = \frac{d_{o u t e r} - d_{i n n e r}}{2}

(6)

In this equation,

T

denotes the bowl wall thickness,

d_{outer}

is the diameter of the largest circle (the outer-wall rim contour), and

d_{inner}

is the diameter of the second-largest circle (the inner-wall rim contour). Figure 5 illustrates the architecture of Bowl Thick Net: Bowl Wall Thickness Estimation Network.

2.4.7. Experiments on the Bowl Thick Net Model

All images in the dataset were captured using an industrial camera under fixed height and illumination conditions, and were resized to

640 \times 640

to ensure accuracy and consistency. Experiments were conducted on Windows 10 using PyTorch 1.9.2, Python 3.8, CUDA 11.2, and an NVIDIA RTX 4060 GPU (Santa Clara, CA, USA). To enlarge the bowl image dataset, we applied data augmentation to expand the original 363 images to 1089 images. The dataset for this study was not entirely acquired in a checkerboard environment. In addition to checkerboard background samples, we also collected a large number of images that are closer to real-world application scenarios, including various backgrounds such as ordinary desktops and restaurant tables, to cover a wider range of textures and lighting conditions. The augmentation pipeline included a series of operations such as rotation, cropping, and flipping, while ensuring that these transformations did not alter the mapping between image pixels and physical dimensions. To maintain geometric consistency with the original images, each augmented image was further processed with CLAHE and Canny edge extraction to enhance contrast and edge features, facilitating subsequent contour detection and analysis. For bowls of different sizes and materials, we manually annotated the standard circle parameters of the outer and inner rims, and measured the bowl wall thickness using vernier calipers to construct millimeter-level ground-truth labels. The dataset was split into training and test sets at a ratio of 8:2.

To systematically evaluate the impact of each module on the performance of Bowl Thick Net, we designed eight model variants. The identifiers and meanings of these variants are as follows. Bowl Thick Net uses Conv NeXt-Tiny as the backbone and includes all modules, representing the final model performance when all components operate jointly. A replaces Conv NeXt-Tiny with Res Net-18 to assess the performance of a conventional convolutional network for bowl wall-thickness prediction. B removes Spatial Positional Encoding, such that the model can no longer explicitly exploit positional encoding to capture spatial relationships. C removes Circle NMS, performing no geometric deduplication and directly using the circle parameters output by the decoder. D–F each remove two of the three variables and retain only one, enabling analysis of the effect of a single remaining component. G removes all three variables to examine the maximum performance degradation when they are all absent. Table 2 summarizes the modules used in each model variant.

All input images were normalized to

[0, 1]

and channel-wise standardized using ImageNet statistics. The backbone was initialized with ImageNet-pretrained weights. Both the encoder and decoder employed a 6-layer and a 8-head self-attention architecture, and the number of object queries was set to 10. In the loss function, the weight ratio between the geometric term and the existence term was set to 2:1, and the model was trained for 100 epochs. Optimization was performed using the SGD (Stochastic Gradient Descent) optimizer, and the mini-batch size and initial learning rate were tuned on the validation set. To improve robustness, early stopping and cross-validation were adopted during training to mitigate overfitting and ensure that the model effectively learns the key geometric features in the images.

A prerequisite for accurate bowl wall-thickness prediction is the correct identification of the two circular contours corresponding to the outer and inner walls in the image. We therefore evaluated the classification accuracy of the largest detected circle

C_{1}

(outer-wall contour) and the second-largest detected circle

C_{2}

(inner-wall contour) produced by the eight models on 218 test images; experiments on dimensional accuracy are reported subsequently. The results are summarized in Table 3.

The results on the 218 test images are summarized in Table 2. Bowl Thick Net achieves strong performance in classifying both the largest circle

C_{1}

(outer-wall contour) and the second-largest circle

C_{2}

(inner-wall contour). The recognition accuracy reaches 97.2% for

C_{1}

and 95.4% for

C_{2}

, and the proportion of test images in which both

C_{1}

and

C_{2}

are correctly identified is 93.6%. The other model variants show slightly lower accuracies compared with Bowl Thick Net. In addition, the confusion matrix of Bowl Thick Net, which attains the highest recognition accuracy, is shown in Figure 6, where

C_{3}

denotes circles other than

C_{1}

and

C_{2}

. To demonstrate the training stability and convergence behavior of Bowl Thick Net in identifying the largest outer circle

C_{1}

and the second largest inner circle

C_{2}

, Figure 7 presents the loss curves for the training and validation sets. During training, the model uses the predicted circle parameters and circle existence as supervision signals, and the loss is composed of a weighted sum of a geometric regression term and an existence classification term. As the number of training epochs increases, both the training loss and the validation loss continuously decrease and tend to stabilize in the later stages. In the model, the accuracy of thickness prediction depends on the output parameters of the two circles. Therefore, when there is a discrepancy in the identification of the two circles, the error will propagate to subsequent prediction processes. Common identification failure scenarios include partial occlusion of the rim, edge loss due to reflection, and inaccurate fitting of multiple arc segments. In these cases, the model may misclassify non-target arcs as candidate circles, or the order of the two circles’ radii may become confused. To effectively handle these failures, we introduced a fault-tolerance mechanism into the workflow and added geometric consistency constraints to the outputs of the two circles. We require that the center positions of the two circles be consistent, and that the radii meet predefined order and range constraints. If the model detects that the output does not meet these geometric constraints, it will return to the previous identification step to readjust. In addition, we manually reviewed the low-confidence prediction results of the two circles to ensure that erroneous identification results do not affect subsequent volume derivation. Through these measures, we can effectively reduce the impact of identification failures in the thickness prediction stage, thereby improving the accuracy of the final volume estimation.

From the above model variants, we selected the output images in which both

C_{1}

and

C_{2}

were correctly identified to conduct experiments on the geometric size errors of the two circles. Since subsequent wall-thickness prediction requires only the circle diameters, we did not report the predicted circle-center coordinates. Instead, we de-normalized the predicted diameters to pixel values, computed the corresponding real-world diameters using the pixel-to-world mapping, and then calculated bowl wall thickness according to Equation (6). We report the predicted diameters of

C_{1}

and

C_{2}

and the resulting wall-thickness estimates for each model variant, and compare them with manual measurements.

In this study, we use

M P E

(Mean Prediction Error) to quantify the discrepancy between the model-predicted circle diameter or bowl wall thickness and the corresponding manual measurements.

M P E

computes the absolute error between the predicted and measured values for each image and then averages these errors over all images; a smaller value indicates closer agreement with the ground truth. We also report

M W T P A

(Mean Wall Thickness Prediction Accuracy), defined as the ratio between the predicted wall thickness and the manually measured wall thickness for each image, averaged across all images [32]. This metric effectively reflects the model’s accuracy in wall-thickness prediction, with higher values indicating more accurate predictions.

Table 4 presents the predicted diameters of

C_{1}

and

C_{2}

and the resulting bowl wall-thickness estimates for each model variant. Bowl Thick Net achieves the best overall performance among all versions: the

M P E

for the diameters of

C_{1}

and

C_{2}

are 0.64 mm and 0.78 mm, respectively; the

M P E

for wall-thickness prediction is 0.75 mm, and the

M W T P A

is 73.3%. We further conducted capacity measurement experiments on multiple bowls. The results show that, using the wall-thickness predictions from this model, the percentage difference in capacity falls within 1.5–4%, demonstrating the effectiveness and accuracy of the proposed model in practical applications.

2.5. Mask Segmentation of Bowls

In this subsection, we use the frontal-view images paired with the bowl top-down images as input and adopt Mask R-CNN as the base framework. Since each image contains only a single bowl, the primary goal is to obtain a high-quality mask with smooth boundaries and accurate details. However, the standard Mask R-CNN, which is designed for multi-object and multi-class detection, introduces redundant computation and may provide insufficient contour delineation for our task [33]. To address this issue, we build an improved Mask R-CNN framework and compare three backbone networks. We further propose a geometry-aware feature pyramid structure, GFPN (Geometry-aware Feature Pyramid Network), tailored to this study, and investigate its effect on bowl mask segmentation. Finally, we select the model with the highest segmentation accuracy and extract keypoints from the predicted bowl mask to infer the rim diameter, base diameter, bowl height, and the sidewall contour corrected by the predicted wall thickness, which are then used to estimate bowl capacity.

2.5.1. Improved Mask R-CNN Framework

In this subsection, we propose an improved Mask R-CNN baseline that is optimized for the characteristics of the mask segmentation task. Conventional Mask R-CNN is primarily designed for multi-object, multi-class detection and must jointly support classification and detection, which leads to substantial computational overhead on classification and insufficient delineation of fine bowl contours. To address these issues, we retain the two-stage architecture and the mask branch, while improving the feature extraction and pyramid fusion components. For region proposal generation, we continue to use the RPN (Region Proposal Network) [34]. However, considering the single-bowl image setting and the requirement for high geometric precision, we simplify the RPN structure and its output configuration. At each scale, we keep only a small set of scale and aspect-ratio combinations that better match bowl contours, thereby reducing the number of proposals and improving convergence efficiency. Standard Mask R-CNN produces hundreds of proposals and then filters them via NMS before further processing; in our task, this is redundant and may introduce uncertainty. Therefore, after NMS we retain only 50–100 high-confidence proposals and feed them into RoI Align (Region of Interest Align) and the segmentation head for subsequent processing. In the RoI (Region of Interest)head, we perform single-class classification without distinguishing specific bowl types [35]. This avoids allocating additional fully connected layers and classification losses for fine-grained categorization. Meanwhile, we keep the original loss functions, which helps the feature representations in the mask and geometric branches remain more focused, thereby improving contour segmentation accuracy.

2.5.2. Three Different Backbone Networks

For the backbone, we adopt three architectures—Res Net-50, Res NeXt-50 (32 × 4d), and Res Net-C4—to maintain compatibility with the standard Mask R-CNN and to facilitate controlled comparisons. The reasons for selecting these three backbones are as follows:

Res Net-50 serves as the baseline for comparison. By introducing residual blocks and skip connections, it effectively alleviates the vanishing-gradient problem in deep networks [36].
Res NeXt-50 (32 × 4d) can be regarded as augmenting the Res Net-50 bottleneck with 32 parallel 3 × 3 convolutional groups, each with 4 channels, implemented via grouped convolutions, thereby improving texture representation and contour modeling capability [37].
Res Net-C4 retains only C1–C4 as the shared backbone and does not further down sample to C5, trading some high-level semantic information for higher spatial resolution and finer local structure, which benefits precise localization of geometric boundaries such as the bowl rim and base [38].

The architectures of the three backbones are illustrated in the upper-left part of Figure 8. The three boxes correspond to the three backbone variants. Beneath C2, using C2 as an example, the figure further compares the convolutional operation in Res Net-50 with standard convolutions against that in Res NeXt-50 (32 × 4d) with grouped convolutions. By comparing these three backbones, we can analyze how backbone architecture affects mask quality and geometric fitting accuracy.

2.5.3. Geometry-Aware Feature Pyramid Network

The FPN (Feature Pyramid Network) is a multi-scale architecture widely used in object detection [39]. It enhances the recognition of targets at different scales by fusing feature maps from multiple hierarchical levels in a bottom-up pyramid manner. In our task, FPN plays the following roles:

Improved contour segmentation accuracy: Because bowl contours may exhibit subtle variations, FPN effectively combines fine-grained details from shallow layers with high-level semantic information from deeper layers, thereby improving the accuracy of bowl contour segmentation. In particular, for segmenting the bowl rim and base, FPN strengthens the model’s sensitivity to fine details, ensuring accurate extraction of circular contours.
Reduced computational redundancy: In conventional Mask R-CNN, processing multi-scale features can introduce redundant computation. FPN mitigates this issue through efficient feature fusion, improving computational efficiency. This advantage is especially important for bowl segmentation tasks that require real-time performance or high-throughput processing.
Multi-scale feature extraction: FPN extracts features at different hierarchical levels and fuses them, enabling the model to recognize targets at various scales within the image.

To further emphasize bowl contour features, we propose a Geometry-aware Feature Pyramid Network (GFPN) that includes only P3–P5, built on the C3–C5 outputs of Res Net-50 and Res NeXt-50 (32 × 4d). This pyramid follows the standard top-down design with

1 \times 1

convolutions, up sampling, and

3 \times 3

convolutions. On this basis, we introduce a GPM (Geometry-aware Prior Modulation) module and the CBAM (Convolutional Block Attention Module) attention mechanism. With multi-scale feature enhancement, GFPN strengthens boundary responses and thereby improves the accuracy of bowl contour segmentation.

GPM enhances the model’s understanding of geometric structures by incorporating geometric prior knowledge. In mask segmentation, GPM modulates the feature maps according to the geometric shape information in the input image, thereby improving sensitivity to geometric boundaries such as bowl contours [40]. CBAM strengthens feature representation through two components: channel attention and spatial attention. Channel attention models the importance of each channel and selectively amplifies informative channel features, whereas spatial attention emphasizes different spatial locations to increase the model’s responses to salient regions [41]. GFPN is derived from the original FPN with targeted modifications. The overall architecture of GFPN is shown in Figure 9, and the procedure is described as follows:

The frontal-view bowl image is first converted to grayscale, and the resulting grayscale image is denoted as

I_{g}

. The corresponding morphological gradient image is defined in Equation (7):

G = d i l a t e (I_{g}) - e r o d e (I_{g})

(7)

In this formulation,

G \in R^{H \times W}

takes larger values near the bowl contour and is close to zero in background regions, where

H

and

W

denote the height and width of the original image. To inject scale-matched geometric priors into each FPN level,

G

is downsampled at multiple scales and encoded via convolution. Taking the third level with a stride of 8 as an example, average pooling is first applied to obtain a coarse-scale boundary map

G_{3}

, which is then encoded using a

3 \times 3

convolution to produce the geometric prior feature

E_{3}

. The details are given in Equations (8) and (9):

G_{3} = {A v g P o o l}_{k = 8, s = 8} (G) \in R^{H_{3} \times W_{3}}

(8)

In these equations,

G_{3}

denotes the coarse-scale boundary map at the third level,

A v g P o o l (\cdot)

represents the average pooling operation, and

H_{3} = H / 8

and

W_{3} = W / 8

.

E_{3} = ϕ (G_{3}) = {C o n v}_{3 \times 3} (G_{3})

(9)

In these equations,

E_{3}

denotes the geometric prior feature corresponding to a stride of 8.

ϕ (\cdot)

is a function symbol representing the convolution operation, which transforms the input map

G_{3}

into a new feature map

E_{3}

.

{C o n v}_{3 \times 3} (\cdot)

denotes a

3 \times 3

convolution.

Similarly, for the levels with strides of 16 and 32, the corresponding geometric prior features

E_{4}

and

E_{5}

can be obtained. Thus, for the three levels with strides of 8, 16, and 32, we derive the geometric prior features

E_{3}

,

E_{4}

, and

E_{5}

. These features are spatially aligned with the backbone outputs C3–C5 and serve as the geometric priors for their respective levels.

For feature fusion, we adopt a top-down pathway and retain only P3–P5, which match a single, medium-scale bowl target. Let the backbone outputs at C3, C4, and C5 be

C_{3}

,

C_{4}

, and

C_{5}

, respectively. The top-level pyramid output

P_{5}

is first obtained by applying

1 \times 1

and

3 \times 3

convolutions to

C_{5}

to produce the base fused feature

{\hat{F}}_{5}

, as defined in Equation (10):

{\tilde{C}}_{5} = {C o n v}_{1 \times 1} (C_{5}), {\hat{F}}_{5} = {C o n v}_{3 \times 3} ({\tilde{C}}_{5})

(10)

In these equations,

{\tilde{C}}_{5}

denotes the channel-compressed top-level feature, and

{\hat{F}}_{5}

denotes the base fused feature at the top level.

{C o n v}_{1 \times 1} (\cdot)

and

{C o n v}_{3 \times 3} (\cdot)

represent

1 \times 1

and

3 \times 3

convolutions, respectively. For the intermediate level, taking the fourth backbone output

C_{4}

as an example, we first apply a

1 \times 1

convolution to

C_{4}

. We then upsample the pyramid feature

P_{5}

by a factor of two so that its spatial resolution matches that of C4. The two features are added element-wise and passed through a

3 \times 3

convolution to obtain the base fused feature for this level. This process is defined in Equations (11) and (12):

{\tilde{C}}_{4} = {C o n v}_{1 \times 1} (C_{4}), U_{5} = U p (P_{5})

(11)

In these equations,

{\tilde{C}}_{4}

denotes the channel-compressed feature at the fourth level.

U p (\cdot)

denotes a two-fold up sampling operation; here, it upsamples the pyramid output from the upper level—

P_{5}

in this example—by a factor of two, producing the upsampled feature

U_{5}

.

S_{4} = {\tilde{C}}_{4} + U_{5}, {\hat{F}}_{4} = {C o n v}_{3 \times 3} (S_{4})

(12)

In these equations,

S_{4}

denotes the fused feature obtained by element-wise addition at the fourth level, and

{\hat{F}}_{4}

denotes the base fused feature at this level. Subsequently, the GPM module applies a

3 \times 3

convolution followed by a sigmoid activation to each level’s geometric prior feature

E_{l}

to produce a normalized geometric weight map. This weight map modulates the base feature

{\hat{F}}_{l}

in a residual manner, yielding the geometry-modulated feature map

F_{l}

at level

l

. The process is defined in Equations (13) and (14):

M_{l} = σ (W_{g} * E_{l}), M_{l} \in [0, 1]^{H_{l} \times W_{l}}

(13)

In these equations,

l

denotes the pyramid level (here,

l = 3

to

5

).

M_{l}

is the normalized geometric weight map at level

l

,

σ (\cdot)

denotes the sigmoid function,

*

denotes the convolution operator,

W_{g}

is a

3 \times 3

convolution kernel, and

E_{l}

is the geometric prior feature at level

l

.

F_{l} = G P M ({\hat{F}}_{l}, E_{l}) = {\hat{F}}_{l} ⊙ (1 + M_{l})

(14)

In these equations,

l

denotes the pyramid level (here,

l = 3

to

5

).

F_{l}

is the feature map at level

l

after geometric prior modulation.

G P M (\cdot)

denotes the geometric prior modulation module.

{\hat{F}}_{l}

is the base feature at level

l

, and

E_{l}

is the geometric prior feature at the same level.

⊙

denotes element-wise multiplication, and

M_{l}

is the normalized geometric weight map for level

l

.

Finally, we apply the CBAM module to

F_{l}

to adaptively reweight the features along both the channel and spatial dimensions, producing the geometry-aware pyramid output

P_{l}

. This process is given in Equation (15):

P_{l} = C B A M (F_{l})

(15)

In this equation,

l

denotes the pyramid level (here,

l = 3

to

5

).

P_{l}

is the geometry-aware pyramid output at level

l

, and

C B A M (\cdot)

denotes the CBAM module.

2.5.4. Bowl Mask Segmentation Model Process

This subsection summarizes the overall workflow of the proposed bowl mask segmentation model based on an improved Mask R-CNN. First, we adopt the improved Mask R-CNN as the baseline framework. To meet the requirements of single-bowl images and high geometric precision, we optimize region proposal generation and introduce the GFPN structure to strengthen the model’s sensitivity to bowl contours. By incorporating the GPM module and the CBAM attention mechanism, the model can more accurately capture fine-grained features of the bowl rim, base, and sidewall. To further improve segmentation accuracy, GFPN enhances the bowl’s geometric boundaries across multiple scales and performs feature fusion on the outputs of the C3–C5 layers. With these optimizations, the model effectively reduces background noise interference and improves the accuracy of bowl contour segmentation. Figure 8 illustrates the architecture of the improved bowl mask segmentation model.

2.5.5. Experimental Results and Analysis

In this experiment, we use 363 frontal-view images of bowls that were captured together with the corresponding top-down images, and expand the dataset to 1089 images through data augmentation. The dataset includes bowls of various sizes. All images were manually annotated using the LabelMe tool version 5.8.1, and converted to the standard COCO annotation format. Unlike the original COCO setting, this task does not involve bowl-type classification; therefore, all instances are labeled as a single class, while still retaining bounding boxes and segmentation masks. The hardware and software settings are the same as those in Section 2.4. Each image has an original resolution of

640 \times 640

pixels. The dataset was randomly split, with 80% of the data used for training and 20% used for testing.

In this study, we adopt a Mask R-CNN model pre-trained on the COCO dataset and fine-tune it by loading the corresponding pre-trained weights. To improve accuracy and efficiency, we adjust the RPN_ANCHOR_SCALES parameter to (16, 32, 64, 128, 256) to better accommodate feature extraction for smaller image sizes. The initial learning rate is set to 0.0005, with a momentum of 0.9, weight decay of 0.00005, a batch size of 8, and 100 training epochs to ensure stable optimization on the reduced image resolution. During training, we use a stepwise learning-rate decay schedule, reducing the learning rate by a factor of 0.1 every 15 epochs. To further improve training stability and convergence speed, we employ a staged training strategy. In the early stage, the first several layers of Res Net are frozen and only the subsequent layers are trained, allowing the network to focus on task-relevant features. As training progresses, more Res Net layers are gradually unfrozen, ultimately enabling end-to-end optimization of all layers.

For model evaluation, we adopt COCO-style segmentation metrics, including

I O U

(Intersection over Union),

m I O U

(Mean Intersection over Union),

A P

(Average Precision), and

m A P

(Mean Average Precision) [42,43]. Specifically, in this experiment,

I O U

is defined as the ratio between the area of intersection and the area of union of the predicted mask and the ground-truth mask, and

m I O U

is the average

I O U

over all images. Since our task involves segmentation of only a single class,

m A P

is equivalent to

A P

as used in multi-class settings. The calculation is given in Equation (16):

m A P = \frac{1}{N} \sum_{i = 1}^{N} {A P}_{i}

(16)

To quantify the model’s boundary prediction accuracy, we evaluate it using the

B o u n d a r y F 1 S c o r e

with a 3-pixel tolerance. This metric assesses boundary performance by computing

P r e c i s i o n

, defined as the proportion of the predicted boundary that overlaps the ground-truth boundary, and

R e c a l l

, defined as the proportion of the ground-truth boundary that overlaps the predicted boundary [44]. The final

B o u n d a r y F 1 S c o r e

is the weighted harmonic mean of

P r e c i s i o n

and

R e c a l l

, as given in Equation (17):

B o u n d a r y F 1 S c o r e = 2 \times \frac{P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

(17)

In this experiment, we report

m I O U

and

m A P

at two

I O U

thresholds, 0.50 and 0.75, as well as the mean

m I O U

and

m A P

averaged over these two thresholds. The corresponding metrics are

{m I O U}_{0.5}

,

{m I O U}_{0.75}

,

{m A P}_{0.5}

,

{m A P}_{0.75}

,

m I O U

, and

m A P

, together with the

B o u n d a r y F 1 S c o r e

.

To systematically evaluate the effects of different backbones and the proposed GFPN on bowl mask segmentation performance, we conduct a comparative study based on the improved Mask R-CNN baseline with different configurations. Three representative residual-network backbones are considered: Res Net-50, Res NeXt-50 (32 × 4d), and Res Net-C4. Here, Res Net-C4 serves as a shallower backbone without FPN, whereas the other two backbones are equipped with either the standard FPN or the proposed geometry-aware feature pyramid GFPN. All models are trained and evaluated on the same grayscale-processed frontal-view bowl dataset using the metrics described above. Table 5 Experimental results of mask segmentation of bowls in models of different modules.

Under the same experimental settings and hyperparameters, we select the best-performing configuration described above—using Res NeXt-50 (32 × 4d) as the backbone and GFPN as the FPN type—which we refer to as Ours, and compare it with other mask segmentation models, including U-Net, HTC, and B Mask R-CNN. All baseline models are evaluated using their original architectures without any module modifications. U-Net employs skip connections to fuse multi-scale features, balancing global semantics and local boundary details for pixel-level mask prediction [45]. HTC alternately optimizes the detection and mask branches and progressively refines predictions through a multi-stage cascade, improving localization and segmentation quality [46]. B Mask R-CNN is a boundary-enhanced variant of Mask R-CNN that introduces contour supervision and feature enhancement to strengthen edge representations and improve mask boundary accuracy and consistency [47]. Table 6 reports the comparative results of these four models. Figure 10 presents qualitative segmentation outputs for all models, along with the training and validation loss curves of our model and the

m I O U

and

m A P

curves of the four models.

2.6. Estimation of the Volume of a Bowl

After obtaining the bowl mask, we use the segmentation model from the previous subsection and define 10 geometric key points on the mask: the outer-contour extremal points

P_{1} \sim P_{6}

and four uniformly sampled points along the right-side arc length,

S_{1} \sim S_{4}

. These keypoints are used to derive the rim diameter, base diameter, and bowl height, and to fit the inner-wall contour by incorporating the predicted wall thickness [48]. Finally, the bowl volume is computed via axisymmetric integration. The detailed procedure is as follows:

1. Foreground pixel set and contour set: After bowl mask segmentation, let the binary mask be

M (x, y) \in {0, 1}

. We first construct the foreground pixel set and the contour set on the mask, which are used for subsequent definition of geometric keypoints and fitting of the outer-wall curve, as given in Equation (18):

Ω = {(x, y) ∣ M (x, y) = 1}

(18)

In this equation,

Ω

denotes the set of foreground pixels in the mask image.

M (x, y)

is the mask value at pixel

(x, y)

, where the foreground is 1 and the background is 0, and

x

and

y

are the horizontal and vertical pixel coordinates in the image coordinate system.

\partial Ω

denotes the contour set extracted using OpenCV’s contour detection.

2. Six outer-contour extremal points

P_{1} \sim P_{6}

: To stably obtain the geometric boundaries of the bowl rim and base from the mask contour, we define four extremal points on

\partial Ω

: the left and right upper-rim points

P_{1}

and

P_{2}

, and the left and right lower-base points

P_{3}

and

P_{4}

. The upper-rim points characterize the rim width, whereas the lower-base points characterize the base width and help determine the base position. Their definitions are given in Equations (19) and (20):

P_{1} = a r g \underset{(x, y) \in \partial Ω, x = m i n x}{m a x} y, P_{2} = a r g \underset{(x, y) \in \partial Ω, x = m a x x}{m a x} y

(19)

In this equation,

m i n x

and

m a x x

denote the minimum and maximum horizontal coordinates on the contour set

\partial Ω

, respectively.

\arg m a x ()

denotes the point within the specified domain that maximizes the objective function.

P_{1} = (x_{p_{1}}, y_{p_{1}})

and

P_{2} = (x_{p_{2}}, y_{p_{2}})

correspond to the left and right upper-rim points, respectively. Specifically,

P_{1}

and

P_{2}

are the highest points in the leftmost and rightmost contour columns, respectively.

P_{3} = a r g \underset{(x, y) \in \partial Ω, x = m i n x}{m i n} y, P_{4} = a r g \underset{(x, y) \in \partial Ω, x = m a x x}{m i n} y

(20)

In this equation,

\arg m i n ()

denotes the point within the specified domain that minimizes the objective function.

P_{3} = (x_{p_{3}}, y_{p_{3}})

and

P_{4} = (x_{p_{4}}, y_{p_{4}})

correspond to the left and right lower-base points, respectively. Specifically,

P_{3}

and

P_{4}

are the lowest points in the leftmost and rightmost contour columns, respectively.

Because the base primarily provides structural support and does not contribute to the effective holding volume, directly using

P_{3}

and

P_{4}

would lead to an overestimation of bowl height. Therefore, we analyze the variation in the mask’s horizontal width along the vertical direction to locate the shape transition between the base and the bowl body. The left and right boundaries at this transition are defined as

P_{5}

and

P_{6}

, which are used for estimating the effective height and fitting the sidewall. The definition is given in Equation (21):

w (y) = x_{m a x} (y) - x_{m i n} (y)

(21)

In this equation,

w (y)

denotes the foreground width of the mask at row

y

, and

x_{m a x} (y)

and

x_{m i n} (y)

are the rightmost and leftmost horizontal coordinates of the foreground region at row

y

, respectively.

In the base region,

w (y)

varies only slightly, whereas once entering the bowl body sidewall,

w (y)

increases markedly as

y

increases. We scan upward from the base and identify the first row at which the width change rate exceeds a threshold

τ

, denoted as

y = y_{split}

. In our implementation,

τ

is set to 5 px, and the left and right boundary points at this split row are defined as

P_{5} \sim P_{6}

, as given in Equation (22):

P_{5} = (x_{m i n} (y_{s p l i t}), y_{s p l i t}), P_{6} = (x_{m a x} (y_{s p l i t}), y_{s p l i t})

(22)

In this equation,

y_{split}

is the pixel

y

-coordinate of the height where the base transitions to the bowl body.

P_{5}

is the left boundary point at the start of the bowl body after removing the base, and

P_{6}

is the corresponding right boundary point.

3. Pixel-scale geometric quantities: rim diameter, base diameter, and effective height. After obtaining the keypoints

P_{1} \sim P_{6}

, we can directly compute the bowl rim diameter, base diameter, and effective height in pixel units. The pixel-scale rim diameter is defined in Equation (23):

D_{r i m}^{(p i x)} = | x_{P 2} - x_{P 1} |

(23)

In this equation,

D_{r i m}^{(p i x)}

denotes the rim diameter in pixels, and

x_{P_{1}}

and

x_{P_{2}}

are the horizontal coordinates of points

P_{1}

and

P_{2}

, respectively. The pixel-scale base diameter is defined in Equation (24):

D_{b a s e}^{(p i x)} = | x_{P 4} - x_{P 3} |

(24)

In this equation,

D_{b a s e}^{(p i x)}

denotes the base diameter in pixels, and

x_{P_{3}}

and

x_{P_{4}}

are the horizontal coordinates of points

P_{3}

and

P_{4}

, respectively. The bowl’s effective height uses the midpoint of the rim as the upper reference point and the base–body split line defined by

P_{5}

and

P_{6}

as the lower reference, thereby reducing the influence of the base height. The pixel-scale effective height is defined in Equation (25):

H^{(p i x)} = | \frac{y_{p_{5}} + y_{p_{6}}}{2} - \frac{y_{p_{1}} + y_{p_{2}}}{2} |

(25)

In this equation,

H^{(p i x)}

denotes the effective bowl height in pixels, and

y_{p_{1}}

,

y_{p_{2}}

,

y_{p_{5}}

, and

y_{p_{6}}

are the vertical coordinates of

P_{1}

,

P_{2}

,

P_{5}

, and

P_{6}

, respectively.

4. Conversion from pixel units to physical scale: Using the pixel-to-world mapping obtained in Section 2.2,

P_{x} = 0.3438 m m / p i x e l

, we convert the pixel-scale measurements to real-world dimensions, including the rim diameter

D_{r i m}

, base diameter

D_{b a s e}

, and bowl height

H

. The conversion is given in Equation (26):

D_{r i m} = D_{r i m}^{(p i x)} \cdot P_{x}, D_{b a s e} = D_{b a s e}^{(p i x)} \cdot P_{x}, H = H^{(p i x)} \cdot P_{x}

(26)

In this equation,

D_{r i m}

,

D_{b a s e}

, and

H

denote the rim diameter, base diameter, and bowl height in real-world units, respectively, while

D_{r i m}^{(p i x)}

,

D_{b a s e}^{(p i x)}

, and

H^{(p i x)}

denote the corresponding measurements in pixel units.

5. Sampling the right-side outer contour: Since a bowl can be approximated as an axisymmetric solid generated by rotation about its central axis, we model the bowl using the right-side outer contour. Specifically, we select the contour arc segment from

P_{6}

to

P_{2}

and denote the contour point sequence as

Γ_{R} = {(x_{i}, y_{i})}_{i = 0}^{N}

. Using arc-length parameterization, we uniformly sample four interior points

S_{1} \sim S_{4}

along this arc. Together with the endpoints

P_{6}

and

P_{2}

, these points form a representative set that characterizes the outer-wall shape, as defined in Equation (27):

Δ s_{i} = \sqrt{(x_{i} - x_{i - 1})^{2} + (y_{i} - y_{i - 1})^{2}}, i = 1, \dots, N, s_{i} = \sum_{j = 1}^{i} Δ s_{j}, s_{0} = 0, s_{t o t} = s_{N}

(27)

In this equation,

Δ s_{i}

denotes the arc-length increment between adjacent contour points.

x_{i}

and

y_{i}

are the horizontal and vertical coordinates of the

i

-th point, and

x_{i - 1}

and

y_{i - 1}

are defined analogously. Here,

i

is the contour-point index ranging from 1 to

N

, where

N

is the total number of contour points.

s_{i}

represents the cumulative arc length from the starting point

P_{6}

to the

i

-th point.

Δ s_{j}

denotes the arc-length increment between the

j

-th point and the

(j− 1)

-th point, where

j

is the summation index ranging from 1 to

i

.

s_{0}

is the arc length at the starting point and is set to 0.

s_{tot}

denotes the total arc length from the starting point

P_{6}

to the endpoint

P_{2}

, i.e., the full length of the selected contour segment. We then select four interior point locations by dividing the arc-length interval

[0, s_{t o t}]

into five equal parts, as defined in Equation (28):

s^{(k)} = \frac{k}{5} s_{t o t}, k = 1,2, 3,4

(28)

In this equation,

s^{(k)}

denotes the arc-length position of the

k

-th sampled point, where

k

is the sampling index and takes values from 1 to 4. By performing linear interpolation on

Γ_{R}

at

s^{(k)}

, we obtain the sampled point

S_{k} = (x_{S_{k}}, y_{S_{k}})

, i.e., the coordinates of

S_{1} \sim S_{4}

.

6. Outer-wall fitting and inner-wall correction: To convert the contour points into a continuous geometric representation, we determine the bowl’s central axis using the midpoint of the rim and project the right-side sampled points onto the height–radius plane. In physical units, we fit the outer-wall radius function

r_{out} (z)

. We then combine this with the wall thickness

\tilde{t}

predicted by Bowl Thick Net to obtain the inner-wall radius function

r_{in} (z)

, as defined in Equation (29):

r_{o u t} (z) = a_{3} z^{3} + a_{2} z^{2} + a_{1} z + a_{0}, r_{i n} (z) = r_{o u t} (z) - \tilde{t}

(29)

In this equation,

r_{out} (z)

denotes the outer-wall radius as a function of height

z

, and

r_{in} (z)

denotes the inner-wall radius as a function of height

z

. The coefficients

a_{0}

,

a_{1}

,

a_{2}

, and

a_{3}

are obtained via least-squares fitting.

7. Bowl capacity via axisymmetric integration: After obtaining the inner-wall profile

r_{in} (z)

, we compute the bowl’s inner-cavity volume using the volume-of-revolution formula. The integration limits correspond to the effective height range, starting from the height of

P_{6}

at the bottom and ending at the rim at

P_{2}

. The formulation is given in Equation (30):

V_{i n n e r} = π \int_{z_{m i n}}^{z_{m a x}} r_{i n}^{2} (z) d z

(30)

In this equation,

V_{inner}

denotes the bowl capacity,

z_{m i n}

and

z_{m a x}

are the lower and upper bounds of the effective height in physical coordinates, and

r_{in} (z)

is the inner-wall radius function.

In summary, based on the frontal-view bowl images, we use the outputs of our improved mask segmentation model to extract key geometric parameters, including the rim diameter, base diameter, and bowl height. By further incorporating the wall thickness predicted by Bowl Thick Net, we construct axisymmetric contour curves for the inner and outer walls, derive an integral formulation for bowl capacity, and complete the bowl reconstruction model. Figure 11 illustrates the bowl capacity estimation procedure and the corresponding results.

Following the above pipeline, we conduct a capacity validation experiment on eight bowls of different sizes. Figure 12 shows examples of real-world images of eight bowls used for prediction. These bowls vary in color, material, and size and are presented in a focused comparison to visually demonstrate the applicability and generalization ability of the scale factor and geometric modeling process used in this study with different bowl shapes.

For each bowl, we first acquire a top-down image and a frontal-view image. The rim contour in the top-down image is used to predict wall thickness, while the mask and contour point set from the frontal-view image are used to obtain the bowl’s geometric parameters and contour curve. The outer-wall contour is then corrected to the inner-wall contour, yielding image-based geometric parameters, a reconstructed bowl model, and the predicted bowl capacity. As ground-truth references, we manually measure the rim diameter, base diameter, effective height, and actual holding volume of each bowl using tools such as vernier calipers (Shanghai, China) and a graduated cylinder (Yancheng, Jiangsu, China). Finally, we evaluate the accuracy of the proposed bowl model reconstruction and volume estimation method by comparing the predicted and measured geometric dimensions and capacities of the eight bowls on a per-bowl basis and conducting error analysis. Table 7 presents the measured and predicted values of rim diameter, base diameter, effective height, and bowl capacity for each bowl.

As shown in Table 7, the prediction error for the rim diameter of the bowl ranges from 2.19% to 4.67%, with an arithmetic mean error of 1.09%. For the base diameter of the bowl, the error ranges from 1.03% to 3.42%, with an arithmetic mean error of 1.14%. For the effective height of the bowl, the error ranges from −2.26% to 1.03%, with an arithmetic mean error of 0.40%. For the bowl volume, the error ranges from −7.29% to 3.41%, with an arithmetic mean error of −3.03%. To further analyze the impact of network structure selection on the accuracy of bowl volume prediction, we conducted a comparative experiment between the backbone network and the feature pyramid structure, as shown in Table 8. While keeping other training settings consistent, we compared the impact of different backbones and FPN types on the final volume error, using the arithmetic mean error of the bowl volume as a unified metric.

2.7. Application of This Study in Dietary Analysis

After obtaining the key geometric parameters of each bowl (rim diameter, base diameter, effective height, and capacity) and its reconstructed model, we store these parameters together with the corresponding bowl ID in a database, which serves as prior knowledge for subsequent prediction of liquid volume and food nutrient content. In practical use, the system first selects the bowl with the specified ID and captures a top-down image of the liquid or food in the bowl. To ensure consistent scale conversion and geometric mapping, the camera parameters, mounting height, and top-down viewing angle must be kept strictly identical to those used during the acquisition of the bowl top-down images described above. Next, the visible upper-surface region of the contents is segmented to obtain its top-down projection, and a circle is fitted to this region to estimate the diameter of the fitted surface circle. Combined with the inner-wall contour function of the selected bowl stored in the database and the corresponding cavity geometry constraints, this diameter can be mapped to the associated height, enabling inference of the liquid level and computation of the liquid volume via integration. Please note that this volume estimation assumes the upper surface of the liquid or food is as horizontal as possible; pronounced concavities, bulges, or stacked structures may introduce additional error. For food nutrient prediction, the system can estimate nutrients for a single food type. After estimating food volume, density is introduced to convert volume to weight, and the nutrient content (e.g., calories, carbohydrates, protein, and fat) is then obtained by linear scaling based on a per-unit-weight nutrient table for different dishes, yielding the estimated nutrient amounts for the food in the bowl [49,50].

2.7.1. Prediction of the Volume of Liquid in the Bowl

After completing bowl mask segmentation and obtaining the bowl geometric parameters as described in the previous subsection, we have the rim diameter

D_{r i m}

, base diameter

D_{b a s e}

, and effective height

H

in physical units, as well as the inner-wall radius function

r_{i n} (z)

, which is obtained by fitting the outer wall and then correcting it using the predicted wall thickness. Specifically,

z

denotes the height coordinate along the bowl’s central axis, and

r_{i n} (z)

is the radius from the inner wall to the central axis at height

z

. Under the axisymmetry assumption, the bowl cavity can be regarded as a solid of revolution generated by rotating

r_{i n} (z)

about the central axis. Therefore, when the liquid level is as horizontal as possible, the liquid volume can be computed by integrating the volume of revolution below the liquid level.

We perform mask segmentation on the surface region in the top-down liquid image and fit an equivalent circle to the liquid-surface contour to obtain the pixel-scale radius

R_{l i q}^{(p i x)}

. Using the calibration factor

P_{x}

(mm/pixel) we convert it to the physical liquid-surface radius

r_{l i q}

, as given in Equation (31):

r_{l i q} = P_{x} \cdot R_{l i q}^{(p i x)}

(31)

Since the inner-wall profile is represented as a continuous function

r_{i n} (z)

, the liquid level can be inferred from the liquid-surface radius. In implementation,

r_{i n} (z)

is interpolated within the integration interval defined by the effective height, and the corresponding

z_{l i q}

is solved using a bisection method. The liquid-surface radius

r_{l i q}

and the liquid level

z_{l i q}

satisfy Equation (32):

r_{i n} (z_{l i q}) = r_{l i q}

(32)

After determining the liquid level

z_{l i q}

, the liquid volume is computed as the volume of revolution from the bottom

z_{m i n}

up to

z_{l i q}

, as given in Equation (33):

V_{l i q} = π \int_{z_{m i n}}^{z_{l i q}} r_{i n}^{2} (z) d z

(33)

In this equation,

V_{l i q}

denotes the predicted liquid volume in

m L

. Figure 13 illustrates the liquid mask segmentation and the principle of liquid volume prediction.

2.7.2. Predicting the Nutritional Content of Food in a Bowl

After obtaining the bowl’s geometric parameters and the inner-wall contour function

r_{i n} (z)

, this subsection further implements the prediction of the nutrient content of food contained in the bowl. We consider eight common food categories: Rice, Mapo Tofu, Kung Pao Chicken, Fried Noodles, Stir-fried Vegetables, Braised Eggplant, Tomato Scrambled Eggs, and Stir-fried Shredded Potatoes. To make predictions, food is placed in a bowl that has been registered in the database, and the food surface is kept as level as possible, without pronounced concavities or protrusions. In this study, only one type of food was placed in the nutritional composition experiment at a time to reduce the uncertainty caused by the combination of multiple foods. We then capture images under the same top-down setup as in the previous experiments, keeping the camera parameters and viewing angle unchanged, and perform instance mask segmentation on the food surface region. Unlike the single-class frontal-view segmentation of the “bowl” in Section 2.5, this subsection requires simultaneous dish category recognition during segmentation. Therefore, the adopted mask segmentation network performs category classification alongside mask prediction. Table 9 lists the eight food categories and the number of samples within different portion-size ranges, which are used to train and evaluate the model.

In the geometric estimation stage, the procedure is consistent with the liquid volume prediction in Section 2.7.1. First, an equivalent circle radius (in pixel units) is obtained by fitting a circle to the segmented food surface in the top-down image, and the calibration factor is used to convert it to a physical-scale radius. Next, the corresponding height

z_{f o o d}

is inferred from the inner-wall function

r_{i n} (z)

, and the food volume in the bowl,

V_{f o o d}

is computed via axisymmetric integration. We then introduce the density parameter

ρ_{c}

for each food category and convert volume to weight as

m_{food} = ρ_{c} \cdot V_{food}

. Finally, based on the per-unit-weight contents of calories, carbohydrates, protein, and fat for each dish, we compute the total nutrient amounts for the serving in the bowl, thereby completing nutrient content prediction. Figure 14 presents the workflow for food nutrient content prediction.

2.7.3. Experimental Results and Analysis of Estimating the Relationship Between Liquid Volume and Food Nutrient Content

In this subsection, under the same experimental equipment and imaging conditions as in Section 2.4.7 (with identical camera parameters, top-down viewing angle, resolution, and calibration factor), we evaluate the performance of different mask segmentation models on two tasks: liquid volume estimation in bowls and food volume estimation in bowls. In our dataset, there are 48 images of liquids in bowls, and the number of images for the eight food categories is as listed in Table 2. We assign an ID to each image and record the corresponding liquid volume and food weight in a spreadsheet. Data augmentation is applied to both the liquid and food images, resulting in 144 augmented liquid images and 1929 augmented images in total for the eight food categories. We then train standard Mask R-CNN models for the two tasks separately. To ensure a fair comparison, both tasks use the same input preprocessing pipeline and the same training/test split strategy; the only difference lies in the category space. The liquid task is a single-class mask segmentation problem that distinguishes only liquid from background, whereas the food task is a multi-class instance segmentation problem that outputs both masks and category labels, with the category set consisting of the eight dishes listed in Table 7.

During inference, the model first outputs the mask of the liquid or the visible food surface in the bowl, and the food task additionally outputs the predicted category. The mask is then processed by geometric circularization. Specifically, boundary points are extracted from the mask, and the OpenCV toolbox version 4.10.0 is used for contour extraction and circle fitting. Based on the extracted boundary point set, an equivalent circle is estimated using a least-squares strategy, yielding the pixel-scale radius of the fitted surface circle. This radius is converted to a physical radius using the pixel-to-world mapping factor

P_{x}

. Because the bowl cavity is a solid of revolution formed by rotating the inner-wall profile

r_{i n} (z)

about the central axis, a given surface radius corresponds to a unique height. Therefore, the height of the liquid level or food surface above the bowl bottom can be obtained by solving

r_{i n} (z) = r

. Finally, given the upper height limit, we integrate

r_{i n} (z)

using the axisymmetric volume formula to obtain the liquid volume or food volume in the bowl. For the food task, after volume estimation, the nutrient amounts are further quantified by combining the dish-specific density and the per-unit-mass nutrient table for each category.

We evaluate the mask segmentation quality of different models using COCO-style metrics, including the mean average precision for bounding-box detection (

b b o x_m A P

) and the mean average precision for mask segmentation (

m a s k_m A P

).

In the bowl liquid-volume prediction experiment, we use the Mask R-CNN segmentation model to extract surface masks from the liquid images in the test set and predict liquid volume according to the aforementioned method. The predicted volumes are then compared with the recorded volumes in the spreadsheet to compute the prediction error. The liquid-surface mask segmentation performance and the liquid-volume prediction results are summarized in Table 10. And Table 11 shows a comparison of the average errors of different research methods in liquid volume prediction.

As shown in Table 8, the mean prediction error for liquid volume estimation is 9.24%. We measured the density of each food category using a 200 mL measuring cup(Yancheng, Jiangsu, China), following the same filling procedure for all foods. To address the potential volume-to-weight conversion bias introduced by uneven food density and variations in packing tightness, this study conducted 20 repeated measurements for each food category during the density statistics phase. Density data was obtained under both lightly and tightly packed packing conditions, and the average value was taken to improve the robustness of density estimation. For nutritional data, priority was given to food composition data resources with more standardized and traceable sources, including Foundation Foods and SR Legacy categories from USDA FoodData Central, as well as the UK’s CoFID. Cross-checking across multiple databases was performed when necessary to reduce the impact of item matching bias and differences between single databases on the results. Related validation studies showed that Evenepoel et al. compared MyFitnessPal’s nutritional calculations with the research-grade Nubel food composition table, reporting errors of approximately 1.3% for energy and 1.2% for fat [51]; Chiplonkar et al. compared food composition table estimates with laboratory chemical analysis results of cooked foods, indicating that the differences in protein and carbohydrate content were typically around 5% [52]. Based on the experimental procedure described in this paper, the aforementioned differences mainly manifest as limited perturbations in nutrient values per unit weight. These perturbations are typically smaller than the error contributions introduced by volume estimation and density measurement, and represent a common external error source in the comparison of different methods. Therefore, their overall impact on the nutrient prediction conclusions is relatively limited. We obtained the nutrient information per 100 g for each dish, and Table 12 summarizes the density and nutrient composition per unit weight for each food category.

In the food nutrient content prediction experiment, we likewise use a Mask R-CNN segmentation model to extract surface masks from the food images in the test set and estimate food volume according to the aforementioned procedure. The estimated volume is converted to food weight using the density of the corresponding food category, and the predicted weight is compared with the recorded weights in the spreadsheet to compute the prediction error. We then estimate the nutrient amounts by combining the predicted food weight with the per-100 g nutrient values in Table 9. Therefore, the accuracy of nutrient content prediction is closely tied to the accuracy of the predicted food weight. Under this approach, the prediction error of food nutrient content is equivalent to the prediction error of food weight. The mask segmentation performance for different food categories and the weight prediction results are reported in Table 13.

As shown in Table 10, the mean weight prediction error for each food category ranges from 8.55% to 14.91%, and the overall mean weight prediction error across the eight food categories is 11.49%. Stir-fried Vegetables exhibits the largest weight prediction error. This is mainly because the density of a given dish is not a fixed constant and can vary substantially with cooking and serving conditions. Specifically, even within the same category, differences in oil usage, moisture content of sauces, and ingredient composition can directly change the solid content and porosity per unit volume. In addition, the stacking pattern and degree of compaction during filling affect the actual volume distribution and effective density of the food in the bowl, which in turn leads to larger deviations in weight prediction. Table 14 compares the average errors of different research methods in predicting the content of food nutrients.

2.7.4. Visual User Platform

Based on the above analysis pipeline, we developed a visualization and user-facing platform, as shown in Figure 15. After a one-time camera calibration, the user only needs to provide a frontal-view and a top-down image of a bowl. The system can then automatically identify geometric parameters such as the rim diameter, base diameter, effective height, and the inner-wall contour curve, and store the recognized results and the reconstructed model as a bowl instance in the database. For liquid volume and food nutrient content prediction, the user selects the bowl type and inputs a top-down image of the liquid or food in the bowl, and the platform outputs the predicted liquid volume or the food category together with its estimated nutrient content.

The platform can be applied to standardized portioning and output management in food retail, dietary intake monitoring in hospitals or elderly-care settings, calorie logging for fitness and weight management, as well as quantitative liquid dispensing and container capacity assessment under laboratory conditions. Looking ahead, the system can be extended to more vessel types and more complex food geometries, and multi-view or depth cues can be incorporated to improve robustness to uneven surfaces and occlusions. In addition, developing finer-grained prediction modules and providing AI-driven dietary recommendations could further enhance the generalization capability and interpretability of the nutritional assessment.

3. Conclusions

To address the low accuracy of conventional methods for predicting the liquid volume and food nutrient content in bowl-type tableware, as well as the tool dependence and time-consuming nature of manual measurements, this study proposes a new approach that integrates 3D reconstruction with deep learning–based segmentation to predict the liquid volume in a bowl and the nutrient content of the contained food with relatively high accuracy.

Section 2.2 ensures accurate bowl dimension estimation by establishing a pixel-to-world mapping based on the pinhole camera model. Zhang’s calibration method is adopted, where 15 images of a 25 mm checkerboard are captured for corner detection to estimate camera intrinsics, extrinsics, and distortion parameters, followed by nonlinear least-squares refinement. The mean reprojection error is 0.23 pixels, and a scale factor of

P_{x} = 0.3438 m m / p i x e l

at a resolution of

640 \times 640

is obtained for subsequent conversion. Section 2.3 focuses on the accuracy requirements of geometric measurements for contour edge coordinates. To improve robustness without altering the pixel-to-physical scale mapping, we uniformly grayscaled both the top and front views and employed median filtering as the basic preprocessing to suppress outliers such as local highlights, impact noise, and sensor dead pixels, while preserving edge position and sharpness as much as possible. Subsequently, a two-stage workflow of “image enhancement and edge extraction” was constructed, and various enhancement and edge detection combinations were compared and evaluated. Experimental results show that the combination of CLAHE and Canny can more stably separate the inner and outer wall edge contours under different bowl shapes and imaging conditions, providing a more consistent and reliable input for subsequent extraction of two circle parameters and wall thickness prediction.

Section 2.4 presents Bowl Thick Net for bowl wall-thickness prediction. Using Conv NeXt-Tiny as the backbone with Res Net-18 as a baseline, the model introduces a Transformer-based encoder–decoder architecture and incorporates spatial positional encoding, Hungarian matching, and Circle NMS to improve the stability of detecting the two circles corresponding to the inner and outer rims in top-down images. The model outputs the circle centers and diameters; after de-normalization and conversion using the scale factor

P_{x}

, the wall thickness is computed. The model is trained on 363 images augmented to 1089, and evaluated on 218 test images. Bowl Thick Net achieves recognition accuracies of 97.2% for

C_{1}

and 95.4% for

C_{2}

with 93.6% of images having both circles correctly identified. The

M P E

of the predicted diameters are 0.64 mm for

C_{1}

and 0.78 mm for

C_{2}

, and the wall-thickness prediction

M P E

is 0.75 mm, with an

M W T P A

of 73.3%. Capacity measurement experiments on multiple bowls further show that, using the wall-thickness predictions from this model, the capacity difference falls within 1.5–4%, demonstrating its effectiveness and accuracy in practical applications.

In Section 2.5 and Section 2.6, we propose a mask segmentation and capacity estimation process for single-bowl scenes using a frontal bowl-shaped image as input, which has high geometric accuracy. Built on Mask R-CNN, the method retains the two-stage architecture and mask branch, while simplifying RPN proposals for the single-object setting to reduce redundant computation and improve contour delineation. Three backbones—Res Net-50, Res NeXt-50 (32 × 4d), and Res Net-C4—are compared. In addition, we improve the FPN on the C3–C5 features by proposing GFPN, which incorporates GPM geometric priors and CBAM attention to enhance multi-scale boundary responses. Using 363 images augmented to 1089, we evaluate performance with

m I O U

,

m A P

, and

B o u n d a r y F 1 S c o r e

. Res NeXt-50 with GFPN achieves the best results and outperforms U-Net, HTC, and B Mask R-CNN. Section 2.5 defines 10 keypoints

(P_{1} \sim P_{6}, S_{1} \sim S_{4})

on the high-quality masks. We construct the foreground and contour sets, locate extremal points for the rim and base, and use

P_{5}

and

P_{6}

to exclude the base and obtain the effective height. The rim diameter, base diameter, and height are then computed and converted to millimeters. Next, we uniformly sample the arc length of the right-side contour to fit the outer-wall function. Combined with the wall thickness predicted by Bowl Thick Net, the outer-wall curve is corrected to the inner-wall curve. Finally, bowl capacity and the reconstructed model are obtained via axisymmetric integration and validated through error analysis, with the minimum bowl-volume prediction error reaching 3.41%.

Section 2.7 stores each bowl’s rim diameter, base diameter, effective height, capacity, and inner-wall function in a database indexed by bowl ID after the geometric parameters and reconstructed model are obtained. During prediction, the user selects the bowl type and captures a top-down image of the liquid or food. Under consistent camera parameters and viewing angle, the visible upper surface is segmented and circularly fitted to obtain the surface radius. The corresponding height is then inferred from the inner-wall function, and the liquid or food volume is computed via axisymmetric integration. For food nutrient content prediction, a classification head is further retained to recognize the eight dish categories. The estimated volume is converted to weight using dish-specific density, and calories, carbohydrates, protein, and fat are computed according to the per-unit-mass nutrient table for each dish. Experimental results show a mean prediction error of 9.24% for liquid volume and a mean prediction error of 11.49% for nutrient content by weight across the eight food categories.

4. Discussion

In this subsection, we discuss several important issues, including the assumptions, limitations, and applicability of our method for bowl capacity prediction, liquid volume estimation in bowls, and nutrient content prediction for food contained in bowls. The proposed approach simplifies the bowl cavity as an axisymmetric solid of revolution about the central axis and represents its geometry using a continuous inner-wall function. The top-down mask of the liquid or food surface is fitted with an equivalent circle radius, which is then used to infer the corresponding height and compute volume via axisymmetric integration. This assumption is effective for regular, circular bowls; however, for vessels with non-circular rims, eccentric structures, handles, or pronounced local irregularities on the inner wall, geometric consistency is violated, leading to systematic bias. In this study, scale conversion relies on camera calibration and the pixel-to-physical mapping. Therefore, during inference, the camera parameters, mounting height, top-down viewing angle, and resolution should be kept as close as possible to the experimental setup to ensure that the mask radius and the inner-wall function are defined in the same geometric coordinate system. If the capture distance changes, the lens differs, or the viewing angle deviates, the pixel-to-physical mapping must be recalibrated; however, recalibration is unnecessary when the system is deployed as a fixed device mounted on a stand or tabletop. Moreover, liquid volume estimation assumes that the liquid surface can be approximated as a flat, level plane, and it is primarily validated under homogeneous liquid conditions. When solid particles, foam, wall adhesion, high-viscosity fluids, or substantial surface fluctuations occur, the equivalent circular radius from the top-down view no longer corresponds to a unique and stable height, and the inferred volume may be distorted. The mixing of multiple foods and uneven density are factors that cannot be ignored in real-world dining scenarios, but reliable solutions are still difficult to obtain by relying solely on single RGB or depth images. When multiple types of food are placed in a bowl simultaneously, with occlusion and stacking, the imaging information typically only covers the top visible surface, leaving the occluded areas without direct observation. Further decomposition into the quantity of each food type often requires the introduction of empirical assumptions or fixed proportions, leading to unstable results and significant error fluctuations. Therefore, this paper adopts a setting of placing only one type of food at a time in the nutritional composition experiment to reduce the uncertainty caused by multiple food stacking, and models and estimates of multi-food mixing scenarios are a future research direction. On the other hand, to address the volume-to-weight conversion bias that may be introduced by uneven food density and differences in packing tightness, this paper performs 20 repeated measurements for each type of food in the density statistics stage, covering both lightly compacted and relatively tightly compacted packing states, and then averages the results to improve the robustness of density estimation, enabling the model to adapt to different packing states during testing. Table 15 provides an overview of the methods and limitations of this study at each stage.

Despite these limitations, the key advantage of our method is that it unifies frontal-view and top-down information into a physical-scale representation and enables a closed-loop, interpretable inference pipeline from images to geometric parameters and then to volume estimation. The workflow does not require depth sensors and can be stably reused under controlled imaging conditions after a one-time calibration. Axisymmetric integration gives the volume and nutrient estimates a clear physical meaning, facilitating error analysis and practical deployment. In addition, database management enables rapid switching among multiple bowls and supports batch prediction.

Author Contributions

Conceptualization, X.J. and Y.F.; methodology, X.J.; validation, X.J.; investigation, X.J., H.L., K.S., H.Z. and Y.F.; resources, L.S.; data curation, X.J. and H.L.; writing—original draft preparation, X.J.; writing—review and editing, Y.F., X.J., H.L., L.S., K.S. and H.Z.; visualization, H.L.; supervision, Y.F.; funding acquisition, L.S.; project administration, Y.F. All authors have read and agreed to the published version of the manuscript.

Funding

This work was financially supported by the Research Project of Liaoning Provincial Applied Basic Research Program Project (2025JH2/101330041) and the National Key R&D Program of China (2018YFD0400800).

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

Author Lianzheng Sun was employed by Qingdao Yeelink Information Technology Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

CNN	Convolutional Neural Network
R-CNN	Region-based Convolutional Neural Network
CLAHE	Contrast Limited Adaptive Histogram Equalization
Circle NMS	Circle Non-Maximum Suppression
RGB	Red Green Blue
PE	Positional Encoding
FFN	Feed-Forward Network
SGD	Stochastic Gradient Descent
$M P E$	Mean Prediction Error
$M W T P A$	Mean Wall Thickness Prediction Accuracy
FPN	Feature Pyramid Network
GFPN	Geometry-aware Feature Pyramid Network
GPM	Geometry-aware Prior Modulation
CBAM	Convolutional Block Attention Module
Conv	Convolution
RPN	Region Proposal Network
RoI	Region of Interest
RoIAlign	Region of Interest Align
$I O U$	Intersection over Union
$m I O U$	Mean Intersection over Union
$A P$	Average Precision
$m A P$	Mean Average Precision
$M V$	Measured Value
$P V$	Prediction Value

References

Lo, F.P.-W.; Qiu, J.; Jobarteh, M.L.; Sun, Y.; Wang, Z.; Jiang, S.; Baranowski, T.; Anderson, A.K.; Mccrory, M.A.; Sazonov, E. AI-enabled wearable cameras for assisting dietary assessment in African populations. npj Digit. Med. 2024, 7, 356. [Google Scholar] [CrossRef] [PubMed]
Lee, D.-s.; Kwon, S.-k. Amount estimation method for food intake based on color and depth images through deep learning. Sensors 2024, 24, 2044. [Google Scholar] [CrossRef] [PubMed]
Yan, R.; Luo, H.; Lu, J.; Liu, D.; Posluszny, H.; Dhaliwal, M.P.; MacLeod, J.; Qin, Y.; Yang, C.; Hartman, T.J. DietAI24 as a framework for comprehensive nutrition estimation using multimodal large language models. Commun. Med. 2025, 5, 458. [Google Scholar] [CrossRef] [PubMed]
Dehais, J.; Anthimopoulos, M.; Shevchik, S.; Mougiakakou, S. Two-view 3D reconstruction for food volume estimation. IEEE Trans. Multimed. 2016, 19, 1090–1099. [Google Scholar] [CrossRef]
Jia, W.; Ren, Y.; Li, B.; Beatrice, B.; Que, J.; Cao, S.; Wu, Z.; Mao, Z.-H.; Lo, B.; Anderson, A.K. A novel approach to dining bowl reconstruction for image-based food volume estimation. Sensors 2022, 22, 1493. [Google Scholar] [CrossRef]
Li, B.; Sun, M.; Mao, Z.-H.; Jia, W. Dining Bowl Modeling and Optimization for Single-Image-Based Dietary Assessment. Sensors 2024, 24, 6058. [Google Scholar] [CrossRef]
Schober, D.; Güldenring, R.; Love, J.; Nalpantidis, L. Vision-based robot manipulation of transparent liquid containers in a laboratory setting. In Proceedings of the 2025 IEEE/SICE International Symposium on System Integration (SII), Munich, Germany, 21–24 January 2025; pp. 1193–1200. [Google Scholar]
Cobo, M.; Heredia, I.; Aguilar, F.; Iglesias, L.L.; García, D.; Bartolomé, B.; Moreno-Arribas, M.V.; Yuste, S.; Pérez-Matute, P.; Motilva, M.-J. Artificial intelligence to estimate wine volume from single-view images. Heliyon 2022, 8, e10557. [Google Scholar] [CrossRef]
Zhao, Y.; Zhu, P.; Jiang, Y.; Xia, K. Visual nutrition analysis: Leveraging segmentation and regression for food nutrient estimation. Front. Nutr. 2024, 11, 1469878. [Google Scholar] [CrossRef]
Han, Y.; Cheng, Q.; Wu, W.; Huang, Z. Dpf-nutrition: Food nutrition estimation via depth prediction and fusion. Foods 2023, 12, 4293. [Google Scholar] [CrossRef]
Gutiérrez-Moizant, R.; Boada, M.J.L.; Ramírez-Berasategui, M.; Al-Kaff, A. Novel Bayesian Inference-Based Approach for the Uncertainty Characterization of Zhang’s Camera Calibration Method. Sensors 2023, 23, 7903. [Google Scholar] [CrossRef]
Hao, Y.; Tai, V.C.; Tan, Y.C. A systematic stereo camera calibration strategy: Leveraging latin hypercube sampling and 2k full-factorial design of experiment methods. Sensors 2023, 23, 8240. [Google Scholar] [CrossRef]
Guan, S.; Liu, B.; Chen, S.; Wu, Y.; Wang, F.; Liu, X.; Wei, R. Adaptive median filter salt and pepper noise suppression approach for common path coherent dispersion spectrometer. Sci. Rep. 2024, 14, 17445. [Google Scholar] [CrossRef] [PubMed]
Buriboev, A.S.; Khashimov, A.; Abduvaitov, A.; Jeon, H.S. CNN-Based Kidney Segmentation Using a Modified CLAHE Algorithm. Sensors 2024, 24, 7703. [Google Scholar] [CrossRef]
Ahmad, M.S.Z.; Aziz, N.A.A.; Lim, H.S.; Ghazali, A.K.; Latiff, A.A. Impact of Image Enhancement Using Contrast-Limited Adaptive Histogram Equalization (CLAHE), Anisotropic Diffusion, and Histogram Equalization on Spine X-Ray Segmentation with U-Net, Mask R-CNN, and Transfer Learning. Algorithms 2025, 18, 796. [Google Scholar] [CrossRef]
El Houby, E.M. Acute lymphoblastic leukemia diagnosis using machine learning techniques based on selected features. Sci. Rep. 2025, 15, 28056. [Google Scholar] [CrossRef]
He, Y.; Kang, S.; Li, W.; Xu, H.; Liu, S. Advanced enhancement technique for infrared images of wind turbine blades utilizing adaptive difference multi-scale top-hat transformation. Sci. Rep. 2024, 14, 15604. [Google Scholar] [CrossRef]
Cao, S.; Zhao, C.; Dong, J.; Fu, X. Ship detection in synthetic aperture radar images under complex geographical environments, based on deep learning and morphological networks. Sensors 2024, 24, 4290. [Google Scholar] [CrossRef]
Zhong, X.; Liang, G.; Meng, L.; Xi, W.; Gu, L.; Tian, N.; Zhai, Y.; He, Y.; Huang, Y.; Jin, F. Automated Particle Size Analysis of Supported Nanoparticle TEM Images Using a Pre-Trained SAM Model. Nanomaterials 2025, 15, 1886. [Google Scholar] [CrossRef] [PubMed]
Xie, W.; Zhou, D.; Zhang, W.; Wang, W. EAFormer: Edge-Aware Guided Adaptive Frequency-Navigator Network for Image Restoration. Sensors 2025, 25, 5912. [Google Scholar] [CrossRef] [PubMed]
Li, Y.; Zhang, D. Toward Efficient Edge Detection: A Novel Optimization Method Based on Integral Image Technology and Canny Edge Detection. Processes 2025, 13, 293. [Google Scholar] [CrossRef]
Avendaño, J.C.; Leander, J.; Karoumi, R. Image-based concrete crack detection method using the median absolute deviation. Sensors 2024, 24, 2736. [Google Scholar] [CrossRef]
Wu, Y.; Li, Q. The algorithm of watershed color image segmentation based on morphological gradient. Sensors 2022, 22, 8202. [Google Scholar] [CrossRef] [PubMed]
Ordoñez, C.; Pastore, J.; Blotta, E. Strategies for the Calculation of the Circle Hough Transform in Low-Resources Systems. In Proceedings of the Congreso Argentino de Bioingeniería, Buenos Aires, Argentina, 3–6 October 2023; pp. 590–598. [Google Scholar]
Choi, C.-Y.; Lee, S.-W. NeXt-DETR: A scalable and efficient transformer-based detector for resource-constrained systems. Inf. Sci. 2025, 731, 122913. [Google Scholar] [CrossRef]
Liu, Z.; Mao, H.; Wu, C.-Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11976–11986. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 213–229. [Google Scholar]
Zhang, H.; Liang, P.; Sun, Z.; Song, B.; Cheng, E. CircleFormer: Circular nuclei detection in whole slide images with circle queries and attention. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Vancouver, BC, Canada, 8–12 October 2023; pp. 493–502. [Google Scholar]
Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Huang, Y.; Kruyer, A.; Syed, S.; Kayasandik, C.B.; Papadakis, M.; Labate, D. Automated detection of GFAP-labeled astrocytes in micrographs using YOLOv5. Sci. Rep. 2022, 12, 22263. [Google Scholar] [CrossRef] [PubMed]
Li, Y.; Bao, H.; Ge, Z.; Yang, J.; Sun, J.; Li, Z. Bevstereo: Enhancing depth estimation in multi-view 3d object detection with temporal stereo. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; pp. 1486–1494. [Google Scholar]
Zuo, L.a.; Ling, J.; Hu, N.; Chen, R. Establishment and validation of a population pharmacokinetic model for apatinib in patients with tumors. BMC Cancer 2024, 24, 1346. [Google Scholar] [CrossRef] [PubMed]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Vancouver, BC, Canada, 7–14 July 2017; pp. 2961–2969. [Google Scholar]
Zhao, K.; Wang, X.; Chen, X.; Zhang, R.; Shen, W. Rethinking mask heads for partially supervised instance segmentation. Neurocomputing 2022, 514, 426–434. [Google Scholar] [CrossRef]
He, M.; He, K.; Huang, Q.; Xiao, H.; Zhang, H.; Li, G.; Chen, A. Lightweight mask R-CNN for instance segmentation and particle physical property analysis in multiphase flow. Powder Technol. 2025, 449, 120366. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Xie, S.; Girshick, R.; Dollár, P.; Tu, Z.; He, K. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1492–1500. [Google Scholar]
Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Han, B.; He, L.; Ke, J.; Tang, C.; Gao, X. Weighted parallel decoupled feature pyramid network for object detection. Neurocomputing 2024, 593, 127809. [Google Scholar] [CrossRef]
Xiu, J.; Li, Y.; Zhao, N.; Fang, H.; Wang, X.; Yao, A. Geometric Alignment and Prior Modulation for View-Guided Point Cloud Completion on Unseen Categories. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Nashville, TN, USA, 11–15 June 2025; pp. 27435–27444. [Google Scholar]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Everingham, M.; Van Gool, L.; Williams, C.K.; Winn, J.; Zisserman, A. The pascal visual object classes (voc) challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [Google Scholar] [CrossRef]
Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; pp. 740–755. [Google Scholar]
Cheng, B.; Girshick, R.; Dollár, P.; Berg, A.C.; Kirillov, A. Boundary IoU: Improving object-centric image segmentation evaluation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 15334–15342. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]
Chen, K.; Pang, J.; Wang, J.; Xiong, Y.; Li, X.; Sun, S.; Feng, W.; Liu, Z.; Shi, J.; Ouyang, W. Hybrid task cascade for instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4974–4983. [Google Scholar]
Cheng, T.; Wang, X.; Huang, L.; Liu, W. Boundary-preserving mask r-cnn. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 660–676. [Google Scholar]
Siswantoro, J.; Asmawati, E.; Siswantoro, M.Z. A rapid and accurate computer vision system for measuring the volume of axi-symmetric natural products based on cubic spline interpolation. J. Food Eng. 2022, 333, 111139. [Google Scholar] [CrossRef]
Jia, W.; Li, B.; Xu, Q.; Chen, G.; Mao, Z.-H.; McCrory, M.A.; Baranowski, T.; Burke, L.E.; Lo, B.; Anderson, A.K. Image-based volume estimation for food in a bowl. J. Food Eng. 2024, 372, 111943. [Google Scholar] [CrossRef] [PubMed]
Cheng, S.-T.; Lyu, Y.-J.; Teng, C. Image-Based Nutritional Advisory System: Employing Multimodal Deep Learning for Food Classification and Nutritional Analysis. Appl. Sci. 2025, 15, 4911. [Google Scholar] [CrossRef]
Evenepoel, C.; Clevers, E.; Deroover, L.; Van Loo, W.; Matthys, C.; Verbeke, K. Accuracy of nutrient calculations using the consumer-focused online app MyFitnessPal: Validation study. J. Med. Internet Res. 2020, 22, e18237. [Google Scholar] [CrossRef] [PubMed]
Chiplonkar, S.A.; Agte, V.V. Extent of error in estimating nutrient intakes from food tables versus laboratory estimates of cooked foods. Asia Pac. J. Clin. Nutr. 2007, 16, 227. [Google Scholar]

Figure 1. Overall Task Flowchart.

Figure 2. Statistical graph of re-projection error in camera calibration.

Figure 3. Degradation simulation and median filtering effect of the acquired image: The top image is the degradation simulation image, and the bottom image is the image after median filtering. (a) is local highlight, (b) is salt and pepper noise, and (c) is random impulse noise cluster.

Figure 4. Two-stage experimental results of the bowl top view.

Figure 5. Bowl Thick Net: Model Architecture Diagram of Bowl Wall Thickness Estimation Network.

Figure 6. Two circles identify chaotic matrices (Bowl Thick Net): In the figure, the numbers against the blue (dark blue and light blue) background represent the number of correctly identified images; the numbers against the orange (four different shades of orange) background represent the number of incorrectly identified images. The shade of the background color indicates the proportion of correctly or incorrectly identified images.

Figure 7. Training and validation loss curves of the eight models: (a) is the loss curve during the training process, and (b) is the loss curve during the validation process.

Figure 8. The structure diagram of the mask segmentation model for the improved bowl image. The lower part of the figure illustrates the overall pipeline of the bowl mask segmentation model, from image input to mask output. The three overlapping boxes in the upper-left corner represent the three backbone architectures: Res Net-50, Res NeXt-50 (32 × 4d), and Res Net-C4. Within the Res Net-50 and Res NeXt-50 (32 × 4d) backbone boxes, the figure highlights the standard convolutional structure of Res Net-50 and the 32 parallel

3 \times 3

grouped convolutions of Res NeXt-50 (32 × 4d), with 4 channels per group, which enhances texture and contour modeling capability. The box in the upper-right corner shows the architecture of GFPN. * The three types of backbone are shown in the upper left corner of the figure, and our proposed GFPN is shown in the upper right corner of the figure.

Figure 8. The structure diagram of the mask segmentation model for the improved bowl image. The lower part of the figure illustrates the overall pipeline of the bowl mask segmentation model, from image input to mask output. The three overlapping boxes in the upper-left corner represent the three backbone architectures: Res Net-50, Res NeXt-50 (32 × 4d), and Res Net-C4. Within the Res Net-50 and Res NeXt-50 (32 × 4d) backbone boxes, the figure highlights the standard convolutional structure of Res Net-50 and the 32 parallel

3 \times 3

grouped convolutions of Res NeXt-50 (32 × 4d), with 4 channels per group, which enhances texture and contour modeling capability. The box in the upper-right corner shows the architecture of GFPN. * The three types of backbone are shown in the upper left corner of the figure, and our proposed GFPN is shown in the upper right corner of the figure.

Figure 9. Structure diagram of GFPN module. In the figure, the orange image layer on the left represents the feature maps output at different stages, and the green arrows indicate the feature extraction process of the backbone network from bottom to top; the blue image layer on the right represents the pyramid feature layers of each level output after GFPN fusion.

Figure 10. The mask segmentation results of the four models, along with the training and validation loss curves of our models and the

m I O U

and

m A P

curves of the four models: (a) The mask segmentation results of our model. (b) B Mask R-CNN mask segmentation results. (c) U-Net mask segmentation results. (d) HTC’s mask segmentation results. (e) Our model’s training and validation loss curves. (f)

m I O U

curves of four models. (g)

m A P

curves of four models. The difference in mask color for the bowls in Figures (a) to (d) is due to the different mask settings for different models, but it does not affect the experimental results.

Figure 10. The mask segmentation results of the four models, along with the training and validation loss curves of our models and the

m I O U

and

m A P

curves of the four models: (a) The mask segmentation results of our model. (b) B Mask R-CNN mask segmentation results. (c) U-Net mask segmentation results. (d) HTC’s mask segmentation results. (e) Our model’s training and validation loss curves. (f)

m I O U

curves of four models. (g)

m A P

curves of four models. The difference in mask color for the bowls in Figures (a) to (d) is due to the different mask settings for different models, but it does not affect the experimental results.

Figure 11. Bowl capacity prediction principle and error statistics: (a) Reconstruction model of the bowl. (b) Bowl key outline extraction (red key points represent extreme points of the outer contour of the bowl, and yellow key points represent sampling points of the outer contour arc of the bowl). (c) The predicted inner contour curve and the actual inner contour curve of the bowl (green is the predicted curve, and blue is the actual curve). (d) Statistical chart of error values between the predicted inner contour curve of the bowl and the actual contour curve of the bowl.

Figure 12. Real images of eight types of experimental bowls. The bowl numbers in the figure correspond to the bowl numbers in Table 7.

Figure 13. Mask segmentation and volume prediction principle of liquid in a bowl: (a) Liquid mask segmentation image. (b) Liquid volume prediction principle diagram. In the figure, the red square represents the bounding box of the mask recognition result, the green arrow represents

r_{l i q}

(liquid surface radius), and the red arrow represents

z_{l i q}

(liquid surface height).

Figure 13. Mask segmentation and volume prediction principle of liquid in a bowl: (a) Liquid mask segmentation image. (b) Liquid volume prediction principle diagram. In the figure, the red square represents the bounding box of the mask recognition result, the green arrow represents

r_{l i q}

(liquid surface radius), and the red arrow represents

z_{l i q}

(liquid surface height).

Figure 14. Flowchart for predicting the nutritional content of food. In the figure, the green box and arrow represent the model reconstruction of the bowl, the blue box and arrow represent the predicted weight of the food in the bowl, and the red box represents the predicted nutritional components of the food.

Figure 15. Estimation and Visualization Platform for the Volume of Liquid in Bowls: (a) Food nutrient content estimation page. (b) Liquid volume prediction page. (c) Bowl type and parameter data page.

Table 1. Experimental results and limitations of each method.

Method (Source of Reference)	Container Volume Prediction Results	Liquid Volume or Food Nutrient Composition Prediction Results	Limitation
Jia et al. (2022) [5]	The relative errors in the volume predictions for the nine bowls were mostly less than 5%, with the largest being 10.6%	The average relative error of the liquid volume prediction is −7.0%	The process of attaching graduated tape to the inside of the bowl and manually calibrating it is tedious and disrupts the dining experience
Li et al. (2024) [6]	The relative errors in volume prediction for the five simulated bowls were mostly less than 1%, with a maximum of 4.7%; the relative errors in volume prediction for the seven real bowls were mostly around 10%, with a maximum of 26.9%	No studies were conducted on liquid volume or food nutrient content	This method borrows from the above approach and introduces the concept of fullness, but it has a relatively large margin of error in experiments with real bowls
Schober et al. (2025) [7]	The experiment was conducted with the container volume as a known condition	The absolute errors of the three variant models for liquid volume prediction ranged from a minimum of 9.39% to a maximum of 24.21%	End-to-end volume regression was used, but experimental results showed that this method is sensitive to container material and only achieves good results on glass containers
Cobo et al. (2022) [8]	The experiment was conducted with the container volume as a known condition	The mean absolute error for predicting liquid volumes of 50–300 mL is 8 mL	The prediction was performed using an end-to-end method, but the liquid container was a wine glass, and the error was larger when the liquid was relatively full
Zhao et al. (2024) [9]	No container volume prediction experiment was conducted	The average absolute error in predicting the five nutrients in food is 17.06%	This method has high requirements for the dataset, requiring multi-view images, and the food container is a plate, thus avoiding depth estimation and failing to predict bowl-like containers
Han et al. (2023) [10]	No container volume prediction experiment was conducted	The average absolute error in predicting the five nutrients in food is 17.76%	Experiments using depth cameras to predict the depth of containers, without knowing the container’s depth, resulted in lower accuracy when using deeper containers

Table 2. Usage of each model version module.

Model	Replace Backbone *	Spatial Positional Encoding	Circle NMS
Bowl Thick Net	√	√	√
A		√	√
B	√		√
C	√	√
D			√
E		√
F	√
G

* In the “Replace backbone” column, a check mark (√) indicates that the backbone is Conv NeXt-Tiny; otherwise, Res Net-18 is used as the backbone.

Table 3. Statistical table of recognition accuracy of C₁ and C₂ in various models.

Model	Number of C₁ Images Identified	Number of C₂ Images Identified	C₁ Recognition Accuracy (%)	C₂ Recognition Accuracy (%)	Both C₁ and C₂ Have Correct Recognition Rates (%)
Bowl Thick Net	212	208	97.2	95.4	93.6%
A	203	196	93.1	89.9	87.4%
B	196	187	89.9	85.8	82.7%
C	194	184	89.0	84.4	81.1%
D	187	158	85.8	72.5	66.9%
E	184	166	84.4	76.1	72.3%
F	185	162	84.9	74.3	67.5%
G	179	171	82.1	78.4	70.6%

Table 4. Predicted results of the diameter and bowl wall thickness of C₁ and C₂ in each model version.

Model	MPE-C₁ (mm)	MPE-C₂ (mm)	MPE-Wall Thickness (mm)	MWTPA (%)
Bowl Thick Net	0.64	0.78	0.75	73.3
A	0.76	0.73	0.88	67.5
B	0.61	0.97	0.96	53.2
C	1.02	1.72	1.21	59.9
D	1.11	1.64	1.39	57.7
E	0.74	0.88	0.91	68.1
F	0.97	1.24	1.33	56.8
G	1.23	1.45	1.41	54.2

Table 5. Experimental results of mask segmentation of bowls in models of different modules.

Backbone	FPN Type *	${m I O U}_{0.5}$	${m I O U}_{0.75}$	${m A P}_{0.5}$	${m A P}_{0.75}$	$m I O U$	$m A P$	$B o u n d a r y F 1 S c o r e$
Res Net-50	FPN	0.8662	0.8634	0.9521	0.9487	0.8658	0.9491	0.84
Res NeXt-50 (32 × 4d)	FPN	0.8835	0.8812	0.9728	0.9676	0.8871	0.9694	0.89
Res Net-C4	/	0.8426	0.8453	0.9139	0.9021	0.8429	0.9083	0.76
Res Net-50	GFPN	0.9024	0.9057	0.9648	0.9532	0.9044	0.9576	0.87
Res NeXt-50 (32 × 4d)	GFPN	0.9257	0.9341	0.9852	0.9817	0.9357	0.9824	0.95

* In the table, “/” indicates that Res Net-C4 is used without FPN or GFPN. The best result for each metric is highlighted in red, and the second-best is highlighted in blue.

Table 6. Comparison of experimental results for four different mask models.

Network Model	${m I O U}_{0.5}$	${m I O U}_{0.75}$	${m A P}_{0.5}$	${m A P}_{0.75}$	$m I O U$	$m A P$	$B o u n d a r y F 1 S c o r e$
Ours	0.9257 *	0.9341	0.9852	0.9817	0.9357	0.9824	0.95
U-Net	0.8169	0.8207	0.9094	0.9181	0.8173	0.9122	0.82
HTC	0.8416	0.8468	0.9167	0.9243	0.8455	0.9208	0.87
B Mask R-CNN	0.8825 *	0.8739	0.9273	0.9361	0.8786	0.9311	0.91

* The best value for each metric in the table is highlighted in red, and the second-best is highlighted in blue.

Table 7. A comparison table of measured and predicted values for the rim diameter, base diameter, effective height, and volume of each bowl.

	Rim Diameter of Bowl			Base Diameter of Bowl			Effective Height of Bowl			Volume of Bowl
	$M V$ (mm) *	$P V$ (mm) *	$E r r o r$ (%) *	$M V$ (mm)	$P V$ (mm)	$E r r o r$ (%)	$M V$ (mm)	$P V$ (mm)	$E r r o r$ (%)	$M V$ (mL)	$P V$ (mL)	$E r r o r$ (%)
No.1 *	159	163.5	2.83	74	72.3	−2.30	59	59.9	1.53	661	626.6	−5.20
No.2	183	189.2	3.39	85	86.8	2.12	79	77.9	−1.39	714	746.8	4.59
No.3	161	157.0	−2.48	77	78.4	1.82	58	58.6	1.03	608	570.9	−6.10
No.4	137	142.6	4.09	92	90.6	−1.52	74	75.5	2.03	375	387.8	3.41
No.5	151	154.3	2.19	68	68.7	1.03	80	81.3	1.62	458	424.6	−7.29
No.6	167	174.8	4.67	87	89.6	2.99	63	62.2	−1.27	350	335.6	−4.11
No.7	166	160.9	−3.07	63	64.0	1.59	84	82.1	−2.26	673	634.0	−5.79
No.8	160	155.4	−2.87	73	75.5	3.42	72	73.4	1.94	337	324.5	−3.71

* No.1 to No.8 denote the bowl IDs.

M V

is short for Measured Value, and

P V

is short for Prediction Value. The error is computed as

E r r o r = \frac{P V - M V}{M V} \times 100 % .

Table 8. The impact of different backbone networks and feature pyramid structures on bowl volume prediction error.

Backbone	FPN Type	Arithmetic Average Error of the Bowl Volume(%)
ResNet-50	FPN	9.11
ResNeXt-50 (32 × 4d)	FPN	−5.27
ResNet-C4	/ *	−9.58
ResNet-50	GFPN	5.64
ResNeXt-50 (32 × 4d)	GFPN	−3.03

* ‘/’ indicates that this model does not use the FPN module.

Table 9. Eight food categories and the number of samples in different serving sizes.

		Number of Images of Food Within Each Weight Range
Food Types	Number of Images	$W * < 100 g$	$100 g \leq W < 200 g$	$200 g \leq W < 300 g$	$300 g \leq W < 400 g$	$W \geq 400 g$
Rice	194	40	70	36	42	6
Mapo Tofu	86	19	16	22	23	5
Kung Pao Chicken	71	16	1	37	15	1
Fried Noodles	58	10	19	13	16	0
Stir-fried Vegetables	64	17	16	16	9	6
Braised Eggplant	51	7	14	15	8	8
Tomato Scrambled Eggs	76	19	13	19	20	5
Stir-fried Shredded Potatoes	43	17	8	14	1	2

*

W

is short for Weight.

Table 10. Experimental Results of Mask Segmentation Performance and Volume Prediction of Liquid Surface Inside Bowl.

Test Parameters	Result
$b b o x_m A P$	0.9453
$m a s k_m A P$	0.8561
$M e a n L i q u i d v o l u m e p r e d i c t i o n e r r o r$ * (%)	9.24

*

M e a n L i q u i d v o l u m e p r e d i c t i o n e r r o r = \frac{1}{N} \sum_{i = 1}^{N} \frac{|P r e d i c t l i q u i d v o l u m e - M e a s u r e l i q u i d v o l u m e|}{M e a s u r e l i q u i d v o l u m e} \times 100 %

,

N

denotes the number of test images.

Table 11. Comparison of average errors of different research methods in liquid volume prediction.

Method (Source of Reference)	$M e a n L i q u i d V o l u m e P r e d i c t i o n$ Error (%)
Ours	9.24
Jia et al. (2022) [5]	7.00
Schober et al. (2025) [7]	9.39

Table 12. Density and content of various nutrients per unit weight of various foods.

Food Types	$D e n s i t y (g / m L)$	$C a l o r i e s * (K c a l)$	$C a r b o h y d r a t e s (g)$	$P r o t e i n (g)$	$F a t (g)$
Rice	0.667	130	28.17	2.69	0.28
Mapo Tofu	0.943	119	5.96	5.91	8.38
Kung Pao Chicken	0.684	268	14.03	17.14	16.13
Fried Noodles	0.932	88	11.23	5.88	2.27
Stir-fried Vegetables	0.537	106	4.96	2.29	8.35
Braised Eggplant	0.867	140	14.14	2.39	8.19
Scrambled Eggs with Tomatoes	0.916	179	5.07	13.25	11.58
Stir-fried Shredded Potatoes	0.527	71	15.73	1.67	2.48

* In the table,

C a l o r i e s

,

C a r b o h y d r a t e s

,

P r o t e i n

, and

F a t

denote the amounts of the corresponding nutrients contained in 100 g of each dish.

Table 13. Experimental Results of Food Mask Segmentation Performance and Weight Prediction on Food Surface Inside Bowl.

Food Types	$b b o x_m A P$	$m a s k_m A P$	$M e a n F o o d W e i g h t P r e d i c t i o n E r r o r * (%)$
Rice	0.9463	0.8537	8.55
Mapo Tofu	0.9316	0.8359	11.49
Kung Pao Chicken	0.9041	0.8280	13.61
Fried Noodles	0.9324	0.8538	10.37
Stir-fried Vegetables	0.9268	0.8103	14.91
Braised Eggplant	0.9574	0.8458	9.67
Scrambled Eggs with Tomatoes	0.9407	0.8174	11.28
Stir-fried Shredded Potatoes	0.9352	0.8566	12.04

*

M e a n F o o d w e i g h t p r e d i c t i o n e r r o r = \frac{1}{N} \sum_{i = 1}^{N} \frac{|P r e d i c t f o o d w e i g h t - M e a s u r e f o o d w e i g h t|}{M e a s u r e f o o d w e i g h t} \times 100 %

,

N

denotes the number of test images.

Table 14. Comparison of average errors of different research methods in predicting the content of food nutrients.

Method (Source of Reference)	Mean Nutrient Composition Prediction Error (%)
Ours	11.49
Zhao et al. (2024) [9]	17.06%
Han et al. (2023) [10]	17.76%

Table 15. Summary of the methods and limitations of each stage of this study.

Stage	Method	Limitation
acquisition and calibration	A single camera calibration was performed, and a calibration factor was used for scale conversion.	Subsequent test images need to maintain the same camera parameters and installation height.
Dataset Construction	The segmentation dataset for bowl-shaped objects was expanded from 363 to 1089 images, utilising single-class manual annotation.	The dataset was constructed solely within simulated restaurant and canteen settings, as well as experimental grid layouts, and exclusively comprises bowl-type containers. No dataset was created for plate-type containers.
Prediction of Bowl Thickness Based on Top-Down View	Employing the Bowl Thick Net model to predict wall thickness based on a two-circle structured output, with module ablation comparison.	Relying on the circularity assumption and the quality of visible edges, distinguishing between two circles becomes difficult when the rim is non-standard circular or when edges are missing.
Geometric Modelling and Volume Prediction of Bowls	Combining wall thickness prediction with parametric bowl model reconstruction, volume prediction is achieved based on geometric characteristics.	The bowl prediction process is based on the container’s axisymmetric properties; non-axisymmetric vessels will produce structural deviations.
Prediction of liquid volume or nutritional content of food	Liquid volume prediction is based on the surface area of the liquid and fitted to an equivalent circle within the container model. Food nutrient content prediction integrates the relationship between food volume, weight, and nutrient content.	If liquids exhibit phenomena such as sticking to the walls or foaming, this may interfere with prediction accuracy. Food predictions can only be made for a single food item at a time, and discrepancies may arise due to variations in ingredient ratios and preparation methods.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ji, X.; Song, K.; Sun, L.; Lu, H.; Zhang, H.; Feng, Y. A Method for Reconstructing and Predicting the Volume of Bowl-Type Tableware and Its Application in Dietary Analysis. Symmetry 2026, 18, 199. https://doi.org/10.3390/sym18010199

AMA Style

Ji X, Song K, Sun L, Lu H, Zhang H, Feng Y. A Method for Reconstructing and Predicting the Volume of Bowl-Type Tableware and Its Application in Dietary Analysis. Symmetry. 2026; 18(1):199. https://doi.org/10.3390/sym18010199

Chicago/Turabian Style

Ji, Xu, Kai Song, Lianzheng Sun, Haolin Lu, Hengyuan Zhang, and Yiran Feng. 2026. "A Method for Reconstructing and Predicting the Volume of Bowl-Type Tableware and Its Application in Dietary Analysis" Symmetry 18, no. 1: 199. https://doi.org/10.3390/sym18010199

APA Style

Ji, X., Song, K., Sun, L., Lu, H., Zhang, H., & Feng, Y. (2026). A Method for Reconstructing and Predicting the Volume of Bowl-Type Tableware and Its Application in Dietary Analysis. Symmetry, 18(1), 199. https://doi.org/10.3390/sym18010199

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Method for Reconstructing and Predicting the Volume of Bowl-Type Tableware and Its Application in Dietary Analysis

Abstract

1. Introduction

2. Materials and Methods

2.1. Overall Task Flow

2.2. Camera Calibration and Perspective Correction

2.3. Image Enhancement and Preprocessing for Geometric Measurement

2.4. Bowl Wall Thickness Estimation Model—Bowl Thick Net

2.4.1. Feature Extraction for the Bowl Thick Net Model

2.4.2. Spatial Positional Encodings

2.4.3. Encoder and Decoder

2.4.4. Post-Processing and Matching

2.4.5. Circle NMS

2.4.6. Principle of Bowl Thick Net for Wall-Thickness Prediction

2.4.7. Experiments on the Bowl Thick Net Model

2.5. Mask Segmentation of Bowls

2.5.1. Improved Mask R-CNN Framework

2.5.2. Three Different Backbone Networks

2.5.3. Geometry-Aware Feature Pyramid Network

2.5.4. Bowl Mask Segmentation Model Process

2.5.5. Experimental Results and Analysis

2.6. Estimation of the Volume of a Bowl

2.7. Application of This Study in Dietary Analysis

2.7.1. Prediction of the Volume of Liquid in the Bowl

2.7.2. Predicting the Nutritional Content of Food in a Bowl

2.7.3. Experimental Results and Analysis of Estimating the Relationship Between Liquid Volume and Food Nutrient Content

2.7.4. Visual User Platform

3. Conclusions

4. Discussion

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI