An End-to-End Image-Based Automatic Food Energy Estimation Technique Based on Learned Energy Distribution Images: Protocol and Methodology

Obtaining accurate food portion estimation automatically is challenging since the processes of food preparation and consumption impose large variations on food shapes and appearances. The aim of this paper was to estimate the food energy numeric value from eating occasion images captured using the mobile food record. To model the characteristics of food energy distribution in an eating scene, a new concept of “food energy distribution” was introduced. The mapping of a food image to its energy distribution was learned using Generative Adversarial Network (GAN) architecture. Food energy was estimated from the image based on the energy distribution image predicted by GAN. The proposed method was validated on a set of food images collected from a 7-day dietary study among 45 community-dwelling men and women between 21–65 years. The ground truth food energy was obtained from pre-weighed foods provided to the participants. The predicted food energy values using our end-to-end energy estimation system was compared to the ground truth food energy values. The average error in the estimated energy was 209 kcal per eating occasion. These results show promise for improving accuracy of image-based dietary assessment.


Introduction
Leading causes of death in the United States, including cancer, diabetes, and heart disease, can be linked to diet [1,2]. Measuring accurate dietary intake is considered to be an open research problem, and developing accurate methods for dietary assessment and evaluation continues to be a challenge. Underreporting is well documented amongst dietary assessment methods. Compared to traditional dietary assessment methods that often involve detailed handwritten reports, technology-assisted dietary assessment approaches reduce the burden of keeping such a detailed report and are preferred over traditional written dietary record for monitoring everyday activity [3].
In recent years, mobile telephones have emerged and provide unique mechanisms to monitor personal health and to collect dietary information [4]. Image-based approaches integrating application technology for mobile devices have been developed which aim at capturing all eating occasions by images as the primary record of dietary intake [3]. To date, these image-based approaches have corresponding pixel location. The use of an "energy distribution image" enabled us to first visualize how food energy estimation was spatially distributed across the eating scene.
Generative models learn from real data distribution and can generate samples that are similar to those in the real data distribution by taking random noises (for example, generate fake faces that look realistic [34]). In addition, generative models can also take prior information when generating new samples [27]. Therefore, they are suitable for tasks of image-to-image translation. We used generative models to predict energy distribution image based on eating occasion image, as generative models are a natural fit for solving image-to-image translations by its proven capability of learning the correspondences from one data distribution to another [27]. The aim of this paper was to develop a novel dietary assessment method to estimate the food energy numeric value from eating occasion images.

Methods
To estimate food portions (in energy), the energy distribution image is a new approach to visualize where foods are in the image and how much relative energy is presented at different food regions. We used Generative Adversarial Network (GAN) architecture to train the generative model that predicts the food energy distribution images based on eating occasion images. We built a food image data set with paired images for the training of the GAN [33]. To complete the end-to-end task of estimating food energy value based on a single-view eating occasion image, we used a CNN based regression model to estimate the numeric food energy value using the learned energy distribution images.

Image-to-Energy Data Set
Food images were collected using the mobile food record (mFR TM ) as part of the Food in Focus study, which was a community dwelling study of 45 adults (15 men and 30 women) between 21 and 65 years of age in a 7-day study period [35]. Pre-weighed food pack-outs were distributed to the participants and uneaten foods were returned and weighted. Briefly, participants captured images of each eating occasion over the entire period using the mFR TM . Providing known foods and amounts supported the objective of being able to identify the foods consumed and their amounts, which were used as ground truth for evaluating the proposed method. The food categories provided for breakfast, lunch, and dinner are listed in Table 1.
Since there is no public data set available for training our generative model, the data set of image pairs, consisting of eating occasion images and corresponding energy distribution images, were constructed using the Food in Focus study. The purpose of this data set was to learn the mappings from food images to the food energy distribution images [33]. This data set was based on the ground truth food labels, segmentation masks, and energy information from the study where known foods and amounts were provided [35]. To build this data set, ground truth food labels, segmentation masks, food energy information, and the presence of the known size fiducial marker were required. To the best of our knowledge, we are the only group that has collected such a food image data set with all required information listed above. We used GAN [34] architecture to train the generative model for the task predicting the food energy distribution image, as GAN has shown impressive success in training generative models [27][28][29]36,37]. In addition, GAN is able to effectively reduce the adversarial space during training [34] compared to other generative models, such as Variational Autoencoders (VAEs) [38]. Our image-to-energy data set described in Section 2.1 could not cover all food types, eating scenes, and all possible food combinations. Therefore, GAN's characteristic reducing adversarial space was important for our task, as it reduced the chance of the generative model overfitting on training image pairs. The energy value of the meal image is estimated based on the learned food energy distribution image by training a CNN. Figure 1 shows the design of the proposed end-to-end food energy estimation based on a single-view eating occasion image. To train the GAN for the task of mapping eating occasion images to energy distribution images, eating occasion image and energy distribution image pairs were required. There is no device that can be used to directly capture the "energy distribution image". We constructed the image-to-energy distribution data set using food images collected from the Food in Focus study [35]. Each food item and each eating occasion image were manually labeled and segmented in the data set. The ground truth energy information of each weighed food item in each eating occasion image was estimated using the energy values in the USDA Food and Nutrient Database for Dietary Studies.
In order to construct the energy distribution image, we first detected the location of the fiducial marker [39]. A fiducial marker is a colored checkerboard, as shown in Figure 2a, which is included in each eating occasion scene image. The marker is used to correct the color of the acquired images to match the reference colors during food identification and for camera calibration in portion size estimation [40,41]. The image-to-energy distribution data set could not be constructed if any of the above components (ground truth food labels, segmentation masks, food energy information, and the presence of the known size fiducial marker) were missing.  With the reference of the known size fiducial marker, we removed the projective distortion in the original image using Direct Linear Transform (DLT) [42] based on the estimated homography matrix H to create a rectified image. Suppose I is the original eating occasion image; we denoteÎ as the rectified image that is obtained:Î = H −1 I. Following the same rule of notation, for each food k and its associated segmentation mask S k , the rectified segmentation can be expressed as:Ŝ k = H −1 S k . For each pixel location (î,ĵ) ∈Ŝ k , a scale factorŵˆi ,ĵ is assigned to reflect the distance between the pixel location (î,ĵ) to the centroid of the segmentation maskŜ k . Based on the scale factorŵˆi ,ĵ assigned to each pixel location inŜ k , the weighted segmentation masksŜ k can be projected back to the original pixel coordinates denoted as S k , where: S k = HŜ k , and learn the parameter P k such that: With the reference of the known size fiducial marker, we removed the projective distortion in the original image using Direct Linear Transform (DLT) [42] based on the estimated homography matrix H to create a rectified image. Suppose I is the original eating occasion image; we denote I as the rectified image that is obtained:I = H I. Following the same rule of notation, for each food k and its associated segmentation mask , the rectified segmentation can be expressed as: = H . For each pixel location (ı, ȷ) ∈ , a scale factor , ̂ is assigned to reflect the distance between the pixel location (,) to the centroid of the segmentation mask . Based on the scale factor , ̂ assigned to each pixel location in , the weighted segmentation masks can be projected back to the original pixel coordinates denoted as , where: = H , and learn the parameter such that: where c is the ground truth energy associated with food , P is the energy mapping coefficient for , and , is the energy weight factor at each pixel that makes up the ground truth energy distribution image. We can then update the energy weight factors , in as: Repeat the above process for all ∈ 1, . . . , , where is total number of food items in the eating occasion image, and then overlay all segments onto the ground truth energy distribution image , whose size is the same as image I = HI . Here, we show a pair of image I and the energy distribution image , as shown in Figure 2a,b, accordingly. The estimated energy distribution image shown in Figure 2c is denoted as , which is learned from training on pairs of images I and the ground truth energy distribution image .

Generative Adversarial Networks (GAN)
GAN architecture has shown impressive success in training generative models [27][28][29]36,37]. In GAN, two models are trained simultaneously: a generative model that captures the data distribution, and a discriminative model that determines the probability that a sample came from the training data rather than [34]. The common analogy for the GAN architecture is a game between producing counterfeits (generative models) and detecting counterfeits (discriminative model) [34]. To formulate the GAN, we specified the cost functions. We use ( ) to denote the parameters of generative model and ( ) to denote the parameters of discriminative model . The generative model attempts to minimize the cost function: where the discriminative model D attempts to minimize the cost function: In a zero-sum game, we have: where c k is the ground truth energy associated with food k, P k is the energy mapping coefficient for S k , and w i,j is the energy weight factor at each pixel that makes up the ground truth energy distribution image. We can then update the energy weight factors w i,j in S k as: Repeat the above process for all k ∈ {1, . . . , M}, where M is total number of food items in the eating occasion image, and then overlay all segments S k onto the ground truth energy distribution image W, whose size is the same as image I = HÎ. Here, we show a pair of image I and the energy distribution image W, as shown in Figure 2a,b, accordingly. The estimated energy distribution image shown in Figure 2c is denoted as ∼ W, which is learned from training on pairs of images I and the ground truth energy distribution image W.

Generative Adversarial Networks (GAN)
GAN architecture has shown impressive success in training generative models [27][28][29]36,37]. In GAN, two models are trained simultaneously: a generative model G that captures the data distribution, and a discriminative model D that determines the probability that a sample came from the training data rather than G [34]. The common analogy for the GAN architecture is a game between producing counterfeits (generative models) and detecting counterfeits (discriminative model) [34]. To formulate the GAN, we specified the cost functions. We use θ (G) to denote the parameters of generative model G and θ (D) to denote the parameters of discriminative model D. The generative model G attempts to minimize the cost function: where the discriminative model D attempts to minimize the cost function: In a zero-sum game, we have: Therefore, the overall cost can be formulated as: where x is sampled from the true data p data and z is random noise generated by distribution p z . The generative model takes z and generates fake sample G(z). The goal of the minimax game would then be: Adversarial samples are those data which can easily lead neural networks to make mistakes. The GAN takes adversarial training samples by its nature, therefore, it could significantly reduce the adversarial space for the generative models to make mistakes. As a result, the use of GAN architecture can greatly reduce the training samples needed to model the statistical insights of the true data. During each update of the generative model G, the generated fake sample G(z) will become more like the true sample x. Therefore, after sufficient epochs of training, the discriminator D is unable to differentiate between the two distributions x and G(z) [34].

The Use of Conditional GAN (cGAN) for Image Mappings
We used conditional GAN (cGAN) [27] to estimate the energy distribution image [33], as cGAN is a natural fit for predicting an image output based on an input image. A cGAN attempts to learn the mapping from a random noise vector z to a target image y conditioned on the observed image x: G(x, z) → y . The objective of a cGAN can be expressed as: Otherwise, an additional conditional loss L conditional (G) [27] is added to further improve G(x, z) → y : Common criteria used in D(y, G(x, z)) to measure the distance between y and G(x, z) are the L 2 distance [43]: the L 1 distance [27]: and a smooth version of the L1 distance: So, the final objective [27,34] is: where the generative model G * is used to estimate the energy distribution image ∼ W based on the input eating occasion image I.

Food Energy Estimation Based on Energy Distribution Images
We were able to obtain the energy distribution image [33] for each RGB eating occasion image using generative model G trained by GAN. An example of an original food image and an estimated energy distribution image is shown in Figure 2a,c. Energy distribution images represent how food energy is distributed in the eating scene. Our goal was to estimate food energy (a numerical value) based on the estimated energy distribution image. This is essentially a regression task as shown in Figure 3. We used a CNN-based regression model to conduct the task of estimating energy from energy distribution images. For the regression model, we used a VGG-16-based [23] architecture, as shown in Figure 4. As VGG-16 has shown impressive results on object detection tasks, VGG-16 is sufficient for learning complex image features. We modified the original VGG-16 architecture and added an additional linear layer, as shown in Figure 4, so that the CNN-based architecture was suitable for the energy value regression task. Instead of using random initialization for VGG-16 and training from scratch, we used pre-trained weights of VGG-16 architecture on ImageNet [44]. The pre-trained weights are indicated in the dash bounding box in Figure 4. We used random initialization for the linear layer. We then fine-tuned the pre-trained weights of the VGG-16 network for energy value prediction task based on the building blocks of complex features originally learned from ImageNet [44]. With the regression model, we can predict the energy of the foods in a single-view eating occasion image. energy distribution image is shown in Figure 2a,c. Energy distribution images represent how food energy is distributed in the eating scene. Our goal was to estimate food energy (a numerical value) based on the estimated energy distribution image. This is essentially a regression task as shown in Figure 3. We used a CNN-based regression model to conduct the task of estimating energy from energy distribution images. For the regression model, we used a VGG-16-based [23] architecture, as shown in Figure 4. As VGG-16 has shown impressive results on object detection tasks, VGG-16 is sufficient for learning complex image features. We modified the original VGG-16 architecture and added an additional linear layer, as shown in Figure 4, so that the CNN-based architecture was suitable for the energy value regression task. Instead of using random initialization for VGG-16 and training from scratch, we used pre-trained weights of VGG-16 architecture on ImageNet [44]. The pre-trained weights are indicated in the dash bounding box in Figure 4. We used random initialization for the linear layer. We then fine-tuned the pre-trained weights of the VGG-16 network for energy value prediction task based on the building blocks of complex features originally learned from ImageNet [44]. With the regression model, we can predict the energy of the foods in a singleview eating occasion image.

Learning Image-to-Energy Mappings
We used 202 food images [35] that were manually annotated with ground truth segmentation masks and labels which we used for training. Data augmentation techniques, such as rotating, cropping, and flipping, were used to expand the database. In total, there were 1875 paired images (an image pair contains one eating occasion image and its corresponding energy distribution image) used to train the cGAN and 220 paired images for testing.
Once the cGAN estimated the energy distribution image , we could then determine the energy for a food image (portion size estimation) as: To compare the estimated energy image (Figure 2c We compared the energy estimation error rates at different epochs for the two different cGAN models we used, the encoder-decoder architecture ( Figure 5) and the U-Net architecture ( Figure 6). Compared to the encoder-decoder architecture (Figure 5), the U-Net architecture ( Figure 6) was more accurate and stable. The reason is that information from the "encoder" can be directly copied to the "decoder" layers in the U-Net architecture to provide precise locations [45], which is an idea similar to ResNet [25]. (a)

Learning Image-to-Energy Mappings
We used 202 food images [35] that were manually annotated with ground truth segmentation masks and labels which we used for training. Data augmentation techniques, such as rotating, cropping, and flipping, were used to expand the database. In total, there were 1875 paired images (an image pair contains one eating occasion image and its corresponding energy distribution image) used to train the cGAN and 220 paired images for testing.
Once the cGAN estimated the energy distribution image W, we could then determine the energy for a food image (portion size estimation) as: To compare the estimated energy image W (Figure 2c) with the ground truth energy image W (Figure 2b), we defined the error between W and W as: We compared the energy estimation error rates at different epochs for the two different cGAN models we used, the encoder-decoder architecture ( Figure 5) and the U-Net architecture ( Figure 6). Compared to the encoder-decoder architecture (Figure 5), the U-Net architecture ( Figure 6) was more accurate and stable. The reason is that information from the "encoder" can be directly copied to the "decoder" layers in the U-Net architecture to provide precise locations [45], which is an idea similar to ResNet [25].

Learning Image-to-Energy Mappings
We used 202 food images [35] that were manually annotated with ground truth segmentation masks and labels which we used for training. Data augmentation techniques, such as rotating, cropping, and flipping, were used to expand the database. In total, there were 1875 paired images (an image pair contains one eating occasion image and its corresponding energy distribution image) used to train the cGAN and 220 paired images for testing.
Once the cGAN estimated the energy distribution image , we could then determine the energy for a food image (portion size estimation) as: We compared the energy estimation error rates at different epochs for the two different cGAN models we used, the encoder-decoder architecture ( Figure 5) and the U-Net architecture ( Figure 6). Compared to the encoder-decoder architecture ( Figure 5), the U-Net architecture ( Figure 6) was more accurate and stable. The reason is that information from the "encoder" can be directly copied to the "decoder" layers in the U-Net architecture to provide precise locations [45], which is an idea similar to ResNet [25].
(a) We also compared the energy estimation error rates under different conditional loss settings, ℒ ( ), using U-Net. We used the batch size of 16 with = 100 in Equation (13), the Adam [46] solver with initial learning rate = 0.0002, and momentum parameters = 0.5, = 0.999 [27]. We observed that distance measure ( , ( , )) as defined in Equation (10)-(12) using the or norms is better than using smoothed norm. At epoch 200, the energy estimation error rates are 10.89% (using criterion) and 12.67% (using criterion), respectively. In the experiments, we included food types whose shapes are difficult to define (for example, fries). Predicting the energy for these food types is very challenging using a geometric-model-based approach [17].

Food Energy Estimation Based on Energy Distribution Images
We predicted the food energy of each eating occasion image based on its energy distribution generated by generative model. The dimension for the predicted energy distribution image was 256 by 256. We resized the predicted energy distribution image from 256 by 256 to 224 by 224 to fit the input image size of VGG-16 architecture. To resize the output from generative model, we used OpenCV implementation of image resize, which is based on linear interpolation. The food energy estimation was then compared to the ground truth food energy from the Food in Focus study. We used 1390 eating occasion images also collected from the Food in Focus study [35], with ground truth food energy (kilocalories) for each food item in the eating occasion image. A total of 1043 of these eating occasion images were used for training and 347 of them for testing. The images selected for training and testing were selected by random sampling. All of the eating occasion images were captured by the users sitting naturally at a table. There were no extreme changes in the viewing angle. The errors for predicted food energy in Figure 7 are defined as:

Error = Estimated Food Energy − Ground Truth Food Energy
(16) Figure 8 shows the relationship between the ground truth food energy and the food energy estimation of the eating occasion images in the testing data set. The dash line in Figure 8 indicates the ground truth and estimated energy are the same, i.e., estimation error is equal to zero. Therefore, the points above this line are overestimated, and the points below this line are underestimated. Figure  9,10 show examples of food energies the have been over-and underestimated, and we use "+" and "−" to indicate over-and underestimation, respectively. The average ground truth of an eating occasion image in the testing data set was 538 kilocalories. We observed that the estimation was more We also compared the energy estimation error rates under different conditional loss settings, L conditional (G), using U-Net. We used the batch size of 16 with λ = 100 in Equation (13), the Adam [46] solver with initial learning rate α = 0.0002, and momentum parameters β 1 = 0.5, β 2 = 0.999 [27]. We observed that distance measure D(y, G(x, z)) as defined in Equations (10)-(12) using the L 1 or L 2 norms is better than using smoothed L 1 norm. At epoch 200, the energy estimation error rates are 10.89% (using L 1 criterion) and 12.67% (using L 2 criterion), respectively. In the experiments, we included food types whose shapes are difficult to define (for example, fries). Predicting the energy for these food types is very challenging using a geometric-model-based approach [17].

Food Energy Estimation Based on Energy Distribution Images
We predicted the food energy of each eating occasion image based on its energy distribution generated by generative model. The dimension for the predicted energy distribution image was 256 by 256. We resized the predicted energy distribution image from 256 by 256 to 224 by 224 to fit the input image size of VGG-16 architecture. To resize the output from generative model, we used OpenCV implementation of image resize, which is based on linear interpolation. The food energy estimation was then compared to the ground truth food energy from the Food in Focus study. We used 1390 eating occasion images also collected from the Food in Focus study [35], with ground truth food energy (kilocalories) for each food item in the eating occasion image. A total of 1043 of these eating occasion images were used for training and 347 of them for testing. The images selected for training and testing were selected by random sampling. All of the eating occasion images were captured by the users sitting naturally at a table. There were no extreme changes in the viewing angle. The errors for predicted food energy in Figure 7 are defined as: Error = Estimated Food Energy − Ground Truth Food Energy (16) Figure 8 shows the relationship between the ground truth food energy and the food energy estimation of the eating occasion images in the testing data set. The dash line in Figure 8 indicates the ground truth and estimated energy are the same, i.e., estimation error is equal to zero. Therefore, the points above this line are overestimated, and the points below this line are underestimated. Figures 9 and 10 show examples of food energies the have been over-and underestimated, and we use "+" and "−" to indicate over-and underestimation, respectively. The average ground truth of an eating occasion image in the testing data set was 538 kilocalories. We observed that the estimation was more accurate for the eating occasion image with ground truth energy around average, when compared to those with extremely high or low ground truth energy, such as zero kilocalories. This is due to the fact that there were not sufficient eating occasion images in our data set with very high or low ground truth energy provided to the neural networks for training.
Nutrients 2019, 11, x FOR PEER REVIEW 12 of 18 accurate for the eating occasion image with ground truth energy around average, when compared to those with extremely high or low ground truth energy, such as zero kilocalories. This is due to the fact that there were not sufficient eating occasion images in our data set with very high or low ground truth energy provided to the neural networks for training.  The error distribution of predicted food energies for 347 eating occasion images is shown in Figure 7. We found that the average energy estimation error was 209 kilocalories. An overestimation is displayed as a positive number. The average ground truth for all eating occasion images was 546 kilocalories, and the average ground truth for breakfast, lunch, and dinner eating occasion images was 531 kilocalories, 603 kilocalories, and 506 kilocalories, respectively. The average energy estimation error we obtained was 209 kilocalories, and the average energy estimation error for breakfast, lunch, and dinner eating occasion images was 204 kilocalories, 211 kilocalories, and 210 kilocalories, respectively. Several sample eating occasion images for overestimated food energy are shown in Figure 9, and eating occasion images for underestimated food energy are shown in Figure  10 accordingly. accurate for the eating occasion image with ground truth energy around average, when compared to those with extremely high or low ground truth energy, such as zero kilocalories. This is due to the fact that there were not sufficient eating occasion images in our data set with very high or low ground truth energy provided to the neural networks for training.  The error distribution of predicted food energies for 347 eating occasion images is shown in Figure 7. We found that the average energy estimation error was 209 kilocalories. An overestimation is displayed as a positive number. The average ground truth for all eating occasion images was 546 kilocalories, and the average ground truth for breakfast, lunch, and dinner eating occasion images was 531 kilocalories, 603 kilocalories, and 506 kilocalories, respectively. The average energy estimation error we obtained was 209 kilocalories, and the average energy estimation error for breakfast, lunch, and dinner eating occasion images was 204 kilocalories, 211 kilocalories, and 210 kilocalories, respectively. Several sample eating occasion images for overestimated food energy are shown in Figure 9, and eating occasion images for underestimated food energy are shown in Figure  10 accordingly. The error distribution of predicted food energies for 347 eating occasion images is shown in Figure 7. We found that the average energy estimation error was 209 kilocalories. An overestimation is displayed as a positive number. The average ground truth for all eating occasion images was 546 kilocalories, and the average ground truth for breakfast, lunch, and dinner eating occasion images was 531 kilocalories, 603 kilocalories, and 506 kilocalories, respectively. The average energy estimation error we obtained was 209 kilocalories, and the average energy estimation error for breakfast, lunch, and dinner eating occasion images was 204 kilocalories, 211 kilocalories, and 210 kilocalories, respectively. Several sample eating occasion images for overestimated food energy are shown in Figure 9, and eating occasion images for underestimated food energy are shown in Figure 10 accordingly.

Discussion
We have advanced the field of research for automatic food portion estimation by developing a novel food image based end-to-end system to estimate food energy using learned energy distribution images. The contributions of this work can be summarized as the following: We introduced a method for modeling the characteristics of energy distribution in an eating scene using generative models. Based on the predicted food energy distribution image, we designed a CNN-based regression model to estimate the energy value based on the learned energy distribution images. We designed and implemented a novel end-to-end system to estimate food energy based on a single-view RGB eating

Discussion
We have advanced the field of research for automatic food portion estimation by developing a novel food image based end-to-end system to estimate food energy using learned energy distribution images. The contributions of this work can be summarized as the following: We introduced a method for modeling the characteristics of energy distribution in an eating scene using generative models. Based on the predicted food energy distribution image, we designed a CNN-based regression model to estimate the energy value based on the learned energy distribution images. We designed and

Discussion
We have advanced the field of research for automatic food portion estimation by developing a novel food image based end-to-end system to estimate food energy using learned energy distribution images. The contributions of this work can be summarized as the following: We introduced a method for modeling the characteristics of energy distribution in an eating scene using generative models. Based on the predicted food energy distribution image, we designed a CNN-based regression model to estimate the energy value based on the learned energy distribution images. We designed and implemented a novel end-to-end system to estimate food energy based on a single-view RGB eating occasion image. The results were validated using data generated from the Food in Focus study using data from the 45 community-dwelling men and women between 21-65 years old consuming known foods and amounts over 7 days [35].
The advantage of our technique compared to a geometric model-based technique is that the system is training based. The pre-defined geometric models were limited to cover only certain types of food with known shapes, which is no longer an issue for training-based methods. In addition, the "energy distribution image" we introduced enabled us to first visualize how food energy estimation is spatially distributed across the eating scene (for example, regions of the image containing apple should have smaller weights due to lower energy (in kcal) compared to regions in the image containing cheese). Therefore, not only the final estimated numeric energy values could be used to analyze where the error may have come from, but also the intermediate results of the "energy distribution image" could be used.
As our end-to-end food portion estimation is a training based system, the limitation of the system is mainly determined by the training data. Expanding the training data set with a larger sample size, capturing images over a longer period of time, and more food types could improve the accuracy of automatic food portion estimation. For wider application, future studies need to include diverse eating styles and patterns, thus broadening the application of these methods to diverse population groups. These results point to the importance of controlled feeding studies using known foods and amounts.
The results of such studies, on a wider scale, would contribute to wider application of these automated image-based methods with the benefit of improving accuracy of results. The use of an image-based system, such as TADA TM , which uses the mFR TM , is necessary for the automatic food portion estimation.
There are several reasons that may have led to the food energy estimation errors observed. Firstly, although we used 1875 paired food images to train the generative model using GAN architecture [33], the amount of food images did not cover all different eating occasions. Similarly, to train the regression model for numeric energy value prediction, 1043 eating occasion images were used where using more eating occasion images and food types could improve the accuracy of the end-to-end system. Secondly, when building the image-to-energy data set [33], the energy distribution images were synthetic images defined by handcrafted energy spread functions, rather than incorporating real 3D structures or depth information. Neither depth nor real 3D structure information was available when the study was conducted to capture eating occasion images [3]. To further improve the accuracy and address this challenge, we are currently investigating techniques to incorporate depth information into the end-to-end system where the 3D structures features of the foods in the images can also be learned by the neural networks.

Conclusions
In this work, we proposed a novel end-to-end system to directly estimate food energy using automatic food portion estimation from eating occasion images captured with an image-based system. Our system first estimated the image to energy mappings using a GAN structure. Based on the predicted food energy distribution image, we designed a CNN-based regression model to further estimate the energy value based the learned energy distribution images. To our knowledge, this method represents a paradigm shift in dietary assessment. The proposed method was validated using data collected by 45 men and women between 21-65 years old. We were able to obtain accurate food energy estimation with an average error of 209 kilocalories for eating occasion images collected from the Food in Focus study using the mFR TM . The training-based technique for end-to-end food energy estimation no longer requires fitting geometric models onto the food objects that may have issues scaling up, as we need a large amounts of geometric models to fit different food types in many food images. In the future, combining automatically detected food labels, segmentation masks, and contextual dietary information has the potential to further improve the accuracy of such end-to-end food portion estimation system.

Author Contributions:
The manuscript represents the collaborative work of all the authors. The work was conceptualized by S.F. and F.Z., S.F. and Z.S. developed the methodology and performed the analysis with supervision from F.Z. The Food in Focus study was designed and conducted by C.J.B and D.A.K. The original draft was prepared by S.F. and Z.S., and all authors reviewed and edited the manuscript. All authors read and approved the final manuscript.
Funding: This work was partially sponsored by the National Science Foundation under grant 1657262, NIH, NCI (1U01CA130784-01); NIH, NIDDK (1R01-DK073711-01A1) for the mobile food record and by the endowment of the Charles William Harrison Distinguished Professorship at Purdue University. Address all correspondence to Fengqing Zhu, zhu0@ecn.purdue.edu or see www.tadaproject.org.