Next Article in Journal
Black Beans, Fiber, and Antioxidant Capacity Pilot Study: Examination of Whole Foods vs. Functional Components on Postprandial Metabolic, Oxidative Stress, and Inflammation in Adults with Metabolic Syndrome
Next Article in Special Issue
Assessment of Pre-Pregnancy Dietary Intake with a Food Frequency Questionnaire in Alberta Women
Previous Article in Journal
The Effects of Supplementation with a Vitamin and Mineral Complex with Guaraná Prior to Fasted Exercise on Affect, Exertion, Cognitive Performance, and Substrate Metabolism: A Randomized Controlled Trial
Previous Article in Special Issue
Urinary Sugars—A Biomarker of Total Sugars Intake
Open AccessArticle

Dietary Assessment on a Mobile Phone Using Image Processing and Pattern Recognition Techniques: Algorithm Design and System Prototyping

by 1,*, 2,†, 3,† and 4,†
School of Medicine, University of Wollongong, Wollongong, NSW 2522, Australia
School of Computing and Information Technology, University of Wollongong, Wollongong, NSW 2522, Australia
Faculty of Information Technology, University of Science Ho Chi Minh City 70000, Vietnam
School of Computing and Information Technology, University of Wollongong, Wollongong, NSW 2522, Australia
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Nutrients 2015, 7(8), 6128-6138;
Received: 17 April 2015 / Accepted: 8 July 2015 / Published: 27 July 2015
(This article belongs to the Special Issue Assessment of Nutrient Intakes)


Dietary assessment, while traditionally based on pen-and-paper, is rapidly moving towards automatic approaches. This study describes an Australian automatic food record method and its prototype for dietary assessment via the use of a mobile phone and techniques of image processing and pattern recognition. Common visual features including scale invariant feature transformation (SIFT), local binary patterns (LBP), and colour are used for describing food images. The popular bag-of-words (BoW) model is employed for recognizing the images taken by a mobile phone for dietary assessment. Technical details are provided together with discussions on the issues and future work.
Keywords: mHealth; food record; image processing; pattern recognition; food image mHealth; food record; image processing; pattern recognition; food image

1. Introduction

Assessment of dietary intake is a process vital to dietetic care across different disciplines and specialties of practice. As a fundamental skill taught during early dietetic training, the manual process of conducting a dietary assessment with a participant is inherently flawed due to various forms of bias depending on the method of assessment being applied [1] in addition to the format of the assessment. Traditionally, assessments were completed using a paper-and-pen format to record usual intake through diet history interview, repeated 24-h recall and food frequency questionnaire and are often impacted by memory and cognition of the person recalling their intake. In addition, the time burden of the process and the literacy of the target group for which dietary intake data is required are also factors that affect the data being recorded. This is particularly evident in assessing the intakes of children where the age of the child plays an important role on the method employed. Whether the recall is obtained from the child who consumed the food or whether the recall is provided by the child’s carer or parents, may have impact on the accuracy of what has been consumed [2]. With the combination of these challenges and technological advances, many dietary assessment methods have been partially or fully automated in an attempt to reduce bias, ease the burden for the participants and also streamline the steps applied within each method [1].
Automation of dietary assessment appears to have begun during the 1960s with early movements towards computerized processing of intake data [3]. Along with the move to the computerized analysis of the nutrients, there has been an expansion to the automation of the intake assessment process itself. This initially began with the use of software on a standalone desktop computer and later advanced to interactive processes through the Internet [1]. Today, web-based dietary recalls are not uncommon in large cohort studies due to their efficiency and ability to streamline processes within one study time point [3]. The automated version of the 24-hour recall for example minimizes the need for an interviewer by employing the use of on-screen avatars [4] and probing questions to guide the participants through the recall. This change has also resulted in reduced resource requirements overall allowing it to be implemented with a very large number of participants.
Contrary to this, the most common method of dietary assessment used within the randomized controlled trials remains as that of the food record or food diary [5], a record of actual food intake. This method places increased burden on the person completing the assessment and may involve a process of estimating or measuring the food portion size after recording the name of the food item that has been consumed. The method is retrospective in nature and due to the participant burden, modifications made to the intake during the recording period often occur with a longer recording periods resulting in greater bias or underestimation of intake [6]. Automation of the food record method aims to address this issue. More recently, the food record method has shifted to portable devices such as Tablets and Smartphones [1]. This shift should not simply be thought of as the creation of data collection forms on a phone but rather be considered as essential changes to the method within which data collection occurs. During the past decade research in this area has been expanding rapidly. Although there are many applications (apps) available for download to a Smartphone, the credibility of these apps remains questionable. Key groups in the United States have developed credible, evidence-based applications used within the research setting [1]. The Technology Assisted Dietary Assessment (TADA) project, for example, has employed a process of image segmentation to accurately detect food items, with an initial prototype trialed on iPhones/iPods. This group reported challenges of detecting colour and texture [7] and employed the use of a fiducial marker to assist with the accuracy of the detection process. Illumination and the angle of the photos being taken have been further noted as concerns [8]. The Bag-of-Words model, used as the basis for the prototype described in this paper, was also used by the TADA team. Also using image recognition, the Food Intake Visual and voice Recognizer (FIVR) project similarly used a fiducial marker to assist with the image recognition. Contrasting to the TADA project, which initially used text to assist with image tagging, FIVR used voice for this process, and in addition, it relied on the user’s descriptions (text or voice) to identify the food in the photograph [9]. While the above tools aim to automate the assessment, notable progress has also been made with the Remote Food Photography Method (RFPM), which uses a semi-automated food photograph classification approach. The RFPM is focused on portion size of the food items and uses bilateral filtering to reduce the noise in the images taken [10]. User training is required in order to ensure that photographs are captured correctly [11].
This paper describes the development of a baseline prototype of an automated food record by using image processing and pattern recognition algorithms. To the knowledge of the authors, the automated food record described in this paper is the first Australian developed prototype of this nature. The described prototype supports automation of the food recording and recognition. The prototype facilitates an extension to determine the portion size of the food and finally all data can be easily matched to food composition databases. Determination of food portion is crucial for nutrient calculations and automation of this process can reduce the need for manual calculations from the recognized foods. The focus of this paper is to provide a comprehensive overview of a baseline prototype with the function of image recognition and addresses the challenges that are potentially faced when moving an algorithm from a laboratory-based test environment [12] to a user-based application.
The baseline prototype also adopts the Bag-of-Words model for recognition due to computational efficiency and promising results obtained recently [11,16]. Many challenging issues mentioned above are dealt by the careful selection of features. However, the work differs from the previous studies [11,16] in the following dimensions. Firstly, the system does not require any fiducial markers or addition user annotation such as text or voice. Secondly, the system aims to recognize multiple foods in a single image, i.e., more than one food type can be present in an input image. Our system is thus more realistic in practice. Thirdly, we are also simultaneously deploying the food image recognition on mobile phones. This enables us to investigate the practical factors of the proposed dietary assessment prototype including the discriminative power of each feature type, the combination of features to particular food categories and the computational speed of various interest point detectors.

2. Experimental Section: Image Classification Using Bag-of-Words (BoW) Model

The BoW model was originally devised for text classification [13]. In the model, a text document is encoded by a histogram representing the frequency of the occurrence of codewords. The codewords are predefined in a discrete vocabulary referred to as a codebook. The codeword histograms obtained from training samples (small collections of text documents) are used to train a discriminative classifier, e.g., Support Vector Machine (SVM) [14]. The trained classifier is then used to classify test documents (larger collections of text).
To automate the food record, the BoW model was applied for classification of food images captured using a person’s mobile phone. In this task, the features of an image are used as codewords. The images employed in this study, are the photographs of foods recorded as part of the food record method. Rather than recording the text-based name of the food risking poor quality detail about the food due to long food selection lists, incorrect spelling or typographical errors, a photograph can provide additional details about the foods. Using photographs also provides the potential to minimize the burden on the person completing the dietary assessment. The food image classification method in this study was implemented in C++ programming language. The training phase and evaluation of the food image recognition were conducted on a desktop computer using Microsoft Windows 7.0. The prototype was deployed on a Smartphone using the Android mobile operating system.
Image classification using the BoW model requires feature selection, codebook creation, discriminative training. In the following sections, each of these components is described in further detail.

2.1. Feature Selection

For the purpose of image recognition, the features are used to describe the visual properties of the foods in the photographs. In this study, we investigated three common types of features including Scale Invariant Feature Transformation (SIFT) [15], Local Binary Pattern (LBP) [16], and Colour [17]. SIFT [15] is used to describe the local shape of visual objects. It is constructed by a histogram of the orientations of edges in the food photographs. LBP [16] is used to capture the texture information of the foods. The LBP is known for its simplicity in implementation, low computational complexity and robustness under varying illumination conditions. Colour [17] plays an important role in food image classification, e.g., the red colour of a tomato is useful to distinguish it from an apple sharing a similar shape. In our experiments, we quantise each colour channel (Red, Green, and Blue) of an image pixel into four bins (intervals). The colour features of an image then can be constructed as the histogram of the colour of all pixels in that image.

2.2. Codebook Creation

Suppose that there are N food categories C 1 , C 2 , , C N (e.g., carrots, muscle meat, etc.) and A is the set of training images. The SIFT interest point detector of Lowe [15] was invoked to detect interest points on the training images of A . SIFT [14] and LBP [15] features were then extracted at interest points. Let v p be the D -dimensional features extracted at interest points p . These features were then clustered into K groups using a K-means algorithm. Idendennn clustering, the dissimilarity (based on distance) between two features was computed using a metric, e.g., χ 2 distance (as in our implementation). Equation (1) outlines the dissimilarity between features as follows,
( v p , v q ) = 1 2 i = 1 D [ v p ( i ) v q ( i ) ] 2 v p ( i ) + v q ( i )
where v p ( i ) is the i -th element of v p .
Completion of this step resulted in a codebook G = { w 1 ,   w 2 ,   , w K } in which codewords w i were considered to be the centres of the i’th clusters.

2.3. Discriminative Training

To classify N food categories, N binary classifiers f 1 , f 2 , , f N were used. Each classifier f i classified a given test sample (food image in this context) into two classes: C i or non- C i . Given a food image and codebook G , the SIFT interest point detector was used to detect a set of interest points from the food image. Let p and v p be an interest point and the feature extracted at p . The best matching codeword w ( p ) G was determined as outlined in Equation (2),
w ( p ) = argmin w i G   d ( v p ,   w i )
where d ( v p ,   w i ) is defined in Equation (1).
Figure 1 provides an example of describing a food image by codewords. In this figure, the red points are SIFT interest points, where w 1 , w 2 , w 3 are the best matching codewords of the features extracted at each of these interest points. The food image was then encoded by a histogram of occurrence of the codewords. Such histograms were collected from all training images of the food category i’th and from the training images of other categories to train the classifier f i . In our implementation, SVMs were used as classifiers for f i .
Figure 1. An image of snowpeas showing interest points shown as red points and the corresponding best matching codewords labelled w1, w2, w3.
Figure 1. An image of snowpeas showing interest points shown as red points and the corresponding best matching codewords labelled w1, w2, w3.
Nutrients 07 05274 g001

2.4. Testing

Given a test image, I , similar to the training phase, the histogram of the occurrence of the codewords obtained on I was computed and denoted as h ( I ) . Let f i ( h ( I ) ) be the classification score of the trained classifier f i applied on h ( I ) . If the image I contains only one food category, this food category was determined as shown in Equation (3)
C i   if  f i =   max j { 1 , , N } f j ( h ( I ) )
If I contains more than one food category (such as for a meal on a plate), the food category C i was considered to be present in I if f i ( h ( I ) ) > ϵ , where ϵ is a user-defined threshold, referred to as recognition sensitivity.

3. Results

The SIFT interest point detector took on average approximately 74 seconds for an image captured by a smartphone’s camera. To speed-up the interest point detector, input images were resized by a factor of two if either the width or height of the input image was over 2000 pixels. Through the experiments it was found that by reducing the image’s size, the speed of the interest point detector could be improved and also the number of interest points overall could be reduced. In the experiments, the detection of interest points on the resized images took approximately 20 s. Note that the time also depends on the computing resource available in the Smartphone. Different implementations of the SIFT interest point detector outlined by Lowe [15] were also trialed. Table 1 shows the times taken by those interest point detectors.
Table 1. Processing time of interest point detectors.
Table 1. Processing time of interest point detectors.
Interest Point DetectorProcessing Time (s/image)
Original interest point detector [9]74
Obtained without resizing the input image.
The ezSift interest point detector was found to be the fastest detector with less than 15 s. The number of interest points generated by this detector was less than that generated by other detectors, likely due to a lower number of scales used in the ezSift detector. However, the number of detected interest points was usually sufficient for recognition as this has been found empirically. The study also found that the zerofog had been optimized using openmp to run on parallel processors.
To describe the appearance of food images, three codebooks corresponding to three different feature types were created for each food category. The codebook size (i.e., the number of codewords) for the SIFT, LBP, and colour feature was 100, 40, and 50, respectively. To combine the three features, for each food image, three histograms of codewords corresponding to the three codebooks were concatenated to form a longer histogram. Linear SVMs [18] were employed as classifiers.
The food image classification method was evaluated on a newly created dataset [19,20,21]. Table 2 summarises the dataset used in the evaluation. Note that in this dataset, one food image may contain more than one food category (see Figure 2). The dataset was organised so that the training sets (used during codebook creation) and test sets were separated.
Table 2. Summary of the dataset used for training and testing of the food image recognition method. Positive images of a food category indicate the number of images contained in each food category.
Table 2. Summary of the dataset used for training and testing of the food image recognition method. Positive images of a food category indicate the number of images contained in each food category.
Food CategoryPositive Training Images, nTest Images, n
Muscle meat734
Figure 2. An example of a food image containing five food categories: cheese, tomato, oranges, beans, and carrots.
Figure 2. An example of a food image containing five food categories: cheese, tomato, oranges, beans, and carrots.
Nutrients 07 05274 g002
Since more than one food category could be contained in a food image, as would be consumed in real life, a food category C i was considered to be contained in the food image I if f i ( h ( I ) ) > ϵ . Thus, the classification (recognition) performance was investigated by varying the threshold ϵ . Let M be a set of test images, M i be a subset of M and contain the food category C i . For a given ϵ , let M i R ( ϵ )   M be the set of images whose classification score is greater than ϵ , i.e., M i R ( ϵ ) is the set of images recognised as containing instances of C i . The recognition performance associated with ϵ of the food category C i was represented by r i ( ϵ ) and defined in Equation (4) as follows,
r i ( ϵ ) = | M i R ( ϵ ) M i | | M i R ( ϵ ) M i |
where and is the intersection and union operator, respectively, and | M | is the cardinality of the set M .
As shown in Equation (4), the higher r ( ϵ ) is, the better the recognition performance is. In this study, r i = max ϵ r i ( ϵ ) ; was used as the recognition accuracy of the food image recognition method on the food category C i . Table 3 represents the accuracy of the method with the various feature types. The accuracy was computed for each food category and for the overall food categories.
Table 3. Recognition accuracy of the food image recognition method with various feature types. SIFT, scale invariant feature transformation; LBP, local binary patterns.
Table 3. Recognition accuracy of the food image recognition method with various feature types. SIFT, scale invariant feature transformation; LBP, local binary patterns.
SIFTLBPColourSIFT + LBP + Colour
Muscle meat0.400.320.50.55
SIFT: scale invariant feature transformation, LBP: local binary patterns

4. Discussion

A prototype of an automated dietary assessment on mobile devices has been described in this paper. Various challenges have been identified through this progressive step with future work continuing to address these challenges. Challenges faced were similar to existing studies [7] in this field with colour and multiple food items being of particular focus. As shown in Table 3, on average, the LBP outperformed both the SIFT and colour histogram and the SIFT gave the poorest performance overall. The combination of all features (SIFT, LBP and Colour) gave a better performance overall as anticipated due to the various advantages and disadvantages of each feature. One particular note was for the “Cheese” category, where the accuracy was low. This was likely due to the presence of other food types in the same image with cheese. The accuracy could be improved if the “Cheese” items could be captured at a closer distance, i.e., outliers and other food items other than cheese are not present in the image.
In the current implementation, linear SVMs were used. More sophisticated SVMs such as radial basis function (RBF) and polynomial kernel SVMs often gain better performance. Thus, those kernel types will be implemented, tested, and compared with the linear kernel in future studies. In addition, some advanced machine learning techniques, e.g., deep learning [22], extreme learning [23], will also be considered to improve the recognition accuracy. These techniques are recent developments in the area of pattern recognition.
Since multiple food types can co-occur on the same image, extracting individual food items would help to improve the recognition accuracy. Therefore, the aim is to simultaneously detect and recognise food as for future work. The accuracy of the food image recognition may also be improved if the classifiers were trained on large and diverse datasets including various illumination conditions, complex background, food images captured at various viewpoints. More challenging datasets with more food categories have been collected and will be released in our future work.

5. Conclusions

Applying image processing and pattern recognition techniques on a mobile device has allowed for the development of an automated food record via the use of a Smartphone. Continuing the prototype developed in this study to the subsequent stages of the food record, namely portion identification and translation to nutrient data, will complete the process and allow for a practical user-friendly approach to dietary data collection within the Australian context. It is vital that the mapping of food images to their corresponding food items within a food composition database is performed carefully to allow for the most accurate output data to be provided. Development of applications for the use in dietetic practice needs to encompass the inherent bias of the underpinning method of dietary assessment. They should also consider the advancements in technology to potentially reduce some of these biases. This will provide impetus for more robust dietary assessment processes that are streamlined in their methods but also less resource intensive in their nature. Considering these two concepts together will mean that nutrition researchers or clinicians in practice can spend additional time with the clients working on behaviour changes for better health rather than data entry and nutrient analysis as was previously the case. Embracing credible and suitably selected technologies to work within the existing nutrition care processes should be considered to the advantage of both the clients and clinicians. The work of this study is one of the first of this nature in the Australian context.


The authors would like to thank the University of Wollongong Science Medicine and Health Advancement funding for supporting the prototype development work of this project. The authors would also like to acknowledge the contribution to the image dataset provided by Dr Megan Rollo during the earlier phases of this project.

Author Contributions

YP conceived the initial project idea, sourced funding support and contributed the initial test images and main proportion of training images to the dataset. DT and WL performed the experiments to refine the image recognition algorithm from the original Bag-of-Words model and MT assisted with transferring the developed algorithm to Android, testing and intensive experimentation on Android. DT, WL contributed to the analysis of the data and all authors contributed to the writing and editing of the manuscript.

Conflicts of Interest

The authors declare no conflict of interest.


  1. Probst, Y.; Nguyen, D.T.; Rollo, M.; Li, W. mHealth diet and nutrition guidance. In mHealth Multidisciplinary Verticals; CRC Press—Taylor & Francis Group: Boca Raton, FL, USA, 2014. [Google Scholar]
  2. Bell, L.; Golley, R.; Magarey, A. Short tools to assess young children’s dietary intake: A systematic review focusing on application to dietary index research. J. Obes. 2013, 2013. [Google Scholar] [CrossRef] [PubMed]
  3. Probst, Y.C.; Tapsell, L.C. Overview of computerized dietary assessment programs for research and practice in nutrition education. J. Nutr. Educ. Behav. 2005, 37, 20–26. [Google Scholar] [CrossRef]
  4. Subar, A.F.; Kirkpatrick, S.I.; Mittl, B.; Zimmerman, T.P.; Thompson, F.E.; Bingley, C.; Willis, G.; Islam, N.G.; Baranowski, T.; McNutt, S.; et al. The automated self-administered 24-hour dietary recall (ASA-24): A resource for researchers, clinicians, and educators from the national cancer institute. J. Acad. Nutr. Diet. 2012, 112, 1134–1137. [Google Scholar] [CrossRef] [PubMed]
  5. Probst, Y.; Zammit, G. Predictors for reporting of dietary assessment methods in food-based randomized controlled trials over a ten year period. Crit. Rev. Food Sci. Nutr. 2015, in press. [Google Scholar]
  6. Black, A.E.; Paul, A.A.; Hall, C. Footnotes to food tables. 2. The underestimations of intakes of lesser b vitamins by pregnant and lactating women as calculated using the fourth edition of mccance and widdowson’s “the composition of foods”. Hum. Nutr. Appl. Nutr. 1985, 39, 19–22. [Google Scholar] [PubMed]
  7. Zhu, F.; Bosch, M.; Boushey, C.J.; Delp, E.J. An image analysis system for dietary assessment and evaluation. Proc. Intl. Conf. Image Proc. 2010, 1853–1856. [Google Scholar] [CrossRef]
  8. Bosch, M.; Zhu, F.; Khanna, N.; Boushey, C.J.; Delp, E.J. Combining global and local features for food identification in dietary assessment. IEEE Trans. Image Process. 2011, 1789–1792. [Google Scholar] [CrossRef]
  9. Weiss, R.; Stumbo, P.J.; Divakaran, A. Automatic food documentation and volume computation using digital imaging and electronic transmission. J. Am. Diet. Assoc. 2010, 110, 42. [Google Scholar] [CrossRef] [PubMed]
  10. Ming, Z.; Gunturk, B.K. Multiresolution bilateral filtering for image denoising. IEEE Trans. Image Process. 2008, 17, 2324–2333. [Google Scholar] [CrossRef] [PubMed]
  11. Martin, C.K.; Han, H.; Coulon, S.M.; Allen, H.R.; Champagne, C.M.; Anton, S.D. A novel method to remotely measure food intake of free-living people in real-time: The remote food photography method (RFPM). Br. J. Nutr. 2009, 101, 446–456. [Google Scholar] [CrossRef] [PubMed]
  12. Nguyen, D.T.; Zong, Z.; Ogunbona, P.O.; Probst, Y.; Li, W. Food image classification using local appearance and global structural information. Neurocomputing 2014, 140, 242–251. [Google Scholar] [CrossRef]
  13. Nigam, K.; Lafferty, J.; McCallum, A. Using maximum entropy for text classification. In Proceedings of the International Joint Conference on Artificial Intelligence Workshop on Machine Learning for Information Filtering, Stockholm, Sweden, 31 July–6 August 1999; pp. 61–67.
  14. Burges, C.C. A tutorial on support vector machines for pattern recognition. Data Min. Knowl. Discov. 1998, 2, 121–167. [Google Scholar] [CrossRef]
  15. Lowe, D. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
  16. Zong, Z.; Nguyen, D.; Ogunbona, P.; Li, W. On the combination of local texture and global structure of food classification. In Proceedings of the IEEE International Symposium on Multimedia, Taichung, Taiwan, 13–15 December 2010; pp. 204–211.
  17. Chen, M.; Dhingra, K.; Wu, W.; Yang, L.; Sukthankar, R.; Yang, J. PFID: Pittsburgh fast-food image dataset. In Proceedings of the 16th IEEE International Conference on Image Processing; IEEE: Piscataway, NJ, USA, 2009; pp. 289–292. [Google Scholar]
  18. Chang, C.; Lin, C. Libsvm: A library for support vector machines. ACM Trans. Intell. Syst. Technol. 2011, 2, 1–27. [Google Scholar] [CrossRef]
  19. Probst, Y.; Jones, H.; Sampson, G.; Smith, K. Development of australian portion size photographs to enhance self-administered online dietary assessments for adults. Nutr. Diet. 2010, 67, 275–280. [Google Scholar] [CrossRef]
  20. Rollo, M.; Ash, S.; Lyons-Wall, P.; Russell, A. Trial of a mobile phone method for recording dietary intake in adults with type 2 diabetes: Evaluation and implications for future applications. J. Telemed. Telecare 2011, 17, 318–323. [Google Scholar] [CrossRef] [PubMed]
  21. Walton, K.; Mcmahon, A.; Brewer, C.; Baker, J.; Fish, J.; Manning, F.; Grafenauer, S.; Kennedy, M.; Probst, Y.C. Novel digital food photos resource enhances knowledge of nutrition and dietetics students. In Proceedings of the Higher Education Research and Development Society of Australasia Conference, Hobart, TAS Australia, 2–5 July 2012.
  22. He, K.; Zhang, X.; Ren, S.; Sun, J. Delving Deep into Rectifiers: Surpassing Human-Level Performance on Imagenet Classification; Microsoft Research: Redmond, WA, USA, 2015. [Google Scholar]
  23. Zhang, S.; He, B.; Nian, R.; Wang, J.; Han, B.; Lendasse, A.; Yuan, G. Fast image recognition based on independent component analysis and extreme learning machine. Cogn. Comput. 2014, 6, 405–422. [Google Scholar] [CrossRef]
Back to TopTop