Vision-Based Methods for Food and Fluid Intake Monitoring: A Literature Review

Food and fluid intake monitoring are essential for reducing the risk of dehydration, malnutrition, and obesity. The existing research has been preponderantly focused on dietary monitoring, while fluid intake monitoring, on the other hand, is often neglected. Food and fluid intake monitoring can be based on wearable sensors, environmental sensors, smart containers, and the collaborative use of multiple sensors. Vision-based intake monitoring methods have been widely exploited with the development of visual devices and computer vision algorithms. Vision-based methods provide non-intrusive solutions for monitoring. They have shown promising performance in food/beverage recognition and segmentation, human intake action detection and classification, and food volume/fluid amount estimation. However, occlusion, privacy, computational efficiency, and practicality pose significant challenges. This paper reviews the existing work (253 articles) on vision-based intake (food and fluid) monitoring methods to assess the size and scope of the available literature and identify the current challenges and research gaps. This paper uses tables and graphs to depict the patterns of device selection, viewing angle, tasks, algorithms, experimental settings, and performance of the existing monitoring systems.


Introduction
Maintaining healthy food intake and adequate hydration is significant for humans' physiological and physical health [1][2][3].
The quality of food intake was proven to be associated with the metabolic function of the human body [4]. Unbalanced nutrition intake increases the risk of many diseases, including diabetes, obesity, cardiovascular disease, and certain cancers [1,5]. When understanding human body dynamics associated with underweight, overweight, and obesity, it is important to objectively assess energy intake (EI); energy intake assessment is related to food type recognition, amount consumed estimation, and portion size estimation [6]. Being underweight can result from energy expenditure exceeding energy intake over an extended period, which leads to health risks such as malnutrition and premature death [7]. Being overweight and obesity are associated with energy intake exceeding energy expenditure, leading to chronic diseases such as type 2 diabetes, cardiovascular diseases, cancers, and musculoskeletal disorders [6][7][8]. A dietary assessment system could be used to monitor daily food intake and control eating habits by triggering a just-in-time intervention during energy intake to prevent health issues [8].
Low-intake dehydration, caused by inadequate fluid intake, has endangered public health and is often underemphasised [9,10]. Mild dehydration happens commonly among people and increases the risk of chronic diseases [11,12]. A notable example is a significant association between urolithiasis (kidney stone) and low daily water intake [4,5]. Furthermore, dehydration is closely associated with disability, hospitalisation and mortality [13]   ("All Metadata": fluid OR "All Metadata": drink OR "All Metadata": water OR "All Metadata": liquid OR "All Metadata": food OR "All Metadata": nutrition OR "All Metadata": energy OR "All Metadata": dietary) AND ("All Metadata": vision OR "All Metadata": camera) AND ("All Metadata": monitoring OR "All Metadata": detection OR "All Metadata": recognition) AND ("Full Text and Metadata": intake) 203/203 II ("All Metadata": vision OR "All Metadata": camera OR "All Metadata": image) AND ("All Metadata": action OR "All Metadata": gesture OR "All Metadata": activity OR "All Metadata": motion) AND ("All Metadata": detection OR "All Metadata": monitoring OR "All Metadata": recognition) AND ("Full Text and Metadata": human) AND ("Full Text and Metadata": drink OR "Full Text and Metadata": water OR "Full Text and Metadata": liquid OR "Full Text and Metadata": food OR "Full Text and Metadata": nutrition OR "Full Text and Metadata": energy OR "Full Text and Metadata": dietary) 100 Stage 2: To extend the scope of the search, 'vision/camera/image + human + action/ gesture/activity/motion + recognition/detection/monitoring' was searched across abstract and title. In addition, 'drink/water/liquid/food/nutrition/energy/dietary' shall be found in the text. This allows a wider and more exhaustive search to find potential papers involving intake monitoring in another research field (human action recognition, namely HAR).

Screening
The following eligibility criteria were applied: (1) at least one kind of vision-based technology (e.g., RGB-D camera or wearable camera) was used in the paper; (2) eating or drinking activities or both identified in the paper; (3) the paper used human participants data; (4) at least one of the evaluation criteria (e.g., F1-score) was used for assessing the performance of the design.
When the number of records in the search results exceeded 300, the first 100 were taken by the rank of relevance in each database, except for Google Scholar, in which only the first 50 records were taken. This is considering the enormous amount of data on Google Scholar, and most of the literature has been covered in other selected academic databases.
The retrieved records were first imported into Zotero, and duplicated items were removed. Then, all papers' titles and abstracts were reviewed to remove articles not on human subjects; and those not mentioning visual methods in titles/abstracts. The full text review was applied in the next step of the eligibility assessment. Papers not mentioning intaking activities and not being evaluated with reasonable criteria were eliminated. Research on human action recognition and daily activity monitoring that addressed fluid intake activity were included. The screening process is summarised in the flow diagram of Figure 1, where 253 full texts from 2010 to 2022 were reviewed and included. There were 24 review papers, 34 proposing datasets, and 195 papers that provided methods, including algorithms, systems, or other solutions, for different tasks or problems for intake monitoring. drinking activities or both identified in the paper; (3) the paper used human participants data; (4) at least one of the evaluation criteria (e.g., F1-score) was used for assessing the performance of the design. When the number of records in the search results exceeded 300, the first 100 were taken by the rank of relevance in each database, except for Google Scholar, in which only the first 50 records were taken. This is considering the enormous amount of data on Google Scholar, and most of the literature has been covered in other selected academic databases.
The retrieved records were first imported into Zotero, and duplicated items were removed. Then, all papers' titles and abstracts were reviewed to remove articles not on human subjects; and those not mentioning visual methods in titles/abstracts. The full text review was applied in the next step of the eligibility assessment. Papers not mentioning intaking activities and not being evaluated with reasonable criteria were eliminated. Research on human action recognition and daily activity monitoring that addressed fluid intake activity were included. The screening process is summarised in the flow diagram of Figure 1, where 253 full texts from 2010 to 2022 were reviewed and included. There were 24 review papers, 34 proposing datasets, and 195 papers that provided methods, including algorithms, systems, or other solutions, for different tasks or problems for intake monitoring. Figure 1. Diagram of paper searching and screening process. ACM is short for Association for Computing Machinery; IEEE is short for Institute of Electrical and Electronics Engineers; SCOPUS is a source-neutral abstract and citation database; PubMed is a free interface for searching MEDLINE, the National Library of Medicine's premier bibliographic database, and the most popular bibliographic database in the health and medical sciences.

Active and Passive Methods
In vision-based methods, there are two approaches to capturing images: active and passive [23,24]. Active methods require the user to take pictures and record their intake manually, while passive methods automatically access the food or fluid intake infor- Figure 1. Diagram of paper searching and screening process. ACM is short for Association for Computing Machinery; IEEE is short for Institute of Electrical and Electronics Engineers; SCOPUS is a source-neutral abstract and citation database; PubMed is a free interface for searching MEDLINE, the National Library of Medicine's premier bibliographic database, and the most popular bibliographic database in the health and medical sciences.

Active and Passive Methods
In vision-based methods, there are two approaches to capturing images: active and passive [23,24]. Active methods require the user to take pictures and record their intake manually, while passive methods automatically access the food or fluid intake information. Active methods are widely used in practice. Traditionally, active food intake monitoring was in the form of food records, recalls, or questionnaires [25]. For active fluid intake monitoring, a fluid balance chart is used as a self-reporting tool to identify a positive (fluid input higher than output) or negative (fluid output higher than input) balance in hospitals or nursing homes [18,26]. A fluid balance chart includes information on the time, approach, and amount of body fluid input (oral, intravenous, etc.) and output (urine, tube, etc.), which can be completed by trained nurses, doctors, or patients themselves [26].
With the development of cameras, images of meals and drinks are more commonly used for dietary monitoring. In visual-based monitoring methods, active methods are not as widely seen as passive methods and mostly rely on mobile phone cameras. For example, [27] proposed a food and nutrition measurement approach by analysing the images taken by the users before and after a meal, which provided up to 97.2% of correct classification for food type and 1% of misreported nutrient information. A similar nutrition logger called DietCam was proposed in [28] based on self-taken videos or images before and after a meal for food type classification and intake amount estimation. Another food logger based on the active image captured using a mobile phone was developed with an inertial-smart watch and a load cell-smart [29], which also required manual food weighing. To reduce the time and efforts of human labour and improve the validity of portion size estimate, Jia et al. developed 'eButton', a semi-automated system which combined manual annotation with software and led to less bias and variability compared with manual annotation [30,31]. Another approach provided a user assistance system with a 360-degree RGB camera, combining active and passive methods to improve the quality of dietary and nutrition assessment. This system improved the solution of active food monitoring with fewer under-reporting cases and less perceived effort in keeping the food diary [24].
Nevertheless, manually recording the intake information by writing, weighing, or triggering a camera can be time-consuming and burdensome [24,32], hence not ideal for daily application. Moreover, self-reporting is not an option for patients with actional difficulties or the elderly with cognition degeneration. Therefore, in recent years, passive sensing methods with different devices and automatic strategies predominated over traditional active methods.

Environmental Settings
The environmental settings were categorised into the free-living environment, pseudofree-living environment, laboratory environment, and others. In free-living-based studies, the systems were assessed with sufficient data collected by sensors configured into the user's natural living environment. The pseudo-free-living environment tried to replicate the user's natural living environment in a laboratory. The controlled laboratory environment only covers the specific actions needed as input data for the system (e.g., biting or drinking). In contrast, the others include methods based on existing datasets without considering the experimental environment. A camera's viewing angle can be either first-person or third-person (details see Section 4). It is noticed that most of the third-person methods were only considered in a controlled testing environment, not testified in a free-living scenario. As for the first-person camera, in free-living, the highest accuracy achieved on food and non-food classification is 95% [33,34], and on eating action detection [35] was 89.68%. One recent free-living research reached the F1-score of 89% on drinking and eating episode detection but only 56.7% precision at the fluid intake level estimation [36]. Therefore, there is still a gap in harnessing cameras in the free-living scenario. Facts identified which can affect the performance include unstable light conditions, occlusion, low framerate, and motion blur.

Privacy Issue
Most papers investigating vision-based monitoring failed to discuss privacy issues even though some of the concerns were evident with the participant's face and body shown in the figures in the paper [37][38][39][40][41][42]. In active methods, cameras can be controlled manually to avoid taking images with privacy concerns, which is inconvenient and labourdemanding [43,44]. Another approach seen in both active and passive methods was reviewing the photos after they were taken and deleting the ones with privacy concerns, which could also be time-consuming and burdensome [35,45]. Hence, passive methods with approaches to eliminate privacy issues were considered the most. Passive methods are more likely to face privacy concerns because most images captured passively are not related to food or drink consumption only [23]. Therefore, in some designs, the intake action was detected by a smart watch or glasses, and the camera was only turned on when the eating or drinking episode was highly probable [23,36,46]. In the survey of privacy concerns for users of AIM-2, the average level of concerns was reduced from 5.0 to 1.9 when images were captured only during intake action rather than continuously [23]. However, this method requires the users to wear multiple sensors, which can be cosmetically unpleasant, uncomfortable, and intrusive, especially for the elderly or groups with disabilities [44]. Moreover, the collaboration of different sensors increases the system's complexity and makes it more challenging to maintain. An optional solution was detecting intake-related images from the cameras without the involvement of humans or extra sensors [44,47]. For example, a pre-trained MobileNet was used for food/non-food classification and helped the system save only food-related images [47]. However, this study's precision was below 90%, so classification algorithms with better performance were needed. Additionally, the scenario that both the human face and food were presented in the same image was not mentioned, which remained to be considered. In other research, human face identification algorithms were applied to the obtained images to blur or remove the subjects' faces [48][49][50]. Similarly, Android's FaceDetector class was used in [33] to eliminate images with visible human faces. Some studies only relied on depth information from RGB-D cameras to reduce the concern of privacy [51 -54], which could suffer from a high false positive rate [52], low accuracy (<90%) [54] or up to 148.8 mm error on mean distance [53]. Hence, algorithms with better performance tracking body movement and recognizing intake activities based only on depth information were needed. An alternative hardware solution to preserve privacy was proposed in [55], which creatively mounted the camera on a cap, facing down, to avoid capturing the surroundings.

Viewing Angles and Devices in Monitoring Systems
In passive vision-based intake monitoring, a camera's viewing angle can be either firstperson or third-person. A first-person camera (egocentric camera/wearable camera) is typically attached to the human body, pointing out at the food or container. In contrast, a third-person camera is mounted in the living environment, pointing at the subject. In the included papers that proposed a monitoring system or nutrition log application, 49 studies were based on first-person cameras, 39 on external third-person cameras, and 28 took advantage of the users' smartphones. What is worth mentioning is that most of the phone-based methods were active, meaning the users needed to take the food/drink picture manually, e.g., the food/nutrition/dietary log proposed in [28,56,57]. This application based on mobile phones is not considered in the rest of this section, as we focus on passive methods and automatic systems.
There were four tasks identified in intake monitoring methods: binary classification, to distinguish food/drink intake from other activities; food/drink type classification, to detect the type of items consumed; food/fluid amount estimation, crucially related to energy intake; and intake action recognition, to recognize the human body movement. In the binary classification task, elements such as fingers, hands, containers, cutleries, and food can be detected, and different criteria can be set and followed as an indication of an intake activity.
Regarding the placement of devices, first-person cameras could be in the form of glasses, watch and pendant, while third-person cameras could be mounted on the ceiling for a top-down view or placed around the subjects. The selected devices vary from studies, and cameras seen in the existing papers were mainly RGB and RGB-D cameras. RGB cameras have been spotted and used in combination with other non-vision sensors. The pattern of viewing angles and devices found in the papers is shown in Figure 2.
Regarding the placement of devices, first-person cameras could be in the form of glasses, watch and pendant, while third-person cameras could be mounted on the ceiling for a top-down view or placed around the subjects. The selected devices vary from studies, and cameras seen in the existing papers were mainly RGB and RGB-D cameras. RGB cameras have been spotted and used in combination with other non-vision sensors. The pattern of viewing angles and devices found in the papers is shown in Figure 2. From the pattern of device selection shown in Figure 2, it is evident that the RGB cameras were the most used, primarily as first-person. In contrast, Depth cameras were not used as first-person cameras and were also barely used collaboratively with non-vision sensors. Moreover, there has been no system that covered all three sensors: RGB camera, Depth camera, and any of the non-vision sensors.

First-Person Approaches
As shown in Figure 2, of the 49 first-person methods, 36 relied on RGB information alone. The remaining 13 used RGB cameras collaboratively with other non-vision sensors, including accelerometers [23,50,58,59], gyroscopes, flex sensors [23,50], load cells [29], proximity sensors and IMUs [36]. The most common technology setting is an inertialsmart watch and a wearable RGB camera [33,34,46,55,60]. For example, Annapurna [33,34,60] is a smartwatch with a built-in camera proposed for autonomous food recording. The inertial sensor of the watch was used for gesture recognition to identify the eating activity, and then the camera took only pictures which were likely to be useful. Thus, compared to methods with a camera constantly in operation, redundant images were reduced, and so was storage requirement, privacy issue, computation, and camera power consumption. However, one fundamental problem with an inertial smartwatch is that the intake action could be missed when the user is drinking with a straw or using a hand that is not wearing the watch. Unlike the approaches mentioned above, which mainly focus on food intake detection was an intake monitoring system for fluid combining glasses, smartwatches, and phones [46]. The system achieved 85.6% accuracy on drinking action detection and 84% on liquids type classification. From the pattern of device selection shown in Figure 2, it is evident that the RGB cameras were the most used, primarily as first-person. In contrast, Depth cameras were not used as first-person cameras and were also barely used collaboratively with non-vision sensors. Moreover, there has been no system that covered all three sensors: RGB camera, Depth camera, and any of the non-vision sensors.

First-Person Approaches
As shown in Figure 2, of the 49 first-person methods, 36 relied on RGB information alone. The remaining 13 used RGB cameras collaboratively with other non-vision sensors, including accelerometers [23,50,58,59], gyroscopes, flex sensors [23,50], load cells [29], proximity sensors and IMUs [36]. The most common technology setting is an inertial-smart watch and a wearable RGB camera [33,34,46,55,60]. For example, Annapurna [33,34,60] is a smartwatch with a built-in camera proposed for autonomous food recording. The inertial sensor of the watch was used for gesture recognition to identify the eating activity, and then the camera took only pictures which were likely to be useful. Thus, compared to methods with a camera constantly in operation, redundant images were reduced, and so was storage requirement, privacy issue, computation, and camera power consumption. However, one fundamental problem with an inertial smartwatch is that the intake action could be missed when the user is drinking with a straw or using a hand that is not wearing the watch. Unlike the approaches mentioned above, which mainly focus on food intake detection was an intake monitoring system for fluid combining glasses, smartwatches, and phones [46]. The system achieved 85.6% accuracy on drinking action detection and 84% on liquids type classification.
Smart glass is another form of wearable device. Automatic Ingestion Monitor Version 2 (AIM-2) [23] was proposed with an accelerometer, a flex sensor and an RGB camera. However, in this design, images captured by the camera were only for validating the performance of other wearable sensors on intake detection; no visual methods were considered. FitByte [36] was a glasses-based diet monitoring system that applied six IMUs for chewing and swallowing detection, a proximity sensor for hand-to-mouth gesture detection, and an RGB camera pointing downward to capture potential food images. Both eating and drinking episodes were detected in this design. However, only 56.7% precision was achieved in fluid intake detection, while it was 92.8% for food intake. This means that Fitbyte performed significantly lower in fluid intake detection because it was only sensitive in simple and continuous drinking scenarios, other than short sips or when drinking happens with other irrelevant activities. A motion-adaptive algorithm was proposed for removing blurred images, which reduced the power consumption by 12% and increased battery life for the glasses-based system with an onboard camera and accelerometer [58].
Assessments have been made on the efficiency of first-person cameras for dietary monitoring. Thomaz et al. (2013) proposed and evaluated a dietary monitoring system based on a neck-worn and human computation. Images were taken by the camera every thirty seconds and sent on Amazon's Mechanical Turk (AMT) (a platform providing human intelligence labour) for identifying food by human labour. This design resulted in 89.68% accuracy in identifying eating activities [35]. In 2015, the wearable camera SenseCam was used to evaluate its potential for dietary assessment. SenseCam made it possible to determine the subjects' external environment, physical position, and interactive social condition. Regarding accuracy, only 71% of the eating episodes could be identified from the images. Hence, wearable cameras are deemed not reliable enough for individual application in dietary monitoring but acceptable as a complementary tool for enhancing traditional self-report [45,61].
The feasibility evaluations mentioned above revealed the limitations of utilising firstperson cameras for passive dietary monitoring. What came first was the occlusion of view. For example, if the image did not provide a complete observation of food, the estimation accuracy of portion size could be low [30]. The uncertainty of wear time, battery sustainability and noncompliance in wearing the camera were other problems, especially when faced with the elderly or patients with cognition decline. As for image acquisition, dark and blurry images obtained in poor light conditions could make classification difficult. In addition, some of the eating episodes could be missed if they occur between shots, so a higher image-capturing frequency was needed. A higher frequency could raise the problem of a larger dataset, heavy computation, and substantial manual annotation. However, in recent research, wearable cameras were used to assess food and beverage consumption during transportation [43], which gave evidence that with the development of sensor applications and computer vision algorithms, the first-person camera can be used in dietary assessment in a free-living environment.
In summary, the common system architecture of first-person methods was combining one first-person RGB camera with other sensors. Cameras can be placed in the form of smartwatches [33,34], glasses [23,36], or even caps [55]. Combining cameras with other sensors can reduce the energy consumption of cameras, extend the use time of batteries, save storage space and rule out privacy concerns by turning the camera on only when a candidate movement is detected [33,34,60]. One fundamental limitation of inertial smartwatches is that the intake action could be missed when the user drinks with a straw or uses the contralateral hand with no watch. The incontinence of wearable devices was another limitation. Another fact worth noting is that in all the methods mentioned above based on RGB with non-vision sensors, the intake detection task was conducted by inertial sensors or proximity sensors rather than the camera itself. In other words, visual information was only used for food or fluid type classification and volume estimation instead of drinking action detection when used with non-vision sensors.

Third-Person Approaches
Compared to first-person cameras, third-person cameras have the advantage of being non-intrusive to the user's [62]. The placement of cameras is one of the primary issues to consider. Most research had only one position of a single camera, placed on the ceiling for a top-down view [53,63,64], or placed pointing at the subject with a fixed distance from 0.6 m to 2 m [24,37,38,48,52]. Multiple cameras could be placed around for different viewing angles and complement the possible occlusion to achieve a more robust system [39,49,65]. However, there is no systematic comparison of the performance of single cameras or multiple cameras in various positions, so specific experiments are needed to choose suitable distances and pointing angles for dietary monitoring systems based on third-person methods.
In third-person methods, RGB and depth information can be used individually or collaboratively for action detection. Specifically, 17 papers used RGB information only, nine were with depth information from an RGB-D camera, and seven were based on the fusion of RGB and depth information, as seen in Figure 2. Unlike first-person cameras, non-vison sensors are used less frequently with third-person cameras. The main reason was that the kinematic information or distance information provided by IMUs and proximity sensors could also be obtained from the visual information of the third-person camera [37,52].
Microsoft Kinect was dominantly adopted in existing research, which can work day and night with the infrared sensor generating the depth images, and the skeleton tracking tool kit providing the joint coordinates [52,66]. The effectiveness of MS Kinect was tested for detecting the eating behaviour of older adults by placing the camera in front of the subject, resulting in an average of 89% success rate [37]. However, no occlusion problem was addressed, and only the experimental environment was considered in this research.
Regarding reducing privacy and image data concerns, some studies only used depth information from RGB-D cameras. For example, Kinect skeletal tracking was used for counting bites by tracking the jaw face point [67], and wrist roll joint of users based on depth information, achieving an overall accuracy of around 94% [52]. A system with an average accuracy of 96.2% was proposed, relying on the depth information of wrist joint and elbow joint motion obtained by a Kinect camera. However, although this study was presented for free-living calorie intake monitoring, only one camera position was tested, and no occlusion problem was considered [51]. The fusion of depth information and RGB information was another option, with the depth information for skeleton definition and body movement tracking while the RGB data for specific intake-related object detection [63].
RGB cameras were also popular devices in intake monitoring used as third-person cameras. It can be embedded in the ceiling, pointing down [64], or put on the dining table, pointing at the subject [38]. The fusion of RGB and depth information has the potential to reach higher accuracy compared to using a single modality of information. An example can be seen in [62], where an adapted version of the self-organized map algorithm was applied to the skeleton model obtained from depth information for movement tracking. The RGB stream was for recognising eating-related items such as glass. This method achieved a 98.3% overall accuracy. All RGB-D cameras were used as third-person cameras (as seen in Figure 2). Table 2 details the methods and accuracy of the seven papers utilising both RGB and depth information, which indicated that the collaborative use of RGB and depth information has become popular in recent years and has the potential to provide promising performance in intake monitoring tasks.

Algorithms by Task
As was mentioned in the introduction, four tasks were observed in the papers. In these four tasks, binary classification ('food/drink' versus 'other'), food/drink type classification and food/fluid amount estimation mainly focused on retrieving information from the image of food/drink, while intake action recognition was aimed at human body movement.
The viewing angles and tasks are illustrated in Figure 3, indicating that third-person cameras are mostly used for intake action detection but were not commonly used for food/drink classification tasks or amount estimation. On the contrary, first-person cameras are commonly used for food/drink detection or amount estimation rather than action recognition. The algorithms used in each task are shown in the following subsections.
cameras are commonly used for food/drink detection or amount estimation rather than action recognition. The algorithms used in each task are shown in the following subsections.
The proportion of papers on these four tasks is demonstrated in Figure 4. The pie chart shows that food research has been dominantly outnumbering fluid. Moreover, the food and fluid type classification were the most attended intake monitoring task, followed by the amount estimation task. However, the efforts on drink/non-drink classification and fluid amount estimation were significantly limited.

Binary Classification
Eliminating unrelated images was a preliminary step for identifying candidate intake activities. This was commonly proposed as a binary classification approach to distinguish food/drink from other objects or to classify low-quality images and delete them. For example, to distinguish sharp images from blurry images for adequate image quality, Fast-Fourier Transform (FFT) for images was calculated to analyse the sharpness, resulting in a 10-15% misclassification [50].  The proportion of papers on these four tasks is demonstrated in Figure 4. The pie chart shows that food research has been dominantly outnumbering fluid. Moreover, the food and fluid type classification were the most attended intake monitoring task, followed by the amount estimation task. However, the efforts on drink/non-drink classification and fluid amount estimation were significantly limited. cameras are commonly used for food/drink detection or amount estimation rather than action recognition. The algorithms used in each task are shown in the following subsections.
The proportion of papers on these four tasks is demonstrated in Figure 4. The pie chart shows that food research has been dominantly outnumbering fluid. Moreover, the food and fluid type classification were the most attended intake monitoring task, followed by the amount estimation task. However, the efforts on drink/non-drink classification and fluid amount estimation were significantly limited.

Binary Classification
Eliminating unrelated images was a preliminary step for identifying candidate intake activities. This was commonly proposed as a binary classification approach to distinguish food/drink from other objects or to classify low-quality images and delete them. For example, to distinguish sharp images from blurry images for adequate image quality, Fast-Fourier Transform (FFT) for images was calculated to analyse the sharpness, resulting in

Binary Classification
Eliminating unrelated images was a preliminary step for identifying candidate intake activities. This was commonly proposed as a binary classification approach to distinguish food/drink from other objects or to classify low-quality images and delete them. For example, to distinguish sharp images from blurry images for adequate image quality, Fast-Fourier Transform (FFT) for images was calculated to analyse the sharpness, resulting in a 10-15% misclassification [50].
Im2 Calories was a food intake monitoring system proposed in 2015, where a GoogLeNet CNN was trained with a modified Food101 dataset. One of the tasks for Im2 Calories was to determine whether the image was related to a meal, at which an accuracy of 99.02% was achieved [73]. Similarly, a GoogLeNet model was trained for food/non-food classification by Singla et al. [74] and achieved an accuracy of 99.2%, in which Food-5K created from Food101 was used as training data. The two works mentioned above were based on the same pretrained model and similar food dataset and both had promising performance on the binary classification task. Another example was iLog, a stress-eating monitoring system based on a seven-layer CNN model and camera-mounted glasses, which achieved around 97% accuracy in food detection [75].
The GoogLeNet in Im2 Calories [73] was tuned on a Titan X GPU with 12 GB memory; then implemented into an Android APP less than 40 MB, which could classify an image within one second. iLog could also operate on edge-level, low-performance computing paradigms, such as mobile phones, sensors and single-board computers [75]. Apart from the networks mentioned above, for real-time and portable monitoring, a derived MobileNet was proposed and implemented into a Cortex-M7 microcontroller for dietary image capturing, which achieved an average precision of 82% in identifying food-related images [47]. The training was conducted on Google Colab using 400 food images and 400 non-food images, taking up to 5.5 h, while only 761.99 KB of flash and 501.76 KB of RAM were needed to implement this algorithm. Hence, networks such as GoogLeNet and MobileNet could provide portable, edge-level, and real-time solutions for binary classification tasks. Still, the training process could be demanding on computation power, where a high computing performance device/server was needed.
Annapurna was a multimodel system with a camera mounted on an inertial smartwatch for dietary recording [33,34]. In this design, the camera was only switched on when the watch detected intake action. A mobile phone was first used as a lightweight computing platform to eliminate images with human faces and blurred edges. Then, 37% of the remaining images with food items in them were transferred to a server for further processing, where the Clarifai API was used to identify the presence of food items in pictures based on CNN, and a depth map was created to detect food too far from the camera (considered as unrelated to the meal). As a result, 95% of the meals could be recalled by the proposed system in a free-living environment. The computation was firstly on mobile phones for Annapurna to remove blank, blurry, and misleading images to reduce runtime for further computing. However, the latency was around 0.9 s for the smartwatch to capture an image, which limited the response speed of the whole system [34].
The server used in Annapurna [33,34], Clarifai API, was also used in [44], where it generated tag outputs (e.g., 'food', 'car', 'dish'.) of an input image for determining whether the image was food-related. This method was tested on both Food-5K and e-Button and reached the specificity of 87% on Food-5K (created in [74]), higher than the results on e-Button. This was because e-Button was an egocentric free-living dataset with 17.7% blurred images, complex backgrounds, and more diverse objects. According to the authors, although the burden of manually observing and recording dietary activities in previous work [76] was reduced, the effectiveness of automatic monitoring was still limited due to the quality of the captured images.
Only limited papers addressed the binary classification of fluid/drink/beverage. An example covering food and fluid was [77], which trained a YOLOv5 network to detect and localize food and beverage items from other objects. The study aimed to distinguish food and beverages from other objects and added 'screen' and 'person' as extra classes. As a result, an overall mean average precision of 80.6% was achieved for classifying these four objects, which was still far from being used in practice. NutriNet was another deep Neural Network proposed for both food and beverage; the detection model's output was either 'food/drink' or 'other' [78]. NutriNet was trained using the NVIDIA GeForce GTX TITAN X in a local computer and fine-tuned on an NVIDIA Tesla K80 in a server environment. This was compared to AlexNet, GoogLeNet and ResNet with three different solver types (SGD, NAG and AdaGrad), in which the NutriNet with the NAG solver achieved the best detection accuracy of 94.47%. Among the compared networks, the training time could take up to 135 h, whereas the ResNet models were the most time-consuming due to the deep learning architecture.
It is noted from the above review that most of the binary classification tasks were based on first-person images. Obtaining a clear and intake-related image is the preliminary step in vision-related intake monitoring technologies. Deep learning algorithms were used for detecting food/drink-related images and eliminating irrelevant images, which achieved promising results (up to 99.2%) on food-related datasets such as Food101 and Food-5K. However, according to the e-Button dataset result, developing robust algorithms for free-living data are still a challenge because the body movements could easily cause blur and occlusion to the captured image. The existing methods have focused on food/non-food classification, while only a few address both food and beverages.

Food/Drink Type Classification
After images of interest were acquired, more advanced classification was needed to identify the intaking food/drink type. Table 3 provide an organized overview of the algorithms used in different papers. In the investigated papers, 29 adopted machine learning (ML) methods; 51 used deep learning (DL) methods, and 11 conducted other methods.
Similar to Section 5.1 for the binary classification task, most of the food/drink type classification approaches were based on first-person cameras, and most of them were only for food type identification, with beverages not included. An example of drink type classification was [40], in which the method of drink region segmentation and the development of a bag of features (BoF) was proposed. Both speed-up robust features(SURF) and colour-based features were used for recognizing the types of drinks, and an accuracy greater than 89% was achieved [40]. HydraDocter was another example of fluid intake [46], where a trained faster-RCNN was used for container identification and classification from the captured videos. In this work, six types of drinks, including juice, coffee, cola, water, milk, and beer, were classified of which coffee and milk achieved the best accuracy, and the overall accuracy was 84.3%. The challenge was that poor image quality caused by the position and viewing angle of the container in the image made recognition difficult. As a solution, HydraDoctor captured a set of images (a short video) to decide the drinking period, and the validated images were taken after the drinking action was completed [46]. However, although this study was providing a real-time monitoring system, the runtime of the algorithm was not provided in the paper.
An early study harnessed SVM with a Gaussian radial basis kernel for training a classifier on food type achieved an accuracy of 97.3% when the training data took up to 50% of the dataset [27]. Surprisingly, only 1% of misreported nutrient information was found in this study. However, there was only one food item in each image, so the robustness of the proposed algorithm could be limited when tested on images with multiple food items or complex backgrounds [27].
As mentioned in Section 3.1, DietCam was a food logger depending on self-taken images, which resulted in 92% accuracy in the food classification [28]. The food was first recognised by matching it with a food database, and three images of one item were required to reduce the risk of occlusion. OCR techniques and user input were optional for food that was not differentiable by appearance. Disappointingly, the matching algorithms used were not suitable for classification (for example, 'cheeseburgers' and 'double cheeseburgers' were the same type of food with different appearances), so a Bayes decision theory-based probabilistic algorithm was proposed for food classification after matching [28]. Another observation was that the accuracy was positively related to the number of references in the database. Hence an extensive database was needed for a large number of patterns covered, leading to high accuracy [28].
For automatic and larger-scale image analysis, computer vision algorithms were then used in later research. CNNs trained by labelled image data provided another method for food classification. Im2Calories mentioned in the last section were examples of a GoogLeNet CNN being trained with different datasets created from existing datasets online. Im2 Calories [73] trained the GoogLeNet with a self-made multi-label dataset and achieved an average precision of 80% [73]. In Ref. [74], Food-11 was created for training, validation, and evaluation, resulting in recognition by 83.6% of food categories. The work mentioned in Section 5.1 verified the performance of CNNs on food/non-food classification tasks. However, the accuracy of food type recognition was limited. The reason could be the mixture of food items in images and the similarity across some food categories. Hence, to achieve higher accuracy on food type recognition, multiple training labels and multiple outputs for one image were suggested for further research, so as different CNN models [74]. NutriNet was also trained for food and beverage recognition with an accuracy of 92.18% [92]. This experiment was conducted on a server equipped with an Intel Core i7-8700K CPU, an Nvidia GeForce GTX 1080 Ti GPU, and 32 GB of RAM. However, NutriNet was limited to only one output for each image, so pixel-level classification was then considered for recognizing multiple food and beverages in one image [92]. To be specific, the FCN-8s Network was used to output a pixel map instead of a single result [109].
Deep neural networks were most likely to achieve extremely high performance (over 99% accuracy) in the classification and recognition tasks. The networks could be used for both food and fluid classification in which Inception ResNet V2, ResNet50 V2, ResNet152, MobileNet V2 and V3 and GoogleNet were seen with over 95% accuracy. Apart from deep neural networks, machine learning methods such as RF, SVM, KNN etc, could also reach over 90% accuracy. However, DL methods could require high-performance devices and be time-consuming when training and the performance of models rely on a sufficient amount of training data with variety. The value of a deep neural network lies in the trade-off between its performance and simplicity.

Intake Action Recognition
The process of an intake activity can be segmented into preparing, delivering, and swallowing, where the preparing phase includes the action of grasping a container and delivering refers to lifting hands to one's mouth. Most of the methods took the observation of food or fluid in human hands as a representation of intake, which turned the action recognition problem into a simple object detection problem. However, taking the presence of food/drink objects as the representation of intake activities has a high false positive rate. For example, in [44], some food preparation and shopping images were misclassified as intake-related images. Hence, identifying the actual body movement of intaking was optimal and more challenging. Efforts have been made to recognise body movement and understand human behaviour through vision. In terms of intake monitoring, the last section mainly focused on the 'what' problem, trying to monitor what the person was drinking or eating, while this section will be about 'when'.
Most of the action detection tasks depended on third-person cameras rather than first-person cameras; in those third-person cameras, depth cameras were more popular than RGB cameras. Microsoft Kinect was the most used device, of which the SDK could provide skeleton tracking for 25 joints on each body for up to six people, from 0.8 to 4 m, as well as six types of streams, including depth, infrared, colour, skeleton and audio [52,53]. As for hardware settings, [53] was tested on a computer running Windows 8 with an Intel Core i5 processor and 8 GB of RAM.
Staring with third-person methods, RGB information was first used for intake detection before the development of the depth camera. One example was the method based on fuzzy vector quantization proposed in 2012, in which activities were considered as 3D volumes formed by a sequence of human poses [48]. Fuzzy vector quantization was for associating the 3D volume representation of an activity video with 3D volume prototypes; the linear discriminant analysis was then used to map activity representations in a low-dimensional discriminant feature space. In this space, a simple nearest centroid classification procedure was used to classify activities, including eating, drinking and apraxia, which achieved an overall correct classification rate of 93.3% [48]. There was no mention of the computational requirements, or the hardware used for the experiments.
Another example using only RGB information was a real-time eating monitoring system for Alzheimer's patients presented in 2018 [38]. This design detected human hand movements, with an RGB camera pointing at the subject, resulting in 89% accuracy with a frame rate of 3.9 fps. In this study, hand and mouth regions were detected by the Haar-Cascade classifier, and the HSV skin-colour filtering approach was used for tracking the hand movements between two reference points, of which one was the position of the mouth, and another was a referencing object put by the food tray. The notable limitation was that an extra reference point was needed near the food, which complicated the system [38]. With the development of deep learning and CNNs, an automatic eating monitoring system was proposed by firstly identifying faces using a Faster R-CNN and then counting bites and chews from affine optical flow parameters using a pre-trained AlexNet on MATLAB [39], which achieved an accuracy of 85.4% ± 6.2% in counting bites and 88.9% ± 7.4% in counting chews. False prediction in this research was mainly caused by gestures resembling bringing hands to the mouth, such as wiping the mouth [39].
The Naive Bayes classifier was first used with Kinect in [3] in 2013 to classify the input images for patient fluid intake monitoring. The performance with different positions of the subject and partial occlusions of the camera was tested. However, the limitation found in this method was that Naive Bayes classifier was only applicable to a relatively small dataset and test case, so the effectiveness of this approach in large-scale free-living scenarios remained to be validated. Moreover, the experimental test set for the method was insufficient, with only three replications and 10 s data for each [3]. The Naive Bayes classifier was used because it assumes each attribute is mathematically independent and can find prior probabilities with small datasets.
In 2014, an automatic drinking activities identification system was proposed based on dynamic time wrapping (DTW) algorithm [54]. DTW is an algorithm that computes the distance between two different signals and analyses the final cost distance to identify the differences between the signals [147], commonly used in speech recognition and ECG signal recognition. The distance between the user's hands and the camera was used to judge whether the person was drinking. A total accuracy of 89% was achieved when being tested with three camera locations [54].
In later years, more information from the camera, rather than a single indicator, was used for more accurate detection. In [37], the skeleton coordinates from the depth image provided by Kinect were used for analysing the movements of drinking soup, drinking water, and eating the main course. The distance from both hands to the head and the plate to the head were used as characterisation for classifying the gestures, resulting in an 89% average success rate for three subjects [37]. However, no algorithm was presented in this study, no occlusion was considered during the test, and only three subjects were observed and evaluated, which could lead to bias because of personal dietary habits. Disregarding the limitations, this study validated the feasibility of using the distance between hands, head, and plate for intake monitoring. In [52], the angle of the upper lime joint was detected using the skeleton tracking function of MS Kinect and divided into the shoulder, elbow, wrist, and hand. The data were then used for training an SVM to classify the sitting posture, and the number of bites was counted depending on the jaw movement and the distance between the hand and mouth [52].
A similar method was seen in recent research in 2020 for intake counting, implemented on an Intel Core i7 CPU with 8 GB of RAM [51]. This research detected intake by analysing human joint motion during food/drink intake captured by a Kinect depth camera and achieved an average accuracy of 96.2% (also mentioned in Section 4.2). Specifically, the system counted one food intake activity when the hand, wrist and mouth were detected close enough, and the elbow joint and wrist angle exceeded certain thresholds. Moreover, interfering actions, including hand at chin, hand at nose etc., were all considered and analysed [51], which was one of the reasons for the meaningful increase in accuracy.
Except for setting threshold, using classifiers such as Naive Bayes classifier, Haar-Cascade classifier, and SVM mentioned, hidden Markov model (HMM) was another algorithm used. In [41], HMM was used for detecting eating gestures and classifying soup and main dishes in conjunction with an MS Kinect camera. The feature used to indicate a candidate intake activity was the distance between hands and the plate/glass. Unlike the studies that only considered the value of distance, the time duration of the intake movement was also measured and evaluated in this study. This study resulted in 72.7-90% of sensitivity on detection and less than 83% of success rate on classification.
In the dimension of computer vision and machine learning, unsupervised machine learning algorithms, including self-organizing map (SOM) [148], extended SOM [149], and growing neural gas network (GNG) [150] were used for tracking food intake movements [53]. The position of the head and two hands were used to build nodes in the self-organizing neural networks. The best network GNG achieved less than 37 mm mean distance error on the hands and head tracking [53]. Methods built on object detection algorithms were seen in [42], where the colour-tracked skin regions of hands and face over video frames on the combined YCbCr and YIQ colour spaces, and the intaking activities were detected by calculating and evaluating the Euclidean distance between the bounding boxes encircling the tracked skin regions. The results indicated that 90.82% of the correct detecting rate was achieved on around 200 eating episodes. However, neither the occlusion problem nor the clothes colour of subjects was addressed in this study, which could notably influence the result. Recently, eating behaviour, food type, and food amount were detected by a trained model with the video dataset collected by a 360-degree camera [24]. In this pilot experiment, a six-layer CNN (a simplified AlexNet) was trained for recognising handto-mouth movements, achieving 70% accuracy, and then extending it to distinguishing the gesture of consuming different foods and using different containers. The proposed method tried to realise food type classification by recognising the gesture of people consuming them, which was different from the previous object detection-based methods. However, only a small amount of data were used in this research, so the training process remained to be conducted on a larger dataset.
Some of the previous research addressed the burden of computing images and videos on food/drink intake activities. While most of the computation happened on a server or offline computer, a micro-control board was used [151] as the computation platform for real-time intake detection, which was based on the joint information of both hand gestures and jaw movement provided by a Kinect.
Compared to third-person methods, the employment of first-person methods for intake action detection was much less. The main technical reason was that using a third-person camera could reduce wearable devices' high false alarm ratio [51]. However, the feasibility of using a wearable camera for daily activity recording and analysing was tested in [152], which successfully reconstructed daily time usage from wearable cameras. In this study, a mean of 19.2 activities was reported in a day, while 41.1 were revealed by the imaged data captured, proving that first-person cameras can help capture daily activities more accurately than manual reporting. Similarly, a wearable camera was used for recording the activities of users during transportation, with a set of image coding including posture recognition, eating episode detection, food, and beverage type recognition [43]. The specific algorithms and methods were not presented in the paper, but this work evaluated the feasibility of monitoring dietary activities in transportation using a wearable camera.

Intake Amount Estimation
The studies mentioned above were mostly related to intake detection and classification rather than intake amount estimation. However, food volume estimation is another problem to be considered, which is the 'how much' problem [73,115,128,[153][154][155]. Meal estimation could be realised based on the respective number of intaking gestures for consuming liquid, soup, and meal [41], but the accuracy was not evaluated. Volume estimation based on 3D reconstruction algorithms from images taken by phone was seen in [155], resulting in less than 0.02 inch absolute error for radius estimation (for radius ranging from 0.8 inches to 1.45 inches). Im2Calories was another example, which firstly predicted the distance of each pixel from the camera using a CNN trained on the NYUv2 RGBD dataset, resulting in an average relative error of 0.18 m, which was too high [73]. Then, the depth map was converted to voxel representation for food size estimation, resulting in less than 400 mL absolute volume error [73]. Similarly, a system called FIVR (food intake and voice recognizer) was developed for quantitative nutrition information acquisition from a set of three images and the speech of a user's meal. Furthermore, 3D reconstruction algorithms were used in this design, reaching a (5.75 ± 3.75)% error in the volume [154]. A CNN was proposed for depth prediction and volume estimation and significantly improved performance with less than 0.2 s runtime, which was 25 times shorter than conventional 3D reconstruction methods [128]. A geometric model for food amount estimation from single-view images was proposed and achieved less than 6% error for energy estimation, but only on the assumption of accurate segmentation and food classification [153]. Stress-log was another system proposed for calorie counts, which achieved 97% accuracy. In this design, 1000 food-related images were collected from Pixabay (an open-access repository) and used to train an object detection application programming in the TensorFlow interface. Then the Firebase Database was used for generating calorie information [115].
As for fluid amount estimation, a design called 'Playful Bottle' was proposed, which combined the camera and accelerometer on the phone to realise fluid intake tracking and reminding [156]. The accelerometer was used for drinking action detection, in which case 21.1% of false-positive detections could be caused by shaking the bottle without actually drinking from it. The camera was used to capture images of the liquid amount in the bottle for water level estimation when drinking action was detected, with a 3.86% average error rate over the 16 subjects [156].

Discussion
Both first-person and third-person methods are faced with viewing occlusion in a free-living environment. However, not many third-person methods were tested in free-living environments compared to first-person methods. For wearable cameras, the camera's position could change with the body movement and cause incompetent frames; or the camera can be accidentally covered by hair or clothes. For third-person cameras, the occlusion happens when the subjects move to a blind spot or block the expected sight with body parts or clothes, drawing forth to using multiple cameras in the living environment. Therefore, compared to the third-person, the first-person, which can move around with the subject, has the advantage of being individually used in a free-living environment. However, the incontinence of wearing, uncertainty of the wear time and battery sustainability are problems hindering the utilization of first-person cameras. In contrast, as a non-intrusive and almost transparent approach, third-person cameras are more popular for noncompliant groups or people with difficulties using wearable devices, including older adults.
RGB and depth cameras were used as third-person cameras, while only RGB cameras were used as first-person cameras. The reason that no depth camera was ever used as a wearable camera could be the unsuitable size and weight of the device. RGB cameras are commonly used with other non-vision sensors for intake monitoring, potentially improving performance and reducing power consumption. However, in this case, the action detection task was done mainly by non-vision sensors rather than the camera. Depth cameras in third-person methods were primarily used independently, without other non-vision sensors. This could be because it can provide the information that other sensors can provide, including acceleration, distance, angle of pitch, roll, yaw etc. MS Kinect was dominantly used compared to other modules of an RGB-D camera. The reason could be the off-the-shelf coding kit from Microsoft for skeleton extraction and body motion tracking. The fusion of RGB and depth information has been increasingly seen in recent years and has been proven to improve the performance of intake monitoring. Still, it also faces the trade-off between computation, power consumption and accuracy.
The binary classification task was mainly based on first-person images obtained from a wearable RGB camera in the forms of glasses, watch or pendant. The body's movement could easily cause motion blur and occlusion to the captured image, so obtaining clear intake-related images is the preliminary step for robust and effective intake monitoring in free-living scenarios. Most of the action detection methods used third-person cameras, and MS Kinect was dominantly used, in which RGB information and depth information were used individually or collaboratively. The distance and orientation of different body parts were evaluated to determine actions, such as 'the distance between mouth to hands' or 'the angle between the elbow joint and hand wrist'. The time duration of this movement could also be considered for judgement. However, because bringing hands to the mouth was often seen as a representation of intake, other similar actions, including touching the nose, wiping the mouth, and adjusting glasses, can be easily mistaken as intake action. This has not been thoroughly considered in the existing research. Almost all the vision-based intake amount estimation methods were designed for food/calorie quantification rather than fluid.
In the investigated algorithms, DL methods were the most popular and tended to achieve high performance, while ML methods could be used with DL to boost the accuracy further. However, the training process could be time and energy-consuming. In real-life practice, the simplicity and robustness of the system are essential, and privacy is always an issue. Therefore, if the computation power is in place and the training sample size is sufficient, DL methods are recommended to maximise accuracy, especially if real-time performance is not a requirement. The acceptance of the monitoring technologies could be different depending on the individuals. The methods mentioned in this review will give general solutions to monitoring tasks and can be making instructions for designing personal systems for some special individuals. Limited research was built or evaluated in real-living scenarios.
Regarding privacy preservation, the existing solutions include avoiding taking privacysensitive images or manually deleting them during pre-processing. Algorithms for tracking human faces were developed to remove or blur them. Another popular approach was to conduct intake monitoring based only on depth information showing the contour or skeleton of the human without an identifiable face, which complicated the system.

Conclusions on Research Gaps
This extensive review of Vision-based Methods for Food and Fluid Intake Monitoring: A Literature Review provides the following conclusion and research gaps to drive future direction in this area. A limited paper was found on the drink/non-drink (binary) classification, while there's a lot for food/non-food classification. This is the preliminary step of identifying intake activities; the interference daily activities (e.g., wiping mouth) were not brought in in current studies to improve the accuracy of the binary classification task. Furthermore, limited papers were found on fluid type classification, much less than food type classification; the drink type included was also limited; the performance of the proposed methods for fluid type classification was lower than what has been achieved on food type classification.
The first-person method was not commonly used for intake action recognition; when an RGB first-person camera was used with non-vision sensors, the action recognition task was mostly conducted by the non-vision sensors rather than the camera. The nonvision sensors used with first-person cameras include an accelerometer, gyroscope, flex sensor, load cell, proximity sensor and IMU. The EMG sensors and microphones were not commonly used but could be an option. Combining first-person RGB cameras with other sensors has the potential to reduce the energy consumption of cameras, extend the use time of batteries, save storage space, and rule out privacy concerns by turning the camera on only when a candidate movement is detected.
In the four tasks mentioned in the paper, third-person methods were mostly used for action recognition rather than other tasks; and the third-person camera was not used collaboratively with non-vision sensors. The result of RGB and depth fusion was promising for intake detection, but the number of papers using this method was limited. The limitations of utilizing first-person cameras for intake monitoring include the occlusion of view, dark and blurry images obtained in poor light conditions, noncompliance in wearing the camera, battery sustainability issues, and privacy issues.
Regarding reducing the concern of privacy and the image data, some studies only used depth information from RGB-D cameras by skeletal tracking. There has been no standalone dataset related to fluid intakes, such as an image dataset for containers or a video dataset for people drinking with different postures, temperatures, containers, and amounts of fluid. Vision-based methods were barely used in fluid intake amount estimation, which is typically done by smart containers, EMG sensors or microphones. The performance of intake monitoring systems proposed in current studies was not adequately tested in a free-living environment.

Conflicts of Interest:
The authors declare no conflict of interest.