Dietary Nutritional Information Autonomous Perception Method Based on Machine Vision in Smart Homes

In order to automatically perceive the user’s dietary nutritional information in the smart home environment, this paper proposes a dietary nutritional information autonomous perception method based on machine vision in smart homes. Firstly, we proposed a food-recognition algorithm based on YOLOv5 to monitor the user’s dietary intake using the social robot. Secondly, in order to obtain the nutritional composition of the user’s dietary intake, we calibrated the weight of food ingredients and designed the method for the calculation of food nutritional composition; then, we proposed a dietary nutritional information autonomous perception method based on machine vision (DNPM) that supports the quantitative analysis of nutritional composition. Finally, the proposed algorithm was tested on the self-expanded dataset CFNet-34 based on the Chinese food dataset ChineseFoodNet. The test results show that the average recognition accuracy of the food-recognition algorithm based on YOLOv5 is 89.7%, showing good accuracy and robustness. According to the performance test results of the dietary nutritional information autonomous perception system in smart homes, the average nutritional composition perception accuracy of the system was 90.1%, the response time was less than 6 ms, and the speed was higher than 18 fps, showing excellent robustness and nutritional composition perception performance.


Introduction
Along with the gradual development of IoT, big data and artificial intelligence, smart homes are changing people's lives and habits to a certain extent [1]. According to data released by Strategy Analytics, since 2016, the number of households with smart home devices in the world and the market size of smart home devices have both continued to grow. In 2020, the global smart home equipment market will reach 121 billion US dollars, and the number of households with smart home equipment in the world will reach 235 million. In addition, deep learning has brought state-of-the-art performance to tasks in various fields, including speech recognition and natural language understanding [2], image recognition and classification [3], system identification and parameter estimation [4][5][6].
According to the 2020 World Health Organization (WHO) survey report, obesity and overweight are currently critical factors endangering health [7]. Indisputably, obesity may cause heart disease, stroke, diabetes, high blood pressure and other diseases [8]. Since 2016, more than 1.9 billion adults worldwide have been identified as overweight, especially in the United States. In 2019, the rate of obesity in all states was more than 30%, and such patients spent USD 1429 more a year on medical diseases than normal people [9]. Six of the ten leading causes of death in the United States, including cancer, diabetes and heart disease, can be directly linked to diet [10]. Though there are various factors that may cause obesity such as certain medications, emotional issues such as stress, less exercise, poor

•
In order to monitor the user's dietary intake using the social robot, we proposed a food-recognition algorithm based on YOLOv5. The algorithm can recognize multiple foods in multiple dishes in a dining-table scenario and has powerful target detection capabilities and real-time performance.

•
In order to obtain the nutritional composition of the user's diet, we calibrated the weight of food ingredients and designed the method for the calculation of food nutritional composition; then, we proposed a dietary nutritional information autonomous perception method based on machine vision (DNPM) that supports the quantitative analysis of nutritional composition. • Deployed the proposed algorithms on the experimental platform and integrate it into the application system for testing. The test results show that the system shows excellent robustness, generalization ability and nutritional composition perception performance.
The remainder is arranged as follows. Section 2 introduces the related work, Section 3 introduces the food-recognition algorithm, Section 4 focuses on the proposed method, and Section 5 details the experimental environment. Section 6 presents the performance experiment and analysis of the food-recognition algorithm. Section 7 discusses the performance testing and analysis of the application system. Section 8 is a necessary discussion of the results of this paper. Finally, in Section 9 we conclude this paper and discuss possible future works.

Related Work
In the last decade, the perception technology of dietary nutritional composition has been widely researched by scholars at home and abroad. Researchers have carried out a series of studies on food dataset construction, food recognition and diet quality assessment.
In the construction of food datasets, training data are mainly collected by manual annotation methods to construct a food image dataset in the early stages [23][24][25]. However, the method based on the manual labeling of datasets is expensive and poorly scalable. Moreover, coupled with factors such as variable image shooting distances and angles, and mutual occlusion among food components, it is difficult to guarantee the accuracy of artificial image classification standards. Compared with the data obtained based on manual annotation methods, Bossard L. et al. [26] collected 101,000 food images containing 101 categories from food photo-sharing websites and established the ETH Food-101 dataset; however, one image often inevitably contains multiple foods. ChineseFoodNet [27] consists of 185,628 Chinese food images in 208 categories, but no fruit images are involved, and the definition of image categories in the dataset is relatively vague. Parneet et al. [28] constructed the FoodX-251 dataset based on the Food-101 public dataset for fine-grained food classification, which contains 158,000 images and 251 fine-grained food categories, although most of them are Western-style food.
In terms of food recognition, Matsuda et al. [29] incorporated the information on co-occurrence relationships between foods. Specifically, four kinds of detectors are used to detect the candidate regions of the image; then, the candidate regions are fused. After extracting a variety of image features, the images are classified, and the manifold sorting method is adopted to identify a variety of foods. Zhu et al. [30] developed a mobile food image recognition method. Firstly, the food region in the image is located by the image segmentation method; then, the color and texture features of the region are extracted and fused for food image recognition. Kong et al. [31] provided a food recognition system DietCam, which extracts SIFT features as food image features with characteristics such as illumination, scale and affine invariance, and obtains three food images from three different shooting angles at a time; then, it performs more robust recognition based on these three food images. However, the existing food image recognition methods are mainly aimed at a single task, such as food classification, while there are few studies on simultaneously predicting food ingredients' energy and other information corresponding to food images. Food image recognition can be improved by learning food categories and food ingredients' attributes at the same time through multi-task learning. Dehais et al. [32] performed a 3D reconstruction of food based on multi-angle images to predict the carbohydrate content of food. However, the food volume estimation method based on multi-angle images requires the input of multiple images and has higher requirements for shooting distance and angle, which is not convenient for users to operate. Myers and Johnston et al. [33] designed a mobile app called Im2Calories that predicts calorie values based on food images. Firstly, the food category is recognized by the GoogLeNet model; then, the different foods in the image are identified and located by target recognition, semantic segmentation, and the food volume is estimated based on the depth image. Finally, the calorie value is calculated by querying the USDA food information database. However, the related information has to be ignored in the training process, because the sub-tasks are independent of each other.
In fact, the quality assessment of a user's diet can be further completed according to the associated components of the food image. Regarding diet quality assessment, Javaid Nabi et al. [34] proposed a Smart Dietary Monitoring System (SDMS) that integrates Wireless Sensor Networks (WSN) into the Internet of Things, tracks user's dietary intake through sensors and analyzes data through statistical methods, so as to track and guide user's nutritional needs. Rodrigo Zenun Franco [35] designed a recommendation system for assessing dietary intake, which systematically integrates the individual user's dietary preferences, population data and expert recommendations for personalized dietary recommendations. Abul Doulah et al. [36] proposed a sensor-based dietary assessment and behavioral monitoring method in 2018 that obtains the user's dietary intake through video and sensors, as well as differentiated statistics on eating time, intake time, intake times and pause time between eating times for the assessment of the user's diet. In 2020, Landu Jiang et al. [11] developed a food image analysis and dietary assessment system based on the depth model, which was used to study and analyze food projects based on daily dietary images. In general, the dietary monitoring and assessment systems proposed above can track and monitor the user's dietary behavior and assess the dietary intake, but cannot effectively assess the user's diet quality. More fundamentally, the proposed systems do not correlate food image recognition algorithms, nor do they fully consider the main components of the diet; moreover, the food analyzed is too simple.
However, it is worth explaining that only by building an expanded multi-food dataset, realizing the multi-target recognition of foods to deal with complex life scenarios and making qualitative and quantitative analyses of the intake can we accurately assess the dietary intake of user's and guide them toward healthier lifestyle choices. Though dataset construction, food recognition and diet quality assessment have been well discussed in the above work, three fundamental challenges remain. Firstly, most dataset images have only one type of food, and most methods of food recognition deal with images of a single food. Secondly, it is still time-consuming (2 s in general) to detect and classify the food in images. Finally, there is a lack of effective assessment of the user's diet quality. In this paper, we aim to address these issues and propose a dietary nutritional information autonomous perception method based on machine vision (DNPM), recognizing foods through cameras, and correlating food nutritional composition to generate diet quality assessments for long-term healthcare plans.

Food-Recognition Algorithm Based on YOLOv5
In order to recognize multiple foods in multiple dishes in a dining-table scenario, using the powerful multi-target detection capability of YOLOv5 [37], we propose a foodrecognition algorithm based on YOLOv5. Its overall architecture diagram is shown in Figure 1, and its detailed steps are shown in Algorithm 1. and pause time between eating times for the assessment of the user's diet. In 2020, Landu Jiang et al. [11] developed a food image analysis and dietary assessment system based on the depth model, which was used to study and analyze food projects based on daily dietary images. In general, the dietary monitoring and assessment systems proposed above can track and monitor the user's dietary behavior and assess the dietary intake, but cannot effectively assess the user's diet quality. More fundamentally, the proposed systems do not correlate food image recognition algorithms, nor do they fully consider the main components of the diet; moreover, the food analyzed is too simple.
However, it is worth explaining that only by building an expanded multi-food dataset, realizing the multi-target recognition of foods to deal with complex life scenarios and making qualitative and quantitative analyses of the intake can we accurately assess the dietary intake of user's and guide them toward healthier lifestyle choices. Though dataset construction, food recognition and diet quality assessment have been well discussed in the above work, three fundamental challenges remain. Firstly, most dataset images have only one type of food, and most methods of food recognition deal with images of a single food. Secondly, it is still time-consuming (2 s in general) to detect and classify the food in images. Finally, there is a lack of effective assessment of the user's diet quality. In this paper, we aim to address these issues and propose a dietary nutritional information autonomous perception method based on machine vision (DNPM), recognizing foods through cameras, and correlating food nutritional composition to generate diet quality assessments for long-term healthcare plans.

Food-Recognition Algorithm Based on YOLOv5
In order to recognize multiple foods in multiple dishes in a dining-table scenario, using the powerful multi-target detection capability of YOLOv5 [37], we propose a foodrecognition algorithm based on YOLOv5. Its overall architecture diagram is shown in Figure 1, and its detailed steps are shown in Algorithm 1.   In Figure 1, the Input layer preprocesses the training dataset through the Mosaic data-enhancement method, adaptive anchor-frame calculation, adaptive picture scaling and other methods [38]; it initializes the model parameters and obtains the required picture size of the model. The Backbone layer divides the picture in the dataset through the Focus structure; then, it scales the length and width of the image continuously through the CSP structure. The Neck layer fuses the data set through FPN operation and PAN In Figure 1, the Input layer preprocesses the training dataset through the Mosaic data-enhancement method, adaptive anchor-frame calculation, adaptive picture scaling and other methods [38]; it initializes the model parameters and obtains the required picture size of the model. The Backbone layer divides the picture in the dataset through the Focus structure; then, it scales the length and width of the image continuously through the CSP structure. The Neck layer fuses the data set through FPN operation and PAN operation to obtain the prediction feature map of the dataset. The Precision layer calculates the gap between the prediction box and the real box through the calculation of the loss function; then, it updates the parameters of the iterative model through the back-propagation algorithm and filters the prediction box through the NMS operation weighted by the model post-processing operation to obtain the prediction results of the model. Use adaptive image scaling technology to uniformly modify the size of the image to 608 × 608 × 3, and obtain the dataset D = {d 1 , d 2 , d 3 , . . . , d n };

5:
Input Input the feature map G = {g 1 , g 2 , . . . , g n } to the Prediction layer. The Prediction layer calculates the difference between the prediction frame and the real frame by calculating the loss, mainly the classification loss , and then reversely updates the iterative model parameters; 10: The model algorithm will generate multiple prediction boxes, use the weighted NMS operation to filter the prediction boxes, and finally get the model prediction result dataset F = {F 1 , . . . , F i , . . . , F n }.

Dietary Nutritional Information Autonomous Perception Method Based on Machine Vision (DNPM)
In order to obtain the nutritional composition of foods, the weight of food ingredients needs to be calibrated first, and the standard weight of each food ingredient is calibrated according to the amount of food ingredients of "Meishi Jie" [39] (see Table 1). The nutritional composition of each food with a weight of 100 g is queried according to the National Nutrition Database-Food Nutritional Composition Query Platform [40] and Shi An Tong-Food Nutritional Composition Query Platform [41], and the recognized food is mapped to the nutritional composition table. Assuming that there are c kinds of main ingredients to form a dish, and the standard nutritional composition of the jth ingredient is Y ij , then the nutritional composition of the jth ingredient N ij = Y ij × G j /100, where G represents the calibrated weight of ingredients, i = 1, 2, 3, 4, 5, . . . , 33 represent 33 nutritional compositions (see Table 2), j = 1, 2, . . . , c represent the c main ingredients of the dish (see Table 1). The nutritional compositions of the main ingredients in the dish are accumulated to obtain the nutritional composition of the dish. The calculation method is shown in Equation (1): where CP i represents the ith nutritional composition of the dish, i = 1, 2, . . . , 33.
Using Algorithm 1, the robot can obtain the feature model w of food recognition, that is, the food-recognition database F = {F 1 , . . . , F i , . . . , F n }. In order to obtain the food nutritional composition consumed by each user after the robot recognizes foods and faces through vision, we propose a dietary nutritional information autonomous perception method based on machine vision (DNPM), where the specific steps are shown in Algorithm 2.
In Step 4, firstly, capture face information and person name information in advance using the camera and store them locally; then, extract 128D feature values from multiple face images using the face database Dlib; calculate the 128D feature mean value of the monitoring object, and store the 128D feature mean value locally. When the system is working, recognize the face in the video stream, extract the feature points in the face and store the local face image information to match the Euclidean distance to determine whether it is the same face; if so, return the corresponding person identity information, if not, it displays unknown. When the threshold set for face recognition is 0.4 and the Euclidean metric matching degree is less than or equal to 0.4, return the corresponding character identity information, and face recognition is successful. Output T = tb, DT and FT. In Step 4, firstly, capture face information and person name information in advance using the camera and store them locally; then, extract 128D feature values from multiple face images using the face database Dlib; calculate the 128D feature mean value of the monitoring object, and store the 128D feature mean value locally. When the system is working, recognize the face in the video stream, extract the feature points in the face and store the local face image information to match the Euclidean distance to determine whether it is the same face; if so, return the corresponding person identity information, if Output T = t b , D T and F T .
In Step 6, consider the food taboos of users, such as the following: seafood-allergic people do not eat seafood; Hui people do not eat pork; vegetarians do not eat meat, eggs and milk; and pregnant women are not allowed to eat cold foods. As a result, build a taboo food database G (see Table 3).

Experimental Environment
The smart home experimental environment built in this paper is shown in Figure 2; in this setting, multiple cameras and a social robot with a depth camera were deployed to monitor the user's dietary behavior, and the frame data captured from multiple camera video streams at the same moment were transmitted to the workstation in real-time through wireless communication, while the training and analysis of the data were performed by a Dell Tower 5810 workstation (Intel i7-6770HQ; CPU, 2600 MHz; 32G memory. NVIDIA Quadro GV100 GPU; 32G memory) [42,43]. The hardware of the social robot included an Intel NUC mini host, EAI DashgoB1 mobile chassis, IPad display screen and Microsoft Kinect V2 depth camera, and the communication control between hardware modules was implemented using the ROS (robot operation system) framework [44]. At the software level, the social robot's platform host and workstations were installed with the Ubuntu 16.04 LTS operating system, TensorFlow deep learning framework, YOLO and machine vision toolkit Opencv3.3.0.  Figure 3 shows the workflow chart of the autonomous perception system for dietary nutritional information in a smart home environment. First of all, the Dell Tower 5810 workstation uses Algorithm 1 to train the food image dataset to obtain the food-recognition feature model w. Then, the obtained feature model w is transmitted to the social robot, which receives the model and loads it, and the multiple cameras deployed to the smart home environment and the social robot with depth cameras apply DNPM to start foodrecognition detection while importing the face data feature database for face recognition. Finally, the food category information is mapped to the nutritional composition database according to the detected results, the nutritional composition is calculated (see Section 4), and the nutritional composition information of the user is obtained and stored in the user dietary information database. Users can query their dietary information through the terminal.

Dataset
The 23 most common kinds of food were selected from the ChinesFoodNet [27] dataset as the training set and test set, including cereal, potato, vegetable, meat, egg, milk and seafood. Considering that the food type in the actual scenario should also include  Figure 3 shows the workflow chart of the autonomous perception system for dietary nutritional information in a smart home environment. First of all, the Dell Tower 5810 workstation uses Algorithm 1 to train the food image dataset to obtain the food-recognition feature model w. Then, the obtained feature model w is transmitted to the social robot, which receives the model and loads it, and the multiple cameras deployed to the smart home environment and the social robot with depth cameras apply DNPM to start foodrecognition detection while importing the face data feature database for face recognition. Finally, the food category information is mapped to the nutritional composition database according to the detected results, the nutritional composition is calculated (see Section 4), and the nutritional composition information of the user is obtained and stored in the user dietary information database. Users can query their dietary information through the terminal.  Figure 3 shows the workflow chart of the autonomous perception system for dietary nutritional information in a smart home environment. First of all, the Dell Tower 5810 workstation uses Algorithm 1 to train the food image dataset to obtain the food-recognition feature model w. Then, the obtained feature model w is transmitted to the social robot, which receives the model and loads it, and the multiple cameras deployed to the smart home environment and the social robot with depth cameras apply DNPM to start foodrecognition detection while importing the face data feature database for face recognition. Finally, the food category information is mapped to the nutritional composition database according to the detected results, the nutritional composition is calculated (see Section 4), and the nutritional composition information of the user is obtained and stored in the user dietary information database. Users can query their dietary information through the terminal.

Dataset
The 23 most common kinds of food were selected from the ChinesFoodNet [27] dataset as the training set and test set, including cereal, potato, vegetable, meat, egg, milk and seafood. Considering that the food type in the actual scenario should also include

Dataset
The 23 most common kinds of food were selected from the ChinesFoodNet [27] dataset as the training set and test set, including cereal, potato, vegetable, meat, egg, milk and seafood. Considering that the food type in the actual scenario should also include milk and fruit, milk and 10 kinds of fruits were added to expand the dataset; in total, 34 kinds of food images were formed, and the dataset CFNet-34 was formed. We took 80% of the CFNet-34 dataset as the training dataset and 20% as the test dataset for training and testing, respectively. Dataset acquisition address: https://pan.baidu.com/s/ 1laUwRuhyEEOmWq8asi0uoA, (accessed on 19 June 2022) Extraction code: 71l4.

Performance Indicators
Four indicators of precision rate P (see Equation (2)), recall rate R (see Equation (3)), mAP@0.5 and mAP@0.5:0.95 were used to evaluate the food-recognition model.
where TP i represents the number of foods of category i that are correctly predicted, N represents the total number of categories of foods, FP i represents the number of other foods that are incorrectly predicted as foods of category i, and FN i represents the number of foods of category i that are incorrectly predicted as other foods. mAP@0.5 represents the mAP when the IoU threshold is 0.5, reflecting the recognition ability of the model. mAP@0.5:0.95 represents the average value of each mAP when the IoU threshold is from 0.5 to 0.95 and the step size is 0.05, which reflects the localization effect and boundary regression ability of the model. The values of these six evaluation indicators are all positively correlated with the detection effect. AP in mAP is the area under the PR curve, and its calculation method is shown in Equation (4). (4)

Experimental Results and Analysis
The hyperparameters of the experiment were set as follows: iteration times, 600; batch size, 32; learning rate, 0.001; size of all input images, 640; confidence threshold, 0.01; IoU threshold, 0.06; and the test set was tested. Table 4 shows the evaluation results of food recognition obtained by testing the YOLOv5 model on the test set. Obviously, the more obvious the image features, the easier they were to identify. For example, the recognition accuracy of fruits was higher, and the recognition accuracy of strawberries with the most obvious features reached 100%. The inter-class similarity among three kinds of dishes, i.e., braised pork, barbecued pork and cola chicken wings, is too large, which can easily lead to recognition errors. Therefore, the recognition accuracy was low, and the recognition accuracy of cola chicken wings was the lowest, at 69.5%. The average accuracy of the model test was 89.7%, the average recall rate was 91.4%, and the average mAP@0.5 and mAP@0.5:0.95 were 94.8% and 87.1%, respectively.  Table 5 shows the experimental results of different image recognition algorithms on the test set. It can be seen from Table 5 that Algorithm 1 performs well on the whole, and the Top-1 and Top-5 accuracy rates of the test set are higher than other algorithms, and a more robust feature model can be obtained, thereby improving the recognition accuracy of the algorithm. It shows that Algorithm 1 has higher recognition accuracy and robustness in food recognition.

Experiment Solution
See Table 6 for indications to set test scenarios considering the possible number of family members and the number of foods. Table 6. Test scenario settings.

No.
Scenario Information C 1 1 person eats 1-3 foods C 2 2 people eat 2-4 foods C 3 3 people eat 3, 4, 6 foods C 4 4 people eat 4, 6, 8 foods C 5 5 people eat 6, 8, 9 foods In order to test the food recognition and nutritional composition perception performance of the system, seven types of test sets were designed from the aspects of test object change, food change, etc.
Test set a: There was only one kind of food in the sample image, and the sample image was divided into six categories, including cereal, potato, vegetable, meat, egg, milk, seafood and fruit. Each category had 10 images, for a total of 60, and did not intersect with the training set.
Test set b: There were 60 images with two kinds of food in the sample image. Test set c: There were 60 images with three kinds of food in the sample image. Test set d: There were 60 images with four kinds of food in the sample image.
Test set e: There were 60 images with six kinds of food in the sample image. Test set f: There were 60 images with eight kinds of food in the sample image. Test set g: There were 60 images with nine kinds of food in the sample image. The working parameters of the camera are not easy to calculate, so the test set used in the test is usually prepared in advance, and the data is sent to the system by simulating the working mechanism of the camera.

Test Results and Analysis
When the proposed algorithm was deployed on the social robot platform, the hyperparameters were set as follows: number of iterations, 600; batch size, 32; learning rate, 0.001. The five scenarios and seven types of test sets designed in Section 7.1 were tested. The response time and speed of the system for different test sets and the perceived accuracy of nutritional composition are shown in Tables 7-9. The box diagram of nutritional composition perception accuracy is shown in Figure 4. The effect chart of the systematic diet assessment is shown in Figure 5. When the proposed algorithm was deployed on the social robot platform, the hyperparameters were set as follows: number of iterations, 600; batch size, 32; learning rate, 0.001. The five scenarios and seven types of test sets designed in Section 7.1 were tested. The response time and speed of the system for different test sets and the perceived accuracy of nutritional composition are shown in Tables 7-9. The box diagram of nutritional composition perception accuracy is shown in Figure 4. The effect chart of the systematic diet assessment is shown in Figure 5.       After food recognition and face recognition, Algorithm 2 can be used to quickly correlate food nutritional information, so the accuracy of food recognition is the perception accuracy of the nutritional composition.
According to Table 7, the average response time of the system was 4.6 ms, and the average response times of test set a~test set g were 3.8 ms, 4.1 ms, 4.5 ms, 4.6 ms, 4.9 ms, 5.1 ms and 5.5 ms, respectively. The average response time of the system for different test sets increased with the increase in personnel and food, indicating that the detection and recognition of the system was more time-consuming in the scenario with more food and personnel; however, the response speed is in the millisecond range, which meets the real-time working requirements of the system. According to Table 8, the average recognition speed of the system was 21.8 fps, and the average recognition speeds of test set a~test set g were 26.3 fps, 24.4 fps, 22.0 fps, 21.6 fps, 20.3 fps, 19.7 fps and 18.2 fps, respectively. According to Table 9, the total average nutritional composition perception accuracy of the system was 90.1%, and the average nutritional composition perception accuracy values of test set a~test set g were 89.7%, 92.5%, 93.3%, 97.2%, 96.5%, 80.9% and 80.3%, respectively. In the scenario where 3 or 4 people eat four foods and six foods, the nutritional composition perception of the system was the most accurate, while in the case of more food and personnel, the performance of the system was affected to a certain extent, but on the whole, the nutritional composition perception accuracy of the system was good.
According to Figure 4, the median scale of Figure 4a-c is higher than 80.0%, indicating that the system showed good recognition performance for the data of the C 1 , C 2 and C 3 scenarios, while Figure 4d,e indicates that the lowest-value scale line is close to 30.0%, indicating that the system showed poor recognition performance for the data of the C 4 and C 5 scenarios. In general, the nutritional composition perception accuracy of the system was 90.1%, but in the case of complex personnel and food, the recognition and perception performance of the system was low; therefore, the recognition and perception robustness of the system needs to be further improved.
To sum up, the test set proved that the response time of food recognition and face recognition of the system for different test sets was less than 6 ms, and the speed was higher than 18 fps. The overall nutritional composition perception accuracy of the system was 90.1%, indicating that the feature model output of this algorithm has a certain gen-eralization ability, the algorithm has a strong feature-learning ability, and the system has good robustness.

Discussion
Though our proposed algorithm performs well on self-built datasets, there is room for improvement compared with some state-of-the-art algorithms.
Model complexity has always been a major factor affecting the performance of deep learning models. Due to hardware limitations, we need to make a trade-off between processing time and system accuracy. In the experiments, we use YOLOv5 for food recognition. YOLOv5 is the most advanced target detection method at present, but the training process is time-consuming and the accuracy of target detection needs to be improved. In the future, we may try to improve the YOLOv5 model structure in terms of reducing training time and increasing recognition accuracy, such as further combining the feature fusion of each module with multi-scale detection [45] and introducing attention mechanism modules in different positions of the model [46].
The second challenge is to generate a good dataset that we can use to capture food images from our daily diet. As the problem that we encountered in our evaluation, though we have the popular image dataset ChineseFoodNet dataset, some images in the dataset are inaccurately classified. Otherwise, some food items have high intra-class variance or low inter-class variance. Items in the same category with high intra-class variance might look different, and two different types of food with low inter-class variance have similar appearances. Both high intra-class variance and low inter-class variance issues can significantly affect the accuracy of the detection model. To solve this problem, we need to search more datasets to augment the CFNet-34 dataset. In the future, we will continue to label our CFNet-34 dataset to extend this dataset to a wider range of food categories. Combinations with other datasets to create a more diverse food dataset are desirable.

Conclusions
In order to reduce the risk of disease caused by the user's obesity and being overweight and to regulate the user's dietary intake from the perspective of dietary behavior, it is necessary to develop a social robot with functions of dietary behavior monitoring and dietary quality assessment. Focusing on the needs of users' dietary behavior monitoring and diet quality assessment in the smart home environment, this paper proposes a dietary nutritional information autonomous perception method based on machine vision in smart homes. The method applies deep learning, image processing, database storage and management and other technologies to acquire and store the user's dietary information. Firstly, we proposed a food-recognition algorithm based on YOLOv5 to recognize the food on the table. Then, in order to quantitatively analyze the user's dietary information, we calibrated the weight of the food ingredients and designed a method for the calculation of the nutritional composition of the foods. Based on this, we proposed a dietary nutritional information autonomous perception method based on machine vision (DNPM) to calculate the user's nutritional composition intake. The acquired user dietary information is stored in the autonomous perception system of dietary nutritional information for the user to query. Finally, the proposed method was deployed and tested in the smart home environment. The test results show that the system response time of the proposed method was less than 6 ms, and the nutritional composition perception accuracy rate was 90.1%, showing good real-time performance, robustness and nutritional composition perception performance. However, this study needs to be further strengthened. Firstly, social robots lack the ability to dynamically and autonomously add food and people. In addition, the user and the social robot do not establish a stable human-machine relationship only through face recognition. In future research, we want to focus on the functional design of social robots to autonomously add food and people and to build a stable relationship between humans and machines. In addition, we will continue to work to improve the accuracy of system recognition and reduce system processing time.