Real-Time Detection of Cook Assistant Overalls Based on Embedded Reasoning

Currently, the target detection based on convolutional neural network plays an important role in image recognition, speech recognition and other fields. However, the current network model features a complex structure, a huge number of parameters and resources. These conditions make it difficult to apply in embedded devices with limited computational capabilities and extreme sensitivity to power consumption. In this regard, the application scenarios of deep learning are limited. This paper proposes a real-time detection scheme for cook assistant overalls based on the Hi3559A embedded processor. With YOLOv3 as the benchmark network, this scheme fully mobilizes the hardware acceleration resources through the network model optimization and the parallel processing technology of the processor, and improves the network reasoning speed, so that the embedded device can complete the task of real-time detection on the local device. The experimental results show that through the purposeful cropping, segmentation and in-depth optimization of the neural network according to the specific processor, the neural network can recognize the image accurately. In an application environment where the power consumption is only 5.5 W, the recognition speed of the neural network on the embedded end is increased to about 28 frames (the design requirement was to achieve a recognition speed of 25 frames or more), so that the optimized network can be effectively applied in the back kitchen overalls identification scene.


Introduction
Food safety problems are prevalent around us. According to the World Health Organization, about 600 million people worldwide get sick and 420,000 people die of eating contaminated food every year, causing a loss of 33 million healthy life years [1]. Since 2014, CFDA has carried out the movement of "transparent kitchen and stoves", with an aim to make public the food processing process through video display, partition of short wall, open kitchen and other forms, and place the key parts and links of catering service under the social supervision. We can collect kitchen data via cameras, but how to correctly recognize and supervise these data has become a problem. The traditional way of labor supervision requires a lot of time and labor costs, resulting in unsatisfactory effects. Therefore, this paper hopes to use deep learning to automatically complete the detection process of whether the staff entering the kitchen are wearing kitchen clothes, by computer equipment.
Video processing technology with deep learning as the core has been fully applied in face recognition and vehicle recognition [2][3][4]. The current mainstream deep learning solutions are deep learning reasoning through high-performance GPU and then deployment to the local end users using cloud servers. However, in the actual use case scenarios, this scheme is faced with high costs, a high power consumption and a limited bandwidth. We therefore try to perform reasoning calculations locally using embedded devices. This scheme can effectively reduce the device cost and bandwidth requirements while controlling the power consumption of the device. At present, embedded devices cannot directly meet the requirements for deep learning due to their weak computational capabilities. This paper aims to design a solution for cook assistant overalls recognition on an embedded device based on YOLO V3 and provide a feasible technical scheme for the embedded application of deep learning [5]. The current common acceleration schemes are roughly divided into three following categories: (1) A more lightweight network model is designed. The model can find a better balance between precision and memory consumption. For example, the MobileNet model is in a new lightweight model structure [6]. VGG-16 network performance can be implemented with a 10-fold decrease in the calculated quantity. (2) Software optimization techniques and mathematical methods are used to trim the network, so as to reduce the network calculation and memory requirements. For example, He Yang's team reduced the calculation amount of ILSVRC-2012 by over 40% on ResNet-50 through progressive soft-filtering pruning, with 0.14% reduced accuracy [7]. (3) An hardware accelerator customized for deep learning algorithm or FPGA is used to perform the high-concurrent operations and complete the acceleration process. For example, Kang's team won the first prize of the LPIRC competition through optimization algorithm using NVidia TX2 [8]; Ma's team increased the reasoning speed of tiny-yolo to 20 frames using FPGA [9,10].
This paper proposes the scheme of pipelining [11,12] and parallel processing on the basis of integrating the above schemes, with an aim to improve the network reasoning speed from the perspective of improving the system hardware resources utilization.
The main contribution and work contents proposed in this paper are: (1) Deep learning is applied to the kitchen target detection to realize the detection of whether the workers entering the kitchen are wearing kitchen overalls. In addition, the whole system is finally deployed on low-power embedded products, which greatly reduces the cost of equipment and power consumption required for traditional deep learning target detection, so that deep learning can be applied to more use case scenarios. (2) This paper introduces the idea of software pipelining in deep learning to transform the network reasoning process from single tasks implemented step by step to different stages of multiple tasks, and thus greatly improve the utilization rate of hardware resources and the recognition speed of the system. (3) Speed matching between the anterior and posterior networks is achieved through unbalanced segmentation of the model and multithreaded optimization [13]. After the introduction of the circle buffer, it can ensure the error of a single task does not affect other tasks. (4) The goal of this scheme is to complete the intelligent processing of video data on the "edge side". The videos of multiple cameras on the local switch can be detected through the embedded reasoning card to recognize whether the chefs correctly wear the overalls. The upper computer can display the recognition results, and set the recognition rate of the card, the threshold and the number of cameras.
The rest of this paper is arranged as follows: Section 2 introduces the implementation process of the system and each part; Section 3 introduces the deep learning acceleration methods based on the Hi3559A embedded processor; Section 4 displays the experimental processes and results; Section 5 gives the conclusions.

Scheme Introduction
The system scheme is shown in Figure 1, in which the reasoning card is connected to the local switch and the console software controls the reasoning card. The reasoning card may decode/recognize/code the video streams and display the coded detection results on the monitor. The results are uploaded to the front-end WEB system via JSON coding.

Scheme Introduction
The system scheme is shown in Figure 1, in which the reasoning card is connect the local switch and the console software controls the reasoning card. The reasoning may decode/recognize/code the video streams and display the coded detection resu the monitor. The results are uploaded to the front-

Embedded Deployment Scheme
The block diagram for embedded software is shown in Figure 2. Through mult deployment and collaboration, the overall computing power and resource throug rate are significantly improved. In Figure 2, an A53 kernel is responsible for the net process, which can complete the data interaction and communication with the conso A73 kernel is responsible for the core scheduling and controlling the coordinatio tween image coding/decoding and pipelining; a DSP kernel is responsible for the coding/output.

Embedded Deployment Scheme
The block diagram for embedded software is shown in Figure 2. Through multi-core deployment and collaboration, the overall computing power and resource throughput rate are significantly improved. In Figure 2, an A53 kernel is responsible for the network process, which can complete the data interaction and communication with the console; an A73 kernel is responsible for the core scheduling and controlling the coordination between image coding/decoding and pipelining; a DSP kernel is responsible for the video coding/output.

Scheme Introduction
The system scheme is shown in Figure 1, in which the reasoning card is connected t the local switch and the console software controls the reasoning card. The reasoning card may decode/recognize/code the video streams and display the coded detection results on the monitor. The results are uploaded to the front-end WEB system via JSON coding.

Embedded Deployment Scheme
The block diagram for embedded software is shown in Figure 2. Through multi-cor deployment and collaboration, the overall computing power and resource throughpu rate are significantly improved. In Figure 2, an A53 kernel is responsible for the networ process, which can complete the data interaction and communication with the console; an A73 kernel is responsible for the core scheduling and controlling the coordination be tween image coding/decoding and pipelining; a DSP kernel is responsible for the vide coding/output.

Interface of Console Software
The interface of console software is shown in Figure 3: after the system start, the console will search for the 59A reasoning card device in the local switch by means of multicast and add the searched devices to the menu bar. Then, one can search for the input camera and video sources, and finally set the parameters to start the device.

Interface of Console Software
The interface of console software is shown in Figure 3: after the system start, the console will search for the 59A reasoning card device in the local switch by means of multicast and add the searched devices to the menu bar. Then, one can search for the input camera and video sources, and finally set the parameters to start the device.

Introduction of Pipelining Technology
Pipelining refers to a parallelization means of superposing multiple instructions in an instruction cycle. The technology is generally applied in general computing processors. Pipelining can reduce the waiting time of the CPU and achieve the continuous use of the CPU resources, thus improving the overall utilization of the CPU. See Figure 4 for a diagram of pipelining.
The scheme of this paper draws lessons from the idea of pipeline technology in CPU, By dividing the network and task, the whole process is divided into graph acquisition, decoding, reasoning, post-processing, video coding output, etc., which greatly improves the utilization rate of hardware resources and finally recognizes an image frame in a reasoning cycle. Figure 4a shows that the cycle time of a single cycle of processing is 92 ms. After the idea of pipeline is introduced in Figure 4b, the processing cycle is compressed to about 30 ms due to the reduction of the waiting time between the hardware.

Introduction of Pipelining Technology
Pipelining refers to a parallelization means of superposing multiple instructions in an instruction cycle. The technology is generally applied in general computing processors. Pipelining can reduce the waiting time of the CPU and achieve the continuous use of the CPU resources, thus improving the overall utilization of the CPU. See Figure 4 for a diagram of pipelining.  The advantages of introducing pipelining are: (1) It greatly improves the utilization rate of hardware resources, and realizes the parallel execution of multiple tasks of different stages at the same time; (2) Although pipelining operates multiple network data at the same time, the network only needs to load once, which greatly reduces the memory occupation compared with traditional multi-operators. This is very important for embedded devices; (3) Pipelining with a buffer zone can effectively reduce the processor's idle time through reasonable task distribution, without additional multi-process time consumption; (4) Pipelining is more conducive to protecting the relationship between frames before and after pictures. The order of data processed is synchronized, without additional thread synchronization tools. The scheme of this paper draws lessons from the idea of pipeline technology in CPU, By dividing the network and task, the whole process is divided into graph acquisition, decoding, reasoning, post-processing, video coding output, etc., which greatly improves the utilization rate of hardware resources and finally recognizes an image frame in a reasoning cycle. Figure 4a shows that the cycle time of a single cycle of processing is 92 ms. After the idea of pipeline is introduced in Figure 4b, the processing cycle is compressed to about 30 ms due to the reduction of the waiting time between the hardware.

Network Cutting
The advantages of introducing pipelining are: (1) It greatly improves the utilization rate of hardware resources, and realizes the parallel execution of multiple tasks of different stages at the same time; (2) Although pipelining operates multiple network data at the same time, the network only needs to load once, which greatly reduces the memory occupation compared with traditional multi-operators. This is very important for embedded devices; (3) Pipelining with a buffer zone can effectively reduce the processor's idle time through reasonable task distribution, without additional multi-process time consumption; (4) Pipelining is more conducive to protecting the relationship between frames before and after pictures. The order of data processed is synchronized, without additional thread synchronization tools.

Network Cutting
Complex models generally have redundant parameters to ensure a balance between performance and robustness. These redundant parameters can be cropped by tools. The model parameters are reduced by cutting out less weighted link layers to achieve model compression [14,15]. In this cutting scheme, the number of network dimensions is a multiple of 16 through additional inducing factors, which can effectively read the efficiency and computing performance using HiSilicon memory. Figure 5 shows the dimension changes before and after pruning.  The advantages of introducing pipelining are: (1) It greatly improves the utilization rate of hardware resources, and realizes the parallel execution of multiple tasks of different stages at the same time; (2) Although pipelining operates multiple network data at the same time, the network only needs to load once, which greatly reduces the memory occupation compared with traditional multi-operators. This is very important for embedded devices; (3) Pipelining with a buffer zone can effectively reduce the processor's idle time through reasonable task distribution, without additional multi-process time consumption; (4) Pipelining is more conducive to protecting the relationship between frames before and after pictures. The order of data processed is synchronized, without additional thread synchronization tools.

Network Cutting
Complex models generally have redundant parameters to ensure a balance between performance and robustness. These redundant parameters can be cropped by tools. The model parameters are reduced by cutting out less weighted link layers to achieve model compression [14,15].In this cutting scheme, the number of network dimensions is a multiple of 16 through additional inducing factors, which can effectively read the efficiency and computing performance using HiSilicon memory. Figure 5 shows the dimension changes before and after pruning. L1 regularization is used for thinning training in the training process. Neural networks have many parameters approaching 0 after L1 regularization training. These parameters can make small contributions to the final result. The spinning training and pruning are combined in this paper. On the one hand, pruning is used to reduce the unimportant link layer, and on the other hand, the weights are induced in training to make the network sparser.
After pruning Multiple of 16 L1 regularization is used for thinning training in the training process. Neural networks have many parameters approaching 0 after L1 regularization training. These parameters can make small contributions to the final result. The spinning training and pruning are combined in this paper. On the one hand, pruning is used to reduce the unimportant link layer, and on the other hand, the weights are induced in training to make the network sparser.
After testing, the accuracy of the network recognition was significantly reduced when the pruning ratio is greater than 50%. Through multiple experimental comparisons, the pruning ratio of 40% was selected, reducing the calculation significantly to 58.59% of the original and the accuracy by only 0.19%. It can improve the reasoning speed of the network effectively while ensuring accuracy.

Model Segmentation and Quantization
A Hi3559Av100 with two neural network inference engines (NNIEs) is used as the main control chip. This scheme tries two deployment modes, parallel deployment and serial deployment. Parallel deployment means that a complete network is run separately in each NNIE and the input images are assigned to the two networks alternately. Serial deployment is the splitting of a network into two halves, with each NNIE running only a portion of the entire network. In theory, both schemes can achieve twice the speed of a single NNIE. However, through comparative experiments, we proved that the parallel deployment scheme was limited by the memory bandwidth and could not unlock the full computing power. So we adopted a serial deployment scheme as the preliminary deployment scheme. The model quantification refers to the technical coding means that are used to reduce the memory space occupied by the parameters in the model, thus achieving model compression. Because HiSilicon needs to quantize the network structure (8 bits) in the process of network model transformation, it will consume a large number of computations in the process of quantification. Therefore, to balance the anterior network and posterior network, the reasoning time of the anterior network needs to add the calculation time of the quantification process. After comparing various segmentation schemes, we finally divided the whole network model into two unbalanced sub-networks, which ran independently in different network acceleration engines. The quantification Formula (1) can be obtained from the data fitting, as shown below: The x in the formula represents the input floating-point number, i represents the quantized value after conversion, and the data represent the set of input data. In order to ensure that the input of the ln function is a positive value, we consider whether x is greater than 0 or not.
According to the dimension of the input image, the calculation amount of the quantification process is about 7 GFLOPS. Therefore, an unbalanced segmentation is required to balance the anterior network and posterior network. The segmentation results are shown in Figure 6 (the anterior network shown in red circle).  Following the segmentation in Figure 6, the network calculation and data size are shown in Table 1.  Following the segmentation in Figure 6, the network calculation and data size are shown in Table 1.

Fixed-Point Data Flow
In most CPUs, calculating with floating-point numbers is much slower than with integers. Taking the Hi3559 series of chips as an example, the inner NNIE core calculates the fixed-point data much faster than the floating-point numbers. Therefore, the data need to complete the transformation from floating-point number to fixed-point number before entering the computer core [16].
In the forward reasoning process of the network, only the forward multiplication and addition calculation is required because the weights are fixed, so the error after the fixed-point conversion does not increase exponentially. The results of statistics and testing on the server show that the network weights and activation values of YOLOv3 are generally distributed between ±20. There is no significant accuracy difference on the COCO dataset after 12-bit fixed data, except the confidence is a few percentage points lower relative to the original data [17]. Therefore, it is somewhat feasible to use fixed-point data for network reasoning.
The coding information of the network output results at certain dimensions is shown in Figure 7. Supposing two tensors are adopted in the network, x and y are offsets of the boundary center box relative to the upper left corner of the grid; w and h are the width and height of the bounding box, respectively. C is used to determine the credibility of the target, and P(C) to determine the probabilities that the target belongs to a certain class. Formulas (2)-(5) represent the transformation between the relative and the absolute coordinates where c x and c y represent the position of the upper left corner of the grid cell relative to the entire picture, t x and t y represent the offsets of the center position of the recognition target relative to the upper left corner of the current grid cell. In order to prevent the calculated center coordinates from exceeding the range of the current grid cell, we need to normalize t x and t y using the sigmoid function; b x and b y are the absolute coordinates of the target center position. p w and p h represent the height and width of the anchor box, respectively; t x and t y represent the width and height, respectively, directly predicted by the bounding box; b w and b h indicate the actual width and height of the forecast, respectively. The network output result is a fixed-point number, and the process of network screening and NMS in the traditional post-processing process requires a floating-point number result. In this design, the process of converting the network output result into a floatingpoint number is cancelled. By reconstructing the filtering algorithm, fixed-point numbers can be used to complete data filtering and accuracy calculations, which greatly reduces the complexity of the calculations. The coding information of the network output results at certain dimensions is shown in Figure 7. Supposing two tensors are adopted in the network, x and y are offsets of the boundary center box relative to the upper left corner of the grid; w and h are the width and height of the bounding box, respectively. C is used to determine the credibility of the target, and P(C) to determine the probabilities that the target belongs to a certain class. Formulas (2)-(5) represent the transformation between the relative and the absolute coordinates.

7×7×11
x,y,w,h C x,y,w,h C P(c) ( )  The pseudo-code of Algorithm 1 shows the process of filtering network results after reloading.

Multithreaded Optimization
After the overall task is segmented into tasks independent of each other, the above combined tasks are deployed multithreaded on the embedded master control. The thread access safety is guaranteed through the mutual exclusion and synchronization mechanism of the threads. The data flow diagram for the tasks in this design is as shown in Figure 8. The pseudo-code of Algorithm 1 shows the process of filtering network results after reloading. Boxes.push_back(box) 12: return Boxes

Multithreaded Optimization
After the overall task is segmented into tasks independent of each other, the above combined tasks are deployed multithreaded on the embedded master control. The thread access safety is guaranteed through the mutual exclusion and synchronization mechanism of the threads. The data flow diagram for the tasks in this design is as shown in Figure 8.

Shared Memory in the Critical Zone
There is a single-directional data interaction between the individual tasks. Currently, the multithreaded synchronization and mutual exclusion mechanism includes signal volume, read/write lock, conditional variable, mutual exclusion lock, spin lock, etc. As shown in Figure 8, the data volume of the system is huge and the overall instantaneity of the system should be guaranteed by the stable data processing time. So we chose a shared memory method for data exchange.
Shared memory refers to common access to multiple threads by mapping a piece of memory that can be accessed jointly by multiple processes, which is a resource in the critical region [18,19]. As shown in Figure 9, the shared memory may communicate a large

Shared Memory in the Critical Zone
There is a single-directional data interaction between the individual tasks. Currently, the multithreaded synchronization and mutual exclusion mechanism includes signal volume, read/write lock, conditional variable, mutual exclusion lock, spin lock, etc. As shown in Figure 8, the data volume of the system is huge and the overall instantaneity of the Sensors 2021, 21, 8069 9 of 18 system should be guaranteed by the stable data processing time. So we chose a shared memory method for data exchange.
Shared memory refers to common access to multiple threads by mapping a piece of memory that can be accessed jointly by multiple processes, which is a resource in the critical region [18,19]. As shown in Figure 9, the shared memory may communicate a large amount of data without the additional replication of data. Shared memory is done directly in the memory space, atomicity between threads cannot be guaranteed. Therefore, the shared memory itself does not provide a solution for process synchronization, and needs to solve the problem of inter-thread synchronization with other synchronization tools.

Introduction of a Circle Buffer
The network processing time of each pipelining task is affected by the working ronment, the number of targets in the picture, and the quality of the input image. unexpected factors will lead to a slow execution speed and block the pipelining, th fecting the overall speed of the system, and even leading to the reading of dirty da access to invalid memory. In all the links, the image acquisition is the most vulnera the network environment, the reading speed of the storage media and other factors. T fore, it is necessary to add an appropriate buffer zone to the original scheme to ensu overall stability of the system.
The entire reasoning process requires a large amount of data and frequent req to release memory. The linear buffer may produce an amount of memory fragment In order to avoid system collapse due to excessive memory fragmentation, this sc establishes a circular buffer pool mechanism that can effectively avoid the frequent tion, allocation and release of memory in the linear buffer. As shown in Figure 1 circle buffer usually sets read/write indicators after the application for memory. The pointer points to the next available location in the current memory segment. Whe current location address exceeds the requested memory area, the write pointer will r to the start position of the application area. The read pointer reads the valid da memory along the route of the write pointer. This design is a one-way communic without the introduction of multiple pointers, so mutual exclusion lock is not requir ensure the memory safety.
Using the circle queue buffer, the system's memory increased by 12% and the s increased to about 28 frames. Moreover, it was more stable during a long period of w After six hours of testing, no access to invalid memory occurred.

Introduction of a Circle Buffer
The network processing time of each pipelining task is affected by the working environment, the number of targets in the picture, and the quality of the input image. Some unexpected factors will lead to a slow execution speed and block the pipelining, thus affecting the overall speed of the system, and even leading to the reading of dirty data or access to invalid memory. In all the links, the image acquisition is the most vulnerable to the network environment, the reading speed of the storage media and other factors. Therefore, it is necessary to add an appropriate buffer zone to the original scheme to ensure the overall stability of the system. The entire reasoning process requires a large amount of data and frequent requests to release memory. The linear buffer may produce an amount of memory fragmentation. In order to avoid system collapse due to excessive memory fragmentation, this scheme establishes a circular buffer pool mechanism that can effectively avoid the frequent creation, allocation and release of memory in the linear buffer. As shown in Figure 10, the circle buffer usually sets read/write indicators after the application for memory. The write pointer points to the next available location in the current memory segment. When the current location address exceeds the requested memory area, the write pointer will return to the start position of the application area. The read pointer reads the valid data in memory along the route of the write pointer. This design is a one-way communication without the introduction of multiple pointers, so mutual exclusion lock is not required to ensure the memory safety.
Using the circle queue buffer, the system's memory increased by 12% and the speed increased to about 28 frames. Moreover, it was more stable during a long period of work. After six hours of testing, no access to invalid memory occurred.

Experiments
Hi3559Av100 is selected as the embedded main control chip. The data set was a custom dataset for cook assistant overalls. An Intel i7-6850K CPU @ 3.60GHz experimental platform was used in the training process with an NVIDIA GTX1080ti*3; the operating system was Ubuntu 18.04.

Experimental Environment Preparation
Numerous parameter computations are required for deep learning network training. This part needs to be placed on a server with certain operational capabilities. The computer configuration used for this training is shown in Table 2. The network training process relied on the darknet framework [20]. The cafe framework [21] was required for the model transformation. After installing the above environment correctly, the upper sampling layer needed to be added manually in cafe to ensure the correct model conversion.

Dataset Used for the Experiment
The dataset used herein is a self-built kitchen overall dataset. A total of 33,254 monitoring pictures were collected in 37 different scenes, which were called the cook dataset. There were eight categories in the dataset, i.e., 0-gray, 1-black, 2-white, 3-blue, 4red, 5-yellow, 6-pink and 7-others. Subsequently, the dataset was expanded using tilt, mirroring, Gaussian blur and histogram equalization. As a result, there were 43,255 pictures in the final dataset, as shown in Figure 11.

Experiments
Hi3559Av100 is selected as the embedded main control chip. The data set was a custom dataset for cook assistant overalls. An Intel i7-6850K CPU @ 3.60GHz experimental platform was used in the training process with an NVIDIA GTX1080ti*3; the operating system was Ubuntu 18.04.

Experimental Environment Preparation
Numerous parameter computations are required for deep learning network training. This part needs to be placed on a server with certain operational capabilities. The computer configuration used for this training is shown in Table 2. The network training process relied on the darknet framework [20]. The cafe framework [21] was required for the model transformation. After installing the above environment correctly, the upper sampling layer needed to be added manually in cafe to ensure the correct model conversion.

Dataset Used for the Experiment
The dataset used herein is a self-built kitchen overall dataset. A total of 33,254 monitoring pictures were collected in 37 different scenes, which were called the cook dataset. There were eight categories in the dataset, i.e., 0-gray, 1-black, 2-white, 3-blue, 4-red, 5-yellow, 6-pink and 7-others. Subsequently, the dataset was expanded using tilt, mirroring, Gaussian blur and histogram equalization. As a result, there were 43,255 pictures in the final dataset, as shown in Figure 11. Sensors 2021, 21, x FOR PEER REVIEW 12 of 19 Figure 11. Display of some images from the dataset.

Network Result Validation
To verify the overall identification algorithm, we first needed to train the weights of the YOLO network. In this paper, the ratio of the training, validation and test sets was set to 8:1.5:0.5 and the number of images were 34,604, 6488 and 2163, respectively.
The hyper-parameter settings for the YOLO network are shown in Table 3. The loss curve from the training process is shown in Figure 12. The loss curve reflects the difference between the recognition result of the model and the actual object category. With the continuous increase of the number of training samples, we can see that the difference is continuously reduced to a stable value, which illustrates that the recognition rate in the training process continues to increase and approaches the theoretical best recognition result.

Network Result Validation
To verify the overall identification algorithm, we first needed to train the weights of the YOLO network. In this paper, the ratio of the training, validation and test sets was set to 8:1.5:0.5 and the number of images were 34,604, 6488 and 2163, respectively.
The hyper-parameter settings for the YOLO network are shown in Table 3. The loss curve from the training process is shown in Figure 12.
Sensors 2021, 21, x FOR PEER REVIEW 12 of 19 Figure 11. Display of some images from the dataset.

Network Result Validation
To verify the overall identification algorithm, we first needed to train the weights of the YOLO network. In this paper, the ratio of the training, validation and test sets was set to 8:1.5:0.5 and the number of images were 34,604, 6488 and 2163, respectively.
The hyper-parameter settings for the YOLO network are shown in Table 3. The loss curve from the training process is shown in Figure 12. The loss curve reflects the difference between the recognition result of the model and the actual object category. With the continuous increase of the number of training samples, we can see that the difference is continuously reduced to a stable value, which illustrates that the recognition rate in the training process continues to increase and approaches the theoretical best recognition result. The loss curve reflects the difference between the recognition result of the model and the actual object category. With the continuous increase of the number of training samples, we can see that the difference is continuously reduced to a stable value, which illustrates that the recognition rate in the training process continues to increase and approaches the theoretical best recognition result.
The performance of the network was verified by checking the relative curve between the registration rate and the recall rate. The statistical results are shown in Figure 13.
Besides the category "Other", the accuracy of the gray category is up to 99%, that of the pink is as the lowest at 75%, the mean average precision (mAP) is about 90.25%. The performance meets metric requirements. The performance of the network was verified by checking the relative curve between the registration rate and the recall rate. The statistical results are shown in Figure 13. Besides the category "Other", the accuracy of the gray category is up to 99%, that of the pink is as the lowest at 75%, the mean average precision (mAP) is about 90.25%. The performance meets metric requirements.

Reasoning Speed and Stability Test
The system's working status and stability are tested by simulation in the laboratory. The schematic diagram for the equipment connections is shown in Figure 14. The system's operating status is shown in Figure 15. The output interface of onboard HDMI is connected to the real-time monitoring device on the monitor. The overall recognition console software is operated on the computer display to communicate with the embedded device via internet access.

HDMI output
Serial port output Network interface Power Figure 14. Connection diagram of testing platform.

Reasoning Speed and Stability Test
The system's working status and stability are tested by simulation in the laboratory. The schematic diagram for the equipment connections is shown in Figure 14. The system's operating status is shown in Figure 15. The output interface of onboard HDMI is connected to the real-time monitoring device on the monitor. The overall recognition console software is operated on the computer display to communicate with the embedded device via internet access. The performance of the network was verified by checking the relative curve between the registration rate and the recall rate. The statistical results are shown in Figure 13. Besides the category "Other", the accuracy of the gray category is up to 99%, that of the pink is as the lowest at 75%, the mean average precision (mAP) is about 90.25%. The performance meets metric requirements.

Reasoning Speed and Stability Test
The system's working status and stability are tested by simulation in the laboratory. The schematic diagram for the equipment connections is shown in Figure 14. The system's operating status is shown in Figure 15. The output interface of onboard HDMI is connected to the real-time monitoring device on the monitor. The overall recognition console software is operated on the computer display to communicate with the embedded device via internet access.

HDMI output
Serial port output Network interface Power Figure 14. Connection diagram of testing platform. Finally, in order to test the stability of the solution in this article under long-term operation, this article recorded the system frame rate, temperature and power consumption data within three hours of system operation. The recorded data are shown in Figure  16. The test data show that the frame rate is basically stable at about 28 frames, without significant frame drop during the test time. The chip surface temperature is basically stable after the initial rise. From the above results, the scheme can be used for a long time in actual production and real life, and meet the real-time requirements of this paper.

Results Comparison and Analysis
In this section, to compare the hardware acceleration scheme of this article and the software acceleration scheme of TensorRT based on the NVIDIA platform, the two platforms simultaneously tested the mAP and FPS of YOLOv3-416 based on the COCO dataset. The comparison results with the existing network in terms of structure accuracy and instantaneity are shown in Figure 17. Finally, in order to test the stability of the solution in this article under long-term operation, this article recorded the system frame rate, temperature and power consumption data within three hours of system operation. The recorded data are shown in Figure 16. The test data show that the frame rate is basically stable at about 28 frames, without significant frame drop during the test time. The chip surface temperature is basically stable after the initial rise. From the above results, the scheme can be used for a long time in actual production and real life, and meet the real-time requirements of this paper. Finally, in order to test the stability of the solution in this article under long-term operation, this article recorded the system frame rate, temperature and power consumption data within three hours of system operation. The recorded data are shown in Figure  16. The test data show that the frame rate is basically stable at about 28 frames, without significant frame drop during the test time. The chip surface temperature is basically stable after the initial rise. From the above results, the scheme can be used for a long time in actual production and real life, and meet the real-time requirements of this paper.

Results Comparison and Analysis
In this section, to compare the hardware acceleration scheme of this article and the software acceleration scheme of TensorRT based on the NVIDIA platform, the two platforms simultaneously tested the mAP and FPS of YOLOv3-416 based on the COCO dataset. The comparison results with the existing network in terms of structure accuracy and instantaneity are shown in Figure 17.

Results Comparison and Analysis
In this section, to compare the hardware acceleration scheme of this article and the software acceleration scheme of TensorRT based on the NVIDIA platform, the two platforms simultaneously tested the mAP and FPS of YOLOv3-416 based on the COCO dataset. The comparison results with the existing network in terms of structure accuracy and instantaneity are shown in Figure 17.
The experiment in the above figure reflects the mean average precision (mAP) and frames per second (FPS) of YOLO V3 (IOU = 0.75) on the COCO dataset under different hardware platforms. We chose Nvidia's embedded edge computing platform Jetson AGX and the server based on 1080Ti were used as references for this solution.
Through the comparison before and after acceleration, we can find that the FPS of the two acceleration schemes was significantly improved after the acceleration, and the FPS after the acceleration has reached the "usable" level. At the same time, the accelerated mAP did not show a significant drop, and it was still within an acceptable range. The experiment in the above figure reflects the mean average precision (mAP) and frames per second (FPS) of YOLO V3 (IOU = 0.75) on the COCO dataset under different hardware platforms. We chose Nvidia's embedded edge computing platform Jetson AGX and the server based on 1080Ti were used as references for this solution.
Through the comparison before and after acceleration, we can find that the FPS of the two acceleration schemes was significantly improved after the acceleration, and the FPS after the acceleration has reached the "usable" level. At the same time, the accelerated mAP did not show a significant drop, and it was still within an acceptable range.
Even when compared with a server with a 1080Ti card, the mAP and FPS of the solution in this article are not too bad on the COCO dataset. The influence of different cutting results on the accuracy and speed are shown in Figure 18. According to the comparison results, when the pruning amplitude is larger than 20%, the accuracy reduces rapidly with the increase of the pruning ratio. Therefore, a cutting proportion of 20% was used as the final network cutting amplitude.  This paper compared the influence of different deployment schemes and segmentation schemes on the recognition accuracy and frame rate, and the results are shown in Table 4. According to the comparison results, the pipelining technology can ensure the accuracy of the system recognition while greatly reducing the memory consumption. Compared with the double-threaded simultaneous reasoning, the piplining technology Even when compared with a server with a 1080Ti card, the mAP and FPS of the solution in this article are not too bad on the COCO dataset. The influence of different cutting results on the accuracy and speed are shown in Figure 18. According to the comparison results, when the pruning amplitude is larger than 20%, the accuracy reduces rapidly with the increase of the pruning ratio. Therefore, a cutting proportion of 20% was used as the final network cutting amplitude. The experiment in the above figure reflects the mean average precision (mAP) and frames per second (FPS) of YOLO V3 (IOU = 0.75) on the COCO dataset under different hardware platforms. We chose Nvidia's embedded edge computing platform Jetson AGX and the server based on 1080Ti were used as references for this solution.
Through the comparison before and after acceleration, we can find that the FPS of the two acceleration schemes was significantly improved after the acceleration, and the FPS after the acceleration has reached the "usable" level. At the same time, the accelerated mAP did not show a significant drop, and it was still within an acceptable range.
Even when compared with a server with a 1080Ti card, the mAP and FPS of the solution in this article are not too bad on the COCO dataset. The influence of different cutting results on the accuracy and speed are shown in Figure 18. According to the comparison results, when the pruning amplitude is larger than 20%, the accuracy reduces rapidly with the increase of the pruning ratio. Therefore, a cutting proportion of 20% was used as the final network cutting amplitude.  This paper compared the influence of different deployment schemes and segmentation schemes on the recognition accuracy and frame rate, and the results are shown in Table 4. According to the comparison results, the pipelining technology can ensure the accuracy of the system recognition while greatly reducing the memory consumption. Compared with the double-threaded simultaneous reasoning, the piplining technology This paper compared the influence of different deployment schemes and segmentation schemes on the recognition accuracy and frame rate, and the results are shown in Table 4. According to the comparison results, the pipelining technology can ensure the accuracy of the system recognition while greatly reducing the memory consumption. Compared with the double-threaded simultaneous reasoning, the piplining technology greatly reduced the memory consumption and the time for reading disk and output images. This design also limited the network dimension to a multiple of 16, thus greatly improving the reasoning speed of the network.

On-Site Deployment Test
During the on-site deployment test, the system was installed in the kitchen and the data were tested and saved. A total of 1000 images (10 cameras) were randomly selected for the result analysis. The results are shown in Table 5. As shown in Table 5, the accuracy rate in most scenes is more than 97%, and the loss rate of some scenes is high, with an average accuracy rate of 98.29%, which basically meets the design requirements. Field test results are shown in Figure 19. Sensors 2021, 21, x FOR PEER REVIEW 17 of 19 Figure 19. Field test results.

Conclusions
This paper proposed a scheme for deploying neural networks in embedded devices and applied them to the task of kitchen overalls recognition. This solution significantly reduced the power consumption and equipment cost required for target recognition through the neural network, and further expanded the application range of the neural network. Through the pruning, segmentation, quantification of the network model and the algorithm optimization for the Hi3559A embedded processor hardware, this design realized a good recognition accuracy while increasing the recognition frame rate to about 28 frames, thereby achieving the expected design goal (a recognition frame rate greater than 25). We proved that the embedded platform can complete the recognition task of kitchen overalls through optimization. In the next step, we will try to incorporate some of the latest software acceleration solutions into this solution, and integrate them with the existing hardware acceleration solutions to achieve better acceleration effects.

Conclusions
This paper proposed a scheme for deploying neural networks in embedded devices and applied them to the task of kitchen overalls recognition. This solution significantly reduced the power consumption and equipment cost required for target recognition through the neural network, and further expanded the application range of the neural network. Through the pruning, segmentation, quantification of the network model and the algorithm optimization for the Hi3559A embedded processor hardware, this design realized a good recognition accuracy while increasing the recognition frame rate to about 28 frames, thereby achieving the expected design goal (a recognition frame rate greater than 25). We proved that the embedded platform can complete the recognition task of kitchen overalls through optimization. In the next step, we will try to incorporate some of the latest software acceleration solutions into this solution, and integrate them with the existing hardware acceleration solutions to achieve better acceleration effects.