YOLO-GD: A Deep Learning-Based Object Detection Algorithm for Empty-Dish Recycling Robots

: Due to the workforce shortage caused by the declining birth rate and aging population, robotics is one of the solutions to replace humans and overcome this urgent problem. This paper introduces a deep learning-based object detection algorithm for empty-dish recycling robots to automatically recycle dishes in restaurants and canteens, etc. In detail, a lightweight object detection model YOLO-GD (Ghost Net and Depthwise convolution) is proposed for detecting dishes in images such as cups, chopsticks, bowls, towels, etc., and an image processing-based catch point calculation is designed for extracting the catch point coordinates of the different-type dishes. The coordinates are used to recycle the target dishes by controlling the robot arm. Jetson Nano is equipped on the robot as a computer module, and the YOLO-GD model is also quantized by TensorRT for improving the performance. The experimental results demonstrate that the YOLO-GD model is only 1/5 size of the state-of-the-art model YOLOv4, and the mAP of YOLO-GD achieves 97.38%, 3.41% higher than YOLOv4. After quantization, the YOLO-GD model decreases the inference time per image from 207.92 ms to 32.75 ms, and the mAP is 97.42%, which is slightly higher than the model without quantization. Through the proposed image processing method, the catch points of various types of dishes are effectively extracted. The functions of empty-dish recycling are realized and will lead to further development toward practical use.


Introduction
Currently, a workforce shortage is accelerating due to the declining birth rate and aging population of the world, which have brought heavy pressure on economic and social development. We target the problem and design an automatic empty-dish recycling robot for collecting empty dishes such as cups, bowls, chopsticks, towels et al., after breakfast, lunch, or dinner in a restaurant, canteen, cafeteria, etc.
The global robot industry has entered a period of rapid development, and robots have been widely used in various fields, such as factory automation [1], medical services [2], search-and-rescue [3], automated kitchen [4,5], etc., gradually replacing humans to overcome the problem of the workforce shortage. Yin et al. present a table cleaning and inspection method using a Human Support Robot (HSR), which can be operated in a typical food court setting. In addition, a lightweight deep convolution neural network (DCNN) is proposed to recognize the food litter on top of the table [6]. The commercial feeding assistant robot acquires food without feedback and moves to a pre-programmed location to deliver the food. Candeias et al. use visual feedback to determine whether the food is captured; thus, the food is effectively brought to the user's mouth rather than to a pre-programmed feeding location [7]. However, robots are less involved in the food • We design a lightweight dish detection model YOLO-GD for empty-dish recycling robots, which significantly reduces parameter numbers and improves the detection accuracy. • We design a dish catch point method to effectively extract the catch points of different types of dishes. The catch points are used to recycle the dishes by controlling the robot arm. • We have realized the quantification of the lightweight dish detection model YOLO-GD without losing accuracy and deploy it on the embedded mobile device, Jetson Nano. • This paper also creates a dish dataset named Dish-20 (http://www.ihpc.se.ritsumei. ac.jp/obidataset.html; accessed on 28 March 2022), which contains 506 images in 20 classes. It not only provides training data for object detection in this paper but also helps in the field of empty-dish recycling automation.
The rest of the paper is organized as follows, Section 2 introduces the related work about robotics applications, object detection, and model quantization deployed on embedded devices. Section 3 gives a detailed explanation of deep learning-based object detection, which is equipped with the empty-dish recycling robot, dish catch points extraction, TensorRT quantization model, and deployment on embedded mobile devices. Section 4 presents the results of the relevant model comparison, and the experimental results without and with model quantification, including detection accuracy, model weights, inference speed, etc. A discussion and future work are provided in Section 5. Finally, Section 6 concludes the paper.

Related Work
The realization of automatic empty-dish recycling can effectively replace human beings to complete the work and alleviate the problem of workforce shortage. This paper aims to design an empty-dish recycling robot for realizing automatic empty-dish recycling. However, high-accurate object detection for detecting the target dishes, and the catch points' calculation for controlling the robotic arm, are still important issues. Furthermore, compacting the object detection model to the embedded robot is also a challenge in this research. In this section, we review current related research work on robotics, object detection algorithm, model quantization, and deploy it on embedded devices.

Research of Robotics
Fukuzawa et al. proposed a robotic system consisting of a robotic manipulator with six degrees of freedom, a robotic hand capable to catch and suction operations, and a 3D camera for dish detection to perform the take-out operation of various dishes from a commercial dishwasher [4]. Zhu et al. developed an automated dish tidying-up robot mechanism for cleaning dishes in a self-service restaurant with a large number of dishes. The dishes are placed on the conveyor belt by the guest, and the robot is mainly responsible for the process of sorting and collecting the dishes [20]. Kawamura et al. designed a threedegree freedom micro-hand consisting of a thin pneumatic rubber actuator generating three degrees of freedom of motion. The micro-hand contracts in the longitudinal direction and bends in any direction by changing the applied air pressure pattern to the artificial muscles, which may be expected to be used in areas such as the flexible catch of a dish [21]. Kinugawa et al. have developed a new underactuated robotic hand for circular standard dishes (square or other types of dishes are not considered), which is an important factor for the realization of a fully automatic dishwashing system [22].

Object Detection
The object detection model based on deep learning is capable of achieving highspeed object detection and object bounding box segmentation. These models are mainly divided into two categories. One is the one-stage detection algorithm, including YOLO [23], SSD [24], Retina Net [25], etc. Another is the two-stage detection algorithm, including R-CNN [26], Fast R-CNN [27], Faster R-CNN [28], etc. YOLOv4 is widely adopted due to its high speed, high accuracy, and relatively simple design [29].
YOLO predicts multiple BBox positions and classes at once, regards detection as a regression problem, and combines the two stages of candidate area and detection, with simple structure and fast detection speed [30]. A modified Tiny YOLOv2 is proposed to recognize small objects such as the shuttlecock, and by modifying the loss function, the detection speed of small objects is improved adaptively and applied to other tasks of detecting small objects at high speed [31]. Zhang et al. proposed a state-of-the-art lightweight detector, namely, CSL-YOLO. Through a lightweight convolution method Cross-Stage Lightweight (CSL) Module, it generates redundant features from cheap operations with excellent results [32]. TRC-YOLO is proposed by pruning the convolution kernel of YOLOv4-tiny and introducing an expanded convolution layer in the residual module of the network, which improves the model's mean average precision (mAP) and realtime detection speed [33]. A lightweight three-stage detection framework is composed of a Coarse Region Proposal (CRP) module, the lightweight Railway Obstacle Detection Network (RODNet), and the post-processing stage, to identify obstacles in the single railway image [34].

Quantification and Deployment
Hirose et al. simultaneously measured the input data and the distribution of values in the middle layer during quantization with TensorRT, suppressing the deterioration of the accuracy caused by quantization [35]. Jeong et al. proposed a parallelization approach to maximize the throughput of a single deep learning application using GPUs and NPUs by exploiting various types of parallelism in TensorRT [36]. Jeong et al. proposed a TensorRTbased framework that supports various optimization parameters to accelerate deep learning applications targeting Jetson with heterogeneous processors, including multithreading, pipelining, buffer assignment, and network duplication [37]. Stäcker et al. analyzed two representative object detection networks, which are deployed on edge AI platforms, and observed a slight advantage in using TensorRT for convolutional layers and TorchScript for fully connected layers. In terms of the optimized setup selection for deployment, quantization significantly reduces the runtime while having only a small impact on the detection performance [38].
A novel neural network based on the SSD framework, including a feature extractor using the improved MobileNet and a lightweight module, is proposed for fast and low-cost high-speed railway intrusion detection. It is deployed on the Jetson TX2 with quantization by TensorRT, achieving 98.00% mAP, and has a 38.6 ms average processing time per frame [39]. The advantage of YOLO has been proven in a wide range of applications [40,41], while excellent real-time performance and fewer network parameters enable YOLO to be applied in edge detection. Yue et al. proposed a deep learning-based empty-dish recycling robot, using YOLOv4 as a detection network for the dish, FP16 quantization of the detection model by TensorRT, and deployment on the Jetson Nano, with more than 96.00% high accuracy on Precision, Recall, and F 1 values, and an inference speed of 0.44 s for per image were achieved. However, this method only detects dishes and does not extract catch points. The inference time of the detection model does not meet the requirements of real-time detection [42]. Wang et al. used the neural network YOLOv4 to detect dirty eggs and used TensorRT to accelerate the detection process; the system was deployed on the Jetson Nano. The method obtained an accuracy of 75.88% and achieved a speed of 2.3 frames per second (FPS) [15]. Figure 1a shows the proposed empty-dish recycling robot, which consists of a robotic body, robotic arms, cameras, robot fingers, and recycling stations. Figure 1 shows the workflow of the empty-dish recycling robot. (a) Shows the initial state of the empty-dish recycling robot after arriving at the food-receiving place. The dish detection model is loaded in the process, waiting for the camera to take an image for detection. (b) Shows the process of collecting images and detecting dishes, taking images by Intel RealSense D435, detecting the dish by the proposed YOLO-GD, and calculating the catch points of the different dish types by different image processing methods. (c) Shows the process of the robot catching dishes. Through the dish category and catch points provided in (b), the embedded control system controls the robotic arm to catch the dish. (d) Shows the process of recycling the dish and putting it into the recycling station.

YOLO-GD Framework
To achieve high-speed and real-time dish detection, a lightweight detection network YOLO-GD is proposed, as shown in Figure 2. The network mainly consists of three parts: feature extraction, feature fusion, and result prediction. The purpose of network development is to realize a high-accuracy network with low computation. YOLO-GD adopts a lightweight feature extraction module; in addition, both the depthwise separable convolution and pointwise convolution are used to replace the traditional convolution operation, which effectively reduces the computational overhead of the network.
In the feature extraction stage, Ghost Net [19,43] replaces the CSPDarknet53 [44] module in the YOLOv4 network. Ghost Net aims to generate more feature maps with cheap operations. The main operation generates Ghost feature maps by applying a series of linear transformations based on original feature maps and extracting the required information from the original features at a low overhead. Figure 3 details each module in the Ghost Net. G_Bottleneck is mainly composed of G_bottleneck, where s represents the stride size, and SE represents adding an SE Net [45] module; "×" represents an iterative operation. G_bottleneck is mainly composed of a Ghost module. As for the case where stride = 1, the first Ghost module is used to extend the layer and increase the number of channels. The second Ghost module reduces the number of channels to match the shortcut path. The input of the first Ghost module and the output of the second Ghost module are connected in a shortcut. After the first layer, batch normalization (BN) and Relu nonlinearity are used, and after the second layer, only BN is used. As for the case where stride = 2, the shortcut path uses depthwise separable convolution with stride = 2 for downsampling, and point convolution for channel adjustment.   Figure 3. Detailed explanation of feature extraction module.
In the feature fusion and result prediction stages, spatial pyramid pooling (SPP) is inserted into the output of the network to extract the spatial feature information of different sizes and increase the receptive field information of the network. SPP can improve the robustness of the model for spatial layout and object variability [46]. The calculation of the SPP is as follows: (1) Among them, F means feature map, C means concatenate operation, f 5×5 means 5 × 5 filter, and MaxPool means max-pooling operation. The Path Aggregation Network (PANet) [47] can fuse features between three different output network layers, as shown in Figure 2. PANet obtains geometric detail information from the bottom network and contour information from the top network, ensuring the rich semantic information of the network and strengthening the feature extraction ability.
YOLOHead predicts the classes, confidence, and coordinate information of the dish at the same time by setting the convolution operation of the number of filters. A detailed explanation is shown in Figure 4. To reduce the overhead of the model in the feature fusion and result prediction stages, all 3 × 3 convolution operations are replaced by 1 × 1 convolution, 3 × 3 depthwise separable convolution, and 1 × 1 convolution [30].

Extraction of Catch Points
When extracting catch points in the whole image, mutual interference exists between different classes. Therefore, we use the detection results and the coordinate information of YOLO-GD to segment the target dish and extract the catch points. For different dish types, we use different feature point extraction methods. The types are mainly divided into circle, ellipse, square, and polygon.
• Circle: Hough transform is used to detect the contours of the circle dish. The equation for a circle in Cartesian coordinates is shown in Equation (2).
where (a, b) is the center of the circle and r is the radius, which can also be expressed as Equation (3).
In the Cartesian xy coordinate system, all points on the same circle have the same equation for the circle. They map to the same point in the abr coordinate system. In the abr coordinate system, the number of points should have the total pixels of the circle. By judging the number of points at each intersection in the abr coordinate system, points greater than a threshold are considered a circle. For the segmented circular dish images, grayscale images, canny edge detection [48], and Gaussian filtering [49] are performed to extract the contours of the dish and reduce the interference. Through Hough transform circle detection, the center point coordinates, radius, and other information of the contour are extracted. The center point coordinates of the circle are moved up by a distance of the radius and set as the catch point [50]. • Ellipse: In the Cartesian xy coordinate system, the maximum distance from any point to the ellipse, the point with the smallest distance is the center of the ellipse, and the smallest maximum distance is the length of the long axis of the ellipse. As shown in Equation (4).
where (p, q) is the center of the ellipse, a and b are the major and minor axes of the ellipse, respectively, and θ is the rotation angle. For the elliptical dish, grayscale conversion and canny edge detection are used to extract ellipse features. The disconnected contour lines are connected and their boundaries are smoothed by the closing operation in morphological processing [10]. The contour finding method is used to find the contour points of the ellipse, and the ellipse center, long axis, short axis, and rotation angle of the ellipse are extracted by ellipse fitting in OpenCV.
In the segment elliptical dish image, the coordinates of the catch points are shown in Equation (5).
• Square: The straight-line equation is as follows: where ρ is the distance of the straight line to the original point and θ is the angle between the straight line and the positive direction of the Cartesian coordinate x-axis. The different points on the straight line are transformed in the polar coordinate plane ρ-θ into a set of sinusoids intersecting at one point. Determine the two-dimensional statistics on the polar coordinate plane and select the peak value. The peak value is the parameter of a straight line in the image space, thus realizing the straight line detection in the Cartesian coordinate. We consider the intersection of the two lines, L 1 and L 2 , in the Cartesian coordinate, with L 1 being defined by two distinct points, (x 1 , y 1 ) and (x 2 , y 2 ), and L 2 being defined by two distinct points, (x 3 , y 3 ) and (x 4 , y 4 ).
where a and b are the vectors of L 1 and L 2 , respectively, and α is the intersection angle between L 1 and L 2 . The intersection P of L 1 and L 2 can be defined using determinants, The determinants are written out as: The edge features of the square dish are highlighted by grayscale conversion and canny edge detection. The straight lines in the image are extracted using straight-line detection with Hough transform [51], and the straight lines with angles around 90 • are selected by calculating the angle of all the straight lines. The intersection points are calculated for the retained straight lines, and the minimum circumscribed rectangle of all intersection points is calculated. The catch point is the midpoint of one side of the minimum circumscribed rectangle. • Polygon: For the irregular dish, grayscale conversion, Gaussian filtering, and binarization conversion are performed to clarify the dish contours. The contour finding function in OpenCV is applied for finding all connected contours and taking the maximum value as the feature of the dish. All points in the contour are processed by the minimum circumscribed rectangle, and the center point is extracted as the catch point.

Model Quantification and Deployment
TensorRT is a high-performance deep learning inference optimizer that provides lowlatency, high-throughput deployment inference for deep learning applications. TensorRT supports I NT8 and FP16 computation for accelerating inference by achieving an ideal trade-off between reducing the computation and maintaining accuracy. TensorRT provides only forward propagation, i.e., inference, and without an in-training process [35]. Figure 5 shows the steps of TensorRT to reconstruct and optimize the network structure, which is mainly divided into two parts: model compression and hardware mapping. Model compression eliminates useless output layers in the network by parsing the network model for reducing computation. For the network vertical integration, the convolution, BN, and Relu layer of the current mainstream neural network are merged into one layer. A horizontal combined network means fusing layers whose inputs are the same tensor and perform the same operation. For the concatenate layer, the input of the contact layer is directly sent to the following operation, without performing the concatenate separately and calculating the input. Furthermore, 32-bit floating-point operations are quantized to 16-bit floating-point operations or 8-bit integer operations. In hardware mapping, the kernel selects the best pre-implemented algorithm based on different batch sizes and problem complexity and uses streaming techniques in CUDA to maximize parallel operations [36,52]. The robot system employs the Jetson Nano [42] as the computation model, which does not support the I NT8 type data. Hence, the robot system uses TensorRT to quantify the object detection model as FP16 type data.

Evaluation
In terms of evaluation, the operating system is Ubuntu 21.04, the CPU is an Intel Core i9-10900 2.8GHz processor with 32GB RAM, the GPU is RTX 3080Ti, the CUDA version is 11.4, and the GPU acceleration library cuDNN is 8.

Dataset
This paper first creates a dish dataset, named Dish-20, which contains 506 images in 20 classes. In the experimentation, 409 images are used for training, 46 images are used for validation, and 51 images are used for testing. Figure 6a shows an example of a dish image, and (b) shows the definition of the 20 classes. The image size of the dataset is set as the default size of YOLO-GD (416 × 416).

Performance Indexes
Precision, Recall, F 1 , and mAP, have been used to evaluate and compare the dish detection performance, which are listed in Equations (10)- (14), respectively.
TP represents a positive sample correctly classified; FP represents a positive sample of the misclassification; FN represents a negative sample of the misclassification [55]. Precision and Recall respectively represent the proportion of the number of correctly predicted samples to the total number of positive class predictions, and the positive class is predicted as the number of positive class samples in the total number of positive samples.
The F 1 score is obtained by taking the harmonic mean of Precision and Recall, namely the reciprocal of the average of the reciprocal of Precision and the reciprocal of Recall [56].
The Precision-Recall curve (P − Rcurve) is derived from the relationship between Precision and Recall. The Average Precision (AP) of all classes is the area of the region surrounded by the curve and the axes. In practical applications, we do not directly calculate the P − Rcurve but smooth the P − Rcurve. That is, for each point on the P − Rcurve, the value of Precision takes the value of the largest Precision on the right side of the point, as shown in Equation (13). Among them, R n represents the Recall of the n-th value, and P max represents the largest Precision value on the right side of the Recall value [57].
The mAP is obtained by averaging the AP of all classes (C) in the dataset, as shown in Equation (14).
We also use the detection and evaluation indicators in COCO API, where the calculation of AP is different from that mentioned above. For a certain classification, the Recall is equidistantly divided into eleven values [0.1, . . . , 0.9, 1], and the maximum Precision is calculated for each Recall value, and then the average of these eleven Precision values is AveragePrecision [53,57], as shown in Equation (15).

Experimental Results
To verify the performance of YOLO-GD, we analyze the test results of YOLOv4 and YOLO-GD. In addition, the parameters, weights, and Floating-Point Operations (FLOPs) of the two models are compared. Inference times and test results are also compared with and without YOLO-GD quantization on the Jetson Nano.

Performance Validation of the YOLO-GD Model
AP and mAP are important indicators that reflect the detection accuracy of object detection models. The larger the value of AP or mAP, the higher the accuracy of the model. Figure 7a shows that the mAP of YOLOv4 is 93.97%, and (b) shows that the mAP of YOLO-GD achieves 97.38%, which is 3.41% higher than YOLOv4. It is also proven that the overall performance of YOLO-GD is better than YOLOv4.  Tables 1 and 2 show the experimental results of YOLOv4 and YOLO-GD, respectively. The results of YOLO-GD are significantly better than YOLOv4. Some classes of results have improved the accuracy with few errors; for example, the F 1 value of "Chopsticks-one" is increased from 0.93 to 1.00, the Recall is increased from 87.50% to 100%, and the AP is increased from 87.50% to 100%. The F 1 value of "Fish-dish" is increased from 0.96 to 0.98, the Recall is increased from 96.30% to 100%, the Precision is increased from 96.30% to 96.43%, and the AP is increased from 95.06% to 99.74%. The F 1 value of "Rice-bowl" is increased from 0.98 to 1.00, Recall is increased from 95.24% to 100%, and AP is increased from 95.24% to 100%. The F 1 value of "Spoon" is increased from 0.84 to 0.98, Recall is increased from 78.57% to 95.24%, Precision is increased from 89.19% to 100%, and AP is increased from 78.06% to 95.24%. The F 1 value of "Square-bowl" is increased from 0.91 to 0.96, Recall is increased from 83.33% to 100%, Precision is decreased from 100% to 92.31%, AP is increased from 83.33% to 100%. The F 1 value of "Waster-paper" is increased from 0.78 to 0.94, the Recall is increased from 78.38% to 91.89%, the Precision is increased from 78.38% to 97.14%, and the AP is increased from 75.65% to 89.88%. The F 1 value of "Water-cup" is increased from 0.95 to 0.97, Recall is increased from 91.18% to 94.12%, Precision is increased from 98.41% to 100%, and AP is increased from 90.45% to 94.12%. The F 1 value of "Wine-cup" is increased from 0.95 to 0.99, and Recall is increased from 90.54% to 98.65%. AP is increased from 90.54% to 98.65%.
However, the accuracy of some classes is decreased: the F 1 value of "Chopsticks-two" is decreased from 0.91 to 0.89, the Recall is decreased from 87.04% to 81.48%, the Precision is increased from 95.92% to 97.78%, and the AP is decreased from 86.57% to 81.28%. The F 1 value of "Paper" is decreased from 1.00 to 0.96, the Recall is decreased from 100% to 92.86%, and the AP is decreased from 100% to 92.86%.
Among the twenty classes, eight classes improve, two classes decrease, and the remaining classes are unchanged. The Precision of "Chopsticks-two" increased by 1.86%, but the Recall decreased by 5.56%, indicating that some of "Chopsticks-two" are predicted to other classes. This leads to a decrease in F 1 and AP values. There are various shapes of "Paper", forming a single class with multiple shapes, which leads to a decrease in the accuracy of recognition. The results demonstrate that YOLO-GD is better than YOLOv4 in the detection of the dish. The COCO API is employed to evaluate the performance of the training model. The performance of YOLO-GD is tested by calculating AP 11 (Average Precision) and AR (Average Recall) based on different IoU values, area sizes, and the number of objects contained in the image. The AP 11 is averaged according to the 10 IoU thresholds of 0.50 to 0.95 (the step is 0.50), and the AP calculation is performed when IoU = 0.50 and IoU = 0.75, respectively. AP 11 and AR are calculated by different detection areas (Small or Medium, or Large) of the object. AR is calculated by the different maximum number of objects detected in each image (1, 10, and 20). Tables 3 and 4 show that for Small detection areas, the AP 11 and AR values are −1.000, which means the relevant dish is not detected in the Small detection area. The AP 11 and AR of the Large detection area are higher than those of the Medium detection area. The dish works well in the Large detection area, indicating that the anchor box of YOLO-GD should be adjusted to increase the detection area of a small area. As the maximum number of detected objects increases from 1 to 10, the AP 11 and AR values increase significantly, but when the maximum number of detected objects increases from 10 to 20, the AP 11 and AR values remain unchanged. This demonstrates that when the model detects each image, the maximum number of detections for each dish does not exceed 10. The values of YOLO-GD are higher than those of YOLOv4. These evaluation indicators validate that the performance of the YOLO-GD model is better.
A relevant parameters comparison of YOLOv4 and YOLO-GD is shown in Table 5. The weights, parameters, and FLOPs of YOLO-GD are significantly lower than YOLOv4. The weight of YOLO-GD is 45.80 MB, which is 82.12% lower than YOLOv4 (256.20 MB); the number of parameters is 11.17 M, which is 82.50% lower than YOLOv4 (63.84 M); the FLOPs is 6.61 G, which is 88.69% lower than YOLOv4 (58.43 G). It is proven that the YOLO-GD model is only 1/5 the size of the YOLOv4 model, which is more lightweight. We deploy the YOLO-GD dish detection model on the robot control system, the Jetson Nano. Table 6 shows the per-image inference time and FPS on the Jetson Nano without and with quantization. The per-image inference time is the average inference time of 49 images. Under normal circumstances, the per-image inference time of the YOLO-GD model on the Jetson Nano is 207.92 ms and the FPS is 4.81. With FP16 quantization using TensorRT, the per-image inference time is 32.75 ms, and the FPS is 30.53. After quantization, the per-image inference time is decreased by 84.25%, which meets the real-time detection requirements.  Figure 8 shows the quantified YOLO-GD results with an mAP of 97.42%, which is slightly higher than the model without quantization. The overall detection accuracy of YOLO-GD does not decrease with quantization, while the detection speed per image increased by 84.25%, which proves the feasibility of the quantization method. Tables 2 and 7 show that the results without and with quantified YOLO-GD and the results of "Chopsticks-two" are increased as a whole, but the Precision and F 1 values of "Chopsticks-one" are decreased, and the AP value of "Waster-paper" is decreased by 0.07%. The Recall, AP, and F 1 values in "Water-cup" are also decreased, and the Recall and AP values in "Wine-cup" are decreased. Figure 7 shows that the mAP value of YOLO-GD with quantization is 97.42%, which is higher than without quantization. It is proven that after YOLO-GD is quantized by TensorRT's FP16, the detection accuracy remains unchanged.
The results of YOLO-GD using COCO API without and with quantification are shown in Tables 4 and 8. AP 11 value decrease by 0.006, AP 11 decrease by 0.004 when IoU = 0.75, AP 11 decrease by 0.007 when the detection area is Large, and AR decrease by 0.007 when maximum number of objects detected is 1. AR decreases by 0.008 when the detection area is Large and the maximum number of objects detected are 10 and 20.  Figure 9 shows the experimental results of the dish image extraction from the dishes' detection to the catch points calculation. (a) and (d) are the taken dish images. (b) is the detection result of (a). The result demonstrates that "Towel" has not been recognized, and "Waster-paper" in "Cup" and "Spoon" in "Square-bowl" have low detection accuracy, mainly because the two dishes put together affects the detection accuracy. (c) is the catch point extraction image of (a), the catch point of "Chopsticks-one" is not positioned at the center, and the catch point of "Towel-dish" is slightly off the edge. (e) is the detection result of (d), and all the detection results are above 95%; (f) is the catch point of (d) to extract the image, and the catch point of "Wine-cup" on the left has an error in extraction because the class is a transparent object. In the process of image processing, the lower edge of the dish is fitted into a circle, which causes the catch point to shift to the lower edge. The robotic fingers that catch the dish are pneumatic fingers, which expand the fingers by pneumatic force to clamp the dish. The catch point does not affect the catch of the dish within a certain error. During the field testing, the method achieves high accuracy, proving the effectiveness of the catch point extraction method proposed in this paper.

Discussion and Future Work
When the confidence score is set to 0.5 and the IoU is set to 0.5, the mAP value of YOLO-GD is 97.38%, which is 3.41% higher than YOLOv4. The weight of YOLO-GD is only 45.80 MB, the parameter is 11.17 M, and the FLOPs is 6.61 G, which is 1/5 of YOLOv4. After TensorRT's FP16 quantization and deployment on the Jetson Nano, the inference time per image is 32.75 ms and reaches 30.53 FPS, and the inference speed is 8.4 times higher than without quantization. Moreover, the mAP with quantization is 97.42%, which is 0.04% higher than without quantization. In addition, the power consumption of the Jetson Nano is only 5-10 W, which meets the requirements of low power consumption for robots.
In the detection process of YOLO-GD, some dishes could not be recognized effectively because of the variety of dishes and the intersection of placement. For example, "Towel" in Figure 9b is not recognized, failing to extract its catch point. In the process of extracting the catch points, the location of the dish contour and other information are misjudged in the image processing of the catch point extraction because the environment such as the light has a significant impact on the image. For example, the "Wine-cup" in Figure 9f locates the catch point at the bottom contour position.
In future work, the YOLO-GD model is further compressed using pruning techniques to make it more lightweight. The catch point extraction method is optimized to ensure a more accurate extraction of the catch points. The Intel RealSense Depth Camera D435's depth images and video images are used to capture the height information of the dish catch point and feed the detailed information of the catch point in a three-dimensional space to the robot. In addition, we will design new algorithms to optimize the order of the catch dishes.

Conclusions
This article introduces a deep learning-based object detection algorithm YOLO-GD for empty-dish recycling robots. The object detection model algorithm YOLO-GD is based on YOLOv4. By replacing the backbone structure with a lightweight Ghost Net, as well as replacing the traditional convolution with depthwise separable convolution and pointwise convolution in the stage of feature fusion and result prediction, a lightweight one-stage detection model YOLO-GD is formed. According to the detection results of the dish, a different image processing is performed to extract the catch points. The coordinate information of the catch point is transmitted to the robot, and the robotic arm is used to catch the dish. To improve the detection speed, TensorRT is used to quantify the object detection model YOLO-GD as FP16 and is deployed on the robot control system, Jetson Nano. The experimental results demonstrate that the object detection algorithm is only 1/5 of YOLOv4, and the mAP value is 97.38%, which is 3.41% higher than the 93.97% of YOLOv4. After YOLO-GD quantization, the inference time per image is decreased from 207.92 ms to 32.75 ms, and the mAP is increased from 97.38% to 97.42%. Although there is a certain error in the extraction of the catch point coordinates, it meets the error requirements of the robotic finger. In summary, the system can effectively detect the dish and extract the catch point, which has far-reaching significance for the empty-dish recycling robot.