Table Cleaning Task by Human Support Robot Using Deep Learning Technique

This work presents a table cleaning and inspection method using a Human Support Robot (HSR) which can operate in a typical food court setting. The HSR is able to perform a cleanliness inspection and also clean the food litter on the table by implementing a deep learning technique and planner framework. A lightweight Deep Convolutional Neural Network (DCNN) has been proposed to recognize the food litter on top of the table. In addition, the planner framework was proposed to HSR for accomplishing the table cleaning task which generates the cleaning path according to the detection of food litter and then the cleaning action is carried out. The effectiveness of the food litter detection module is verified with the cleanliness inspection task using Toyota HSR, and its detection results are verified with standard quality metrics. The experimental results show that the food litter detection module achieves an average of 96% detection accuracy, which is more suitable for deploying the HSR robots for performing the cleanliness inspection and also helps to select the different cleaning modes. Further, the planner part has been tested through the table cleaning tasks. The experimental results show that the planner generated the cleaning path in real time and its generated path is optimal which reduces the cleaning time by grouping based cleaning action for removing the food litters from the table.


Introduction
Due to long working hours, low wages, unwillingness to work as a cleaner, workforce shortage has been a constant problem for food court cleaning and maintenance tasks in recent times [1]. Recently, many robotic platforms are designed for different cleaning application which include floor cleaning [2,3], facade cleaning [4], staircase cleaning [5], pavement cleaning [6,7] and garden cleaning [8]. However, these robot architectures could not support table cleaning and maintenance tasks. In this context, HSR can be a viable candidate for this task [9,10]. However, configuring HSR for cleanliness inspection and food litter collection is a challenging task [11]. Because, the robots need optimal food litter (food scraps, stains, spillage) detection system and real-time planner algorithm to execute the cleaning and inspection task [12,13].
Various techniques have been developed for cleaning robots to recognize the different class of litter (garbage, dirt, liquid spillages, and stains) and compute the cleaning strategy. Among them, computer static environments. As a consequence, the PRM algorithm is applied for the table cleaning task where the objects of interest stay static. Furthermore, positional information, muscle stiffness of the human arm, contact force with the environment also play important roles in understanding and generating human-like manipulation behaviors for robots as in Reference [31].
Motivated by the works mentioned above, this work proposes the deep learning-based food litter detection system and a path planner algorithm for the Toyota Human Support Robot (HSR) [10,32] to accomplish the table cleaning and inspection task for the standard food court setting. The technical contributions of the paper are as follows.
(1) The 2D location from the output of the lightweight Deep Convolutional Neural Network (DCNN) based litter detection model is combined with the depth data to yield the 3D location of the food litter on top of the table.
(2)The proposed framework can provide a high confidence level in classifying various types of litters like liquid, solid.
(3) The planner algorithm computes the cleaning mode and find the cleaning path for removing the food litters from the table. The planning and execution behaviors of how the robot cleans the dirty table are inspired by the actual cleaning activities conducted by the human. The real-time planner uses the Depth-First Search (DFS) [33] and Probabilistic Road Map (PRM) [34] algorithm, which are unified for generating the cleaning path. The DFS technique generates the initial path map for collecting food litters from the table, and the PRM scheme optimizes the path according to the motion planning function.
(4) The proposed method is validated in Toyota HSR robot and its efficiency has been verified with standard quality metrics.
The rest of the paper is organized out as follows: Section 2 describes the litter detection framework and planner module. Section 3 describes the experimental setup and the experimental results. Conclusions and future work are finally presented in Section 4. Figure 1 shows the functional block diagram of proposed scheme. It comprised of two modules, include DCNN based litter detection framework and planner module. The detection model comprises of two parts-a feature extractor and a bounding box predictor. Here, the feature extractors extract the specialized features pertaining to a litter classification then generates the feature map. As a consequence, bounding box predictors locate the objects in an image and distinguish the litter class (c) using feature maps extracted by the feature extractor. The detected boundary box region coordinates b(x,y) is further converted to 3D coordinate in the world frame in meters from the robot base frame. Then 3D coordinates of the center of the boundary box become the representative for each object of interest, which is used to control the HSR arm actuator. The detailed description of each module is described as follows.

CNN Based Litter Detection Framework
The proposed CNN network contains 16 layers with 9 convolutional layers and 6 pooling layers. The network is built on python with DarkFlow as a backend. DarkFlow is an open-source object detection algorithm that can be used for object detection and localization. The advantage with DarkFlow is that the architecture of the network can be altered, that is, changing the activation functions, network layers, and training using custom objects. The number of convolutional and pooling layers has been decided to make sure that the network does not cause any overfitting. The huge number of hidden layers could, in turn, cause overfitting issues. We started with adding convolutional and hidden layers until we receive a good enough F-1 score [35]. Figure 2 shows the functional block diagram of litter detection architecture and detail of each layer include filter size, padding, stride are given in Table 1.  Convolutional layers are used to extract higher-level features that could be used for performing some complicated classification. In a nutshell, the convolutional layer performs the operation of the convolution between its input and filter of the desired size. The number and size of the filters are given by the user as a parameter to the layer. The Equation (1) describing the operation of convolution function.
where, f and g are two variables that are involved in convolution. f is the input and g is the filter function.
The output is a two-dimensional activation map that provides the response of convolution at each spatial position. Based on the size of the filter and the size of the input, the size of the output can be determined using Equation (2).
where, I is input image matrix, K is the filter size, P is zero padding and S is the stride length. The first convolutional layer takes in images of size 416 × 416 × 3. The number of channels in the filter of a convolutional layer must be equal to the number of channels in the output of the previous layer.

Pooling layer
Pooling layers are generally designed to prevent over-fitting and, by applying non-linear down-sampling, it reduces the size and robustness to expedite computation. Zero-padding is not performed within the pooling layer as it would work against the purpose of the layer. The predominantly used pooling methods are: Max pooling: The maximum adjacent value is taken from the input image based on the position of pixels. This is implemented through input channels. Median pooling: The average of all values within the region covered by the filter is taken and assigned as the output value. The other important settings for the network layers are the hyper-parameters, which were set as follows: • Batch size = 64 • Learning Rate = 0.001 • policy = steps • Threshold = 0.5

Bounding Box
In this work, we are focusing on litter object localization with class. Specifically, the customized CNN network will give the exact location or Region Of Interest (ROI) in the color image. The location of the litter object is wrapped inside the bounding box. The idea behind the bounding box is that each image is divided into segments with an identical area, and a target vector has been generated for them (Equation (3)). The bounding box target vector for training would be as follows.
where P-binary value which determines if there is an object of interest in the image, X min -Upper left x Bounding Box Coordinate, Y min -upper left Y bounding box coordinate, X max -Lower right x bounding box coordinate, Y max -Lower right y bounding box coordinate, C 1 is 1 if the object belongs to class 1, else zero, C 2 is if the object belongs to class 2, else zero. Further, Intersection Over Union (IOU) method (Equation (4)) is used to remove any overlapping bounding boxes and to measure the accuracy of the bounding box with respect to the ground truth. IOU is the ratio of the area of overlap to the area of union.
For fixing the various IOU thresholds were tested out during the experiments, and the best one has been picked. In our case, we have chosen the IOU threshold to be 0.5, which has been widely reported in the literature and provide more stable results. This means when evaluating the output; if the IOU calculated from the predicted bounding box and the actual bounding box is equal to more than 0.5, we would consider that as a correct output, whereas anything below 0.5 is considered as the wrong predictions of the bounding box coordinates. The average IOU matching calculated over all the bounding boxes in the test set is 0.7044, with an overall confidence 0.589.

Planner
The planner module is developed for accomplishing the table cleaning task through HSR. The planner has two functions, namely finding the cleaning method and constructing the cleaning path. Figure 3 shows the process flow diagram of the planner module. It uses the litter detection framework for finding the cleaning method and constructing the cleaning path. Two cleaning methods are adopted for the table cleaning task, which includes sweeping and wiping, where the sweeping method is used to clean the solid and semi-solid food litter, and wiping mode is used to remove stains. Straight move or grouping based sweeping action is adopted for cleaning solid, and semi-solid food litter where grouping based sweeping is used for cleaning the multiple litters in one shot and straight move is used for removing the separated food litters. Further, zig-zag cleaning action is considered for wiping the stains. Figure 4 demonstrate the three different cleaning action.

Construction of Cleaning Path
To construct the path, the planner uses the probabilistic road map algorithms. Here, the DFS technique generates the initial path from the detected bounding region, then PRM scheme has been applied for fine-tune path according to HSR mobility function. The key element of path planning function P = (nodes, D goal (x, y), edges). Here, the detected bounding box are considered as nodes is the robot dust collection point or edge of the table and P is the feasible path for connecting the nodes. The path planner constructs two-way path, such as direct path planning and grouping based path planning.
Initially, the planner starts the path planning by exploring the path which starts from the leftmost upper node, and that node is marked as q start . The DFS scheme has been used to explore the path between q start and remaining nodes. It follows the vertical path searching, and it starts with the first neighbor and continues down the line as far as possible. Once it reaches the final node, then it backtracks to the first node, where it was faced with a possibility to change the next neighbor. Here, the backtracks nodes are marked as connection point of D goal (x, y, d). Once DFS explores the path. PRM has been adopted to post-processing the path that fine-tune the cleaning path according to HSR mobility function. The post-processing steps first connect the backtrack point with D goal (x, y, d) and generate the new path. In addition, analysing the distance of each connected node in the generated path used Equations (5) and (6). As long as the distance between the two consecutive nodes in the path is less than or equal to a predefined threshold (maximum arm reachable distance), the node considered as neighbor and path is valid. Otherwise, a node has been neglected, and the new pathfinding process has been initiated for the neglected node. The process repeats for all neglected nodes. The Algorithm 1 outline the path planning scheme.
where D max is threshold, q is new generated node, q is neighbor node, and D(q, q ) is Euclidean distance Equation (6).
Algorithm 1 Cleaning Path Generation 1: Step 1 2: N is an array holds all the nodes position 3: P is an array used to store the cleaning path 4: q start = leftmost(N); Set uppermost left nodes as q start 5: load q n = N − 1 6: q goal = def_goal(litter collection point); the function load the litter collection point as q goal 7: Step 2 8: while cleaning path between q start to q n not discovered do 9: Q, backtracknode, internodes and Path are variable used for store the temporary value 10: for n = 0; n < q n ; n + + do 11: Q= dfs_paths(q start , q n ); Explore the path between q start and N − 1 nodes 12: end for 13: backtracknode = backtrack.find(Q); the function find the backtrack node in the explored path 14: internodes = group.find(q start , backtracknode); the function collect intermediate node between q start and backtracknode 15: Path = def_PRM_node_connect(q start , internodes, backtracknode, q goal , R); PRM function connect the nodes according to threshold function R 16: plt.plot (Path) 17: end while 18: visited.add(Path); Mark all path identified node in N 19: Store the path in P 20: load q n = N − Path; exclude path generate node from N 21: set.new(q start , N); set new q start from updated N 22: Run the Step 2 up-to generate the path for remaining nodes

Experimental Setup & Results
This section describes the experimental setup and experimental results of the proposed scheme. There are three phases in this section which including configuring algorithms in HSR, training data preparation and validation, and evaluating the table cleaning and inspection with HSR.

Configuring HSR
Toyota HSR is utilized in our experiment to test the proposed scheme. The contour structure of the HSR robot is shown as in Figure 1. There are five key modules in the HSR robot, including an RGB-D camera in the head region to perceive the objects in the environment, 4 degrees of freedom arm manipulator with a capacity of flexible manipulation enabling it to reach all the points in the 3D space.
The proposed system is built on the Robot Operation System (ROS) platform [36]. ROS provides the infrastructure and mechanism that enable hardware components to work smoothly together. The ROS master installed on the main system monitors the entire ROS system. The ROS topics transmitted in the ROS network through local connection or data networks enable the communication between ROS nodes. The 3D robot coordinate frames such as the base movement of the robot, RGB camera, base, arm, grabber are maintained by transformation frame service in the ROS system. By referring to the lookup table in the transformation frame, HSR understands the relative 3D offset between the pair of these frames to control the corresponding actuators. Figure 5 shows the schematic representation of HSR hardware architecture configuration, which comprises of two computing units, namely primary and secondary systems. The primary system contains master ROS which has access to the sensor data and the motion planner framework. Here, the primary system is configured to perform the path planning and arm manipulation task. The primary system uses the MoveIt open-source software framework application program interface (API) to control and plan the motion of the robotic arms. The secondary system is Nvidia's Jetson TK1 board (GPU), operating as a ROS slave. The slave block contains a control block and a litter detection framework. The control block governs the interface between the deep learning framework and ROS. It wraps the data received from the deep learning framework into the ROS compatible messages and sends it to the primary system. Similarly, it unwraps the ROS message into a suitable format that is required by the deep learning framework. Both the primary and the secondary systems are connected over Transmission Control Protocol/Internet Protocol (TCP/IP) and share all the ROS topics.
To execute the cleaning task, the primary system enables the RGB-D sensor and collect the RGB-D stream of image data to the secondary system. The GPU in the secondary system, which is running the deep learning framework, takes in the input data and performs the object detection and classification task and returns detected information, which contains the coordinates and types of litters.
Since the boundary box of the detected object is in the image coordinate frame, the knowledge of the point in the 3D world coordinate system is essential for the robot to manipulate the arm and grabber. The pixel coordinates in the color image of the boundary box, including the center, top left, bottom left, and top right, are converted to 3D coordinate in the world frame in meter from the robot base frame. Then the 3D coordinate of the center of boundary box becomes the representative for each object of interest. In this paper, we assume that the RGB-D camera is set up so that its Field of View (FOV) covers the whole table, which needs to be cleaned. To do the 2D to 3D conversion, projective geometry is used for mapping the point in the image plane to 3D point in the world frame. Intrinsic and extrinsic parameters of the camera are estimated by the camera calibration method [6]. Translation matrix T = [t 1 t 2 t 3 ] is assumed from perception system after calibration from the origin of world coordinate and orientation matrix R = [roll, pitch, yawn]. Camera intrinsic parameters lies in K with focal length f x , f y , principle point (x c , y c ), pixel size (sx, sy) and distortion coefficients σ, any pixel p = [p x p y ] on the image plane with 3D coordinate W = [X Y d] world plane and W can be calculated by p = HW where H = K[R T] is the homogeneous matrix. After identifying the point in the 3D world coordinate frame and assigning it to a specific object frame, the transformation frame service of ROS will maintain the relative location between this point to robot coordinate frames.
Upon receiving the control information, the primary system then generates the cleaning map for executing the cleaning action through arm motion. After the path has been derived from DFS and PRM techniques, then Moveit [37] service of ROS has been used to path following, arm controlling, and base motions which have been optimized by Toyota for HSR. Specifically, by considering robot kinematics, curvature continuities of the robot arm, path feasibility, Moveit executes the robot arm trajectory to avoid smooth sudden stops and collisions with fragile objects. Upon executing the motion plan to clean the litters, the arms return to its home position. After visiting all the litters locations at least once, the primary system requests the secondary system (GPU) for a confirmation of the cleaning by sending in the RGB-D image data. If the food litters are still present, then the secondary system will request the primary system for another round of cleaning.

Training Data Preparation and Validation
The HSR RGB-D camera is used to capture the litter data set. The specification of RGB-D camera is given Table 2. Images are collected from the robot perspective with a different angle. In total, there are 3000 images captured at different table backgrounds with various types of food litters, include food scrub, liquid spillage, and stains. In addition to that, the CNN learning rate was improved, and over-fitting was prevented by applying data expansion to the captured images. Data expansion applies simple geometrical operations on images like scaling, rotating, shifting, and flipping to increase the number of samples in the data set. Further, the images are resized to 416 × 416 pixels to reduce the computational time for training. Furthermore, the dataset is labeled as two classes that include solid and liquid. The dataset was divided into two classes, with 1500 images per class. One class of image contains only a single type (solid or liquid) food litters. The other one consist of mixed food litter in the same image. The CNN network was developed in the Tensorflow framework and trained in Intel Xeon E5-1600 V4 CPU with 64 GB RAM and an NVIDIA Quadro P4000 GPU with 12 GB Video memory. K-fold cross-validation is adopted in this work to validate the dataset. In this method, the dataset is divided into k subsets, and K-1 is used for training. This work uses 10 fold cross-validation. The images shown are obtained from the model with the highest accuracy.

Evaluate the Table Cleanliness Inspection with HSR
The cleanliness inspection part has been assessed through food litter detection on the dining table. This is a crucial component of the proposed scheme. Hence, the performance of the robot was ensured for the accuracy of the detection model. To carry out the first experiment, the trained model was configured in HSR secondary system. The experiment was tested in square and circular dining tables, which are arranged like a food court dining pattern, as shown in Figure 6. After configuring the algorithm in hardware, the robot is placed in a work-space. For experimental purposes, various food litter (solid food class, stains, and spillage) are scattered on the table looks like unclean table, and detection has been analyzed through a remote console. The results ensure that the performance of the developed litter detection framework can detect most of the food litters includes solid class foods, stains, and liquid spillages on the dining table. Solid food litter detected typically has a 97% or higher confidence level and stains, and liquid spillage has been recognized at 96% or higher confidence level, respectively. Further, the miss rate (Equation (7)) and false rate (Equation (8)) metric [13] are evaluated for the proposed litter detection framework. These two scenarios can be better understood by observing the Figure 9b,c,h. In Figure 9b,c are examples of miss detection, where some food litter is not detected by the model. Figure 9h is an example of false detection where the solid type food liters are detected as the liquid class. Here the miss rate represents the case, where target litter is not detected by the model from given input image set. Whereas, false positive indicates that the type of litter (solid, liquid etc.) is wrongly detected from the input image set. In our model, the overall miss rate and false rate is less than 3% and 2% respectively for an input image set comprising of hundred solid and liquid food litter objects.Since the depth features of the solid and liquid litter of trash are not so obvious and are affected by depth sensor noise, perspective viewpoint, depth data alone is not sufficient. In addition, using only the depth is hard to identify litter from other solid and liquid objects such as food or items on the table because they have the same depth features as litter. As a consequence, the DNN network needs to be customized to give the acceptable solid/liquid classification on the table with only depth information. We will consider using depth data directly for classification in the future works η miss = n missnum n testset × 100% η f alse = n f alsenum n testset × 100% (8) where n testset is total number of test objects.

Comparison with Existing Schemes
This section describes the comparative analysis of proposed work with other existing case studies in the literature. The comparison analysis has been performed based on CNN and non-CNN methods used on different cleaning robot applications for the detection of different kinds of litter ( dirt, garbage, marine debris). Table 3 shows the comparison with non-deep learning based approaches. Table 3 results indicate that non-deep learning based schemes detection accuracy is average of 80%, and detection accuracy heavily relies on background surface and brightness of the litter objects which lead to false detection of objects. However, the authors describe that false detection can be overcome by re-scaling the filter response manually. Further, models are able to detect the litter based filter function and cannot classify the litter classes, which is the fundamental limitation of non-deep learning based litter detection model. Table 3. Non-Deep Learning Based Detection.

Case Study Algorithm Detection Accuracy
Floor cleaning: Dirt and Mud detection on floor [14] Spectral residual filter 75.45 Floor cleaning: Dirt and mud detection [38] Spectral residual filter + Maximally Stable Extremal Regions 80.12 Garbage Detection [39] Histogram of Oriented Gradients (HOG) + Gabor + Color 80.32 Trash detection [40] SVM + Scale-invariant feature transform (SIFT) 63 Table 4 shows CNN case study and Table 5 shows comparative analysis of CNN based litter detection schemes, which are implemented using various CNN based object detection schemes. Generally, the object detection framework efficiency has been assess through accuracy, precision, recall, and F-1 score, miss rate and false rate metric [13]. In literature point of view, compare our proposed method with the other litter detection frameworks is very hard, because each model uses a different CNN topology and training parameters. Also, the data-sets are different. Hence, we resort to describing the essential characteristics of each model and enlist its pros and cons for the common attributes.  In the literature, Faster RCNN with ResNet or inception, Mobilnet V2 SSD [42], and YOLO [] object detection framework are widely reported framework for litter detection. The table results indicate that faster RCNN-resnet or faster RCNN-inception framework has higher accuracy compared to other models. However, faster RCNN requires tremendous computing resources due to its region-based convolutional, which makes it less suitable to be deployed in low computing devices. At the same time, SSD-MobileNet based implementation provides a right balance between accuracy and computational time where the SSD mobilenet model uses a defined set of sizes for the bounding boxes and uses the more efficient depthwise convolution layers which significantly reduces the computation time and inference time. YOLO v2 is much faster than SSD Mobilenet, which is built on a dark flow framework and uses a single convolutional network to predict the bounding boxes and the class probabilities for these boxes. Hence, it required less computation power and low inference time than other models. Furthermore, our framework follows the YOLO dark flow; hence it achieves considerable accuracy and low computation time similar to the YOLO model. YOLO is prone to overfitting, and therefore, we have set the learning rate to be much less than many other previous works. Further, we have also limited the number of hidden after the convolutional and Max pooling layers. Finally, we have made sure that the training data is not skewed that is, the number of training data in each of classes is pretty much equivalent for training.
In contrast to a non-deep learning model, Deep learning techniques offer a significant advantage in scenarios when sufficient data is available. These techniques are able to autonomously extract features from images, which allow them to learn features and patterns which are difficult to figure out statistically. CNN architectures are specifically good for extracting features from images since they use a combination of pixels next to each other for features.

Validate the Planner Module
The efficiency of the planner module has been tested through cleaning path generation and table cleaning tasks using generated cleaning path. Two experiments are conducted to assess the table cleaning function. In the first experiment, Robot cleans the solid food litter on the table. The second experiment is set to clean the presence of both solid waste and liquid spillages. To carry out the experiments, firstly, the litter detection module was run to recognize the food litter, then the planner was executed to generate the cleaning path. Figure 9 shows the litter detection results and their respective cleaning path for two experiments. The experimental results indicate that the path planner covers most of the detection region efficiently. Further, the generated path was loaded into the arm and base motion planning module to execute the cleaning task. The outcome of the cleaning task was verified after executing all the cleaning path. Figure 10 shows the outcome of the task after executing the cleaning task. Here, the stains and liquid spillage litters were spread when the robot tries to clean. So, those litter regions need to do multiple rounds of cleaning action and cleanliness inspection, as shown Figure 11 (cleanliness inspection after executing the cleaning task). Further, the solid food litter was cleaned in one iteration. However, some of the litter's regions are not able to clean in the first iteration because some food litter is scattered, and some of the litter regions are not able to cover accurately due to noise of depth data, which affect the process of 3D coordinate into the world coordinate.  The Table 6 shows the average computation time for detection, path planning, and motion planning. The planner part runs on the CPU and takes about an average of 0.0145 seconds. Further, for cleaning the three to five food litters, the execution module takes approximately an average of 210.5 seconds. The test results indicate that the proposed algorithm can satisfy the real-time requirement. It is more suitable for executing the cleanliness inspection and table cleaning application using HSR.

Practical Limitations
In this work, the cleaning actions depend upon the precise value of the depth information acquired by the RGB-D cameras, which have a restricted resolution. Even though the use of cloth has some advantages, the expected movement of the litter is not reached by the actions. Further, The Toyota arm has 5 degrees of freedom rotation, because of which the arm planning becomes complicated by involving the robot base to reach certain points on the table. This can be solved by adding an additional degree of freedom to the robotic arm. In our feature work, we plan to improve the quality of the RGB-D vision system and HSR motion planner function, which helps to predict the object position and orientation accurately and improve the cleaning efficiency. The computing power of the existing CPU installed on the HSR is relatively slow, an extra computing board with additional hardware computing power can be used to expedite the detection and identification of food litter.

Conclusions
Developing a cleaning robot platform should have two critical features. The environment should be of real-world setting, and the platform should be able to reach every nook and corner of the cleaning region. This proposed cleaning platform Toyota HSR is developed for assisting humans in a general setting, and proven to be efficient in conventional environments. The proposed framework is tested on a common food-court like setting so that the real-world implementation can be without any issues. Unlike the earlier results, the proposed work focuses on classifying the litter into various types, for which the cleaning process is different. Moreover, the classification module is augmented with an optimal path planning module and control module to achieve high efficiency. The efficiency of the detection framework is evaluated through cleanliness inspection, and its detection accuracy was measured through standard quality metrics. The experimental results show that the proposed cleaning framework detects most of the food litter with highest detection accuracy among the considered methods, and HSR cleaning part takes an average of reasonable time to clean the three to five food litters on the table. The proposed method can be used to clean vertical surfaces like glass panels, windows in homes, among others. The module can be used to identify litters not only on tables but also on the floors. This scaling of this application can open new doors in the cleaning industry.